Commit Graph

2352 Commits

Author SHA1 Message Date
Iikka Hauhio
1752993414 disable giza when eflomal is in use 2024-06-06 15:44:50 +03:00
Iikka Hauhio
d25a7c44d3 add support for eflomal aligner 2024-06-06 15:32:46 +03:00
Raphaël Merx
8cee20eaca
nonbreaking_prefix.tdt: add "Nu" for "Numeru"
E.g. "Dekretu-Lei Nu. 18/2022" -> "Decree Law No. 18/2022"
2022-05-08 10:33:58 +08:00
swk0627
b2a3b96154 Modify a comment on usage in the script 2022-01-21 21:11:02 +09:00
Raphael Merx
75d4c672e8 Add tokenisation support for the Tetun language 2021-03-13 18:39:56 +08:00
Kenneth Heafield
78ca5f3cc5 Allow Arabic letters to begin a fa sentence 2020-08-03 21:51:09 +01:00
Cristina España i Bonet
8d78dae634
adding rules for Catalan
special characters within words and contractions closer to French than to English
2020-07-31 15:22:47 +02:00
Barry Haddow
47915b561f escape ampersands 2020-06-30 08:10:56 +01:00
Hieu Hoang
d90a8df862
Merge pull request #221 from HjalmarrSv/master
Added some for sv
2020-06-01 17:19:36 -07:00
HjalmarrSv
da3768a296
Update nonbreaking_prefix.sv
Added Å Ä Ö, which are not unusual initials in names, e.g. Åke, Ärling, Östen.
Added some new, but mostly variations on the existing ones. Both a dot after each letter (or pair) and a dot only after last letter are accepted forms. A couple of decades ago, there had to be a space after the dot, which explains the third form.
The file for sv is much more useful with these few additions. Although, It is still far from complete.
Removed: G (occured twice).
In this list there is one item that is also a word, even when case is kept: tom.
If all words are in small case, then tex, mao, tom (again), may be confused with names, and iaf, etc with named entities.
2020-05-23 17:43:33 +02:00
Kenneth Heafield
89b9b4fba2 sentence splitter -k option to keep line boundaries 2020-03-19 15:44:41 +00:00
Kenneth Heafield
0a892749bc Add Pashto ؟ as a sentence splitting character 2020-03-19 12:06:50 +00:00
William Waites
696a5d9833 flag to turn off sentence splitter from emitting <P> 2020-02-26 14:08:26 +00:00
Kenneth Heafield
22923ddcf0 Revert "line buffering for tokeniser and truecaser"
This reverts commit 691717c425.
2020-02-20 09:52:08 +00:00
William Waites
691717c425 line buffering for tokeniser and truecaser 2020-02-17 14:29:24 +00:00
alvations
d03df21e88
Proper spacing 2020-01-06 11:43:31 +08:00
HjalmarrSv
fa747062dc
Modernized
I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text
My proposed changes does the job.
Basically I had to change by replacing the + at end of line 5 with *(\/)?
The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (|) after reading various posts.
2019-12-17 20:40:51 +01:00
Barry Haddow
a89691fee3 attempt to handle Korean better; only consider horizontal space in final split 2019-12-16 15:52:45 +00:00
Barry Haddow
2cff8ff6dd split word on any type of space 2019-12-09 17:04:09 +00:00
alvations
f6d7adde15
Single quotes should be escaped as single quotes. 2019-11-25 10:10:40 +08:00
Barry Haddow
74d54b54c3 2 letter codes 2019-11-08 15:36:22 +00:00
Barry Haddow
1037070026 support for several Indic languages 2019-11-08 14:56:58 +00:00
Barry Haddow
b1163966b1 initial hi non-breaking prefixes 2019-11-05 16:59:40 +00:00
Barry Haddow
61b1d06570 list items 2019-11-05 16:52:50 +00:00
Barry Haddow
4da86c360f rupees 2019-11-05 16:02:19 +00:00
Barry Haddow
56b2bad907 fix abbrev rule 2019-11-05 15:58:07 +00:00
Barry Haddow
3910cd6c46 devanagari fix 2019-10-31 21:28:43 +00:00
Barry Haddow
2affb9b624 reorganise indic support 2019-10-31 16:50:17 +00:00
Barry Haddow
d708e26b60 use block notation for indic scripts 2019-10-31 16:12:59 +00:00
Barry Haddow
0fef8ebf4c fix nbp 2019-10-31 16:08:56 +00:00
Barry Haddow
b1d9fb6d75 full cjk test 2019-10-28 09:53:45 +00:00
Barry Haddow
8ebebbc680 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-10-28 09:48:40 +00:00
Kevin Canwen Xu
5d3331b922
Update replace-unicode-punctuation.perl 2019-10-14 16:33:58 +08:00
alvations
555829a771
Undoing 0578892581
Causes abbreviations to not split when ending with a fullstop. E.g. 

> The restructuring of IBM was essential to enable it organisationally to take up the responsibilities entrusted in the role with the recent changes in the policy and legislations, revised charter of function of IBM and the new activities and initiatives undertaken by IBM. IBM is also engaged in handholding the States for auction of mineral blocks for greater transparency in allocation of mineral concessions.
2019-10-01 05:27:06 +08:00
Barry Haddow
486dce270f debug 2019-09-30 16:58:21 +01:00
Barry Haddow
9bffde57ba revert 05788925 2019-09-30 16:53:06 +01:00
Barry Haddow
257d7e5e66 enable custom non breaking prefixes 2019-09-30 16:52:24 +01:00
Barry Haddow
01a8ec41e8 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-09-30 15:33:33 +01:00
Barry Haddow
768944d851 do not add spaces in cjk 2019-09-30 15:33:26 +01:00
titsuki
490dc3996a Enable use strict pragma 2019-09-23 15:40:13 +09:00
alvations
0578892581
The dot before an acronym should be optional. 2019-09-04 14:16:41 +08:00
Achim Ruopp
7ad5ffa0c0 Support for Urdu in sentence splitter 2019-07-10 10:48:32 -04:00
Matt Post
63c450b401
escape angle brackets
The script doesn't escape angle brackets which can result in bad SGML / XML output. This fixes that, although ideally, this should be implemented with a proper parser and dumper.
2019-04-26 14:24:07 -04:00
Joel Barry
fdb7384d3d Fix non-ASCII lowercasing 2019-02-27 10:17:29 -05:00
Hieu Hoang
26940e714a Revert "use ucfirst instead of defined uppercase function"
This reverts commit dfbb17e549.
2019-01-04 14:55:55 +00:00
Hieu Hoang
7bc56b66f2
Merge pull request #207 from alvations/patch-truecaser
Reverting split_xml()
2019-01-03 16:56:56 +00:00
alvations
8fdbc74bbf
Reverting split_xml() 2019-01-03 20:51:27 +08:00
Hieu Hoang
db1894ad24 consistent output 2018-12-30 12:05:57 +00:00
alvations
dfbb17e549 use ucfirst instead of defined uppercase function 2018-12-20 11:57:48 +08:00
alvations
40748e528d split_xml should be consistent for training and using 2018-12-20 11:53:02 +08:00