Commit Graph

17299 Commits

Author SHA1 Message Date
Barry Haddow
0fef8ebf4c fix nbp 2019-10-31 16:08:56 +00:00
Barry Haddow
b1d9fb6d75 full cjk test 2019-10-28 09:53:45 +00:00
Barry Haddow
8ebebbc680 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-10-28 09:48:40 +00:00
Hieu Hoang
286188b82a
Merge pull request #214 from JetRunner/patch-1
Fix the incorrect processing considering fullwidth number character
2019-10-18 19:54:46 -07:00
Kevin Canwen Xu
5d3331b922
Update replace-unicode-punctuation.perl 2019-10-14 16:33:58 +08:00
alvations
555829a771
Undoing 0578892581
Causes abbreviations to not split when ending with a fullstop. E.g. 

> The restructuring of IBM was essential to enable it organisationally to take up the responsibilities entrusted in the role with the recent changes in the policy and legislations, revised charter of function of IBM and the new activities and initiatives undertaken by IBM. IBM is also engaged in handholding the States for auction of mineral blocks for greater transparency in allocation of mineral concessions.
2019-10-01 05:27:06 +08:00
Barry Haddow
486dce270f debug 2019-09-30 16:58:21 +01:00
Barry Haddow
9bffde57ba revert 05788925 2019-09-30 16:53:06 +01:00
Barry Haddow
257d7e5e66 enable custom non breaking prefixes 2019-09-30 16:52:24 +01:00
Barry Haddow
01a8ec41e8 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-09-30 15:33:33 +01:00
Barry Haddow
768944d851 do not add spaces in cjk 2019-09-30 15:33:26 +01:00
Hieu Hoang
b21b071a66
Merge pull request #213 from titsuki/enable-strict
Enable use strict pragma
2019-09-25 01:04:13 +01:00
titsuki
490dc3996a Enable use strict pragma 2019-09-23 15:40:13 +09:00
Hieu Hoang
fd06cdf026
Merge pull request #212 from moses-smt/alvations-patch-regexes
The dot before an acronym should be optional.
2019-09-04 08:07:06 +01:00
alvations
0578892581
The dot before an acronym should be optional. 2019-09-04 14:16:41 +08:00
Hieu Hoang
9f08d77b0d
Merge pull request #211 from achimr/master
Support for Urdu in sentence splitter
2019-08-21 22:05:45 +01:00
Achim Ruopp
7ad5ffa0c0 Support for Urdu in sentence splitter 2019-07-10 10:48:32 -04:00
Hieu Hoang
158d252389 tweak readme 2019-06-08 18:22:39 +01:00
Hieu Hoang
c0545019eb
Merge pull request #210 from mjpost/patch-1
escape angle brackets
2019-04-27 21:23:50 +01:00
Matt Post
63c450b401
escape angle brackets
The script doesn't escape angle brackets which can result in bad SGML / XML output. This fixes that, although ideally, this should be implemented with a proper parser and dumper.
2019-04-26 14:24:07 -04:00
Hieu Hoang
187a75cb55
Merge pull request #209 from joelb-git/multi-bleu-detok-non-ascii-fix
Fix non-ASCII lowercasing
2019-03-01 23:26:12 +00:00
Joel Barry
fdb7384d3d Fix non-ASCII lowercasing 2019-02-27 10:17:29 -05:00
Hieu Hoang
49b388ac79 check state object are not null before using it. For alternate weights setting where some feature functions are not used for a particular sentence 2019-01-17 14:34:55 +00:00
Hieu Hoang
26940e714a Revert "use ucfirst instead of defined uppercase function"
This reverts commit dfbb17e549.
2019-01-04 14:55:55 +00:00
Hieu Hoang
7bc56b66f2
Merge pull request #207 from alvations/patch-truecaser
Reverting split_xml()
2019-01-03 16:56:56 +00:00
alvations
8fdbc74bbf
Reverting split_xml() 2019-01-03 20:51:27 +08:00
Hieu Hoang
db1894ad24 consistent output 2018-12-30 12:05:57 +00:00
Hieu Hoang
5a600dfe62
Merge pull request #206 from alvations/patch-truecaser
Patching truecaser
2018-12-29 20:21:37 +00:00
Hieu Hoang
4b2872fad8 rename file so it appears on github website. Clarify mailing list 2018-12-28 15:15:09 +00:00
alvations
dfbb17e549 use ucfirst instead of defined uppercase function 2018-12-20 11:57:48 +08:00
alvations
40748e528d split_xml should be consistent for training and using 2018-12-20 11:53:02 +08:00
Hieu Hoang
413ba6b583 increase cores to 16. For bitextor azure pipeline 2018-12-10 16:17:16 +00:00
Hieu Hoang
dd9ff66479 put fix into UnorderedComparer again. Maybe weird template bug 2018-12-10 13:27:57 +00:00
Hieu Hoang
baefaa1b12 fix weird unordered set error on ubuntu 18.04, gcc 7.3.0, boost 1.65. May be over-optimizing or bug in gcc or boost 2018-12-10 13:15:03 +00:00
Hieu Hoang
20edd331bc debug 2018-12-10 12:29:58 +00:00
Hieu Hoang
c753350641 ems config for moses2 2018-12-08 19:47:10 +00:00
Hieu Hoang
3d4bf99367 sacre bleu 2018-12-04 15:40:00 +00:00
Hieu Hoang
dbbc47292f sacre bleu 2018-12-04 15:27:09 +00:00
Hieu Hoang
345dabcde6 use --discount_fallback 2018-12-04 14:34:47 +00:00
Hieu Hoang
1591cf3676 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2018-11-12 14:03:54 +00:00
Hieu Hoang
13e48bc8b4 removing python port. Sacremoses is newer 2018-11-12 14:03:38 +00:00
Hieu Hoang
19a31ca3f1
Merge pull request #205 from coylz/master
Add option "-b" (unbuffer output) to tokenizer scripts
2018-11-10 22:41:52 +00:00
Loïc Vial
4133726ef9 Add option "-b" (unbuffer output) to tokenizer scripts 2018-11-09 22:53:33 +01:00
Hieu Hoang
a2315ffd3a rename directory to work with python import 2018-11-09 13:01:17 +00:00
Hieu Hoang
a70086c1e6 python wrapper works 2018-11-09 12:58:22 +00:00
Hieu Hoang
2451c46960 start borging Luis Gomes code 2018-11-07 17:12:05 +00:00
Hieu Hoang
2217bc136e
Merge pull request #204 from ozancaglayan/nb-fix
tokenizer.perl: split final dots unconditionally
2018-11-07 14:36:41 +00:00
Ozan Caglayan
9fc964da7f tokenizer.perl: split final dots unconditionally
Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I.
EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .
2018-11-07 10:59:54 +01:00
Barry Haddow
d2b558728f basic support for Gujarati and Hindi, backported from one of the many upstreams 2018-10-30 14:16:16 +00:00
Hieu Hoang
979dd5a403 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2018-10-26 18:57:07 +02:00