Commit Graph

2352 Commits

Author SHA1 Message Date
Hieu Hoang
413ba6b583 increase cores to 16. For bitextor azure pipeline 2018-12-10 16:17:16 +00:00
Hieu Hoang
c753350641 ems config for moses2 2018-12-08 19:47:10 +00:00
Hieu Hoang
3d4bf99367 sacre bleu 2018-12-04 15:40:00 +00:00
Hieu Hoang
dbbc47292f sacre bleu 2018-12-04 15:27:09 +00:00
Hieu Hoang
345dabcde6 use --discount_fallback 2018-12-04 14:34:47 +00:00
Hieu Hoang
1591cf3676 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2018-11-12 14:03:54 +00:00
Hieu Hoang
13e48bc8b4 removing python port. Sacremoses is newer 2018-11-12 14:03:38 +00:00
Loïc Vial
4133726ef9 Add option "-b" (unbuffer output) to tokenizer scripts 2018-11-09 22:53:33 +01:00
Hieu Hoang
a2315ffd3a rename directory to work with python import 2018-11-09 13:01:17 +00:00
Hieu Hoang
a70086c1e6 python wrapper works 2018-11-09 12:58:22 +00:00
Hieu Hoang
2451c46960 start borging Luis Gomes code 2018-11-07 17:12:05 +00:00
Ozan Caglayan
9fc964da7f tokenizer.perl: split final dots unconditionally
Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.

EN-old: So am I.
EN-new: So am I .

DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .

FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .

CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .
2018-11-07 10:59:54 +01:00
Barry Haddow
d2b558728f basic support for Gujarati and Hindi, backported from one of the many upstreams 2018-10-30 14:16:16 +00:00
Rico Sennrich
411f45f249 multi-bleu-detok should take raw reference 2018-09-26 12:24:07 +01:00
Hieu Hoang
48fa6e92a9 grammar 2018-09-16 14:58:39 +01:00
Hieu Hoang
fd1758ba74 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2018-09-10 18:30:46 +01:00
Hieu Hoang
e760db2d17 unused script 2018-09-10 18:30:36 +01:00
Barry Haddow
06f519d4e2 Handle glottal stops in Somalian 2018-09-06 16:09:36 +01:00
Louis MARTIN
53da5f4dbe Fix multi-bleu.perl bug when file does not end with newline
When reading hypothesis and reference files, multi-bleu.perl uses the
chop function to remove the trailing newline character.
If one of these files happens to not end with a newline, then chop will
remove the last character of the last line (instead of the newline).
This causes the BLEU score to be slightly off from its theoretical
value.
Using the safest chomp function solves this problem, i.e. it only
removes newlines when present.
2018-07-03 04:06:09 -06:00
Joachim Wagner
5bbd5ca160
fix syntax error; credit https://www.mail-archive.com/moses-support@mit.edu/msg15226.html 2018-06-23 08:19:36 +01:00
Joachim Wagner
2aa5cd2152
fix syntax error in regular expression 2018-06-22 18:16:11 +01:00
Tomas Fulajtar
3a2a63b9dc * Added missing step for the "TRAINING:build-generation-custom".
* Fixed the $cmd parameter - should be "-corpus" instead of "-generation-corpus".
2018-05-18 14:18:11 +02:00
Hieu Hoang
999e83d128
Merge pull request #196 from astronautguo/master
fix bug when copying to cache
2018-05-04 14:42:35 +01:00
Kenneth Heafield
ae47469919 Don't drop last character if file does not end with newline 2018-05-03 10:28:11 +01:00
astro
f47e670f20 fix bug when copying to cache 2018-04-27 19:52:20 -04:00
alvations
686034488a
Contributing MosesTokenizer from NLTK to Moses 2018-04-11 00:27:37 +08:00
Scherrer Yves
4a7f16b366 add fi/sv-specific colon handling in tokenizer.perl 2018-02-14 10:27:46 +02:00
alvations
194964c017
Korean words has spaces =) 2018-01-19 13:29:53 +08:00
Hieu Hoang
3a0631a05b better default 2017-12-12 15:30:56 +00:00
Tomas Fulajtar
5b9a6da9a4 The .gz extension should be also added for 'On Disk' and 'Probing' Phrase tables. 2017-11-28 10:29:58 +01:00
Rico Sennrich
7e9108dd29 multi-bleu-detok.perl - a plain text alternative to mteval-v13a.perl 2017-10-20 10:08:22 +01:00
Hieu Hoang
05a37d218e wording change 2017-10-19 23:31:56 +01:00
Kenneth Heafield
545eee7e75 Attempt to stop people from publishing non-comparable BLEU scores, as discussed in statmt meeting 2017-10-19 22:57:36 +01:00
Jörg Tiedemann
23cf6c4d1f new option for mert-moses: transform-decoded-file 2017-08-15 17:11:46 +03:00
Hieu Hoang
8aa8988320 executable 2017-07-28 22:58:54 +01:00
Hieu Hoang
b8de7c3528 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2017-05-02 10:59:02 +01:00
Hieu Hoang
101e52da60 check for executables before running 2017-05-02 10:57:00 +01:00
Hieu Hoang
b199e654df Merge branch 'master' of github.com:moses-smt/mosesdecoder 2017-04-27 13:48:28 +01:00
Hieu Hoang
2ea75d91dc add new mteval script 2017-04-27 13:48:18 +01:00
Rico Sennrich
ae476ae531 fix rdlm training - extra-settings was missing 2017-04-24 15:30:17 +01:00
Rico Sennrich
61f5b49dee fix rdlm training - train_host option was missing 2017-04-24 13:29:58 +01:00
Rico Sennrich
b99af32113 fix split-input if it is passed, but if output-splitter is defined 2017-04-24 12:16:36 +01:00
alvations
793e64b7d5 removed redundant subdirectory in path 2017-04-12 10:15:18 +08:00
alvations
66cbf46e27 use static path to compile.sh 2017-04-12 10:03:25 +08:00
alvations
9f246cef89 added Dockerfiles for Moses 2017-04-12 09:52:31 +08:00
Ondrej Bojar
d9faf8f901 ignore words where there is nothing to case 2017-04-07 17:28:13 +02:00
Phil Williams
a5c99ca660 reference-from-sgm.perl: fix Perl error 2017-03-07 15:54:08 +00:00
Ulrich Germann
08138b44a7 Bug fix in scripts/generic/bsbleu.py. 2017-03-04 13:21:18 +00:00
Linas Vepstas
8fdd19310b Update to applly CJK processing conditionally. 2017-01-11 11:23:54 -06:00
Linas Vepstas
2e48f83ab4 Handle punctuation+CJK combinations. 2017-01-08 10:08:53 -06:00
Linas Vepstas
6fb2c97029 Bug-fix: regular Western sentence enders not recognized. 2017-01-05 23:29:00 -06:00
Linas Vepstas
bd9d12351b Create a Cantonese version, distinct from Mandarin.
The content is identical, at this moment, but having distinct
langauge suffixes solves processing-pipeline problems later on.
2017-01-05 12:53:21 -06:00
Linas Vepstas
1933bcbf33 Whoops, revert cut-n-paste damage in previous commit. 2017-01-05 11:39:01 -06:00
Linas Vepstas
203c7c6387 Preliminary support for Chinese. 2017-01-05 11:34:38 -06:00
Linas Vepstas
144f43495e Preliminary support for Chinese.
Also, cleanup some of the comments.
2017-01-05 11:33:10 -06:00
Linas Vepstas
9f5500a3a8 oops. 2017-01-05 10:09:34 -06:00
Linas Vepstas
ab6816f9a7 Purely cosmetic cleanup.
Use same indentation style throughout; wrap long lines; capitalize
sentences; add punctuation; remove trailing whitespace.
2017-01-05 10:08:06 -06:00
Linas Vepstas
d10ba6f049 More abbreviations for LLithuanian. 2017-01-04 23:52:28 -06:00
Linas Vepstas
3ef84b133c More abbreviations 2017-01-04 22:30:53 -06:00
Linas Vepstas
2a5e40ed60 New file: Lithuanian 2017-01-04 22:01:45 -06:00
Hieu Hoang
ff12a13eaa re-tune if decoder changed. eg moses -> moses2 2017-01-02 16:37:56 -05:00
Hieu Hoang
29b0072eda CreateProbingPT2 -> CreateProbingPT 2017-01-02 06:02:54 -05:00
Hieu Hoang
28c0564589 Merge pull request #170 from moses-smt/alvations-patch-1
Changed \p{Hyphen} to \p{Line_Break} in mteval-v13a.pl
2016-12-23 15:00:49 +00:00
Hieu Hoang
59119c0044 Merge pull request #168 from tofula/master
Named group added for the safer 'protected patterns' recognition regexp
2016-12-23 10:26:19 +00:00
Hieu Hoang
fc8829cdda Merge pull request #169 from lonevvolf/master
Fix for number at the end of a string
2016-12-23 09:50:57 +00:00
alvations
c6c3bc84b7 Changed \p{Hyphen} to \p{LineBreak}
Using Perl v5.18.2, it's reporting this warning:
**Use of 'Hyphen' in \p{} or \P{} is deprecated because: Supplanted by Line_Break property values; see www.unicode.org/reports/tr14**
2016-12-23 14:21:20 +08:00
lonevvolf
d68211cba9 Fix for number at the end of a string 2016-12-06 09:41:32 +01:00
Hieu Hoang
5992f58d99 comment out debugging messages 2016-10-31 12:27:25 -04:00
Hieu Hoang
2679c30c1b --num-scores 2016-10-05 21:35:36 +01:00
Hieu Hoang
bcea640c9a handles hiero models too 2016-10-05 21:33:19 +01:00
Hieu Hoang
16d6a89861 output debugging messages to stderr not stdout 2016-09-29 07:16:57 -04:00
Hieu Hoang
3d5500e698 Merge branch 'perf_moses2' of github.com:hieuhoang/mosesdecoder into perf_moses2 2016-09-27 08:21:34 -04:00
Hieu Hoang
a29f7d5c99 can define srilm-dir in general section 2016-09-27 08:21:18 -04:00
Hieu Hoang
9527fb050d duplicate -T arg for OSM 2016-09-26 12:04:33 +01:00
Hieu Hoang
92f5f868cb add --num-scores arg. To binarize regression test tables with 5 scores 2016-09-25 22:53:08 +01:00
Marcin Junczys-Dowmunt
9ff0af4e85 consistent order of parameters in ini 2016-09-04 21:53:41 +02:00
Hieu Hoang
9d176ce7b5 Merge pull request #162 from da-web/patch-2
Changed NoPhraseCount score-option
2016-09-04 12:25:18 +01:00
Hieu Hoang
9236eeeba9 add brodie to list of machines 2016-08-13 18:27:25 +01:00
Hieu Hoang
260b4776ad binarization with CreateProbingPT2 2016-08-13 18:24:18 +01:00
Hieu Hoang
67d8a10d95 binarization with CreateProbingPT2 2016-08-13 18:16:35 +01:00
Hieu Hoang
c42bb54c04 formatting of -show-weights to make it work with mert script 2016-08-12 09:20:43 +01:00
Hieu Hoang
a8325a3e8e make probing pt work with ems 2016-08-11 15:46:43 +01:00
Hieu Hoang
9a31447c23 add support for CreateProbingPT2 2016-08-10 19:57:35 +01:00
Hieu Hoang
bf4f6b3b90 Merge ../mosesdecoder into perf_moses2 2016-08-05 17:15:18 +01:00
Hieu Hoang
34ffa372c1 Merge pull request #163 from a455bcd9/patch-1
Separate comma after a number end sentence
2016-08-05 17:12:44 +01:00
Antoine Dusséaux
6652068a43 Single lower-case letter French word
"a" is a single lower-case letter French word that can be at the end of a sentence: "Oui, il l'a."
2016-07-31 14:56:37 +02:00
Antoine Dusséaux
d04bdc7440 Separate comma after a number end sentence
Separate "," after a number if it's the end of a sentence.

Example:

He is tall,
He was born in 1800,
He wants to go there in 2000.

He is tall ,
He was born in 1800 ,
He wants to go there in 2000 .
2016-07-31 14:10:07 +02:00
da-web
fb2d7261cf Changed NoPhraseCount score-option
NoPhraseCount score-option was changed to PhraseCount: i.e. per default PhraseCount is omitted.

1. parse PhraseCount instead of NoPhraseCount from "score-options"
2. pass PhraseCount instead of NoPhraseCount to consolidate

fix for issue #157
2016-07-12 14:30:18 +02:00
Hieu Hoang
7fa586aa9b Merge ../mosesdecoder into perf_moses2 2016-07-11 09:50:22 +01:00
Philipp Koehn
ef9d327841 only train input or output truecaser, if only one is needed 2016-07-10 11:17:21 -04:00
Hieu Hoang
65667ad322 Merge ../mosesdecoder into perf_moses2 2016-07-05 23:36:15 +01:00
Hieu Hoang
bf4109655f Revert "Merge pull request #158 from da-web/patch-1"
This reverts commit f7ea8fe0da, reversing
changes made to 03d8747e65.
2016-07-05 11:26:04 +01:00
Hieu Hoang
7f6ce67bac Merge pull request #155 from ypeels/master
avoid name collisions when filtering multiple reordering tables
2016-07-03 20:20:24 +01:00
Hieu Hoang
f7ea8fe0da Merge pull request #158 from da-web/patch-1
Correctly consider score-options NoPhraseCount Argument
2016-07-03 20:20:13 +01:00
Hieu Hoang
0eb0bf642c Merge ../mosesdecoder into perf_moses2 2016-06-23 11:11:11 +01:00
Barry Haddow
aa37aba8aa add threshold pruning option to binarizer 2016-06-20 22:15:51 +01:00
da-web
51b2d0c302 Correctly consider score-options NoPhraseCount Argument
Handle and propagate NoPhraseCount score-option correctly (per default phrasetable is created WITH phrasecount feature):
1. pass --PhraseCount to consolidate (as --NoPhraseCount is not supported by consolidate)
2. consider --NoPhraseCount when calculating the basic_weight correctly (otherwise Moses.ini is wrong)
2016-06-10 13:02:57 +02:00
Hieu Hoang
b75ef6f619 Merge ../mosesdecoder into perf_moses2 2016-06-04 12:45:30 +01:00
Philipp Koehn
defbf8d7c3 barebone support for quality estimation in experiment.perl 2016-06-04 05:15:34 -04:00
Hieu Hoang
36812013bf Merge ../mosesdecoder into perf_moses2 2016-05-31 14:36:15 +01:00