Ulrich Germann
872facd171
Avoid errors in truecaser if input isn't factored and contains vertical bars.
2014-04-05 15:39:00 +01:00
Rico Sennrich
ee06a0f652
don't complain if input contains non-escaped '<' or '>', but is not XML
2014-02-08 15:43:00 +00:00
Rico Sennrich
d26fe4cc4d
fix truecaser with XML input (didn't do anything depending on formatting/whitespace)
2014-01-29 23:01:53 +00:00
Christian Buck
26bf04df5d
added unbuffered mode for casers (using -b)
2013-03-04 15:29:13 +00:00
phikoehn
124c36a837
bug fix with MML settings
2013-01-14 19:39:26 +00:00
phikoehn
a7f7379fa4
fixed bug in detruecaser / interaction with esacping
2013-01-14 19:25:43 +00:00
phikoehn
344b150372
bug fixes with escaping / truecasing interactions
2013-01-14 19:22:29 +00:00
Hieu Hoang
99d5e738aa
use kenlm if sri specified
2012-10-20 14:01:11 +01:00
Hieu Hoang
b761bd3237
exit 0 on success. /Henry Hu
2012-09-25 10:57:01 +01:00
Rico Sennrich
bed4bc08ad
distortion limit for recaser should be 0
2012-07-11 16:57:05 +02:00
Rico Sennrich
be1f959a1a
truecase corpus before training recaser
...
gives better results in (small) test, and the code already had a placeholder for it.
(without truecasing, the recaser is more likely to uppercase words like "the" if they are often sentence-initial in the training corpus)
If people don't want the default behavior changed, I can disable truecasing by default and add a command line parameter to enable it.
2012-07-11 16:27:00 +02:00
Hieu Hoang
01b84656bf
default pt implementation if no phrase table specified
2012-06-08 00:19:56 +01:00
Jehan
f3cb3ad789
- Bug fix: when --help set, errors on absence of --corpus or --dir must not be displayed.
...
- Unset variables must not be set as 0.
2011-11-27 10:14:39 +00:00
Jehan
d875b0774b
- Exit with failure when a step of train-recaser.sh fails.
...
It is kind of hard to identify the cause of a problem (or even to see there is a problem) if a script continues when a
main step failed. Better to exit when the error occurs with relevant logs.
2011-11-27 09:55:30 +00:00
Jehan
30febce3e8
- Help output for train-recaser script.
2011-11-25 17:21:55 +00:00
Jehan
78ccb137fb
- Coding style fix: use the upstream coding style.
2011-11-25 02:31:18 +00:00
Jehan
5841aea6aa
- Recaser train script updated to support IRSTLM as well.
...
By default, it will still use SRILM so that any previous use of this script from others won't be broken.
To switch to IRSTLM training, simply add "-lm irslm" command line option.
Also if build-lm.sh is not accessible from $PATH, the option "-build-lm /path/to/build-lm.sh" is also available.
2011-11-25 02:16:16 +00:00
bgottesman
518035ed05
add --possiblyUseFirstToken option, which, when selected, allows certain sentence-initial tokens to be taken into account. See comment in header or support mailing list discussion.
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3690 1f5c12ca-751b-0410-a591-d2e778427230
2010-11-09 11:05:23 +00:00
hieuhoang1972
e5edb4b971
delete duplicate detokenizer
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3622 1f5c12ca-751b-0410-a591-d2e778427230
2010-10-13 16:39:46 +00:00
hieuhoang1972
eedef63277
keep perl scripts with Unix line endings
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3612 1f5c12ca-751b-0410-a591-d2e778427230
2010-10-11 11:32:27 +00:00
pjwilliams
2edfc16912
Merge remaining script support for tree-based models from mt3_chart.
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3137 1f5c12ca-751b-0410-a591-d2e778427230
2010-04-16 09:45:51 +00:00
bgottesman
5a3a6bd3b0
set utf8 mode on the input and output files, instead of on stdin and stdout, which are not used. This allows case variants of non-ASCII characters to be recognized correctly
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2987 1f5c12ca-751b-0410-a591-d2e778427230
2010-03-18 19:13:05 +00:00
bojar
dbfe610546
uppercasing first letter even if after punct
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2846 1f5c12ca-751b-0410-a591-d2e778427230
2010-02-03 14:23:20 +00:00
phkoehn
8d5aef137b
bug fix
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2113 1f5c12ca-751b-0410-a591-d2e778427230
2009-02-09 16:00:35 +00:00
phkoehn
a62f8ee316
added truecaser
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@2112 1f5c12ca-751b-0410-a591-d2e778427230
2009-02-09 15:32:34 +00:00
bojar
7f3e34207a
added some heuristics for Czech quotation marks
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1567 1f5c12ca-751b-0410-a591-d2e778427230
2008-02-22 15:07:46 +00:00
bojar
6af3140978
added optional sentence uppercasing (use -u)
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1566 1f5c12ca-751b-0410-a591-d2e778427230
2008-02-22 14:50:43 +00:00
jdschroeder
04ae9361d2
added "-v 0" moses flag to decoder call to minimize log output.
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1335 1f5c12ca-751b-0410-a591-d2e778427230
2007-04-04 17:04:50 +00:00
bojar
55ea5d6f94
Adding simple Czech rules to detokenizer. Making detokenizer 'released'.
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1328 1f5c12ca-751b-0410-a591-d2e778427230
2007-03-26 06:08:13 +00:00
bojar
58bf2089af
Adding detokenizer from WMT07 shared scripts.tgz, hoping there are no copyright
...
problems. Please withdraw if necessary.
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1327 1f5c12ca-751b-0410-a591-d2e778427230
2007-03-26 05:46:50 +00:00
bojar
3d288d81e4
Proper unicode-based lower and uppercasing.
...
Added language option to recase.perl, English remains the default.
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1326 1f5c12ca-751b-0410-a591-d2e778427230
2007-03-26 05:44:27 +00:00
hieuhoang1972
4b0ea463c8
add svn id comments to start of file
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1308 1f5c12ca-751b-0410-a591-d2e778427230
2007-03-14 22:30:25 +00:00
hieuhoang1972
3c07c5df4d
add svn id comments to start of file
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1307 1f5c12ca-751b-0410-a591-d2e778427230
2007-03-14 22:22:36 +00:00
phkoehn
a89acb34ae
minor bug fix to recaser training
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1242 1f5c12ca-751b-0410-a591-d2e778427230
2007-02-26 12:19:06 +00:00
phkoehn
14839768c8
a large number of changes. besides little tweaks:
...
* training script now has proper default behaviour for single-factor models,
* mert script has better handling of default lambda parameters that now
works with lexicalized reordering models, and also with multiple
models files (e.g. multiple language models)
* parallel mert script is more robust when single jobs fail: detects it
and resubmits the crashed (or killed) jobs
* recaser added that builds on moses
* filtering script added that also binarizes filtered model files
(this will be eventually replaced when the lexicalized reordering
model also uses the binary format)
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@1210 1f5c12ca-751b-0410-a591-d2e778427230
2007-02-13 19:22:35 +00:00