Commit Graph

28 Commits

Author SHA1 Message Date
Jeremy Gwinnup
a5fb4d1550 Fixed bug in tokenizer.perl where comma separated lists of single
characters aren't handled correctly

input> A,B,C,D,E,F

yielded> A, B,C , D,E , F

now yields> A, B, C, D, E, F

Updated Russian nonbreaking prefixes list with capital letters
2013-08-16 14:39:50 -04:00
Phil Williams
1238041f98 Add option to do Penn Treebank style tokenization
tokenizer.perl and detokenizer.perl now have an option called -penn
which does Penn Treebank-like tokenization (English only).  This is
useful if your pipeline involves processing the corpus with tools
trained on PTB-tokenized text.

Unlike PTB, the tokenizer splits on slashes (e.g. "Monday/Tuesday"
becomes "Monday", "@/@", "Tuesday").  If using parse-de-berkeley.perl,
the option -split-slash re-joins tokens that are separated by slashes
for parsing then splits them afterwards.
2013-07-24 13:41:21 +01:00
Hieu Hoang
f96a82d26c add normalize-punctuation.perl, from WMT 2013-05-16 17:03:37 +01:00
amittai
7ca271b200 fixed typo 2013-02-26 19:47:44 -08:00
Pidong Wang
eecefbede6 add multi-threading feature to the tokenizer.perl 2012-06-30 23:04:09 +08:00
phikoehn
6af0b62b8a bug fix 2012-06-26 22:49:59 +01:00
Hieu Hoang
93bff3f201 lock m_vocab variable access in Encode() and Lookup(). Other functions are still not threadsafe 2012-06-26 13:33:34 -04:00
Hieu Hoang
debe090426 Change Bin to RealBin. Thanks to Tom Hoar 2012-06-26 11:57:23 -04:00
phikoehn
135e38d355 escape bar character with proper html escape sequence 2012-06-25 23:37:59 +01:00
phikoehn
2e370ed11b more escaping in tokenizer; wrapper for berkeley parser (german) 2012-05-30 00:58:18 +01:00
phikoehn
561b9ac956 minor fixes 2012-05-26 00:09:50 +01:00
phikoehn
366d427ce6 minor fixes 2012-04-12 00:25:57 +01:00
phikoehn
f15e6515a4 scripts to escape moses specific characters 2012-03-23 07:21:17 +00:00
phikoehn
4d0fc996ba bug fix to filter hierarchical 2012-03-23 07:17:08 +00:00
phikoehn
791b5a7676 lotsa minor changes: mostly bug fixes, tokenizer now esacapes special Moses characters (|<>&) 2012-03-20 04:57:37 +00:00
Barry Haddow
83bb286809 Option to disable buffering (from Tom Hoar) 2012-01-18 08:55:12 +00:00
Hieu Hoang
038892601e reminder of language codes 2011-11-30 14:29:34 +07:00
bgottesman
24f5bf6723 when detokenizing, remove whitespace between a pair of CJK (Chinese/Japanese/Korean) words
This gets the Chinese and Japanese tests working, so remove the failure expectation.


git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4134 1f5c12ca-751b-0410-a591-d2e778427230
2011-08-08 15:30:54 +00:00
bgottesman
14587cdafc fix a detokenization bug that was preventing the removal of the whitespace following a contracted French or Italian article/pronoun (e.g. "l' immigration") when the contraction was the second-last word in the segment
remove the expectation of failure on the corresponding unit test


git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4133 1f5c12ca-751b-0410-a591-d2e778427230
2011-08-08 15:02:56 +00:00
rafpayen
cdc4179ce1 Add a space before double punctuation signs in French
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4016 1f5c12ca-751b-0410-a591-d2e778427230
2011-06-16 17:24:25 +00:00
phkoehn
df901e7ce6 added files from Tom Hoar
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3881 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-16 10:44:26 +00:00
bojar
76174ccd4b mark web/bin/detokenizer.perl as outdated
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3880 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-14 13:35:04 +00:00
bojar
26ccace946 Czech detokenization
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3879 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-14 13:32:41 +00:00
bhaddow
4174082396 Non-breaking prefixes for Dutch
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3764 1f5c12ca-751b-0410-a591-d2e778427230
2010-12-08 16:09:24 +00:00
sarst
0594b13c61 Added nonbreaking_prefix.sv for Swedish
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3630 1f5c12ca-751b-0410-a591-d2e778427230
2010-10-19 12:45:49 +00:00
phkoehn
fb8b0eb180 new prefix files for tokenizer
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3467 1f5c12ca-751b-0410-a591-d2e778427230
2010-09-15 16:06:04 +00:00
hieuhoang1972
579253d3cd add lowercaser
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3380 1f5c12ca-751b-0410-a591-d2e778427230
2010-08-02 14:05:23 +00:00
phkoehn
2ed6804f12 official release of experiment.perl
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3224 1f5c12ca-751b-0410-a591-d2e778427230
2010-05-04 23:04:10 +00:00