Jeremy Gwinnup
a5fb4d1550
Fixed bug in tokenizer.perl where comma separated lists of single
...
characters aren't handled correctly
input> A,B,C,D,E,F
yielded> A, B,C , D,E , F
now yields> A, B, C, D, E, F
Updated Russian nonbreaking prefixes list with capital letters
2013-08-16 14:39:50 -04:00
Phil Williams
1238041f98
Add option to do Penn Treebank style tokenization
...
tokenizer.perl and detokenizer.perl now have an option called -penn
which does Penn Treebank-like tokenization (English only). This is
useful if your pipeline involves processing the corpus with tools
trained on PTB-tokenized text.
Unlike PTB, the tokenizer splits on slashes (e.g. "Monday/Tuesday"
becomes "Monday", "@/@", "Tuesday"). If using parse-de-berkeley.perl,
the option -split-slash re-joins tokens that are separated by slashes
for parsing then splits them afterwards.
2013-07-24 13:41:21 +01:00
Hieu Hoang
f96a82d26c
add normalize-punctuation.perl, from WMT
2013-05-16 17:03:37 +01:00
amittai
7ca271b200
fixed typo
2013-02-26 19:47:44 -08:00
Pidong Wang
eecefbede6
add multi-threading feature to the tokenizer.perl
2012-06-30 23:04:09 +08:00
phikoehn
6af0b62b8a
bug fix
2012-06-26 22:49:59 +01:00
Hieu Hoang
93bff3f201
lock m_vocab variable access in Encode() and Lookup(). Other functions are still not threadsafe
2012-06-26 13:33:34 -04:00
Hieu Hoang
debe090426
Change Bin to RealBin. Thanks to Tom Hoar
2012-06-26 11:57:23 -04:00
phikoehn
135e38d355
escape bar character with proper html escape sequence
2012-06-25 23:37:59 +01:00
phikoehn
2e370ed11b
more escaping in tokenizer; wrapper for berkeley parser (german)
2012-05-30 00:58:18 +01:00
phikoehn
561b9ac956
minor fixes
2012-05-26 00:09:50 +01:00
phikoehn
366d427ce6
minor fixes
2012-04-12 00:25:57 +01:00
phikoehn
f15e6515a4
scripts to escape moses specific characters
2012-03-23 07:21:17 +00:00
phikoehn
4d0fc996ba
bug fix to filter hierarchical
2012-03-23 07:17:08 +00:00
phikoehn
791b5a7676
lotsa minor changes: mostly bug fixes, tokenizer now esacapes special Moses characters (|<>&)
2012-03-20 04:57:37 +00:00
Barry Haddow
83bb286809
Option to disable buffering (from Tom Hoar)
2012-01-18 08:55:12 +00:00
Hieu Hoang
038892601e
reminder of language codes
2011-11-30 14:29:34 +07:00
bgottesman
24f5bf6723
when detokenizing, remove whitespace between a pair of CJK (Chinese/Japanese/Korean) words
...
This gets the Chinese and Japanese tests working, so remove the failure expectation.
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4134 1f5c12ca-751b-0410-a591-d2e778427230
2011-08-08 15:30:54 +00:00
bgottesman
14587cdafc
fix a detokenization bug that was preventing the removal of the whitespace following a contracted French or Italian article/pronoun (e.g. "l' immigration") when the contraction was the second-last word in the segment
...
remove the expectation of failure on the corresponding unit test
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4133 1f5c12ca-751b-0410-a591-d2e778427230
2011-08-08 15:02:56 +00:00
rafpayen
cdc4179ce1
Add a space before double punctuation signs in French
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4016 1f5c12ca-751b-0410-a591-d2e778427230
2011-06-16 17:24:25 +00:00
phkoehn
df901e7ce6
added files from Tom Hoar
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3881 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-16 10:44:26 +00:00
bojar
76174ccd4b
mark web/bin/detokenizer.perl as outdated
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3880 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-14 13:35:04 +00:00
bojar
26ccace946
Czech detokenization
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3879 1f5c12ca-751b-0410-a591-d2e778427230
2011-02-14 13:32:41 +00:00
bhaddow
4174082396
Non-breaking prefixes for Dutch
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3764 1f5c12ca-751b-0410-a591-d2e778427230
2010-12-08 16:09:24 +00:00
sarst
0594b13c61
Added nonbreaking_prefix.sv for Swedish
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3630 1f5c12ca-751b-0410-a591-d2e778427230
2010-10-19 12:45:49 +00:00
phkoehn
fb8b0eb180
new prefix files for tokenizer
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3467 1f5c12ca-751b-0410-a591-d2e778427230
2010-09-15 16:06:04 +00:00
hieuhoang1972
579253d3cd
add lowercaser
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3380 1f5c12ca-751b-0410-a591-d2e778427230
2010-08-02 14:05:23 +00:00
phkoehn
2ed6804f12
official release of experiment.perl
...
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3224 1f5c12ca-751b-0410-a591-d2e778427230
2010-05-04 23:04:10 +00:00