Commit Graph

10562 Commits

Author SHA1 Message Date
Hieu Hoang
a3e3289b08 In corpus mode, replace number with number symbol 2013-07-25 15:54:47 +01:00
Hieu Hoang
76a9730ca8 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-25 15:23:12 +01:00
Barry Haddow
7081f06413 Fixes to the shared build 2013-07-25 15:24:34 +01:00
Hieu Hoang
e2c2bc59f1 beautify 2013-07-25 15:23:05 +01:00
Hieu Hoang
78381d0213 @NUM@ --> @num@. In case using recaser 2013-07-25 15:16:15 +01:00
Phil Williams
f0b603e6b5 extract-ghkm: write glue grammars for all sentence offsets
extract-parallel now merges separate glue grammars, so remove
previous workaround.
2013-07-25 13:53:32 +01:00
Hieu Hoang
d0172ed5cd create script to convert phrase-table with alignment in Moses' dead-end format to standard format 2013-07-25 12:56:20 +01:00
Hieu Hoang
018998247a create script to convert phrase-table with alignment in Moses' dead-end format to standard format 2013-07-25 12:52:05 +01:00
Hieu Hoang
c0aba71c79 bug processing unknown word with digits 2013-07-25 08:41:59 +01:00
Barry Haddow
f79746b3c2 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-24 20:49:59 +01:00
Hieu Hoang
6fc21a32fc Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-24 19:01:57 +01:00
Hieu Hoang
c104dee3b2 merge glue grammars, rather than writing them all to the same file. Required by Phil Williams & others when doing syntax extraction 2013-07-24 19:01:46 +01:00
Achim Ruopp
1813f9784b Additional factoring to allow more NE recognizers; bug fixes 2013-07-24 12:44:53 -04:00
Barry Haddow
46ee1ca42d More lattice fixes squashed by merge 2013-07-24 16:09:32 +01:00
Barry Haddow
0ce50a4c70 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-24 15:58:08 +01:00
Phil Williams
1238041f98 Add option to do Penn Treebank style tokenization
tokenizer.perl and detokenizer.perl now have an option called -penn
which does Penn Treebank-like tokenization (English only).  This is
useful if your pipeline involves processing the corpus with tools
trained on PTB-tokenized text.

Unlike PTB, the tokenizer splits on slashes (e.g. "Monday/Tuesday"
becomes "Monday", "@/@", "Tuesday").  If using parse-de-berkeley.perl,
the option -split-slash re-joins tokens that are separated by slashes
for parsing then splits them afterwards.
2013-07-24 13:41:21 +01:00
Kenneth Heafield
71ae8c9d19 LM/Factory.cpp -> FF/Factory.cpp oops 2013-07-24 12:13:11 +01:00
Ian Johnson
68779c66b9 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-24 11:52:21 +01:00
Ian Johnson
08f64dea28 Arrow pipeline submodules now use https protocol. 2013-07-24 11:52:14 +01:00
Barry Haddow
d5e40a5b08 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-24 11:38:23 +01:00
Phil Williams
b5584fdecf extract-ghkm: workaround for extract-parallel issue
Don't write glue grammar or unknown word label files unless the sentence
offset is 0.  This prevents multiple instances of extract-ghkm writing
to the same two files when extract-parallel is used.

TODO Better solutions might be:
 1. modify extract-parallel so that it only configures one instance of
    extract-ghkm to write the glue / unknown-lhs files (like the current
    workaround, this assumes file chunks are representative of the whole)
 2. add multithreading support directly to extract-ghkm
 3. write distinct output files for each extract-ghkm instance and
    combine them on completion
2013-07-23 14:55:16 +01:00
Hieu Hoang
e6a3df7e97 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-23 13:12:30 +01:00
Hieu Hoang
206b165d14 randlm compile with refactored code. No regression tests yet 2013-07-23 12:56:35 +01:00
Hieu Hoang
9b9e8cc759 eclipse file with randlm 2013-07-23 12:41:02 +01:00
Nadir Durrani
30544ae17e Sample Config File 2013-07-23 12:29:23 +01:00
Nadir Durrani
61e56ecdcd Sample Config File 2013-07-23 12:18:57 +01:00
Barry Haddow
50de0e06d1 Generate correct ini file for lattices 2013-07-23 11:46:37 +01:00
Barry Haddow
8ed8bcafc2 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-23 11:21:47 +01:00
Barry Haddow
887d5dad62 Restore EMS lattice fixes, squashed by merge. 2013-07-23 10:38:11 +01:00
Phil Williams
91cc7c329e parse-de-berkeley.perl: escape @ characters in input 2013-07-23 10:22:56 +01:00
Hieu Hoang
1e906bea73 add ControlRecombination feature function 2013-07-23 01:38:08 +01:00
Hieu Hoang
42c1c908a5 add ControlRecombination feature function 2013-07-23 01:32:25 +01:00
Barry Haddow
ecc6c7177c Reinstate lattice fixes squashed by merge 2013-07-22 17:25:01 +01:00
Hieu Hoang
2590601708 add ControlRecombination feature function 2013-07-20 23:41:49 +01:00
Hieu Hoang
a098227abe add ControlRecombination feature function 2013-07-20 23:10:50 +01:00
Hieu Hoang
96da822861 Don't deprecate lmodel-oov-feature 2013-07-20 17:20:12 +01:00
Hieu Hoang
b6f8e3c383 Don't mix old and new ini file format 2013-07-20 17:08:03 +01:00
Hieu Hoang
5b7a9af588 refactor RandLM. Compiles with eclipse but not with bjam 2013-07-20 00:19:04 +01:00
Hieu Hoang
d4e641de80 eclipse 2013-07-19 23:19:17 +01:00
Hieu Hoang
11666a8359 RandLM is currently broken 2013-07-19 22:39:20 +01:00
Achim Ruopp
3a668aaccf Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-19 15:56:04 -04:00
unknown
54eb50523b Converted into modulino; added support for French numbers 2013-07-19 14:41:01 -04:00
Hieu Hoang
4a4b1a168d Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-19 18:52:54 +01:00
Kenneth Heafield
2f6e669fb9 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2013-07-19 18:50:29 +01:00
Kenneth Heafield
e1a2b2f0c9 Reduce scope of lm dependency 2013-07-19 18:50:12 +01:00
Hieu Hoang
c77ec1b904 beautfiy 2013-07-19 13:56:02 +01:00
Hieu Hoang
a95127b972 add default weights for feature functions that aren't tuneable, eg. OOV feature 2013-07-19 13:24:05 +01:00
Hieu Hoang
8a28178339 add default weights for feature functions that aren't tuneable, eg. OOV feature 2013-07-19 11:35:50 +01:00
Hieu Hoang
24a9a7949e eclipse 2013-07-19 09:37:33 +01:00
Kenneth Heafield
b5e6b9c959 Factory 2013-07-18 22:54:52 +01:00