the binary file header, including <unk>. But the vocabulary was sized based on the ARPA file
count (excluding <unk>). Then when the binary file was loaded, the vocabulary size was based on
the count including <unk>. Fix this by pre-padding vocabulary to the count including <unk>.
Also, some minor cleanups: remove a debug message and change some always-true returns to void.
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@4145 1f5c12ca-751b-0410-a591-d2e778427230
- Fix case where "foo bar baz" appears but "bar baz" does not. Previously probing silently returned the wrong answer and trie silently broke.
- More aggressive recombination: if "baz quux" is never followed by any word, then do not include "bar" in the state.
- kenlm assumes that "foo bar" is present if "foo bar baz" is. This is now checked.
- Binary format version number bump because the format has changed to support the above.
- Lower memory consumption trie building. But it will take longer for to ensure correct handling of blanks and aggressive recombination.
- Fix progress bar newlines on trie building.
Agrees with SRI's 1-best outputs on the WMT 10 evaluation set.
git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@3847 1f5c12ca-751b-0410-a591-d2e778427230