mosesdecoder/lm
2012-08-18 12:07:53 -04:00
..
bhiksha.cc eclipse project 2012-05-28 17:29:46 +01:00
bhiksha.hh KenLM b1daeaf for clang 2012-05-05 00:55:46 -04:00
binary_format.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
binary_format.hh KenLM 98814b2 including faster malloc-backed building and portability improvements 2012-02-28 13:58:00 -05:00
blank.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
build_binary.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
clean.sh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
compile.sh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
config.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
config.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
COPYING Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
COPYING.LESSER Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
enumerate_vocab.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
facade.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
Jamfile Move max-order to lm directory and direct dependencies. 2012-08-18 12:07:53 -04:00
left_test.cc Cast everything to double before BOOST_CHECK_CLOSE for Yvette 2012-08-06 08:31:02 -04:00
left.hh KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
LICENSE Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
lm_exception.cc Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
lm_exception.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
max_order.cc Add program to query the KenLM maximum n-gram order 2012-08-08 16:41:29 -04:00
model_test.cc Cast everything to double before BOOST_CHECK_CLOSE for Yvette 2012-08-06 08:31:02 -04:00
model_type.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
model.cc Throw exception for binary files with too large order / Ondrej Bojar 2012-08-15 11:08:08 -04:00
model.hh KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
ngram_query.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
ngram_query.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
quantize.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
quantize.hh KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
read_arpa.cc windows-compatible. Thanks to Mike Lagwig 2012-07-10 10:05:14 +01:00
read_arpa.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
README KenLM b1daeaf for clang 2012-05-05 00:55:46 -04:00
return.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
search_hashed.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
search_hashed.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
search_trie.cc KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
search_trie.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
state.hh KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
test_nounk.arpa Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
test.arpa Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
test.sh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
trie_sort.cc Don't segfault trie building when there is no n-gram of a given order. Jon Clark. 2012-08-16 16:01:43 -04:00
trie_sort.hh KenLM maximum n-gram order can now be set via a compile-time flag 2012-08-08 16:22:13 -04:00
trie.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
trie.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
value_build.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
value_build.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
value.hh visual studio compiles but doesn't link 2012-07-11 10:54:21 +01:00
virtual_interface.cc Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
virtual_interface.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00
vocab.cc KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
vocab.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
weights.hh KenLM e3b5c55910 including rest costs for probing 2012-06-28 10:58:59 -04:00
word_index.hh Move kenlm up one level, simplify compilation 2011-11-17 12:49:55 +00:00

Language model inference code by Kenneth Heafield <kenlm at kheafield.com>

THE GIT REPOSITORY https://github.com/kpu/kenlm IS WHERE ACTIVE DEVELOPMENT HAPPENS.  IT MAY RETURN SILENTLY WRONG ANSWERS OR BE SILENTLY BINARY-INCOMPATIBLE WITH STABLE RELEASES.  

The website http://kheafield.com/code/kenlm/ has more documentation.  If you're a decoder developer, please download the latest version from there instead of copying from another decoder.  

Two data structures are supported: probing and trie.  Probing is a probing hash table with keys that ere 64-bit hashes of n-grams and floats as values.  Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers.  The trie node entries are sorted by word index.  Probing is the fastest and uses the most memory.  Trie uses the least memory and a bit slower.  

With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version.  Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version.  KenLM's probing hash table implementation goes even faster at the expense of using more memory.  See http://kheafield.com/code/kenlm/benchmark/.  

Binary format via mmap is supported.  Run ./build_binary to make one then pass the binary file name to the appropriate Model constructor.   


PLATFORMS
murmur_hash.cc and bit_packing.hh perform unaligned reads and writes that make the code architecture-dependent.  
It has been sucessfully tested on x86_64, x86, and PPC64.  
ARM support is reportedly working, at least on the iphone, but I cannot test this. 

Runs on Linux, OS X, Cygwin, and MinGW.  

Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.  Hieu Hoang is working on a native Windows port.  


DECODER DEVELOPERS
- I recommend copying the code and distributing it with your decoder.  However, please send improvements upstream as indicated in CONTRIBUTORS.  

- It does not depend on Boost or ICU.  If you use ICU, define HAVE_ICU in util/have.hh (uncomment the line) to avoid a name conflict.  Defining HAVE_BOOST will let you hash StringPiece.  

- Most people have zlib.  If you don't want to depend on that, comment out #define HAVE_ZLIB in util/have.hh.  This will disable loading gzipped ARPA files.  

- There are two build systems: compile.sh and Jamroot+Jamfile.  They're pretty simple and are intended to be reimplemented in your build system.  

- Use either the interface in lm/model.hh or lm/virtual_interface.hh.  Interface documentation is in comments of lm/virtual_interface.hh and lm/model.hh.  

- There are several possible data structures in model.hh.  Use RecognizeBinary in binary_format.hh to determine which one a user has provided.  You probably already implement feature functions as an abstract virtual base class with several children.  I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by RecognizeBinary.  This is the strategy used in Moses and cdec.

- See lm/config.hh for tuning options.  


CONTRIBUTORS
Contributions to KenLM are welcome.  Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access).  Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.  


The name was Hieu Hoang's idea, not mine.