akimbal1
915c29b0dd
detokenization fixes and features
2015-02-15 17:19:47 -05:00
akimbal1
eff60db207
stop treating dash like hyphen
2015-02-15 00:23:29 -05:00
akimbal1
6352dc773c
closer match to perl tokenizer
2015-02-14 23:37:44 -05:00
akimbal1
362e6a9374
remove spurious endl
2015-02-02 15:57:04 -05:00
akimbal1
8ea1c9fd40
alignment for hieu
2015-02-02 12:55:21 -05:00
Hieu Hoang
884a0b1c90
forgot to add Parameters.cpp. Change c++11 to c++0x to support older compilers (on Ubuntu 12.04 etc).
2015-01-30 17:45:20 +00:00
Hieu Hoang
1dea58e945
separate parameters into it's own class
2015-01-25 15:02:33 +00:00
Hieu Hoang
5d2b0224d6
Jamfile for tokenizer
2015-01-25 14:00:35 +00:00
akimbal1
d38dcd89bb
add glib-2.0 for better unicodification and faster implementation
2015-01-23 13:35:09 -05:00
Kenneth Heafield
e30065072e
C++ tokenizer based on RE2. Not by me.
...
Some differences from Moses tokenizer: fraction characters count as numbers, _ handling, URLs
Currently 3x slower than perl :'(. Looking to make it faster by composing regex substitutions.
TODO eliminate sprintf and fixed-size buffers.
2015-01-21 12:23:44 -05:00