Commit Graph

17477 Commits

Author SHA1 Message Date
Hieu Hoang
d30a1d51c8
Merge pull request #220 from wwaites/master
flag to turn off sentence splitter from emitting <P>
2020-02-27 18:55:17 -08:00
William Waites
696a5d9833 flag to turn off sentence splitter from emitting <P> 2020-02-26 14:08:26 +00:00
Kenneth Heafield
22923ddcf0 Revert "line buffering for tokeniser and truecaser"
This reverts commit 691717c425.
2020-02-20 09:52:08 +00:00
Hieu Hoang
3c881255b1
Merge pull request #219 from wwaites/master
line buffering for tokeniser and truecaser
2020-02-19 10:35:29 -08:00
William Waites
691717c425 line buffering for tokeniser and truecaser 2020-02-17 14:29:24 +00:00
Hieu Hoang
4c5e89f075
Merge pull request #218 from veer66/master
Add AARCH64 support
2020-01-22 11:30:03 -08:00
Vee Satayamas
5694efe10b Add AARCH64 support 2020-01-16 09:13:03 +00:00
Hieu Hoang
e4a52f14e4
Merge pull request #217 from moses-smt/alvations-patch-2
Proper spacing for sent-split perl script
2020-01-05 19:46:25 -08:00
alvations
d03df21e88
Proper spacing 2020-01-06 11:43:31 +08:00
Hieu Hoang
f46ee7c5ac get rid of boost thread local code 2020-01-05 18:56:49 -08:00
Hieu Hoang
745e03b4fc use c++11 thread local construct instead of boost 2020-01-05 18:09:57 -08:00
Hieu Hoang
fdabcd34f8 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2020-01-05 17:29:14 -08:00
Hieu Hoang
afb353b430 limit thread queue to x2 number of threads 2020-01-05 17:29:04 -08:00
Hieu Hoang
25ec481655
Merge pull request #216 from HjalmarrSv/patch-1
Modernized
2020-01-02 03:38:55 +00:00
HjalmarrSv
fa747062dc
Modernized
I wanted to properly parse links on https://dumps.wikimedia.org/mirrors.html when page copied as text
My proposed changes does the job.
Basically I had to change by replacing the + at end of line 5 with *(\/)?
The pipe symbol could lead to crashes why I broke up line 5 to three lines. I suggest not using the pipe (|) after reading various posts.
2019-12-17 20:40:51 +01:00
Barry Haddow
a89691fee3 attempt to handle Korean better; only consider horizontal space in final split 2019-12-16 15:52:45 +00:00
Barry Haddow
2cff8ff6dd split word on any type of space 2019-12-09 17:04:09 +00:00
Hieu Hoang
41b31167fd
Merge pull request #215 from moses-smt/alvations-patch-normalization
Single quotes should be escaped as single quotes.
2019-11-24 18:38:05 -08:00
alvations
f6d7adde15
Single quotes should be escaped as single quotes. 2019-11-25 10:10:40 +08:00
Barry Haddow
74d54b54c3 2 letter codes 2019-11-08 15:36:22 +00:00
Barry Haddow
1037070026 support for several Indic languages 2019-11-08 14:56:58 +00:00
Barry Haddow
b1163966b1 initial hi non-breaking prefixes 2019-11-05 16:59:40 +00:00
Barry Haddow
61b1d06570 list items 2019-11-05 16:52:50 +00:00
Barry Haddow
4da86c360f rupees 2019-11-05 16:02:19 +00:00
Barry Haddow
56b2bad907 fix abbrev rule 2019-11-05 15:58:07 +00:00
Barry Haddow
3910cd6c46 devanagari fix 2019-10-31 21:28:43 +00:00
Barry Haddow
2affb9b624 reorganise indic support 2019-10-31 16:50:17 +00:00
Barry Haddow
d708e26b60 use block notation for indic scripts 2019-10-31 16:12:59 +00:00
Barry Haddow
0fef8ebf4c fix nbp 2019-10-31 16:08:56 +00:00
Barry Haddow
b1d9fb6d75 full cjk test 2019-10-28 09:53:45 +00:00
Barry Haddow
8ebebbc680 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-10-28 09:48:40 +00:00
Hieu Hoang
286188b82a
Merge pull request #214 from JetRunner/patch-1
Fix the incorrect processing considering fullwidth number character
2019-10-18 19:54:46 -07:00
Kevin Canwen Xu
5d3331b922
Update replace-unicode-punctuation.perl 2019-10-14 16:33:58 +08:00
alvations
555829a771
Undoing 0578892581
Causes abbreviations to not split when ending with a fullstop. E.g. 

> The restructuring of IBM was essential to enable it organisationally to take up the responsibilities entrusted in the role with the recent changes in the policy and legislations, revised charter of function of IBM and the new activities and initiatives undertaken by IBM. IBM is also engaged in handholding the States for auction of mineral blocks for greater transparency in allocation of mineral concessions.
2019-10-01 05:27:06 +08:00
Barry Haddow
486dce270f debug 2019-09-30 16:58:21 +01:00
Barry Haddow
9bffde57ba revert 05788925 2019-09-30 16:53:06 +01:00
Barry Haddow
257d7e5e66 enable custom non breaking prefixes 2019-09-30 16:52:24 +01:00
Barry Haddow
01a8ec41e8 Merge branch 'master' of github.com:moses-smt/mosesdecoder 2019-09-30 15:33:33 +01:00
Barry Haddow
768944d851 do not add spaces in cjk 2019-09-30 15:33:26 +01:00
Hieu Hoang
b21b071a66
Merge pull request #213 from titsuki/enable-strict
Enable use strict pragma
2019-09-25 01:04:13 +01:00
titsuki
490dc3996a Enable use strict pragma 2019-09-23 15:40:13 +09:00
Hieu Hoang
fd06cdf026
Merge pull request #212 from moses-smt/alvations-patch-regexes
The dot before an acronym should be optional.
2019-09-04 08:07:06 +01:00
alvations
0578892581
The dot before an acronym should be optional. 2019-09-04 14:16:41 +08:00
Hieu Hoang
9f08d77b0d
Merge pull request #211 from achimr/master
Support for Urdu in sentence splitter
2019-08-21 22:05:45 +01:00
Achim Ruopp
7ad5ffa0c0 Support for Urdu in sentence splitter 2019-07-10 10:48:32 -04:00
Hieu Hoang
158d252389 tweak readme 2019-06-08 18:22:39 +01:00
Hieu Hoang
c0545019eb
Merge pull request #210 from mjpost/patch-1
escape angle brackets
2019-04-27 21:23:50 +01:00
Matt Post
63c450b401
escape angle brackets
The script doesn't escape angle brackets which can result in bad SGML / XML output. This fixes that, although ideally, this should be implemented with a proper parser and dumper.
2019-04-26 14:24:07 -04:00
Hieu Hoang
187a75cb55
Merge pull request #209 from joelb-git/multi-bleu-detok-non-ascii-fix
Fix non-ASCII lowercasing
2019-03-01 23:26:12 +00:00
Joel Barry
fdb7384d3d Fix non-ASCII lowercasing 2019-02-27 10:17:29 -05:00