Rico Sennrich
2a4a44b5c0
move files to package structure; add setup.py
2018-05-16 11:44:24 +01:00
Rico Sennrich
f4f95998c0
fix regression from 7bb1c: don't duplicate empty line
...
fixes #48
2018-05-01 21:27:22 +01:00
Rico Sennrich
662efd1b9d
don't strip UTF-8 whitespace
2018-05-01 19:14:13 +01:00
Rico Sennrich
c00684c035
more tests
2018-04-26 11:41:43 +01:00
Rico Sennrich
9bf7efba31
testing
2018-04-26 09:56:33 +01:00
Rico Sennrich
d40de91976
whitespace-only splitting everywhere.
2018-04-25 16:58:51 +01:00
Rico Sennrich
9a95f9f740
update changelog
2018-03-28 09:19:42 +01:00
Rico Sennrich
aa1bd9fe51
get_vocabulary: don't crash on double whitespace or empty line
2018-03-26 19:04:10 +01:00
Rico Sennrich
7bb1c3d670
don't break on leading whitespace
2018-03-26 10:35:33 +01:00
Rico Sennrich
1d4c3ca4e9
don't assume trailing whitespace has length 1.
...
fixes bug introduced in commit 30e5be, and issue #38 .
2018-03-25 14:58:58 +01:00
Rico Sennrich
5300845b52
don't break when same BPE file is used multiple times in script.
...
fixes #39 .
2018-03-25 15:45:14 +02:00
Rico Sennrich
0541c3ef0f
make get_vocab consistent with new whitespace-only splitting in apply_bpe
...
(thanks to Ondrej Bojar)
2018-03-14 21:15:47 +00:00
Rico Sennrich
27f1ab81eb
fix regression in commit 30e5be (end-of-line token was segmented wrong)
2018-03-06 17:22:41 +00:00
Rico Sennrich
db8a8687d1
don't crash on double spaces
2018-03-06 14:49:47 +00:00
Rico Sennrich
30e5be7bfb
don't silently replace unicode characters with space or newline.
...
should fix #29 .
2018-03-06 12:01:29 +00:00
Rico Sennrich
75773ed42b
reference to package
2018-03-01 19:02:04 +00:00
Rico Sennrich
b4be0ebb87
fix number of arguments in test_glossaries.encode_mock
...
fixes #36
2018-01-22 17:48:59 +00:00
Rico Sennrich
80b7c1449e
Merge pull request #35 from obo/patch-1
...
add up repeated entries with --dist-input
2018-01-12 15:44:15 +00:00
Ondrej Bojar
2c10ebc13a
add up repeated entries with --dist-input
...
This allows to directly read the contatenation of several outputs of get_vocab.py, instead of adding them up with a separate script.
2018-01-05 10:59:48 +01:00
Rico Sennrich
666a914f3e
Merge pull request #34 from ozancaglayan/fixes
...
Fixes
2018-01-03 16:19:50 +00:00
Ozan Caglayan
8247488c35
remove unused imports, fix trailing whitespace
2017-12-21 21:42:38 +01:00
Ozan Caglayan
12406420ce
do not force system's default python
...
This is to make sure that the scripts are executed with the interpreter
defined in the environment instead of what has been installed as
/usr/bin/python.
2017-12-21 21:38:22 +01:00
Ozan Caglayan
818e78f3a8
add .gitignore file
2017-12-21 21:37:07 +01:00
Rico Sennrich
a24880065b
Merge pull request #32 from ozancaglayan/master
...
learn_joint_bpe_and_vocab: Fix parameter passing
2017-12-18 16:51:56 +00:00
Ozan Çağlayan
1290ace1e1
learn_joint_bpe_and_vocab: Fix parameter passing
...
args.separator was being passed to merges instead of separator.
2017-12-18 17:24:24 +01:00
Rico Sennrich
6fd0151fec
Merge pull request #31 from Proyag/master
...
Option to apply fewer BPE operations than learned
2017-12-15 13:47:05 +00:00
Proyag
4b80e3c2af
Option to apply fewer merge operations than learned
2017-12-14 12:08:19 +00:00
Rico Sennrich
3d28265d77
update toy example to BPE use same representation again as learn_bpe.py
...
learn_bpe.py was changed in commit a749a7 to make end-of-word representation more consistent.
2017-10-05 14:54:13 +01:00
Rico Sennrich
8ba000d953
line buffering for apply_bpe.py
...
python 3 only so far; not sure how to make this work in python 2
2017-08-30 18:20:19 +02:00
Rico Sennrich
88731360d8
cache persists within BPE instance, but not across BPE instances
2017-06-09 13:01:56 +03:00
Rico Sennrich
dfa5db52f4
typo
2017-05-20 15:29:52 +01:00
Rico Sennrich
b4f2c8dd30
Merge pull request #23 from jvdbogae/master
...
chmod +x apply_bpe.py
2017-05-10 10:14:24 +01:00
Joachim Van den Bogaert
8624cfd796
Somehow, apply_bpe.py ended up non-executable, resulting in an empty training corpus and a failed AMUNMT training. When cleaning afterwards, the subword-nmt repo is deleted and cloned again by the AMUNMT example training script, resulting in apply_bpe.py being non-executable again (should it have been chmod +x ’ed).
2017-05-09 12:00:17 +02:00
Rico Sennrich
629c5066d2
Merge pull request #22 from Unbabel/feat/glossaries
...
Feat/glossaries
2017-05-01 16:36:02 +01:00
dimesq
11a933b1cf
fix glossaries feature
2017-05-01 11:19:43 +01:00
Rico Sennrich
78fabda946
comments
2017-04-28 10:46:08 +01:00
dimesq
bd96908bd5
Implement glossaries feature
2017-04-27 10:21:10 +01:00
dimesq
d95a5c834a
Add tests
2017-04-25 20:48:30 +01:00
Rico Sennrich
15137ff114
changelog
2017-04-21 11:25:06 +01:00
Rico Sennrich
4597d6986d
update README
2017-04-21 11:13:09 +01:00
Rico Sennrich
399db35103
Merge branch 'master' into vocab
2017-04-21 11:00:39 +01:00
Rico Sennrich
2d5a3ecdbc
remove subword marker at end-of-line
2017-04-07 15:13:26 +02:00
Rico Sennrich
90fa4afd13
fix merge conflict
2017-04-01 21:25:05 +01:00
Rico Sennrich
b481fdc4c0
Merge branch 'master' into vocab
2017-03-09 14:07:07 +00:00
Rico Sennrich
fb526f1b00
rename --is-dict to --dict-input
2017-02-27 15:57:11 +00:00
Martin Boyanov
f37902dec6
Allow passing in a word - count file instead of iterating through the whole dataset
2017-02-25 14:17:56 +02:00
Rico Sennrich
4c54e1df2e
make max deterministic by using symbol pair as secondary sort key
2017-02-22 13:58:21 +00:00
Rico Sennrich
669255833f
acknowledgements
2017-02-20 10:54:15 +00:00
Rico Sennrich
f83508017c
documentation fix
2017-02-10 12:36:33 +00:00
Rico Sennrich
b2eb49607c
consistent utf-8 encoding across python versions and environment variables
2017-02-10 11:27:47 +00:00