Commit Graph

79 Commits

Author SHA1 Message Date
Rico Sennrich
2a4a44b5c0 move files to package structure; add setup.py 2018-05-16 11:44:24 +01:00
Rico Sennrich
f4f95998c0 fix regression from 7bb1c: don't duplicate empty line
fixes #48
2018-05-01 21:27:22 +01:00
Rico Sennrich
662efd1b9d don't strip UTF-8 whitespace 2018-05-01 19:14:13 +01:00
Rico Sennrich
c00684c035 more tests 2018-04-26 11:41:43 +01:00
Rico Sennrich
9bf7efba31 testing 2018-04-26 09:56:33 +01:00
Rico Sennrich
d40de91976 whitespace-only splitting everywhere. 2018-04-25 16:58:51 +01:00
Rico Sennrich
9a95f9f740 update changelog 2018-03-28 09:19:42 +01:00
Rico Sennrich
aa1bd9fe51 get_vocabulary: don't crash on double whitespace or empty line 2018-03-26 19:04:10 +01:00
Rico Sennrich
7bb1c3d670 don't break on leading whitespace 2018-03-26 10:35:33 +01:00
Rico Sennrich
1d4c3ca4e9 don't assume trailing whitespace has length 1.
fixes bug introduced in commit 30e5be, and issue #38.
2018-03-25 14:58:58 +01:00
Rico Sennrich
5300845b52 don't break when same BPE file is used multiple times in script.
fixes #39.
2018-03-25 15:45:14 +02:00
Rico Sennrich
0541c3ef0f make get_vocab consistent with new whitespace-only splitting in apply_bpe
(thanks to Ondrej Bojar)
2018-03-14 21:15:47 +00:00
Rico Sennrich
27f1ab81eb fix regression in commit 30e5be (end-of-line token was segmented wrong) 2018-03-06 17:22:41 +00:00
Rico Sennrich
db8a8687d1 don't crash on double spaces 2018-03-06 14:49:47 +00:00
Rico Sennrich
30e5be7bfb don't silently replace unicode characters with space or newline.
should fix #29.
2018-03-06 12:01:29 +00:00
Rico Sennrich
75773ed42b reference to package 2018-03-01 19:02:04 +00:00
Rico Sennrich
b4be0ebb87 fix number of arguments in test_glossaries.encode_mock
fixes #36
2018-01-22 17:48:59 +00:00
Rico Sennrich
80b7c1449e
Merge pull request #35 from obo/patch-1
add up repeated entries with --dist-input
2018-01-12 15:44:15 +00:00
Ondrej Bojar
2c10ebc13a
add up repeated entries with --dist-input
This allows to directly read the contatenation of several outputs of get_vocab.py, instead of adding them up with a separate script.
2018-01-05 10:59:48 +01:00
Rico Sennrich
666a914f3e
Merge pull request #34 from ozancaglayan/fixes
Fixes
2018-01-03 16:19:50 +00:00
Ozan Caglayan
8247488c35 remove unused imports, fix trailing whitespace 2017-12-21 21:42:38 +01:00
Ozan Caglayan
12406420ce do not force system's default python
This is to make sure that the scripts are executed with the interpreter
defined in the environment instead of what has been installed as
/usr/bin/python.
2017-12-21 21:38:22 +01:00
Ozan Caglayan
818e78f3a8 add .gitignore file 2017-12-21 21:37:07 +01:00
Rico Sennrich
a24880065b
Merge pull request #32 from ozancaglayan/master
learn_joint_bpe_and_vocab: Fix parameter passing
2017-12-18 16:51:56 +00:00
Ozan Çağlayan
1290ace1e1
learn_joint_bpe_and_vocab: Fix parameter passing
args.separator was being passed to merges instead of separator.
2017-12-18 17:24:24 +01:00
Rico Sennrich
6fd0151fec Merge pull request #31 from Proyag/master
Option to apply fewer BPE operations than learned
2017-12-15 13:47:05 +00:00
Proyag
4b80e3c2af Option to apply fewer merge operations than learned 2017-12-14 12:08:19 +00:00
Rico Sennrich
3d28265d77 update toy example to BPE use same representation again as learn_bpe.py
learn_bpe.py was changed in commit a749a7 to make end-of-word representation more consistent.
2017-10-05 14:54:13 +01:00
Rico Sennrich
8ba000d953 line buffering for apply_bpe.py
python 3 only so far; not sure how to make this work in python 2
2017-08-30 18:20:19 +02:00
Rico Sennrich
88731360d8 cache persists within BPE instance, but not across BPE instances 2017-06-09 13:01:56 +03:00
Rico Sennrich
dfa5db52f4 typo 2017-05-20 15:29:52 +01:00
Rico Sennrich
b4f2c8dd30 Merge pull request #23 from jvdbogae/master
chmod +x apply_bpe.py
2017-05-10 10:14:24 +01:00
Joachim Van den Bogaert
8624cfd796 Somehow, apply_bpe.py ended up non-executable, resulting in an empty training corpus and a failed AMUNMT training. When cleaning afterwards, the subword-nmt repo is deleted and cloned again by the AMUNMT example training script, resulting in apply_bpe.py being non-executable again (should it have been chmod +x ’ed). 2017-05-09 12:00:17 +02:00
Rico Sennrich
629c5066d2 Merge pull request #22 from Unbabel/feat/glossaries
Feat/glossaries
2017-05-01 16:36:02 +01:00
dimesq
11a933b1cf fix glossaries feature 2017-05-01 11:19:43 +01:00
Rico Sennrich
78fabda946 comments 2017-04-28 10:46:08 +01:00
dimesq
bd96908bd5 Implement glossaries feature 2017-04-27 10:21:10 +01:00
dimesq
d95a5c834a Add tests 2017-04-25 20:48:30 +01:00
Rico Sennrich
15137ff114 changelog 2017-04-21 11:25:06 +01:00
Rico Sennrich
4597d6986d update README 2017-04-21 11:13:09 +01:00
Rico Sennrich
399db35103 Merge branch 'master' into vocab 2017-04-21 11:00:39 +01:00
Rico Sennrich
2d5a3ecdbc remove subword marker at end-of-line 2017-04-07 15:13:26 +02:00
Rico Sennrich
90fa4afd13 fix merge conflict 2017-04-01 21:25:05 +01:00
Rico Sennrich
b481fdc4c0 Merge branch 'master' into vocab 2017-03-09 14:07:07 +00:00
Rico Sennrich
fb526f1b00 rename --is-dict to --dict-input 2017-02-27 15:57:11 +00:00
Martin Boyanov
f37902dec6 Allow passing in a word - count file instead of iterating through the whole dataset 2017-02-25 14:17:56 +02:00
Rico Sennrich
4c54e1df2e make max deterministic by using symbol pair as secondary sort key 2017-02-22 13:58:21 +00:00
Rico Sennrich
669255833f acknowledgements 2017-02-20 10:54:15 +00:00
Rico Sennrich
f83508017c documentation fix 2017-02-10 12:36:33 +00:00
Rico Sennrich
b2eb49607c consistent utf-8 encoding across python versions and environment variables 2017-02-10 11:27:47 +00:00