mirror of https://github.com/rsennrich/subword-nmt.git synced 2024-11-27 02:53:55 +03:00

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

bpe machine-translation neural-machine-translation nmt segmentation starred-repo starred-rsennrich-repo subword-units

Go to file

Rico Sennrich 2d5a3ecdbc remove subword marker at end-of-line		2017-04-07 15:13:26 +02:00
apply_bpe.py	consistently use UTF-8 across python versions and environment variables	2017-02-10 11:11:45 +00:00
bpe_toy.py	update reference	2016-06-01 14:49:14 +01:00
chrF.py	Fixes #3	2016-01-29 10:23:00 +01:00
get_vocab.py	using python3 print function	2016-11-08 10:00:31 +01:00
learn_bpe.py	rename --is-dict to --dict-input	2017-02-27 15:57:11 +00:00
LICENSE	initial commit	2015-09-01 11:48:49 +01:00
README.md	remove subword marker at end-of-line	2017-04-07 15:13:26 +02:00
segment-char-ngrams.py	consistent, cross-version unicode handling	2017-01-10 14:52:42 +00:00

README.md

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}

To segment rare words into character n-grams, do the following:

./get_vocab.py < {train_file} > {vocab_file}
./segment-char-ngrams.py --vocab {vocab_file} -n {order} --shortlist {size} < {test_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ )|(@@ ?$)//g'

PUBLICATIONS

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

ACKNOWLEDGMENTS

This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

README.md Unescape Escape

Subword Neural Machine Translation

USAGE INSTRUCTIONS

PUBLICATIONS

ACKNOWLEDGMENTS

README.md