b41c74dc5b
Summary: Changelog: - `e330f56`: Add code for the "Pay Less Attention with Lightweight and Dynamic Convolutions" paper - `5e3b98c`: Add scripts for computing tokenized BLEU with compound splitting and sacrebleu - update READMEs - misc fixes Pull Request resolved: https://github.com/pytorch/fairseq/pull/473 Differential Revision: D13819717 Pulled By: myleott fbshipit-source-id: f2dc12ea89a436b950cafec3593ed1b04af808e9 |
||
---|---|---|
.. | ||
prepare-iwslt14.sh | ||
prepare-wmt14en2de.sh | ||
prepare-wmt14en2fr.sh | ||
README.md |
Neural Machine Translation
Pre-trained models
Description | Dataset | Model | Test set(s) |
---|---|---|---|
Convolutional (Gehring et al., 2017) |
WMT14 English-French | download (.tar.bz2) | newstest2014: download (.tar.bz2) newstest2012/2013: download (.tar.bz2) |
Convolutional (Gehring et al., 2017) |
WMT14 English-German | download (.tar.bz2) | newstest2014: download (.tar.bz2) |
Convolutional (Gehring et al., 2017) |
WMT17 English-German | download (.tar.bz2) | newstest2014: download (.tar.bz2) |
Transformer (Ott et al., 2018) |
WMT14 English-French | download (.tar.bz2) | newstest2014 (shared vocab): download (.tar.bz2) |
Transformer (Ott et al., 2018) |
WMT16 English-German | download (.tar.bz2) | newstest2014 (shared vocab): download (.tar.bz2) |
Transformer (Edunov et al., 2018; WMT'18 winner) |
WMT'18 English-German | download (.tar.bz2) | See NOTE in the archive |
Example usage
Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:
$ mkdir -p data-bin
$ curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
$ curl https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
$ python generate.py data-bin/wmt14.en-fr.newstest2014 \
--path data-bin/wmt14.en-fr.fconv-py/model.pt \
--beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
...
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
# Scoring with score.py:
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
Preprocessing
These scripts provide an example of pre-processing data for the NMT task.
prepare-iwslt14.sh
Provides an example of pre-processing for IWSLT'14 German to English translation task: "Report on the 11th IWSLT evaluation campaign" by Cettolo et al.
Example usage:
$ cd examples/translation/
$ bash prepare-iwslt14.sh
$ cd ../..
# Binarize the dataset:
$ TEXT=examples/translation/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.tokenized.de-en
# Train the model (better for a single GPU setup):
$ mkdir -p checkpoints/fconv
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
--lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr-scheduler fixed --force-anneal 200 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv
# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
--path checkpoints/fconv/checkpoint_best.pt \
--batch-size 128 --beam 5 --remove-bpe
To train transformer model on IWSLT'14 German to English:
# Preparation steps are the same as for fconv model.
# Train the model (better for a single GPU setup):
$ mkdir -p checkpoints/transformer
$ CUDA_VISIBLE_DEVICES=0 python train.py data-bin/iwslt14.tokenized.de-en \
-a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 50000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
# Average 10 latest checkpoints:
$ python scripts/average_checkpoints.py --inputs checkpoints/transformer \
--num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt
# Generate:
$ python generate.py data-bin/iwslt14.tokenized.de-en \
--path checkpoints/transformer/model.pt \
--batch-size 128 --beam 5 --remove-bpe
prepare-wmt14en2de.sh
The WMT English to German dataset can be preprocessed using the prepare-wmt14en2de.sh
script.
By default it will produce a dataset that was modeled after "Attention Is All You Need" (Vaswani et al., 2017), but with news-commentary-v12 data from WMT'17.
To use only data available in WMT'14 or to replicate results obtained in the original "Convolutional Sequence to Sequence Learning" (Gehring et al., 2017) paper, please use the --icml17
option.
$ bash prepare-wmt14en2de.sh --icml17
Example usage:
$ cd examples/translation/
$ bash prepare-wmt14en2de.sh
$ cd ../..
# Binarize the dataset:
$ TEXT=examples/translation/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt14_en_de --thresholdtgt 0 --thresholdsrc 0
# Train the model:
# If it runs out of memory, try to set --max-tokens 1500 instead
$ mkdir -p checkpoints/fconv_wmt_en_de
$ python train.py data-bin/wmt14_en_de \
--lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr-scheduler fixed --force-anneal 50 \
--arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de
# Generate:
$ python generate.py data-bin/wmt14_en_de \
--path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe
prepare-wmt14en2fr.sh
Provides an example of pre-processing for the WMT'14 English to French translation task.
Example usage:
$ cd examples/translation/
$ bash prepare-wmt14en2fr.sh
$ cd ../..
# Binarize the dataset:
$ TEXT=examples/translation/wmt14_en_fr
$ python preprocess.py --source-lang en --target-lang fr \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0
# Train the model:
# If it runs out of memory, try to set --max-tokens 1000 instead
$ mkdir -p checkpoints/fconv_wmt_en_fr
$ python train.py data-bin/wmt14_en_fr \
--lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr-scheduler fixed --force-anneal 50 \
--arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr
# Generate:
$ python generate.py data-bin/fconv_wmt_en_fr \
--path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe