mirror of
https://github.com/facebookresearch/fairseq.git
synced 2024-11-12 21:52:01 +03:00
1305008e97
Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes # (issue). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1827 Reviewed By: ngoyal2707 Differential Revision: D28060291 Pulled By: lematt1991 fbshipit-source-id: 2540eb2a7d6a1fe37af9a3e9b4ed3df9e05a0823 |
||
---|---|---|
.. | ||
flores_logo.png | ||
README.md |
Flores101: Large-Scale Multilingual Machine Translation
Introduction
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html
Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/
Pretrained models
Model | Num layers | Embed dimension | FFN dimension | Vocab Size | #params | Download |
---|---|---|---|---|---|---|
flores101_mm100_615M |
12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz |
flores101_mm100_175M |
6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz |
These models are trained similar to M2M-100 with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.
Example Generation code
Download model, sentencepiece vocab
fairseq=/path/to/fairseq
cd $fairseq
# Download 615M param model.
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
# Extract
tar -xvzf flores101_mm100_615M.tar.gz
Encode using our SentencePiece Model
Note: Install SentencePiece from here
fairseq=/path/to/fairseq
cd $fairseq
# Download example dataset From German to French
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr
for lang in de fr ; do
python scripts/spm_encode.py \
--model flores101_mm100_615M/sentencepiece.bpe.model \
--output_format=piece \
--inputs=raw_input.de-fr.${lang} \
--outputs=spm.de-fr.${lang}
done
Binarization
fairseq-preprocess \
--source-lang de --target-lang fr \
--testpref spm.de-fr \
--thresholdsrc 0 --thresholdtgt 0 \
--destdir data_bin \
--srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt
Generation
fairseq-generate \
data_bin \
--batch-size 1 \
--path flores101_mm100_615M/model.pt \
--fixed-dictionary flores101_mm100_615M/dict.txt \
-s de -t fr \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs flores101_mm100_615M/language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn
Supported Languages and lang code
Language | lang code |
---|---|
Akrikaans | af |
Amharic | am |
Arabic | ar |
Assamese | as |
Asturian | ast |
Aymara | ay |
Azerbaijani | az |
Bashkir | ba |
Belarusian | be |
Bulgarian | bg |
Bengali | bn |
Breton | br |
Bosnian | bs |
Catalan | ca |
Cebuano | ceb |
Chokwe | cjk |
Czech | cs |
Welsh | cy |
Danish | da |
German | de |
Dyula | dyu |
Greek | el |
English | en |
Spanish | es |
Estonian | et |
Persian | fa |
Fulah | ff |
Finnish | fi |
French | fr |
Western Frisian | fy |
Irish | ga |
Scottish Gaelic | gd |
Galician | gl |
Gujarati | gu |
Hausa | ha |
Hebrew | he |
Hindi | hi |
Croatian | hr |
Haitian Creole | ht |
Hungarian | hu |
Armenian | hy |
Indonesian | id |
Igbo | ig |
Iloko | ilo |
Icelandic | is |
Italian | it |
Japanese | ja |
Javanese | jv |
Georgian | ka |
Kachin | kac |
Kamba | kam |
Kabuverdianu | kea |
Kongo | kg |
Kazakh | kk |
Central Khmer | km |
Kimbundu | kmb |
Northern Kurdish | kmr |
Kannada | kn |
Korean | ko |
Kurdish | ku |
Kyrgyz | ky |
Luxembourgish | lb |
Ganda | lg |
Lingala | ln |
Lao | lo |
Lithuanian | lt |
Luo | luo |
Latvian | lv |
Malagasy | mg |
Maori | mi |
Macedonian | mk |
Malayalam | ml |
Mongolian | mn |
Marathi | mr |
Malay | ms |
Maltese | mt |
Burmese | my |
Nepali | ne |
Dutch | nl |
Norwegian | no |
Northern Sotho | ns |
Nyanja | ny |
Occitan | oc |
Oromo | om |
Oriya | or |
Punjabi | pa |
Polish | pl |
Pashto | ps |
Portuguese | pt |
Quechua | qu |
Romanian | ro |
Russian | ru |
Sindhi | sd |
Shan | shn |
Sinhala | si |
Slovak | sk |
Slovenian | sl |
Shona | sn |
Somali | so |
Albanian | sq |
Serbian | sr |
Swati | ss |
Sundanese | su |
Swedish | sv |
Swahili | sw |
Tamil | ta |
Telugu | te |
Tajik | tg |
Thai | th |
Tigrinya | ti |
Tagalog | tl |
Tswana | tn |
Turkish | tr |
Ukrainian | uk |
Umbundu | umb |
Urdu | ur |
Uzbek | uz |
Vietnamese | vi |
Wolof | wo |
Xhosa | xh |
Yiddish | yi |
Yoruba | yo |
Chinese | zh |
Zulu | zu |