Training open neural machine translation models
Go to file
2021-03-13 00:29:23 +02:00
backtranslate backtranslation for Tatoeba data 2021-02-25 17:17:21 +02:00
bt-tatoeba added recipe for refreshing release info 2021-03-13 00:29:23 +02:00
doc tico19 benchmark added 2020-10-27 23:48:09 +02:00
evaluate make compatible with mac osx and include submodules for required tools 2020-09-02 15:52:34 +03:00
finetune make compatible with mac osx and include submodules for required tools 2020-09-02 15:52:34 +03:00
html train with backtranslations 2020-01-18 20:37:01 +02:00
lib added recipe for refreshing release info 2021-03-13 00:29:23 +02:00
models updated model list 2020-11-26 12:57:46 +02:00
pivoting keep translations even if uncomplete in pivoting 2020-09-06 00:22:48 +03:00
scripts added recipe for refreshing release info 2021-03-13 00:29:23 +02:00
tatoeba store and fetch work data 2020-08-22 23:51:37 +03:00
testsets added recipe for refreshing release info 2021-03-13 00:29:23 +02:00
tools adjust to mahti 2021-02-25 21:19:08 +02:00
work-spm sami 2020-03-27 22:30:51 +02:00
work-tatoeba moved results table generation for tatoeba models 2021-02-15 20:35:29 +02:00
.gitmodules added bpe submodule 2020-09-04 15:34:20 +03:00
Dockerfile.cpu Add more aligners 2020-02-03 16:03:26 +07:00
Dockerfile.gpu Add Dockerfile for GPU 2020-02-03 15:41:34 +07:00
LICENSE fixed license 2020-01-10 17:04:04 +02:00
Makefile added recipe for refreshing release info 2021-03-13 00:29:23 +02:00
NOTES.md fixed bug in release target 2020-10-04 00:10:11 +03:00
README.md Update README.md 2021-01-14 22:31:09 +02:00
requirements.txt plain text vocab files from spm models 2020-09-13 22:17:21 +03:00
TODO.md back to yml vocab files as default 2020-09-25 09:58:25 +03:00

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.