mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-08-17 00:00:33 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Jörg Tiedemann ddafb43d66 removed dependence on moses tools in preprocessing script for released spm packages		2020-09-12 14:42:10 +03:00
backtranslate	documentation of low-resource languages	2020-09-06 23:56:16 +03:00
doc	added acknowledgements	2020-09-12 12:01:02 +03:00
evaluate	make compatible with mac osx and include submodules for required tools	2020-09-02 15:52:34 +03:00
finetune	make compatible with mac osx and include submodules for required tools	2020-09-02 15:52:34 +03:00
html	train with backtranslations	2020-01-18 20:37:01 +02:00
lib	removed dependence on moses tools in preprocessing script for released spm packages	2020-09-12 14:42:10 +03:00
models	fixed a bug in eval-testsets	2020-05-29 14:43:36 +03:00
pivoting	keep translations even if uncomplete in pivoting	2020-09-06 00:22:48 +03:00
scripts	removed dependence on moses tools in preprocessing script for released spm packages	2020-09-12 14:42:10 +03:00
tatoeba	store and fetch work data	2020-08-22 23:51:37 +03:00
testsets	fix chinese/korean/japanese language codes	2020-06-17 22:02:39 +03:00
tools	added bpe submodule	2020-09-04 15:34:20 +03:00
work-spm	sami	2020-03-27 22:30:51 +02:00
.gitmodules	added bpe submodule	2020-09-04 15:34:20 +03:00
Dockerfile.cpu	Add more aligners	2020-02-03 16:03:26 +07:00
Dockerfile.gpu	Add Dockerfile for GPU	2020-02-03 15:41:34 +07:00
LICENSE	fixed license	2020-01-10 17:04:04 +02:00
Makefile	added acknowledgements	2020-09-12 12:01:02 +03:00
NOTES.md	fixed multilingual tatoeba evaluation	2020-06-11 00:54:40 +03:00
README.md	added acknowledgements	2020-09-12 12:01:02 +03:00
requirements.txt	started tutorial and fixes to backtranslate makefile	2020-09-05 00:16:22 +03:00
TODO.md	store and fetch work data	2020-08-22 23:51:37 +03:00

README.md

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.