mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-07-04 15:16:30 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Joerg Tiedemann c951e4918f last on mahti		2023-07-03 23:04:41 +03:00
backtranslate	elg project stuff and changes done on mahti	2022-03-17 21:02:11 +02:00
bt-tatoeba	updated elg recipes	2022-03-11 21:04:00 +02:00
doc	Merge pull request #64 from rrrepsac/patch-1	2023-02-05 22:13:25 +02:00
evaluate	Merge branch 'master' of github.com:Helsinki-NLP/OPUS-MT-train	2021-12-10 19:21:26 +02:00
finetune	Merge branch 'master' of github.com:Helsinki-NLP/OPUS-MT-train	2021-12-10 19:21:26 +02:00
ft-tatoeba	cleanup in tatoeba data recipes	2021-12-18 00:27:04 +02:00
html	train with backtranslations	2020-01-18 20:37:01 +02:00
lib	last on mahti	2023-07-03 23:04:41 +03:00
models	comet scores	2022-10-15 22:16:24 +03:00
pivoting	elg project stuff and changes done on mahti	2022-03-17 21:02:11 +02:00
scripts	24x12 transformer model added	2023-04-17 23:45:34 +03:00
tatoeba	last on mahti	2023-07-03 23:04:41 +03:00
testsets	changes on mahti	2022-04-27 10:59:39 +03:00
tools	24x12 transformer model added	2023-03-20 23:55:58 +02:00
work-tatoeba	renamed variable for loading environment	2021-11-11 19:21:35 +02:00
.gitmodules	leaderboard as submodule	2022-04-12 21:24:50 +03:00
Dockerfile.cpu	Add more aligners	2020-02-03 16:03:26 +07:00
Dockerfile.gpu	Add Dockerfile for GPU	2020-02-03 15:41:34 +07:00
LICENSE	fixed license	2020-01-10 17:04:04 +02:00
Makefile	fixing many bugs with tatoeba model recipes	2022-02-07 20:55:31 +02:00
NOTES.md	student model quantisation finetuning added	2022-01-18 14:41:17 +02:00
README.md	a note about setting up some environment specifications	2022-02-03 22:37:36 +02:00
requirements.txt	students for glg, eus, swa added	2023-03-11 20:27:31 +02:00
TODO.md	latest spm models online	2022-05-28 00:17:52 +03:00

README.md

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Look into lib/env.mk and adust any settings that you need in your environment. For CSC-users: adjust lib/env/puhti.mk and lib/env/mahti.mk to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.