OPUS-MT-train/README.md

77 lines
3.9 KiB
Markdown
Raw Normal View History

2020-01-10 17:45:42 +03:00
# Train Opus-MT models
2020-01-10 18:04:04 +03:00
This package includes scripts for training NMT models using MarianNMT and OPUS data for [OPUS-MT](https://github.com/Helsinki-NLP/Opus-MT). More details are given in the [Makefile](Makefile) but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
2020-01-10 17:45:42 +03:00
## Pre-trained models
2021-10-05 09:33:17 +03:00
The subdirectory [models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models) contains information about pre-trained models that can be downloaded from this project. They are distribted with a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license. [More pre-trained models](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-results-all.md) trained with the [OPUS-MT training pipeline](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md) are available from the [Tatoeba translation challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge) also under a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license.
2020-09-12 12:01:02 +03:00
## Quickstart
2020-09-12 12:01:02 +03:00
Setting up:
2020-09-12 12:01:02 +03:00
```
git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install
```
Look into `lib/env.mk` and adust any settings that you need in your environment.
For CSC-users: adjust `lib/env/puhti.mk` and `lib/env/mahti.mk` to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).
2020-09-12 12:01:02 +03:00
Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):
2020-09-12 12:01:02 +03:00
```
make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release
```
2020-09-12 12:01:02 +03:00
More information is available in the documentation linked below.
2020-08-26 22:18:12 +03:00
## Documentation
* [Installation and setup](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/Setup.md)
2020-08-26 22:18:12 +03:00
* [Details about tasks and recipes](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/README.md)
* [Information about back-translation](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/backtranslate/README.md)
* [Information about Fine-tuning models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/finetune/README.md)
* [How to generate pivot-language-based translations](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/pivoting/README.md)
2021-01-08 00:50:33 +03:00
## Tutorials
* [Training low-resource models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/tutorials/low-resource.md)
2020-08-26 22:18:12 +03:00
* [How to train models for the Tatoeba MT Challenge](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/TatoebaChallenge.md)
2020-09-12 12:01:02 +03:00
## References
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
Please, cite the following paper if you use OPUS-MT software and models:
2020-01-10 17:45:42 +03:00
```
2020-09-12 12:01:02 +03:00
@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}
```
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
## Acknowledgements
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
None of this would be possible without all the great open source software including
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
* GNU/Linux tools
* [Marian-NMT](https://github.com/marian-nmt/)
* [eflomal](https://github.com/robertostling/eflomal)
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...
2020-01-10 17:45:42 +03:00
2020-09-12 12:01:02 +03:00
We would also like to acknowledge the support by the [University of Helsinki](https://blogs.helsinki.fi/language-technology/), the [IT Center of Science CSC](https://www.csc.fi/en/home), the funding through projects in the EU Horizon 2020 framework ([FoTran](http://www.helsinki.fi/fotran), [MeMAD](https://memad.eu/), [ELG](https://www.european-language-grid.eu/)) and the contributors to the open collection of parallel corpora [OPUS](http://opus.nlpl.eu/).