OPUS-MT-train/README.md

# Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for [OPUS-MT](https://github.com/Helsinki-NLP/Opus-MT). More details are given in the [Makefile](Makefile) but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.


## Pre-trained models

The subdirectory [models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models) contains information about pre-trained models that can be downloaded from this project. They are distribted with a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license. [More pre-trained models](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-results-all.md) trained with the [OPUS-MT training pipeline](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md) are available from the [Tatoeba translation challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge) also under a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license.


## Quickstart

Setting up:

```
git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install
```

Look into `lib/env.mk` and adust any settings that you need in your environment.
For CSC-users: adjust `lib/env/puhti.mk` and `lib/env/mahti.mk` to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

```
make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release
```

More information is available in the documentation linked below.


## Documentation

* [Installation and setup](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/Setup.md)
* [Details about tasks and recipes](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/README.md)
* [Information about back-translation](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/backtranslate/README.md)
* [Information about Fine-tuning models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/finetune/README.md)
* [How to generate pivot-language-based translations](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/pivoting/README.md)


## Tutorials

* [Training low-resource models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/tutorials/low-resource.md)
* [How to train models for the Tatoeba MT Challenge](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/TatoebaChallenge.md)


## References

Please, cite the following paper if you use OPUS-MT software and models:

```
@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }
 ```


## Acknowledgements

None of this would be possible without all the great open source software including

* GNU/Linux tools
* [Marian-NMT](https://github.com/marian-nmt/)
* [eflomal](https://github.com/robertostling/eflomal)

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the [University of Helsinki](https://blogs.helsinki.fi/language-technology/), the [IT Center of Science CSC](https://www.csc.fi/en/home), the funding through projects in the EU Horizon 2020 framework ([FoTran](http://www.helsinki.fi/fotran), [MeMAD](https://memad.eu/), [ELG](https://www.european-language-grid.eu/)) and the contributors to the open collection of parallel corpora [OPUS](http://opus.nlpl.eu/).
initial import 2020-01-10 17:45:42 +03:00			`# Train Opus-MT models`

fixed license 2020-01-10 18:04:04 +03:00			`This package includes scripts for training NMT models using MarianNMT and OPUS data for [OPUS-MT](https://github.com/Helsinki-NLP/Opus-MT). More details are given in the [Makefile](Makefile) but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.`
initial import 2020-01-10 17:45:42 +03:00

information about license for pre-trained models added 2020-05-15 20:01:07 +03:00			`## Pre-trained models`

Update README.md 2021-10-05 09:33:17 +03:00			The subdirectory [models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models) contains information about pre-trained models that can be downloaded from this project. They are distribted with a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license. [More pre-trained models](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/results/tatoeba-results-all.md) trained with the [OPUS-MT training pipeline](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/doc/TatoebaChallenge.md) are available from the [Tatoeba translation challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge) also under a [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/) license.
information about license for pre-trained models added 2020-05-15 20:01:07 +03:00

added acknowledgements 2020-09-12 12:01:02 +03:00			`## Quickstart`
multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`Setting up:`
multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			```
			`git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git`
			`git submodule update --init --recursive --remote`
			`make install`
			```

a note about setting up some environment specifications 2022-02-03 23:34:54 +03:00			Look into `lib/env.mk` and adust any settings that you need in your environment.
a note about setting up some environment specifications 2022-02-03 23:37:36 +03:00			For CSC-users: adjust `lib/env/puhti.mk` and `lib/env/mahti.mk` to match yoursetup (especially the locations where Marian-NMT and other tools are installed and the CSC project that you are using).
a note about setting up some environment specifications 2022-02-03 23:34:54 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):`
multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			```
			`make SRCLANGS="fi et" TRGLANGS="da sv en" train`
			`make SRCLANGS="fi et" TRGLANGS="da sv en" eval`
			`make SRCLANGS="fi et" TRGLANGS="da sv en" release`
			```
multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`More information is available in the documentation linked below.`
multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00

fixed bug in env.mk 2020-08-26 22:18:12 +03:00			`## Documentation`

setup and installation information added 2020-09-02 16:49:22 +03:00			`* [Installation and setup](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/Setup.md)`
fixed bug in env.mk 2020-08-26 22:18:12 +03:00			`* [Details about tasks and recipes](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/README.md)`
			`* [Information about back-translation](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/backtranslate/README.md)`
			`* [Information about Fine-tuning models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/finetune/README.md)`
			`* [How to generate pivot-language-based translations](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/pivoting/README.md)`
tutorial links added 2021-01-08 00:50:33 +03:00


			`## Tutorials`

			`* [Training low-resource models](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/tutorials/low-resource.md)`
fixed bug in env.mk 2020-08-26 22:18:12 +03:00			`* [How to train models for the Tatoeba MT Challenge](https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/doc/TatoebaChallenge.md)`

multilingual tatoeba models and some documentation added 2020-06-03 15:39:18 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`## References`
initial import 2020-01-10 17:45:42 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`Please, cite the following paper if you use OPUS-MT software and models:`
initial import 2020-01-10 17:45:42 +03:00
			```
added acknowledgements 2020-09-12 12:01:02 +03:00			`@InProceedings{TiedemannThottingal:EAMT2020,`
			`author = {J{\"o}rg Tiedemann and Santhosh Thottingal},`
			`title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},`
			`booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},`
			`year = {2020},`
			`address = {Lisbon, Portugal}`
			`}`
			```
initial import 2020-01-10 17:45:42 +03:00

added acknowledgements 2020-09-12 12:01:02 +03:00			`## Acknowledgements`
initial import 2020-01-10 17:45:42 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`None of this would be possible without all the great open source software including`
initial import 2020-01-10 17:45:42 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`* GNU/Linux tools`
			`* [Marian-NMT](https://github.com/marian-nmt/)`
			`* [eflomal](https://github.com/robertostling/eflomal)`
initial import 2020-01-10 17:45:42 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...`
initial import 2020-01-10 17:45:42 +03:00
added acknowledgements 2020-09-12 12:01:02 +03:00			`We would also like to acknowledge the support by the [University of Helsinki](https://blogs.helsinki.fi/language-technology/), the [IT Center of Science CSC](https://www.csc.fi/en/home), the funding through projects in the EU Horizon 2020 framework ([FoTran](http://www.helsinki.fi/fotran), [MeMAD](https://memad.eu/), [ELG](https://www.european-language-grid.eu/)) and the contributors to the open collection of parallel corpora [OPUS](http://opus.nlpl.eu/).`