This package includes scripts and makefiles to train NMT models and here is some incomplete documentation.
The build targets are all included in various makefiles and the main idea is to provide a flexible setup for running different jobs for many language pairs and to support all tasks necessary to build and test a model.
* basic training of bilingual and multilingual models ([Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile))
* [Generating back-translations](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/README.md) for data augmentation ([Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile))
* [Fine-tuning models](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/finetune/README.md) for domain adaptation ([Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/finetune/Makefile))
* [Generate pivot-language-based translations](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/pivoting/README.md) for data augmentation ([pivoting](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/pivoting/Makefile))
The make targets and essential system properties are defined in a number of makefiles that are included from top-level Makefiles.
*`Makefile`: top-level makefile for main tasks
*`backtranslate/Makefile`: top-level makefile for generating back-translations
*`finetune/Makefile`: top-level makefile for fine-tuning
*`pivoting/Makefile`: top-level makefile for pivot-based translations
Configurations and definitions about the system environment are stored in
*`lib/env.mk`: system-specific environment (now based on CSC machines)
*`lib/config.mk`: essential model configuration
*`lib/langsets.mk`: definition of language sets
*`${WORKDIR}/config.mk`: model-specific configuration (only if it exists)
The model specific configuration can store properties that otherwise need to be given on the command-line when calling make targets. You can generate the configuration file using
* model-specific test data: `work/LANGPAIRSTR/test`
* additional test sets are stored in `testsets/` sorted by language pair
* released models are stored in `models/LANGPAIRSTR`
`LANGPAIRSTR` is generated from the specifed source languages and target languages. Source and target language IDs are merged using `+` as a delimiter and those merged strings are merged using `-`. For example, `fi+et-en` is a the model directory for a multilingual models that includes Finnish and Estonian as source languages and English as target language.