This is information about scripts for training and testing models with data from the [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge). The build targets are defined in [lib/projects/tatoeba.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/projects/tatoeba.mk).
Multilingual models that include all combinations of given source and target languages can be trained by calling the following special target, which first fetches the necessary data for all language pairs and then starts a training job. Here is an example with Afrikaans+Dutch as source languages and German+English+Spanish as target languages:
The following commands can be used to train all language pairs from a given subset of the Tatoeba Challenge data set. Note that each of them runs over all language pairs and prepares the data sets before submitting training and evluation jobs. Naturally, this will take a lot of time. Once the data sets are prepared, this does not have to be re-done. Jobs are submitted to the SLURM batch management system (make sure that this works in your setup).
Release packages can also be created for the entire subset (`medium` in the example below) by running:
```
make tatoeba-distsubset-medium
```
If training did not converge in time or jobs are interrupted then evaluation can also be invoked for the entire subset (`medium` in the example again) by running:
## Start jobs for multilingual models from one of the subsets
The commands below can be used to create mulitlingual NMT models with all languages involved in each of the Tatoeba Challenge subsets. First, all data sets will be created (which will take substantial amount of time) and after that the training jobs are submitted using SLURM. Data selections are automatically under/over-sampled to include equal amounts of training data for each language pair (base on the number of lines in the data).
Note that this includes many languages and may not work well and training will take a lot of time as well.
Similar to the subset targets above, there are also special targets for creating release packages and for evaluating multilingual models. A release package is created by running:
```
make tatoeba-multilingual-distsubset-zero
make tatoeba-multilingual-distsubset-lowest
...
```
Another special thing is that multilingual models cover many language pairs. In order to run all test sets for all language pairs one can run:
```
make tatoeba-multilingual-evalsubset-zero
make tatoeba-multilingual-evalsubset-lowest
...
```
Note that this can be quite a lot of language pairs!
The targets above train models on over/under-sampled datasets to balance language pairs included in the multilingual model. The sample size per language pair is 1 million sentence pairs with a threshold of 50 for the maximum number of repeating the same data.
Jobs for specific tasks and language groups; example task: `gmw2eng` (it's recommended to use `MODELTYPE=transformer` to skip word alignment and `FIT_DATA_SIZE` controls the data size used in over- and undersampling data to balance various language pairs in the training data):