mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-11-30 22:14:14 +03:00

History

Joerg Tiedemann ec43fcd30a fixed a bug in eval-testsets		2020-05-29 14:43:36 +03:00
..
lib	fixed a bug in eval-testsets	2020-05-29 14:43:36 +03:00
Makefile	fixed a bug in eval-testsets	2020-05-29 14:43:36 +03:00
README.md	fixed testset names and backtranslation sentence splitting	2020-05-20 23:19:48 +03:00

README.md

Pivot-based data augmentation

The idea of this folder is to create synthetic training data by translating existing bitexts for a different language pair on one side. An example is to create training data for translation from Breton to English from bitexts in Breton and French. The French part of the auxiliary corpus is translated to English using a strong French-English translation model.

This assumes that

auxiliary data sets are in ORIGINAL_DATADIR (defaults to ${PWD}/../work/data)
packaged translation models can be found in ${PWD}/../models or ${PWD}/../models

Usage

Set variables SRC, TRG and PIVOT and run make all, for example to translate French-Breton data to English-Breton:

make SRC=en TRG=br PIVOT=fr all

You can print the data that will be translated and the model that will be used for that by running:

make SRC=en TRG=br PIVOT=fr print-all-data
make SRC=en TRG=br PIVOT=fr print-modelname

If this does not print anything then running make all does not make sense. For submitting a job via slurm you can add the suffic .submit to the call, e.g.

make SRC=en TRG=br PIVOT=fr all.submit

Specific models

Special targets for specific models are defined in lib/models.mk. Use them like this:

Sami language model: make all-sami (can also do make print-all-data-sami and make print-modelname-sami)

TODO

get models from ObjectStorage instead to fetch them from the local filesystem
get auxiliary data from OPUS instead of pre-processed data in the OPUS-MT dir (with hard-coded path)