backtranslate | ||
doc | ||
evaluate | ||
finetune | ||
html | ||
lib | ||
models | ||
pivoting | ||
scripts | ||
tatoeba | ||
testsets | ||
tools | ||
work-spm | ||
.gitmodules | ||
Dockerfile.cpu | ||
Dockerfile.gpu | ||
LICENSE | ||
Makefile | ||
NOTES.md | ||
postprocess-bpe.sh | ||
postprocess-spm.sh | ||
preprocess-bpe-multi-target.sh | ||
preprocess-bpe.sh | ||
preprocess-spm-multi-target.sh | ||
preprocess-spm.sh | ||
project_2000661-openrc-backup.sh | ||
README.md | ||
requirements.txt | ||
TODO.md |
Train Opus-MT models
This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
Pre-trained models
The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license.
Prerequisites
Running the scripts does not work out of the box because many settings are adjusted for the local installations on our IT infrastructure at CSC. Here is an incomplete list of prerequisites needed for running a process. It is on our TODO list to make the training procedures and settings more transparent and self-contained. Preliminary information about installation and setup is available here.
- marian-nmt: The essential NMT toolkit we use in OPUS-MT; make sure you compile a version with GPU and SentencePiece support!
- Moses scripts: various pre- and post-processing scripts from the Moses SMT toolkit (also bundled here: marian-nmt)
- OpusTools: library and tools for accessing OPUS data
- OpusTools-perl: additional tools for accessing OPUS data
- iso-639: a Python package for ISO 639 language codes
- Perl modules ISO::639::3 and ISO::639::5
- jq JSON processor
Optional (recommended) software:
- terashuf: efficiently shuffle massive data sets
- pigz: multithreaded gzip
- efmomal (needed for word alignment when transformer-align is used)
Documentation
- Installation and setup
- Details about tasks and recipes
- Information about back-translation
- Information about Fine-tuning models
- How to generate pivot-language-based translations
- How to train models for the Tatoeba MT Challenge
Structure of the training scripts
Essential files for making new models:
Makefile
: top-level makefilelib/env.mk
: system-specific environment (now based on CSC machines)lib/config.mk
: essential model configurationlib/data.mk
: data pre-processing taskslib/generic.mk
: generic implicit rules that can extend other taskslib/dist.mk
: make packages for distributing models (CSC ObjectStorage based)lib/slurm.mk
: submit jobs with SLURM
There are also make targets for specific models and tasks. Look into lib/models/
to see what has been defined already.
Note that this frequently changes! There is, for example:
lib/models/multilingual.mk
: various multilingual modelslib/models/celtic.mk
: data and models for Celtic languageslib/models/doclevel.mk
: experimental document-level models
Run this if you want to train a model, for example for translating English to French:
make SRCLANGS=en TRGLANGS=fr train
To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:
make SRCLANGS=en TRGLANGS=fr eval
For multilingual (more than one language on either side) models run, for example:
make SRCLANGS="de en" TRGLANGS="fr es pt" train
make SRCLANGS="de en" TRGLANGS="fr es pt" eval
Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:
make -j 8 SRCLANG=en TRGLANG=fr data