mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2025-01-07 10:24:54 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Jörg Tiedemann 16eef8e45d moved project makefiles to lib/projects		2020-09-10 12:12:44 +03:00
backtranslate	documentation of low-resource languages	2020-09-06 23:56:16 +03:00
doc	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
evaluate	make compatible with mac osx and include submodules for required tools	2020-09-02 15:52:34 +03:00
finetune	make compatible with mac osx and include submodules for required tools	2020-09-02 15:52:34 +03:00
html	train with backtranslations	2020-01-18 20:37:01 +02:00
lib	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
models	fixed a bug in eval-testsets	2020-05-29 14:43:36 +03:00
pivoting	keep translations even if uncomplete in pivoting	2020-09-06 00:22:48 +03:00
scripts	fix chinese/korean/japanese language codes	2020-06-17 22:02:39 +03:00
tatoeba	store and fetch work data	2020-08-22 23:51:37 +03:00
testsets	fix chinese/korean/japanese language codes	2020-06-17 22:02:39 +03:00
tools	added bpe submodule	2020-09-04 15:34:20 +03:00
work-spm	sami	2020-03-27 22:30:51 +02:00
.gitmodules	added bpe submodule	2020-09-04 15:34:20 +03:00
Dockerfile.cpu	Add more aligners	2020-02-03 16:03:26 +07:00
Dockerfile.gpu	Add Dockerfile for GPU	2020-02-03 15:41:34 +07:00
LICENSE	fixed license	2020-01-10 17:04:04 +02:00
Makefile	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
NOTES.md	fixed multilingual tatoeba evaluation	2020-06-11 00:54:40 +03:00
postprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
postprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-bpe-multi-target.sh	sami	2020-03-27 22:30:51 +02:00
preprocess-bpe.sh	pre-processing scripts fixed	2020-01-17 13:42:18 +02:00
preprocess-spm-multi-target.sh	sami	2020-03-27 22:30:51 +02:00
preprocess-spm.sh	removed punctuation normalisation and added language filter	2020-02-08 00:19:21 +02:00
project_2000661-openrc-backup.sh	initial import	2020-01-10 16:45:42 +02:00
README.md	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
requirements.txt	started tutorial and fixes to backtranslate makefile	2020-09-05 00:16:22 +03:00
TODO.md	store and fetch work data	2020-08-22 23:51:37 +03:00

README.md

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license.

Prerequisites

Running the scripts does not work out of the box because many settings are adjusted for the local installations on our IT infrastructure at CSC. Here is an incomplete list of prerequisites needed for running a process. It is on our TODO list to make the training procedures and settings more transparent and self-contained. Preliminary information about installation and setup is available here.

marian-nmt: The essential NMT toolkit we use in OPUS-MT; make sure you compile a version with GPU and SentencePiece support!
Moses scripts: various pre- and post-processing scripts from the Moses SMT toolkit (also bundled here: marian-nmt)
OpusTools: library and tools for accessing OPUS data
OpusTools-perl: additional tools for accessing OPUS data
iso-639: a Python package for ISO 639 language codes
Perl modules ISO::639::3 and ISO::639::5
jq JSON processor

Optional (recommended) software:

terashuf: efficiently shuffle massive data sets
pigz: multithreaded gzip
efmomal (needed for word alignment when transformer-align is used)

Documentation

Structure of the training scripts

Essential files for making new models:

Makefile: top-level makefile
lib/env.mk: system-specific environment (now based on CSC machines)
lib/config.mk: essential model configuration
lib/data.mk: data pre-processing tasks
lib/generic.mk: generic implicit rules that can extend other tasks
lib/dist.mk: make packages for distributing models (CSC ObjectStorage based)
lib/slurm.mk: submit jobs with SLURM

There are also make targets for specific projects and tasks. Look into lib/projects/ to see what has been defined already. Note that this frequently changes! Check the file lib/projects.mk to see what kind of project files are enabled. There are currently, for example:

lib/projects/multilingual.mk: various multilingual models
lib/projects/celtic.mk: data and models for Celtic languages
lib/projects/doclevel.mk: experimental document-level models

Run this if you want to train a model, for example for translating English to French:

make SRCLANGS=en TRGLANGS=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANGS=en TRGLANGS=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANGS="de en" TRGLANGS="fr es pt" train
make SRCLANGS="de en" TRGLANGS="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data