Training open neural machine translation models
Go to file
2020-09-10 12:12:44 +03:00
backtranslate documentation of low-resource languages 2020-09-06 23:56:16 +03:00
doc moved project makefiles to lib/projects 2020-09-10 12:12:44 +03:00
evaluate make compatible with mac osx and include submodules for required tools 2020-09-02 15:52:34 +03:00
finetune make compatible with mac osx and include submodules for required tools 2020-09-02 15:52:34 +03:00
html train with backtranslations 2020-01-18 20:37:01 +02:00
lib moved project makefiles to lib/projects 2020-09-10 12:12:44 +03:00
models fixed a bug in eval-testsets 2020-05-29 14:43:36 +03:00
pivoting keep translations even if uncomplete in pivoting 2020-09-06 00:22:48 +03:00
scripts fix chinese/korean/japanese language codes 2020-06-17 22:02:39 +03:00
tatoeba store and fetch work data 2020-08-22 23:51:37 +03:00
testsets fix chinese/korean/japanese language codes 2020-06-17 22:02:39 +03:00
tools added bpe submodule 2020-09-04 15:34:20 +03:00
work-spm sami 2020-03-27 22:30:51 +02:00
.gitmodules added bpe submodule 2020-09-04 15:34:20 +03:00
Dockerfile.cpu Add more aligners 2020-02-03 16:03:26 +07:00
Dockerfile.gpu Add Dockerfile for GPU 2020-02-03 15:41:34 +07:00
LICENSE fixed license 2020-01-10 17:04:04 +02:00
Makefile moved project makefiles to lib/projects 2020-09-10 12:12:44 +03:00
NOTES.md fixed multilingual tatoeba evaluation 2020-06-11 00:54:40 +03:00
postprocess-bpe.sh initial import 2020-01-10 16:45:42 +02:00
postprocess-spm.sh initial import 2020-01-10 16:45:42 +02:00
preprocess-bpe-multi-target.sh sami 2020-03-27 22:30:51 +02:00
preprocess-bpe.sh pre-processing scripts fixed 2020-01-17 13:42:18 +02:00
preprocess-spm-multi-target.sh sami 2020-03-27 22:30:51 +02:00
preprocess-spm.sh removed punctuation normalisation and added language filter 2020-02-08 00:19:21 +02:00
project_2000661-openrc-backup.sh initial import 2020-01-10 16:45:42 +02:00
README.md moved project makefiles to lib/projects 2020-09-10 12:12:44 +03:00
requirements.txt started tutorial and fixes to backtranslate makefile 2020-09-05 00:16:22 +03:00
TODO.md store and fetch work data 2020-08-22 23:51:37 +03:00

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license.

Prerequisites

Running the scripts does not work out of the box because many settings are adjusted for the local installations on our IT infrastructure at CSC. Here is an incomplete list of prerequisites needed for running a process. It is on our TODO list to make the training procedures and settings more transparent and self-contained. Preliminary information about installation and setup is available here.

Optional (recommended) software:

  • terashuf: efficiently shuffle massive data sets
  • pigz: multithreaded gzip
  • efmomal (needed for word alignment when transformer-align is used)

Documentation

Structure of the training scripts

Essential files for making new models:

  • Makefile: top-level makefile
  • lib/env.mk: system-specific environment (now based on CSC machines)
  • lib/config.mk: essential model configuration
  • lib/data.mk: data pre-processing tasks
  • lib/generic.mk: generic implicit rules that can extend other tasks
  • lib/dist.mk: make packages for distributing models (CSC ObjectStorage based)
  • lib/slurm.mk: submit jobs with SLURM

There are also make targets for specific projects and tasks. Look into lib/projects/ to see what has been defined already. Note that this frequently changes! Check the file lib/projects.mk to see what kind of project files are enabled. There are currently, for example:

  • lib/projects/multilingual.mk: various multilingual models
  • lib/projects/celtic.mk: data and models for Celtic languages
  • lib/projects/doclevel.mk: experimental document-level models

Run this if you want to train a model, for example for translating English to French:

make SRCLANGS=en TRGLANGS=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANGS=en TRGLANGS=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANGS="de en" TRGLANGS="fr es pt" train
make SRCLANGS="de en" TRGLANGS="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data