mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-11-27 20:40:55 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Joerg Tiedemann c703bb4c2b fixed file name for wikimedia.mk and added memad-multi model		2020-05-07 19:55:28 +03:00
backtranslate	all models	2020-04-27 13:56:40 +03:00
evaluate	all models = opus	2020-01-15 23:18:07 +02:00
finetune	backtranslation data for multilingual models	2020-03-24 23:47:57 +02:00
html	train with backtranslations	2020-01-18 20:37:01 +02:00
lib	fixed file name for wikimedia.mk and added memad-multi model	2020-05-07 19:55:28 +03:00
models	better division of the massive tasks makefile	2020-05-03 20:27:55 +03:00
scripts	some new models	2020-04-11 14:50:39 +03:00
testsets	simplification evaluation with BLEU	2020-03-01 00:25:25 +02:00
work-spm	sami	2020-03-27 22:30:51 +02:00
Dockerfile.cpu	Add more aligners	2020-02-03 16:03:26 +07:00
Dockerfile.gpu	Add Dockerfile for GPU	2020-02-03 15:41:34 +07:00
large-context.pl	initial import	2020-01-10 16:45:42 +02:00
LICENSE	fixed license	2020-01-10 17:04:04 +02:00
Makefile	new makefile structure	2020-05-03 21:46:30 +03:00
NOTES.md	use only latest backtranslation	2020-04-01 20:18:06 +03:00
postprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
postprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-bpe-multi-target.sh	sami	2020-03-27 22:30:51 +02:00
preprocess-bpe.sh	pre-processing scripts fixed	2020-01-17 13:42:18 +02:00
preprocess-spm-multi-target.sh	sami	2020-03-27 22:30:51 +02:00
preprocess-spm.sh	removed punctuation normalisation and added language filter	2020-02-08 00:19:21 +02:00
project_2000661-openrc-backup.sh	initial import	2020-01-10 16:45:42 +02:00
README.md	new makefile structure	2020-05-03 21:46:30 +03:00
TODO.md	add local config parameters	2020-04-18 21:40:52 +03:00
verify-wordalign.pl	initial import	2020-01-10 16:45:42 +02:00

README.md

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Structure

Essential files for making new models:

Makefile: top-level makefile
lib/env.mk: system-specific environment (now based on CSC machines)
lib/config.mk: essential model configuration
lib/data.mk: data pre-processing tasks
lib/generic.mk: generic implicit rules that can extend other tasks
lib/dist.mk: make packages for distributing models (CSC ObjectStorage based)
lib/slurm.mk: submit jobs with SLURM

There are also make targets for specific models and tasks. Look into lib/models/ to see what has been defined already. Note that this frequently changes! There is, for example:

lib/models/multilingua.mk: various multilingual models
lib/models/celtic.mk: data and models for Celtic languages
lib/models/doclevel.mk: experimental document-level models

Run this if you want to train a model, for example for translating English to French:

make SRCLANG=en TRGLANG=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANG=en TRGLANG=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data

Upload to Object Storage

swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"