mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-12-04 12:56:34 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Joerg Tiedemann f749bd7a87 goethe test setting		2020-01-12 22:08:50 +02:00
backtranslate	finetuning and backtranslation	2020-01-12 01:10:53 +02:00
evaluate	goethe test setting	2020-01-12 22:08:50 +02:00
finetune	goethe test setting	2020-01-12 22:08:50 +02:00
models	fixed license	2020-01-10 17:04:04 +02:00
testsets	finetuning and backtranslation	2020-01-12 01:10:53 +02:00
work-spm	initial import	2020-01-10 16:45:42 +02:00
large-context.pl	initial import	2020-01-10 16:45:42 +02:00
LICENSE	fixed license	2020-01-10 17:04:04 +02:00
Makefile	initial import	2020-01-10 16:45:42 +02:00
Makefile.config	default name always opus	2020-01-12 01:25:14 +02:00
Makefile.data	finetuning and backtranslation	2020-01-12 01:10:53 +02:00
Makefile.def	initial import	2020-01-10 16:45:42 +02:00
Makefile.dist	consistent BPW/SPM models	2020-01-10 18:17:12 +02:00
Makefile.doclevel	initial import	2020-01-10 16:45:42 +02:00
Makefile.env	initial import	2020-01-10 16:45:42 +02:00
Makefile.slurm	initial import	2020-01-10 16:45:42 +02:00
Makefile.tasks	initial import	2020-01-10 16:45:42 +02:00
postprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
postprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
project_2000661-openrc-backup.sh	initial import	2020-01-10 16:45:42 +02:00
README.md	fixed license	2020-01-10 17:04:04 +02:00
TODO.md	initial import	2020-01-10 16:45:42 +02:00
verify-wordalign.pl	initial import	2020-01-10 16:45:42 +02:00

README.md

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Structure

Essential files for making new models:

Makefile: top-level makefile
Makefile.env: system-specific environment (now based on CSC machines)
Makefile.config: essential model configuration
Makefile.data: data pre-processing tasks
Makefile.doclevel: experimental document-level models
Makefile.tasks: tasks for training specific models and other things (this frequently changes)
Makefile.dist: make packages for distributing models (CSC ObjectStorage based)
Makefile.slurm: submit jobs with SLURM

Run this if you want to train a model, for example for translating English to French:

make SRCLANG=en TRGLANG=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANG=en TRGLANG=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data

Upload to Object Storage

swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"