mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-08-17 16:20:50 +03:00

Training open neural machine translation models

language-technology machine-learning machine-translation natural-language-processing starred-helsinki-nlp-repo starred-repo

Go to file

Joerg Tiedemann b36d9a3e22 initial import		2020-01-10 16:45:42 +02:00
models	initial import	2020-01-10 16:45:42 +02:00
testsets	initial import	2020-01-10 16:45:42 +02:00
work-bt	initial import	2020-01-10 16:45:42 +02:00
work-spm	initial import	2020-01-10 16:45:42 +02:00
large-context.pl	initial import	2020-01-10 16:45:42 +02:00
LICENSE	initial import	2020-01-10 16:45:42 +02:00
Makefile	initial import	2020-01-10 16:45:42 +02:00
Makefile.config	initial import	2020-01-10 16:45:42 +02:00
Makefile.data	initial import	2020-01-10 16:45:42 +02:00
Makefile.def	initial import	2020-01-10 16:45:42 +02:00
Makefile.dist	initial import	2020-01-10 16:45:42 +02:00
Makefile.doclevel	initial import	2020-01-10 16:45:42 +02:00
Makefile.env	initial import	2020-01-10 16:45:42 +02:00
Makefile.slurm	initial import	2020-01-10 16:45:42 +02:00
Makefile.tasks	initial import	2020-01-10 16:45:42 +02:00
postprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
postprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-bpe.sh	initial import	2020-01-10 16:45:42 +02:00
preprocess-spm.sh	initial import	2020-01-10 16:45:42 +02:00
project_2000661-openrc-backup.sh	initial import	2020-01-10 16:45:42 +02:00
README.md	initial import	2020-01-10 16:45:42 +02:00
TODO.md	initial import	2020-01-10 16:45:42 +02:00
verify-wordalign.pl	initial import	2020-01-10 16:45:42 +02:00

README.md

Train Opus-MT models

This folder includes make targets for training NMT models using MarianNMT and OPUS data. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Structure

Essential files for making new models:

Makefile: top-level makefile
Makefile.env: system-specific environment (now based on CSC machines)
Makefile.config: essential model configuration
Makefile.data: data pre-processing tasks
Makefile.doclevel: experimental document-level models
Makefile.tasks: tasks for training specific models and other things (this frequently changes)
Makefile.dist: make packages for distributing models (CSC ObjectStorage based)
Makefile.slurm: submit jobs with SLURM

Run this if you want to train a model, for example for translating English to French:

make SRCLANG=en TRGLANG=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANG=en TRGLANG=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data

Upload to Object Storage

swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"