backtranslate | ||
evaluate | ||
finetune | ||
html | ||
models | ||
scripts | ||
testsets | ||
work-spm | ||
Dockerfile.cpu | ||
Dockerfile.gpu | ||
large-context.pl | ||
LICENSE | ||
Makefile | ||
Makefile.config | ||
Makefile.data | ||
Makefile.def | ||
Makefile.dist | ||
Makefile.doclevel | ||
Makefile.env | ||
Makefile.generic | ||
Makefile.simplify | ||
Makefile.slurm | ||
Makefile.tasks | ||
NOTES.md | ||
postprocess-bpe.sh | ||
postprocess-spm.sh | ||
preprocess-bpe-multi-target.sh | ||
preprocess-bpe.sh | ||
preprocess-spm-multi-target.sh | ||
preprocess-spm.sh | ||
project_2000661-openrc-backup.sh | ||
README.md | ||
TODO.md | ||
verify-wordalign.pl |
Train Opus-MT models
This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
Structure
Essential files for making new models:
Makefile
: top-level makefileMakefile.env
: system-specific environment (now based on CSC machines)Makefile.config
: essential model configurationMakefile.data
: data pre-processing tasksMakefile.doclevel
: experimental document-level modelsMakefile.tasks
: tasks for training specific models and other things (this frequently changes)Makefile.dist
: make packages for distributing models (CSC ObjectStorage based)Makefile.slurm
: submit jobs with SLURM
Run this if you want to train a model, for example for translating English to French:
make SRCLANG=en TRGLANG=fr train
To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:
make SRCLANG=en TRGLANG=fr eval
For multilingual (more than one language on either side) models run, for example:
make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval
Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:
make -j 8 SRCLANG=en TRGLANG=fr data
Upload to Object Storage
swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"