Training open neural machine translation models
Go to file
2020-02-11 23:20:11 +02:00
backtranslate finetuning anc backtranslations 2020-02-11 23:20:11 +02:00
evaluate all models = opus 2020-01-15 23:18:07 +02:00
finetune finetuning anc backtranslations 2020-02-11 23:20:11 +02:00
html train with backtranslations 2020-01-18 20:37:01 +02:00
models finetuning anc backtranslations 2020-02-11 23:20:11 +02:00
testsets finetuning and backtranslation 2020-01-12 01:10:53 +02:00
work-spm avoid uploading linked dist files 2020-01-29 21:46:18 +02:00
bitext-match-lang.py new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
Dockerfile.cpu Add more aligners 2020-02-03 16:03:26 +07:00
Dockerfile.gpu Add Dockerfile for GPU 2020-02-03 15:41:34 +07:00
large-context.pl initial import 2020-01-10 16:45:42 +02:00
LICENSE fixed license 2020-01-10 17:04:04 +02:00
Makefile allwikis 2020-01-20 23:37:40 +02:00
Makefile.config new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
Makefile.data added function to convert TMX file for fine-tuning (requires OpusTools-perl) 2020-02-10 21:49:44 +02:00
Makefile.def initial import 2020-01-10 16:45:42 +02:00
Makefile.dist finetuning anc backtranslations 2020-02-11 23:20:11 +02:00
Makefile.doclevel train with backtranslations 2020-01-18 20:37:01 +02:00
Makefile.env new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
Makefile.generic new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
Makefile.slurm added function to convert TMX file for fine-tuning (requires OpusTools-perl) 2020-02-10 21:49:44 +02:00
Makefile.tasks new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
mono-match-lang.py new mode: SentencePieceModels trained on monolingual data 2020-02-08 15:21:37 +02:00
postprocess-bpe.sh initial import 2020-01-10 16:45:42 +02:00
postprocess-spm.sh initial import 2020-01-10 16:45:42 +02:00
preprocess-bpe-multi-target.sh train with backtranslations 2020-01-18 20:37:01 +02:00
preprocess-bpe.sh pre-processing scripts fixed 2020-01-17 13:42:18 +02:00
preprocess-spm-multi-target.sh removed punctuation normalisation and added language filter 2020-02-08 00:19:21 +02:00
preprocess-spm.sh removed punctuation normalisation and added language filter 2020-02-08 00:19:21 +02:00
project_2000661-openrc-backup.sh initial import 2020-01-10 16:45:42 +02:00
README.md fixed license 2020-01-10 17:04:04 +02:00
TODO.md backtranslate bugfix 2020-01-22 13:33:28 +02:00
verify-wordalign.pl initial import 2020-01-10 16:45:42 +02:00

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Structure

Essential files for making new models:

  • Makefile: top-level makefile
  • Makefile.env: system-specific environment (now based on CSC machines)
  • Makefile.config: essential model configuration
  • Makefile.data: data pre-processing tasks
  • Makefile.doclevel: experimental document-level models
  • Makefile.tasks: tasks for training specific models and other things (this frequently changes)
  • Makefile.dist: make packages for distributing models (CSC ObjectStorage based)
  • Makefile.slurm: submit jobs with SLURM

Run this if you want to train a model, for example for translating English to French:

make SRCLANG=en TRGLANG=fr train

To evaluate the model with the automatically generated test data (from the Tatoeba corpus as a default) run:

make SRCLANG=en TRGLANG=fr eval

For multilingual (more than one language on either side) models run, for example:

make SRCLANG="de en" TRGLANG="fr es pt" train
make SRCLANG="de en" TRGLANG="fr es pt" eval

Note that data pre-processing should run on CPUs and training/testing on GPUs. To speed up things you can process data sets in parallel using the jobs flag of make, for example using 8 threads:

make -j 8 SRCLANG=en TRGLANG=fr data

Upload to Object Storage

swift upload OPUS-MT --changed --skip-identical name-of-file
swift post OPUS-MT --read-acl ".r:*"