.. | ||
models | ||
tutorials | ||
BatchJobs.md | ||
Data.md | ||
README.md | ||
ReleaseAndStore.md | ||
TatoebaChallenge.md | ||
Test.md | ||
Train.md |
OPUS-MT-train documentation
This package includes scripts and makefiles to train NMT models and here is some incomplete documentation. The build targets are all included in various makefiles and the main idea is to provide a flexible setup for running different jobs for many language pairs and to support all tasks necessary to build and test a model.
The package includes 4 components:
- basic training of bilingual and multilingual models (Makefile)
- Generating back-translations for data augmentation (Makefile)
- Fine-tuning models for domain adaptation (Makefile)
- Generate pivot-language-based translations for data augmentation (pivoting)
More information about specific tasks:
- Creating data files
- Training models
- Testing models
- Running batch jobs
- Packaging, releases and storage
- Models for the Tatoeba MT Challenge
Tutorials (to-do)
Documentation of project-specific models:
- Celtic language models
- Romance language models
- Russian models
- Sami language models
- Languages in Finland
- Multilingual models
- Doc-level models
- Simplification models
- Fiskmö project
- MeMAD project
- Wikimedia collaboration model
Main structure of build scripts
The make targets and essential system properties are defined in a number of makefiles that are included from top-level Makefiles.
Makefile
: top-level makefile for main tasksbacktranslate/Makefile
: top-level makefile for generating back-translationsfinetune/Makefile
: top-level makefile for fine-tuningpivoting/Makefile
: top-level makefile for pivot-based translations
Configurations and definitions about the system environment are stored in
lib/env.mk
: system-specific environment (now based on CSC machines)lib/config.mk
: essential model configurationlib/langsets.mk
: definition of language sets${WORKDIR}/config.mk
: model-specific configuration (only if it exists)
The model specific configuration can store properties that otherwise need to be given on the command-line when calling make targets. You can generate the configuration file using
make [OPTIONS] local-config
Essential targets for training and testing NMT models are provided in
lib/data.mk
: data pre-processing taskslib/train.mk
: training modelslib/test.mk
: translating with existing models and evaluating test setslib/test.mk
: translating with existing models and evaluating test sets
Targets for job management, packaging and other project related tasks:
lib/slurm.mk
: submit jobs with SLURMlib/dist.mk
: make packages for distributing models (CSC ObjectStorage based)lib/generic.mk
: generic implicit rules that can extend other taskslib/misc.mk
: miscellaneuous tasks
Targets for specific models and projects in lib/models/
, currently:
lib/models/celtic.mk
: data and models Celtic languageslib/models/finland.mk
: main languages spoken in Finlandlib/models/fiskmo.mk
: models related to the fiskmö projectlib/models/memad.mk
: models related to the MeMAD projectlib/models/multilingual.mk
: various multilingual modelslib/models/opus.mk
: models covering OPUS languageslib/models/romance.mk
: Romance languageslib/models/russian.mk
: data and models for Russianlib/models/sami.mk
: data and models for Sami languageslib/models/wikimedia.mk
: models related to WikiMedia collaborationlib/models/wikimatrix.mk
: models that include WikiMatrix data
Targets related to the Tatoeba MT Challenge:
lib/models/tatoeba.mk
Scripts for various tasks in scripts/
:
scripts/filter
: filtering data (currently language identification only)scripts/cleanup
: language-specific cleanup scripts (should not remove lines to keep alignment)
Data structure
- original source data is expected in
${OPUSHOME}
(seelib/env.mk
) - pre-processed data will be stored in
work/data/simple
(current default setting, can be adjusted with WORKHOME and settings for PRE) - model-specific data is stored in
work/LANGPAIRSTR
- model-specific training data:
work/LANGPAIRSTR/train
- model-specific validation data:
work/LANGPAIRSTR/val
- model-specific test data:
work/LANGPAIRSTR/test
- additional test sets are stored in
testsets/
sorted by language pair - released models are stored in
models/LANGPAIRSTR
LANGPAIRSTR
is generated from the specifed source languages and target languages. Source and target language IDs are merged using +
as a delimiter and those merged strings are merged using -
. For example, fi+et-en
is a the model directory for a multilingual models that includes Finnish and Estonian as source languages and English as target language.