mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-11-30 12:32:24 +03:00

History

Joerg Tiedemann 3d478d7602 a note about setting up some environment specifications		2022-02-03 22:34:54 +02:00
..
projects	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
tutorials	back to yml vocab files as default	2020-09-25 09:58:25 +03:00
BatchJobs.md	fixed bug in env.mk	2020-08-26 22:18:12 +03:00
Data.md	back to yml vocab files as default	2020-09-25 09:58:25 +03:00
README.md	moved project makefiles to lib/projects	2020-09-10 12:12:44 +03:00
ReleaseAndStore.md	moved allas recipes to a different makefile	2020-09-02 16:35:35 +03:00
Setup.md	a note about setting up some environment specifications	2022-02-03 22:34:54 +02:00
TatoebaChallenge.md	tico19 benchmark added	2020-10-27 23:48:09 +02:00
Test.md	more documentation	2020-08-26 21:45:03 +03:00
Train.md	more documentation	2020-08-26 21:45:03 +03:00

README.md

OPUS-MT-train documentation

This package includes scripts and makefiles to train NMT models and here is some incomplete documentation. The build targets are all included in various makefiles and the main idea is to provide a flexible setup for running different jobs for many language pairs and to support all tasks necessary to build and test a model.

The package includes 4 components:

basic training of bilingual and multilingual models (Makefile)
Generating back-translations for data augmentation (Makefile)
Fine-tuning models for domain adaptation (Makefile)
Generate pivot-language-based translations for data augmentation (pivoting)

Information about installation and setup is available here..

More information about specific tasks:

Tutorials (to-do)

Documentation of project-specific models:

Main structure of build scripts

The make targets and essential system properties are defined in a number of makefiles that are included from top-level Makefiles.

Makefile: top-level makefile for main tasks
backtranslate/Makefile: top-level makefile for generating back-translations
finetune/Makefile: top-level makefile for fine-tuning
pivoting/Makefile: top-level makefile for pivot-based translations

Configurations and definitions about the system environment are stored in

lib/env.mk: system-specific environment (now based on CSC machines)
lib/config.mk: essential model configuration
lib/langsets.mk: definition of language sets
${WORKDIR}/config.mk: model-specific configuration (only if it exists)

The model specific configuration can store properties that otherwise need to be given on the command-line when calling make targets. You can generate the configuration file using

make [OPTIONS] config

Essential targets for training and testing NMT models are provided in

lib/data.mk: data pre-processing tasks
lib/train.mk: training models
lib/test.mk: translating with existing models and evaluating test sets
lib/test.mk: translating with existing models and evaluating test sets

Targets for job management, packaging and other project related tasks:

lib/slurm.mk: submit jobs with SLURM
lib/dist.mk: make packages for distributing models (CSC ObjectStorage based)
lib/generic.mk: generic implicit rules that can extend other tasks
lib/misc.mk: miscellaneuous tasks

Targets for specific models and projects in lib/projects/, currently:

lib/projects.mk: high-level makefile that includes enabled projects
lib/projects/celtic.mk: data and models Celtic languages
lib/projects/finland.mk: main languages spoken in Finland
lib/projects/fiskmo.mk: models related to the fiskmö project
lib/projects/memad.mk: models related to the MeMAD project
lib/projects/multilingual.mk: various multilingual models
lib/projects/opus.mk: models covering OPUS languages
lib/projects/romance.mk: Romance languages
lib/projects/russian.mk: data and models for Russian
lib/projects/sami.mk: data and models for Sami languages
lib/projects/wikimedia.mk: models related to WikiMedia collaboration
lib/projects/wikimatrix.mk: models that include WikiMatrix data

Targets related to the Tatoeba MT Challenge:

lib/projects/tatoeba.mk

Scripts for various tasks in scripts/:

scripts/filter: filtering data (currently language identification only)
scripts/cleanup: language-specific cleanup scripts (should not remove lines to keep alignment)

Data structure

original source data is expected in ${OPUSHOME} (see lib/env.mk)
pre-processed data will be stored in work/data/simple (current default setting, can be adjusted with WORKHOME and settings for PRE)
model-specific data is stored in work/LANGPAIRSTR
model-specific training data: work/LANGPAIRSTR/train
model-specific validation data: work/LANGPAIRSTR/val
model-specific test data: work/LANGPAIRSTR/test
additional test sets are stored in testsets/ sorted by language pair
released models are stored in models/LANGPAIRSTR

LANGPAIRSTR is generated from the specifed source languages and target languages. Source and target language IDs are merged using + as a delimiter and those merged strings are merged using -. For example, fi+et-en is a the model directory for a multilingual models that includes Finnish and Estonian as source languages and English as target language.