more documentation

This commit is contained in:
Joerg Tiedemann 2020-08-26 21:45:03 +03:00
parent a8b54f5311
commit 596dd993a5
6 changed files with 276 additions and 14 deletions

113
doc/BatchJobs.md Normal file
View File

@ -0,0 +1,113 @@
# Running batch jobs
The beauty of the whole package is to run batch jobs for training many models in various settings. Some batch jobs are specified with their own targets, some others are specified in dedicated makefiles. Submitting jobs to SLURM is also supported to support job creation on a cluster in a convenient way.
# SLURM jobs
There are two generic implicit rules to submit jobs to SLURM using `sbatch` in `lib/slurm.mk`:
* `%.submit`: submit a job to a GPU node
* `%.submitcpu`: submit a job to a CPU node
The suffix can be added to any target to trigger job submission instead of execution on the current shell.
The options are highly specific for the job management system on puhti@CSC and need to be adjusted if used on a different server. For more details look into `lib/env/mk` and `lib/slurm.mk`. The essential job that is submitted is a call to 'make' running the target without the suffix. For example,
```
make SRCLANGS=en TRGLANGS=de all.submit
```
submits a job to the GPU queue for running everything needed to prepare, train and evaluate a model for English-German (basically running `make all` on the allocated note). The variable assignments specified on the command-line magically get transferred to the job call (I still don't really know why - but this is great ...).
There are important variables that modify allocations requested by the job:
* HPC_NODES: number of nodes to be allocated (default: 1)
* HPC_QUEUE: SLURM CPU queue (default on puhti: small)
* HPC_GPUQUEUE: SLURM GPU queue (default: gpu)
* HPC_DISK: local disc space allocated in GB (default: 500)
* HPC_TIME: allocated walltime in hh::mm (default: 72:00)
* HPC_CORES: number of CPU cores (default: 1)
* HPC_MEM: allocated RAM (default: 4g)
There are 3 shortcuts/aliases for the lazy people:
* MEM: can be used instead of HPC_MEM
* THREADS: can be used instead of HPC_CORES
* WALLTIME: can be used instead of HPC_TIME but only allows `hh` (like `72`)
GPU-specific parameters include:
* GPU: device name (default: v100 on puhti)
* NR_GPUS: number of GPUs allocated (default: 1)
* GPU_MODULES: software packages to be loaded as module before running the make command
CPU-specific parameters are:
* CPU_MODULES: software packages to be loaded as module before running the make command
Extra parameters include:
* EMAIL: e-mail notification when the job is done
* HPC_EXTRA: can be set to add any additional parameters that need to be added to the slurm startup script
* CSCPROJECT: project ID to be billed on puhti
## Combined and generic targets
There is various targets that combine tasks, setup common pipelines or create very task-specific jobs. Some more generic targets are defined in the top-level Makefile and in `lib/generic.mk`. Some, possibly interesting ones are:
* `all`: make the entire pipeline from preparing to evaluation
* `train-and-eval`: train a model and evaluate on all test sets
Submitting jobs in combination with other tasks:
* `all-job`: prepare all data and then submit a GPU job to train and eval a model
* `train-job`: submit a GPU for taining a model (will also trigger multi-gpu jobs if specified in model-config)
* `train-and-eval-job`: the same as above but also evaluate all test sets
* `bilingual`: prepare all data, submit a multi-GPU training job, reverse the data and submit another multi-GPU job in reverse direction
* `bilingual-small`: the same as above but single GPU jobs, reduced MarianNMT workspace and faster validation frequencies
* `multilingual`: make data for a multilingual model and start a multi-GPU job to train it (use LANG to specify the languages to be used on both sides)
Some very complex tasks (might not work and be careful before running ... those targets are not well tested)
* `all2pivot`: make data and create jobs for all languages combined with a PIVOT language in both directions; use LANGS to specify the languages to be considered and PIVOT for the pivot language (default=en)
* `train-and-start-bt-jobs`: train a model, evaluate it, create a local distribution package, start back-translation of all wikidata in separate jobs (only for bilingual models)
* `all-and-backtranslate`: similar to above but start back-translation for all language pairs (in case this is a multilingual model), but no separate translation jobs and only wikipedia data
* `all-and-backtranslate-allwikis`: similar to above but back-translate data from all wikis
* `all-and-backtranslate-allwikiparts`: same as above but translate all parts of all wikis
Combining this to make it complete:
* `all-with-bt`: run `all-and-backtranslate` in reverse direction and then run `all-bt` (entire pipeline with back-translated data)
* `all-with-bt-all`: same as above but run `all-and-backtranslate-allwikis` first
* `all-with-bt-allparts` same as above but run `all-and-backtranslate-allwikiparts` first
### Generic rules
Some implicit rules can be used to trigger certain batch jobs. Typically, they can be used by adding a suffix to existing targets, for example:
* `%-all`: run a target over all language pairs in WORKHOME, for example `make eval-all`
* `%-allmodels`: run a target over all models in all sub-dirs in WORKHOME, for example `make eval-allmodels` (so this also includes different types of models for the same language pair)
* `%-allbilingual`: run a target over all bilingual models in WORKHOME, for example `make eval-allbilingual`
* `%-allmultilingual`: run a target over all multilingual models in WORKHOME, for example `make eval-allmultilingual`
* `%-all-parallel`: basically the same as `%-all` but enables parallelization
Generic rules for special model types:
* `%-bt`: include back-translation data, e.g., `make train-bt`
* `%-pivot`: include pivot-based tranlation, e.g., `make train-pivot`
* `%-RL`: right-to-left models, e.g., `make all-RL`
* `%-bpe`: use BPE instead of sentence-piece, e.g. `make all-bpe`

View File

@ -1,10 +1,61 @@
# Creating data files
## Overview
Relevant makefiles:
* [Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile)
* [lib/config.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/config.mk)
* [lib/data.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/data.mk)
* [lib/preprocess.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/preprocess.mk)
* [lib/sentencepiece.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/sentencepiece.mk)
* [lib/bpe.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/bpe.mk)
Main recipes:
* `data`: create all data, subword models, optional word alignment, vocabulary
* `devdata`: create validation data set
* `testdata`: create test data set
* `traindata`: create train data set
* `reverse-data`: create data in reverse translation direction (bilingual models only)
* `wordalign`: make word alignments
* `spm-models`: train source and target language sentence-piece models
* `bpe-models`: train source and target language BPE models
Parameters / variables:
* `SRCLANGS`: list of source language codes
* `TRGLANGS`: list of target language codes
* `DEVSET`: corpus name for validation data (default: Tatoeba/GlobalVoices/infopankki/JW300/bible-uedin)
* `TESTSET`: corpus name for validation data (default: DEVSET)
* `TRAINSET`: list of corpora for training data (default: all except DEVSET, TESTSET, EXCLUDE_CORPORA (WMT-News, ...)
* `USE_REST_DEVDATA`: if set to 1 then unused DEVSET data is added to train (default: 1)
* `DEVSIZE`: number of sentence pairs in validation data (default: 5000/2500)
* `TESTSIZE`: number of sentence pairs in test data (default: 5000/2500)
* `DEVSMALLSIZE`: reduced size of validation data for small data sets (default: 10000)
* `TESTSMALLSIZE`: reduced size of test data for small data sets (default: 10000)
* `DEVMINSIZE`: minimum number of sentence pairs in validation data (default: 150)
* `BPESIZE`: subword segmentation model size (default: 32000)
* `SRCBPESIZE`: source language subword segmentation model size (default: BPESIZE)
* `TRGBPESIZE`: target language subword segmentation model size (default: BPESIZE)
Implicit rules:
* `%-bt`: include back-translations
* `%-pivot`: include pivot-based translations
## Detailed information
* data sets are defined in `lib/config.mk`
* data sets are created using targets from `lib/data.mk` and `lib/preprocess.mk`
* subword models are trained and applied with targets from `lib/sentencepiece.mk` and `bpe.mk`
* data sets are created using recipes from `lib/data.mk` and `lib/preprocess.mk`
* subword models are trained and applied with recipes from `lib/sentencepiece.mk` and `bpe.mk`
The main target for creating data sets (train, validation, test sets) for a model translating from languages `xx` to languags `yy` is

View File

@ -11,20 +11,12 @@ The package includes 4 components:
* [pivoting](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master//pivoting/Makefile) for data augmentation
General principles:
* use variables in make-calls to adjust parameters (like language pair to consider, data to use, ...)
* proper dependencies to automatize the whole pipeline (and also to allow parallel execution using the -j flag)
* support submitting SLURM jobs and starting large batches of jobs
* heavy use of (phony) implicit targets with some kind of suffix notation to support setup changes
More information about specific tasks:
* [Creating data files](Data.md)
* [Training models](Train.md)
* [Testing models](Test.md)
* [Running on a cluster](Slurm.md)
* [Running batch jobs](BatchJobs.md)
* [Generating back-translations](https://github.com/Helsinki-NLP/OPUS-MT-train/backtranslate/README.md)
* [Fine-tuning models](https://github.com/Helsinki-NLP/OPUS-MT-train/finetune/README.md)
* [Generate pivot-language-based translations](https://github.com/Helsinki-NLP/OPUS-MT-train/pivoting/README.md)

View File

@ -1 +1,58 @@
# Translating and evaluating
## Overview
Relevant makefiles:
* [Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile)
* [lib/config.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/config.mk)
* [lib/test.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/test.mk)
Main recipes:
* `translate`: tanslate test set
* `eval`: evaluate translated test set
* `compare`: merge input, output, reference
* `eval-testsets`: translate and evaluate all test sets
* `eval-ensemble`: evaluate model ensemble
* `eval-RL`: evaluate right-to-left model
* `eval-allmodels`: evaluate all models in WORKHOME
Parameters / variables:
* `SRCLANGS`: list of source language codes
* `TRGLANGS`: list of target language codes
* `MODELTYPE`: transformer or transformer-align (with guided alignment) (default: transformer-align)
## Detailed information
Basic targets for translating and evaluating the test set can be done by running:
```
make [OPTIONS] translate
make [OPTIONS] eval
make [OPTIONS] compare
```
Set the options to correspond to your model, so at least `SRCLANGS` and `TRGLANGS`.
It is not necessary to call `make translate` separately as `make eval` requires the translated data. `make compare` generates a file that merges input, output and reference translation for comparison.
The translations and evaluation scores will be stored in the work directory of the current model (`work/${LANGPAIRSTR}`) with the name of the test set and the name of the model.
* translations: `${WORKDIR}/${TESTSET_NAME}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}`
* scores: `${WORKDIR}/${TESTSET_NAME}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.eval`
* comparison: `${WORKDIR}/${TESTSET_NAME}.${MODEL}${NR}.${MODELTYPE}.${SRC}.${TRG}.compare`
## Translate additional test sets
There is a collection of addiitonal test sets in `testsets/`. It is possible to run through all test sets of language pairs that are supported by the current model by calling:
```
make [OPTIONS] eval-testsets
```
All tanslations and evaluation scores will be stored with the test set names in the work directory of the model.

View File

@ -1,13 +1,60 @@
# Training models
## Overview
Relevant makefiles:
* [Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile)
* [lib/config.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/config.mk)
* [lib/train.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/train.mk)
* [lib/generic.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/generic.mk)
Main recipes:
* `train`: train a model
* `train-multigpu`: train with 4 GPUs
* `train-RL`: right-to-left model
* `train.submit`: submit train job
* `train.submit-multigpu`: submit multi-GPU job
Parameters / variables:
* `SRCLANGS`: list of source language codes
* `TRGLANGS`: list of target language codes
* `MODELTYPE`: transformer or transformer-align (with guided alignment) (default: transformer-align)
* `NR`: model number, also used for initialisation seed (default: 1)
* `MARIAN_VALID_FREQ`: validation frequency (default: 10000)
* `MARIAN_EARLY_STOPPING`: stop after number of validation steps without improvement (default: 10)
* `MARIAN_WORKSPACE`: allocated space on GPU (default: depends on device, see `lib/env.mk`)
* `WALLTIME`: walltime for HPC jobs in hours (default: 72)
## Detailed information
Training a model can be started by simply running:
```
make SRCLANGS=xx TRGLANGS=yy train
```
This should be done on a machine with GPU or submitted to a GPU node via SLURM.
The model will be trained in the model-specific WORKDIR, which defaults to `work/${LANGPAIRSTR}/`. The name depends on the data and other parameters and the model basename is set to `${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.${MODELTYPE}.model${NR}`. This includes:
* MODEL_SUBDIR: optional sub directory (default: empty string)
* DATASET: main data set used for training (default = opus)
* TRAINSIZE: optional size of the training data (cropped from beginning), default = empty (i.e. use all data)
* PRE_SRC and PRE_TRG: segmentation model applied (default = spm32k)
* MODELTYPE: either transformer or transformer-align (using guided alignment), default = transformer-align
* NR: model number (for ensembling, also used as seed for initialisation)
Logfiles are stored in the same work directory with a similar as the model files (see extension `.log`).
Training should be done on a machine with GPU or submitted to a GPU node via SLURM.
The default configuration and parameters for training Marian-NMT models are specifed in `lib/train.mk` and `lib/config/mk`.
```
@ -75,8 +122,10 @@ make SRCLANGS=xx TRGLANGS=yy train-multigpu
```
## Running on a cluster
Submitting jobs via SLURM is supported but highly specific for the setting on puhti and our infrastructure at CSC.
Add the suffix `.submit` and set appropriate variables for job requirements, for example,
starting a single-gpu job with walltime of 48 hours:
@ -91,4 +140,4 @@ This can be combined with the multi-gpu suffix:
make SRCLANGS=xx TRGLANGS=yy WALLTIME=48 train.submit-multigpu
```
More details on job management in [Slurm.md](Slurm.md)
More details on job management in [BatchJobs.md](BatchJobs.md)

View File

@ -69,7 +69,7 @@ else ifneq ($(wildcard /wrk/tiedeman/research),)
MARIAN = ${HOME}/appl_taito/tools/marian/build-gpu
MARIANCPU = ${HOME}/appl_taito/tools/marian/build-cpu
LOADMODS = ${LOADGPU}
else ifeq (${shell hostname --domain},bullx)
else ifeq (${shell hostname --domain 2>dev/null},bullx)
CSCPROJECT = project_2002688
WORKHOME = ${shell realpath ${PWD}/work}
APPLHOME = /projappl/project_2001194