4.9 KiB
Training models
Overview
Relevant makefiles:
Main recipes:
train
: train a modeltrain-multigpu
: train with 4 GPUstrain-RL
: right-to-left modeltrain.submit
: submit train jobtrain.submit-multigpu
: submit multi-GPU job
Parameters / variables:
SRCLANGS
: list of source language codesTRGLANGS
: list of target language codesMODELTYPE
: transformer or transformer-align (with guided alignment) (default: transformer-align)NR
: model number, also used for initialisation seed (default: 1)MARIAN_VALID_FREQ
: validation frequency (default: 10000)MARIAN_EARLY_STOPPING
: stop after number of validation steps without improvement (default: 10)MARIAN_WORKSPACE
: allocated space on GPU (default: depends on device, seelib/env.mk
)WALLTIME
: walltime for HPC jobs in hours (default: 72)
Detailed information
Training a model can be started by simply running:
make SRCLANGS=xx TRGLANGS=yy train
The model will be trained in the model-specific WORKDIR, which defaults to work/${LANGPAIRSTR}/
. The name depends on the data and other parameters and the model basename is set to ${MODEL_SUBDIR}${DATASET}${TRAINSIZE}.${PRE_SRC}-${PRE_TRG}.${MODELTYPE}.model${NR}
. This includes:
- MODEL_SUBDIR: optional sub directory (default: empty string)
- DATASET: main data set used for training (default = opus)
- TRAINSIZE: optional size of the training data (cropped from beginning), default = empty (i.e. use all data)
- PRE_SRC and PRE_TRG: segmentation model applied (default = spm32k)
- MODELTYPE: either transformer or transformer-align (using guided alignment), default = transformer-align
- NR: model number (for ensembling, also used as seed for initialisation)
Logfiles are stored in the same work directory with a similar as the model files (see extension .log
).
Training should be done on a machine with GPU or submitted to a GPU node via SLURM.
The default configuration and parameters for training Marian-NMT models are specifed in lib/train.mk
and lib/config/mk
.
--type transformer \
--max-length 500 \
--mini-batch-fit \
-w ${MARIAN_WORKSPACE} \
--maxi-batch ${MARIAN_MAXI_BATCH} \
--early-stopping ${MARIAN_EARLY_STOPPING} \
--valid-freq ${MARIAN_VALID_FREQ} \
--save-freq ${MARIAN_SAVE_FREQ} \
--disp-freq ${MARIAN_DISP_FREQ} \
--valid-metrics perplexity \
--valid-mini-batch ${MARIAN_VALID_MINI_BATCH} \
--beam-size 12 --normalize 1 --allow-unk \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout ${MARIAN_DROPOUT} \
--label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--overwrite --keep-best \
--devices ${MARIAN_GPUS} \
--sync-sgd --seed ${SEED} \
--sqlite \
--tempdir ${TMPDIR} \
--exponential-smoothing \
--guided-alignment ${TRAIN_ALG}
- guided alignment is only used with
MODELTYPE=transformer-align
. MARIAN_WORKSPACE
is set depending on the GPU that is used (13000 for p100, 24000 for v100 and 10000 otherwise)- if no GPU is found (tested using
nvidia-smi
) then the--cpu-threads
flag is added automatically - other default parameters are set like this:
MARIAN_GPUS = 0
MARIAN_EXTRA =
MARIAN_VALID_FREQ = 10000
MARIAN_SAVE_FREQ = ${MARIAN_VALID_FREQ}
MARIAN_DISP_FREQ = ${MARIAN_VALID_FREQ}
MARIAN_EARLY_STOPPING = 10
MARIAN_VALID_MINI_BATCH = 16
MARIAN_MAXI_BATCH = 500
MARIAN_DROPOUT = 0.1
MARIAN_MAX_LENGTH = 500
Multi-GPU training
Multi-GPU training can be enabled by adding a suffix to the make target:
-multigpu
or-gpu0123
: train on 4 GPUs (devices 0,1,2,3)-twogpu
or-gpu01
: train on 2 GPUs (devices 0,1)-gpu23
: train on 2 GPUs (devices 2,3)
So, for example:
make SRCLANGS=xx TRGLANGS=yy train-multigpu
Running on a cluster
Submitting jobs via SLURM is supported but highly specific for the setting on puhti and our infrastructure at CSC.
Add the suffix .submit
and set appropriate variables for job requirements, for example,
starting a single-gpu job with walltime of 48 hours:
make SRCLANGS=xx TRGLANGS=yy WALLTIME=48 train.submit
This can be combined with the multi-gpu suffix:
make SRCLANGS=xx TRGLANGS=yy WALLTIME=48 train.submit-multigpu
More details on job management in BatchJobs.md