mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-11-27 11:03:13 +03:00

History

Joerg Tiedemann 8f92bc84c7 Merge branch 'master' of github.com:Helsinki-NLP/OPUS-MT-train		2021-12-10 19:21:26 +02:00
..
de-fi/goethe	all models = opus	2020-01-15 23:18:07 +02:00
en-fi/news	finetuning for fi-en	2020-02-14 00:12:55 +02:00
fi-en/news	finetuning for fi-en	2020-02-14 00:12:55 +02:00
Makefile	Merge branch 'master' of github.com:Helsinki-NLP/OPUS-MT-train	2021-12-10 19:21:26 +02:00
README.md	finetune README improved	2020-02-16 00:07:53 +02:00
VNK-Hallituksen_vuosikertomus.tmx	finetune branch downloads models from object storage	2020-02-15 23:40:55 +02:00

README.md

Model fine-tuning

Scripts for fine-tuning transformer models using some small in-domain data.

NOTE: this only works for bilingual SentencePiece models

Requirements

marian-nmt
SentencePiece
Moses pre-processing scripts
OpusTools-perl (for extracting text from TMX)

Basic use:

Make a fine-tune data set from newstest data (as part of the eval data in this package), for example for English-German:

make SRC=en TRG=de news-tune-data
make SRC=en TRG=de all

Fine-tune with data from a given TMX file (in the direction of sorted language IDs taken from the TMX file):

make TMXFILE=file.tmx tmx-tune

Fine-tune with data from a given TMX file in reverse direction:

make TMXFILE=file.tmx REVERSE=1 tmx-tune

Output

The fine-tuned models are in subdirectories of the language pair and model name, for example

en-de/news/model

Test scores using the baseline and the fine-tuned models are in

en-de/news/test/*.eval

Step-wise procedure

The whole procedure consists of several steps that can be done in isolation:

#  make data .............. pre-process train/dev data
#  make tune .............. fine-tune model
#  make translate ......... translate test set with fine-tuned model
#  make translate-baseline  translate test set with baseline model
#  make eval .............. evaluate test set translation (fine-tuned)
#  make eval-baseline ..... evaluate test set translation (baseline)
#  make compare ........... put together source, reference translation and system output
#  make compare-baseline .. same as compare but with baseline translation

TODO

make it work with multilingual models (need to adjust preprocess-scripts for those models)