OPUS-MT-train/doc/projects/Sami.md
2020-09-10 12:12:44 +03:00

11 KiB

Models for Sami languages

Recipes for training multilingual language models involving Sami languages.

Overview

Relevant makefiles:

Main recipes:

  • sami-data: fetch and convert external data sets
  • data-sami: prepare data for a multilingual model including Sami languages
  • train-sami: train a multilingual model for Sami languages
  • eval-sami: evaluate the multilingual model above
  • dist-sami: create a release package for the multilingual model

See also implict rules for additional common recipes.

Recipes for back-translation (go to sub-directory backtranslate):

  • sami-corp: download monolingual Sami corpora from Giellatekno
  • translate-sami-corp: translate monolingual Sami corpora with a multilingual model (hardcoded in backtranslate/Makefile)
  • translate-sami-wiki: translate Northern Sami wiki data to all kinds of languages using a multilingual model (hardcoded in backtranslate/Makefile) and translate wiki data for a selected number of languages (currently: no, nn ru, sv, en) to Sami languages using the same multilingual model
  • translate-sami: do both of the things above (translate-sami-corp and translate-sami-wiki)
  • translate-sami-xx-wiki: translate wiki data from Northern Sami to other Sami languages, Finnish, Norwegian, and Swedish using a "sami-xx" model
  • translate-xx-sami-wiki: translate wiki data from Finnish, Norwegian, and Swedish to Sami languages using a "xx-sami" model
  • translate-sami-xx-corp: translate monolingual corpora from Giellatekno from Sami languages to Finnish, Norwegian, and Swedish using a "sami-xx" model

Recipes for pivot-based translation (go to sub-directory pivoting), set SRC (e.g. fi), TRG (e.g. se) and PIVOT (e.g. nb)

  • all: fetch model, prepare data sets and translate
  • prepare: prepare the data sets (pivot bitexts))
  • translate: translate the pivot bitexts

Data-related recipes:

  • fetch-sami-tmx: fetch translation memories from Giellatekno
  • convert-sami-tmx: convert the TMX files from above
  • merge-sami-data: merge the converted TMX files into one bitext
  • convert-sami-gloss: convert bilingual glossaries from Giellatekno

Parameters / variables:

  • GIELLATEKNO_HOME: URL for Giellatekno resources (default: https://victorio.uit.no/biggies/trunk)
  • GIELLATEKNO_TM_HOME: directory of translation memories (default: ${GIELLATEKNO_HOME}/mt/omegat)
  • GIELLATEKNO_SAMI_TM: list of translation memories to be downloaded (see lib/projects/sami.mk)

Implicit rules:

  • %-sami: run recipe for a multilingual model including Sami languages (e.g. make train-sami)
  • %-sami-xx: run recipe for a model that translates Sami languages to selected other languages
  • %-xx-sami: run recipe for a model that translates selected languages to Sami languages
  • %-bt: include back-translated data (can be combined with the implicit rules above, e.g. make train-bt-sami)
  • %-pivot: include pivot-based translations (can be combined with the implicit rules above, e.g. make train-pivot-sam i)

Detailed information

There are special recipes for working with Sami languages. First of all, we would like to include additional data sets to extend the training data and to have decent amounts of test data for evaluation. Open data sources are availabel from Giellatekno and they can be downloaded and converted using:

make sami-data

This assumes that wget is available on your system and the OpusTools-perl package is installed (see also installation and setup).

If everything works fine you should see non-empty files in work/data/simple with the basename giella and glossary:

ls -al work/data/simple/giella*
-rw-r--r-- 1 tiedeman 3.3M Sep  9 12:34 work/data/simple/giella.fi-se.clean.fi.gz
-rw-r--r-- 1 tiedeman 3.2M Sep  9 12:34 work/data/simple/giella.fi-se.clean.se.gz
-rw-r--r-- 1 tiedeman 1.1M Sep  9 12:34 work/data/simple/giella.fi-smn.clean.fi.gz
-rw-r--r-- 1 tiedeman 1.1M Sep  9 12:34 work/data/simple/giella.fi-smn.clean.smn.gz
-rw-r--r-- 1 tiedeman 6.9M Sep  9 12:34 work/data/simple/giella.nb-se.clean.nb.gz
-rw-r--r-- 1 tiedeman 7.2M Sep  9 12:34 work/data/simple/giella.nb-se.clean.se.gz
-rw-r--r-- 1 tiedeman 608K Sep  9 12:34 work/data/simple/giella.nb-sma.clean.nb.gz
-rw-r--r-- 1 tiedeman 633K Sep  9 12:34 work/data/simple/giella.nb-sma.clean.sma.gz
-rw-r--r-- 1 tiedeman 238K Sep  9 12:34 work/data/simple/giella.nb-smj.clean.nb.gz
-rw-r--r-- 1 tiedeman 240K Sep  9 12:34 work/data/simple/giella.nb-smj.clean.smj.gz
-rw-r--r-- 1 tiedeman 644K Sep  9 12:34 work/data/simple/giella.se-sma.clean.se.gz
-rw-r--r-- 1 tiedeman 633K Sep  9 12:34 work/data/simple/giella.se-sma.clean.sma.gz
-rw-r--r-- 1 tiedeman 410K Sep  9 12:34 work/data/simple/giella.se-smn.clean.se.gz
-rw-r--r-- 1 tiedeman 424K Sep  9 12:34 work/data/simple/giella.se-smn.clean.smn.gz

ls -al work/data/simple/glossary*
-rw-r--r-- 1 tiedeman  33K Sep  9 12:34 work/data/simple/glossary.fi-se.clean.fi.gz
-rw-r--r-- 1 tiedeman  37K Sep  9 12:34 work/data/simple/glossary.fi-se.clean.se.gz
-rw-r--r-- 1 tiedeman  75K Sep  9 12:34 work/data/simple/glossary.fi-smn.clean.fi.gz
-rw-r--r-- 1 tiedeman 120K Sep  9 12:34 work/data/simple/glossary.fi-smn.clean.smn.gz
-rw-r--r-- 1 tiedeman 122K Sep  9 12:34 work/data/simple/glossary.fi-sms.clean.fi.gz
-rw-r--r-- 1 tiedeman 171K Sep  9 12:34 work/data/simple/glossary.fi-sms.clean.sms.gz
-rw-r--r-- 1 tiedeman 150K Sep  9 12:34 work/data/simple/glossary.nb-se.clean.nb.gz
-rw-r--r-- 1 tiedeman 282K Sep  9 12:34 work/data/simple/glossary.nb-se.clean.se.gz
-rw-r--r-- 1 tiedeman  65K Sep  9 12:34 work/data/simple/glossary.nb-sma.clean.nb.gz
-rw-r--r-- 1 tiedeman  82K Sep  9 12:34 work/data/simple/glossary.nb-sma.clean.sma.gz
-rw-r--r-- 1 tiedeman  43K Sep  9 12:34 work/data/simple/glossary.nb-smj.clean.nb.gz
-rw-r--r-- 1 tiedeman  52K Sep  9 12:34 work/data/simple/glossary.nb-smj.clean.smj.gz
-rw-r--r-- 1 tiedeman  21K Sep  9 12:34 work/data/simple/glossary.se-sma.clean.se.gz
-rw-r--r-- 1 tiedeman  23K Sep  9 12:34 work/data/simple/glossary.se-sma.clean.sma.gz
-rw-r--r-- 1 tiedeman  38K Sep  9 12:34 work/data/simple/glossary.se-smj.clean.se.gz
-rw-r--r-- 1 tiedeman  38K Sep  9 12:34 work/data/simple/glossary.se-smj.clean.smj.gz
-rw-r--r-- 1 tiedeman  23K Sep  9 12:34 work/data/simple/glossary.se-smn.clean.se.gz
-rw-r--r-- 1 tiedeman  23K Sep  9 12:34 work/data/simple/glossary.se-smn.clean.smn.gz

Training bilingual models

The first step is to prepare data using the additional data sets from above. In this example, we use Northern-Sami to Finnish as the example and take data from the Giellatekno resources as validation and test set and add glossary as extra training data.

make SRCLANGS=se TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary BPESIZE=4000 config
make SRCLANGS=se TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary data

Note that we reduce the vocabulary size to 4,000 per language to accomodate the low-reource scenario and change the name of the data set to mark the extra data that we use (this will also be used in the model name). Note also that The giella data set will be the main data set we can use and that a large portion will be used as training data even though we specify the data set as validation and test data. This is because of the automatic extraction of distinct dev/test sets that makes it possible to add all remaining data points to the training data.

Training and evaluating this model works in the same way as for other models but keep the extra parameters:

make SRCLANGS=se TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary train
make SRCLANGS=se TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary eval

Training multilingual models

Multilingual models can be done by adding languages, for example including 5 Sami languages as a source:

make SRCLANGS="se sma smj smn sms" TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary BPESIZE=4000 config
make SRCLANGS="se sma smj smn sms" TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary data
make SRCLANGS="se sma smj smn sms" TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary train
make SRCLANGS="se sma smj smn sms" TRGLANGS=fi DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary eval

Any set of languages can be used as source or target language. The system automatically uses all combinations of languages to train the model. This can have undesired effects when some high-resource languages can dominate the setup. For example, training a multilingual model for Northern Sami, Finnish, and Norwegian (nynorsk and bokmål) may exclude the translations between Finnish and Norwegian using the SKIP_LANGPAIRS parameter:

make SRCLANGS="se fi nb nn" TRGLANGS="se fi nb nn" DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary BPESIZE=4000 config
make SRCLANGS="se fi nb nn" TRGLANGS="se fi nb nn" SKIP_LANGPAIRS="fi-nb|fi-nn" DATASET=opus+giella DEVSET=giella TESTSET=giella EXTRA_TRAINSET=glossary data
...

There are pre-defined implict rules that define certain setups of multilingual models that include Sami languages:

  • %-sami: recipes for symmetric multilingual models that cover se sma smj smn sms vep et fi kv krl nb no nn ru sv en but exclude all combinations of high-resource/non-Finno-Ugric languages (nb no nn ru sv en)
  • %-sami-xx: recipes for models that translate from Sami languages (se sma smj smn sms) to Nordic languages, English and Russian (nb no nn ru sv en)
  • %-xx-sami: recipes for models in the opposite direction

So, one can simply run to create models to translate from Sami languages to the given high-resource languages:

make config-sami-xx
make data-sami-xx
make train-sami-xx
make eval-sami-xx

Back-translation

The first step is to fetch some monolingual data for Sami languages:

make -C backtranslate SRC=se fetch-wiki
make -C backtranslate sami-corp

The second recipe may fail. Try again and check the data in backtranslate/wiki/se and backtranslate/giellatekno.

Notes

To-do list