OPUS-MT-train/doc/models/Sami.md
2020-08-28 15:51:37 +03:00

3.5 KiB

Models for Sami languages

Recipes for training multilingual language models involving Sami languages.

Overview

Relevant makefiles:

Main recipes:

  • sami-data: fetch and convert external data sets and prepare the train/dev/test splits
  • sami-train: train a multilingual model for Sami languages
  • sami-eval: evaluate the multilingual model above
  • sami-dist: create a release package for the multilingual model

See also implict rules for additional common recipes.

Recipes for back-translation (go to sub-directory backtranslate):

  • sami-corp: download monolingual Sami corpora from Giellatekno
  • translate-sami-corp: translate monolingual Sami corpora with a multilingual model (hardcoded in backtranslate/Makefile)
  • translate-sami-wiki: translate Northern Sami wiki data to all kinds of languages using a multilingual model (hardcoded in backtranslate/Makefile) and translate wiki data for a selected number of languages (currently: no, nn ru, sv, en) to Sami languages using the same multilingual model
  • translate-sami: do both of the things above (translate-sami-corp and translate-sami-wiki)
  • translate-sami-xx-wiki: translate wiki data from Northern Sami to other Sami languages, Finnish, Norwegian, and Swedish using a "sami-xx" model
  • translate-xx-sami-wiki: translate wiki data from Finnish, Norwegian, and Swedish to Sami languages using a "xx-sami" model
  • translate-sami-xx-corp: translate monolingual corpora from Giellatekno from Sami languages to Finnish, Norwegian, and Swedish using a "sami-xx" model

Recipes for pivot-based translation (go to sub-directory pivoting), set SRC (e.g. fi), TRG (e.g. se) and PIVOT (e.g. nb)

  • all: fetch model, prepare data sets and translate
  • prepare: prepare the data sets (pivot bitexts))
  • translate: translate the pivot bitexts

Data-related recipes:

  • fetch-sami-tmx: fetch translation memories from Giellatekno
  • convert-sami-tmx: convert the TMX files from above
  • merge-sami-data: merge the converted TMX files into one bitext
  • convert-sami-gloss: convert bilingual glossaries from Giellatekno

Parameters / variables:

  • GIELLATEKNO_HOME: URL for Giellatekno resources (default: https://victorio.uit.no/biggies/trunk)
  • GIELLATEKNO_TM_HOME: directory of translation memories (default: ${GIELLATEKNO_HOME}/mt/omegat)
  • GIELLATEKNO_SAMI_TM: list of translation memories to be downloaded (see lib/models/sami.mk)

Implicit rules:

  • %-sami: run recipe for a multilingual model including Sami languages (e.g. make train-sami)
  • %-sami-xx: run recipe for a model that translates Sami languages to selected other languages
  • %-xx-sami: run recipe for a model that translates selected languages to Sami languages
  • %-bt: include back-translated data (can be combined with the implicit rules above, e.g. make train-bt-sami)
  • %-pivot: include pivot-based translations (can be combined with the implicit rules above, e.g. make train-pivot-sami)

Notes

To-do list