OPUS-MT-train/doc/models/Sami.md

72 lines
3.5 KiB
Markdown
Raw Normal View History

# Models for Sami languages
Recipes for training multilingual language models involving Sami languages.
## Overview
Relevant makefiles:
* [Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile)
* [lib/models/sami.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/models/sami.mk)
* [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile)
Main recipes:
* `sami-data`: fetch and convert external data sets and prepare the train/dev/test splits
* `sami-train`: train a multilingual model for Sami languages
* `sami-eval`: evaluate the multilingual model above
* `sami-dist`: create a release package for the multilingual model
See also implict rules for additional common recipes.
Recipes for back-translation (go to sub-directory `backtranslate`):
* `sami-corp`: download monolingual Sami corpora from Giellatekno
* `translate-sami-corp`: translate monolingual Sami corpora with a multilingual model (hardcoded in [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile))
* `translate-sami-wiki`: translate Northern Sami wiki data to all kinds of languages using a multilingual model (hardcoded in [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile)) and translate wiki data for a selected number of languages (currently: no, nn ru, sv, en) to Sami languages using the same multilingual model
* `translate-sami`: do both of the things above (`translate-sami-corp` and `translate-sami-wiki`)
* `translate-sami-xx-wiki`: translate wiki data from Northern Sami to other Sami languages, Finnish, Norwegian, and Swedish using a "sami-xx" model
* `translate-xx-sami-wiki`: translate wiki data from Finnish, Norwegian, and Swedish to Sami languages using a "xx-sami" model
* `translate-sami-xx-corp`: translate monolingual corpora from Giellatekno from Sami languages to Finnish, Norwegian, and Swedish using a "sami-xx" model
Recipes for pivot-based translation (go to sub-directory `pivoting`), set `SRC` (e.g. fi), `TRG` (e.g. se) and `PIVOT` (e.g. nb)
* `all`: fetch model, prepare data sets and translate
* `prepare`: prepare the data sets (pivot bitexts))
* `translate`: translate the pivot bitexts
Data-related recipes:
* `fetch-sami-tmx`: fetch translation memories from Giellatekno
* `convert-sami-tmx`: convert the TMX files from above
* `merge-sami-data`: merge the converted TMX files into one bitext
* `convert-sami-gloss`: convert bilingual glossaries from Giellatekno
Parameters / variables:
* `GIELLATEKNO_HOME`: URL for Giellatekno resources (default: https://victorio.uit.no/biggies/trunk)
* `GIELLATEKNO_TM_HOME`: directory of translation memories (default: ${GIELLATEKNO_HOME}/mt/omegat)
* `GIELLATEKNO_SAMI_TM`: list of translation memories to be downloaded (see [lib/models/sami.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/models/sami.mk))
Implicit rules:
* `%-sami`: run recipe for a multilingual model including Sami languages (e.g. `make train-sami`)
* `%-sami-xx`: run recipe for a model that translates Sami languages to selected other languages
* `%-xx-sami`: run recipe for a model that translates selected languages to Sami languages
* `%-bt`: include back-translated data (can be combined with the implicit rules above, e.g. `make train-bt-sami`)
* `%-pivot`: include pivot-based translations (can be combined with the implicit rules above, e.g. `make train-pivot-sami`)
## Notes
## To-do list