mirror of
https://github.com/Helsinki-NLP/OPUS-MT-train.git
synced 2024-10-27 05:30:16 +03:00
72 lines
3.5 KiB
Markdown
72 lines
3.5 KiB
Markdown
|
# Models for Sami languages
|
||
|
|
||
|
Recipes for training multilingual language models involving Sami languages.
|
||
|
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
Relevant makefiles:
|
||
|
|
||
|
* [Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/Makefile)
|
||
|
* [lib/models/sami.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/models/sami.mk)
|
||
|
* [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile)
|
||
|
|
||
|
|
||
|
Main recipes:
|
||
|
|
||
|
* `sami-data`: fetch and convert external data sets and prepare the train/dev/test splits
|
||
|
* `sami-train`: train a multilingual model for Sami languages
|
||
|
* `sami-eval`: evaluate the multilingual model above
|
||
|
* `sami-dist`: create a release package for the multilingual model
|
||
|
|
||
|
See also implict rules for additional common recipes.
|
||
|
|
||
|
Recipes for back-translation (go to sub-directory `backtranslate`):
|
||
|
|
||
|
* `sami-corp`: download monolingual Sami corpora from Giellatekno
|
||
|
* `translate-sami-corp`: translate monolingual Sami corpora with a multilingual model (hardcoded in [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile))
|
||
|
* `translate-sami-wiki`: translate Northern Sami wiki data to all kinds of languages using a multilingual model (hardcoded in [backtranslate/Makefile](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/backtranslate/Makefile)) and translate wiki data for a selected number of languages (currently: no, nn ru, sv, en) to Sami languages using the same multilingual model
|
||
|
* `translate-sami`: do both of the things above (`translate-sami-corp` and `translate-sami-wiki`)
|
||
|
* `translate-sami-xx-wiki`: translate wiki data from Northern Sami to other Sami languages, Finnish, Norwegian, and Swedish using a "sami-xx" model
|
||
|
* `translate-xx-sami-wiki`: translate wiki data from Finnish, Norwegian, and Swedish to Sami languages using a "xx-sami" model
|
||
|
* `translate-sami-xx-corp`: translate monolingual corpora from Giellatekno from Sami languages to Finnish, Norwegian, and Swedish using a "sami-xx" model
|
||
|
|
||
|
|
||
|
Recipes for pivot-based translation (go to sub-directory `pivoting`), set `SRC` (e.g. fi), `TRG` (e.g. se) and `PIVOT` (e.g. nb)
|
||
|
|
||
|
* `all`: fetch model, prepare data sets and translate
|
||
|
* `prepare`: prepare the data sets (pivot bitexts))
|
||
|
* `translate`: translate the pivot bitexts
|
||
|
|
||
|
|
||
|
|
||
|
Data-related recipes:
|
||
|
|
||
|
* `fetch-sami-tmx`: fetch translation memories from Giellatekno
|
||
|
* `convert-sami-tmx`: convert the TMX files from above
|
||
|
* `merge-sami-data`: merge the converted TMX files into one bitext
|
||
|
* `convert-sami-gloss`: convert bilingual glossaries from Giellatekno
|
||
|
|
||
|
|
||
|
Parameters / variables:
|
||
|
|
||
|
* `GIELLATEKNO_HOME`: URL for Giellatekno resources (default: https://victorio.uit.no/biggies/trunk)
|
||
|
* `GIELLATEKNO_TM_HOME`: directory of translation memories (default: ${GIELLATEKNO_HOME}/mt/omegat)
|
||
|
* `GIELLATEKNO_SAMI_TM`: list of translation memories to be downloaded (see [lib/models/sami.mk](https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/lib/models/sami.mk))
|
||
|
|
||
|
|
||
|
Implicit rules:
|
||
|
|
||
|
* `%-sami`: run recipe for a multilingual model including Sami languages (e.g. `make train-sami`)
|
||
|
* `%-sami-xx`: run recipe for a model that translates Sami languages to selected other languages
|
||
|
* `%-xx-sami`: run recipe for a model that translates selected languages to Sami languages
|
||
|
* `%-bt`: include back-translated data (can be combined with the implicit rules above, e.g. `make train-bt-sami`)
|
||
|
* `%-pivot`: include pivot-based translations (can be combined with the implicit rules above, e.g. `make train-pivot-sami`)
|
||
|
|
||
|
|
||
|
|
||
|
## Notes
|
||
|
|
||
|
## To-do list
|
||
|
|