OPUS-MT-train/backtranslate
2022-03-17 21:02:11 +02:00
..
Makefile elg project stuff and changes done on mahti 2022-03-17 21:02:11 +02:00
marian-dev started tutorial and fixes to backtranslate makefile 2020-09-05 00:16:22 +03:00
mosesdecoder started tutorial and fixes to backtranslate makefile 2020-09-05 00:16:22 +03:00
README.md enabled fetching OPUS data instead of reading local files if necessary 2020-08-28 10:53:11 +03:00

Back-translation

Translate monolingual data (extracted from various wikimedia sources) to create synthetic training data.

Overview

Relevant makefiles:

Main recipes:

  • all: translate wiki data for the specified language
  • get-data:
  • extract-text:
  • extract-doc:
  • prepare-model:
  • prepare-data:
  • translate:
  • check-length:
  • print-names:
  • print-modelname:

Recipes for fetching data and pre-processing batch jobs:

  • index.html:
  • all-wikitext:
  • all-wikilangs:
  • all-wikilangs-fast:
  • all-wikis-all-langs:
  • all-wikidocs-all-langs:
  • wiki-iso639: link (shuffled) wikisources to iso639-3 conform language labels
  • wiki-iso639-doc: same as above but for non-shuffled wikisources with document boundaries

Recipes for translating wiki data:

  • translate-all-parts:
  • translate-all-wikis:
  • translate-all-wikiparts:
  • translate-all-parts-jobs:
  • translate-all-wikis-jobs:
  • translate-all-wikiparts-jobs:

Recipes for Sami languages:

  • sami-corp:
  • translate-sami:
  • translate-sami-corp:
  • translate-sami-wiki:
  • translate-sami-xx-wiki:
  • translate-sami-xx-corp:
  • translate-xx-sami-wiki:

Recipes for Celtic languages:

  • fetch-celtic:
  • translate-celtic-english:
  • translate-english-celtic:
  • breton:

Recipes for Nordic and Uralic languages:

  • finland-focus-wikis:
  • translate-thl:
  • all-nordic-wikidocs:
  • uralic-wiki-texts:
  • uralic-wikis:

Other task-specific recipes:

  • xnli-wikidocs:
  • small-romance:
  • wikimedia-focus-wikis:

Parameters / variables:

  • SRC:
  • TRG:
  • WIKISOURCE:
  • SPLIT_SIZE:
  • MAX_LENGTH:
  • MAX_SENTENCES:
  • PART:
  • MODELSDIR:
  • MULTI_TARGET_MODEL:
  • WIKI_HOME:
  • WIKIDOC_HOME:

Detailed information

Use Wiki data:

NOTE: this only works for SentencePiece models

TODO

  • download base models from ObjectStorage
  • DONE? make it work with multilingual models (need to adjust preprocess-scripts for those models)