add option to skip sentence piecce vocabs but use marian_vocab instead

2024-09-11 12:25:37 +03:00 · 2020-09-16 19:33:19 +03:00 · 2020-09-16 19:33:19 +03:00 · a61cf48443
commit a61cf48443
parent 913d31472e
5 changed files with 17 additions and 25 deletions
--- a/TODO.md
+++ b/TODO.md
@ -2,10 +2,9 @@
 # Things to do


-## Bugs
+## Issues

-* something is wrong with multi-threaded data preparation
-* balancing data for multilingual models does not work well with one lang-pair that is tiny
+* get rid of BPE to simplify the scripts


 ## General settings
@ -26,7 +25,7 @@
 ## Fine-tuning and domain adaptation

 * status: basically working
-* do we want to publishfine-tuned data or rather the fina-tuning procedures? (using a docker container?)
+* do we want to publish fine-tuned data or rather the fina-tuning procedures? (using a docker container?)


 ## Show-case some selected language pairs
@ -35,12 +34,3 @@
 * focus languages: Tagalog (tl, tgl), Central Bikol (bcl), Malayalam (ml, mal), Bengali (bn, ben), and Mongolian (mn, mon)


-## Tatoeba MT models
-
-
-
-Labels are only taken from test data but this can be a problem if there are relevant data sets that will be missed out
-
-* example: nor (there is only nno nob in the test data but most of the data for Norwegian is only tagged as nor_Latn); 
-* another example: hbs (hbs labels do not exist in test data)
-* possible solution: take all labels from train data; problem: some noisy labels may influence the model a lot and it would be better to leave them out (wrong script data etc); another issue: over-sampling data sets that only exist in train data may damage the model
--- a/doc/Data.md
+++ b/doc/Data.md
@ -91,8 +91,9 @@ The variables can be set to override defaults. See below to understand how the d
 Data sets, vocabulary, alignments and segmentation models will be stored in the work directory of the model (`work/LANGPAIRSTR/`). Here is an example for the language pair br-en:

 ```
-# MarianNMT vocabulary file:
-work/br-en/opus.spm4k-spm4k.vocab.yml
+# MarianNMT vocabulary files:
+work/br-en/opus.spm4k-spm4k.src.vocab
+work/br-en/opus.spm4k-spm4k.trg.vocab

 # test data:
 work/br-en/test/README.md
@ -165,7 +166,7 @@ Currently, the makefile looks at the local copy of released OPUS data to find av
 Most settings can be adjusted by setting corresponding variables to new values. Common changes are:

 * run word-alignment and train with guided alignment: set `MODELTYPE=transformer-align`
-* use sentence piece model internally to define vocabularies: set `MODELTYPE=transformer-spm`
+* generate the vocabulary file from training data instead of using the sentence piece model: `USE_SPM_VOCAB=0`
 * change the vocabulary size: set `BPESIZE=<yourvalue>` for example BPESIZE=4000 (this is also used for sentence-piece models)
 * vocabulary sizes can also be set for source and target language independently (`SRCBPESIZE` and `TRGBPESIZE`)
 * use BPE instead of sentence-piece (not recommended): set `SUBWORDS=bpe`
--- a/doc/tutorials/low-resource.md
+++ b/doc/tutorials/low-resource.md
@ -20,7 +20,7 @@ make SRCLANGS=en TRGLANGS=br config
 make SRCLANGS=en TRGLANGS=br data
 ```

-This will also download the necessary files if they don't exist on the local file system. It will train sentence piece models for each language separately and apply the model to all data sets. Finally, it also creates the vocabulary file from the segmented training data.
+This will also download the necessary files if they don't exist on the local file system. It will train sentence piece models for each language separately and apply the model to all data sets. Finally, it also creates the vocabulary files from the sentence-piece models.


 ## Train the model
@ -122,7 +122,8 @@ backtranslate/br-en/opus-2020-09-04/source.spm
 backtranslate/br-en/opus-2020-09-04/target.spm
 backtranslate/br-en/opus-2020-09-04/preprocess.sh
 backtranslate/br-en/opus-2020-09-04/postprocess.sh
-backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.vocab.yml
+backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.src.vocab
+backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.trg.vocab
 backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.transformer.model1.npz.best-perplexity.npz
 ```

--- a/lib/config.mk
+++ b/lib/config.mk
@ -348,15 +348,15 @@ MODEL_DECODER    = ${MODEL_FINAL}.decoder.yml
 ## backwards compatibility: if there is already a vocab-file then use it

 ifeq (${SUBWORDS},spm)
-ifneq ($(wildcard ${WORKDIR}/${MODEL}.vocab.yml),)
-  MODEL_VOCAB     = ${WORKDIR}/${MODEL}.vocab.yml
-  MODEL_SRCVOCAB  = ${MODEL_VOCAB}
-  MODEL_TRGVOCAB  = ${MODEL_VOCAB}
-else
+ifeq ($(wildcard ${WORKDIR}/${MODEL}.vocab.yml),)
+  USE_SPM_VOCAB ?= 1
+endif
+endif
+
+ifeq ($(USE_SPM_VOCAB),1)
  MODEL_VOCAB     = ${WORKDIR}/${MODEL}.vocab
  MODEL_SRCVOCAB  = ${WORKDIR}/${MODEL}.src.vocab
  MODEL_TRGVOCAB  = ${WORKDIR}/${MODEL}.trg.vocab
-endif
 else
  MODEL_VOCAB     = ${WORKDIR}/${MODEL}.vocab.yml
  MODEL_SRCVOCAB  = ${MODEL_VOCAB}
--- a/lib/train.mk
+++ b/lib/train.mk
@ -38,7 +38,7 @@ endif


 ## get vocabulary from sentence piece model
-ifeq (${SUBWORDS},spm)
+ifeq ($(USE_SPM_VOCAB),1)
 ${MODEL_SRCVOCAB}: ${SPMSRCMODEL}
 	cut -f1 < $<.vocab > $@