add option to skip sentence piecce vocabs but use marian_vocab instead

This commit is contained in:
Jörg Tiedemann 2020-09-16 19:33:19 +03:00
parent 913d31472e
commit a61cf48443
5 changed files with 17 additions and 25 deletions

16
TODO.md
View File

@ -2,10 +2,9 @@
# Things to do
## Bugs
## Issues
* something is wrong with multi-threaded data preparation
* balancing data for multilingual models does not work well with one lang-pair that is tiny
* get rid of BPE to simplify the scripts
## General settings
@ -26,7 +25,7 @@
## Fine-tuning and domain adaptation
* status: basically working
* do we want to publishfine-tuned data or rather the fina-tuning procedures? (using a docker container?)
* do we want to publish fine-tuned data or rather the fina-tuning procedures? (using a docker container?)
## Show-case some selected language pairs
@ -35,12 +34,3 @@
* focus languages: Tagalog (tl, tgl), Central Bikol (bcl), Malayalam (ml, mal), Bengali (bn, ben), and Mongolian (mn, mon)
## Tatoeba MT models
Labels are only taken from test data but this can be a problem if there are relevant data sets that will be missed out
* example: nor (there is only nno nob in the test data but most of the data for Norwegian is only tagged as nor_Latn);
* another example: hbs (hbs labels do not exist in test data)
* possible solution: take all labels from train data; problem: some noisy labels may influence the model a lot and it would be better to leave them out (wrong script data etc); another issue: over-sampling data sets that only exist in train data may damage the model

View File

@ -91,8 +91,9 @@ The variables can be set to override defaults. See below to understand how the d
Data sets, vocabulary, alignments and segmentation models will be stored in the work directory of the model (`work/LANGPAIRSTR/`). Here is an example for the language pair br-en:
```
# MarianNMT vocabulary file:
work/br-en/opus.spm4k-spm4k.vocab.yml
# MarianNMT vocabulary files:
work/br-en/opus.spm4k-spm4k.src.vocab
work/br-en/opus.spm4k-spm4k.trg.vocab
# test data:
work/br-en/test/README.md
@ -165,7 +166,7 @@ Currently, the makefile looks at the local copy of released OPUS data to find av
Most settings can be adjusted by setting corresponding variables to new values. Common changes are:
* run word-alignment and train with guided alignment: set `MODELTYPE=transformer-align`
* use sentence piece model internally to define vocabularies: set `MODELTYPE=transformer-spm`
* generate the vocabulary file from training data instead of using the sentence piece model: `USE_SPM_VOCAB=0`
* change the vocabulary size: set `BPESIZE=<yourvalue>` for example BPESIZE=4000 (this is also used for sentence-piece models)
* vocabulary sizes can also be set for source and target language independently (`SRCBPESIZE` and `TRGBPESIZE`)
* use BPE instead of sentence-piece (not recommended): set `SUBWORDS=bpe`

View File

@ -20,7 +20,7 @@ make SRCLANGS=en TRGLANGS=br config
make SRCLANGS=en TRGLANGS=br data
```
This will also download the necessary files if they don't exist on the local file system. It will train sentence piece models for each language separately and apply the model to all data sets. Finally, it also creates the vocabulary file from the segmented training data.
This will also download the necessary files if they don't exist on the local file system. It will train sentence piece models for each language separately and apply the model to all data sets. Finally, it also creates the vocabulary files from the sentence-piece models.
## Train the model
@ -122,7 +122,8 @@ backtranslate/br-en/opus-2020-09-04/source.spm
backtranslate/br-en/opus-2020-09-04/target.spm
backtranslate/br-en/opus-2020-09-04/preprocess.sh
backtranslate/br-en/opus-2020-09-04/postprocess.sh
backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.vocab.yml
backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.src.vocab
backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.trg.vocab
backtranslate/br-en/opus-2020-09-04/opus.spm4k-spm4k.transformer.model1.npz.best-perplexity.npz
```

View File

@ -348,15 +348,15 @@ MODEL_DECODER = ${MODEL_FINAL}.decoder.yml
## backwards compatibility: if there is already a vocab-file then use it
ifeq (${SUBWORDS},spm)
ifneq ($(wildcard ${WORKDIR}/${MODEL}.vocab.yml),)
MODEL_VOCAB = ${WORKDIR}/${MODEL}.vocab.yml
MODEL_SRCVOCAB = ${MODEL_VOCAB}
MODEL_TRGVOCAB = ${MODEL_VOCAB}
else
ifeq ($(wildcard ${WORKDIR}/${MODEL}.vocab.yml),)
USE_SPM_VOCAB ?= 1
endif
endif
ifeq ($(USE_SPM_VOCAB),1)
MODEL_VOCAB = ${WORKDIR}/${MODEL}.vocab
MODEL_SRCVOCAB = ${WORKDIR}/${MODEL}.src.vocab
MODEL_TRGVOCAB = ${WORKDIR}/${MODEL}.trg.vocab
endif
else
MODEL_VOCAB = ${WORKDIR}/${MODEL}.vocab.yml
MODEL_SRCVOCAB = ${MODEL_VOCAB}

View File

@ -38,7 +38,7 @@ endif
## get vocabulary from sentence piece model
ifeq (${SUBWORDS},spm)
ifeq ($(USE_SPM_VOCAB),1)
${MODEL_SRCVOCAB}: ${SPMSRCMODEL}
cut -f1 < $<.vocab > $@