Merge pull request #64 from rrrepsac/patch-1

Patch 1
This commit is contained in:
tiedemann 2023-02-05 22:13:25 +02:00 committed by GitHub
commit d1ffe9106c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -73,7 +73,7 @@ Note that the data splits are done on-the-fly from shuffled data sets and this m
## Generate back-translations
Back-translation requires a moel in the opposite direction. First thing to do is to reverse the data. This can be done without generating them from scratch:
Back-translation requires a model in the opposite direction. First thing to do is to reverse the data. This can be done without generating them from scratch:
```
make SRCLANGS=en TRGLANGS=br reverse-data
@ -141,7 +141,7 @@ backtranslate/br-en/latest/wiki.aa.br-en.en.gz
Another way of augmenting training data is to translate existing bitexts on one side of the bitext to create more data for the language pair we are interested in. For example, for the case of English-Breton translation we can translate the French part of French-Breton bitexts to English using an existing French-English translation model. The latter is a high-resource language pair and decent transaltions can be expected. All this is supported by the recipes in `pivoting`.
Forst of all, you can check what kind of bitexts are available for a pivot language like French:
First of all, you can check what kind of bitexts are available for a pivot language like French:
```
make -C pivoting SRC=en TRG=br PIVOT=fr print-all-data
@ -303,7 +303,7 @@ For the other direction, the additional back-translation loop does not seem to w
## Multilingual models
Another common approach to improve low-resource translation is to rely on transfer learning and multilingual models.
The basic steps are the same, only some variables need to be adjusted. Most importantly, you need to set several source and target languages to be covered by the model. All combinations of thos languages will be considered. Furthermore, it might be useful to activate over and under-sampling of data to have more equal proportions of data for each language pair. This is done by setting a value to `FIT_DATA_SIZE` (number of training examples, i.e. aligned sentence pairs). Here would be the example for training a mode between English and French to a number of celtic languages (including Breton):
The basic steps are the same, only some variables need to be adjusted. Most importantly, you need to set several source and target languages to be covered by the model. All combinations of those languages will be considered. Furthermore, it might be useful to activate over and under-sampling of data to have more equal proportions of data for each language pair. This is done by setting a value to `FIT_DATA_SIZE` (number of training examples, i.e. aligned sentence pairs). Here would be the example for training a mode between English and French to a number of celtic languages (including Breton):
```
make SRCLANGS="en fr" TRGLANGS="ga cy br gd kw gv" FIT_DATA_SIZE=500000 config