mirror of
https://github.com/Helsinki-NLP/OPUS-MT-train.git
synced 2024-09-11 20:27:19 +03:00
160 lines
3.9 KiB
Markdown
160 lines
3.9 KiB
Markdown
|
|
# Integration
|
|
|
|
https://github.com/UKPLab/EasyNMT
|
|
|
|
|
|
# Data cleanup
|
|
|
|
Need better data filtering:
|
|
* integrate OpusFilter
|
|
* Tatoeba MT challenge data sets are noisy for smaller languages like Breton (but the similarity scores are not available), CC-Matrix etc is not very good for those languages
|
|
* cleanup script also before subword splitting?
|
|
* stronger filters in cleanup script?
|
|
* idea: compare character diversity between the two languages and use a threshold to filter sentences? (language-specific?)
|
|
|
|
links and tools:
|
|
|
|
* https://github.com/ZJaume/clean
|
|
* https://github.com/Helsinki-NLP/OPUS-MT-distillation
|
|
|
|
|
|
# more efficient parallelisation
|
|
|
|
* https://www.gnu.org/software/parallel/
|
|
* https://www.gnu.org/software/parallel/parallel_tutorial.html
|
|
* https://www.gnu.org/software/bash/manual/html_node/GNU-Parallel.html
|
|
* multinode training with MarianNMT: https://github.com/marian-nmt/marian/issues/244
|
|
|
|
from Bergamot:
|
|
https://github.com/browsermt/students/blob/master/train-student/alignment/generate-alignment-and-shortlist.sh
|
|
|
|
```
|
|
# Subword segmentation with SentencePiece.
|
|
test -s $DIR/corpus.spm.$SRC || cat $CORPUS_SRC | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$SRC
|
|
test -s $DIR/corpus.spm.$TRG || cat $CORPUS_TRG | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$TRG
|
|
```
|
|
|
|
# Benchmarking
|
|
|
|
* SOTA-bench forum: https://forum.sotabench.com/
|
|
|
|
|
|
|
|
# OPUS-MT at huggingface
|
|
|
|
* more efficient GPU usage: https://github.com/kb-labb/kblabb-examples/tree/master/text/machine_translation_gpu
|
|
|
|
|
|
# related projects
|
|
|
|
* https://browser.mt (bergamot project)
|
|
* https://nteu.eu
|
|
* https://gourmet-project.eu
|
|
* https://elitr.eu
|
|
* https://www.european-language-grid.eu
|
|
|
|
Multilingual data:
|
|
|
|
* http://lr-coordination.eu (ELRC)
|
|
* https://www.pret-a-llod.eu
|
|
* https://www.taus.net
|
|
|
|
|
|
further resources: (from http://techiaith.cymru/translation/demo/?lang=en)
|
|
contact: Dewi Jones (d.b.jones@bangor.ac.uk)
|
|
|
|
http://techiaith.cymru/corpws/Moses/CofnodYCynulliad/CofnodYCynulliad.tar.gz
|
|
http://techiaith.cymru/corpws/Moses/Deddfwriaeth/Deddfwriaeth.tar.gz
|
|
http://techiaith.cymru/corpws/Moses/Meddalwedd/Meddalwedd.tar.gz
|
|
http://techiaith.cymru/alinio/rhestr_geiriau.tsv
|
|
http://techiaith.cymru/alinio/hunalign/cy-en.dic
|
|
|
|
(see work/data/cy-en)
|
|
|
|
|
|
|
|
# celtic languages
|
|
|
|
LANGS = "ga cy br gd kv gv"
|
|
|
|
|
|
```
|
|
ga gle Irish yes (ga)
|
|
cy wel/cym Welsh yes (cy)
|
|
br bre bre/xbm/obt Breton yes (br)
|
|
gd gla Scottish Gaelic yes (gd)
|
|
kw cor cor/cnx/oco Cornish yes (kw)
|
|
gv glv Manx yes (gv)
|
|
```
|
|
|
|
|
|
# Romance
|
|
|
|
LANGS = "fr wa frp oc ca rm lld fur lij lmo es pt gl lad an mwl it co nap scn vec sc ro la"
|
|
LANGS_FR = "fr_BE fr_CA fr_FR"
|
|
LANGS_ES = "es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE"
|
|
LANGS_PT = "pt_br pt_BR pt_PT"
|
|
LANGS_IT = "it_IT"
|
|
|
|
|
|
## Gallo-Romance
|
|
|
|
```
|
|
fr yes (regional variants: BE CA FR)
|
|
wa wln Walloon yes
|
|
pcs Picard no
|
|
nrf Norman no
|
|
frp FRanco-Provencal yes
|
|
oc oci Occitano-Romance yes
|
|
|
|
ca cat Catalan yes (ca/cat)
|
|
rm roh Romansh yes (rm)
|
|
lld Ladin yes
|
|
fur Friulan yes
|
|
lij Liguran yes
|
|
lmo Lombard yes (very little / noisy wikimedia)
|
|
```
|
|
|
|
## Iberian-Romance
|
|
|
|
```
|
|
es yes (regional variants: AR CL CO CR DO EC ES GT HN MX NI PA PE PR SV UY VE)
|
|
pt yes (variants: br BR PT)
|
|
gl yes
|
|
lad Ladino yes
|
|
an arg Aragonese yes (an)
|
|
mxi Mozarabic no
|
|
mwl Mirandese yes (very little / noisy wikimedia)
|
|
```
|
|
|
|
## Italo-Dalmatian
|
|
|
|
```
|
|
it yes (regional variants: IT)
|
|
co cos Corsican yes
|
|
nap Napolitan yes (very little / noisy wikimedia)
|
|
scn Sicilian yes (very little / noisy wikimedia)
|
|
dlm Dalmatian no
|
|
vec Venetian yes
|
|
itk Judeo-Italian no
|
|
```
|
|
|
|
## Sardinian
|
|
|
|
```
|
|
sc srd Sardinian yes
|
|
```
|
|
|
|
## Eastern Romance
|
|
|
|
```
|
|
ro yes
|
|
```
|
|
|
|
## Early forms
|
|
|
|
```
|
|
la yes
|
|
Vulgar
|
|
``` |