OPUS-MT-train/NOTES.md


# Integration

https://github.com/UKPLab/EasyNMT


# Data cleanup

Need better data filtering:
* integrate OpusFilter
* Tatoeba MT challenge data sets are noisy for smaller languages like Breton (but the similarity scores are not available), CC-Matrix etc is not very good for those languages
* cleanup script also before subword splitting?
* stronger filters in cleanup script?
* idea: compare character diversity between the two languages and use a threshold to filter sentences? (language-specific?)

links and tools:

* https://github.com/ZJaume/clean
* https://github.com/Helsinki-NLP/OPUS-MT-distillation


# more efficient parallelisation

* https://www.gnu.org/software/parallel/
* https://www.gnu.org/software/parallel/parallel_tutorial.html
* https://www.gnu.org/software/bash/manual/html_node/GNU-Parallel.html
* multinode training with MarianNMT: https://github.com/marian-nmt/marian/issues/244

from Bergamot:
https://github.com/browsermt/students/blob/master/train-student/alignment/generate-alignment-and-shortlist.sh

```
# Subword segmentation with SentencePiece.
test -s $DIR/corpus.spm.$SRC || cat $CORPUS_SRC | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$SRC
test -s $DIR/corpus.spm.$TRG || cat $CORPUS_TRG | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$TRG
```

# Benchmarking

* SOTA-bench forum: https://forum.sotabench.com/


# OPUS-MT at huggingface

* more efficient GPU usage: https://github.com/kb-labb/kblabb-examples/tree/master/text/machine_translation_gpu


# related projects

* https://browser.mt (bergamot project)
* https://nteu.eu
* https://gourmet-project.eu
* https://elitr.eu
* https://www.european-language-grid.eu

Multilingual data:

* http://lr-coordination.eu (ELRC)
* https://www.pret-a-llod.eu
* https://www.taus.net


further resources: (from http://techiaith.cymru/translation/demo/?lang=en)
contact: Dewi Jones (d.b.jones@bangor.ac.uk)

http://techiaith.cymru/corpws/Moses/CofnodYCynulliad/CofnodYCynulliad.tar.gz
http://techiaith.cymru/corpws/Moses/Deddfwriaeth/Deddfwriaeth.tar.gz
http://techiaith.cymru/corpws/Moses/Meddalwedd/Meddalwedd.tar.gz
http://techiaith.cymru/alinio/rhestr_geiriau.tsv
http://techiaith.cymru/alinio/hunalign/cy-en.dic

(see work/data/cy-en)


# celtic languages

LANGS = "ga cy br gd kv gv"


```
ga	gle			Irish			yes (ga)
cy	wel/cym			Welsh			yes (cy)
br	bre	bre/xbm/obt	Breton			yes (br)
gd	gla			Scottish Gaelic		yes (gd)
kw	cor	cor/cnx/oco	Cornish	 		yes (kw)
gv	glv			Manx			yes (gv)
```


# Romance

LANGS = "fr wa frp oc ca rm lld fur lij lmo es pt gl lad an mwl it co nap scn vec sc ro la"
LANGS_FR = "fr_BE fr_CA fr_FR"
LANGS_ES = "es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE"
LANGS_PT = "pt_br pt_BR pt_PT"
LANGS_IT = "it_IT"


## Gallo-Romance

```
fr							yes (regional variants: BE CA FR)
wa	wln			Walloon			yes
	pcs			Picard			no
	nrf			Norman			no
	frp			FRanco-Provencal	yes
oc	oci			Occitano-Romance	yes

ca	cat			Catalan			yes (ca/cat)
rm	roh			Romansh			yes (rm)
	lld			Ladin			yes
	fur			Friulan			yes
	lij			Liguran			yes
	lmo			Lombard			yes (very little / noisy wikimedia)
```

## Iberian-Romance

```
es							yes (regional variants: AR CL CO CR DO EC ES GT HN MX NI PA PE PR SV UY VE)
pt							yes (variants: br BR PT)
gl							yes
	lad			Ladino			yes
an	arg			Aragonese		yes (an)
	mxi			Mozarabic		no
	mwl			Mirandese		yes (very little / noisy wikimedia)
```

## Italo-Dalmatian

```
it							yes (regional variants: IT)
co	cos			Corsican		yes
	nap			Napolitan		yes (very little / noisy wikimedia)
	scn			Sicilian		yes (very little / noisy wikimedia)
	dlm			Dalmatian		no
	vec			Venetian		yes
	itk			Judeo-Italian		no
```

## Sardinian

```
sc	srd			Sardinian		yes
```

## Eastern Romance

```
ro							yes
```

## Early forms

```
la							yes
				Vulgar
```
celtic and romance language tasks added 2020-03-22 22:18:29 +03:00
fix vocab yaml script added 2021-11-02 19:38:28 +03:00			`# Integration`

			`https://github.com/UKPLab/EasyNMT`

fixed bug in release target 2020-10-04 00:10:11 +03:00
changes to preprocessing 2022-01-11 17:10:43 +03:00			`# Data cleanup`

			`Need better data filtering:`
			`* integrate OpusFilter`
			`* Tatoeba MT challenge data sets are noisy for smaller languages like Breton (but the similarity scores are not available), CC-Matrix etc is not very good for those languages`
			`* cleanup script also before subword splitting?`
			`* stronger filters in cleanup script?`
			`* idea: compare character diversity between the two languages and use a threshold to filter sentences? (language-specific?)`

student model quantisation finetuning added 2022-01-18 15:41:17 +03:00			`links and tools:`

			`* https://github.com/ZJaume/clean`
			`* https://github.com/Helsinki-NLP/OPUS-MT-distillation`

changes to preprocessing 2022-01-11 17:10:43 +03:00
fixed bug in release target 2020-10-04 00:10:11 +03:00			`# more efficient parallelisation`

student model quantisation finetuning added 2022-01-18 15:41:17 +03:00			`* https://www.gnu.org/software/parallel/`
			`* https://www.gnu.org/software/parallel/parallel_tutorial.html`
			`* https://www.gnu.org/software/bash/manual/html_node/GNU-Parallel.html`
			`* multinode training with MarianNMT: https://github.com/marian-nmt/marian/issues/244`

fixed bug in release target 2020-10-04 00:10:11 +03:00			`from Bergamot:`
			`https://github.com/browsermt/students/blob/master/train-student/alignment/generate-alignment-and-shortlist.sh`

			```
			`# Subword segmentation with SentencePiece.`
			`test -s $DIR/corpus.spm.$SRC \|\| cat $CORPUS_SRC \| pigz -dc \| parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$SRC`
			`test -s $DIR/corpus.spm.$TRG \|\| cat $CORPUS_TRG \| pigz -dc \| parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$TRG`
			```

student model quantisation finetuning added 2022-01-18 15:41:17 +03:00			`# Benchmarking`

			`* SOTA-bench forum: https://forum.sotabench.com/`


fixed bug in release target 2020-10-04 00:10:11 +03:00
tatoeba models 2021-12-22 18:31:22 +03:00			`# OPUS-MT at huggingface`

			`* more efficient GPU usage: https://github.com/kb-labb/kblabb-examples/tree/master/text/machine_translation_gpu`


fixed multilingual tatoeba evaluation 2020-06-11 00:54:40 +03:00			`# related projects`

			`* https://browser.mt (bergamot project)`
			`* https://nteu.eu`
			`* https://gourmet-project.eu`
			`* https://elitr.eu`
			`* https://www.european-language-grid.eu`

			`Multilingual data:`

			`* http://lr-coordination.eu (ELRC)`
			`* https://www.pret-a-llod.eu`
			`* https://www.taus.net`

celtic and romance language tasks added 2020-03-22 22:18:29 +03:00
backtranslation data for multilingual models 2020-03-25 00:47:57 +03:00			`further resources: (from http://techiaith.cymru/translation/demo/?lang=en)`
			`contact: Dewi Jones (d.b.jones@bangor.ac.uk)`

			`http://techiaith.cymru/corpws/Moses/CofnodYCynulliad/CofnodYCynulliad.tar.gz`
			`http://techiaith.cymru/corpws/Moses/Deddfwriaeth/Deddfwriaeth.tar.gz`
			`http://techiaith.cymru/corpws/Moses/Meddalwedd/Meddalwedd.tar.gz`
			`http://techiaith.cymru/alinio/rhestr_geiriau.tsv`
more data for cy-en 2020-03-25 21:40:29 +03:00			`http://techiaith.cymru/alinio/hunalign/cy-en.dic`
backtranslation data for multilingual models 2020-03-25 00:47:57 +03:00
			`(see work/data/cy-en)`

celtic and romance language tasks added 2020-03-22 22:18:29 +03:00

			`# celtic languages`

			`LANGS = "ga cy br gd kv gv"`


			```
			`ga gle Irish yes (ga)`
			`cy wel/cym Welsh yes (cy)`
			`br bre bre/xbm/obt Breton yes (br)`
			`gd gla Scottish Gaelic yes (gd)`
use only latest backtranslation 2020-04-01 20:18:06 +03:00			`kw cor cor/cnx/oco Cornish yes (kw)`
celtic and romance language tasks added 2020-03-22 22:18:29 +03:00			`gv glv Manx yes (gv)`
			```


			`# Romance`

			`LANGS = "fr wa frp oc ca rm lld fur lij lmo es pt gl lad an mwl it co nap scn vec sc ro la"`
			`LANGS_FR = "fr_BE fr_CA fr_FR"`
			`LANGS_ES = "es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE"`
			`LANGS_PT = "pt_br pt_BR pt_PT"`
			`LANGS_IT = "it_IT"`


			`## Gallo-Romance`

			```
			`fr yes (regional variants: BE CA FR)`
			`wa wln Walloon yes`
			`pcs Picard no`
			`nrf Norman no`
			`frp FRanco-Provencal yes`
			`oc oci Occitano-Romance yes`

			`ca cat Catalan yes (ca/cat)`
			`rm roh Romansh yes (rm)`
			`lld Ladin yes`
			`fur Friulan yes`
			`lij Liguran yes`
			`lmo Lombard yes (very little / noisy wikimedia)`
			```

			`## Iberian-Romance`

			```
			`es yes (regional variants: AR CL CO CR DO EC ES GT HN MX NI PA PE PR SV UY VE)`
			`pt yes (variants: br BR PT)`
			`gl yes`
			`lad Ladino yes`
			`an arg Aragonese yes (an)`
			`mxi Mozarabic no`
			`mwl Mirandese yes (very little / noisy wikimedia)`
			```

			`## Italo-Dalmatian`

			```
			`it yes (regional variants: IT)`
			`co cos Corsican yes`
			`nap Napolitan yes (very little / noisy wikimedia)`
			`scn Sicilian yes (very little / noisy wikimedia)`
			`dlm Dalmatian no`
			`vec Venetian yes`
			`itk Judeo-Italian no`
			```

			`## Sardinian`

			```
			`sc srd Sardinian yes`
			```

			`## Eastern Romance`

			```
			`ro yes`
			```

			`## Early forms`

			```
			`la yes`
			`Vulgar`
			```