Helsinki-NLP/OPUS-MT-train

Fork 0

mirror of https://github.com/Helsinki-NLP/OPUS-MT-train.git synced 2024-11-27 11:03:13 +03:00

Joerg Tiedemann bc54b403cd student model quantisation finetuning added

2022-01-18 14:41:17 +02:00

3.9 KiB

Raw Blame History

Integration

https://github.com/UKPLab/EasyNMT

Data cleanup

Need better data filtering:

integrate OpusFilter
Tatoeba MT challenge data sets are noisy for smaller languages like Breton (but the similarity scores are not available), CC-Matrix etc is not very good for those languages
cleanup script also before subword splitting?
stronger filters in cleanup script?
idea: compare character diversity between the two languages and use a threshold to filter sentences? (language-specific?)

links and tools:

more efficient parallelisation

https://www.gnu.org/software/parallel/
https://www.gnu.org/software/parallel/parallel_tutorial.html
https://www.gnu.org/software/bash/manual/html_node/GNU-Parallel.html
multinode training with MarianNMT: https://github.com/marian-nmt/marian/issues/244

from Bergamot: https://github.com/browsermt/students/blob/master/train-student/alignment/generate-alignment-and-shortlist.sh

# Subword segmentation with SentencePiece.
test -s $DIR/corpus.spm.$SRC || cat $CORPUS_SRC | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$SRC
test -s $DIR/corpus.spm.$TRG || cat $CORPUS_TRG | pigz -dc | parallel --no-notice --pipe -k -j16 --block 50M "$MARIAN/spm_encode --model $VOCAB" > $DIR/corpus.spm.$TRG

Benchmarking

SOTA-bench forum: https://forum.sotabench.com/

OPUS-MT at huggingface

more efficient GPU usage: https://github.com/kb-labb/kblabb-examples/tree/master/text/machine_translation_gpu

Multilingual data:

further resources: (from http://techiaith.cymru/translation/demo/?lang=en) contact: Dewi Jones (d.b.jones@bangor.ac.uk)

http://techiaith.cymru/corpws/Moses/CofnodYCynulliad/CofnodYCynulliad.tar.gz http://techiaith.cymru/corpws/Moses/Deddfwriaeth/Deddfwriaeth.tar.gz http://techiaith.cymru/corpws/Moses/Meddalwedd/Meddalwedd.tar.gz http://techiaith.cymru/alinio/rhestr_geiriau.tsv http://techiaith.cymru/alinio/hunalign/cy-en.dic

(see work/data/cy-en)

celtic languages

LANGS = "ga cy br gd kv gv"

ga	gle			Irish			yes (ga)
cy	wel/cym			Welsh			yes (cy)
br	bre	bre/xbm/obt	Breton			yes (br)
gd	gla			Scottish Gaelic		yes (gd)
kw	cor	cor/cnx/oco	Cornish	 		yes (kw)
gv	glv			Manx			yes (gv)

Romance

LANGS = "fr wa frp oc ca rm lld fur lij lmo es pt gl lad an mwl it co nap scn vec sc ro la" LANGS_FR = "fr_BE fr_CA fr_FR" LANGS_ES = "es_AR es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_SV es_UY es_VE" LANGS_PT = "pt_br pt_BR pt_PT" LANGS_IT = "it_IT"

Gallo-Romance

fr							yes (regional variants: BE CA FR)
wa	wln			Walloon			yes
	pcs			Picard			no
	nrf			Norman			no
	frp			FRanco-Provencal	yes
oc	oci			Occitano-Romance	yes

ca	cat			Catalan			yes (ca/cat)
rm	roh			Romansh			yes (rm)
	lld			Ladin			yes
	fur			Friulan			yes
	lij			Liguran			yes
	lmo			Lombard			yes (very little / noisy wikimedia)

Iberian-Romance

es							yes (regional variants: AR CL CO CR DO EC ES GT HN MX NI PA PE PR SV UY VE)
pt							yes (variants: br BR PT)
gl							yes
	lad			Ladino			yes
an	arg			Aragonese		yes (an)
	mxi			Mozarabic		no
	mwl			Mirandese		yes (very little / noisy wikimedia)

Italo-Dalmatian

it							yes (regional variants: IT)
co	cos			Corsican		yes
	nap			Napolitan		yes (very little / noisy wikimedia)
	scn			Sicilian		yes (very little / noisy wikimedia)
	dlm			Dalmatian		no
	vec			Venetian		yes
	itk			Judeo-Italian		no

Sardinian

sc	srd			Sardinian		yes

Eastern Romance

ro							yes

Early forms

la							yes
				Vulgar

3.9 KiB Raw Blame History