OPUS-MT-train/TODO.md


# Things to do


## data preparation

* do temperature-based balanced sampling (see https://arxiv.org/pdf/1907.05019.pdf)


## slurm job pipelines

Create slurm jobs with dependencies to create pipelines of jobs.
(add --dependencies to sbatch)
see https://hpc.nih.gov/docs/job_dependencies.html
https://hpc.nih.gov/docs/userguide.html#depend
grep for job is (pattern: 'Submitted batch job 946074')


## Issues

* racing situations in work/data with jobs that fetch data for the same language pairs!
* get rid of BPE to simplify the scripts


## General settings

* better hyperparameters for low-resource setting (lower batch sizes, smaller vocabularies ...)
* better data selection (data cleaning / filtering); use opus-filter?
* better balance between general data sets and backtranslations


## Backtranslation

* status: basically working, need better integration?!
* add backtranslations to training data
* can use monolingual data from tokenized wikipedia dumps: https://sites.google.com/site/rmyeid/projects/polyglot
* https://dumps.wikimedia.org/backup-index.html
* better in JSON: https://dumps.wikimedia.org/other/cirrussearch/current/

## Fine-tuning and domain adaptation

* status: basically working
* do we want to publish fine-tuned data or rather the fina-tuning procedures? (using a docker container?)


## Show-case some selected language pairs

* collaboration with wikimedia
* focus languages: Tagalog (tl, tgl), Central Bikol (bcl), Malayalam (ml, mal), Bengali (bn, ben), and Mongolian (mn, mon)


# Other requests

* Hebrew-->English and Hebrew-->Russian (Shaul Dar)
initial import 2020-01-10 17:45:42 +03:00
			`# Things to do`

latest spm models online 2022-05-28 00:17:52 +03:00
			`## data preparation`

			`* do temperature-based balanced sampling (see https://arxiv.org/pdf/1907.05019.pdf)`


student models for tatoeba 2022-01-25 23:43:48 +03:00			`## slurm job pipelines`

			`Create slurm jobs with dependencies to create pipelines of jobs.`
			`(add --dependencies to sbatch)`
			`see https://hpc.nih.gov/docs/job_dependencies.html`
			`https://hpc.nih.gov/docs/userguide.html#depend`
			`grep for job is (pattern: 'Submitted batch job 946074')`

backtranslate bugfix 2020-01-22 14:33:28 +03:00
add option to skip sentence piecce vocabs but use marian_vocab instead 2020-09-16 19:33:19 +03:00			`## Issues`
fit-data-size fixed 2020-06-08 14:14:55 +03:00
latest scores 2022-03-08 22:29:11 +03:00			`* racing situations in work/data with jobs that fetch data for the same language pairs!`
add option to skip sentence piecce vocabs but use marian_vocab instead 2020-09-16 19:33:19 +03:00			`* get rid of BPE to simplify the scripts`
fit-data-size fixed 2020-06-08 14:14:55 +03:00

add local config parameters 2020-04-18 21:40:52 +03:00			`## General settings`

			`* better hyperparameters for low-resource setting (lower batch sizes, smaller vocabularies ...)`
			`* better data selection (data cleaning / filtering); use opus-filter?`
			`* better balance between general data sets and backtranslations`


backtranslate bugfix 2020-01-22 14:33:28 +03:00			`## Backtranslation`

			`* status: basically working, need better integration?!`
initial import 2020-01-10 17:45:42 +03:00			`* add backtranslations to training data`
			`* can use monolingual data from tokenized wikipedia dumps: https://sites.google.com/site/rmyeid/projects/polyglot`
			`* https://dumps.wikimedia.org/backup-index.html`
			`* better in JSON: https://dumps.wikimedia.org/other/cirrussearch/current/`

backtranslate bugfix 2020-01-22 14:33:28 +03:00			`## Fine-tuning and domain adaptation`

			`* status: basically working`
add option to skip sentence piecce vocabs but use marian_vocab instead 2020-09-16 19:33:19 +03:00			`* do we want to publish fine-tuned data or rather the fina-tuning procedures? (using a docker container?)`
backtranslate bugfix 2020-01-22 14:33:28 +03:00

			`## Show-case some selected language pairs`

			`* collaboration with wikimedia`
			`* focus languages: Tagalog (tl, tgl), Central Bikol (bcl), Malayalam (ml, mal), Bengali (bn, ben), and Mongolian (mn, mon)`
store and fetch work data 2020-08-22 23:51:37 +03:00

back to yml vocab files as default 2020-09-25 09:58:25 +03:00
			`# Other requests`

			`* Hebrew-->English and Hebrew-->Russian (Shaul Dar)`