2020-01-10 17:45:42 +03:00
|
|
|
|
|
|
|
# Things to do
|
|
|
|
|
2022-05-28 00:17:52 +03:00
|
|
|
|
|
|
|
## data preparation
|
|
|
|
|
|
|
|
* do temperature-based balanced sampling (see https://arxiv.org/pdf/1907.05019.pdf)
|
|
|
|
|
|
|
|
|
2022-01-25 23:43:48 +03:00
|
|
|
## slurm job pipelines
|
|
|
|
|
|
|
|
Create slurm jobs with dependencies to create pipelines of jobs.
|
|
|
|
(add --dependencies to sbatch)
|
|
|
|
see https://hpc.nih.gov/docs/job_dependencies.html
|
|
|
|
https://hpc.nih.gov/docs/userguide.html#depend
|
|
|
|
grep for job is (pattern: 'Submitted batch job 946074')
|
|
|
|
|
2020-01-22 14:33:28 +03:00
|
|
|
|
2020-09-16 19:33:19 +03:00
|
|
|
## Issues
|
2020-06-08 14:14:55 +03:00
|
|
|
|
2022-03-08 22:29:11 +03:00
|
|
|
* racing situations in work/data with jobs that fetch data for the same language pairs!
|
2020-09-16 19:33:19 +03:00
|
|
|
* get rid of BPE to simplify the scripts
|
2020-06-08 14:14:55 +03:00
|
|
|
|
|
|
|
|
2020-04-18 21:40:52 +03:00
|
|
|
## General settings
|
|
|
|
|
|
|
|
* better hyperparameters for low-resource setting (lower batch sizes, smaller vocabularies ...)
|
|
|
|
* better data selection (data cleaning / filtering); use opus-filter?
|
|
|
|
* better balance between general data sets and backtranslations
|
|
|
|
|
|
|
|
|
2020-01-22 14:33:28 +03:00
|
|
|
## Backtranslation
|
|
|
|
|
|
|
|
* status: basically working, need better integration?!
|
2020-01-10 17:45:42 +03:00
|
|
|
* add backtranslations to training data
|
|
|
|
* can use monolingual data from tokenized wikipedia dumps: https://sites.google.com/site/rmyeid/projects/polyglot
|
|
|
|
* https://dumps.wikimedia.org/backup-index.html
|
|
|
|
* better in JSON: https://dumps.wikimedia.org/other/cirrussearch/current/
|
|
|
|
|
2020-01-22 14:33:28 +03:00
|
|
|
## Fine-tuning and domain adaptation
|
|
|
|
|
|
|
|
* status: basically working
|
2020-09-16 19:33:19 +03:00
|
|
|
* do we want to publish fine-tuned data or rather the fina-tuning procedures? (using a docker container?)
|
2020-01-22 14:33:28 +03:00
|
|
|
|
|
|
|
|
|
|
|
## Show-case some selected language pairs
|
|
|
|
|
|
|
|
* collaboration with wikimedia
|
|
|
|
* focus languages: Tagalog (tl, tgl), Central Bikol (bcl), Malayalam (ml, mal), Bengali (bn, ben), and Mongolian (mn, mon)
|
2020-08-22 23:51:37 +03:00
|
|
|
|
|
|
|
|
2020-09-25 09:58:25 +03:00
|
|
|
|
|
|
|
# Other requests
|
|
|
|
|
|
|
|
* Hebrew-->English and Hebrew-->Russian (Shaul Dar)
|