mirror of
https://github.com/Helsinki-NLP/OPUS-MT-train.git
synced 2024-11-30 22:14:14 +03:00
382 B
382 B
Translate data as synthetic training data
Use Wiki data:
- json processor: https://stedolan.github.io/jq/
- wiki JSON dumps: https://dumps.wikimedia.org/other/cirrussearch/current/
NOTE: this only works for SentencePiece models
TODO
- download base models from ObjectStorage
- make it work with multilingual models (need to adjust preprocess-scripts for those models)