Summary: Integrating LASER (Language-Agnostic SEntence Representations) training code - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ Y] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ N/A] Did you make sure to update the docs? - [ Y] Did you write any new necessary tests? => an additional test in `test_iterators.py` ## What does this PR do? This diff introduces the training code for LASER. It includes a specific `laser` task in `laser_task.py` which reads a json configuration file describing the binarized datasets of language pairs. `multitask_data_utils.py` defines dataset wrappers and iterators used by `laser` task. ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Yes. � Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1207 Reviewed By: myleott Differential Revision: D26454296 Pulled By: Celebio fbshipit-source-id: c987672aa66abf31b039ee11867b06912d3486e5
5.2 KiB
LASER Language-Agnostic SEntence Representations
LASER is a library to calculate and use multilingual sentence embeddings.
You can find more information about LASER and how to use it on the official LASER repository.
This folder contains source code for training LASER embeddings.
Prepare data and configuration file
Binarize your data with fairseq, as described here.
Create a json config file with this format:
{
"src_vocab": "/path/to/spm.src.cvocab",
"tgt_vocab": "/path/to/spm.tgt.cvocab",
"train": [
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang1-tgtlang0/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang1-tgtlang1/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
},
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang2-tgtlang0/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang2-tgtlang1/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
},
...
],
"valid": [
{
"type": "translation",
"id": 0,
"src": "/unused",
"tgt": "/unused"
}
]
}
where paths are paths to binarized indexed fairseq dataset files.
id
represents the target language id.
Training Command Line Example
fairseq-train \
/path/to/configfile_described_above.json \
--user-dir examples/laser/laser_src \
--log-interval 100 --log-format simple \
--task laser --arch laser_lstm \
--save-dir . \
--optimizer adam \
--lr 0.001 \
--lr-scheduler inverse_sqrt \
--clip-norm 5 \
--warmup-updates 90000 \
--update-freq 2 \
--dropout 0.0 \
--encoder-dropout-out 0.1 \
--max-tokens 2000 \
--max-epoch 50 \
--encoder-bidirectional \
--encoder-layers 5 \
--encoder-hidden-size 512 \
--decoder-layers 1 \
--decoder-hidden-size 2048 \
--encoder-embed-dim 320 \
--decoder-embed-dim 320 \
--decoder-lang-embed-dim 32 \
--warmup-init-lr 0.001 \
--disable-validation
Applications
We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").
- Cross-lingual document classification using the MLDoc corpus [2,6]
- WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
- Bitext mining using the BUCC corpus [3,5]
- Cross-lingual NLI using the XNLI corpus [4,5,6]
- Multilingual similarity search [1,6]
- Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.
For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.
References
[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017
[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.
[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018
[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.
[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.
[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.
[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.
[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB