fairseq/examples/criss
Myle Ott f34abcf2b6 Use safe_getattr and safe_hasattr (#2347)
Summary:
We use omegaconf.DictConfig objects in non-strict mode, so hasattr behaves weirdly:
```
>>> import omegaconf
>>> omegaconf.__version__
'2.0.6'
>>> x = omegaconf.DictConfig({"a": 1})
>>> hasattr(x, "foo")
True
```

This violates some assumptions in various parts of the code. For example, previously this command was incorrectly missing the final layer norm due to upgrade logic that relied on `hasattr`, but is fixed after this diff:
```
CUDA_VISIBLE_DEVICES=0 python train.py --task dummy_lm --arch transformer_lm_gpt3_small --optimizer adam --lr 0.0001 --max-sentences 8 --log-format json --log-interval 1
```

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2347

Reviewed By: alexeib

Differential Revision: D31170584

Pulled By: myleott

fbshipit-source-id: bd767b7497794314f58f0f8073cdd4332b214006
2021-09-27 10:23:01 -07:00
..
mining Add Truncated BPTT example + TransformerXL (#1410) 2020-11-15 19:47:42 -08:00
sentence_retrieval Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
unsupervised_mt Add CRISS README and code to fairseq (#1344) 2020-10-14 10:34:51 -07:00
download_and_preprocess_flores_test.sh Add CRISS README and code to fairseq (#1344) 2020-10-14 10:34:51 -07:00
download_and_preprocess_tatoeba.sh Fix fairseq/criss README 2020-10-22 11:30:54 -07:00
README.md Fix fairseq/criss README 2020-10-22 11:30:54 -07:00
save_encoder.py Use safe_getattr and safe_hasattr (#2347) 2021-09-27 10:23:01 -07:00

Cross-lingual Retrieval for Iterative Self-Supervised Training

https://arxiv.org/pdf/2006.09526.pdf

Introduction

CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.

Requirements:

Unsupervised Machine Translation

1. Download and decompress CRISS checkpoints
cd examples/criss
wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz
tar -xf criss_checkpoints.tar.gz
2. Download and preprocess Flores test dataset

Make sure to run all scripts from examples/criss directory

bash download_and_preprocess_flores_test.sh
3. Run Evaluation on Sinhala-English
bash unsupervised_mt/eval.sh

Sentence Retrieval

1. Download and preprocess Tatoeba dataset
bash download_and_preprocess_tatoeba.sh
2. Run Sentence Retrieval on Tatoeba Kazakh-English
bash sentence_retrieval/sentence_retrieval_tatoeba.sh

Mining

1. Install faiss

Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md

2. Mine pseudo-parallel data between Kazakh and English
bash mining/mine_example.sh

Citation

@article{tran2020cross,
  title={Cross-lingual retrieval for iterative self-supervised training},
  author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
  journal={arXiv preprint arXiv:2006.09526},
  year={2020}
}