Add XGLM pre-training data format explaination (#4158)

Summary: 1. Add XGLM pre-training data format explanation 2. Add back pointer to pre-print Pull Request resolved: https://github.com/pytorch/fairseq/pull/4158 Reviewed By: xianxl Differential Revision: D33825440 Pulled By: todpole3 fbshipit-source-id: 379aa55d55ef3c9016987d1f05de023b7a7aee04
2024-10-04 04:37:58 +03:00 · 2022-02-01 10:25:58 -08:00 · 2022-02-01 10:25:58 -08:00 · 790f3be15a
commit 790f3be15a
parent d839d84f1e
1 changed files with 32 additions and 0 deletions
--- a/examples/xglm/README.md
+++ b/examples/xglm/README.md
@ -17,6 +17,33 @@ Model | Layers | Model Dim | Languages | Download
 `XGLM 7.5B` | 32 | 4096 | trained on 30 languages|  [xglm.7.5B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.7.5B.tar.gz)
 `XGLM 4.5B` | 48 | 2048 | trained on 134 languages|  [xglm.4.5B.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xglm/xglm.4.5B.tar.gz)

+## Pre-training Data Format
+Our models were pre-trained with data in the following format (i.e. paragraphs are separated with new lines and documents were separated with double new lines).
+```
+<doc0,para0,tok0> ... <doc0,para0,tokX0> # X0: number of tokens in para0 of doc0
+<doc0,para1,tok0> ... <doc0,para1,tokY0> # Y0: number of tokens in para1 of doc0
+
+<doc1,para0,tok0> ... <doc1,para0,tokX1> # X1: number of tokens in para0 of doc1
+<doc1,para1,tok0> ... <doc1,para1,tokY1> # Y1: number of tokens in para1 of doc1
+
+...
+```
+Fairseq's preprocessing replaces newlines with the end-of-sentence symbol (`</s>`). As a result, the models never saw newline characters during pretraining and the same preprocessing should be run prior to few-shot inference to maximize performance. For example, our language model scoring function has `replace_newlines_with_eos` argument to trigger this preprocessing:
+```python
+from fairseq.models.transformer_lm import TransformerLanguageModel
+
+model_dir = 'path_to_decompressed_tar_gz_dir'
+lm = TransformerLanguageModel.from_pretrained(model_dir, bpe='sentencepiece')
+
+text = """First paragraph of the first document.
+Second paragraph of the first document.
+
+First paragraph of the second document.
+"""
+tokens = lm.score(text, replace_newlines_with_eos=True)['tokens']
+assert '\n' not in lm.decode(tokens)  # no newlines were encoded
+```
+
 ## Evaluation

 ### Example (COPA)
@ -111,6 +138,11 @@ for lang in ['en', 'zh', 'hi']:
 # hi-1 0 0
 ```

+## Preprint
+[Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668).
+Xi Victoria Lin*, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li* (* Equal Contribution).
+ArXiv 2021.
+
 ## Citation
 ```
@article{DBLP:journals/corr/abs-2112-10668,