Commit Graph

1580 Commits

Author SHA1 Message Date
Myle Ott
b4d57c6d49 Move TPU grad reductions out of Trainer into TPUDistributedDataParallel (#1397)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1397

Data parallel command: `python train.py ~/data/data-bin/wikitext-103-roberta-bpe-bin/ --task language_modeling --arch transformer_lm --batch-size 8 --tokens-per-sample 512 --log-format simple --log-interval 1 --fp16 --optimizer adam --share-decoder-input-output-embed --lr 0.0001`

Data parallel before:
```
2020-11-04 08:20:13 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:20:13 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:20:13 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 08:20:13 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:20:14 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:20:14 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:20:14 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:20:19 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      2 / 3587 loss=19.682, ppl=841142, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=13.17, loss_scale=64, train_wall=0, wall=5
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      3 / 3587 loss=16.721, ppl=108002, wps=160870, ups=4.91, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=4.507, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      4 / 3587 loss=16.07, ppl=68785.8, wps=517232, ups=15.77, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=2.737, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      5 / 3587 loss=15.714, ppl=53741.4, wps=537322, ups=16.38, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=2.542, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      6 / 3587 loss=15.441, ppl=44492.1, wps=540488, ups=16.48, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=2.485, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      7 / 3587 loss=15.199, ppl=37603.2, wps=543411, ups=16.57, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.382, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:19 | INFO | train_inner | epoch 001:      8 / 3587 loss=14.984, ppl=32414, wps=540359, ups=16.47, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.274, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:20 | INFO | train_inner | epoch 001:      9 / 3587 loss=14.7, ppl=26622.2, wps=533446, ups=16.26, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.16, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:20:20 | INFO | train_inner | epoch 001:     10 / 3587 loss=14.482, ppl=22875.4, wps=539734, ups=16.46, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.055, loss_scale=64, train_wall=0, wall=6
```

Data parallel after:
```
2020-11-04 08:14:02 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:14:02 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:14:02 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 08:14:02 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:14:03 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:14:03 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:14:03 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:14:08 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      2 / 3587 loss=19.682, ppl=841142, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=13.17, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      3 / 3587 loss=16.721, ppl=108002, wps=157099, ups=4.79, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=4.507, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      4 / 3587 loss=16.07, ppl=68785.8, wps=560049, ups=17.08, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=2.737, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      5 / 3587 loss=15.714, ppl=53741.4, wps=558507, ups=17.03, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=2.542, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      6 / 3587 loss=15.441, ppl=44492.1, wps=514194, ups=15.68, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=2.485, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:08 | INFO | train_inner | epoch 001:      7 / 3587 loss=15.199, ppl=37603.2, wps=552676, ups=16.85, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.382, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:09 | INFO | train_inner | epoch 001:      8 / 3587 loss=14.984, ppl=32414, wps=546402, ups=16.66, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.274, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:09 | INFO | train_inner | epoch 001:      9 / 3587 loss=14.7, ppl=26622.2, wps=508472, ups=15.5, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.16, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:09 | INFO | train_inner | epoch 001:     10 / 3587 loss=14.482, ppl=22875.4, wps=552493, ups=16.84, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.055, loss_scale=64, train_wall=0, wall=6
```

Data parallel command (no_c10d): `python train.py ~/data/data-bin/wikitext-103-roberta-bpe-bin/ --task language_modeling --arch transformer_lm --batch-size 8 --tokens-per-sample 512 --log-format simple --log-interval 1 --fp16 --optimizer adam --share-decoder-input-output-embed --lr 0.0001 --dp-backend no_c10d`

Data parallel before:
```
2020-11-04 08:19:25 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:19:25 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:19:25 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 08:19:25 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:19:25 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:19:26 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:19:26 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:19:31 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:19:31 | INFO | train_inner | epoch 001:      2 / 3587 loss=19.682, ppl=841142, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=13.17, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      3 / 3587 loss=16.721, ppl=108001, wps=141659, ups=4.32, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=4.507, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      4 / 3587 loss=16.07, ppl=68785.9, wps=503762, ups=15.36, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=2.737, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      5 / 3587 loss=15.714, ppl=53741.5, wps=488599, ups=14.9, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=2.542, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      6 / 3587 loss=15.441, ppl=44492, wps=507855, ups=15.48, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=2.485, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      7 / 3587 loss=15.199, ppl=37603, wps=503270, ups=15.34, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.382, loss_scale=64, train_wall=0, wall=7
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      8 / 3587 loss=14.984, ppl=32414, wps=467778, ups=14.26, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.274, loss_scale=64, train_wall=0, wall=7
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:      9 / 3587 loss=14.7, ppl=26622.2, wps=503800, ups=15.36, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.16, loss_scale=64, train_wall=0, wall=7
2020-11-04 08:19:32 | INFO | train_inner | epoch 001:     10 / 3587 loss=14.482, ppl=22875.3, wps=468486, ups=14.28, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.055, loss_scale=64, train_wall=0, wall=7
```

Data parallel after:
```
2020-11-04 08:14:50 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:14:50 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:14:50 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 08:14:50 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:14:50 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:14:51 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:14:51 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:14:56 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      2 / 3587 loss=19.682, ppl=841142, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=13.17, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      3 / 3587 loss=16.721, ppl=108001, wps=137677, ups=4.2, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=4.507, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      4 / 3587 loss=16.07, ppl=68785.9, wps=519541, ups=15.84, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=2.737, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      5 / 3587 loss=15.714, ppl=53741.5, wps=517063, ups=15.76, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=2.542, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      6 / 3587 loss=15.441, ppl=44492, wps=490728, ups=14.95, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=2.485, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      7 / 3587 loss=15.199, ppl=37603, wps=505262, ups=15.41, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.382, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:56 | INFO | train_inner | epoch 001:      8 / 3587 loss=14.984, ppl=32414, wps=508874, ups=15.52, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.274, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:57 | INFO | train_inner | epoch 001:      9 / 3587 loss=14.7, ppl=26622.2, wps=518028, ups=15.79, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.16, loss_scale=64, train_wall=0, wall=6
2020-11-04 08:14:57 | INFO | train_inner | epoch 001:     10 / 3587 loss=14.482, ppl=22875.3, wps=515996, ups=15.73, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.055, loss_scale=64, train_wall=0, wall=7
```

Model parallel command: `python train.py ~/data/data-bin/wikitext-103-roberta-bpe-bin/ --task language_modeling --arch transformer_lm_megatron --decoder-layers 4 --batch-size 8 --tokens-per-sample 512 --log-format simple --log-interval 1 --fp16 --optimizer adam --model-parallel-size 2 --share-decoder-input-output-embed --lr 0.0001`

Model parallel before:
```
2020-11-04 08:18:38 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:18:38 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:18:38 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last-model_part-0.pt
2020-11-04 08:18:38 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:18:38 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:18:39 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:18:39 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:18:44 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:18:45 | INFO | train_inner | epoch 001:      2 / 7173 loss=55.997, ppl=7.19017e+16, wps=0, ups=0, wpb=16384, bsz=32, num_updates=1, lr=0.0001, gnorm=14.03, loss_scale=64, train_wall=1, wall=7
2020-11-04 08:18:45 | INFO | train_inner | epoch 001:      3 / 7173 loss=28.372, ppl=3.47501e+08, wps=48371.7, ups=2.95, wpb=16384, bsz=32, num_updates=2, lr=0.0001, gnorm=15.339, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:18:46 | INFO | train_inner | epoch 001:      4 / 7173 loss=15.855, ppl=59276.8, wps=72422.5, ups=4.42, wpb=16384, bsz=32, num_updates=3, lr=0.0001, gnorm=4.189, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:18:46 | INFO | train_inner | epoch 001:      5 / 7173 loss=14.713, ppl=26858.7, wps=72933.5, ups=4.45, wpb=16384, bsz=32, num_updates=4, lr=0.0001, gnorm=4.751, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:18:46 | INFO | train_inner | epoch 001:      6 / 7173 loss=13.901, ppl=15299.7, wps=71974.8, ups=4.39, wpb=16384, bsz=32, num_updates=5, lr=0.0001, gnorm=4.361, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:18:46 | INFO | train_inner | epoch 001:      7 / 7173 loss=13.312, ppl=10169.5, wps=72897.8, ups=4.45, wpb=16384, bsz=32, num_updates=6, lr=0.0001, gnorm=3.307, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:18:47 | INFO | train_inner | epoch 001:      8 / 7173 loss=12.914, ppl=7720.21, wps=73044.6, ups=4.46, wpb=16384, bsz=32, num_updates=7, lr=0.0001, gnorm=5.473, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:18:47 | INFO | train_inner | epoch 001:      9 / 7173 loss=12.56, ppl=6036.72, wps=73453.1, ups=4.48, wpb=16384, bsz=32, num_updates=8, lr=0.0001, gnorm=6.112, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:18:47 | INFO | train_inner | epoch 001:     10 / 7173 loss=12.116, ppl=4437.77, wps=73442.6, ups=4.48, wpb=16384, bsz=32, num_updates=9, lr=0.0001, gnorm=4.415, loss_scale=64, train_wall=0, wall=9
```

Model parallel after:
```
2020-11-04 08:12:09 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 08:12:09 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 08:12:09 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last-model_part-0.pt
2020-11-04 08:12:09 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 08:12:09 | INFO | fairseq.data.data_utils | loaded 1801350 examples from: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/train
2020-11-04 08:12:10 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 08:12:10 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 08:12:16 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-11-04 08:12:17 | INFO | train_inner | epoch 001:      2 / 7173 loss=55.997, ppl=7.19017e+16, wps=0, ups=0, wpb=16384, bsz=32, num_updates=1, lr=0.0001, gnorm=14.03, loss_scale=64, train_wall=1, wall=8
2020-11-04 08:12:17 | INFO | train_inner | epoch 001:      3 / 7173 loss=28.372, ppl=3.47501e+08, wps=53097, ups=3.24, wpb=16384, bsz=32, num_updates=2, lr=0.0001, gnorm=15.339, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:12:17 | INFO | train_inner | epoch 001:      4 / 7173 loss=15.855, ppl=59276.8, wps=72355.5, ups=4.42, wpb=16384, bsz=32, num_updates=3, lr=0.0001, gnorm=4.189, loss_scale=64, train_wall=0, wall=8
2020-11-04 08:12:17 | INFO | train_inner | epoch 001:      5 / 7173 loss=14.713, ppl=26858.7, wps=70526.4, ups=4.3, wpb=16384, bsz=32, num_updates=4, lr=0.0001, gnorm=4.751, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:12:18 | INFO | train_inner | epoch 001:      6 / 7173 loss=13.901, ppl=15299.7, wps=73063.5, ups=4.46, wpb=16384, bsz=32, num_updates=5, lr=0.0001, gnorm=4.361, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:12:18 | INFO | train_inner | epoch 001:      7 / 7173 loss=13.312, ppl=10169.5, wps=73559.4, ups=4.49, wpb=16384, bsz=32, num_updates=6, lr=0.0001, gnorm=3.307, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:12:18 | INFO | train_inner | epoch 001:      8 / 7173 loss=12.914, ppl=7720.21, wps=72693.2, ups=4.44, wpb=16384, bsz=32, num_updates=7, lr=0.0001, gnorm=5.473, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:12:18 | INFO | train_inner | epoch 001:      9 / 7173 loss=12.56, ppl=6036.72, wps=73531.2, ups=4.49, wpb=16384, bsz=32, num_updates=8, lr=0.0001, gnorm=6.112, loss_scale=64, train_wall=0, wall=9
2020-11-04 08:12:19 | INFO | train_inner | epoch 001:     10 / 7173 loss=12.116, ppl=4437.77, wps=73187.6, ups=4.47, wpb=16384, bsz=32, num_updates=9, lr=0.0001, gnorm=4.415, loss_scale=64, train_wall=0, wall=10
```

Test Plan: Imported from OSS

Reviewed By: ngoyal2707

Differential Revision: D24729295

Pulled By: myleott

fbshipit-source-id: beee8bdece3eaa0419a2e813990420411e507c75
2020-11-05 15:29:33 -08:00
Myle Ott
f57b148938 Require process group for all helpers in distributed_utils (#1395)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1395

Data parallel command: `python train.py --task dummy_lm   --arch transformer_lm --tokens-per-sample 512   --max-sentences 8 --decoder-attention-heads 8 --dropout 0.0 --activation-dropout 0.0   --optimizer adam --lr 0.0001   --log-format simple --log-interval 1 --no-save --clip-norm 0.0`

Data parallel before:
```
2020-11-04 07:14:16 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 07:14:16 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 07:14:16 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 07:14:16 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 07:14:16 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
2020-11-04 07:14:16 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 07:14:16 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 07:14:21 | INFO | train_inner | epoch 001:      1 / 1563 loss=16.297, ppl=80495, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=2.501, train_wall=2, wall=5
2020-11-04 07:14:21 | INFO | train_inner | epoch 001:      2 / 1563 loss=15.399, ppl=43203.8, wps=101398, ups=3.09, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=2.101, train_wall=0, wall=6
2020-11-04 07:14:21 | INFO | train_inner | epoch 001:      3 / 1563 loss=14.742, ppl=27411.2, wps=217567, ups=6.63, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=1.888, train_wall=0, wall=6
2020-11-04 07:14:21 | INFO | train_inner | epoch 001:      4 / 1563 loss=14.206, ppl=18899.3, wps=219413, ups=6.69, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=1.91, train_wall=0, wall=6
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:      5 / 1563 loss=13.697, ppl=13282.1, wps=219446, ups=6.69, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=1.98, train_wall=0, wall=6
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:      6 / 1563 loss=13.179, ppl=9274.18, wps=220131, ups=6.71, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.08, train_wall=0, wall=6
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:      7 / 1563 loss=12.634, ppl=6358.37, wps=220236, ups=6.72, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.195, train_wall=0, wall=6
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:      8 / 1563 loss=12.056, ppl=4256.86, wps=220392, ups=6.72, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.259, train_wall=0, wall=6
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:      9 / 1563 loss=11.453, ppl=2804.05, wps=225842, ups=6.89, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.287, train_wall=0, wall=7
2020-11-04 07:14:22 | INFO | train_inner | epoch 001:     10 / 1563 loss=10.842, ppl=1835, wps=238808, ups=7.28, wpb=32768, bsz=64, num_updates=10, lr=0.0001, gnorm=2.311, train_wall=0, wall=7
```

Data parallel after:
```
2020-11-04 07:14:47 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 07:14:47 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 8
2020-11-04 07:14:47 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-11-04 07:14:47 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 07:14:47 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
2020-11-04 07:14:47 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 07:14:47 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 07:14:52 | INFO | train_inner | epoch 001:      1 / 1563 loss=16.297, ppl=80495, wps=0, ups=0, wpb=32768, bsz=64, num_updates=1, lr=0.0001, gnorm=2.501, train_wall=2, wall=5
2020-11-04 07:14:52 | INFO | train_inner | epoch 001:      2 / 1563 loss=15.399, ppl=43203.8, wps=96089.4, ups=2.93, wpb=32768, bsz=64, num_updates=2, lr=0.0001, gnorm=2.101, train_wall=0, wall=5
2020-11-04 07:14:52 | INFO | train_inner | epoch 001:      3 / 1563 loss=14.742, ppl=27411.2, wps=239285, ups=7.3, wpb=32768, bsz=64, num_updates=3, lr=0.0001, gnorm=1.888, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      4 / 1563 loss=14.206, ppl=18899.3, wps=233039, ups=7.11, wpb=32768, bsz=64, num_updates=4, lr=0.0001, gnorm=1.91, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      5 / 1563 loss=13.697, ppl=13282.1, wps=237484, ups=7.24, wpb=32768, bsz=64, num_updates=5, lr=0.0001, gnorm=1.98, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      6 / 1563 loss=13.179, ppl=9274.18, wps=231683, ups=7.07, wpb=32768, bsz=64, num_updates=6, lr=0.0001, gnorm=2.08, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      7 / 1563 loss=12.634, ppl=6358.37, wps=233804, ups=7.13, wpb=32768, bsz=64, num_updates=7, lr=0.0001, gnorm=2.195, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      8 / 1563 loss=12.056, ppl=4256.86, wps=234025, ups=7.14, wpb=32768, bsz=64, num_updates=8, lr=0.0001, gnorm=2.259, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:      9 / 1563 loss=11.453, ppl=2804.05, wps=238426, ups=7.27, wpb=32768, bsz=64, num_updates=9, lr=0.0001, gnorm=2.287, train_wall=0, wall=6
2020-11-04 07:14:53 | INFO | train_inner | epoch 001:     10 / 1563 loss=10.842, ppl=1835, wps=240069, ups=7.32, wpb=32768, bsz=64, num_updates=10, lr=0.0001, gnorm=2.311, train_wall=0, wall=6
```

Model parallel command: `python train.py --task dummy_lm --arch transformer_lm_megatron --decoder-layers 2 --batch-size 2 --tokens-per-sample 512 --log-format simple --log-interval 1 --fp16 --optimizer adam --model-parallel-size 2 --share-decoder-input-output-embed --lr 0.0001`

Model parallel before:
```
2020-11-04 07:12:22 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 07:12:22 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 2
2020-11-04 07:12:22 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last-model_part-0.pt
2020-11-04 07:12:22 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 07:12:23 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 07:12:23 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      1 / 12500 loss=60.017, ppl=1.16627e+18, wps=0, ups=0, wpb=4096, bsz=8, num_updates=1, lr=0.0001, gnorm=8.531, loss_scale=128, train_wall=2, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      2 / 12500 loss=46.473, ppl=9.77028e+13, wps=48996.6, ups=11.95, wpb=4096, bsz=8, num_updates=2, lr=0.0001, gnorm=15.019, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      3 / 12500 loss=30.525, ppl=1.54543e+09, wps=58424.2, ups=14.25, wpb=4096, bsz=8, num_updates=3, lr=0.0001, gnorm=13.936, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      4 / 12500 loss=18.561, ppl=386799, wps=58399.5, ups=14.24, wpb=4096, bsz=8, num_updates=4, lr=0.0001, gnorm=7.251, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      5 / 12500 loss=15.145, ppl=36230, wps=58275.6, ups=14.21, wpb=4096, bsz=8, num_updates=5, lr=0.0001, gnorm=2.392, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      6 / 12500 loss=14.683, ppl=26304.2, wps=58704.8, ups=14.32, wpb=4096, bsz=8, num_updates=6, lr=0.0001, gnorm=2.487, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:28 | INFO | train_inner | epoch 001:      7 / 12500 loss=14.169, ppl=18418.9, wps=58449.2, ups=14.26, wpb=4096, bsz=8, num_updates=7, lr=0.0001, gnorm=2.45, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:29 | INFO | train_inner | epoch 001:      8 / 12500 loss=13.574, ppl=12197.4, wps=59106.5, ups=14.42, wpb=4096, bsz=8, num_updates=8, lr=0.0001, gnorm=2.393, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:29 | INFO | train_inner | epoch 001:      9 / 12500 loss=12.974, ppl=8047.87, wps=58619.6, ups=14.3, wpb=4096, bsz=8, num_updates=9, lr=0.0001, gnorm=2.317, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:12:29 | INFO | train_inner | epoch 001:     10 / 12500 loss=12.341, ppl=5187.55, wps=58166.5, ups=14.19, wpb=4096, bsz=8, num_updates=10, lr=0.0001, gnorm=2.213, loss_scale=128, train_wall=0, wall=6
```

Model parallel after:
```
2020-11-04 07:11:07 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2020-11-04 07:11:07 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 2
2020-11-04 07:11:07 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last-model_part-0.pt
2020-11-04 07:11:07 | INFO | fairseq.trainer | loading train data for epoch 1
2020-11-04 07:11:08 | INFO | fairseq.optim.adam | using FusedAdam
2020-11-04 07:11:08 | INFO | fairseq.trainer | begin training epoch 1
2020-11-04 07:11:13 | INFO | train_inner | epoch 001:      1 / 12500 loss=60.017, ppl=1.16627e+18, wps=0, ups=0, wpb=4096, bsz=8, num_updates=1, lr=0.0001, gnorm=8.531, loss_scale=128, train_wall=2, wall=6
2020-11-04 07:11:13 | INFO | train_inner | epoch 001:      2 / 12500 loss=46.473, ppl=9.77028e+13, wps=47018.1, ups=11.47, wpb=4096, bsz=8, num_updates=2, lr=0.0001, gnorm=15.019, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:13 | INFO | train_inner | epoch 001:      3 / 12500 loss=30.525, ppl=1.54543e+09, wps=59292.6, ups=14.46, wpb=4096, bsz=8, num_updates=3, lr=0.0001, gnorm=13.936, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:13 | INFO | train_inner | epoch 001:      4 / 12500 loss=18.561, ppl=386799, wps=57708.9, ups=14.08, wpb=4096, bsz=8, num_updates=4, lr=0.0001, gnorm=7.251, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:      5 / 12500 loss=15.145, ppl=36230, wps=57427.4, ups=14.01, wpb=4096, bsz=8, num_updates=5, lr=0.0001, gnorm=2.392, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:      6 / 12500 loss=14.683, ppl=26304.2, wps=58730.2, ups=14.33, wpb=4096, bsz=8, num_updates=6, lr=0.0001, gnorm=2.487, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:      7 / 12500 loss=14.169, ppl=18418.9, wps=59523.2, ups=14.52, wpb=4096, bsz=8, num_updates=7, lr=0.0001, gnorm=2.45, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:      8 / 12500 loss=13.574, ppl=12197.4, wps=58945.2, ups=14.38, wpb=4096, bsz=8, num_updates=8, lr=0.0001, gnorm=2.393, loss_scale=128, train_wall=0, wall=6
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:      9 / 12500 loss=12.974, ppl=8047.87, wps=59659.2, ups=14.55, wpb=4096, bsz=8, num_updates=9, lr=0.0001, gnorm=2.317, loss_scale=128, train_wall=0, wall=7
2020-11-04 07:11:14 | INFO | train_inner | epoch 001:     10 / 12500 loss=12.341, ppl=5187.55, wps=59681.4, ups=14.56, wpb=4096, bsz=8, num_updates=10, lr=0.0001, gnorm=2.213, loss_scale=128, train_wall=0, wall=7
```

Test Plan: Imported from OSS

Reviewed By: ngoyal2707

Differential Revision: D24728687

Pulled By: myleott

fbshipit-source-id: 2d387d022ee889494f429b98df1942167896e306
2020-11-05 09:44:32 -08:00
alexeib
b58f4f017e end to end hydra configs (#1393)
Summary:
this adds a hydra_train binary that uses hydra configs/command line overrides instead of argparse

use case 1: built in configs + overrides from command line

```
python fairseq_cli/hydra_train.py distributed_training.distributed_world_size=1 dataset.batch_size=2 task.data=/private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/ model=transformer_lm/transformer_lm_gpt task=language_modeling optimization.max_update=5000
```

use case 2: use an external config that is used instead of bundled configs (but dataclass defaults still work)

```
python fairseq_cli/hydra_train.py --config-path ~/fairseq-py-dev/lm --config-name wiki103
```

the config file contains this:

```
# package _group_

model:
  _name: transformer_lm
distributed_training:
  distributed_world_size: 1
dataset:
  batch_size: 2
task:
  _name: language_modeling
  data: /private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/
  add_bos_token: false
  max_target_positions: 1024
optimization:
  max_update: 50000
  lr: [ 0.25 ]
criterion: cross_entropy
optimizer: adam
lr_scheduler:
  _name: cosine
```

use case 3: use an external config directory that provides additional configs for e.g. models

python fairseq_cli/hydra_train.py distributed_training.distributed_world_size=1 dataset.batch_size=2 task.data=/private/home/myleott/data/data-bin/wikitext-103-roberta-bpe-bin/ model=transformer_lm/2_layers task=language_modeling optimization.max_update=5000 --config-dir ~/fairseq-py-dev/lm/hydra

where ~/fairseq-py-dev/lm/hydra has the following structure:

- model
-- transformer_lm
 --- 2_layers.yaml

and inside 2_layers.yaml is a copy of transformer_lm_gpt.yaml but with decoder_layers set to 2

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1393

Reviewed By: myleott

Differential Revision: D24722252

Pulled By: alexeib

fbshipit-source-id: 758ea431fa099cd7c0e4daf41eff680df1d3b841
2020-11-04 18:20:12 -08:00
Alex Xiao
ea4ccd94de Load and broadcast fairseq checkpoints instead of having each rank load them individually
Summary:
This diff is based on feedback in D24379649

Before when loading checkpoints:

Each rank loads the checkpoint from Manifold.

Now:

Rank 0 loads checkpoint from Manifold. This checkpoint is broadcasted to all other ranks. This saves IO.

Furthermore, when doing zero-sharding, we only broadcast the relevant parts of the optimizer state to each node. This makes checkpoint loading more memory-efficient and should enable loading models beyond 2-3B parameters.

Reviewed By: myleott

Differential Revision: D24660791

fbshipit-source-id: e30b2ea5990083375e4549f0427a112346ba170d
2020-11-04 12:57:29 -08:00
Myle Ott
1a709b2a40 Reproduce #1781. Add Weights and Biases support
Summary:

Fixes https://github.com/pytorch/fairseq/issues/1790.

Reviewed By: alexeib

Differential Revision: D24579153

fbshipit-source-id: 74a30effa164db9d6376554376e36b1f47618899

Co-authored-by: Nikolay Korolev <korolevns98@gmail.com>
Co-authored-by: Vlad Lyalin <Guitaricet@gmail.com>
2020-11-03 20:48:00 -08:00
Myle Ott
dd52ed0f38 Small fixes (#1392)
Summary:
- Set default value of clip-norm back to 0.0 (disabled)
- Add comment explaining that we divide loss by log(2) to covert the base
- Fix `--zero-optimizer=os` (fixes #2811)
- Update requirements to PyTorch >= 1.5
- Fix bug in fixed LR schedule

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1392

Reviewed By: alexeib

Differential Revision: D24714231

Pulled By: myleott

fbshipit-source-id: 63dc8cfc74683bbccbf05b44228014eb12ddbfc7
2020-11-03 20:45:06 -08:00
Joshua Meier
b120fbbe8f Fix correctness issue with megatron save/load checkpoints (#1386)
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?
Fixes https://github.com/pytorch/fairseq/issues/2681.

Proof that it's working now:
```
python fairseq_train.py --task masked_lm /checkpoint/bioseq_nonsecure/model-parallel-data/tiny_sample_valid_ur50-bin  --dataset-impl fasta  --save-dir checkpoints/mp-fix4    --dropout 0.1   --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0   --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07   --tokens-per-sample 128 --sample-break-mode none   --max-tokens 128 --no-progress-bar --log-interval 1 --seed 4 --max-epoch 1 --max-update 50 --encoder-layers 4  --arch model_parallel_roberta_large --model-parallel-size 2 --update-freq 2 --save-interval-updates 10

2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     11 / 78 loss=0.939, ppl=1.92, wps=116.7, ups=0.11, wpb=1024, bsz=8, num_updates=11, lr=1.47473e-06, gnorm=2.276, train_wall=0, wall=15
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     12 / 78 loss=0.938, ppl=1.92, wps=15769.2, ups=15.38, wpb=1024, bsz=8, num_updates=12, lr=1.5997e-06, gnorm=2.612, train_wall=0, wall=15
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     13 / 78 loss=0.877, ppl=1.84, wps=18658.8, ups=18.2, wpb=1024, bsz=8, num_updates=13, lr=1.72468e-06, gnorm=2.798, train_wall=0, wall=15
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     14 / 78 loss=0.887, ppl=1.85, wps=18324.5, ups=17.88, wpb=1024, bsz=8, num_updates=14, lr=1.84965e-06, gnorm=2.326, train_wall=0, wall=15
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     15 / 78 loss=0.867, ppl=1.82, wps=17616.5, ups=17.19, wpb=1024, bsz=8, num_updates=15, lr=1.97463e-06, gnorm=2.112, train_wall=0, wall=15
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     16 / 78 loss=0.891, ppl=1.85, wps=18624.5, ups=18.17, wpb=1024, bsz=8, num_updates=16, lr=2.0996e-06, gnorm=2.123, train_wall=0, wall=16
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     17 / 78 loss=0.887, ppl=1.85, wps=17972.5, ups=17.53, wpb=1024, bsz=8, num_updates=17, lr=2.22458e-06, gnorm=2.061, train_wall=0, wall=16
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     18 / 78 loss=0.862, ppl=1.82, wps=14672.4, ups=14.32, wpb=1024, bsz=8, num_updates=18, lr=2.34955e-06, gnorm=2.282, train_wall=0, wall=16
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     19 / 78 loss=0.876, ppl=1.83, wps=14398.6, ups=14.05, wpb=1024, bsz=8, num_updates=19, lr=2.47453e-06, gnorm=2.261, train_wall=0, wall=16
2020-10-29 18:42:08 | INFO | train_inner | epoch 001:     20 / 78 loss=0.818, ppl=1.76, wps=18652.2, ups=18.2, wpb=1024, bsz=8, num_updates=20, lr=2.5995e-06, gnorm=1.969, train_wall=0, wall=16

...relaunch...

2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     11 / 78 loss=0.939, ppl=1.92, wps=98.2, ups=0.1, wpb=1024, bsz=8, num_updates=11, lr=1.47473e-06, gnorm=2.276, train_wall=1, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     12 / 78 loss=0.938, ppl=1.92, wps=17137.8, ups=16.72, wpb=1024, bsz=8, num_updates=12, lr=1.5997e-06, gnorm=2.612, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     13 / 78 loss=0.877, ppl=1.84, wps=17239.6, ups=16.82, wpb=1024, bsz=8, num_updates=13, lr=1.72468e-06, gnorm=2.798, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     14 / 78 loss=0.887, ppl=1.85, wps=18132, ups=17.69, wpb=1024, bsz=8, num_updates=14, lr=1.84965e-06, gnorm=2.326, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     15 / 78 loss=0.867, ppl=1.82, wps=17795.1, ups=17.36, wpb=1024, bsz=8, num_updates=15, lr=1.97463e-06, gnorm=2.112, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     16 / 78 loss=0.891, ppl=1.85, wps=18021.3, ups=17.58, wpb=1024, bsz=8, num_updates=16, lr=2.0996e-06, gnorm=2.123, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     17 / 78 loss=0.887, ppl=1.85, wps=16452.9, ups=16.05, wpb=1024, bsz=8, num_updates=17, lr=2.22458e-06, gnorm=2.061, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     18 / 78 loss=0.862, ppl=1.82, wps=17563.3, ups=17.14, wpb=1024, bsz=8, num_updates=18, lr=2.34955e-06, gnorm=2.282, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     19 / 78 loss=0.876, ppl=1.83, wps=16770.3, ups=16.36, wpb=1024, bsz=8, num_updates=19, lr=2.47453e-06, gnorm=2.261, train_wall=0, wall=0
2020-10-29 18:47:20 | INFO | train_inner | epoch 001:     20 / 78 loss=0.818, ppl=1.76, wps=16808.2, ups=16.4, wpb=1024, bsz=8, num_updates=20, lr=2.5995e-06, gnorm=1.969, train_wall=0, wall=0
```

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1386

Reviewed By: myleott

Differential Revision: D24640946

Pulled By: joshim5

fbshipit-source-id: cb141d92496b289a04d53f080ecd4d5ac6941672
2020-11-03 14:07:06 -08:00
Shashank Jain
de977736f9 Support running batch of sentences together on GPU with BART fill_mask (#2833)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/2833

 Add support for filling masks using BART on a batch of sentences. This will be helpful when running on GPU

Reviewed By: myleott

Differential Revision: D24687773

fbshipit-source-id: 1b8005c18a09be526f40e9e2b99207afa38e0f1a
2020-11-02 17:17:15 -08:00
Yuqing Tang
de859692ff Enable translation_multi_simple_epoch to have different source and target dictionaries
Summary: In past, we always use shared dictionary for multilingual experiments. This diff renables different dictionaries for source and target languages by changing the assertion criteria and reverts back to use specific languages to return source_dict and target_dict.

Reviewed By: chtran

Differential Revision: D24637682

fbshipit-source-id: a982e4f1e48395cc5bf10dc03b98fbe970062f8d
2020-10-30 18:25:25 -07:00
Myle Ott
a4356b1da2 Simplify --user-dir and require user-dir module name to be globally unique (#2815)
Summary:
This PR reverts recent changes that attempted to make `--user-dir` work with non-unique module names. But that new approach introduced other issues (e.g., poor compatibility with multiprocessing and Windows), so let's revert to the previous simpler implementation.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2815

Reviewed By: alexeib

Differential Revision: D24611571

Pulled By: myleott

fbshipit-source-id: cecfe28395585ca0401f844f10bd0d49d014c4d8
2020-10-29 17:08:20 -07:00
Anuroop Sriram
6debe29150 Compute WER for Wav2Vec 2.0 Seq2Seq models (#1376)
Summary:
# Before submitting

- [X] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?

## What does this PR do?
Adds support to compute WER for wav2vec2.0 seq2seq models.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1376

Reviewed By: alexeib

Differential Revision: D24611516

Pulled By: anuroopsriram

fbshipit-source-id: dd7daab73ebccc21367dd51f41a11e89c404977b
2020-10-29 11:46:08 -07:00
Myle Ott
4cdc81f6f1 Support activation checkpointing in Transformer (#1378)
Summary:
Without activation checkpointing (peak GPU memory usage: 7138MiB)
```
$ python train.py --task dummy_mt --arch transformer --dropout 0.1 --max-tokens 4096 --optimizer adam --lr 0.00001 --log-format simple --log-interval 25 --fp16
(...)
2020-10-28 08:03:03 | INFO | train_inner | epoch 001:     25 / 92 loss=12.67, ppl=6517.2, wps=281380, ups=8.61, wpb=32640, bsz=1088, num_updates=25, lr=1e-05, gnorm=8.541, clip=0, loss_scale=128, train_wall=5, wall=10
2020-10-28 08:03:05 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-10-28 08:03:06 | INFO | train_inner | epoch 001:     51 / 92 loss=8.938, ppl=490.52, wps=302975, ups=9.28, wpb=32640, bsz=1088, num_updates=50, lr=1e-05, gnorm=6.395, clip=0, loss_scale=64, train_wall=3, wall=12
2020-10-28 08:03:08 | INFO | train_inner | epoch 001:     76 / 92 loss=3.855, ppl=14.47, wps=316039, ups=9.68, wpb=32640, bsz=1088, num_updates=75, lr=1e-05, gnorm=9.078, clip=0, loss_scale=64, train_wall=3, wall=15
2020-10-28 08:03:10 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-10-28 08:03:17 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 0.048 | ppl 1.03 | wps 1.09646e+06 | wpb 32640 | bsz 1088 | num_updates 91
```

With activation checkpointing (peak GPU memory usage: 6466MiB)
```
$ python train.py --checkpoint-activations --task dummy_mt --arch transformer --dropout 0.1 --max-tokens 4096 --optimizer adam --lr 0.00001 --log-format simple --log-interval 25 --fp16
(...)
2020-10-28 08:01:50 | INFO | train_inner | epoch 001:     25 / 92 loss=12.67, ppl=6517.22, wps=291110, ups=8.91, wpb=32640, bsz=1088, num_updates=25, lr=1e-05, gnorm=8.541, clip=0, loss_scale=128, train_wall=4, wall=9
2020-10-28 08:01:51 | INFO | fairseq.trainer | NOTE: overflow detected, setting loss scale to: 64.0
2020-10-28 08:01:52 | INFO | train_inner | epoch 001:     51 / 92 loss=8.938, ppl=490.54, wps=295438, ups=9.05, wpb=32640, bsz=1088, num_updates=50, lr=1e-05, gnorm=6.394, clip=0, loss_scale=64, train_wall=3, wall=12
2020-10-28 08:01:55 | INFO | train_inner | epoch 001:     76 / 92 loss=3.855, ppl=14.47, wps=308351, ups=9.45, wpb=32640, bsz=1088, num_updates=75, lr=1e-05, gnorm=9.082, clip=0, loss_scale=64, train_wall=3, wall=14
2020-10-28 08:01:57 | INFO | fairseq_cli.train | begin validation on "valid" subset
```

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1378

Reviewed By: min-xu-ai

Differential Revision: D24593170

Pulled By: myleott

fbshipit-source-id: 701254e603a2277d22f8b3bcc3ebbade54bb7479
2020-10-28 18:35:56 -07:00
alexeib
b7d8b9dce2 fix architecture params (#1382)
Summary:
fixes architectures not getting applied to migrated models

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1382

Reviewed By: myleott

Differential Revision: D24603110

Pulled By: alexeib

fbshipit-source-id: 18f44d3736853282466feed5e8896db95338b097
2020-10-28 18:29:11 -07:00
freewym
9c66ff54c4 build_generator() in generator.py should accept cfg.generation instea… (#2813)
Summary:
…d of cfg.task

# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2813

Reviewed By: alexeib

Differential Revision: D24604698

Pulled By: myleott

fbshipit-source-id: e41996147203ec47274ded803bab910460a19eb3
2020-10-28 18:21:08 -07:00
Myle Ott
e4e01780f8 Fix dummy LM task (#1381)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1381

Reviewed By: alexeib

Differential Revision: D24603479

Pulled By: myleott

fbshipit-source-id: 5aae8da9c0f20d6526c98b0b37bf9b32a8c78393
2020-10-28 18:19:07 -07:00
alexeib
65b02d529a fix wav2vec infer and finetuning (#1384)
Summary:
Fixes #2807, #2810, #2519

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1384

Reviewed By: myleott

Differential Revision: D24605451

Pulled By: alexeib

fbshipit-source-id: 46ec8f273ac2fab86bd444461e2706c35608b250
2020-10-28 17:18:12 -07:00
alexeib
f6d9313092 fix eval lm (#1380)
Summary:
fixes eval lm that wasnt parsing arguments correctly

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1380

Reviewed By: myleott

Differential Revision: D24600415

Pulled By: alexeib

fbshipit-source-id: eb56575bef4d20a3cd5cee3dcd279046f085d938
2020-10-28 14:59:44 -07:00
Elijah Rippeth
3c726544d2 fix issue where is_initialized is not available in single-worker paradigm (#2801)
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes https://github.com/pytorch/fairseq/issues/1205

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2801

Reviewed By: alexeib

Differential Revision: D24579193

Pulled By: myleott

fbshipit-source-id: bcb14bb588d4538398bff4114e0a387fd29818c5
2020-10-28 14:54:11 -07:00
Myle Ott
1bc83c703a Misc fixes (#2786)
Summary:
- Rename type -> key in fairseq/tasks/sentence_prediction.py (fixes https://github.com/pytorch/fairseq/issues/2746)
- Update preprocessing docs (fixes https://github.com/pytorch/fairseq/issues/2565)
- Turn off logging in test_fp16_optimizer.TestGradientScaling
- Documentation updates
- Remove some unused code
- Fix noisychannel example (fixes https://github.com/pytorch/fairseq/issues/2213)

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2786

Reviewed By: shruti-bh

Differential Revision: D24515146

Pulled By: myleott

fbshipit-source-id: 86b0f5516c57610fdca801c60e58158ef052fc3a
2020-10-27 11:26:07 -07:00
Myle Ott
01be083e46 Centralize hydra init (and support packaged location of configs) (#2784)
Summary:
Configs can either be in `/fairseq/configs` (once the package is installed) or `/configs` (if using an editable installation). This centralizes the hydra init and supports these two possible config locations.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2784

Reviewed By: alexeib

Differential Revision: D24513586

Pulled By: myleott

fbshipit-source-id: 8e10a88177ebcf809d5d37d448d2b384142febef
2020-10-27 07:46:44 -07:00
Shruti Bhosale
beeac0ad68 Get 12B M2M-100 model generation to work correctly on exactly 2 32gb gpus (#1366)
Summary:
# What does this PR do?
Addresses https://github.com/pytorch/fairseq/issues/2772 where external users can't generate using the model because the README is currently not accurate.
This PR fixes the issues in the README

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1366

Reviewed By: edunov

Differential Revision: D24455634

Pulled By: shruti-bh

fbshipit-source-id: 480a11f8b95d1278162d585700e58d467a35d35a
2020-10-27 02:15:13 -07:00
Vladimir Smirnov
81677d751d Update README.md (#2796)
Summary:
Fixed link.

# Before submitting

- [-] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [+] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [+] Did you make sure to update the docs?
- [-] Did you write any new necessary tests?

## What does this PR do?
Fixes link.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2796

Reviewed By: nlaptev

Differential Revision: D24538759

Pulled By: myleott

fbshipit-source-id: af947f432c34ca2aec35c9fe59dd1214e363450b
2020-10-26 08:18:52 -07:00
alexeib
3c41478083 fix loading emissions (#1375)
Summary:
broken in last change to infer.py

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1375

Reviewed By: xuqiantong

Differential Revision: D24531499

Pulled By: alexeib

fbshipit-source-id: fab60abf67a05c48e1ff750fac3ab6d4c0fa2770
2020-10-25 12:54:22 -07:00
alexeib
6ee0364685 fix building components when no configuration is provided (#1374)
Summary:
see title, in particular fixes evaluating generate.py with --scoring wer

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1374

Reviewed By: kahne

Differential Revision: D24527059

Pulled By: alexeib

fbshipit-source-id: b01994441fda12eafd4e465d147047c6e84a8335
2020-10-24 21:20:05 -07:00
alexeib
c147060598 add new w2v models (#1373)
Summary:
update readme to add new wav2vec models (incl w/ self training)

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1373

Reviewed By: michaelauli

Differential Revision: D24524182

Pulled By: alexeib

fbshipit-source-id: c918971f8009b11855908e71bfcc247cf6776a8f
2020-10-24 10:21:31 -07:00
Shashank Jain
4b0cf6649b Revert "Fix deprecated usage of nonzero()"
Summary: Reverting the diff because it has already been fixed in https://github.com/pytorch/pytorch/pull/45413

Reviewed By: myleott

Differential Revision: D24511658

fbshipit-source-id: a5561dae50d69a03443ca8a60bebe2cd064e3ee0
2020-10-23 14:45:25 -07:00
alexeib
2409d5a36e refactor dataclass related files, add proper types for static checkin… (#1371)
Summary:
- refactor dataclass/ hierarchy to make it a bit more sane (while avoiding circular references)
- add top level FairseqConfig
- change typehints to reflect the correct config type if it is known

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1371

Reviewed By: myleott

Differential Revision: D24469026

Pulled By: alexeib

fbshipit-source-id: 01f68918f761d51ec5216286b8959ad35f41a7b2
2020-10-23 00:07:33 -07:00
alexeib
cd2bba4419 rename remove_bpe to post_process; add aliasing (#1369)
Summary:
some binaries (e.g. speech based ones) used --post-process, some used --remove-bpe. --post-process seems more appropriate as it does more than just remove bpe at the moment. this renames remove_bpe to post_process, adds alias so existing command lines would work and adds checkpoint upgrades so they continue to work also.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1369

Reviewed By: myleott

Differential Revision: D24465040

Pulled By: alexeib

fbshipit-source-id: 1b3e388291ccc403e76e069ef6606b80ead863a7
2020-10-22 16:31:49 -07:00
Myle Ott
e0737c3c29 Dynamically generate versions based on commit hash (#2774)
Summary:
This will produce version strings like `1.0.0a0+3065963`, similar to PyTorch version strings.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2774

Reviewed By: alexeib

Differential Revision: D24453517

Pulled By: myleott

fbshipit-source-id: 03a0c324ed6124bbc513ba7edc954abd71d63a0f
2020-10-22 12:51:04 -07:00
Myle Ott
b8a938e96e BART hub fixes + improvements (#1342)
Summary:
- Make BART hub interface extend from GeneratorHubInterface (fixes #1748)
- Add mask filling interface for BART

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1342

Reviewed By: ngoyal2707

Differential Revision: D24264195

Pulled By: myleott

fbshipit-source-id: 0885f90a54fabe1672b1bfe137dfbccbc5d25d0e
2020-10-22 12:45:49 -07:00
alexeib
f0fcb55d5b fix #2764 (#1368)
Summary:
fix interactive.py + add args from tasks before registries (where we catch errors)

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1368

Reviewed By: myleott

Differential Revision: D24462871

Pulled By: alexeib

fbshipit-source-id: 307b829c935aa5061bdd79d8cc339eaf87fd8845
2020-10-22 12:19:31 -07:00
Myle Ott
11aaffdd18 rm FairseqModel::upgrade_args, it's not needed anymore (#1363)
Summary:
Tests seems to pass without it, so let's remove it

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1363

Reviewed By: alexeib

Differential Revision: D24452369

Pulled By: myleott

fbshipit-source-id: 186933ff3ee16be61c77a9581658db8e853c1baa
2020-10-22 12:05:39 -07:00
Chau Tran
31c23baafc Fix fairseq/criss README
Summary: Add requirements, fix wrong command

Reviewed By: tangyuq

Differential Revision: D24452748

fbshipit-source-id: 4837610ea7e5b5df8caecc685226080cafddb3e0
2020-10-22 11:30:54 -07:00
alexeib
18cadab1d0 support new cfg based models; make sure --normalize is consistent in … (#1370)
Summary:
support new cfg based models; make sure --normalize is consistent in infer with the model

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1370

Reviewed By: myleott

Differential Revision: D24467698

Pulled By: alexeib

fbshipit-source-id: 056b3608e3c1fe8acdb3e45e0306de5d874cb4d1
2020-10-22 07:08:54 -07:00
Pavel Soriano
751bcbfcb9 Changed EnvironmentError to RuntimeError in get_from_cache (#2767)
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
No need I believe
- [x] Did you write any new necessary tests?
No
## What does this PR do?
Fixes https://github.com/pytorch/fairseq/issues/2724

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Yes! It is not a big PR at all but it allowed me to familiarize with the caching/downloading logic used in fairseq (which is very similar to that used in pytorch/transformers)

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2767

Reviewed By: edunov

Differential Revision: D24456055

Pulled By: myleott

fbshipit-source-id: bc634a9b97f957ecc5a8da57b112ff892e492107
2020-10-22 06:31:07 -07:00
Myle Ott
43c69a7666 Fix deprecated usage of nonzero() (#1364)
Summary:
PyTorch requires the `as_tuple` argument now, otherwise it prints warnings. Let's just fix this everywhere

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1364

Reviewed By: edunov

Differential Revision: D24452587

Pulled By: myleott

fbshipit-source-id: 7e6d424792ffec74a6197b2a266600cb13f24770
2020-10-22 06:28:27 -07:00
Myle Ott
0f44e89c38 Fix Latent Depth args (#1365)
Summary:
Args should be registered in the Model rather than modules

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1365

Reviewed By: pipibjc

Differential Revision: D24453007

Pulled By: myleott

fbshipit-source-id: d22b0d86a3c940456b394b005acab4bb6a3f5bed
2020-10-21 15:32:28 -07:00
Changhan Wang
ee450dde19 S2T multilingual example + bug fix
Summary:
* S2T multilingual example on MuST-C
* A bug fix for `speech_to_text_dataset` (for multilingual setting)

Reviewed By: jmp84

Differential Revision: D24339394

fbshipit-source-id: ef0c0be08137884897b532e45ebc56551d20be48
2020-10-21 08:10:47 -07:00
Myle Ott
eece1d7082 More detailed error message for data iterator size mismatch (#2768)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/2768

Reviewed By: vimalmanohar

Differential Revision: D24446804

Pulled By: myleott

fbshipit-source-id: 19220f2fd3e3db49f7528f6fb17188834b09646f
2020-10-21 07:47:54 -07:00
Myle Ott
9b0611e678 Fix torch.hub (fixes #2756) (#2762)
Summary:
Typically `torch.hub.load(...)` doesn't call `pip install`, so our Cython components never get built. We have a hack in our hubconf that builds these components by running the equivalent of `python setup.py build_ext --inplace` using the setuptools sandbox: f6677b6755/hubconf.py (L52-L55).

Unfortunately, this sandbox gets mad if you modify the filesystem, which is what this recent change does: f6677b6755/setup.py (L203-L205). Combined this breaks torch.hub.

The solution is that when we're doing `build_ext`, don't setup the symlinks. This is fine, since `build_ext` doesn't actually build a package, so we don't care about including config or examples.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2762

Reviewed By: alexeib

Differential Revision: D24430228

Pulled By: myleott

fbshipit-source-id: e05d075a003ddfde196cb8a86b32882d73808015
2020-10-20 15:46:55 -07:00
freewym
d6f2c907be remove unnecessary logging configs (#2733)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?
It's sufficient to set logging.basicConfig in the most outside calling code like train.py or generate.py. Actually the setting of logging.basicConfig () (like [here](https://github.com/pytorch/fairseq/blob/master/fairseq_cli/generate.py#L54)) will been overwritten if logging.basicConfig is set in the inner part of the whole code.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2733

Reviewed By: alexeib

Differential Revision: D24418987

Pulled By: myleott

fbshipit-source-id: 862d200023357de8947799f380e513f4c411b143
2020-10-20 15:46:48 -07:00
Xu Song
8248a12a64 Upgrade args: max_sentences to batch_size (#2754)
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?

Upgrade args: `max_sentences` to `batch_size`

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2754

Reviewed By: alexeib

Differential Revision: D24418980

Pulled By: myleott

fbshipit-source-id: 5269c2fc8c434513cc5114f7e9d2eccd0c553fbd
2020-10-20 15:42:54 -07:00
Alexei Baevski
f6677b6755 fix #2761, #2760
Summary:
Fixes issue #2761 and #2760
args from registries were not added to argparse

Reviewed By: myleott

Differential Revision: D24422792

fbshipit-source-id: c8a8e835965da5c4f527bd589bd621371441e7fe
2020-10-20 13:44:42 -07:00
alexeib
3b27ed7996 Enable Hydra configs in fairseq (#1343) (#1510)
Summary:
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1510

this is the main pr that switches on hydra functionality in fairseq

we migrate "args" object into omegaconf "DictConfig" at all legacy entry points

in addition this migrates various components from secondary registries (like bpe encoders and tokenizers) to make the migration smoother

i am going through code that references migrated fairseq components and changing it to inherit from "Legacy*" components instead. hopefully tests will catch most of this

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1343

Reviewed By: myleott

Differential Revision: D23973928

Pulled By: alexeib

fbshipit-source-id: dd9554981fff51ea75c1ff343874d1d6e61793c9
2020-10-20 00:32:26 -07:00
Alexei Baevski
c76cb6dfb9 composite criterion should still use legacy criterion as it will break with subsequent diff
Summary: see title

Reviewed By: myleott

Differential Revision: D24393903

fbshipit-source-id: 4b972b8150c7228fb32977675c6c60b13d5194d0
2020-10-19 20:17:11 -07:00
Myle Ott
de5c2cb35a Fix model parallel LM (#1358)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1358

Reviewed By: alexeib

Differential Revision: D24393064

Pulled By: myleott

fbshipit-source-id: ee88fd1e7b203d7df6b7a65d3b1b1469e8fe9b6e
2020-10-19 14:15:02 -07:00
Myle Ott
9b8b464070 Package config and examples with fairseq (#1356)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1356

Reviewed By: alexeib

Differential Revision: D24385688

Pulled By: myleott

fbshipit-source-id: 72c4a702d93d2854a6409d42913d7413207cb61e
2020-10-19 09:24:04 -07:00
Angela Fan
e3168f74a8 minor fix for linter (#1360)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1360

Reviewed By: myleott

Differential Revision: D24393217

Pulled By: huihuifan

fbshipit-source-id: a110ef6958b1e15cd8c4e23b610db5cfc994f06d
2020-10-19 09:11:03 -07:00
Shruti Bhosale
65e11a37d5 Readme with instructions to generate and evaluate with a 12B model (#1351)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1351

Reviewed By: edunov

Differential Revision: D24386349

Pulled By: huihuifan

fbshipit-source-id: ade362d7cb64e24e6b2689ba87c53636073d2246
2020-10-19 06:11:59 -07:00
Myle Ott
a48f235636 Apply black+isort (#1357)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1357

Reviewed By: alexeib

Differential Revision: D24377772

fbshipit-source-id: 51581af041d42d62166b33a35a1a4228b1a76f0c
2020-10-18 18:14:51 -07:00