fairseq/tests
Myle Ott 656d7e5779 Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1667

Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded)

This enables fully parameter + optimizer state sharding by using
FullyShardedDataParallel (FSDP) from fairscale. The user just needs to provide
`--ddp-backend=fully_sharded` to enable. Other common options work
out-of-the-box (e.g., `--fp16`, `--memory-efficient-fp16`, `--update-freq`,
etc.). This should be a drop-in replacement for the "c10d" backend.

This yields pretty big speedups for small models and enables training ~13B
parameter models on 8 GPUs and 175B parameter models on 128 GPUs, without model
parallelism.

This also adds a new option `--cpu-offload` that offloads the optimizer state
and FP32 model copy to CPU, which is particularly useful when combined with
`--optimizer=cpu_adam`.

Note: after enabling this, each GPU will save a checkpoint file, since the
optimizer state is sharded. Each checkpoint will contain a single shard of the
optimizer state and the rank 0 checkpoint will contain the full model weights.

Note: a known limitation of the current implementation is that you cannot
resume training on a different world_size. This constraint will be relaxed in
future iterations.

Test Plan: Imported from OSS

Reviewed By: sshleifer

Differential Revision: D26771144

Pulled By: myleott

fbshipit-source-id: 74c2f46f57719e24e2dcfc9d9ee7c2fc0aeedb46
2021-03-04 13:32:46 -08:00
..
distributed Add tests for fairseq.distributed.utils.all_gather_list (#1548) 2021-01-28 14:21:10 -08:00
gpu Fix NAT code (#1454) 2020-11-20 12:42:33 -08:00
speech_recognition Enable Hydra configs in fairseq (#1343) (#1510) 2020-10-20 00:32:26 -07:00
__init__.py remediation of S205607 2020-07-17 17:21:51 -07:00
test_activation_checkpointing.py Make checkpoint wrapper pickleable (#1603) 2021-02-06 08:07:32 -08:00
test_average_checkpoints.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_backtranslation_dataset.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_binaries.py Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667) 2021-03-04 13:32:46 -08:00
test_character_token_embedder.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_checkpoint_utils.py Move checkpoint state_dict creation into Trainer (#1666) 2021-03-04 13:32:44 -08:00
test_concat_dataset.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_constraints.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_convtbc.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_data_utils.py batch_by_size refactoring: 100x speedup and optimization of memory footprint 2020-12-28 21:05:51 -08:00
test_dataset.py Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667) 2021-03-04 13:32:46 -08:00
test_dictionary.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_export.py Improve torchscript compatibility of transfomer and transformer pg (#3247) 2021-02-22 14:22:54 -08:00
test_file_io.py ioPath async - Fairseq unittests (#1669) 2021-03-03 10:50:39 -08:00
test_fp16_optimizer.py end to end hydra configs (#1393) 2020-11-04 18:20:12 -08:00
test_inference_dropout.py Enable Hydra configs in fairseq (#1343) (#1510) 2020-10-20 00:32:26 -07:00
test_iopath.py Support atomic saves for checkpoints (#1520) 2020-12-18 07:40:49 -08:00
test_iterators.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_label_smoothing.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_lm_context_window.py Fix --context-window and add test (#1526) 2020-12-23 18:35:54 -08:00
test_lstm_jitable.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_memory_efficient_fp16.py Enable Hydra configs in fairseq (#1343) (#1510) 2020-10-20 00:32:26 -07:00
test_metrics.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_multi_corpus_dataset.py optimize sampling process of multi_corpus_dataset 2021-03-03 19:31:40 -08:00
test_multi_corpus_sampled_dataset.py Relicense fairseq under MIT license (#786) 2019-07-30 07:48:23 -07:00
test_multihead_attention.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_noising.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_reproducibility.py Support atomic saves for checkpoints (#1520) 2020-12-18 07:40:49 -08:00
test_resampling_dataset.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_sequence_generator.py fastseq ngram blocking (#1509) 2020-12-30 12:58:09 -08:00
test_sequence_scorer.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_sparse_multihead_attention.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
test_token_block_dataset.py TokenBlockDataset np type promotion issue (#1658) 2021-02-26 21:00:38 -08:00
test_train.py Move checkpoint state_dict creation into Trainer (#1666) 2021-03-04 13:32:44 -08:00
test_utils.py Apply black+isort (#1357) 2020-10-18 18:14:51 -07:00
utils.py LASER training code (#1207) 2021-02-18 03:10:55 -08:00