Commit Graph

2 Commits

Author SHA1 Message Date
Myle Ott
656d7e5779 Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1667

Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded)

This enables fully parameter + optimizer state sharding by using
FullyShardedDataParallel (FSDP) from fairscale. The user just needs to provide
`--ddp-backend=fully_sharded` to enable. Other common options work
out-of-the-box (e.g., `--fp16`, `--memory-efficient-fp16`, `--update-freq`,
etc.). This should be a drop-in replacement for the "c10d" backend.

This yields pretty big speedups for small models and enables training ~13B
parameter models on 8 GPUs and 175B parameter models on 128 GPUs, without model
parallelism.

This also adds a new option `--cpu-offload` that offloads the optimizer state
and FP32 model copy to CPU, which is particularly useful when combined with
`--optimizer=cpu_adam`.

Note: after enabling this, each GPU will save a checkpoint file, since the
optimizer state is sharded. Each checkpoint will contain a single shard of the
optimizer state and the rank 0 checkpoint will contain the full model weights.

Note: a known limitation of the current implementation is that you cannot
resume training on a different world_size. This constraint will be relaxed in
future iterations.

Test Plan: Imported from OSS

Reviewed By: sshleifer

Differential Revision: D26771144

Pulled By: myleott

fbshipit-source-id: 74c2f46f57719e24e2dcfc9d9ee7c2fc0aeedb46
2021-03-04 13:32:46 -08:00
Guillaume Wenzek
da83e2f356 add fast filter_indices_by_size for RoundRobinZipDatasets (#1555)
Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue?
    this has been extracted from https://github.com/fairinternal/fairseq-py/issues/1538
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?

Implements a fast RoundRobinZipDataset.filter_indices_by_size.
Instead of filtering the dataset sample by sample, the different datasets that are part of the RoundRobinZipDataset,
are now filtered before being zipped together.
This might generate slightly different datasets.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1555

Reviewed By: myleott

Differential Revision: D25924464

Pulled By: gwenzek

fbshipit-source-id: bc64d9dc35eee62da7e3e17fd75a7f9facb60452
2021-02-02 09:26:16 -08:00