fairseq

facebookresearch/fairseq

Fork 0

mirror of https://github.com/facebookresearch/fairseq.git synced 2024-08-17 04:20:36 +03:00

Commit Graph

Author	SHA1	Message	Date
Myle Ott	656d7e5779	Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667 ) Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1667 Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) This enables fully parameter + optimizer state sharding by using FullyShardedDataParallel (FSDP) from fairscale. The user just needs to provide `--ddp-backend=fully_sharded` to enable. Other common options work out-of-the-box (e.g., `--fp16`, `--memory-efficient-fp16`, `--update-freq`, etc.). This should be a drop-in replacement for the "c10d" backend. This yields pretty big speedups for small models and enables training ~13B parameter models on 8 GPUs and 175B parameter models on 128 GPUs, without model parallelism. This also adds a new option `--cpu-offload` that offloads the optimizer state and FP32 model copy to CPU, which is particularly useful when combined with `--optimizer=cpu_adam`. Note: after enabling this, each GPU will save a checkpoint file, since the optimizer state is sharded. Each checkpoint will contain a single shard of the optimizer state and the rank 0 checkpoint will contain the full model weights. Note: a known limitation of the current implementation is that you cannot resume training on a different world_size. This constraint will be relaxed in future iterations. Test Plan: Imported from OSS Reviewed By: sshleifer Differential Revision: D26771144 Pulled By: myleott fbshipit-source-id: 74c2f46f57719e24e2dcfc9d9ee7c2fc0aeedb46	2021-03-04 13:32:46 -08:00
Guillaume Wenzek	da83e2f356	add fast filter_indices_by_size for RoundRobinZipDatasets (#1555 ) Summary: # Before submitting - [x] Was this discussed/approved via a Github issue? this has been extracted from https://github.com/fairinternal/fairseq-py/issues/1538 - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [x] Did you write any new necessary tests? ## What does this PR do? Implements a fast RoundRobinZipDataset.filter_indices_by_size. Instead of filtering the dataset sample by sample, the different datasets that are part of the RoundRobinZipDataset, are now filtered before being zipped together. This might generate slightly different datasets. Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1555 Reviewed By: myleott Differential Revision: D25924464 Pulled By: gwenzek fbshipit-source-id: bc64d9dc35eee62da7e3e17fd75a7f9facb60452	2021-02-02 09:26:16 -08:00

Author

SHA1

Message

Date

Myle Ott

656d7e5779

Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded) (#1667 )

Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1667

Add support for FullyShardedDataParallel (--ddp-backend=fully_sharded)

This enables fully parameter + optimizer state sharding by using
FullyShardedDataParallel (FSDP) from fairscale. The user just needs to provide
`--ddp-backend=fully_sharded` to enable. Other common options work
out-of-the-box (e.g., `--fp16`, `--memory-efficient-fp16`, `--update-freq`,
etc.). This should be a drop-in replacement for the "c10d" backend.

This yields pretty big speedups for small models and enables training ~13B
parameter models on 8 GPUs and 175B parameter models on 128 GPUs, without model
parallelism.

This also adds a new option `--cpu-offload` that offloads the optimizer state
and FP32 model copy to CPU, which is particularly useful when combined with
`--optimizer=cpu_adam`.

Note: after enabling this, each GPU will save a checkpoint file, since the
optimizer state is sharded. Each checkpoint will contain a single shard of the
optimizer state and the rank 0 checkpoint will contain the full model weights.

Note: a known limitation of the current implementation is that you cannot
resume training on a different world_size. This constraint will be relaxed in
future iterations.

Test Plan: Imported from OSS

Reviewed By: sshleifer

Differential Revision: D26771144

Pulled By: myleott

fbshipit-source-id: 74c2f46f57719e24e2dcfc9d9ee7c2fc0aeedb46

2021-03-04 13:32:46 -08:00

Guillaume Wenzek

da83e2f356

add fast filter_indices_by_size for RoundRobinZipDatasets (#1555 )

Summary:
# Before submitting

- [x] Was this discussed/approved via a Github issue?
    this has been extracted from https://github.com/fairinternal/fairseq-py/issues/1538
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?

Implements a fast RoundRobinZipDataset.filter_indices_by_size.
Instead of filtering the dataset sample by sample, the different datasets that are part of the RoundRobinZipDataset,
are now filtered before being zipped together.
This might generate slightly different datasets.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1555

Reviewed By: myleott

Differential Revision: D25924464

Pulled By: gwenzek

fbshipit-source-id: bc64d9dc35eee62da7e3e17fd75a7f9facb60452

2021-02-02 09:26:16 -08:00

2 Commits