Commit Graph

19 Commits

Author SHA1 Message Date
Vimal Manohar
6b7a7d6457 Fix EMA GPU test
Summary: The GPU test was broken after D33809223 (1b61bbad32)

Reviewed By: cruvadom

Differential Revision: D33931570

fbshipit-source-id: 37962a437d8e25b1dafc58db0efa55c1afa5f3ee
2022-02-04 09:10:06 -08:00
Diana Liskovich
7fddb9d960 lint fixes (#2834)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Applied `black` and `isort` to fix failing CI

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2834

Reviewed By: vedanuj

Differential Revision: D33262876

Pulled By: dianaml0

fbshipit-source-id: 03215c276fcddda9f7c78971bf6ed7c5ac21b2ee
2021-12-29 11:50:55 -08:00
Xian Li
7f3967805f add readme for xglm models (#2808)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Add readme and task for xglm models.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2808

Reviewed By: punitkoura

Differential Revision: D33237928

Pulled By: xianxl

fbshipit-source-id: 7773cf56e896210dab1f4311ae69f0e00c6d9aff
2021-12-20 13:05:17 -08:00
dianaml0
88e7d2586b fix flake8 issues (#2570)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
- [x] applies flake8 fixes to main branch (https://github.com/fairinternal/fairseq-py/issues/2546) - still more to be fixed

Fix GPU tests:
- [x] when torch.ao.quantization import doesn't work use torch.quantization
- [x] build apex from earlier commit in circleci so that its compatible with pytorch 1.8 and 1.9

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2570

Reviewed By: Mortimerp9

Differential Revision: D32955312

Pulled By: dianaml0

fbshipit-source-id: e163cbd4998f171f819e31b0682c1c0f1986f9e1
2021-12-09 02:34:30 -08:00
dianaml0
0dfd6b6240 Add linting with black (#2678)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2678

Reviewed By: Mortimerp9

Differential Revision: D32653381

Pulled By: dianaml0

fbshipit-source-id: 2810d14867cd7d64f4d340740e2b590b82de47fe
2021-11-29 12:32:59 -08:00
Vinayak Tantia
3a5838c320 Update implemention of SlowMo to its implementation in Fairscale (#3996)
Summary:
- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?
SlowMo is being moved to [Fairscale](https://fairscale.readthedocs.io/en/latest/). This commit updates the implementation of SlowMo to the Fairscale version. It also adds tests for SlowMo.
Note: This PR is currently for review. It will be merged at a later date once SlowMo has been updated to Fairscale. SlowMo is being merged to Fairscale as part of [a PR](https://github.com/facebookresearch/fairscale/pull/378). So, once that PR is merged to Fairscale, this PR on Fairseq will be ready for merge

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/3996

Reviewed By: dianaml0

Differential Revision: D32280163

Pulled By: vtantia

fbshipit-source-id: 70c97b04a7cdc90ada7099375c2a31b0c978ba70
2021-11-09 09:44:45 -08:00
Vimal Manohar
8feccf9441 EMA
Summary:
Adds Exponential moving average (EMA) model for Kaizen semi-supervised training https://arxiv.org/abs/2106.07759

1. Add `ema.store_ema` to enable storing EMA. EMA will be written to extra_state in the state dict while saving checkpoint.
2. `ema.ema_start_update` to control when the EMA starts accumulating
3. Tasks can use `uses_ema` property to decide if the EMA should be passed to the task. (Default is False)
4. `load_ema_from_checkpoint` can be used to load EMA model in place of the model to be used for evalutation. Pyspeech has eval-ema option for this.

```
This module has the EMA class used to store a copy of the exponentially decayed
model params.

Typical usage of EMA class involves initializing an object using an existing
model (random or from a seed model) and setting the config like ema_decay,
ema_start_update which determine how the EMA model is updated. After every
update of the model i.e. at the end of the train_step, the EMA should be updated
by passing the new model to the EMA.step function. The EMA model state dict
can be stored in the extra state under the key of "ema" and dumped
into a checkpoint and loaded. The EMA object can be passed to tasks
by setting task.uses_ema property.
EMA is a smoothed/ensemble model which might have better performance
when used for inference or further fine-tuning. EMA class has a
reverse function to load the EMA params into a model and use it
like a regular model.
```

Reviewed By: cruvadom

Differential Revision: D24238379

fbshipit-source-id: 879d3ba5070a614b7d365f9503af357001e875b2
2021-09-01 12:29:51 -07:00
Gagandeep Singh
237184e522 Add torch.cuda.amp support (#3460)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?
Fixes https://github.com/pytorch/fairseq/issues/3282
Add support for `torch.cuda.amp`
AMP can be enabled by `--amp`, instead of using `--fp16` for the already present full fp16 support.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/3460

Reviewed By: sshleifer, msbaines

Differential Revision: D27932253

Pulled By: myleott

fbshipit-source-id: 21637aefb5e788c59bf4f3c5de6c4a80f7319543
2021-05-26 14:39:10 -07:00
Weiyi Zheng
8df9e3a4a5 support FSDP sharded_state checkpoint loading during inference
Summary:
using the very useful feature added by QuentinDuval https://github.com/facebookresearch/fairscale/pull/683/files , we can consolidate sharded states into a full regular states. this allows inferences on sharded state almost transparently.

The main complexity comes from trying to be smart about what kind of checkpoint the user wants to load. not sure if this is over-engineering
1. if the file checkpoint-shard0.pt exists, and `--checkpoint-shard-count` is > 1, then we load sharded FSDP checkpoint
2. if checkpoint-shard0.pt exists but --checkpoint-shard-count=1, we load consolidated FSDP checkpoint
3. if checkpoint-shard0.pt does not exist, but --checkpoint-shard-count > 1, we load model parallel checkpoint
4. otherwise we are loading a single, plain checkpoint.

In theory we could be even smarter and load shard0.pt to check how many more checkpoints are needed. this is not implemented, though it will save the user having to specify --checkpoint-shard-count.

Reviewed By: sshleifer

Differential Revision: D28563441

fbshipit-source-id: dcafcaa7c9eaf5c9ff94f55c16bb3424c98dfa59
2021-05-25 17:45:51 -07:00
Weiyi Zheng
425c36eaff support use_sharded_state on command line
Summary:
we wanted to use sharded_state because
1. to save memory
2. support sharded state loading, which allows MoE models's weight to live on their respective shard
I just added the use_sharded_state as a config option, and added unit test to make sure it runs fine.

old revision's comment:
fairseq.FSDP has a  flag use_sharded_state, but I had to address a couple problems before being able to use it.
1. fairscale FSDP (FSDP for short) calls self.state_dict/load_state_dict, which has been overwritten by fairseq.FSDP, this is not a desired behavior
2. the optimizer states shouldn't be sharded again when use_sharded_state is True
3. expose this option on the command line.

Reviewed By: sshleifer

Differential Revision: D28375035

fbshipit-source-id: c2f59a9c62163405033f34ed595ba78528aea850
2021-05-14 18:53:16 -07:00
Sam Shleifer
05b86005bc Fix FSDP optim state loading (#1819)
Summary:
### Problem:
- if we consolidate optim state dict on rank 0, rank 1+ save `optimizer.state_dict()`. When they try to load, they call get_shard(last_optim_state), which is wrong since the optim state is already shared. They should find the global consolidated optimizer state dict and load that.

### Possible Solutions:
- if world size is the same, you could just reuse the local OSD.
- [this PR] rank 1+ load optim state from the rank0 file and call get_shard
- separate file for optim_state that every rank loads. (like 'shared.pt' on `gshard-azure`). This will save some CPU Ram.

### Note:
- I don't think it's possible to pass `--use-sharded-state` from the command line. It should be I think.

### Implementation here
+ if FSDP saves -1 as state['last_optimizer_key'], it means that, on load, rank 0's optim state must be loaded.
+ regression test

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1819

Reviewed By: zhengwy888

Differential Revision: D27910281

Pulled By: sshleifer

fbshipit-source-id: d34987008f77ce7e0cb28b7224dd2aabed38a70c
2021-04-21 15:50:13 -07:00
Sam Shleifer
da0432a3cd MultiGPU test and --log-file workaround (#1793)
Summary:
The initial problem I set out to solve was that it's not easy to add a multigpu test. I solved that problem but it ruined log capturing, both with `self.assertLogs` and `with contextlib.redirect_stdout(StringIO())`.

After some brief digging, I gave up on trying to get those to work, and added support for `--log-file AGI_v0.log` which will write the `progress_bar.log()` statements to `log-file` as well as `stdout`. This functionality is used by the resumption test.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1793

Reviewed By: myleott

Differential Revision: D27671192

Pulled By: sshleifer

fbshipit-source-id: bcba5f9df7a965889a4cd6993f7eeb0f14b770c6
2021-04-21 06:39:00 -07:00
Myle Ott
d464af2feb Fix NAT code (#1454)
Summary:
D23752010 (add65adcc5) broke some GPU-only tests for NAT.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1454

Test Plan: Imported from OSS

Reviewed By: jmp84

Differential Revision: D25108461

Pulled By: myleott

fbshipit-source-id: f32b890221578c421944d6f9a49f06ef1dc075c6
2020-11-20 12:42:33 -08:00
Myle Ott
2d900bf308 Fix tests (#1352)
Summary:
We need to keep `--num-workers=0` during tests

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1352

Reviewed By: alexeib

Differential Revision: D24375411

Pulled By: myleott

fbshipit-source-id: 9975ed5405f3b19b4dd0877ca15ee3081b185942
2020-10-16 17:36:13 -07:00
Myle Ott
7c292af66f Fix hub (#2687)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/2687

Reviewed By: alexeib

Differential Revision: D24095130

Pulled By: myleott

fbshipit-source-id: 7d371bccb550ec68b2b9b39dfa4c0718356508d6
2020-10-02 19:02:01 -07:00
Stanislau Hlebik
698e3b91ff remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:21:51 -07:00
Stanislau Hlebik
7ea5e3b341 remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:21:45 -07:00
Myle Ott
5abc774eea Re-enable test_transformer_fp16 GPU test
Reviewed By: theweiho

Differential Revision: D21890628

fbshipit-source-id: 4088884dd2a82a831f1c129e675eb233c469242a
2020-06-05 06:06:20 -07:00
Wei Ho
ea092c2aa6 Split out fairseq GPU tests & add new deeplearning_fairseq_gpu contbuild using remote execution
Reviewed By: myleott

Differential Revision: D21472387

fbshipit-source-id: efde278baf6a05e8a81a9630b44c7e7e7c7fe7fc
2020-06-03 18:53:35 -07:00