Commit Graph

4 Commits

Author SHA1 Message Date
dianaml0
0dfd6b6240 Add linting with black (#2678)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2678

Reviewed By: Mortimerp9

Differential Revision: D32653381

Pulled By: dianaml0

fbshipit-source-id: 2810d14867cd7d64f4d340740e2b590b82de47fe
2021-11-29 12:32:59 -08:00
Vimal Manohar
1ef3d6a1a2 CPLTask for training with continuous pseudo labeling
Summary:
CPLTaskImpl provides implementation to augment existing tasks to take additional input of ema_model in its train_step and valid_step for continous pseudo-labeling (CPL) during training. It passes this ema_model to the criterion.

See Kaizen semi-supervised training paper for more details https://arxiv.org/abs/2106.07759.

This implementation also supports using CPLDataset which enables using unsupervised data only for `cpl_finetune_epoch > epochs >= cpl_start_epoch`. CPLDataset is like MultiCorpusDataset but ignores the unsupervised datasets while sampling.

Another addition in this diff is to skip dataset in MultiCorpusDataset if the sampling probability is 0.

Reviewed By: cruvadom

Differential Revision: D30701536

fbshipit-source-id: 1d840eacfd538ed7aed3baaefc8b254390642b45
2021-10-14 22:09:07 -07:00
Alex Xiao
fc2840de58 optimize sampling process of multi_corpus_dataset
Summary:
The sampling process in multi_corpus_dataset is very inefficient. Turns out we can signficantly optimize it by sampling in batches rather than one by one. this allows:

1. fast local development and iteration with corpus sampling, as the turnaround time was long before
2. makes it take less time for our jobs can start training, enabling earlier signal if for example there is a configuration issue

Reviewed By: zhengwy888

Differential Revision: D26187821

fbshipit-source-id: b4f7f6b7c187b3785499308226e2af671a6c354f
2021-03-03 19:31:40 -08:00
Alex Xiao
1fed7a8426 add unit test for multi_corpus_dataset
Reviewed By: vimalmanohar

Differential Revision: D26220694

fbshipit-source-id: ed13f8527a1b203e1a9d004fa8a86e1ad6423d60
2021-03-03 19:31:39 -08:00