Summary:
# Before submitting
- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
## What does this PR do?
Fixes # (issue).
## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
## Did you have fun?
Make sure you had fun coding �
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2678
Reviewed By: Mortimerp9
Differential Revision: D32653381
Pulled By: dianaml0
fbshipit-source-id: 2810d14867cd7d64f4d340740e2b590b82de47fe
Summary:
CPLTaskImpl provides implementation to augment existing tasks to take additional input of ema_model in its train_step and valid_step for continous pseudo-labeling (CPL) during training. It passes this ema_model to the criterion.
See Kaizen semi-supervised training paper for more details https://arxiv.org/abs/2106.07759.
This implementation also supports using CPLDataset which enables using unsupervised data only for `cpl_finetune_epoch > epochs >= cpl_start_epoch`. CPLDataset is like MultiCorpusDataset but ignores the unsupervised datasets while sampling.
Another addition in this diff is to skip dataset in MultiCorpusDataset if the sampling probability is 0.
Reviewed By: cruvadom
Differential Revision: D30701536
fbshipit-source-id: 1d840eacfd538ed7aed3baaefc8b254390642b45
Summary:
The sampling process in multi_corpus_dataset is very inefficient. Turns out we can signficantly optimize it by sampling in batches rather than one by one. this allows:
1. fast local development and iteration with corpus sampling, as the turnaround time was long before
2. makes it take less time for our jobs can start training, enabling earlier signal if for example there is a configuration issue
Reviewed By: zhengwy888
Differential Revision: D26187821
fbshipit-source-id: b4f7f6b7c187b3785499308226e2af671a6c354f