fairseq

mirror of https://github.com/facebookresearch/fairseq.git synced 2024-08-16 12:00:25 +03:00

Author	SHA1	Message	Date
dianaml0	0dfd6b6240	Add linting with black (#2678 ) Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/main/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes # (issue). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2678 Reviewed By: Mortimerp9 Differential Revision: D32653381 Pulled By: dianaml0 fbshipit-source-id: 2810d14867cd7d64f4d340740e2b590b82de47fe	2021-11-29 12:32:59 -08:00
Vimal Manohar	1ef3d6a1a2	CPLTask for training with continuous pseudo labeling Summary: CPLTaskImpl provides implementation to augment existing tasks to take additional input of ema_model in its train_step and valid_step for continous pseudo-labeling (CPL) during training. It passes this ema_model to the criterion. See Kaizen semi-supervised training paper for more details https://arxiv.org/abs/2106.07759. This implementation also supports using CPLDataset which enables using unsupervised data only for `cpl_finetune_epoch > epochs >= cpl_start_epoch`. CPLDataset is like MultiCorpusDataset but ignores the unsupervised datasets while sampling. Another addition in this diff is to skip dataset in MultiCorpusDataset if the sampling probability is 0. Reviewed By: cruvadom Differential Revision: D30701536 fbshipit-source-id: 1d840eacfd538ed7aed3baaefc8b254390642b45	2021-10-14 22:09:07 -07:00
Alex Xiao	fc2840de58	optimize sampling process of multi_corpus_dataset Summary: The sampling process in multi_corpus_dataset is very inefficient. Turns out we can signficantly optimize it by sampling in batches rather than one by one. this allows: 1. fast local development and iteration with corpus sampling, as the turnaround time was long before 2. makes it take less time for our jobs can start training, enabling earlier signal if for example there is a configuration issue Reviewed By: zhengwy888 Differential Revision: D26187821 fbshipit-source-id: b4f7f6b7c187b3785499308226e2af671a6c354f	2021-03-03 19:31:40 -08:00
Alex Xiao	1fed7a8426	add unit test for multi_corpus_dataset Reviewed By: vimalmanohar Differential Revision: D26220694 fbshipit-source-id: ed13f8527a1b203e1a9d004fa8a86e1ad6423d60	2021-03-03 19:31:39 -08:00

4 Commits