fairseq/tests
Naman Goyal 0add50c2e0 allowing sharded dataset (#696)
Summary:
Co-authored-by: myleott <myleott@fb.com>

Changing `data` to be `str` with colon separated list for loading sharded datasets. This change is useful for loading large datasets that cannot fit into, memory. The large dataset can be sharded and then each shard is loaded in one epoch in roudrobin manner.

For example, if there are `5` shards of data and `10` epochs then the shards will be iterated upon `[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]`.

myleott We need to look into `translation.py` as it currently already expects a list and then concats the datasets.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/696

Differential Revision: D15214049

fbshipit-source-id: 03e43a7b69c7aefada2ca668abf1eac1969fe013
2019-05-06 15:27:17 -07:00
..
__init__.py fairseq-py goes distributed (#106) 2018-02-27 17:09:42 -05:00
test_average_checkpoints.py Merge internal changes (#136) 2018-04-02 10:13:07 -04:00
test_backtranslation_dataset.py Back translation + denoising in MultilingualTranslation task (#620) 2019-04-10 10:56:51 -07:00
test_binaries.py Fix and generalize --temperature option (#508) 2019-05-04 16:39:32 -07:00
test_character_token_embedder.py fix tests 2018-09-03 19:15:23 -04:00
test_convtbc.py Remove more Variable() calls (#198) 2018-06-25 12:23:04 -04:00
test_dictionary.py Move string line encoding logic from tokenizer to Dictionary (unified diff). (#541) 2019-02-28 09:19:12 -08:00
test_iterators.py Further generalize EpochBatchIterator and move iterators into new file 2018-09-03 19:15:23 -04:00
test_label_smoothing.py Add FairseqTask 2018-06-15 13:05:22 -06:00
test_multi_corpus_sampled_dataset.py Enable custom sampling strategy in MultiCorpusSampledDataset (#639) 2019-04-16 23:29:02 -07:00
test_noising.py Refactor BacktranslationDataset to be more reusable (#354) 2018-11-25 21:26:03 -08:00
test_reproducibility.py Merge internal changes (#654) 2019-04-29 19:50:58 -07:00
test_sequence_generator.py Modularize generate.py (#351) 2019-02-22 10:08:52 -08:00
test_sequence_scorer.py Modularize generate.py (#351) 2019-02-22 10:08:52 -08:00
test_token_block_dataset.py Merge internal changes (#483) 2019-01-30 09:01:10 -08:00
test_train.py allowing sharded dataset (#696) 2019-05-06 15:27:17 -07:00
test_utils.py Simplify and generalize utils.make_positions 2019-04-15 07:32:11 -07:00
utils.py Fix tests + style nits + Python 3.5 compat 2018-11-01 01:28:30 -07:00