mirror of
https://github.com/facebookresearch/fairseq.git
synced 2024-10-26 17:32:57 +03:00
53bf2b1293
Summary: ## What does this PR do? there are a few places where we do file chunking for multiprocessing a single file. However, the code is partly in Binarizer and partly just duplicated here and there. This PR extracts the file chunking/reading logic. The multiprocessing logic could probably be extracted too, but I haven't found a good abstraction yet. # Testing Added testing for this reading logic + maybe fixed a bug where the last part of a file might get dropped (even if it's unclear with the current stopping logic) Tested by running the preprocessing script as follow: ``` python -m fairseq_cli.preprocess --source-lang de --target-lang en --trainpref ...train.spm.clean.de_en --srcdict ...fairseq.dict --tgtdict .../fairseq.dict --destdir ... --workers 60 ``` Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1955 Reviewed By: myleott Differential Revision: D29065473 Pulled By: Mortimerp9 fbshipit-source-id: c60843de8cfd45a63b3dbb8290f57ef3df3bf983 |
||
---|---|---|
.. | ||
distributed | ||
gpu | ||
speech_recognition | ||
__init__.py | ||
test_activation_checkpointing.py | ||
test_amp_optimizer.py | ||
test_average_checkpoints.py | ||
test_backtranslation_dataset.py | ||
test_binaries.py | ||
test_character_token_embedder.py | ||
test_checkpoint_utils.py | ||
test_concat_dataset.py | ||
test_constraints.py | ||
test_convtbc.py | ||
test_data_utils.py | ||
test_dataset.py | ||
test_dictionary.py | ||
test_export.py | ||
test_file_chunker_utils.py | ||
test_file_io.py | ||
test_fp16_optimizer.py | ||
test_inference_dropout.py | ||
test_iopath.py | ||
test_iterators.py | ||
test_label_smoothing.py | ||
test_lm_context_window.py | ||
test_lstm_jitable.py | ||
test_memory_efficient_fp16.py | ||
test_metrics.py | ||
test_multi_corpus_dataset.py | ||
test_multi_corpus_sampled_dataset.py | ||
test_multihead_attention.py | ||
test_noising.py | ||
test_online_backtranslation.py | ||
test_plasma_utils.py | ||
test_reproducibility.py | ||
test_resampling_dataset.py | ||
test_roberta.py | ||
test_sequence_generator.py | ||
test_sequence_scorer.py | ||
test_sparse_multihead_attention.py | ||
test_token_block_dataset.py | ||
test_train.py | ||
test_transformer.py | ||
test_utils.py | ||
test_valid_subset_checks.py | ||
utils.py |