Commit Graph

1 Commits

Author SHA1 Message Date
Pierre Andrews
279796224f Preprocess Split (#2738)
Summary:
This is the equivalent to PR https://github.com/fairinternal/fairseq-py/issues/2697 but on top of main instead of gshard (cherry-picked and merged the squash):

* reorganize preprocess.py code a bit
* use Binarizers objects in the multiprocess code
* clean up the make_binary
* multiprocess logic
* learn to count
* format and doc string
* add basic test for vocab binarizer
* generalize to one line
* move multiprocess in binarizer

Testing:
```
python -m fairseq_cli.preprocess --only-source --trainpref ~/fixathon/small_vocab_test/train.in --destdir ~/fixathon/small_vocab_test/data-bin.cherry --workers 20
python -m fairseq_cli.preprocess --only-source --trainpref ~/fixathon/small_vocab_test/train.in --destdir ~/fixathon/small_vocab_test/data-bin.main --workers 20
```

```
 md5sum ~/fixathon/small_vocab_test/data-bin.cherry/train.bin == md5sum ~/fixathon/small_vocab_test/data-bin.main/train.bin
```

```
diff ~/fixathon/small_vocab_test/data-bin.main/dict.txt ~/fixathon/small_vocab_test/data-bin.cherry/dict.tx
```

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2738

Reviewed By: sshleifer, dianaml0

Differential Revision: D32830875

Pulled By: Mortimerp9

fbshipit-source-id: e7463d5cdd96a877691bf39666daa319ebb3dcb8
2022-01-11 11:56:46 -08:00