Summary:
Imported from https://github.com/fairinternal/fairseq-py/pull/1284. Updated according to PR comments.
Main changes:
* New task: `fairseq.tasks.speech_to_text`
* Multilingual support: multiple train sub-splits, temperature-based sampling, language ID tokens
* New dataset: `fairseq.data.audio.speech_to_text_dataset`
* Added accuracy metrics and BOS prefix removal to label smoothed cross entropy
* New models: Transformer (`fairseq.models.speech_to_text.s2t_transformer`) and BLSTM (`fairseq.models.speech_to_text.berard`)
* Extended scorers:
* Added a base scorer class: `fairseq.scorers.BaseScorer` (the parent class for all scorers except the BLEU scorer in CPP)
* Added an evaluation tokenizer: `fairseq.scorers.eval_tokenizer` which leverages sacreBLEU's built-in tokenizers and allows character-level tokenization as well as punctuation removal (for WER scoring).
* Added chrF scorer: `fairseq.scorers.chrf`
* Online Mel-filter bank speech feature extraction (via CPP-based pyKaldi or Python-based TorchAudio): `fairseq.data.audio.audio_utils`
* Online speech feature transforms: `fairseq.data.audio.feature_transforms.*`
* Fixed the subsampled sequence lengths in VGGTransformer (`examples.speech_recognition.models.vggtransformer`)
* Examples under `examples/speech_to_text`:
* LibriSpeech (ASR): better results than VGGTransformer with smaller Transformer-based models
* MuST-C (ST): comparable to [SOTA results](https://arxiv.org/pdf/2004.10234.pdf) but with less tricks
Reviewed By: jmp84
Differential Revision: D24065273
fbshipit-source-id: 5f842ca9c826f92d4af660705611885fe440a9ab