Summary:
# Before submitting
- [X] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [X] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
## What does this PR do?
Fixes https://github.com/pytorch/fairseq/issues/3882
Fixes https://github.com/pytorch/fairseq/issues/3884
## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
## Did you have fun?
Make sure you had fun coding �
Pull Request resolved: https://github.com/pytorch/fairseq/pull/3887
Reviewed By: yuntang
Differential Revision: D33152073
Pulled By: kahne
fbshipit-source-id: 7f5c90a9876320e7c5c406ed032681452c7c5056
Summary:
Fix sever issues in simul speech transition example, including
- Load pretrained encoder with when loading model.
- Generating broken config.yaml when using gcvm.
- Fix the preprocessed databin.
- Fix some errors in the instructions.
- Add detailed instructions on evaluation a pretrained model.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1703
Reviewed By: jmp84
Differential Revision: D27071600
Pulled By: xutaima
fbshipit-source-id: bfe72005190d7936caeef4f805bd99c8d2cf9c37
Summary:
update audio_utils and fix mTEDx example
- Updated `audio_utils`
- Added support for OGG Vorbis (the only supported lossy compressed format)
- Added a separate `convert_to_mono()` helper function
- Updated `get_waveform()`
- added new arguments `frames` and `start` for reading part of audios
- added new argument `mono` for auto conversion to mono-channel audio
- unified returned waveform shape to channels x length (same as torchaudio default)
- Updated mTEDx and MUST-C data prep scripts
- Replaced `torchaudio.info()` with `soundfile.info()` (the latter is faster and the former has incompatible interface between <0.8 and the latest 0.8)
- Replaced `torchaudio.load()` with `get_waveform` for auto conversion to mono channel
Reviewed By: jmp84
Differential Revision: D26901114
fbshipit-source-id: fa9560c9714d51a91157d5141564574d4eee454d
Summary:
# Before submitting
- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
## What does this PR do?
Fixes # (issue).
## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
## Did you have fun?
Make sure you had fun coding �
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1660
Reviewed By: jmp84, kahne
Differential Revision: D26708521
Pulled By: xutaima
fbshipit-source-id: c53e9052298c559706ceffeb359dadfede2f1a09
Summary:
Imported from https://github.com/fairinternal/fairseq-py/pull/1284. Updated according to PR comments.
Main changes:
* New task: `fairseq.tasks.speech_to_text`
* Multilingual support: multiple train sub-splits, temperature-based sampling, language ID tokens
* New dataset: `fairseq.data.audio.speech_to_text_dataset`
* Added accuracy metrics and BOS prefix removal to label smoothed cross entropy
* New models: Transformer (`fairseq.models.speech_to_text.s2t_transformer`) and BLSTM (`fairseq.models.speech_to_text.berard`)
* Extended scorers:
* Added a base scorer class: `fairseq.scorers.BaseScorer` (the parent class for all scorers except the BLEU scorer in CPP)
* Added an evaluation tokenizer: `fairseq.scorers.eval_tokenizer` which leverages sacreBLEU's built-in tokenizers and allows character-level tokenization as well as punctuation removal (for WER scoring).
* Added chrF scorer: `fairseq.scorers.chrf`
* Online Mel-filter bank speech feature extraction (via CPP-based pyKaldi or Python-based TorchAudio): `fairseq.data.audio.audio_utils`
* Online speech feature transforms: `fairseq.data.audio.feature_transforms.*`
* Fixed the subsampled sequence lengths in VGGTransformer (`examples.speech_recognition.models.vggtransformer`)
* Examples under `examples/speech_to_text`:
* LibriSpeech (ASR): better results than VGGTransformer with smaller Transformer-based models
* MuST-C (ST): comparable to [SOTA results](https://arxiv.org/pdf/2004.10234.pdf) but with less tricks
Reviewed By: jmp84
Differential Revision: D24065273
fbshipit-source-id: 5f842ca9c826f92d4af660705611885fe440a9ab