fairseq/examples/speech_to_text
Sravya Popuri d03f4e7714 Minor fixes (#3198)
Summary:
- Fix error introduced in e55e094b96 in the case where net_input doesn't have prev_output_tokens key
- Fix typo in covost README.

X-link: https://github.com/fairinternal/fairseq-py/pull/3198

Reviewed By: cndn, kahne

Differential Revision: D34810092

Pulled By: sravyapopuri388

fbshipit-source-id: 9be6e6f06586cd2a2d44415ebf7c3596a5334b81
2022-03-11 09:23:12 -08:00
..
docs Minor fixes (#3198) 2022-03-11 09:23:12 -08:00
simultaneous_translation/agents Several updates on simul st (#1774) 2021-04-02 14:45:08 -07:00
data_utils.py Miscellaneous S2T & S2 bug fixes 2022-01-31 20:44:43 -08:00
prep_covost_data.py update S2T 2021-09-12 22:22:09 -07:00
prep_librispeech_data.py update S2T 2021-09-12 22:22:09 -07:00
prep_mtedx_data.py update S2T 2021-09-12 22:22:09 -07:00
prep_mustc_data.py Fix bugs in MuST-C preprocessing (#3887) 2021-12-21 18:18:16 -08:00
README.md Simultaneous Speech Translation Model (#1607) 2021-02-18 22:43:36 -08:00
seg_mustc_data.py Several updates on simul st (#1774) 2021-04-02 14:45:08 -07:00

Speech-to-Text (S2T) Modeling

https://www.aclweb.org/anthology/2020.aacl-demo.6

Speech recognition (ASR) and speech-to-text translation (ST) with fairseq.

Data Preparation

S2T modeling data consists of source speech features, target text and other optional information (source text, speaker id, etc.). Fairseq S2T uses per-dataset-split TSV manifest files to store these information. Each data field is represented by a column in the TSV file.

Unlike text token embeddings, speech features (e.g. log mel-scale filter banks) are usually fixed during model training and can be pre-computed. The manifest file contains the path to either the feature file in NumPy format or the WAV/FLAC audio file. For the latter, features will be extracted on-the-fly by fairseq S2T. Optionally, feature/audio files can be packed into uncompressed ZIP files (then accessed via byte offset and length) to improve I/O performance.

Fairseq S2T also employs a YAML file for data related configurations: tokenizer type and dictionary path for the target text, feature transforms such as CMVN (cepstral mean and variance normalization) and SpecAugment, temperature-based resampling, etc.

Model Training

Fairseq S2T uses the unified fairseq-train interface for model training. It requires arguments --task speech_to_text, --arch <model architecture in fairseq.models.speech_to_text.*> and --config-yaml <config YAML filename>.

Inference & Evaluation

Fairseq S2T uses the unified fairseq-generate/fairseq-interactive interface for inference and evaluation. It requires arguments --task speech_to_text and --config-yaml <config YAML filename>. The interactive console takes audio paths (one per line) as inputs.

Examples

Updates

  • 02/04/2021: Added interactive decoding (fairseq-interactive) support. Examples: ASR (LibriSpeech) and ST (CoVoST 2).
  • 01/08/2021: Several fixes for S2T Transformer model, inference-time de-tokenization, scorer configuration and data preparation scripts. We also add pre-trained models to the examples and revise the instructions. Breaking changes: the data preparation scripts now extract filterbank features without CMVN. CMVN is instead applied on-the-fly (defined in the config YAML).

What's Next

Citation

Please cite as:

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}