mirror of https://github.com/facebookresearch/fairseq.git synced 2024-09-21 14:17:25 +03:00

Changhan Wang ee450dde19 S2T multilingual example + bug fix

Summary:
* S2T multilingual example on MuST-C
* A bug fix for `speech_to_text_dataset` (for multilingual setting)

Reviewed By: jmp84

Differential Revision: D24339394

fbshipit-source-id: ef0c0be08137884897b532e45ebc56551d20be48

2020-10-21 08:10:47 -07:00

15 KiB

Raw Blame History

Speech-to-Text (S2T) Modeling

https://arxiv.org/abs/2010.05171

Examples for speech recognition (ASR) and speech-to-text translation (ST) with fairseq.

Data Preparation

S2T modeling data consists of source speech features, target text and other optional information (source text, speaker id, etc.). Fairseq S2T uses per-dataset-split TSV manifest files to store these information. Each data field is represented by a column in the TSV file.

Unlike text token embeddings, speech features (e.g. log mel-scale filter banks) are usually fixed during model training and can be pre-computed. The manifest file contains the path to either the feature file in NumPy format or the WAV/FLAC audio file. For the latter, features will be extracted on-the-fly by fairseq S2T. Optionally, feature/audio files can be packed into uncompressed ZIP files (then accessed via byte offset and length) to improve I/O performance.

Fairseq S2T also employs a YAML file for data related configurations: tokenizer type and dictionary path for the target text, feature transforms such as CMVN (cepstral mean and variance normalization) and SpecAugment, temperature-based resampling, etc.

Model Training & Evaluation

Fairseq S2T uses the unified fairseq-train/fairseq-generate interface for model training and evaluation. It requires arguments --task speech_to_text and --arch <arch in fairseq.models.speech_to_text.*>.

Example 1: Speech Recognition (ASR) on LibriSpeech

Data preparation

Download and preprocess LibriSpeech data with

python examples/speech_to_text/prep_librispeech_data.py --output-root ${LS_ROOT} --vocab-type unigram --vocab-size 10000

where LS_ROOT is the root path for downloaded data as well as generated manifest and feature files.

Training

fairseq-train ${LS_ROOT} --train-subset train --valid-subset dev --save-dir ${SAVE_DIR} --num-workers 4 \
    --max-tokens 40000 --task speech_to_text --criterion label_smoothed_cross_entropy --max-update 300000 \
    --arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 \
    --clip-norm 10.0 --seed 1 --update-freq 8

where SAVE_DIR is the checkpoint root path. Here we use --arch s2t_transformer_s (31M parameters) as example. You may switch to s2t_transformer_m (71M) or s2t_transformer_l (268M) for better performance. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.

Inference & Evaluation

Average the last 10 checkpoints and evaluate on the 4 splits (dev-clean, dev-other, test-clean and test-other):

CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py --inputs ${SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${SAVE_DIR}/${CHECKPOINT_FILENAME}"
for SUBSET in dev-clean dev-other test-clean test-other; do
    fairseq-generate ${LS_ROOT} --gen-subset ${SUBSET} --task speech_to_text \
        --path ${SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 --scoring wer
done

Result

--arch	Params	dev-clean	dev-other	test-clean	test-other
s2t_transformer_s	30M	4.1	9.3	4.4	9.2
s2t_transformer_sp	35M	3.9	9.3	4.3	8.8
s2t_transformer_m	71M	3.5	8.1	3.7	8.1
s2t_transformer_mp	84M	3.3	7.8	3.7	8.2
s2t_transformer_l	268M	3.3	7.7	3.5	7.8
s2t_transformer_lp	318M	3.1	7.5	3.4	7.6

Example 2: Speech Translation (ST) on MuST-C

Data Preparation

Download and unpack MuST-C data to a path ${MUSTC_ROOT}/en-${TARGET_LANG_ID}, then preprocess it with

# Generate TSV manifests, features, vocabulary and configuration for each language
python examples/speech_to_text/prep_mustc_data.py --data-root ${MUSTC_ROOT} --task asr \
    --vocab-type unigram --vocab-size 5000
python examples/speech_to_text/prep_mustc_data.py --data-root ${MUSTC_ROOT} --task st \
    --vocab-type unigram --vocab-size 8000

# Add vocabulary and configuration for joint data (based on the manifests and features generated above)
python examples/speech_to_text/prep_mustc_data.py --data-root ${MUSTC_ROOT} --task asr --joint \
    --vocab-type unigram --vocab-size 10000
python examples/speech_to_text/prep_mustc_data.py --data-root ${MUSTC_ROOT} --task st --joint \
    --vocab-type unigram --vocab-size 10000

The generated files will be available under ${MUSTC_ROOT}/en-${TARGET_LANG_ID} (per-language data) and MUSTC_ROOT (joint data).

ASR

Training

ASR data from En-De as example:

fairseq-train ${MUSTC_ROOT}/en-de --train-subset train_asr --valid-subset dev_asr --save-dir ${ASR_SAVE_DIR} \
    --num-workers 4 --max-tokens 40000 --task speech_to_text --criterion label_smoothed_cross_entropy \
    --report-accuracy --max-update 100000 --arch s2t_transformer_s --optimizer adam --lr 1e-3 \
    --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8

Using joint data from all directions:

fairseq-train ${MUSTC_ROOT} \
    --train-subset train_de_asr,train_nl_asr,train_es_asr,train_fr_asr,train_it_asr,train_pt_asr,train_ro_asr,train_ru_asr \
    --valid-subset dev_de_asr,dev_nl_asr,dev_es_asr,dev_fr_asr,dev_it_asr,dev_pt_asr,dev_ro_asr,dev_ru_asr \
    --save-dir ${JOINT_ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --task speech_to_text --arch s2t_transformer_s \
    --criterion label_smoothed_cross_entropy --report-accuracy --max-update 100000 --optimizer adam --lr 1e-3 \
    --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8

where ASR_SAVE_DIR (JOINT_ASR_SAVE_DIR) is the checkpoint root path. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.

Inference & Evaluation

CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py --inputs ${ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MUSTC_ROOT}/en-de --gen-subset tst-COMMON_asr --task speech_to_text \
    --path ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 \
    --scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct

# For models trained on joint data
python scripts/average_checkpoints.py --inputs ${JOINT_ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${JOINT_ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
for LANG in de nl es fr it pt ro ru; do
    fairseq-generate ${MUSTC_ROOT} --gen-subset tst-COMMON_${LANG}_asr --task speech_to_text \
        --path ${JOINT_ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 \
        --scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct
done

Result

Data	--arch	Params	En-De	En-Nl	En-Es	En-Fr	En-It	En-Pt	En-Ro	En-Ru
Single	s2t_transformer_s	31M	18.2	17.6	17.7	17.2	17.9	19.1	18.1	17.7
Joint	s2t_transformer_m	76M	16.8	16.7	16.9	16.9	17.0	17.4	17.0	16.9

ST

Training

En-De as example:

fairseq-train ${MUSTC_ROOT}/en-de --train-subset train_st --valid-subset dev_st --save-dir ${ST_SAVE_DIR} \
    --num-workers 4 --max-tokens 40000 --task speech_to_text --criterion label_smoothed_cross_entropy \
    --report-accuracy --max-update 100000 --arch s2t_transformer_s --optimizer adam --lr 2e-3 \
    --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
    --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}

Example for multilingual models:

fairseq-train ${MUSTC_ROOT} \
    --train-subset train_de_st,train_nl_st,train_es_st,train_fr_st,train_it_st,train_pt_st,train_ro_st,train_ru_st \
    --valid-subset dev_de_st,dev_nl_st,dev_es_st,dev_fr_st,dev_it_st,dev_pt_st,dev_ro_st,dev_ru_st \
    --save-dir ${MULTILINGUAL_ST_SAVE_DIR} --num-workers 4 --max-tokens 40000 --task speech_to_text \
    --arch s2t_transformer_s --criterion label_smoothed_cross_entropy --report-accuracy --ignore-prefix-size 1 \
    --max-update 100000 --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 \
    --seed 1 --update-freq 8 --load-pretrained-encoder-from ${JOINT_ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}

where ST_SAVE_DIR (MULTILINGUAL_ST_SAVE_DIR) is the checkpoint root path. The ST encoder is pre-trained by ASR for faster training and better performance: --load-pretrained-encoder-from <(JOINT_)ASR checkpoint path>. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU. For multilingual models, we prepend target language ID token as target BOS, which should be excluded from the training loss via --ignore-prefix-size 1.

Inference & Evaluation

Average the last 10 checkpoints and evaluate on the tst-COMMON split:

CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py --inputs ${ST_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MUSTC_ROOT} --gen-subset tst-COMMON_st --task speech_to_text \
    --path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 --scoring sacrebleu

# For multilingual models
python scripts/average_checkpoints.py --inputs ${MULTILINGUAL_ST_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${MULTILINGUAL_ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
for LANG in de nl es fr it pt ro ru; do
    fairseq-generate ${MUSTC_ROOT} --gen-subset tst-COMMON_${LANG}_st --task speech_to_text --prefix-size 1 \
        --path ${MULTILINGUAL_ST_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 --scoring sacrebleu
done

For multilingual models, we force decoding from the target language ID token (as BOS) via --prefix-size 1.

Result

Data	--arch	Params	En-De	En-Nl	En-Es	En-Fr	En-It	En-Pt	En-Ro	En-Ru
Bilingual	s2t_transformer_s	31M	22.7	27.3	27.2	32.9	22.7	28.1	21.9	15.3
Multilingual	s2t_transformer_m	76M	24.5	28.6	28.2	34.9	24.6	31.1	23.8	16.0

Example 3: ST on CoVoST

We replicate the experiments in CoVoST 2 and Massively Multilingual Speech-to-Text Translation (Wang et al., 2020).

Data Preparation

Download and preprocess CoVoST (version 2) data with

# En ASR
python examples/speech_to_text/prep_covost_data.py --data-root ${COVOST_ROOT} --vocab-type char --src-lang en
# ST
python examples/speech_to_text/prep_covost_data.py --data-root ${COVOST_ROOT} --vocab-type char \
    --src-lang fr --tgt-lang en

where COVOST_ROOT is the root path for downloaded data as well as generated manifest and feature files.

ASR

Training

fairseq-train ${COVOST_ROOT} --train-subset train_asr --valid-subset dev_asr --save-dir ${ASR_SAVE_DIR} \
    --num-workers 4 --max-tokens 40000 --task speech_to_text --criterion label_smoothed_cross_entropy \
    --report-accuracy --max-update 100000 --arch s2t_transformer_s --optimizer adam --lr 1e-3 \
    --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8

where ASR_SAVE_DIR is the checkpoint root path. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.

Inference & Evaluation

CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py --inputs ${ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${COVOST_ROOT} --gen-subset test_asr_en --task speech_to_text \
    --path ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 \
    --scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct

Result

--arch	Params	En
s2t_transformer_s	31M	25.6

ST

Training

fairseq-train ${COVOST_ROOT} --train-subset train_st_fr_en --valid-subset dev_st_fr_en --save-dir ${ST_SAVE_DIR} \
    --num-workers 4 --max-tokens 40000 --task speech_to_text --criterion label_smoothed_cross_entropy \
    --report-accuracy --max-update 100000 --arch s2t_transformer_s --optimizer adam --lr 2e-3 \
    --lr-scheduler inverse_sqrt --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
    --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}

where ST_SAVE_DIR is the checkpoint root path. The ST encoder is pre-trained by En ASR for faster training and better performance: --load-pretrained-encoder-from <ASR checkpoint path>. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.

Inference & Evaluation

Average the last 10 checkpoints and evaluate on test split:

CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py --inputs ${ST_SAVE_DIR} --num-epoch-checkpoints 10 \
    --output "${ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${COVOST_ROOT} --gen-subset test_st_fr_en --task speech_to_text \
    --path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 --scoring sacrebleu

Result

--arch	Params	Fr-En	De-En	Es-En	Ca-En	En-De	En-Ca	En-Fa	En-Et
s2t_transformer_s	31M	26.3	17.1	23.0	18.8	16.3	21.8	13.1	13.2

Citation

Please cite as:

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

More Paper Code

The following papers also base their experiments on fairseq S2T. We are adding more examples for replication.

15 KiB Raw Blame History

Speech-to-Text (S2T) Modeling

Data Preparation

Model Training & Evaluation

Example 1: Speech Recognition (ASR) on LibriSpeech

Data preparation

Training

Inference & Evaluation

Result

Example 2: Speech Translation (ST) on MuST-C

Data Preparation

ASR

Training

Inference & Evaluation

Result

ST

Training

Inference & Evaluation

Result

Example 3: ST on CoVoST

Data Preparation

ASR

Training

Inference & Evaluation

Result

ST

Training

Inference & Evaluation

Result

Citation

More Paper Code

15 KiB

Raw Blame History