fairseq/examples/hubert
Wei-Ning Hsu 272c4c5197 Fix hubert (#3019)
Summary:
## PR review
1. Update HuBERT to work with the TransformerEncoder wav2vec2.py
2. Remove dictionary loading issue when loading fine-tuned HuBERT checkpoints to make the checkpoints self-contained
3. Add unit-test for HuBERT fine-tuned checkpoints
4. Avoid divide-by-zero error in infer.py when inference time is zero (e.g., when inferring just one utterance)

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/3019

Reviewed By: andrewyeh

Differential Revision: D33970620

Pulled By: wnhsu

fbshipit-source-id: c523dd6ddb0f6a496be8b0b4b56f0c32c1d3dbc5
2022-02-03 10:17:10 -08:00
..
config update hubert decode config (#2106) 2021-07-30 10:16:06 -07:00
simple_kmeans update code dict prep (#2424) 2021-10-08 13:00:37 -07:00
tests Fix hubert (#3019) 2022-02-03 10:17:10 -08:00
measure_teacher_quality.py Merge Hubert to master (#1877) 2021-05-21 18:40:56 -07:00
README.md Hubert unit test (#2766) 2022-01-24 16:24:50 -08:00
update_ckpt.py Merge Hubert to master (#1877) 2021-05-21 18:40:56 -07:00

HuBERT

Pre-trained and fine-tuned (ASR) models

Model Pretraining Data Finetuning Dataset Model Quantizer
HuBERT Base (~95M params) Librispeech 960 hr No finetuning (Pretrained Model) download L9 km500
HuBERT Large (~316M params) Libri-Light 60k hr No finetuning (Pretrained Model) download
HuBERT Extra Large (~1B params) Libri-Light 60k hr No finetuning (Pretrained Model) download
HuBERT Large Libri-Light 60k hr Librispeech 960 hr download
HuBERT Extra Large Libri-Light 60k hr Librispeech 960 hr download

Load a model

ckpt_path = "/path/to/the/checkpoint.pt"
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
model = models[0]

Train a new model

Data preparation

Follow the steps in ./simple_kmeans to create:

  • {train,valid}.tsv waveform list files
  • {train,valid}.km frame-aligned pseudo label files.
  • dict.km.txt a dummy dictionary The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 50Hz for HuBERT features by default.

Pre-train a HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, and the label rate is 100Hz.

To train a base model (12 layer transformer), run:

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \
  --config-name hubert_base_librispeech \
  task.data=/path/to/data task.label_dir=/path/to/labels task.labels='["km"]' model.label_rate=100

Fine-tune a HuBERT model with a CTC loss

Suppose {train,valid}.tsv are saved at /path/to/data, and their corresponding character transcripts {train,valid}.ltr are saved at /path/to/trans.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
  --config-name base_10h \
  task.data=/path/to/data task.label_dir=/path/to/trans \
  model.w2v_path=/path/to/checkpoint

Decode a HuBERT model

Suppose the test.tsv and test.ltr are the waveform list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint. We support three decoding modes:

  • Viterbi decoding: greedy decoding without a language model
  • KenLM decoding: decoding with an arpa-format KenLM n-gram language model
  • Fairseq-LM deocding: decoding with a Fairseq neural language model

Viterbi decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/directory/decode/viterbi/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_viterbi \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \

KenLM / Fairseq-LM decoding

Suppose the pronunciation lexicon and the n-gram LM are saved at /path/to/lexicon and /path/to/arpa, respectively. Decoding results will be saved at /path/to/experiment/directory/decode/kenlm/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_kenlm \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \
  decoding.decoder.lexicon=/path/to/lexicon \
  decoding.decoder.lmpath=/path/to/arpa

The command above uses the default decoding hyperparameter, which can be found in examples/speech_recognition/hydra/decoder.py. These parameters can be configured from the command line. For example, to search with a beam size of 500, we can append the command above with decoding.decoder.beam=500. Important parameters include:

  • decoding.decoder.beam
  • decoding.decoder.beamthreshold
  • decoding.decoder.lmweight
  • decoding.decoder.wordscore
  • decoding.decoder.silweight

To decode with a Fairseq LM, use --config-name infer_fsqlm instead, and change the path of lexicon and LM accordingly.