mirror of https://github.com/facebookresearch/fairseq.git synced 2024-10-04 04:37:58 +03:00

History

Wei-Ning Hsu 272c4c5197 Fix hubert (#3019 ) Summary: ## PR review 1. Update HuBERT to work with the TransformerEncoder wav2vec2.py 2. Remove dictionary loading issue when loading fine-tuned HuBERT checkpoints to make the checkpoints self-contained 3. Add unit-test for HuBERT fine-tuned checkpoints 4. Avoid divide-by-zero error in infer.py when inference time is zero (e.g., when inferring just one utterance) Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/3019 Reviewed By: andrewyeh Differential Revision: D33970620 Pulled By: wnhsu fbshipit-source-id: c523dd6ddb0f6a496be8b0b4b56f0c32c1d3dbc5		2022-02-03 10:17:10 -08:00
..
config	update hubert decode config (#2106 )	2021-07-30 10:16:06 -07:00
simple_kmeans	update code dict prep (#2424 )	2021-10-08 13:00:37 -07:00
tests	Fix hubert (#3019 )	2022-02-03 10:17:10 -08:00
measure_teacher_quality.py	Merge Hubert to master (#1877 )	2021-05-21 18:40:56 -07:00
README.md	Hubert unit test (#2766 )	2022-01-24 16:24:50 -08:00
update_ckpt.py	Merge Hubert to master (#1877 )	2021-05-21 18:40:56 -07:00

README.md

HuBERT

Pre-trained and fine-tuned (ASR) models

Model	Pretraining Data	Finetuning Dataset	Model	Quantizer
HuBERT Base (~95M params)	Librispeech 960 hr	No finetuning (Pretrained Model)	download	L9 km500
HuBERT Large (~316M params)	Libri-Light 60k hr	No finetuning (Pretrained Model)	download
HuBERT Extra Large (~1B params)	Libri-Light 60k hr	No finetuning (Pretrained Model)	download
HuBERT Large	Libri-Light 60k hr	Librispeech 960 hr	download
HuBERT Extra Large	Libri-Light 60k hr	Librispeech 960 hr	download

Load a model

ckpt_path = "/path/to/the/checkpoint.pt"
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
model = models[0]

Train a new model

Data preparation

Follow the steps in ./simple_kmeans to create:

{train,valid}.tsv waveform list files
{train,valid}.km frame-aligned pseudo label files.
dict.km.txt a dummy dictionary The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 50Hz for HuBERT features by default.

Pre-train a HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, and the label rate is 100Hz.

To train a base model (12 layer transformer), run:

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \
  --config-name hubert_base_librispeech \
  task.data=/path/to/data task.label_dir=/path/to/labels task.labels='["km"]' model.label_rate=100

Fine-tune a HuBERT model with a CTC loss

Suppose {train,valid}.tsv are saved at /path/to/data, and their corresponding character transcripts {train,valid}.ltr are saved at /path/to/trans.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
  --config-name base_10h \
  task.data=/path/to/data task.label_dir=/path/to/trans \
  model.w2v_path=/path/to/checkpoint

Decode a HuBERT model

Suppose the test.tsv and test.ltr are the waveform list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint. We support three decoding modes:

Viterbi decoding: greedy decoding without a language model
KenLM decoding: decoding with an arpa-format KenLM n-gram language model
Fairseq-LM deocding: decoding with a Fairseq neural language model

Viterbi decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/directory/decode/viterbi/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_viterbi \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \

KenLM / Fairseq-LM decoding

Suppose the pronunciation lexicon and the n-gram LM are saved at /path/to/lexicon and /path/to/arpa, respectively. Decoding results will be saved at /path/to/experiment/directory/decode/kenlm/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_kenlm \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \
  decoding.decoder.lexicon=/path/to/lexicon \
  decoding.decoder.lmpath=/path/to/arpa

The command above uses the default decoding hyperparameter, which can be found in examples/speech_recognition/hydra/decoder.py. These parameters can be configured from the command line. For example, to search with a beam size of 500, we can append the command above with decoding.decoder.beam=500. Important parameters include:

decoding.decoder.beam
decoding.decoder.beamthreshold
decoding.decoder.lmweight
decoding.decoder.wordscore
decoding.decoder.silweight

To decode with a Fairseq LM, use --config-name infer_fsqlm instead, and change the path of lexicon and LM accordingly.