Training Phase

The training is divided into two parts – Pre Training and Fine Tuning.

  1. Pre-Training – The objective of pre-training step is to learn the hidden unit representation. It is done for 2 iterations for the ‘Base model’ and 1 iteration for the ‘Large’ and ‘X-Large’ models. Since the training of ‘Large’ and ‘X-Large’ takes input from the ‘Base model’ it is practically seen as 3 iterations.
  2. Fine Tuning– Here the model is fine-tuned for the ASR task.
    • The input audio wave is converted to 39-dimensional MFCC (Mel-Frequency Cepstral Coefficients) features.
    • K-means clustering with 100 clusters is run on these inputs. Each segment of the audio is assigned to one of the k clusters. These clusters becomes hidden unit. Each hidden unit is mapped to a embedding vector. These acts as TARGETS to be predicted by the BERT model in step 3.

    Pre -Training Iteration 1 Clustering

  3. The raw audio input is passed through the convolution layer ( composed of seven 512 channel layers ) . Certain outputs of the convolution layer are masked and then fed into the BERT encoder. The objective of BERT encoder is to predict the masked representations for each input that should match with the hidden unit obtained in step 2. Since the output of BERT encoder is of higher dimension as compared to the embedding dimension a projection layer is used to transform the output of BERT. Here cross-entropy loss is used for wrong prediction.

    Pretraining Iteration Masked Language Modeling

  4. Above is first iteration of the model training. In second iteration for the clustering steps instead of taking the MFCC features, output form an intermediate layer (6th layer for the ‘BASE’) of the BERT encoder from the previous iteration step is used.

    Pre -Training Iteration 2 Clustering

  5. Step 2-4 are referred as ‘pre-training’ where the model learns meaningful high level representations. This step is done for 2 iteration for the ‘Base’ model. The ‘Large’ and ‘X-Large model’ is trained for one iteration. Instead of restarting the iterative process of clustering MFCC features, features from the 9-th transformer layer of the second iteration BASE HuBERT is used for clustering and training of these two models.
  6. Fine Tuning – After the pre-training CTC loss is used for ASR fine tuning. The projection layer is replaced with a randomly initialized softmax layer. CTC target vocabulary includes 26 English characters, a space token, an apostrophe, and a special CTC blank symbol.

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Similar Reads

What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context....

Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model....

HuBERT ARCHITECTURE

The HuBERT model consists of :...

Training Phase

The training is divided into two parts – Pre Training and Fine Tuning....

Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library....

CONCLUSION

...

Contact Us