HuBERT ARCHITECTURE

The HuBERT model consists of :

  1. Convolution encoder – The encoder consists of 7 layer convolutional feature encoder which takes input raw audio X and outputs latent speech representations z1, . . . , zT for T time-steps with 512 channels at each step. The job of encoder is to reduce the dimensionality of the input data. A typical feature encoder which was introduced in Wav2Vec2 model and is also used in HuBERT contains seven blocks and temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2). This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio. The down-sampling factor is 320x. The audio encoded features are ten randomly maksed.
  2. BERT encoder -The core of HuBERT architecture is the BERT encoder which takes the feature vectors from convolution encoder to produce hidden unit representations. Input sequence first needs to go through a feature projection layer to increase the dimension from 512 (output of the CNN) to 768 for BASE , 1024 for LARGE and 1280 for X-Large. It then moves through multilayer multi-head attention. For the base we have 12 layer 8 head , for the Large 24 layer 16 head , for X-large 48 layer 8 head attention blocks .The output of final attention layer is passed through a FFN layer to expand its dimension by a factor of 4. Thus the final output of encoder block becomes 3072 for base, 4096 for Large and 5120 for X large.
  3. A projection layer – It is a linear layer which is used to transform the output of BERT encoder to match the embedding dimension of the clustering step . The dimension are 256 ,768 and 1024 for Base, Large and X-Large respectively. Cosine similarity is calculated between the transformer outputs (after dimension reduction through the projection) and every hidden unit embedding (output of the clustering step), to derive prediction logits. Subsequently, a cross-entropy loss function is employed to penalize incorrect predictions.
  4. Code Embedding Layer – It is used to convert the output of clustering step (hidden unit) to its embedding vector (hidden unit embeddings). The dimension are 256 ,768 and 1024 for Base, Large and X-Large respectively.
  5. CTC layer – During the ASR fine tuning the projection layer is removed and replaced with a softmax layer. CTC (Connectionist temporal classification) loss is used for ASR fine tuning.
  6. Clustering Layer – A k means clustering layer is used to generate hidden units. In its first iteration it takes MFCC feature as its input and in subsequent step it takes input from the transformer layer. More details in the training section.

How all this component interact during pre-training and fine-tuning is explained in detail in the training section.

HuBERT Model Architecture

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Similar Reads

What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context....

Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model....

HuBERT ARCHITECTURE

The HuBERT model consists of :...

Training Phase

The training is divided into two parts – Pre Training and Fine Tuning....

Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library....

CONCLUSION

...

Contact Us