HuBERT ARCHITECTURE

The HuBERT model consists of :

Convolution encoder – The encoder consists of 7 layer convolutional feature encoder which takes input raw audio X and outputs latent speech representations z1, . . . , zT for T time-steps with 512 channels at each step. The job of encoder is to reduce the dimensionality of the input data. A typical feature encoder which was introduced in Wav2Vec2 model and is also used in HuBERT contains seven blocks and temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2). This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio. The down-sampling factor is 320x. The audio encoded features are ten randomly maksed.
BERT encoder -The core of HuBERT architecture is the BERT encoder which takes the feature vectors from convolution encoder to produce hidden unit representations. Input sequence first needs to go through a feature projection layer to increase the dimension from 512 (output of the CNN) to 768 for BASE , 1024 for LARGE and 1280 for X-Large. It then moves through multilayer multi-head attention. For the base we have 12 layer 8 head , for the Large 24 layer 16 head , for X-large 48 layer 8 head attention blocks .The output of final attention layer is passed through a FFN layer to expand its dimension by a factor of 4. Thus the final output of encoder block becomes 3072 for base, 4096 for Large and 5120 for X large.
A projection layer – It is a linear layer which is used to transform the output of BERT encoder to match the embedding dimension of the clustering step . The dimension are 256 ,768 and 1024 for Base, Large and X-Large respectively. Cosine similarity is calculated between the transformer outputs (after dimension reduction through the projection) and every hidden unit embedding (output of the clustering step), to derive prediction logits. Subsequently, a cross-entropy loss function is employed to penalize incorrect predictions.
Code Embedding Layer – It is used to convert the output of clustering step (hidden unit) to its embedding vector (hidden unit embeddings). The dimension are 256 ,768 and 1024 for Base, Large and X-Large respectively.
CTC layer – During the ASR fine tuning the projection layer is removed and replaced with a softmax layer. CTC (Connectionist temporal classification) loss is used for ASR fine tuning.
Clustering Layer – A k means clustering layer is used to generate hidden units. In its first iteration it takes MFCC feature as its input and in subsequent step it takes input from the transformer layer. More details in the training section.

How all this component interact during pre-training and fine-tuning is explained in detail in the training section.

HuBERT Model Architecture

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Tags:

#Geeks Premier League 2023 #NLP-Projects #AI-ML-DS #Geeks Premier League #NLP

Wav2Vec2

Training Phase

HuBERT ARCHITECTURE

HuBERT Model

Similar Reads

Contact Us