Training Phase

Implementing Automatic Speech Recognition

The training is divided into two parts – Pre Training and Fine Tuning.

Pre-Training – The objective of pre-training step is to learn the hidden unit representation. It is done for 2 iterations for the ‘Base model’ and 1 iteration for the ‘Large’ and ‘X-Large’ models. Since the training of ‘Large’ and ‘X-Large’ takes input from the ‘Base model’ it is practically seen as 3 iterations.
Fine Tuning– Here the model is fine-tuned for the ASR task.
- The input audio wave is converted to 39-dimensional MFCC (Mel-Frequency Cepstral Coefficients) features.
- K-means clustering with 100 clusters is run on these inputs. Each segment of the audio is assigned to one of the k clusters. These clusters becomes hidden unit. Each hidden unit is mapped to a embedding vector. These acts as TARGETS to be predicted by the BERT model in step 3.
Pre -Training Iteration 1 Clustering
The raw audio input is passed through the convolution layer ( composed of seven 512 channel layers ) . Certain outputs of the convolution layer are masked and then fed into the BERT encoder. The objective of BERT encoder is to predict the masked representations for each input that should match with the hidden unit obtained in step 2. Since the output of BERT encoder is of higher dimension as compared to the embedding dimension a projection layer is used to transform the output of BERT. Here cross-entropy loss is used for wrong prediction.
Pretraining Iteration Masked Language Modeling
Above is first iteration of the model training. In second iteration for the clustering steps instead of taking the MFCC features, output form an intermediate layer (6th layer for the ‘BASE’) of the BERT encoder from the previous iteration step is used.
Pre -Training Iteration 2 Clustering
Step 2-4 are referred as ‘pre-training’ where the model learns meaningful high level representations. This step is done for 2 iteration for the ‘Base’ model. The ‘Large’ and ‘X-Large model’ is trained for one iteration. Instead of restarting the iterative process of clustering MFCC features, features from the 9-th transformer layer of the second iteration BASE HuBERT is used for clustering and training of these two models.
Fine Tuning – After the pre-training CTC loss is used for ASR fine tuning. The projection layer is replaced with a randomly initialized softmax layer. CTC target vocabulary includes 26 English characters, a space token, an apostrophe, and a special CTC blank symbol.

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Tags:

#Geeks Premier League 2023 #NLP-Projects #AI-ML-DS #Geeks Premier League #NLP

HuBERT ARCHITECTURE

Implementing Automatic Speech Recognition

Training Phase

HuBERT Model

Similar Reads

Contact Us