Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model.

Wav2Vec2 is a deep learning model designed for automatic speech recognition (ASR). It was developed by Facebook AI Research and introduced in 2020. Wav2Vec2 is a significant advancement in ASR technology. It builds on the original Wav2Vec model and leverages the power of transformers, with a training objective similar to BERT’s masked language modeling objective, but adapted for speech. These four important elements in Wav2Vec2 are: the feature encoder, context network, quantization module, and contrastive loss (pre-training objective).

Wav2Vec2 operates in a two-step process: pre-training and fine-tuning. During pre-training, it learns to predict the context of waveform samples from a large dataset of multilingual and multitask supervised data. The pre-training includes both learning to align the audio with text transcriptions and learning contextualized representations of audio. This allows it to capture phonetic and linguistic features from the audio.

After pre-training, Wav2Vec2 can be fine-tuned for specific ASR tasks. It has shown impressive results on various ASR benchmarks, reducing the need for extensive amounts of transcribed data, which was traditionally required for ASR systems. Wav2Vec2 has had a significant impact on the development of more accurate and efficient speech recognition models.

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Similar Reads

What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context....

Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model....

HuBERT ARCHITECTURE

The HuBERT model consists of :...

Training Phase

The training is divided into two parts – Pre Training and Fine Tuning....

Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library....

CONCLUSION

...

Contact Us