What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context.

BERT is a state-of-the-art natural language processing (NLP) model introduced by Google AI in 2018. It revolutionized NLP tasks by pre-training a deep neural network on massive amounts of text data. BERT is a transformer-based model that learns contextualized word embeddings by training on large corpora of text from the internet. Unlike earlier models that used unidirectional context (either left-to-right or right-to-left), BERT employs a bidirectional approach. This means it looks at both the left and right context of a word simultaneously, enabling it to capture richer contextual information.

After pre-training on a vast amount of text, BERT can be fine-tuned for specific NLP tasks such as text classification, question-answering, and sentiment analysis. It has set the standard for many NLP benchmarks, and its ability to understand context and semantics has made it widely adopted in various NLP applications.

BERT is trained using two mechanisms –

  1. Masked Language Modeling– Here a certain percentage of input is masked and the model is trained to predict the masked value.
  2. Next Sentence Prediction – Here the model is given 2 sentences and the model is trained to learn if one follows the other or not.

Now to apply MLM to speech, the input must be discretized. This is where the idea of hidden units comes in. An offline clustering step is used to generate TARGET labels. The BERT model consumes the masked speech features to predict the target labels generated by the clustering step. This unique approach to training makes BERT different from Wav2Vec. This is explained in detail below.

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Similar Reads

What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context....

Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model....

HuBERT ARCHITECTURE

The HuBERT model consists of :...

Training Phase

The training is divided into two parts – Pre Training and Fine Tuning....

Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library....

CONCLUSION

...

Contact Us