Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library.

Before starting with the implementation, we need to install librosa , torch and transformers library.

!pip install librosa
!pip install torch
!pip install transformers

Let’s import the installed libraries.

  • Librosa is a python package for analyzing and processing audio signal
  • Torch is an open source ml framework that provides flexible an efficient platform for building and training deep neural networks
  • HubertForCTC is specifically designed for training and using HuBERT model speech-related tasks
  • AutoProcessor is tailored library for preprocessing audio.

Python3




import librosa
import torch
from transformers import HubertForCTC, AutoProcessor


Define the path to the audio file, that we want to transcribe. The audio file can be downloaded from here.

Python3




# Path of audio file to be transcribed
AUDIO_FILE = 'harvard.wav'


In the next step, we have initialized a AutoProcessor to process the audio data and align it with the model. We have also initialized HubertForCTC model for automatic speech recognition tasks and transcribe the audio.

Python3




# Load the model and tokenizer
processor = AutoProcessor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")


Now, using Librosa library , we loaded the specified audio file. It returns speech, which contains the audio waveform and sampling rate.

Then, using AutoProcessor, we preprocessed the audio. It convert the waveform into a format that can be fed into the model.

The torch.no_grad() section indicates the operations should not be included in gradient computation, which is useful when you don’t need to update model weights.

Then, we pass the processed audio data through the HuBERT model to obtain the logits. The logits are unnormalized predictions from the model. After this, we calculate predicted IDs by taking the argmax along the last dimension of the logits.

Python3




# Importing the file
speech, rate = librosa.load(AUDIO_FILE, sr=16000)
 
# Tokenizing the input
inputs = processor(speech, return_tensors="pt", sampling_rate=rate).input_values
 
# Model logits
with torch.no_grad():
    logits = model(inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)


We have used batch_decode method to convert the predicted token IDs into a human readable transcript.

Python3




# Print transcription
transcription = processor.batch_decode(predicted_ids)
print(transcription)


Output:

['THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TACOS AL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN']

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Similar Reads

What is BERT?

BERT is a shorthand for Bidirectional Encoder Representations from Transformers. In order to understand BERT, one should be aware of the Transformer architecture and its self-attention mechanism. The BERT implementation is based on the encoder block of the Transformer architecture. It receives the entire sequence at once and is able to learn the context....

Wav2Vec2

The architecture of HuBERT is very similar to Wav2Vec2. However, it is the training process which is very different. Lets have a brief understanding of Wav2Vec2 model....

HuBERT ARCHITECTURE

The HuBERT model consists of :...

Training Phase

The training is divided into two parts – Pre Training and Fine Tuning....

Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library....

CONCLUSION

...

Contact Us