Implementing Automatic Speech Recognition

Below is the python code for ASR inference using the Hugging Face transformers library.

Before starting with the implementation, we need to install librosa , torch and transformers library.

!pip install librosa
!pip install torch
!pip install transformers

Let’s import the installed libraries.

Librosa is a python package for analyzing and processing audio signal
Torch is an open source ml framework that provides flexible an efficient platform for building and training deep neural networks
HubertForCTC is specifically designed for training and using HuBERT model speech-related tasks
AutoProcessor is tailored library for preprocessing audio.

Python3

import librosa
import torch
from transformers import HubertForCTC, AutoProcessor

Define the path to the audio file, that we want to transcribe. The audio file can be downloaded from here.

Python3

# Path of audio file to be transcribed
AUDIO_FILE = 'harvard.wav'

In the next step, we have initialized a AutoProcessor to process the audio data and align it with the model. We have also initialized HubertForCTC model for automatic speech recognition tasks and transcribe the audio.

Python3

# Load the model and tokenizer
processor = AutoProcessor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")

Now, using Librosa library , we loaded the specified audio file. It returns speech, which contains the audio waveform and sampling rate.

Then, using AutoProcessor, we preprocessed the audio. It convert the waveform into a format that can be fed into the model.

The torch.no_grad() section indicates the operations should not be included in gradient computation, which is useful when you don’t need to update model weights.

Then, we pass the processed audio data through the HuBERT model to obtain the logits. The logits are unnormalized predictions from the model. After this, we calculate predicted IDs by taking the argmax along the last dimension of the logits.

Python3

# Importing the file
speech, rate = librosa.load(AUDIO_FILE, sr=16000)
 
# Tokenizing the input
inputs = processor(speech, return_tensors="pt", sampling_rate=rate).input_values
 
# Model logits
with torch.no_grad():
    logits = model(inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)

We have used batch_decode method to convert the predicted token IDs into a human readable transcript.

Python3

# Print transcription
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Output:

['THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TACOS AL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN']

HuBERT Model

Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.

Tags:

#Geeks Premier League 2023 #NLP-Projects #AI-ML-DS #Geeks Premier League #NLP

Training Phase

CONCLUSION

Implementing Automatic Speech Recognition

Python3

Python3

Python3

Python3

Python3

HuBERT Model

Similar Reads

Contact Us