Implementing Automatic Speech Recognition
Below is the python code for ASR inference using the Hugging Face transformers library.
Before starting with the implementation, we need to install librosa , torch and transformers library.
!pip install librosa
!pip install torch
!pip install transformers
Let’s import the installed libraries.
- Librosa is a python package for analyzing and processing audio signal
- Torch is an open source ml framework that provides flexible an efficient platform for building and training deep neural networks
- HubertForCTC is specifically designed for training and using HuBERT model speech-related tasks
- AutoProcessor is tailored library for preprocessing audio.
Python3
import librosa import torch from transformers import HubertForCTC, AutoProcessor |
Define the path to the audio file, that we want to transcribe. The audio file can be downloaded from here.
Python3
# Path of audio file to be transcribed AUDIO_FILE = 'harvard.wav' |
In the next step, we have initialized a AutoProcessor to process the audio data and align it with the model. We have also initialized HubertForCTC model for automatic speech recognition tasks and transcribe the audio.
Python3
# Load the model and tokenizer processor = AutoProcessor.from_pretrained( "facebook/hubert-large-ls960-ft" ) model = HubertForCTC.from_pretrained( "facebook/hubert-large-ls960-ft" ) |
Now, using Librosa library , we loaded the specified audio file. It returns speech, which contains the audio waveform and sampling rate.
Then, using AutoProcessor, we preprocessed the audio. It convert the waveform into a format that can be fed into the model.
The torch.no_grad() section indicates the operations should not be included in gradient computation, which is useful when you don’t need to update model weights.
Then, we pass the processed audio data through the HuBERT model to obtain the logits. The logits are unnormalized predictions from the model. After this, we calculate predicted IDs by taking the argmax along the last dimension of the logits.
Python3
# Importing the file speech, rate = librosa.load(AUDIO_FILE, sr = 16000 ) # Tokenizing the input inputs = processor(speech, return_tensors = "pt" , sampling_rate = rate).input_values # Model logits with torch.no_grad(): logits = model(inputs).logits predicted_ids = torch.argmax(logits, dim = - 1 ) |
We have used batch_decode method to convert the predicted token IDs into a human readable transcript.
Python3
# Print transcription transcription = processor.batch_decode(predicted_ids) print (transcription) |
Output:
['THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TACOS AL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN']
HuBERT Model
Since the introduction of the Wav2Vec model, self-supervised learning research in speech has gained momentum. HuBERT is a self-supervised model that allows the BERT model to be applied to audio inputs. Applying a BERT model to a sound input is challenging as sound units have variable length and there can be multiple sound units in each input. In order to apply the BERT model, we need to discretize the audio input. This is achieved through hidden units (Hu), as explained in detail below. Hence the name Hubert. However, before understanding HuBERT, we must get a basic understanding of BERT, as HuBERT is based on it.
Contact Us