Text Embedding Generation

In NLP, text embeddings refer to a vectorized representation of text. It is necessary to convert text to embeddings as we cannot feed Machine/Deep Learning models with the raw text directly. This can be done using iNLTK’s get_embedding_vectors(text, language code) which takes input text and its language code as the arguments.

Example:

We generate text embeddings for the same sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘w3wiki is a great technology learning platform.’)

Python3

from inltk.inltk import get_embedding_vectors
from warnings import filterwarnings
from IPython.display import display
filterwarnings("ignore")
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।'
vectors = get_embedding_vectors(text, 'hi')
display(vectors)

Output:

[array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),

array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),

array([ 0.021317, -0.130494, -0.248163, -0.203298, …, 0.064852, 0.230874, -0.315259, 0.368123], dtype=float32),

array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),

array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),

array([ 0.526271, -0.111786, 0.024964, -0.413432, …, -0.269101, 0.14501 , 0.139528, 0.036384], dtype=float32),

array([ 0.231323, -0.129719, -0.120698, -0.229107, …, -0.207799, -0.144117, 1.09991 , 0.544219], dtype=float32),

array([ 0.408419, 0.320988, -0.380744, -0.563505, …, -0.254394, -0.200471, 0.201553, -0.074097], dtype=float32),

array([-0.307099, -0.186613, 0.040754, -0.271758, …, 0.477781, 0.759681, 0.485825, 0.222599], dtype=float32),

array([-0.0195 , -0.056414, 0.155854, -0.955072, …, 0.127837, -0.161846, 0.381132, -0.233802], dtype=float32),

array([-0.063136, -0.16291 , -0.412124, -0.580033, …, -0.468475, 0.246613, 0.661614, 0.354779], dtype=float32),

array([-0.182706, -0.237699, 0.478908, -0.567147, …, 0.694749, 0.526647, 0.650397, 0.172727], dtype=float32),

array([-0.183833, -0.005238, -0.187345, -0.113823, …, 0.062584, -1.36463 , 0.665604, -1.425032], dtype=float32),

array([ 0.792413, 0.01189 , -0.71231 , -0.313467, …, 0.190676, 0.938687, 0.464781, 0.195361], dtype=float32)]

Thus, we have generated embeddings for Hindi text using iNLTK.

iNLTK: Natural Language Toolkit for Indic Languages in Python

We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages.

Text Embedding Generation

Example:

Python3

iNLTK: Natural Language Toolkit for Indic Languages in Python

Similar Reads

Contact Us