Text Embedding Generation
In NLP, text embeddings refer to a vectorized representation of text. It is necessary to convert text to embeddings as we cannot feed Machine/Deep Learning models with the raw text directly. This can be done using iNLTK’s get_embedding_vectors(text, language code) which takes input text and its language code as the arguments.
Example:
We generate text embeddings for the same sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘w3wiki is a great technology learning platform.’)
Python3
from inltk.inltk import get_embedding_vectors from warnings import filterwarnings from IPython.display import display filterwarnings( "ignore" ) text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।' vectors = get_embedding_vectors(text, 'hi' ) display(vectors) |
Output:
[array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),
array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),
array([ 0.021317, -0.130494, -0.248163, -0.203298, …, 0.064852, 0.230874, -0.315259, 0.368123], dtype=float32),
array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),
array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),
array([ 0.526271, -0.111786, 0.024964, -0.413432, …, -0.269101, 0.14501 , 0.139528, 0.036384], dtype=float32),
array([ 0.231323, -0.129719, -0.120698, -0.229107, …, -0.207799, -0.144117, 1.09991 , 0.544219], dtype=float32),
array([ 0.408419, 0.320988, -0.380744, -0.563505, …, -0.254394, -0.200471, 0.201553, -0.074097], dtype=float32),
array([-0.307099, -0.186613, 0.040754, -0.271758, …, 0.477781, 0.759681, 0.485825, 0.222599], dtype=float32),
array([-0.0195 , -0.056414, 0.155854, -0.955072, …, 0.127837, -0.161846, 0.381132, -0.233802], dtype=float32),
array([-0.063136, -0.16291 , -0.412124, -0.580033, …, -0.468475, 0.246613, 0.661614, 0.354779], dtype=float32),
array([-0.182706, -0.237699, 0.478908, -0.567147, …, 0.694749, 0.526647, 0.650397, 0.172727], dtype=float32),
array([-0.183833, -0.005238, -0.187345, -0.113823, …, 0.062584, -1.36463 , 0.665604, -1.425032], dtype=float32),
array([ 0.792413, 0.01189 , -0.71231 , -0.313467, …, 0.190676, 0.938687, 0.464781, 0.195361], dtype=float32)]
Thus, we have generated embeddings for Hindi text using iNLTK.
iNLTK: Natural Language Toolkit for Indic Languages in Python
We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages.
Contact Us