Import and Initial Setup

When using a language for the first time in a system/environment, you need to set the language setup which downloads the models corresponding to the language. However, this is required only when using the language for the first time. Subsequently, no setup is required. we have set the language as Bengali (bn). You can set it up for any language of your choice from the list of available languages. It is reiterated that the setup is only a one-time job. You can set up a language as follows:

Python3




# Setting up Hindi Language
from inltk.inltk import setup
setup('hi')
setup('bn')
 
# to run on google colab
# !python -c """from inltk.inltk import
# setup;setup('hi');setup('bn')"""


iNLTK: Natural Language Toolkit for Indic Languages in Python

We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages. 

Similar Reads

Languages Available in iNLTK

iNLTK covers almost all of the most common Indian languages. Following is the list of languages along with their codes available in iNLTK:...

Installation

iNLTK can be easily installed using pip as follows:...

Import and Initial Setup

When using a language for the first time in a system/environment, you need to set the language setup which downloads the models corresponding to the language. However, this is required only when using the language for the first time. Subsequently, no setup is required. we have set the language as Bengali (bn). You can set it up for any language of your choice from the list of available languages. It is reiterated that the setup is only a one-time job. You can set up a language as follows:...

Performing basic NLP tasks using iNLTK

...

Tokenization

Now, let us perform some of the basic NLP tasks in Indian Languages using iNLTK. The tasks that we will be performing are as follows:...

Text Embedding Generation

Tokenization refers to breaking a sentence into smaller units. This is one of the imperative steps when it comes to text pre-processing. For this iNLTK offers a function called tokenize(text, language code) which takes input text and its language code as the arguments....

Next Word Prediction

...

Similar Sentence Generation

In NLP, text embeddings refer to a vectorized representation of text. It is necessary to convert text to embeddings as we cannot feed Machine/Deep Learning models with the raw text directly. This can be done using iNLTK’s get_embedding_vectors(text, language code) which takes input text and its language code as the arguments....

Checking Sentence Similarity

...

Contact Us