Text Preprocessing for NLP Tasks

Natural Language Processing (NLP) has seen tremendous growth and development, becoming an integral part of various applications, from chatbots to sentiment analysis. One of the foundational steps in NLP is text preprocessing, which involves cleaning and preparing raw text data for further analysis or model training. Proper text preprocessing can significantly impact the performance and accuracy of NLP models. This article will delve into the essential steps involved in text preprocessing for NLP tasks.

Why Text Preprocessing is Important?

Raw text data is often noisy and unstructured, containing various inconsistencies such as typos, slang, abbreviations, and irrelevant information. Preprocessing helps in:

  1. Improving Data Quality: Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent.
  2. Enhancing Model Performance: Well-preprocessed text can lead to better feature extraction, improving the performance of NLP models.
  3. Reducing Complexity: Simplifying the text data can reduce the computational complexity and make the models more efficient.


Now, we will perform the tasks on the sample corpus:

Python
corpus = [
    "I can't wait for the new season of my favorite show!",
    "The COVID-19 pandemic has affected millions of people worldwide.",
    "U.S. stocks fell on Friday after news of rising inflation.",
    "<html><body>Welcome to the website!</body></html>",
    "Python is a great programming language!!! ??"
]

1. Text Cleaning

We’ll convert the text to lowercase, remove punctuation, numbers, special characters, and HTML tags.

Python
import re
import string
from bs4 import BeautifulSoup

def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
    return text

cleaned_corpus = [clean_text(doc) for doc in corpus]
print(cleaned_corpus)

Output:

['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language   ']

2. Tokenization

Splitting the cleaned text into tokens (words).

Python
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus]
print(tokenized_corpus)

Output:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['htmlbodywelcome', 'to', 'the', 'websitebodyhtml'], ['python', 'is', 'a', 'great', 'programming', 'language']]

3. Stop Words Removal

Removing common stop words from the tokens.

Python
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print(filtered_corpus)

Output:

['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']

4. Stemming and Lemmatization

Reducing words to their base form using stemming and lemmatization.

Python
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print(stemmed_corpus)
print(lemmatized_corpus)

Output:

[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['htmlbodywelcom', 'websitebodyhtml'], ['python', 'great', 'program', 'languag']]
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']]

5. Handling Contractions

Expanding contractions in the text.

Python
import contractions

expanded_corpus = [contractions.fix(doc) for doc in cleaned_corpus]
print(expanded_corpus)

Output:

['i cannot wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language   ']

6. Handling Emojis and Emoticons

Converting emojis to their textual representation.

Python
import emoji

emoji_corpus = [emoji.demojize(doc) for doc in cleaned_corpus]
print(emoji_corpus)

Output:

['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language   ']

7. Spell Checking

Correcting spelling errors in the text.

Python
from spellchecker import SpellChecker

spell = SpellChecker()
corrected_corpus = [[spell.correction(word) for word in doc] for doc in tokenized_corpus]
print(corrected_corpus)

Output:

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'bovid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'fridge', 'after', 'news', 'of', 'rising', 'inflation'], [None, 'to', 'the', None], ['python', 'is', 'a', 'great', 'programming', 'language']]


After performing all the preprocessing steps, the final preprocessed corpus is ready for further NLP tasks, such as feature extraction or model training.

This pipeline ensures that the text data is clean, consistent, and ready for any NLP application, from sentiment analysis to text classification. By following these steps, you can significantly improve the quality and performance of your NLP models.




Contact Us