Text Preprocessing

Textual data is highly unstructured and need attention on many aspects like:

Although removing data means loss of information but we need to do this to make the data perfect to feed into a machine learning model.

Python3




# Lower case all the words of the tweet before any preprocessing
df['tweet'] = df['tweet'].str.lower()
 
# Removing punctuations present in the text
punctuations_list = string.punctuation
def remove_punctuations(text):
    temp = str.maketrans('', '', punctuations_list)
    return text.translate(temp)
 
df['tweet']= df['tweet'].apply(lambda x: remove_punctuations(x))
df.head()


Output:

Dataset after removal of punctuation’s

The below function is a helper function that will help us to remove the stop words and Lemmatize the important words.

Python3




def remove_stopwords(text):
    stop_words = stopwords.words('english')
 
    imp_words = []
 
    # Storing the important words
    for word in str(text).split():
 
        if word not in stop_words:
 
            # Let's Lemmatize the word as well
            # before appending to the imp_words list.
 
            lemmatizer = WordNetLemmatizer()
            lemmatizer.lemmatize(word)
 
            imp_words.append(word)
 
    output = " ".join(imp_words)
 
    return output
 
 
df['tweet'] = df['tweet'].apply(lambda text: remove_stopwords(text))
df.head()


Output:

Dataset after removal of stop words and lemmatization

Word cloud is a text visualization tool that help’s us to get insights into the most frequent words present in the corpus of the data.

Python3




def plot_word_cloud(data, typ):
  # Joining all the tweets to get the corpus
  email_corpus = " ".join(data['tweet'])
 
  plt.figure(figsize = (10,10))
   
  # Forming the word cloud
  wc = WordCloud(max_words = 100,
                width = 200,
                height = 100,
                collocations = False).generate(email_corpus)
   
  # Plotting the wordcloud obtained above
  plt.title(f'WordCloud for {typ} emails.', fontsize = 15)
  plt.axis('off')
  plt.imshow(wc)
  plt.show()
  print()
 
plot_word_cloud(df[df['class']==2], typ='Neither')


Output:

Word cloud for the neither class of data

As we know from above that the data we had was highly imbalanced now we will solve this problem by using a mixture of down sampling and up sampling.

Python3




class_2 = df[df['class'] == 2]
class_1 = df[df['class'] == 1].sample(n=3500)
class_0 = df[df['class'] == 0]
 
balanced_df = pd.concat([class_0, class_0, class_0, class_1, class_2], axis=0)


Now let’s check what is the data distribution in the three classes.

Python3




plt.pie(balanced_df['class'].value_counts().values,
        labels=balanced_df['class'].value_counts().index,
        autopct='%1.1f%%')
plt.show()


Output:

Pie chart for the distribution of the data in three classes

After this step we can be sure of the fact that the data is perfectly balanced for the three classes.

Hate Speech Detection using Deep Learning

There must be times when you have come across some social media post whose main aim is to spread hate and controversies or use abusive language on social media platforms. As the post consists of textual information to filter out such Hate Speeches NLP comes in handy. This is one of the main applications of NLP which is known as Sentence Classification tasks.

In this article, we will learn how to build an NLP-based Sequence Classification model which can predict Tweets as Hate Speech, Offensive Language, and Normal.

Similar Reads

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code....

Text Preprocessing

...

Word2Vec Conversion

...

Model Development and Evaluation

...

Conclusion

...

Contact Us