Word2Vec Conversion

We cannot feed words to a machine learning model because they work on numbers only. So, first, we will convert the our words to vectors with the token id’s to the corresponding words and after padding them our textual data will arrive to a stage where we can feed it to a model.

Python3




features = balanced_df['tweet']
target = balanced_df['class']
 
X_train, X_val, Y_train, Y_val = train_test_split(features,
                                                  target,
                                                  test_size=0.2,
                                                  random_state=22)
X_train.shape, X_val.shape


Output:

((8201,), (2051,))

We have successfully divided our data into training and validation data.

Python3




Y_train = pd.get_dummies(Y_train)
Y_val = pd.get_dummies(Y_val)
Y_train.shape, Y_val.shape


Output:

((8201, 3), (2051, 3))

The labels of the classes have been converted into one-hot-encoded vectors. For this, we will use a vocabulary size of 5000 with each tweet, not more than 100 in length.

Python3




max_words = 5000
max_len = 100
 
token = Tokenizer(num_words=max_words,
                  lower=True,
                  split=' ')
 
token.fit_on_texts(X_train)


We have fitted the tokenizer on our training data we will use it to convert the training and validation data both to vectors.

Python3




# training the tokenizer
max_words = 5000
token = Tokenizer(num_words=max_words,
                  lower=True,
                  split=' ')
token.fit_on_texts(train_X)
 
#Generating token embeddings
Training_seq = token.texts_to_sequences(train_X)
Training_pad = pad_sequences(Training_seq,
                             maxlen=50,
                             padding='post',
                             truncating='post')
 
Testing_seq = token.texts_to_sequences(test_X)
Testing_pad = pad_sequences(Testing_seq,
                            maxlen=50,
                            padding='post',
                            truncating='post')


Hate Speech Detection using Deep Learning

There must be times when you have come across some social media post whose main aim is to spread hate and controversies or use abusive language on social media platforms. As the post consists of textual information to filter out such Hate Speeches NLP comes in handy. This is one of the main applications of NLP which is known as Sentence Classification tasks.

In this article, we will learn how to build an NLP-based Sequence Classification model which can predict Tweets as Hate Speech, Offensive Language, and Normal.

Similar Reads

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code....

Text Preprocessing

...

Word2Vec Conversion

...

Model Development and Evaluation

...

Conclusion

...

Contact Us