Word2Vec Conversion
We cannot feed words to a machine learning model because they work on numbers only. So, first, we will convert the our words to vectors with the token id’s to the corresponding words and after padding them our textual data will arrive to a stage where we can feed it to a model.
Python3
features = balanced_df[ 'tweet' ] target = balanced_df[ 'class' ] X_train, X_val, Y_train, Y_val = train_test_split(features, target, test_size = 0.2 , random_state = 22 ) X_train.shape, X_val.shape |
Output:
((8201,), (2051,))
We have successfully divided our data into training and validation data.
Python3
Y_train = pd.get_dummies(Y_train) Y_val = pd.get_dummies(Y_val) Y_train.shape, Y_val.shape |
Output:
((8201, 3), (2051, 3))
The labels of the classes have been converted into one-hot-encoded vectors. For this, we will use a vocabulary size of 5000 with each tweet, not more than 100 in length.
Python3
max_words = 5000 max_len = 100 token = Tokenizer(num_words = max_words, lower = True , split = ' ' ) token.fit_on_texts(X_train) |
We have fitted the tokenizer on our training data we will use it to convert the training and validation data both to vectors.
Python3
# training the tokenizer max_words = 5000 token = Tokenizer(num_words = max_words, lower = True , split = ' ' ) token.fit_on_texts(train_X) #Generating token embeddings Training_seq = token.texts_to_sequences(train_X) Training_pad = pad_sequences(Training_seq, maxlen = 50 , padding = 'post' , truncating = 'post' ) Testing_seq = token.texts_to_sequences(test_X) Testing_pad = pad_sequences(Testing_seq, maxlen = 50 , padding = 'post' , truncating = 'post' ) |
Hate Speech Detection using Deep Learning
There must be times when you have come across some social media post whose main aim is to spread hate and controversies or use abusive language on social media platforms. As the post consists of textual information to filter out such Hate Speeches NLP comes in handy. This is one of the main applications of NLP which is known as Sentence Classification tasks.
In this article, we will learn how to build an NLP-based Sequence Classification model which can predict Tweets as Hate Speech, Offensive Language, and Normal.
Contact Us