Classification of text documents using sparse features in Python Scikit Learn

A similar classification problem is to classify the given text or document under a particular label. For this example, the following is a brief about the prerequisites for moving ahead.

Processing Text data as input

The machine does not understand the text data as it is, so it needs to be represented in a way that can be ingested by the classifier as features to predict an output. For that, the given data is to be represented as vectors. To do the same bag-of-words model is followed. The idea is to remove the grammar from the document, break it into simple words and represent the count of each word as a numeric within that vector.

Example:

"Geeks For Geeks is a great platform to learn and grow and achieve great knowledge."

Bag of words model

geeks for is a great to and learn knowledge archive platform grow
2 1 1 1 2 1 1 1 1 1 1 1

 

The vector results in the formation of the sparse matrix, for a case of multiple documents, we may have many words which have almost no contribution to the prediction of the label but its large repetition is going to create a lot of noise. For this case, considering the words having more repetition within the document have to be given more weightage as compared to words having repetitions all over the dataset. Vectorization using this idea is known as Tf-idf-weighted vectorization.

Classification Models

To represent this problem and to compare the best model for this use case, the following models will be used:

  1. Logistic Regression: Simple classifier using a one-vs-rest scheme for this use case.
  2. Ridge Classifier: Classifier which treats the problem as a regression problem and trains accordingly with {-1,1} labels for binary and multiple no. for non-binary.
  3. K-Neighbors Classifier: Converts to a regression model with K-Neighbors assigned as a label.

Classification of text documents using sparse features in Python Scikit Learn

Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a living organism into one of the species from within the animal kingdom (multi-classification). There are various classification models provided in the Scikit Learn library in Python

Similar Reads

Classification of text documents using sparse features in Python Scikit Learn

A similar classification problem is to classify the given text or document under a particular label. For this example, the following is a brief about the prerequisites for moving ahead....

Classification problem

Making the model starts by importing the dataset. The dataset used here is the 20 newsgroups text dataset, which comprises around 18,000 newsgroups posts on 20 topics split into two subsets: one for training (or development) and the other one for testing (or for performance evaluation). For the sake of easiness, we will be selecting four categories out of 20....

Contact Us