Classification problem
Making the model starts by importing the dataset. The dataset used here is the 20 newsgroups text dataset, which comprises around 18,000 newsgroups posts on 20 topics split into two subsets: one for training (or development) and the other one for testing (or for performance evaluation). For the sake of easiness, we will be selecting four categories out of 20.
Once the data is imported, it needs to be vectorized via the bag-of-words model. To vectorize the documents, the Tf-idf vectorization technique is used. The below code shows a glimpse of what the training data may be like after vectorization.
Python3
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer # Keeping 4 of 20 categories only categories = [ "alt.atheism" , "talk.religion.misc" , "sci.electronics" , "sci.space" , ] # importing train and test data data_train = fetch_20newsgroups( subset = "train" , categories = categories, shuffle = True , random_state = 4 , remove = ( "headers" , "footers" , "quotes" ), ) data_test = fetch_20newsgroups( subset = "test" , categories = categories, shuffle = False , random_state = 4 , remove = ( "headers" , "footers" , "quotes" ), ) |
Converting the data into the numeric form using the Tf-Idf vectorizer.
Python3
# initialize vectorizer - keeping the # range of document frequency [0.5,5] vectorizer = TfidfVectorizer( sublinear_tf = True , max_df = 0.5 , min_df = 5 , stop_words = "english" ) X_train = vectorizer.fit_transform(data_train.data) y_train = data_train.target |
The transformed data needs to be fitted to the above-mentioned models and then, the evaluation will be performed in the testing dataset later on.
Python3
from sklearn.linear_model import LogisticRegression, RidgeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics models = {} # inverse of regularization strength models[ 'LogisticRegression' ] = LogisticRegression(C = 5 , max_iter = 1000 ) # gradient descent solver for sparse matrix "sparse_cg" models[ 'RidgeClassifier' ] = RidgeClassifier(alpha = 1.0 , solver = "sparse_cg" ) # setting nearest 100 points as 1 label models[ 'KNeighborsClassifier' ] = KNeighborsClassifier(n_neighbors = 100 ) |
Let’s train the above-defined model using the data we have prepared above.
Python3
# transforming the test data into existing train data vectorizer X_test = vectorizer.transform(X_test_raw) # training models and comparing accuracy for k, v in models.items(): print ( "\n=========Training model with classifier {0}===========" . format (k)) v.fit(X_train, y_train) pred = v.predict(X_test) score = metrics.accuracy_score(y_test, pred) print ( "accuracy after evaluation on test data: {0} %" . format (score * 100 )) |
Output
========= Training model with classifier LogisticRegression =========== Accuracy after evaluation on test data: 75.09211495946941 % ========= Training model with classifier RidgeClassifier =========== Accuracy after evaluation on test data: 74.57627118644068 % ========= Training model with classifier KNeighborsClassifier =========== Accuracy after evaluation on test data: 72.88135593220339 %
In the RidgeClassifier case, the training took minimum time and effort because it takes the problem as a simple regression problem, whereas in the case of the LogisticRegression, one vs rest type of classification takes place, which takes more time, as the model gets trained with respect to each class individually one at a time. It takes time but gives better performance because each class is trained individually. In the case of KNN classification, it calculates the distance of each point with respect to the class middle points and thus takes a lot of time.
Classification of text documents using sparse features in Python Scikit Learn
Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a living organism into one of the species from within the animal kingdom (multi-classification). There are various classification models provided in the Scikit Learn library in Python.
Contact Us