Implementing Sentimental Analysis with CatBoost

Why use to CatBoost for Sentiment Analysis?

For this example, we will use the IMDb dataset from the datasets library, which contains 50,000 movie reviews labeled as positive or negative. This dataset is readily available and well-suited for sentiment analysis.

Step 1: Install Necessary Libraries

We will be installing CatBoost library and Datasets module using the following command:

pip install catboost
pip install datasets

Step 2: Load Dataset

First, we load the IMDb dataset using the Hugging Face datasets library and separates it into training and test sets for further use in machine learning tasks. Specifically, train_data contains the reviews and labels for training, while test_data contains the reviews and labels for testing and evaluation.

from datasets import load_dataset

# Load the IMDb dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

Step 3: Text Preprocessing using TF-IDF

In the following code, we use TfidfVectorizer from the sklearn.feature_extraction.text module to convert the text data from the IMDb dataset into numerical feature vectors based on the TF-IDF scheme, limited to 5000 features. The fit_transform method is applied to the training data (train_data['text']) to learn the vocabulary and transform the text into TF-IDF features, while the transform method is applied to the test data (test_data['text']) to transform it using the same vocabulary. The labels for the training and test sets are extracted and stored in y_train and y_test, respectively, for use in model training and evaluation.

from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['text'])
X_test = vectorizer.transform(test_data['text'])

y_train = train_data['label']
y_test = test_data['label']

Step 4: Model Training

Here, the code initializes a CatBoostClassifier with specified parameters (iterations, learning rate, depth, and verbosity) and fits the model to the TF-IDF transformed training data (X_train and y_train).

from catboost import CatBoostClassifier

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, verbose=100)

# Fit the model
model.fit(X_train, y_train)

Step 5: Model Training

After training the model, we predict the sentiments on the test set and evaluate the model’s performance.

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

Complete Code for Sentimental Analysis using CatBoost

Python

from datasets import load_dataset

# Load the IMDb dataset
dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['text'])
X_test = vectorizer.transform(test_data['text'])

y_train = train_data['label']
y_test = test_data['label']

from catboost import CatBoostClassifier

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, verbose=100)

# Fit the model
model.fit(X_train, y_train)

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

Output:

Accuracy: 0.8766
              precision    recall  f1-score   support

           0       0.89      0.86      0.88     12500
           1       0.87      0.89      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Sentiment Analysis using CatBoost

Sentiment analysis is crucial for understanding the emotional tone behind text data, making it invaluable for applications such as customer feedback analysis, social media monitoring, and market research. In this article, we will explore how to perform sentiment analysis using CatBoost.

Table of Content

Key Features of CatBoost
Why use to CatBoost for Sentiment Analysis?
Implementing Sentimental Analysis with CatBoost

Step 1: Install Necessary Libraries
Step 2: Load Dataset
Step 3: Text Preprocessing using TF-IDF
Step 4: Model Training
Step 5: Model Training
Complete Code for Sentimental Analysis using CatBoost

Conclusion

Tags:

#AI-ML-DS With Python #CatBoost #Data Science Blogathon 2024 #AI-ML-DS #Blogathon #Machine Learning #Machine Learning