Bagging for Imbalanced Classification

Let us start by exploring Bagging Techniques for Imbalanced Classification problem.

Standard Bagging

Ensemble learning is a machine learning approach that involves using multiple learning algorithms to create a stronger model than an individual model. Bagging, or bootstrap aggregating, is one of these techniques involving creation of multiple models on different subsets of the training data and then combining their predictions. It combines two key concepts: bootstrapping and aggregation.

  1. Bootstrap: Creating multiple datasets through sampling with replacement from the original dataset. This technique allows for the creation of multiple “bootstrapped” datasets that are similar but not identical to the original dataset.
  2. Aggregation: Once the bootstrapped datasets are created, an algorithm (such as a decision tree) is trained on each dataset. The predictions from each model are then aggregated or combined to make the final prediction. This aggregation helps improve the overall accuracy of the model.

Implementation

We will explore the impact of bagging on imbalanced classification using a simplified example on an imbalanced dataset using the scikit-learn library. For this, we generate an imbalanced dataset with 2 target classes and a class distribution of 90% for the majority class and 10% for the minority class.

Python




# Import Required Libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Create synthetic dataset
X, y = make_classification(n_samples=1500, n_features=15, n_informative=5, n_redundant=1, n_classes=2, weights=[0.90, 0.10])
 
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Let’s visualize this class distribution:

Python




import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
# Count occurrences of each label
label_counts = np.bincount(y)
 
# Visualize the imbalanced data
fig, ax = plt.subplots(figsize=(8, 6))
ax = sns.barplot(x=np.arange(2), y=label_counts, palette="Set1")
ax.set_xticks(np.arange(2))
ax.set_xticklabels(['Class 1', 'Class 2'])
ax.set_title("Count Plot of Synthetic Datapoints", fontsize=16)
ax.set_xlabel("Classes", fontsize=14)
ax.set_ylabel("# Samples", fontsize=14)
plt.show()


Output:

After visualization we will train our Bagging Model:

Python




# Standard Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
 
# Create a bagging classifier
bagging_clf = BaggingClassifier()
 
# Train the bagging classifier on the training data
bagging_clf.fit(X_train, y_train)
 
# Make predictions on the test set
y_pred = bagging_clf.predict(X_test)
 
# Calculate the accuracy of the model
acc_bag = accuracy_score(y_test, y_pred)
print(" Bagging Classifier - Test Accuracy:", round(acc_bag, 2))


Output:

Bagging Classifier - Test Accuracy: 0.92

Bagging With Random Undersampling

In order to further improve the performance of Bagging Classifier on Imbalanced Dataset, we can perform random undersampling of the majority class. This technique involves randomly eliminating instances from the majority class to achieve a balanced distribution in each bootstrapped sample. By doing so, the model is less likely to exhibit bias towards the majority class, thus enhancing its accuracy in classifying minority instances.

Implementation

Python




# Bagging With Random Undersampling
from sklearn.ensemble import BaggingClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import accuracy_score
 
# Apply random undersampling to the training set
rus = RandomUnderSampler()
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
 
# Create a bagging classifier with random undersampling
bagging_classifier = BaggingClassifier()
 
# Train the bagging classifier on the resampled training data
bagging_classifier.fit(X_train_resampled, y_train_resampled)
 
# Make predictions on the test set
y_pred = bagging_classifier.predict(X_test)
 
# Calculate the accuracy of the model
acc_rndm = accuracy_score(y_test, y_pred)
print(" Bagging Classifier with Random Undersampling - Test Accuracy:", round(acc_rndm, 2))


Output:

Bagging Classifier with Random Undersampling - Test Accuracy: 0.96


Bagging and Random Forest for Imbalanced Classification

Ensemble learning techniques like bagging and random forests have gained prominence for their effectiveness in handling imbalanced classification problems. In this article, we will delve into these techniques and explore their applications in mitigating the impact of class imbalance.

Classification problems are fundamental to machine learning and find applications across various domains. However, a prevalent issue in such tasks is handling imbalanced datasets, where one class significantly outweighs the other. This class imbalance presents a hurdle for conventional classifiers as they often exhibit a bias toward the majority class, resulting in skewed models.

Similar Reads

Bagging for Imbalanced Classification

Let us start by exploring Bagging Techniques for Imbalanced Classification problem....

Random Forest for Imbalanced Classification

...

Conclusion

...

Contact Us