Random Forest for Imbalanced Classification

Now that we have an idea about Bagging techniques, let’s understand Random Forest and how it can be used for Imbalanced Classification problem.

Standard Random Forest

Random forest is a variation of bagging that deploys decision trees as the base models. It trains each decision tree on a random subset of the training data and a random subset of the features and at the time of prediction, the random forest algorithm aggregates the predictions of all the individual decision trees to make a final prediction with higher predictive accuracy and also control over-fitting.

For classification tasks, after each tree has been trained on a subset of the training data, in order to make a prediction for a new data point, the random forest algorithm passes the data point through each decision tree in the forest. After this, each tree independently predicts the class of the data point. If a certain class receives the highest number of votes across all trees, that class becomes the predicted class.

Implementation

Now we will go through a practical example to explore the impact of random forest on imbalanced classification.

Python




# Import Required Libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1500, n_features=15, n_informative=5, n_redundant=1, n_classes=2, weights=[0.90, 0.10])
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


After data split we will train Random Forest model:

Python




# Standard Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
# Create a Random Forest classifier
rf = RandomForestClassifier(n_estimators=100)
 
# Train the Random Forest classifier on the training data
rf.fit(X_train, y_train)
 
# Make predictions on the test set
y_pred_rf = rf.predict(X_test)
 
# Calculate the accuracy of the model
acc_rf = accuracy_score(y_test, y_pred_rf)
print(f"Standard Random Forest Accuracy: {acc_rf:.2f}")


Output:

Standard Random Forest Accuracy: 0.96

Random Forest With Class Weighting

Another way that we can handle class imbalance is by adjusting the weights assigned to each class in the Random Forest algorithm aka Class Weighting. When we give more weight to the minority class and less weight to the majority class, the algorithm pays more attention to the minority class during training. This allows the model to learn from the imbalanced data more effectively, leading to better performance on predicting the minority class.

Implementation

Python




# Random Forest with Class Weighting
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
# Create a Random Forest Classifier with Class Weighting
rf_cw = RandomForestClassifier(random_state=42, class_weight="balanced")
 
# Train the Random Forest classifier on the training data
rf_cw.fit(X_train, y_train)
 
# Make predictions on the test set
y_pred_cw = rf_cw.predict(X_test)
 
# Calculate the accuracy of the model
acc_cw = accuracy_score(y_test, y_pred_cw)
print(f"Random Forest with Class Weighting Accuracy: {acc_cw:.2f}")


Output:

Random Forest with Class Weighting Accuracy: 0.93

Random Forest With Bootstrap Class Weighting

Bootstrapping process is a vital aspect of Random Forests, and by combining Class Weighting with bootstrap we can quite effectively handle class imbalance in our data. In a traditional Random Forest model, each tree is trained on a bootstrap sample of the original data set, where samples may be repeated or left out. In Random Forest with bootstrap class weighting, additional weights based on class distribution are given to samples during bootstrapping, focusing more on minority class samples for improved classification accuracy on minority class instances.

Implementation

Python




# Random Forest with Bootstrap Class Weighting
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
# Create a Random Forest classifier with Bootstrap Class Weighting
rf_bcw = RandomForestClassifier(random_state=42, class_weight='balanced_subsample')
 
# Train the Random Forest classifier on the training data
rf_bcw.fit(X_train, y_train)
 
# Make predictions on the test set
y_pred_bcw = rf_bcw.predict(X_test)
 
# Calculate the accuracy of the model
acc_bcw = accuracy_score(y_test, y_pred_bcw)
print(f"Random Forest with Bootstrap Class Weighting Accuracy: {acc_bcw:.2f}")


Output:

Random Forest with Bootstrap Class Weighting Accuracy: 0.94

Random Forest With Random Undersampling

Similar to bagging, random undersampling can be combined with random forests for improved performance on imbalanced datasets. By randomly removing some instances from the majority class in each bootstrap sample, we can balance the class distribution. This allows random forest classifiers to focus more on the minority class during training, potentially improving their ability to learn effectively from minority class instances and classify them correctly.

Implementation

Python




# Random Forest with Random Undersampling
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import accuracy_score
 
# Apply random undersampling to the training set
rus = RandomUnderSampler(random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
 
# Create a Random Forest classifier with random undersampling
rf_rus = RandomForestClassifier(random_state=42)
 
# Train the Random Forest classifier on the resampled training data
rf_rus.fit(X_train_resampled, y_train_resampled)
 
# Make predictions on the test set
y_pred_rus = rf_rus.predict(X_test)
 
# Calculate the accuracy of the model
acc_rus = accuracy_score(y_test, y_pred_rus)
print(f"Random Forest with Random Undersampling Accuracy: {acc_rus:.2f}")


Output:

Random Forest with Random Undersampling Accuracy: 0.89

Bagging and Random Forest for Imbalanced Classification

Ensemble learning techniques like bagging and random forests have gained prominence for their effectiveness in handling imbalanced classification problems. In this article, we will delve into these techniques and explore their applications in mitigating the impact of class imbalance.

Classification problems are fundamental to machine learning and find applications across various domains. However, a prevalent issue in such tasks is handling imbalanced datasets, where one class significantly outweighs the other. This class imbalance presents a hurdle for conventional classifiers as they often exhibit a bias toward the majority class, resulting in skewed models.

Similar Reads

Bagging for Imbalanced Classification

Let us start by exploring Bagging Techniques for Imbalanced Classification problem....

Random Forest for Imbalanced Classification

...

Conclusion

...

Contact Us