Class Imbalance Handling in Machine Learning

Resampling, which modifies the sample distribution, is a frequently used technique for handling very unbalanced datasets. This can be accomplished by either over-sampling, which adds more examples from the minority class, or under-sampling, which removes samples from the majority class. One method for reducing the difficulties caused by severely skewed datasets is resampling, which balances the class distribution.

Using strategies like over- and under-sampling to balance classes has advantages, but there are also disadvantages.

A fundamental method of over-sampling is to replicate random records from the minority class, which may cause overfitting.

On the other hand, information loss may occur from the simple technique of eliminating random records from the majority class in an undersampled situation.

In Up-sampling, samples from minority classes are randomly duplicated so as to achieve equivalence with the majority class. There are many methods used for achieving this.

1. Using Random Under-Sampling

When observations from the majority class are eliminated until the majority and minority classes are balanced, this is known as undersampling.

Undersampling has advantages when working with large datasets, especially ones with millions of rows, but there is a risk that important information will be lost during the removal process.

Example : 

Python3
# Importing scikit-learn, pandas library
from sklearn.utils import resample
from sklearn.datasets import make_classification
import pandas as pd

# Making DataFrame having 100
# dummy samples with 4 features 
# Divided in 2 classes in a ratio of 80:20 
X, y = make_classification(n_classes=2, 
                           weights=[0.8, 0.2],
                           n_features=4, 
                           n_samples=100, 
                           random_state=42)

df = pd.DataFrame(X, columns=['feature_1',
                              'feature_2',
                              'feature_3',
                              'feature_4'])
df['balance'] = y
print(df)

# Let df represent the dataset
# Dividing majority and minority classes
df_major = df[df.balance == 0]
df_minor = df[df.balance == 1]

# Upsampling minority class
df_minor_sample = resample(df_minor,
                           
                           # Upsample with replacement
                           replace=True,    
                           
                           # Number to match majority class
                           n_samples=80,   
                           random_state=42)

# Combine majority and upsampled minority class
df_sample = pd.concat([df_major, df_minor_sample])

# Display count of data points in both class
print(df_sample.balance.value_counts())

Output:

    feature_1  feature_2  feature_3  feature_4  balance
0 -1.053839 -1.027544 -0.329294 0.826007 1
1 1.569317 1.306542 -0.239385 -0.331376 0
2 -0.658926 -0.357633 0.723682 -0.628277 0
3 -0.136856 0.460938 1.896911 -2.281386 0
4 -0.048629 0.502301 1.778730 -2.171053 0
.. ... ... ... ... ...
95 -2.241820 -1.248690 2.357902 -2.009185 0
96 0.573042 0.362054 -0.462814 0.341294 1
97 -0.375121 -0.149518 0.588465 -0.575002 0
98 1.042518 1.058239 0.461945 -0.984846 0
99 -0.121203 -0.043997 0.204211 -0.203119 0
[100 rows x 5 columns]
0 80
1 80
Name: balance, dtype: int64

Explanation : 

  • Firstly, we’ll divide the data points from each class into separate DataFrames.
  • After this, the minority class is resampled with replacement by setting the number of data points equivalent to that of the majority class.
  • In the end, we’ll concatenate the original majority class DataFrame and up-sampled minority class DataFrame.

2. Using RandomOverSampler:

Oversampling is the process of adding more copies to the minority class. When dealing with constrained data resources, this approach is helpful. Overfitting and decreased generalization performance on the test set are potential drawbacks of oversampling, though.

This can be done with the help of the RandomOverSampler method present in imblearn. This function randomly generates new data points belonging to the minority class with replacement (by default).

Syntax: RandomOverSampler(sampling_strategy=’auto’, random_state=None, shrinkage=None)

Parameters:

  • sampling_strategy: Sampling Information for dataset.Some Values are- ‘minority’: only minority class ‘not minority’: all classes except minority class, ‘not majority’: all classes except majority class, ‘all’: all classes,  ‘auto’: similar to ‘not majority’, Default value is ‘auto’
  • random_state: Used for shuffling the data. If a positive non-zero number is given then it shuffles otherwise not. Default value is None.
  • shrinkage: Parameter controlling the shrinkage. Values are: float: Shrinkage factor applied on all classes. dict: Every class will have a specific shrinkage factor. None: Shrinkage= 0. Default value is None.

Implementation of RandomOverSampler

Python3
# Importing imblearn,scikit-learn library
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification

# Making Dataset having 100
# dummy samples with 4 features 
# Divided in 2 classes in a ratio of 80:20 
X, y = make_classification(n_classes=2, 
                           weights=[0.8, 0.2],
                           n_features=4, 
                           n_samples=100, 
                           random_state=42)

# Printing number of samples in
# each class before Over-Sampling
t = [(d) for d in y if d==0]
s = [(d) for d in y if d==1]
print('Before Over-Sampling: ')
print('Samples in class 0: ',len(t))
print('Samples in class 1: ',len(s))

# Over Sampling Minority class
OverS = RandomOverSampler(random_state=42)

# Fit predictor (x variable)
# and target (y variable) using fit_resample()
X_Over, Y_Over = OverS.fit_resample(X, y)

# Printing number of samples in
# each class after Over-Sampling
t = [(d) for d in Y_Over if d==0]
s = [(d) for d in Y_Over if d==1]
print('After Over-Sampling: ')
print('Samples in class 0: ',len(t))
print('Samples in class 1: ',len(s))

Output:

Before Over-Sampling: 
Samples in class 0: 80
Samples in class 1: 20
After Over-Sampling:
Samples in class 0: 80
Samples in class 1: 80

  • This code illustrates how to use imbalanced-learn’s RandomOverSampler to address class imbalance in a dataset.
  • By creating artificial samples for the minority class, it improves the balance of the class distribution.
  • For comparison, the number of samples in each class is printed both before and after oversampling.

How to Handle Imbalanced Classes in Machine Learning

In machine learning, “imbalanced classes” is a familiar problem particularly occurring in classification when we have datasets with an unequal ratio of data points in each class.

Training of model becomes much trickier as typical accuracy is no longer a reliable metric for measuring the performance of the model. Now if the number of data points in minority class is much less, then it may end up being completely ignored during training.

Similar Reads

Problems with the imbalanced data

Unbalanced class distributions present a barrier, even though many machine learning algorithms work best when there are nearly equal numbers of samples in each class. A model may appear to have high accuracy in these situations if it primarily predicts the majority class. In such cases, having high accuracy becomes deceptive. Sadly, the minority class—which is frequently the main focus of model creation—is ignored by this strategy. In the event that 99% of the data pertains to the majority class, for example, simple classification models such as logistic regression or decision trees may find it difficult to recognize and precisely forecast occurrences from the minority class....

Class Imbalance Handling in Machine Learning

Resampling, which modifies the sample distribution, is a frequently used technique for handling very unbalanced datasets. This can be accomplished by either over-sampling, which adds more examples from the minority class, or under-sampling, which removes samples from the majority class. One method for reducing the difficulties caused by severely skewed datasets is resampling, which balances the class distribution....

Balancing data with the Imbalanced-Learn module in Python

In the world of fixing imbalanced data, there are some smart tricks. Scientists have come up with advanced methods to handle this issue....

Contact Us