Outlier Removal in Dataset using IQR

In this example, we are using the interquartile range (IQR) method to detect and remove outliers in the ‘bmi’ column of the diabetes dataset. It calculates the upper and lower limits based on the IQR, identifies outlier indices using Boolean arrays, and then removes the corresponding rows from the DataFrame, resulting in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.

Python3




# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
 
# Load the dataset
diabetes = load_diabetes()
 
# Create the dataframe
column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)
 
''' Detection '''
# IQR
# Calculate the upper and lower limits
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
 
# Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]
 
# Removing the outliers
df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)
 
# Print the new shape of the DataFrame
print("New Shape: ", df_diabetes.shape)


Output:

Old Shape:  (442, 10)
New Shape:  (439, 10)

Detect and Remove the Outliers using Python

Outliers, deviating significantly from the norm, can distort measures of central tendency and affect statistical analyses. The piece explores common causes of outliers, from errors to intentional introduction, and highlights their relevance in outlier mining during data analysis.

The article delves into the significance of outliers in data analysis, emphasizing their potential impact on statistical results.

Similar Reads

What is Outlier?

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal) objects. Identifying outliers is important in statistics and data analysis because they can have a significant impact on the results of statistical analyses. The analysis for outlier detection is referred to as outlier mining....

Outlier Detection And Removal

Here pandas data frame is used for a more realistic approach as real-world projects need to detect the outliers that arose during the data analysis step, the same approach can be used on lists and series-type objects....

Outlier Removal in Dataset using IQR

...

Conclusion

...

Frequently Asked Questions on Outlier Removal

...

Contact Us