Handling missing values using Sunbird
The Sunbird library is used for feature engineering purposes. In this library, you will get various techniques to handle missing values, outliers, categorical encoding, normalization and standardization, feature selection techniques, etc.
Installation:
pip install sunbird
Handling Missing Values:
Datasets might have missing values, which can cause problems for certain machine learning algorithms. It’s indeed a good practice to recognize and substitute missed values for each column in the input data before modeling the prediction evaluation. This is called imputation.
1. Median Imputation
Columns in the data set that have constant numeric values which be replaced by the median of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses the mean, median, or mode imputations.
Python3
from sunbird.nan_values import median_imputation median_imputation(dataframe, feature, plot = True ) |
Survived Age Fare Age_median 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 26.0 1 NAN 53.100 26.0 0 35.0 8.0500 35.0
2. Mean Imputation
Columns in the data set that have constant numeric values which be replaced by the mean of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses mean median or mode imputations.
Python3
from sunbird.nan_values import mean_imputation mean_imputation(dataframe, feature, plot = True ) |
Survived Age Fare Age_mean 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 26.50 1 NAN 53.100 26.50 0 35.0 8.0500 35.0
3. Mode Imputation
Columns in the data set that have constant numeric values which be replaced by the mode of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses mean median or mode imputations.
Python3
from sunbird.nan_values import mode_imputation mode_imputation(dataframe, feature, plot = True ) |
Survived Age Fare Age_mode 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 25.0 1 NAN 53.100 25.0 0 35.0 8.0500 35.0
4. Random Sample Imputation
Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques.
Python3
from sunbird.nan_values import random_sampler random_sampler(dataframe, feature, plot = True ) |
Survived Age Fare Age_random 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 37.0 1 NAN 53.100 28.0 0 35.0 8.0500 35.0
5. End Of Distribution Imputation
If there is suspicion that the missing value is not random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.
Python3
from sunbird.nan_values import endof_distribution endof_distribution(dataframe, feature, plot = True ) |
Survived Age Fare Age_enddist 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 40.0 1 NAN 53.100 40.0 0 35.0 8.0500 35.0
6. Arbitrary Value Imputation
Arbitrary value imputation consists of replacing all occurrences of missing values within a variable by an arbitrary value. Ideally, arbitrary values should be different from the median/mean/mode, and not within the normal values of the variable.
Python3
from sunbird.nan_values import arbitrary_imputation arbitrary_imputation(dataframe, feature, arbit, plot = True ) |
Survived Age Fare Age_arbitrary 0 22.0 7.2500 22.0 1 38.0 71.283 38.0 1 NAN 7.9250 25.0 1 NAN 53.100 25.0 0 35.0 8.0500 35.0
7. Capture NAN Imputation
Capturing nan imputation suggest that this feature engineering technique is used on the data which is not missing completely at random. This technique is used for capturing the importance of missing values but due to adding additional features, it may lead to the Curse of Dimensionality.
Python3
from sunbird.nan_values import capture_nan capture_nan(dataframe, feature) |
Survived Age Fare Age_nan 0 22.0 7.2500 0 1 38.0 71.283 0 1 26.0 7.9250 1 1 26.0 53.100 1 0 35.0 8.0500 0
8. Frequent Value Imputation
Frequent value imputation is used for handling categorical missing values. In this technique, we sort the values in the dataset according to their frequency in descending order. The first value in that order becomes the most frequent value, and we replace other missing values with that particular most frequent value.
Python3
from sunbird.nan_values import frequent frequent(dataframe, feature) |
Survived Age Fare 0 22.0 7.2500 1 38.0 71.283 1 36.0 7.9250 1 36.0 53.100 0 35.0 8.0500
9. Fill With Missing Imputation
This is the simplest feature engineering technique that is used when are a lot of missing values. In this technique, we replace missing values with missing words by keeping remaining values as it is.
Python3
from sunbird.nan_values import fill_missing fill_missing(dataframe, feature) |
BsmtQual FireplaceQu GarageType SalePrice 0 Gd TA Attchd 208500 1 Gd TA Missing 181500 2 Gd TA Attchd 223500 3 TA Gd Missing 140000 4 Gd TA Attchd 250000
Reference:- www.sunbird.ml
Contact Us