Handling missing values using Sunbird

The Sunbird library is used for feature engineering purposes. In this library, you will get various techniques to handle missing values, outliers, categorical encoding, normalization and standardization, feature selection techniques, etc.

Installation:

pip install sunbird

Handling Missing Values:

Datasets might have missing values, which can cause problems for certain machine learning algorithms. It’s indeed a good practice to recognize and substitute missed values for each column in the input data before modeling the prediction evaluation. This is called imputation.

1. Median Imputation

Columns in the data set that have constant numeric values which be replaced by the median of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses the mean, median, or mode imputations.

Python3

from sunbird.nan_values import median_imputation
median_imputation(dataframe, feature, plot=True)

Survived  Age      Fare      Age_median
0      22.0     7.2500       22.0
1      38.0     71.283       38.0
1      NAN     7.9250        26.0
1      NAN     53.100        26.0
0      35.0     8.0500       35.0

2. Mean Imputation

Columns in the data set that have constant numeric values which be replaced by the mean of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses mean median or mode imputations.

Python3

from sunbird.nan_values import mean_imputation
mean_imputation(dataframe, feature, plot=True)

Survived  Age      Fare      Age_mean
0      22.0     7.2500       22.0
1      38.0     71.283       38.0
1      NAN     7.9250       26.50
1      NAN     53.100       26.50
0      35.0     8.0500       35.0

3. Mode Imputation

Columns in the data set that have constant numeric values which be replaced by the mode of the remaining values in the column. This approach will avoid data loss relative to the other approaches. The statistical approach of handling missing values uses mean median or mode imputations.

Python3

from sunbird.nan_values import mode_imputation
mode_imputation(dataframe, feature, plot=True)

Survived  Age      Fare      Age_mode
0      22.0     7.2500       22.0
1      38.0     71.283       38.0
1      NAN     7.9250       25.0
1      NAN     53.100       25.0
0      35.0     8.0500       35.0

4. Random Sample Imputation

Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques.

Python3

from sunbird.nan_values import random_sampler
random_sampler(dataframe, feature, plot=True)

Survived  Age      Fare      Age_random
0      22.0     7.2500       22.0
1      38.0     71.283       38.0
1      NAN     7.9250       37.0
1      NAN     53.100       28.0
0      35.0     8.0500       35.0

5. End Of Distribution Imputation

If there is suspicion that the missing value is not random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.

Python3

from sunbird.nan_values import endof_distribution
endof_distribution(dataframe, feature, plot=True)

Survived  Age      Fare      Age_enddist
0      22.0     7.2500       22.0
1      38.0     71.283      38.0
1      NAN     7.9250       40.0
1      NAN     53.100       40.0
0      35.0     8.0500       35.0

6. Arbitrary Value Imputation

Arbitrary value imputation consists of replacing all occurrences of missing values within a variable by an arbitrary value. Ideally, arbitrary values should be different from the median/mean/mode, and not within the normal values of the variable.

Python3

from sunbird.nan_values import arbitrary_imputation
arbitrary_imputation(dataframe, feature, arbit, plot=True)

Survived  Age      Fare      Age_arbitrary
0      22.0     7.2500       22.0
1      38.0     71.283      38.0
1      NAN     7.9250       25.0
1      NAN     53.100       25.0
0      35.0     8.0500       35.0

7. Capture NAN Imputation

Capturing nan imputation suggest that this feature engineering technique is used on the data which is not missing completely at random. This technique is used for capturing the importance of missing values but due to adding additional features, it may lead to the Curse of Dimensionality.

Python3

from sunbird.nan_values import capture_nan
capture_nan(dataframe, feature)

Survived  Age      Fare      Age_nan
0      22.0     7.2500        0
1      38.0     71.283        0
1      26.0     7.9250        1
1      26.0     53.100        1
0      35.0     8.0500        0

8. Frequent Value Imputation

Frequent value imputation is used for handling categorical missing values. In this technique, we sort the values in the dataset according to their frequency in descending order. The first value in that order becomes the most frequent value, and we replace other missing values with that particular most frequent value.

Python3

from sunbird.nan_values import frequent
frequent(dataframe, feature)

Survived  Age      Fare      
0      22.0     7.2500       
1      38.0     71.283    
1      36.0     7.9250       
1      36.0     53.100   
0      35.0     8.0500

9. Fill With Missing Imputation

This is the simplest feature engineering technique that is used when are a lot of missing values. In this technique, we replace missing values with missing words by keeping remaining values as it is.

Python3

from sunbird.nan_values import fill_missing
fill_missing(dataframe, feature)

    BsmtQual    FireplaceQu    GarageType    SalePrice
0      Gd               TA         Attchd    208500
1      Gd               TA        Missing    181500
2      Gd               TA        Attchd     223500
3      TA               Gd        Missing    140000
4      Gd               TA        Attchd     250000

Reference:- www.sunbird.ml

Installation:

Handling Missing Values:

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Contact Us