Encoding Categorical Data in Python
Certain learning algorithms like regression and neural networks require their input to be numbers. Hence, categorical data must be converted to numbers to use these algorithms. Let us take a look at some encoding methods.
Label Encoding in Python
With label encoding, we can number the categories from 0 to num_categories – 1. Let us apply label encoding on the blood type feature.
Python3
le = LabelEncoder() without_bogus_records[ 'blood_type' ] = le.fit_transform( without_bogus_records[ 'blood_type' ]) without_bogus_records[ 'blood_type' ].unique() |
Output:
array([0, 4, 1, 3, 2, 5, 7, 6])
One-hot Encoding in Python
There are certain limitations of label encoding that are taken care of by one-hot encoding.
Python3
inconsistent_data = pd.get_dummies(inconsistent_data, columns = [ 'marriage_status' ]) inconsistent_data.head() |
Output:
Ordinal Encoding in Python
Categorical data can be ordinal, where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+
Python3
custom_map = { '40k-75k' : 1 , '75k-100k' : 2 , '100k-125k' : 3 , '125k-150k' : 4 , '150k+' : 5 } remapping_data[ 'income_groups' ] = remapping_data[ 'income_groups' ]\ . map (custom_map) remapping_data.head() |
Output:
Similarly, different encodings can be applied according to the use case.
Handling Categorical Data in Python
Categorical data is a set of predefined categories or groups an observation can fall into. Categorical data can be found everywhere. For instance, survey responses like marital status, profession, educational qualifications, etc. However, certain problems can arise with categorical data that must be dealt with before proceeding with any other task. This article discusses various methods to handle categorical data in a DataFrame. So, let us look at some problems posed by categorical data and how to handle categorical data in a DataFrame.
As mentioned earlier, categorical data can only take up a finite set of values. However, due to human error, while filling out a survey form, or any other reason, some bogus values could be found in the dataset.
Contact Us