Encoding Categorical Data in Python

Certain learning algorithms like regression and neural networks require their input to be numbers. Hence, categorical data must be converted to numbers to use these algorithms. Let us take a look at some encoding methods.

Label Encoding in Python

With label encoding, we can number the categories from 0 to num_categories – 1. Let us apply label encoding on the blood type feature.

Python3




le = LabelEncoder()
without_bogus_records['blood_type'] = le.fit_transform(
    without_bogus_records['blood_type'])
without_bogus_records['blood_type'].unique()


Output:

array([0, 4, 1, 3, 2, 5, 7, 6])

One-hot Encoding in Python

There are certain limitations of label encoding that are taken care of by one-hot encoding.

Python3




inconsistent_data = pd.get_dummies(inconsistent_data,
                                   columns=['marriage_status'])
inconsistent_data.head()


Output:

Ordinal Encoding in Python

Categorical data can be ordinal, where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+

Python3




custom_map = {'40k-75k': 1, '75k-100k': 2, '100k-125k': 3,
              '125k-150k': 4, '150k+': 5}
remapping_data['income_groups'] = remapping_data['income_groups']\
    .map(custom_map)
remapping_data.head()


Output:

Similarly, different encodings can be applied according to the use case.



Handling Categorical Data in Python

Categorical data is a set of predefined categories or groups an observation can fall into. Categorical data can be found everywhere. For instance, survey responses like marital status, profession, educational qualifications, etc. However, certain problems can arise with categorical data that must be dealt with before proceeding with any other task. This article discusses various methods to handle categorical data in a DataFrame. So, let us look at some problems posed by categorical data and how to handle categorical data in a DataFrame.

As mentioned earlier, categorical data can only take up a finite set of values. However, due to human error, while filling out a survey form, or any other reason, some bogus values could be found in the dataset.

Similar Reads

Importing Libraries

Python libraries make it very easy for us to handle categorical data in a DataFrame and perform typical and complex tasks with a single line of code....

Cleaning Categorical Data in Python

...

Visualizing Categorical Data in Python Pandas

...

Encoding Categorical Data in Python

...

Contact Us