Target Encoding

  • The code initializes a TargetEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column (‘Class’) using col_names[:-1].
  • An OrdinalEncoder object (oe) is also initialized to encode the target variable (‘Class’) to ordinal values.
  • The target variable is encoded using oe to create y_train_oe and y_test_oe.
  • Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
  • The fit_transform method of the TargetEncoder object is applied to the training feature set (X_train_te), fitting the encoder and transforming the training data into target encoded format using the encoded target variable (y_train_oe).
  • Similarly, the transform method is applied to the testing feature set (X_test_te) to transform it into the same target encoded format using the encoded target variable (y_test_oe).
  • The head() method is used to display the first few rows of the transformed training feature set (X_train_te).
Python3
target_encoder = ce.TargetEncoder(cols=col_names[:-1])
oe = ce.OrdinalEncoder(cols=["Class"])
y_train_oe = oe.fit_transform(y_train)
y_test_oe = oe.transform(y_test)
X_train_te = X_train.copy()
X_test_te = X_test.copy()
X_train_te = target_encoder.fit_transform(X_train_te, y_train_oe)
X_test_te = target_encoder.transform(X_test_te, y_test_oe)
X_train_te.head()


Output:


Cost    Maintenance    Doors    Persons    Luggage boot    Safety
107    1.168639    1.159292    1.466667    1.596950    1.522777    1.738462
901    1.521127    1.159292    1.397661    1.627907    1.295896    1.513100
1709    1.684814    1.688623    1.466667    1.000000    1.522777    1.738462
706    1.264706    1.517045    1.450867    1.000000    1.421397    1.513100
678    1.264706    1.517045    1.397661    1.000000    1.421397    1.000000

How to fit categorical data types for random forest classification?

Categorical variables are an essential component of many datasets, representing qualitative characteristics rather than numerical values. While random forest classification is a powerful machine-learning technique, it typically requires numerical input data. Therefore, encoding categorical variables into a suitable format is a crucial step in preparing data for random forest classification. In this article, we’ll explore different encoding methods and their applications in fitting categorical data types for random forest classification.

Similar Reads

Types of Encoding for Random Forest Classification

Ordinal Encoder: Ordinal encoding is particularly useful when categorical variables have an inherent order or rank. In this method, each category is assigned a unique integer value based on its position in the ordered sequence.One-hot Encoding: One-hot encoding is a popular technique for handling categorical variables, especially when the categories are not inherently ordered. In this method, each category is represented by a binary indicator variable (0 or 1).Target Encoding: Target encoding, also known as mean encoding, replaces each category with the mean of the target variable for that category. This method is particularly useful when dealing with high-cardinality categorical variables, where one-hot encoding would result in a large number of binary columns. By encoding categories based on their relationship with the target variable, target encoding captures valuable information about the predictive power of each category. However, it’s essential to be cautious when using target encoding to avoid overfitting, especially with small or imbalanced datasets....

How to fit categorical data types for random forest classification in Python?

Handling categorical data in machine learning involves converting discrete category values into numerical representations suitable for models like random forests. Techniques include Label Encoding, One-Hot Encoding, and Target Encoding, each with unique advantages and considerations based on the nature of the categorical variable and the model requirements. The choice of encoding method impacts model performance and should be selected carefully based on the data characteristics and modeling goals....

Random Forest Classification on One-hot encoded data

The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_oh, y_train) using the fit() method.Predictions are made on the one-hot encoded testing feature set (X_test_oh) using the predict() method, resulting in predicted target values (y_pred_oh).The accuracy of the model is calculated by comparing the predicted target values (y_pred_oh) with the actual target values from the testing set (y_test) using the accuracy_score() function....

Target Encoding

The code initializes a TargetEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column (‘Class’) using col_names[:-1].An OrdinalEncoder object (oe) is also initialized to encode the target variable (‘Class’) to ordinal values.The target variable is encoded using oe to create y_train_oe and y_test_oe.Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.The fit_transform method of the TargetEncoder object is applied to the training feature set (X_train_te), fitting the encoder and transforming the training data into target encoded format using the encoded target variable (y_train_oe).Similarly, the transform method is applied to the testing feature set (X_test_te) to transform it into the same target encoded format using the encoded target variable (y_test_oe).The head() method is used to display the first few rows of the transformed training feature set (X_train_te)....

Random Forest Classification on Target encoded data

The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_te, y_train) using the fit() method.Predictions are made on the target encoded testing feature set (X_test_te) using the predict() method, resulting in predicted target values (y_pred_te).The accuracy of the model is calculated by comparing the predicted target values (y_pred_te) with the actual target values from the testing set (y_test) using the accuracy_score() function....

Contact Us