Implementing One-hot encoding and Target encoding in CatBoost

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

Install CatBoost: If not already installed, use the command pip install catboost.
Prepare Data: Create a pandas DataFrame with your dataset.
Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.
Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.
Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions.

1. Implementing One-Hot Encoding in CatBoost

One-Hot Encoding Example: The feature ‘feature1’ with categories [‘Red’, ‘Green’, ‘Blue’] will be one-hot encoded since it has fewer than 3 unique values (threshold set by one_hot_max_size=3). The predictions are based on the transformed binary vectors for the categorical feature.

Python

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Load dataset
data = {
    'feature1': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue'],
    'feature2': [1, 2, 3, 4, 5, 6],
    'target': [0, 1, 0, 1, 0, 1]
}

# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Specify categorical features
cat_features = ['feature1']

# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features, one_hot_max_size=3)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)

# Model evaluation
predictions = model.predict(X_val)
print(predictions)

Output:

[0 1]

Here, the model is predicting the classes for the two samples in the validation set.

2. Demonstrating Target Encoding in CatBoost

Target Encoding Example: The feature ‘feature1’ with categories [‘A’, ‘B’, ‘C’] will use ordered target encoding. The encoding will replace each category with the mean target value for that category, computed using only the preceding data points to avoid data leakage.

Python

from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Load dataset
data = {
    'feature1': ['A', 'B', 'C', 'A', 'B', 'C'],
    'feature2': [10, 20, 30, 40, 50, 60],
    'target': [1, 0, 1, 0, 1, 0]
}

# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Specify categorical features
cat_features = ['feature1']

# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)

# Model evaluation
predictions = model.predict(X_val)
print(predictions)

Output:

[1 1]

In this case, the model is predicting the classes for the two samples in the validation set.

CatBoost’s Categorical Encoding: One-Hot vs. Target Encoding

CatBoost is a powerful gradient boosting algorithm that excels in handling categorical data. It incorporates unique methods for encoding categorical features, including one-hot encoding and target encoding. Understanding these encoding techniques is crucial for effectively utilizing CatBoost in machine learning tasks.

In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.

Table of Content

One-Hot Encoding in CatBoost
Target Encoding in CatBoost
Implementing One-hot encoding and Target encoding in CatBoost

1. Implementing One-Hot Encoding in CatBoost
2. Demonstrating Target Encoding in CatBoost

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

Implementing One-hot encoding and Target encoding in CatBoost

1. Implementing One-Hot Encoding in CatBoost

2. Demonstrating Target Encoding in CatBoost

CatBoost’s Categorical Encoding: One-Hot vs. Target Encoding

Similar Reads

Contact Us