Implementing One-hot encoding and Target encoding in CatBoost
- Install CatBoost: If not already installed, use the command pip install catboost.
- Prepare Data: Create a pandas DataFrame with your dataset.
- Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.
- Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.
- Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions.
1. Implementing One-Hot Encoding in CatBoost
One-Hot Encoding Example: The feature âfeature1â with categories [âRedâ, âGreenâ, âBlueâ] will be one-hot encoded since it has fewer than 3 unique values (threshold set by one_hot_max_size=3). The predictions are based on the transformed binary vectors for the categorical feature.
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
# Load dataset
data = {
'feature1': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue'],
'feature2': [1, 2, 3, 4, 5, 6],
'target': [0, 1, 0, 1, 0, 1]
}
# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']
# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Specify categorical features
cat_features = ['feature1']
# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features, one_hot_max_size=3)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)
# Model evaluation
predictions = model.predict(X_val)
print(predictions)
Output:
[0 1]
Here, the model is predicting the classes for the two samples in the validation set.
2. Demonstrating Target Encoding in CatBoost
Target Encoding Example: The feature âfeature1â with categories [âAâ, âBâ, âCâ] will use ordered target encoding. The encoding will replace each category with the mean target value for that category, computed using only the preceding data points to avoid data leakage.
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
# Load dataset
data = {
'feature1': ['A', 'B', 'C', 'A', 'B', 'C'],
'feature2': [10, 20, 30, 40, 50, 60],
'target': [1, 0, 1, 0, 1, 0]
}
# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']
# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Specify categorical features
cat_features = ['feature1']
# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)
# Model evaluation
predictions = model.predict(X_val)
print(predictions)
Output:
[1 1]
In this case, the model is predicting the classes for the two samples in the validation set.
CatBoostâs Categorical Encoding: One-Hot vs. Target Encoding
CatBoost is a powerful gradient boosting algorithm that excels in handling categorical data. It incorporates unique methods for encoding categorical features, including one-hot encoding and target encoding. Understanding these encoding techniques is crucial for effectively utilizing CatBoost in machine learning tasks.
In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.
Table of Content
- One-Hot Encoding in CatBoost
- Target Encoding in CatBoost
- Implementing One-hot encoding and Target encoding in CatBoost
- 1. Implementing One-Hot Encoding in CatBoost
- 2. Demonstrating Target Encoding in CatBoost
- Advantages and Disadvantages of One-Hot Encoding and Target Encoding
Contact Us