Implementing One-hot encoding and Target encoding in CatBoost

  1. Install CatBoost: If not already installed, use the command pip install catboost.
  2. Prepare Data: Create a pandas DataFrame with your dataset.
  3. Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.
  4. Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.
  5. Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions.

1. Implementing One-Hot Encoding in CatBoost

One-Hot Encoding Example: The feature ‘feature1’ with categories [‘Red’, ‘Green’, ‘Blue’] will be one-hot encoded since it has fewer than 3 unique values (threshold set by one_hot_max_size=3). The predictions are based on the transformed binary vectors for the categorical feature.

Python
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Load dataset
data = {
    'feature1': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue'],
    'feature2': [1, 2, 3, 4, 5, 6],
    'target': [0, 1, 0, 1, 0, 1]
}

# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Specify categorical features
cat_features = ['feature1']

# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features, one_hot_max_size=3)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)

# Model evaluation
predictions = model.predict(X_val)
print(predictions)

Output:

[0 1]

Here, the model is predicting the classes for the two samples in the validation set.

2. Demonstrating Target Encoding in CatBoost

Target Encoding Example: The feature ‘feature1’ with categories [‘A’, ‘B’, ‘C’] will use ordered target encoding. The encoding will replace each category with the mean target value for that category, computed using only the preceding data points to avoid data leakage.

Python
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split

# Load dataset
data = {
    'feature1': ['A', 'B', 'C', 'A', 'B', 'C'],
    'feature2': [10, 20, 30, 40, 50, 60],
    'target': [1, 0, 1, 0, 1, 0]
}

# Prepare data
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']]
y = df['target']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Specify categorical features
cat_features = ['feature1']

# Initialize and train CatBoost model
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features)
model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)

# Model evaluation
predictions = model.predict(X_val)
print(predictions)

Output:

[1 1]

In this case, the model is predicting the classes for the two samples in the validation set.

CatBoost’s Categorical Encoding: One-Hot vs. Target Encoding

CatBoost is a powerful gradient boosting algorithm that excels in handling categorical data. It incorporates unique methods for encoding categorical features, including one-hot encoding and target encoding. Understanding these encoding techniques is crucial for effectively utilizing CatBoost in machine learning tasks.

In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.

Table of Content

  • One-Hot Encoding in CatBoost
  • Target Encoding in CatBoost
  • Implementing One-hot encoding and Target encoding in CatBoost
    • 1. Implementing One-Hot Encoding in CatBoost
    • 2. Demonstrating Target Encoding in CatBoost
  • Advantages and Disadvantages of One-Hot Encoding and Target Encoding

Similar Reads

One-Hot Encoding in CatBoost

One-hot encoding is a common technique used to convert categorical variables into a format that can be provided to machine learning algorithms. In one-hot encoding, each category is represented as a binary vector, where only one element is “1” (indicating the presence of the category) and all other elements are “0”....

Target Encoding in CatBoost

Target encoding, sometimes referred to as mean encoding, substitutes the target variable’s mean for each category’s categorical values. A more advanced variation known as ordered target encoding is used by CatBoost....

Implementing One-hot encoding and Target encoding in CatBoost

Install CatBoost: If not already installed, use the command pip install catboost.Prepare Data: Create a pandas DataFrame with your dataset.Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions....

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

One-Hot Encoding:Advantage: Simple and effective for categorical features with a small number of unique values.Disadvantage: Can lead to high-dimensional data and is not suitable for features with many unique values.Target Encoding:Advantage: Captures the relationship between categorical features and the target variable, handles high-cardinality features effectively.Disadvantage: Prone to overfitting if not implemented correctly, requires careful handling to avoid target leakage....

Conclusion

CatBoost’s ability to handle categorical data directly through one-hot encoding and target encoding makes it a versatile tool for machine learning tasks. One-hot encoding is suitable for features with a small number of unique values, while target encoding is effective for high-cardinality features. By leveraging these encoding techniques, CatBoost enhances model performance and generalization, making it a valuable asset in data preprocessing and machine learning....

Contact Us