Target Encoding in CatBoost
Target encoding, sometimes referred to as mean encoding, substitutes the target variable’s mean for each category’s categorical values. A more advanced variation known as ordered target encoding is used by CatBoost.
Each category is replaced by the mean target value for that category.
- Example: For binary target values, a feature with categories “A”, “B”, and “C”:
- Category A: mean(target|A)
- Category B: mean(target|B)
- Category C: mean(target|C)
CatBoost uses a variant of target encoding called “ordered encoding” to avoid target leakage. Ordered encoding calculates the target statistics for a categorical feature based on the observed history, i.e., only from the rows (observations) before the current one. This approach mimics time series data validation and helps prevent overfitting.
Steps in Ordered Encoding
- TargetCount: Sum of the target values for the categorical feature up to the current observation.
- Prior: A constant value determined by the sum of target values in the entire dataset divided by the total number of observations.
- FeatureCount: Total number of observations with the same categorical feature value up to the current observation.
The encoded value for a category is calculated using the formula:
Encoded Value=TargetCount+PriorFeatureCount+1
To reduce the variance in the first few observations, CatBoost uses multiple random permutations of the data and averages the target statistics across these permutations.
CatBoost’s Categorical Encoding: One-Hot vs. Target Encoding
CatBoost is a powerful gradient boosting algorithm that excels in handling categorical data. It incorporates unique methods for encoding categorical features, including one-hot encoding and target encoding. Understanding these encoding techniques is crucial for effectively utilizing CatBoost in machine learning tasks.
In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.
Table of Content
- One-Hot Encoding in CatBoost
- Target Encoding in CatBoost
- Implementing One-hot encoding and Target encoding in CatBoost
- 1. Implementing One-Hot Encoding in CatBoost
- 2. Demonstrating Target Encoding in CatBoost
- Advantages and Disadvantages of One-Hot Encoding and Target Encoding
Contact Us