Catboost Classification Metrics

When it comes to machine learning, classification is a fundamental task that involves predicting a categorical label or class based on a set of input features. One of the most popular and efficient algorithms for classification is Catboost, a gradient boosting library developed by Yandex.

Catboost is known for its speed, accuracy, and ease of use, making it a favorite among data scientists and machine learning practitioners. However, to fully leverage the power of Catboost, it’s essential to understand the various metrics used to evaluate the performance of classification models.

In this article, we’ll delve into the world of Catboost classification metrics, exploring what they are, how they work, and how to interpret them.

Table of Content

  • What are Classification Metrics?
  • Common Catboost Classification Metrics
  • How to Integrate Catboost Classification Metrics?
  • Choosing the Right Metric
  • Best Practices for Using Catboost Classification Metrics

What are Classification Metrics?

Classification metrics are used to evaluate the performance of a classification model by comparing its predictions with the actual labels or classes. These metrics provide insights into the model’s accuracy, precision, recall, and other aspects of its performance. In CatBoost, classification metrics are calculated during the training process and can be used to tune hyperparameters, select the best model, and identify areas for improvement.

Common Catboost Classification Metrics

There are some performance metrics for assessing classification mentioned as follows:

1. Binary Log Loss

A measure called cross entropy is used to measure or assess how good predicted probabilities are in a binary classification. It can easily be deduced that the log loss with lower value represents a better performance of the model. It tells the accuracy in terms of the ratio of the number of cases that have been predicted successfully divided by the total number of cases.One way to begin with this would be the accuracy that will however be a poor choice for imbalanced dataset.

[Tex]Binary Log Loss = – \frac{1}{N} \sum ^{N} _{i=1} [y_{ij} . log(p_{i}) + (1 – y_{i}) . log(1 – p_{i})] [/Tex]

where p→ Probability of event

2. Accuracy

One of the most popular metrics for classifying algorithms is accuracy and that is the rate of correct results over total of asked items for classification. However accuracy alone is not sufficient to identify all the areas of weakness in the model especially if the model is trained on imbalanced data.

[Tex]Accuracy = \frac {TP + TN} {TP + TN + FP + FN} [/Tex]

where,

  • True Positives (TP): This is the number of cases of positivity that the model was able to predict correctly.
  • True Negatives (TN): The total number of negative instances that the model has been capable of identifying in their right prediction.
  • False Positives (FP): This is the number of vectors whereas the model has classified as positives that are actually negatives or also known as Type I errors.
  • False Negatives (FN): The number of true negative cases that have been classified by the model as true positive cases (false negative percentage).

3. Precision

The Precision of your model indicates the number of items that your model identified as positive which is actually positive. Simply it is the number of times the actual positives were identified correctly by the model over the number of actual positive predictions. High precision refers the fact that if the model gives a positive, then it is correct most of the time.

[Tex]\text{Precision} = \frac{TP}{TP + FP} [/Tex]

4. Recall

Specificity which is also referred as recall represents the percentage of ‘correctly identified’ actual positive items. It is the ratio of actual true predicted values over the total actual positive values. It is how you achieve high recall which is that the model remembers most of the positive cases.

[Tex]\text{Recall} = \frac{TP}{TP + FN} [/Tex]

5. F1- Score

The F1 score metric computes the mean precision for true positive predictions considering all positive data and calculates the mean recall of true positive predictions about all original positive results. F1 – score is also referred to the harmonic mean of precision and recall that helps in equilibrating the value of the two. This is given by the formula:

[Tex]F1 -Score= \frac {2*Precision*Recall}{Precision + Recall} [/Tex]

6. AUC-ROC

Another metric is the ROC associated with Area Under Curve (AUC) which indicates the model’s ability to classify classes based on the Receiver Operating Characteristic (ROC) in a curve. This is given by the formula:

[Tex]AUC= \int ^{1} _{0} TP(FP) dFP [/Tex]

Interpretation of AUC are:

  • AUC = 1: Pretty good. It has the best separability and clearly categorizes all the positives from the negative ones without any mistake.
  • AUC = 0. 5: This means that model is performing worse than a random guess. It does not differentiate between instances in one class and instances in another class.
  • AUC < 0. 5: Noise is higher than zero, which means that Model is worse than random guessing, and is more likely to give wrong answers than right ones.

7. Kappa

Kappa is the actual measure of the agreement between the actual prediction labels and the actual labels in obtained. But it also mentions that the other probable outcome – accurate prophecy through pure luck – is not to be forgotten. It also implied that the higher or the closer the value of Kappa is with 1, the better is the agreement between the predicted values and actual ground labels.

[Tex]\kappa = \frac{P_o – P_e}{1 – P_e} [/Tex]

where Po → observed agreement, Pe → is the expected agreement.

8. Confusion Matrix

A confusion matrix can therefore be described as an approach that is utilized in evaluating the accuracy of a classifier. It provides the users with a graphical illustration procedure used in determining the degree of match between a model and actual performance. The matrix is typically a square table with four key components for a binary classification problem: The matrix is always a square table, containing four essential parts ( TP,TN,FP,FN) in the case of a binary classification problem.

How to Integrate Catboost Classification Metrics?

Interpreting Catboost classification metrics requires a deep understanding of the problem domain and the goals of the project. Here are some general guidelines:

  • High accuracy and F1-score indicate that the model is performing well overall.
  • High precision and low recall suggest that the model is conservative in its predictions, missing some true positives.
  • High recall and low precision indicate that the model is aggressive in its predictions, resulting in more false positives.
  • High AUC-ROC indicates that the model is good at distinguishing between positive and negative classes.
  • Low logloss and cross-entropy indicate that the model is confident in its predictions.

Lets take an example to point out an instance of catboost classification metrics on Iris Dataset using demographics information.

To implement Catboost classification metrics in your project, follow these steps:

  • Train the boost model on your dataset to get the model.
  • Then we need to predict the target variable using the trained model.
  • Evaluate the metric to get the output for accuracy, precision, recall, F1-score, and ROC-AUC respectively through Catboost accuracy, precision, recall, F1_score, auc_roc.

Implement Catboost Algorithm

Python

from sklearn.datasets import load_iris from catboost import CatBoostClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, cohen_kappa_score iris = load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train CatBoost model model = CatBoostClassifier(iterations=50, learning_rate=0.1, eval_metric='AUC') # Adjust hyperparameters as needed model.fit(X_train, y_train) y_pred = model.predict(X_test) y_pred_proba = model.predict_proba(X_test)

Calculate Catboost Classification Metrics

Python

# Calculate evaluation metrics metrics = {} metrics['Accuracy'] = accuracy_score(y_test, y_pred) metrics['Precision'] = precision_score(y_test, y_pred, average='macro') # Macro averaging for imbalanced data metrics['Recall'] = recall_score(y_test, y_pred, average='macro') # Macro averaging for imbalanced data metrics['F1 Score'] = f1_score(y_test, y_pred, average='macro') # Macro averaging for imbalanced data metrics['Kappa'] = cohen_kappa_score(y_test, y_pred) # Display metrics print('Metrics:') for metric, value in metrics.items(): print(f'{metric}: {value}') # Confusion Matrix conf_matrix = confusion_matrix(y_test, y_pred) print('Confusion Matrix:') print(np.array2string(conf_matrix, suppress_small=True))

Output:

Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Kappa: 1.0
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Visualize AUC Graph

Python

# Get unique class labels class_labels = np.unique(y) # Plot ROC curves for each class plt.figure(figsize=(8, 6)) for i, label in enumerate(class_labels): fpr, tpr, _ = roc_curve(y_test == label, y_pred_proba[:, i]) roc_auc = roc_auc_score(y_test == label, y_pred_proba[:, i]) plt.plot(fpr, tpr, label=f'Class {label} (AUC-ROC={roc_auc:.4f})') plt.legend() # Plot ROC curve for all classes (optional) # all_fpr, all_tpr, _ = roc_curve(y_test, y_pred_proba[:, 0], multi_class='ovr') # plt.plot(all_fpr, all_tpr, label='Multi-class ROC (ovr)') plt.xlabel('False Positive Rate (FPR)') plt.ylabel('True Positive Rate (TPR)') plt.title('ROC Curves for Iris Flower Classification (CatBoost)') plt.grid(True) plt.xlim(0, 1) plt.ylim(0, 1.05) plt.show()

Output:

AUC-ROC graph

Choosing the Right Metric

The choice of metric depends on your problem’s specific characteristics:

  • Imbalanced Datasets: Precision, recall, and F1-score are more informative than accuracy when one class is much more frequent than others.
  • Probabilistic Predictions: Logloss is suitable when your model outputs probabilities instead of hard class labels.
  • Ranking Ability: AUC is ideal when you need to assess how well your model ranks instances.

Best Practices for Using Catboost Classification Metrics

  • Use a combination of metrics to get a comprehensive view of the model’s performance.
  • Monitor metrics during training to identify overfitting or underfitting.
  • Tune hyperparameters based on the metrics to improve the model’s performance.
  • Use metrics to select the best model from a set of candidates.
  • Interpret metrics in the context of the problem domain to ensure that the model is meeting the project’s goals.

Conclusion

Catboost classification metrics are essential for evaluating the performance of classification models and identifying areas for improvement. By understanding the different metrics, including accuracy, precision, recall, F1-score, AUC-ROC, logloss, cross-entropy, and mean F1-score, data scientists and machine learning practitioners can develop more accurate and effective models. Remember to use a combination of metrics, monitor them during training, and interpret them in the context of the problem domain to get the most out of Catboost.



Contact Us