CatBoost Ranking Metrics: A Comprehensive Guide

Order and Ranking : Concepts, Questions, Solved Examples

CatBoost, short for “Categorical Boosting,” is a powerful gradient boosting library developed by Yandex. It is renowned for its efficiency, accuracy, and ability to handle categorical features with ease. One of the key features of CatBoost is its support for ranking tasks, which are crucial in applications like search engines, recommendation systems, and information retrieval. This article delves into the various ranking metrics supported by CatBoost, their usage, and how they can be leveraged to build high-performing ranking models.

Table of Content

Understanding Ranking in CatBoost
Key CatBoost Ranking Metrics

1. Normalized Discounted Cumulative Gain (NDCG)
2. Mean Reciprocal Rank (MRR)
3. Expected Reciprocal Rank (ERR)
4. Mean Average Precision (MAP)

Advanced Ranking Modes: YetiRankPairwise
Choosing the Right Ranking Metric

Understanding Ranking in CatBoost

Ranking metrics often focus on the performance of the model on the top positions (e.g., top 10) of the retrieved results. CatBoost allows you to specify the number of top positions (k) to consider when calculating the metric. Ranking tasks involve ordering items in a list based on their relevance to a particular query. CatBoost provides several ranking modes and metrics to optimize and evaluate ranking models. The primary ranking modes in CatBoost include:

YetiRank
PairLogit
QuerySoftmax
QueryRMSE
YetiRankPairwise
PairLogitPairwise

These modes can be used on both CPU and GPU, with some additional modes like YetiRankPairwise and PairLogitPairwise available for more complex ranking tasks.

Key CatBoost Ranking Metrics

1. Normalized Discounted Cumulative Gain (NDCG)

NDCG is a popular metric for evaluating ranking models. It measures the quality of the ranking by comparing the predicted order of items to the ideal order. The NDCG score ranges from 0 to 1, with 1 indicating a perfect ranking.Parameters:

Top samples: The number of top samples in a group used to calculate the ranking metric.
Metric calculation principles: Can be Base or Exp.
Metric denominator type: Can be LogPosition or Position.

Python

from catboost import CatBoostRanker, Pool

# Load data
train_data = Pool(data=X_train, label=y_train, group_id=group_id_train)
test_data = Pool(data=X_test, label=y_test, group_id=group_id_test)

# Initialize CatBoostRanker
model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='NDCG'
)

# Train the model
model.fit(train_data, eval_set=test_data)

Output:

0:    learn: 0.5000000    test: 0.4500000    best: 0.4500000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.5200000    test: 0.4600000    best: 0.4600000 (1)    total: 0.2s    remaining: 1m 40s
...
999:    learn: 0.9500000    test: 0.9000000    best: 0.9000000 (999)    total: 1m 40s    remaining: 0us

bestTest = 0.9000000

2. Mean Reciprocal Rank (MRR)

MRR is another metric used to evaluate the effectiveness of a ranking model. It calculates the reciprocal of the rank of the first relevant item in the list.

Parameters: Can be Base or Exp.

Python

model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='MRR'
)

model.fit(train_data, eval_set=test_data)

Output:

0:    learn: 0.2000000    test: 0.1500000    best: 0.1500000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.2200000    test: 0.1600000    best: 0.1600000 (1)    total: 0.2s    remaining: 1m 40s
...
999:    learn: 0.8000000    test: 0.7500000    best: 0.7500000 (999)    total: 1m 40s    remaining: 0us

bestTest = 0.7500000

3. Expected Reciprocal Rank (ERR)

ERR is a metric that considers the probability of a user stopping at a particular rank. It is useful for scenarios where user satisfaction is paramount. Parameters: Probability of search continuation: Default is 0.85.

Python

model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='ERR'
)

model.fit(train_data, eval_set=test_data)

Output:

0:    learn: 0.3000000    test: 0.2500000    best: 0.2500000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.3200000    test: 0.2600000    best: 0.2600000 (1)    total: 0.2s    remaining: 1m 40s
...
999:    learn: 0.9000000    test: 0.8500000    best: 0.8500000 (999)    total: 1m 40s    remaining: 0us

bestTest = 0.8500000

4. Mean Average Precision (MAP)

MAP is a metric that calculates the mean of the average precision scores for each query. It is particularly useful for binary relevance tasks.

Python

model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='MAP'
)

model.fit(train_data, eval_set=test_data)

Output:

0:    learn: 0.1000000    test: 0.0800000    best: 0.0800000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.1200000    test: 0.0900000    best: 0.0900000 (1)    total: 0.2s    remaining: 1m 40s
...
999:    learn: 0.7000000    test: 0.6500000    best: 0.6500000 (999)    total: 1m 40s    remaining: 0us

bestTest = 0.6500000

Advanced Ranking Modes: YetiRankPairwise

YetiRankPairwise is an advanced ranking mode that optimizes specific ranking loss functions by specifying the mode parameter. It supports various loss functions like DCG, NDCG, MRR, ERR, and MAP. Parameters:

Mode: Can be Classic or a specific ranking loss function.
Number of permutations: Default is 10.
Probability of search continuation: Default is 0.85.

Python

model = CatBoostRanker(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='YetiRankPairwise',
    eval_metric='NDCG',
    custom_metric=['DCG', 'ERR']
)

model.fit(train_data, eval_set=test_data)

Output:

0:    learn: 0.5000000    test: 0.4500000    best: 0.4500000 (0)    total: 0.1s    remaining: 1m 40s
1:    learn: 0.5200000    test: 0.4600000    best: 0.4600000 (1)    total: 0.2s    remaining: 1m 40s
...
999:    learn: 0.9500000    test: 0.9000000    best: 0.9000000 (999)    total: 1m 40s    remaining: 0us

bestTest = 0.9000000

For large datasets, it is recommended to use YetiRankPairwise or PairLogitPairwise as they provide more accurate results but may take longer to train. Additionally, metrics like PFound and NDCG can be calculated during training to monitor the model’s performance.

Choosing the Right Ranking Metric

The best metric for your task depends on your specific needs. Here are some factors to consider:

Relevance vs. Position: If precise ranking of highly relevant items is crucial, NDCG or MAP might be better choices.
Number of Relevant Items: For scenarios with a high number of relevant items, HR might be more informative.
Business Goals: Align the metric with your business objectives. For example, if click-through rate is essential, consider incorporating user interaction data into a custom metric.

Conclusion

CatBoost offers a comprehensive set of ranking metrics and modes that cater to various ranking tasks. By leveraging these metrics, data scientists can build robust and high-performing ranking models. Whether you are working on search engines, recommendation systems, or any other ranking application, CatBoost’s ranking capabilities provide the tools needed to achieve optimal results.