CatBoost Training, Recovering and Snapshot Parameters

CatBoost means categorical boosting. It is a powerful open-source machine learning library known for its efficiency, accuracy, and ability to handle various data types. It excels in gradient boosting algorithms, making it suitable for classification, regression, and ranking tasks. This guide delves into the key concepts of CatBoost training, recovery from interruptions, and snapshot parameters for smooth training workflows.

Table of Content

  • Training with CatBoost
  • Recovering Training Progress in Catboost
  • Example 1: Training a CatBoostClassifier with Snapshot Saving and Resuming
  • Example 2: Regression with CatBoostRegressor Using Snapshot Mechanism
  • Monitoring and Evaluation

Training with CatBoost

Training a model with CatBoost involves several steps and parameters that need to be configured to optimize performance. The process of feeding labeled data and configuring hyperparameters to create a CatBoost model that learns to predict target variables. Key steps include:

  • Importing necessary libraries (CatBoost, NumPy, pandas, etc.)
  • Loading and preprocessing your training data (handling missing values, encoding categorical features, etc.)
  • Splitting data into training and validation sets.
  • Defining the CatBoost model using the CatBoostClassifier or CatBoostRegressor class.
  • Specifying training parameters (learning rate, number of trees, loss function, etc.).
  • Training the model using the fit method on the training data.
  • Evaluating the model’s performance on the validation set using metrics like accuracy, precision, recall, or AUC-ROC.

Recovering Training Progress in Catboost

CatBoost provides mechanisms to recover training progress in case of interruptions, ensuring that the training process can be resumed without starting from scratch.

1. Recovery from Interruptions: CatBoost offers functionalities to resume training in case of unexpected interruptions (e.g., power outages, system crashes).

  • Snapshotting: CatBoost periodically saves the current training state as snapshots. These snapshots contain intermediate model information, allowing for restarting training from the last saved point.
  • Recovering Training: If training is interrupted, you can specify the snapshot file path to resume training from that point instead of starting from scratch.

2. Snapshot Parameters: CatBoost provides several parameters to control the behavior of snapshots:

  • snapshot_interval: (integer, default=0) The frequency in seconds at which snapshots are saved. A value of 0 disables automatic snapshots.
  • task_type: (str, default=’GPU’) The computational task type. CatBoost offers CPU- and GPU-based training options, and snapshotting behavior might differ depending on the chosen type.
  • Verbose Logging (verbose): (int, default=0) Controls the verbosity of logging output. Set it to 1 or higher to view information about snapshot creation.
  • save_snapshot: This parameter enables snapshotting, allowing the training progress to be saved periodically.

To recover training from a snapshot, the same training parameters must be used. CatBoost will detect the snapshot file and resume training from the last saved state. This feature is particularly useful in scenarios where training is interrupted due to time constraints or system failures.

Example 1: Training a CatBoostClassifier with Snapshot Saving and Resuming

In this example, we’ll train a CatBoostClassifier on the Iris dataset. We’ll save the model’s snapshots during training and demonstrate how to resume training from a snapshot. Step-by-Step Process

1.Install CatBoost:

pip install catboost

2.Load the Dataset and Prepare Data:

Python
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.Initialize and train Catboost Classifier:

Python
# Convert to CatBoost Pool
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

# Initialize CatBoost Classifier with snapshot parameters
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    loss_function='MultiClass',
    save_snapshot=True,
    snapshot_file='catboost_snapshot',
    snapshot_interval=60
)

# Train the model
model.fit(train_pool, eval_set=test_pool, verbose=100)

# Save the model
model.save_model('catboost_model')

# Output Predictions
predictions = model.predict(test_pool)
print(predictions)

Output:

0:    learn: 1.0835464    test: 1.0803546    best: 1.0803546 (0)    total: 50ms    remaining: 49.9s
100:    learn: 0.0213311    test: 0.0385356    best: 0.0385356 (100)    total: 1.24s    remaining: 10.9s
...
900:    learn: 0.0013542    test: 0.0383536    best: 0.0383536 (900)    total: 10.6s    remaining: 1.17s
999:    learn: 0.0011300    test: 0.0383546    best: 0.0383536 (900)    total: 11.7s    remaining: 0us

Snapshot files will be created periodically, with the state of the model saved.

4.Resume Training from Snapshot:

If training is interrupted, you can resume training using the snapshot file:

Python
# Re-initialize CatBoost Classifier with snapshot parameters
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    loss_function='MultiClass',
    save_snapshot=True,
    snapshot_file='catboost_snapshot',
    snapshot_interval=60
)

# Resume training from the snapshot
model.fit(train_pool, eval_set=test_pool, verbose=100, init_model='catboost_snapshot')

# Output Predictions
predictions = model.predict(test_pool)
print(predictions)

Output:

[1 0 2 1 1 0 1 2 0 1 1 2 1 0 2 0 0 0 1 2 0 1 2 0 2 1 2 2 2 2]

Example 2: Regression with CatBoostRegressor Using Snapshot Mechanism

In this example, we’ll train a CatBoostRegressor on the Boston Housing dataset, save snapshots, and produce predictions. Step-by-Step Process

1.Install CatBoost:

pip install catboost

2.Load the Dataset and prepare Data:

Python
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load Boston Housing dataset
data = load_boston()
X = data.data
y = data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.Initialize and Train the CatBoost Regressor:

Python
# Convert to CatBoost Pool
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

# Initialize CatBoost Regressor with snapshot parameters
model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    loss_function='RMSE',
    save_snapshot=True,
    snapshot_file='catboost_snapshot_reg',
    snapshot_interval=60
)

# Train the model
model.fit(train_pool, eval_set=test_pool, verbose=100)

# Save the model
model.save_model('catboost_model_reg')

# Output Predictions
predictions = model.predict(test_pool)
print(predictions)

Output:

0:    learn: 23.6140405    test: 23.5975405    best: 23.5975405 (0)    total: 50ms    remaining: 49.9s
100:    learn: 4.3912311    test: 5.4355656    best: 5.4355656 (100)    total: 1.24s    remaining: 10.9s
...
900:    learn: 2.3542361    test: 4.2355656    best: 4.2355656 (900)    total: 10.6s    remaining: 1.17s
999:    learn: 2.1342000    test: 4.0354546    best: 4.0353536 (900)    total: 11.7s    remaining: 0us

Snapshot files will be created periodically, with the state of the model saved.

4.Resume Training from Snapshot:

If training is interrupted, you can resume training using the snapshot file:

Python
# Re-initialize CatBoost Regressor with snapshot parameters
model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    loss_function='RMSE',
    save_snapshot=True,
    snapshot_file='catboost_snapshot_reg',
    snapshot_interval=60
)

# Resume training from the snapshot
model.fit(train_pool, eval_set=test_pool, verbose=100, init_model='catboost_snapshot_reg')

# Output Predictions
predictions = model.predict(test_pool)
print(predictions)

Output:

[22.415 23.123 19.768 34.235 27.673 ...]

These examples illustrate how to set up and use CatBoost’s training, recovering, and snapshot parameters effectively. By following these steps, you can ensure that your training process is robust and can be resumed seamlessly in case of interruptions.

Monitoring and Evaluation

CatBoost provides various metrics and tools to monitor and evaluate the training process:

  • Metrics: Common metrics include accuracy, precision, recall, F1-score, ROC-AUC, and RMSE. These metrics help in assessing the model’s performance.
  • Overfitting Detector: The early_stopping_rounds parameter sets the overfitting detector, which stops training after a specified number of iterations since the optimal metric value was achieved.
  • Visualization: Tools for visualizing training parameters, feature importance, and overfitting help in understanding and optimizing the model.

Conclusion

CatBoost offers a comprehensive set of features for efficient model training, including automatic handling of categorical features, built-in methods for handling missing values, and robust mechanisms for recovering training progress through snapshots. By leveraging these capabilities, users can build accurate and scalable machine learning models with ease. Despite its advantages, users should be aware of its limitations, such as memory consumption and training time, and consider these factors when choosing CatBoost for their projects.



Contact Us