Implementation of Regularization Parameters in CatBoost
Let’s implement CatBoost with various regularization parameters in Python.
Importing Libraries
Python3
import pandas as pd from catboost import CatBoostClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from catboost import Pool, CatBoostClassifier |
- pandas as pd: Used for data manipulation.
- CatBoostClassifier: From catboost, this function is used for building the machine learning model.
- train_test_split: From Scikit-Learn, this function is used to split the dataset into training and testing sets.
- accuracy_score: This function from Scikit-Learn computes the accuracy classification score, which measures the accuracy of the classification model.
Dataset Loading and Splitting
We load the dataset from a CSV File for Diabetes Prediction. The dataset is split into 8 features (BMI, insulin level, age, etc.) and the target variable (Outcome whether patient has diabetes or not). We further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.
Python3
# Load the dataset df = pd.read_csv( 'diabetes.csv' ) # Separate features and target variable X = df.drop( 'Outcome' , axis = 1 ) y = df[ 'Outcome' ] # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) |
Creating CatBoost Pools
Python3
# Creating a CatBoost Pool for training and testing data train_pool = Pool(data = X_train, label = y_train) test_pool = Pool(data = X_test, label = y_test) |
CatBoost operates on a data structure called Pool. Here, we create train_pool and test_pool to efficiently handle the training and testing data.
Defining CatBoost Parameters
Python3
# Defining CatBoost parameters with regularization params = { 'depth' : 6 , # Depth of the trees 'learning_rate' : 0.1 , # Learning rate of the model 'l2_leaf_reg' : 3 , # L2 regularization term on weights 'rsm' : 0.8 , # Random Selection Rate (regularization by introducing randomness) 'iterations' : 100 , # Number of boosting iterations 'loss_function' : 'MultiClass' , # Loss function for multi-class classification 'eval_metric' : 'Accuracy' , # Evaluation metric 'random_seed' : 42 # Random seed for reproducibility } |
We define a dictionary params containing parameters for CatBoost.
- depth: It controls the maximum depth of the trees in the ensemble.
- learning_rate: Step size shrinkage used to prevent overfitting. Lower values make the model more robust but require more boosting rounds.
- l2_leaf_reg: L2 regularization adds a penalty term to the loss function based on the square of the weights.
- rsm: It specifies the fraction of features to be randomly chosen for each tree.
- iterations: It represents the number of trees added to the model. Increasing the number of iterations allows the model to learn more complex patterns in the data.
- loss_function: It specifies the loss function used during the training process. For multi-class classification tasks, ‘MultiClass’ is an appropriate choice as it optimizes the model for multi-class classification problems.
- eval_metric: It defines the metric used to evaluate the model’s performance during training. ‘Accuracy’ is used in this case, which measures the proportion of correctly classified instances.
- random_seed: Setting a random seed ensures that the random processes in the algorithm are reproducible. It means that if you run the same code with the same random seed, you will get the same results, making experiments reproducible
Training the Model
Python3
# Training the CatBoost model model = CatBoostClassifier( * * params) model.fit(train_pool, eval_set = test_pool, verbose = 50 ) |
A CatBoostClassifier is instantiated with the specified parameters, and it’s trained using the fit() method. The train_pool is used as the training data, and eval_set is set to test_pool for validation during training. verbose=50 specifies that training progress will be printed every 50 iterations.
of
Python3
# Making predictions on the test data predictions = model.predict(test_pool) # Calculating accuracy accuracy = accuracy_score(y_test, predictions) print ( "Accuracy: {:.2f}%" . format (accuracy * 100 )) |
Output:
Accuracy: 78.57%
The trained model is used to make predictions on the test data. The predictions are then compared with the actual labels, and accuracy is calculated using accuracy_score().
CatBoost Regularization parameters
CatBoost, developed by Yandex, is a powerful open-source gradient boosting library designed to tackle categorical feature handling and deliver high-performance machine learning models. It stands out for its ability to handle categorical variables natively, without requiring extensive preprocessing. This feature simplifies the workflow and preserves valuable information, making it an attractive choice for real-world applications.
Contact Us