Hyperparameter Tuning to optimize Gradient Boosting Algorithm

Hyperparameters govern the learning process of a GBM, impacting its complexity, training time, and generalizability. Fine-tuning these parameters is crucial for optimal performance. We shall now use the tuning methods on the Titanic dataset and let’s see the impact of an optimized model!

Classification Model without Tuning

The provided code implements a Gradient Boosting Classifier on the Titanic dataset to predict survival outcomes. It preprocesses the data, splits it into training and testing sets, and trains the model. Notably, hyperparameter tuning, which significantly impacts model performance, is not performed in this implementation. Adjusting hyperparameters such as learning rate, tree depth, and regularization strength could potentially enhance the accuracy of the model.


# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the Titanic dataset
titanic_data = pd.read_csv("train.csv")
# Let's do some basic preprocessing for simplicity
# Replace missing values and encode categorical variables
titanic_data.fillna(0, inplace=True)
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)
# Select features and target variable
X = titanic_data.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
y = titanic_data['Survived']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier()
# Fit the model to the training data
gb_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = gb_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
# Print the results
print(f"Accuracy: {accuracy}")


Accuracy: 0.7988826815642458

Hyperparameter Tuning using Grid Seach CV

In this code, a GridSearchCV object is utilized to perform hyperparameter tuning for the Gradient Boosting Classifier on the Titanic dataset. By defining a parameter grid containing various values for parameters such as the number of estimators, learning rate, and maximum depth of trees, the code systematically searches for the combination of hyperparameters that yields the highest accuracy. The GridSearchCV iteratively trains and evaluates the model using different hyperparameter combinations via cross-validation. Finally, the best parameters and the corresponding best model are identified, and predictions are made on the test set using the optimized model.


# Import necessary libraries
from sklearn.model_selection import GridSearchCV
# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier()
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# Fit the model to the training data using GridSearchCV
grid_search.fit(X_train, y_train)
# Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
# Make predictions on the test set using the best model
y_pred_best = best_model.predict(X_test)
# Evaluate the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
# Print the results
print("Best Parameters:", best_params)
print(f"Best Model Accuracy: {accuracy_best}")


Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Best Model Accuracy: 0.8044692737430168

Hyperparameter Tuning using Randomized Search CV

This code snippet demonstrates the utilization of RandomizedSearchCV to perform hyperparameter tuning for the Gradient Boosting Classifier on the Titanic dataset. By specifying a parameter distribution containing ranges or distributions for hyperparameters such as the number of estimators, learning rate, and maximum depth of trees, RandomizedSearchCV randomly samples combinations from this parameter space and evaluates their performance using cross-validation. The process aims to efficiently explore a wide range of hyperparameter values, potentially discovering optimal settings that maximize model accuracy. The best parameters and corresponding best model are identified, and predictions are made on the test set using the optimized model, thereby improving predictive performance through effective hyperparameter tuning.


# Import necessary libraries
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
# Define the parameter grid for RandomizedSearchCV
param_dist = {
    'n_estimators': np.arange(50, 251, 50),
    'learning_rate': np.linspace(0.01, 0.2, 10),
    'max_depth': np.arange(3, 8),
# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier()
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=gb_model, param_distributions=param_dist, n_iter=10,
                                   cv=5, scoring='accuracy', random_state=42, n_jobs=-1)
# Fit the model to the training data using RandomizedSearchCV
random_search.fit(X_train, y_train)
# Get the best parameters and best model
best_params_random = random_search.best_params_
best_model_random = random_search.best_estimator_
# Make predictions on the test set using the best model
y_pred_best_random = best_model_random.predict(X_test)
# Evaluate the best model
accuracy_best_random = accuracy_score(y_test, y_pred_best_random)
# Print the results
print("Best Parameters (Randomized Search):", best_params_random)
print(f"Best Model Accuracy (Randomized Search): {accuracy_best_random}")


Best Parameters (Randomized Search): {'n_estimators': 250, 'max_depth': 3, 'learning_rate': 0.09444444444444444}
Best Model Accuracy (Randomized Search): 0.8156424581005587

Hyperparameter Tuning using Optuna

In this code, Optuna is employed for hyperparameter optimization of the Gradient Boosting Classifier on the Titanic dataset. The objective function defines the search space for hyperparameters such as the number of estimators, learning rate, and maximum depth, and it evaluates the model’s performance based on accuracy. Optuna’s optimization process aims to minimize the objective function by iteratively exploring the hyperparameter space, resulting in the identification of optimal hyperparameters that maximize model accuracy.


import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Define the objective function to be minimized
def objective(trial):
    # Define the search space for hyperparameters
    param_space = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 250, step=50),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2),
        'max_depth': trial.suggest_int('max_depth', 3, 7),
    # Initialize the Gradient Boosting model with early stopping
    gb_model = GradientBoostingClassifier(**param_space, validation_fraction=0.1, n_iter_no_change=5, random_state=42)
    # Fit the model to the training data
    gb_model.fit(X_train, y_train)
    # Make predictions on the test set
    y_pred = gb_model.predict(X_test)
    # Calculate accuracy as the objective to be minimized
    accuracy = accuracy_score(y_test, y_pred)
    return 1.0 - accuracy  # Optuna minimizes the objective, so we use 1 - accuracy
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a study and optimize the objective function
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
# Get the best parameters and best model
best_params_optuna = study.best_params
best_model_optuna = GradientBoostingClassifier(**best_params_optuna, validation_fraction=0.1, n_iter_no_change=5, random_state=42)
best_model_optuna.fit(X_train, y_train)
# Make predictions on the test set using the best model
y_pred_best_optuna = best_model_optuna.predict(X_test)
# Evaluate the best model obtained through Optuna
accuracy_best_optuna = accuracy_score(y_test, y_pred_best_optuna)
print(f"Best Model Accuracy (Optuna): {accuracy_best_optuna}")


Best Model Accuracy (Optuna): 0.8324022346368715

In conclusion, hyperparameter tuning significantly impacts the performance of Gradient Boosting algorithms, as demonstrated through the optimization processes using Grid Search CV, Randomized Search CV, and Optuna on the Titanic dataset.

Gradient boosting algorithms (GBMs) are ensemble learning methods that excel in various machine learning tasks, from regression to classification. They work by iteratively adding decision trees that correct the mistakes of their predecessors. Each tree focuses on the errors left by the previous ones, gradually building a stronger collective predictor. In this article, we are going to learn the fundamentals of gradient boosting and demonstrate how can we tune the hyperparameters of Gradient Boosting Algorithm.

