Cross-Validation

A machine learning approach called cross-validation is used to evaluate a model’s performance and make sure that it isn’t unduly dependent on a particular training-test split of the data. To gain a more accurate approximation of the model’s performance, you must divide the dataset into several subgroups, train and test the model using various combinations of these subsets, and then average the results.

There are various different cross-validation methods. The most popular ones are:

K-Fold Cross-Validation

In machine learning, K-Fold Cross-Validation is an essential method for assessing and optimizing model performance. It solves overfitting and underfitting issues by methodically separating a dataset into ‘K’ subsets, sometimes known as “folds.” One fold is used as the validation set while the remaining “K-1” folds are used as the training data for each iteration. The test set for each of the ‘K’ training and testing iterations of the model is a distinct fold. For a reliable evaluation of the model’s performance, the data are averaged or otherwise integrated.

K-Fold Cross-Validation has a number of benefits. It produces more accurate performance estimations by maximizing the use of data for both training and validation. Because it assesses the model on many data subsets, it also helps identify problems like overfitting of the model. But it can be computationally demanding, especially when dealing with big datasets or high ‘K’ values. In spite of this, K-Fold Cross-Validation is still widely used in model evaluation to make sure machine learning models have good generalization to new data.

Implementation of K-Fold Cross-Validation

Python3




import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
 
# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
 
# Number of folds
n_splits = 5
 
# Create a KFold cross-validator
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
 
# Initialize a list to store model performance metrics
metrics = []
 
# Define LightGBM hyperparameters
params = {
    'objective': 'multiclass',
    'num_class': 3# Number of classes in the Iris dataset
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
}
 
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
 
    # Create LightGBM datasets for training and testing
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
 
    # Train a LightGBM model
    num_round = 100
    bst = lgb.train(params, train_data, num_round)
 
    # Make predictions on the test set
    y_pred = bst.predict(X_test)
 
    # Get the class with the highest predicted probability as the predicted label
    y_pred_labels = np.argmax(y_pred, axis=1)
 
    # Calculate accuracy and store it in the metrics list
    accuracy = accuracy_score(y_test, y_pred_labels)
    metrics.append(accuracy)
 
# Calculate the average accuracy across all folds
average_accuracy = np.mean(metrics)
print(f'Average Accuracy: {average_accuracy:.4f}')


Output:

Average Accuracy: 0.9600

Using the LightGBM machine learning framework and k-fold cross-validation, the provided code evaluates a multiclass classification model’s performance on the Iris dataset. The dataset is first loaded and split into feature variables (X) and target labels (y). To ensure data shuffling for a robust evaluation, the KFold cross-validator is applied with a predetermined number of folds. The model makes predictions on a test subset after being trained on a training subset for each fold. The predicted label is chosen to be the class with the highest expected probability. The method then determines the accuracy metric for every fold and computes the average accuracy over all folds to provide a general indicator of the model’s classification performance on the Iris dataset.

Stratified K-Fold Cross-Validation

Stratified K-Fold Cross-Validation: It is used to address categorization issues. In order to lessen evaluation bias, it makes sure that each fold has a distribution of class labels that is close to that of the overall dataset.

Every fold in a stratified K-fold preserves the same class distribution as the dataset as a whole. It is especially helpful in classification problems, where biased model assessment might result from unbalanced class distributions. It is a reliable method for determining the optimal hyperparameters and evaluating a model’s generalization capacity because it maintains class proportions in every fold, which results in a more accurate estimation of a model’s performance. In order to ensure that each fold is representative of the total data, this strategy is frequently used when the target variable has an uneven class distribution.

A popular form of K-Fold Cross-Validation for classification issues is called stratified K-Fold Cross-Validation, which makes sure that each fold has a class label distribution that is comparable to the dataset as a whole. As a result, there is less bias and a more accurate evaluation of the model’s performance.

Implementation of Stratified K-Fold Cross-Validation

Python3




import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
 
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
 
# Define hyperparameters for LightGBM
params = {
    'objective': 'multiclass'# For multi-class classification
    'metric': 'multi_logloss'# Logarithmic loss for multiclass
    'boosting_type': 'gbdt',
    'num_class': 3# Number of classes in Iris dataset
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
 
# Number of folds for stratified cross-validation
num_folds = 5
 
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
 
# Initialize an empty list to store cross-validation scores
cv_scores = []
 
# Perform stratified k-fold cross-validation
for train_index, val_index in skf.split(X, y):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
     
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
     
    # Train LightGBM model with early stopping
    model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[val_data])
     
    # Make predictions on the validation set
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
     
    # Convert predicted probabilities to class predictions
    val_pred_classes = np.argmax(val_pred, axis=1)
     
    # Calculate accuracy and store it in the list
    accuracy = accuracy_score(y_val, val_pred_classes)
    cv_scores.append(accuracy)
 
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
 
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')


Output:

Mean Accuracy: 0.9667
Std Accuracy: 0.0298

This code uses the gradient boosting framework LightGBM to illustrate a popular machine learning technique called Stratified K-Fold Cross-Validation. First, the widely used benchmark dataset for classification, Iris, is loaded. The multiclass classification-focused hyperparameters of the model are preset. To divide the data into five subsets and preserve the balance of the class distribution, the StratifiedKFold approach is utilized. A LightGBM model is trained on the training set with early pausing to avoid overfitting inside the cross-validation loop. Accuracy is calculated for every fold based on predictions made on the validation set. A thorough assessment of the model’s performance and capacity for generalization on the Iris dataset is given by the code, which gathers and presents the mean and standard deviation of accuracy scores across the folds.

Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out Cross-Validation (LOOCV): A resampling technique called Leave-One-Out Cross-Validation (LOOCV) is used to evaluate how well machine learning models perform. Using an exhaustive method, the remaining data is used for training and one data point is reserved as the validation set for each round. There are as many rounds as there are data points when this technique is repeated for each data point. Using all available data for both training and validation, LOOCV provides a thorough assessment of a model’s generalization. However, computing costs may be high, particularly for huge datasets. In order to accurately measure a model’s predictive capacity and robustness, the final performance score is typically calculated from the average of the individual validation findings.

(LOOCV) is a cross-validation method in which all of the dataset’s data points are regarded as distinct test sets and the model is trained using them all. Although LOOCV offers a reliable assessment of model performance, it can be computationally costly, particularly when dealing with big datasets.

Implementation of Leave-One-Out Cross-Validation (LOOCV)

Here’s how to use Python to implement LOOCV with LightGBM:

Python3




import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import LeaveOneOut
 
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
 
# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()
 
# Initialize an empty list to store cross-validation scores
cv_scores = []
 
# Perform Leave-One-Out Cross-Validation
for train_index, val_index in loo.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
     
    # Create and configure a LightGBM dataset for training
    train_data = lgb.Dataset(X_train, label=y_train)
     
    # Define hyperparameters for LightGBM
    params = {
        'objective': 'multiclass',
        'num_class': 3,
        'boosting_type': 'gbdt',
        'num_leaves': 5,
        'learning_rate': 0.05,
    }
     
    # Train LightGBM model
    model = lgb.train(params, train_data, num_boost_round=100)
     
    # Make predictions on the validation set
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
     
    # Get the predicted class (index of the highest probability)
    val_pred_class = np.argmax(val_pred, axis=1)
     
    # Calculate accuracy and store it in the list
    accuracy = accuracy_score(y_val, val_pred_class)
    cv_scores.append(accuracy)
 
# Calculate the mean and standard deviation of accuracy across folds
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)
 
print(f'Mean Accuracy: {mean_accuracy:.4f}')
print(f'Std Accuracy: {std_accuracy:.4f}')


Output:

Mean Accuracy: 0.9533
Std Accuracy: 0.2109

With the use of LightGBM and the Iris dataset, this code sample illustrates Leave-One-Out Cross-Validation (LOOCV). Using each data point as a validation set, LOOCV iterates through the data, using the remaining data to train the model. It applies certain hyperparameters to the multiclass classification target of LightGBM. The mean accuracy and standard deviation are computed across the total number of repetitions, and the accuracy is determined for each iteration. By testing the model on each data point separately and making sure that every data point adds to the evaluation, this method offers a thorough assessment of the model’s performance. One may get an idea of the model’s predictive ability and consistency in categorizing samples from the Iris dataset by looking at the final mean accuracy and standard deviation.

Cross-validation and Hyperparameter tuning of LightGBM Model

In a variety of industries, including finance, healthcare, and marketing, machine learning models have become essential for resolving challenging real-world issues. Gradient boosting techniques have become incredibly popular among the myriad of machine learning algorithms due to their remarkable prediction performance. Due to its speed and effectiveness, LightGBM (Light Gradient Boosting Machine) is one such technique that many data scientists and machine learning practitioners now turn to first.

We will examine LightGBM in this post with an emphasis on cross-validation, hyperparameter tweaking, and the deployment of a LightGBM-based application. To clarify the ideas covered, we shall use code examples throughout the article.

Similar Reads

Understanding LightGBM

LightGBM is a gradient-boosting framework developed by Microsoft that uses a tree-based learning algorithm. It is specifically designed to be efficient and can handle large datasets with millions of records and features. Some of its key advantages include:...

Cross-Validation

A machine learning approach called cross-validation is used to evaluate a model’s performance and make sure that it isn’t unduly dependent on a particular training-test split of the data. To gain a more accurate approximation of the model’s performance, you must divide the dataset into several subgroups, train and test the model using various combinations of these subsets, and then average the results....

LightGBM’s Hyperparameter Tuning

...

Conclusion

...

Contact Us