Cross-validation on Digits Dataset in Scikit-learn

Sklearn Diabetes Dataset : Scikit-learn Toy Datasets in Python

In this article, we will discuss cross-validation and its use on digit datasets. Further, we will see the code implementation using a digits dataset.

What is Cross-Validation?

Cross Validation on the Digits Dataset will allow us to choose the best parameters avoiding overfitting over the training dataset. It is a procedure of experimentation of hit and trial procedure and checking the cross val score of each parameter and then after evaluation, choosing the best one. It applies to commercial workflow as well.

The Digits Dataset in Scikit Learn contains a copy of UCI ML hand-written digits datasets. It is a classification dataset that is very good for beginners and a good dataset for learning various machine-learning algorithms including CNN.

Cross-validation is a technique in which we train our model using the subset of the data set and then evaluate using the complementary subset. The three steps involved in cross-validation are as follows :

Reserve some portion of the sample data set.
Using the rest data-set train the model.
Test the model using the reserve portion of the data set.

K-Fold Cross Validation: In this method, we split the data set into k number of subsets(known as folds) then we perform training on all the subsets but leave one(k-1) subset to evaluate the trained model. In this method, we iterate k times with a different subset reserved for testing purposes each time.

Syntax

To perform K-fold cross-validation, we can use the cross_val_score method to perform the validation. Here is the syntax:

cross_val_score(model, X, y, cv=5)

model: It is the estimator that we want to fit on the data.
X: It is the training data.
y: It is the number of labels.
cv: It states the number of folds in a (Stratified)KFold.

We can use GridSearchCV which performs an exhaustive search over the parameter grid that we will perform. It takes the following parameters:

GridSearchCV(model, param_grid, cv=kf, scoring='accuracy')

model: It is the estimator that we want to fit on the data.
param_grid: It will run on all the parameter value combinations that is provided
cv: It is the cross-validation splitting strategy.
scoring: It defines the strategy to evaluate the performance of the cross-validated model on the test set.

Performing K-Fold Cross Validation on the Dataset

Step 1: Import the libraries:

Importing all necessary libraries required for further steps. This python code demonstrates how to perform a grid search for tuning a Support Vector Machine (SVM) classifier’s hyperparameters using the sci-kit-learn library.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.datasets import load_digits
from sklearn.svm import SVC

Step 2: Load the digits dataset

The handwritten digits dataset is loaded in this line by the load_digits(return_X_y=True) function, which also assigns the feature matrix to X and the associated labels to Y.

Python

X, y = load_digits(return_X_y=True)

Step 3: Define the parameter grid using numpy logspace

Python

param_grid = {'C': np.logspace(-5, 5, 10)}

Step 4: Define the Kfold object in Sklearn and create and SVM classifier

This code snippet creates a KFold cross-validation object and instantiates a Support Vector Machine (SVM) classifier with a sigmoid kernel.

Python

svm = SVC(kernel="sigmoid")
kf = KFold(n_splits=5, shuffle=True, random_state=42)

Step 5: Now we need to perform the GridSearchCV using the Cross-Validation and SVM.

Python

# performing exhaustive search
grid_search = GridSearchCV(svm, param_grid, cv=kf, scoring='accuracy', return_train_score=True, verbose=3, n_jobs=-1)
 
grid_search.fit(X,y)

Step 6: Plot and print the results

The mean cross-validated scores, standard deviations, and optimal hyperparameter values found through a grid search are plotted in this code snippet.

Python

scores_avg = grid_search.cv_results_['mean_test_score']
scores_std = grid_search.cv_results_['std_test_score']
param_values = grid_search.cv_results_['param_C']
# Do the plotting
plt.figure()
plt.semilogx(param_values, scores_avg)
plt.semilogx(param_values, np.array(scores_avg) + np.array(scores_std), "r--")
plt.semilogx(param_values, np.array(scores_avg) - np.array(scores_std), "g--")
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel("CV score")
plt.xlabel("Parameter C")
plt.ylim(0, 1.1)
plt.show()
 
 
# Print the best score and parameters
print('Best score:', grid_search.best_score_)
print('Best C:', grid_search.best_params_['C'])

Output:

Best score: 0.9115242958836273
Best C: 0.2782559402207126

Interpretation: We see that the score after 10^-2 increases providing better parameter C for our Support Vector Machine Classifier. Eventually after 10^0, we see again a dip and then almost constant value.Higher the CV score here, it is better.

Advantages and Disadvantages of Cross Validation

Advantages

It provides the idea of how the model would generalize on an unknown data.
It helps to estimate for accurate estimate of model prediction.
Cross validation helps to prevent overfitting by providing a more robust estimate of the model’s performance on unseen data.
It can be used to optimize the hyperparameters of a model

Disadvantages

Cross Validation takes a higher training time since we are splitting the data multiple times. For instance, if there are 5 folds, and we have combination of parameters equal 10, then there will be total 50 times splitting and training. It increases exponentially when another parameter is added.
It requires huge processing power.
The choice of the number of folds in cross validation can impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while too many folds may result in high bias

Tags:

#Geeks Premier League 2023 #Python scikit-module #AI-ML-DS #Geeks Premier League #Machine Learning #Machine Learning

Sklearn Diabetes Dataset : Scikit-learn Toy Datasets in Python

Cross-validation on Digits Dataset in Scikit-learn

What is Cross-Validation?

Syntax

Performing K-Fold Cross Validation on the Dataset

Step 1: Import the libraries:

Python

Step 2: Load the digits dataset

Python

Step 3: Define the parameter grid using numpy logspace

Python

Step 4: Define the Kfold object in Sklearn and create and SVM classifier

Python

Step 5: Now we need to perform the GridSearchCV using the Cross-Validation and SVM.

Python

Step 6: Plot and print the results

Python

Advantages and Disadvantages of Cross Validation

Contact Us