Cross-validation on Digits Dataset in Scikit-learn
In this article, we will discuss cross-validation and its use on digit datasets. Further, we will see the code implementation using a digits dataset.
What is Cross-Validation?
Cross Validation on the Digits Dataset will allow us to choose the best parameters avoiding overfitting over the training dataset. It is a procedure of experimentation of hit and trial procedure and checking the cross val score of each parameter and then after evaluation, choosing the best one. It applies to commercial workflow as well.
The Digits Dataset in Scikit Learn contains a copy of UCI ML hand-written digits datasets. It is a classification dataset that is very good for beginners and a good dataset for learning various machine-learning algorithms including CNN.
Cross-validation is a technique in which we train our model using the subset of the data set and then evaluate using the complementary subset. The three steps involved in cross-validation are as follows :
- Reserve some portion of the sample data set.
- Using the rest data-set train the model.
- Test the model using the reserve portion of the data set.
K-Fold Cross Validation: In this method, we split the data set into k number of subsets(known as folds) then we perform training on all the subsets but leave one(k-1) subset to evaluate the trained model. In this method, we iterate k times with a different subset reserved for testing purposes each time.
Syntax
To perform K-fold cross-validation, we can use the cross_val_score method to perform the validation. Here is the syntax:
cross_val_score(model, X, y, cv=5)
- model: It is the estimator that we want to fit on the data.
- X: It is the training data.
- y: It is the number of labels.
- cv: It states the number of folds in a (Stratified)KFold.
We can use GridSearchCV which performs an exhaustive search over the parameter grid that we will perform. It takes the following parameters:
GridSearchCV(model, param_grid, cv=kf, scoring='accuracy')
- model: It is the estimator that we want to fit on the data.
- param_grid: It will run on all the parameter value combinations that is provided
- cv: It is the cross-validation splitting strategy.
- scoring: It defines the strategy to evaluate the performance of the cross-validated model on the test set.
Performing K-Fold Cross Validation on the Dataset
Step 1: Import the libraries:
Importing all necessary libraries required for further steps. This python code demonstrates how to perform a grid search for tuning a Support Vector Machine (SVM) classifier’s hyperparameters using the sci-kit-learn library.
Python
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold from sklearn.datasets import load_digits from sklearn.svm import SVC |
Step 2: Load the digits dataset
The handwritten digits dataset is loaded in this line by the load_digits(return_X_y=True) function, which also assigns the feature matrix to X and the associated labels to Y.
Python
X, y = load_digits(return_X_y = True ) |
Step 3: Define the parameter grid using numpy logspace
Python
param_grid = { 'C' : np.logspace( - 5 , 5 , 10 )} |
Step 4: Define the Kfold object in Sklearn and create and SVM classifier
This code snippet creates a KFold cross-validation object and instantiates a Support Vector Machine (SVM) classifier with a sigmoid kernel.
Python
svm = SVC(kernel = "sigmoid" ) kf = KFold(n_splits = 5 , shuffle = True , random_state = 42 ) |
Step 5: Now we need to perform the GridSearchCV using the Cross-Validation and SVM.
Python
# performing exhaustive search grid_search = GridSearchCV(svm, param_grid, cv = kf, scoring = 'accuracy' , return_train_score = True , verbose = 3 , n_jobs = - 1 ) grid_search.fit(X,y) |
Step 6: Plot and print the results
The mean cross-validated scores, standard deviations, and optimal hyperparameter values found through a grid search are plotted in this code snippet.
Python
scores_avg = grid_search.cv_results_[ 'mean_test_score' ] scores_std = grid_search.cv_results_[ 'std_test_score' ] param_values = grid_search.cv_results_[ 'param_C' ] # Do the plotting plt.figure() plt.semilogx(param_values, scores_avg) plt.semilogx(param_values, np.array(scores_avg) + np.array(scores_std), "r--" ) plt.semilogx(param_values, np.array(scores_avg) - np.array(scores_std), "g--" ) locs, labels = plt.yticks() plt.yticks(locs, list ( map ( lambda x: "%g" % x, locs))) plt.ylabel( "CV score" ) plt.xlabel( "Parameter C" ) plt.ylim( 0 , 1.1 ) plt.show() # Print the best score and parameters print ( 'Best score:' , grid_search.best_score_) print ( 'Best C:' , grid_search.best_params_[ 'C' ]) |
Output:
Best score: 0.9115242958836273
Best C: 0.2782559402207126
Interpretation: We see that the score after 10^-2 increases providing better parameter C for our Support Vector Machine Classifier. Eventually after 10^0, we see again a dip and then almost constant value.Higher the CV score here, it is better.
Advantages and Disadvantages of Cross Validation
Advantages
- It provides the idea of how the model would generalize on an unknown data.
- It helps to estimate for accurate estimate of model prediction.
- Cross validation helps to prevent overfitting by providing a more robust estimate of the model’s performance on unseen data.
- It can be used to optimize the hyperparameters of a model
Disadvantages
- Cross Validation takes a higher training time since we are splitting the data multiple times. For instance, if there are 5 folds, and we have combination of parameters equal 10, then there will be total 50 times splitting and training. It increases exponentially when another parameter is added.
- It requires huge processing power.
- The choice of the number of folds in cross validation can impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while too many folds may result in high bias
Contact Us