The Validation Set Approach in R Programming
The validation set approach is a cross-validation technique in Machine learning. Cross-validation techniques are often used to judge the performance and accuracy of a machine learning model. In the Validation Set approach, the dataset which will be used to build the model is divided randomly into 2 parts namely training set and validation set(or testing set). The model is trained on the training dataset and its accuracy is calculated by predicting the target variable for those data points which is not present during the training that is validation set. This whole process of splitting the data, training the model, testing the model is a complex task. But the R language consists of numerous libraries and inbuilt functions which can carry out all the tasks very easily and efficiently.
Steps Involved in the Validation Set Approach
- A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred)
- Training of the model on the training data set
- The resultant model is applied to the validation set
- Model’s accuracy is calculated through prediction error by using model performance metrics
This article discusses the step by step method of implementing the Validation set approach as a cross-validation technique for both classification and regression machine learning models.
For Classification Machine Learning Models
This type of machine learning model is used when the target variable is a categorical variable like positive, negative, or diabetic, non-diabetic, etc. The model predicts the class label of the dependent variable. Here, the Logistic regression algorithm will be applied to build the classification model.
Step 1: Loading the dataset and other required packages
Before doing any exploratory or manipulation task, one must include all the required libraries and packages to use various inbuilt functions and a dataset which will make it easier to carry out the whole process.
R
# loading required packages # package to perform data manipulation # and visualization library (tidyverse) # package to compute # cross - validation methods library (caret) # package Used to split the data # used during classification into # train and test subsets library (caTools) # loading package to # import desired dataset library (ISLR) |
Step 2: Exploring the dataset
It is very necessary to understand the structure and dimension of the dataset as this will help in building a correct model. Also, as this is a classification model, one must know the different categories present in the target variable.
R
# assigning the complete dataset # Smarket to a variable dataset <- Smarket[ complete.cases (Smarket), ] # display the dataset with details # like column name and its data type # along with values in each row glimpse (dataset) # checking values present # in the Direction column # of the dataset table (dataset$Direction) |
Output:
Rows: 1,250 Columns: 9 $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ... $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1... $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0... $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -... $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ... $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ... $ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ... $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0... $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up... > table(dataset$Direction) Down Up 602 648
According to the above information, the imported dataset has 250 rows and 9 columns. The data type of columns as <dbl> means the double-precision floating-point number (dbl came from double). The target variable must be of factor datatype in classification models. Since the data type of the Direction column is already <fct>, there is no need to change anything.
Moreover, the response variable or target variable is a binary categorical variable(as the values in the column are only Down and Up) and the proportion of both class labels is approximately 1:1 means they are balanced. If there will be a case of class imbalance as if the proportion of class labels would be 1:2, we have to make sure that both the categories are in approximately equal proportion. For this purpose, there are many techniques like:
- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE
Step 3: Building the model and generating the validation set
This step involves the random splitting of the dataset, developing training and validation set, and training of the model. Below is the implementation.
R
# setting seed to generate a # reproducible random sampling set.seed (100) # dividing the complete dataset # into 2 parts having ratio of # 70% and 30% spl = sample.split (dataset$Direction, SplitRatio = 0.7) # selecting that part of dataset # which belongs to the 70% of the # dataset divided in previous step train = subset (dataset, spl == TRUE ) # selecting that part of dataset # which belongs to the 30% of the # dataset divided in previous step test = subset (dataset, spl == FALSE ) # checking number of rows and column # in training and testing dataset print ( dim (train)) print ( dim (test)) # Building the model # training the model by assigning Direction column # as target variable and rest other columns # as independent variables model_glm = glm (Direction ~ . , family = "binomial" , data = train, maxit = 100) |
Output:
> print(dim(train)) [1] 875 9 > print(dim(test)) [1] 375 9
Step 4: Predicting the target variable
As the training of the model is completed, it is time to make predictions on the unseen data. Here, the target variable has only 2 possible values so in the predict() function it is desirable to use type = response such that the model predicts the probability score of the target categorical variable as 0 or 1.
There is an optional step of transforming the response variable into the factor variable of 1’s and 0’s so that if the probability score of a data point is above a certain threshold, it will be treated as 1 and if below that threshold it will be treated as 0. Here, the probability cutoff is set as 0.5. Below is the code to implement these steps
R
# predictions on the validation set predictTest = predict (model_glm, newdata = test, type = "response" ) # assigning the probability cutoff as 0.5 predicted_classes <- as.factor ( ifelse (predictTest >= 0.5, "Up" , "Down" )) |
Step 5: Evaluating the accuracy of the model
The Best way to judge the accuracy of a classification machine learning model is through Confusion Matrix. This matrix gives us a numerical value which suggests how many data points are predicted correctly as well as incorrectly by taking reference with the actual values of the target variable in the testing dataset. Along with the confusion matrix, other statistical details of the model like accuracy and kappa can be calculated using the below code.
R
# generating confusion matrix and # other details from the # prediction made by the model print ( confusionMatrix (predicted_classes, test$Direction)) |
Output:
Confusion Matrix and Statistics Reference Prediction Down Up Down 177 5 Up 4 189 Accuracy : 0.976 95% CI : (0.9549, 0.989) No Information Rate : 0.5173 P-Value [Acc > NIR] : <2e-16 Kappa : 0.952 Mcnemar's Test P-Value : 1 Sensitivity : 0.9779 Specificity : 0.9742 Pos Pred Value : 0.9725 Neg Pred Value : 0.9793 Prevalence : 0.4827 Detection Rate : 0.4720 Detection Prevalence : 0.4853 Balanced Accuracy : 0.9761 'Positive' Class : Down
For Regression Machine Learning Models
Regression models are used to predict a quantity whose nature is continuous like the price of a house, sales of a product, etc. Generally in a regression problem, the target variable is a real number such as integer or floating-point values. The accuracy of this kind of model is calculated by taking the mean of errors in predicting the output of various data points. Below are the steps to implement the validation set approach in Linear Regression Models.
Step 1: Loading the dataset and required packages
R language contains a variety of datasets. Here we are using trees dataset which is an inbuilt dataset for the linear regression model. Below is the code to import the required dataset and packages to perform various operations to build the model.
R
# loading required packages # package to perform data manipulation # and visualization library (tidyverse) # package to compute # cross - validation methods library (caret) # access the data from R’s datasets package data (trees) # look at the first several rows of the data head (trees) |
Output:
Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7
So, in this dataset, there are a total of 3 columns among which Volume is the target variable. Since the variable is of continuous nature, a linear regression algorithm can be used to predict the outcome.
Step 2: Building the model and generating the validation set
In this step, the model is split randomly into a ratio of 80-20. 80% of the data points will be used to train the model while 20% acts as the validation set which will give us the accuracy of the model. Below is the code for the same.
R
# reproducible random sampling set.seed (123) # creating training data as 80% of the dataset random_sample <- createDataPartition (trees $ Volume, p = 0.8, list = FALSE ) # generating training dataset # from the random_sample training_dataset <- trees[random_sample, ] # generating testing dataset # from rows which are not # included in random_sample testing_dataset <- trees[-random_sample, ] # Building the model # training the model by assigning sales column # as target variable and rest other columns # as independent variables model <- lm (Volume ~., data = training_dataset) |
Step 3: Predict the target variable
After building and training the model, predictions of the target variable of the data points belong to the validation set will be done.
R
# predicting the target variable predictions <- predict (model, testing_dataset) |
Step 4: Evaluating the accuracy of the model
Statistical metrics that are used for evaluating the performance of a Linear regression model are Root Mean Square Error(RMSE), Mean Squared Error(MAE), and R2 Error. Among all R2 Error, metric makes the most accurate judgement and its value must be high for a better model. Below is the code to calculate the prediction error of the model.
R
# computing model performance metrics data.frame (R2 = R2 (predictions, testing_dataset $ Volume), RMSE = RMSE (predictions, testing_dataset $ Volume), MAE = MAE (predictions, testing_dataset $ Volume)) |
Output:
R2 RMSE MAE 1 0.9564487 5.274129 4.73567
Advantages of the Validation Set approach
- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.
Disadvantages of the Validation Set approach
- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.
Contact Us