Testing Set

This dataset is independent of the training set but has a somewhat similar type of probability distribution of classes and is used as a benchmark to evaluate the model, used only after the training of the model is complete. Testing set is usually a properly organized dataset having all kinds of data for scenarios that the model would probably be facing when used in the real world. Often the validation and testing set combined is used as a testing set which is not considered a good practice. If the accuracy of the model on training data is greater than that on testing data then the model is said to have overfitting. This data is approximately 20-25% of the total data available for the project.

Example:

Python3

# Importing numpy & scikit-learn 
import numpy as np 
from sklearn.model_selection import train_test_split 
  
# Making a dummy array to represent x,y for example 
# Making a array for x ranging from 0-15 then 
# reshaping it to form a matrix of shape 8x2 
x = np.arange(16).reshape((8, 2)) 
  
# y is just a list of 0-7 number representing  
# target variable 
y = range(8) 
  
# Splitting dataset in 80-20 fashion .i.e. 
# Training set is 80% of total data 
# Testing set is 20% of total data 
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=42) 
  
# Testing set 
print("Testing set x: ", x_test) 
print("Testing set y: ", y_test) 

Output:

Testing set x:  [[ 2  3]
 [10 11]]
Testing set y:  [1, 5]

Explanation:

To show how the train_test_split function works we first created a dummy matrix of 8×2 shape using NumPy library to represent input x. And a list of 0 to 7 integers representing our target variable y.
Now in order to split our dataset into training and testing data, input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in test_size i.e. if test_size=0.2 is given then the dataset will be divided in such an away that testing set will be 20% of given dataset and training set will be 80% of given dataset.
And as we specify random_state to be a positive number, train_test_split function will randomly split data.

Training vs Testing vs Validation Sets

In this article, we are going to see how to Train, Test and Validate the Sets.

The fundamental purpose for splitting the dataset is to assess how effective will the trained model be in generalizing to new data. This split can be achieved by using train_test_split function of scikit-learn.

Tags:

#python #Python Testing #Python-numpy #AI-ML-DS #Machine Learning #Machine Learning #python

Training Set

Validation Set

Testing Set

Python3

Training vs Testing vs Validation Sets

Similar Reads

Contact Us