What is Toy Dataset – Types, Purpose, Benefits and Application

Toy datasets are small, simple datasets commonly used in the field of machine learning for training, testing, and demonstrating algorithms. These datasets are typically clean, well-organized, and structured in a way that makes them easy to use for instructional purposes, reducing the complexities associated with real-world data processing.

What is Toy dataset?

A toy dataset is a small yet pretending set of data used in machine learning and statistics, made for the purpose. These datasets are of basic level to help data professionals and amateurs get started while they also provide the necessary tools for deeper knowledge.

Table of Content

  • What is Toy dataset?
  • Characteristics of Toy DataSet
  • Types of Toy DataSets
  • 1. Iris Plants Dataset
  • 2. Diabetes Dataset
  • 3. Optical Recognition of Handwritten Digits Dataset
  • 4. Linnerrud Dataset
  • 5. Wine recognition Dataset
  • 6. Breast cancer wisconsin (diagnostic) dataset
  • Purpose and Benefits of Toy Dataset
  • Limitations of Toy Datasets
  • Conclusion

Characteristics of Toy DataSet

Here’s a breakdown of their key characteristics:

  • Simple and Understandable: An easy-to-comprehend and analyze is the toy dataset that deals with a small number of variables and observations in one hand.
  • Controlled Environment: Data is frequently fabricated or deliberately selected by removing complications such as noise or nullity, and now such processes become a determinant factor in studying precisely given concepts.
  • Focus on Learning: Essentially, they are employed for pedagogical purposes so that newbies can get acquainted with data analysis, learn how to use algorithms, and understand core machine learning concepts.

Scikit-learn comes with some small standard datasets which not required to be downloaded from any external site.

Top Toy DataSets

Some of the most popular Toy Datasets include:

  1. Iris plants dataset
  2. Diabetes dataset
  3. Optical recognition of handwritten digits dataset
  4. Linnerrud dataset
  5. Wine recognition dataset
  6. Breast cancer wisconsin (diagnostic) dataset

Let us see about them one by one:

1. Iris Plants Dataset

This dataset contains 150 records of iris flowers, each with measurements of sepal length, sepal width, petal length, and petal width. The task is typically to classify these records into one of three iris species.

Classes

3

Samples per class

50

Samples total

150

Dimensionality

4

Features

real, positive

Example for loading iris dataset

Python3
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()

# Creating a DataFrame from the dataset for easier manipulation
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Print the first few rows of the DataFrame
print(iris_df.head())

# Print a summary of the DataFrame
print(iris_df.describe())

# Print the target names and feature

Output:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  

2. Diabetes Dataset

The load_diabetes function from scikit-learn provides a dataset for regression analysis, featuring physiological measurements and diabetes progression indicators from 442 patients.

Samples total

442

Dimensionality

10

Features

real, -.2 < x < .2

Targets

integer 25 – 346

Diabetes dataset Example:

Python3
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the Diabetes dataset
diabetes = load_diabetes()

# Creating a DataFrame from the dataset for easier manipulation
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target

# Print the first few rows of the DataFrame
print(diabetes_df.head())

Output:

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  target  
0 -0.002592  0.019907 -0.017646   151.0  
1 -0.039493 -0.068332 -0.092204    75.0  
2 -0.002592  0.002861 -0.025930   141.0  
3  0.034309  0.022688 -0.009362   206.0  
4 -0.002592 -0.031988 -0.046641   135.0  

3. Optical Recognition of Handwritten Digits Dataset

The load_digits function from scikit-learn loads a dataset of 1,797 samples of 8×8 images of handwritten digits, useful for practicing image classification techniques in machine learning with 10 class labels (0-9).

Classes

10

Samples per class

~180

Samples total

1797

Dimensionality

64

Features

integers 0-16

Optical recognition of handwritten digits dataset Examples:

Python3
from sklearn.datasets import load_digits
import pandas as pd

# Load the digits dataset
digits = load_digits()

# Creating a DataFrame from the dataset for easier manipulation
digits_df = pd.DataFrame(data=digits.data)
digits_df['target'] = digits.target

# Adding column names for better readability
digits_df.columns = [f'pixel_{i}' for i in range(digits.data.shape[1])] + ['target']

# Print the first few rows of the DataFrame
print(digits_df.head())

Output:

   pixel_0  pixel_1  pixel_2  pixel_3  pixel_4  pixel_5  pixel_6  pixel_7  \
0      0.0      0.0      5.0     13.0      9.0      1.0      0.0      0.0   
1      0.0      0.0      0.0     12.0     13.0      5.0      0.0      0.0   
2      0.0      0.0      0.0      4.0     15.0     12.0      0.0      0.0   
3      0.0      0.0      7.0     15.0     13.0      1.0      0.0      0.0   
4      0.0      0.0      0.0      1.0     11.0      0.0      0.0      0.0   

   pixel_8  pixel_9  ...  pixel_55  pixel_56  pixel_57  pixel_58  pixel_59  \
0      0.0      0.0  ...       0.0       0.0       0.0       6.0      13.0   
1      0.0      0.0  ...       0.0       0.0       0.0       0.0      11.0   
2      0.0      0.0  ...       0.0       0.0       0.0       0.0       3.0   
3      0.0      8.0  ...       0.0       0.0       0.0       7.0      13.0   
4      0.0      0.0  ...       0.0       0.0       0.0       0.0       2.0   

   pixel_60  pixel_61  pixel_62  pixel_63  target  
0      10.0       0.0       0.0       0.0       0  
1      16.0      10.0       0.0       0.0       1  
2      11.0      16.0       9.0       0.0       2  
3      13.0       9.0       0.0       0.0       3  
4      16.0       4.0       0.0       0.0       4  

[5 rows x 65 columns]

4. Linnerrud Dataset

The load_linnerud function in scikit-learn provides a multi-output regression dataset containing exercise and physiological measurements from twenty middle-aged men, useful for fitness-related studies.

Samples total

20

Dimensionality

3 (for both data and target)

Features

integer

Targets

integer

Linnerrud dataset Examples:

Python3
from sklearn.datasets import load_linnerud
import pandas as pd

# Load the Linnerud dataset
linnerud = load_linnerud()

# Creating DataFrames from the dataset for easier manipulation
# Features DataFrame
features_df = pd.DataFrame(data=linnerud.data, columns=linnerud.feature_names)
# Target DataFrame
targets_df = pd.DataFrame(data=linnerud.target, columns=linnerud.target_names)

# Print the first few rows of the features DataFrame
print("Features DataFrame:")
print(features_df.head())

Output:

Features DataFrame:
   Chins  Situps  Jumps
0    5.0   162.0   60.0
1    2.0   110.0   60.0
2   12.0   101.0  101.0
3   12.0   105.0   37.0
4   13.0   155.0   58.0

5. Wine recognition Dataset

The load_wine function from scikit-learn offers a dataset for classification tasks, featuring chemical analyses of three different types of Italian wine.

Classes

3

Samples per class

[59,71,48]

Samples total

178

Dimensionality

13

Features

real, positive

Wine recognition dataset Examples:

Python3
from sklearn.datasets import load_wine
import pandas as pd

# Load the wine dataset
wine = load_wine()

# Creating a DataFrame from the dataset for easier manipulation
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target

# Add a new column with target names for better readability
wine_df['target_name'] = wine_df['target'].apply(lambda x: wine.target_names[x])

# Print the first few rows of the DataFrame
print(wine_df.head())

Output:

alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target target_name  
0                          3.92   1065.0       0     class_0  
1                          3.40   1050.0       0     class_0  
2                          3.17   1185.0       0     class_0  
3                          3.45   1480.0       0     class_0  
4                          2.93    735.0       0     class_0  

6. Breast cancer wisconsin (diagnostic) dataset

The load_breast_cancer function in scikit-learn provides a dataset for binary classification between benign and malignant breast tumors based on features derived from cell nucleus images.

Classes

2

Samples per class

212(M),357(B)

Samples total

569

Dimensionality

30

Features

real, positive

Breast cancer wisconsin (diagnostic) dataset Example:

Python3
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the breast cancer dataset
breast_cancer = load_breast_cancer()

# Creating a DataFrame from the dataset for easier manipulation
cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
cancer_df['target'] = breast_cancer.target

# Add a new column with target names for better readability
cancer_df['diagnosis'] = cancer_df['target'].apply(lambda x: breast_cancer.target_names[x])

# Print the first few rows of the DataFrame
print(cancer_df.head())

Output:

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst perimeter  worst area  worst smoothness  \
0                 0.07871  ...           184.60      2019.0            0.1622   
1                 0.05667  ...           158.80      1956.0            0.1238   
2                 0.05999  ...           152.50      1709.0            0.1444   
3                 0.09744  ...            98.87       567.7            0.2098   
4                 0.05883  ...           152.20      1575.0            0.1374   

   worst compactness  worst concavity  worst concave points  worst symmetry  \
0             0.6656           0.7119                0.2654          0.4601   
1             0.1866           0.2416                0.1860          0.2750   
2             0.4245           0.4504                0.2430          0.3613   
3             0.8663           0.6869                0.2575          0.6638   
4             0.2050           0.4000                0.1625          0.2364   

   worst fractal dimension  target  diagnosis  
0                  0.11890       0  malignant  
1                  0.08902       0  malignant  
2                  0.08758       0  malignant  
3                  0.17300       0  malignant  
4                  0.07678       0  malignant  

[5 rows x 32 columns]

Purpose and Benefits of Toy Dataset

  1. Educational Tools: Toy datasets serve as excellent resources for teaching and learning machine learning concepts. They allow beginners to focus on understanding algorithms and techniques without getting bogged down by the challenges of data cleaning, preprocessing, or large-scale data management.
  2. Benchmarking: These datasets provide a standardized framework for evaluating and comparing the performance of various algorithms and models. Since the results are easily reproducible, researchers and developers can benchmark their methods against established baselines.
  3. Rapid Prototyping: They are ideal for prototyping machine learning models quickly. Developers can test the viability of an algorithm or model design before applying it to more complex and larger datasets.
  4. Algorithm Development and Testing: Developers use toy datasets to test new algorithms for accuracy, efficiency, and other performance metrics. This testing can reveal fundamental strengths and weaknesses in algorithmic approaches under controlled conditions.

Limitations of Toy Datasets

While toy datasets are valuable educational tools, they do have limitations:

  1. Simplicity: Toy datasets are often too simple and fail to represent the complexity and noise found in real-world data. This can lead to overly optimistic performance estimates for models trained on these datasets.
  2. Size: Due to their small size, models trained on toy datasets might not scale well or might overfit when applied to larger, real-world datasets.
  3. Lack of Diversity: These datasets might not capture the diverse scenarios and variations found in real-world applications, which can limit the generalizability of the insights gained.

Conclusion

Toy datasets, with their simplicity and structured format, play a crucial role in the field of machine learning, particularly in education and preliminary testing. They offer an excellent starting point for beginners to understand fundamental concepts and for experts to test and benchmark new algorithms efficiently. The manageable size of these datasets allows for quick computational tasks and easy visualization, which are invaluable for instructional purposes and algorithm development.



Contact Us