Create and Train the k-NN Model

Now, it’s time to create and train our k-NN model. We’ll use the `nearest_neighbor()` function from the `parsnip` package, which is part of tidymodels.

R




# Create a k-NN model specification
knn_spec <- train(
  Species ~ .,
  data = iris,
  method = "kknn",
  trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE),
  tuneLength = 5
)


Output:

+ Fold1: kmax= 5, distance=2, kernel=optimal 
- Fold1: kmax= 5, distance=2, kernel=optimal 
+ Fold1: kmax= 7, distance=2, kernel=optimal 
- Fold1: kmax= 7, distance=2, kernel=optimal 
+ Fold1: kmax= 9, distance=2, kernel=optimal 
- Fold1: kmax= 9, distance=2, kernel=optimal 
+ Fold1: kmax=11, distance=2, kernel=optimal 
- Fold1: kmax=11, distance=2, kernel=optimal 
+ Fold1: kmax=13, distance=2, kernel=optimal 
- Fold1: kmax=13, distance=2, kernel=optimal 
+ Fold2: kmax= 5, distance=2, kernel=optimal 
- Fold2: kmax= 5, distance=2, kernel=optimal 
+ Fold2: kmax= 7, distance=2, kernel=optimal 
- Fold2: kmax= 7, distance=2, kernel=optimal 
+ Fold2: kmax= 9, distance=2, kernel=optimal 
- Fold2: kmax= 9, distance=2, kernel=optimal 
+ Fold2: kmax=11, distance=2, kernel=optimal 
- Fold2: kmax=11, distance=2, kernel=optimal 
+ Fold2: kmax=13, distance=2, kernel=optimal 
- Fold2: kmax=13, distance=2, kernel=optimal 
+ Fold3: kmax= 5, distance=2, kernel=optimal 
- Fold3: kmax= 5, distance=2, kernel=optimal 
+ Fold3: kmax= 7, distance=2, kernel=optimal 
- Fold3: kmax= 7, distance=2, kernel=optimal 
+ Fold3: kmax= 9, distance=2, kernel=optimal 
- Fold3: kmax= 9, distance=2, kernel=optimal 
+ Fold3: kmax=11, distance=2, kernel=optimal 
- Fold3: kmax=11, distance=2, kernel=optimal 
+ Fold3: kmax=13, distance=2, kernel=optimal 
- Fold3: kmax=13, distance=2, kernel=optimal 
+ Fold4: kmax= 5, distance=2, kernel=optimal 
- Fold4: kmax= 5, distance=2, kernel=optimal 
+ Fold4: kmax= 7, distance=2, kernel=optimal 
- Fold4: kmax= 7, distance=2, kernel=optimal 
+ Fold4: kmax= 9, distance=2, kernel=optimal 
- Fold4: kmax= 9, distance=2, kernel=optimal 
+ Fold4: kmax=11, distance=2, kernel=optimal 
- Fold4: kmax=11, distance=2, kernel=optimal 
+ Fold4: kmax=13, distance=2, kernel=optimal 
- Fold4: kmax=13, distance=2, kernel=optimal 
+ Fold5: kmax= 5, distance=2, kernel=optimal 
- Fold5: kmax= 5, distance=2, kernel=optimal 
+ Fold5: kmax= 7, distance=2, kernel=optimal 
- Fold5: kmax= 7, distance=2, kernel=optimal 
+ Fold5: kmax= 9, distance=2, kernel=optimal 
- Fold5: kmax= 9, distance=2, kernel=optimal 
+ Fold5: kmax=11, distance=2, kernel=optimal 
- Fold5: kmax=11, distance=2, kernel=optimal 
+ Fold5: kmax=13, distance=2, kernel=optimal 
- Fold5: kmax=13, distance=2, kernel=optimal 
Aggregating results
Selecting tuning parameters
Fitting kmax = 9, distance = 2, kernel = optimal on full training set

R




# Print the model
print(knn_spec)


Output:

k-Nearest Neighbors 
150 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 120, 120, 120, 120, 120
Resampling results across tuning parameters:
kmax Accuracy Kappa
5 0.9466667 0.92
7 0.9533333 0.93
9 0.9533333 0.93
11 0.9466667 0.92
13 0.9466667 0.92
Tuning parameter 'distance' was held constant at a value of 2
Tuning
parameter 'kernel' was held constant at a value of optimal
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 9, distance = 2 and kernel
= optimal.
  • knn_spec <- train(…): This line creates a k-NN (k-nearest neighbors) model specification using the `train` function from the `caret` package. The `train` function is used for training various machine learning models.
  • Species ~ .: This formula specifies the target variable (Species) and the predictors (all other columns denoted by `.`) to be used in the model.
  • data = iris: This specifies the dataset to be used, in this case, the Iris dataset loaded using `data(iris)`.
  • method = “kknn”: Here, we specify that we want to use the “kknn” model, which stands for kernel k-nearest neighbors. This is a variation of the k-NN algorithm that uses kernel density estimation to make predictions.
  • trControl = trainControl(…): This part sets the control parameters for the training process. It specifies that we want to perform cross-validation (`method = “cv”`) with 5 folds (`number = 5`) and requests verbose output during the training process (`verboseIter = TRUE`).
  • tuneLength = 5: This parameter specifies the number of neighbors (`k`) to try during cross-validation. In this case, we are trying five different values of `k` to determine which one provides the best model performance.
  • print(knn_spec): Finally, we print the k-NN model specification to the console. This provides information about the model, including the method used, the tuning parameters, and other details.

This code sets up a k-NN classification model using the “kknn” method, performs cross-validation with different values of `k`, and prints information about the model specification. It’s a common practice in machine learning to explore different hyperparameters (like `k` in k-NN) to find the best model for a given problem. The resulting `knn_fit` will contain the trained and tuned k-NN model.

Predictions Multiple outcomes with KNN Model Using tidymodels

When dealing with classification problems that involve multiple classes or outcomes, it’s essential to have a reliable method for making predictions. One popular algorithm for such tasks is k-Nearest Neighbors (k-NN). In this tutorial, we will walk you through the process of making predictions with multiple outcomes using a k-NN model in R, specifically with the tidymodels framework.

K-Nearest Neighbors (KNN) is a simple yet effective supervised machine learning algorithm used for classification and regression tasks. Here’s an explanation of KNN and some of its benefits:

Similar Reads

K-Nearest Neighbors (KNN):

KNN is a non-parametric algorithm, meaning it doesn’t make any underlying assumptions about the distribution of data. It’s an instance-based or memory-based learning algorithm, which means it memorizes the entire training dataset and uses it to make predictions. The fundamental idea behind KNN is to classify a new data point by considering the majority class among its K-nearest neighbors....

Tidymodels

Tidymodels is a powerful and user-friendly ecosystem for modeling and machine learning in R. It provides a structured workflow for creating, tuning, and evaluating models. Before we proceed, make sure you have tidymodels and the necessary packages installed. You can install them using:...

Pre-Requisites

...

Load Required Libraries and Data

Before moving forward make sure you have Caret and ggplot packages installed....

Preprocess Data

...

Create and Train the k-NN Model

We’ll start by loading the necessary libraries and a dataset. For this tutorial, we’ll use the classic Iris dataset, which contains three different species of iris flowers (setosa, versicolor, and virginica)....

Make Predictions

...

Evaluate the Model (Optional)

Data preprocessing is crucial for building a robust model. In this step, we’ll create a recipe to preprocess the data. In our case, we don’t need any preprocessing since the Iris dataset is well-structured and doesn’t have any missing values....

Performing KNN on MTCars Dataset

...

Conclusion

Now, it’s time to create and train our k-NN model. We’ll use the `nearest_neighbor()` function from the `parsnip` package, which is part of tidymodels....

Contact Us