Introduction to fit_resamples

fit_resamples is part of the yardstick package, which complements parsnip for model evaluation. It allows users to fit models using a variety of resampling methods, including cross-validation and bootstrapping.

Using Custom Split Data

While fit_resamples can automatically generate resampled data, you may want to use custom split data for specific reasons, such as replicating previous analyses or ensuring consistency across experiments. Let’s assume you have a dataset named my_data with predictors X1 and X2, and a target variable Y. You want to fit a linear regression model using 5-fold cross-validation.

Here’s the Step-by-Step Guide for how to use fit_resamples with custom split data:

Step 1: Generate Example Dataset

Let’s create a synthetic dataset for our example. We’ll generate random values for features like the number of bedrooms, bathrooms, and square footage, as well as the corresponding house prices.

R
library(yardstick)
library(tidymodels)
# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 100

# Generate synthetic data
bedrooms <- sample(1:5, n, replace = TRUE)
bathrooms <- sample(1:3, n, replace = TRUE)
sq_footage <- rnorm(n, mean = 2000, sd = 500)
price <- 100000 + 50000 * bedrooms + 30000 * bathrooms + 100 * sq_footage + 
                                              rnorm(n, mean = 0, sd = 50000)

# Create dataset
house_data <- data.frame(Bedrooms = bedrooms, Bathrooms = bathrooms, 
                         SqFootage = sq_footage, Price = price)
head(house_data)

Output:

  Bedrooms Bathrooms SqFootage    Price
1 3 1 2281.495 562189.5
2 3 3 1813.781 552915.8
3 2 1 2488.487 473166.7
4 2 3 1812.710 394625.9
5 3 2 2526.356 536579.7
6 5 2 1475.411 533047.6

Step 2: Split the Data

Split the dataset into predictors (X) and the target variable (Y). We’ll also create custom split indices for 5-fold cross-validation.

R
# Separate predictors and target variable
X <- house_data[, c("Bedrooms", "Bathrooms", "SqFootage")]
Y <- house_data$Price

# Create custom split indices
folds <- vfold_cv(X, v = 5, strata = Y)

Step 3: Define the Model

Specify the model you want to fit. In this example, we’ll use a linear regression model.

R
# Define the model
model <- linear_reg()

Step 4: Fit the Model

Use fit_resamples to fit the model using custom split data and specify the metrics to evaluate model performance.

R
# Fit the model with custom split data
results <- fit_resamples(
  model,
  resamples = folds,
  metrics = metric_set(rmse, mae)
)

Step 5: Analyze Results

Review the results, including performance metrics for each fold of the cross-validation.

R
# View results
results

Output:

# A tibble: 5 × 6
splits id .metrics .notes .predictions .ids
<named list> <chr> <named list> <named list> <named list> <named …
1 <split [80/20]> Fold1 <tibble [2 × … <tibble [0 ×… <tibble [20 … <tibble…
2 <split [80/20]> Fold2 <tibble [2 × … <tibble [0 ×… <tibble [20 … <tibble…
3 <split [80/20]> Fold3 <tibble [2 × … <tibble [0 ×… <tibble [20 … <tibble…
4 <split [80/20]> Fold4 <tibble [2 × … <tibble [0 ×… <tibble [20 … <tibble…
5 <split [80/20]> Fold5 <tibble [2 × … <tibble [0 ×… <tibble [20 … <tibble…

The output of the code provided will display the results of the model evaluation using 5-fold cross-validation. The results object will contain performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for each fold of the cross-validation.

  • splits: The indices of the training and validation sets for each fold of cross-validation.
  • id: The fold identifier.
  • .metrics: A tibble containing evaluation metrics for the model on the validation set of each fold.
  • .notes: Additional notes, if any.
  • .predictions: Predicted values on the validation set.
  • .ids: Identifier for each observation in the validation set.

Resamples with Custom Split Data in R

To create custom resamples, you can manually create the rsplit objects and then use them with vfold_cv or similar functions. Here’s how you can do it:

R
library(tidymodels)
library(rsample)
library(dplyr)

data("mtcars")

set.seed(123)
initial_split <- initial_split(mtcars, prop = 0.8)
train_data <- training(initial_split)
test_data <- testing(initial_split)

# Custom resample function
custom_resample <- function(data, prop = 0.8) {
  initial_split(data, prop = prop)
}

# Generate custom resamples
custom_splits <- list()
for (i in 1:5) {
  custom_splits[[i]] <- custom_resample(train_data)
}

# Create an rset object using manual_rset
custom_folds <- manual_rset(
  splits = custom_splits,
  ids = paste0("Fold", 1:5)
)

# Define model and workflow
linear_model <- linear_reg() %>%
  set_engine("lm")

recipe <- recipe(mpg ~ ., data = train_data)

workflow <- workflow() %>%
  add_model(linear_model) %>%
  add_recipe(recipe)

# Fit resamples
resample_results <- fit_resamples(
  workflow,
  resamples = custom_folds,
  metrics = metric_set(rmse, rsq)
)

# Evaluate results
resample_results %>% 
  collect_metrics()

Output:

# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 4.87 5 1.07 Preprocessor1_Model1
2 rsq standard 0.534 5 0.180 Preprocessor1_Model1
  1. .metric: This column indicates the type of metric used to evaluate the model. In this case, there are two metrics:
    • rmse (Root Mean Squared Error)
    • rsq (R-squared)
  2. .estimator: This column shows the type of estimator used. Here, it is standard, indicating a standard regression estimator.
  3. mean: This column provides the mean value of the metric across all the resamples. It is the average performance of the model according to the specific metric:
    • For rmse, the mean value is 4.87. This means that, on average, the root mean squared error of the model’s predictions is 4.87. RMSE is a measure of the differences between predicted and observed values; lower values indicate better model performance.
    • For rsq, the mean value is 0.534. R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. A value of 0.534 means that approximately 53.4% of the variance in the response variable can be explained by the model. Higher values indicate better model performance.
  4. n: This column indicates the number of resamples used in the evaluation. In this case, n = 5 for both metrics, meaning the evaluation is based on 5 resamples.
  5. std_err: This column shows the standard error of the mean for each metric. The standard error measures the variability of the metric estimate:
    • For rmse, the standard error is 1.07, indicating the variability of the RMSE across the 5 resamples.
    • For rsq, the standard error is 0.180, indicating the variability of the R-squared value across the 5 resamples.
  6. .config: This column provides a configuration identifier for the model and preprocessing steps used. Here, it is labeled as Preprocessor1_Model1, indicating the first configuration of preprocessor and model used in the workflow.

Interpreting the Results

  • Root Mean Squared Error (RMSE): With a mean RMSE of 4.87 and a standard error of 1.07, the model’s predictions on average deviate from the actual values by approximately 4.87 units. The standard error indicates some variability in this error across the resamples, but the model seems to be consistent.
  • R-squared (R²): With a mean R-squared value of 0.534 and a standard error of 0.180, the model explains about 53.4% of the variance in the outcome variable. This suggests a moderate fit, indicating that while the model captures some of the variability in the data, there is still a significant portion of the variance unexplained.

How to Use fit_resamples with Custom Split Data in R

When working with machine learning models, it’s common to use resampling techniques like cross-validation to evaluate model performance. The fit_resamples function in the parsnip package provides a convenient way to fit models using resampled data. In this article, we’ll explore how to use fit_resamples with custom split data in R Programming Language.

Similar Reads

Introduction to fit_resamples

fit_resamples is part of the yardstick package, which complements parsnip for model evaluation. It allows users to fit models using a variety of resampling methods, including cross-validation and bootstrapping....

Conclusion

The results from the fit_resamples function provide an overview of the model’s performance using custom resamples. The RMSE indicates the average prediction error, while the R-squared value indicates the proportion of variance explained by the model. These metrics, along with their standard errors, help in assessing the consistency and reliability of the model. If the model’s performance is satisfactory, you can proceed with further analysis or model deployment; if not, you may consider improving the model by adjusting the features, model parameters, or preprocessing steps....

Contact Us