K-Fold Implementation to the Model
Let’s see the difference on the model prediction while utilizing K-Fold cross validation versus not utilizing it. For this, I will utilize california_housing_test.csv.
Step 1: Import Necessary Libraries
First, we need to import the relevant libraries.
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
Step 2: Load the dataset
df = pd.read_csv("/content/sample_data/california_housing_test.csv")
df.info()
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
median_house_value is our target and rest of features are input columns
Step 3: Preprocessing the dataset
label_encoder = LabelEncoder()
# Fit and transform the "ocean_proximity" column
df['ocean_proximity_encoded'] = label_encoder.fit_transform(df['ocean_proximity'])
df.drop('ocean_proximity',axis=1,inplace=True)
df['total_bedrooms'] = df['total_bedrooms'].ffill()
df.head()
Output:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity_encoded
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 3
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 3
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 3
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 3
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 3
Step 4: Splitting the dataset
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
Defining Model: Without K-Fold cross validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"R2 Score: {score}")
Output:
R2 Score: 0.6114554518898516
Defining Model: With K-Fold cross validation
This code implements K-Fold Cross-validation for a linear regression model where the target variable is median_house_value.
- Number of folds k are defined to be 5 and initializes a KFold object ‘kf’ with 5 splits, shuffling the data and fixing the random state for reproducibility.
- Next, the code iterates over each fold using a for loop. For each fold, it splits the data into training and testing sets using the indices provided by kf.split(X).
- Finally, the code calculates the average R2 score across all folds by summing up the scores and dividing by the number of folds.
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = []
# Iterate over the splits
for fold, (train_index, test_index) in enumerate(kf.split(X)):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
scores.append(score)
print(f"Fold {fold+1} R2 Score: {score}")
# Calculate the average score
average_score = sum(scores) / len(scores)
print(f"Average R2 Score: {average_score}")
Output:
Fold 1 R2 Score: 0.6114554518898566
Fold 2 R2 Score: 0.6425719794066727
Fold 3 R2 Score: 0.6382892378835952
Fold 4 R2 Score: 0.6654790505178491
Fold 5 R2 Score: 0.6057229383411187
Average R2 Score: 0.6327037316078185
With k-fold cross-validation, we evaluate the model numerous times on distinct subsets of the data, resulting in a more trustworthy estimate of performance and aiding in the detection of overfitting or model instability. We only assess the model’s performance on one split of the data without cross-validation.
The R2 score in the case above is 0.61 when cross validation using K-Fold is used.
- Without cross-validation, an R2 score of 0.61 indicates that 61% of the variance is explained by the model.
- A somewhat improved performance is shown by an R2 score of 0.63 using cross-validation, suggesting the model’s generalizability across various data splits.
How K-Fold Prevents overfitting in a model?
In machine learning, accurately processing how well a model performs and whether it can handle new data is crucial. Yet, with limited data or concerns about generalization, traditional methods of evaluation may not cut it. That’s where cross-validation steps in. It’s a method that rigorously tests predictive models by splitting the data, training on one part, and testing on another. Among these methods, K-Fold Cross-validation shines as a reliable and popular choice.
In this article, we’ll look at the K-Fold cross-validation approach and how it helps to reduce overfitting in models.
Contact Us