Learning Curve To Identify Overfit & Underfit

A learning curve is a graphical representation showing how an increase in learning comes from greater experience. It can also reveal if a model is learning well, overfitting, or underfitting.

In this article, we’ll gain insights on how to identify underfitted and overfitted models using Learning Curve.

Table of Content

  • Understanding Learning Curve
  • Identifying Overfitting and Underfitting Using Learning Curves
  • Implementation of Learning Curve To Identify Overfit & Underfit
    • Learning Curve of a Well-fitted Model
    • Learning Curve of an Overfit Model
    • Learning Curve of an Underfit Model

Understanding Learning Curve

Learning curves are graphical representations that illustrate how a model’s performance changes with increasing experience, typically measured by the amount of training data it has processed. The x-axis of a learning curve typically represents the amount of training data or the number of training iterations, while the y-axis represents the performance metric, such as error or accuracy.

It helps in diagnosing overfitting or underfitting by showing how the model’s error changes as it learns, guiding decisions on improving model training through adjustments in complexity or training data size.

Identifying Overfitting and Underfitting Using Learning Curves

Learning curves visually depict the model’s performance on both the training and validation sets over time. By analyzing these curves, we can identify overfitting and underfitting:

  • Overfitting:
    • The training accuracy is high and remains stable or even increases.
    • The validation accuracy is significantly lower than the training accuracy and may even decrease over time.
    • This indicates that the model is memorizing the training data instead of learning the general patterns.
  • Underfitting:
    • Both the training and validation accuracies are low and remain relatively constant.
    • This suggests that the model is unable to capture the essential features of the data, leading to poor performance on both sets.

Implementation of Learning Curve To Identify Overfitting and Underfitting

Here, we’ll demonstrate how learning curves can help identify overfitting and underfitting using the California Housing dataset, a popular dataset for regression tasks. These learning curves will visualize how the model’s performance evolves as it learns from a training set, compared to its performance on a validation set that it hasn’t seen during training. We will examine well-fitted, overfit and underfit models, focusing on training loss and validation loss to gain insights into their behaviors.

Learning Curve of a Well-fitted Model

Let’s generate a learning curve for a well-fitted model using the California Housing dataset and Ridge regression.Necessary libraries are imported for data loading, preprocessing, model creation, evaluation, and visualization. A Ridge regression model is defined with an alpha value of 1.0. Alpha is a regularization parameter that controls the model’s complexity and helps prevent overfitting. The model is then trained iteratively over a range of epochs. In each epoch, the model fits the training data using the Ridge regression algorithm. At last, Two empty lists, train_losses and val_losses, are created to store the training and validation losses throughout the training process.

For each epoch, the model makes predictions on the training data (train_pred = model.predict(X_train_scaled)) calculates the training loss using mean squared error.

Python
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt

# Load the California housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Ridge regression model
model = Ridge(alpha=1.0)

# Lists to store training and validation losses
train_losses = []
val_losses = []

# Train the model over a range of epochs
epochs = range(1, 101)
for epoch in epochs:
    model.fit(X_train_scaled, y_train)
    
    # Predict and calculate the training loss
    train_pred = model.predict(X_train_scaled)
    train_loss = sqrt(mean_squared_error(y_train, train_pred))
    train_losses.append(train_loss)
    
    # Predict and calculate
    # Predict and calculate the validation loss
    val_pred = model.predict(X_test_scaled)
    val_loss = sqrt(mean_squared_error(y_test, val_pred))
    val_losses.append(val_loss)

# Plotting the learning curve
plt.figure(figsize=(10, 6))
plt.plot(epochs, train_losses, label='Training RMSE')
plt.plot(epochs, val_losses, label='Validation RMSE', linestyle='--')
plt.title('Learning Curve of a Good Fit Model on Boston Housing Dataset')
plt.xlabel('Epochs')
plt.ylabel('Root Mean Squared Error (RMSE)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Output:

Output2

The x-axis represents the epochs, which are the number of times the model has been trained on the data. The y-axis represents the root mean squared error (RMSE), which is a measure of how well the model is performing.

  • The graph shows that the training RMSE (red line) and the validation RMSE (green line) both decrease as the number of epochs increases. This indicates that the model is learning from the training data and improving its performance on unseen data.The learning curve plotted using the California housing dataset demonstrates the model’s performance over time.
  • The RMSE metric is used here to quantify the model’s accuracy, with lower values indicating better performance. The training RMSE reflects how well the model fits the training data, while the validation RMSE indicates how well the model generalizes to unseen data.
  • Throughout the training process, the gap between the training and validation RMSE remains relatively small, suggesting that the model achieves a good balance between fitting the training data and generalizing to new data. This balance is crucial for avoiding overfitting and underfitting, making it a well-fitted model for the California housing dataset.

Learning Curve of an Overfit Model

An overfit model performs well on the training data but fails to generalize to unseen data, such as a validation set. This phenomenon occurs when the model learns the noise in the training data instead of the underlying pattern, leading to high variance. To simulate an overfit model with the California Housing dataset, we can use a complex model, like a deep neural network with many layers and neurons, or a high-degree polynomial regression without proper regularization. Training this model for many epochs or iterations can lead to overfitting.

Let’s plot learning curve that exhibits overfitting on the California Housing dataset using a Multi-layer Perceptron (MLP) regressor.Model prone to overfitting is created using:

  • MLPRegressor: This is a neural network architecture known for its flexibility and ability to learn complex patterns. However, with excessive complexity, it can easily overfit the data.
  • High number of neurons: The chosen architecture uses three hidden layers with 100 neurons each, making it a relatively complex model for this dataset. This increases the risk of memorizing training data specifics rather than learning generalizable patterns.
  • max_iter=1: Setting the maximum number of iterations to 1 ensures the model only trains for one epoch. This further emphasizes the focus on fitting the training data quickly, potentially leading to overfitting.
  • warm_start=True: This allows the model to use the weights from the previous iteration as a starting point for the next one. While helpful for training efficiency, it can amplify overfitting if the initial weights heavily favor the training data.

The model is trained for 200 epochs and records the training and validation losses (using model.score) in separate lists. Plot both losses on the same graph, with epochs on the x-axis and loss on the y-axis.

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

def plot_learning_curves(max_depth_range):
    train_errors, val_errors = [], []
    
    for depth in max_depth_range:
        # Train a decision tree regressor at the given depth
        model = DecisionTreeRegressor(max_depth=depth)
        model.fit(X_train, y_train)
        
        # Make predictions on both training and validation sets
        y_train_predict = model.predict(X_train)
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train, y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    
    plt.plot(max_depth_range, np.sqrt(train_errors), "r-+", linewidth=2, label="Training error")
    plt.plot(max_depth_range, np.sqrt(val_errors), "b-", linewidth=3, label="Validation error")
    plt.title("Learning Curves (Decision Tree)")
    plt.xlabel("Tree Depth")
    plt.ylabel("RMSE")
    plt.legend()
    plt.show()

# Range of `max_depth` values to explore
max_depth_range = range(1, 20)
plot_learning_curves(max_depth_range)

Output:

Output3

For an overfit model, the training loss decreases continuously, approaching zero, indicating that the model is fitting the training data very closely. Initially, the validation loss decreases, reflecting improvements in generalization. However, after a certain point, as the model starts to overfit, the validation loss stops decreasing and may even begin to increase, signaling that the model’s performance on unseen data is deteriorating.

A significant gap between the training and validation loss curves is a hallmark of overfitting. The model performs exceptionally well on the training set but much worse on the validation set.

Learning Curve of an Underfit Model

Conversely, an underfit model is too simple to capture the underlying pattern in the data, leading to high bias. This model performs poorly on both the training and validation sets because it cannot learn the complexity of the data.

To simulate an underfit model with the California Housing dataset, we can use a simplistic model, such as linear regression for a dataset that has a non-linear distribution or a shallow neural network with very few neurons. As the training set size increases, the training error initially decreases, indicating the model learns from the data. However, the validation error remains high, suggesting the model fails to generalize well to unseen data. This is because the model lacks the complexity to capture the underlying relationships in the data.

Python
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Model selection: A simplistic model that is likely to underfit the training data
model = LinearRegression()

# Prepare lists to store the mean squared errors for training and validation sets
train_errors, val_errors = [], []

# Training the model and monitoring losses
for m in range(1, len(X_train)):
    model.fit(X_train[:m], y_train[:m])
    y_train_predict = model.predict(X_train[:m])
    y_val_predict = model.predict(X_val)
    train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(np.sqrt(train_errors), label='Training RMSE')
plt.plot(np.sqrt(val_errors), label='Validation RMSE')
plt.title('Learning Curves (Underfitting Scenario)')
plt.xlabel('Training set size')
plt.ylabel('RMSE')
plt.legend()
plt.show()

Output:

Output4

The learning curves visualize the phenomenon, where the training error decreases rapidly but plateaus, while the validation error remains significantly higher. This highlights the importance of choosing appropriate models and evaluating their performance on unseen data to avoid underfitting.

  • For an underfit model, the training loss decreases only to a point and remains high, indicating the model’s inability to fit the training data well.
  • The validation loss mirrors the training loss, remaining high and showing little improvement over time, which indicates poor performance on unseen data as well.

Unlike overfitting, the gap between the training and validation loss curves in underfitting is not significant. Both losses are high because the model lacks the capacity to learn the data’s complexity.

Conclusion

A large gap may indicate overfitting, whereas low scores for both may signal underfitting. By assessing a model’s learning curve, one can make informed decisions on whether to collect more data, try a more complex or simpler model, or adjust the existing one for better performance.

Learning Curve demystifies the model’s behavior, guiding improvements and ensuring the model is robust and reliable.



Contact Us