Understanding PyTorch Learning Rate Scheduling

In the realm of deep learning, PyTorch stands as a beacon, illuminating the path for researchers and practitioners to traverse the complex landscapes of artificial intelligence. Its dynamic computational graph and user-friendly interface have solidified its position as a preferred framework for developing neural networks. As we delve into the nuances of model training, one essential aspect that demands meticulous attention is the learning rate. To navigate the fluctuating terrains of optimization effectively, PyTorch introduces a potent ally—the learning rate scheduler. This article aims to demystify the PyTorch learning rate scheduler, providing insights into its syntax, parameters, and indispensable role in enhancing the efficiency and efficacy of model training.

PyTorch Learning Rate Scheduler

PyTorch, an open-source machine learning library, has gained immense popularity for its dynamic computation graph and ease of use. Developed by Facebook’s AI Research lab (FAIR), PyTorch has become a go-to framework for building and training deep learning models. Its flexibility and dynamic nature make it particularly well-suited for research and experimentation, allowing practitioners to iterate swiftly and explore innovative approaches in the ever-evolving field of artificial intelligence.

What is Learning Rate Scheduler?

At the heart of effective model training lies the learning rate—a hyperparameter crucial for controlling the step size during optimization. PyTorch provides a sophisticated mechanism, known as the learning rate scheduler, to dynamically adjust this hyperparameter as the training progresses. The syntax for incorporating a learning rate scheduler into your PyTorch training pipeline is both intuitive and flexible. At its core, the scheduler is integrated into the optimizer, working hand in hand to regulate the learning rate based on predefined policies. The typical syntax for implementing a learning rate scheduler involves instantiating an optimizer and a scheduler, then stepping through epochs or batches, updating the learning rate accordingly. The versatility of the scheduler is reflected in its ability to accommodate various parameters, allowing practitioners to tailor its behavior to meet specific training requirements.

Parameters and their Significance

  • optimizer: Establishes the connection between the PyTorch learning rate scheduler and the optimizer responsible for updating the model parameters.
  • step_size: Dictates the number of epochs between each adjustment of the learning rate, influencing how often the learning rate is updated during training.
  • gamma: Scales the learning rate after each step, controlling the rate at which the learning rate decays or grows.
  • last_epoch: A parameter that aids in resuming training from a specific epoch, providing flexibility in model development and training management.

Need for Learning Rate Scheduler

The importance of learning rate schedulers becomes evident when considering the dynamic nature of model training. As models traverse complex loss landscapes, a fixed learning rate may hinder convergence or cause overshooting. Learning rate schedulers address this challenge by adapting the learning rate based on the model’s performance during training. This adaptability is crucial for avoiding divergence, accelerating convergence, and facilitating the discovery of optimal model parameters.

Demonstrating PyTorch Learning Rate Scheduling

Colab link: Learning rate scheduler

Importing libraries


import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler

Loading dataset

You can download the dataset from here.


df = pd.read_csv("breast-cancer.csv")


         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0
smoothness_worst compactness_worst concavity_worst concave points_worst \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625
symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678
[5 rows x 32 columns]

Data extraction and encoding

  • X is a DataFrame containing features, excluding the “diagnosis” and “id” columns from the original DataFrame df.
  • y is a Series containing the target variable “diagnosis” from the original DataFrame df.
  • The values in the “diagnosis” column of y are mapped to numerical values: ‘M’ (Malignant) is mapped to 1, and ‘B’ (Benign) is mapped to 0.
  • X represents the features, while y represents the target variable.


X = df.drop(["diagnosis", "id"],axis=1)
y= df['diagnosis']
y = y.map({'M':1, 'B':0})

Train test split and stadardisation

  • The train_test_split function from scikit-learn is used to split the dataset (X and y) into training and testing sets.
  • X_train and X_test are the training and testing sets of features, respectively.
  • Y_train and Y_test are the corresponding training and testing sets of target labels.
  • A StandardScaler instance is created, which is a preprocessing step to standardize the features.
  • X_train_std is obtained by fitting the scaler on X_train and then transforming it. This ensures that the training data has a mean of 0 and a standard deviation of 1.
  • X_test_std is standardized using the parameters learned from the training data (X_train), ensuring consistency in the scaling process.
  • random_state=2 is set for reproducibility. This ensures that if you run the code multiple times, you get the same train-test split.


X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=2)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

Tensor dataset and Dataloader

  • The NumPy arrays X_train_std and Y_train are converted to PyTorch tensors using torch.FloatTensor.
  • Y_train_tensor is reshaped using .view(-1, 1) to ensure it has a proper shape for model compatibility. The -1 is used to automatically infer the size based on the length of the array, and 1 indicates a single column.
  • Similarly, the test set features (X_test_std) and target labels (Y_test) are converted to PyTorch tensors using torch.FloatTensor. The target tensor is also reshaped.
  • A TensorDataset is created for the training data, combining the features (X_train_std_tensor) and targets (Y_train_tensor) into a single dataset.
  • DataLoader is then used to create an iterator over the dataset with a specified batch size of 32 and shuffling the data (shuffle=True).


X_train_std_tensor = torch.FloatTensor(X_train_std)
Y_train_tensor = torch.FloatTensor(Y_train.values).view(-1, 1)
X_test_std_tensor = torch.FloatTensor(X_test_std)
Y_test_tensor = torch.FloatTensor(Y_test.values).view(-1, 1)
train_dataset = TensorDataset(X_train_std_tensor, Y_train_tensor)
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)

Model creation

  • Input Layer: 30 features.
  • Hidden Layers: Two hidden layers with 64 and 32 units, respectively.
  • Activation Functions: ReLU after each hidden layer, Sigmoid at the output.
  • Output Layer: Single unit for binary classification.


model = nn.Sequential(
    nn.Linear(30, 64),  # Input layer with 30 features, hidden layer with 64 units
    nn.Linear(64, 32),  # Hidden layer with 32 units
    nn.Linear(32, 1),   # Output layer with 1 unit (for binary classification)

Loss function and optimizer

  • criterion = nn.BCELoss(): Binary Cross Entropy Loss is chosen as the loss function, suitable for binary classification tasks.
  • optimizer = optim.Adam(model.parameters(), lr=0.001): Adam optimizer is used for gradient-based optimization with a learning rate of 0.001.


criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Learning Rate Scheduler

  • Learning rate is adjusted using StepLR scheduler, reducing it by a factor of 0.5 every 20 epochs.


scheduler = lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
num_epochs = 50

Training Loop

  • for epoch in range(num_epochs):: Iterating through a specified number of epochs (50 in this case).
  • model.train(): Sets the model to training mode.
  • Loop over batches from train_loader.
  • outputs = model(inputs): Forward pass to obtain model predictions.
  • targets = targets.unsqueeze(1).float(): Adjusting the shape of target tensor.
  • loss = criterion(outputs, targets.view(-1, 1)): Calculating the binary cross-entropy loss.
  • Backward pass, gradient update, and learning rate adjustment.


# Training loop
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        outputs = model(inputs)
        targets = targets.unsqueeze(1).float()  # Fix the shape of the targets
        loss = criterion(outputs, targets.view(-1, 1))
    # Adjust learning rate
    # Print loss for monitoring
    print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}')


Epoch [1/50], Loss: 0.5196633338928223
Epoch [2/50], Loss: 0.29342177510261536
Epoch [3/50], Loss: 0.19762122631072998
Epoch [4/50], Loss: 0.19884507358074188
Epoch [5/50], Loss: 0.028389474377036095
Epoch [6/50], Loss: 0.007852290757000446
Epoch [7/50], Loss: 0.040723469108343124
Epoch [8/50], Loss: 0.04233770817518234
Epoch [9/50], Loss: 0.2953278720378876
Epoch [10/50], Loss: 0.020912442356348038

Evaluation metrics

  • model.eval(): Sets the model to evaluation mode.
  • with torch.no_grad():: Temporarily disables gradient computation during evaluation.
  • test_outputs = model(X_test_std_tensor): Forward pass on the test set.
  • test_predictions = (test_outputs >= 0.5).float(): Converting model probabilities to binary predictions using a threshold of 0.5.
  • accuracy = (test_predictions == Y_test_tensor).float().mean().item(): Calculating accuracy based on binary predictions and true labels.


with torch.no_grad():
    test_outputs = model(X_test_std_tensor)
    test_predictions = (test_outputs >= 0.5).float()  # Convert probabilities to binary predictions
    # Evaluation metrics (you can use appropriate metrics based on your problem)
    accuracy = (test_predictions == Y_test_tensor).float().mean().item()
    print(f'Test Accuracy: {accuracy}')


Test Accuracy: 0.9561403393745422

The provided test accuracy of approximately 95.6% suggests that the trained neural network model performs well on the test set.

Applications of PyTorch learning rate schedulers

The applications of PyTorch learning rate schedulers are multifaceted. They play a pivotal role in fine-tuning models for specific tasks, improving convergence speed, and aiding in the exploration of diverse hyperparameter spaces. Learning rate schedulers find particular relevance in scenarios where the loss landscape is non-uniform, and traditional fixed learning rates prove suboptimal. Applications range from image classification and object detection to natural language processing, where the ability to dynamically adjust the learning rate can be a game-changer in achieving superior model performance.

