Working with Data in PyTorch

The development of Machine Learning involves working with data. Thus, the techniques of efficient data handling are crucial while learning PyTorch. So in this section, we will learn about various data handling techniques like Data Loading and Preprocessing.

Loading Data: Using DataLoader and Dataset

DataLoader and Dataset classes in PyTorch are the main components for loading and iterating over datasets. Among these two, the Datasets class acts as the interface for custom datasets. You have to use the ‘len’ and ‘getitem’ methods to create Custom dataset for model building using PyTorch.

On the other hand, DataLoader iterates over the dataset and fetches batches of samples. After this, it transfers them to the appropriate device (CPU or GPU) so that the model can process them. This is shown in the below code snippet.

Python




import torch
from torch.utils.data import DataLoader, Dataset
 
 
# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]
 
 
# Example data
data = torch.randn(100, 3, 32, 32# Example image data
targets = torch.randint(0, 10, (100,))  # Example target labels
 
 
# Create custom dataset
custom_dataset = CustomDataset(data, targets)
 
 
# Create DataLoader
batch_size = 32
shuffle = True
num_workers = 4
data_loader = DataLoader(custom_dataset, batch_size=batch_size,
                         shuffle=shuffle, num_workers=num_workers)
 
 
# Iterate over batches
for batch_idx, (inputs, targets) in enumerate(data_loader):
    print(
        f"Batch {batch_idx+1}: Inputs shape: {inputs.shape}, Targets shape: {targets.shape}")


Output:

Batch 1: Inputs shape: torch.Size([32, 3, 32, 32]), Targets shape: torch.Size([32])
Batch 2: Inputs shape: torch.Size([32, 3, 32, 32]), Targets shape: torch.Size([32])
Batch 3: Inputs shape: torch.Size([32, 3, 32, 32]), Targets shape: torch.Size([32])
Batch 4: Inputs shape: torch.Size([4, 3, 32, 32]), Targets shape: torch.Size([4])

Preprocessing Data: Transformations and Normalization

Preprocessing of the data means bringing the data into the standard format so that data can be fitted into the model. Here, the two main methods are Transformation and Normalization. The transformation techniques include various methods including resizing, cropping, rotating, and flipping images.

On the other hand, Normalization means to scale the data in such a way that it has zero mean and unit variance. The aim of this method is to stabilize the training process and improve the model’s efficiency. The preprocessing of data is demonstrated through the following code snippet.

Python




import torchvision.transforms as transforms
 
# Define transformations
transform = transforms.Compose([
    transforms.Resize(256),              # Resize images to 256x256
    transforms.RandomCrop(224),          # Randomly crop images to 224x224
    transforms.RandomHorizontalFlip(),   # Randomly flip images horizontally
    transforms.ToTensor(),               # Convert images to PyTorch tensors
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[
                         0.229, 0.224, 0.225])  # Normalize images
])
 
# Example of applying transformations to image
example_image = transforms.ToPILImage()(
    torch.randn(3, 256, 256))  # Example image tensor
transformed_image = transform(example_image)
 
 
print("Transformed image shape:", transformed_image.shape)


Output:

Transformed image shape: torch.Size([3, 224, 224])

Handling Custom Datasets

  • Handling the custom dataset means creating a dataset of a specific structure and format.
  • For this, we have to create a custom dataset class that inherits from the ‘torch.utils.data.Dataset’ class.’ Mainly, the ‘__len__’ and ‘__getitem__’ methods are used to handle the custom dataset.
  • The ‘__len__’ method returns the total number of samples in the dataset and the ‘__getitem__’ method fetches the sample and its corresponding target. This is shown in the following code snippet.

Python




import torch
from torch.utils.data import Dataset, DataLoader
 
 
# Define custom dataset class by subclassing torch.utils.data.Dataset
class CustomDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
         
    def __len__(self):
        # Return the total number of samples in the dataset
        return len(self.data)
       
    def __getitem__(self, index):
        # Retrieve and return a sample and its corresponding target based on the given index
        sample = self.data[index]
        target = self.targets[index]
        return sample, target
 
 
# Example data and targets
data = torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8]])
targets = torch.tensor([0, 1, 0, 1])
 
# Create instance of the custom dataset
custom_dataset = CustomDataset(data, targets)
 
# Create a data loader to iterate over the dataset in batches
batch_size = 2
data_loader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True)
 
# Iterate over the data loader to access batches of data
for batch_idx, (samples, targets) in enumerate(data_loader):
    print(f"Batch {batch_idx}:")
    print("Samples:", samples)
    print("Targets:", targets)


Output:

Batch 0:
Samples: tensor([[5, 6],
        [3, 4]])
Targets: tensor([0, 1])
Batch 1:
Samples: tensor([[1, 2],
        [7, 8]])
Targets: tensor([0, 1])

Start learning PyTorch for Beginners

Machine Learning helps us to extract meaningful insights from the data. But now, it is capable of mimicking the human brain. This is done using neural networks, which contain the various interconnected layers of nodes containing the data. This data is passed to forward layers. Subsequently, the model learns from the data and predicts output for the new data.

PyTorch helps us to create and train these neural networks that act like our brains and learn from the data.

Table of Content

  • What is Pytorch?
  • Why use PyTorch?
  • How to install Pytorch ?
  • PyTorch Basics
  • Autograd: Automatic Differentiation in PyTorch
  • Neural Networks in PyTorch
  • Working with Data in PyTorch
  • Intermediate Topics in PyTorch
  • Validation and Testing
  • Frequently Asked Questions

Similar Reads

What is Pytorch?

PyTorch is an open-source machine learning library for Python developed by Facebook’s AI Research Lab (FAIR). It is widely used for building deep learning models and conducting research in various fields like computer vision, natural language processing, and reinforcement learning. One of the key features of PyTorch is its dynamic computational graph, which allows for more flexible and intuitive model construction compared to static graph frameworks. PyTorch also offers seamless integration with other popular libraries like NumPy, making it easier to work with tensors and multidimensional arrays....

Why use PyTorch?

It supports tensor computation: Tensor is the data structure that is similar to the networks, array. It is an n-dimensional array that contains the data. We can perform arbitrary numeric computation on these arrays using the APIs. It provides Dynamic Graph Computation: This feature allows us to define the computational graphs dynamically during runtime. This makes it more flexible than the static computation graphs approach in which where the graph structure is fixed and defined before execution, It provides the Automatic Differentiation: The Autograd package automatically computes the gradients that are crucial for training the model using optimization algorithms. Thus, we can perform operations on tensors without manually calculating gradients. It has Support for Python: It has native support for the Python programming language. Thus, we can easily integrate with existing Python workflows and libraries. This is the reason why it is used by the machine learning and data science communities. It has its production environment: PyTorch has the TorchScript which is the high-performance environment for serializing and executing PyTorch models. You can easily compile PyTorch models into a portable intermediate representation (IR) format. Due to this, we can deploy the model on various platforms and devices without requiring the original Python code....

How to install Pytorch ?

To install PyTorch, you can use the pip package manager, which is the standard tool for installing Python packages. You can install PyTorch using the following command:...

PyTorch Basics

PyTorch Tensors: Creation, Manipulation, and Operations...

Autograd: Automatic Differentiation in PyTorch

...

Neural Networks in PyTorch

Now, we will shift our focus on Autograd which is one of the most important topics in the PyTorch basics. The Autograd Module of PyTorch provides the automatic calculation of the gradients. It means that we do not need to calculate the gradients explicitly. You might be thinking what gradient is. So, the gradient represents the rate of change of functions with respect to parameters. This helps us to identify the difference between the predicted outputs and actual labels....

Working with Data in PyTorch

...

Intermediate Topics in PyTorch

Basics of nn.Module and nn.Parameter...

Validation and Testing

...

Conclusion

...

Frequently Asked Questions

...

Contact Us