Implementing Reinforcement Learning using PyTorch
Using the CartPole environment from OpenAI’s Gym. This example demonstrates a basic policy gradient method to train an agent. Ensure you have PyTorch and Gym installed:
pip install torch gym
Import Libraries
This code implements a simple policy gradient reinforcement learning algorithm using PyTorch, where an agent learns to balance a pole on a cart in the CartPole environment provided by the OpenAI Gym.
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt
Imports necessary libraries, including gym for the environment, torch for neural network and optimization, numpy for numerical operations, and matplotlib for plotting.
Initialize Reward Storage
episode_rewards = []
A list to store the total reward for each episode, used later to visualize the learning curve.
Define Policy Network
The Policy Network in this context is a neural network designed to map states (observations from the environment) to actions. It consists of two linear layers with ReLU activation in between and a final Softmax layer to produce a probability distribution over possible actions. Given a state as input, it outputs the probabilities of taking each action in that state. This probabilistic approach allows for exploration of the action space, as actions are sampled according to their probabilities, enabling the agent to learn which actions are most beneficial. The Policy Network is the agent’s “brain,” deciding how to act based on its current understanding of the environment, which it improves upon iteratively through training using rewards received from the environment.
class PolicyNetwork(nn.Module):
def __init__(self):
super(PolicyNetwork, self).__init__()
self.fc = nn.Sequential(
nn.Linear(4, 128),
nn.ReLU(),
nn.Linear(128, 2),
nn.Softmax(dim=-1),
)
def forward(self, x):
return self.fc(x)
Calculate Discounted Rewards
Calculates the discounted rewards for each time step in an episode, emphasizing the importance of immediate rewards over future rewards.
def compute_discounted_rewards(rewards, gamma=0.99):
discounted_rewards = []
R = 0
for r in reversed(rewards):
R = r + gamma * R
discounted_rewards.insert(0, R)
discounted_rewards = torch.tensor(discounted_rewards)
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-5)
return discounted_rewards
Training Loop
The main function where the environment is interacted with, the policy network is trained using the rewards collected, and the optimizer updates the network’s parameters based on the policy gradient.
def train(env, policy, optimizer, episodes=1000):
for episode in range(episodes):
state = env.reset()
log_probs = []
rewards = []
done = False
while not done:
state = torch.FloatTensor(state).unsqueeze(0)
probs = policy(state)
m = Categorical(probs)
action = m.sample()
state, reward, done, _ = env.step(action.item())
log_probs.append(m.log_prob(action))
rewards.append(reward)
# Inside the train function, after an episode ends:
if done:
episode_rewards.append(sum(rewards))
discounted_rewards = compute_discounted_rewards(rewards)
policy_loss = []
for log_prob, Gt in zip(log_probs, discounted_rewards):
policy_loss.append(-log_prob * Gt)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()
if episode % 50 == 0:
print(f"Episode {episode}, Total Reward: {sum(rewards)}")
break
env = gym.make('CartPole-v1')
policy = PolicyNetwork()
optimizer = optim.Adam(policy.parameters(), lr=1e-2)
train(env, policy, optimizer)
Plotting the Learning Curve
plt.plot(episode_rewards)
plt.title('Training Reward Over Episodes')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()
Output:
Episode 0, Total Reward: 15.0
Episode 50, Total Reward: 10.0
Episode 100, Total Reward: 9.0
Episode 150, Total Reward: 10.0
Episode 200, Total Reward: 10.0
Episode 250, Total Reward: 10.0
Episode 300, Total Reward: 10.0
Episode 350, Total Reward: 9.0
Episode 400, Total Reward: 10.0
Episode 450, Total Reward: 10.0
Episode 500, Total Reward: 9.0
Episode 550, Total Reward: 8.0
Episode 600, Total Reward: 10.0
Episode 650, Total Reward: 10.0
Episode 700, Total Reward: 10.0
Episode 750, Total Reward: 9.0
Episode 800, Total Reward: 9.0
Episode 850, Total Reward: 10.0
Episode 900, Total Reward: 9.0
Episode 950, Total Reward: 9.0
After training, the total rewards per episode are plotted to visualize the learning progress.
Output explanation:
The graph shows the total reward per episode for a reinforcement learning agent across 1,000 episodes. The reward starts high but decreases and stabilizes, indicating the agent may not be improving over time.
Reinforcement Learning using PyTorch
Reinforcement learning using PyTorch enables dynamic adjustment of agent strategies, crucial for navigating complex environments and maximizing rewards. The article aims to demonstrate how PyTorch enables the iterative improvement of RL agents by balancing exploration and exploitation to maximize rewards. The article introduces PyTorch’s suitability for Reinforcement Learning (RL), emphasizing its dynamic computation graph and ease of implementation for training agents in environments like CartPole.
Table of Content
- Reinforcement Learning with PyTorch
- Reinforcement Learning Algorithm for CartPole Balancing
- Implementing Reinforcement Learning using PyTorch
Contact Us