REINFORCE Algorithm
The REINFORCE algorithm is a reinforcement learning algorithm that adjusts the weights of a neural network after each trial. The algorithm is a Monte Carlo variant of a policy gradient algorithm. The article highlights the features and fundamentals of the REINFORCE algorithm.
Table of Content
- Basics of Reinforcement Learning
- What is REINFORCE Algorithm?
- Implementation of REINFORCE Algorithm
Basics of Reinforcement Learning
Reinforcement Learning is a machine learning algorithm that trains the agent by rewarding on good actions and punishing them for bad actions.
Important Terms of Reinforcement Learning
Let’s assume, that we are teaching a robot to play a game then the robot is our agent. The environment is the game world that includes characters, obstacles and everything the robot interacts with. The actions are the moves or decisions the robot can make, like going left or right. In the game, getting points or losing points is the reward based on the action of the agent.
Policy Gradient
The policy gradient method focusses on learning a policy – a strategy or set of rules guiding an agent’s decision-making process. The policy gradient method is represented by a parameterized function, such as neural network. The function takes the state of environment as input and provide output as probability distribution over the possible actions.
Monte Carlo Methods
The expected reward is estimated using Monte Carlo methods to estimate the expected reward. The method involves sampling sequences of actions, states, and rewards and use them to update the policy.
What is REINFORCE Algorithm?
REINFORCE algorithm was introduced by Ronald J. Williams in 1992. The aim of the algorithm was to maximize the expected cumulative reward by adjusting the policy parameters. The REINFORCE Algorithm is used to train agents to make sequential decision in an environment. It is a policy gradient method that belongs to the family of Monte Carlo algorithms. In REINFORCE, a neural network is employed to present a policy, which is a strategy guiding the agent’s action in different states.
The algorithm updates the neural network’s parameters based on the obtained rewards, aiming to enhance the likelihood of actions that lead to higher cumulative rewards. This is an iterative process that allows the agent to learn a policy for decision-making in the given environment.
REward Increment = Non-negative Factor × Offset Reinforcement × Characteristic Eligibility
Algorithm
- Set up the policy parameters: First, establish an initial policy. A parametric representation or a neural network might be used for this.
- Get Paths of Collection: In order to gather a collection of trajectories—sequences of states, actions, and rewards—execute the present policy in the environment.
- Determine Returns: Compute the return, or the total of the discounted rewards from each state forward, for each one in the trajectory.
- Calculate the Policy Gradient: Determine the gradient of the anticipated return about the parameters of the policy. To do this, one must compute the gradient of the log likelihood of the chosen course of action.
- Refresh the policy parameters: The policy parameters should be updated in a way that makes it more likely that the actions will result in greater returns. Gradient ascent is usually used for this.
- Repeat: Steps 2 through 5 should be repeated many times.
REINFORCE with Baseline
The policy gradient theorem in the context of episodic scenarios states that:
Here,
- the gradient are column vectors of partial derivatives with respect to the components of
- denotes the policy corresponding to parameter vector
- the distribution is the on-policy distribution under
The policy gradient theorem can be extended to incorporate a comparison between the action value and a user-defined baseline, denoted as b(s):
The baseline can take the form of any function, or even a random variable, as long as it remains constant across different actions (denoted as a); the equation remains accurate because the subtracted quantity is zero.
The policy gradient theorem incorporating a baseline can be utilized to derive an update rule through analogous steps as in the preceding section. The resulting update rule is a modified iteration of REINFORCE that incorporates a versatile baseline.
As the baseline has the potential to be uniformly zero, this update represents a clear extension of REINFORCE. Typically, the baseline does not alter the expected value of the update, but it can significantly impact its variance.
Implementation of REINFORCE Algorithm
Let’s look at a simple scenario in which an agent picks up certain gaming skills. A neural network that generates probability of executing certain actions might serve as the policy. By playing the game, the agent gathers trajectories, computes return, uses policy gradients to change the neural network’s parameters, and then repeats the procedure.
Python
import numpy as np # Initialization def initialize_policy(num_actions): return np.random.rand(num_actions) # Other functions def collect_trajectories(): # For simplicity, let's assume a fixed trajectory for each episode states = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ], [ 7 , 8 , 9 ]] actions = [ 0 , 1 , 2 ] rewards = [ 0.1 , 0.5 , 0.2 ] return states, actions, rewards def compute_returns(rewards): # Assume this function computes returns from rewards pass def compute_policy_gradient(states, actions, returns): # Assume this function computes the policy gradient pass def update_policy_parameters(policy_gradient): # Assume this function updates the policy parameters using the gradient pass # Example Python Code # Initialization num_actions = 3 policy_parameters = initialize_policy(num_actions) # Training loop num_episodes = 1000 for episode in range (num_episodes): # Collect trajectories states, actions, rewards = collect_trajectories() # Compute returns returns = compute_returns(rewards) # Compute policy gradient policy_gradient = compute_policy_gradient(states, actions, returns) # Update policy parameters update_policy_parameters(policy_gradient) # Final policy final_policy_parameters = policy_parameters # Print final policy parameters print ( "Final Policy Parameters:" , final_policy_parameters) |
Output:
Final Policy Parameters: [0.01854423 0.63611265 0.73294125]
Contact Us