Finite Markov Decision process

Flexibility and Abstraction in the framework

In reinforcement learning, a critical concept is the Markov property, which defines a specific characteristic of the environment and its state signals. The Markov property ensures that the future state of the environment depends only on the current state and the action taken by the agent but not on the sequence of states and actions that preceded it. This is generally defined as the Markov Decision Process (MDP) framework.

The reinforcement learning framework allows the agent to make decisions based on the state signal received from the environment. This state signal contains all the necessary information about the current situation of the environment that enables the agent to decide what action to take next.

However, the Markov property restricts the state signal to provide only the information that is currently available and it does it necessarily include the hidden or future information that might be useful for the agent to make a decision. For instance, if the agent is playing Blackjack, the agent does not have access to the next card in the deck. This ensures that while there may be hidden state information in the environment that could be beneficial. The agent can only make its decisions based on the information received up to that point.

Thus the Markov property guarantees that the decision-making process remains grounded in the present state and the agent’s immediate perceptions and ensures a manageable and realistic framework for learning and decision-making in uncertain and dynamic environments.

If both the state and action spaces are finite, the MDP is referred to as Finite Markov Decision process(finite MDP). Finite MDP is crucial because they form the foundation for understanding the majority of reinforcement learning theories and applications.

Components of Finite MDP

A finite MDP can be defined as:

State set (S): All possible states the environment can be in.
Action set(A): All possible action that the agent can take.
One-step Dynamics: The probability distribution of the next state and reward given the current state and action.

Dynamics of Finite MDP

At any given state s and action a, the likelihood of each possible combination of next state s’ and reward r can be denoted as

[Tex]p(s’, r| s,a) = Pr\{S_{t+1} = s’, R_{t+1} = r \mid S_t = s, A_t = a\} [/Tex] ……. Equation 1

The above quantities specify the dynamics of finite MDP. Once the dynamics of the environment are defined by the probability distribution p(s’, r| s,a), we can now calculate the various important characteristics of the environment such as expected rewards for state-action pairs, state-transition probabilities and also expected rewards for state-action-next-state triples.

The Expected rewards for the State-Action pairs can be denoted as

[Tex]r(s,a) = \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a] = \sum_{r \in R} r \sum_{s’ \in S} p(s’, r|s,a)[/Tex] ……. Equation 2

The above equation 2 represents the expected reward after taking action a in state s.

The state-transition probabilities can be denoted as,

[Tex]p(s’ | s, a) = Pr\{S_{t+1} = s’ \mid S_t = s, A_t = a\} = \sum_{r \in R} p(s’, r|s,a)[/Tex] ……. Equation 3

The above equation 3 represents the probability of transitioning to state s’ after taking action a in the state s.

The expected rewards for the State-Action-Next-State Triples can be denoted as,

[Tex] r(s,a,s’) = \mathbb{E} [R_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s’] = \frac{\sum_{r \in R} r \cdot p(s’, r|s,a)}{p(s’ | s, a)}[/Tex] ……. Equation 4

The above equation 4 represents the expected reward when the agent transitions from state s to state s’ after taking action a.

Agent-Environment Interface in AI

The agent-environment interface is a fundamental concept of reinforcement learning. It encapsulates the continuous interaction between an autonomous agent and its surrounding environment that forms the basis of how the agents learn from and adapt to their experiences to achieve specific goals. This article explores the decision-making process of agents, the flexibility of the framework, and the critical distinction between the agent and its environment.

Table of Content

Agent-environment Interface in AI

Time steps and continual interaction
Perception, Action, and Feedback
Representation of State, Action and Rewards

Policy and Decision-making

Policy
Decision making

Finite Markov Decision process

Components of Finite MDP
Dynamics of Finite MDP

Flexibility and Abstraction in the framework
Boundary between Agent and Environment
Conclusion

Finite Markov Decision process

Components of Finite MDP

Dynamics of Finite MDP

Agent-Environment Interface in AI

Similar Reads

Contact Us