Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
Markov Decision Processes: Reinforcement Learning
The MDPs need to satisfy the Markov Property.
Markov Property: requires that “the future is independent
of the past given the present”.
Property: Our state Sₜ is Markov if and only if:
Simply this means that the state Sₜ captures all the relevant information
from the history. S₁, S₂, …, Sₜ₋₁ can be discarded and we still get the
same state transition probability to the next state Sₜ₊₁.
State Transition Probability: The state transition
probability tells us, given we are in state s what the probability
the next state s’ will occur.
P without the double lines represents the state transitions. The above equation has the transition from
state s to state s’. P with the double lines represents the probability from going from state s to s’.
We can also define all state transitions in terms of a State Transition
Matrix P, where each row tells us the transition probabilities from one
state to all possible successor states.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
State transition matrix.
Markov Process / Markov Chain
The first and most simplest MDP is a Markov process.
Markov Process / Markov Chain: A sequence of random
states S₁, S₂, … with the Markov property.
Below is an illustration of a Markov Chain were each node represents a
state with a probability of transitioning from one state to the next,
where Stop represents a terminal state.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
Markov Chain
We can take a sample episode to go through the chain and end up at
the terminal state. An example sample episode would be to go
from Stage1 to Stage2 to Win to Stop. Below is a representation of a few
sample episodes:
- S1 S2 Win Stop
- S1 S2 Teleport S2 Win Stop
- S1 Pause S1 S2 Win Stop
The above Markov Chain has the following Transition Probability
Matrix:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
For each of the states the sum of the transition probabilities for that
state equals 1.
Markov Reward Process
In the above Markov Chain we did not have a value associated with
being in a state to achieve a goal. A Markov Reward Process is a
Markov chain with reward values.
Our goal is to maximise the return. The return Gₜ is the total
discount reward from time-step t.
Equation to calculate return
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
The discount factor γ is a value (that can be chosen) between 0
and 1. If gamma is closer 0 it leads to short sighted evaluation,
while a value closer to 1 favours far sighted evaluation.
Markov Reward Process
Value Function for MRPs
State Value Function v(s): gives the long-term value of
state s. It is the expected return starting from state s
state-value function
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
How we can view this is by saying going from state s and going through
various samples from state s what is our expected return. We want to
prefer states which gives more total reward.
state-values for MRP with γ=1
Bellman Equation for MRPs
The value function can be decomposed into two parts:
Immediate reward: Rₜ₊₁
Discounted value of successor state: γ (Sₜ₊₁)
We can define a new equation to calculate the state-value function using
the state-value function and return function above:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
updated bellman state-value equation
Alternatively this can be written in a matrix form:
Using this equation we can calculate the state values for each state.
Since we have a simple model above with the “state-values for MRP
with γ=1” we can calculate the state values using a simultaneous
equations using the updated state-value function.
Solving the above equation is simple for a small MRPs but becomes
highly complex for larger numbers. In order to solve for large MRPs we
require other techniques such as Dynamic Programming, Monte-Carlo
evaluation and Temporal-Difference learning which will be discussed
in a later blog.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
Markov Decision Process
A Markov Decision Process is an extension to a Markov Reward
Process as it contains decisions that an agent must make. All states in
the environment are Markov.
In a Markov Decision Process we now have more control over which
states we go to. An example in the below MDP if we choose to take the
action Teleport we will end up back in state Stage2 40% of the time
and Stage1 60% of the time. Other state transitions occur with 100%
probability when selecting the corresponding actions such as taking the
Action Advance2 from Stage2 will take us to Win.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
Policies
A policy π is a distribution over actions given states. It fully
defines the behaviour of an agent. MDP policies depend on the
current state and not the history.
Policy function
Polices give the mappings from one state to the next. If I am in state s, it
maps from that state the probability of taking each action. Example if
we have the policy π(Chores|Stage1)=100%, this means the agent will
take the action Chores 100% of the time when in state Stage1.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
Value Function for MDP
Since we take actions there are different expectations depending on
how we behave.
The state-value function v_π(s) of an MDP is the expected
return starting from state s, and then following policy π.
State-value function tells us how good is it to be in state s by following
policy π.
The action-value function q_π(s,a) is the expected return
starting from state s, taking action a, and then following
policy π.
Action-value function tells us how good is it to take a particular action
from a particular state. Gives us an idea on what action we should take
at states
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering
The optimal state-value function v∗(s) is the maximum value
function over all policies. It tells us the maximum possible reward
you can extract from the system.
The optimal action-value function q∗(s,a) is the maximum
action-value function over all policies. It tells us what is the
maximum possible reward you can extract from the system starting
at state s and taking action a.