[go: up one dir, main page]

0% found this document useful (0 votes)
101 views11 pages

Markov Decision Process

The document discusses Markov decision processes (MDPs) and reinforcement learning. It defines key concepts of MDPs including the Markov property, state transition probabilities, state transition matrices, Markov reward processes, value functions, Bellman equations, policies, and optimal value functions. MDPs extend Markov reward processes to include decisions an agent must make, with the goal of maximizing long-term rewards through reinforcement learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views11 pages

Markov Decision Process

The document discusses Markov decision processes (MDPs) and reinforcement learning. It defines key concepts of MDPs including the Markov property, state transition probabilities, state transition matrices, Markov reward processes, value functions, Bellman equations, policies, and optimal value functions. MDPs extend Markov reward processes to include decisions an agent must make, with the goal of maximizing long-term rewards through reinforcement learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Chandigarh Engineering College Jhanjeri

Mohali-140307
Department of Computer Science & Engineering

Markov Decision Processes: Reinforcement Learning

The MDPs need to satisfy the Markov Property.

 Markov Property: requires that “the future is independent


of the past given the present”.

 Property: Our state Sₜ is Markov if and only if:

Simply this means that the state Sₜ captures all the relevant information
from the history. S₁, S₂, …, Sₜ₋₁ can be discarded and we still get the
same state transition probability to the next state Sₜ₊₁.

 State Transition Probability: The state transition


probability tells us, given we are in state s what the probability
the next state s’ will occur.

P without the double lines represents the state transitions. The above equation has the transition from
state s to state s’. P with the double lines represents the probability from going from state s to s’.

We can also define all state transitions in terms of a State Transition


Matrix P, where each row tells us the transition probabilities from one
state to all possible successor states.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

State transition matrix.

 Markov Process / Markov Chain

The first and most simplest MDP is a Markov process.

Markov Process / Markov Chain: A sequence of random


states S₁, S₂, … with the Markov property.

Below is an illustration of a Markov Chain were each node represents a


state with a probability of transitioning from one state to the next,
where Stop represents a terminal state.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

Markov Chain

We can take a sample episode to go through the chain and end up at


the terminal state. An example sample episode would be to go
from Stage1 to Stage2 to Win to Stop. Below is a representation of a few
sample episodes:

- S1 S2 Win Stop


- S1 S2 Teleport S2 Win Stop
- S1 Pause S1 S2 Win Stop

The above Markov Chain has the following Transition Probability


Matrix:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

For each of the states the sum of the transition probabilities for that
state equals 1.

 Markov Reward Process

In the above Markov Chain we did not have a value associated with
being in a state to achieve a goal. A Markov Reward Process is a
Markov chain with reward values.

Our goal is to maximise the return. The return Gₜ is the total


discount reward from time-step t.

Equation to calculate return


Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 The discount factor γ is a value (that can be chosen) between 0


and 1. If gamma is closer 0 it leads to short sighted evaluation,
while a value closer to 1 favours far sighted evaluation.

Markov Reward Process

 Value Function for MRPs

 State Value Function v(s): gives the long-term value of


state s. It is the expected return starting from state s

state-value function
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

How we can view this is by saying going from state s and going through
various samples from state s what is our expected return. We want to
prefer states which gives more total reward.

state-values for MRP with γ=1

 Bellman Equation for MRPs

The value function can be decomposed into two parts:

 Immediate reward: Rₜ₊₁

 Discounted value of successor state: γ (Sₜ₊₁)

We can define a new equation to calculate the state-value function using


the state-value function and return function above:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

updated bellman state-value equation

Alternatively this can be written in a matrix form:

Using this equation we can calculate the state values for each state.
Since we have a simple model above with the “state-values for MRP
with γ=1” we can calculate the state values using a simultaneous
equations using the updated state-value function.

Solving the above equation is simple for a small MRPs but becomes
highly complex for larger numbers. In order to solve for large MRPs we
require other techniques such as Dynamic Programming, Monte-Carlo
evaluation and Temporal-Difference learning which will be discussed
in a later blog.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 Markov Decision Process

A Markov Decision Process is an extension to a Markov Reward


Process as it contains decisions that an agent must make. All states in
the environment are Markov.

In a Markov Decision Process we now have more control over which


states we go to. An example in the below MDP if we choose to take the
action Teleport we will end up back in state Stage2 40% of the time
and Stage1 60% of the time. Other state transitions occur with 100%
probability when selecting the corresponding actions such as taking the
Action Advance2 from Stage2 will take us to Win.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 Policies

A policy π is a distribution over actions given states. It fully


defines the behaviour of an agent. MDP policies depend on the
current state and not the history.

Policy function

Polices give the mappings from one state to the next. If I am in state s, it
maps from that state the probability of taking each action. Example if
we have the policy π(Chores|Stage1)=100%, this means the agent will
take the action Chores 100% of the time when in state Stage1.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

Value Function for MDP

Since we take actions there are different expectations depending on


how we behave.

 The state-value function v_π(s) of an MDP is the expected


return starting from state s, and then following policy π.

State-value function tells us how good is it to be in state s by following


policy π.

 The action-value function q_π(s,a) is the expected return


starting from state s, taking action a, and then following
policy π.

Action-value function tells us how good is it to take a particular action


from a particular state. Gives us an idea on what action we should take
at states
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

The optimal state-value function v∗(s) is the maximum value


function over all policies. It tells us the maximum possible reward
you can extract from the system.

The optimal action-value function q∗(s,a) is the maximum


action-value function over all policies. It tells us what is the
maximum possible reward you can extract from the system starting
at state s and taking action a.

You might also like