Markov Decision Processes &
Reinforcement Learning
Megan Smith
Lehigh University, Fall 2006
Outline
Stochastic Process
Markov Property
Markov Chain
Markov Decision Process
Reinforcement Learning
RL Techniques
Example Applications
Stochastic Process
Quick definition: A Random Process
Often viewed as a collection of
indexed random variables
http://en.wikipedia.org/wiki
/
Image:AAMarkov.jpg
Useful to us: Set of states with
probabilities of being in those states
indexed over time
We’ll deal with discrete stochastic
processes
Stochastic Process Example
Classic: Random Walk
Start at state X0 at time t0
At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
At time ti, state Xi = X0 + Z1 +…+ Zi
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
Also thought of as the “memoryless”
property
A stochastic process is said to have
the Markov property if the probability
of state Xn+1 having any given value
depends only upon state Xn
Very much depends on description of
states
Markov Property Example
Checkers:
Current State: The current configuration
of the board
Contains all information needed for
transition to next state
Thus, each configuration can be said to
have the Markov property
Markov Chain
Discrete-time
stochastic process
with the Markov
property
Industry Example:
Google’s PageRank
algorithm
Probability
distribution
representing
likelihood of
random linking
ending up on a
page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
Discrete time stochastic control
process
Extension of Markov chains
Differences:
Addition of actions (choice)
Addition of rewards (motivation)
If the actions are fixed, an MDP
reduces to a Markov chain
Description of MDPs
Tuple (S, A, P(.,.), R(.)))
S -> state space
A -> action space
Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
R(s) = immediate reward at state s
Goal is to maximize some cumulative
function of the rewards
Finite MDPs have finite state and
action spaces
Simple MDP Example
Recycling MDP Robot
Can search for trashcan, wait for
someone to bring a trashcan, or go
news.bbc.co.uk home and recharge battery
Has two energy levels – high and low
Searching runs down battery, waiting
does not, and a depleted battery has a
very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph
state node
action node
Solution to an MDP = Policy π
Gives the action to take from a given
state regardless of history
Two arrays indexed by state
V is the value function, namely the
discounted sum of rewards on average from
following a policy
π is an array of actions to be taken in each
state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants
2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2
Value Iteration Value Function
Policy Iteration
Modified Policy Iteration
Prioritized Sweeping
Value Iteration
V(s) = R(s) + γmaxa∑Pa(s,s')V(s')
k Vk(PU) Vk(PF) Vk(RU) Vk(RF)
1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 18.55 24.18
4 4.76 11.79 19.26 29.23
5 7.45 15.30 20.81 31.82
6 10.23 17.67 22.72 33.68
Why So Interesting?
If the transition probabilities are
known, this becomes a
straightforward computational
problem, however…
If the transition probabilities are
unknown, then this is a problem for
reinforcement learning.
Typical Agent
In reinforcement learning (RL), the
agent observes a state and takes an
action.
Afterward, the agent receives a
reward.
Mission: Optimize Reward
Rewards are calculated in the
environment
Used to teach the agent how to reach
a goal state
Must signal what we ultimately want
achieved, not necessarily subgoals
May be discounted over time
In general, seek to maximize the
expected return
Value Functions
Vπ is a value function
(How good is it to be
in this state?)
State-value function for policy π
V is the unique
π
solution to its
Bellman Equation
Expresses Bellman Equation:
relationship between
a state and its
successor states
Another Value Function
Qπ defines the value of taking action a in state s
under policy π
Expected return starting from s, taking action a,
and thereafter following policy π
Action-value function for policy π
Backup diagrams for (a) Vπ and (b) Qπ
Dynamic Programming
Classically, a collection of algorithms
used to compute optimal policies
given a perfect model of environment
as an MDP
The classical view is not so useful in
practice since we rarely have a
perfect environment model
Provides foundation for other
methods
Not practical for large problems
DP Continued…
Use value functions to organize and
structure the search for good policies.
Turn Bellman equations into update
policies.
Iterative policy evaluation using full
backups
Policy Improvement
When should we change the policy?
If we pick a new action α from state s
and thereafter follow the current
policy and V(π’) >= V(π), then
picking α from state s is a better
policy overall.
Results from the policy improvement
theorem
Policy Iteration
Continue improving
the policy π and
recalculating V(π)
A finite MDP has a
finite number of
policies, so
convergence is
guaranteed in a
finite number of
iterations
Remember Value Iteration?
Used to truncate policy iteration by combining
one sweep of policy evaluation and one of policy
improvement in each of its sweeps.
Monte Carlo Methods
Requires only episodic experience –
on-line or simulated
Based on averaging sample returns
Value estimates and policies only
changed at the end of each episode,
not on a step-by-step basis
Policy Evaluation
Compute
average returns
as the episode
runs
Two methods:
first-visit and
every-visit
First-visit is
most widely First-visit MC method
studied
Estimation of Action Values
State values are not enough without
a model – we need action values as
well
Qπ(s, a) expected return when
starting in state s, taking action a,
and thereafter following policy π
Exploration vs. Exploitation
Exploring starts
Example Monte Carlo Algorithm
First-visit Monte Carlo assuming exploring starts
Another MC Algorithm
On-line, first-visit, ε-greedy MC without exploring starts
Temporal-Difference Learning
Central and novel to reinforcement
learning
Combines Monte Carlo and DP
methods
Can learn from experience w/o a
model – like MC
Updates estimates based on other
learned estimates (bootstraps) – like
DP
TD(0)
Simplest TD method
Uses sample backup from single successor
state or state-action pair instead of full
backup of DP methods
SARSA – On-policy Control
Quintuple of events (st, at, rt+1, st+1, at+1)
Continually estimate Qπ while changing π
Q-Learning – Off-policy Control
Learned action-value function, Q, directly
approximates Q*, the optimal action-value
function, independent of policy being
followed
Case Study
Job-shop Scheduling
Temporal and resource constraints
Find constraint-satisfying schedules of
short duration
In it’s general form, NP-complete
NASA Space Shuttle Payload
Processing Problem (SSPPP)
Schedule tasks required for installation and
testing of shuttle cargo bay payloads
Typical: 2-6 shuttle missions, each
requiring 34-164 tasks
Zhang and Dietterich (1995, 1996; Zhang,
1996)
First successful instance of RL applied in
plan-space
states = complete plans
actions = plan modifications
SSPPP – continued…
States were an entire schedule
Two types of actions:
REASSIGN-POOL operators – reassigns a
resource to a different pool
MOVE operators – moves task to first
earlier or later time with satisfied
resource constraints
Small negative reward for each step
Resource dilation factor (RDF)
formula for rewarding final schedule’s
duration
Even More SSPPP…
Used TD() to learn value function
Actions selected by decreasing ε-greedy
policy with one-step lookahead
Function approximation used multilayer
neural networks
Training generally took 10,000 episodes
Each resulting network represented
different scheduling algorithm – not a
schedule for a specific instance!
RL and CBR
Example: CBR used to store various
policies and RL used to learn and
modify those policies
Ashwin Ram and Juan Carlos
Santamarıa, 1993
Autonomous Robotic Control
Job shop scheduling: RL used to
repair schedules, CBR used to
determine which repair to make
Similar methods can be used for IDSS
References
Sutton, R. S. and Barto A. G. Reinforcement
Learning: An Introduction. The MIT Press,
Cambridge, MA, 1998
Stochastic Processes, www.hanoivn.net
http://en.wikipedia.org/wiki/PageRank
http://en.wikipedia.org/wiki/Markov_decision_proc
ess
Using Case-Based Reasoning as a Reinforcement
Learning framework for Optimization with
Changing Criteria, Zeng, D. and Sycara, K. 1995