0% found this document useful (0 votes)

117 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

This document provides an overview of Markov decision processes (MDPs) and reinforcement learning (RL). It defines key concepts such as stochastic processes, the Markov property, Markov chains, and MDPs. MDPs extend Markov chains by adding actions and rewards. The goal of an MDP is to maximize cumulative rewards by finding the optimal policy. The document discusses solutions to MDPs using dynamic programming approaches like value iteration, policy iteration, and Monte Carlo methods. It also introduces temporal-difference learning methods like TD(0), SARSA, and Q-learning which can learn directly from experience without a model of the environment. Finally, it provides an example of applying RL to solve scheduling problems for NASA's space shuttle payload processing.

Uploaded by

Sanja Lazarova-Molnar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Sanja Lazarova-Molnar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Markov Decision Processes &

Reinforcement Learning
Megan Smith
Lehigh University, Fall 2006
Outline
 Stochastic Process
 Markov Property
 Markov Chain
 Markov Decision Process
 Reinforcement Learning
 RL Techniques
 Example Applications
Stochastic Process
 Quick definition: A Random Process
 Often viewed as a collection of
indexed random variables
http://en.wikipedia.org/wiki
/
Image:AAMarkov.jpg
 Useful to us: Set of states with
probabilities of being in those states
indexed over time
 We’ll deal with discrete stochastic
processes
Stochastic Process Example
 Classic: Random Walk
 Start at state X0 at time t0
 At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
 At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
 Also thought of as the “memoryless”
property
 A stochastic process is said to have
the Markov property if the probability
of state Xn+1 having any given value
depends only upon state Xn
 Very much depends on description of
states
Markov Property Example
 Checkers:
 Current State: The current configuration
of the board
 Contains all information needed for
transition to next state
 Thus, each configuration can be said to
have the Markov property
Markov Chain
 Discrete-time
stochastic process
with the Markov
property
 Industry Example:
Google’s PageRank
algorithm
 Probability
distribution
representing
likelihood of
random linking
ending up on a
page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
 Discrete time stochastic control
process
 Extension of Markov chains
 Differences:
 Addition of actions (choice)
 Addition of rewards (motivation)

 If the actions are fixed, an MDP

reduces to a Markov chain
Description of MDPs
 Tuple (S, A, P(.,.), R(.)))
 S -> state space
 A -> action space
 Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
 R(s) = immediate reward at state s
 Goal is to maximize some cumulative
function of the rewards
 Finite MDPs have finite state and
action spaces
Simple MDP Example
 Recycling MDP Robot
 Can search for trashcan, wait for
someone to bring a trashcan, or go
news.bbc.co.uk home and recharge battery
 Has two energy levels – high and low
 Searching runs down battery, waiting
does not, and a depleted battery has a
very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph

state node

action node
Solution to an MDP = Policy π
 Gives the action to take from a given
state regardless of history
 Two arrays indexed by state
 V is the value function, namely the
discounted sum of rewards on average from
following a policy
 π is an array of actions to be taken in each
state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants

2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2

 Value Iteration Value Function

 Policy Iteration
 Modified Policy Iteration
 Prioritized Sweeping
Value Iteration

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 18.55 24.18
4 4.76 11.79 19.26 29.23
5 7.45 15.30 20.81 31.82
6 10.23 17.67 22.72 33.68
Why So Interesting?
 If the transition probabilities are
known, this becomes a
straightforward computational
problem, however…
 If the transition probabilities are
unknown, then this is a problem for
reinforcement learning.
Typical Agent
 In reinforcement learning (RL), the
agent observes a state and takes an
action.

 Afterward, the agent receives a

reward.
Mission: Optimize Reward
 Rewards are calculated in the
environment
 Used to teach the agent how to reach
a goal state
 Must signal what we ultimately want
achieved, not necessarily subgoals
 May be discounted over time
 In general, seek to maximize the
expected return
Value Functions
 Vπ is a value function
(How good is it to be
in this state?)
State-value function for policy π
 V is the unique
π

solution to its
Bellman Equation
 Expresses Bellman Equation:
relationship between
a state and its
successor states
Another Value Function
 Qπ defines the value of taking action a in state s
under policy π
 Expected return starting from s, taking action a,
and thereafter following policy π
Action-value function for policy π

Backup diagrams for (a) Vπ and (b) Qπ

Dynamic Programming
 Classically, a collection of algorithms
used to compute optimal policies
given a perfect model of environment
as an MDP
 The classical view is not so useful in
practice since we rarely have a
perfect environment model
 Provides foundation for other
methods
 Not practical for large problems
DP Continued…
 Use value functions to organize and
structure the search for good policies.
 Turn Bellman equations into update
policies.
 Iterative policy evaluation using full
backups
Policy Improvement
 When should we change the policy?
 If we pick a new action α from state s
and thereafter follow the current
policy and V(π’) >= V(π), then
picking α from state s is a better
policy overall.
 Results from the policy improvement
theorem
Policy Iteration
 Continue improving
the policy π and
recalculating V(π)
 A finite MDP has a
finite number of
policies, so
convergence is
guaranteed in a
finite number of
iterations
Remember Value Iteration?

Used to truncate policy iteration by combining

one sweep of policy evaluation and one of policy
improvement in each of its sweeps.
Monte Carlo Methods
 Requires only episodic experience –
on-line or simulated
 Based on averaging sample returns
 Value estimates and policies only
changed at the end of each episode,
not on a step-by-step basis
Policy Evaluation
 Compute
average returns
as the episode
runs
 Two methods:
first-visit and
every-visit
 First-visit is
most widely First-visit MC method
studied
Estimation of Action Values
 State values are not enough without
a model – we need action values as
well
 Qπ(s, a)  expected return when
starting in state s, taking action a,
and thereafter following policy π
 Exploration vs. Exploitation
 Exploring starts
Example Monte Carlo Algorithm

First-visit Monte Carlo assuming exploring starts

Another MC Algorithm

On-line, first-visit, ε-greedy MC without exploring starts

Temporal-Difference Learning
 Central and novel to reinforcement
learning
 Combines Monte Carlo and DP
methods
 Can learn from experience w/o a
model – like MC
 Updates estimates based on other
learned estimates (bootstraps) – like
DP
TD(0)

 Simplest TD method
 Uses sample backup from single successor
state or state-action pair instead of full
backup of DP methods
SARSA – On-policy Control

 Quintuple of events (st, at, rt+1, st+1, at+1)

 Continually estimate Qπ while changing π
Q-Learning – Off-policy Control

 Learned action-value function, Q, directly

approximates Q*, the optimal action-value
function, independent of policy being
followed
Case Study
 Job-shop Scheduling
 Temporal and resource constraints
 Find constraint-satisfying schedules of
short duration
 In it’s general form, NP-complete
NASA Space Shuttle Payload
Processing Problem (SSPPP)
 Schedule tasks required for installation and
testing of shuttle cargo bay payloads
 Typical: 2-6 shuttle missions, each
requiring 34-164 tasks
 Zhang and Dietterich (1995, 1996; Zhang,
1996)
 First successful instance of RL applied in
plan-space
 states = complete plans
 actions = plan modifications
SSPPP – continued…
 States were an entire schedule
 Two types of actions:
 REASSIGN-POOL operators – reassigns a
resource to a different pool
 MOVE operators – moves task to first
earlier or later time with satisfied
resource constraints
 Small negative reward for each step
 Resource dilation factor (RDF)
formula for rewarding final schedule’s
duration
Even More SSPPP…
 Used TD() to learn value function
 Actions selected by decreasing ε-greedy
policy with one-step lookahead
 Function approximation used multilayer
neural networks
 Training generally took 10,000 episodes
 Each resulting network represented
different scheduling algorithm – not a
schedule for a specific instance!
RL and CBR
 Example: CBR used to store various
policies and RL used to learn and
modify those policies
 Ashwin Ram and Juan Carlos
Santamarıa, 1993
 Autonomous Robotic Control

 Job shop scheduling: RL used to

repair schedules, CBR used to
determine which repair to make
 Similar methods can be used for IDSS
References
 Sutton, R. S. and Barto A. G. Reinforcement
Learning: An Introduction. The MIT Press,
Cambridge, MA, 1998
 Stochastic Processes, www.hanoivn.net
 http://en.wikipedia.org/wiki/PageRank
 http://en.wikipedia.org/wiki/Markov_decision_proc
ess
 Using Case-Based Reasoning as a Reinforcement
Learning framework for Optimization with
Changing Criteria, Zeng, D. and Sycara, K. 1995

Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit 4
No ratings yet
Unit 4
49 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
15 MDP
No ratings yet
15 MDP
35 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
AS02
No ratings yet
AS02
16 pages
06 MDP
No ratings yet
06 MDP
89 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
No ratings yet
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
CS229
No ratings yet
CS229
17 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
86 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
AIML Unit - 3 MDP New
No ratings yet
AIML Unit - 3 MDP New
30 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Introduction To DES
No ratings yet
Introduction To DES
129 pages
General Principles - Revived: Dr. Sanja Lazarova-Molnar
No ratings yet
General Principles - Revived: Dr. Sanja Lazarova-Molnar
39 pages
Lecture04 Simulation Examples
No ratings yet
Lecture04 Simulation Examples
28 pages
MDP Basics for AI Researchers
No ratings yet
MDP Basics for AI Researchers
22 pages
Martin Gales 2023
No ratings yet
Martin Gales 2023
96 pages
Assignmnet 2 - MA22563 - EE - Jan2025
No ratings yet
Assignmnet 2 - MA22563 - EE - Jan2025
2 pages
Neuroscience Course Curriculum 0
No ratings yet
Neuroscience Course Curriculum 0
11 pages
4013-Article Text-26989-4-10-20220512
No ratings yet
4013-Article Text-26989-4-10-20220512
8 pages
Stochastic Processes Study Guide
No ratings yet
Stochastic Processes Study Guide
8 pages
MATLAB Monte Carlo for Students
No ratings yet
MATLAB Monte Carlo for Students
7 pages
Syllabus SVIIT CSE B.Tech (BDA-CMC-AI-DS-FSDB-IBM) WoS II Sem 20-21 05.07.2021
No ratings yet
Syllabus SVIIT CSE B.Tech (BDA-CMC-AI-DS-FSDB-IBM) WoS II Sem 20-21 05.07.2021
19 pages
Introduction To Probability, Statistics, and Random Processes - Hossein Pishro-Nik
100% (2)
Introduction To Probability, Statistics, and Random Processes - Hossein Pishro-Nik
1,007 pages
Financial Instrument Pricing Using C Daniel J. Duffy Download
No ratings yet
Financial Instrument Pricing Using C Daniel J. Duffy Download
52 pages
Untitled
No ratings yet
Untitled
267 pages
IT 7th SEM AICTE111220011103
No ratings yet
IT 7th SEM AICTE111220011103
19 pages
EE601
No ratings yet
EE601
2 pages
Ergodic Theory in Stochastic PDEs
No ratings yet
Ergodic Theory in Stochastic PDEs
30 pages
ECE CSESyllabus
No ratings yet
ECE CSESyllabus
42 pages
Logarithmic Poisson
No ratings yet
Logarithmic Poisson
9 pages
From The Sde Package To The Yuima Project
No ratings yet
From The Sde Package To The Yuima Project
2 pages
Ma1252-Probability and Queueing Theory Unit - I Probability and Random Variable
100% (1)
Ma1252-Probability and Queueing Theory Unit - I Probability and Random Variable
19 pages
M.Tech Communication Systems Syllabus
No ratings yet
M.Tech Communication Systems Syllabus
31 pages
Stochastic Modeling and Analysis of Telecoms Networks 1st Edition Laurent Decreusefond - The Ebook in PDF Format Is Ready For Download
100% (4)
Stochastic Modeling and Analysis of Telecoms Networks 1st Edition Laurent Decreusefond - The Ebook in PDF Format Is Ready For Download
43 pages
Probability Statistics and Random Processes For Engineers 4th Edition Stark Fast Access
No ratings yet
Probability Statistics and Random Processes For Engineers 4th Edition Stark Fast Access
317 pages
Petrov CV
No ratings yet
Petrov CV
18 pages
Mat3004 Random-Process LT 1.0 1 Mat3004
No ratings yet
Mat3004 Random-Process LT 1.0 1 Mat3004
3 pages
Probability and Stochastic Processes: Reza Pulungan
No ratings yet
Probability and Stochastic Processes: Reza Pulungan
29 pages
Optimal Markovian Couplings
No ratings yet
Optimal Markovian Couplings
25 pages
Revisiting The Illiac Suite
No ratings yet
Revisiting The Illiac Suite
8 pages
Probability Theory & Random Processes
No ratings yet
Probability Theory & Random Processes
1 page
Stationary Stochastic Process
No ratings yet
Stationary Stochastic Process
47 pages
SDEs With Jumps
No ratings yet
SDEs With Jumps
55 pages
Advanced Risk Analysis in Engineering Enterprise Systems (2012)
No ratings yet
Advanced Risk Analysis in Engineering Enterprise Systems (2012)
453 pages
M.tech (Water Resourses Engg) Syllabus
No ratings yet
M.tech (Water Resourses Engg) Syllabus
22 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Markov Decision Processes &

 If the actions are fixed, an MDP

 Value Iteration Value Function

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

 Afterward, the agent receives a

Backup diagrams for (a) Vπ and (b) Qπ

Used to truncate policy iteration by combining

First-visit Monte Carlo assuming exploring starts

On-line, first-visit, ε-greedy MC without exploring starts

 Quintuple of events (st, at, rt+1, st+1, at+1)

 Learned action-value function, Q, directly

 Job shop scheduling: RL used to

You might also like