0% found this document useful (0 votes)

101 views11 pages

Markov Decision Process

The document discusses Markov decision processes (MDPs) and reinforcement learning. It defines key concepts of MDPs including the Markov property, state transition probabilities, state transition matrices, Markov reward processes, value functions, Bellman equations, policies, and optimal value functions. MDPs extend Markov reward processes to include decisions an agent must make, with the goal of maximizing long-term rewards through reinforcement learning.

Uploaded by

Harpreet Singh Bagga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views11 pages

Markov Decision Process

Uploaded by

Harpreet Singh Bagga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Chandigarh Engineering College Jhanjeri

Mohali-140307
Department of Computer Science & Engineering

Markov Decision Processes: Reinforcement Learning

The MDPs need to satisfy the Markov Property.

 Markov Property: requires that “the future is independent

of the past given the present”.

 Property: Our state Sₜ is Markov if and only if:

Simply this means that the state Sₜ captures all the relevant information
from the history. S₁, S₂, …, Sₜ₋₁ can be discarded and we still get the
same state transition probability to the next state Sₜ₊₁.

 State Transition Probability: The state transition

probability tells us, given we are in state s what the probability
the next state s’ will occur.

P without the double lines represents the state transitions. The above equation has the transition from
state s to state s’. P with the double lines represents the probability from going from state s to s’.

We can also define all state transitions in terms of a State Transition

Matrix P, where each row tells us the transition probabilities from one
state to all possible successor states.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

State transition matrix.

 Markov Process / Markov Chain

The first and most simplest MDP is a Markov process.

Markov Process / Markov Chain: A sequence of random

states S₁, S₂, … with the Markov property.

Below is an illustration of a Markov Chain were each node represents a

state with a probability of transitioning from one state to the next,
where Stop represents a terminal state.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

Markov Chain

We can take a sample episode to go through the chain and end up at

the terminal state. An example sample episode would be to go
from Stage1 to Stage2 to Win to Stop. Below is a representation of a few
sample episodes:

- S1 S2 Win Stop

- S1 S2 Teleport S2 Win Stop
- S1 Pause S1 S2 Win Stop

The above Markov Chain has the following Transition Probability

Matrix:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

For each of the states the sum of the transition probabilities for that
state equals 1.

 Markov Reward Process

In the above Markov Chain we did not have a value associated with
being in a state to achieve a goal. A Markov Reward Process is a
Markov chain with reward values.

Our goal is to maximise the return. The return Gₜ is the total

discount reward from time-step t.

Equation to calculate return

Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 The discount factor γ is a value (that can be chosen) between 0

and 1. If gamma is closer 0 it leads to short sighted evaluation,
while a value closer to 1 favours far sighted evaluation.

Markov Reward Process

 Value Function for MRPs

 State Value Function v(s): gives the long-term value of

state s. It is the expected return starting from state s

state-value function
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

How we can view this is by saying going from state s and going through
various samples from state s what is our expected return. We want to
prefer states which gives more total reward.

state-values for MRP with γ=1

 Bellman Equation for MRPs

The value function can be decomposed into two parts:

 Immediate reward: Rₜ₊₁

 Discounted value of successor state: γ (Sₜ₊₁)

We can define a new equation to calculate the state-value function using

the state-value function and return function above:
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

updated bellman state-value equation

Alternatively this can be written in a matrix form:

Using this equation we can calculate the state values for each state.
Since we have a simple model above with the “state-values for MRP
with γ=1” we can calculate the state values using a simultaneous
equations using the updated state-value function.

Solving the above equation is simple for a small MRPs but becomes
highly complex for larger numbers. In order to solve for large MRPs we
require other techniques such as Dynamic Programming, Monte-Carlo
evaluation and Temporal-Difference learning which will be discussed
in a later blog.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 Markov Decision Process

A Markov Decision Process is an extension to a Markov Reward

Process as it contains decisions that an agent must make. All states in
the environment are Markov.

In a Markov Decision Process we now have more control over which

states we go to. An example in the below MDP if we choose to take the
action Teleport we will end up back in state Stage2 40% of the time
and Stage1 60% of the time. Other state transitions occur with 100%
probability when selecting the corresponding actions such as taking the
Action Advance2 from Stage2 will take us to Win.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

 Policies

A policy π is a distribution over actions given states. It fully

defines the behaviour of an agent. MDP policies depend on the
current state and not the history.

Policy function

Polices give the mappings from one state to the next. If I am in state s, it
maps from that state the probability of taking each action. Example if
we have the policy π(Chores|Stage1)=100%, this means the agent will
take the action Chores 100% of the time when in state Stage1.
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

Value Function for MDP

Since we take actions there are different expectations depending on

how we behave.

 The state-value function v_π(s) of an MDP is the expected

return starting from state s, and then following policy π.

State-value function tells us how good is it to be in state s by following

policy π.

 The action-value function q_π(s,a) is the expected return

starting from state s, taking action a, and then following
policy π.

Action-value function tells us how good is it to take a particular action

from a particular state. Gives us an idea on what action we should take
at states
Chandigarh Engineering College Jhanjeri
Mohali-140307
Department of Computer Science & Engineering

The optimal state-value function v∗(s) is the maximum value

function over all policies. It tells us the maximum possible reward
you can extract from the system.

The optimal action-value function q∗(s,a) is the maximum

action-value function over all policies. It tells us what is the
maximum possible reward you can extract from the system starting
at state s and taking action a.

17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Lec 02
No ratings yet
Lec 02
89 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Lecture2-MRP (RL IITH)
No ratings yet
Lecture2-MRP (RL IITH)
54 pages
Lec 12
No ratings yet
Lec 12
60 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Markov Decision Process
No ratings yet
Markov Decision Process
36 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Lecture#2 - Markov Decision Process MDP An Introduction 2023
No ratings yet
Lecture#2 - Markov Decision Process MDP An Introduction 2023
36 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
No ratings yet
Reinforcement Learning: Full Summary of Chapters 3-8: Summarized by Grok 3 June 30, 2025
23 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
119686
No ratings yet
119686
24 pages
CS229
No ratings yet
CS229
17 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
MDPs
No ratings yet
MDPs
19 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Infosys Technical Interview Questions
0% (1)
Infosys Technical Interview Questions
29 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
DXC Technology Verbal Ability Test
100% (1)
DXC Technology Verbal Ability Test
8 pages
Project Report ON Online Car Rental System
No ratings yet
Project Report ON Online Car Rental System
12 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Capgemini Technical Interview Questions
No ratings yet
Capgemini Technical Interview Questions
12 pages
C++ Interview Prep Guide
No ratings yet
C++ Interview Prep Guide
27 pages
Puzzles 1
No ratings yet
Puzzles 1
5 pages
Data Structure & Algorithem Interview
No ratings yet
Data Structure & Algorithem Interview
30 pages
TCS Technical Interview Questions
No ratings yet
TCS Technical Interview Questions
8 pages
Technical Interview Question
No ratings yet
Technical Interview Question
21 pages
Computer Network Interview Question
100% (1)
Computer Network Interview Question
23 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Cross Validation for ML Models
No ratings yet
Cross Validation for ML Models
6 pages
Universal Human Values: CGC Jhanjeri, Mohali
No ratings yet
Universal Human Values: CGC Jhanjeri, Mohali
9 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
ISEA Workshop Brochure
No ratings yet
ISEA Workshop Brochure
2 pages
5thsem - TT - ODD2021 - 10112021
No ratings yet
5thsem - TT - ODD2021 - 10112021
3 pages
Mid Term Report of Car Rental System
No ratings yet
Mid Term Report of Car Rental System
15 pages
TPP 5TH SEM DATA Updated
No ratings yet
TPP 5TH SEM DATA Updated
8 pages
Campus to Corporate Program Puzzles
No ratings yet
Campus to Corporate Program Puzzles
3 pages
B. 3 and 4 Only C. 1 and 2 Only D. 2 and 3 Only
No ratings yet
B. 3 and 4 Only C. 1 and 2 Only D. 2 and 3 Only
16 pages
CGC Reading Comprehension
No ratings yet
CGC Reading Comprehension
4 pages
SRS On Student Management
No ratings yet
SRS On Student Management
7 pages
CAT Parajumble Strategies
No ratings yet
CAT Parajumble Strategies
5 pages
Computational Principles of Mobile Robotics 2ed. Edition Gregory Dudek PDF Download
100% (1)
Computational Principles of Mobile Robotics 2ed. Edition Gregory Dudek PDF Download
53 pages
Digital Communications 5th Ed SolutionsChap3
No ratings yet
Digital Communications 5th Ed SolutionsChap3
33 pages
ASHTIKA
No ratings yet
ASHTIKA
9 pages
CS-4th Semcourseinfo
No ratings yet
CS-4th Semcourseinfo
57 pages
Applied Stochastic Processes
No ratings yet
Applied Stochastic Processes
89 pages
Markov Chain (Part 1)
No ratings yet
Markov Chain (Part 1)
31 pages
Stochastic Processes Theory & Applications
No ratings yet
Stochastic Processes Theory & Applications
30 pages
DepmixS4 Package For R
No ratings yet
DepmixS4 Package For R
21 pages
Digital Communication Exam 2016
No ratings yet
Digital Communication Exam 2016
1 page
CS2A Masterclass IAI
No ratings yet
CS2A Masterclass IAI
65 pages
Lovasz Discrete and Continuous
No ratings yet
Lovasz Discrete and Continuous
23 pages
Stokey, Lucas, Prescott CH 1-2
No ratings yet
Stokey, Lucas, Prescott CH 1-2
53 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
A Markov Model For Human Resources Supply Forecast
No ratings yet
A Markov Model For Human Resources Supply Forecast
7 pages
QT Study Pack 2006 Wip
100% (1)
QT Study Pack 2006 Wip
446 pages
Solution Review Set P DS
No ratings yet
Solution Review Set P DS
10 pages
Lec12 PDF
No ratings yet
Lec12 PDF
22 pages
Lecture Notes in Computer Science 5281: Editorial Board
No ratings yet
Lecture Notes in Computer Science 5281: Editorial Board
207 pages
Creutz Freedman
No ratings yet
Creutz Freedman
36 pages
Physics Lecture Notes: Random Systems
No ratings yet
Physics Lecture Notes: Random Systems
17 pages
Markov Model
No ratings yet
Markov Model
12 pages
Analyzing Reliability and Maintainability of Crawler Dozer BD155 Transmission Failure Using Markov Method and Total Productive Maintenance
No ratings yet
Analyzing Reliability and Maintainability of Crawler Dozer BD155 Transmission Failure Using Markov Method and Total Productive Maintenance
18 pages
Probability and Statistics UIUC Luthuli
100% (2)
Probability and Statistics UIUC Luthuli
451 pages
Dual-Polarized MIMO HAP Channel Model
No ratings yet
Dual-Polarized MIMO HAP Channel Model
12 pages
Ms Var Libro
No ratings yet
Ms Var Libro
369 pages
M.E. Comm. Systems
No ratings yet
M.E. Comm. Systems
105 pages
Blockchain and Artificial Intelligence Technologies For Smart Energy Systems 1st Edition Hongjian Sun Complete Edition
100% (1)
Blockchain and Artificial Intelligence Technologies For Smart Energy Systems 1st Edition Hongjian Sun Complete Edition
115 pages
Transition Matrices for Students
No ratings yet
Transition Matrices for Students
10 pages
Lecture9 SIQ3003 PDF
0% (1)
Lecture9 SIQ3003 PDF
18 pages
Chapter 1
No ratings yet
Chapter 1
13 pages

Markov Decision Process

Uploaded by

Markov Decision Process

Uploaded by

Chandigarh Engineering College Jhanjeri

Markov Decision Processes: Reinforcement Learning

The MDPs need to satisfy the Markov Property.

 Markov Property: requires that “the future is independent

 Property: Our state Sₜ is Markov if and only if:

 State Transition Probability: The state transition

We can also define all state transitions in terms of a State Transition

State transition matrix.

 Markov Process / Markov Chain

The first and most simplest MDP is a Markov process.

Markov Process / Markov Chain: A sequence of random

Below is an illustration of a Markov Chain were each node represents a

We can take a sample episode to go through the chain and end up at

- S1 S2 Win Stop

The above Markov Chain has the following Transition Probability

 Markov Reward Process

Our goal is to maximise the return. The return Gₜ is the total

Equation to calculate return

 The discount factor γ is a value (that can be chosen) between 0

Markov Reward Process

 Value Function for MRPs

 State Value Function v(s): gives the long-term value of

state-values for MRP with γ=1

 Bellman Equation for MRPs

The value function can be decomposed into two parts:

 Discounted value of successor state: γ (Sₜ₊₁)

We can define a new equation to calculate the state-value function using

updated bellman state-value equation

Alternatively this can be written in a matrix form:

 Markov Decision Process

A Markov Decision Process is an extension to a Markov Reward

In a Markov Decision Process we now have more control over which

A policy π is a distribution over actions given states. It fully

Value Function for MDP

Since we take actions there are different expectations depending on

 The state-value function v_π(s) of an MDP is the expected

State-value function tells us how good is it to be in state s by following

 The action-value function q_π(s,a) is the expected return

Action-value function tells us how good is it to take a particular action

The optimal state-value function v∗(s) is the maximum value

The optimal action-value function q∗(s,a) is the maximum

You might also like