0% found this document useful (0 votes)

16 views50 pages

2-dynamic

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views50 pages

2-dynamic

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Reinforcement Learning

2. Dynamic Programming

Thomas Bonald

2024 – 2025
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Definition
A Markov decision process (MDP) is defined by:
▶ the reward distribution, p(rt |st , at )
▶ the state transition distribution, p(st+1 | st , at )

We denote by S the set of non-terminal states.

Policy → Agent

Definition
Given a Markov decision process, the policy of an agent defines
the action taken in each non-terminal state:

∀s ∈ S, π(a|s) = P(at = a| st = s)

When deterministic, we use the simpler notation a = π(s)

Gain → Objective

Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:

G = r0 + γr1 + γ 2 r2 + . . .

The parameter γ ∈ [0, 1] is the discount factor.

Value function → Expectation

Consider some policy π.

Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Value function → Expectation

Consider some policy π.

Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy:

∀π, π⋆ ≥ π
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

a
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

Notes:
▶ Not a linear system!
▶ Can still be solved by fixed-point iteration
A or B (optimal value function)

(+1) (−2)

A B

(+5) (−3)

C D
Quiz
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞

Proof: The mapping F ⋆ is contracting:

∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞

→ Banach fixed-point theorem

Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

The policy π ⋆ has value function V ⋆ and is optimal.
Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

The policy π ⋆ has value function V ⋆ and is optimal.

Note that:
▶ The optimal policy is not unique in general.
▶ There always exists a deterministic optimal policy.
▶ The optimal value function V ⋆ is unique.
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Maze (optimal policy)
V⋆

π⋆
Tic-Tac-Toe (optimal player, random adversary)

V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96

X X X X
O O

V ⋆ ≈ 0.75 V ⋆ ≈ 0.75 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Policy improvement

Let Vπ be the value function of policy π.

Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

a
Policy improvement

Let Vπ be the value function of policy π.

Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

Proposition
The policy π ′ is better than π.
Maze (from the random policy)
Vπ

π′
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

The initial policy is random.

What is the new policy after policy improvement?
Policy iteration
Policy iteration

Algorithm
Starting from some arbitrary policy π = π0 , iterate until
convergence:
1. Evaluate the policy (by solving Bellman’s equation)
2. Improve the policy:

∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)

▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in

finite time (for finite numbers of states and actions).
▶ The limit is an optimal policy.
▶ These results assume perfect policy evaluation.
Maze (policy iteration)

iteration 1 iteration 2
Practical considerations

▶ The step of policy evaluation is time-consuming

(solution of Bellman’s equation)
▶ Do we need the exact solution?
No, since it is used only to improve the policy!
▶ Why not directly improving the value function?
This is value iteration!
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Value iteration
Value iteration

Algorithm
Starting from some arbitrary value function V = V0 ,
iterate until convergence:

∀s, V (s) ← max E(r0 + γV (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges

(but not in finite time in general)
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Maze (optimal policy)
V⋆

π⋆
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

Assume V0 = 0.
What is V1 (after one iteration)?
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Games

In many games (Tic-Tac-Toe, Chess, etc.), we have:

▶ 2 players
▶ No reward except in terminal states → r ∈ {−1, 0, +1}
▶ No discount → G ∈ {−1, 0, +1}
▶ Deterministic state transitions

Two approaches:
1. Learn to play against a given adversary
→ The adversary is part to the environment
2. Learn to play against a perfect adversary
→ The adversary is part of the agent
(who controls both players)
Learning to play perfectly

Players:
▶ π1 → policy of the first player (player 1)
▶ π2 → policy of the second player (player 2)

States:
▶ S1 → states of player’s 1 turn
▶ S2 → states of player’s 2 turn

Rewards:
▶ +1 if player 1 wins
▶ −1 if player 2 wins
▶ 0 otherwise (tie)
Value function

The value function of the policy π = (π1 , π2 ) is the expected gain

from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Value function

The value function of the policy π = (π1 , π2 ) is the expected gain

from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Note: For deterministic players, Vπ ∈ {−1, 0, +1}.

Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy.
Optimal value function

Value function V ⋆ of perfect players.

Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

a
Optimal value function

Value function V ⋆ of perfect players.

Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.

Value iteration

Algorithm
Starting from some arbitrary value function V = V0 (e.g., 0 in
non-terminal states), iterate until convergence:

∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)

∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges in finite time

(takes values in {−1, 0, +1})
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Tic-Tac-Toe (perfect players)

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X X X
O O

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Summary

Dynamic programming
▶ Optimal policy
π ⋆ with V ⋆ ≥ V
▶ Bellman’s optimality equation
V ⋆ = max E(r + γV ⋆ )
▶ Policy iteration
π0 → V0 → π1 → V1 → . . . → π ⋆
▶ Value iteration
V0 → V1 → . . . → V ⋆

Next lecture
How to evaluate a policy online?

Siemens S7 400 PLC Restarts
No ratings yet
Siemens S7 400 PLC Restarts
5 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
1-markov
No ratings yet
1-markov
34 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Lec 09
No ratings yet
Lec 09
51 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Rl Lecture4
No ratings yet
Rl Lecture4
7 pages
lec12
No ratings yet
lec12
60 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
subtitle (13)
No ratings yet
subtitle (13)
2 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
11 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
CS229
No ratings yet
CS229
17 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
Lec 3
No ratings yet
Lec 3
15 pages
08_MDPs.pptx
No ratings yet
08_MDPs.pptx
111 pages
lecture-06
No ratings yet
lecture-06
98 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
No ratings yet
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
33 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
No ratings yet
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
23 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
22 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
Aksantara2015 Sheet1
No ratings yet
Aksantara2015 Sheet1
2 pages
English Sessions
No ratings yet
English Sessions
30 pages
How To Write A Resume Email
100% (1)
How To Write A Resume Email
8 pages
Active Low-Pass Filters
No ratings yet
Active Low-Pass Filters
9 pages
VLSI Answer Key
No ratings yet
VLSI Answer Key
15 pages
(2020-0873) Marine EDC7C1 NEF Electronic System Specification Feb 2022
No ratings yet
(2020-0873) Marine EDC7C1 NEF Electronic System Specification Feb 2022
18 pages
Engineering, Computer and Mathematical Sciences 2024
No ratings yet
Engineering, Computer and Mathematical Sciences 2024
48 pages
Nessus
No ratings yet
Nessus
29 pages
SampleManager Installation Guide
100% (1)
SampleManager Installation Guide
253 pages
OSI Data Link Layer: CCNA Exploration Semester 1
No ratings yet
OSI Data Link Layer: CCNA Exploration Semester 1
20 pages
CNC Machine Tools
No ratings yet
CNC Machine Tools
64 pages
TGIS JLC CBLM
No ratings yet
TGIS JLC CBLM
36 pages
European Pharmacopoeia Combistats User Manual
No ratings yet
European Pharmacopoeia Combistats User Manual
179 pages
CCS Week 4 CBLM
No ratings yet
CCS Week 4 CBLM
6 pages
Emerging Individual Assignment 1-Abdi Tilahun Geleto
No ratings yet
Emerging Individual Assignment 1-Abdi Tilahun Geleto
15 pages
SLC Free 2 Manual: Warning
No ratings yet
SLC Free 2 Manual: Warning
4 pages
OFLicense
No ratings yet
OFLicense
2 pages
An ISA95 Overview: Workflow and System Modeling For 21st Century Manufacturing Flexibility
No ratings yet
An ISA95 Overview: Workflow and System Modeling For 21st Century Manufacturing Flexibility
51 pages
Infineon-Remote - Control - (PDL - RC) - 1.0-Software Module Datasheets-V01 - 00-EN
No ratings yet
Infineon-Remote - Control - (PDL - RC) - 1.0-Software Module Datasheets-V01 - 00-EN
8 pages
Spectroquant Prove
No ratings yet
Spectroquant Prove
8 pages
F28004x Microcontroller 1-0
No ratings yet
F28004x Microcontroller 1-0
292 pages
Portable Bluetooth WiFi Speaker Build FREE PLANS
No ratings yet
Portable Bluetooth WiFi Speaker Build FREE PLANS
58 pages
M-2022-008 - Guidelines On The Submission of The Supplemental
No ratings yet
M-2022-008 - Guidelines On The Submission of The Supplemental
3 pages
3d Game Development Using Unity Game en Gine
No ratings yet
3d Game Development Using Unity Game en Gine
4 pages
Q1W2D1
No ratings yet
Q1W2D1
10 pages
PCB Design Made Easy For Every Engineer: EAGLE / Fusion 360
No ratings yet
PCB Design Made Easy For Every Engineer: EAGLE / Fusion 360
8 pages
THREEJS_PDF1
No ratings yet
THREEJS_PDF1
44 pages
NIS 17 Cryptolab
No ratings yet
NIS 17 Cryptolab
5 pages
https___steponcrypto.com_ - Sheet1
No ratings yet
https___steponcrypto.com_ - Sheet1
2 pages

2-dynamic

Uploaded by

2-dynamic

Uploaded by

Reinforcement Learning

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

We denote by S the set of non-terminal states.

When deterministic, we use the simpler notation a = π(s)

The parameter γ ∈ [0, 1] is the discount factor.

Consider some policy π.

∀s, Vπ (s) = E(G |s0 = s)

Consider some policy π.

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ′ (s) ≥ Vπ (s)

∀s, Vπ′ (s) ≥ Vπ (s)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proof: The mapping F ⋆ is contracting:

∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞

→ Banach fixed-point theorem

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96

Let Vπ be the value function of policy π.

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

Let Vπ be the value function of policy π.

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

(+2) (+3) (+1)

The initial policy is random.

∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)

▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in

▶ The step of policy evaluation is time-consuming

∀s, V (s) ← max E(r0 + γV (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges

(+2) (+3) (+1)

In many games (Tic-Tac-Toe, Chess, etc.), we have:

The value function of the policy π = (π1 , π2 ) is the expected gain

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

The value function of the policy π = (π1 , π2 ) is the expected gain

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Note: For deterministic players, Vπ ∈ {−1, 0, +1}.

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Value function V ⋆ of perfect players.

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Value function V ⋆ of perfect players.

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.

∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)

∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges in finite time

You might also like