2-dynamic
2-dynamic
2. Dynamic Programming
Thomas Bonald
2024 – 2025
Markov decision process → Model
Definition
Given a Markov decision process, the policy of an agent defines
the action taken in each non-terminal state:
∀s ∈ S, π(a|s) = P(at = a| st = s)
Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:
G = r0 + γr1 + γ 2 r2 + . . .
Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Optimal policy
Partial ordering
Policy π ′ is better than policy π if:
Notation: π ′ ≥ π
Optimal policy
Partial ordering
Policy π ′ is better than policy π if:
Notation: π ′ ≥ π
Optimal policy
A policy π ⋆ is optimal if it is better than any other policy:
∀π, π⋆ ≥ π
A or B (optimal policy)
(+1) (−2)
A B
(+5) (−3)
C D
Optimal value function
Proposition
There is a unique solution V ⋆ to the equation:
Proposition
There is a unique solution V ⋆ to the equation:
Notes:
▶ Not a linear system!
▶ Can still be solved by fixed-point iteration
A or B (optimal value function)
(+1) (−2)
A B
(+5) (−3)
C D
Quiz
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:
V = F ⋆ (V )
V = F ⋆ (V )
∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:
V = F ⋆ (V )
∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Note that:
▶ The optimal policy is not unique in general.
▶ There always exists a deterministic optimal policy.
▶ The optimal value function V ⋆ is unique.
A or B (optimal policy)
(+1) (−2)
A B
(+5) (−3)
C D
Maze (optimal policy)
V⋆
π⋆
Tic-Tac-Toe (optimal player, random adversary)
X X X X
O O
V ⋆ ≈ 0.75 V ⋆ ≈ 0.75 V⋆ = 0 V⋆ = 0
X X O X X O X X O X X O
O O O O O O X
X X X
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Policy improvement
Proposition
The policy π ′ is better than π.
Maze (from the random policy)
Vπ
π′
Exercise
(−1) (+1)
(+3)
A B
C D
Algorithm
Starting from some arbitrary policy π = π0 , iterate until
convergence:
1. Evaluate the policy (by solving Bellman’s equation)
2. Improve the policy:
iteration 1 iteration 2
Practical considerations
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Value iteration
Value iteration
Algorithm
Starting from some arbitrary value function V = V0 ,
iterate until convergence:
π⋆
Exercise
(−1) (+1)
(+3)
A B
C D
Assume V0 = 0.
What is V1 (after one iteration)?
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Games
Two approaches:
1. Learn to play against a given adversary
→ The adversary is part to the environment
2. Learn to play against a perfect adversary
→ The adversary is part of the agent
(who controls both players)
Learning to play perfectly
Players:
▶ π1 → policy of the first player (player 1)
▶ π2 → policy of the second player (player 2)
States:
▶ S1 → states of player’s 1 turn
▶ S2 → states of player’s 2 turn
Rewards:
▶ +1 if player 1 wins
▶ −1 if player 2 wins
▶ 0 otherwise (tie)
Value function
Bellman’s equation
The value function of π is the unique solution to:
Bellman’s equation
The value function of π is the unique solution to:
Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:
Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:
Optimal policy
A policy π ⋆ is optimal if it is better than any other policy.
Optimal value function
Algorithm
Starting from some arbitrary value function V = V0 (e.g., 0 in
non-terminal states), iterate until convergence:
V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0
X X X X
O O
V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0
X X O X X O X X O X X O
O O O O O O X
X X X
Summary
Dynamic programming
▶ Optimal policy
π ⋆ with V ⋆ ≥ V
▶ Bellman’s optimality equation
V ⋆ = max E(r + γV ⋆ )
▶ Policy iteration
π0 → V0 → π1 → V1 → . . . → π ⋆
▶ Value iteration
V0 → V1 → . . . → V ⋆
Next lecture
How to evaluate a policy online?