[go: up one dir, main page]

0% found this document useful (0 votes)
16 views50 pages

2-dynamic

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views50 pages

2-dynamic

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Reinforcement Learning

2. Dynamic Programming

Thomas Bonald

2024 – 2025
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:


▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:


▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Definition
A Markov decision process (MDP) is defined by:
▶ the reward distribution, p(rt |st , at )
▶ the state transition distribution, p(st+1 | st , at )

We denote by S the set of non-terminal states.


Policy → Agent

Definition
Given a Markov decision process, the policy of an agent defines
the action taken in each non-terminal state:

∀s ∈ S, π(a|s) = P(at = a| st = s)

When deterministic, we use the simpler notation a = π(s)


Gain → Objective

Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:

G = r0 + γr1 + γ 2 r2 + . . .

The parameter γ ∈ [0, 1] is the discount factor.


Value function → Expectation

Consider some policy π.


Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)


Value function → Expectation

Consider some policy π.


Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy:

∀π, π⋆ ≥ π
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)


Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)


a
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)


a

Notes:
▶ Not a linear system!
▶ Can still be solved by fixed-point iteration
A or B (optimal value function)

(+1) (−2)

A B

(+5) (−3)

C D
Quiz
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.


Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.


Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.


Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞

Proof: The mapping F ⋆ is contracting:

∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞

→ Banach fixed-point theorem


Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)


a

Bellman’s optimality theorem


The policy π ⋆ has value function V ⋆ and is optimal.
Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)


a

Bellman’s optimality theorem


The policy π ⋆ has value function V ⋆ and is optimal.

Note that:
▶ The optimal policy is not unique in general.
▶ There always exists a deterministic optimal policy.
▶ The optimal value function V ⋆ is unique.
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Maze (optimal policy)
V⋆

π⋆
Tic-Tac-Toe (optimal player, random adversary)

V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96

X X X X
O O

V ⋆ ≈ 0.75 V ⋆ ≈ 0.75 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Policy improvement

Let Vπ be the value function of policy π.


Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)


a
Policy improvement

Let Vπ be the value function of policy π.


Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)


a

Proposition
The policy π ′ is better than π.
Maze (from the random policy)

π′
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

The initial policy is random.


What is the new policy after policy improvement?
Policy iteration
Policy iteration

Algorithm
Starting from some arbitrary policy π = π0 , iterate until
convergence:
1. Evaluate the policy (by solving Bellman’s equation)
2. Improve the policy:

∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)


a

▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in


finite time (for finite numbers of states and actions).
▶ The limit is an optimal policy.
▶ These results assume perfect policy evaluation.
Maze (policy iteration)

iteration 1 iteration 2
Practical considerations

▶ The step of policy evaluation is time-consuming


(solution of Bellman’s equation)
▶ Do we need the exact solution?
No, since it is used only to improve the policy!
▶ Why not directly improving the value function?
This is value iteration!
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Value iteration
Value iteration

Algorithm
Starting from some arbitrary value function V = V0 ,
iterate until convergence:

∀s, V (s) ← max E(r0 + γV (s1 )| s, a)


a

▶ The sequence V0 , V1 , V2 , . . . converges


(but not in finite time in general)
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Maze (optimal policy)
V⋆

π⋆
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

Assume V0 = 0.
What is V1 (after one iteration)?
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Games

In many games (Tic-Tac-Toe, Chess, etc.), we have:


▶ 2 players
▶ No reward except in terminal states → r ∈ {−1, 0, +1}
▶ No discount → G ∈ {−1, 0, +1}
▶ Deterministic state transitions

Two approaches:
1. Learn to play against a given adversary
→ The adversary is part to the environment
2. Learn to play against a perfect adversary
→ The adversary is part of the agent
(who controls both players)
Learning to play perfectly

Players:
▶ π1 → policy of the first player (player 1)
▶ π2 → policy of the second player (player 2)

States:
▶ S1 → states of player’s 1 turn
▶ S2 → states of player’s 2 turn

Rewards:
▶ +1 if player 1 wins
▶ −1 if player 2 wins
▶ 0 otherwise (tie)
Value function

The value function of the policy π = (π1 , π2 ) is the expected gain


from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)


Value function

The value function of the policy π = (π1 , π2 ) is the expected gain


from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Note: For deterministic players, Vπ ∈ {−1, 0, +1}.


Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)


Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy.
Optimal value function

Value function V ⋆ of perfect players.


Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)


a

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)


a
Optimal value function

Value function V ⋆ of perfect players.


Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)


a

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)


a

Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.


Value iteration

Algorithm
Starting from some arbitrary value function V = V0 (e.g., 0 in
non-terminal states), iterate until convergence:

∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)


a

∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)


a

▶ The sequence V0 , V1 , V2 , . . . converges in finite time


(takes values in {−1, 0, +1})
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Tic-Tac-Toe (perfect players)

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X X X
O O

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Summary

Dynamic programming
▶ Optimal policy
π ⋆ with V ⋆ ≥ V
▶ Bellman’s optimality equation
V ⋆ = max E(r + γV ⋆ )
▶ Policy iteration
π0 → V0 → π1 → V1 → . . . → π ⋆
▶ Value iteration
V0 → V1 → . . . → V ⋆

Next lecture
How to evaluate a policy online?

You might also like