Formula Sheet: Section 1 - Deterministic Dynamic Programming
Formula Sheet: Section 1 - Deterministic Dynamic Programming
Formula Sheet
Section 1 – Deterministic Dynamic Programming
Finite-horizon Dynamic Programming
(i) Initialize the value function: VN ( s) rN ( s) , s S N .
(ii) Backward recursion: For k N 1, ,0 , compute
Vk ( s) max rk ( s, a) Vk 1 ( f k ( s, a)) ,
aAk
s Sk
aAk
k ( s) arg max rk ( s, a) Vk 1 ( f k ( s, a)) , k 0, , N 1
Section 2 – Planning
Markov Chains
Stationary Distribution: is a stationary distribution for the Markov chain if P ,
namely j i i pij j .
Theorem: Recurrence of finite Markov chains: Let ( X t ) be an irreducible, a-periodic
Markov chain over a finite state space X . Then the following properties hold:
1. All states are positive recurrent
2. There exists a unique stationary distribution
3. Convergence to the stationary distribution: limt pij j (j )
(t )
1 t 1
4. Ergodicity: limt f ( X s ) i (i) f (i)
t s 0
f
Theorem: Countable Markov chains: Let ( X t ) be an irreducible and a-periodic Markov
chain over a countable state space X . Then:
(a) Either (i) all states are positive recurrent, or (ii) all states are null recurrent, or (iii) all
states are transient.
(b) If (i) holds, then properties (2)-(4) of the previous Theorem hold as well.
(c) Conversely, if there exists a stationary distribution then properties (1)-(4) are
satisfied.
Contraction Operators
Norm: A norm || || over n
is a real-valued function || ||: n
such that, for any pair
of vectors x, y n
and scalar a ,
(1) || ax ||| a | || ax || , (2) || x y |||| x || || y || , (3) || x || 0 only if x 0 .
is a contraction in || || if there exists (0,1)
n n
Contraction: An operator T :
such that
|| T (v1) T (v2 ) || || v1 v2 || , for all v1, v2 n
Theorem: Banach's fixed point theorem
Let T :
n
n
be a contraction operator. Then
(C1) The equation T (v) v has a unique solution V * n .
(C2) For any v0 n , limn T n (v0 ) V * . In fact, || T n (v0 ) V * || O( n ) .
aAk
Vk ( s) max rk ( s, a) s 'S
k 1
pk ( s ' | s, a)Vk 1( s ') , s Sk
aAt
t* ( s) arg max rt ( s, a) s 'S
t 1
pt ( s ' | s, a)Vt 1( s ') , s St
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Discounted MDPs
Value Function (fixed policy)
V ( s) E ,s ( t r (st , at ))
t 0
Bellman (fixed policy)
V (s) r (s, (s)) s 'S p(s ' | s, (s))V (s ') , sS
Vector form: V r P V
Optimal Value Function
V * ( s) sup E ,s ( t r ( st , at ))
t 0
Theorem: Bellman's Optimality Equation
(i) V * is the unique solution of of the following set of (nonlinear) equations:
V (s) max r ( s, a) s 'S p( s ' | s, a)V ( s ') ,
aA
sS .
(T * (V ))(s) max r (s, a) s 'S p(s ' | s, a)V ( s ') , s S
aA
Theorem: Contraction property
(i) T is a -contraction operator with respect to the max-norm, namely
|| T (V1) T (V2 ) || || V1 V2 || for all V1, V2 S
Vn1 (s) max r (s, a) s 'S p(s ' | s, a)Vn ( s ') , s S
aA
Then limn Vn V (component-wise). The rate of convergence is O( n ) .
*
Policy Iteration
Initialization: choose some stationary policy 0 .
For k 0,1, :
k
1. Policy evaluation: compute V .
k
2. Policy Improvement: Compute k 1 , a greedy policy with respect to V :
k 1 (s) arg max aA r ( s, a) s 'S p( s ' | s, a)V k (s ') .
k
3. Stop if V k 1 V k (or if V satisfied the optimality equation) , else repeat.
Theorem: Policy Iteration
Learning and Planning in Dynamical Systems – 046194 Spring 2016
k 1
(i) Each policy k 1 is improving over the previous one k , in the sense that V V k
(component-wise).
k 1
(ii) V V k if and only if k is an optimal policy.
(iii) k converges to the optimal policy after a finite number of steps.
Union bound:
Exploratory MAB
PAC definition: An algorithm is a ( , ) PAC algorithm for the multi armed bandit with
complexity T , if it outputs an -optimal arm, a ' , with probability at least 1 , when it
Learning and Planning in Dynamical Systems – 046194 Spring 2016
terminates, and the number of time steps the algorithm performs until it terminates is
bounded by T .
Naïve Algorithm:
Median Elimination:
Arm differences: i * i .
Theorem: UCB
Learning and Planning in Dynamical Systems – 046194 Spring 2016
SA Convergence Theorems:
Learning and Planning in Dynamical Systems – 046194 Spring 2016
1
Explicitly, we can write in vector notation as: .
The projection is non-expansive:‖ V V‖ ‖ V V‖
Theorem: Convergence of TD(0) with Function Approximation. Let the step-sizes satisfy G1.
Also, assume that is full rank, and that the stationary distribution exists. Then
limt wt w* with probability 1, and w* is the fixed point of T .
Error Bound:
1
w* V V V
1
LSTD Algorithm:
Aˆ ( si ) ( si ') ( si )
N
i 1
N
bˆ r ( si ) ( si )
i 1
w Aˆ 1bˆ
LSPI Algorithm:
start with some arbitrary w0
Learning and Planning in Dynamical Systems – 046194 Spring 2016
i 1
N
bˆ r ( si , ai ) ( si , ai )
i 1
wm 1 Aˆ 1bˆ
T 1
ˆ J ( ) R( τ ) log (u | x )
Compute t t
t 0
Policy Gradient Theorem:
T 1
J ( ) E
( log (ut | xt ))Q ( xt , ut )
t 0