[go: up one dir, main page]

0% found this document useful (0 votes)
147 views10 pages

Formula Sheet: Section 1 - Deterministic Dynamic Programming

This document contains formulas and algorithms for dynamic programming, Markov decision processes (MDPs), and multi-armed bandit problems (MABs). It summarizes deterministic and stochastic dynamic programming techniques like value iteration and policy iteration. It also describes finite and infinite horizon MDPs, discounted MDPs, and solutions based on contraction mappings and the Bellman equation. Finally, it provides concentration inequalities and definitions for probably approximately correct (PAC) exploration in multi-armed bandits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views10 pages

Formula Sheet: Section 1 - Deterministic Dynamic Programming

This document contains formulas and algorithms for dynamic programming, Markov decision processes (MDPs), and multi-armed bandit problems (MABs). It summarizes deterministic and stochastic dynamic programming techniques like value iteration and policy iteration. It also describes finite and infinite horizon MDPs, discounted MDPs, and solutions based on contraction mappings and the Bellman equation. Finally, it provides concentration inequalities and definitions for probably approximately correct (PAC) exploration in multi-armed bandits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning and Planning in Dynamical Systems – 046194 Spring 2016

Formula Sheet
Section 1 – Deterministic Dynamic Programming
Finite-horizon Dynamic Programming
(i) Initialize the value function: VN ( s)  rN ( s) , s  S N .
(ii) Backward recursion: For k  N  1, ,0 , compute

Vk ( s)  max rk ( s, a)  Vk 1 ( f k ( s, a)) ,
aAk
 s  Sk

(iii) Optimal policy: Choose any control policy   ( k ) that satisfies:

aAk
 
 k ( s)  arg max rk ( s, a)  Vk 1 ( f k ( s, a)) , k  0, , N 1

Shortest Path on a Graph


DP equation (or Bellman's equation) for the shortest path problem:
d (u, v)  min{w(u, u ')  d (u ', v) : (u, u ')  E}
Bellman-Ford Algorithm:
Input: A weighted directed graph G , and destination node t .
Initialization: d [t ]  0 , d [v]   for v V \{t} ,  [v]   for v V
for i  1 to | V | 1
d '[v]  d[v], v V \{t}
for each vertex v V \{t} ,
compute q[v]  minu {w(v, u)  d '[u]: (v, u)  E}
if q[v]  d [v] ,
set d [v]  q[v] ,  [v]  arg minu {w(v, u)  d '[u]: (v, u)  E}
return {d [v],  [v]}
Dijkstra's Algorithm:
Input: A weighted directed graph, and destination node t .
Initialization: d [t ]  0 , d [v]   for v V \{t} ,  [v]   for v V
S 
while S  V ,
choose u V \ S with minimal value d [u ] , add it to S
for each vertexv with (v, u)  E ,
if d[v]  w(v, u)  d[u] ,
set d[v]  w(v, u)  d[u] ,  [v]  u
return {d [v],  [v]}
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Section 2 – Planning
Markov Chains
Stationary Distribution: is a stationary distribution for the Markov chain if P  ,
namely j  i i pij j .
Theorem: Recurrence of finite Markov chains: Let ( X t ) be an irreducible, a-periodic
Markov chain over a finite state space X . Then the following properties hold:
1. All states are positive recurrent
2. There exists a unique stationary distribution
3. Convergence to the stationary distribution: limt  pij  j (j )
(t )

1 t 1
4. Ergodicity: limt   f ( X s )   i (i) f (i)
t s 0
f
Theorem: Countable Markov chains: Let ( X t ) be an irreducible and a-periodic Markov
chain over a countable state space X . Then:
(a) Either (i) all states are positive recurrent, or (ii) all states are null recurrent, or (iii) all
states are transient.
(b) If (i) holds, then properties (2)-(4) of the previous Theorem hold as well.
(c) Conversely, if there exists a stationary distribution then properties (1)-(4) are
satisfied.

Contraction Operators
Norm: A norm ||  || over n
is a real-valued function ||  ||: n
 such that, for any pair
of vectors x, y  n
and scalar a ,
(1) || ax ||| a |  || ax || , (2) || x  y |||| x ||  || y || , (3) || x || 0 only if x  0 .
 is a contraction in ||  || if there exists   (0,1)
n n
Contraction: An operator T :
such that
|| T (v1)  T (v2 ) ||  || v1  v2 || , for all v1, v2  n
Theorem: Banach's fixed point theorem
Let T :
n
 n
be a contraction operator. Then
(C1) The equation T (v)  v has a unique solution V *  n .
(C2) For any v0  n , limn T n (v0 )  V * . In fact, || T n (v0 )  V * || O(  n ) .

Finite Horizon MDPs


Performance Criterion
J  ( s0 )  E ,s0  T 1
r ( s , a )  rT (sT )
t 0 t t t 
Finite-horizon Dynamic Programming
(i) Backward recursion: Set VT ( s)  rT ( s) for s  ST .
For k  T  1, ,0 , Vk ( s) may be computed using the following recursion:

aAk

Vk ( s)  max rk ( s, a)   s 'S
k 1
pk ( s ' | s, a)Vk 1( s ') ,  s  Sk

(ii) Optimal policy: Any Markov policy  * that satisfies, for t  0, , T  1,

aAt

 t* ( s)  arg max rt ( s, a)   s 'S
t 1

pt ( s ' | s, a)Vt 1( s ') , s  St
Learning and Planning in Dynamical Systems – 046194 Spring 2016

is an optimal control policy. Furthermore,  * maximizes J  ( s0 ) simultaneously


for every initial state s0  S0

Discounted MDPs
Value Function (fixed policy)

V  ( s)  E ,s (  t r (st , at ))
t 0
Bellman (fixed policy)
V  (s)  r (s,  (s))    s 'S p(s ' | s,  (s))V  (s ') , sS
Vector form: V   r   P V 
Optimal Value Function

V * ( s)  sup E ,s (  t r ( st , at ))
 t 0
Theorem: Bellman's Optimality Equation
(i) V * is the unique solution of of the following set of (nonlinear) equations:


V (s)  max r ( s, a)    s 'S p( s ' | s, a)V ( s ') ,
aA
 sS .

(ii) Any stationary policy  * that satisfies



 * (s)  arg max aA r ( s, a)    s 'S p( s ' | s, a)V * (s ') 
is an optimal policy (for any initial state s0  S ) .
Operator Notation
(T  (V ))(s)  r (s, a)    s 'S p(s ' | s,  (s))V (s '), s  S


(T * (V ))(s)  max r (s, a)    s 'S p(s ' | s, a)V ( s ') , s  S
aA

Theorem: Contraction property
(i) T  is a  -contraction operator with respect to the max-norm, namely
|| T  (V1)  T  (V2 ) ||   || V1  V2 || for all V1, V2  S

(ii) Similarly, T * is an  -contraction operator with respect to the max-norm.


Value Iteration
Starting with an arbitrary V0 ( s) , s  S , compute recursively


Vn1 (s)  max r (s, a)    s 'S p(s ' | s, a)Vn ( s ') , s  S
aA

Then limn Vn  V (component-wise). The rate of convergence is O( n ) .
*

Policy Iteration
Initialization: choose some stationary policy  0 .
For k  0,1, :
k
1. Policy evaluation: compute V .
k
2. Policy Improvement: Compute  k 1 , a greedy policy with respect to V :


 k 1 (s)  arg max aA r ( s, a)    s 'S p( s ' | s, a)V  k (s ') . 
  k
3. Stop if V k 1  V k (or if V satisfied the optimality equation) , else repeat.
Theorem: Policy Iteration
Learning and Planning in Dynamical Systems – 046194 Spring 2016

 k 1
(i) Each policy  k 1 is improving over the previous one  k , in the sense that V  V k
(component-wise).
 k 1
(ii) V  V  k if and only if  k is an optimal policy.
(iii)  k converges to the optimal policy after a finite number of steps.

Section 3 – Learning - MAB


Hoeffding's Inequality:

Union bound:

Exploratory MAB
PAC definition: An algorithm is a ( ,  )  PAC algorithm for the multi armed bandit with
complexity T , if it outputs an -optimal arm, a ' , with probability at least 1   , when it
Learning and Planning in Dynamical Systems – 046194 Spring 2016

terminates, and the number of time steps the algorithm performs until it terminates is
bounded by T .
Naïve Algorithm:

Successive Elimination (known biases):

Successive Elimination (unknown biases):


Learning and Planning in Dynamical Systems – 046194 Spring 2016

Median Elimination:

Regret Minimization in MAB


t
Expected Regret definition: [regrett ]  tr *  [ r ].t
1
UCB algorithm:

Arm differences: i  *  i .
Theorem: UCB
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Section 4 – Reinforcement Learning


Stochastic Approximation
Martingales:

Martingale Difference: the same, but with E ( X k 1 | k )0 .


Stochastic Approximation Algorithm:
n1  n   n [h(n )  n ]
Assumptions:

SA Convergence Theorems:
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Reinforcement Learning – Tabular Methods


Deterministic Q-learning:

Policy Evaluation: TD(0), TD(  )

Policy Evaluation: SARSA

Policy Improvement: Q-learning


Learning and Planning in Dynamical Systems – 046194 Spring 2016

Reinforcement Learning – Function Approximation


Linear Function Approximation:
V  ( s)   ( s) w
-weighted Euclidean norm:
n
  ( j ) X ( j)2
2
X
j 1

Projection Operator:  :  projects a vector V 


n n n
to a linear subspace S that is
the span of k features 1 ( s),,k ( s) w.r.t.‖ ·‖ :
 V  w *
w*  arg min w w  V
2
k

 
1
Explicitly, we can write  in vector notation as:       .
The projection is non-expansive:‖ V  V‖ ‖ V  V‖

Lemma: When is the stationary distribution: ‖ Pz‖ 2 ‖ z‖ 2


Theorem: T  is a contraction in the‖ ‖ norm. Its unique fixed point satisfies:
   P  I  w   r  0 .

TD(0) with Function Approximation


wt 1  wt   t t ( st )
t r ( st )   ( st 1 ) wt   ( st ) wt

Theorem: Convergence of TD(0) with Function Approximation. Let the step-sizes satisfy G1.
Also, assume that  is full rank, and that the stationary distribution exists. Then
limt  wt  w* with probability 1, and w* is the fixed point of T  .
Error Bound:
1
w*  V   V   V 
1 

LSTD Algorithm:

Aˆ   ( si )   ( si ')   ( si ) 
N

i 1
N
bˆ  r ( si ) ( si )
i 1

w  Aˆ 1bˆ
LSPI Algorithm:
start with some arbitrary w0
Learning and Planning in Dynamical Systems – 046194 Spring 2016

for m  0,1, 2,

Aˆ   ( si , ai )   ( si ',  greedy ( si '; wm ))   ( si , ai ) 


N

i 1
N
bˆ  r ( si , ai ) ( si , ai )
i 1

wm 1  Aˆ 1bˆ

Where  greedy ( s; w)  arg max a  ( s, a) w

Reinforcement Learning – Policy Gradients


Gradient ascent scheme:
k 1  k   k  J (k ) .

Likelihood ratio formula:


J ( )   τ R( τ ) p ( τ )
  τ [ R( τ ) log p ( τ )] p ( τ )
 E  R( τ ) log p ( τ ) 

Gradient estimation algorithm:


 Simulate a single episode ("rollout") of the controlled system with policy  .
T 1
 Compute R( τ ) as R( τ)  t 0 r ( xt , ut )  rT ( xT ) , or directly using the
 t 0 Rt .
T
observed rewards R( τ )

T 1
 ˆ J ( )  R( τ )   log  (u | x )
Compute    t t
t 0
Policy Gradient Theorem:
T 1
 J ( )  E 
 ( log  (ut | xt ))Q ( xt , ut )
t 0

You might also like