0% found this document useful (0 votes)

147 views10 pages

Formula Sheet: Section 1 - Deterministic Dynamic Programming

This document contains formulas and algorithms for dynamic programming, Markov decision processes (MDPs), and multi-armed bandit problems (MABs). It summarizes deterministic and stochastic dynamic programming techniques like value iteration and policy iteration. It also describes finite and infinite horizon MDPs, discounted MDPs, and solutions based on contraction mappings and the Bellman equation. Finally, it provides concentration inequalities and definitions for probably approximately correct (PAC) exploration in multi-armed bandits.

Uploaded by

סהר זילברמן

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views10 pages

Formula Sheet: Section 1 - Deterministic Dynamic Programming

Uploaded by

סהר זילברמן

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning and Planning in Dynamical Systems – 046194 Spring 2016

Formula Sheet
Section 1 – Deterministic Dynamic Programming
Finite-horizon Dynamic Programming
(i) Initialize the value function: VN ( s)  rN ( s) , s  S N .
(ii) Backward recursion: For k  N  1, ,0 , compute

Vk ( s)  max rk ( s, a)  Vk 1 ( f k ( s, a)) ,
aAk
 s  Sk

(iii) Optimal policy: Choose any control policy   ( k ) that satisfies:

aAk
 
 k ( s)  arg max rk ( s, a)  Vk 1 ( f k ( s, a)) , k  0, , N 1

Shortest Path on a Graph

DP equation (or Bellman's equation) for the shortest path problem:
d (u, v)  min{w(u, u ')  d (u ', v) : (u, u ')  E}
Bellman-Ford Algorithm:
Input: A weighted directed graph G , and destination node t .
Initialization: d [t ]  0 , d [v]   for v V \{t} ,  [v]   for v V
for i  1 to | V | 1
d '[v]  d[v], v V \{t}
for each vertex v V \{t} ,
compute q[v]  minu {w(v, u)  d '[u]: (v, u)  E}
if q[v]  d [v] ,
set d [v]  q[v] ,  [v]  arg minu {w(v, u)  d '[u]: (v, u)  E}
return {d [v],  [v]}
Dijkstra's Algorithm:
Input: A weighted directed graph, and destination node t .
Initialization: d [t ]  0 , d [v]   for v V \{t} ,  [v]   for v V
S 
while S  V ,
choose u V \ S with minimal value d [u ] , add it to S
for each vertexv with (v, u)  E ,
if d[v]  w(v, u)  d[u] ,
set d[v]  w(v, u)  d[u] ,  [v]  u
return {d [v],  [v]}
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Section 2 – Planning
Markov Chains
Stationary Distribution: is a stationary distribution for the Markov chain if P  ,
namely j  i i pij j .
Theorem: Recurrence of finite Markov chains: Let ( X t ) be an irreducible, a-periodic
Markov chain over a finite state space X . Then the following properties hold:
1. All states are positive recurrent
2. There exists a unique stationary distribution
3. Convergence to the stationary distribution: limt  pij  j (j )
(t )

1 t 1
4. Ergodicity: limt   f ( X s )   i (i) f (i)
t s 0
f
Theorem: Countable Markov chains: Let ( X t ) be an irreducible and a-periodic Markov
chain over a countable state space X . Then:
(a) Either (i) all states are positive recurrent, or (ii) all states are null recurrent, or (iii) all
states are transient.
(b) If (i) holds, then properties (2)-(4) of the previous Theorem hold as well.
(c) Conversely, if there exists a stationary distribution then properties (1)-(4) are
satisfied.

Contraction Operators
Norm: A norm ||  || over n
is a real-valued function ||  ||: n
 such that, for any pair
of vectors x, y  n
and scalar a ,
(1) || ax ||| a |  || ax || , (2) || x  y |||| x ||  || y || , (3) || x || 0 only if x  0 .
 is a contraction in ||  || if there exists   (0,1)
n n
Contraction: An operator T :
such that
|| T (v1)  T (v2 ) ||  || v1  v2 || , for all v1, v2  n
Theorem: Banach's fixed point theorem
Let T :
n
 n
be a contraction operator. Then
(C1) The equation T (v)  v has a unique solution V *  n .
(C2) For any v0  n , limn T n (v0 )  V * . In fact, || T n (v0 )  V * || O(  n ) .

Finite Horizon MDPs

Performance Criterion
J  ( s0 )  E ,s0  T 1
r ( s , a )  rT (sT )
t 0 t t t 
Finite-horizon Dynamic Programming
(i) Backward recursion: Set VT ( s)  rT ( s) for s  ST .
For k  T  1, ,0 , Vk ( s) may be computed using the following recursion:

aAk

Vk ( s)  max rk ( s, a)   s 'S
k 1
pk ( s ' | s, a)Vk 1( s ') ,  s  Sk

(ii) Optimal policy: Any Markov policy  * that satisfies, for t  0, , T  1,

aAt

 t* ( s)  arg max rt ( s, a)   s 'S
t 1

pt ( s ' | s, a)Vt 1( s ') , s  St
Learning and Planning in Dynamical Systems – 046194 Spring 2016

is an optimal control policy. Furthermore,  * maximizes J  ( s0 ) simultaneously

for every initial state s0  S0

Discounted MDPs
Value Function (fixed policy)

V  ( s)  E ,s (  t r (st , at ))
t 0
Bellman (fixed policy)
V  (s)  r (s,  (s))    s 'S p(s ' | s,  (s))V  (s ') , sS
Vector form: V   r   P V 
Optimal Value Function

V * ( s)  sup E ,s (  t r ( st , at ))
 t 0
Theorem: Bellman's Optimality Equation
(i) V * is the unique solution of of the following set of (nonlinear) equations:


V (s)  max r ( s, a)    s 'S p( s ' | s, a)V ( s ') ,
aA
 sS .

(ii) Any stationary policy  * that satisfies


 * (s)  arg max aA r ( s, a)    s 'S p( s ' | s, a)V * (s ') 
is an optimal policy (for any initial state s0  S ) .
Operator Notation
(T  (V ))(s)  r (s, a)    s 'S p(s ' | s,  (s))V (s '), s  S


(T * (V ))(s)  max r (s, a)    s 'S p(s ' | s, a)V ( s ') , s  S
aA

Theorem: Contraction property
(i) T  is a  -contraction operator with respect to the max-norm, namely
|| T  (V1)  T  (V2 ) ||   || V1  V2 || for all V1, V2  S

(ii) Similarly, T * is an  -contraction operator with respect to the max-norm.

Value Iteration
Starting with an arbitrary V0 ( s) , s  S , compute recursively


Vn1 (s)  max r (s, a)    s 'S p(s ' | s, a)Vn ( s ') , s  S
aA

Then limn Vn  V (component-wise). The rate of convergence is O( n ) .
*

Policy Iteration
Initialization: choose some stationary policy  0 .
For k  0,1, :
k
1. Policy evaluation: compute V .
k
2. Policy Improvement: Compute  k 1 , a greedy policy with respect to V :


 k 1 (s)  arg max aA r ( s, a)    s 'S p( s ' | s, a)V  k (s ') . 
  k
3. Stop if V k 1  V k (or if V satisfied the optimality equation) , else repeat.
Theorem: Policy Iteration
Learning and Planning in Dynamical Systems – 046194 Spring 2016

 k 1
(i) Each policy  k 1 is improving over the previous one  k , in the sense that V  V k
(component-wise).
 k 1
(ii) V  V  k if and only if  k is an optimal policy.
(iii)  k converges to the optimal policy after a finite number of steps.

Section 3 – Learning - MAB

Hoeffding's Inequality:

Union bound:

Exploratory MAB
PAC definition: An algorithm is a ( ,  )  PAC algorithm for the multi armed bandit with
complexity T , if it outputs an -optimal arm, a ' , with probability at least 1   , when it
Learning and Planning in Dynamical Systems – 046194 Spring 2016

terminates, and the number of time steps the algorithm performs until it terminates is
bounded by T .
Naïve Algorithm:

Successive Elimination (known biases):

Successive Elimination (unknown biases):

Learning and Planning in Dynamical Systems – 046194 Spring 2016

Median Elimination:

Regret Minimization in MAB

t
Expected Regret definition: [regrett ]  tr *  [ r ].t
1
UCB algorithm:

Arm differences: i  *  i .
Theorem: UCB
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Section 4 – Reinforcement Learning

Stochastic Approximation
Martingales:

Martingale Difference: the same, but with E ( X k 1 | k )0 .

Stochastic Approximation Algorithm:
n1  n   n [h(n )  n ]
Assumptions:

SA Convergence Theorems:
Learning and Planning in Dynamical Systems – 046194 Spring 2016

Reinforcement Learning – Tabular Methods

Deterministic Q-learning:

Policy Evaluation: TD(0), TD(  )

Policy Evaluation: SARSA

Policy Improvement: Q-learning

Learning and Planning in Dynamical Systems – 046194 Spring 2016

Reinforcement Learning – Function Approximation

Linear Function Approximation:
V  ( s)   ( s) w
-weighted Euclidean norm:
n
  ( j ) X ( j)2
2
X
j 1

Projection Operator:  :  projects a vector V 

n n n
to a linear subspace S that is
the span of k features 1 ( s),,k ( s) w.r.t.‖ ·‖ :
 V  w *
w*  arg min w w  V
2
k

 
1
Explicitly, we can write  in vector notation as:       .
The projection is non-expansive:‖ V  V‖ ‖ V  V‖

Lemma: When is the stationary distribution: ‖ Pz‖ 2 ‖ z‖ 2

Theorem: T  is a contraction in the‖ ‖ norm. Its unique fixed point satisfies:
   P  I  w   r  0 .

TD(0) with Function Approximation

wt 1  wt   t t ( st )
t r ( st )   ( st 1 ) wt   ( st ) wt

Theorem: Convergence of TD(0) with Function Approximation. Let the step-sizes satisfy G1.
Also, assume that  is full rank, and that the stationary distribution exists. Then
limt  wt  w* with probability 1, and w* is the fixed point of T  .
Error Bound:
1
w*  V   V   V 
1 

LSTD Algorithm:

Aˆ   ( si )   ( si ')   ( si ) 
N

i 1
N
bˆ  r ( si ) ( si )
i 1

w  Aˆ 1bˆ
LSPI Algorithm:
start with some arbitrary w0
Learning and Planning in Dynamical Systems – 046194 Spring 2016

for m  0,1, 2,

Aˆ   ( si , ai )   ( si ',  greedy ( si '; wm ))   ( si , ai ) 

i 1
N
bˆ  r ( si , ai ) ( si , ai )
i 1

wm 1  Aˆ 1bˆ

Where  greedy ( s; w)  arg max a  ( s, a) w

Reinforcement Learning – Policy Gradients

Gradient ascent scheme:
k 1  k   k  J (k ) .

Likelihood ratio formula:

J ( )   τ R( τ ) p ( τ )
  τ [ R( τ ) log p ( τ )] p ( τ )
 E  R( τ ) log p ( τ ) 

Gradient estimation algorithm:

 Simulate a single episode ("rollout") of the controlled system with policy  .
T 1
 Compute R( τ ) as R( τ)  t 0 r ( xt , ut )  rT ( xT ) , or directly using the
 t 0 Rt .
T
observed rewards R( τ )

T 1
 ˆ J ( )  R( τ )   log  (u | x )
Compute    t t
t 0
Policy Gradient Theorem:
T 1
 J ( )  E 
 ( log  (ut | xt ))Q ( xt , ut )
t 0

A Child's Guide To Dynamic Programming
No ratings yet
A Child's Guide To Dynamic Programming
20 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
No ratings yet
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
11 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Convergence of Q-Learning PDF
No ratings yet
Convergence of Q-Learning PDF
4 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Abstract Dynamic Programming Bertsekas Dimitri P download
No ratings yet
Abstract Dynamic Programming Bertsekas Dimitri P download
87 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Lecture SM 1 DP
No ratings yet
Lecture SM 1 DP
71 pages
Dynamic_Programming_and_Optimal_Control
No ratings yet
Dynamic_Programming_and_Optimal_Control
62 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Littomore
No ratings yet
Littomore
169 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
3 Recursive
No ratings yet
3 Recursive
8 pages
Lec 08
No ratings yet
Lec 08
59 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
No ratings yet
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
55 pages
Chapter 6 Dynamic Optimization Math Econ 3rd y
No ratings yet
Chapter 6 Dynamic Optimization Math Econ 3rd y
17 pages
note3
No ratings yet
note3
9 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
RL Class Notes (4)
No ratings yet
RL Class Notes (4)
68 pages
Bellman Routingproblem 1958
No ratings yet
Bellman Routingproblem 1958
5 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
The Econometric Society, Wiley Econometrica: This Content Downloaded From 193.227.1.29 On Sat, 23 Jul 2016 23:17:22 UTC
No ratings yet
The Econometric Society, Wiley Econometrica: This Content Downloaded From 193.227.1.29 On Sat, 23 Jul 2016 23:17:22 UTC
22 pages
Rl Lecture4
No ratings yet
Rl Lecture4
7 pages
Mathii at Su and Sse: John Hassler Iies, Stockholm University February 25, 2005
No ratings yet
Mathii at Su and Sse: John Hassler Iies, Stockholm University February 25, 2005
87 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
RL Monograph1
No ratings yet
RL Monograph1
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
DP Book
No ratings yet
DP Book
428 pages
Dynamic Programming and Optimal Control, Volumes I Solution Selected
No ratings yet
Dynamic Programming and Optimal Control, Volumes I Solution Selected
30 pages
Vol I Dimitri PDF
No ratings yet
Vol I Dimitri PDF
30 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
MIT6 231F15 Notes PDF
No ratings yet
MIT6 231F15 Notes PDF
303 pages
Convergence of Stochastic Iterative Dynamic Programming Algorithms
No ratings yet
Convergence of Stochastic Iterative Dynamic Programming Algorithms
8 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
MIT Dynamic Programming Lecture Slides
No ratings yet
MIT Dynamic Programming Lecture Slides
261 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Lecture1.2 Laplace Transform
No ratings yet
Lecture1.2 Laplace Transform
11 pages
3
No ratings yet
3
29 pages
Matrices Determinants Question Bank
No ratings yet
Matrices Determinants Question Bank
26 pages
MATHS - 6th MAY POLYNOMIALS - Assignment
No ratings yet
MATHS - 6th MAY POLYNOMIALS - Assignment
3 pages
Midterm: Th. Nov.03, 2pm - 3pm
No ratings yet
Midterm: Th. Nov.03, 2pm - 3pm
3 pages
Numerics GROUP ASSIGNMENT #1
No ratings yet
Numerics GROUP ASSIGNMENT #1
4 pages
Limits With Piecewise Functions
No ratings yet
Limits With Piecewise Functions
9 pages
AccumulationFunctionSol-1
No ratings yet
AccumulationFunctionSol-1
4 pages
Natural Frequencies and Mode Shapes
No ratings yet
Natural Frequencies and Mode Shapes
9 pages
0471433314realanalysis4 Pages 51,151 154
No ratings yet
0471433314realanalysis4 Pages 51,151 154
5 pages
Nurul A. Chowdhury - Numerical Methods
100% (1)
Nurul A. Chowdhury - Numerical Methods
336 pages
Applied Mathematics (Class XII)
No ratings yet
Applied Mathematics (Class XII)
151 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
369 pages
Lesson 11 Definite Integrals
No ratings yet
Lesson 11 Definite Integrals
17 pages
Finding Triconnected Components of Graphs PDF
No ratings yet
Finding Triconnected Components of Graphs PDF
24 pages
Partial and Directional Derivatives, Differentiability: Rafikul Alam Department of Mathematics IIT Guwahati
No ratings yet
Partial and Directional Derivatives, Differentiability: Rafikul Alam Department of Mathematics IIT Guwahati
20 pages
(Maa 2.7) Asymptotes
No ratings yet
(Maa 2.7) Asymptotes
18 pages
2.2 Fixed Point Iteration
No ratings yet
2.2 Fixed Point Iteration
10 pages
ADI Cavity Matlab CODE
100% (2)
ADI Cavity Matlab CODE
5 pages
Notes Curve Sketching
No ratings yet
Notes Curve Sketching
5 pages
Math 2200 Practice Test 3
No ratings yet
Math 2200 Practice Test 3
8 pages
Assignment 28individual 29
No ratings yet
Assignment 28individual 29
4 pages
DMTH505 Measure Theorey and Functional Analysis PDF
No ratings yet
DMTH505 Measure Theorey and Functional Analysis PDF
349 pages
Sec 4.2 Logarithmic Functions: Relation Between Exponents and Logarithms
No ratings yet
Sec 4.2 Logarithmic Functions: Relation Between Exponents and Logarithms
21 pages
STPM Sem1 Maths T L22
No ratings yet
STPM Sem1 Maths T L22
10 pages
Eng Math 1 Im 9
No ratings yet
Eng Math 1 Im 9
4 pages
Codes Higher Engineering Mathematics B S Grewal
No ratings yet
Codes Higher Engineering Mathematics B S Grewal
197 pages
CFD Lec5
No ratings yet
CFD Lec5
40 pages
CIE 305 Topic3 Slides Laplace Transform
No ratings yet
CIE 305 Topic3 Slides Laplace Transform
127 pages
RD Sharma Class 8 Solutions Maths Chapter 7 Factorization
No ratings yet
RD Sharma Class 8 Solutions Maths Chapter 7 Factorization
48 pages