0% found this document useful (0 votes)

61 views42 pages

Reinforcement Learning: Foundations Exam

The document is an exam question booklet for a course on Reinforcement Learning, authored by Shie Mannor, Yishay Mansour, and Aviv Tamar, and includes various topics such as Dynamic Programming, MDPs, Bandits, and Function Approximation. Each section contains specific problems and algorithms related to the respective topics, aimed at advanced undergraduate students. The booklet is a work in progress and invites feedback for improvements.

Uploaded by

Vitaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views42 pages

Reinforcement Learning: Foundations Exam

Uploaded by

Vitaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Reinforcement Learning: Foundations

Exam Question Booklet

Shie Mannor, Yishay Mansour and Aviv Tamar

February 2022
This book is still work in progress. We would be grateful for notification about errors or comments of
any kind, at rlfoundationsbook@gmail.com.

Contents

1 Preface 3

2 Dynamic Programming 3
2.1 Subset sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Colored Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Trajectories in Graph & Efficient Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 3
2.4 Egg Throw & Maximal Square sub-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Two-Traversals Maximal Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.7 All Pairs Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.8 Machine Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.9 Black-Red Card Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.10 Blackmailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.11 Engineer’s Job Choice Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 MDPs 9
3.1 MDP with an Absorbing State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Worst Case Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Solving MDPs through Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Simple MDP – Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Solving MDPs using sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Bellman equation for Second Moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Investor Planning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1
3.8 Bellman Operators and Linearly Solvable MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.9 MDPs with Non-Linear Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.10 Belief MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.11 Exploration Conscious MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.12 Average Policy Improvement Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.13 Time-Varying Bellman Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.14 Contextual MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Bandits 23
4.1 MAB with 2 arms per round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Exploratory Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 (ϵ, δ)-Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Reward Swapping Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Median Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Constrained Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Budgeted Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.8 Concurrent Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.9 UCB with Unknown T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.10 Bandits with Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Function Approximation 31
5.1 2-step Bellman Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Off-policy TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 TD(λ) Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 TD(0) with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 TD(λ) with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Approximate Value Iteration – revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 RL – misc. 36
6.1 GTD(0) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 n-step Operators in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 q π instead of v π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.5 Modified Policy Iteration and Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2
1 Preface

This booklet contains a collection of questions on RL, to accompany the textbook. These questions were given
in final exams, in our teaching of advanced undergraduate RL courses at Tel Aviv University and at Technion.
The questions are roughly organized according to topic. However, some questions involve topics from several
chapters of study, and are intended as practice exams for students who have already completed studying the full
material.

2 Dynamic Programming

2.1 Subset sum

We are given a list of n positive integer numbers a1 , . . . , an and a positive integer S. It is required to answer the
following question: Does there exist a subset of the ai ’s that add up to S?

a. Suppose that you are allowed to use each element ai any number of times. Give a dynamic programming
algorithm that provides an answer in O(nS) operations (addition and comparison).

b. Suppose now that you are allowed to use each element ai at most once. Repeat the previous problem (with
the same complexity requirement).
Hint: Let Aj = {a1 , . . . , aj }. Consider iterating over j.

2.2 Colored Tree

Consider a tree such that every edge of the tree is coloured either in blue or in red.

a. Devise an efficient algorithm that counts how many trajectories from the source to the leafs contain only
blue edges. The algorithm should not traverse a single node twice.

b. Now we want to count how many trajectories are there from source to the leafs that contain at most one
blue edge. Devise an efficient algorithm for this task.

c. Now we want to count how many trajectories are there from source to the leafs that contain at most k
blue edges, where k > 0. Devise an efficient algorithm for this task.

2.3 Trajectories in Graph & Efficient Matrix Multiplication

a. Given an undirected graph G = (V, E), we denote the degree (number of edges) of vertex v ∈ V by d(v).
Devise an algorithm for finding the number of possible T -length trajectories starting from a given vertex s,
and ending in a given vertex t (cycles are allowed). What is the complexity of your algorithm as a function
of |V |, T , and |E|?

3
Figure 1: Example - There are 12 trajectories of length 3 from F to C:
F CF C, F EDC, F CDC, F CEC, F CBC, F CAC, F ABC, F GAC, F GEC, F GF C, F AF C, F EF C.

b. The number of operations needed to multiply two matrices A and B where, A ∈ Rm×n , B ∈ Rn×p is
mnp operations.
For example, let A ∈ R3×5 , B ∈ R5×9 , C ∈ R9×2 , D ∈ R2×8 . Then the number of operations needed for
calculating ABCD with the following parentheses: (AB)(CD) is 3·5·9+9·2·8+3·9·8 = 495 operations.
However, the number of operations when calculating with the following parentheses: A(B(CD)) is
9 · 2 · 8 + 5 · 9 · 8 + 3 · 5 · 8 = 624 actions.
Given a multiplicative chain of N matrices N pi ×pi+1 , devise an algorithm for
Q
i=1 Ai , such that Ai ∈ R
finding the multiplication order with the smallest number of operations. What is the complexity of your
algorithm in the parameters of the problem?

2.4 Egg Throw & Maximal Square sub-Matrix

a. There is a tower with n floors and you have 2 identical eggs. There exists a floor m such that any egg
thrown from a higher floor crashes, while every egg thrown from floor m, or lower, survives. Devise
an efficient algorithm for finding the least number of tries needed in the worst case and find its time
complexity.
Example 1. Let’s say there are 15 floors, and that your algorithm performs binary search with the first
egg. Notice that once the first egg breaks, the second egg must be thrown from every floor you are not
sure about, from the bottom to the top, since otherwise you will not know m exactly. Hence, if the first egg
breaks at floor 8 - you will be forced to 8 tries in the worst case: floors 8, and then floors 1 − 7 one after
the other. If the first egg does not break at floor 8, you know that m ≥ 8, so you won’t try more than 8
tries anyhow (8 − 15) - so the least number of trials in the worst case of your algorithm here will be 8.

b. Suggest an efficient algorithm that given a matrix consisting of 0’s and 1’s, finds the size of the maximum
size squared sub-matrix consisting of only 1’s. What is the time complexity of your algorithm?
Notice! - A sub-matrix is defined as an adjacent subset of rows and columns.
Example 2. Observe the following 4 × 4 matrix A:
 
1 1 1 1
 1 1 1 1 
 
 1 0 0 0 
1 1 0 1

4
The biggest sub-matrices of this matrix are given in A[1, 2|1, 2], A[1, 2|2, 3], A[1, 2|3, 4] and are of size 2.
The sub-matrix A[1, 2, 4|1, 2, 4] consists of only 1’s, but it is not composed of an adjacent subset of rows
and columns, and is therefore not considered.

2.5 Two-Traversals Maximal Score

Given a matrix of dimension R × C, where every entry contains a score, how do you collect a maximal overall
score using two traversals?
The two traversals are defined as follows.

1. The two traversals have to start from top row, from the left and right corners, i.e., [1, 1] and [1, C]. The
first (in the left corner) has to reach bottom left corner, i.e., [R, 1]. The second traversal should reach
bottom right corner, i.e., [R, C].

2. From a point [i, j], each traversal can move to [i + 1, j + 1] or [i + 1, j − 1] or [i + 1, j].

A traversal gets all points of a particular cell through which it passes. If one traversal has already collected points
of a cell, then the other traversal gets no points if goes through that cell again.
Example For the following input:

Figure 2: Example of an input matrix.

The output will be 73.

First traversal collects total score of value 3 + 2 + 20 + 1 + 1 = 27.
Second traversal collects total score of value 2 + 4 + 10 + 20 + 10 = 46.
Total score collected = 27 + 46 = 73.

a. Analyze the complexity of an enumeration-based (naive) algorithm.

b. Suggest a dynamic programming based algorithm to solve the problem and analyze its complexity.

2.6 Longest Common Subsequence

Given a sequence X = (x1 , . . . , xm ), we say that sequence Z is a subsequence of X if there exists a strictly
increasing sequence (i1 , . . . , ik ) such that for all j = 1, . . . , k we have Xij = Zj . For example, Z =
(B, C, D, B) is a subsequence of X = (A, B, C, B, D, A, B).
Given two sequences X, Y we say that Z is a common subsequence of X and Y if Z is a subsequence of both
X and Y . In the longest-common-subsequence (LCS) problem we are given two sequences X = (x1 , . . . , xm )
and Y = (y1 , . . . , yn ), and we need to find the maximum length common subsequence of X and Y .

5
a. Propose a dynamic programming solution to find the LCS of X and Y . What are its time and space
complexities?

Let us define the k-LCS of X and Y to be their LCS where it is allowed to change at most k elements in X to
any value.

b. Find a 2-LCS for X = (1, 2, 3, 4, 5) and Y = (5, 3, 1, 4, 2).

c. Propose a dynamic programming solution to find the k-LCS of any X and Y . What are its time and space
complexities?

2.7 All Pairs Shortest Path

In this question we will consider the all pairs shortest path (APSP) problem. We start with the deterministic
setting on a graph.
Consider a directed weighted graph G = (V, E) with N nodes, and no negative cycles. We would like to find
the shortest distance D(v, u) from any node v ∈ V to any other node u ∈ V .

a. Suggest a naive solution that uses the single-source (or single destination) Bellman-Ford algorithm in an
inner loop. What is the complexity of this algorithm?

b. Assume that D(v, u) for the graph G is given. Now, we construct a new graph G′ by adding a new node v ′
to G, and corresponding edges between v ′ and the nodes in G. Let D′ (v, u) denote the shortest distances
on the graph G′ .

(i) Suggest a method for computing D′ (v ′ , u) for some u ∈ V in the new graph G′ , in O(|V |) computa-
tions.
(ii) Suggest a method for computing D′ (u, v ′ ) for some u ∈ V in the new graph G′ , in O(|V |) computa-
tions.
(iii) Suggest a method for computing D′ (v, u) ∀u, v ∈ V in the new graph G′ , in O(|V |2 ) computa-
tions.

c. Propose an algorithm for the APSP problem with O(|V |3 ) complexity.

We next consider the stochastic MDP setting.

Let M be a controlled Markov chain with state space S, action space A, and transitions P (s′ |s, a), for s, s′ ∈
S, a ∈ A. We define V (u, v) as the shortest number of transitions to reach v ∈ S from u ∈ S in expectation.
By definition, V (u, u) = 0.

d. For a fixed goal state g ∈ S, formulate the shortest distance problem above as an MDP. That is, define a
cost function c(s, a) and an optimization objective such that the optimal value function corresponds to
V (u, g).

e. Prove or find a negative example for the following conjecture:

V (u, v) = min (V (u, s) + V (s, v)).

s∈S,s̸=u,s̸=v

6
f. Prove or find a negative example for the following conjecture:
!
X
V (u, v) = min c(u, a) + P (s′ |u, a)V (s′ , v) ∀u ̸= v.
a∈A
s′ ∈S

g. Propose an algorithm for calculating V (u, v).

2.8 Machine Maintenance

A workshop manager has just bought an expensive new machine. On each day, he has two options: maintain the
machine at cost M or not maintain it. If he chooses not to maintain it, there is a risk that the machine will break
down with probability pj and cost B to fix, where j is the number of consecutive days without maintenance
since the last breakdown or maintenance (e.g. in the beginning, the probability of breakdown is p0 . After one
day with no maintenance, the probability of breakdown is p1 , after two successive days of no maintenance, the
probability of breakdown is p2 , etc). Assume pj is monotonically non-decreasing in j, and that there exists an
integer m such that pm B > M .

a. Explain, in words, the meaning behind the assumption that there exists an integer m such that pm B > M ,
and deduce that the smallest number of states that is sufficient for a full characterization of the problem is
m + 1.

b. Formulate this as an infinite horizon discounted cost MDP, with a discount factor γ ∈ (0, 1). Describe the
states, actions, costs, and transition probabilities (the discount factor is arbitrary).

c. Write the corresponding Bellman equation for the value function J(i), where i is the state.

d. Show that B + γJ(0) ≥ J(i), for any i.

Hint: use Bellman’s equation.

e. Prove that the cost value is monotonically non-decreasing in the state, i.e., i > j =⇒ J(i) ≥ J(j).
Hint: use (backward) induction and the result from (d).

f. Explain why the optimal policy is a threshold policy (defined below) and find the threshold i∗ .
Threshold policy: for an MDP with actions {a1 , a2 } and states {s1 , s2 , ..., sN }, a threshold policy with
threshold i∗ is defined as follows: choose action a1 for all si such that i < i∗ , and a2 otherwise.

2.9 Black-Red Card Game

Dima plays a game that starts with a deck with B "black" cards and R "red" cards. Dima knows B and R. At
each time-step Dima draws a random card and decides between the following two options:

(a1 ) Without looking at the card, "predict" that it is black in which case Dima wins the game if the prediction
is correct and loses if the prediction is incorrect.

(a2 ) "Discard" the card, after looking at its color, and continue the game with one card less.

Dima wins the game if at any time-step the deck has only black cards, and loses the game if the deck has only
red cards. Dima wants to find a policy that maximizes the probability of winning the game.

7
a. What is the maximal number of time steps the game can last?

b. We wish to formulate the problem as aSfinite horizon undiscounted (γ = 1) return problem, with
S = {(b, r) | 0 ≤ b ≤ B, 0 ≤ r ≤ R} {T } as the state space, where T is the terminal state. Write the
action space at each state, i.e. A(s), ∀s ∈ S, write the transition probabilities p(s′ |s, a), ∀s′ , s ∈ S, ∀a ∈
A, and suggest a suitable (and simple) reward function.
Clarification: In the case of zero red/black cards, we first move to the corresponding state with zero
red/black cards and then terminate in the next step.

c. Write the corresponding Bellman equation.

d. Suggest a dynamic programming algorithm to find the exact solution of the MDP, i.e. the exact value
function J(s), ∀s ∈ S. What is its complexity?

e. Use induction to show that the optimal probability of winning when starting with b black cards and r red
b
cards is b+r .

f. Show that any policy is optimal.

2.10 Blackmailer

Consider a situation involving a blackmailer and a victim. At each stage, the blackmailer can:

(1) Retire with the accumulated blackmail earnings.

(2) Demand a payment of $1, in which case the victim will comply with the demand (with probability p, where
0 < p < 1, independently of the past history), or they will refuse to pay and denounce the blackmailer to
the police (with probability 1 − p).

Once denounced to the police, the blackmailer loses all of the accumulated earnings and cannot blackmail again.
Also, once the blackmailer reaches accumulated earnings of $n he immediately retires. The blackmailer’s goal is
to maximize the expected (undiscounted) amount of money he earns.

a. Formulate the problem as an MDP with the states i = 0, 1, ..., n and a termination state. Specifically,
write the states, actions at each state, transition probabilities, and rewards. Note: The rewards must be
non-negative.

b. Write the corresponding Bellman equation (consider all states in 0, 1, ..., n).

c. Let J ∗ be the optimal value function of the MDP. Show that J ∗ (i) is monotonically increasing as a
function of i.

d. Show that J ∗ (i) = i for all i larger than p/(1 − p).

e. Show that the optimal policy is a threshold policy.

Threshold policy: Retire if the current accumulated earnings are greater than some threshold.

8
2.11 Engineer’s Job Choice Problem

An ambitious engineer, currently employed at some company at a salary c, seeks employment elsewhere. She
receives a job offer at each time step, which she may accept or reject. The offered salary takes one of n possible
values w1 , ..., wn , with given probabilities ξ1 , ..., ξn , independent of past offers.
If she accepts the offer, she keeps the job for the rest of her life. If she rejects, she continues at her current job at
salary c and continues to receive offers in future steps.

a. Formulate the engineer’s problem as a discounted MDP. Write the states, actions, transition probabilities,
and rewards. Assume a discount factor 0 < γ < 1.

b. Write the corresponding Bellman equation.

c. Show that the optimal policy is a threshold policy.

Threshold policy: Accept the offer only if the salary is greater than some threshold.

Now, consider a variant of the problem where the worker could be fired from her new job at salary wi at each
time step, this happens with probability pi . Once fired, she returns to her former company and resumes her work
at salary c, and she continues to receive offers in future steps, which she may accept or reject. Also, Assume she
cannot be fired from her former job at salary c.

d. Write the corresponding Bellman equation for this case.

e. Assume pi = p ∀i. Show that the optimal policy is a threshold policy (i.e., only accept the offer if the
salary is greater than some threshold). Hint: assume, without loss of generality, that w1 ≤ w2 ≤ ... ≤ wn .

f. Now pi is not independent of i. Provide a sufficient condition such that the optimal policy from the last
clause would still hold

3 MDPs

3.1 MDP with an Absorbing State

Consider an MDP, defined by (S, A, P, r). We consider the following total-return criterion over an infinite time
horizon:
XT
J π (s) = lim Eπ,s ( r(st , at )),
T →∞
t=0

where Eπ,sis, as usual, the expected value given a control policy π and initial state s0 = s. The policy π is taken
to be stationary and deterministic.
We assume here that there exist a special state s∗ ∈ S and a number β > 0 such that, for any state-action pair
(s, a),
P (s∗ |s, a) ≥ β .

Furthermore, there is only one action a∗ at state s∗ , and

r(s∗ , a∗ ) = 0, P (s∗ |s∗ , a∗ ) = 1 .

The state s∗ is called an absorbing state.

9
Although the total return criterion is not well defined in general, the presence of an absorbing state makes it
“well behaved". There are two approaches to show this: either a direct one, or through an equivalent discounted
problem. We will consider both.

a. Prove that there exists a constant γ < 1 such that, for any policy π and t ≥ 0,
Pπ (st ̸= s∗ |s0 = s) ≤ γ t .

b. Prove that the total return J π (s) is well defined for every policy π (i.e., the limit converges) and finite.
Hint: Find a bound on Eπ,s (r(st , at )).
c. Show that J π (s) can be written as an equivalent discounted return function. Follow the following steps:
(1) Let P̂ = (P̂ (s′ |s, a)) be a modified state transition matrix, defined as:
P (s′ |s, a)

 : s′ ̸= s∗
−

′
P̂ (s |s, a) = 1 β .
′
 P (s |s, a) − β : s′ = s∗

1−β
Show that P̂ is indeed a state transition matrix (i.e., satisfies the relevant requirements).
(2) Let P̂π,s denote the probability distribution induced on the process by P̂ , policy π, and initial state
s0 = s. Show that, for any s′ ̸= s∗ ,
1
P̂π,s (st = s′ ) = Pπ,s (st = s′ ).
(1 − β)t
(3) Show that J π (s) can be equivalently expressed as a discounted return function for the modified
MDP (S, A, P̂ , r). What is the discount factor?
(4) Write down the Bellman optimality equation for the equivalent discounted problem, and express it
in terms of the original problem parameters. Compare to the Bellman operator defined in the next
section.
d. For a stationary and deterministic policy π, consider the (undiscounted) fixed-policy Bellman operator
T π : RS → RS , defined by
X
(T π (V ))(s) = r(s, π(s)) + P (s′ |s, π(s))V (s′ ) .
s′ ∈S

Show that T π is a γ-contraction (for some γ < 1) over the space {V ∈ RS : V (s∗ ) = 0}. Specify γ, and
explain why we need to consider only functions with V (s∗ ) = 0.

3.2 Worst Case Reward

Consider a finite horizon MDP with the following twist. For a stationary stochastic policy π, instead of looking
at the expected cumulative reward, we look at the worst case reward until a given time T :
T
X
π,T
J (s) = min r(st , at ).
(s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0

That is, the worst possible reward given that the initial state is s. Similarly, we define the best case reward as
T
π,T X
J (s) = max r(st , at ).
(s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0

10
a. Write a backward recursion (dynamic programming on T, T − 1, . . .) formula for J π,T .
∗,T π,T
b. Suppose that we want to compute J (s) = supπ J (s). Suggest an algorithm to do so.
∗,T
c. Is there an optimal deterministic strategy that attains J (s)? That is, does there exist a deterministic
∗,T π,T
policy π such that J (s) = J (s) for all s? Prove or give a counterexample.

d. We now consider the infinite horizon case. For some γ < 1 we define
T
X
J π (s) = lim min γ t r(st , at ).
T →∞ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0

i. Write dynamic programming equations for J ∗ (s) = supπ J π (s).

ii. Show that the resulting dynamic programming operator is a contraction.
iii. * Prove that there exists a stationary deterministic strategy that attains J ∗ (s).
π (s) = (J π (s) + J (s))/2. π
e. * Consider J1/2
∗ (s) = sup J π (s). Is there an optimal deterministic policy? Prove or give a
We now consider J1/2 π 1/2
counterexample.

3.3 Solving MDPs through Linear Programming

Recall the following standard and dual Linear Programming (LP) formulation:

min b⊤ x, subject to Ax ≥ c,

and its dual:

max c⊤ y, subject to A⊤ y ≤ b, y ≥ 0,

Consider an MDP, defined by (S, A, P, r, ρ, γ), with the following discounted-return criterion over an infinite
time horizon:
X∞
π π,ρ
J =E ( γ t r(st , at )),
t=0
where Eπ,s is, as usual, the expected value given a control policy π and initial state distribution s0 ∼ ρ. Our goal
is to find the optimal value function:
V ∗ = max J π . (1)
π

Recall that for a fixed policy, the value function can be found by solving the following set of linear equations:
X
v(s) = r(s, a) + γ Pr(s′ |s, π(a))v(s′ ), ∀s.
s′ ∈S

a. Write the primal LP for the optimization problem (1), where the primal variables are the state values.

b. Write the dual LP for this problem (where the dual variables are the discounted state-action frequencies).

Suppose we want to solve the MDP, but we want our solution policy to be close to a given stationary policy µ(a|s)
(note that µ is stochastic in general). Specifically, let ϵ > 0 and we constrain: ∀s ∈ S : maxa∈A |π(a|s) −
µ(a|s)| ≤ ϵ.

11
c. Show (by an example or a convincing argument) that the optimal policy in this case is not necessarily
deterministic, an example with more than one optimal policy will not be accepted. (* Bonus: show the
same applies even if µ is deterministic).

d. Rewrite the dual LP to fit the suggested constraint, your formulation should not contain non-linear
functions (like absolute values).

e. How many variables and how many constraints will there be in the primal LP now?

f. Propose a policy iteration algorithm using this new formulation.

g. Show this policy iteration algorithm converges in a finite number of steps to the optimal policy.

3.4 Simple MDP – Analysis

Consider the n states MDP given in Figure 1. In each state there are two actions - Left which leads to the starting
state 1, and Right which takes us one step closer to the rightmost state n. In state n, any action leads to the
leftmost state s = 1 and a reward 1 is given (r(s = n, a) = 1). The reward for every other state and action is 0.
The discount factor is γ, so a delay in reaching the goal will inflict an exponential reduction in the total reward.

a. What is the optimal policy µ?

b. Find the value V µ (s) of each state.

We apply the uniform exploration policy π: π(a = right|s) = π(a = left|s) = 0.5.

c. Find the stationary distribution over the states dπ (s).

d. Run value iteration starting from v 0 (s) = 0 to find v 1 (s), v 2 (s), v 3 (s). Write v ∞ (s) using V π (s).

e. Starting from state s = 1, how many steps in expectation will it take to reach state s = n?

f. Starting from s = 1, under a policy of your choice, what is the minimal number of steps it will take to
visit every state-action pair at least once?

g. Let’s say we employ an optimistic exploration scheme, where every previously unvisited state-action pair
is assumed to have reward 1. Starting from s = 1, how many steps will it take to visit every state-action
pair at least once?

From now on assume n = 3, γ = 0.5.

h. Find V π (s).

Now we want to employ a function approximation scheme to evaluate the value function.

12
i. Find a feature vector that will guarantee zero approximation error.

Assume we are using the constant feature vector: ϕ(s) = 1, and that n = 3.

j. Find Ππ - the projection into span{ϕ} w.r.t. dπ (s).

k. Find Φw∗ - the fixed point of Ππ T π where T π is the Bellman operator corresponding to π.

l. Find Ππ V π .

3.5 Solving MDPs using sampling

The Phased Value Iteration algorithm solves discrete, finite-horizon MDPs, while not requiring access to their
exact model. Instead, it queries a simulator G(M ), to which state-action pairs (s, a) are given as input, and next
state s′ and reward r are returned as output.
Presented below is the algorithm.

Algorithm 1 Phased Value Iteration

Require: simulator G(M ), sampling parameter m
Ensure: estimated policy π̂ and value function V̂t
1: initialize V̂T (s) = 0, ∀s
2: for t = T − 1, . . . , 0 do
3: using m calls to G(M ) for each state-action

# of times (s, a) → s′
P̂t (s′ |s, a) =
m

4: set
h i
V̂t (s) = max r(s, a) + Es′ ∼P̂t (·|s,a) V̂t+1 (s′ )
a∈A
h i
π̂t (s) = arg max r(s, a) + Es′ ∼P̂t (·|s,a) V̂t+1 (s′ )
a∈A

5: end for

a. How many calls to the simulator are made during the run of Alg. (1)?

We now make the following assumption:

h i h i
Es′ ∼P (·|s,a) V̂t (s′ ) − Es′ ∼P̂t−1 (·|s,a) V̂t (s′ ) ≤ ϵ, ∀s, a, t, (2)

meaning the two expactations, using the original P and the estimated one, P̂ , are relatively close.

b. Using the assumption in Eq. (2), show that

∗
Vt−1 (s) − V̂t−1 (s) ≤ max
′
Vt∗ (s′ ) − V̂t (s′ ) + ϵ,
s

where Vt∗ is the optimal value function.

13
c. Next, show that
Vπ̂,t−1 (s) − V̂t−1 (s) ≤ max
′
Vπ̂,t (s′ ) − V̂t (s′ ) + ϵ,
s

where Vπ̂,t is the value function corresponding to π̂t .

d. Using the two previous results (from parts (c) and (d)), show that

max |Vπ̂,0 (s) − V0∗ (s)| ≤ 2ϵT.

e. Assuming from now on the MDP is stationary, write ϵ as a function of ϵ̃ and the different terms in the
question, where ϵ̃ is defined so that
Vπ (s) ≥ V ∗ (s) − ϵ̃.

f. We now turn to justifying the assumption in Eq. (2). Show that the probability that the assumption in
2
Eq. (2) fails is smaller than 2 · |S| · |A| · T · exp (− ϵ̃2Tm2 ).
2
g. Prove the following theorem for a choice of m = O Tϵ̃2 log |S||A|T δ .

Theorem 1. The phased value iteration algorithm calls the simulator G(M )

|S||A|T 3

|S||A|T
O log
ϵ̃2 δ

times and with probability greater than 1 − δ, returns a policy π such that for every state s,

Vπ (s) ≥ V ∗ (s) − ϵ̃.

3.6 Bellman equation for Second Moment

Consider an infinite-horizon MDP (S, A, R, P, γ) with a given, fixed policy π. We define M π (s) as the second
moment of the discounted return when starting from state s and following policy π,
 !2 
X∞
M π (s) = Eπ  γ t r(st , at ) so = s .
t=0

In addition, a reminder: "∞ #

X
V π (s) = Eπ γ t r(st , at ) so = s .
t=0

a. Write an expression in vector-form for M π that is dependent on V π and M π .

b. Given that V π is fixed, is the operator you received a contraction? If the answer is positive, prove it. If
not, convincingly show why or find a counter example.

c. Given that V π is fixed, suggest an iterative algorithm for learning M π and prove its convergence. Assume
there is access to a simulator and that S is finite.

14
3.7 Investor Planning Problem

An investor has a call option to buy one share of a stock at a fixed price of p and has T days to exercise this
option. He can exercise the option each day. When the stock price is x, he gets (x − p). The investor may decide
not to exercise this option at all. Assume that the price of the stock Xt on day t is
t
X
Xt = X0 + Wk
k=1

where {Wk } is an i.i.d. process independent of X0 ; also, EWk < ∞ ∀k.

In this question we will find the optimal investment strategy for choosing τ ≤ T such that E[(Xτ − p)I[τ ≤T ] ] is
maximized.
Let us define a Markov state St = (Xt , Et ) ∈ R+ × {0, 1}, where Et indicates if the option has been exercised
in the past or not.

a. Define the corresponding action set, model dynamics, reward function and objective function for finite
MDPs, as learned in class.

Next, notice that the value function Vt holds:

Vt (x, 1) = 0,
Vt (x, 0) = max{x − p, E[Vt+1 (x + Wt+1 , 0)]}.

b. Prove the following arguments, possibly using induction:

1. ∀t : Vt (x, 0) is increasing in x.
2. ∀t : (Vt (x, 0) − x) is decreasing in x.
3. ∀x : (Vt (x, 0) − x) is decreasing in t.

Now, note that for an optimal selling action at x∗ , it holds that

x∗ − p ≥ E[Vt+1 (x∗ + Wt+1 , 0)].

c. Show that if the selling action is optimal at x∗ , then it is optimal at all x ≥ x∗ .

d. Let Xt∗ = {x : x − p ≥ E[Vt+1 (x + W, 0)]} and x∗t = min Xt∗ . Show that {x∗t }Tt=1 is decreasing. Hint:
use an inclusion relation.

e. Use the previous result to describe (in words and/or math) the type of optimal policy in our problem.

3.8 Bellman Operators and Linearly Solvable MDPs

In this question we will focus on a specific class of MDPs, M = {S, A, P, c, γ}, where the action space is
continuous, u ∈ RS . The ‘passive’ dynamics Markov Chain is p0 (s′ | s) and its corresponding transition matrix
is p0 . Next, the dynamics of M is given by

p(s′ | s, u) = p0 (s′ | s) exp(us′ ),

where the notation us means the component of u that corresponds to the state s.

15
Let β > 0. The cost of the model is

c(s, u) = q(s) + βDKL (p(· | s, u) || p(· | s, u = 0)),

where the KL-Divergence is given by,

X p(s′ | s, u)
DKL (p(· | s, u) || p(· | s, u = 0)) = p(s′ | s, u) log ,
p(s′ | s, u = 0)
s′

and remember that DKL (q || p) ≥ 0 with equality iff p(·) = q(·) point-wise. The DKL term represents a cost
for controlling the system, i.e, when changing the ‘passive’ dynamics, p0 , by using u > 0.
The Bellman equation has the following form:
!
X
′ ′
v(s) = min c(s, u) + γ p(s | s, u)v(s ) . (3)
u∈U (s)
s′

a. What is the optimal policy in the limit β → ∞?

b. What is the constraint u should satisfy for p(· |, s, u) to be a valid probability function? (no need to solve
the constraint)

c. Plug p(· | s, u) into the Bellman equation (3), and define the corresponding optimal Bellman operator.
Prove it is a contraction mapping in the max-norm and find its contraction coefficient.

From now on assume β = 1. Since the found Bellman operator is a contraction mapping, we can calculate the
optimal value by repeatedly applying it. However, since the action space is continuous, applying it is expected to
be problematic without solving analytically the minimization problem.

d. Solve the minimization problem defined in (c) with the constraint defined in (b). Find u∗ which solves the
constraint minimization problem (Hint: solve using Lagrange multipliers, see end of formula sheet).

e. Set γ = 1. Plug u∗ into the Bellman equation (3), rewrite the Bellman equation using the transformation
z(s) = exp(−v(s)) and express it in the form of z(s) = g(s) s′ p0 (s′ | s)z(s′ ). What is g(s)?
P

f. Let T : [0, 1]|S| → [0, 1]|S| be the following operator,

X
T z(s) = g(s) p0 (s′ | s)z(s′ ),
s′

where g is defined in section (e), and p0 is the ‘passive’ dynamics. Assume T is an α-contraction mapping
with a corresponding fixed point z ∗ . In this section, assume access to a simulator which is given an action,
u, and returns a 3-tuple,(sn , g(sn ), sn+1 ). You can use all the assumptions of the Q-learning algorithm.
Answer the following

1. Suggest a Stochastic Approximation algorithm that converges to z ∗ and prove its convergence. Make
sure to be clear about what is the policy with which your algorithm interacts with the environment.
2. What is the relation between z ∗ and the solution of the Bellman equation in (3), v ∗ ?

16
3.9 MDPs with Non-Linear Objectives

Consider an MDP M with finite state space S and finite actions space A, transitions P (s′ |s, a), discount factor
γ ∈ (0, 1), a fixed initial state sinit , and two reward functions: r1 (s) and r2 (s). In this question we will consider
objectives that are a function of both r1 and r2 .
For a Markov policy π, we denote the discounted returns J1π and J2π as:
" ∞ #
X
J1π = Eπ γ t r1 (st ) s0 = sinit ,
t=0
∞
" #
X
J2π =E π t
γ r2 (st ) s0 = sinit .
t=0

Note that the initial state is fixed, and that J1π , J2π denote scalar returns and not value functions.
Let f (x, y) be some function of two variables. For some policy π we denote J π = f (J1π , J2π ). We wish to find a
policy π ∗ that maximizes J π :
J ∗ = max J π , π ∗ ∈ argmax J π .
π π

a. For f (x, y) = αx + βy, propose a standard MDP with a single reward r̂ such that it’s optimal policy is
π ∗ . Explain.

For the rest of this question, we consider the function f (x, y) = xy . Furthermore, we assume the following
bounds on the rewards: 0 < rmin ≤ r1 (s) < r2 (s) ≤ rmax , for all s ∈ S.

b. Can the standard MDP solution approaches (value iteration, policy iteration) be used to find π ∗ in this
case? Explain (no need to prove formally).

For some ρ ∈ [0, 1], consider a standard MDP Mρ with the same S, A, P, γP as M and reward r̂(s) =
π π ∞ t
r1 (s) − ρr2 (s). Denote the discounted reward for a policy π in Mρ as Jρ = E t=0 γ r̂(st )|s0 = sinit .

c. Assume that for some policy π, we have that Jρπ = 0. What is J π ?

π∗
d. Let πρ∗ be an optimal policy in Mρ , that also satisfies Jρ ρ = 0. Show that πρ∗ is optimal also in M.
′
Hint: assume that for some π ′ , J π > ρ, and show a contradiction.

e. Show that for ρ = 0, Jρπ > 0 for any π.

f. Show that for ρ = 1, Jρπ < 0 for any π.

g. Let Jρ∗ denote the optimal value in Mρ . Show that Jρ∗ is monotonically decreasing in ρ.
Hint: start by showing monotonicity for a fixed policy.

h. Based on (d-g), propose an approach for finding the optimal policy π ∗ . Technically, you can assume that
Jρ∗ is continuous in ρ, and you can invoke any standard MDP solver in your solution.

17
3.10 Belief MDP

Remark: while this question concerns a bandit, we will solve it optimally using MDP methods.
Consider a two-armed bandit, with i.i.d. reward distributions Bernoulli(p1 ) and Bernoulli(p2 ). The success
probabilities p1 and p2 are known in advance. However, we do not know which of the two arms, denoted A1 and
A2 , is associated with p1 and which is with p2 . That is - only the permutation of the arms is not known.
Let e ∈ {0, 1} be a random variable that represents the arm permutation. That is, e = 1 means that arm A1 is
assigned with p1 and A2 with p2 , and e = 0 means that arm A1 is assigned with p2 and A2 with p1 . We denote
by θ0 = P (e = 1) the permutation probability, and we assume that θ0 is known (though e is not observed).
The permutation is drawn before the agent starts playing, and does not change afterwards. The agent pulls arms
for T time steps, and we denote by at ∈ {A1 , A2 } the action taken at time t, and by Rt the resulting (stochastic,
i.i.d.) reward observed. The agent’s goal is to maximize the T -horizon expected cumulative reward:
"T −1 #
X
max Ee,π Rt . (4)
π
t=0

Note that the expectation is taken with respect to the random permutation e, the random rewards, and the
(possibly) random action selection policy π.

a. For a single step horizon T = 1, what is the optimal policy?

b. Explain why Eq. (4) does not describe a standard MDP (hint: what is not observed?).
c. For T > 1, and the special case p1 = 1, p2 < 1, θ0 = 0.75, describe an optimal policy (no need to prove).

We denote by ht = {a0 , R0 , . . . , at−1 , Rt−1 } the history up to time t. The general solution to Eq. (4) is a history
dependent policy π(ht ). In the rest of this question, however, we will show that we can construct an MDP with a
special state θ for solving Eq. (4).

d. We denote by θt (ht ) = P (e = 1|ht ) the belief about the arm permutation at time t, given the history of
rewards and actions until that time. Assume that at−1 = A1 , and a reward Rt−1 = 1 was observed. Show
that θt (ht ) can be written as a function of θt−1 (ht−1 ), p1 , p2 , as follows:
p1 θt−1 (ht−1 )
P (e = 1|ht ) = .
p1 θt−1 (ht−1 ) + p2 (1 − θt−1 (ht−1 ))
Hint: use Bayes rule. You can assume that given ht , the action at and e are independent (the policy only
depends on the history and not on any other variable).
e. Using θt (ht ), write down the reward probability given the history P (Rt = 1|ht ), for the case that at = A1 .
Write down an expression for the expected reward R̄(θt (ht ), at ) = E[R(at )|ht ], for the case at = A1 .
f. Given θt (ht ), and assuming that at = A1 , what are the possible values of θt+1 ? What are their probabili-
ties?
g. Define the belief MDP, where the state at time t is θt , the action space is {A1 , A2 }, the reward is R̄(θt , at ),
as calculated in (e), and where P (θt+1 |θt , at ) is as calculated in (f). Show that the optimal policy in the
belief MDP is optimal for the original problem Eq. (4).
h. Is the state space of the belief MDP discrete or continuous? Given θ0 , how many different states θT are
possible at time T ? Suggest an algorithm for solving the belief MDP. How does the algorithm complexity
depend on T ?

18
3.11 Exploration Conscious MDPs

In class, we have treated ϵ-greedy exploration as a means to find an optimal deterministic policy. However, the
stochastic exploration can lead the agent to bad rewards during learning. In this question, we assume that the
agent wants to obtain high rewards during learning too, and change the usual definition of optimality such that it
will take the exploration into account.
Consider a discounted MDP M = (S, A, r, P, γ) defined as usual. Assume that for any policy π and from any
state s an agent acts with probability ϵ by a fixed, pre-determined, policy π0 , and with probability 1 − ϵ by the
policy π, i.e., the effective policy is

µπ,ϵ,π0 (a | s) = (1 − ϵ)π(a | s) + ϵπ0 (a | s).

For example, when setting π0 to be the uniform policy we obtain ϵ-greedy exploration.
The value of a policy µπ,ϵ,π0 (a | s) from a state s on the MDP M is defined by
"∞ #
µπ,ϵ,π0 X
VM (s) = E γ t r(st , at ) | s0 = s, µπ,ϵ,π0 , P ,
t=0

in words, the expected value of the policy µπ,ϵ,π0 when the dynamics is P and the reward is r.
∗
We define the optimal exploration-conscious policy πϵ,π as the policy that maximizes the expression
0

∗ µπ,ϵ,π0
πϵ,π 0
∈ arg max VM (s), (5)
π

µ∗ϵ,π µ∗ϵ,π0 ∗ where V ∗ is the optimal value of the MDP M ? prove

a. Let VM 0 = maxπ V µπ,ϵ,π0 . Does VM ≤ VM M
or give a counter example.

At first sight, the optimization in Eq. (5) might seem hard. However, as you will show in the following, it is
equivalent to solving a different MDP. Define an MDP Mϵ,π0 = (S, A, rϵ,π0 , Pϵ,π0 , γ), where:

Pϵ,π0 (s′ | s, a) = (1 − ϵ)P (s′ | s, a) + ϵP π0 (s′ |s), and rϵ,π0 (s, a) = (1 − ϵ)r(s, a) + ϵrπ0 (s),

and P π0 (s′ |s) = a π0 (a|s)P (s′ |s, a), rπ0 (s) = a π0 (a|s)r(s, a). Furthermore, denote the (standard, non
P P
exploration-conscious) value of a policy π on the MDP Mϵ,π0 as VM π .
ϵ,π 0

b. Let TMπ and TMϵ,π0 denote the fixed policy (for some policy π) and optimal Bellman operators for the
ϵ,π0
MDP Mϵ,π0 . Let TM π and T
M denote the fixed policy (for some policy π) and optimal Bellman operators
for the MDP M . Show that
π0 π π π0 µπ,ϵ,π0
TMϵ,π0 = (1 − ϵ)T + ϵTM and TM ϵ,π
= (1 − ϵ)TM + ϵTM = TM .
0

π µπ,ϵ,π0
c. Based on section (b), prove that VM ϵ,π
= VM for any policy π.
0

∗
d. Based on (b) and (c), prove that πϵ,π ∗
= πM ∗
where πϵ,π ∗
is defined in (5) and πM is the optimal
0 ϵ,π0 0 ϵ,π0
policy of the MDP Mϵ,π0 .
e. Assume we do not have access to P and r, however, we can interact with the environment. Suggest
a Q-learning algorithm from which we can obtain µ∗ϵ,π0 (5) in the limit (you can assume all standard
assumptions for the convergence of Q-learning to hold). Hint: define the optimal Bellman operator for
the Q function of the MDP Mϵ,π0 denoted by TQ,Mϵ,π0 . Then find an update rule that in expectation is
equivalent to the application of TQ,Mϵ,π0 .

19
3.12 Average Policy Improvement Theorem

Consider a discounted MDP with states s ∈ S, actions a ∈ A, state dependent reward r(s), transitions P (s′ |s, a),
initial state distribution µ(s), and discount factor γ.
We define the usual value and Q functions for a Markov policy π:
X
V π (s) = r(s) + γ P (s′ |s, π)V π (s),
s′
X
Qπ (s, a) = r(s) + γ P (s′ |s, a)V π (s),
s′

and the discounted reward performance criterion (averaged over initial states):
X
Jπ = µ(s)V π (s).
s

′
Consider two Markov policies π and π ′ . In this question we will characterize the performance gap J π − J π .
′
Define the advantage of a state s as Aππ (s) = a π ′ (a|s) (Qπ (s, a) − V π (s)).
P

a. What is Aππ (s)?

We define dπ (s) = ∞ t
P
t=0 γ P (st = s|π, s0 ∼ µ) to be the discounted state visitation frequencies. In the next 3
clauses we will prove the following identity:
′
X ′ ′
Jπ − Jπ = dπ (s)Aππ (s). (6)
s

b. Prove that the following holds:

′
X
Aππ (s) = r(s) + γ P (s′ |s, π ′ )V π (s′ ) − V π (s).
s′

c. Prove that the following holds:

X
dπ (s′ ) = µ(s′ ) + γ P (s′ |s, π)dπ (s).
s

d. Use the above results to prove Eq. (6).

′
Hint: you can start by adding and subtracting s dπ (s)V π (s) from the expression.
P

We will next show that Eq. (6) is a generalization of the policy gradient theorem. Consider a policy parameterized
by θ (you can assume that θ is a scalar, and that πθ is differentiable). Let us define J(θ) = J πθ .

∂
e. Use Eq. (6) to calculate ∂θ J(θ) explicitly, and show that you obtain the policy gradient theorem.

20
3.13 Time-Varying Bellman Operators

An MDP is defined by: (S, A, P, R, γ), where S, A are the state and action spaces, P is the dynamics, R is the
reward and γ ∈ (0, 1) is the discount factor. In this question, we consider MDPs with varying reward functions.
Let R ⊆ R|S|·|A| denote a set (not necessarily finite) of deterministic reward functions (i.e. R ∈ R). Let
{Ri }∞
i=1 ⊆ R be some arbitrary sequence of reward functions.

a. ∀R ∈ R we define a Bellman operator: (TR V )(s) = maxa∈A R(s, a) + γ s′ ∈S P (s′ |s, a)V (s′ ) .
P

Let {Ti }∞
i=1 be the sequence of Bellman operators corresponding to the sequence of rewards above (i.e.
. .
Ti = TRi ). For the ease of notation, ∀i < j we denote Ti:j V = Tj Tj−1 ...Ti+1 Ti V .
(1) For any i, is Ti a γ-contraction?
(2) Prove or give a counter example: the sequence limK→∞ T1:K V0 converges for any V0 and sequence
of rewards {Ri }∞i=1 .

From now on, assume limi→∞ ||Ri − R∗ ||∞ = 0, and let (T ∗ V )(s) = (TR∗ V )(s).
In the following sections (b-d) we aim to show that iteratively applying Ti for i = 1, 2, ... converges to the fixed
point of T ∗ , for any V0 ∈ R|S| .

b. Let ∀V1 , V2 ∈ R|S| , and M > 0 such that ||V1 − V2 ||∞ ≤ M . Show that for any i ∈ N:

||Ti V1 − T ∗ V2 ||∞ ≤ δi + γM,

.
where δi = ||Ri − R∗ ||∞ .
c. Let δ̄ = maxi∈N ||Ri − R∗ ||∞ (which exists since the rewards converge). Show that ∀k ∈ N:
δ̄
||T1:k V0 − (T ∗ )k V0 ||∞ ≤ .
1−γ

d. Use sections b and c, along with the assumption that limi→∞ ||Ri − R∗ ||∞ = 0 to prove that ∀V0 ∈ R|S|
and ∀ϵ > 0, ∃K0 ∈ N s.t. ∀K > K0 :

||T1:k V0 − (T ∗ )K V0 || < ϵ.

conclude that
lim T1:K V0 = lim (T ∗ )K V0
K→∞ K→∞

e. We now consider deterministic dynamics, i.e., s′ = f (s, a). Consider the Q-learning algorithm with
time-varying rewards:
Initialize arbitrary Q0
for t=0,1,2,... do
Observe st , at , s′t , Rt (st , at )
Qt+1 (st , at ) = Rt (st , at ) + γ maxa′ ∈A Qt (s′t , a′ )
end for
Assume that limt→∞ ||Rt − R∗ ||∞ = 0, and all states and actions are visited infinitely often. Prove
that the algorithm converges to Q∗ , the fixed point of the Bellman operator: (T ∗ Q)(s, a) = R∗ (s, a) +
γ maxa′ ∈A Q(s′ , a′ ). You can use the following result: the sequence xt+1 = γxt + ϵt converges to 0 if
γ < 1 and ϵt → 0.

21
3.14 Contextual MDP

We consider a simplistic, discrete view of autonomous driving under heavy

traffic. In our problem, there are two cars, one of them is under our control
(car A) and the other is controlled by another human driver (car B). We
consider a single lane, with two possible actions (for both cars); gas and
break. Where gas increases the speed and break decreases the speed. Our
goal is to arrive at our destination x∗ , as fast as possible, while avoiding
collisions with car B.
At each time, we can observe our (car A) current speed, σtA , and the locations
of car A and car B, xA B
t and xt . Both the location and the speed of the cars
are discrete, x , x ∈ X , σ , σ B ∈ Σ, where X and Σ are finite sets. For
A B A

simplicity, a collision happens when xA = xB . Assume that the speed of

car B can be approximated by: σtB ≈ xB B
t − xt−1 .

a. We want to formulate this problem as an MDP. As we can only control

car A, we define the action space A = gas,break . One possible defi-
nition for the state space is that at each time-step t, st = (xA B A
t , xt , σt )).
Does it satisfy the requirements of an MDP state space? if so, suggest
a reward function. If not – explain why, suggest a state space that
satisfies those requirements and a reward function.

Since we have no control over car B, the dynamics of the MDP might change significantly between different
drivers (e.g. some would drive faster when car A is closer and others would drive slower or even break). Assume
that you have a finite set of possible dynamics, P (the state and action are omitted for the ease of notation),
induced by the possible policies of car B.

b. We want to use P to find an optimal policy for the worst case scenario. Suggest a model (from the models
you have seen in class) for this task, formulate it, and describe a value-iteration like algorithm to compute
the optimal policy.

In reality, we might have some hints over the policy of car B (i.e. type of car, weather, etc.), we would want
to use it to produce a better policy. More formally, consider an episodic scenario, where at the beginning of
each episode we observe some vector, α ∈ R|P| , such that α represents a probability vector over all possible
P|P|
dynamics ∀P ∈ P (e.g., α ≥ 0, i=1 αi = 1). At each time-step, the dynamics of the environment are drawn
from the distribution induced by α. When one episode ends and another starts, we observe a new α.
* Note that in this setting α is fully-observable and given at the beginning of each episode.

c. Modify the state space from clause a, so it would fit the new setting. How many states there are?

Definition: Contextual MDP (CMDP) is a tuple (C, S, A, M), where C is the context-space, S is the state-space,
A is the action-space and M is a mapping, M : C → (P, R, s0 ), such that for each context c ∈ C, M(c) defines
dynamics (P c ), reward (Rc ) and an initial state (sc0 ). The main idea here is that each context defines a single
MDP, (S, A, Rc , P c , sc0 ), where the state and actions spaces are shared across all the MDPs defined by the
contexts.

d. Using the CMDP model for this problem, how would you define the context space and the state space? is
it possible to use a tabular model in this case? how?

22
e. (1) Think of a ‘lazy’ learning scheme, in which we do not learn the optimal policy in advance, but at the
beginning of each episode, after we observe α, we compute the optimal policy using a simulator and
play accordingly through the episode. Consider the model from clauses c and d, how would you use
each one of them for this task? which one is preferable for this setting and why?
(2) What are the advantages/disadvantages of using the model from clause b over the models from
clauses c and d?
|P|
f. Assume you know the optimal policy for each p ∈ P separately, denote them by {πi }i=1 . Prove or give a
counter example: For some α, the optimal policy can be computed by a linear combination of the known
P|P|
policies, i.e. π α (a|s) = i=1 αi πi (a|s). (here π(a|s) represents a probability)

4 Bandits

4.1 MAB with 2 arms per round

Consider a stochastic MAB (multi-armed bandit) problem, where at each round we need to choose two different
arms (and not just one). All other assumptions are similar to the model studied in class (IID reward bounded
reward per arm, etc.). The goal is to minimize a suitable regret measure.
Notation: Let S be the set of arms. For each arm a ∈ S, let (R(n), n ≥ 0) be the sequence of i.i.d. rewards
associated with that arm, taking values in [0, 1], and with mean µa = E(Ra (n)). Further, let nat be the number
of times arm a has been chosen up to (and including) time t, let a(t) and b(t) denote the two arms chosen at
round t, and let
a(t) b(t)
rt = rt + rt
a(t) b(t)
denote the reward obtained at time t, with rta = Ra (nat−1 + 1). (Both rewards rt and rt that are obtained at
the chosen arms are observed by the learning agent. I.e., the agent knows which arm produced which reward.)

a. Propose an appropriate definition for regret(T ), the T -stage expected regret, which is suitable for this
problem. Express it precisely in mathematical terms.

b. Show that this problem can re-formulated as a standard bandit problem, with each pair of bandits (or arms)
considered a new “super-bandit". Describe precisely the obtained model.

c. Apply the UCB1 algorithm to the “super-bandit" model of (b). Describe the algorithm, and obtain an
upper bound on the expected regret (using the available results for the standard problem). Discuss briefly
the shortfalls of the super-bandit approach, and how they are reflected in your bound.

d. Propose an extension of the UCB1 algorithm which is suitable for the original problem (and does not use
the super-bandit model). Justify your suggestions verbally.

e. Prove a (non-trivial) upper bound on the expected regret of this algorithm when the expected value of the
two best arms happens to be identical. (Note attached Theorem 6 and its proof from the lecture notes.)
Discuss the improvement relative to (c).

f. [Bonus problem] Derive an upper bound on the algorithm proposed in (d), without the additional assump-
tion of (e) above.

23
4.2 Exploratory Multi-armed Bandits

In this question we consider the exploratory Bernoulli multi-armed bandit where instead of finding the arm with
highest reward, we want to find the arm with the median reward. Notation:

• Denote by n the number of arms.

• Denote by rmed the median reward.

• An arm a is ϵ-median if |r(a) − rmed | ≤ ϵ.

• We look for (ϵ, δ)-PAC algorithms such that an ϵ-median arm is chosen with probability at least 1 − δ.

• Throughout the question you can ignore n’s parity and assume there is one median hand amed (if you
wish, you may assume n is an odd number).

To solve the problem, we first suggest the following naive algorithm:

Input: ϵ > 0, δ > 0

output: An arm
for Arm a ∈ A do
Sample it ℓ = ϵ42 ln( nδ ) times Let p̂a be the average reward of arm a;
end for
Output a′ = arg mediana∈A {p̂a }

a. After applying this algorithm, bound the probability:

Pr (p̂amed < rmed − ϵ/2) (7)

b. After applying this algorithm, bound the probability:

Pr p̂a′ > E[R(a′ )] + ϵ/2

(8)

c. Prove the following claim:

The naive algorithm is an (ϵ, δ)-PAC algorithm with arm sample complexity O (n/ϵ2 ) log(n/δ) .

Now we propose the following quartile elimination algorithm:

Input: ϵ > 0
output: An arm
Set S1 = A, ϵ1 = ϵ/4, δ1 = δ/2, ℓ = 1.
repeat
Sample every arm a ∈ Sℓ for 1/(ϵℓ /2)2 log(6/δℓ ) times, and let p̂ℓa denote its empirical value Find the
upper and lower quartiles of p̂ℓa , denoted by qℓl , qℓu correspondingly Sℓ+1 = Sℓ \ {a : qℓl < p̂ℓa or p̂ℓa > qℓu }
ϵℓ+1 = 34 ϵℓ ; δℓ+1 = δℓ /2; ℓ = ℓ + 1
until |Sℓ | = 1

We analyze the first round:

24
d. Bound the failure probability of the following event:

Pr{|p̂1med − pmed | > ϵ1 /2} (9)

e. Let j be some non ϵ1 -median arm, and assume the failure event from the previous does not happen. Bound
the failure probability of the following events:

Pr{p̂1j ≥ p̂1med | |p̂1med − pmed | ≤ ϵ1 /2, pj < pmed } (10)

Pr{p̂1j ≤ p̂1med | |p̂1med − pmed | ≤ ϵ1 /2, pj > pmed } (11)

f. Let #bad be the number of arms which are not ϵ1 -optimal but empirically change relation with respect to
the median arm - if they were lower than the median then their estimate is higher, and if they were higher
than the median, then their estimate is lower. Bound the following probability:

P[#bad ≥ n/2 | |p̂1med − pmed | ≤ ϵ1 /2] (12)

Hint: Use Markov’s inequality.

g. Prove that the probability of a failure in the first round is lower than δ1 (a failure is the case where no
ϵ1 -optimal arm was left).

4.3 (ϵ, δ)-Finder

You are given k arms, each with an unknown Bernoulli distribution. We assume that each distribution has a
mean of µi . You can assume that k is an odd number.
We say that an algorithm is (ϵ, δ)-median finder if it outputs an ϵ approximation of the median with a probability
of at least 1 − δ.

a. Suppose that we sample each arm exactly T times and we output the empirical median. How large should
T be so that this algorithm is (ϵ, δ)-median finder? Prove that the algorithms is indeed (ϵ, δ)-median
finder.

b. Bonus: Suppose now that the values of µi are known but their identity is not. Propose an (ϵ, δ)-median
finder algorithm with improved sample complexity dependence on the actual values of µi .

We now consider the problem of finding the range: R = maxi µi − mini µi . We say that an algorithm is
(ϵ, δ)-range finder if it outputs an ϵ approximation of R with probability of at least 1 − δ.

c. Suppose that we sample each arm T times and then output the empirical range as the estimate. What
should T be so that we have a (ϵ, δ)-range finder? Prove your claim.

d. Propose another algorithm with improved sample complexity in k. Prove a sample complexity bound.
Hint: you may use algorithms studied in class as subroutines.

25
4.4 Reward Swapping Arms

Consider a two-arms multi-armed bandit setting where the arms are Bernoulli-distributed, with expected rewards
µ+ = 0.5 + ϵ, µ− = 0.5 − ϵ, where ϵ > 0 is some fixed constant. Every T time-steps, a switch may (but does
not have to) occur, in which case the two arms swap their reward function. Let n = kT , where k is an integer.
We define the regret at time n as
Xn
R(n) = nµ+ − E µ(It ) ,
t=1

where It ∈ {1, 2} is the algorithm’s choice of arm at time t.

a. Propose an algorithm for minimizing the regret when log T >> O( 1ϵ ), and compute the resulting regret at
time n as a function of T, k, ϵ.

b. Propose an algorithm for minimizing the regret when T is small (T = 1 or T = 2, for example), and
compute the resulting regret as a function of T, k, ϵ.

c. Consider the following case: k = 2 and ϵ = n1 (you can assume that n is large). Propose an algorithm for
minimizing the regret in that case, and compute the resulting regret as a function of T, k, ϵ.

d. Suppose now that the switch may occur not more than once throughout the whole process. Argue that
√
n is a lower bound on the regret in this case. Hint: consider how to choose k and T such that “enough”
exploration is forced?

4.5 Median Finder

We are given a population of K i.i.d. bandits. When pulling arm i an iid reward is received with mean µi . You
may assume that the arms are Bernoulli random variables and that K is an odd number.
Note: Recall that the median of K number that are ordered a1 ≤ a2 ≤ · · · ≤ aK is the number a(K+1)/2 .

a. Suppose we want to find the median arm of the K arms. Propose an algorithm for finding (with probability
of at least 1 − δ) the median arm up to an ϵ-error. Prove its correctness and bound its sample complexity.

b. Suppose now that we want an algorithm for finding the median, but for ϵ = 0. That is, we want an
algorithm that exactly finds the median arm. Propose an algorithm (that is correct with probability at least
1 − δ) and analyze it sample complexity in terms of the unknown expectations of the arms.
Note: You are not required to prove the sample complexity of the algorithm.

c. Suppose now that we want to repeat (a.) above but with δ = 0 (i.e., find the median arm with certainty). Is
there an algorithm that can find the median in finite time with certainty? Explain.

d. We now change the problem to asking how many arms have expectations between µmin and µmax ? That
is, we want to figure out how many of the K arms have expectation exactly between these numbers
(with probability at least 1 − δ). More precisely, given K and δ we need to give a finite time horizon T
(which may depend on K and δ) by which we want to be able to say how many of the K arms satisfy that
µmin ≤ µi ≤ µmax ? (I.e., find |{i : µmin ≤ µi ≤ µmax }|.) Does such an algorithm exist? If so, propose
it. If not then explain why.

26
4.6 Constrained Bandits

We are given K arms. When pulling arm i a vector (ri , ci ) is received, where ri and ci are iid random variables
bounded by 1 with (initially unknown) expectations E[ri ], E[ci ]. The algorithm is supposed to return one of the
arms which we denote by I.

a. We say that an algorithm is (ϵ, δ)-PAC constrained if for all strictly positive ϵ, C0 , δ

P I = arg max{E[ri ] s.t. E[ci ] ≤ C0 + ϵ} ≥ 1 − δ.
i∈[K]

Devise such and algorithm and give (but don’t prove) sample complexity bounds (may depend on
E[ri ], E[ci ]).

b. Assume that always ∃i s.t. E[ci ] ≤ C0 . Does there exist a (0, δ)-PAC constrained algorithm whose
expected sample complexity (i.e., time to return I) is bounded?

c. We say that an algorithm is (ϵ, δ, η)-PAC constrained if for all strictly positive ϵ, C0 , δ, η

P E[rI ] ≥ max{E[ri ] s.t. E[ci ] ≤ C0 + ϵ} − η ≥ 1 − δ.
i∈[K]

Devise an algorithm that is (ϵ, δ, η)-PAC constrained and give (with proof) sample complexity bounds
that depend on E[ri ], E[ci ]. Note that we do not ask for the most efficient algorithm, but rather ask for a
correct algorithm.

4.7 Budgeted Bandits

We consider a simple bandit problem with K arms that have an unknown reward distribution on [0, 1]. We are
looking to find an ϵ optimal arm, with probability of at least 1 − δ in the usual PAC model. In the following
questions we allow some samples to be “free". We consider a modification of the sample complexity to “budget
complexity" where samples do not add to the budget if sampled from “good” arms (good is defined differently
below).
In the following, we do not ask you to prove any of the bounds, but use existing results and argue their validity.

a. Suppose that we are given a budget of T samples from non-optimal arms (i.e., every time you sample
the optimal arm, this cost nothing). In the worst case, for a given ϵ, what is the smallest δ that can be
guaranteed (i.e., write δ as a function of T, K and ϵ.)? Explain which algorithm you would use to achieve
that budget complexity.

b. Suppose now that we are given a budget of T samples from non ϵ optimal arms (i.e., every time we sample
the optimal arm or arm that is at most ϵ worse than the optimal arm, this cost nothing). In the worst case,
for a given ϵ, what is the smallest δ that can be guaranteed (i.e., write δ as a function of T, K and ϵ.)?
Explain which algorithm you would use to achieve that budget complexity.

c. Suppose now that we are given a budget of T samples from non 2ϵ optimal arms (i.e., every time we
sample the optimal arm or arm that is at most two ϵ worse than the optimal arm, this cost nothing). In the
worst case, for a given ϵ, what is the smallest δ that can be guaranteed (i.e., write δ as a function of T, K
and ϵ.)? Explain which algorithm you would use to achieve that budget complexity.

27
d. Consider the budget structure of [c.] above (i.e., sampling an arm whose expected reward is less than 2 ϵ
worse than the optimal does not count). Propose an algorithm that takes advantage of actual arms means
(as in the successive elimination algorithm). Provide an expression for its budget complexity and explain,
there is no need to rigorously prove it.

4.8 Concurrent Bandits

Consider a multi-armed bandit problem where there are K arms. Instead of the usual setting, we have N different
multi-armed problems, each of which with K arms. You can assume that each of the N K arms is Bernoulli,
where the mean of arm i in bandit problem ℓ is piℓ . In the following questions, we ask about different variants of
this problem. In all these settings, when pulling arm i, all bandits are activated and their reward revealed one
way or another.

a. Suppose that the objective of the agent is to find ϵ-optimal arms for all problems with probability of at
least 1 − δ. Propose an algorithm for the problem when N = 2 and analyze its sample complexity (i.e.,
how many samples are needed to return an ϵ optimal for each and every arm).

b. Suppose now that N < K is general. Propose an algorithm and analyze its sample complexity. Your
algorithm should scale sublinearly in N .

c. Consider the regret minimization problem. Here when selecting arm i the reward obtain is the sum of
rewards for arm i in each bandit problem 1, 2, . . . , N . Define a concept of regret formally and provide a
regret minimizing strategy with sublinear regret.

d. Consider another variant of the regret minimization problem. Here when selecting arm i the reward obtain
is the maximum of rewards for arm i over all bandit problems 1, 2, . . . , N . Define a concept of regret
formally and provide a regret minimizing strategy.

4.9 UCB with Unknown T

In class, we analyzed the UCB algorithm for a case in which T , the number ofq
time-steps which we run the
algorithm for, is known in advance. In particular, the UCB bonus took the form 2nlog T
t (a)
, i.e., it depends on T .
In this question, you will analyze two variants that remedy the need to know T in advance, while still obtaining
the same guarantees of the UCB algorithm for any t.
Pnt (a)
In the following, we denote µ̄t (a) = nt1(a) i=1 ri , where the array D(a) = [r1 , .., rnt (a) ] contains all the
rewards that were observed from arm a, and nt (a) = |D(a)| is the number of observed rewards for arm a.
Express your answers in an Big-O notation. Assume that all logarithms are in base 2, and that the rewards are
bounded in [0, 1].
The following bounds might be useful:
PT i
1. For a > 1, i=1 a ≤ O(aT ).
PT −l
2. For l ∈ N, i=1 i ≤ O(T −l+1 ).
p
3. The expected regret of UCB is bounded by O(K + log(T )KT ). The first constant term K was omitted
in class, and originates from the first K rounds of UCB where every arm is sampled once.

28
Algorithm 2 Doubling Trick UCB
Initialize: K arms, Ts = 1, Tf = 2
Initialize reward arrays by trying each arm once
for t = 1, .., T do
if t > Tf then
Ts = Tf
Tf = 2Tf
Empty Dt (a) for all a.
Initialize reward arrays by trying each arm once
end if q
2 log(Tf −Ts )
Act by at ∈ arg maxa µ̄t (a) + nt (a)
Observe r(at ) ∼ Pat and append r(at ) to Dat .
end for
Part 1 – The Doubling
q Trick. In algorithm 2 we describe the Doubling Trick UCB. Instead of using a bonus
term of the form 2nlog T
t (a)
, the algorithm assumes an horizon Tf − Ts and uses it instead of the unknown T .
Once t = Tf the horizon Tf − Ts is doubled (hence the name), and all gathered experience is deleted. This
procedure continues until t = T .

a. Bound the expected regret of the UCB algorithm studied in class for horizon Tf − Ts for any Ts ≤ t ≤ Tf .
b. For simplicity, assume log T is an integer. Bound the expected regret of the Doubling Trick UCB algorithm
for any t ≤ T .

Algorithm 3 Horizon Independent UCB

Initialize: K arms.
for t = 1, .., T do q
Act by at ∈ arg maxa µ̄t (a) + 2nlog(t)
t (a)
Observe r(at ) ∼ Pat and append r(at ) to Dat .
end for
q q
Part 2 – Horizon Independent UCB. In algorithm 3, the bonus term is changed from 2nlog T
t (a)
to 2 log t
nt (a) .
Observe that by using this bonus term we do not need to provide the horizon T in advance.

c. Define the following event, for a fixed t ∈ [T ],

( " s s #)
2 log t 2 log t
At = ∀a ∈ [1, . . . , K] : µ(a) ∈ µ̄t (a) − , µ̄t (a) + .
nt (a) nt (a)

Let Act denote the complement of At . Prove that Pr (Act ) ≤ O( K

t3
).
d. Define the following event,
( " s s #)
2 log t 2 log t
A= ∀a ∈ [1, . . . , K], t ∈ [1, . . . , T ] : µ(a) ∈ µ̄t (a) − , µ̄t (a) + .
nt (a) nt (a)

Let Ac denote the complement of A. Prove that Pr (Ac ) ≤ O( T1 ) (under the assumption K ≤ T ).
e. Using the previous sections, prove an expected regret bound for Horizon Independent UCB for any t ≤ T .

29
4.10 Bandits with Context

In this question we analyze a multi-arm bandit problem with contexts. A context describes information that is
available to the agent before making a decision. For example, in web advertising, the context can be the gender
of the user – information that is very relevant to what ad should be displayed to him/her.
Formally, at the beginning of each round a context is sampled i.i.d. from a finite set of contexts, Xt ∼ PX
where Xt ∈ X = {x1 , x2 .., xN }, i.e., the possible number of contexts is discrete, finite, and |X | = N . For each
context x ∈ X , the agent can choose among a fixed set of actions A = {1, .., K}, such that the average reward
of playing action a ∈ A given the context is x ∈ X is µ(a, x). As usual, the agent does not observe the average
reward, but only a noisy sample ra,x ∈ [0, 1] such that E[rat ,xt | xt = x, at = a] = µ(a, x).

a. Assume the agent does not have access to the contexts at each round. Suggest a UCB-type algorithm
with good performance w.r.t. the following definition of regret
" T #
X
∗
R(T ) = E µ − µ(at ) ,
t=1

where µ(a) = Ext ∼PX [µ(at , xt ) | at = a], and µ∗

= maxa µ(a). Prove a bound on the regret R(T ) by
reducing this problem to standard multi-arm bandit problem.

Now assume the agent has access to the context: at the beginning of each round the agent observes Xt . In the
following, we would like to minimize the context-aware regret measured w.r.t. µ∗ (Xt ) = maxa∈A µ(a, Xt ):
" T #
X
∗
RX (T ) = E µ (Xt ) − µ(at , Xt ) .
t=1

b. Prove that R(T ) ≤ RX (T ). Show by a counter example that the opposite does not hold. Explain in 2-3
sentences what would be stronger result – a bound on RX (T ) or the same bound on R(T )?
Hint: prove that E[maxa f (x, a)] ≥ maxa E[f (x, a)] when x ∈ X and f (x, a) is finite for any x, a.

We will now show that by usingPt−1the context xt , we can devise a UCB algorithm with a bounded regret
for RX . Let nt−1 (x, a) = t′ =1 1(xt′ = x, at′ = a) denote the number of times action a was chosen
until round t (1(A) is the indicator function). We estimate µ(x, a) by its empirical average,
for context x P
t−1
r ′ 1(x ′ =x,at′ =a)
µ̄t−1 (x, a) = t′ =1 nt t−1 (x,a)
t
. Actions are chosen according to (C is a universal constant to be deter-
mined in the following)
s !
log T
at ∈ arg max µ̄t−1 (Xt , a) + C . (13)
a nt−1 (Xt , a)

c. Show that the regret RX (T ) can be decomposed as follows

X
RX (T ) = E [R(T ; x)] ,
x∈X

− µ(at , x))1(xt = x).

PT ∗ (x)
where R(T ; x) = t=1 (µ

d. Perform the following steps. (i) Define the set of clean events such that (13) holds for every a, x, t ≤
p T ≥ N, K), (ii) set C in (13),
T, nt (x, a) (assume PT(iii) show that conditioned on the clean event
R(T ; x) ≤ O( nT (x)K log T ), where nT (x) = t=1 1(xt = x) is the total number of rounds in
which context x was received. You can apply results learned in class without proving them.
√
e. Combine c-d and prove that RX (T ) ≤ O KN T log T (hint: Jensen’s inequality).

30
5 Function Approximation

5.1 2-step Bellman Operator

This question concerns policy evaluation with function approximation.

Consider an MDP with n states and m actions. For some stationary policy π, we define the 2-step Bellman
operator T π,2 : Rn → Rn as follows:
.
T π,2 (v) = r + γP π r + γ 2 (P π )2 v,

where r ∈ Rn and P π ∈ Rn×n are the reward vector and transition matrix under policy π, respectively.
Also, recall that the projection operator Π : Rn → Rn projects a vector v ∈ Rn to a linear subspace S ⊂ Rn
that is the span of k features ϕ1 (s), . . . , ϕk (s), w.r.t. the ϵ-weighted Euclidean norm, where ϵ is the stationary
state distribution under policy π.

a. Show that for any v ∈ Rn , it holds that ∥(P π )2 v∥ϵ ≤ ∥v∥ϵ .

b. Show that the projected the 2-step Bellman operator ΠT π,2 is a contraction.

c. We propose to approximate the value function V π by the fixed point of ΠT π,2 , i.e., the approximate value
function Ṽ π satisfies the following equation:

Ṽ π = ΠT π,2 Ṽ π . (14)

Is the solution to (14) unique?

d. Derive a bound on the approximation error ∥Ṽ π − V π ∥ϵ in terms of the ’best’ approximation error
∥ΠV π − V π ∥ϵ .

e. Is the fixed point of ΠT π,2 a better approximation than the fixed point of ΠT π ? Explain.

f. Suggest a modification of the TD(0) algorithm, that will converge to a fixed point of ΠT π,2 . Explain.

5.2 Off-policy TD(0)

In this question we consider function-approximation TD(0) for the off-policy case. Let S and A be finite state
and action spaces, respectively; in addition, denote by γ the discount factor. Let µ be the behavior policy that
generated the samples and let π is the target policy which we want to evaluate. We denote by Pµ , Pπ , Rµ , Rπ
the induced transition matrices and reward vectors. The vector ϕ(s) ∈ Rk is the feature vector of state s ∈ S
and Φ ∈ Rk×|S| is the states-feature matrix representing the features of all states; we approximate the value of
state s ∈ S through Vπ (s) ≈ w⊤ ϕ(s).

a. We start from recalling the results shown in class. You can assume the states are drawn IID according to
the limit stationary distribution induced by the behavior policy: dµ (s) = limt→∞ Pr{st = s} and denote
Dµ = diag(dµ ).

1. Write the standard TD(0) update for wt+1 = f (wt , αt , r(st ), ϕ(st ), ϕ(st+1 )), where αt is the step
size.
2. Find E[r(st )ϕ(st )|wt ], E[ϕ(st )ϕ(st+1 )⊤ |wt ], E[ϕ(st )ϕ(st )⊤ |wt ].

31
3. Rewrite the TD(0) update in the following stochastic approximation form:
wt+1 = wt + αt (A⊤
t wt + bt + mt ), E[mt |wt ] = 0,
Find explicitly A = limt→∞ Eµ [At ] and b = limt→∞ Eµ [bt ].

Reminder: The stability of TD(0) results from A holding the following negative definite property: ∀y ̸=
0, y T Ay < 0, which is a result of E ≜ Dµ (γPµ − I) holding the same property.
π(At |St )
Define ρt = µ(At |St ) to be the importance sampling ratio at time t.

b. Show that for any random variable Z = f (St , At , St+1 ), the following holds:
Eµ [ρt Z|St = s] = Eπ [Z|St = s].

c. Off-policy TD(0) uses the following update instead of the regular TD(0):
wt+1 = wt + αt ρt (A⊤
t wt + bt + mt ).
.
Show that the new matrix AOf f = limt→∞ Eµ [ρt At ] is equal to Φ⊤ Dµ (γPπ − I)Φ, and define the
corresponding EOf f = Dµ (γPπ − I).
d. Given the following MDP (two states, two actions, reward is always zero) and state representation:

0.5 0.5 0 1
[Pµ/π ]i,j = P (sj |si , µ(si )/π(si )), Pµ = , Pπ = , r(s, a) ≡ 0,
0.5 0.5 0 1
γ = 0.9, ϕ(s1 ) = 1, ϕ(s2 ) = 2.

1. Find Dπ , Dµ .
2. Show that EOf f is not negative definite.
3. (Bonus) Start from w0 = 10, αt = 0.1, and demonstrate (there is no need for a formal proof) that off-policy
TD(0) with function approximation is unstable, i.e., ∥wt ∥ → ∞.

e. Empathic TD(0) suggests a similar approach, define:

.
F0 = 1, Ft = γρt−1 Ft−1 + 1,
and:
. .
f (s) = dµ (s) lim Eµ [Ft |St = s], F = diag(f (s))
t→∞
This time the following update is used:
wt+1 = wt + αFt ρt (A⊤
t wt + bt )

Show that AEm = limt→∞ E[Ft ρt At ] is equal to Φ⊤ F (γPπ − I)Φ.

Remark. It is possible to show that empathic TD is stable.

32
5.3 TD(λ) Analysis

Consider an infinite-horizon MDP (S, A, R, P, γ) with a given, fixed policy π. In this question we analyze the
behavior of the TD(λ) algorithm. We consider the approximate case, where a weight vector w ∈ Rk is learned,
such that
X∞
π
V (s) = E [π
γ t r(st )|s0 = s] ≈ w⊤ ϕ(s).
t=0

In the approximate case, TD(λ) is the following algorithm:

wt+1 = wt + αt [rt + γϕ⊤ ⊤

t+1 wt − ϕt wt ]et (15)
et = γλet−1 + ϕt , (16)

where et , , ϕt , wt ∈ Rk , and w0 , e0 are chosen arbitrarily. Through the whole question, for simplicity assume
r(s) = 0 ∀s ∈ S, and that the step size is fixed, i.e. αt =α.

a. Show that the TD(λ) algorithm can be written as a multiplicative update algorithm, i.e., wt+1 = Qt wt .
Find Qt .

b. Bound wT in L2 norm, for some finite time-step T . Use Qt ’s eigenvalues for the bound.
The expression should be a function of γ, λ, ϕ(·), t, α. It may contain solely scalars, and the norm of w0 .
For your convenience, you can use the two following Lemmas:

1. Given matrix A ∈ Rn×n , ∥A∥22 = maxi ai , where ai , i = {1, . . . , n}, are the eigenvalues of the
matrix AA⊤ .
2. Let B = uv ⊤ + xz ⊤ , where u, v, x, z ∈ Rk . Then B has k − 2 eigenvalues equal to zero, and 2
eigenvalues: p
+ − u⊤ v + x⊤ z ± (u⊤ v − x⊤ z)2 + 4(u⊤ z)(v ⊤ x)
b ,b = .
2
c. Given the bound you obtained, will TD(λ) converge? If the answer is positive, prove it. If it is negative,
explain what condition has to hold for TD(λ) to converge.

5.4 TD(0) with Function Approximation

Consider the TD(0) algorithm with linear function approximation:

θn+1 = θn + αn [rn + γϕ(s′n )⊤ θn − ϕ(sn )⊤ θn ]ϕ(sn ), (17)

which can be written as

θn+1 = θn + αn [h(θn ) + Mn ], (18)
where h(θ) = b − Aθ.
In this question we assume that the samples {(sn , s′n )} in (17) are iid, obtained from P (sn , s′n ) = d(sn )P (s′n |sn ),
where d(·) is the stationary distribution of the Markov chain and P (·|·) is its transition probability.

a. Write an explicit expression for A and b, in terms of the parameters in (17). You can also appropriately
define matrices and vectors for this purpose.

33
b. Write an explicit expression for Mn in terms of A,
n ′ n
b and the parameters in (17) and show that E[Mn |Fn ] =
0, where Fn = σ {θk }k=1 , {(sk−1 , sk−1 )}k=1 . Is E[Mn |Fn ] = 0 also when the samples are obtained
as a sequence {sn } rather than iid in pairs, i.e., Fn = σ({θk }nk=1 , {sk }nk=1 )? Prove your answer.

We now note that the solution θ∗ to the limiting ODE θ̇ = h(θ) is θ∗ = A−1 b and make the following assumption:
All rewards r(s, s′ ) and feature vectors ϕ(s) are uniformly bounded, i.e., ∥ϕ(s)∥ ≤ 1/2, and |r(s, s′ )| ≤ 1,
∀s, s′ ∈ S.

c. Show that ∥Mn ∥ ≤ C1 + C2 ∥θn − θ∗ ∥ and express the constants C1 , C2 using the parameters in (17)
and their norms.
d. Show that ∥θn ∥ ≤ C3 · n, for some constant C3 > 1. Hint: this can proved by induction.
e. *Bonus: Show that
1
E[∥θn+1 − θ∗ ∥2 |Fn ] ≤ (θn − θ∗ )⊤ Λn (θn − θ∗ ) + C4
n2
and express Λn and constant C4 using the parameters in the question (but not using θn or θ∗ ).
Hint: first express θn+1 as θn+1 = f1 (n) · (θn − θ∗ ) + f2 (n) and use the relation |x| < 1 + x2 .

5.5 TD(λ) with Function Approximation

In this question you will analyze the TD(λ) algorithm when using a linear function to approximate the value
function, v.
Consider an MDP with n states and m actions (per state). Let π be an arbitrary policy. We define the n-step
def Pn−1 t
(fixed policy) Bellman operator Tnπ : Rn → Rn as follows Tnπ v = π t π n π n
t=0 γ (P ) r + γ (P ) v, where
rπ ∈ Rn , P π ∈ Rn × n are the rewards and transition matrix under policy π, respectively. Next, define the
TD(λ) operator, Tλπ : Rn → Rn as follows,
∞
X
Tλπ v = (1 − λ) λt Ttπ v = (I − γλP π )−1 (rπ + γ(1 − λ)P π v)
t=0
= v + (I − γλP π )−1 (rπ + γP π v − v).

We would like to approximate the value using a linear function approximation, v = ΦT w. Let dT = dT P π be
the stationary distribution of P π . We denote Π as the linear projection operator to the span of Φ, under the
d-weighted norm.
def
a. Prove that Tλπ v is a ξλ = γ(1−λ)
1−γλ contraction in the d-weighted L2 norm and answer (by proving) what is
−1
its fixed point? (Hint: (I − γλP ) = t (γλ)t (P π )t ).
π
P

b. Prove that ΠTλπ is a contraction operator in the d-weighted L2 norm. What is its contraction coefficient?
c. Let wλ∗ satisfy the fixed point equation ΠTλπ ΦT wλ∗ = ΦT wλ∗ . Give a bound of the form
||v π − ΦT wλ∗ ||d ≤ f (λ, γ)||v π − Πv π ||d .
Furthermore, based on the form of f (λ, γ), how would you choose λ?
d. Assume i) access to a sampling device from d, i.e, s ∼ d; and ii) access to a simulator which receives an
action and returns the 3-tuple (st , rt , st+1 ) (from which you can sample as you wish). Suggest a Stochastic
Approximation procedure that will converge to the fixed point of ΠTλπ .
Specifically, write in a concise way the sampling procedure and the update-rules of your algorithm.

34
5.6 Approximate Value Iteration

a. Consider an MDP with two actions: A = {continue, stop}. At any state s if the agent chooses continue,
we transition according to P (s′ |s), and obtain reward 0. If the agent chooses stop, we transition to a
terminal state and obtain reward r(s). Write down the Bellman operator T for this MDP.

b. Let d denote the stationary distribution of the Markov chain P (s′ |s). Show that T is a contraction in the
d-weighted Euclidean norm ∥ · ∥d .

c. We consider an example application domain. A financial stock’s price at time t is represented by xt ,

and transitions according to some distribution P (xt+1 |xt ). An American ‘call’ option on the stock with
threshold K and expiration date Te is an investment with the following properties:

– It is bought at time t = 0, and expires at time t = Te .

– At any time 0 ≤ t < Te , the option can be ‘executed’. When it is executed, if the stock price xt is
above K, the investor gets paid γ t (xt − K). Otherwise, he gets nothing. After executing the option,
it becomes worthless.
– If the option is not executed before t = Te , it becomes worthless.

Evaluating when to execute options, and estimating their expected worth (at time t = 0), is an important
problem in finance. Explain how the American ‘call’ option can be modeled by the MDP above.

d. In reality, the state space for the stock price is continuous. Suggest an AVI algorithm with linear function
approximation for the MDP problem above. Explain why your algorithm converges, and derive an error
bound similar to the error bound we saw in class for the projected Bellman equation. You do not have to
write the algorithm explicitly in pseudo-code, but explain what is the equation driving it and how it can be
implemented in practice.

5.7 Approximate Value Iteration – revisited

We consider approximate value iteration (AVI) in the discounted MDP setting. In class we saw that the Bellman
operator T is a contraction in the sup-norm, but that the projection operator Π is not necessarily a contraction in
sup-norm. We thus concluded that the projected Bellman operator ΠT is not necessarily a contraction. In this
question we investigate a case in which ΠT is a contraction.
We consider a general MDP. Recall that we define function approximation using the projection operator Π,
which maps a function f : S → R to a restricted set of functions.

a. A function approximator is said to be an averager if it can be written as follows:

X
ΠV (s) = βs + βs,s′ V (s′ ),
s′ ∈S

WherePβs ≥ 0 is a state-dependent constant, and βs,s′ ≥ 0 are state-dependent weights that satisfy
βs + s′ βs,s′ = 1. We will show that several well-known function approximators are in fact averagers.

35
(i) The 1-nearest neighbor approximation (1-NN) is defined by a set of N target states T = {t1 , . . . , tN },
and a distance function D over the state space. The approximation is given by ΠV (s) = V (t∗ ),
where t∗ = argmint∈T D(s, t). Show that 1-NN is an averager (i.e., write the weights βs , βs,s′
explicitly).
(ii) The K-nearest neighbor (K-NN) approximation is an average of the K nearest neighbors. Show that
K-NN is an averager.
(iii) Soft state aggregation function approximation partitions the state space into a set of clusters C =
{c1 , . . . , cN }, and approximates the value of a state as the average of the values in its cluster. Show
that this is also a type of averager.

b. Show that the projection operator for an averager is a non-expansion in the sup-norm.

c. Suggest a convergent batch AVI algorithm using averagers. Show that your algorithm converges and
provide an error bound on the optimal policy.

6 RL – misc.

6.1 GTD(0) algorithm

Consider a Markov reward process, where at iteration k a tuple (ϕk , rk , ϕ′k ) is obtained iid from the stationary
distribution of the process. Let δk ≜ rk + γθk⊤ ϕ′k − θk⊤ ϕk . The Gradient TD(0) algorithm performs the following
iterative iterations on uk , θk ∈ Rd :
uk+1 = uk + βk (δk ϕk − uk ) (19)
and
θk+1 = θk + αk (ϕk − γϕ′k )ϕ⊤ k uk . (20)
P∞ P∞ 2
Throughout this question we assume βk = ηαk , where η ∈ (0, 1); k=0 αk = ∞ ; and k=0 αk < ∞. Also,
we use the notation learned in class A = E[ϕk (ϕk − γϕ′k )⊤ ] and b = E[rk ϕk ].

a. What is the time complexity of a single iteration ((19) followed by (20)) of the GTD(0) algorithm, using
an efficient implementation?

√
Let us now define a joint vector of length 2d: ρ⊤ ⊤ ⊤
k = (vk , θk ), where vk = uk / η, and a reward-related vector
⊤
of length 2d: gk+1 = (rk ϕ⊤ ⊤
k , 0 ).

b. Show that iterations (19) and (20) can be written jointly as

√
ρk+1 = ρk + αk η(Gk+1 ρk + gk+1 ), (21)

and find the matrix Gk+1 .

c. Let G ≜ E[Gk ] and g ≜ E[gk ]; write them explicitly using the definitions provided in the question. Then,
rewrite (21) as a stochastic approximation algorithm of the form ρk+1 = ρk + αk′ (h(ρk ) + Mk+1 ). Specify
αk′ , h(·) and Mk+1 (where Mk+1 is a martingale difference sequence).

d. Let the filtration Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). You can assume that: i) E[∥ϕk ∥2 ], E[rk2 ], E[∥ϕ′k ∥2 ] are
bounded ∀k; ii)the iterates ρk are bounded ∀k; and iii) G is negative-definite. Prove that ρk → ρ∗ w.p. 1,
and find ρ∗ .

36
6.2 n-step Operators in Reinforcement Learning

In the course, we formulated planning and online algorithms using 1-step operators, T and T π , the optimal
Bellman operator and the fixed-policy Bellman operator, respectively. Here, you will formulate a possible form
of ‘n-step’1 optimal Bellman operators and formulate a possible use for these operators.
def
Let v be a value function. We define the n-step optimal Bellman operator, T n : R|S| → R|S| , T n v = T
| · T{z· · · T}v,
n times
where ∀s ∈ S, T v(s) = argmaxa Es′ ∼P (·|s,a) [r(s, a) + γv(s′ ) | s, a].

a. Prove that T n is a contraction operator and answer

i) What is the contraction coefficient of T n ?

ii) What is its fixed point?

b. Consider an Approximate Value Iteration algorithm in which the value at the kth iteration satisfies,

||vk − T n vk−1 ||∞ ≤ ϵ.

Let πk be the greedy policy w.r.t. vk . Provide a bound on ||v πk − v ∗ ||∞ in terms of γ, n, ϵ and ||v0 − v ∗ ||,
where v0 is the initial value function.
Reminder: The greedy policy w.r.t. a value v satisfies
X
∀s ∈ S, π(s) = arg max r(s, a) + γ p(s′ | s, a)v(s′ ) (22)
a
s′

and thus, T π v = T v.

We now define a set of non-stationary greedy policies with respect to a value function v, Gn (v),
n−1
X
Gn (v) = arg ′ max
′
E[ γ t r(st , πt′ (st )) + γ n v(sn )], (23)
π0 ,..,πn−1
t=0

i.e, {π0n , π1 , .., πn−1

n } ∈ G (v) is a non-stationary policy defined over n-time steps. We denote the policy at time
n
0 ≤ t ≤ n − 1 as πtn . Notice that G1 (v) is the set of the (1-step) greedy policies discussed in the course (22). By
n n n
definition, and similarly to (22) we have that T π0 T π1 · · · T πn−1 v π = T n v π .
Let π be a policy (possibly non-stationary) and let v π be its value.
n n n
c. Show that T π0 T π1 · · · T πn−1 v π ≥ v π component-wise. Describe in words the policy with the value equal
n n n
to T π0 T π1 · · · T πn−1 v π .
n n n
d. Let v ∈ R|S| be a general vector. Does it hold that T π0 T π1 · · · T πn−1 v ≥ v component-wise? Prove or
give a counter example.

e. Let σn be the non-stationary policy which repeatedly runs {π0n , π1 , .., πn−1
n } ∈ G (v π ), i.e, at time ∀t > 0,
n
n σ π
we use the policy πt mod n . Prove that v ≥ v component-wise with equality if and only if π is the
n

optimal policy.
1
In the course, we used ’n-step’ for the task of policy evaluation alone.

37
6.3 q π instead of v π

In the semester, we saw how v π (s), the value of a policy can be calculated. It is occasionally more helpful
to calculate the q-function of a policy, q π (s, a), occasionally denoted as the vector q π ∈ R|S||A| . In words, it
represents the value of acting once from state s by the action a, and then act according to π (π is a stationary
policy). Formally:
"∞ #
X
π π t
q (s, a) = E γ r(st , at ) | s0 = s, a0 = a ,
t=0

where at = π(st ) for any t > 0.

a. Prove the following equalities hold,

– q π (s, a) = r(s, a) + γ s′ p(s′ | s, a)v π (s′ ).

– v π (s̄) = ā π(ā | s̄)q π (s̄, ā).

b. Show that q π satisfies the following fixed-point equation

def
X
q π (s, a) = r(s, a) + γ p(s′ | s, a)π(a′ | s′ )q π (s′ , a′ ) = Tqπ q π (s, a). (24)
s′ ,a′

Prove that Tqπ : R|S||A| → R|S||A| is a contraction operator and its contraction coefficient is γ.

c. Show that q π have the following closed form:

q π = (I − γA)−1 b,

where A ∈ R|S||A|×|S||A| and b ∈ R|S||A| , and prescribe A, b.

We now turn to the linear approximation setting. We wish to estimate q π (s, a) using the features ϕ(s, a) ∈ Rd ,
i.e, q̂ π (s, a) = ϕT (s, a)w, and w ∈ Rd .

d. Let q be a q-function. Define the linear projection Πq in the ϵ-weighted norm by

Πq q = Φw∗
w∗ = arg min ||Φw − q||2ϵ ,
w

where ϵ(s, a) = π(a | s)dπ (s), and dπ is the stationary distribution of the Markov Chain induced by π.
Show that Πq is a non-expansive operator.

e. Prove that Πq Tqπ is a γ contraction in the ϵ-weighted norm || · ||ϵ . For ease, use the following notation for
Tqπ

Tqπ q = r + (P π)q,

where X
(P π)q(s, a) = P (s′ | s, a)π(a′ | s′ )q(s′ , a′ ),
s′ ,a′

and the (s, a) component of r is r(s, a). See that this definition is consistent with (24).

38
6.4 Inverse Reinforcement Learning

In some applications it is desired to design decision-making agents that display similar strategies to humans. For
example, in an autonomous driving scenario, we would like to design driving algorithms that display human
driving styles such as safety and cautiousness. While human behavior can easily be demonstrated and recorded,
explicitly coding it as a policy is difficult in practice.
Inverse RL is an approach to this problem, which aims to only learn the ‘reward’ of the human, instead of its
policy. In this question we will formulate an inverse RL algorithm.
Our model is a reward-less MDP M = {S, A, P, γ}. That is, we know the MDP state space, action space,
transitions, and discount factor, but do not know the reward function r : S → R. Our goal is to learn the function
r from a demonstration trajectory τ = {s0 , a0 , s1 , a1 , . . . , sT , aT }, where we assume that actions in τ were
generated according to the optimal policy π ∗ in the MDP Mr = {S, A, P, r, γ}.

a. Consider the following grid-world MDP with S = {1, . . . , 28}, where actions are the 4 directions
{up, down, lef t, right}, and transitions are deterministic movements of the agent in the corresponding
direction (we assume that moving into a wall keeps the agent in place). We assume γ = 0.99. Consider
the trajectory τ shown in the picture. Suggest a reward function r(s) such that running the optimal policy
starting in s0 would result in the trajectory τ .

b. Is the reward function that you found in (1) unique?

We now consider a probabilistic formulation for this problem. Let Q(s, a; r) denote the optimal Q function in
Mr . We assume that the demonstration is generated from the softmax policy

eQ(s,a;r)
π(a|s) = P Q(s,a′ ;r)
.
a′ ∈A e

We wish to find an r that maximizes the log-likelihood of the actions in τ :

X
L(r) = log π(ai |si ).
i=0,...,T

Note that L depends on r through Q(s, a; r) in π defined above. Furthermore, we will assume that r is given in
a parametric form r(s; θ), and we will seek the parameter θ ∈ R by gradient ascent on L (for simplicity, we
assume that θ is a scalar).

c. Explain, in words, what maximizing L(r) means.

39
d. Write an expression for ∂
∂Q(s′ ,a′ ;r) log π(a|s). Consider separately the cases {s′ = s, a′ = a}, {s′ =
s, a′ ̸= a}, and s′ ̸= s.
e. Assume that ∂
∂r(s′ ) Q(s, a; r) is known (for all s′ , s, a). Write down an expression for ∂
∂θ log π(a|s)

f. We will now calculate ∂

∂r(s′ ) Q(s, a; r) for all s′ , s, a. Using the Bellman equation, write down an equation
∂
for ∂r(s′ ) Q(s, a; r). You can use the following identity, known as the envelope theorem:

– Consider a function f (x, z) where x is continuous and z belongs to some finite set Z. Let g(x) =
maxz∈Z f (x, z), and z ∗ (x) = argmaxz∈Z f (x, z). Then ∇x g(x) = ∇x f (x, z = z ∗ (x)).
g. Based on your equation from (f), suggest an algorithm in the spirit of value-iteration to calculate
∂
∂r(s′ ) Q(s, a; r).

h. Show that your algorithm converges.

∂
i. Using the results established above, write down an algorithm for computing ∂θ L(r).

6.5 Modified Policy Iteration and Generalizations

In this question we will analyze the Modified Policy Iteration (MPI) algorithm 4. Denote the fixed-policy and
optimal Bellman operators as T π and T , respectively.

Algorithm 4 MPI
Initialize: m ∈ N, v0 ∈ RS
while some stopping criterion do
for s ∈ S do
πk+1 (s) ∈ arg maxa r(s, a) + γ s′ P (s′ | s, a)vk (s′ ).
P
end for
for s ∈ S do P
vk+1 (s) = m−1 k+1 (s = s′ | s = s)r(s′ , π ′ πk+1 (s = s′′ | s = s)v (s′′ ).
P t π m
P
t=0 s′ γ P t 0 k+1 (s )) + γ s′′ P m 0 k
end for
end while
Return π, v

a. In a vector notation, MPI performs the following two steps of policy and value update,
′
– πk+1 ∈ {π ′ : T π vk = T vk }.
– vk+1 = (T πk+1 )m vk .
Prove the two algorithms are equivalent.
b. Which algorithms are obtained when m = 1, ∞?

In the rest of the question we will analyze the convergence of Modified Policy Iteration.

c. Assume that T v0 ≥ v0 . Prove the following (component-wise) inequalities

v0 ≤ T π1 v0 ≤ · · · ≤ (T π1 )m v0 ≤ (T π1 )m+1 v0 ≤ · · · ≤ v π1 ≤ v ∗ .
Do the inequalities hold without the assumption? prove or give a counter-example.

40
d. Assume that T v0 ≥ v0 . Prove that for any k, T vk ≥ vk (component-wise) and thus

vk−1 ≤ T πk vk−1 ≤ · · · ≤ (T πk )m vk−1 ≤ (T πk )m+1 vk−1 ≤ · · · ≤ v πk ≤ v ∗

for any k. Hint: use induction, and the update rule of MPI, vk = (T πk )m vk−1 .
e. Prove ∥v ∗ − vk ∥∞ ≤ γ ∥v ∗ − vk−1 ∥∞ . Hint: prove 0 ≤ v ∗ − vk ≤ T v ∗ − T vk−1 and then take the
max-norm.

6.6 Double Q-learning

This question considers an algorithm called Double Q-learning (DQ). Throughout the question we assume the γ-
discounted MDP setting with finite state and action spaces, and tabular Q functions (no function approximation).
In DQ, instead of maintaining a single Q table, we maintain two tables: QA (s, a) and QB (s, a). At each time
step t of the algorithm we observe st , choose an action
( at , and observe the next state st+1 . We then choose
A, w.p. 0.5
which table to update by drawing a random xt+1 = (independent of all other variables in the
B, w.p. 0.5
problem).
If xt+1 = A, we set a∗ = argmaxa QA A
t (st+1 , a), and update Q (st , at ):

∗
QA A B A

t+1 (st , at ) = Qt (st , at ) + αt (st , at ) r(st , at ) + γQt (st+1 , a ) − Qt (st , at ) .

otherwise, if xt+1 = B, we set b∗ = argmaxa QB B

t (st+1 , a), and update Q (st , at ):

∗
QB B A B

t+1 (st , at ) = Qt (st , at ) + αt (st , at ) r(st , at ) + γQt (st+1 , b ) − Qt (st , at ) .

When updating QA , all elements in QB remain the same, and vice versa. Additionally, similarly to regular
Q-learning, we only update the Q-table (either A or B) for the current state and action st , at , and leave all the
other elements in the table unchanged.
The DQ algorithm is often used in practice, as it is empirically more stable than standard Q learning. In this
question, we will only focus on showing that DQ converges.
We will show convergence of QB A
t . From symmetry, this will hold also for Qt . We let Ft denote the history of
A B A B
the process until time t, i.e., Ft = s0 , a0 , Q0 , Q0 , x0 , . . . , st , at , Qt , Qt , xt .

a. Show that the update for QB

t can be written as

QB B B

t+1 (st , at ) = (1 − αt (st , at ))Qt (st , at ) + αt (st , at ) HQt + ωt + ct ,

where ωt is a martingale difference noise, H is a contraction, and ct = γQA ∗ B ∗

t (st+1 , b ) − γQt (st+1 , b ).

An extension of the stochastic approximation theorem claims that if limt→∞ ct = 0 w.p. 1 then QB
t converges.
In the rest of the question we will show that limt→∞ ct = 0 w.p. 1.

b. Define ∆t = QB A
t − Qt . Explain why showing that limt→∞ ∆t = 0 w.p. 1 means that limt→∞ ct = 0
w.p. 1.
c. Write down ∆t+1 (st , at ) explicitly as a function of ∆t (st , at ), xt (and other required terms).
d. We define gt = E QA ∗ B ∗

t (st+1 , b ) − Qt (st+1 , a )|Ft . Show that:

E [∆t+1 (st , at )|Ft ] = (1 − 0.5αt (st , at ))∆t (st , at ) + 0.5γαt (st , at )gt .

41
e. Prove that |gt | ≤ ∥∆t ∥∞ .

f. Show that the update of ∆t can be written as:

∆t+1 (st , at ) = (1 − 0.5αt (st , at ))∆t (st , at ) + 0.5αt (st , at ) Ht′ + ηt ,

(25)

where |E [Ht′ |Ft ]| ≤ γ∥∆t ∥∞ , and ηt is a martingale difference.

An extension of the SA theorem shows that the update in (25) converges to zero w.p. 1.

Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Book-Decision Making Under Uncertainty and Reinforcement Learning
No ratings yet
Book-Decision Making Under Uncertainty and Reinforcement Learning
273 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
AR23
No ratings yet
AR23
159 pages
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Stochastic Programming
No ratings yet
Stochastic Programming
326 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
124 pages
Introduction To Mathematical Optimization by Matteo Fischetti
No ratings yet
Introduction To Mathematical Optimization by Matteo Fischetti
232 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
2019 Algorithmic Game Theory Lecture Notes
No ratings yet
2019 Algorithmic Game Theory Lecture Notes
106 pages
DP Book
No ratings yet
DP Book
428 pages
Abstract Dynamic Programming Bertsekas Dimitri P Download
No ratings yet
Abstract Dynamic Programming Bertsekas Dimitri P Download
87 pages
Vdoc - Pub Discrete Optimization Spring 2015
No ratings yet
Vdoc - Pub Discrete Optimization Spring 2015
82 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Book All in One
No ratings yet
Book All in One
288 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
Book
No ratings yet
Book
534 pages
Co250 Note2
No ratings yet
Co250 Note2
79 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
No ratings yet
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
60 pages
Mathematics N4: FET College Nated, #6
From Everand
Mathematics N4: FET College Nated, #6
Efetobo Emede
No ratings yet
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Operations Research PDF
No ratings yet
Operations Research PDF
121 pages
Lecture Notes Cmput366
No ratings yet
Lecture Notes Cmput366
122 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Applications of Linear Programming: András London
No ratings yet
Applications of Linear Programming: András London
124 pages
DisOpt409 Spring2020
No ratings yet
DisOpt409 Spring2020
86 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Lec 08
No ratings yet
Lec 08
59 pages
Algorithmic Game Theory Lecture Notes
No ratings yet
Algorithmic Game Theory Lecture Notes
110 pages
Mathematical Programming and Magic - The Gathering®
No ratings yet
Mathematical Programming and Magic - The Gathering®
52 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Introduction To Optimisation PDF
No ratings yet
Introduction To Optimisation PDF
264 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Discrete Mathematical Modeling: Math 381 Course Notes University of Washington
No ratings yet
Discrete Mathematical Modeling: Math 381 Course Notes University of Washington
117 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Applications of Optimization With Xpress
No ratings yet
Applications of Optimization With Xpress
264 pages
Linear Programming Formulation
No ratings yet
Linear Programming Formulation
265 pages
CO250 Web
No ratings yet
CO250 Web
204 pages
Seminar On PRVNTV Obg
100% (1)
Seminar On PRVNTV Obg
43 pages
RL Notes
No ratings yet
RL Notes
69 pages
Stochastic Programming
100% (1)
Stochastic Programming
326 pages
Manujw
100% (1)
Manujw
326 pages
Ied Full Book-Ncert Based-Updated
No ratings yet
Ied Full Book-Ncert Based-Updated
53 pages
Stochastic Programming
100% (2)
Stochastic Programming
315 pages
Dynamic Programming
No ratings yet
Dynamic Programming
46 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Katalog Komputer Haidakom
No ratings yet
Katalog Komputer Haidakom
8 pages
Job Description - BDA
No ratings yet
Job Description - BDA
2 pages
Chapters PDF
No ratings yet
Chapters PDF
51 pages
Integer Programming PDF
No ratings yet
Integer Programming PDF
141 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
05 DP1
No ratings yet
05 DP1
19 pages
Pros Dle24
No ratings yet
Pros Dle24
37 pages
Income Tax Calculator 2023
No ratings yet
Income Tax Calculator 2023
50 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
HOA Loan Services Launches AI-Powered Portal To Simplify HOA Loan Applications
No ratings yet
HOA Loan Services Launches AI-Powered Portal To Simplify HOA Loan Applications
2 pages
PRODJEDEV
No ratings yet
PRODJEDEV
8 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Methods of Teaching Physical Science
No ratings yet
Methods of Teaching Physical Science
15 pages
DS Module 4 Trees
No ratings yet
DS Module 4 Trees
21 pages
2023 Term 4 Math GR 6 Common Test Memo
No ratings yet
2023 Term 4 Math GR 6 Common Test Memo
5 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
JGGL-XJ0017-PRT0077 XJ 17MY MB GOE WEB tcm76-272869
No ratings yet
JGGL-XJ0017-PRT0077 XJ 17MY MB GOE WEB tcm76-272869
88 pages
Shipboard High Voltage Application and Safeties - Hanif Dewan's Blog
100% (1)
Shipboard High Voltage Application and Safeties - Hanif Dewan's Blog
26 pages
API INTEGRATION AGREEMENT - Part 3
No ratings yet
API INTEGRATION AGREEMENT - Part 3
3 pages
Scaffold Designs - Chadworth House Elevation PDF
No ratings yet
Scaffold Designs - Chadworth House Elevation PDF
1 page
Blue Light Company Guerrero
No ratings yet
Blue Light Company Guerrero
22 pages
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
HAV MARLIN Ex San Diego
No ratings yet
HAV MARLIN Ex San Diego
2 pages
11 CogControlConclusion B55 02
No ratings yet
11 CogControlConclusion B55 02
11 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Camelia The Dancer Crochet Patron
No ratings yet
Camelia The Dancer Crochet Patron
12 pages
2021-22 Sem II Insem Exam Question Bank-1
No ratings yet
2021-22 Sem II Insem Exam Question Bank-1
3 pages
Presentation1 of OSCE-OSPE
No ratings yet
Presentation1 of OSCE-OSPE
9 pages
CBSE 8 Science CBSE - Stars and The Solar System, Free Test Papers, Sample Questions, HOTS Questions and Notes, CBSE - Stars and The Solar System
No ratings yet
CBSE 8 Science CBSE - Stars and The Solar System, Free Test Papers, Sample Questions, HOTS Questions and Notes, CBSE - Stars and The Solar System
6 pages
Post-Graduate Diploma in Computer Application (Second Semester) Examination, May./June 2019
No ratings yet
Post-Graduate Diploma in Computer Application (Second Semester) Examination, May./June 2019
3 pages
Memo Writing
No ratings yet
Memo Writing
6 pages
Core Banking Solution
No ratings yet
Core Banking Solution
11 pages
Compensation Management & Reward System
No ratings yet
Compensation Management & Reward System
4 pages
Human Rights Rubric Film
No ratings yet
Human Rights Rubric Film
2 pages
Angles in A Circle and Cyclic Quadrilateral: 130 Mathematics
No ratings yet
Angles in A Circle and Cyclic Quadrilateral: 130 Mathematics
12 pages
1BE YMC2 Res QuickStart SalesGuide PDF
No ratings yet
1BE YMC2 Res QuickStart SalesGuide PDF
2 pages
Analytic Geometry Exam
No ratings yet
Analytic Geometry Exam
1 page