Reinforcement Learning: Foundations Exam
Reinforcement Learning: Foundations Exam
February 2022
This book is still work in progress. We would be grateful for notification about errors or comments of
any kind, at rlfoundationsbook@gmail.com.
Contents
1 Preface 3
2 Dynamic Programming 3
2.1 Subset sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Colored Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Trajectories in Graph & Efficient Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 3
2.4 Egg Throw & Maximal Square sub-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Two-Traversals Maximal Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.7 All Pairs Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.8 Machine Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.9 Black-Red Card Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.10 Blackmailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.11 Engineer’s Job Choice Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 MDPs 9
3.1 MDP with an Absorbing State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Worst Case Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Solving MDPs through Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Simple MDP – Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Solving MDPs using sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Bellman equation for Second Moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Investor Planning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
3.8 Bellman Operators and Linearly Solvable MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.9 MDPs with Non-Linear Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.10 Belief MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.11 Exploration Conscious MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.12 Average Policy Improvement Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.13 Time-Varying Bellman Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.14 Contextual MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Bandits 23
4.1 MAB with 2 arms per round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Exploratory Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 (ϵ, δ)-Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Reward Swapping Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Median Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Constrained Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Budgeted Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.8 Concurrent Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.9 UCB with Unknown T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.10 Bandits with Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Function Approximation 31
5.1 2-step Bellman Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Off-policy TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 TD(λ) Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 TD(0) with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 TD(λ) with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Approximate Value Iteration – revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 RL – misc. 36
6.1 GTD(0) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 n-step Operators in Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 q π instead of v π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.5 Modified Policy Iteration and Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2
1 Preface
This booklet contains a collection of questions on RL, to accompany the textbook. These questions were given
in final exams, in our teaching of advanced undergraduate RL courses at Tel Aviv University and at Technion.
The questions are roughly organized according to topic. However, some questions involve topics from several
chapters of study, and are intended as practice exams for students who have already completed studying the full
material.
2 Dynamic Programming
We are given a list of n positive integer numbers a1 , . . . , an and a positive integer S. It is required to answer the
following question: Does there exist a subset of the ai ’s that add up to S?
a. Suppose that you are allowed to use each element ai any number of times. Give a dynamic programming
algorithm that provides an answer in O(nS) operations (addition and comparison).
b. Suppose now that you are allowed to use each element ai at most once. Repeat the previous problem (with
the same complexity requirement).
Hint: Let Aj = {a1 , . . . , aj }. Consider iterating over j.
Consider a tree such that every edge of the tree is coloured either in blue or in red.
a. Devise an efficient algorithm that counts how many trajectories from the source to the leafs contain only
blue edges. The algorithm should not traverse a single node twice.
b. Now we want to count how many trajectories are there from source to the leafs that contain at most one
blue edge. Devise an efficient algorithm for this task.
c. Now we want to count how many trajectories are there from source to the leafs that contain at most k
blue edges, where k > 0. Devise an efficient algorithm for this task.
a. Given an undirected graph G = (V, E), we denote the degree (number of edges) of vertex v ∈ V by d(v).
Devise an algorithm for finding the number of possible T -length trajectories starting from a given vertex s,
and ending in a given vertex t (cycles are allowed). What is the complexity of your algorithm as a function
of |V |, T , and |E|?
3
Figure 1: Example - There are 12 trajectories of length 3 from F to C:
F CF C, F EDC, F CDC, F CEC, F CBC, F CAC, F ABC, F GAC, F GEC, F GF C, F AF C, F EF C.
b. The number of operations needed to multiply two matrices A and B where, A ∈ Rm×n , B ∈ Rn×p is
mnp operations.
For example, let A ∈ R3×5 , B ∈ R5×9 , C ∈ R9×2 , D ∈ R2×8 . Then the number of operations needed for
calculating ABCD with the following parentheses: (AB)(CD) is 3·5·9+9·2·8+3·9·8 = 495 operations.
However, the number of operations when calculating with the following parentheses: A(B(CD)) is
9 · 2 · 8 + 5 · 9 · 8 + 3 · 5 · 8 = 624 actions.
Given a multiplicative chain of N matrices N pi ×pi+1 , devise an algorithm for
Q
i=1 Ai , such that Ai ∈ R
finding the multiplication order with the smallest number of operations. What is the complexity of your
algorithm in the parameters of the problem?
a. There is a tower with n floors and you have 2 identical eggs. There exists a floor m such that any egg
thrown from a higher floor crashes, while every egg thrown from floor m, or lower, survives. Devise
an efficient algorithm for finding the least number of tries needed in the worst case and find its time
complexity.
Example 1. Let’s say there are 15 floors, and that your algorithm performs binary search with the first
egg. Notice that once the first egg breaks, the second egg must be thrown from every floor you are not
sure about, from the bottom to the top, since otherwise you will not know m exactly. Hence, if the first egg
breaks at floor 8 - you will be forced to 8 tries in the worst case: floors 8, and then floors 1 − 7 one after
the other. If the first egg does not break at floor 8, you know that m ≥ 8, so you won’t try more than 8
tries anyhow (8 − 15) - so the least number of trials in the worst case of your algorithm here will be 8.
b. Suggest an efficient algorithm that given a matrix consisting of 0’s and 1’s, finds the size of the maximum
size squared sub-matrix consisting of only 1’s. What is the time complexity of your algorithm?
Notice! - A sub-matrix is defined as an adjacent subset of rows and columns.
Example 2. Observe the following 4 × 4 matrix A:
1 1 1 1
1 1 1 1
1 0 0 0
1 1 0 1
4
The biggest sub-matrices of this matrix are given in A[1, 2|1, 2], A[1, 2|2, 3], A[1, 2|3, 4] and are of size 2.
The sub-matrix A[1, 2, 4|1, 2, 4] consists of only 1’s, but it is not composed of an adjacent subset of rows
and columns, and is therefore not considered.
Given a matrix of dimension R × C, where every entry contains a score, how do you collect a maximal overall
score using two traversals?
The two traversals are defined as follows.
1. The two traversals have to start from top row, from the left and right corners, i.e., [1, 1] and [1, C]. The
first (in the left corner) has to reach bottom left corner, i.e., [R, 1]. The second traversal should reach
bottom right corner, i.e., [R, C].
A traversal gets all points of a particular cell through which it passes. If one traversal has already collected points
of a cell, then the other traversal gets no points if goes through that cell again.
Example For the following input:
b. Suggest a dynamic programming based algorithm to solve the problem and analyze its complexity.
Given a sequence X = (x1 , . . . , xm ), we say that sequence Z is a subsequence of X if there exists a strictly
increasing sequence (i1 , . . . , ik ) such that for all j = 1, . . . , k we have Xij = Zj . For example, Z =
(B, C, D, B) is a subsequence of X = (A, B, C, B, D, A, B).
Given two sequences X, Y we say that Z is a common subsequence of X and Y if Z is a subsequence of both
X and Y . In the longest-common-subsequence (LCS) problem we are given two sequences X = (x1 , . . . , xm )
and Y = (y1 , . . . , yn ), and we need to find the maximum length common subsequence of X and Y .
5
a. Propose a dynamic programming solution to find the LCS of X and Y . What are its time and space
complexities?
Let us define the k-LCS of X and Y to be their LCS where it is allowed to change at most k elements in X to
any value.
c. Propose a dynamic programming solution to find the k-LCS of any X and Y . What are its time and space
complexities?
In this question we will consider the all pairs shortest path (APSP) problem. We start with the deterministic
setting on a graph.
Consider a directed weighted graph G = (V, E) with N nodes, and no negative cycles. We would like to find
the shortest distance D(v, u) from any node v ∈ V to any other node u ∈ V .
a. Suggest a naive solution that uses the single-source (or single destination) Bellman-Ford algorithm in an
inner loop. What is the complexity of this algorithm?
b. Assume that D(v, u) for the graph G is given. Now, we construct a new graph G′ by adding a new node v ′
to G, and corresponding edges between v ′ and the nodes in G. Let D′ (v, u) denote the shortest distances
on the graph G′ .
(i) Suggest a method for computing D′ (v ′ , u) for some u ∈ V in the new graph G′ , in O(|V |) computa-
tions.
(ii) Suggest a method for computing D′ (u, v ′ ) for some u ∈ V in the new graph G′ , in O(|V |) computa-
tions.
(iii) Suggest a method for computing D′ (v, u) ∀u, v ∈ V in the new graph G′ , in O(|V |2 ) computa-
tions.
d. For a fixed goal state g ∈ S, formulate the shortest distance problem above as an MDP. That is, define a
cost function c(s, a) and an optimization objective such that the optimal value function corresponds to
V (u, g).
6
f. Prove or find a negative example for the following conjecture:
!
X
V (u, v) = min c(u, a) + P (s′ |u, a)V (s′ , v) ∀u ̸= v.
a∈A
s′ ∈S
A workshop manager has just bought an expensive new machine. On each day, he has two options: maintain the
machine at cost M or not maintain it. If he chooses not to maintain it, there is a risk that the machine will break
down with probability pj and cost B to fix, where j is the number of consecutive days without maintenance
since the last breakdown or maintenance (e.g. in the beginning, the probability of breakdown is p0 . After one
day with no maintenance, the probability of breakdown is p1 , after two successive days of no maintenance, the
probability of breakdown is p2 , etc). Assume pj is monotonically non-decreasing in j, and that there exists an
integer m such that pm B > M .
a. Explain, in words, the meaning behind the assumption that there exists an integer m such that pm B > M ,
and deduce that the smallest number of states that is sufficient for a full characterization of the problem is
m + 1.
b. Formulate this as an infinite horizon discounted cost MDP, with a discount factor γ ∈ (0, 1). Describe the
states, actions, costs, and transition probabilities (the discount factor is arbitrary).
c. Write the corresponding Bellman equation for the value function J(i), where i is the state.
e. Prove that the cost value is monotonically non-decreasing in the state, i.e., i > j =⇒ J(i) ≥ J(j).
Hint: use (backward) induction and the result from (d).
f. Explain why the optimal policy is a threshold policy (defined below) and find the threshold i∗ .
Threshold policy: for an MDP with actions {a1 , a2 } and states {s1 , s2 , ..., sN }, a threshold policy with
threshold i∗ is defined as follows: choose action a1 for all si such that i < i∗ , and a2 otherwise.
Dima plays a game that starts with a deck with B "black" cards and R "red" cards. Dima knows B and R. At
each time-step Dima draws a random card and decides between the following two options:
(a1 ) Without looking at the card, "predict" that it is black in which case Dima wins the game if the prediction
is correct and loses if the prediction is incorrect.
(a2 ) "Discard" the card, after looking at its color, and continue the game with one card less.
Dima wins the game if at any time-step the deck has only black cards, and loses the game if the deck has only
red cards. Dima wants to find a policy that maximizes the probability of winning the game.
7
a. What is the maximal number of time steps the game can last?
b. We wish to formulate the problem as aSfinite horizon undiscounted (γ = 1) return problem, with
S = {(b, r) | 0 ≤ b ≤ B, 0 ≤ r ≤ R} {T } as the state space, where T is the terminal state. Write the
action space at each state, i.e. A(s), ∀s ∈ S, write the transition probabilities p(s′ |s, a), ∀s′ , s ∈ S, ∀a ∈
A, and suggest a suitable (and simple) reward function.
Clarification: In the case of zero red/black cards, we first move to the corresponding state with zero
red/black cards and then terminate in the next step.
d. Suggest a dynamic programming algorithm to find the exact solution of the MDP, i.e. the exact value
function J(s), ∀s ∈ S. What is its complexity?
e. Use induction to show that the optimal probability of winning when starting with b black cards and r red
b
cards is b+r .
2.10 Blackmailer
Consider a situation involving a blackmailer and a victim. At each stage, the blackmailer can:
(2) Demand a payment of $1, in which case the victim will comply with the demand (with probability p, where
0 < p < 1, independently of the past history), or they will refuse to pay and denounce the blackmailer to
the police (with probability 1 − p).
Once denounced to the police, the blackmailer loses all of the accumulated earnings and cannot blackmail again.
Also, once the blackmailer reaches accumulated earnings of $n he immediately retires. The blackmailer’s goal is
to maximize the expected (undiscounted) amount of money he earns.
a. Formulate the problem as an MDP with the states i = 0, 1, ..., n and a termination state. Specifically,
write the states, actions at each state, transition probabilities, and rewards. Note: The rewards must be
non-negative.
b. Write the corresponding Bellman equation (consider all states in 0, 1, ..., n).
c. Let J ∗ be the optimal value function of the MDP. Show that J ∗ (i) is monotonically increasing as a
function of i.
8
2.11 Engineer’s Job Choice Problem
An ambitious engineer, currently employed at some company at a salary c, seeks employment elsewhere. She
receives a job offer at each time step, which she may accept or reject. The offered salary takes one of n possible
values w1 , ..., wn , with given probabilities ξ1 , ..., ξn , independent of past offers.
If she accepts the offer, she keeps the job for the rest of her life. If she rejects, she continues at her current job at
salary c and continues to receive offers in future steps.
a. Formulate the engineer’s problem as a discounted MDP. Write the states, actions, transition probabilities,
and rewards. Assume a discount factor 0 < γ < 1.
Now, consider a variant of the problem where the worker could be fired from her new job at salary wi at each
time step, this happens with probability pi . Once fired, she returns to her former company and resumes her work
at salary c, and she continues to receive offers in future steps, which she may accept or reject. Also, Assume she
cannot be fired from her former job at salary c.
e. Assume pi = p ∀i. Show that the optimal policy is a threshold policy (i.e., only accept the offer if the
salary is greater than some threshold). Hint: assume, without loss of generality, that w1 ≤ w2 ≤ ... ≤ wn .
f. Now pi is not independent of i. Provide a sufficient condition such that the optimal policy from the last
clause would still hold
3 MDPs
Consider an MDP, defined by (S, A, P, r). We consider the following total-return criterion over an infinite time
horizon:
XT
J π (s) = lim Eπ,s ( r(st , at )),
T →∞
t=0
where Eπ,sis, as usual, the expected value given a control policy π and initial state s0 = s. The policy π is taken
to be stationary and deterministic.
We assume here that there exist a special state s∗ ∈ S and a number β > 0 such that, for any state-action pair
(s, a),
P (s∗ |s, a) ≥ β .
9
Although the total return criterion is not well defined in general, the presence of an absorbing state makes it
“well behaved". There are two approaches to show this: either a direct one, or through an equivalent discounted
problem. We will consider both.
a. Prove that there exists a constant γ < 1 such that, for any policy π and t ≥ 0,
Pπ (st ̸= s∗ |s0 = s) ≤ γ t .
b. Prove that the total return J π (s) is well defined for every policy π (i.e., the limit converges) and finite.
Hint: Find a bound on Eπ,s (r(st , at )).
c. Show that J π (s) can be written as an equivalent discounted return function. Follow the following steps:
(1) Let P̂ = (P̂ (s′ |s, a)) be a modified state transition matrix, defined as:
P (s′ |s, a)
: s′ ̸= s∗
−
′
P̂ (s |s, a) = 1 β .
′
P (s |s, a) − β : s′ = s∗
1−β
Show that P̂ is indeed a state transition matrix (i.e., satisfies the relevant requirements).
(2) Let P̂π,s denote the probability distribution induced on the process by P̂ , policy π, and initial state
s0 = s. Show that, for any s′ ̸= s∗ ,
1
P̂π,s (st = s′ ) = Pπ,s (st = s′ ).
(1 − β)t
(3) Show that J π (s) can be equivalently expressed as a discounted return function for the modified
MDP (S, A, P̂ , r). What is the discount factor?
(4) Write down the Bellman optimality equation for the equivalent discounted problem, and express it
in terms of the original problem parameters. Compare to the Bellman operator defined in the next
section.
d. For a stationary and deterministic policy π, consider the (undiscounted) fixed-policy Bellman operator
T π : RS → RS , defined by
X
(T π (V ))(s) = r(s, π(s)) + P (s′ |s, π(s))V (s′ ) .
s′ ∈S
Show that T π is a γ-contraction (for some γ < 1) over the space {V ∈ RS : V (s∗ ) = 0}. Specify γ, and
explain why we need to consider only functions with V (s∗ ) = 0.
Consider a finite horizon MDP with the following twist. For a stationary stochastic policy π, instead of looking
at the expected cumulative reward, we look at the worst case reward until a given time T :
T
X
π,T
J (s) = min r(st , at ).
(s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0
That is, the worst possible reward given that the initial state is s. Similarly, we define the best case reward as
T
π,T X
J (s) = max r(st , at ).
(s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0
10
a. Write a backward recursion (dynamic programming on T, T − 1, . . .) formula for J π,T .
∗,T π,T
b. Suppose that we want to compute J (s) = supπ J (s). Suggest an algorithm to do so.
∗,T
c. Is there an optimal deterministic strategy that attains J (s)? That is, does there exist a deterministic
∗,T π,T
policy π such that J (s) = J (s) for all s? Prove or give a counterexample.
d. We now consider the infinite horizon case. For some γ < 1 we define
T
X
J π (s) = lim min γ t r(st , at ).
T →∞ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT ):Pπ (s0 ,a0 ,s1 ,a1 ,s2 ,a2 ,...,sT ,aT |s0 =s)>0
t=0
Recall the following standard and dual Linear Programming (LP) formulation:
min b⊤ x, subject to Ax ≥ c,
Consider an MDP, defined by (S, A, P, r, ρ, γ), with the following discounted-return criterion over an infinite
time horizon:
X∞
π π,ρ
J =E ( γ t r(st , at )),
t=0
where Eπ,s is, as usual, the expected value given a control policy π and initial state distribution s0 ∼ ρ. Our goal
is to find the optimal value function:
V ∗ = max J π . (1)
π
Recall that for a fixed policy, the value function can be found by solving the following set of linear equations:
X
v(s) = r(s, a) + γ Pr(s′ |s, π(a))v(s′ ), ∀s.
s′ ∈S
a. Write the primal LP for the optimization problem (1), where the primal variables are the state values.
b. Write the dual LP for this problem (where the dual variables are the discounted state-action frequencies).
Suppose we want to solve the MDP, but we want our solution policy to be close to a given stationary policy µ(a|s)
(note that µ is stochastic in general). Specifically, let ϵ > 0 and we constrain: ∀s ∈ S : maxa∈A |π(a|s) −
µ(a|s)| ≤ ϵ.
11
c. Show (by an example or a convincing argument) that the optimal policy in this case is not necessarily
deterministic, an example with more than one optimal policy will not be accepted. (* Bonus: show the
same applies even if µ is deterministic).
d. Rewrite the dual LP to fit the suggested constraint, your formulation should not contain non-linear
functions (like absolute values).
e. How many variables and how many constraints will there be in the primal LP now?
g. Show this policy iteration algorithm converges in a finite number of steps to the optimal policy.
Consider the n states MDP given in Figure 1. In each state there are two actions - Left which leads to the starting
state 1, and Right which takes us one step closer to the rightmost state n. In state n, any action leads to the
leftmost state s = 1 and a reward 1 is given (r(s = n, a) = 1). The reward for every other state and action is 0.
The discount factor is γ, so a delay in reaching the goal will inflict an exponential reduction in the total reward.
We apply the uniform exploration policy π: π(a = right|s) = π(a = left|s) = 0.5.
d. Run value iteration starting from v 0 (s) = 0 to find v 1 (s), v 2 (s), v 3 (s). Write v ∞ (s) using V π (s).
e. Starting from state s = 1, how many steps in expectation will it take to reach state s = n?
f. Starting from s = 1, under a policy of your choice, what is the minimal number of steps it will take to
visit every state-action pair at least once?
g. Let’s say we employ an optimistic exploration scheme, where every previously unvisited state-action pair
is assumed to have reward 1. Starting from s = 1, how many steps will it take to visit every state-action
pair at least once?
h. Find V π (s).
Now we want to employ a function approximation scheme to evaluate the value function.
12
i. Find a feature vector that will guarantee zero approximation error.
Assume we are using the constant feature vector: ϕ(s) = 1, and that n = 3.
k. Find Φw∗ - the fixed point of Ππ T π where T π is the Bellman operator corresponding to π.
l. Find Ππ V π .
The Phased Value Iteration algorithm solves discrete, finite-horizon MDPs, while not requiring access to their
exact model. Instead, it queries a simulator G(M ), to which state-action pairs (s, a) are given as input, and next
state s′ and reward r are returned as output.
Presented below is the algorithm.
# of times (s, a) → s′
P̂t (s′ |s, a) =
m
4: set
h i
V̂t (s) = max r(s, a) + Es′ ∼P̂t (·|s,a) V̂t+1 (s′ )
a∈A
h i
π̂t (s) = arg max r(s, a) + Es′ ∼P̂t (·|s,a) V̂t+1 (s′ )
a∈A
5: end for
a. How many calls to the simulator are made during the run of Alg. (1)?
meaning the two expactations, using the original P and the estimated one, P̂ , are relatively close.
13
c. Next, show that
Vπ̂,t−1 (s) − V̂t−1 (s) ≤ max
′
Vπ̂,t (s′ ) − V̂t (s′ ) + ϵ,
s
d. Using the two previous results (from parts (c) and (d)), show that
e. Assuming from now on the MDP is stationary, write ϵ as a function of ϵ̃ and the different terms in the
question, where ϵ̃ is defined so that
Vπ (s) ≥ V ∗ (s) − ϵ̃.
f. We now turn to justifying the assumption in Eq. (2). Show that the probability that the assumption in
2
Eq. (2) fails is smaller than 2 · |S| · |A| · T · exp (− ϵ̃2Tm2 ).
2
g. Prove the following theorem for a choice of m = O Tϵ̃2 log |S||A|T δ .
Theorem 1. The phased value iteration algorithm calls the simulator G(M )
|S||A|T 3
|S||A|T
O log
ϵ̃2 δ
times and with probability greater than 1 − δ, returns a policy π such that for every state s,
Consider an infinite-horizon MDP (S, A, R, P, γ) with a given, fixed policy π. We define M π (s) as the second
moment of the discounted return when starting from state s and following policy π,
!2
X∞
M π (s) = Eπ γ t r(st , at ) so = s .
t=0
b. Given that V π is fixed, is the operator you received a contraction? If the answer is positive, prove it. If
not, convincingly show why or find a counter example.
c. Given that V π is fixed, suggest an iterative algorithm for learning M π and prove its convergence. Assume
there is access to a simulator and that S is finite.
14
3.7 Investor Planning Problem
An investor has a call option to buy one share of a stock at a fixed price of p and has T days to exercise this
option. He can exercise the option each day. When the stock price is x, he gets (x − p). The investor may decide
not to exercise this option at all. Assume that the price of the stock Xt on day t is
t
X
Xt = X0 + Wk
k=1
a. Define the corresponding action set, model dynamics, reward function and objective function for finite
MDPs, as learned in class.
Vt (x, 1) = 0,
Vt (x, 0) = max{x − p, E[Vt+1 (x + Wt+1 , 0)]}.
1. ∀t : Vt (x, 0) is increasing in x.
2. ∀t : (Vt (x, 0) − x) is decreasing in x.
3. ∀x : (Vt (x, 0) − x) is decreasing in t.
d. Let Xt∗ = {x : x − p ≥ E[Vt+1 (x + W, 0)]} and x∗t = min Xt∗ . Show that {x∗t }Tt=1 is decreasing. Hint:
use an inclusion relation.
e. Use the previous result to describe (in words and/or math) the type of optimal policy in our problem.
In this question we will focus on a specific class of MDPs, M = {S, A, P, c, γ}, where the action space is
continuous, u ∈ RS . The ‘passive’ dynamics Markov Chain is p0 (s′ | s) and its corresponding transition matrix
is p0 . Next, the dynamics of M is given by
where the notation us means the component of u that corresponds to the state s.
15
Let β > 0. The cost of the model is
and remember that DKL (q || p) ≥ 0 with equality iff p(·) = q(·) point-wise. The DKL term represents a cost
for controlling the system, i.e, when changing the ‘passive’ dynamics, p0 , by using u > 0.
The Bellman equation has the following form:
!
X
′ ′
v(s) = min c(s, u) + γ p(s | s, u)v(s ) . (3)
u∈U (s)
s′
b. What is the constraint u should satisfy for p(· |, s, u) to be a valid probability function? (no need to solve
the constraint)
c. Plug p(· | s, u) into the Bellman equation (3), and define the corresponding optimal Bellman operator.
Prove it is a contraction mapping in the max-norm and find its contraction coefficient.
From now on assume β = 1. Since the found Bellman operator is a contraction mapping, we can calculate the
optimal value by repeatedly applying it. However, since the action space is continuous, applying it is expected to
be problematic without solving analytically the minimization problem.
d. Solve the minimization problem defined in (c) with the constraint defined in (b). Find u∗ which solves the
constraint minimization problem (Hint: solve using Lagrange multipliers, see end of formula sheet).
e. Set γ = 1. Plug u∗ into the Bellman equation (3), rewrite the Bellman equation using the transformation
z(s) = exp(−v(s)) and express it in the form of z(s) = g(s) s′ p0 (s′ | s)z(s′ ). What is g(s)?
P
where g is defined in section (e), and p0 is the ‘passive’ dynamics. Assume T is an α-contraction mapping
with a corresponding fixed point z ∗ . In this section, assume access to a simulator which is given an action,
u, and returns a 3-tuple,(sn , g(sn ), sn+1 ). You can use all the assumptions of the Q-learning algorithm.
Answer the following
1. Suggest a Stochastic Approximation algorithm that converges to z ∗ and prove its convergence. Make
sure to be clear about what is the policy with which your algorithm interacts with the environment.
2. What is the relation between z ∗ and the solution of the Bellman equation in (3), v ∗ ?
16
3.9 MDPs with Non-Linear Objectives
Consider an MDP M with finite state space S and finite actions space A, transitions P (s′ |s, a), discount factor
γ ∈ (0, 1), a fixed initial state sinit , and two reward functions: r1 (s) and r2 (s). In this question we will consider
objectives that are a function of both r1 and r2 .
For a Markov policy π, we denote the discounted returns J1π and J2π as:
" ∞ #
X
J1π = Eπ γ t r1 (st ) s0 = sinit ,
t=0
∞
" #
X
J2π =E π t
γ r2 (st ) s0 = sinit .
t=0
Note that the initial state is fixed, and that J1π , J2π denote scalar returns and not value functions.
Let f (x, y) be some function of two variables. For some policy π we denote J π = f (J1π , J2π ). We wish to find a
policy π ∗ that maximizes J π :
J ∗ = max J π , π ∗ ∈ argmax J π .
π π
a. For f (x, y) = αx + βy, propose a standard MDP with a single reward r̂ such that it’s optimal policy is
π ∗ . Explain.
For the rest of this question, we consider the function f (x, y) = xy . Furthermore, we assume the following
bounds on the rewards: 0 < rmin ≤ r1 (s) < r2 (s) ≤ rmax , for all s ∈ S.
b. Can the standard MDP solution approaches (value iteration, policy iteration) be used to find π ∗ in this
case? Explain (no need to prove formally).
For some ρ ∈ [0, 1], consider a standard MDP Mρ with the same S, A, P, γP as M and reward r̂(s) =
π π ∞ t
r1 (s) − ρr2 (s). Denote the discounted reward for a policy π in Mρ as Jρ = E t=0 γ r̂(st )|s0 = sinit .
g. Let Jρ∗ denote the optimal value in Mρ . Show that Jρ∗ is monotonically decreasing in ρ.
Hint: start by showing monotonicity for a fixed policy.
h. Based on (d-g), propose an approach for finding the optimal policy π ∗ . Technically, you can assume that
Jρ∗ is continuous in ρ, and you can invoke any standard MDP solver in your solution.
17
3.10 Belief MDP
Remark: while this question concerns a bandit, we will solve it optimally using MDP methods.
Consider a two-armed bandit, with i.i.d. reward distributions Bernoulli(p1 ) and Bernoulli(p2 ). The success
probabilities p1 and p2 are known in advance. However, we do not know which of the two arms, denoted A1 and
A2 , is associated with p1 and which is with p2 . That is - only the permutation of the arms is not known.
Let e ∈ {0, 1} be a random variable that represents the arm permutation. That is, e = 1 means that arm A1 is
assigned with p1 and A2 with p2 , and e = 0 means that arm A1 is assigned with p2 and A2 with p1 . We denote
by θ0 = P (e = 1) the permutation probability, and we assume that θ0 is known (though e is not observed).
The permutation is drawn before the agent starts playing, and does not change afterwards. The agent pulls arms
for T time steps, and we denote by at ∈ {A1 , A2 } the action taken at time t, and by Rt the resulting (stochastic,
i.i.d.) reward observed. The agent’s goal is to maximize the T -horizon expected cumulative reward:
"T −1 #
X
max Ee,π Rt . (4)
π
t=0
Note that the expectation is taken with respect to the random permutation e, the random rewards, and the
(possibly) random action selection policy π.
We denote by ht = {a0 , R0 , . . . , at−1 , Rt−1 } the history up to time t. The general solution to Eq. (4) is a history
dependent policy π(ht ). In the rest of this question, however, we will show that we can construct an MDP with a
special state θ for solving Eq. (4).
d. We denote by θt (ht ) = P (e = 1|ht ) the belief about the arm permutation at time t, given the history of
rewards and actions until that time. Assume that at−1 = A1 , and a reward Rt−1 = 1 was observed. Show
that θt (ht ) can be written as a function of θt−1 (ht−1 ), p1 , p2 , as follows:
p1 θt−1 (ht−1 )
P (e = 1|ht ) = .
p1 θt−1 (ht−1 ) + p2 (1 − θt−1 (ht−1 ))
Hint: use Bayes rule. You can assume that given ht , the action at and e are independent (the policy only
depends on the history and not on any other variable).
e. Using θt (ht ), write down the reward probability given the history P (Rt = 1|ht ), for the case that at = A1 .
Write down an expression for the expected reward R̄(θt (ht ), at ) = E[R(at )|ht ], for the case at = A1 .
f. Given θt (ht ), and assuming that at = A1 , what are the possible values of θt+1 ? What are their probabili-
ties?
g. Define the belief MDP, where the state at time t is θt , the action space is {A1 , A2 }, the reward is R̄(θt , at ),
as calculated in (e), and where P (θt+1 |θt , at ) is as calculated in (f). Show that the optimal policy in the
belief MDP is optimal for the original problem Eq. (4).
h. Is the state space of the belief MDP discrete or continuous? Given θ0 , how many different states θT are
possible at time T ? Suggest an algorithm for solving the belief MDP. How does the algorithm complexity
depend on T ?
18
3.11 Exploration Conscious MDPs
In class, we have treated ϵ-greedy exploration as a means to find an optimal deterministic policy. However, the
stochastic exploration can lead the agent to bad rewards during learning. In this question, we assume that the
agent wants to obtain high rewards during learning too, and change the usual definition of optimality such that it
will take the exploration into account.
Consider a discounted MDP M = (S, A, r, P, γ) defined as usual. Assume that for any policy π and from any
state s an agent acts with probability ϵ by a fixed, pre-determined, policy π0 , and with probability 1 − ϵ by the
policy π, i.e., the effective policy is
For example, when setting π0 to be the uniform policy we obtain ϵ-greedy exploration.
The value of a policy µπ,ϵ,π0 (a | s) from a state s on the MDP M is defined by
"∞ #
µπ,ϵ,π0 X
VM (s) = E γ t r(st , at ) | s0 = s, µπ,ϵ,π0 , P ,
t=0
in words, the expected value of the policy µπ,ϵ,π0 when the dynamics is P and the reward is r.
∗
We define the optimal exploration-conscious policy πϵ,π as the policy that maximizes the expression
0
∗ µπ,ϵ,π0
πϵ,π 0
∈ arg max VM (s), (5)
π
At first sight, the optimization in Eq. (5) might seem hard. However, as you will show in the following, it is
equivalent to solving a different MDP. Define an MDP Mϵ,π0 = (S, A, rϵ,π0 , Pϵ,π0 , γ), where:
Pϵ,π0 (s′ | s, a) = (1 − ϵ)P (s′ | s, a) + ϵP π0 (s′ |s), and rϵ,π0 (s, a) = (1 − ϵ)r(s, a) + ϵrπ0 (s),
and P π0 (s′ |s) = a π0 (a|s)P (s′ |s, a), rπ0 (s) = a π0 (a|s)r(s, a). Furthermore, denote the (standard, non
P P
exploration-conscious) value of a policy π on the MDP Mϵ,π0 as VM π .
ϵ,π 0
b. Let TMπ and TMϵ,π0 denote the fixed policy (for some policy π) and optimal Bellman operators for the
ϵ,π0
MDP Mϵ,π0 . Let TM π and T
M denote the fixed policy (for some policy π) and optimal Bellman operators
for the MDP M . Show that
π0 π π π0 µπ,ϵ,π0
TMϵ,π0 = (1 − ϵ)T + ϵTM and TM ϵ,π
= (1 − ϵ)TM + ϵTM = TM .
0
π µπ,ϵ,π0
c. Based on section (b), prove that VM ϵ,π
= VM for any policy π.
0
∗
d. Based on (b) and (c), prove that πϵ,π ∗
= πM ∗
where πϵ,π ∗
is defined in (5) and πM is the optimal
0 ϵ,π0 0 ϵ,π0
policy of the MDP Mϵ,π0 .
e. Assume we do not have access to P and r, however, we can interact with the environment. Suggest
a Q-learning algorithm from which we can obtain µ∗ϵ,π0 (5) in the limit (you can assume all standard
assumptions for the convergence of Q-learning to hold). Hint: define the optimal Bellman operator for
the Q function of the MDP Mϵ,π0 denoted by TQ,Mϵ,π0 . Then find an update rule that in expectation is
equivalent to the application of TQ,Mϵ,π0 .
19
3.12 Average Policy Improvement Theorem
Consider a discounted MDP with states s ∈ S, actions a ∈ A, state dependent reward r(s), transitions P (s′ |s, a),
initial state distribution µ(s), and discount factor γ.
We define the usual value and Q functions for a Markov policy π:
X
V π (s) = r(s) + γ P (s′ |s, π)V π (s),
s′
X
Qπ (s, a) = r(s) + γ P (s′ |s, a)V π (s),
s′
and the discounted reward performance criterion (averaged over initial states):
X
Jπ = µ(s)V π (s).
s
′
Consider two Markov policies π and π ′ . In this question we will characterize the performance gap J π − J π .
′
Define the advantage of a state s as Aππ (s) = a π ′ (a|s) (Qπ (s, a) − V π (s)).
P
We define dπ (s) = ∞ t
P
t=0 γ P (st = s|π, s0 ∼ µ) to be the discounted state visitation frequencies. In the next 3
clauses we will prove the following identity:
′
X ′ ′
Jπ − Jπ = dπ (s)Aππ (s). (6)
s
We will next show that Eq. (6) is a generalization of the policy gradient theorem. Consider a policy parameterized
by θ (you can assume that θ is a scalar, and that πθ is differentiable). Let us define J(θ) = J πθ .
∂
e. Use Eq. (6) to calculate ∂θ J(θ) explicitly, and show that you obtain the policy gradient theorem.
20
3.13 Time-Varying Bellman Operators
An MDP is defined by: (S, A, P, R, γ), where S, A are the state and action spaces, P is the dynamics, R is the
reward and γ ∈ (0, 1) is the discount factor. In this question, we consider MDPs with varying reward functions.
Let R ⊆ R|S|·|A| denote a set (not necessarily finite) of deterministic reward functions (i.e. R ∈ R). Let
{Ri }∞
i=1 ⊆ R be some arbitrary sequence of reward functions.
a. ∀R ∈ R we define a Bellman operator: (TR V )(s) = maxa∈A R(s, a) + γ s′ ∈S P (s′ |s, a)V (s′ ) .
P
Let {Ti }∞
i=1 be the sequence of Bellman operators corresponding to the sequence of rewards above (i.e.
. .
Ti = TRi ). For the ease of notation, ∀i < j we denote Ti:j V = Tj Tj−1 ...Ti+1 Ti V .
(1) For any i, is Ti a γ-contraction?
(2) Prove or give a counter example: the sequence limK→∞ T1:K V0 converges for any V0 and sequence
of rewards {Ri }∞i=1 .
From now on, assume limi→∞ ||Ri − R∗ ||∞ = 0, and let (T ∗ V )(s) = (TR∗ V )(s).
In the following sections (b-d) we aim to show that iteratively applying Ti for i = 1, 2, ... converges to the fixed
point of T ∗ , for any V0 ∈ R|S| .
b. Let ∀V1 , V2 ∈ R|S| , and M > 0 such that ||V1 − V2 ||∞ ≤ M . Show that for any i ∈ N:
d. Use sections b and c, along with the assumption that limi→∞ ||Ri − R∗ ||∞ = 0 to prove that ∀V0 ∈ R|S|
and ∀ϵ > 0, ∃K0 ∈ N s.t. ∀K > K0 :
||T1:k V0 − (T ∗ )K V0 || < ϵ.
conclude that
lim T1:K V0 = lim (T ∗ )K V0
K→∞ K→∞
e. We now consider deterministic dynamics, i.e., s′ = f (s, a). Consider the Q-learning algorithm with
time-varying rewards:
Initialize arbitrary Q0
for t=0,1,2,... do
Observe st , at , s′t , Rt (st , at )
Qt+1 (st , at ) = Rt (st , at ) + γ maxa′ ∈A Qt (s′t , a′ )
end for
Assume that limt→∞ ||Rt − R∗ ||∞ = 0, and all states and actions are visited infinitely often. Prove
that the algorithm converges to Q∗ , the fixed point of the Bellman operator: (T ∗ Q)(s, a) = R∗ (s, a) +
γ maxa′ ∈A Q(s′ , a′ ). You can use the following result: the sequence xt+1 = γxt + ϵt converges to 0 if
γ < 1 and ϵt → 0.
21
3.14 Contextual MDP
Since we have no control over car B, the dynamics of the MDP might change significantly between different
drivers (e.g. some would drive faster when car A is closer and others would drive slower or even break). Assume
that you have a finite set of possible dynamics, P (the state and action are omitted for the ease of notation),
induced by the possible policies of car B.
b. We want to use P to find an optimal policy for the worst case scenario. Suggest a model (from the models
you have seen in class) for this task, formulate it, and describe a value-iteration like algorithm to compute
the optimal policy.
In reality, we might have some hints over the policy of car B (i.e. type of car, weather, etc.), we would want
to use it to produce a better policy. More formally, consider an episodic scenario, where at the beginning of
each episode we observe some vector, α ∈ R|P| , such that α represents a probability vector over all possible
P|P|
dynamics ∀P ∈ P (e.g., α ≥ 0, i=1 αi = 1). At each time-step, the dynamics of the environment are drawn
from the distribution induced by α. When one episode ends and another starts, we observe a new α.
* Note that in this setting α is fully-observable and given at the beginning of each episode.
c. Modify the state space from clause a, so it would fit the new setting. How many states there are?
Definition: Contextual MDP (CMDP) is a tuple (C, S, A, M), where C is the context-space, S is the state-space,
A is the action-space and M is a mapping, M : C → (P, R, s0 ), such that for each context c ∈ C, M(c) defines
dynamics (P c ), reward (Rc ) and an initial state (sc0 ). The main idea here is that each context defines a single
MDP, (S, A, Rc , P c , sc0 ), where the state and actions spaces are shared across all the MDPs defined by the
contexts.
d. Using the CMDP model for this problem, how would you define the context space and the state space? is
it possible to use a tabular model in this case? how?
22
e. (1) Think of a ‘lazy’ learning scheme, in which we do not learn the optimal policy in advance, but at the
beginning of each episode, after we observe α, we compute the optimal policy using a simulator and
play accordingly through the episode. Consider the model from clauses c and d, how would you use
each one of them for this task? which one is preferable for this setting and why?
(2) What are the advantages/disadvantages of using the model from clause b over the models from
clauses c and d?
|P|
f. Assume you know the optimal policy for each p ∈ P separately, denote them by {πi }i=1 . Prove or give a
counter example: For some α, the optimal policy can be computed by a linear combination of the known
P|P|
policies, i.e. π α (a|s) = i=1 αi πi (a|s). (here π(a|s) represents a probability)
4 Bandits
Consider a stochastic MAB (multi-armed bandit) problem, where at each round we need to choose two different
arms (and not just one). All other assumptions are similar to the model studied in class (IID reward bounded
reward per arm, etc.). The goal is to minimize a suitable regret measure.
Notation: Let S be the set of arms. For each arm a ∈ S, let (R(n), n ≥ 0) be the sequence of i.i.d. rewards
associated with that arm, taking values in [0, 1], and with mean µa = E(Ra (n)). Further, let nat be the number
of times arm a has been chosen up to (and including) time t, let a(t) and b(t) denote the two arms chosen at
round t, and let
a(t) b(t)
rt = rt + rt
a(t) b(t)
denote the reward obtained at time t, with rta = Ra (nat−1 + 1). (Both rewards rt and rt that are obtained at
the chosen arms are observed by the learning agent. I.e., the agent knows which arm produced which reward.)
a. Propose an appropriate definition for regret(T ), the T -stage expected regret, which is suitable for this
problem. Express it precisely in mathematical terms.
b. Show that this problem can re-formulated as a standard bandit problem, with each pair of bandits (or arms)
considered a new “super-bandit". Describe precisely the obtained model.
c. Apply the UCB1 algorithm to the “super-bandit" model of (b). Describe the algorithm, and obtain an
upper bound on the expected regret (using the available results for the standard problem). Discuss briefly
the shortfalls of the super-bandit approach, and how they are reflected in your bound.
d. Propose an extension of the UCB1 algorithm which is suitable for the original problem (and does not use
the super-bandit model). Justify your suggestions verbally.
e. Prove a (non-trivial) upper bound on the expected regret of this algorithm when the expected value of the
two best arms happens to be identical. (Note attached Theorem 6 and its proof from the lecture notes.)
Discuss the improvement relative to (c).
f. [Bonus problem] Derive an upper bound on the algorithm proposed in (d), without the additional assump-
tion of (e) above.
23
4.2 Exploratory Multi-armed Bandits
In this question we consider the exploratory Bernoulli multi-armed bandit where instead of finding the arm with
highest reward, we want to find the arm with the median reward. Notation:
• We look for (ϵ, δ)-PAC algorithms such that an ϵ-median arm is chosen with probability at least 1 − δ.
• Throughout the question you can ignore n’s parity and assume there is one median hand amed (if you
wish, you may assume n is an odd number).
Input: ϵ > 0
output: An arm
Set S1 = A, ϵ1 = ϵ/4, δ1 = δ/2, ℓ = 1.
repeat
Sample every arm a ∈ Sℓ for 1/(ϵℓ /2)2 log(6/δℓ ) times, and let p̂ℓa denote its empirical value Find the
upper and lower quartiles of p̂ℓa , denoted by qℓl , qℓu correspondingly Sℓ+1 = Sℓ \ {a : qℓl < p̂ℓa or p̂ℓa > qℓu }
ϵℓ+1 = 34 ϵℓ ; δℓ+1 = δℓ /2; ℓ = ℓ + 1
until |Sℓ | = 1
24
d. Bound the failure probability of the following event:
e. Let j be some non ϵ1 -median arm, and assume the failure event from the previous does not happen. Bound
the failure probability of the following events:
f. Let #bad be the number of arms which are not ϵ1 -optimal but empirically change relation with respect to
the median arm - if they were lower than the median then their estimate is higher, and if they were higher
than the median, then their estimate is lower. Bound the following probability:
g. Prove that the probability of a failure in the first round is lower than δ1 (a failure is the case where no
ϵ1 -optimal arm was left).
You are given k arms, each with an unknown Bernoulli distribution. We assume that each distribution has a
mean of µi . You can assume that k is an odd number.
We say that an algorithm is (ϵ, δ)-median finder if it outputs an ϵ approximation of the median with a probability
of at least 1 − δ.
a. Suppose that we sample each arm exactly T times and we output the empirical median. How large should
T be so that this algorithm is (ϵ, δ)-median finder? Prove that the algorithms is indeed (ϵ, δ)-median
finder.
b. Bonus: Suppose now that the values of µi are known but their identity is not. Propose an (ϵ, δ)-median
finder algorithm with improved sample complexity dependence on the actual values of µi .
We now consider the problem of finding the range: R = maxi µi − mini µi . We say that an algorithm is
(ϵ, δ)-range finder if it outputs an ϵ approximation of R with probability of at least 1 − δ.
c. Suppose that we sample each arm T times and then output the empirical range as the estimate. What
should T be so that we have a (ϵ, δ)-range finder? Prove your claim.
d. Propose another algorithm with improved sample complexity in k. Prove a sample complexity bound.
Hint: you may use algorithms studied in class as subroutines.
25
4.4 Reward Swapping Arms
Consider a two-arms multi-armed bandit setting where the arms are Bernoulli-distributed, with expected rewards
µ+ = 0.5 + ϵ, µ− = 0.5 − ϵ, where ϵ > 0 is some fixed constant. Every T time-steps, a switch may (but does
not have to) occur, in which case the two arms swap their reward function. Let n = kT , where k is an integer.
We define the regret at time n as
Xn
R(n) = nµ+ − E µ(It ) ,
t=1
a. Propose an algorithm for minimizing the regret when log T >> O( 1ϵ ), and compute the resulting regret at
time n as a function of T, k, ϵ.
b. Propose an algorithm for minimizing the regret when T is small (T = 1 or T = 2, for example), and
compute the resulting regret as a function of T, k, ϵ.
c. Consider the following case: k = 2 and ϵ = n1 (you can assume that n is large). Propose an algorithm for
minimizing the regret in that case, and compute the resulting regret as a function of T, k, ϵ.
d. Suppose now that the switch may occur not more than once throughout the whole process. Argue that
√
n is a lower bound on the regret in this case. Hint: consider how to choose k and T such that “enough”
exploration is forced?
We are given a population of K i.i.d. bandits. When pulling arm i an iid reward is received with mean µi . You
may assume that the arms are Bernoulli random variables and that K is an odd number.
Note: Recall that the median of K number that are ordered a1 ≤ a2 ≤ · · · ≤ aK is the number a(K+1)/2 .
a. Suppose we want to find the median arm of the K arms. Propose an algorithm for finding (with probability
of at least 1 − δ) the median arm up to an ϵ-error. Prove its correctness and bound its sample complexity.
b. Suppose now that we want an algorithm for finding the median, but for ϵ = 0. That is, we want an
algorithm that exactly finds the median arm. Propose an algorithm (that is correct with probability at least
1 − δ) and analyze it sample complexity in terms of the unknown expectations of the arms.
Note: You are not required to prove the sample complexity of the algorithm.
c. Suppose now that we want to repeat (a.) above but with δ = 0 (i.e., find the median arm with certainty). Is
there an algorithm that can find the median in finite time with certainty? Explain.
d. We now change the problem to asking how many arms have expectations between µmin and µmax ? That
is, we want to figure out how many of the K arms have expectation exactly between these numbers
(with probability at least 1 − δ). More precisely, given K and δ we need to give a finite time horizon T
(which may depend on K and δ) by which we want to be able to say how many of the K arms satisfy that
µmin ≤ µi ≤ µmax ? (I.e., find |{i : µmin ≤ µi ≤ µmax }|.) Does such an algorithm exist? If so, propose
it. If not then explain why.
26
4.6 Constrained Bandits
We are given K arms. When pulling arm i a vector (ri , ci ) is received, where ri and ci are iid random variables
bounded by 1 with (initially unknown) expectations E[ri ], E[ci ]. The algorithm is supposed to return one of the
arms which we denote by I.
a. We say that an algorithm is (ϵ, δ)-PAC constrained if for all strictly positive ϵ, C0 , δ
P I = arg max{E[ri ] s.t. E[ci ] ≤ C0 + ϵ} ≥ 1 − δ.
i∈[K]
Devise such and algorithm and give (but don’t prove) sample complexity bounds (may depend on
E[ri ], E[ci ]).
b. Assume that always ∃i s.t. E[ci ] ≤ C0 . Does there exist a (0, δ)-PAC constrained algorithm whose
expected sample complexity (i.e., time to return I) is bounded?
c. We say that an algorithm is (ϵ, δ, η)-PAC constrained if for all strictly positive ϵ, C0 , δ, η
P E[rI ] ≥ max{E[ri ] s.t. E[ci ] ≤ C0 + ϵ} − η ≥ 1 − δ.
i∈[K]
Devise an algorithm that is (ϵ, δ, η)-PAC constrained and give (with proof) sample complexity bounds
that depend on E[ri ], E[ci ]. Note that we do not ask for the most efficient algorithm, but rather ask for a
correct algorithm.
We consider a simple bandit problem with K arms that have an unknown reward distribution on [0, 1]. We are
looking to find an ϵ optimal arm, with probability of at least 1 − δ in the usual PAC model. In the following
questions we allow some samples to be “free". We consider a modification of the sample complexity to “budget
complexity" where samples do not add to the budget if sampled from “good” arms (good is defined differently
below).
In the following, we do not ask you to prove any of the bounds, but use existing results and argue their validity.
a. Suppose that we are given a budget of T samples from non-optimal arms (i.e., every time you sample
the optimal arm, this cost nothing). In the worst case, for a given ϵ, what is the smallest δ that can be
guaranteed (i.e., write δ as a function of T, K and ϵ.)? Explain which algorithm you would use to achieve
that budget complexity.
b. Suppose now that we are given a budget of T samples from non ϵ optimal arms (i.e., every time we sample
the optimal arm or arm that is at most ϵ worse than the optimal arm, this cost nothing). In the worst case,
for a given ϵ, what is the smallest δ that can be guaranteed (i.e., write δ as a function of T, K and ϵ.)?
Explain which algorithm you would use to achieve that budget complexity.
c. Suppose now that we are given a budget of T samples from non 2ϵ optimal arms (i.e., every time we
sample the optimal arm or arm that is at most two ϵ worse than the optimal arm, this cost nothing). In the
worst case, for a given ϵ, what is the smallest δ that can be guaranteed (i.e., write δ as a function of T, K
and ϵ.)? Explain which algorithm you would use to achieve that budget complexity.
27
d. Consider the budget structure of [c.] above (i.e., sampling an arm whose expected reward is less than 2 ϵ
worse than the optimal does not count). Propose an algorithm that takes advantage of actual arms means
(as in the successive elimination algorithm). Provide an expression for its budget complexity and explain,
there is no need to rigorously prove it.
Consider a multi-armed bandit problem where there are K arms. Instead of the usual setting, we have N different
multi-armed problems, each of which with K arms. You can assume that each of the N K arms is Bernoulli,
where the mean of arm i in bandit problem ℓ is piℓ . In the following questions, we ask about different variants of
this problem. In all these settings, when pulling arm i, all bandits are activated and their reward revealed one
way or another.
a. Suppose that the objective of the agent is to find ϵ-optimal arms for all problems with probability of at
least 1 − δ. Propose an algorithm for the problem when N = 2 and analyze its sample complexity (i.e.,
how many samples are needed to return an ϵ optimal for each and every arm).
b. Suppose now that N < K is general. Propose an algorithm and analyze its sample complexity. Your
algorithm should scale sublinearly in N .
c. Consider the regret minimization problem. Here when selecting arm i the reward obtain is the sum of
rewards for arm i in each bandit problem 1, 2, . . . , N . Define a concept of regret formally and provide a
regret minimizing strategy with sublinear regret.
d. Consider another variant of the regret minimization problem. Here when selecting arm i the reward obtain
is the maximum of rewards for arm i over all bandit problems 1, 2, . . . , N . Define a concept of regret
formally and provide a regret minimizing strategy.
In class, we analyzed the UCB algorithm for a case in which T , the number ofq
time-steps which we run the
algorithm for, is known in advance. In particular, the UCB bonus took the form 2nlog T
t (a)
, i.e., it depends on T .
In this question, you will analyze two variants that remedy the need to know T in advance, while still obtaining
the same guarantees of the UCB algorithm for any t.
Pnt (a)
In the following, we denote µ̄t (a) = nt1(a) i=1 ri , where the array D(a) = [r1 , .., rnt (a) ] contains all the
rewards that were observed from arm a, and nt (a) = |D(a)| is the number of observed rewards for arm a.
Express your answers in an Big-O notation. Assume that all logarithms are in base 2, and that the rewards are
bounded in [0, 1].
The following bounds might be useful:
PT i
1. For a > 1, i=1 a ≤ O(aT ).
PT −l
2. For l ∈ N, i=1 i ≤ O(T −l+1 ).
p
3. The expected regret of UCB is bounded by O(K + log(T )KT ). The first constant term K was omitted
in class, and originates from the first K rounds of UCB where every arm is sampled once.
28
Algorithm 2 Doubling Trick UCB
Initialize: K arms, Ts = 1, Tf = 2
Initialize reward arrays by trying each arm once
for t = 1, .., T do
if t > Tf then
Ts = Tf
Tf = 2Tf
Empty Dt (a) for all a.
Initialize reward arrays by trying each arm once
end if q
2 log(Tf −Ts )
Act by at ∈ arg maxa µ̄t (a) + nt (a)
Observe r(at ) ∼ Pat and append r(at ) to Dat .
end for
Part 1 – The Doubling
q Trick. In algorithm 2 we describe the Doubling Trick UCB. Instead of using a bonus
term of the form 2nlog T
t (a)
, the algorithm assumes an horizon Tf − Ts and uses it instead of the unknown T .
Once t = Tf the horizon Tf − Ts is doubled (hence the name), and all gathered experience is deleted. This
procedure continues until t = T .
a. Bound the expected regret of the UCB algorithm studied in class for horizon Tf − Ts for any Ts ≤ t ≤ Tf .
b. For simplicity, assume log T is an integer. Bound the expected regret of the Doubling Trick UCB algorithm
for any t ≤ T .
Let Ac denote the complement of A. Prove that Pr (Ac ) ≤ O( T1 ) (under the assumption K ≤ T ).
e. Using the previous sections, prove an expected regret bound for Horizon Independent UCB for any t ≤ T .
29
4.10 Bandits with Context
In this question we analyze a multi-arm bandit problem with contexts. A context describes information that is
available to the agent before making a decision. For example, in web advertising, the context can be the gender
of the user – information that is very relevant to what ad should be displayed to him/her.
Formally, at the beginning of each round a context is sampled i.i.d. from a finite set of contexts, Xt ∼ PX
where Xt ∈ X = {x1 , x2 .., xN }, i.e., the possible number of contexts is discrete, finite, and |X | = N . For each
context x ∈ X , the agent can choose among a fixed set of actions A = {1, .., K}, such that the average reward
of playing action a ∈ A given the context is x ∈ X is µ(a, x). As usual, the agent does not observe the average
reward, but only a noisy sample ra,x ∈ [0, 1] such that E[rat ,xt | xt = x, at = a] = µ(a, x).
a. Assume the agent does not have access to the contexts at each round. Suggest a UCB-type algorithm
with good performance w.r.t. the following definition of regret
" T #
X
∗
R(T ) = E µ − µ(at ) ,
t=1
Now assume the agent has access to the context: at the beginning of each round the agent observes Xt . In the
following, we would like to minimize the context-aware regret measured w.r.t. µ∗ (Xt ) = maxa∈A µ(a, Xt ):
" T #
X
∗
RX (T ) = E µ (Xt ) − µ(at , Xt ) .
t=1
b. Prove that R(T ) ≤ RX (T ). Show by a counter example that the opposite does not hold. Explain in 2-3
sentences what would be stronger result – a bound on RX (T ) or the same bound on R(T )?
Hint: prove that E[maxa f (x, a)] ≥ maxa E[f (x, a)] when x ∈ X and f (x, a) is finite for any x, a.
We will now show that by usingPt−1the context xt , we can devise a UCB algorithm with a bounded regret
for RX . Let nt−1 (x, a) = t′ =1 1(xt′ = x, at′ = a) denote the number of times action a was chosen
until round t (1(A) is the indicator function). We estimate µ(x, a) by its empirical average,
for context x P
t−1
r ′ 1(x ′ =x,at′ =a)
µ̄t−1 (x, a) = t′ =1 nt t−1 (x,a)
t
. Actions are chosen according to (C is a universal constant to be deter-
mined in the following)
s !
log T
at ∈ arg max µ̄t−1 (Xt , a) + C . (13)
a nt−1 (Xt , a)
d. Perform the following steps. (i) Define the set of clean events such that (13) holds for every a, x, t ≤
p T ≥ N, K), (ii) set C in (13),
T, nt (x, a) (assume PT(iii) show that conditioned on the clean event
R(T ; x) ≤ O( nT (x)K log T ), where nT (x) = t=1 1(xt = x) is the total number of rounds in
which context x was received. You can apply results learned in class without proving them.
√
e. Combine c-d and prove that RX (T ) ≤ O KN T log T (hint: Jensen’s inequality).
30
5 Function Approximation
where r ∈ Rn and P π ∈ Rn×n are the reward vector and transition matrix under policy π, respectively.
Also, recall that the projection operator Π : Rn → Rn projects a vector v ∈ Rn to a linear subspace S ⊂ Rn
that is the span of k features ϕ1 (s), . . . , ϕk (s), w.r.t. the ϵ-weighted Euclidean norm, where ϵ is the stationary
state distribution under policy π.
b. Show that the projected the 2-step Bellman operator ΠT π,2 is a contraction.
c. We propose to approximate the value function V π by the fixed point of ΠT π,2 , i.e., the approximate value
function Ṽ π satisfies the following equation:
Ṽ π = ΠT π,2 Ṽ π . (14)
d. Derive a bound on the approximation error ∥Ṽ π − V π ∥ϵ in terms of the ’best’ approximation error
∥ΠV π − V π ∥ϵ .
e. Is the fixed point of ΠT π,2 a better approximation than the fixed point of ΠT π ? Explain.
f. Suggest a modification of the TD(0) algorithm, that will converge to a fixed point of ΠT π,2 . Explain.
In this question we consider function-approximation TD(0) for the off-policy case. Let S and A be finite state
and action spaces, respectively; in addition, denote by γ the discount factor. Let µ be the behavior policy that
generated the samples and let π is the target policy which we want to evaluate. We denote by Pµ , Pπ , Rµ , Rπ
the induced transition matrices and reward vectors. The vector ϕ(s) ∈ Rk is the feature vector of state s ∈ S
and Φ ∈ Rk×|S| is the states-feature matrix representing the features of all states; we approximate the value of
state s ∈ S through Vπ (s) ≈ w⊤ ϕ(s).
a. We start from recalling the results shown in class. You can assume the states are drawn IID according to
the limit stationary distribution induced by the behavior policy: dµ (s) = limt→∞ Pr{st = s} and denote
Dµ = diag(dµ ).
1. Write the standard TD(0) update for wt+1 = f (wt , αt , r(st ), ϕ(st ), ϕ(st+1 )), where αt is the step
size.
2. Find E[r(st )ϕ(st )|wt ], E[ϕ(st )ϕ(st+1 )⊤ |wt ], E[ϕ(st )ϕ(st )⊤ |wt ].
31
3. Rewrite the TD(0) update in the following stochastic approximation form:
wt+1 = wt + αt (A⊤
t wt + bt + mt ), E[mt |wt ] = 0,
Find explicitly A = limt→∞ Eµ [At ] and b = limt→∞ Eµ [bt ].
Reminder: The stability of TD(0) results from A holding the following negative definite property: ∀y ̸=
0, y T Ay < 0, which is a result of E ≜ Dµ (γPµ − I) holding the same property.
π(At |St )
Define ρt = µ(At |St ) to be the importance sampling ratio at time t.
b. Show that for any random variable Z = f (St , At , St+1 ), the following holds:
Eµ [ρt Z|St = s] = Eπ [Z|St = s].
c. Off-policy TD(0) uses the following update instead of the regular TD(0):
wt+1 = wt + αt ρt (A⊤
t wt + bt + mt ).
.
Show that the new matrix AOf f = limt→∞ Eµ [ρt At ] is equal to Φ⊤ Dµ (γPπ − I)Φ, and define the
corresponding EOf f = Dµ (γPπ − I).
d. Given the following MDP (two states, two actions, reward is always zero) and state representation:
0.5 0.5 0 1
[Pµ/π ]i,j = P (sj |si , µ(si )/π(si )), Pµ = , Pπ = , r(s, a) ≡ 0,
0.5 0.5 0 1
γ = 0.9, ϕ(s1 ) = 1, ϕ(s2 ) = 2.
1. Find Dπ , Dµ .
2. Show that EOf f is not negative definite.
3. (Bonus) Start from w0 = 10, αt = 0.1, and demonstrate (there is no need for a formal proof) that off-policy
TD(0) with function approximation is unstable, i.e., ∥wt ∥ → ∞.
32
5.3 TD(λ) Analysis
Consider an infinite-horizon MDP (S, A, R, P, γ) with a given, fixed policy π. In this question we analyze the
behavior of the TD(λ) algorithm. We consider the approximate case, where a weight vector w ∈ Rk is learned,
such that
X∞
π
V (s) = E [π
γ t r(st )|s0 = s] ≈ w⊤ ϕ(s).
t=0
where et , , ϕt , wt ∈ Rk , and w0 , e0 are chosen arbitrarily. Through the whole question, for simplicity assume
r(s) = 0 ∀s ∈ S, and that the step size is fixed, i.e. αt =α.
a. Show that the TD(λ) algorithm can be written as a multiplicative update algorithm, i.e., wt+1 = Qt wt .
Find Qt .
b. Bound wT in L2 norm, for some finite time-step T . Use Qt ’s eigenvalues for the bound.
The expression should be a function of γ, λ, ϕ(·), t, α. It may contain solely scalars, and the norm of w0 .
For your convenience, you can use the two following Lemmas:
1. Given matrix A ∈ Rn×n , ∥A∥22 = maxi ai , where ai , i = {1, . . . , n}, are the eigenvalues of the
matrix AA⊤ .
2. Let B = uv ⊤ + xz ⊤ , where u, v, x, z ∈ Rk . Then B has k − 2 eigenvalues equal to zero, and 2
eigenvalues: p
+ − u⊤ v + x⊤ z ± (u⊤ v − x⊤ z)2 + 4(u⊤ z)(v ⊤ x)
b ,b = .
2
c. Given the bound you obtained, will TD(λ) converge? If the answer is positive, prove it. If it is negative,
explain what condition has to hold for TD(λ) to converge.
a. Write an explicit expression for A and b, in terms of the parameters in (17). You can also appropriately
define matrices and vectors for this purpose.
33
b. Write an explicit expression for Mn in terms of A,
n ′ n
b and the parameters in (17) and show that E[Mn |Fn ] =
0, where Fn = σ {θk }k=1 , {(sk−1 , sk−1 )}k=1 . Is E[Mn |Fn ] = 0 also when the samples are obtained
as a sequence {sn } rather than iid in pairs, i.e., Fn = σ({θk }nk=1 , {sk }nk=1 )? Prove your answer.
We now note that the solution θ∗ to the limiting ODE θ̇ = h(θ) is θ∗ = A−1 b and make the following assumption:
All rewards r(s, s′ ) and feature vectors ϕ(s) are uniformly bounded, i.e., ∥ϕ(s)∥ ≤ 1/2, and |r(s, s′ )| ≤ 1,
∀s, s′ ∈ S.
c. Show that ∥Mn ∥ ≤ C1 + C2 ∥θn − θ∗ ∥ and express the constants C1 , C2 using the parameters in (17)
and their norms.
d. Show that ∥θn ∥ ≤ C3 · n, for some constant C3 > 1. Hint: this can proved by induction.
e. *Bonus: Show that
1
E[∥θn+1 − θ∗ ∥2 |Fn ] ≤ (θn − θ∗ )⊤ Λn (θn − θ∗ ) + C4
n2
and express Λn and constant C4 using the parameters in the question (but not using θn or θ∗ ).
Hint: first express θn+1 as θn+1 = f1 (n) · (θn − θ∗ ) + f2 (n) and use the relation |x| < 1 + x2 .
In this question you will analyze the TD(λ) algorithm when using a linear function to approximate the value
function, v.
Consider an MDP with n states and m actions (per state). Let π be an arbitrary policy. We define the n-step
def Pn−1 t
(fixed policy) Bellman operator Tnπ : Rn → Rn as follows Tnπ v = π t π n π n
t=0 γ (P ) r + γ (P ) v, where
rπ ∈ Rn , P π ∈ Rn × n are the rewards and transition matrix under policy π, respectively. Next, define the
TD(λ) operator, Tλπ : Rn → Rn as follows,
∞
X
Tλπ v = (1 − λ) λt Ttπ v = (I − γλP π )−1 (rπ + γ(1 − λ)P π v)
t=0
= v + (I − γλP π )−1 (rπ + γP π v − v).
We would like to approximate the value using a linear function approximation, v = ΦT w. Let dT = dT P π be
the stationary distribution of P π . We denote Π as the linear projection operator to the span of Φ, under the
d-weighted norm.
def
a. Prove that Tλπ v is a ξλ = γ(1−λ)
1−γλ contraction in the d-weighted L2 norm and answer (by proving) what is
−1
its fixed point? (Hint: (I − γλP ) = t (γλ)t (P π )t ).
π
P
b. Prove that ΠTλπ is a contraction operator in the d-weighted L2 norm. What is its contraction coefficient?
c. Let wλ∗ satisfy the fixed point equation ΠTλπ ΦT wλ∗ = ΦT wλ∗ . Give a bound of the form
||v π − ΦT wλ∗ ||d ≤ f (λ, γ)||v π − Πv π ||d .
Furthermore, based on the form of f (λ, γ), how would you choose λ?
d. Assume i) access to a sampling device from d, i.e, s ∼ d; and ii) access to a simulator which receives an
action and returns the 3-tuple (st , rt , st+1 ) (from which you can sample as you wish). Suggest a Stochastic
Approximation procedure that will converge to the fixed point of ΠTλπ .
Specifically, write in a concise way the sampling procedure and the update-rules of your algorithm.
34
5.6 Approximate Value Iteration
We consider approximate value iteration (AVI) in the discounted MDP setting. In class we saw that the Bellman
operator T is a contraction in the sup-norm, but that the projection operator Π is not necessarily a contraction in
sup-norm. We thus concluded that the projected Bellman operator ΠT is not necessarily a contraction. In this
question we investigate a case in which ΠT is a contraction.
a. Consider an MDP with two actions: A = {continue, stop}. At any state s if the agent chooses continue,
we transition according to P (s′ |s), and obtain reward 0. If the agent chooses stop, we transition to a
terminal state and obtain reward r(s). Write down the Bellman operator T for this MDP.
b. Let d denote the stationary distribution of the Markov chain P (s′ |s). Show that T is a contraction in the
d-weighted Euclidean norm ∥ · ∥d .
Evaluating when to execute options, and estimating their expected worth (at time t = 0), is an important
problem in finance. Explain how the American ‘call’ option can be modeled by the MDP above.
d. In reality, the state space for the stock price is continuous. Suggest an AVI algorithm with linear function
approximation for the MDP problem above. Explain why your algorithm converges, and derive an error
bound similar to the error bound we saw in class for the projected Bellman equation. You do not have to
write the algorithm explicitly in pseudo-code, but explain what is the equation driving it and how it can be
implemented in practice.
We consider approximate value iteration (AVI) in the discounted MDP setting. In class we saw that the Bellman
operator T is a contraction in the sup-norm, but that the projection operator Π is not necessarily a contraction in
sup-norm. We thus concluded that the projected Bellman operator ΠT is not necessarily a contraction. In this
question we investigate a case in which ΠT is a contraction.
We consider a general MDP. Recall that we define function approximation using the projection operator Π,
which maps a function f : S → R to a restricted set of functions.
WherePβs ≥ 0 is a state-dependent constant, and βs,s′ ≥ 0 are state-dependent weights that satisfy
βs + s′ βs,s′ = 1. We will show that several well-known function approximators are in fact averagers.
35
(i) The 1-nearest neighbor approximation (1-NN) is defined by a set of N target states T = {t1 , . . . , tN },
and a distance function D over the state space. The approximation is given by ΠV (s) = V (t∗ ),
where t∗ = argmint∈T D(s, t). Show that 1-NN is an averager (i.e., write the weights βs , βs,s′
explicitly).
(ii) The K-nearest neighbor (K-NN) approximation is an average of the K nearest neighbors. Show that
K-NN is an averager.
(iii) Soft state aggregation function approximation partitions the state space into a set of clusters C =
{c1 , . . . , cN }, and approximates the value of a state as the average of the values in its cluster. Show
that this is also a type of averager.
b. Show that the projection operator for an averager is a non-expansion in the sup-norm.
c. Suggest a convergent batch AVI algorithm using averagers. Show that your algorithm converges and
provide an error bound on the optimal policy.
6 RL – misc.
Consider a Markov reward process, where at iteration k a tuple (ϕk , rk , ϕ′k ) is obtained iid from the stationary
distribution of the process. Let δk ≜ rk + γθk⊤ ϕ′k − θk⊤ ϕk . The Gradient TD(0) algorithm performs the following
iterative iterations on uk , θk ∈ Rd :
uk+1 = uk + βk (δk ϕk − uk ) (19)
and
θk+1 = θk + αk (ϕk − γϕ′k )ϕ⊤ k uk . (20)
P∞ P∞ 2
Throughout this question we assume βk = ηαk , where η ∈ (0, 1); k=0 αk = ∞ ; and k=0 αk < ∞. Also,
we use the notation learned in class A = E[ϕk (ϕk − γϕ′k )⊤ ] and b = E[rk ϕk ].
a. What is the time complexity of a single iteration ((19) followed by (20)) of the GTD(0) algorithm, using
an efficient implementation?
√
Let us now define a joint vector of length 2d: ρ⊤ ⊤ ⊤
k = (vk , θk ), where vk = uk / η, and a reward-related vector
⊤
of length 2d: gk+1 = (rk ϕ⊤ ⊤
k , 0 ).
c. Let G ≜ E[Gk ] and g ≜ E[gk ]; write them explicitly using the definitions provided in the question. Then,
rewrite (21) as a stochastic approximation algorithm of the form ρk+1 = ρk + αk′ (h(ρk ) + Mk+1 ). Specify
αk′ , h(·) and Mk+1 (where Mk+1 is a martingale difference sequence).
d. Let the filtration Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). You can assume that: i) E[∥ϕk ∥2 ], E[rk2 ], E[∥ϕ′k ∥2 ] are
bounded ∀k; ii)the iterates ρk are bounded ∀k; and iii) G is negative-definite. Prove that ρk → ρ∗ w.p. 1,
and find ρ∗ .
36
6.2 n-step Operators in Reinforcement Learning
In the course, we formulated planning and online algorithms using 1-step operators, T and T π , the optimal
Bellman operator and the fixed-policy Bellman operator, respectively. Here, you will formulate a possible form
of ‘n-step’1 optimal Bellman operators and formulate a possible use for these operators.
def
Let v be a value function. We define the n-step optimal Bellman operator, T n : R|S| → R|S| , T n v = T
| · T{z· · · T}v,
n times
where ∀s ∈ S, T v(s) = argmaxa Es′ ∼P (·|s,a) [r(s, a) + γv(s′ ) | s, a].
b. Consider an Approximate Value Iteration algorithm in which the value at the kth iteration satisfies,
Let πk be the greedy policy w.r.t. vk . Provide a bound on ||v πk − v ∗ ||∞ in terms of γ, n, ϵ and ||v0 − v ∗ ||,
where v0 is the initial value function.
Reminder: The greedy policy w.r.t. a value v satisfies
X
∀s ∈ S, π(s) = arg max r(s, a) + γ p(s′ | s, a)v(s′ ) (22)
a
s′
and thus, T π v = T v.
We now define a set of non-stationary greedy policies with respect to a value function v, Gn (v),
n−1
X
Gn (v) = arg ′ max
′
E[ γ t r(st , πt′ (st )) + γ n v(sn )], (23)
π0 ,..,πn−1
t=0
e. Let σn be the non-stationary policy which repeatedly runs {π0n , π1 , .., πn−1
n } ∈ G (v π ), i.e, at time ∀t > 0,
n
n σ π
we use the policy πt mod n . Prove that v ≥ v component-wise with equality if and only if π is the
n
optimal policy.
1
In the course, we used ’n-step’ for the task of policy evaluation alone.
37
6.3 q π instead of v π
In the semester, we saw how v π (s), the value of a policy can be calculated. It is occasionally more helpful
to calculate the q-function of a policy, q π (s, a), occasionally denoted as the vector q π ∈ R|S||A| . In words, it
represents the value of acting once from state s by the action a, and then act according to π (π is a stationary
policy). Formally:
"∞ #
X
π π t
q (s, a) = E γ r(st , at ) | s0 = s, a0 = a ,
t=0
Prove that Tqπ : R|S||A| → R|S||A| is a contraction operator and its contraction coefficient is γ.
q π = (I − γA)−1 b,
We now turn to the linear approximation setting. We wish to estimate q π (s, a) using the features ϕ(s, a) ∈ Rd ,
i.e, q̂ π (s, a) = ϕT (s, a)w, and w ∈ Rd .
Πq q = Φw∗
w∗ = arg min ||Φw − q||2ϵ ,
w
where ϵ(s, a) = π(a | s)dπ (s), and dπ is the stationary distribution of the Markov Chain induced by π.
Show that Πq is a non-expansive operator.
e. Prove that Πq Tqπ is a γ contraction in the ϵ-weighted norm || · ||ϵ . For ease, use the following notation for
Tqπ
Tqπ q = r + (P π)q,
where X
(P π)q(s, a) = P (s′ | s, a)π(a′ | s′ )q(s′ , a′ ),
s′ ,a′
and the (s, a) component of r is r(s, a). See that this definition is consistent with (24).
38
6.4 Inverse Reinforcement Learning
In some applications it is desired to design decision-making agents that display similar strategies to humans. For
example, in an autonomous driving scenario, we would like to design driving algorithms that display human
driving styles such as safety and cautiousness. While human behavior can easily be demonstrated and recorded,
explicitly coding it as a policy is difficult in practice.
Inverse RL is an approach to this problem, which aims to only learn the ‘reward’ of the human, instead of its
policy. In this question we will formulate an inverse RL algorithm.
Our model is a reward-less MDP M = {S, A, P, γ}. That is, we know the MDP state space, action space,
transitions, and discount factor, but do not know the reward function r : S → R. Our goal is to learn the function
r from a demonstration trajectory τ = {s0 , a0 , s1 , a1 , . . . , sT , aT }, where we assume that actions in τ were
generated according to the optimal policy π ∗ in the MDP Mr = {S, A, P, r, γ}.
a. Consider the following grid-world MDP with S = {1, . . . , 28}, where actions are the 4 directions
{up, down, lef t, right}, and transitions are deterministic movements of the agent in the corresponding
direction (we assume that moving into a wall keeps the agent in place). We assume γ = 0.99. Consider
the trajectory τ shown in the picture. Suggest a reward function r(s) such that running the optimal policy
starting in s0 would result in the trajectory τ .
We now consider a probabilistic formulation for this problem. Let Q(s, a; r) denote the optimal Q function in
Mr . We assume that the demonstration is generated from the softmax policy
eQ(s,a;r)
π(a|s) = P Q(s,a′ ;r)
.
a′ ∈A e
Note that L depends on r through Q(s, a; r) in π defined above. Furthermore, we will assume that r is given in
a parametric form r(s; θ), and we will seek the parameter θ ∈ R by gradient ascent on L (for simplicity, we
assume that θ is a scalar).
39
d. Write an expression for ∂
∂Q(s′ ,a′ ;r) log π(a|s). Consider separately the cases {s′ = s, a′ = a}, {s′ =
s, a′ ̸= a}, and s′ ̸= s.
e. Assume that ∂
∂r(s′ ) Q(s, a; r) is known (for all s′ , s, a). Write down an expression for ∂
∂θ log π(a|s)
– Consider a function f (x, z) where x is continuous and z belongs to some finite set Z. Let g(x) =
maxz∈Z f (x, z), and z ∗ (x) = argmaxz∈Z f (x, z). Then ∇x g(x) = ∇x f (x, z = z ∗ (x)).
g. Based on your equation from (f), suggest an algorithm in the spirit of value-iteration to calculate
∂
∂r(s′ ) Q(s, a; r).
In this question we will analyze the Modified Policy Iteration (MPI) algorithm 4. Denote the fixed-policy and
optimal Bellman operators as T π and T , respectively.
Algorithm 4 MPI
Initialize: m ∈ N, v0 ∈ RS
while some stopping criterion do
for s ∈ S do
πk+1 (s) ∈ arg maxa r(s, a) + γ s′ P (s′ | s, a)vk (s′ ).
P
end for
for s ∈ S do P
vk+1 (s) = m−1 k+1 (s = s′ | s = s)r(s′ , π ′ πk+1 (s = s′′ | s = s)v (s′′ ).
P t π m
P
t=0 s′ γ P t 0 k+1 (s )) + γ s′′ P m 0 k
end for
end while
Return π, v
a. In a vector notation, MPI performs the following two steps of policy and value update,
′
– πk+1 ∈ {π ′ : T π vk = T vk }.
– vk+1 = (T πk+1 )m vk .
Prove the two algorithms are equivalent.
b. Which algorithms are obtained when m = 1, ∞?
In the rest of the question we will analyze the convergence of Modified Policy Iteration.
40
d. Assume that T v0 ≥ v0 . Prove that for any k, T vk ≥ vk (component-wise) and thus
for any k. Hint: use induction, and the update rule of MPI, vk = (T πk )m vk−1 .
e. Prove ∥v ∗ − vk ∥∞ ≤ γ ∥v ∗ − vk−1 ∥∞ . Hint: prove 0 ≤ v ∗ − vk ≤ T v ∗ − T vk−1 and then take the
max-norm.
This question considers an algorithm called Double Q-learning (DQ). Throughout the question we assume the γ-
discounted MDP setting with finite state and action spaces, and tabular Q functions (no function approximation).
In DQ, instead of maintaining a single Q table, we maintain two tables: QA (s, a) and QB (s, a). At each time
step t of the algorithm we observe st , choose an action
( at , and observe the next state st+1 . We then choose
A, w.p. 0.5
which table to update by drawing a random xt+1 = (independent of all other variables in the
B, w.p. 0.5
problem).
If xt+1 = A, we set a∗ = argmaxa QA A
t (st+1 , a), and update Q (st , at ):
∗
QA A B A
t+1 (st , at ) = Qt (st , at ) + αt (st , at ) r(st , at ) + γQt (st+1 , a ) − Qt (st , at ) .
∗
QB B A B
t+1 (st , at ) = Qt (st , at ) + αt (st , at ) r(st , at ) + γQt (st+1 , b ) − Qt (st , at ) .
When updating QA , all elements in QB remain the same, and vice versa. Additionally, similarly to regular
Q-learning, we only update the Q-table (either A or B) for the current state and action st , at , and leave all the
other elements in the table unchanged.
The DQ algorithm is often used in practice, as it is empirically more stable than standard Q learning. In this
question, we will only focus on showing that DQ converges.
We will show convergence of QB A
t . From symmetry, this will hold also for Qt . We let Ft denote the history of
A B A B
the process until time t, i.e., Ft = s0 , a0 , Q0 , Q0 , x0 , . . . , st , at , Qt , Qt , xt .
QB B B
t+1 (st , at ) = (1 − αt (st , at ))Qt (st , at ) + αt (st , at ) HQt + ωt + ct ,
An extension of the stochastic approximation theorem claims that if limt→∞ ct = 0 w.p. 1 then QB
t converges.
In the rest of the question we will show that limt→∞ ct = 0 w.p. 1.
b. Define ∆t = QB A
t − Qt . Explain why showing that limt→∞ ∆t = 0 w.p. 1 means that limt→∞ ct = 0
w.p. 1.
c. Write down ∆t+1 (st , at ) explicitly as a function of ∆t (st , at ), xt (and other required terms).
d. We define gt = E QA ∗ B ∗
t (st+1 , b ) − Qt (st+1 , a )|Ft . Show that:
E [∆t+1 (st , at )|Ft ] = (1 − 0.5αt (st , at ))∆t (st , at ) + 0.5γαt (st , at )gt .
41
e. Prove that |gt | ≤ ∥∆t ∥∞ .
An extension of the SA theorem shows that the update in (25) converges to zero w.p. 1.
42