[go: up one dir, main page]

0% found this document useful (0 votes)
32 views4 pages

RL DP and Value and Policy

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views4 pages

RL DP and Value and Policy

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DS 102 Data, Inference, and Decisions Spring 2020

Lecture 24: Reinforcement Learning


Lecturer: Jacob Steinhardt

Last time, reviewed dynamic programming and Markov Decision Processes, and then saw how to
combine these two ideas into the Value Iteration algorithm. Today, we will discuss Value Iteration in
more details, as well as another useful variant of the algorithm called Q-iteration. We will then see
how to improve these approaches to incorporate learning from data and handle large state spaces.
Finally, having discussed all of the key conceptual ideas underlying reinforcement learning (RL),
we will briefly discuss two more advanced approaches: temporal difference learning and policy
gradient.

24.1 Recap: Value Iteration

Recall that, for Value Iteration, we have a Markov Decision Process (MDP) with transition prob-
abilities P(s0 | a, s) (note that in some places in the literature, these transition probabilities are
instead denotes as T (s0 , a, s)), rewards R(s0 , a, s), and possibly a discount parameter γ. Our goal
is to compute the value function V (s) which gives the expected reward starting from state s under
the optimal policy. We can define V recursively,
X
V (s) = max P(s0 | a, s)(R(s0 , a, s) + γV (s0 )),
a∈A
s0 ∈S

but to avoid computational issues with this circular definition, we also add a time component, Vt (s),
where V0 (s) = 0 and
X
Vt+1 (s) = max P(s0 | a, s)(R(s0 , a, s) + γVt (s0 )).
a∈A
s0 ∈S

24.2 Q-iteration

Recall that, in addition to the value function V , we defined the Q function where Q(s, a) is the ex-
pected reward if we start from state s taking action a, and then follow the optimal policy thereafter.
Compared to V , Q is useful since it helps us understand which actions are optimal. However, the
value and Q functions are very closely related:

V (s) = max Q(s, a),


a∈A

24-1
Lecture 24: Reinforcement Learning 24-2

and
X
Q(s, a) = P(s0 | a, s)(R(s0 , a, s) + γV (s0 ))
s0 ∈S
X  
0 0 0 0
= P(s | a, s) R(s , a, s) + γ max
0
Q(s , a ) .
a ∈A
s0 ∈S

We can approach computing Q the same way we computed V in the previous section, namely by
adding a time component and using dynamic programming: Q0 (s, a) = 0 and
X
Qt+1 (s, a) = P(s0 | a, s)(R(s0 , a, s) + γ max
0
Qt (s0 , a0 )).
a ∈A
s0 ∈S

24.3 Q-Learning

Value and Q-Iteration both assume that the MDP is fully known. What if the transition dynamics
P(s0 | a, s) are unknown? It turns out that, even when the transition probabilities are unknown, we
can still estimate Q from data. This is called Q-learning.
Suppose we have data about several different trajectories, where a trajectory is a sequence

s0 , a0 , s1 , a1 , . . . , si , ai , . . .

that comes from some policy π : S → A. After each state-action pair (st , at ) in a trajectory, we can
update our estimate of Q:

Q(st , at ) = (1 − α)Qold (st , at ) + α(R(st+1 , at , st ) + γ max


0
Qold (st+1 , a0 )).
a ∈A

This update is a weighted average of our previous estimate of the Q value and something looks
similar to the update we used in Q-iteration. In fact, it turns out that the right-hand term of this
Q-learning update is equal to the Q-iteration update in expectation:
X
0
Est+1 [R(st+1 , at , st ) + γ max
0
Qold (s t+1 , a )] = P (s0 |at , st )(R(s0 , at , st ) + γ max
0
Qold (s0 , a0 )).
a ∈A a ∈A
s0 ∈S

Thus, our Q-learning update slowly averages in new estimates of the Q-function (by was of “stochas-
tic” Q-iteration updates), while the weighted average with our previous estimate of the Q value
helps stabilize our estimates.
There are convergence theorems which specify conditions under which this Q-learning algorithm
will converge to the true Q-value function for the optimal policy. These theorems have two im-
portant requirements. First, they typically require that α → 0 over time. In practice, however, it
usually works to fix α to a small value. Second, both in theory and in practice we need to make
sure to explore the state space enough. There is an analogy to the multi-armed bandits setting
from Lectures 17 and 21 where we needed to visit all of the states sufficiently often. One way to
accomplish this is to explicitly define an exploration policy that does a good job of visiting all of
Lecture 24: Reinforcement Learning 24-3

the states, but it can be challenging to design such a policy by hand if there are some states that
can only be reached by very specific actions/trajectories. An alternative approach is to follow the
induced policy from the current estimate of Q(s, a), which means taking the action that looks the
best given the current estimate of the Q function. However, this approach can get stuck and end
up not exploring enough; it is better to make sure that some fraction of the time we take a random
action. It often also helps to initialize Q0 (s, a) to some optimistic (i.e. very large) value, which is
similar in spirit to the UCB algorithm in that exploring states which have not yet been visited very
often is incentivized.

24.4 Function Approximation

Q-learning allowed us to improve upon the Q-Iteration algorithm by incorporating learning, but
both of these approaches will do poorly at handling large state spaces. In particular, each Q-learning
update only updates Q(s, a) for a single state-action pair, so if we have a very large number of states
we will not be able to sufficiently explore them all. Intuitively, we would like to somehow update
Q(s, a) for all similar state-action pairs at once. We accomplish this by using function approxima-
tion: we parameterize Q as Qθ (s, a). For example, we often model Q as a neural network, and θ
are the parameters of the network.
Recall our Q-learning update,
Q(st , at ) = (1 − α)Qold (st , at ) + α(R(st+1 , at , st ) + max Qold (st+1 , a0 ))
a0 ∈A
 
0
= Qold (st , at ) + α R(st+1 , at , st ) + max
0
Q (s
old t+1 , a ) − Q (s
old t t , a ) ,
a ∈A

where the second form of the update emphasizes the similarity to a stochastic gradient descent
update with step size α. Instead of updating Q directly as in Q-learning, our new update will
update θ:
 
0
θ = θold + α R(st+1 , at , st ) + max
0
Q (s
θold t+1 , a ) − Q (s ,
θold t t a ) ∇θ Qθold (st , at ).
a ∈A

This function approximation approach is an adaptation of Q-learning which is much better able to
handle large state spaces.
One (not entirely rigorous) way to intuit the form of the update rule for θ is to imagine that we
are preforming a prediction task where we would like to predict the Q value and we are using
squared-error as our loss:
 2
0
R(st+1 , at , st ) + max
0
Q (s
θold t+1 , a ) − Q (s ,
θold t t a ) .
a ∈A

If we imagine that we are only updating the θold of the last term Qθold (st , at ) and take a gradient,
we get  
0
−2 R(st+1 , at , st ) + max
0
Qθold (st+1 , a ) − Qθold (st , at ) ∇θ Qθold (st , at ),
a ∈A

so that taking one gradient step would be equivalent to using our update rule for θ with α = 2.
Lecture 24: Reinforcement Learning 24-4

24.5 Temporal Difference Learning

It is often the case that we only see a reward at the very end of a trajectory. For example, if we want
to use RL to play chess or Go, we do not see rewards until the end of the game when we either
win or lose. This leads to a credit assignment problem wherein we have to determine which action
we took along the way was really responsible for the reward obtained at the end of the trajectory.
In regular Q-learning, this will slow down convergence immensely — we need T updates for a
trajectory of length T to even get away from our initialization Q0 . Temporal Different Learning
(TD(λ)) is a more advanced RL approach which helps deal with these situations by propagating
rewards backwards retrospectively to all the previous states in the trajectory in some exponentially
decaying way (according to the parameter λ) so that each update for a trajectory is more efficient.
This approach was most famously used for TD-Gammon, an RL algorithm that was used to create
a backgammon playing AI.

24.6 Policy Gradient

Policy gradient is an alternative way to approach RL problems. In Q-learning, we tried to build


some approximate Q function, Qθ , and then optimize θ to get good Q values. Policy gradient takes
the alternative approach of learning a policy πθ (a|s) that’s a probability distribution of the action
given the state. The goal is

max Eπθ [R(s0 , a0 , a1 ) + γR(s1 , a1 , s2 ) + . . .].


θ

To maximize this, we want the gradient of this quantity with respect to θ. Taking this gradient is a
bit tricky since the θ appears in the distribution with respect to which we are taking the expectation.
To handle this, we can use the log-derivative trick, which says

∇θ Eπθ [R] = Eπθ [R∇θ log πθ ],

thereby allowing us to compute a gradient we can use to update the parameters θ.

You might also like