Value Functions & Bellman Equations: UNIT-3
Value Functions & Bellman Equations: UNIT-3
Value Functions & Bellman Equations: UNIT-3
Introduction
Suppose you are in a new town and you have no map nor GPS, and you need to reach
downtown. You can try assess your current position relative to your destination, as well
the effectiveness (value) of each direction you take. You can think of this as computing the
value function. Or you can ask a local and he tells you to go straight and when you see a
fountain you go to the left and continue until you reach downtown. He gave you a policy
to follow. Naturally, in this case, following the given policy is much less complicated than
computing the value function on your own.
In another example, consider that you are managing in inventory, and you decided that
when the count of each item drops below a certain limit you issue a buy order to refill your
stock. This is a policy that is far more simple than studying the activity of the customers,
and their buying habits and preferences, in order to predict the impact on your stock…
Surely enough, value functions will lead to determining a policy as seen in previous
articles, but there are other methods that can learn a policy that can select actions using
parameters, without consulting a value function (this is not quite correct, since value
function is needed to increase accuracy). So the main idea is to be able to determine at a
state (s) which action to take in order to maximize the reward.
The way to achieve this objective is to fine tune a vector of parameters noted 𝜽 in order to
select the best action to take for policy 𝜋. The policy is noted 𝜋(a|s, 𝜽) = Pr{At = a | St = s,
𝜽t = 𝜽}, which means that the policy 𝜋 is the probability of taking action a when at
state s and the parameters are 𝜽.
Advantages
Disadvantage
Stochastic Policy
At first it is important to note that stochastic does not mean randomness in all states, but
it can be stochastic in some states where it makes sense. Usually maximizing reward leads
to deterministic policy. But in some cases deterministic policies are not a good fit for the
problem, for example in any two player game, playing deterministically means the other
player will be able to come with counter measures in order to win all the time. For example
in Rock-Cissors-Paper game, if we play deterministically meaning the same shape every
time, then the other player can easily counter our policy and wins every game.
So in this game the optimal policy would be stochastic which will be better than the
deterministic one.
Blue Print
Before delving into the details of math and algorithms, it is useful to have an overview of
how to proceed, a kind of blue print:
1. Find an objective function which can be used to assess the effectiveness of the
policy. In other words telling how good the result given by the policy.
3. Naive Algorithm.
Propose an algorithm that makes direct use of the policies in order to learn the
parameters.
4. Improved Algorithms
Find algorithms that improve the objective function in order to maximize the
effectiveness of the policy.
Value Functions
Almost all reinforcement learning algorithms are based on estimating value functions--
functions of states (or of state-action pairs) that estimate how good it is for the agent to
be in a given state (or how good it is to perform a given action in a given state). The notion
of "how good" here is defined in terms of future rewards that can be expected, or, to be
precise, in terms of expected return. Of course the rewards the agent can expect to receive
in the future depend on what actions it will take. Accordingly, value functions are defined
with respect to particular policies.
Recall that a policy, , is a mapping from each state, , and action, , to the
probability of taking action when in state . Informally, the value of a
state under a policy , denoted , is the expected return when starting in and
following thereafter. For MDPs, we can define formally as
(1
)
where denotes the expected value given that the agent follows policy , and is any
time step. Note that the value of the terminal state, if any, is always zero. We call the
function the state-value function for policy .
Similarly, we define the value of taking action in state under a policy ,
denoted , as the expected return starting from , taking the action , and
thereafter following policy :
(3.
9)
where it is implicit that the actions, , are taken from the set , and the next states, ,
are taken from the set , or from in the case of an episodic problem. Equation (2) is
the Bellman equation for . It expresses a relationship between the value of a state and
the values of its successor states. Think of looking ahead from one state to its possible
successor states, as suggested by Figure 1a. Each open circle represents a state and each
solid circle represents a state-action pair. Starting from state , the root node at the top,
the agent could take any of some set of actions--three are shown in Figure 1a. From each
of these, the environment could respond with one of several next states, , along with a
reward, . The Bellman equation (2) averages over all the possibilities, weighting each by
its probability of occurring. It states that the value of the start state must equal the
(discounted) value of the expected next state, plus the reward expected along the way.
The value function is the unique solution to its Bellman equation. We show in
subsequent chapters how this Bellman equation forms the basis of a number of ways to
compute, approximate, and learn . We call diagrams like those shown in
Figure 1 backup diagrams because they diagram relationships that form the basis of the
update or backup operations that are at the heart of reinforcement learning methods.
These operations transfer value information back to a state (or a state-action pair) from
its successor states (or state-action pairs). We use backup diagrams throughout the book
to provide graphical summaries of the algorithms we discuss. (Note that unlike transition
graphs, the state nodes of backup diagrams do not necessarily represent distinct states;
for example, a state might be its own successor. We also omit explicit arrowheads because
time always flows downward in a backup diagram.)
Figure 1: Backup diagrams for (a) and (b) .
Figure 2 Grid example: (a) exceptional reward dynamics; (b) state-value function for
the equiprobable random policy.
Suppose the agent selects all four actions with equal probability in all states. Figure 2b
shows the value function, , for this policy, for the discounted reward case with
. This value function was computed by solving the system of equations (2). Notice the
negative values near the lower edge; these are the result of the high probability of hitting
the edge of the grid there under the random policy. State A is the best state to be in under
this policy, but its expected return is less than 10, its immediate reward, because from A
the agent is taken to , from which it is likely to run into the edge of the grid. State B, on
the other hand, is valued more than 5, its immediate reward, because from B the agent is
taken to , which has a positive value. From the expected penalty (negative reward)
for possibly running into an edge is more than compensated for by the expected gain for
possibly stumbling onto A or B.
Figure 2: A golf example: the state-value function for putting (above) and the optimal
action-value function for using the driver (below).
Example 2: Golf To formulate playing a hole of golf as a reinforcement learning task, we
count a penalty (negative reward) of for each stroke until we hit the ball into the hole.
The state is the location of the ball. The value of a state is the negative of the number of
strokes to the hole from that location. Our actions are how we aim and swing at the ball,
of course, and which club we select. Let us take the former as given and consider just the
choice of club, which we assume is either a putter or a driver. The upper part of
Figure 2 shows a possible state-value function, , for the policy that always uses
the putter. The terminal state in-the-hole has a value of . From anywhere on the green
we assume we can make a putt; these states have value . Off the green we cannot reach
the hole by putting, and the value is greater. If we can reach the green from a state by
putting, then that state must have value one less than the green's value, that is, . For
simplicity, let us assume we can putt very precisely and deterministically, but with a
limited range. This gives us the sharp contour line labeled in the figure; all locations
between that line and the green require exactly two strokes to complete the hole.
Similarly, any location within putting range of the contour line must have a value
of , and so on to get all the contour lines shown in the figure. Putting doesn't get us out
of sand traps, so they have a value of . Overall, it takes us six strokes to get from the
tee to the hole by putting.
Exercise 1 What is the Bellman equation for action values, that is, for ? It must give
the action value in terms of the action values, , of possible successors to
the state-action pair . As a hint, the backup diagram corresponding to this equation
is given in Figure 2b. Show the sequence of equations analogous to (2), but for action
values.
Exercise 2 The Bellman equation (2) must hold for each state for the value
function shown in Figure 2b. As an example, show numerically that this equation
holds for the center state, valued at , with respect to its four neighboring states,
valued at , , , and . (These numbers are accurate only to one decimal
place.)
Exercise 3 In the gridworld example, rewards are positive for goals, negative for running
into the edge of the world, and zero the rest of the time. Are the signs of these rewards
important, or only the intervals between them? Prove, using (3), that adding a
constant to all the rewards adds a constant, , to the values of all states, and thus does
not affect the relative values of any states under any policies. What is in terms
of and ?
Exercise 4 Now consider adding a constant to all the rewards in an episodic task, such
as maze running. Would this have any effect, or would it leave the task unchanged as in
the continuing task above? Why or why not? Give an example.
Exercise 3.12 The value of a state depends on the the values of the actions possible in
that state and on how likely each action is to be taken under the current policy. We can
think of this in terms of a small backup diagram rooted at the state and considering each
possible action:
Give the equation corresponding to this intuition and diagram for the value at the root
node, , in terms of the value at the expected leaf node, , given . This
expectation depends on the policy, . Then give a second equation in which the expected
value is written out explicitly in terms of such that no expected value notation
appears in the equation.
Exercise 5 The value of an action, , can be divided into two parts, the expected
next reward, which does not depend on the policy , and the expected sum of the
remaining rewards, which depends on the next state and the policy. Again we can think
of this in terms of a small backup diagram, this one rooted at an action (state-action pair)
and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action
value, , in terms of the expected next reward, , and the expected next state
value, , given that and . Then give a second equation, writing out the
expected value explicitly in terms of and , defined respectively by (6) and (7),
such that no expected value notation appears in the equation.