Incremental Multi-Step Q-Learning
Jing Peng
College of Computer Science
Northeastern University
Boston, MA 02115
jp@ccs.neu.edu
Abstract
This paper presents a novel incremental algorithm that combines Q-learning, a wellknown dynamic programming-based reinforcement learning method, with the TD()
return estimation process, which is typically used in actor-critic learning, another
well-known dynamic programming-based reinforcement learning method. The parameter is used to distribute credit throughout
sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian e ect of coarse state-space quantization. The resulting algorithm, Q()learning, thus combines some of the best
features of the Q-learning and actor-critic
learning paradigms. The behavior of this algorithm is demonstrated through computer
simulations of the standard benchmark control problem of learning to balance a pole on
a cart.
1 INTRODUCTION
The incremental multi-step Q-learning (Q()-learning)
method is a new direct (or model-free) algorithm that
extends the one-step Q-learning algorithm (Watkins
1989) by combining it with TD() returns for general (Sutton 1989) in a natural way for delayed reinforcement learning. By allowing corrections to be
made incrementally to the predictions of observations
occurring in the past, the Q()-learning method propagates information rapidly to where it is important.
The Q()-learning algorithm works signi cantly better than the one-step Q-learning algorithm on a number of tasks and its basis in the integration of onestep Q-learning and TD() returns makes it possible
to take advantage of some of the best features of the
Q-learning and actor-critic learning paradigms and to
bridge the gap between them. It can also serve as a
basis for developing various multiple time scale learn-
Ronald J. Williams
College of Computer Science
Northeastern University
Boston, MA 02115
rjw@ccs.neu.edu
ing mechanisms that are essential for applications of
reinforcement learning to real world problems.
2 TD() RETURNS
Direct dynamic programming-based reinforcement
learning algorithms are based on updating state values or state-action values according to state transitions
as they are experienced. Each such update is in turn
based on the use of a particular choice of estimator for
the value being updated. This section describes an important and computationally useful class of such estimators { the TD() estimators (Sutton 1988, Watkins
1989).
Let the world state at time step t be xt, and assume
that the learning system then chooses action at. The
immediate result is that a reward rt is received by
the learner and the world undergoes a transition to
the next state, xt+1. The objective of the learner is
to choose actions maximizing discounted cumulative
rewards over time. More precisely, let be a speci ed
discount factor in [0; 1). The total discounted return
(or simply return) received by the learner starting at
time t is given by
rt = rt + rt+1 + 2 rt+2 + + n rt+n + :
The objective is to nd a policy , or rule for selecting actions, so that the expected value of the return
is maximized. It is sucient to restrict attention to
policies that select actions based only on the current
state (called stationary policies). For any such policy
and for any state x we de ne
V (x) = E [r0 x0 = x; ai = (xi ) for all i 0];
the expected total discounted return received when
starting in state x and following policy thereafter.
If is an optimal policy we also use the notation V
for V . Many dynamic programming-based reinforcement learning methods involve trying to estimate the
state values V (x) or V (x) for a xed policy .
An important class of methods for estimating V for
a given policy is the TD() estimators, which have
j
been investigated by Sutton (Sutton 1984,1988) and
later by Watkins (Watkins 1989). Following Watkins'
notation, let rt(n) denote the corrected n-step truncated
return for time t, given by
(n)
rt
= rt + rt+1 + 2 rt+2 + +
n?1r
n
(1)
t+n?1 + V^t+n (xt+n )
where V^t is the estimate of V at time t. If V^
were equal to V , then the corrected truncated returns would be unbiased estimators of V . Watkins
(Watkins 1989) shows that corrected truncated returns
have the error-reduction property in that the expected
value of the corrected truncated return is closer to V
than V^ is. Then Sutton's TD() return from time t
is
(1)
(2)
(3)
rt = (1 ? )[rt + rt + 2 rt + ]
= (1 ? )(rt + V^t (xt+1)) +
(1 ? )(rt + rt+1 + 2 V^t+1 (xt+2 )) +
= rt + (1 ? )V^t (xt+1 ) +
[rt+1 + (1 ? )V^t+1 (xt+2) +
Intuitively, r is a weighted average of corrected
truncated returns in which the weight of r(n) is proportional to n , where 0 < < 1. As a result, r has the
error-reduction property. In fact, it is shown (Dayan
1992, Sutton 1988) that under certain conditions the
expected value of r converges to V .
The TD() return can also be written recursively as
rt = rt + (1 ? )V^t (xt+1) + r
(2)
t+1
Then the TD(0) return is just
r0t = rt + V^t (xt+1)
and the TD(1) return is
r1t = rt + rt+1 + 2 rt+2 +
which is the exact actual return. Watkins argues that,
in a Markov decision problem, the choice of is a
trade-o between bias and variance. Sutton's empirical demonstration (Sutton 1988) favors intermediate
values of that are closer to 0. For further details, see
(Sutton 1988, Watkins 1989).
3 ONE-STEP Q-LEARNING
One-step Q-learning of Watkins (Watkins 1989), or
simply Q-learning, is a simple incremental algorithm
developed from the theory of dynamic programming
(Ross 1983) for delayed reinforcement learning. In
Q-learning, policies and the value function are represented by a two-dimensional lookup table indexed by
state-action pairs. Formally, using notation consistent
with that of the previous section, for each state x and
action a let
Q (x; a) = E fr0 + V (x1 )jx0 = x; a0 = ag (3)
= R(x; a) +
Pxy (a)V (y)
(4)
X
y
where R(x; a) = E fr0jx0 = x; a0 = ag, and Pxy (a) is
the probability of reaching state y as a result of taking
action a in state x. It follows that
V (x) = max
Q (x; a)
(5)
a
Intuitively, Equation (4) says that the state-action
value, Q (x; a), is the expected total discounted return
resulting from taking action a in state x and continuing with the optimal policy thereafter. More generally,
the Q function can be de ned with respect to an arbitrary policy as
Q (x; a) = R(x; a) +
Pxy (a)V (y); (6)
X
y
and Q is just Q for an optimal policy .
The Q-learning algorithm works by maintaining an estimate of the Q function, which we denote by Q^ , and
adjusting Q^ values (often just called Q-values) based
on actions taken and reward received. This is done using Sutton's prediction di erence, or TD error (Sutton
1988) - the di erence between the immediate reward
received plus the discounted value of the next state
and the Q-value of the current state-action pair:
r + V^ (y) ? Q^ (x; a)
(7)
where r is the immediate reward, y is the next state
resulting from taking action a in state x, and V^ (x) =
maxa Q^ (x; a). Then the values of Q^ are adjusted
according to
Q^ (x; a) = (1 ? )Q^ (x; a) + (r + V^ (y)) (8)
where 2 (0; 1] is a learning rate parameter. In
terms of the notation described in the previous section, Equation (8) may be rewritten as
Q^ (x; a) = (1 ? )Q^ (x; a) + r0
(9)
That is, the Q-learning method uses TD(0) as its estimator of expected returns.
Note that the current estimate of the Q function implicitly de nes a greedy policy by (x) =
arg maxa Q^ (x; a). That is, the greedy policy is to select actions with the largest estimated Q-value.
It is important to note that the one-step Q-learning
method does not specify what actions the agent should
take at each state as it updates its estimates. In fact,
the agent may take whatever actions it pleases. This
means that Q-learning allows arbitrary experimentation while at the same time preserving the current best
estimate of states' values. This is possible because
Q-learning constructs a value function on the stateaction space, instead of the state space. It constructs
a value function on the state space only indirectly. Furthermore, since this function is updated according to
the ostensibly optimal choice of action at the following state, it does not matter what action is actually
followed at that state. For this reason, the estimated
returns in Q-learning are not contaminated by \experimental" actions (Watkins 1989), so Q-learning is not
experimentation-sensitive.
On the other hand, because actor-critic learning updates the state value at any state based on the actual
action selected, not on what would have been the optimal choice of action, it is experimentation-sensitive.
In fact, in an actor-critic learning system, experimental actions always a ect the agent's estimated returns,
potentially hampering the agent's learning eciency,
as we shall see in Section 5.
To nd the optimal Q function eventually, however,
the agent must try out each action in every state
many times. It has been shown (Watkins 1989,
Watkins & Dayan 1992) that if Equation (8) is repeatedly applied to all state-action pairs in any order
in which each state-action pair's Q-value is updated
in nitely often, then
Q^ will converge to Q and V^
will converge to V with probability 1 as long as is
reduced to 0 at a suitable rate.
Finally, Watkins (Watkins 1989) has also described
possible extensions to the one-step Q-learning method
by using di erent value estimators, such as r for
0 < < 1, and he has illustrated the use of r returns
in Q-learning in his empirical demonstrations by memorizing past experiences and calculating these returns
at the end of each learning period, where a learning period speci es the number of experiences occurring in
the past the agent needs to store. The following section derives a novel algorithm that enables the value
estimation process to be done incrementally.
4 Q()-LEARNING
This section derives the Q()-learning algorithm combining TD() returns for general with Q-learning in
an incremental way. Note that in terms of the notation introduced here, one-step Q-learning is simply
Q(0)-learning, making it a special case.
For simplicity,
in what follows we drop the superscript
in V and assume that the given policy is the
agent's greedy policy. Now let
et = rt + V^t (xt+1 ) ? V^t (xt )
(10)
and
e0t = rt + V^t (xt+1 ) ? Q^ t(xt ; at)
(11)
where V^ (x) = arg maxa Q^ (x; a). Then, if we use Equation (11) for one step and Equation (10) thereafter, the
di erence between the TD() return of Equation (2)
and the estimated Q-value can be written as
^ t(xt ; at) =
r
t ?Q
frt + (1 ? )V^t (xt+1) +
[rt+1 + (1 ? )V^t+1 (xt+2 ) +
[rt+2 + (1 ? )V^t+2 (xt+3 ) + g
?Q^ t (xt; at)
= frt + V^t (xt+1) ? Q^ t(xt ; at) +
Q^ t (xt; at) ? V^t (xt+1) +
[rt+1 + V^t+1 (xt+2 ) ? V^t+1 (xt+1) +
V^t+1 (xt+1 ) ? V^t+1 (xt+2 ) +
[rt+2 + g ? Q^ t(xt ; at)
Thus,
^
r
t ? Qt(xt ; at) =
e0t + et+1 + 2 2 et+2 + 3 3 et+3 +
1
( )n [V^t+n(xt+n ) ? V^t+n?1(xt+n )]: (12)
X
n=1
If the learning rate is small, so that Q is adjusted
slowly, then the second summation on the right-hand
side of the above equation will be small.
1. Q^ (x; a) = 0 and T r(x; a) = 0 for all x and a
2. Do Forever:
(a) xt the current state
(b) Choose an action at that maximizes Q^ (xt; a)
over all a
(c) Carry out action at in the world. Let the
short term reward be rt, and the new state
be xt+1
(d) e0t = rt + V^t (xt+1) ? Q^ t (xt; at)
(e) et = rt + V^t (xt+1) ? V^t (xt )
(f) For each state-action pair (x; a) do
T r(x; a) = T r(x; a)
Q^ t+1(x; a) =
Q^ t(x; a) + T r(x; a)et
(g) Q^ t+1(xt ; at) = Q^ t+1 (xt; at) + e0t
(h) T r(xt; at) = T r(xt; at) + 1
Figure 1: The Q()-Learning Algorithm.
The Q()-learning algorithm is summarized in Figure
1, where T r(x; a) is the \activity" trace of state-action
pair (x; a), corresponding to the \eligibility" trace as
described in (Barto, Sutton & Anderson 1983). This
description does not specify how actions are chosen
during experimentation. One sensible approach that
we used in the simulation experiments to be described
below is to choose actions randomly with a bias toward
the ostensibly optimal action. In particular, one can
choose actions randomly according to the Boltzmann
distribution, with the probability of choosing any action a in state x proportional to
exp(Q^ (x; a)=T )
where T is a temperature parameter (set equal to 1 in
the experiments described below). Choosing an action
probabilistically ensures the necessary exploration of
di erent actions.
The main diculty associated with Q()-learning in
a Markov decision process is that rewards received after a non-greedy action cannot be used to evaluate
the agent's greedy policy since this will not be the
policy that was actually followed. In other words,
Q()-learning is experimentation-sensitive, assuming
that > 0 is xed. For a discussion of some ways
around this diculty, see (Watkins 1989).
Still another diculty is that changes in Q^ at each time
step may a ect r, which will in turn a ect Q^ , and so
on. However, these e ects may not be signi cant for
small since they are proportional to 2 (Peng 1993).
At each time step, the Q()-learning algorithm loops
through a set of state-action pairs which grow linearly
with time. This can cause serious concerns on a serial machine, since, in the worst case, the algorithm
may have to enumerate the entire state-action space.
However, the number of state-action pairs for which
actual updating is required can be kept at a manageable level by maintaining only those state-action
pairs whose activity trace ( )n is signi cant, since
this quantity declines exponentially when < 1. Another approach is to implement a Q()-learning system on a parallel machine in which each state-action
pair is mapped onto a separate processor. This corresponds directly to the kind of neural network implementation rst envisioned for the actor-critic approach
(Barto, Sutton & Anderson). A possible alternative
way to use a parallel machine, appropriate for situations when much of the state-action space may never
be experienced, is to dynamically allocate processors
to only those state-action pairs actually experienced.
This is particularly appealing if the system can exhibit
more focused behaviors.
Finally, it is interesting to note that both Q()learning and actor-critic learning (Sutton 1984) use
TD() returns as their value estimators through a
trace mechanism. Therefore, it seems reasonable to expect the Q()-learning algorithm to exhibit bene cial
performance characteristics attributable to the use of
TD() returns for > 0, as illustrated in (Barto, Sutton & Anderson 1983, Sutton 1988). At the same time,
both Q()-learning and one-step Q-learning construct
a value function on the state-action space rather than
just the state space, making them both capable of discriminating between the e ects of choosing di erent
actions in each state. Thus, while Q()-learning is
experimentation-sensitive, unlike one-step Q-learning,
it seems reasonable to expect it to be less so than actorcritic learning. Overall, then, Q()-learning appears to
incorporate some of the best features of the Q-learning
and actor-critic learning paradigms into a single mechanism. Furthermore, it can be viewed as a potential
bridge between them. The following section will illustrate this still further.
5 EXPERIMENTAL
DEMONSTRATION
This section describes an experimental result evaluating the performance of the Q()-learning method
described above in the familiar domain of the classic
cart-pole problem. As a comparison, the actor-critic
learning system (Barto, Sutton & Anderson 1983) and
the one step Q-learning system (Watkins 1989) are also
implemented. These algorithms were chosen for comparison because they are all direct (or model-free) approaches.
5.1 CART-POLE PROBLEM
Figure 2 illustrates the cart-pole problem. The cart
is free to travel left or right along a one dimensional
bounded track. The pole is hinged to the top of the
cart and is free to move in the vertical plane aligned
with the track. The objective is to learn to push the
cart left or right so as to keep the pole balanced more
or less vertically above the cart, and also to keep the
cart from colliding with the ends of track.
F
x
x=0
Figure 2: The Cart-Pole System
One can pose this as a learning problem in several ways. The formulation of the learning problem adopted here is the same as that described in
(Barto, Sutton & Anderson 1983). That is, the learner
has access to the state vector (x; x:;
_ _) at each time
The representation used in all experiments reported
here coincides with that used in (Barto, Sutton & Anderson 1983), with the state space quantized coarsely
into 162 cells. For the actor-critic experiments, then,
there were 162 actor entries (representing the policy in
each cell) and 162 critic values, while the Q()-learning
algorithm and the one-step Q-learning algorithm each
required maintaining 162 2 = 324 numbers. Euler's
method with a time step of 0:02 second was used to
approximate numerically the solution of the equations
of motion of the cart-pole system given in (Barto, Sutton & Anderson 1983).
100
Actor−critic learning
Q( λ )−learning
(x 1000)
5.2 SIMULATION RESULTS
seemed to give the best result. A slightly di erent
version of the Q()-learning algorithm of Figure 1 was
simulated in the experiment, in which the learning rate
parameter for the current state-action pair and the one
for the state-action pairs occurring in the past were allowed to be independent. The parameter values used
in producing the Q()-learning results of Table 1 were
= 0:2 (the learning rate for the current state-action
pair), = 0:1 (the learning rate for the state-action
pairs occurring in the past), and = 0:9. These values were chosen since they seemed to give the best
performance among other values tried. The discount
factor, , was set to 0:95 for all three methods. Figure 3 shows the simulation results of the three learning
systems in terms of the number of actual experiences
and the number of trials, respectively, averaged over
10 runs.
Time Steps Until Failure
step and can select one of two actions, a rightward or
leftward force on the cart. The cart-pole system begins
with = 0, _ = 0, x = 0, and x_ = 0. If the pole falls
over more than 12 degrees from vertical, or if the cart
hits the track boundary, a failure is said to occur. All
immediate rewards are zero except upon failure, when
a reward of negative one is delivered. When a failure
occurs, the cart-pole system is reset to the initial state,
and a new attempt to balance the pole begins. A trial
is a complete balancing attempt.
Table 1: Summary of Simulation Results on the CartPole Problem
ALGORITHMS
NO. EXPS NO. TRIALS
Q-learning
142171
478
Actor-Critic learning 461583
108
Q()-learning
30571
95
50
Q−learning
0
0
500
250
Trials
Q( λ )−learning
Actor−critic learning
(x 1000)
A series of runs of each learning method attempting
to control the cart-pole system were carried out, where
each run consisted of a sequence of trials. All table entries of each method were set to zero at the beginning
of each run. Also, all the trace variables were set to
zero at the start of each trial. Except for the initial
conditions of the random number generator, identical
parameter values were used for all runs. Each run consisted of a number of trials until the cart-pole system
remained balanced for more than 100000 time steps
(approximately 35 minutes of simulated real time), in
which case the run was terminated. The balancing
duration of the system was measured by interspersing
test trials, with learning turned o , with the normal
activities of the learning agent.
Table 1 shows the simulation results of the Q-learning
algorithm, the actor-critic learning algorithm, and the
Q()-learning algorithm. The numbers of Table 1 are
averages of performance over ten runs. For the implementation of the actor-critic learning system, the
identical parameter values published in (Barto, Sutton & Anderson 1983) were used. For the Q-learning
system, = 0:2 learning rate value was picked that
Time Steps Until Failure
100
50
Q−learning
0
0
850
1700
(x 1000)
Experiences
Figure 3: Performance Of Three Learning Systems,
Measured Against The Number Of Trials (top) And
The Number Of Experiences (bottom).
The Q()-learning system achieved much better results in terms of both the number of actual experiences, where an experience is a state transition occurring in the world, and the number of trials than did the
Q-learning system. As Table 1 shows, it took, on average, 30571 actual experiences (approximately 10 min-
utes of simulated real time) for the Q()-learning algorithm to balance the pole for more than 100000 time
steps, whereas the Q-learning algorithm took about
48 minutes. Furthermore, the Q()-learning system
tended to solve the problem before it had experienced
100 failures. In contrast, the Q-learning system experienced a much greater number of failures before it
learned to solve the problem.
The signi cant performance improvement of the Q()learning system over the simple Q-learning system is
clearly due to the use of the TD() return estimation process, which has the e ect of making alterations
to past predictions throughout each trial. One natural explanation for the improved learning eciency in
this case is that TD() propagates information backward much faster than does TD(0). For every action
taken in the world, TD() provides some degree of
updating for essentially all recent predictions, while
TD(0) only updates the most recent one. This is the
reason that = 0:3 was found empirically to give
optimal performance in the random-walk prediction
problem studied by Sutton (1988). If this is the main
bene t conferred by TD(), one might expect modelbased, multiple-update methods like priority-Dyna
(Peng 1993, Peng & Williams 1993, Moore & Atkeson 1994), to perform at least as well. However, additional experiments, whose details we omit here, were
carried out using such techniques and found to give
results signi cantly worse than those obtained using
Q()-learning. We believe the reason for this is that
the coarse state-space quantization used here has the
e ect of making the environment non-Markovian, and
increasing makes TD() less sensitive to this nonMarkovian e ect.
The Q()-learning system performed slightly better
than the actor-critic learning system in terms of the
number of trials, but signi cantly better in terms of
the number of actual experiences. As Table 1 indicates, it took, on average, about 2.5 hours of simulated
real time for the actor-critic learning system to learn
to solve the problem, as compared with 10 minutes
for the Q()-learning system. The reason for this may
have to do with the following. Although both Q()learning and actor-critic learning use TD() returns
as their estimators, the ways in which the TD() returns are computed in the two systems are di erent.
Q()-learning computes the TD() return estimates
using the current best estimated action values in the
states actually visited, while actor-critic learning computes the TD() returns using essentially an average
of values of actions taken so far. This means that
Q()-learning is less experimentation-sensitive than is
actor-critic learning. We believe that the ecient use
of experience measures in general the degree of accuracy to which the estimated returns are computed, and
thus is critical for any learning system to be successful
in more realistic environments.
A close look at Figure 3 reveals that Q()-learning has
a strikingly similar performance characteristic with
actor-critic learning in terms of the number of trials
and a very similar performance characteristic with Qlearning in terms of the number of actual experiences.
That is, Q()-learning has achieved the best of both
the actor-critic and Q-learning paradigms, which is
what it was designed to do.
Obviously, we cannot make general statements about
the relative merit of these systems based solely on
these experiments. Nevertheless, the argument made
by Werbos (Werbos 1990) may help explain this to a
certain extent.
It should be pointed out that both the xed period
learning process of Watkins (Watkins 1989) for sufciently long learning periods and the action-replay
process of Lin (Lin 1992) produce the same bene cial e ects as that of Q()-learning. However, both of
these approaches operate in \batch" mode in that they
replay, backwards, the memorized sequence of experiences that the learning agent has recently had.
6 CONCLUSION
The Q()-learning algorithm is of interest because of
its incrementality and its relationship to Q-learning
(Watkins 1989) and actor-critic learning (Barto, Sutton & Anderson 1983, Sutton 1984). The experiments
reported here demonstrate that the Q()-learning algorithm inherits the best qualities of both the actorcritic learning algorithm and the Q-learning algorithm. However, this algorithm, unlike the one-step
Q-learning algorithm, cannot be expected to converge
to the correct Q values under an arbitrary policy that
tries every action in every state (although the obvious
strategies of gradually reducing or gradually turning
down the Boltzmann temperature as learning proceeds
would probably allow such convergence). In spite of
this, the Q()-learning algorithm has always outperformed the one-step Q-learning algorithm on all the
problems we have experimented with so far. Furthermore, this also means that an analytic understanding
of this algorithm will necessarily require more complex arguments than are used to prove convergence of
one-step Q-learning.
From a computational standpoint, the incrementality of Q()-learning makes it more attractive than
Watkins' batch mode learning and Lin's action replay
process since the computation can be distributed over
time more evenly, and thus under many circumstances
can ease overall demands on the memory and speed.
Similar arguments are made in (Sutton 1988). One
additional pleasing characteristic of the Q()-learning
method is that it achieves greater computational efciency without having to learn and use a model of
the world (Peng 1993, Peng & Williams 1993, Sutton
1990) and is well suited to parallel implementation.
This paper has only examined the Q()-learning algorithm in which the TD() returns are computed by
taking the maximum Q values at each state visited.
There are other possibilities, however. For example,
the algorithm may estimate the TD() returns by using a xed Q value at each state. This amounts to
estimating Q for a particular policy . What is interesting for this is that it can be shown (Peng 1993)
that for a given policy , Q()-learning
converges with
probability one to the correct Q values under appropriate conditions. Also see (Jaakkola, Jordan & Singh
1993).
Finally, it is clear that in continuous-time systems,
or even systems where time is discrete but very negrained, the use of algorithms that propagate information back one step at a time can make no sense or at
least be of little value. In these cases the use of TD()
methods is not a luxury but a necessity. In general,
can be viewed as a time scale parameter in such
situations, and we argue that better understanding of
its use in this regard is an important area for future
research.
Acknowledgements
We wish to thank Rich Sutton for his many valuable
suggestions and continuing encouragement. This work
was supported by Grant IRI-8921275 from the National Science Foundation.
References
Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983).
Neuronlike elements that can solve dicult learning
control problems. IEEE Transactions on Systems,
Man, and Cybernetics 13:835-846.
Dayan, P. (1992). The convergence of TD() for general . Machine Learning 8:117-138.
Jaakkola, T., Jordan, M. I. & Singh, S. P. (1993). On
the convergence of stochastic iterative dynamic programming algorithms. ML93 Reinforcement Learning
Workshop.
Lin, L. J. (1992). Reinforcement learning for robots
using neural networks. Ph. D. Dissertation, Carnegie
Mellon University, PA.
Moore, A. W. & Atkeson, C. G. (1994). Prioritized
sweeping: reinforcement learning with less data and
less time. Machine Learning 13(1):103-130.
Peng, J. (1993). Ecient Dynamic ProgrammingBased Learning for Control. Ph. D. Dissertation,
Northeastern University, Boston, MA 02115.
Peng, J. & Williams, R. J. (1993). Ecient learning
and planning within the Dyna framework. Adaptive
Behavior 1(4):437-454.
Ross, S. (1983). Introduction to Stochastic Dynamic
Programming. New York, Academic Press.
Sutton, R. S. (1990). Integrated architectures for
learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning,
216-224.
Sutton, R. S. (1988). Learning to predict by the methods of temporal di erences. Machine Learning 3:9-44.
Sutton, R. S. (1984). Temporal credit assignment in
reinforcement learning. Ph. D. Dissertation, University of Massachusetts, Amherst (also COINS Technical
Report 84-02).
Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning.
Machine Learning 8:279-292.
Watkins, C. J. C. H. (1989). Learning from delayed
rewards. Ph. D. Dissertation, King's College, UK.
Werbos, P. J. (1990). Consistency of HDP applied
to a simple reinforcement learning problem. Neural
Networks 3:179-189.