Neural Processing Letters 9: 119–127, 1999.
© 1999 Kluwer Academic Publishers. Printed in the Netherlands.
119
Training Reinforcement Neurocontrollers
Using the Polytope Algorithm
ARISTIDIS LIKAS and ISAAC E. LAGARIS
Department of Computer Science, University of Ioannina, P.O. Box. 1186 – GR 45110 Ioannina,
Greece. e-mail: arly@cs.uoi.gr
Abstract. A new training algorithm is presented for delayed reinforcement learning problems that
does not assume the existence of a critic model and employs the polytope optimization algorithm to
adjust the weights of the action network so that a simple direct measure of the training performance
is maximized. Experimental results from the application of the method to the pole balancing problem indicate improved training performance compared with critic-based and genetic reinforcement
approaches.
Key words: reinforcement learning, neurocontrol, optimization, polytope algorithm, pole balancing,
genetic reinforcement
1. Introduction
In the framework of delayed reinforcement learning, a system receives input from
its environment, decides for a proper sequence of actions, executes them, and
thereafter, receives a reinforcement signal, namely a grade for the made decision.
A system at any instant is described by its so called state variables. The objective
of a broad class of reinforcement problems, is to learn how to control a system in
such a way, so that its state variables remain at all times within prescribed ranges.
However, if at any instant, the system violates this requirement, it is penalized by
receiving a ‘bad grade’ signal, and hence its policy in making further decisions is
influenced accordingly.
There are many examples of this kind of problems, like the pole balancing problem, teaching an autonomous robot to avoid obstacles, the ball and beam problem
[11], etc.
In general, we can distinguish two kinds of approaches that have been developed
for delayed reinforcement problems [9]: the critic-based approaches and the direct
approaches. There is also the Q-learning approach [15] which exhibits many similarities with the critic-based ones. The most well-studied critic-based approach is
the Adaptive Heuristic Critic (AHC) [2, 1, 3] method which assumes two separate models: an action model that receives the current system state and selects the
action to be taken and the evaluation model which provides as output a prediction
120
ARISTIDIS LIKAS AND ISAAC E. LAGARIS
e(x) of the evaluation of the current state x. The evaluation model is usually a
feedforward neural network trained using the method of temporal differences, i.e.,
it tries to minimize the error δ = e(x) − (r + γ e(y)) where y is the new state,
r the received reinforcement and γ a discount factor [3, 1, 2]. The action model
is also a feedforward network that provides as output a vector of probabilities
upon which the action selection is based. Both networks are trained on-line through
backpropagation using the same error value δ described previously.
The direct approach to delayed reinforcement learning problems considers reinforcement learning as a general optimization problem with an objective function
having a straightforward formulation but which is difficult to optimize [9]. In such
a case only the action model is necessary to provide the action policy and optimization techniques must be employed to adjust the parameters of the action model
so that a stochastic integer-valued function is maximized. This function is actually
proportional to the number of successful decisions (i.e. actions that do not lead to
the receipt of penalty signal). A previous direct approach to delayed reinforcement
problems employs real-valued genetic algorithms to perform the optimization task
[16]. In the present study we propose another optimization strategy that is based
on the polytope method with random restarts. Details concerning such an approach
are presented in the next section, while section 3 provides experimental results
from the application of the proposed method to the pole balancing problem and
compares its performance against that of the AHC method and of the evolutionary
approach.
2. The Proposed Training Algorithm
As already mentioned, the proposed method belongs to the category of direct
approaches to delayed reinforcement problems. Therefore, only an action model
is considered that in our case has the architecture of a multilayer perceptron with
input units accepting the system state at each time instant, and sigmoid output units
providing output values pi in the range (0, 1). The decision for the action to be
taken from the values of pi can be made either stochastically or deterministically.
For example in the case of one output unit the value p may represent the probability
that the final output will be one or zero, or the final output may be obtained deterministically using the rule: if p > 0.5 the final output will be one, otherwise it will be
zero. Learning proceeds in cycles, with each cycle starting with the system placed
at a random initial position and ending with a failure signal. Since our objective
is to train the network so that the system ideally never receives a failure signal,
the number of time steps of the cycle (i.e. its length), constitutes the performance
measure to be optimized. Consequently, the training problem can be considered as
a function optimization problem with the adjustable parameters being the weights
and biases of the action network and with the function value being the length of a
cycle obtained using the current weight values. In practice, when the length of a
cycle exceeds a preset maximum number of steps, we consider that the controller
TRAINING REINFORCEMENT NEUROCONTROLLERS USING THE POLYTOPE ALGORITHM
121
has been adequately trained. This is used as a criterion for terminating the training
process. The training also terminates if the number of unsuccessful cycles (i.e.
function evaluations without reaching maximum value) exceeds a preset upper
bound.
Obviously, the function to be optimized is integer-valued, thus it is not possible
to define derivatives. Therefore, traditional gradient-based optimization techniques
cannot be employed. Moreover, the function possesses an amount of random noise
since the initial state specification as well as the action selection at the early steps
are performed at random. On one hand the incorporation of this random noise
may disrupt the optimization process. For example, if the evaluation of the same
network is radically different at different times, then the learning process will be
misled. On the other hand, the global search certainly benefits from it and hence,
the noise should be kept, however, under control. It is clear that the direct approach
has certain advantages which we summarize in the following list.
• Instead of using an on-line update strategy for the action network, we perform
updates only at the end of each cycle. Therefore, the policy of the action
network is not affected in the midst of a cycle (during which the network
actually performs well). The continuous on-line adjustment of the weights of
the action network may lead, due to overfitting, to the corruption of correct
policies that the system has acquired so far [10].
• Several sofisticated, derivative-free, multidimensional optimization techniques
may be employed instead of the naive stochastic gradient descent.
• Stochastic action selection is not necessary (except only at the early steps of
each cycle). In fact, stochastic action selection may cause problems, since it
may lead to choices that are not suggested by the current policy [10].
• There is no need for a critic. The absence of a critic and the small number of
weight updates contribute to the increase of the training speed.
The main disadvantage of the direct approach is that its performance relies mainly
on the effectiveness of the used optimization strategy. Due to the characteristics of
the function to be optimized one cannot be certain that any kind of optimization
approach will be suitable for training.
As already stated, a previous reinforcement learning approach that follows a
direct strategy, employs optimization techniques based on genetic algorithms and
provides very good results in terms of training speed (required number of cycles)
[16]. In this work, we present a different optimization strategy based on the polytope algorithm [12, 8, 13], which is described next.
2.1.
THE POLYTOPE ALGORITHM
The Polytope algorithm belongs to the class of direct search methods for non-linear
optimization. It is also known by the name Simplex, however, it should not be
122
ARISTIDIS LIKAS AND ISAAC E. LAGARIS
confused with the well known Simplex method of linear programming. Originally,
this algorithm was designed by Spendley et al. [14] and was refined later by Nelder
and Mead [12]. A polytope (or simplex) in R (n) is a construct with (n + 1) vertices
(points in R (n) ) defining a volume element. For instance in two dimensions the
simplex is a triange, in three dimensions it is a tetrahydron, and so on. In our case
each vertex point wi = (wi1 , . . . , win ) describes the n parameters (weights and
thresholds) of an action network.
The input to the algorithm apart from a few parameters of minor importance,
is an initial simplex, ie (n + 1) points wi . The algorithm brings the simplex in the
area of a minimum, adapts it to the local geometry, and finally shrinks it around
the minimizer. It is a derivative-free, iterative method and proceeds towards the
minimum by manipulating a population of n + 1 points (the simplex vertices) and
hence it is expected to be tolerant to noise, inspite its deterministic nature. The
steps taken in each iteration are described below. (We denote by f the objective
function and by wi the simplex vertices).
1. Examine the termination criteria to decide whether or not to stop.
2. Number the simplex vertices wi , so that the sequence fi = f (wi ) is sorted in
ascending order.
P
3. Calculate the centroid of the first n vertices: c = n1 n−1
i=0 wi
4. Invert the ‘worst’ vertex wn as: r = c + α(c − wn ) (usually α = 1)
5. If f0 ≤ f (r) ≤ fn−1 then
set wn = r, fn = f (r), and go to step 1
endif
6. If f (r) < f0 then
Expand as: e = c + γ (r − c) (γ > 1, usually γ = 2)
If f (e) < f (r) then
set wn = e, fn = f (e)
else
set wn = r, fn = f (r)
endif
go to step 1
endif
7. If f (r) ≥ fn−1 then
If f (r) ≥ fn then
contract as: k = c + β(wn − c), (β < 1, usually β = 12 )
else
contract as: k = c + β(r − c)
endif
If f (k) < min (f (r), fn ), then
set wn = k, fn = f (k)
else
Shrink the whole polytope as:
TRAINING REINFORCEMENT NEUROCONTROLLERS USING THE POLYTOPE ALGORITHM
123
Set wi = 12 (w0 + wi ), fi = f (wi ) for i = 1, 2, · · · , n
endif
go to step 1
endif
In essence, the polytope algorithm considers at each step a population of (n + 1)
action networks whose weight vectors wi are properly adjusted in order to obtain an
action network with high evaluation. In this sense, the polytope algorithm, although
developed earlier, exhibits an analogy with genetic algorithms which are also based
on the recombination of a population of points.
The initial simplex may be constructed in various ways. The approach we followed was to pick the first vertex at random. The rest of the vertices were obtained
by line searches originating at the first vertex, along each of the n directions. This
initialization scheme proved to be very effective for the pole balancing. Other
schemes such as random initial vertices or constrained random vertices on predefined directions, etc., did not work well. The termination criterion relies on
comparing a measure for the polytope’s ‘error’ to
Pna user preset small positive
1
number. Specifically, the algorithm returns if: n+1 i=0 |fi − f¯| ≤ ǫ where f¯ =
1 Pn
i=0 fi .
n+1
The use of the polytope algorithm has certain advantages like robustness in
the presence of noise, simple implementation and derivative-free operation. These
characteristics make this algorithm a suitable candidate for use as an optimization
tool in a direct reinforcement learning scheme. Moreover, since the method is deterministic, its effectiveness depends partly on the initial weight values. For this
reason, our training strategy employs the polytope algorithm with random restarts
as it will become clear in the application presented in the next section.
It must also be stressed that the proposed technique does not make any assumption concerning the architecture of the action network, (which in the case described
here is a multilayer perceptron), and can be used with any kind of parameterized action model (e.g. the fuzzy-neural action model employed in the GARIC
architecture [4]).
3. Application to the Pole Balancing Problem
The pole balancing problem constitutes the best-studied reinforcement learning
application. It consists of a single pole hinged on a cart that may move left or right
on a horizontal track of finite length. The pole has only one degree of freedom
(rotation about the hinge point). The control objective is to push the cart either left
or right with a force so that the pole remains balanced and the cart is kept within
the track limits.
Four state variables are used to describe the status of the system at each time
instant: the horizontal position of the cart (x), the cart velocity (ẋ), the angle of
the pole (θ) and the angular velocity (θ̇ ). At each step the action network must
124
ARISTIDIS LIKAS AND ISAAC E. LAGARIS
decide the direction and magnitude of force F to be exerted to the cart. Details
concerning the equations of motion of the cart-pole system can be found in [2, 16,
11]. Through Euler’s approximation method we can simulate the cart-pole system
using discrete-time equations with time step 1τ = 0.02 sec. We assume that the
system’s equations of motion are not known to the controller, which perceives only
the state vector at each time step. Moreover, we assume that a failure occurs when
|θ| > 12 degrees or |x| > 2.4m and that training has been successfully completed
if the pole remains balanced for more than 120000 consequtive time steps. Two
versions of the problem exist concerning the magnitude of the applied force F . We
are concerned with the case where the magnitude is fixed and the controller must
decide only the direction of the force at each time step. Obviously, the control
problem is more difficult compared to the case where any value for the magnitude
is allowed. Therefore, comparisons will be presented only with fixed magnitude
approaches and we will not consider architectures like the RFALCON [11], which
are very efficient but assume continuous values for the force magnitude.
The polytope method is embedded in the MERLIN package [6, 7] for multidimensional optimization. Other derivative-free methods, provided by MERLIN have
been tested in the pole-balancing example (random, roll [6]), but the results were
not satisfactory. On the contrary, the polytope algorithm was very effective being
able to balance the pole in a relative few number of cycles (function evaluations)
which was less than 1000 in many cases. As mentioned in the previous section,
the polytope method is deterministic, thus its effectiveness depends partly on the
initial weight values. For this reason we have employed an optimization strategy
that is based on the polytope algorithm with random restarts. Each run starts by
randomly specifying an initial point in the weight space and constructing the initial
polytope by performing line minimizations along each of the n directions. Next,
the polytope algorithm is run for up to 100 function evaluations (cycles) and the
optimization progress is monitored. If a cycle has been found to last more than
100 steps, application of the polytope algorithm continues for an additional 750
cycles, otherwise we consider that the initial polytope was not proper and a random
restart takes place. A random restart is also performed when after the additional
750 function evaluations the algorithm has not converged, i.e., a cycle has not been
encountered to last for more than 120000 steps (this maximum value is suggested in
[1, 16]). In the experiments presented in this article a maximum of 15 restarts was
allowed. The strategy was considered unsuccessfully terminated if 15 unsuccessful
restarts were performed or the total number of function evaluations was greater
than 15000.
The above strategy was implemented using the MCL programming language
[5] that is part of the MERLIN optimization environment. The initial weight values
at each restart were randomly selected in the range (−0.5, 0.5). Experiments that
considered the ranges (−1.0, 1.0) and (−2.0, 2.0) were also conducted and the
obtained results were similar, showing that the method exhibits robustness as far as
the initial weights are concerned.
TRAINING REINFORCEMENT NEUROCONTROLLERS USING THE POLYTOPE ALGORITHM
125
Table I. Training performance in terms of required number of training cycles
Method
Best
Polytope
AHC
GA-100
217
4123
886
Number of Cycles
Worst Mean SD
10453
12895
11481
2250
6175
4097
1955
2284
2205
For comparison purposes the action network had also the same architecture with
the architecture reported in [16, 2]. It is a multilayer perceptron with four input
units (accepting the sytem state), one hidden layer with five sigmoid units and
one sigmoid unit in the output layer. There are also direct connections from the
input units to the output unit. The specification of the applied force characteristics
from the output value y ∈ (0, 1) was performed in the following way. At the first
ten steps of each cycle the specification was probabilistic, i.e., F = 10N with
probability equal to y. At the remaining steps the specification was deterministic,
i.e., if y > 0.5 then F = 10N, otherwise F = −10N. In this way, a degree of
randomness is introduced in the function evaluation process that assists in escaping
from plateaus and shallow local minima.
Experiments have been conducted to assess the performance of the proposed
training method both in terms of training speed and generalization capabilities. For
comparison purposes we have also implemented the AHC approach [1, 2], while
experimental results concerning the genetic reinforcement approach on the same
problem using the same motion equations and the same network architecture are
reported in [16]. Training speed is measured in terms of the number of cycles (function evaluations) required to achieve a successful cycle. A series of 50 experiments
were conducted using each method, each cycle starting with random initial state
variables. Obtained results are summarized in Table 1, along with results from
[16] concerning the genetic reinforcement case with population of 100 networks
(GA-100) that exhibited the best generalization performance. In accordance with
previous published results, the AHC method does not manage to find a solution in
14 of the 50 experiments (28%), so the displayed results concern values obtained
considering only the successful experiments. On the contrary, the proposed training
strategy was successful in all the experiments and exhibited significantly better
performance with respect to the AHC case in terms of the required training cycles.
From the displayed results it is also clear that the polytope method outperforms the
genetic approach, which is also better than the AHC method.
Moreover, we have tested the generalization performance of the obtained action
networks. These experiments are useful since a successful cycle starting from an
arbitrary initial position, does not nessecarily imply that the system will exhibit
126
ARISTIDIS LIKAS AND ISAAC E. LAGARIS
Table II. Generalization performance in terms
of the percentage of successful tests
Method
Polytope
AHC
GA-100
Percentage of Successful Tests
Best Worst Mean SD
88.1 2.3
47.2
16.4
62.2 9.5
38.5
10.3
71.4 3.9
47.5
14.2
acceptable performance when started with different initial state vectors. The generalization experiments were conducted following the guidelines suggested in [16]:
for each action network obtained in each of the 50 experiments either using the
polytope method or using the AHC method, a series of 5000 tests were performed
from random initial states and we counted the percentage of the tests in which the
network was able to balance the pole for more than 1000 time steps. The same failure criteria that were used for training were also used for testing. Table 2 displays
average results obtained by testing the action networks obtained using the polytope
and the AHC method (in the case of successful training experiments). Moreover, it
provides generalization results provided in [16] concerning the GA-100 algorithm,
using the same testing criteria. As the results indicate the action networks obtained
by all methods exhibit comparative generalization performance. As noted in [16]
it is possible to increase the generalization performance by considering stricter
stopping criteria for the training algorithm. It must also be noted that, in what
concerns the polytope method, there was no connection between training time and
generalization performance, i.e., the networks that resulted by longer training times
did not nessecarlily exhibit better generalization capabilities.
From the above results it is clear that direct approaches to delayed reinforcement learning problems constitute a serious alternative to the most-studied criticbased approaches. While critic-based approaches are mainly based on their elegant
formulation based on temporal differences and stochastic dynamic programming,
direct approaches base their success on the power of the optimization schemes they
employ. Such an optimization scheme based on the polytope algorithm with random restarts has been presented in this work and was proved to be very successful
in dealing with the pole balancing problem. Future work will be focused on the
employment of different kinds of action models (for example RBF networks) as
well as the exploration of other derivative-free optimization schemes.
Acknowledgement
One of the authors (I. E. L.) acknowledges partial support from the General Secretariat of Research and Technology under contract PENED 91 ED 959.
TRAINING REINFORCEMENT NEUROCONTROLLERS USING THE POLYTOPE ALGORITHM
127
References
1.
C.W. Anderson, Strategy learning with multilayer connectionist representations, Technical
Report TR87-509.3, GTE Labs, Waltham, MA.
2. C.W. Anderson, “Learning to control an inverted pendulum using neural networks”, IEEE
Control Systems Magazine, Vol. 2, pp. 31–37, 1989.
3. A.G. Barto, R.S. Sutton and C.W. Anderson, C.W., “Neuronlike elements that can solve difficult
control problems”, IEEE Trans. on Systems, Man and Cybernetics, Vol. 13, pp. 835–846, 1983.
4. H.R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controllers using reinforcements”, IEEE Trans. on Neural Networks, Vol. 3, pp. 724–740, 1992.
5. C.S. Chassapis, D.G. Papageorgiou, and I.E. Lagaris, MCL – “Optimization oriented programming language, computer physics communications”, Vol. 52, pp. 223–239, 1989.
6. G.A. Evangelakis, J.P. Rizos, I.E. Lagaris and I.N. Demetropoulos, Merlin – “A portable system
for multidimensional minimization”, Computer Physics Communications, Vol. 46, pp. 402–
412, 1987.
7. D.G. Papageorgiou, C.S. Chassapis and I.E. Lagaris, “MERLIN–2.0 – Enhanced and programmable version”, Computer Physics Communications, Vol. 52, pp. 241–247, 1989.
8. P. Gil, W. Murray, and M. Wright, Practical Optimization, Academic Press, 1989.
9. L. Kaelbing, M. Littman and A. Moore, “Reinforcement learning: A survey”, Journal of
Artificial Intelligence Research, Vol. 4, pp. 237–285, 1996.
10. D. Kontoravdis, A. Likas and A. Stafylopatis, “Efficient reinforcement learning strategies for
the pole balancing problem”, in M. Marinaro and P. Morasso (eds), Proc. ICANN’94, pp. 659–
662, Springer-Verlag, 1994.
11. C-J. Lin and C-T. Lin, “Reinforcement learning for an ART-based fuzzy adaptive learning
control network”, IEEE Trans. on Neural Networks, Vol. 7, pp. 709–731, 1996.
12. J.A. Nelder and R. Mead, “A simplex method for function minimization”, Computer Journal,
Vol. 7, pp. 308–313, 1965.
13. S. Nash and A. Sofer, Linear and Nonlinear Programming, McGraw-Hill, 1996.
14. W. Spendley, G. Hext and F. Himsworth, “Sequential application of simplex designs in
optimization and evolutionary operation”, Technometrics, Vol. 4, pp. 441–461, 1962.
15. C. Watkins and P. Dayan, “Learning from delayed rewards”, Machine Learning, Vol. 8, pp. 279–
292, 1992.
16. D. Whitley, S. Dominic, R. Das and C.W. Anderson, “Genetic reinforcement learning for
neurocontrol problems”, Machine Learning, Vol. 13, pp. 259–284, 1993.