THE BELL SYSTEM TECHNICAL JOURNAL
Vol. 62, No.8, October 1983
Printed in U.S.A.
Forecasting With Adaptive Gradient Exponential
Smoothing
By A. FEUER'
(Manuscript received December 20, 1982)
Exponential Smoothing (ES) as a forecasting technique has been extensively used since its introduction in the 1960s. It is simple, hence easy to
implement, and in many cases performs surprisingly well. However, many
phenomena require a more sophisticated forecasting technique. In this paper
we introduce a new forecasting technique, Adaptive Gradient Exponential
Smoothing (AGES). This technique extends the classical ES as used on simple
data or on data with linear trend. For data with both linear trend and seasonal
effects this extension results in a new and more general form of ES, which is
presented in this paper. The new forecasting technique is tested on simulated
data and some real data of the types mentioned above, and its performance in
all these tests is clearly superior to ES. It is shown by analysis and by the
experimentations that for certain types of data it does in fact converge to the
optimal (in the mean square error sense) forecasts.
I. INTRODUCTION
The need for quick and reliable forecasts of various time series is
often encountered in economic and business situations. In the Bell
System, forecasting is used to help plan trunk and facilities for the
telephone network.v" as well as to project computer workload, to
determine staffing levels for operators or service observers, and more.
Many forecasting techniques exist and different time series may
require different techniques. In general, there is a clear trade-off
between simplicity (resulting in cheaper implementation) and per• Bell Laboratories.
IlCopyright 1983, American Telephone & TelegraphCompany. Photo reproduction for
noncommercial use is permitted without payment of royaltyprovided that each reproduction is done without alteration and that the Journal reference and copyright notice
are included on the first page. The title and abstract, but no other portions, of this
paper may be copied or distributed royalty free by computer-based and other information-service systems without further permission. Permissionto reproduce or republish
any other portion of this paper must be obtainedfrom the Editor.
2561
formance of the forecasting technique. One of the simplest forecasting .
techniques, Exponential Smoothing (ES), has surprisingly good performance. This technique was presented originally by Winters" and
Brown" and is described briefly in Section II. In Ref. 6 the optimality
properties of ES are studied and we expand on these studies and use
the conclusions as the basis for a new technique we introduce here.
In fact, these studies revealed a relationship between the ES and
the Autoregressive-Integrated Moving Average (ARIMA) model-fitting-based forecasting suggested by Box and -lenkins," This is further
discussed in Section III.
The extensive use of ES clearly indicated that for time series with
nonstationary discontinuities or changes in the generating parameters,
ES performance is not satisfactory. This prompted a number of
researchers to develop the Adaptive Exponential Smoothing (AES)
idea. In these techniques the algorithm is supposedly evaluating its
own performance and correcting its parameters to obtain improved
performance. Recently, the existing AESs (see, for example, Refs. 8
through 11) were reviewed critically by Ekem. 12 One of the points
raised in Ref. 12 was that none of the existing AESs is supported by
analysis or general performance claims (e.g., optimality). In addition,
it should be pointed out that only Roberts' and Reed's AES l l can be
used on data with both linear trend and seasonal effects, while the
other AESs are limited to simpler data and have no natural generalization.
In this paper we present a new AES algorithm, which we call
Adaptive Gradient Exponential Smoothing (AGES). This technique
naturally generalizes to data with both linear trend and seasonal effect.
In addition, analysis of AGES for simple data and extensive simulations, using simple as well as more general data, strongly suggests that
this technique converges to optimal performance in the mean square
error (MSE) sense.
Section II presents ES as commonly used. A new, more general form
is developed with a discussion of its optimal properties. The new
technique, AGES, is derived and presented in Section III, while the
results of experiments with this technique on both real and simulated
data are presented in Section IV.
II. EXPONENTIAL SMOOTHING AND ITS OPTIMAL PROPERTIES
First we consider ES as Winters" did for three types of data: simple"
(S), with linear trend (LT), and with both linear trend and multiplicative seasonal effects (LSM). Common to all the configurations is
·Simple data are of the form a + n(t), where a is a fixed value and n(t) is noise with
zero mean.
2562
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
the following: a time series lx(t)l is measured every time interval T
(e.g., hour, day, or week), and t is an integer representing the time tT.
Then, one is interested in forecasting the value x(t + 1)* based on the
data available up to and including t, namely x(O), x(1), ... , x(t).
If i(t + 1) denotes the forecast, carried out at time t for x(t + 1),
from Ref. 4 we have (using our own notation for consistency with the
discussions in the sequel), for S data:
i(t
+ 1)
+ (1
= ax(t)
os
a:S
(la)
- a)i(t)
(lb)
1;
for LT data:
i(t
+ 1) = a(t) + b(t)
a(t) = ax(t)
b(t)
(2a)
+ (1 -
= J3[a(t) - a(t o :s a, fJ :s 1;
1)]
+ (1
+ b(t -
1)]
(2b)
- J3)b(t - 1)
(2c)
a)[a(t - 1)
(2d)
and for LSM data:
i(t
+ 1)
= (a(t)
+ b(t»c(t -
•
x(t)
a(t) = a c(t _ L)
b(t)
=
+
~~:~ + (1 -
o :s a, fJ,
+
1)
•
(1 - a)[a(t - 1)
J3[a(t) - a(t - 1)]
c(t) = 'Y
L
+ (1
(3a)
•
+ b(t -
1)]
- J3)b(t - 1)
'Y)c(t - L)
'Y :s 1,
(3b)
(3c)
(3d)
(3e)
where L is the known periodicity of the season.
In all the equations above, the parameters a, fJ, and 'Y are called the
"smoothing coefficients".
Our first step is to rewrite eq. (1) and, more importantly, eq. (2).
This provides the basis for a new form of ES for LSM data, more
general than (3). The new form, which is a natural extension of (1)
and (2), suggests types of data for which the ES algorithm can result
in optimal (in the MSE sense) performance.
Equation (1) can be readily rewritten as
(4a)
*Note that we restrict our discussions to one-interval-ahead forecasting with the
understanding that it can be generalized to more time intervals ahead.
FORECASTING WITH AGES
2563
where clearly
81 =
1 - a.
(4b)
With some algebra one can show that eq. (2) is equivalent to
x(t
+ 1)
= 81x(t)
+ 82x(t
+ (2
- 1)
- ( 1 )x (t ) - (1
+ (2 )x (t
- I),
(5a)
where
81 =
2 - a(1
82 =
a - 1.
+ (j)
(5b)
(5c)
The basic difference between (2) and (5) is that (5) reflects the
assumption that the noise-free part of the data x(t) is generated by
the difference equation
+ yet
yet) - 2y(t - 1)
- 2) = 0,
(6)
while (2) reflects the assumption that the solution of (6) is
= a + bt.
yet)
(7)
[Note that in (2) aCt) is the current estimate of 'a + bt' and 6(t) is the
current estimate of 'b.']
The ES as given in (3) for LSM data is clearly based on the
assumption that the noise-free part of the data has the form
yet)
(a
+ bt)e(t),
(Sa)
+ L)
= e(t).
(8b)
=
where
e(t
The difference equation satisfied by (8) is
Yet) - 2y(t - L)
+ yet
(9)
- 2L) = 0,
and the corresponding ES
x(t
+ 1)
M
= ~ 8Jx(t - j
j-I
+ 2x(t -
+ 1) L
+ 1) -
M
~ 8jx(t - j
j=1
+ 1)
x(t - 2L + 1).
(10)
The parameters OJ, j = I, "', M and the constraints they have to
satisfy are discussed later. Also, the claimed correspondence between
(9) and (10) will become more apparent in later discussion.
At this point, however, we emphasize that while (7) is the general
solution of (6), and thus (2) and (5) are equivalent, (8) is only one of
2564
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
many possible solutions of (9). Hence (10) represents an ES form that
is more general than (3).
Similarly, data with linear trend and additive seasonal effects"
(LSA) have the underlying difference equation
y(t) - y(t - 1) - y(t - L)
+ y(t -
L - 1)
=0
(11)
and the corresponding ES is
x(t
+
M
8j x (t
M
+
1) -
+ x(t) + x(t
- L
1) = ~
j=l
- j
~
j=l
8j x (t
+ 1) -
- j
+ 1)
x(t - L).
(12)
To unify and simplify the discussions ahead we introduce the
following notation. Let D be a unit delay operator, namely Dx(t) =
x(t - 1), and let A(D) be a polynomial in D such that
A(D) =
{
I
2- D
2DL-1 _ D 2L - 1
1
+ DL-l
- DL
for
for
for
for
S data
LT data
LSM data
LSA data.
(13)
With these definitions (4), (5), (10), and (12) can be unified as
x(t
+ 1)
M
= ~
j-l
8j D j- l (X(t ) -
x(t»
+
A(D)x(t),
(14)
where M = 1 will result in (4) and M = 2 in (5).
It should also be pointed out that the ES as given by eq(s). (1) [(2)
or (3)] has an implicit assumption in it. The assumption is that one
(two or three) coefficient(s) can, in fact, smoothen the data. In other
words, Min (14) is equal to one (two or three). However, its general
form, (14), allows for a larger number of coefficients to get better
approximations.
To observe the optimal properties of the ES forecasts we defme the
forecast error as
e(t) = x(t) - x(t)
(15)
and use as our criteria for the forecast quality the mean square error
(MSE), i.e., Ele 2 (t ) l. With this in mind, it is clear that optimal
performance is achieved if the e(t) becomes a white noise sequence
(i.e., independent and identically distributed with zero mean). Namely,
the ES technique, while assuming knowledge of the generating process
for the noise-free component of the data, attempts to "whiten" the
·This type of data was not addressed in Ref. 4 and, as far as we know, no form of
ES applicable to it was proposed before the one here.
FORECASTING WITH AGES
2565
noise component. This attempt implies an underlying assumption that
the data are generated through, or at least approximated by, the
process
[1 - DA(D)]x(t) = [1 -
i~1 8iDi] E(t),
(16)
where E(t) is a white noise with variance a 2 •
Substituting (14) and (16) into (15) results in
[1 -
.~ OiDi] e(t) = [1 - .~ 8iDi] E(t).
J=1
(17)
J-l
This equation satisfied by e(t) is the basis of our claims for correspondence between eqs. (9) and (10), and (11) and (12). Equation (17)
immediately suggests the conditions for optimal forecasting. First, to
get bounded MSE one must require:
Condition 1: All zeros of the polynomial [1 - L7!l OiAiJ are outside
the unit circle.
If, in addition, we also require:
Condition 2: Oi = 8i, j = 1, 2, ... , M,
then, clearly, from eq. (17), e(t) will converge to E(t) and optimal
forecasting (in the MSE sense) is achieved.
Remark 1: As we discussed here, the sufficiency of Conditions 1 and
2 is quite obvious; however, they are also necessary. This
is argued in Appendix A.
Remark 2: In Ref. 4 01. and {3 for LT data are restricted to interval
[0, IJ, which corresponds to the set S2 in Fig. 1. The actual
constraints follow from applying Condition 1 to the M = 2
case. This results in the set SI in Fig. 1, which clearly
contains S2and is considerably larger. Allowing for a larger
constraint set for 01 and O
2 (or, correspondingly, 01., (3) will
result in more cases for which ES could result in optimal
performance.
III. ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING
In the previous section we argued that for data that can be approximated by (16), forecasting with ES of the form (14) can result in
optimal performance in the MSE sense. To achieve this, Conditions 1
and 2 must be satisfied. However, while Condition 1 can be satisfied
by proper choice of 0i, Condition 2 is, in general, hard to satisfy since
the values of 8i in eq. (16) are not known. Basically, the ARIMA
2566
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
2
.-----'---·0,
-1
-2
Fig. I-The constraint sets:
= 1(0.. O I021
< 1, O2 + O. < 1, O2 - O. < 11
8 2 = 1(0.. O
2 ) : O. = 2 - a(l + 11), O
2 = a - I, 0 < a, P < 11.
8.
2):
model-fitting-based forecasting" deals with exactly this type of problem. The O/s of eq. (16) are estimated and these estimates are then
used as the 8/s in eq. (14) in an attempt to satisfy Condition 2. In the
ES algorithm no such attempt is made. In practice, the forecasters
using ES choose some fixed values for the OJ, which satisfy Condition
1 [or even more restrictively, e.g., eq. (2d)]. These values are based on
intuition, experience, and familiarity with the data they forecast.
However, considerable differences between the underlying O/s and
the chosen 8/s can result in significant performance degradation. This
is demonstrated in Fig. 2 for the case M = 2. The MSE for this case
was computed in a closed form as a function of OJ and O2 for some
fixed 81 and 82 and graphed in the figure. Together with phenomena
like nonstationary discontinuity* and changes in the data-generating
process (i.e., the OJ change values), this resulted in unsatisfactory
performance of the ES. The realization of what may cause this poor
performance brought about the idea of using adaptive schemes where
* Step-like changes in the data.
FORECASTING WITH AGES
2567
1.0
0.8
0.6
0.4
0.2
()2
0
-0.2
-0.4
-0.8
-0.8
-1.0
-2.0
2.0
MSE - MEAN SQUARED ERROR
Fig. 2-The mean squared error 88 a function of the daQi-generating parameters for
M = 2. (The smoothing coefficients are fixed at 81 = -0.3, 82 = -0.3.)
A
the OJ are not fixed but are adjusted in an attempt to improve performance.
Compared to the existing Adaptive Exponential Smoothing (AES)
techniques (see, e.g., Refs. 8 through 11), the new technique we
introduce here is analytically more sound and there are strong indications that it converges to opptimal performance in the MSE sense
for the data approximated by (16).
This new technique is based on the gradient search for the minimum
of the MSE. If the MSE would have been available as a function of
the OJ, then one could compute the gradient
V = c1E{e:(t»
c10
where 0 =
[01> O2 ,
".,
(18)
'
OM]T, and recursively update the OJ through
8(t
+ 1)
= O(t) - p.V,
(19)
where p. > 0 is the adaptation constant. This is the gradient search
technique, sometimes referred to as the steepest descent technique. In
general, however, the MSE is not available as a function of the OJ;
hence, neither is the gradient. Instead, we use an instantaneous
estimate of this gradient. To get this estimate we replace Ele 2 (t )1 by
e2 (t ) and the gradient by
V = ae2~t) =
ao
2568
2e(t)
ae~).
ao
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
(20)
Let us denote
s(t) =
ae~)
ao
as the "sensitivity vector," since it gives an indication of how sensitive
the error e(t) is to the values of OJ.
While s(t) is not available we can use eq. (17) to develop a means
for generating it. Let us take partial derivatives of both sides of this
equation with respect to 0. Since the right-hand side does not depend
explicitly on 0 we get:
(1 -
.~
,sl
BiD i- l) Sj(t) = Dj-le(t)
j = 1,2, ... , M.
(21)
At this point we are ready to introduce the Adaptive Gradient
Exponential Smoothing (AGES) technique. Combining eqs. (14), (19),
(20), and (21) we get:
the forecast:
i(t
+ 1) = A(D)x(t)
M
-
L
i=1
Oi(t)e(t - i
+ 1)
(22)
[see definition of A(D) in eq. (13)],
the sensitivity functions:
M
Sj(t
+ 1) = L
i-I
Oi(t)Sj(t - i
+ 1) + e(t
j
- j
+ 1)
= 1, 2, ... , M,
(23)
and the coefficient adjustments:
OAt
+ 1) = Bj(t)
- 2Ile(t)sj(t)
j
= 1, 2,
"', M.
(24)
Recall that the error e(t) = x(t) - i(t).
Both our simulations and our experiments (as described in the next
section) strongly indicate that AGES converges to optimal performance through convergence of OJ(t) to OJ. Namely, the error e(t) is
adaptively whitened. Despite these indications, since the resulting
equations are quite complex, a global proof of convergence of the
AGES technique is beyond the scope of this paper. However, we
conclude this section by treating the special case M = 1 and show
local convergence properties for it.
Let M = 1; then eqs. (17), (23), and (24) become
e(t
SI(t
+
+
1)
= 01(t)e(t) + f(t + 1)
1) = 01(t)Sl(t)
- 0If(t)
+ e(t),
FORECASTING WITH AGES
2569
and
O(t + 1) = 01(t) - 2lte(t)S1(t).
Assuming OJ(t) is independent of e(t) and S1(t) (similar assumptions
are common in convergence proofs of adaptive filters) and observing
that EIE(t)·e(t») = q2, EIE(t)'E(t + 1)1 = 0, E{S(t)'E(t») = 0 we get
+ 1») = E{OHt)I·Ele2(t)j
+ q2[1 + ~ - 281E{81(t )j]
E{Sl(t + 1).e(t + 1)1 = EIOW)l·E{s1(t).e(t)1
E{e 2(t
+ EI0 1(t» .E{e 2(t)1 E{01(t
+ 1)1
81c2q2
= E101(t)j - 2~ls1(t)e(t>l.
(25)
If we assume in addition that 01(t) has a small variance, namely
E{of(t») ::::: [E{01(t)J]2 (the simulation results tend to support this
assumption), defining
'Y1(t) = E{e 2(t») - q2
'Y2(t)
= E{s1(t)·e(t)1
'Y3(t) = E{Ol(t») - 81
(26)
and substituting in (25) results in
+ 1) = [-Y3(t) + 81]2'Y1(t ) + q2['Y3(t)]2
'Y2(t + 1) = [-Y3(t) + 81]2'Y2(t ) + [-Y3(t) + 81h 1(t) + q2'Y3(t)
'Y3(t + 1) = 'Y3(t) - 21t'Y2(t).
'Y1(t
(27)
Clearly, if we could prove that 'Y1(t), 'Y2(t), and 'Y3(t) converge to the
origin globally (i.e., independent of the initial values), it would mean
that [see eq. (26)] the MSE converges to the minimum 0'2 and EI81(t) I
converges to 81 , However, despite strong indications from our simulations that these variables do converge globally, we can prove only local
convergence. In addition, the proof provides an indication as to how
to choose the parameter It.
Let us linearize eq. (27) around the origin to get
'Yl(t
+
1) = Oh1(t)
+ 1) =
'Y3(t + 1) =
'Y2(t
2570
~'Y2(t)
+ 81'Yl(t) + O'2'Y3(t )
'Y3(t) - 21t'Y2(t).
THE BEll SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
(28)
The coefficients matrix is
(J~
A = ~
[
o
of
-2JL
and to ensure convergence all eigenvalues of A must be within the unit
circle. The eigenvalues of A are
Al =
A2,3
Of
= 1/211 + of ± [(1 -
Of)2 - 8JL0"2jl/21,
and it can be verified that choosing
1 - o~
JL<--
(29)
20"2
will guarantee the convergence of eq. (28).
Condition (29) implies that if 101 I is close to one, JL must be chosen
very small and the convergence will be slow. Again, our simulation
experiments verified this observation.
IV. SIMULATION RESULTS
We divide our experiments with AGES into two parts. In the first
part we applied both ES and AGES on data generated by the computer
10,......---------------,--------------,
9
N
b
AGES - ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING
ES - EXPONENTIAL SMOOTHING
8
x
a:
~ 6
a:
w
Cl
w
~
:l
5l
z
:5
::<
4
3
2
1.0
Fig. 3-Comparison of forecasting performance between ES (fJ
= 0.8) and AGES.
FORECASTING WITH AGES
2571
and compared the results. In the second part we applied the AGES to
real data that we took from Ref. 7.
Equation (16) was used to generate data of type S, LT, and LSM by
the computer. The results of applying both ES and AGES on these
data are presented in Figs. 3, 4, and 5 and in Table I. Each point on
the curves of Fig. 3 corresponds to a complete run on a sequence of
data generated with the particular choice of the OJ. The resulting MSE
for the ES and the AGES forecasts are presented and the comparison
clearly indicates the superiority of the AGES algorithm. In addition,
we observe that the AGES results, in almost all the runs, in a MSE
very close to the minimum, 0'2.
In Fig. 5, we followedthe variation of the OJ(t) with time in a number
of runs. The results clearly show that the OJ(t) converge to the OJ from
a variety of initial values; this indicates global convergence properties.
Similar results are observed in Table I for data with seasonal multiplicative effects and linear trend. The OJ(t) clearly converge to the O/s,
and the MSE, when AGES is applied., is again very close to the optimal
6
(a)
5
4
a:
a:
a:
1
0
0
UJ
0
UJ
7
0
6
Z
<t
5
a:
<t
:::>
en
UJ
::E
(b)
AGES - ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING
ES - EXPONENTIAL SMOOTHING
4
3
2
Ol----l.._...I-_L.----..l._-l-_...I...----lL....---l.._...I-_.l..---.L_....l
-1.8 -1.5 -1.2 -0.9 -0.6 -0.3
0
0.6
0.9
1.5
1.8
(J,
Fig.4-Comparison of mean squared error in forecasting with ES (8, = 82 = -0.3)
and AGES 88 a function of the data-generating parameters 8, and 82 , (a) 82 = -0.9.
(b) 82 = -0.6.
2572
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
0.8
8,
0.6
(a)
~ 0.5
0.4
0.2
<<:0
o
0
-0.2
-0.4
-0.6
-0.8
-1.0
0
5
10
15
20
25
30
08SERVATIONS IN HUNDREDS
0.8
(b)
0.6
0.4
0.2
<<:f:
0
-0.2
-0.4
-0.6
-0.8
-1.0
1.0
FiJ. 5-Convergence of: (8) 81(t ) to the optimal value 81 in the AGES method. (b) 81(t )
and IMt) from various initial conditions to (It and 82 • using AGES on data with linear
trend.
value, 0'2. From Ref. 7 we took data of the simple kind (no linear trend
or seasonal effects): The IBM common stock closing prices, daily,
from May 17,1961 through November 2,1962. On the data we applied
both ES and AGES and the results are presented in Fig. 6. Each point
on the curves corresponds to a run on the same data, each time with
a different coefficient (for the ES) and different initial condition (for
the AGES). The further the coefficient used in the ES is from 81
(which in this case is equal to -0.1, as indicated in Ref. 7), the better
the performance is for AGES.
Further experiments were conducted on monthly international airline passengers data," These data, as Fig. 7 indicates, are with linear
trend and multiplicative seasonal effects. We applied the AGES algorithm (with M = 3) and the results are presented in Fig. 8. In Ref. 7
it is claimed that sometimes rather than work with the actual data it
FORECASTING WITH AGES
2573
Table I-Comparison of MSE in forecasting with ES (81 = -0.2, 82 =
0.5,83 = 0.4) and AGES
Data-Generating Coefficients (and Adaptive Coefficients)
91
(91 ) .
9. (8.)
-1.3 (-1.29)
-1.95 (-1.79)
-0.6 (-0.56)
-0.75 (-0.76)
0.6 (0.6)
0.0 (0.01)
-1.0 (-0.99)
0.5 (0,49)
0.25 (0.26)
-0.9 (-0.81)
-1.35 (-1.24)
-0.75 (-0.75)
-0.5 (-0.48)
-0.75 (-0.66)
0.0 (0.03)
-0.75 (-0.74)
0.5 (0.53)
-0.15 (-0.14)
-1.35 (-1.22)
-0.75 (-0.76)
-0.3 (-0.28)
0.5 (0.49)
-0.65 (-0.62)
-0.75 (-0.75)
-0.6 (-0.56)
-0.4 (-0.38)
93
(93 )
MSE (xu2)
AGES
1.1972
2.1
1.3831
1.0749
0.75
0.6
1.0600
-0.75
1.0663
0.0
1.0040
1.0
1.2001
-0.2
1.0397
-0.1
0.9973
1.1044
1.2
1.8
1.2192
0.3
1.0574
1.0
1.0605
1.5
1.1165
0.75
1.0212
0.0
1.0314
0.2
1.0186
1.2
1.1115
-1.8
1.2077
-0.3
1.0438
-0.75
1.0062
0.4
1.0461
-0.7
1.0670
-0.6
1.0815
-0.75
1.1065
-0.5
1.0759
* The values to which O;(t) converge are given in parentheses.
1.4
(1.39)
(1.97)
(0.69)
(0.6)
(-0.73)
(-0.04)
(0.98)
(-0.16)
(-0.09)
(1.19)
(1.74)
(0.31)
(0.96)
(1.42)
(0.77)
(-0.03)
(0.19)
(1.19)
(-1.7)
(-0.3)
(-0.79)
(0.43)
(-0.73)
(-0.61)
(-0.74)
(-0.52)
0.8 (0.75)
0.8 (0.68)
0.8 (0.78)
0.8 (0.77)
0.8 (0.79)
0.0 (0.03)
1.0 (0.94)
0.4 (0.4)
0.4 (0.41)
0.4 (0.36)
0.4 (0.34)
0.4 (0.38)
0.0 (-0.04)
0.0 (-0.05)
0.0 (0.03)
0.0 (0.02)
-0.4 (-0.43)
-0.4 (-0.39)
-0.4 (-0.36)
-0.4 (-0.39)
-0.4 (-0.38)
-0.8 (-0.78)
-0.8 (-0.8)
-0.8 (-0.79)
-0.8 (-0.74)
-0.8 (-0.81)
ES
11.7155
20.5821
5.1431
5.0256
1.3577
1.5737
8.6414
1.0137
1.0839
7.1069
13.0215
3.8630
4.2218
6.9455
2,4687
3.5907
1.7692
4.0338
13.8494
4.6957
3.9623
2.6126
6.1628
6.8677
7.0182
5.1676
is more convenient to work with the logarithm of the data. As we
argue in Appendix B, these logarithms, as data, have linear trend and
additive seasonal effects (see Fig. 8). Hence, on the logarithms we
applied AGES for linear trend and additive seasonal effects and the
results are presented in Fig. Sa (M = 3). We used the same data (the
logarithms) to see whether the performance improves with larger M.
AGES was applied with M = 13 and the results, as presented in Fig.
8b, clearly indicate that for this data M = 3 was sufficient.
V. CONCLUSIONS
In this paper we have introduced a new forecasting technique,
Adaptive Gradient Exponential Smoothing (AGES), which is based
on Exponential Smoothing (ES). We have elaborated on the optimality
properties in the MSE sense of the ES. For certain types of data, the
ES can result in optimal performance provided some coefficients are
known. In general, these coefficients are unavailable, and the AGES
shows strong indications of converging to these unknown coefficients
and providing optimal performance.
2574
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
400
AGES - ADAPTIVE GRADIENT
EXPONENTIAL SMOOTHING
350
ES - EXPONENTIAL SMOOTHING
a:: JOO
a::
a::
0
w 250
0
w
a:: 200
«
::>
a
II)
z 150
FORECASTING WITH
THE AGES METHOD
«
w
:;; 100
\
\
50
0
-1.0
1.0
Fig. 6-Comparison of performance of forecasting with the ES (varying the coefficient
in each run) and the AGES methods.
700 . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ,
o
z
II)
;J;
::>
o
:t:
600
;
500
~
~ 400
z
w
~
:
300
u,
ACTUAL DATA-_
o
a::
w 200
a>
:;;
::>
z
100
l...-_l!I-_---L_----I_ _.L-_--"--_--l._ _'----_...l.-_---L_----I._ _L-----l
o
12
24
36
48
60
72
84
96
lOB
120
132
144
NUM8E R OF MONTHS
Fig. 7-Forecasting with AGES international airline passengers (M = 3). (Note that
these data have linear trend and multiplicative seasonal effects.)
Clearly, more extensive experiments and practical use of the proposed forecasting technique, the AGES, are required. A user-friendly
software package can be developed for implementation of this technique if sufficient interest is generated.
VI. ACKNOWLEDGMENT
The author wishes to thank the reviewers for their thoroughness.
Their comments were very helpful in the revision of this paper.
FORECASTING WITH AGES
2575
5.1
4.5 L..-_ _---'6.6
...L-
.L..- _ _---l'--_ _---'-
.....L.._ _- - I
r-----------------------------,
4.6 L..-_ _---'20
o
...L-
.L..- _ _---l'--_ _---'-
.....L.._ _- - I
40
60
120
80
100
140
NUMBER OF OBSERVATIONS
Fig.8-Forecasting with AGES the logarithm of the data in Fig. 7 for: (a) M = 3.
(b) M = 13. (Note that the logarithm of the data in Fig. 7 has the form of data with
additive seasonal effects and linear trend.)
REFERENCES
1. C. D. Pack and B. A. Whitaker, "Kalman Filter Models for Network Forecasting,"
B.S.T.J., 61, No.1 (January 1982),pp. 1-14.
2. J. P. Moreland, "A Robust Sequential Projection for Traffic Load Forecasting,"
B.S.T .J., 61, No.1 (January 1982),pp. 16-38.
3. C. R. Szelag, "A Short- Term Forecasting Algorithm for Trunk Demand Servicing,"
B.S.T.J., 61, No.1 (January 1982),pp. 67-96.
4. P. R. Winters, "Forecasting Sales, by Exponentially Weighted Moving Averages,"
Management Sciences, 6, No.3 (April 1960),pp. 324-42.
6. R. G. Brown, "Smoothing, Forecasting and Prediction of Discrete Time Series,"
Englewood Cliffs, NJ: Prentice-Hall, 1962.
6. J. F. Muth, "Optimal Properties of Exponentially We~hted Forecasts of Time
Series With Permanent and Transitory Components, J. Am. Statis. Assn., 55
(June 1960),pp. 299-306.
7. G. P. Box and G. M. Jenkins, "Time Series Analysis, Forecasting and Control," San
Francisco, CA: Holden-Day, 1970.
8. W. M. Chow, "Adaptive Control of the Exponential Smoothing Constant," J. Indust.
Eng., 16, No.6 (October 1965),pp. 314-7.
2576
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
9. D. W. Trigg and A. G. Leach, "Exponential Smoothing With an Adaptive Response
Rate," Oper. Res. Quart., 18, No.1 (March 1967), pp. 53-60.
10. D. C. Whybark, "A Comparison of Adaptive Forecasting Techniques," Logist.
Transport. Rev., 8 (1973), pp. 13-26.
11. S. D. Roberts and R. Reed, "The Development of a Self-Adaptive Forecasting
Technique," AIlE Trans., 1 (December 1969), pp. 314-22.
12. S. Ekern, "Adaptive Exponential Smoothing Revisited," J. Oper. Res. Soc.,32, No.
9 (September 1981), pp. 775-82.
APPENDIX A
Necessity of Conditions rand 2 for the Convergence of e(1) to E(l) in
Equation (r 7)
Condition 1 is clearly necessary (as well as sufficient) for the
convergence of E/e 2 (t )l to a finite value. We want to show that
Condition 2 is necessary for Ele 2 (t )l to converge to u2 •
Let
'Y.... (T)
= E/e(t).e(t -
T)}
(30)
and
(31)
Clearly,
'Yee (T) = 'Yee (-T)
(32)
and, from eq. (17) and the definition of E(t)
'Y.. (-T) = O.
(33)
With these definitions it follows from eq. (17), after transients die,
that
'Ye,(O) = u 2
'Ye.(1) - 81'Yee(O)
A
= -81u 2
'Ye,(2) - 81'Y e.(l ) - 82'Ye.(O) = -8 2 u
A
A
A
A
2
'Ye.(M) - 81'Ye.(M - 1) - ... - 8M'Ye,
= -8M u 2,
or in matrix form
o
1
-8 1
0
0
1
o
o
o
(34)
1
FORECASTING WITH AGES
2577
Also,
'Yee(O) -
81'Yee(l ) - 82'Yee(2) -
••• -
8M'Yee(M )
= 'Y••(O) - lh'Y••(1) - ••. - (JM'Y••(M)
'Yee(l) -
81'Yee(O) - 82'Yee(l)
- ... -
8M'Yee(M
- 1)
= -(Jl'Y••(O) - ... - 8M'Y.. (M - 1)
'Y••(2) -
8 'Y••(l)
1
82'Yee(O) -
-
•• , -
8M'Yee(M
- 2)
= -82'Yee(O) - •.. - 8M'Ye.(M - 2)
81'Yee(M
'Yee(M) -
- 1) -
82'Yee(M
- 2) - ... -
8M'Yee(0)
= -(JM'Y ee (O)
or again in a matrix form
o
o
o
o
o
1
-8M
1
-(JM-l
0
0
0
81 82
82 83
83 84
8M-2 8M-l
8M-1 8M
8M 0
0
0
0 8M 0
0
0 0
1
-8 1
-82
-(Jl
-82
-(J3
0
0
-(J2
8M
0
0
'Yee(O)
'Yee(l)
'Yee(2)
0
0
'Yee(M)
'Yee(M)
-(JM
-(J4
0
0
'Yee(O)
'Yee(l)
'Yee(2)
0
0
'Yee(3)
-83
=
8M
2578
0
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983
(35)
Now, if we claim that e(t) converges to E(t), it means that
'Ye. r
()
=
~(
II
)
r IT
2
=
{1T
0
2
for r = 0
for r';= 0
and
'Yee(r) = o(r)1T 2•
Then, substituting this in (34) or (35) results in
OJ = OJ
for j = 1, 2, "', M,
which is Condition 2. Hence this condition is necessary as claimed.
APPENDIX B
Possible Transformation of Multiplicative Seasonal EHecls Into Additive
Seasonal EHecls
Suppose we are given data of the noise-free form
y(t) = (a + bt)C(t)}
c(t + L) = c(t)
,
(36)
which is with linear trend and multiplicative seasonal effects.
Let
z(t) = Log[y(t)].
(37)
Then substitution of (36) gives (if we assume bt « a, which is true in
most real data);
z(t) = log
where
a + log
(1 + ~
t)
+ log c(t)
b
+ - t + log c(t)
~
log a
~
a + bt + c(t),
a
a=
~
(38)
log a
b
a
b=-
c(t) = log c(t).
Hence z(t) clearly has the form of data with linear trend and additive
seasonal effects.
FORECASTING WITH AGES
2579
AUTHOR
Arie Feuer, B.Sc. (Mechanical Engineering), 1967, M.Sc. (Mechanical Engineering), 1973, Technion, Israel; Ph.D. (Control Systems Engineering), 1978,
Yale University; Bell Laboratories, 1978-. Since joining Bell Laboratories
Mr. Feuer has been involved in telephone network measurement planning and
implementations. He is actively involved in research in control system theory,
signal processing, and adaptive systems, and currently looks into their possible
applications for network measurements and operations.
2580
THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983