[go: up one dir, main page]

Academia.eduAcademia.edu
THE BELL SYSTEM TECHNICAL JOURNAL Vol. 62, No.8, October 1983 Printed in U.S.A. Forecasting With Adaptive Gradient Exponential Smoothing By A. FEUER' (Manuscript received December 20, 1982) Exponential Smoothing (ES) as a forecasting technique has been extensively used since its introduction in the 1960s. It is simple, hence easy to implement, and in many cases performs surprisingly well. However, many phenomena require a more sophisticated forecasting technique. In this paper we introduce a new forecasting technique, Adaptive Gradient Exponential Smoothing (AGES). This technique extends the classical ES as used on simple data or on data with linear trend. For data with both linear trend and seasonal effects this extension results in a new and more general form of ES, which is presented in this paper. The new forecasting technique is tested on simulated data and some real data of the types mentioned above, and its performance in all these tests is clearly superior to ES. It is shown by analysis and by the experimentations that for certain types of data it does in fact converge to the optimal (in the mean square error sense) forecasts. I. INTRODUCTION The need for quick and reliable forecasts of various time series is often encountered in economic and business situations. In the Bell System, forecasting is used to help plan trunk and facilities for the telephone network.v" as well as to project computer workload, to determine staffing levels for operators or service observers, and more. Many forecasting techniques exist and different time series may require different techniques. In general, there is a clear trade-off between simplicity (resulting in cheaper implementation) and per• Bell Laboratories. IlCopyright 1983, American Telephone & TelegraphCompany. Photo reproduction for noncommercial use is permitted without payment of royaltyprovided that each reproduction is done without alteration and that the Journal reference and copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free by computer-based and other information-service systems without further permission. Permissionto reproduce or republish any other portion of this paper must be obtainedfrom the Editor. 2561 formance of the forecasting technique. One of the simplest forecasting . techniques, Exponential Smoothing (ES), has surprisingly good performance. This technique was presented originally by Winters" and Brown" and is described briefly in Section II. In Ref. 6 the optimality properties of ES are studied and we expand on these studies and use the conclusions as the basis for a new technique we introduce here. In fact, these studies revealed a relationship between the ES and the Autoregressive-Integrated Moving Average (ARIMA) model-fitting-based forecasting suggested by Box and -lenkins," This is further discussed in Section III. The extensive use of ES clearly indicated that for time series with nonstationary discontinuities or changes in the generating parameters, ES performance is not satisfactory. This prompted a number of researchers to develop the Adaptive Exponential Smoothing (AES) idea. In these techniques the algorithm is supposedly evaluating its own performance and correcting its parameters to obtain improved performance. Recently, the existing AESs (see, for example, Refs. 8 through 11) were reviewed critically by Ekem. 12 One of the points raised in Ref. 12 was that none of the existing AESs is supported by analysis or general performance claims (e.g., optimality). In addition, it should be pointed out that only Roberts' and Reed's AES l l can be used on data with both linear trend and seasonal effects, while the other AESs are limited to simpler data and have no natural generalization. In this paper we present a new AES algorithm, which we call Adaptive Gradient Exponential Smoothing (AGES). This technique naturally generalizes to data with both linear trend and seasonal effect. In addition, analysis of AGES for simple data and extensive simulations, using simple as well as more general data, strongly suggests that this technique converges to optimal performance in the mean square error (MSE) sense. Section II presents ES as commonly used. A new, more general form is developed with a discussion of its optimal properties. The new technique, AGES, is derived and presented in Section III, while the results of experiments with this technique on both real and simulated data are presented in Section IV. II. EXPONENTIAL SMOOTHING AND ITS OPTIMAL PROPERTIES First we consider ES as Winters" did for three types of data: simple" (S), with linear trend (LT), and with both linear trend and multiplicative seasonal effects (LSM). Common to all the configurations is ·Simple data are of the form a + n(t), where a is a fixed value and n(t) is noise with zero mean. 2562 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 the following: a time series lx(t)l is measured every time interval T (e.g., hour, day, or week), and t is an integer representing the time tT. Then, one is interested in forecasting the value x(t + 1)* based on the data available up to and including t, namely x(O), x(1), ... , x(t). If i(t + 1) denotes the forecast, carried out at time t for x(t + 1), from Ref. 4 we have (using our own notation for consistency with the discussions in the sequel), for S data: i(t + 1) + (1 = ax(t) os a:S (la) - a)i(t) (lb) 1; for LT data: i(t + 1) = a(t) + b(t) a(t) = ax(t) b(t) (2a) + (1 - = J3[a(t) - a(t o :s a, fJ :s 1; 1)] + (1 + b(t - 1)] (2b) - J3)b(t - 1) (2c) a)[a(t - 1) (2d) and for LSM data: i(t + 1) = (a(t) + b(t»c(t - • x(t) a(t) = a c(t _ L) b(t) = + ~~:~ + (1 - o :s a, fJ, + 1) • (1 - a)[a(t - 1) J3[a(t) - a(t - 1)] c(t) = 'Y L + (1 (3a) • + b(t - 1)] - J3)b(t - 1) 'Y)c(t - L) 'Y :s 1, (3b) (3c) (3d) (3e) where L is the known periodicity of the season. In all the equations above, the parameters a, fJ, and 'Y are called the "smoothing coefficients". Our first step is to rewrite eq. (1) and, more importantly, eq. (2). This provides the basis for a new form of ES for LSM data, more general than (3). The new form, which is a natural extension of (1) and (2), suggests types of data for which the ES algorithm can result in optimal (in the MSE sense) performance. Equation (1) can be readily rewritten as (4a) *Note that we restrict our discussions to one-interval-ahead forecasting with the understanding that it can be generalized to more time intervals ahead. FORECASTING WITH AGES 2563 where clearly 81 = 1 - a. (4b) With some algebra one can show that eq. (2) is equivalent to x(t + 1) = 81x(t) + 82x(t + (2 - 1) - ( 1 )x (t ) - (1 + (2 )x (t - I), (5a) where 81 = 2 - a(1 82 = a - 1. + (j) (5b) (5c) The basic difference between (2) and (5) is that (5) reflects the assumption that the noise-free part of the data x(t) is generated by the difference equation + yet yet) - 2y(t - 1) - 2) = 0, (6) while (2) reflects the assumption that the solution of (6) is = a + bt. yet) (7) [Note that in (2) aCt) is the current estimate of 'a + bt' and 6(t) is the current estimate of 'b.'] The ES as given in (3) for LSM data is clearly based on the assumption that the noise-free part of the data has the form yet) (a + bt)e(t), (Sa) + L) = e(t). (8b) = where e(t The difference equation satisfied by (8) is Yet) - 2y(t - L) + yet (9) - 2L) = 0, and the corresponding ES x(t + 1) M = ~ 8Jx(t - j j-I + 2x(t - + 1) L + 1) - M ~ 8jx(t - j j=1 + 1) x(t - 2L + 1). (10) The parameters OJ, j = I, "', M and the constraints they have to satisfy are discussed later. Also, the claimed correspondence between (9) and (10) will become more apparent in later discussion. At this point, however, we emphasize that while (7) is the general solution of (6), and thus (2) and (5) are equivalent, (8) is only one of 2564 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 many possible solutions of (9). Hence (10) represents an ES form that is more general than (3). Similarly, data with linear trend and additive seasonal effects" (LSA) have the underlying difference equation y(t) - y(t - 1) - y(t - L) + y(t - L - 1) =0 (11) and the corresponding ES is x(t + M 8j x (t M + 1) - + x(t) + x(t - L 1) = ~ j=l - j ~ j=l 8j x (t + 1) - - j + 1) x(t - L). (12) To unify and simplify the discussions ahead we introduce the following notation. Let D be a unit delay operator, namely Dx(t) = x(t - 1), and let A(D) be a polynomial in D such that A(D) = { I 2- D 2DL-1 _ D 2L - 1 1 + DL-l - DL for for for for S data LT data LSM data LSA data. (13) With these definitions (4), (5), (10), and (12) can be unified as x(t + 1) M = ~ j-l 8j D j- l (X(t ) - x(t» + A(D)x(t), (14) where M = 1 will result in (4) and M = 2 in (5). It should also be pointed out that the ES as given by eq(s). (1) [(2) or (3)] has an implicit assumption in it. The assumption is that one (two or three) coefficient(s) can, in fact, smoothen the data. In other words, Min (14) is equal to one (two or three). However, its general form, (14), allows for a larger number of coefficients to get better approximations. To observe the optimal properties of the ES forecasts we defme the forecast error as e(t) = x(t) - x(t) (15) and use as our criteria for the forecast quality the mean square error (MSE), i.e., Ele 2 (t ) l. With this in mind, it is clear that optimal performance is achieved if the e(t) becomes a white noise sequence (i.e., independent and identically distributed with zero mean). Namely, the ES technique, while assuming knowledge of the generating process for the noise-free component of the data, attempts to "whiten" the ·This type of data was not addressed in Ref. 4 and, as far as we know, no form of ES applicable to it was proposed before the one here. FORECASTING WITH AGES 2565 noise component. This attempt implies an underlying assumption that the data are generated through, or at least approximated by, the process [1 - DA(D)]x(t) = [1 - i~1 8iDi] E(t), (16) where E(t) is a white noise with variance a 2 • Substituting (14) and (16) into (15) results in [1 - .~ OiDi] e(t) = [1 - .~ 8iDi] E(t). J=1 (17) J-l This equation satisfied by e(t) is the basis of our claims for correspondence between eqs. (9) and (10), and (11) and (12). Equation (17) immediately suggests the conditions for optimal forecasting. First, to get bounded MSE one must require: Condition 1: All zeros of the polynomial [1 - L7!l OiAiJ are outside the unit circle. If, in addition, we also require: Condition 2: Oi = 8i, j = 1, 2, ... , M, then, clearly, from eq. (17), e(t) will converge to E(t) and optimal forecasting (in the MSE sense) is achieved. Remark 1: As we discussed here, the sufficiency of Conditions 1 and 2 is quite obvious; however, they are also necessary. This is argued in Appendix A. Remark 2: In Ref. 4 01. and {3 for LT data are restricted to interval [0, IJ, which corresponds to the set S2 in Fig. 1. The actual constraints follow from applying Condition 1 to the M = 2 case. This results in the set SI in Fig. 1, which clearly contains S2and is considerably larger. Allowing for a larger constraint set for 01 and O 2 (or, correspondingly, 01., (3) will result in more cases for which ES could result in optimal performance. III. ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING In the previous section we argued that for data that can be approximated by (16), forecasting with ES of the form (14) can result in optimal performance in the MSE sense. To achieve this, Conditions 1 and 2 must be satisfied. However, while Condition 1 can be satisfied by proper choice of 0i, Condition 2 is, in general, hard to satisfy since the values of 8i in eq. (16) are not known. Basically, the ARIMA 2566 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 2 .-----'---·0, -1 -2 Fig. I-The constraint sets: = 1(0.. O I021 < 1, O2 + O. < 1, O2 - O. < 11 8 2 = 1(0.. O 2 ) : O. = 2 - a(l + 11), O 2 = a - I, 0 < a, P < 11. 8. 2): model-fitting-based forecasting" deals with exactly this type of problem. The O/s of eq. (16) are estimated and these estimates are then used as the 8/s in eq. (14) in an attempt to satisfy Condition 2. In the ES algorithm no such attempt is made. In practice, the forecasters using ES choose some fixed values for the OJ, which satisfy Condition 1 [or even more restrictively, e.g., eq. (2d)]. These values are based on intuition, experience, and familiarity with the data they forecast. However, considerable differences between the underlying O/s and the chosen 8/s can result in significant performance degradation. This is demonstrated in Fig. 2 for the case M = 2. The MSE for this case was computed in a closed form as a function of OJ and O2 for some fixed 81 and 82 and graphed in the figure. Together with phenomena like nonstationary discontinuity* and changes in the data-generating process (i.e., the OJ change values), this resulted in unsatisfactory performance of the ES. The realization of what may cause this poor performance brought about the idea of using adaptive schemes where * Step-like changes in the data. FORECASTING WITH AGES 2567 1.0 0.8 0.6 0.4 0.2 ()2 0 -0.2 -0.4 -0.8 -0.8 -1.0 -2.0 2.0 MSE - MEAN SQUARED ERROR Fig. 2-The mean squared error 88 a function of the daQi-generating parameters for M = 2. (The smoothing coefficients are fixed at 81 = -0.3, 82 = -0.3.) A the OJ are not fixed but are adjusted in an attempt to improve performance. Compared to the existing Adaptive Exponential Smoothing (AES) techniques (see, e.g., Refs. 8 through 11), the new technique we introduce here is analytically more sound and there are strong indications that it converges to opptimal performance in the MSE sense for the data approximated by (16). This new technique is based on the gradient search for the minimum of the MSE. If the MSE would have been available as a function of the OJ, then one could compute the gradient V = c1E{e:(t» c10 where 0 = [01> O2 , "., (18) ' OM]T, and recursively update the OJ through 8(t + 1) = O(t) - p.V, (19) where p. > 0 is the adaptation constant. This is the gradient search technique, sometimes referred to as the steepest descent technique. In general, however, the MSE is not available as a function of the OJ; hence, neither is the gradient. Instead, we use an instantaneous estimate of this gradient. To get this estimate we replace Ele 2 (t )1 by e2 (t ) and the gradient by V = ae2~t) = ao 2568 2e(t) ae~). ao THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 (20) Let us denote s(t) = ae~) ao as the "sensitivity vector," since it gives an indication of how sensitive the error e(t) is to the values of OJ. While s(t) is not available we can use eq. (17) to develop a means for generating it. Let us take partial derivatives of both sides of this equation with respect to 0. Since the right-hand side does not depend explicitly on 0 we get: (1 - .~ ,sl BiD i- l) Sj(t) = Dj-le(t) j = 1,2, ... , M. (21) At this point we are ready to introduce the Adaptive Gradient Exponential Smoothing (AGES) technique. Combining eqs. (14), (19), (20), and (21) we get: the forecast: i(t + 1) = A(D)x(t) M - L i=1 Oi(t)e(t - i + 1) (22) [see definition of A(D) in eq. (13)], the sensitivity functions: M Sj(t + 1) = L i-I Oi(t)Sj(t - i + 1) + e(t j - j + 1) = 1, 2, ... , M, (23) and the coefficient adjustments: OAt + 1) = Bj(t) - 2Ile(t)sj(t) j = 1, 2, "', M. (24) Recall that the error e(t) = x(t) - i(t). Both our simulations and our experiments (as described in the next section) strongly indicate that AGES converges to optimal performance through convergence of OJ(t) to OJ. Namely, the error e(t) is adaptively whitened. Despite these indications, since the resulting equations are quite complex, a global proof of convergence of the AGES technique is beyond the scope of this paper. However, we conclude this section by treating the special case M = 1 and show local convergence properties for it. Let M = 1; then eqs. (17), (23), and (24) become e(t SI(t + + 1) = 01(t)e(t) + f(t + 1) 1) = 01(t)Sl(t) - 0If(t) + e(t), FORECASTING WITH AGES 2569 and O(t + 1) = 01(t) - 2lte(t)S1(t). Assuming OJ(t) is independent of e(t) and S1(t) (similar assumptions are common in convergence proofs of adaptive filters) and observing that EIE(t)·e(t») = q2, EIE(t)'E(t + 1)1 = 0, E{S(t)'E(t») = 0 we get + 1») = E{OHt)I·Ele2(t)j + q2[1 + ~ - 281E{81(t )j] E{Sl(t + 1).e(t + 1)1 = EIOW)l·E{s1(t).e(t)1 E{e 2(t + EI0 1(t» .E{e 2(t)1 E{01(t + 1)1 81c2q2 = E101(t)j - 2~ls1(t)e(t>l. (25) If we assume in addition that 01(t) has a small variance, namely E{of(t») ::::: [E{01(t)J]2 (the simulation results tend to support this assumption), defining 'Y1(t) = E{e 2(t») - q2 'Y2(t) = E{s1(t)·e(t)1 'Y3(t) = E{Ol(t») - 81 (26) and substituting in (25) results in + 1) = [-Y3(t) + 81]2'Y1(t ) + q2['Y3(t)]2 'Y2(t + 1) = [-Y3(t) + 81]2'Y2(t ) + [-Y3(t) + 81h 1(t) + q2'Y3(t) 'Y3(t + 1) = 'Y3(t) - 21t'Y2(t). 'Y1(t (27) Clearly, if we could prove that 'Y1(t), 'Y2(t), and 'Y3(t) converge to the origin globally (i.e., independent of the initial values), it would mean that [see eq. (26)] the MSE converges to the minimum 0'2 and EI81(t) I converges to 81 , However, despite strong indications from our simulations that these variables do converge globally, we can prove only local convergence. In addition, the proof provides an indication as to how to choose the parameter It. Let us linearize eq. (27) around the origin to get 'Yl(t + 1) = Oh1(t) + 1) = 'Y3(t + 1) = 'Y2(t 2570 ~'Y2(t) + 81'Yl(t) + O'2'Y3(t ) 'Y3(t) - 21t'Y2(t). THE BEll SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 (28) The coefficients matrix is (J~ A = ~ [ o of -2JL and to ensure convergence all eigenvalues of A must be within the unit circle. The eigenvalues of A are Al = A2,3 Of = 1/211 + of ± [(1 - Of)2 - 8JL0"2jl/21, and it can be verified that choosing 1 - o~ JL<-- (29) 20"2 will guarantee the convergence of eq. (28). Condition (29) implies that if 101 I is close to one, JL must be chosen very small and the convergence will be slow. Again, our simulation experiments verified this observation. IV. SIMULATION RESULTS We divide our experiments with AGES into two parts. In the first part we applied both ES and AGES on data generated by the computer 10,......---------------,--------------, 9 N b AGES - ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING ES - EXPONENTIAL SMOOTHING 8 x a: ~ 6 a: w Cl w ~ :l 5l z :5 ::< 4 3 2 1.0 Fig. 3-Comparison of forecasting performance between ES (fJ = 0.8) and AGES. FORECASTING WITH AGES 2571 and compared the results. In the second part we applied the AGES to real data that we took from Ref. 7. Equation (16) was used to generate data of type S, LT, and LSM by the computer. The results of applying both ES and AGES on these data are presented in Figs. 3, 4, and 5 and in Table I. Each point on the curves of Fig. 3 corresponds to a complete run on a sequence of data generated with the particular choice of the OJ. The resulting MSE for the ES and the AGES forecasts are presented and the comparison clearly indicates the superiority of the AGES algorithm. In addition, we observe that the AGES results, in almost all the runs, in a MSE very close to the minimum, 0'2. In Fig. 5, we followedthe variation of the OJ(t) with time in a number of runs. The results clearly show that the OJ(t) converge to the OJ from a variety of initial values; this indicates global convergence properties. Similar results are observed in Table I for data with seasonal multiplicative effects and linear trend. The OJ(t) clearly converge to the O/s, and the MSE, when AGES is applied., is again very close to the optimal 6 (a) 5 4 a: a: a: 1 0 0 UJ 0 UJ 7 0 6 Z <t 5 a: <t :::> en UJ ::E (b) AGES - ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING ES - EXPONENTIAL SMOOTHING 4 3 2 Ol----l.._...I-_L.----..l._-l-_...I...----lL....---l.._...I-_.l..---.L_....l -1.8 -1.5 -1.2 -0.9 -0.6 -0.3 0 0.6 0.9 1.5 1.8 (J, Fig.4-Comparison of mean squared error in forecasting with ES (8, = 82 = -0.3) and AGES 88 a function of the data-generating parameters 8, and 82 , (a) 82 = -0.9. (b) 82 = -0.6. 2572 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 0.8 8, 0.6 (a) ~ 0.5 0.4 0.2 <<:0 o 0 -0.2 -0.4 -0.6 -0.8 -1.0 0 5 10 15 20 25 30 08SERVATIONS IN HUNDREDS 0.8 (b) 0.6 0.4 0.2 <<:f: 0 -0.2 -0.4 -0.6 -0.8 -1.0 1.0 FiJ. 5-Convergence of: (8) 81(t ) to the optimal value 81 in the AGES method. (b) 81(t ) and IMt) from various initial conditions to (It and 82 • using AGES on data with linear trend. value, 0'2. From Ref. 7 we took data of the simple kind (no linear trend or seasonal effects): The IBM common stock closing prices, daily, from May 17,1961 through November 2,1962. On the data we applied both ES and AGES and the results are presented in Fig. 6. Each point on the curves corresponds to a run on the same data, each time with a different coefficient (for the ES) and different initial condition (for the AGES). The further the coefficient used in the ES is from 81 (which in this case is equal to -0.1, as indicated in Ref. 7), the better the performance is for AGES. Further experiments were conducted on monthly international airline passengers data," These data, as Fig. 7 indicates, are with linear trend and multiplicative seasonal effects. We applied the AGES algorithm (with M = 3) and the results are presented in Fig. 8. In Ref. 7 it is claimed that sometimes rather than work with the actual data it FORECASTING WITH AGES 2573 Table I-Comparison of MSE in forecasting with ES (81 = -0.2, 82 = 0.5,83 = 0.4) and AGES Data-Generating Coefficients (and Adaptive Coefficients) 91 (91 ) . 9. (8.) -1.3 (-1.29) -1.95 (-1.79) -0.6 (-0.56) -0.75 (-0.76) 0.6 (0.6) 0.0 (0.01) -1.0 (-0.99) 0.5 (0,49) 0.25 (0.26) -0.9 (-0.81) -1.35 (-1.24) -0.75 (-0.75) -0.5 (-0.48) -0.75 (-0.66) 0.0 (0.03) -0.75 (-0.74) 0.5 (0.53) -0.15 (-0.14) -1.35 (-1.22) -0.75 (-0.76) -0.3 (-0.28) 0.5 (0.49) -0.65 (-0.62) -0.75 (-0.75) -0.6 (-0.56) -0.4 (-0.38) 93 (93 ) MSE (xu2) AGES 1.1972 2.1 1.3831 1.0749 0.75 0.6 1.0600 -0.75 1.0663 0.0 1.0040 1.0 1.2001 -0.2 1.0397 -0.1 0.9973 1.1044 1.2 1.8 1.2192 0.3 1.0574 1.0 1.0605 1.5 1.1165 0.75 1.0212 0.0 1.0314 0.2 1.0186 1.2 1.1115 -1.8 1.2077 -0.3 1.0438 -0.75 1.0062 0.4 1.0461 -0.7 1.0670 -0.6 1.0815 -0.75 1.1065 -0.5 1.0759 * The values to which O;(t) converge are given in parentheses. 1.4 (1.39) (1.97) (0.69) (0.6) (-0.73) (-0.04) (0.98) (-0.16) (-0.09) (1.19) (1.74) (0.31) (0.96) (1.42) (0.77) (-0.03) (0.19) (1.19) (-1.7) (-0.3) (-0.79) (0.43) (-0.73) (-0.61) (-0.74) (-0.52) 0.8 (0.75) 0.8 (0.68) 0.8 (0.78) 0.8 (0.77) 0.8 (0.79) 0.0 (0.03) 1.0 (0.94) 0.4 (0.4) 0.4 (0.41) 0.4 (0.36) 0.4 (0.34) 0.4 (0.38) 0.0 (-0.04) 0.0 (-0.05) 0.0 (0.03) 0.0 (0.02) -0.4 (-0.43) -0.4 (-0.39) -0.4 (-0.36) -0.4 (-0.39) -0.4 (-0.38) -0.8 (-0.78) -0.8 (-0.8) -0.8 (-0.79) -0.8 (-0.74) -0.8 (-0.81) ES 11.7155 20.5821 5.1431 5.0256 1.3577 1.5737 8.6414 1.0137 1.0839 7.1069 13.0215 3.8630 4.2218 6.9455 2,4687 3.5907 1.7692 4.0338 13.8494 4.6957 3.9623 2.6126 6.1628 6.8677 7.0182 5.1676 is more convenient to work with the logarithm of the data. As we argue in Appendix B, these logarithms, as data, have linear trend and additive seasonal effects (see Fig. 8). Hence, on the logarithms we applied AGES for linear trend and additive seasonal effects and the results are presented in Fig. Sa (M = 3). We used the same data (the logarithms) to see whether the performance improves with larger M. AGES was applied with M = 13 and the results, as presented in Fig. 8b, clearly indicate that for this data M = 3 was sufficient. V. CONCLUSIONS In this paper we have introduced a new forecasting technique, Adaptive Gradient Exponential Smoothing (AGES), which is based on Exponential Smoothing (ES). We have elaborated on the optimality properties in the MSE sense of the ES. For certain types of data, the ES can result in optimal performance provided some coefficients are known. In general, these coefficients are unavailable, and the AGES shows strong indications of converging to these unknown coefficients and providing optimal performance. 2574 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 400 AGES - ADAPTIVE GRADIENT EXPONENTIAL SMOOTHING 350 ES - EXPONENTIAL SMOOTHING a:: JOO a:: a:: 0 w 250 0 w a:: 200 « ::> a II) z 150 FORECASTING WITH THE AGES METHOD « w :;; 100 \ \ 50 0 -1.0 1.0 Fig. 6-Comparison of performance of forecasting with the ES (varying the coefficient in each run) and the AGES methods. 700 . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - , o z II) ;J; ::> o :t: 600 ; 500 ~ ~ 400 z w ~ : 300 u, ACTUAL DATA-_ o a:: w 200 a> :;; ::> z 100 l...-_l!I-_---L_----I_ _.L-_--"--_--l._ _'----_...l.-_---L_----I._ _L-----l o 12 24 36 48 60 72 84 96 lOB 120 132 144 NUM8E R OF MONTHS Fig. 7-Forecasting with AGES international airline passengers (M = 3). (Note that these data have linear trend and multiplicative seasonal effects.) Clearly, more extensive experiments and practical use of the proposed forecasting technique, the AGES, are required. A user-friendly software package can be developed for implementation of this technique if sufficient interest is generated. VI. ACKNOWLEDGMENT The author wishes to thank the reviewers for their thoroughness. Their comments were very helpful in the revision of this paper. FORECASTING WITH AGES 2575 5.1 4.5 L..-_ _---'6.6 ...L- .L..- _ _---l'--_ _---'- .....L.._ _- - I r-----------------------------, 4.6 L..-_ _---'20 o ...L- .L..- _ _---l'--_ _---'- .....L.._ _- - I 40 60 120 80 100 140 NUMBER OF OBSERVATIONS Fig.8-Forecasting with AGES the logarithm of the data in Fig. 7 for: (a) M = 3. (b) M = 13. (Note that the logarithm of the data in Fig. 7 has the form of data with additive seasonal effects and linear trend.) REFERENCES 1. C. D. Pack and B. A. Whitaker, "Kalman Filter Models for Network Forecasting," B.S.T.J., 61, No.1 (January 1982),pp. 1-14. 2. J. P. Moreland, "A Robust Sequential Projection for Traffic Load Forecasting," B.S.T .J., 61, No.1 (January 1982),pp. 16-38. 3. C. R. Szelag, "A Short- Term Forecasting Algorithm for Trunk Demand Servicing," B.S.T.J., 61, No.1 (January 1982),pp. 67-96. 4. P. R. Winters, "Forecasting Sales, by Exponentially Weighted Moving Averages," Management Sciences, 6, No.3 (April 1960),pp. 324-42. 6. R. G. Brown, "Smoothing, Forecasting and Prediction of Discrete Time Series," Englewood Cliffs, NJ: Prentice-Hall, 1962. 6. J. F. Muth, "Optimal Properties of Exponentially We~hted Forecasts of Time Series With Permanent and Transitory Components, J. Am. Statis. Assn., 55 (June 1960),pp. 299-306. 7. G. P. Box and G. M. Jenkins, "Time Series Analysis, Forecasting and Control," San Francisco, CA: Holden-Day, 1970. 8. W. M. Chow, "Adaptive Control of the Exponential Smoothing Constant," J. Indust. Eng., 16, No.6 (October 1965),pp. 314-7. 2576 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 9. D. W. Trigg and A. G. Leach, "Exponential Smoothing With an Adaptive Response Rate," Oper. Res. Quart., 18, No.1 (March 1967), pp. 53-60. 10. D. C. Whybark, "A Comparison of Adaptive Forecasting Techniques," Logist. Transport. Rev., 8 (1973), pp. 13-26. 11. S. D. Roberts and R. Reed, "The Development of a Self-Adaptive Forecasting Technique," AIlE Trans., 1 (December 1969), pp. 314-22. 12. S. Ekern, "Adaptive Exponential Smoothing Revisited," J. Oper. Res. Soc.,32, No. 9 (September 1981), pp. 775-82. APPENDIX A Necessity of Conditions rand 2 for the Convergence of e(1) to E(l) in Equation (r 7) Condition 1 is clearly necessary (as well as sufficient) for the convergence of E/e 2 (t )l to a finite value. We want to show that Condition 2 is necessary for Ele 2 (t )l to converge to u2 • Let 'Y.... (T) = E/e(t).e(t - T)} (30) and (31) Clearly, 'Yee (T) = 'Yee (-T) (32) and, from eq. (17) and the definition of E(t) 'Y.. (-T) = O. (33) With these definitions it follows from eq. (17), after transients die, that 'Ye,(O) = u 2 'Ye.(1) - 81'Yee(O) A = -81u 2 'Ye,(2) - 81'Y e.(l ) - 82'Ye.(O) = -8 2 u A A A A 2 'Ye.(M) - 81'Ye.(M - 1) - ... - 8M'Ye, = -8M u 2, or in matrix form o 1 -8 1 0 0 1 o o o (34) 1 FORECASTING WITH AGES 2577 Also, 'Yee(O) - 81'Yee(l ) - 82'Yee(2) - ••• - 8M'Yee(M ) = 'Y••(O) - lh'Y••(1) - ••. - (JM'Y••(M) 'Yee(l) - 81'Yee(O) - 82'Yee(l) - ... - 8M'Yee(M - 1) = -(Jl'Y••(O) - ... - 8M'Y.. (M - 1) 'Y••(2) - 8 'Y••(l) 1 82'Yee(O) - - •• , - 8M'Yee(M - 2) = -82'Yee(O) - •.. - 8M'Ye.(M - 2) 81'Yee(M 'Yee(M) - - 1) - 82'Yee(M - 2) - ... - 8M'Yee(0) = -(JM'Y ee (O) or again in a matrix form o o o o o 1 -8M 1 -(JM-l 0 0 0 81 82 82 83 83 84 8M-2 8M-l 8M-1 8M 8M 0 0 0 0 8M 0 0 0 0 1 -8 1 -82 -(Jl -82 -(J3 0 0 -(J2 8M 0 0 'Yee(O) 'Yee(l) 'Yee(2) 0 0 'Yee(M) 'Yee(M) -(JM -(J4 0 0 'Yee(O) 'Yee(l) 'Yee(2) 0 0 'Yee(3) -83 = 8M 2578 0 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983 (35) Now, if we claim that e(t) converges to E(t), it means that 'Ye. r () = ~( II ) r IT 2 = {1T 0 2 for r = 0 for r';= 0 and 'Yee(r) = o(r)1T 2• Then, substituting this in (34) or (35) results in OJ = OJ for j = 1, 2, "', M, which is Condition 2. Hence this condition is necessary as claimed. APPENDIX B Possible Transformation of Multiplicative Seasonal EHecls Into Additive Seasonal EHecls Suppose we are given data of the noise-free form y(t) = (a + bt)C(t)} c(t + L) = c(t) , (36) which is with linear trend and multiplicative seasonal effects. Let z(t) = Log[y(t)]. (37) Then substitution of (36) gives (if we assume bt « a, which is true in most real data); z(t) = log where a + log (1 + ~ t) + log c(t) b + - t + log c(t) ~ log a ~ a + bt + c(t), a a= ~ (38) log a b a b=- c(t) = log c(t). Hence z(t) clearly has the form of data with linear trend and additive seasonal effects. FORECASTING WITH AGES 2579 AUTHOR Arie Feuer, B.Sc. (Mechanical Engineering), 1967, M.Sc. (Mechanical Engineering), 1973, Technion, Israel; Ph.D. (Control Systems Engineering), 1978, Yale University; Bell Laboratories, 1978-. Since joining Bell Laboratories Mr. Feuer has been involved in telephone network measurement planning and implementations. He is actively involved in research in control system theory, signal processing, and adaptive systems, and currently looks into their possible applications for network measurements and operations. 2580 THE BELL SYSTEM TECHNICAL JOURNAL, OCTOBER 1983