International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
337
Dynamic Time Series Regression: A Panacea for
Spurious Correlations
Emmanuel Alphonsus Akpan*, Imoh Udo Moffat**
*
Department of Mathematics and Statistics, University of Uyo, Nigeria
Department of Mathematics and Statistics, University of Uyo, Nigeria
**
Abstract- The study examined that the linear relationship
between Gross Domestic Product ( ) and Money Supply ( )
from 1981 to 2014 is spurious and could be avoided by dynamic
regression modeling. The fact that spurious regression always
results in misleading correlations between two time series was a
big motivation for undertaking this study. Therefore, exploring
data from the Central Bank of Nigeria Statistical Bulletin, we
found that the linear relationship between the dependent variable
( ) and the independent variable ( ) seemed spurious as the
errors of the regression model were found to be autocorrelated.
In a bid to correct this problem of spurious regression, we
identified lags 0, -1 and -2 of
as predictors of
using cross
correlation function. Hence, the dynamic regression of current
lag and past lags 1, 2 of
as predictors of revealed that the
errors are uncorrelated and the coefficient of determination is as
low as 0.2086, indicating that and are totally unrelated.
Index Terms- autocorrelated errors; cross correlation function;
dynamic regression; prewhitening; spurious regression.
I. INTRODUCTION
I
t is always common to see in literature that most regression
models are reported with very high coefficient of
determination,
which indicates that the model is a good fit
without considering the problem posed by autocorrelated errors
(Granger and Newbold, 1974). Though in classical regression
model, the error terms are assumed to be a white noise, that is, a
sequence of independent and identically distributed random
variables. In practice, however, the error terms always appear to
be autocorrelated (Wei, 2006; Box, Jenkins and Reinsel, 2008).
But if the autocorrelated error terms are ignored, the regression
becomes spurious resulting in misleading correlations, a situation
where a significant regression can be achieved for totally
unrelated series (Pankratz, 1991; Wei, 2006; Cryer and Chan,
2008). The problem associated with spurious regressions can be
avoided by including lagged values of time independent variables
in the regression model (Pankratz, 1991; Brockwell and Davis,
2002; Fuller, 1996; Wei, 2006; and, Box, Jenkins and Reinsel,
2008). Different authors use different names to refer to
regression models in which the current value of time dependent
variable is a function of current and lagged values of time
independent variables. For instance, Fuller (1996) referred to
such regression models as transfer function models while
Pankratz, (1991) called it dynamic regression models. However,
prior studies revealed that differencing both the time dependent
variable and time independent variables is one approach that
spurious regressions could be avoided but failed to take into
consideration lagged values of time independent variables,
thereby, creating a gap in knowledge that the dependent variable
may be related to independent variables with time lags which
often results in loss of useful information about the roles of
lagged values in explaining movements in the dependent
variable. Thus, this paper contributes towards filling the gap by
analyzing the relationship between Gross Domestic Product ( )
– dependent variable and Money Supply ( ) – independent
variable.
II. METHODOLOGY
Regression Model
Rawlings, Pantula and Dickey (1998) defined a standard
regression model as
=
+ , , +
, +⋯ +
, + �
(2.1)
where
= dependent variable
� = regression parameters, i = 1,…, k
� = independent variables, i = 1,…, k
� = error term assumed to be i.i.d. N(0, � )
Thus, the dependent variable for a time series regression
model with independent variables is a linear combination of
independent variables measured in the same time frame as the
dependent variable. Estimates of the parameters of the model in
(2.1) can be obtained by Least Squares Estimation Method (see
for example Drasper and Smith, 1998; Rawlings, Pantula and
Dickey, 1998).
Dynamic Regression Model
Dynamic regression model as specified by Pankratz (1991)
is as follows:
=
+
+
− +⋯ +
− + �
(2.2)
The intuition is that equation (2.2) is built to take into
account useful information about the roles of time (past) lag
( − , … , − ) in explaining the movements in , which is not
possible with equation (2.1). The parameters of dynamic
regression models are estimated using maximum likelihood
method, see Pankratz (1991) for details.
Autoregressive Moving Average (ARMA) Processes
A natural extension of pure autoregressive and pure moving
average processes is the mixed autoregressive moving average
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
�
processes, which includes the autoregressive and
moving average as special cases (Wei, 2006).
�
, process if { }
A stochastic process { } is an
is stationary and if for every �,
=�
�
(2.3)
= −
−
−⋯ −
is the autoregressive
coefficient polynomial.
�
= − �
−�
−⋯ −�
is the moving average
coefficient polynomial.
Box, Jenkins and Reinsel (2008) considered the extension of
ARMA model in (2.3) to deal with homogenous non-stationary
time series in which , is non-stationary but its � ℎ difference is
a stationary ARMA model. Denoting the � ℎ difference of by
=�
∇� = �
�
(2.4)
where
is the nonstationary autoregressive operator
such that d of the roots of
= are unity and the remainder
lie outside the unit circle while �
is a stationary
autoregressive operator.
Thus, (2.4) is called an autoregressive integrated moving
average model and can be referred to as an
�
, �,
model.
Prewhitening and Cross Correlation Function (CCF)
According to Wei (2002), assuming that independent
variable,
follows an ARMA(p,q) process,
= �� (
�(
(2.5)
where is white noise. The series
−
=
��
�(
(2.6)
is called the prewhitened series. Applying the same
prewhitening transformation to the dependent variable, we obtain
a filtered independent series,
−
=
��
�(
(2.7)
Let Y = { } be time dependent variable, X = { } be time
independent
variable,
and
the
cross
covariance
=
� , ) for each pair of integers t and
function , ,
s. The cross correlation between X and Y at lag k can be defined
�
, =
, − ) =
by
. In general, the
√�
�
cross correlation function is not an even function since
, − ) need not equal
, + ). Moreover, the
sample cross correlation function (CCF) is useful for identifying
lags of independent variable that might be useful predictors of
dependent variable (Cryer and Chan, 2008). However, the CCF
can be obtained by prewhitening by considering a more general
regression model relating X to Y,
= ∑∞
−∞
− + �
(2.8)
where X is independent of �. Applying the filter
to both
sides of (2.8), we have
̃ = ∑∞
̃ − + �̃
−∞
(2.9)
where �̃ = � − � − − � − − ⋯
338
The prewhitening procedure thus orthogonalizes the various lags
of X in the original regression model (Cryer and Chan, 2008).
Model Selection Criteria
For a given data set, when there are multiple adequate
models, the selection criterion is normally based on summary
statistics from residuals of a fitted model (Wei, 2006).
There are several model selection criteria based on residuals
(see Wei, 2006). For the purpose of this study, we consider the
well-known Akaike’s information criterion (AIC), (Akaike,
1973) defined as
AIC = −
likelihood +
number of parameters
where the likelihood function is evaluated at the maximum
likelihood estimates. The optimal order of the model is chosen by
the value of the number of parameters, so that AIC is minimum
(Wei, 2006).
Model Diagnostic Checking
Box and Pierce (1970) proposed the Portmanteau statistics:
∗
(m)
T ∑ ̂
=
=
(2.10)
where T is the number of observations.
Ljung and Box (1978) modify the ∗ (m) statistic to increase
the power of the test in finite samples as follows:
Q(m)
=
T(T
+
(2.11)
where T is the number of observations.
2)
∑
=
̂�2
�
�−
The decision rule is to reject
if Q(m) > � , where �
denotes the 100 (1 – )th percentile of a Chi-squared distribution
with m – (p + q) degree of freedom (see for example Akpan,
Moffat and Ekpo, 2016).
III.
DATA ANALYSIS AND DISCUSSION
This study considers the Gross Domestic Product (N’
Billion) as the dependent variable and the Money Supply (N’
Billion) as the independent variable. The data were obtained
from the Central Bank of Nigeria Statistical Bulletin for a period
spanning from 1981 to 2014. Each series consists of 34
observations.
First, we regress
on
, and obtain the estimated
regression model presented in equation (3.1) below:
=
.
+
.
s.e
(294.2920)
(0.5383)
t-value
( .
(7.564)
(3.1)
p-value
(0.85)
(1.98e-09)
= 0.5709 [Excerpts from Table 1].
From the fitted model in (3.1), it is observed that the
inclusion of the
in the model is significant since the p - value
= (1.98e-09) < 0.05 level of significance, implying that there is
a very strong evidence to conclude that
has a significant linear
contribution to . The coefficient of determination ( ) indicates
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
that the
is able to explain about 57.09% of the total variation
in . If the error term is found to be autocorrelated, then the
regression model could be termed spurious.
Table 1: Output of Regression Model
Call:
lm(formula = Y ~ X)
Residuals:
Min
1Q Median 3Q Max
-3985.9 -72.3 59.2 577.8 3080.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.0350 294.2920 0.190 0.85
X
4.0717 0.5383 7.564 1.98e-09 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1595 on 43 degrees of freedom
Multiple R-squared: 0.5709,
Adjusted R-squared: 0.5609
F-statistic: 57.21 on 1 and 43 DF, p-value: 1.984e-09
In order to check if autocorrelations exist in the residuals
obtained from the regression model in equation (3.1), we
consider the ACF [Figure 1] of the residuals from the regression
model;
339
independent variable, we examine the cross correlation function
(CCF), that is, a smooth tapering pattern in the CCF shows which
lags of independent variable we should used. Considering the
CCF for and
in [Figure 2], we noticed that it is unclear and
misleading. To identify which lags of
may predict
, we
therefore, apply the prewhitening technique to help us identify
the lags of CCF.
Figure 2: Cross Correlation Function of Money Supply and
Gross Domestic Product
By prewhitening, we mean fitting an ARIMA model to
and reducing the residuals to white noise. Thereafter, we filter
with the fitted ARIMA model to obtain the white noise residual
series. Lastly,
is filtered with the same model and then the
cross correlation function is computed using the prewhitened
and prewhitened . Now, fitting an ARIMA model to , we
allow the data to talk. The ACF and PACF of
[Figures 3and 4]
respectively, indicate a tentative ARIMA(2,0,2) model alongside
ARIMA(2,0,0) ARIMA(0,0,2) models. Both the ACF and PACF
of residuals from ARIMA(2,0,2) model, [Figures 5 and 6],
ARIMA(2,0,0) model, [Figures 7 and 8], and ARIMA(0,0,2)
model, [Figures 9 and 10] respectively, are near white noise.
Figure 1: ACF of Residuals from Regression Model
it is observed that there are significant spikes at lags 2, 4
and 6 which are more than 5% of the total lags while all other
lags fall within the confidence bounds, as such the residuals from
the regression model appear to be autocorrelated and the
regression model in (3.1) is termed spurious.
The reason for autocorrelated error term of the regression
model in (3.1) is not farfetched. The nonstationarity in
and
is more likely the cause of spurious correlations found in the
error term of the regression model. To cop-out the menace of
spurious correlations in the error term, we therefore, employ the
dynamic regression method which takes into account both the
lagged and current values of . In order to build the dynamic
regression model successfully, we make use of Box and Jenkins
three iterative methods; model identification, model estimation
and model diagnostic checking.
Model Identification
To identify the nature of the time-lagged relationship
between a dependent variable, and current and past values of an
Figure 3: ACF of Money Supply
Figure 4: PACF of Money Supply
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
340
Figure 5: ACF of the Residuals from ARIMA (2,0,2) Model
Figure 9: ACF of the Residuals from ARIMA (0,0,2) Model
Figure 6: PACF of the Residuals from ARIMA(2,0,2) Model
Figure 10: PACF of the Residuals from ARIMA (0,0,2)
Model
Since all the three tentative models seem appropriate, their
outputs as seen in Table 2, Table 3 and Table 4 for
ARIMA(2,0,2), ARIMA(2,0,0) and ARIMA(0,0,2) respectively,
Figure 7: ACF of the Residuals from ARIMA(2,0,0) Model
Figure 8: PACF of the Residuals from ARIMA(2,0,0) Model
Table 2: Output of ARIMA(2,0,2) Model
Call:
arima(x = X, order = c(2, 0, 2))
Coefficients:
ar1 ar2 ma1 ma2 intercept
-0.1704 0.5977 0.5434 0.0290 297.0807
s.e. 0.1884 0.1892 0.2383 0.2389 134.1617
sigma^2 estimated as 116088: log likelihood = -326.69, aic =
665.38
Table 3: Output of ARIMA(2,0,0) Model
Call:
arima(x = X, order = c(2, 0, 0))
Coefficients:
ar1 ar2 intercept
0.2188 0.4599 281.2463
s.e. 0.1281 0.1289 154.9215
sigma^2 estimated as 126455: log likelihood = -328.5, aic = 665
Table 4: Output of ARIMA(0,0,2) Model
Call:
arima(x = X, order = c(0, 0, 2))
Coefficients:
ma1 ma2 intercept
0.3487 0.4537 312.8221
s.e. 0.1588 0.1082 96.1965
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
sigma^2 estimated as 132106: log likelihood = -329.42, aic =
666.84
we observed that both ARIMA(2,0,2) model and ARIMA(2,0,0)
model have a smaller information criteria (aic) than
ARIMA(0,0,2) model. Parsimoniously, we chose ARIMA(2,0,0)
model over ARIMA(2,0,2) model. Thus, the estimated
ARIMA(2,0,0) model is presented in (3.2) below;
(1−
.
− .
=
�
) ∗
(3.2)
where ∗ = ( − 281.2463)
[Excerpts from Table 3]
Then, we filter the
series using the model for
as shown in
(3.3)
(1−
.
− .
)
(3.3)
Now, the cross correlation function (CCF) for the prewhitened
(which is the product of the ARIMA(2,0,0) model and its
residuals) and the prewhitened is presented in [Figure 11]. We
observed clear spikes at lags 0, -1,-2 and 1. Since we are only
interested in the past values of , the spike at lag1 is ignored.
Thus, , − and − should be included as predictors of .
Figure 11: CCF for Prewhitened Money Supply and
Prewhitened GDP
Model Estimation
The estimated dynamic regression model is presented in (3.4)
below:
=
.
− .
Xt− − .
.
s.e
(37.62991)
(0.08255)
(0.07183)
t-value
( .
(-1.673)
(-1.978)
(3.4)
p-value
(1.31e-05)
(0.1017)
(0.0545)
= 0.2086 [Excerpts from Table 5].
Xt− −
(0.12893)
(-0.241)
(0.8109)
From the dynamic regression model in (3.4), the effect of
using lagged values in overcoming spurious correlations is
clearly seen as both lagged (Xt− , Xt− ) and current ( ) values
in the model appear to show no sign of relationship with since
their p - values are all less than 5% level of significance.
Moreover, the coefficient of determination ( = .
)
indicates that there is no linear relationship between and .
Model Diagnostic Checking
341
The results from Box-Ljung test indicate that the residuals
from the model in (3.4) are uncorrelated since
= 17.46, df
=22 with corresponding p-value = 0.7375 > 0.05 level of
significance. [Excerpts from Table 6]
Table 5: Output of Dynamic Regression Model
Call:
dynlm(formula = Y ~ XLag2 + XLag1 + X)
Residuals:
Min
1Q Median 3Q Max
-158.84 -80.45 -28.63 20.12 715.01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 185.75550 37.62991 4.936 1.31e-05 ***
XLag2
-0.13813 0.08255 -1.673 0.1017
XLag1
-0.03104 0.12893 -0.241 0.8109
X
-0.14208 0.07183 -1.978 0.0545 .
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 156.5 on 42 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.2086,
Adjusted R-squared: 0.1521
F-statistic: 3.691 on 3 and 42 DF, p-value: 0.01908
Table 6: Box – Ljung Test for Residuals from Dynamic
Regression Model
Box-Ljung test
data: residuals(regmod01)
X-squared = 17.46, df = 22, p-value = 0.7375
IV. CONCLUSION
Although it is well documented in the literature that
differencing both the dependent and independent variables offers
a good solution to avoiding spurious correlations, our study takes
a look at dynamic regression as another approach to avoiding
spurious correlations. This was achieved by allowing the
dependent variable to be expressed as a function of both lagged
and current values of the independent variable. Regressing on
, we found that the errors were correlated being a clear
evidence of spurious correlation. However, the cross correlation
function of the prewhitened variables indicated that lags 0, -1 and
-2 of
should be included as predictors of . Subsequently, we
modeled a dynamic regression of
with past lags 1, 2 and
current lag of t as explanatory variables and the resulting
residual series was diagnostically checked using Ljung and Box
Q – statistic. The residuals were confirmed to be uncorrelated,
revealing that
and
are totally unrelated. Therefore, we
concluded that the problem of spurious correlations could be
avoided by modeling a series using dynamic regression. This
study could be extended to include the lagged values of the
dependent variable, and both lagged and current values of
multiple independent variables as predictors of the dependent
variable.
REFERENCES
[1]
H. Akaike, A New Look at the Statistical Model Identification,
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 6, Issue 10, October 2016
ISSN 2250-3153
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
IEEE Transactions on Automatic Control, 1973, Vol. 19, no 6, pp. 716 –
723.
E. A. Akpan, I. U Moffat and N. B. Ekpo, Arma – Arch Modeling of the
Returns of First Bank of Nigeria, European Scientific Journal, 2016,
Vol.12, no.8, pp. 257 – 266.
G.E.P. Box, G. M Jenkins and G. C. Reinsel, Time Series Analysis:
Forecasting and Control. 3rd Ed., New Jersey: Wiley and Sons, 2008, pp. 522.
G. E. P. Box and D. Pierce, Distribution of Residual Autocorrelations
in Autoregressive Integrated Moving Average Time Series Models,
Journal of the American Statistical Association, 1970, 65, pp.1509-1526.
P. J. Brockwell and R. A. Davis, Introduction to Time Series and
Forecasting, 2nd Ed,. Springer, 2002.
J. D. Cryer, and K. Chan, Time Series Analysis with Application in R, 2nd
Ed., Springer, 2008.
N. R. Drasper and H. Smith, Applied Regression Analysis, 3rd Ed., New
York, John Wiley and Sons, 1998.
W. A. Fuller, Introduction to Statistical Time Series, 2nd Ed., New York,
John Wiley and Sons, 1996.
C. W. J. Granger and P. Newbold, Spurious Regressions in Econometrics,
Journal of Econometrics, 1974, 2, pp.111 – 120.
342
[10] G. Ljung and G. C. Box, On a Measure of Lack of Fit in Time series
models, Biometrica, 1978, Vol. 2 no.66, pp. 265-270.
[11] A. Pankratz, Forecasting with Dynamic Regressions Models, 3nd Ed., New
York, John Wiley and Sons, 1991.
[12] J. O. Rawlings, S. G. Pantula and D. A. Dickey, Applied Regression
Analysis: AResearch Tool, 2nd Ed., Springer, 1998.
[13] W. W. S. Wei, Time Series Analysis Univariate and Multivariate Methods,
2nd Ed., Adison Westley, 2006.
AUTHORS
First Author – Akpan, E. A., Department of Mathematics and
Statistics, University of Uyo, eubong44@gmail.com,
+2348036200343
Second Author – Moffat, I. U. Ph. D, Department of
Mathematics and Statistics, University of Uyo,
moffitto@yahoo.com, +2348064497511
www.ijsrp.org