Theory and Algorithms for
Forecasting Non-Stationary
Time Series
VITALY KUZNETSOV KUZNETSOV@
GOOGLE RESEARCH..
MEHRYAR MOHRI MOHRI@
COURANT INSTITUTE & GOOGLE RESEARCH..
Motivation
Time series prediction:
• stock values.
• economic variables.
• weather: e.g., local and global temperature.
• sensors: Internet-of-Things.
• earthquakes.
• energy demand.
• signal processing.
• sales forecasting.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 2
Google Trends
Theory and Algorithms for Forecasting Non-Stationary Time Series page 3
Google Trends
Theory and Algorithms for Forecasting Non-Stationary Time Series page 4
Google Trends
Theory and Algorithms for Forecasting Non-Stationary Time Series page 5
Challenges
Standard Supervised Learning:
• IID assumption.
• Same distribution for training and test data.
• Distributions fixed over time (stationarity).
none of these assumptions holds
for time series!
Theory and Algorithms for Forecasting Non-Stationary Time Series page 6
Outline
Introduction to time series analysis.
Learning theory for forecasting non-stationary time series.
Algorithms for forecasting non-stationary time series.
Time series prediction and on-line learning.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 7
Introduction to Time
Series Analysis
Classical Framework
Postulate a particular form of a parametric model that is
assumed to generate data.
Use given sample to estimate unknown parameters of the
model.
Use estimated model to make predictions.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 9
Autoregressive (AR) Models
Definition: AR( p) model is a linear generative model based
on the pth order Markov assumption:
p
X
8t, Yt = ai Yt i + ✏ t
i=1
where
• ✏ts are zero mean uncorrelated random variables with
variance .
• a1 , . . . , ap are autoregressive coefficients.
• Yt is observed stochastic process.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 10
Moving Averages (MA)
Definition: MA( q) model is a linear generative model for
the noise term based on the qth order Markov assumption:
q
X
8t, Yt = ✏t + bj ✏ t j
j=1
where
• b1 , . . . , bq are moving average coefficients.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 11
ARMA model
(Whittle, 1951; Box & Jenkins, 1971)
Definition: ARMA( p, q) model is a generative linear model
that combines AR( p) and MA( q) models:
p
X q
X
8t, Yt = ai Yt i + ✏t + bj ✏ t j .
i=1 j=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 12
ARMA
Theory and Algorithms for Forecasting Non-Stationary Time Series page 13
Stationarity
Definition: a sequence of random variables Z = {Zt }+1 1 is
stationary if its distribution is invariant to shifting in time.
same distribution
(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )
Theory and Algorithms for Forecasting Non-Stationary Time Series page 14
Weak Stationarity
Definition: a sequence of random variables Z = {Zt }+11 is
weakly stationary if its first and second moments are
invariant to shifting in time, that is,
• E[Zt ] is independent of t .
• E[Zt Zt j] = f (j) for some function f.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 15
Lag Operator
Lag operator L is defined by LYt = Yt 1.
ARMA model in terms of the lag operator:
p
! q
!
X X
i
1 ai L Yt = 1 + bj Lj ✏ t
i=1 j=1
Characteristic polynomial
p
X
P (z) = 1 ai z i
i=1
can be used to study properties of this stochastic process.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 16
Weak Stationarity of ARMA
Theorem: an ARMA( p, q) process is weakly stationary if the
roots of the characteristic polynomial P (z) are outside the
unit circle.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 17
Proof
If roots of the characteristic polynomial are outside the unit
circle then:
p
X
P (z) = 1 ai z i = c( 1 z) · · · ( p z)
i=1
= c0 (1 1
1
z) · · · (1 p
1
z)
where | i | > 1 for all i = 1, . . . , p and c, c0 are constants.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 18
Proof
Therefore, the ARMA( p, q) process
p
! q
!
X X
1 ai Li Yt = 1 + bj Lj ✏ t
i=1 j=1
admits MA( 1) representation:
✓ ◆ 1 ✓ ◆ 1✓ q ◆
X
Yt = 1 1
1
L · · · 1 p
1
L 1 + b j Lj
✏t
j=1
where
! 1 1 ⇣
X ⌘k
1 1
L = 1
L
i i
k=0
is well-defined since | i
1
| < 1.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 19
Proof
Therefore, it suffices to show that
X1
Yt = j ✏t j
j=0
is weakly stationary.
The mean is constant
1
X
E[Yt ] = j E[✏t j ] = 0.
j=0
Covariance function E[Yt Yt l ] only depends on the lag l:
1 X
X 1 1
X
E[Yt Yt l ] = k j E[✏t j ✏t l k ] = j j+l.
k=0 j=0 j=0
Theory and Algorithms for Forecasting Non-Stationary Time Series page 20
ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
Characteristic polynomial with unit roots can be factored:
P (z) = R(z)(1 z)D
where R(z) has no unit roots.
Definition: ARIMA( p , D , q ) model is an ARMA( p, q) model
for (1 L) Yt :
D
p
! !D q
!
X X
i
1 ai L 1 L Yt = 1 + bj Lj ✏t.
i=1 j=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 21
ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
Characteristic polynomial with unit roots can be factored:
where has no unit roots.
Definition: ARIMA( , , ) model is an ARMA( , ) model
for :
Theory and Algorithms for Forecasting Non-Stationary Time Series page 22
Other Extensions
Further variants:
• models with seasonal components (SARIMA).
• models with side information (ARIMAX).
• models with long-memory (ARFIMA).
• multi-variate time series models (VAR).
• models with time-varying coefficients.
• other non-linear models.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 23
Modeling Variance
(Engle, 1982; Bollerslev, 1986)
Definition: the generalized autoregressive conditional
heteroscedasticity GARCH( p, q) model is an ARMA( p, q)
model for the variance t of the noise term ✏t:
p 1 q 1
X X
8t, 2
t+1 =!+ ↵i 2
t i + ✏
j t
2
j
i=0 j=0
where
• ✏ts are zero mean Gaussian random variables with
variance t conditioned on {Yt 1 , Yt 2 , . . .}.
• ! > 0 is the mean parameter.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 24
GARCH Process
Definition: the generalized autoregressive conditional
heteroscedasticity (GARCH( , )) model is an ARMA( , )
model for the variance of the noise term :
where
• s are zero mean Gaussian random variables with
variance conditioned on .
• is the mean parameter.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 25
State-Space Models
Continuous state space version of Hidden Markov Models:
Xt+1 = BXt + Ut ,
Yt = AXt + ✏t
where
• Xt is an n-dimensional state vector.
• Yt is an observed stochastic process.
• A and B are model parameters.
• Ut and ✏t are noise terms.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 26
State-Space Models
Theory and Algorithms for Forecasting Non-Stationary Time Series page 27
Estimation
Different methods for estimating model parameters:
• Maximum likelihood estimation:
• Requires further parametric assumptions on the noise
distribution (e.g. Gaussian).
• Method of moments (Yule-Walker estimator).
• Conditional and unconditional least square estimation.
• Restricted to certain models.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 28
Invertibility of ARMA
Definition: an ARMA( p, q) process is invertible if the roots of
the polynomial
X q
Q(z) = 1 + bj z j
j=1
are outside the unit circle.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 29
Learning guarantee
Theorem: assume Yt ⇠ ARMA( p, q) is weakly stationary and
invertible. Let a bT denote the least square estimate of
a = (a1 , . . . , ap ) and assume that p is known. Then, kbaT ak
converges in probability to zero.
Similar results hold for other estimators and other models.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 30
Notes
Many other generative models exist.
Learning guarantees are asymptotic.
Model needs to be correctly specified.
Non-stationarity needs to be modeled explicitly.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 31
Theory
Time Series Forecasting
Training data: finite sample realization of some stochastic
process,
(X1 , Y1 ), . . . , (XT , YT ) 2 Z = X ⇥ Y.
Loss function: L : H ⇥ Z ! [0, 1] , where H is a hypothesis
set of functions mapping from X to Y .
Problem: find h 2 H with small path-dependent expected
loss,
T
⇥ T
⇤
L(h, Z1 ) = E L(h, ZT +1 )|Z1 .
ZT +1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 33
Standard Assumptions
Stationarity:
same distribution
(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )
Mixing:
dependence between events decaying with k.
B A
n n+k
Theory and Algorithms for Forecasting Non-Stationary Time Series page 34
Learning Theory
Stationary and -mixing process: generalization bounds.
• PAC-learning preserved in that setting (Vidyasagar, 1997).
• VC-dimension bounds for binary classification (Yu, 1994).
• covering number bounds for regression (Meir, 2000).
• Rademacher complexity bounds for general loss
functions (MM and Rostamizadeh, 2000).
• PAC-Bayesian bounds (Alquier et al., 2014).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 35
Learning Theory
Stationarity and mixing: algorithm-dependent bounds.
• AdaBoost (Lozano et al., 1997).
• general stability bounds (MM and Rostamizadeh, 2010).
• regularized ERM (Steinwart and Christmann, 2009).
• stable on-line algorithms (Agarwal and Duchi, 2013).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 36
Problem
Stationarity and mixing assumptions:
• often do not hold (think trend or periodic signals).
• not testable.
• estimating mixing parameters can be hard, even if
general functional form known.
• hypothesis set and loss function ignored.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 37
Questions
Is learning with general (non-stationary, non-mixing)
stochastic processes possible?
Can we design algorithms with theoretical guarantees?
need a new tool for the analysis.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 38
Key Quantity - Fixed h
key difference
L(h, Zt1 1 ) L(h, ZT1 )
1 t T T +1
1 Xh i
T
Key average quantity: L(h, ZT1 ) L(h, Zt1 1 ) .
T t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 39
Discrepancy
Definition:
T
1X
= sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H T t=1
• captures hypothesis set and loss function.
• can be estimated from data, under mild assumptions.
• = 0 in IID case or for weakly stationary processes with
linear hypotheses and squared loss (K and MM, 2014).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 40
Weighted Discrepancy
Definition: extension to weights (q1 , . . . , qT ) = q.
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1
• strictly extends discrepancy definition in drifting (MM and
Muñoz Medina, 2012) or domain adaptation (Mansour, MM,
Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss
(Devroye et al., 1996; Ben-David et al., 2007).
• admits upper bounds in terms of relative entropy, or in
terms of -mixing coefficients of asymptotic stationarity
for an asymptotically stationary process.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 41
Estimation
Decomposition: (q) 0 (q) + s .
✓ T
X T
X ◆
1
(q) sup L(h, Zt1 1 ) qt L(h, Zt1 1 )
h2H s t=1
t=T s+1
✓ T
X ◆
1
+ sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H s
t=T s+1
1 T s T T +1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 42
Learning Guarantee
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
T
r
X E[N1 (↵, G, z)]
T
L(h, Z1 ) qt L(h, Zt ) + (q) + 2↵ + kqk2 2 log ,
t=1
where G = {z 7! L(h, z) : h 2 H}.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 43
Bound with Emp. Discrepancy
Corollary: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s + 4↵
t=1 s ⇥ ⇤
h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z ,
8 ✓ ◆
T
X T
X
>
> 1
>
> b (q) = sup L(h, Zt ) qt L(h, Zt )
< s
h2H
where t=T s+1 t=1
>
> us unif. dist. over [T s, T ]
>
>
:
G = {z 7! L(h, z) : h 2 H}.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 44
Weighted Sequential α-Cover
(Rakhlin et al., 2010; K and MM, 2015)
Definition: let z be a Z-valued full binary tree of depth T.
Then, a set of trees V is an l1-norm q -weighted ↵-cover of a
function class G on z if
XT
T ↵
8g 2 G, 8 2 {±1} , 9v 2 V : |vt ( ) g(zt ( ))| .
t=1
kqk1
g(z1 ), v1
1 +1
g(z2 ), v2 g(z3 ), v3
1 +1 1 +1
g(z4 ), v4 g(z5 ), v5 g(z6 ), v6 g(z7 ), v7
" v1 g(z1 )
#
1 v3 g(z3 ) ↵
v6 g(z6 )
.
kqk1
v12 g(z12 ) 1
g(z12 ), v12
Theory and Algorithms for Forecasting Non-Stationary Time Series page 45
Sequential Covering Numbers
Definitions:
• sequential covering number:
N1 (↵, G, z) = min{|V| : V l1 -norm q-weighted ↵-cover of G}.
• expected sequential covering number: E [N1 (↵, G, z)].
z⇠ZT
Z1 , Z10 ⇠ D1
Z2 , Z20 ⇠ D2 (·|Z1 ) Z3 , Z30 ⇠ D2 (·|Z10 )
Z4 , Z40 Z5 , Z50
⇠ D3 (·|Z1 , Z2 ) ⇠ D3 (·|Z1 , Z20 ) ZT : distribution based on Zt s.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 46
Proof
⇣ T
X ⌘
Key quantities: (ZT1 ) = sup L(h, ZT ) qt L(h, Zt )
h2H t=1
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1
Chernoff technique: for any t > 0 ,
⇥ ⇤
P (ZT1 (q) > ✏)
" T
#
X ⇥ ⇤
P sup qt L(h, Zt1 1 ) L(h, Zt ) > ✏ (sub-add. of sup)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
= P exp t sup qt L(h, Zt1 1 ) L(h, Zt ) > et✏ (t > 0)
h2H t=1
" #
T
X ⇥ ⇣ ⇤⌘
t✏
e E exp t sup qt L(h, Zt1 1 ) L(h, Zt ) . (Markov’s ineq.)
h2H t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 47
Symmetrization
Key tool: decoupled tangent sequence Z0 1 associated to ZT1 .
T
• Zt and Zt0 i.i.d. given Zt1 1
.
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
e E exp t sup qt L(h, Zt1 1 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E exp t sup qt E[L(h, Zt0 )|Zt1 1 ] L(h, Zt ) (tangent seq.)
h2H t=1
" #
hX
T ⇣⇥ ⇤ i⌘
t✏
= e E exp t sup E qt L(h, Zt0 ) L(h, Zt ) ZT1 (lin. of expectation)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
e t✏
E exp t sup qt L(h, Zt0 ) L(h, Zt ) . (Jensen’s ineq.)
h2H t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 48
Symmetrization
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
e E exp t sup qt L(h, Zt0 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) L(h, zt ( )) (tangent seq. prop.)
(z,z ) h2H t=1
" #
⇣ T
X T
X ⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) + t sup qt t L(h, zt ( )) (sub-add. of sup)
(z,z ) h2H t=1 h2H t=1
"
1 ⇣ XT ⌘
t✏ 0
e E0 E exp 2t sup qt t L(h, zt ( ))
(z,z ) 2 h2H t=1
#
1 ⇣ XT ⌘
+ exp 2t sup qt t L(h, zt ( )) (convexity. of exp)
2 h2H t=1
" #
⇣ X T ⌘
= e t✏ E E exp 2t sup qt t L(h, zt ( )) .
z h2H t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 49
Covering Number
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⌘
t✏
e E E exp 2t sup qt t L(h, zt ( ))
z h2H t=1
" #
⇣ h T
X i⌘
e t✏ E E exp 2t max qt t v t ( ) + ↵ (↵-covering)
z v2V
t=1
" " ##
X ⇣ T
X ⌘
t(✏ 2↵)
e E E exp 2t qt t v t ( ) (monotonicity of exp)
z
v2V t=1
" ##
X ⇣ t2 kqk2 ⌘
t(✏ 2↵)
e E exp (Hoe↵ding’s ineq.)
z 2
v2V
h i
t2 kqk2
E N1 (↵, G, z) exp t(✏ 2↵) + .
z 2
Theory and Algorithms for Forecasting Non-Stationary Time Series page 50
Algorithms
Review
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s + 4↵
t=1 s ⇥ ⇤
h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z .
This bound can be extended to hold uniformly over q at
the price of the additional term:
p
e
O(kq uk1 log2 log2 (1 kq uk) 1)
.
Data-dependent learning guarantee.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 52
Discrepancy-Risk Minimization
Key Idea: directly optimize the upper bound on
generalization over q and h .
This problem can be solved efficiently for some L and H.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 53
Kernel-Based Regression
Squared loss function: L(y, y 0 ) = (y y 0 )2
Hypothesis set: for PDS kernel K,
n o
H = x 7! w · K (x) : kwkH ⇤ .
Complexity term can be bounded by
⇣ ⌘
O (log3/2 T )⇤ sup K(x, x)kqk2 .
x
Theory and Algorithms for Forecasting Non-Stationary Time Series page 54
Instantaneous Discrepancy
Empirical discrepancy can be further upper bounded in
terms of instantaneous discrepancies:
T
X
b (q) qt dt + M kq uk1
t=1
where M = sup L(y, y 0 ) and
y,y 0 !
T
1 X
dt = sup L(h, Zt ) L(h, Zt ) .
h2H s
t=T s+1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 55
Proof
By sub-additivity of supremum
( T
X T
X
)
1
b (q) = sup L(h, Zt ) qt L(h, Zt )
h2H s
(
t=T s+1 t=1
!
T
X T
X
= sup qt
1
L(h, Zt ) L(h, Zt )
h2H t=1
s
t=T s+1
)
T ⇣
X 1 ⌘1 T
X
+ qt L(h, Zt )
t=1
T s
t=T s+1
T T
!
X 1 X
qt sup L(h, Zt ) qt L(h, Zt ) + M ku qk1.
t=1 h2H s
t=T s+1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 56
Computing Discrepancies
Instantaneous discrepancy for kernel-based hypothesis
with squared loss:
T
!
X
dt = sup us (w0 · K (xs ) ys ) 2 (w0 · K (xt ) yt ) 2 .
kw0 k⇤ s=1
Difference of convex (DC) functions.
Global optimum via DC-programming: (Tao and Ahn, 1998).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 57
Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all
0 < kq uk1 1
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s
t=1
⇣
⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .
x
Corresponding optimization problem:
( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 58
Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all
0 < kq uk1 1
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s
t=1
⇣
⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .
x
Corresponding optimization problem:
( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 59
Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all
0 < kq uk1 1
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s
t=1
⇣
⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .
x
Corresponding optimization problem:
( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 60
Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all
0 < kq uk1 1
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s
t=1
⇣
⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .
x
Corresponding optimization problem:
( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 61
Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all
0 < kq uk1 1
XT
L(h, ZT1 ) qt L(h, Zt ) + b (q) + s
t=1
⇣
⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .
x
Corresponding optimization problem:
( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 62
Convex Problem
Change of variable: rt = 1/qt .
Upper bound: |rt 1
1/T | T 1
|rt T|.
( T T
)
X (w · K (xt ) yt ) 2 + 1 dt
X
min + 2 kwkH + 3 |rt T| .
r2D,w
t=1
rt t=1
• where D = {r : rt 1}.
• convex optimization problem.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 63
Two-Stage Algorithm
Minimize empirical discrepancy b (q) over q (convex
optimization).
Solve (weighted) kernel-ridge regression problem:
( T )
X
min qt⇤ (w · K (xt ) yt )2 + kwkH
w
t=1
where q⇤ is the solution to discrepancy minimization
problem.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 64
Preliminary Experiments
Artificial data sets:
Theory and Algorithms for Forecasting Non-Stationary Time Series page 65
True vs. Empirical Discrepancies
Discrepancies Weights
Theory and Algorithms for Forecasting Non-Stationary Time Series page 66
Running MSE
Theory and Algorithms for Forecasting Non-Stationary Time Series page 67
Real-world Data
Commodity prices, exchange rates, temperatures &
climate.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 68
Time Series Prediction &
On-line Learning
Two Learning Scenarios
Stochastic scenario:
• distributional assumption.
• performance measure: expected loss.
• guarantees: generalization bounds.
On-line scenario:
• no distributional assumption.
• performance measure: regret.
• guarantees: regret bounds.
• active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al.
2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth,
1998, 2001; Koolen et al., 2015).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 70
On-Line Learning Setup
Adversarial setting with hypothesis/action set H.
For t = 1 to T do
• player receives xt 2 X .
• player selects ht 2 H .
• adversary selects yt 2 Y .
• player incurs loss L(ht (xt ), yt ) .
Objective: minimize (external) regret
T
X T
X
RegT = L(ht (xt ), yt ) min L(h(xt ), yt ).
h2H ⇤
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 71
Example: Exp. Weights (EW)
Expert set H ⇤ = {E1 , . . . , EN } , H = conv(H ⇤ ) .
EW({E1 , . . . , EN })
1 for i 1 to N do
2 w1,i 1
3 for t 1 to T do
4 Receive(x P t)
N
i=1 wt,i Ei
5 ht P N
i=1 wt,i
6 Receive(yt )
7 Incur-Loss(L(ht (xt ), yt ))
8 for i 1 to N do
9 wt+1,i wt,i e ⌘L(Ei (xt ),yt ) . (parameter ⌘ > 0)
10 return hT
Theory and Algorithms for Forecasting Non-Stationary Time Series page 72
EW Guarantee
Theorem: assume that L is convex in its first argument and
takes values in [0, 1]. Then, for any ⌘ > 0 and any
sequence y1 , . . . , yT 2 Y , the regret of EW at time T
satisfies
log N ⌘T
RegT + .
⌘ 8
p
For ⌘ = 8 log N/T ,
p
RegT (T /2) log N .
✓r ◆
RegT log N
=O .
T T
Theory and Algorithms for Forecasting Non-Stationary Time Series page 73
EW - Proof
PN
Potential: t = log i=1 wt,i .
Upper bound:
PN
i=1 wt 1,i e ⌘L(Ei (xt ),yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(Ei (xt ),yt ) ]
wt 1
✓ ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(Ei (xt ), yt ) E [L(Ei (xt ), yt )] ⌘ E [L(Ei (xt ), yt )]
wt 1 wt 1 wt 1
⌘2
⌘ E [L(Ei (xt ), yt )] + (Hoe↵ding’s ineq.)
wt 1 8
⌘2
⌘L E [Ei (xt )], yt + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(ht (xt ), yt ) + .
8
Theory and Algorithms for Forecasting Non-Stationary Time Series page 74
EW - Proof
Upper bound: summing up the inequalities yields
XT
⌘2 T
T 0 ⌘ L(ht (xt ), yt ) + .
t=1
8
Lower bound: N
X PT
⌘ L(Ei (xt ),yt )
T 0 = log e t=1 log N
i=1
N PT
⌘ L(Ei (xt ),yt )
log max e t=1 log N
i=1
T
X
N
= ⌘ min L(Ei (xt ), yt ) log N.
i=1
t=1
Comparison:
T
X T
X
N log N ⌘T
L(ht (xt ), yt ) min L(Ei (xt ), yt ) + .
t=1
i=1
t=1
⌘ 8
Theory and Algorithms for Forecasting Non-Stationary Time Series page 75
Questions
Can we exploit both batch and on-line to
• design flexible algorithms for time series prediction with
stochastic guarantees?
• tackle notoriously difficult time series problems e.g.,
model selection, learning ensembles?
Theory and Algorithms for Forecasting Non-Stationary Time Series page 76
Model Selection
Problem: given N time series models, how should we use
sample ZT1 to select a single best model?
• in i.i.d. case, cross-validation can be shown to be close to
the structural risk minimization solution.
• but, how do we select a validation set for general
stochastic processes?
• use most recent data?
• use the most distant data?
• use various splits?
• models may have been pre-trained on ZT1 .
Theory and Algorithms for Forecasting Non-Stationary Time Series page 77
Learning Ensembles
Problem: given a hypothesis set H and a sample ZT1 , find
PT
accurate convex combination h = t=1 qt ht with h 2 HA
and q 2 .
• in most general case, hypotheses may have been pre-
trained on ZT1 .
on-line-to-batch conversion for general non-stationary
non-mixing processes.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 78
On-Line-to-Batch (OTB)
Input: sequence of hypotheses h = (h1 , . . . , hT ) returned
after T rounds by an on-line algorithm A minimizing
general regret
T
X T
X
RegT = L(ht , Zt ) inf L(h⇤ , Zt ).
h 2 H⇤
⇤
t=1 t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 79
On-Line-to-Batch (OTB)
Problem: use h = (h1 , . . . , hT ) to derive a hypothesis h 2 H
with small path-dependent expected loss,
⇥ ⇤
LT +1 (h, ZT1 ) = E L(h, ZT +1 )|ZT1 .
ZT +1
• i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al.,
2004).
• but, how do we design solutions for the general time-
series scenario?
Theory and Algorithms for Forecasting Non-Stationary Time Series page 80
Questions
Is OTB with general (non-stationary, non-mixing) stochastic
processes possible?
Can we design algorithms with theoretical guarantees?
need a new tool for the analysis.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 81
Relevant Quantity
key difference
Lt (ht , Zt1 1 ) LT +1 (ht , ZT1 )
1 t+1 T T +1
XT h i
1
Average difference: LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 ) .
T t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 82
On-line Discrepancy
Definition:
T
X h i
disc(q) = sup qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 ) .
h2HA t=1
• HA : sequences that A can return.
• q = (q1 , . . . , qT ) : arbitrary weight vector.
• natural measure of non-stationarity or dependency.
• captures hypothesis set and loss function.
• can be efficiently estimated under mild assumptions.
• generalization of definition of (Kuznetsov and MM, 2015) .
Theory and Algorithms for Forecasting Non-Stationary Time Series page 83
Discrepancy Estimation
Batch discrepancy estimation method.
Alternative method:
• assume that the loss is µ-Lipschitz.
• assume that there exists an accurate hypothesis h⇤ :
h i
⇤ T
⌘ = inf
⇤
E L(Z T +1 , h (X T +1 ))|Z 1 ⌧ 1.
h
Theory and Algorithms for Forecasting Non-Stationary Time Series page 84
Discrepancy Estimation
Lemma: fix sequence ZT1 in Z. Then, for any > 0 , with
probability at least 1 , the following holds for all ↵ > 0:
r
d E[N1 (↵, G, z)]
disc(q) discH T (q) + µ⌘ + 2↵ + M kqk2 2 log ,
where
T
X h i
d H (q) =
disc sup qt L ht (XT +1 ), h(XT +1 ) L ht , Z t .
h2H,h2HA t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 85
Proof Sketch
T
X h i
disc(q) = sup qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 )
h2HA t=1
T
X h i
⇤
sup qt LT +1 (ht , ZT1 ) E L(ht (XT +1 ), h (XT +1 )) ZT1
h2HA t=1
T
X h h i i
⇤
+ sup T
qt E L(ht (XT +1 ), h (XT +1 )) Z1 Lt (ht , Zt1 1 )
h2HA t=1
T
X h i
⇤
µ sup qt E L(h (XT +1 ), YT +1 ) ZT1
h2HA t=1
T
X h h i i
⇤
+ sup T
qt E L(ht (XT +1 ), h (XT +1 )) Z1 Lt (ht , Zt1 1 )
h2HA t=1
h i
= µ sup E L(h⇤ (XT +1 ), YT +1 ) ZT1
h2HA
T
X h h i i
+ sup qt E L(ht (XT +1 ), h⇤ (XT +1 )) ZT1 Lt (ht , Zt1 1 ) .
h2HA t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 86
Learning Guarantee
Lemma: let L be a convex loss bounded by M and hT1 a
hypothesis sequence adapted to ZT1. Fix q 2 . Then, for
any > 0 , the following holds with probability at least 1
PT
for the hypothesis h = t=1 qt ht :
T r
X 1
T
LT +1 (h, Z1 ) qt L(ht , Zt ) + disc(q) + M kqk2 2 log .
t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 87
Proof
By definition of the on-line discrepancy,
XT h i
qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 ) disc(q).
t=1
h i
At = qt Lt (ht , Z1t 1
L(ht , Zt ) is a martingale difference,
)
thus by Azuma’s inequality, whp,
T
X T
X q
qt Lt (ht , Z1t 1
) qt L(ht , Zt ) + kqk2 2 log 1 .
t=1 t=1
By convexity of the loss:
T
X
LT +1 (h, ZT1 ) qt LT +1 (ht , ZT1 ).
t=1
Theory and Algorithms for Forecasting Non-Stationary Time Series page 88
Learning Guarantee
Theorem: let L be a convex loss bounded by M and H⇤a set
of hypothesis sequences adapted to ZT1 . Fix q 2 . Then, for
any > 0 , the following holds with probability at least 1
PT
for the hypothesis h = t=1 qt ht :
LT +1 (h, ZT1 )
XT
⇤ T RegT
inf LT +1 (h , Z1 ) + 2disc(q) +
⇤
h 2H
t=1
T
r
2
+ M kq uk1 + 2M kqk2 2 log .
Theory and Algorithms for Forecasting Non-Stationary Time Series page 89
Conclusion
Time series forecasting:
• key learning problem in many important tasks.
• very challenging: theory, algorithms, applications.
• new and general data-dependent learning guarantees
for non-mixing non-stationary processes.
• algorithms with guarantees.
Time series prediction and on-line learning:
• proof for flexible solutions derived via OTB.
• application to model selection.
• application to learning ensembles.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 90
Time Series Workshop
Join us in Room 117
Friday, December 9th.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 91
References
T. M. Adams and A. B. Nobel. Uniform convergence of
Vapnik-Chervonenkis classes under ergodic sampling. The
Annals of Probability, 38(4):1345–1367, 2010.
A. Agrawal, J. Duchi. The generalization ability of online
algorithms for dependent data. Information Theory, IEEE
Transactions on, 59(1):573–587, 2013.
P. Alquier, X. Li, O. Wintenberger. Prediction of time series
by statistical learning: general losses and fast rates.
Dependence Modelling, 1:65–93, 2014.
Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir.
Online learning for time series prediction. COLT, 2013.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 92
References
O. Anava, E. Hazan, and A. Zeevi. Online time series
prediction with missing data. ICML, 2015.
O. Anava and S. Mannor. Online Learning for
heteroscedastic sequences. ICML, 2016.
P. L. Bartlett. Learning with a slowly changing distribution.
COLT, 1992.
R. D. Barve and P. M. Long. On the complexity of learning
from drifting distributions. Information and Computation,
138(2):101–123, 1997.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 93
References
P. Berti and P. Rigo. A Glivenko-Cantelli theorem for
exchangeable random variables. Statistics & Probability
Letters, 32(4):385 – 391, 1997.
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis
of representations for domain adaptation. In NIPS. 2007.
T. Bollerslev. Generalized autoregressive conditional
heteroskedasticity. J Econometrics, 1986.
G. E. P. Box, G. Jenkins. (1990) . Time Series Analysis,
Forecasting and Control.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 94
References
O. Bousquet and M. K. Warmuth. Tracking a small set of
experts by mixing past posteriors. COLT, 2001.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the
generalization ability of on-line learning algorithms. IEEE
Trans. on Inf. Theory , 50(9), 2004.
N. Cesa-Bianchi and C. Gentile. Tracking the best
hyperplane with a simple budget perceptron. COLT, 2006.
C. Cortes and M. Mohri. Domain adaptation and sample
bias correction theory and algorithm for regression.
Theoretical Computer Science, 519, 2014.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 95
References
V. H. De la Pena and E. Gine. (1999) Decoupling: from
dependence to independence: randomly stopped processes, U-
statistics and processes, martingales and beyond. Probability
and its applications. Springer, NY.
P. Doukhan. (1994) Mixing: properties and examples. Lecture
notes in statistics. Springer-Verlag, New York.
R. Engle. Autoregressive conditional heteroscedasticity with
estimates of the variance of United Kingdom inflation.
Econometrica, 50(4):987–1007, 1982.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 96
References
D. P. Helmbold and P. M. Long. Tracking drifting concepts by
minimizing disagreements. Machine Learning, 14(1): 27-46,
1994.
M. Herbster and M. K. Warmuth. Tracking the best expert.
Machine Learning, 32(2), 1998.
M. Herbster and M. K. Warmuth. Tracking the best linear
predictor. JMLR, 2001.
D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time
estimation in reversible Markov chains from a single
sample path. NIPS, 2015.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 97
References
W. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi.
Minimax time series prediction. NIPS, 2015.
V. Kuznetsov, M. Mohri. Generalization bounds for time
series prediction with non-stationary processes. In ALT,
2014.
V. Kuznetsov, M. Mohri. Learning theory and algorithms for
forecasting non-stationary time series. In NIPS, 2015.
V. Kuznetsov, M. Mohri. Time series prediction and on-line
learning. In COLT, 2016.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 98
References
A. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence
and consistency of regularized boosting algorithms with
stationary β-mixing observations. In NIPS, pages 819–826,
2006.
Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain
adaptation: Learning bounds and algorithms. In COLT.
2009.
R. Meir. Nonparametric time series prediction through
adaptive model selection. Machine Learning, pages 5–34,
2000.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 99
References
D. Modha, E. Masry. Memory-universal prediction of
stationary random processes. Information Theory, IEEE
Transactions on, 44(1):117–133, Jan 1998.
M. Mohri, A. Munoz Medina. New analysis and algorithm
for learning with drifting distributions. In ALT, 2012.
M. Mohri, A. Rostamizadeh. Rademacher complexity
bounds for non-i.i.d. processes. In NIPS, 2009.
M. Mohri, A. Rostamizadeh. Stability bounds for stationary
φ-mixing and β-mixing processes. Journal of Machine
Learning Research, 11:789–814, 2010.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 100
References
V. Pestov. Predictive PAC learnability: A paradigm for
learning from exchangeable input data. GRC, 2010.
A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random
averages, combinatorial parameters, and learnability. In
NIPS, 2010.
L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC-
Bayes bounds for non-iid data: Applications to ranking and
stationary beta-mixing processes. JMLR 11:1927–1956,
2010.
C. Shalizi and A. Kontorovitch. Predictive PAC-learning and
process decompositions. NIPS, 2013.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 101
References
A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random
averages, combinatorial parameters, and learnability. NIPS,
2010.
I. Steinwart, A. Christmann. Fast learning from non-i.i.d.
observations. NIPS, 2009.
M. Vidyasagar. (1997). A Theory of Learning and
Generalization: With Applications to Neural Networks and
Control Systems. Springer-Verlag New York, Inc.
V. Vovk. Competing with stationary prediction strategies.
COLT 2007.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 102
References
B. Yu. Rates of convergence for empirical processes of
stationary mixing sequences. The Annals of Probability,
22(1):94–116, 1994.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 103
𝜷-Mixing
Definition: a sequence of random variables Z = {Zt }+1
1 is
-mixing if
(k) = sup En sup P[A | B] P[A] ! 0.
n B2 1 A2 1
n+k
dependence between events decaying with k.
B A
n n+k
Theory and Algorithms for Forecasting Non-Stationary Time Series page 104