[go: up one dir, main page]

0% found this document useful (0 votes)
30 views16 pages

Dynamic Online Learning Via Frank-Wolfe Algorithm

This document discusses the application of the Frank-Wolfe (FW) algorithm in dynamic online learning and online convex optimization (OCO), focusing on its ability to handle non-stationary settings and mitigate overfitting through structured sparsity. The authors establish that the FW method achieves O(T 1/2) dynamic regret, even with noisy gradient estimates, and propose a 'Meta-Frank Wolfe' approach to enhance performance. Experimental results demonstrate that FW outperforms alternative methods in tasks such as matrix completion and background separation in video.

Uploaded by

Zineb GARROUSSI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views16 pages

Dynamic Online Learning Via Frank-Wolfe Algorithm

This document discusses the application of the Frank-Wolfe (FW) algorithm in dynamic online learning and online convex optimization (OCO), focusing on its ability to handle non-stationary settings and mitigate overfitting through structured sparsity. The authors establish that the FW method achieves O(T 1/2) dynamic regret, even with noisy gradient estimates, and propose a 'Meta-Frank Wolfe' approach to enhance performance. Experimental results demonstrate that FW outperforms alternative methods in tasks such as matrix completion and background separation in video.

Uploaded by

Zineb GARROUSSI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

932 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

69, 2021

Dynamic Online Learning via Frank-Wolfe


Algorithm
Deepak S. Kalhan, Amrit Singh Bedi , Alec Koppel , Ketan Rajawat , Hamed Hassani ,
Abhishek K. Gupta , and Adrish Banerjee

Abstract—Online convex optimization (OCO) encapsulates su- orthodoxies in the design of optimization algorithms: finite time
pervised learning when training sets are large-scale or dynamic, performance is prioritized, updates must be memory-efficient
and has grown essential as data has proliferated. OCO decomposes despite the scale of training sets, and drift in data distributions
learning into a sequence of sub-problems, each of which must be must be mitigated. Recently, online optimization has gained
solved with limited information. To ensure safe model adaption
or to avoid overfitting, constraints are often imposed, which are
popularity as a way to meet these specifications in disparate
often addressed with projections or Lagrangian relaxation. To contexts such as nonparametric regression [5], portfolio man-
avoid this complexity incursion, we propose to study Frank-Wolfe agement [6], control in robotics [7]. The framework of online
(FW), which operates by updating in collinear directions with the optimization decomposes a complex problem into a sequence
gradient but guaranteed to be feasible. We specifically focus on its of sub-problems, which inherently arises when one operates on
use in non-stationary settings, motivated by the fact that its iterates subsets of data per step due to the sheer scale of full training
have structured sparsity that may be employed as a distribution- sets. Alternatively, in many problems, the cost is an expectation
free change-point detector. We establish performance in terms of of a collection of loss functions parameterized by data only
dynamic regret, which quantifies cost accumulation as compared
with the optimal at each individual time slot. Specifically, for convex
accessible via samples [8].
losses, we establish O(T 1/2 ) dynamic regret up to metrics of Static and Dynamic Regret: To be specific, in online convex
non-stationarity. We relax the algorithm’s required information optimization (OCO), at each time t, a learner selects an action xt
to only noisy gradient estimates, i.e., partial feedback. We also after which an arbitrary convex cost Ft is revealed. The standard
consider a mini-batching ‘Meta-Frank Wolfe’, and characterize performance metric for this setting is to compare the action
its dynamic regret. Experiments on matrix completion problem sequence {xt }Tt=1 up to some time-horizon T with a single best
and background separation in video demonstrate favorable perfor- action in hindsight [9], defined as the regret
mance of the proposed scheme. Moreover, the structured sparsity
of FW is experimentally observed to yield the sharpest tracker T
 T

of change points among alternative approaches to non-stationary RegST = Ft (xt ) − min Ft (x). (1)
online convex optimization. x∈X
t=1 t=1
Index Terms—Online learning, Frank-Wolfe algorithm, convex
optimization, gradient descent. However, whenever training data define trajectories, as is the
case in increasingly salient learning problems in dynamical
systems or reinforcement learning [10], then hypothesizing that
I. INTRODUCTION samples come from a stationary distribution is invalid. While
ANY learning problems may be formulated as complex the use of buffers experimentally sidestep this issue [11], rigor-
M data-dependent optimization problems, as in the design
of methods for speech recognition [2], perception [3], and in-
ously addressing it requires treating learning as non-stationary
stochastic optimization [12]. This setting then implies that the
creasingly, locomotion [4]. These technologies upend several optimizer continuously varies with respect to time compared to
a single best action in static settings.
Manuscript received September 27, 2019; revised August 1, 2020, October
In general, this perspective requires tuning algorithms to mix-
25, 2020, and December 14, 2020; accepted January 5, 2021. Date of publi- ing rates of the data distribution [19], which substantially impact
cation January 15, 2021; date of current version February 5, 2021. This work performance but mixing rates are typically unknown. More
was supported by the Science and Engineering Research Board (India) under specifically, in time-series and stochastic processes literature,
Grant SRG/2019/001459. The associate editor coordinating the review of this
manuscript and approving it for publication was Dr. Soummya Kar. A part of
in order to obtain convergence in mean, one typically imposes
this work was presented in ICASSP, Barcelona, Spain, 2020 [1]. (Corresponding that observations are generated from a distribution whose mix-
author: Ketan Rajawat.) ing rates (aka convergence rate to a stationary distribution) is
Deepak S. Kalhan, Ketan Rajawat, Abhishek K. Gupta, and Adrish Banerjee exponentially fast [19].
are with the Department of Electrical Engineering, IIT Kanpur, Kanpur 208016,
India (e-mail: dskalhan@iitk.ac.in; ketan@iitk.ac.in; gkrabhi@iitk.ac.in; adr-
By contrast, in the online learning (distribution-free) setting,
ish@iitk.ac.in). dynamic regret is the canonical performance metric, which quan-
Amrit Singh Bedi and Alec Koppel are with the Computational and Informa- tifies the difference between the instantaneous cost accumulation
tion Sciences Directorate, US Army Research Laboratory, Adelphi, MD 20783 and the cost of the best action at each time [17]:
USA (e-mail: amrit0714@gmail.com; akoppel@seas.upenn.edu).
Hamed Hassani is with the Department of Electrical and Systems Engi- T T
neering, University of Pennsylvania, Philadelphia, PA 19104 USA (e-mail:
 
hassani@seas.upenn.edu). RegD
T = Ft (xt ) − Ft (xt ). (2)
Digital Object Identifier 10.1109/TSP.2021.3051871 t=1 t=1

1053-587X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 933

where xt := minx∈X Ft (x). OCO concerns the design of meth- FW [22]. See Table I for a comparison of recent dynamic regret
ods such that RegD T grows sublinearly in horizon T for a given
bounds for non-stationary OCO.
sequence of loss function Ft , i.e., the average regret goes to Moreover, in practice, exact online gradients may be unavail-
null with T (no-regret [9]). Unfortunately, exactly tracking the able due to dependence on unknown distributions or latency
optimizer defined by an arbitrarily varying optimization problem required for sampling. Thus there is a great need for optimiza-
is impossible [12]. Moreover, the time-variation in the optimizer, tion tools that are robust to non-stationarity and noisy gradient
the gradient DT , the instantaneous loss VT , optimal variation CT estimates. When only estimates of the online gradient are avail-
and ST appear as fundamental metrics of non-stationarity in the able, i.e., partial feedback, augmentations of Frank-Wolfe that
literature defined as involve sampling multiple gradient estimates per time slot exist
T
[24]. In this work, we take inspiration from [24] to propose
 such methods that may operate effectively in the presence of
VT := sup | Ft (x) − Ft−1 (x) | ,
non-stationarity.
t=1 x∈X
An additional challenge of non-stationary settings is that
T
 T

      distributional drift may occur, yet existing approaches to OCO
CT := xt − xt−1  , ST := xt − xt−1 2 are oblivious to how to detect it and adjust problem parameters.
t=1 t=1 This obliviousness belies the fact that there is extensive literature
T
 on change point detection in information theory built upon se-
DT := ∇Ft (xt ) − ∇Ft−1 (xt−1 )2 . (3) quential Bayesian inference [31]. Thus, an open question is how
t=1 to translate the successes of distribution-focused approaches into
the distribution-free OCO setting. We contend that structural
In [12], [13], it is also shown that obtaining a no-regret algorithm aspects of FW iterates provide a possible answer.
is impossible without stipulations on these metrics, i.e., that their Contributions: In this work, we put forth a collection of FW
growth is required to be sublinear in time for sublinear regret to methods that obviate the need for projection and are robust to
be obtainable. This fact may be interpreted as the distribution- gradient estimation error, leveraging recently developed averag-
free analogue of the aforementioned exponentially fast mixing ing techniques [32], and characterize their performance amidst
conditions. Moreover, such conditions are satisfiable whenever non-stationarity. In particular:
the optimal trajectory corresponds to an object (in its appropriate r We generalize Frank-Wolfe method to non-stationary prob-
domain) moving with bounded acceleration. lems (Section II) and establish O(T 1/2 ) dynamic regret
Our goal in this work is the design of algorithms such that when losses are convex (Section III).
dynamic regret grows sublinearly in T up to multiplicative r We generalize the algorithm to the setting where we only
factors of the variable and gradient variations defined in (3), have access to the noisy estimates of online gradients (par-
i.e., RegD T = o(T (VT + DT )). Before continuing, we contex- tial feedback, Section II-A), and establish that its dynamic
tualize our approach with related works. regret growth is also sublinear (Section III-A).
Context: Central to online optimization is online gradient r To close the gap between partial feedback and the afore-
descent [9], whose static regret is O(T 1/2 ). Improvements are mentioned O(T 1/2 ) rate, we propose to use Meta-Frank
possible for strongly convex losses, a detailed review presented Wolfe, which allows for multiple samples per action up-
in [26]. Often, operating under constraints is required to, e.g., date. We establish that its dynamic regret can match rates
safely adapt models or avoid overfitting. Constraint satisfaction where one has exact online gradient information.
at each time slot poses challenges: methods based on Lagrangian r Moreover, we exploit the structured sparsity of the iterates
relaxation such as ADMM [27] or saddle point [28] cannot identified in [33] to employ FW as a conceptually-justified
ensure the feasibility of individual actions. In contrast, pro- distribution-free change-point detector in non-stationary
jections do so but require a quadratic problem to be solved OCO.
at each step [29]. Frank-Wolfe (FW, or conditional gradient) r Experimentally, we observe that FW attains favorable per-
method moves in a feasible direction that is collinear with the formance relative to alternatives [26] on non-stationary
gradient through the solution of a linear program (LP) [30], and matrix completion and background extraction in video
has gained attention recently as a way to avoid projections in (Section IV). In particular, Frank-Wolfe yields a signifi-
online constrained settings [22]. We build upon these successes cant reduction in the computational time while attaining
to characterize the behavior of the FW method in non-stationary comparable performance to existing approaches. Further-
settings. more, the experimental utility of its sparsity dimension as
Non-stationary learning problems cannot be solved exactly a detector of distributional shift is validated as compared
[12]: as previously mentioned, dynamic regret is an irreducible to alternatives for OCO.
of the problem dynamics such as (3) or the variable variation CT The paper is organized as follows. We describe the Frank-
(how much the optimizer changes across time). Thus, several Wolfe algorithm for the dynamic settings in SectionII. All
works characterize sublinear growth of dynamic regret up to the theoretical results are detailed in SectionIII. The proposed
factors depending on VT and CT , i.e., O(T 1/2 (1 + CT )) for algorithm is applied to the practical problems of interest and
OGD or mirror descent with convex losses [9], expressions the results are presented in Section IV. In the end, SectionV
that depend on multiple metrics of non-stationarity [12], [17], concludes the paper.
and improved rates O(1 + CT ) in strongly convex cases [13]. Recent Advances in Frank-Wolfe Online convex optimiza-
These works, however, all execute projections, which owing to tion (OCO) encapsulates supervised learning when training sets
the complexity requirements, may prohibit them from yielding are large-scale or dynamic, and has grown essential as data
solutions in a timely fashion when data drifts, in contrast to has proliferated. OCO decomposes learning into a sequence of

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
934 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

TABLE I
SUMMARY OF THE RELATED WORKS COMPARED TO THE PRESENT WORK. HERE, GO DENOTES THE GRADIENT ORACLE COMPLEXITY OR GRADIENT ORACLE
CALLS PER ITERATION t. HERE a IS A POSITIVE SCALAR 0 < a ≤ 0.5

TABLE II
SUMMARY OF RECENT RELATED RESULTS REGARDING THE CONVERGENCE OF FRANK-WOLFE. HERE, LMO REPRESENTS THE LINEAR MINIMIZATION ORACLE
CALLS PER ITERATION t AND GO DENOTES THE GRADIENT ORACLE COMPLEXITY OR GRADIENT ORACLE CALLS PER ITERATION t.
THE RANGE OF a IS 0 < a ≤ 0.5

sub-problems, each of which must be solved with limited infor- FW is of interest when either computing projection (or more
mation. To ensure safe model adaption or to avoid overfitting, broadly proximal) operators is costly [22] or the sparsity of
constraints are often imposed, which are often addressed with iterates is of interest [37]. This class of algorithms has received
projections or Lagrangian relaxation. To avoid this complex- renewed interest recently due to the fact that its iterates them-
ity incursion, we propose to study Frank-Wolfe (FW), which selves are structured, i.e., its sparse [38], [39]. Specifically, new
operates by updating in collinear directions with the gradient step-size rules and interpretations of dual certificates (“Frank-
but guaranteed to be feasible. We specifically focus on its Wolfe gap”) have been studied [33], as well as efforts to reduce
use in non-stationary settings, {motivated by the fact that its variance [40], the number of LMO calls [41], the amount of
iterates have structured sparsity that may be employed as a information such oracles require [20]. Additional recent works
distribution-free change-point detector}. We establish perfor- broadened the class of problems to which its applicable to
mance in terms of dynamic regret, which quantifies cost ac- incorporate composite structure [21] or constraints [42], [43].
cumulation as compared with the optimal at each individual These innovations, while noteworthy, are uniformly restricted
time slot. Specifically, for convex losses, we establish O(T 1/2 ) to setting where the objective is associated with data generated
dynamic regret up to metrics of non-stationarity. We relax the from a stationary distribution, or otherwise a static comparator.
algorithm’s required information to only noisy gradient esti- Thus the key departing point of this work is its emphasis on
mates, i.e., partial feedback. We also consider a mini-batching projection-free algorithms for dynamic problems, and specifi-
‘Meta-Frank Wolfe’, and characterize its dynamic regret. Exper- cally exploiting the structured sparsity of FW iterates [33]. In
iments on matrix completion problem and background separa- particular, [33] consider the following optimization problem
tion in video demonstrate favorable performance of the proposed min F (X) s.t. X∗ ≤ k. (4)
scheme. {Moreover, the structured sparsity of FW is experi- X
mentally observed to yield the sharpest tracker of change points where k is the sparsity control parameter. Note that the problem
among alternative approaches to non-stationary online convex in (4) becomes
 the low rank matrix completion for the objective
optimization. F (X) := 12 (i,j)∈Ω ([X]ij − [M]ij )2  where M denotes the
Here we expand upon recent innovations in FW-type al- observed matrix. It is stated in [33] that when Frank-Wolfe
gorithms for stationary and non-stationary optimization. As algorithm is used to solve the problem in (4), it holds that
previously mentioned, these methods’ update collinearly with
the gradient as the solution of an LP [22], [30]. Typically, rank(Xt ) ≤ t + 1 (5)
one may establish comparable convergence to projected gradi- for each iteration t. Since the original goal was to recover a
ent methods in terms of iteration complexity for convex [23], low-rank matrix, the result in (5) shows that the FW updates
[34] and non-convex settings [35], [36]. Moreover, one also result in structurally favorable iterates, particularly for small t.
quantifies performance in terms of the number of calls to a The property also motivates us to consider the FW algorithm
linear minimization oracle (LMO), i.e., an LP solver, as this in dynamic settings where the objective function undergoes
quantity distinguishes FW from methods based upon gradient sporadic changes. In the context of dynamic low-rank matrix
projections. completion, we therefore expect the rank of the FW iterates to
Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 935

mimic the rank of the time-varying low-rank matrix. Since the


Algorithm 1: Online Frank-Wolfe Algorithm (OFW).
number of iterations one needs to obtain small dynamic regret
increases whenever there is distributional drift, this provides a 1: Require step sizes 0 < ρ < 1 and 0 < γ < 1.
tight characterization of change points in the observation stream 2: Initialize t = 0, d0 = 0 and choose x0 ∈ X .
for sparsity-constrained setting. Thus, we can explicitly connect 3: for t = 1, 2,.......do.
changes in dynamic regret to changes in the sparsity dimension. 4: Update gradient estimate dt = (1 − ρ)dt−1
We define in greater detail aspects of detecting change points in + ρ∇Ft (xt )
the subsequent section. Experimentally, moreover, we illuminate 5: Compute vt = arg minv∈X dt , v
the merits of the structured sparsity dimension of the FW iterates 6: Update xt+1 = (1 − γ)xt + γvt
as a change point detector, which is entirely new. 7: end for

II. FRANK-WOLFE METHOD


form unknown model disturbance g(xt , zt ) which is environ-
We begin by deriving a standard Frank-Wolfe (conditional mentally dependent and non-stationary. In (9), zt is interpreted
gradient) algorithm adapted to the setting of online optimization. as a sensor feed at time t, and which may be used to learn
For time t, assuming that action xt has been chosen and the a model of uncertainty g(xt , zt ), i.e., training pairs take the
instantaneous cost Ft is revealed, we may evaluate the online form {zt , g(xt , zt )}. The true model disturbance Ez [g(x, z)]
gradient as ∇Ft (xt ). Based upon this information, we define a is unknown, and hence only partial feedback is available.
directional vector dt by the recursion: Under partial feedback, we require use of stochastic on-
dt = (1 − ρ)dt−1 + ρ∇Ft (xt ) (6) line gradients ∇ft (xt , zt ) rather than exact online gradients
with initial vector d0 = 0, and ρ ∈ (0, 1] is a constant momentum ∇Ft (xt ) in step 4 of Algorithm 1. The significance of parameter
parameter. The smoothing step (6) permits us to gracefully apply ρ ∈ (0, 1) will become clear in the context of establishing regret
the algorithm to the more challenging setting of partial feedback bounds of Frank-Wolfe, which is discussed in later sections.
discussed later [20]. Then, we seek a direction vt that is parallel Batching and Meta-Frank Wolfe: While noisy unbiased
to dt inside feasible set X , the source of the name conditional samples of the online gradient may be the only available, often, it
gradient. This is accomplished by solving the following linear is practical to take multiple samples of the gradient per algorithm
program (LP) iterate. This means that we could fix a batch size K and then per-
form the updated (8) processed on samples {∇ft (xt , zkt )}K k=1 ,
vt = arg mindt , v. (7) and output only the last action performed at K + 1. This is
v∈X
Subsequently, we define a query to the LP solver (7) as a linear summarized in step 7 of Algorithm 2. This improves the quality
minimization oracle E (t) which outputs vt at time t, as is often of the gradient (by reducing the variance) and, as a result, attains
considered when studying the complexity of FW algorithms improved dynamic regret, as detailed in the next section. We call
[30], [44]. Then, the action xt+1 for subsequent time t + 1 is the variant of Frank-Wolfe that incorporates two time-scale pro-
given by cedure Meta-Frank Wolfe, which is summarized in Algorithm
2. The steps in Algorithm 2 are similar to those proposed in [24]
xt+1 = (1 − γ)xt + γvt , (8) but here specified for the non-stationary setting. Next, we shift
where γ < 1 is a time-invariant step-size for the given horizon focus to establishing convergence of Algorithms 1 and 2. Before
T . In the following subsection, we discuss a generalization to doing so, a remark regarding the role of online FW as it pertains
partial feedback. The method is summarized as Algorithm 1. in change detection is due, as it is a significant motivation of the
applications of this work.
A. Partial Feedback Remark 1 (Role of Sparsity in Change Detection): In change
detection [49], we have a stochastic process {yt }, and we
To implement the Algorithm 1, the exact gradient ∇Ft (xt )
seek to specify whether its underlying generating distribution
must be computed at each iteration t. In practice, this compu-
has changed at some time n ≤ ∞. One may define a random
tation may be unavailable or prohibitively costly to obtain. For
variable Γ on the nonnegative integers as the change point,
instance, in expected risk minimization [45], ∇Ft (xt ) denotes
and P{Γ = n} := πn . The relationship to online learning is in
the full batch gradient, which, if the number N of samples
reinterpreting the sequence of convex costs Ft as a stochastic
{zn }Nn=1 in the training set is large, is costly to evaluate [46].
process and discerning as to whether they are drawn from a
Alternatively, one may simply receive only noisy samples of
latent distribution that changes. The challenge in our setting
the gradient, but not its true value, as is the case with received
is that the majority of approaches require Ft to be i.i.d., hy-
signal strength-based localization [47] or learned models of mis-
pothesize a geometric distribution πn on the change point Γ,
matched kinematics in optimal control [48]. For such situations,
and asymptotically minimize the probability of false alarm by
only a noisy estimate ∇ft (xt , zt ) of the online gradient ∇Ft (xt )
aggregating information for all time – see [50], [51].
is available such that ∇Ft (x) = E[∇ft (x, zt )]. Here zt denotes
In the distribution-free sequential nature of online learning,
a realization of random variable z that parameterizes the noisy
however, such hypotheses and computations are either inappro-
online gradient.
priate or inapplicable. Therefore, we propose employing the rank
Example 1: For instance, consider the following nonlinear
or sparsity dimension of the iterates [cf. (5)] as a distribution-free
dynamical system [48]
proxy. This alleviates the requisite for a likelihood model to com-
xt+1 = f (xt ) + g(xt , zt ) (9) pute, for instance, recursive series in terms of the instantaneous
In practice, one observes state estimates {x̂t } via on-board log-likelihoods in order to compute crossing times [52]. We
sensing. These are subtracted from known kinematics f (xt ) to investigate the empirical utility of this approach in Section IV.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
936 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

III. DYNAMIC REGRET ANALYSIS


Algorithm 2: Meta-Frank Wolfe Algorithm.
In this section, we characterize the performance of Algorithm 1: INPUT: convex set X , time horizon T , linear
1 and 2 in the presence of non-stationarity as quantified by optimization oracles E (1) ........E (K) , step sizes
dynamic regret. First, we state some required technical assump- ρ ∈ (0, 1) and γ ∈ (0, 1), and initial point x1 .
tions. 2: OUTPUT: {xt : 1 ≤ t ≤ T }
Assumption 1: The set X is convex and compact with diam- 3: Initialize online linear optimization oracles
eter D, i.e., for all x, y ∈ X , it holds that x − y ≤ D. E (1) ........E (K) .
Assumption 2: The gradient of loss ∇Ft (·) is Lipschitz with 4: Initialize d0t = 0 and x1t = xt−1
parameter L1 , which implies that 5: for t = 1,2,.......T do.
∇Ft (x) − ∇Ft (y) ≤ L1 x − y for all t and (x, y) ∈ X . 6: vtk ← output of oracle E k in round t − 1.
(k+1)
(10) 7: xt ← (1 − γ)xkt + γvtk for k = 1, · · · , K
(K+1)
These assumptions 1–2 are standard in online learning [13], 8: Select xt = xt , then obtain ft (·, zkt ) and
[20]. Assumption 1 ensures constrained set X is compact. As- online stochastic grad. ∇ft (·, zkt ) for each k
sumption 2 bounds the loss function gradient. Next, we present (k−1)
9: dkt ← (1 − ρ)dt + ρ∇ft (xkt , zkt ) for
the dynamic regret analysis of Algorithm 1. To do so, we
k = 1, · · · , K
establish some technical lemmas that characterize descent-like
10: Feedback vtk , dkt  to E k for k = 1, · · · , K
properties.
11: end for
Lemma 1: Under Assumptions 1–2, Algorithm 1 satisfies
the following descent relations: when loss functions Ft are  
(12), and utilizing ( Tt=1 Ft−1 (xt−1 ) − Tt=1 Ft (xt )) =
convex,
(F0 (x0 ) − FT (xT )), we get
Ft (xt )−Ft (xt ) ≤ Ft,t−1
sup
+(1−γ)(Ft−1 (xt−1)−Ft−1 (xt−1)) T −1

3L1 2 γ [Ft (xt ) − Ft (xt )] + [FT (xT ) − FT (xT )]
+ Ft−1 (xt−1 ) − Ft (xt ) +γ 2
D t=1
2
T

+ γD ∇Ft (xt )−∇Ft−1 (xt−1) .
sup ≤ sup
Ft,t−1 +(1−γ) [F0 (x0 )−F0 (x0 )]+F0 (x0 )−FT (xT )
where we define := supx∈X |Ft (x) − Ft−1 (x)| as the
Ft,t−1 t=1
instantaneous maximum cost variation.
T

Our first result, Lemma 1, shows that the sub-optimality 3L1 D2
Ft (xt ) − Ft (xt ) depends on (1 − γ) times the sub-optimality + γD ∇Ft (xt )−∇Ft−1 (xt−1 )+γ 2 T . (13)
t=1
2
at previous instant and maximum variation between the two
Since γ < 1, the left hand side of (13) is lower bounded by
consecutive functions, since the remaining terms can be very 
small with proper selection of step-size γ. The detailed proof is γ Tt=1 [Ft (xt ) − Ft (xt )] and (1 − γ) < 1. We can write (13)
provided in Appendix A. as
T
Theorem 1: Under the Assumptions 1–2, for the iterates 3L1 D2
generated by Algorithm 1, which performs one linear minimiza- γRegDT ≤ sup
Ft,t−1 + [F0 (x0 ) − FT (xT )] + γ 2 T
t=1
2
tion optimization step per iteration under step-size selections
γ = √1T and ρ = 1, we have the following dynamic regret T

bound: √  + γD ∇Ft (xt ) − ∇Ft−1 (xt−1 ) . (14)
 
RegD T ≤O T 1 + V T + DT . (11) t=1
Divide both sides of (14) by γ and substitute the following
Proof: Taking summation on both sides from t = 1 to T of identity
the statement of Lemma 1, we get T

T
 ∇Ft (xt ) − ∇Ft−1 (xt−1 )
[Ft (xt ) − Ft (xt )] t=1
t=1
T
 
T
 ≤ T ∇Ft (xt ) − ∇Ft−1 (xt−1 )2 = T DT
≤ sup
Ft,t−1 + Ft−1 (xt−1 ) − Ft (xt )
t=1
t=1
T −1 (15)
 into (14) with the definition of VT (3) to obtain
+ (1 − γ) [Ft (xt )−Ft (xt )]+ F0 (x0 ) − F0 (x0 )
1 1
t=1 RegD T ≤ VT + [F0 (x0 ) − FT (xT )|]
γ γ
T
 3L1 D2 
+ γD ∇Ft (xt ) − ∇Ft−1 (xt−1 ) + γ 2 T 3L1 D2
2 + D T DT + γT . (16)
t=1 2
(12) 1
Select γ = √T and use (1 − γ) < 1 to write
sup
where Ft,t−1 is as defined in Lemma 1. Taking the √   
T −1 RegD T ≤ T V T + K1 + D D T , (17)
term (1 − γ) t=1 [Ft (xt ) − Ft (xt )|] to the left side of

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 937

2
where K1 := [F0 (x0 ) − FT (xT )] + 3L12D , completing the decrement properties of Algorithm 1 in the convex case, which
proof.  is formalized in the following lemma.
Theorem 1 establishes convergence of Algorithm 1 for non- Lemma 3: Under Assumptions 1–2, the iterates generated by
stationary problems in terms of dynamic regret up to factors the Algorithm 1 under the partial feedback satisfy: when loss
depending on VT and DT [cf. (3)], as defined in Section I. This functions Ft are convex, it holds that :
is the first time a projection-free scheme has been demonstrated Ft (xt ) − Ft (xt )
as theoretically effective for dynamic learning problems, paving  
the way for use in applications with data drift across time. Note, ≤ (1 − γ) Ft−1 (xt−1 ) − Ft−1 (xt−1 )
however, that Algorithm 1 requires exact gradient information at sup
+ 2Ft,t−1 + γD ∇Ft (xt ) − ∇Ft−1 (xt−1 )
each step, which in applications to learning control such as (9),
may be unavailable. This motivates the partial feedback setting, 3L1 D2
which we analyze next. + γD ∇Ft−1 (xt−1 ) − dt−1  + γ 2 .
2
(21)
A. Regret Analysis Under Partial Feedback
The detailed proof of Lemma 3 is provided in Appendix
To analyze performance when feedback is partial, before D. Lemma 3 shows that the sub-optimality Ft (xt ) − Ft (xt )
proceeding, we state an additional required assumption that depends on (1 − γ) times the sub-optimality at previous instant
limits the variance of stochastic approximation error. and maximum variation between the two consecutive functions
Assumption 3: The variance of the unbiased stochastic gra- and other terms which can be made small with appropriate choice
dients ∇F̃t (x, z) is bounded above by σ 2 of averaging parameter γ.
 Theorem 2: Under the Assumptions 1–3, for the iterates gen-
E ∇ft (x, z) − ∇Ft (x)2 ≤ σ 2 , for all t. (18) erated by Algorithm 1 which performs one linear minimization
We are ready to state the convergence of Algorithm 1 in optimization step per iteration, the following expected dynamic
terms of dynamic regret. We begin by characterizing the error regret bounds hold:
associated with using partial feedback, i.e., stochastic gradients, T

rather than true gradients, in the following lemma. [E [Ft (xt )] − Ft (xt )]
Lemma 2: Let the Assumptions 1–3 hold, then the iterates t=1
generated by Algorithm 1 satisfy  √  
 5
≤ O 1 + T VT + T 6 (1 + DT ) , (22)
E ∇Ft (xt ) − dt 2
 under step-size and inertia selections γ = √1T , ρ = T 1/3
1
.
≤ ρ2 E ∇Ft (xt ) − ∇ft (xt , zt )2 Proof: Firstly, compute the total expectation on both sides of
 the statement of Lemma 3 for a given Ft−1 , we get,
(1 − ρ)
+ E ∇Ft (xt )−∇Ft−1 (xt−1 )2 E[Ft (xt )] − Ft (xt )
ρ
  
sup
≤ Ft,t−1 + (1 − γ) E[Ft−1 (xt−1 )] − Ft−1 (xt−1 )
+ (1 − ρ)E ∇Ft−1 (xt−1 ) − dt−1 2 . (19)  
The proof is provided in Appendix B. The result in Lemma 2 + E Ft−1 (xt−1 )−Ft (xt )
shows that the squared error of gradient estimation ∇Ft (xt ) − + γDE ∇Ft (xt )−∇Ft−1 (xt−1 )
dt 2 depends on (1 − ρ) times the squared error in previous
gradient estimate and variation in gradient at previous instant. 3L1 D2
+ γDE [∇Ft−1 (xt−1 ) − dt−1 ] + γ 2 (23)
The other terms are negligible with proper choice of ρ. 2
We will use this characterization of gradient estimation error Next, following the steps similar to (12) to (16), we obtain
to establish decrement properties of Algorithm 1 in the convex T
settings. Before doing so, we analyze the error accumulation 
[E [Ft (xt )] − Ft (xt )]
over time horizon T for partial feedback, as stated next.
t=1
Corollary 1: Using the result of Lemma 1, it holds that
T 1 1 
 ≤ VT + [F0 (x0 ) − FT (xT )|] +D T Dt
E [∇Ft (xt ) − dt ] γ γ
t=1 T
  3L1 D2
 +D E [∇Ft−1 (xt−1 )−dt−1 ]+γT . (24)
2 2
T (1 − ρ) T (1 − ρ) 2 2
≤ ρT σ + DT + ∇F0 (x0 ) . t=1
ρ2 ρ √
(20) √ √ result of Corollary 1 into (24), utilize u1 + u2 ≤
Using the
u1 + u2 , and the upper bound of (1 − ρ) < 1, we get
The result in Corollary 1 (proof is provided in Appendix C)
T

shows that the error in gradient estimate over the entire time
horizon is bounded asymptotically if we choose ρ properly. We [E [Ft (xt )]−Ft (xt )]
t=1
note that if we select ρ = 1, i.e., use stochastic gradients, then the
gradient estimation error over entire time horizon diverges due 1 1 √
to the variance of estimates. Now, we are ready to establish the ≤ VT + [F0 (x0 )−FT (xT )]+D ρT σ
γ γ

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
938 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

  
1 T Corollary 2: Utilizing the statement of Lemma 4, it holds that
+D +1 T DT + D ∇F0 (x0 )
ρ ρ
K
  2  L2 (1 − ρ) 2
E ∇Ft (xkt ) − dkt  ≤ ρσ 2 K + 1 2 γ KD2
ρ
3L1 D2 k=1
+ D ∇F0 (x0 ) + γT . (25)
2 (1 − ρ)  
∇Ft (x0t )2 .
+
Substituting γ = √1 and ρ = 1
for any 0 < c < 12 , we obtain ρ
T Tc
(29)
T
 √ If we use the stochastic gradient directly in place of its
T (VT +K1 ) + DT (1− 2 ) σ
c
[E [Ft (xt )]−Ft (xt )] ≤ estimate dkt by setting ρ = 1, the error in gradient approximation
t=1 over entire inner loop time horizon diverges due to non-vanishing
 variance of gradient approximation. Next, similar to the analysis
T DT + T ( )B + B
1+c
+ D (T c + 1) 2 (26) of Algorithm 1, we present a decrement property satisfied by the
where, B := (D∇F0 (x0 )), K1 = [F0 (x0 )−FT (xT ) + iterates of Algorithm 2.
3L1 D 2 Lemma 5: With all the Assumption 1–3 satisfied, for the
2 ] and we utilize the definition of DT and VT . sub-optimality Ft (xt ) − Ft (xt ) generated by the actions of
In order to ensure sublinearity of (26) up to the non- Algorithm 2, the following hold true, when loss functions Ft
stationarity parameters, we require 0 < c < 12 and with c = 13 are convex,
we have,
Ft (xk+1
t ) − Ft (xt ) ≤ (1 − γ)(Ft (xkt ) − Ft (xt ))
T

[E [Ft (xt )] − Ft (xt )] γ  
∇Ft (xkt ) − dkt 2 + γ β D2
+
t=1 2β 2
√      
T K1 + VT + D DT +T ( 6 ) D σ+ DT
5 L1 2
≤B+ + γdkt , vtk − xt  + γ 2
D . (30)
2
+ T ( 3 ) (B). Lemma 5 shows that the sub-optimality Ft (xk+1 ) − Ft (xt )
2
(27) t
decreases at each iteration by the factor (1 − γ), when the
 remaining terms are made small under proper selection of aver-
Theorem 2 establishes that the dynamic regret for Algorithm aging parameter γ and constant β > 0.
1 is sublinear despite only having access to noisy estimates With those in place, the dynamic regret performance of Meta-
of online gradients, given appropriate stepsize and averaging Frank Wolfe is presented next.
parameter selections. Theorem 3: Consider the proposed Algorithm 2, with all
1
the Assumptions 1–3 satisfied. Then, with step-size γ = K ,
1 a
B. Improved Results for Meta-Frank Wolfe inertia value ρ = T , and with K = O(T ) linear minimization
optimization steps with a > 1.5, it holds that
We may tighten the regret bounds by sampling multiple
T  
stochastic gradients between time slots, as described in Algo- [E [Ft (xt )] − Ft (xt )] ≤ O 1 + VT + T max{(2.5−a),0.5} .
rithm 2. To establish these results, several prerequisite lemmas
t=1
must be established which characterize the gradient estimation
error and decrement properties. Firstly, we analyze the noise in Here, E denotes the total expectation with respect to random-
gradient approximation at a particular instant k and provide an ness in the algorithm updates and noise realizations of z.
upper bound as stated in following Lemma 4. Theorem 3 (proof in Appendix H) establishes that one may
Lemma 4: Under Assumption, 1–2, it holds for the iterates improve performance in terms of dynamic regret by using Algo-
generated by Algorithm 2 that rithm 2, i.e., multiple gradient samples per action update, rather
than a single one (Algorithm 1), as in Theorem 2. Meta-Frank
 2   2
E ∇Ft (xkt ) − dkt  ≤ ρ2 E ∇Ft (xkt ) − ∇ft (xkt , zkt ) Wolfe (Algorithm 2) makes the use of K online linear optimiza-
tion oracles in each round, which in turn requires K gradient
(1 − ρ)  2 samples for each round and hence attains superior dynamic
+ E ∇Ft (xkt ) − ∇Ft (xk−1
t ) regret compared to its standard counterpart, which decreases
ρ
with an increase in K. In particular, it attains O(T 1/2 ) dynamic
 2
+ (1−ρ)E ∇Ft (xk−1 t )−dk−1
t
 . (28) regret as compared to the one shot algorithm, which attains a
slower rate (depending on the choice of c that parameterizes the
The result in Lemma 4 (proof in Appendix E) shows that the step-size. To do so, we require that K = O(T a ) for a > 3/2,
squared error of gradient estimation E[∇Ft (xkt ) − dkt 2 |Ft ] which is a stipulation that ensures the stochastic approximation
depends on (1 − ρ) times the squared error in previous gradient error associated with stochastic gradients attenuates at a suffi-
estimate and variation in gradient at previous instant, when the ciently fast rate with respect to time horizon T .
rest of the terms are made negligible relative to these terms under From Corollary 2 (equation (27)), one may observe that the
proper selection of averaging parameter ρ. estimation error associated with partial feedback attenuates as
We use the preceding relationship (Lemma 4) to establish the a function of the batch size K, which then in (85) implies
following corollary which characterizes the gradient estimation that performance monotonically improves with larger batch-size
error associated with partial feedback over time horizon T . K, i.e., larger values of a improve performance in terms of

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 939

dynamic. However, this bound does not encapsulate the fact That is, we observe a favorable tradeoff between complexity
that in practice for non-stationary problems, the computation and accuracy by avoiding computationally costly projections. In
time between sampling new cost functions is fixed at a high particular, we compare Algorithm 1 (basic Frank-Wolfe) with
frequency, so only a fixed number of computations may be exe- full and partial feedback, i.e., exact and stochastic estimates of
cuted before a new cost is received. The experimental tradeoffs the online gradient, as well as Algorithm 2, Meta Frank-Wolfe,
between computations/sampling rates are investigated in detail in the full and partial (25% of the possible gradient samples)
in Section IV. feedback setting. The aforementioned approaches obviate the
With the performance of Algorithms 1–2 established, we turn need for projections, whereas online projected gradient descent
to experimental evaluation of the proposed schemes. Before (O-PGD) [9] and online projected stochastic gradient (O-PSG)
doing so, we close with a remark about the correspondence descent, which we also implement for comparison, requires
between our regret results in Table I and the number of calls projections. We shift to detailing the benchmarks.
to a linear minimization oracle, which is an alternative way of Matrix Completion: We consider the problem of online
characterizing complexity that is more commonplace in mathe- matrix completion in non-stationary environments. Here we are
matical programming. interested in completing low rank matrices whose entries are
Next, we characterize the dynamic regret performance of the noisy and time varying in nature. This problem arises in different
Meta-Frank Wolfe algorithm when the gradient at each iteration applications such as localization [54], network monitoring [55],
is deterministic in nature. and video denoising [56]. For example, the pairwise latency
Theorem 4: Consider the proposed Algorithm 2, with all matrix between the nodes in a network is approximately low
1
the Assumptions 1–2 satisfied. Then, with step-size γ = K , rank and time varying [57]. Here we seek the best possible low
a
inertia value ρ = 1, and with K = O(T ) linear minimization rank approximation of a given matrix Mt ∈ Rm×n . We denote
optimization steps with 0 < a ≤ 0.5, it holds that Xt ∈ Rm×n as the low-rank approximation of Mt . At each
  iteration t, we observe the time varying entries (i, j) ∈ I of the
RegD T ≤ O 1 + V T + T (1−a)
. matrix Mt and we seek to estimate the low rank approximation
Theorem 4 (proof in Appendix I) establishes that the perfor- denoted by Xt . The standard online convex optimization prob-
mance in terms of dynamic regret further improves if we have lem is defined in [26, Chap. 7]. The non-stationary counterpart
full gradient at each instant. of the problem is formulated as
Remark 2: We want to point out that for each step t in standard 
Frank Wolfe (Algorithm 1 in Section II), we require a single call min ([Xt ]ij − [Mt ]ij )2 s. t. Xt  ≤ k. (31)
Xt
to a linear minimization oracle (LMO) per step t, meaning that (ij)∈It
there is an exact correspondence between the time index and the
number of LMO calls. Thus, to obtain the convergence result in for all t where [Xt ]ij denotes the (i, j)th element of matrix Xt
Theorem 1, one requires T calls to a LMO. This logic also applies and k is a sparsity parameter which controls the rank of the
to reinterpret Theorem 2 in terms of the number of LMO calls; matrix. We solve (31) using Algorithms 1 and 2, and compare
except for this setting the LMO has access to a stochastic gradient performance with alternatives such as O-PGD and online con-
rather than a true gradient of the instantaneous cost. By contrast, ditional gradient (OCG) algorithm. algorithm in Fig. 1(a).
Algorithm 2 requires K calls to a LMO per step t (which also The use of Frank-Wolfe style algorithms are natural for this
only has access to a stochastic gradient per step), and hence uses domain, not only due to the low computational complexity, but
K more stochastic gradient calls per update than Algorithm 1. also because the iterates exhibit structured sparsity. Indeed, it
This means that the tightened convergence results of Meta-Frank is known that for such settings, the nuclear norm of the matrix
Wolfe in the non-stationary partial feedback setting (Theorem estimate Xt is relatively constant when data is stationary, and
3) relative to vanilla Frank Wolfe (Theorem 2) comes at the evolves whenever it is non-stationary, thus making the spar-
computational cost of a factor of K additional LMO calls. sity dimension of the iterates a valid change-point detector in
Remark 3: Observe that the regret bounds presented in this non-stationary domains [33], [37]. This motivates us to execute
subsection require a priori knowledge of the final time-horizon change detection in the rank of ground truth matrix Xt of (31)
T in order to select the step-sizes γ and ρ. This may be im- by observing abrupt changes in the rank of Xt under the con-
practical if one would like the algorithm to run in perpetuity in straint that Xt ∗ ≤ k. For the change detection experiments,
a streaming application. To alleviate this dependence, one may the ground truth matrix X∗t changes its rank over time with
employ the “doubling trick” [53, Section 2.3.1], where instead a maximum value of k. We are interested in detecting those
one employs a fixed constant step-size γt = γ > 0 but intro- changes in the rank of the ground truth matrix X∗t as soon as
duces mini-batching of computations over batches increasing possible. Therefore, the abrupt change is defined in terms of
size 2m and then runs the algorithm over outer-index m. The the change in rank of matrix X∗t . An example of such a change
result yields an algorithm with the same order of regret as the is when rank(X∗t ) = min( t/5 , k) for all t ≥ 1 which states
original algorithm but is worse by a multiplicative constant. that the rank of X∗t increases by one at every 5th iteration. In
These approaches, while developed for static regret, operate practice, this may happen if the distribution that generates Xt
similarly for preserving performance results in terms of dynamic has shifted substantially. We propose to exploit the structured
regret. sparsity of Frank-Wolfe iterations through the upper-bound
rank(Xt ) ≤ t + 1 in (5) to discern how many iterations are
required for convergence. More specifically, rank(Xt ) gives
IV. EXPERIMENTS
us a proxy for the unknowable entities in equation (3) that
In this section, we experimentally evaluate Algorithms 1–2 determines when the sequence of costs Ft is varying more/less.
on matrix completion and background subtraction in the video, We corroborate our claim of faster change detection via FW
both of which demonstrate the merits of online Frank-Wolfe. iterates in the experiments presented next.
Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
940 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

Fig. 2. In the matrix completion problem (31), FW detects change points


in the observation matrix most sharply due to the fact that its iterates exhibit
structured sparsity, as compared with alternatives for non-stationary online
convex optimization.

TABLE III
SUMMARY OF COMPUTATION TIME IN SECONDS FOR BACKGROUND
EXTRACTION PROBLEM

detection feature of the proposed algorithm. We start with a


square subset of size 150 × 150 of matrix M1 making rank of
O1 as 150. The transition of rank occurs when we increase the
size of the observation matrix in a random direction, i.e., the
Fig. 1. (a) Comparison of dynamic regrets of different algorithms for matrix inclusion of some random rows or columns from Mt into Ot .
completion. (b) Runtime comparison of Frank-Wolfe and Meta Frank-Wolfe Note that the observation matrix is always a square matrix, and
compared to alternatives on matrix complication. Observe that Frank-Wolfe the rank of the matrix under observation changes in a manner
performs better than the Meta-FW and online projected gradient descent (O-
PGD). The performance is similar to the online conditional gradient (OCG), but shown by the solid curve in Fig. 2. As clear from Fig. 2, the
OCG has a higher dynamic regret [see Fig. 1(a)]. proposed approach is able to detect the change in the rank earliest
among other available techniques in the literature such as online
subgradient descent (O-Sub-D), online proximal descent (O-
In the experiments, we have included the results for both prox-D), and O-PGD.
exact as well as an inexact gradient. For OFW-inexact and Meta- Background extraction problem: In this experiment, we
FW-inexact, we have considered only 25% of the samples’ full extend the matrix completion problem on real dataset from [58],
gradient from random locations at each iteration. As presented [59]. At each instant we observe a video frame and collect it into
in Fig. 1(a), OFW performs better than OCG, a projection-free matrix Mt . The goal of the problem is to extract the background
algorithm of [26] when full gradient information is available. from the video which is conceptually the low-rank estimate Lt
We remark that the Meta-FW algorithm performs best among of the data matrix. The problem is then given as:
all the algorithms when the full gradient is available. Similar
1
behavior is observed with partial information availability too. min Mt − Lt 2F + Lt 2F such that Lt ∗ ≤ k (32)
Also Fig. 1(b) shows that O-PGD is the slowest as compared Lt 2
to all the algorithms due to the required projection. For the The results in Fig. 3 are generated using OFW with different
experiments, we start with a random matrix M1 with each samples of the gradient at different instants. Note that online
element [M1 ]ij ∼ N (0, 1) and generate subsequent matrices Frank-Wolfe yields effective performance for this application, as
Mt as Mt = a Mt−1 , where a is uniformly distributed random demonstrated in Fig. 3 – the cars are removed from the frame. We
variable between 0 and 1. We have considered m = n = 20. To summarize execution times in Table III, where we observe obvi-
implement the Meta-Frank Wolfe algorithm, we fix K = 30. The ating projections yields quicker completion. Please see the video
experiments were performed on an Intel core i3-7020 computer appended to the submission to observe Frank-Wolfe implement-
running at 2.30 GHz with 4 GB RAM and execution time was ing online background subtraction. The size of matrix Mt is
recorded using the default system clock. 240 × 320 and k = 150 for all the algorithms. The experiments
Next, in Fig. 2, we present the experimental results where we were performed on an Intel core i7-8700 computer running at
utilize the structured sparsity of the iterates to develop a sharper 3.20 GHz with RAM, and execution time was recorded using
change point detector. For this experiment, we have considered the default system clock.
m = n = 300 and generate Mt for each t as explained earlier.
To incorporate the time-varying rank nature to the underlying
V. CONCLUSION
matrix Mt , we observe a subset of entries from random locations
denoted by Ot which is used in place in Mt in (31). For the In this work, we focused on learning problems amidst non-
simulations in Fig. 2, we are interested in showing the change stationarity, which we addressed with the formalism of dynamic

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 941

Fig. 3. Background Extraction Problem: 1st and 3rd row represents results for instant 1 of the video; the 2nd and 4th row represent instant 2 of the video, which
is clear from the 1st column. In the figure, RS denotes random sampling, and the percentage denotes how many samples from the full gradient are utilized for the
algorithm updates. The proposed algorithm performs really well for this application since the cars are effectively removed from the frame, as clear from 2nd to 5th
columns.

regret minimization. Due to the nature of non-stationary learn- From the update in Algorithm 1, we write (xt − xt−1 ) =
ing, the best one may hope to achieve is sublinear regret up γ(vt−1 − xt−1 ) and substitute into (33). Thus, we get
to fundamental metrics of non-stationarity. Existing approaches
Ft (xt ) ≤ Ft (xt−1 ) + γ∇Ft (xt−1 ), vt−1 − xt−1 
that achieve these results, such as online gradient descent, re-
quire projections to ensure constraint satisfaction, causing a L1
computational bottleneck that reduces adaptivity. To avoid this + γ2 vt−1 − xt−1 2 . (34)
2
issue, we proposed using Frank-Wolfe (conditional gradient Adding and subtracting the terms γ∇Ft−1 (xt−1 ), vt−1 −
method), which involves executing updates in directions that
xt−1  to the right hand side of (34) and after rearranging, we
are collinear with the gradient of the instantaneous cost that
are guaranteed to be feasible. These directions are found as the obtain
solution of a linear program. Ft (xt ) ≤ Ft (xt−1 )+γ∇Ft (xt−1 )−∇Ft−1 (xt−1 ), vt−1 −xt−1 
Overall, we established that Frank-Wolfe enjoys comparable
performance in terms of dynamic regret to existing methods with + γ∇Ft−1 (xt−1 ), vt−1 − xt−1 
substantially reduced complexity, thus improving adaptivity. L1
Moreover, when only noisy estimates of the stochastic gradient + γ2 vt−1 − xt−1 2 . (35)
2
are available, we established the dynamic regret of Frank-Wolfe, Next, utilizing the optimality condition i.e.
which can be improved through multiple gradient samples per
time-slot (Meta Frank-Wolfe). These conceptual results trans- xt−1 , ∇Ft−1 (xt−1 ) ≥ minv, ∇Ft−1 (xt−1 )
v∈X
lated well into experimental performance competitive with state
of the art with significant reductions in runtime. Future directions = vt−1 ,∇Ft−1 (xt−1 ) (36)
include tuning hyper-parameters to the dynamics of the learning and substituting into (35), we get
problem and addressing constraints in multi task learning [60]
that cannot be satisfied easily using projections or the solution Ft (xt )≤ Ft (xt−1 )+γ∇Ft (xt−1 )−∇Ft−1 (xt−1 ),vt−1 −xt−1 
of linear programs. + γ∇Ft−1 (xt−1 ), xt−1 − xt−1 
L1
APPENDIX A + γ2 vt−1 − xt−1 2 . (37)
PROOF OF LEMMA 1 2
Using first order convexity condition of Ft−1 (·) in (37), we get
Proof: From the L1 smoothness of the loss function Ft (As-
sumption 2), it holds that Ft (xt) ≤ Ft (xt−1 )+γ∇Ft (xt−1 )−∇Ft−1 (xt−1 ), vt−1 −xt−1 
Ft (xt ) ≤ Ft (xt−1 )+∇Ft (xt−1 ), xt −xt−1  + γ(Ft−1 (xt−1 )−Ft−1 (xt−1 ))
L1 L1
+ xt −xt−1 2 . (33) + γ2 vt−1 −xt−1 2 . (38)
2 2

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
942 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

Subtracting Ft (xt ) from both sides of (38) leads to APPENDIX B


Ft (xt ) − Ft (xt ) ≤ Ft (xt−1 ) − Ft (xt ) PROOF OF LEMMA 2
Proof: Let us start by analyzing the term ∇Ft (xt ) − dt 2 .
+ γ∇Ft (xt−1 ) − ∇Ft−1 (xt−1 ), vt−1 − xt−1 
Using the definition of dt in (6), we can write
+ γ(Ft−1 (xt−1 ) − Ft−1 (xt−1 )) ∇Ft (xt )−dt 2 = ∇Ft (xt )− ρ dt−1 − ρ∇ft (xt , zt )
2

L1 (46)
+ γ2vt−1 − xt−1 2 . (39)
2 where ρ := 1 − ρ. Add and subtract the term ρ ∇Ft−1 (xt−1 )
Next, consider the term Ft (xt−1 ) − Ft (xt ) from the right hand to inside the norm to the right hand side of (46) and after
side of (39). We can write rearranging the terms, we get
Ft (xt−1 ) − Ft (xt ) = Ft (xt−1 ) − Ft−1 (xt−1 ) + Ft−1 (xt−1 ) ∇Ft (xt ) − dt 2
− Ft−1 (xt−1 ) + Ft−1 (xt−1 ) − Ft (xt ). (40) = ρ(∇Ft (xt )−∇ft (xt , zt ))+ρ (∇Ft (xt )−∇Ft−1 (xt−1 ))
Bounding the Ft (xt−1 )−Ft−1 (xt−1 ) by its absolute value, and + ρ (∇Ft−1 (xt−1 ) − dt−1 )2 . (47)
then taking the supremum over x ∈ X , we get
Next, on expanding the square on right hand side, we obtain
Ft (xt−1 ) − Ft (xt ) ≤ Ft,t−1
sup
+ Ft−1 (xt−1 ) − Ft−1 (xt−1 )
∇Ft (xt ) − dt 2
+ Ft−1 (xt−1 ) − Ft (xt ) (41) 2
sup
= ρ(∇Ft (xt )−∇ft (xt , zt))2+ρ (∇Ft (xt )−∇Ft−1 (xt−1 ))
where we use := supx∈X |Ft (x) − Ft−1 (x)|. Using the
Ft,t−1
2
inequality from (41) into (39) and after regrouping the terms, + ρ (∇Ft−1 (xt−1 ) − dt−1 )
we get
+ 2ρρ ∇Ft (xt ) − ∇ft (xt , zt ), ∇Ft (xt ) − ∇Ft−1 (xt−1 )
Ft (xt ) − Ft (xt ) ≤ sup
Ft,t−1 +(1−γ)(Ft−1 (xt−1 )−Ft−1 (xt−1 ))
+ 2(ρ )2 ∇Ft (xt ) − ∇Ft−1 (xt−1 ), ∇Ft−1 (xt−1 ) − dt−1 
+ Ft−1 (xt−1 ) − Ft (xt )
+ 2(ρ )2 ∇Ft (xt ) − ∇ft (xt , zt ), ∇Ft−1 (xt−1 ) − dt−1 .
+ γ∇Ft (xt−1 )−∇Ft−1 (xt−1 ),vt−1 −xt−1  (48)
L1 Let us define a sigma algebra Ft such that it represents the
+ γ2 vt−1 −xt−1 2 (42) algorithm history which contains {xk }tk=1 and {zk }t−1
2 k=1 . This
Further, the inner product term in (42) can be written as implies that the conditional expectation of the unbiased stochas-
tic gradient estimate is equal to E[∇f (xt , zt )|Ft ] = ∇Ft (xt ).
∇Ft (xt−1 ) − ∇Ft−1 (xt−1 ), vt−1 − xt−1 
Now compute the conditional expectation E[·|Ft ] on both sides
= ∇Ft (xt−1)−∇Ft (xt)+∇Ft (xt) of (48), we obtain

− ∇Ft−1 (xt−1), vt−1 −xt−1  (43) E ∇Ft (xt ) − dt 2 |Ft
Substituting (43) into (42) and applying the Cauchy-Schwartz 
inequality, we get = E ρ(∇Ft (xt ) − ∇ft (xt , zt ))2 | Ft

Ft (xt ) − Ft (xt ) + ρ (∇Ft (xt ) − ∇Ft−1 (xt−1 ))


2

sup
≤ Ft,t−1 + (1 − γ)(Ft−1 (xt−1 ) − Ft−1 (xt−1 )) + 2(ρ )2 ∇Ft (xt ) − ∇Ft−1 (xt−1 ), ∇Ft−1 (xt−1 ) − dt−1 .
+ Ft−1 (xt−1 ) − Ft (xt ) + ρ (∇Ft−1 (xt−1 ) − dt−1 ) .
2
(49)
+ γ ∇Ft (xt−1 ) − ∇Ft (xt ) vt−1 − xt−1  From Young’s inequality, it holds that
+ γ∇Ft (xt )−∇Ft−1 (xt−1 )vt−1 −xt−1  ∇Ft (xt ) − ∇Ft−1 (xt−1 ), ∇Ft−1 (xt−1 ) − dt−1 
L1 β
vt−1 −xt−1 2 .
+ γ2 (44) ≤ ∇Ft−1 (xt−1 ) − dt−1 2
2 2
Using Assumption 2 and the update for xt (cf. (8)), we get 1
+ ∇Ft (xt ) − ∇Ft−1 (xt−1 )2 . (50)
Ft (xt )−Ft (xt ) 2β
sup
≤ Ft,t−1 +(1−γ)(Ft−1 (xt−1)−Ft−1 (xt−1)) Using upper bound from (50) into (49) and considering β = ρ,
utilize results (1 − ρ2 ) < 1, and (1 + ρ1 )ρ < ρ1 , we get
3L1  
+ Ft−1 (xt−1 ) − Ft (xt ) +γ 2 vt−1−xt−1 2 E ∇Ft (xt )−dt 2 |Ft ≤ ρ2 E ∇Ft (xt )−∇ft (xt , zt )2 |Ft
2
+ γ ∇Ft (xt )−∇Ft−1 (xt−1)vt−1 −xt−1  . (45)
+ (ρ /ρ) ∇Ft (xt )−∇Ft−1 (xt−1 )2
Utilizing the statement of Assumption 1 and upper bounding the
right hand side of (26) provides the result in Lemma 1.  + ρ ∇Ft−1 (xt−1 ) − dt−1 2 . (51)

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 943

After taking the total expectation on both the sides of (51), we we obtain
obtain the result stated in Lemma 2.  Ft (xt ) ≤ Ft (xt−1 ) + γ∇Ft (xt−1 ) − dt−1 , vt−1 − xt−1 
L1
APPENDIX C + γdTt−1 (vt−1 −xt−1 )+γ 2 vt−1 −xt−1 2 .
PROOF OF COROLLARY 1 2
(56)
Proof: First, taking the summation over  −1t on both sides Next, utilizing the optimality condition i.e.
of (51), second, taking the term (1 − ρ) Tt=1 E[∇Ft (xt ) −
dt 2 ] to left hand side, and then using E[∇FT (xT ) − dT 2 ] > xt−1 , dt−1  ≥ minv, dt−1  = vt−1 , dt−1 , (57)
v∈X
ρE[∇FT (xT ) − dT 2 ] since ρ < 1, we get into (35), we get
 T  Ft (xt ) ≤ Ft (xt−1 ) + γ(∇Ft (xt−1 ) − dt−1 )T (vt−1 − xt−1 )
ρ E ∇Ft (xt ) − dt 2
L1
t=1
+ γdTt−1 (xt−1 − xt−1 ) + γ 2 vt−1 − xt−1 2 .
T  2
 (58)
≤ ρ2 E ∇Ft (xt ) − ∇ft (xt , zt )2
t=1 Utilizing the similar steps used in (37) to (45), we obtain
T Ft (xt )−Ft (xt )
1−ρ  
+ ∇Ft (xt ) − ∇Ft−1 (xt−1 )2 sup
≤ Ft,t−1 + (1 − γ) Ft−1 (xt−1 ) − Ft−1 (xt−1 )
ρ t=1

+ (1 − ρ) ∇F0 (x0 ) − d0 2 . (52) + Ft−1 (xt−1 )−Ft (xt)



Divide both the sides by ρ in (52), use the initialization d0 = 0, + γ L1 xt−1−xt 
utilize the bounded variance assumption of Assumption 3, and 
+ ∇Ft (xt )−∇Ft−1 (xt−1 ) vt−1 −xt−1 
invoking the definition of DT , we obtain  
T  + γ ∇Ft−1 (xt−1 )−dt−1  vt−1 −xt−1 
E ∇Ft (xt ) − dt 2 L1
t=1 + γ2 vt−1 −xt−1 2 . (59)
2
1−ρ 1−ρ Using the update step of xt (cf. (8)) and applying the upper bound
≤ ρT σ 2 + DT + ∇F0 (x0 )2 . (53)
ρ2 ρ from Assumption 1 in (59), we obtain the result of Lemma 3.
Note that
T
 APPENDIX E
E [∇Ft (xt ) − dt ] PROOF OF LEMMA 4
t=1
Proof: 4 Using the definition of dkt from the update step 9 of
T
 Algorithm 2, we can write
≤ (E [∇Ft (xt ) − dt ])2 .    2
T (54) ∇Ft (xkt )−dkt 2= ∇Ft (xkt )−(1−ρ)dk−1 t −ρ∇ft (xkt , zkt ) .
t=1
(60)
and using the inequality [E[X]]2 ≤ E[X 2 ], we get Add and subtract the term (1 − ρ)∇Ft (xk−1 ) to (60) as follows
t
 2
T
 ∇Ft (xt )−dt  = ∇Ft (xt )−(1−ρ)∇Ft (xk−1
k k k
t )
T (E [∇Ft (xt ) − dt ])2
t=1 + (1−ρ)∇Ft (xk−1
t )−(1−ρ)dk−1
t

T
  − ρ∇ft (xkt , zkt )2 . (61)
2
≤ T E ∇Ft (xt ) − dt  . (55) Rearranging and expanding the squares, the right hand side of
t=1
(60) can be written as
    
From (55), we obtain the statement of Corollary 1. ∇Ft (xkt )−dkt 2 = ρ ∇Ft (xkt ) − ∇ft (xkt , zkt )2
 2  2
APPENDIX D + (1−ρ)2 ∇Ft (xkt )−∇Ft (xk−1
t ) +(1−ρ)2 ∇Ft (xkt )−dkt 
PROOF OF LEMMA 3
+2ρ(1−ρ)∇Ft (xkt )−∇ft (xkt , zkt ), ∇Ft (xkt ) − ∇Ft (xk−1
t )
Proof: We remark that the statement of Lemma 3 is similar to
Lemma 1 except for a dt dependent term in the right hand side. + 2(1−ρ)2 ∇Ft (xkt ) − ∇Ft (xk−1
t ), ∇Ft (xk−1
t ) − dk−1
t 
Hence, the proof of Lemma 3 is almost similar to the one for
Lemma 1 and the major steps which are changed are explained + 2(1−ρ)2 ∇Ft (xt ) − ∇f (xkt , zkt ), ∇Ft (xk−1
t ) − dk−1
t .
next. Adding and subtracting the term γdt−1 , vt−1 − xt−1  Now compute the conditional expectation E[(.)|Ftk ] for
(note that we use dt−1 instead of ∇Ft−1 (xt−1 ) in the proof both sides, where Ftk denotes the filtration contain-
of Lemma 1) to the right hand side of (34) and after rearranging, ing algorithm history {x11 , x21 , · · · , xK 1 k
1 , · · · , xt , · · · , xt } and

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
944 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

k−1
{z11 , z21 , · · · , zK 1
1 , · · · , zt , · · · , zt }. Note that we have We note that
k k k k
E[∇ft (xt , zt )|Ft ] = ∇Ft (xt ), we obtain (k) (k) (k)
∇Ft (xt ), vt −xt  = ∇Ft (xt )−dt , vt
(k) (k) (k)
− xt 
 2 
E ∇Ft (xkt ) − dkt  | Ftk (k) (k) (k) (k)
+ ∇Ft (xt ), xt −xt +dt , vt −xt .
 
=E ρ(∇Ft (xkt ) − ∇ft (xkt , zkt ))2 Using Young’s inequality for any β > 0, it holds that
 2 (k) (k) (k)
+ (1−ρ)(∇Ft (xkt )−∇Ft (xk−1 )) ∇Ft (xt ) − dt , vt − xt 
t
 2 1 
 (k)
2 β 
(k)   (k)
2

+ (1−ρ)(∇Ft (xk−1
t )−dk−1
t ) ≤ ∇Ft (xt ) − dt  + vt − xt  .
2β 2
+ 2(1−ρ)2 ∇Ft (xkt )−∇Ft (xk−1
t ),
(67)
  From Assumption 1, i.e. the boundedness of set X , we get
∇Ft (xk−1
t )−dt
k−1
 |Ftk . (62)
(k) (k) (k)
Using analogous logic to that which proceeds from (50)–(51), ∇Ft (xt )−dt , vt −xt 
1  2 β
we may conclude Lemma 4.
 (k) (k) 
≤ ∇Ft (xt )−dt  + D2 . (68)
APPENDIX F 2β 2
PROOF OF COROLLARY 2 (k) (k)
We know that ∇Ft (xt )T (xt − xt ) is upper bounded by
(k)
Proof: From Assumption 1–3 and using them into the state- Ft (xt ) − Ft (xt ) via the first-order characterization of the
ment of Lemma 4, we get convexity of Ft . Using this upper bound and substituting (68)
 2  L2 (1−ρ)  2 into (67), we get
E ∇Ft (xkt ) − dkt  ≤ ρ2 σ 2 + 1 E xkt − xk−1
t

ρ (k) (k) (k)
∇Ft (xt ), vt −xt 
  
k−1 2
+ (1−ρ)E ∇Ft (xk−1 t ) − d t . (63) 1 
 (k)
2 β
(k) 
= ∇Ft (xt ) − dt  + D2 +Ft (xt )
Utilizing the update in step 7 of Algorithm 2, we get 2β 2
 2  L2 (1−ρ) 2  2 (k) (k) (k)
E ∇Ft (xkt )−dkt  ≤ ρ2 σ 2 + 1 γ E xk−1 −vtk−1  − Ft (xt )+dt , vt − xt . (69)
t
ρ
   Using the upper bound from (69) into (66), it holds that
+ (1−ρ)E ∇Ft (xk−1 )−dk−1 2 .
t t (64)
(k) γ 
 (k)
2
(k)  β 2
Ft (xk+1
t ) ≤ F (x
t t ) + ∇F (x
t t ) − d t  +γ D
Taking summation w.r.t. k and performing further simplifica- 2β 2
tions, we get (k) (k) (k)
+ γdt ,vt −xt +γFt (xt )−γFt (xt )
K
 2
ρ E ∇Ft (xkt )−dkt  L1 2
k=1
+ γ2
D . (70)
2
K
L21 (1−ρ) 2   2 Subtracting Ft (xt ) from both sides, we get
≤ ρ2 σ 2 K + γ E xk−1
t −vtk−1 
ρ
k=1 Ft (xk+1
t ) − Ft (xt )
 2
+ (1 − ρ) ∇Ft (x0t ) . (k) γ 
 (k)
2
(k) 
(65) ≤ (1 − γ)(Ft (xt ) − Ft (xt )) + ∇Ft (xt ) − dt 
Using Assumption 1 i.e. the boundedness of set X , and taking 2β
ρ to the right hand side, we get the required result.  β (k) (k) L1
+ γ D2 + γdt , vt − xt  + γ 2 D2 . (71)
2 2
APPENDIX G which is as stated in Lemma 5. 
PROOF FOR LEMMA 5
Proof: Based on L1 -smoothness of function Ft (Assumption APPENDIX H
1) at any time instant and the definition of the update, we have PROOF FOR THEOREM 3
(k) (k) (k)
Ft (xk+1
t ) = Ft (xt + γ(vt − xt )) Proof: Let k = K in (71), we get
(k) (k) (k) (k)
 
(K)
≤ Ft (xt ) + γ∇Ft (xt ), vt − xt  Ft (xK+1
t ) − Ft (xt ) ≤ (1 − γ) Ft (xt ) − Ft (xt )
L1 
 (k)
2
(k)  γ  2
+ γ2 v t − x t   (K) (K)  β
2 + ∇Ft (xt ) − dt  + γ D2
2β 2
(k) (k) (k) (k) L1 2
≤ Ft (xt ) + γ∇Ft (xt ), vt − xt  + γ 2 D . L1 2
2 + γdkt , vtk − xt  + γ 2 D . (72)
(66) 2

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 945

On expanding the term Ft (xt


(K)
) − Ft (xt ), we get T
γ  
K
2  β
  + E ∇Ft (xkt ) − dkt  + KT γ D2
Ft (xK  K (1)
Ft (xt ) − Ft (xt ) 2β t=1 2
t ) − Ft (xt ) ≤ (1 − γ) k=1
T 
 K
K
γ  
∇Ft (xkt ) − dkt 2 + Kγ β D2 +γ dkt , vtk − xt  + KT γ 2
L1 2
D . (78)
+ 2
2β 2 t=1 k=1
k=1
T    
K
 Using t=1 E[Ft−1 (xt−1 ) − Ft (xt )] = F0 (x0 ) − FT (xT ),
L1 2
+γ dkt , vtk − xt  + Kγ 2 D . (73) and rearranging the terms, we obtain
2
k=1
  T
−1

As x1t = xt−1 , and xK 1−(1−γ)K E [Ft (xt )−Ft (xt )] + E [FT (xT )−FT (xT )]
t = xt , we can write
t=1
Ft (xt ) − Ft (xt ) ≤ (1 − γ)K (Ft (xt−1 ) − Ft (xt ))  
T

K
K
γ   ≤ (1 − γ) sup
Ft,t−1 + F0 (x0 ) − FT (xT )
+ ∇Ft (xkt ) − dkt 2 + Kγ β D2 t=1
2β 2
k=1
T
γ  
K
2  β
K
 L1 2 + E ∇Ft (xkt ) − dkt  + KT γ D2
+γ dkt , vtk − xt  + Kγ 2 D . (74) 2β t=1 2
k=1
2
k=1
T 
 K
Next, adding and subtracting Ft−1 (xt−1 ) + Ft−1 (xt−1 ) to the L1 2
+γ dkt , vtk − xt  + KT γ 2 D . (79)
term Ft (xt−1 ) − Ft (xt ) and after rearranging, we get t=1 k=1
2
Ft (xt−1 ) − Ft (xt ) = Ft (xt−1 ) − Ft−1 (xt−1 ) + Ft−1 (xt−1 ) Since, (1 − (1 − γ)K ) < 1, we can simplify the left hand side
− Ft−1 (xt−1 ) + Ft−1 (xt−1 ) − Ft (xt ). and further utilizing Corollary 2, we get
T
(75)  
1 − (1 − γ)K E [Ft (xt ) − Ft (xt )]
In order to bring the right hand side of (75) in the form of VT ,
t=1
we simplify the equation in (75) as  
T

Ft (xt−1)−Ft (xt) ≤ |Ft (xt−1 )−Ft−1 (xt−1 )|+Ft−1 (xt−1 ) ≤ (1 − γ) K sup
Ft,t−1 + F0 (x0 ) − FT (xT )
−Ft−1 (xt−1 ) + Ft−1 (xt−1 ) − Ft (xt ) t=1
T  
≤ sup
Ft,t−1 + Ft−1 (xt−1 ) − Ft−1 (xt−1 ) γ  2 L21 (1−ρ) 2 2 (1−ρ) 
 0 2

+ ρσ K + γ KD + ∇Ft (xt )
2β t=1 ρ2 ρ
+ Ft−1 (xt−1 ) − Ft (xt ). (76)
T K
Using (76) into (74), we get β L1
+ KT γ D2 + γ dkt , vtk − xt  + KT γ 2 D2 .
Ft (xt ) − Ft (xt ) 2 t=1 k=1
2
 sup  (80)
≤ (1 − γ)K Ft,t−1 + Ft−1 (xt−1 ) − Ft−1 (xt−1 )
  From the definition of VT and further simplification results in
+ (1 − γ)K Ft−1 (xt−1 ) − Ft (xt )
T
 
γ
K
   1 − (1 − γ)K E [Ft (xt ) − Ft (xt )]
+ ∇Ft (xkt ) − dkt 2 + Kγ β D2 t=1
2β 2
k=1 γρ
≤ (1 − γ)K (VT + F0 (x0 ) − FT (xT )) + T Kσ 2
K
 L1 2 2β
+γ dkt , vtk − xt  + Kγ 2 D . (77)
2 L21 (1 − ρ)γ 3 (1 − ρ)γ  2
k=1 + T KD2 + T ∇Ft (x0t )
2βρ2 2βρ
After taking sum over t in (77) and computing the total expec-
tation, we get T K
β L1
T + KT γ D2 +γ dkt , vtk −xt  + KT γ 2 D2 .
 2 2
E [Ft (xt ) − Ft (xt )] t=1 k=1

t=1
(81)
T
    For a fixed k, the sequence {vtk }Tt=1
is produced by a online
≤ (1 − γ)K sup
Ft,t−1 + E Ft−1 (xt−1 ) − Ft−1 (xt−1 ) linear minimization oracle with regret RET , which implies that
t=1 T T T
  
T
   dkt , vtk − xt  ≤ dkt , vtk − mindkt , x ≤ RET .
K x∈X
+ (1 − γ) E Ft−1 (xt−1 ) − Ft (xt ) t=1 t=1 t=1
t=1 (82)

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
946 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 69, 2021

Using (82) in (81) APPENDIX I


T
DYNAMIC REGRET OF META FRANK WOLFE ALGORITHM

(1 − (1 − γ)K ) E [Ft (xt ) − Ft (xt )] (WHEN FULL GRADIENT IS AVAILABLE)
t=1 With exact gradient available, the step 9 of Algorithm 2 be-
K γρ comes dkt = ∇Ft (xt ) for all k. Since, exact gradient is available
≤ (1 − γ) (VT + F0 (x0 ) − FT (xT )) + T Kσ 2
2β the error in gradient approximation becomes zero and similar to
(84) we have
L21 (1 − ρ)γ 3 (1 − ρ)γ  2   T 
+ T KD2 + T ∇Ft (x0t )  sup
2βρ 2 2βρ D 1 K
RegT ≤ (1 − γ) Ft,t−1 + K1
β L1 (1 − (1 − γ)K ) t=1
+ KT γ D2 + γKRET + KT γ 2 D2 . (83) 
2 2
E 2 L1 2
+γKRt + KT γ D . (86)
Let K1 := F0 (x0 ) − FT (xT ), B1 := (1 − ρ)∇Ft (x0t )2 , 2
C := L21 (1 − ρ)D2 := C, and taking (1 − (1 − γ)K ) to the
where, K1 := F0 (x0 ) − FT (xT ), Ft,t−1
sup
:= supx∈X |Ft (x) −
right hand side, we obtain
Ft−1 (x)|, and the online linear optimization oracles have regret
1 1 K
T
 RET . Let γ = K , K = T a , and using the inequality (1 − K ) ≤
E [Ft (xt ) − Ft (xt )] 1
e we get
t=1  
 1 1
1 γρ RegD T ≤   1 − (VT + K1 ) + RET
≤ (1 − γ)K (VT + K1 ) + T Kσ 2 1 − 1 − 1e e
(1 − (1 − γ)K ) 2β 
L1 D2 (1−a)
γ3 γ β + T . (87)
+ T KC + T B + KT γ D2 + γKRET 2
2βρ2 2βρ 2 
 where, VT := Tt=1 Ft,t−1sup
as defined in (3). Finally, by select-
L1 D 2 √
+ KT γ 2 . (84) ing the liner oracle with regret RET = O( T ) similar to [24,
2 Appendix B] we get
1  
which is as stated in Theorem 3. The term (1−(1−γ) K ) from
RegD max{0.5,(1−a)}
−1 2 3 T ≤ O 1 + VT + T . (88)
the expansion of (1 − x) = 1 + x + x + x + · · · , we can
1
say that (1−(1−γ) K ) = 1 + (1 − γ)
K
+ (1 − γ)2K + · · · . The By fixing the range of a as 0 < a ≤ 0.5, we get
 
terms (1 − γ)K are diminishing and decreases with increase in RegDT ≤ O 1 + V T + T (1−a)
. (89)
K. Thus, with increase in K the term (1 − (1 − γ)K ) approaches
1
closer to 1. Taking γ = K , β = T1b , ρ = T1c , and using (1 − REFERENCES
1 K 1
K ) ≤ e we get
[1] D. S. Kalhan, A. S. Bedi, A. Koppel, K. Rajawat, A. Gupta, and A.
T
 Banerjee, “Projection free dynamic online learning,” in Proc. IEEE Int.
E [Ft (xt ) − Ft (xt )] Conf. Acoust., Speech, Signal Process., 2020, pp. 3957–3961.
[2] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
t=1 recognition,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97,
 Nov. 2012.
1 1 T (1−c+b) 2 T (1+b+2c)
≤  (V T + K 1 ) + σ + C [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
1 − 1e e 2 2K 2 recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.

T (1+b+c) T (1−b) 2 T [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
+ B+ D + γRET + L1 D 2 . (85) and D. Wierstra, “Continuous control with deep reinforcement learning,”
2K 2 2K 2016. [Online]. Available: http://arxiv.org/abs/1509.02971
[5] A. Koppel, G. Warnell, E. Stump, and A. Ribeiro, “Parsimonious online
For b = 0.5, c = 1, and K = T a , we have learning with kernels via sparse projections in function space,” J. Mach.
T
Learn. Res., vol. 20, no. 1, pp. 83–126, 2019.
 [6] A. Agarwal, E. Hazan, S. Kale, and R. E. Schapire, “Algorithms for
E [Ft (xt ) − Ft (xt )] portfolio management based on the newton method,” in Proc. 23rd Int.
t=1 Conf. Mach. Learn.. ACM, 2006, pp. 9–16.
 √ [7] N. Wagener, C.-A. Cheng, J. Sacks, and B. Boots, “An online learning
1 1 T 2 C (3.5−2a) approach to model predictive control,” in Proc. Robot.: Sci. Syst., Jun.
≤   (VT + K1 ) + σ + T 2019, doi: 10.15607/RSS.2019.XV.033.
1 − 1e e 2 2 [8] A. Shapiro, D. Dentcheva, and D. Ruszczyński, Lectures on Stochastic
√ Programming: Modeling and Theory. Philadelphia, PA, USA: SIAM,
T (2.5−a) T 2 L1 D2 (1−a) 2009.
+ B+ D + RET + T . [9] M. Zinkevich, “Online convex programming and generalized infinitesimal
2 2 2 gradient ascent,” in Proc. 20th Int. Conf. Mach. Learn., vol. 20, no. 2,
√ Washington DC, USA, Aug. 21-24 2003, pp. 928–936.
Next, by selecting the liner oracle with regret RET = O( T ) [10] K. J. Aström and R. M. Murray, Feedback Systems: An Introduction for
Scientists and Engineers. Princeton, NJ, USA: Princeton Univ. press, 2010.
similar to [24, Appendix B], we obtain the final result presented [11] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
in Theorem 3.  replay,” 2015, arXiv:1511.05952.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.
KALHAN et al.: DYNAMIC ONLINE LEARNING VIA FRANK-WOLFE ALGORITHM 947

[12] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” [36] S. J. Reddi, S.Sra, B. Póczos, and A. Smola, “Stochastic Frank-Wolfe
Operations Res., vol. 63, no. 5, pp. 1227–1244, 2015. [Online]. Available: methods for nonconvex optimization,” in Proc. 54th Annu. Allerton Conf.
https://doi.org/10.1287/opre.2015.1408 Commun., Control, Comput., 2016, pp. 1244–1251.
[13] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online [37] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex op-
optimization in dynamic environments: Improved regret rates for strongly timization,” in Proc. 30th Int. Conf. Mach. Learn., no. CONF, 2013,
convex problems,” in Proc. IEEE 55th Conf. Decis. Control, 2016, pp. 427–435.
pp. 7195–7201. [38] Z. Harchaoui, A. Juditsky, and A. Nemirovski, “Conditional gradient
[14] T. Yang, L. Zhang, R. Jin, and J. Yi, “Tracking slowly moving clairvoyant: algorithms for norm-regularized smooth convex optimization,” Math. Pro-
Optimal dynamic regret of online learning with true and noisy gradient,” gram., vol. 152, no. 1-2, pp. 75–112, 2015.
in Proc. Int. Conf. Mach. Learn., 2016, pp. 449–457. [39] C. Mu, Y. Zhang, J. Wright, and D. Goldfarb, “Scalable robust matrix
[15] L. Zhang, T. Yang, J. Yi, R. Jin, and Z.-H. Zhou, “Improved dynamic recovery: Frank-Wolfe meets proximal methods,” SIAM J. Sci. Comput.,
regret for non-degenerate functions,” in Proc. Adv. Neural Inf. Process. vol. 38, no. 5, pp. A3291-A3317, 2016.
Syst., 2017, pp. 732–741. [40] E. Hazan and H. Luo, “Variance-reduced and projection-free stochastic
[16] L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dynamic optimization,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, vol. 16,
environments,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1323– pp. 1263–1271.
1333. [41] G. Lan and Y. Zhou, “Conditional gradient sliding for convex optimiza-
[17] A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan, “Online tion,” SIAM J. Optim., vol. 26, no. 2, pp. 1379–1409, 2016.
optimization: Competing with dynamic comparators,” in Artif. Intell. [42] W. Zhang, P. Zhao, W. Zhu, S. C. Hoi, and T. Zhang, “Projection-free
Statist., 2015, pp. 398–406. distributed online learning in networks,” in Proc. 34th Int. Conf. Mach.
[18] T.-J. Chang and S. Shahrampour, “Unconstrained online optimization: Learn., 2017, pp. 4054–4062.
Dynamic regret analysis of strongly convex and smooth problems,” 2020, [43] A. Silveti-Falls, C. Molinari, and J. Fadili, “Inexact and stochastic general-
arXiv:2006.03912. ized conditional gradient with augmented Lagrangian and proximal step,”
[19] M. Mohri and A. Rostamizadeh, “Stability bounds for stationary ϕ- 2020, arXiv:2005.05158.
mixing and β-mixing processes,” J. Mach. Learn. Res., vol. 11, no. Feb, [44] G. Lan, “The complexity of large-scale convex programming under a linear
pp. 789–814, 2010. optimization oracle,” 2013, arXiv:1309.5550.
[20] A. Mokhtari, H. Hassani, and A. Karbasi, “Stochastic conditional gradient [45] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical
methods: From convex minimization to submodular maximization,” J. Learning, vol. 1, Berlin, Germany: Springer, 2001.
Mach. Learn. Res. vol. 21, no. 105, pp. 1–49, 2020. [46] A. Mokhtari, A. Koppel, and A. Ribeiro, “A class of parallel doubly
[21] F. Locatello, A. Yurtsever, O. Fercoq, and V. Cevher, “Stochastic Frank- stochastic algorithms for large-scale learning,” J. Mach. Learn. Res. vol.
Wolfe for composite convex minimization,” in Proc. Adv. Neural Inf. 21, no. 120, pp. 1–51, 2020.
Process. Syst., 2019, pp. 14246–14256. [47] G. Wang, H. Chen, Y. Li, and M. Jin, “On received-signal-strength based
[22] E. Hazan and S. Kale, “Projection-free online learning,” in Proc. 29th Int. localization with unknown transmit power and path loss exponent,” IEEE
Conf. Int. Conf. Mach. Learn., ser. ICML’12. USA: Omnipress, 2012, Wireless Commun. Lett., vol. 1, no. 5, pp. 536–539, Oct. 2012.
pp. 1843–1850. [Online]. Available: https://dl.acm.org/citation.cfm?id= [48] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based
3042573.3042808 model predictive control for safe exploration,” in Proc. IEEE Conf. Decis.
[23] D. Garber and E. Hazan, “A linearly convergent variant of the conditional Control, 2018, pp. 6059–6066.
gradient algorithm under strong convexity, with applications to online and [49] H. V. Poor and O. Hadjiliadis, Quickest Detection. Cambridge, U.K.:
stochastic optimization,” SIAM J. Optim., vol. 26, no. 3, pp. 1493–1528, Cambridge Univ. Press, 2008.
2016. [50] V. V. Veeravalli and T. Banerjee, “Quickest change detection,” in Academic
[24] L. Chen, C. Harshaw, H. Hassani, and A. Karbasi, “Projection-free online Press Library in Signal Processing. Elsevier, 2014, vol. 3, pp. 209–255.
optimization with stochastic gradient: From convexity to submodularity,” [51] J. Unnikrishnan, V. V. Veeravalli, and S. P. Meyn, “Minimax robust
in Proc. 35th Int. Conf. Mach. Learn., Stockholm, Sweden, Jul. 10– quickest change detection,” IEEE Trans. Inf. Theory, vol. 57, no. 3,
15 2018, pp. 814–823. pp. 1604–1614, Mar. 2011.
[25] J. Xie, Z. Shen, C. Zhang, B. Wang, and H. Qian, “Efficient projection-free [52] Y. Ritov, “Decision theoretic optimality of the cusum procedure,” Ann.
online methods with stochastic recursive gradient,” in Proc. 34th AAAI Statist., pp. 1464–1469, 1990.
Conf. Artif. Intell., 2020, pp. 6446–6453. [53] S. Shalev-Shwartz, “Online learning and online convex optimization,”
[26] E. Hazan et al., “Introduction to online convex optimization,” Found. Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107–194, Feb. 2012.
Trends Optim., vol. 2, no. 3-4, pp. 157–325, 2016. [54] S. Choudhary, N. Kumar, S. Narayanan, and U. Mitra, “Active target
[27] H. Wang and A. Banerjee, “Online alternating direction method,” in Proc. localization using low-rank matrix completion and unimodal regression,”
29th Int. Conf. Mach. Learn., 2012, pp. 1699–1706. 2016, arXiv:1601.07254.
[28] M. Mahdavi, R. Jin, and T. Yang, “Trading regret for efficiency: online [55] R. Tripathi and K. Rajawat, “Dynamic network latency prediction with
convex optimization with long term constraints,” J. Mach. Learn. Res., adaptive matrix completion,” in Proc. Int. Conf. Signal Process. Commun.,
vol. 13, no. Sep, pp. 2503–2528, 2012. 2018, pp. 407–411.
[29] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” [56] H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using low
SIAM J. Control Optim., vol. 14, no. 5, pp. 877–898, 1976. rank matrix completion,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
[30] M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Nav. Pattern Recognit., 2010, pp. 1791–1798.
Res. Logistics Quart., vol. 3, no. 1-2, pp. 95–110, 1956. [57] R. Tripathi, B. Mohan, and K. Rajawat, “Adaptive low-rank matrix com-
[31] V. Krishnamurthy, “Quickest detection POMDPs with social learning: pletion,” IEEE Trans. Signal Process., vol. 65, no. 14, pp. 3603–3616,
Interaction of local and global decision makers,” IEEE Trans. Inf. Theory, Jul. 2017.
vol. 58, no. 8, pp. 5563–5587, Aug. 2012. [58] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, “Changede-
[32] A. Mokhtari, H. Hassani, and A. Karbasi, “Conditional gradient method tection.net: A. new change detection benchmark dataset,” in Proc. IEEE
for stochastic submodular maximization: Closing the gap,” in Proc. Int. Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, 2012,
Conf. Artif. Intell. Statist., 2018, pp. 1886–1895. pp. 1–8.
[33] R. M. Freund, P. Grigas, and R. Mazumder, “An extended Frank-Wolfe [59] R. Dixit, A. Singh Bedi, R.Tripathi, and K.Rajawat, “Online learning with
method with ‘in-face’ directions, and its application to low-rank matrix inexact proximal online gradient descent algorithms,” IEEE Trans. Signal
completion,” SIAM J. Optim., vol. 27, no. 1, pp. 319–346, 2017. Process., vol. 67, no. 5, pp. 1338–1352, Mar. 2019.
[34] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic [60] R. Nassif, S. Vlaski, C. Richard, J. Chen, and A. H. Sayed, “Multi-
approximation approach to stochastic programming,” SIAM J. Optim., task learning over graphs: An approach for distributed, streaming ma-
vol. 19, no. 4, pp. 1574–1609, 2009. chine learning,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 14–25,
[35] J. Lafond, H.-T. Wai, and E. Moulines, “On the online Frank- May 2020.
Wolfe algorithms for convex and non-convex optimizations,” 2015,
arXiv:1510.01171.

Authorized licensed use limited to: ECOLE POLYTECHNIQUE DE MONTREAL. Downloaded on June 12,2023 at 18:31:02 UTC from IEEE Xplore. Restrictions apply.

You might also like