[go: up one dir, main page]

0% found this document useful (0 votes)
16 views11 pages

Multi-Target Regression Via Robust Low-Rank Learning

Uploaded by

sarah ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Multi-Target Regression Via Robust Low-Rank Learning

Uploaded by

sarah ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Multi-Target Regression via Robust Low-Rank

Learning
Xiantong Zhen, Mengyang Yu, Xiaofei He, Senior Member, IEEE, and Shuo Li

Abstract—Multi-target regression has recently regained great popularity due to its capability of simultaneously learning multiple
relevant regression tasks and its wide applications in data mining, computer vision and medical image analysis, while great
challenges arise from jointly handling inter-target correlations and input-output relationships. In this paper, we propose Multi-
layer Multi-target Regression (MMR) which enables simultaneously modeling intrinsic inter-target correlations and nonlinear
input-output relationships in a general framework via robust low-rank learning. Specifically, the MMR can explicitly encode inter-
target correlations in a structure matrix by matrix elastic nets (MEN); the MMR can work in conjunction with the kernel trick to
effectively disentangle highly complex nonlinear input-output relationships; the MMR can be efficiently solved by a new alternating
optimization algorithm with guaranteed convergence. The MMR leverages the strength of kernel methods for nonlinear feature
learning and the structural advantage of multi-layer learning architectures for inter-target correlation modeling. More importantly,
it offers a new multi-layer learning paradigm for multi-target regression which is endowed with high generality, flexibility and
expressive ability. Extensive experimental evaluation on 18 diverse real-world datasets demonstrates that our MMR can achieve
consistently high performance and outperforms representative state-of-the-art algorithms, which shows its great effectiveness
and generality for multivariate prediction.

Index Terms—Robust Low-Rank Learning, Multi-Layer Learning, Multi-Target Regression, Matrix Elastic Nets.

1 I NTRODUCTION nonlinear relationships between inputs and targets.


Multi-target regression, as an instance of multitask Although great effort has been made to improve
learning [1], has recently drawn increasing research multi-target regression in the last decades [10], it still
efforts in the machine learning community due to lacks a general framework that can well tackle those
its great capability of predicting multiple relevant two major challenges simultaneously.
targets simultaneously with improved performance. To explore inter-target correlations, most of existing
Moreover, it has started to show its great effective- multi-target regression models were focused on de-
ness to solve challenging problems in a broad range signing a regularizer on the regression matrix, which
of applications including data mining [2], computer rely either on linear regression models [11], [12], [13]
vision [3] and medical image analysis [4]. or on specific assumptions of correlation structures
The major challenges of multi-target regression with strong prior knowledge [14], [15], [16]. By build-
arise from jointly modeling inter-target correlations ing on linear regression, sparsity or low rank was
and nonlinear input-output relationships [5]. By ex- simply imposed on the regression matrix to explore
ploring the shared knowledge across relevant targets the correlations. However, these linear models lack
to capture the inter-target correlation, multi-target ability to handle nonlinear input-output relationships;
regression performance can be significantly improved moreover it is nontrivial to extend for nonlinear re-
[6], [7]. However, the structure of inter-target correla- gression due to the non-convexity of loss functions
tions is usually not known a priori and varies great- or the non-smoothness of sparsity constraints [17],
ly with different applications. Meanwhile, multiple [18]. Under specific assumptions, e.g., task parame-
targets represent higher-level semantic concepts of ters share a common prior [19], [20], or combine a
high-dimensional inputs [8], [9], which induces highly finite number of basis tasks [21], particular structures
of inter-target correlations were explored. However,
• X. Zhen is with the Department of Medical Biophysics, The University
those assumptions would not necessarily hold or
of Western Ontario, London, Ontario, N6A 4V2, Canada. E-mail: be shared across different applications due to the
zhenxt@gmail.com. great diversity of inter-target correlations in different
• M. Yu is with the Computer Vision Laboratory, ETH Zurich, Switzer-
land. E-mail: mengyangyu@gmail.com.
domains, which resulting in models of insufficient
• X. He is with the State Key Lab of CAD&CG, College of Computer generality to automatically extract the correlations
Science, Zhejiang University, Hangzhou, Zhejiang, China, 310058. E- from data for diverse applications [5].
mail: xiaofeihe@cad.zju.edu.cn.
• S. Li is with the Department of Medical Biophysics, The University
To handle nonlinear input-output relationships, k-
of Western Ontario, London, Ontario, N6A 4V2, Canada. E-mail: ernel methods [22], [23] were extended from single
slishuo@gmail.com. task learning to multi-task learning, which however
'LJLWDO2EMHFW,GHQWL¿HU73$0,
do not provide effective ways to model the inter-
‹,(((3HUVRQDOXVHLVSHUPLWWHGEXWUHSXEOLFDWLRQUHGLVWULEXWLRQUHTXLUHV,(((SHUPLVVLRQ
6HH KWWSZZZLHHHRUJSXEOLFDWLRQV VWDQGDUGVSXEOLFDWLRQVULJKWVLQGH[KWPO IRU PRUH LQIRUPDWLRQ
2

The proposed MMR leverages the strength of kernel


methods for nonlinear feature learning and the struc-
tural advantage of multi-layer architectures to capture
inter-target correlations [26], and more importantly, it
provides a new multi-layer learning paradigm that is
endowed with high generality, flexibility and expres-
sive ability for multi-target regression. Moreover, the
MMR is highly extensible and can work on the top of
deep learning architectures, e.g., convolutional neural
networks (CNNs) [27], for further fine-tuning [28].
The contributions of this work are summarized in
four major aspects as follows:
• We propose a new multi-layer multi-target re-
gression model, which enables simultaneously
modeling intrinsic inter-target correlations and
Fig. 1: The architecture of the proposed multi-layer complex input-output relationships in one single
multi-target regression (MMR). framework.
• We introduce the matrix elastic nets (MEN) to
multi-target regression to explore inter-target cor-
relations, which enables explicitly encoding the
target correlation. In [23], for instance, the regression
correlations by a robust low-rank learning frame-
matrix of multiple tasks is simply reshaped into a
work without specific assumptions.
vector to explore inter-target correlations, which does
• We provide a kernel extension of the learn-
not distinguish between inter and intra tasks, and
ing framework, which enables effectively disen-
tends to be less effective to encode the correlations.
tangling highly nonlinear relationships between
Although kernel extension was developed in multi-
high-dimensional inputs and multiple targets.
target relationship learning (MTRL) [5], a matrix-
• We derive a new alternating optimization algo-
variate normal distribution is required as a prior to
rithm, which enables efficiently solving the ob-
model task structure in a covariance matrix [13].
jective with quick convergence to achieve efficient
In this paper, we propose Multi-layer Multi-target
multi-target regression.
Regression (MMR) that enables simultaneously mod-
eling intrinsic inter-target correlations and complex
input-output relationships in a general framework. As 2 R ELATED W ORK
illustrated in Fig. 1, the MMR accomplishes a multi- Multi-target regression [10] has recently regained
layer learning architecture composed of the input, great popularity due to its fundamental role in ma-
hidden and output (target) layers. chine learning and widespread applications in com-
• The high-dimensional inputs X are implicitly puter vision and medical image analysis. Previous
mapped into a high, even infinite dimensional work has been focused on particular aspects, e.g.,
reproducing kernel Hilbert space (RKHS) HK in- simply learning features for multiple tasks [7], [14],
duced by some nonlinear kernel K; the mapping [29], [30] or solely exploring specific task structures
serves as a nonlinear feature extraction function, [6], [31], [32], [33]. Some of the regression models
which allows to disentangle highly nonlinear are designed for classification tasks [12], [34]. Most
input-output relationships. of existing methods are developed mainly on linear
• The variables Z in the latent space, which are regression models to explore inter-target correlations,
obtained by a linear transformation A via a repre- while being lack of the ability to simultaneously han-
senter theorem, represents higher-level concepts dle nonlinear relationships between high-dimensional
to build a common feature representation for inputs and multiple targets.
multiple targets [8]; the latent space decouples Recently, Rai et al. [13] proposed multi-output re-
inputs and targets, which allows to effectively gression with output and task structures (MROTS)
handle their different noise levels by A and S, which generalizes the multivariate regression model
respectively [24]. with covariance estimation (MRCE) [11] and the linear
• The structure matrix S explicitly encodes the multi-task relationship learning (MTRL) [5] as its spe-
inter-target correlation by imposing the matrix cial cases with improved performance. However, like
elastic net (MEN) regularization [25], which en- the MRCE [11], the MROTS does not provide any for-
ables robust low-rank learning of the correlation; mulation for nonlinear regression. Liu et al. [35], [17]
S is learned automatically from data without re- proposed a linear regression model called calibrated
lying on any specific assumptions, which greatly multivariate regression (CMR) to tackle different noise
enhances its generality. levels of different tasks. By assuming an uncorrelated
3

structure of noises, the CMR employs a 2,1 -norm of matrix A, which can be computed by tr(A A),
based loss function to calibrate each regression task, and λ is the regularization parameter that controls the
while the assumption of uncorrelated noises does not amount of shrinkage, i.e., the larger the value of λ, the
always hold in practice. Beside, it is nontrivial to greater the amount of shrinkage [37].
extend the CMR to kernel regression due to the 2,1 - The objective function in (2) is a straightforward
norm loss function. extension of ridge regression, one of classical statis-
In order to handle nonlinear input-output relation- tical learning algorithms [37], to multivariate targets.
ship, the recent output kernel learning (OKL) algorith- When working in a reproducing kernel Hilbert space
m [22], [36], which learns a semi-definite similarity (RKHS), the resulting model is known as kernel ridge
matrix, i.e., the output kernel, of multiple targets, regression (KRR) [37] [38]. Likewise, (2) can be kernel-
would not fully capture inter-target correlations, e.g., ized to achieve multi-target kernel ridge regression
negative correlations [5]. By assuming that all tasks (mKRR). We derive the proposed MMR from this
can be clustered into disjoint groups, the clustered fundamental formulation to ensure its generality.
multi-target learning (CMTL) [15] was developed The multi-target regression model in (2) is de-
to explore inter-target correlations, which learns the coupled into several single-output problems, which
underlying cluster structure from the training data. does not take into account inter-target correlations,
However, the number of clusters needs to be specified, resulting in suboptimal multi-target regression with
which is rarely available in real-world tasks. Recently, inferior performance. In what follows, we introduce
an improved version of CMTL called flexible clustered our MMR, which is a multi-layer learning architecture
multi-target (FCMTL) was presented in [33]. In the to explicitly model the correlations by a robust low-
FCMTL, the cluster structure is learned by identify- rank learning with matrix elastic nets (MEN) [25].
ing representative tasks. However, the assumption of
the existence of representative tasks remains purely 3.2 Robust Low-Rank Learning with MEN
heuristic and therefore would not be shared across
diverse applications. Rather than directly imposing the sparsity constraint
on W in existing methods, we propose incorporating
a latent space, from which a structure matrix S is
3 M ULTI -L AYER M ULTI -TARGET R EGRES - learned to explicitly encode inter-target correlations
SION via a rank minimization.
The proposed MMR accomplishes a general frame- 1
work of multi-layer learning to jointly model inter- min ||Y − SZ||2F + λ||W ||2F + βRank(S) + γ||S||2F ,
W,S N
target correlations by robust low rank learning via (3)
matrix elastic nets (MEN) (Sec. 3.2) and disentan- where Z = [z1 , · · · , zi , · · · , zN ] ∈ RQ×N , zi = W xi +
gle nonlinear input-output relationships by kernel b ∈ RQ contains the latent variables in the latent
regression (Sec. 3.3); the MMR is efficiently solved by space, S ∈ RQ×Q is the structure matrix that serves
a newly derived alternating optimization algorithm to explicitly model inter-target correlations, β is the
with guaranteed convergence (Sec. 3.4). regularization parameter to control the rank of S, that
is, a larger β induces lower rank, and the Frobenius
3.1 Problem Formulation norm control the shrinkage of S with the associated
We start with the fundamental multi-target linear parameter γ. The rank minimization of the structure
regression model matrix S explores the low-rank structure existing
between tasks to capture the intrinsic inter-target cor-
y = W x + b, (1) relation. S is learned automatically from data without
relying on any specific assumptions, which allows to
where y = [y1 , · · · , yi , · · · , yQ ] ∈ RQ are the
adaptively cater different applications.
multivariate targets, x ∈ Rd is the input, W =
However, the objective function in (3) is NP-hard
[w1 , · · · , wi , · · · , wQ ] ∈ RQ×d is the model param-
due to the noncontinuous and non-convex nature of
eter, i.e., the regression coefficient, each wi ∈ Rd is
the rank function [39]. The nuclear norm ||S||∗ is
the predictor for yi , b ∈ RQ is the bias, d and Q
commonly used for low-rank learning, and it has
are the dimensionality of input and output spaces,
proven to be the convex envelop of the rank function
respectively.
over the domain ||S||2 ≤ 1 [40], which provides the
Given the training set {(xi , yi )}N i=1 , one can solve
tightest lower bound among all convex lower bounds
for W by solving the following penalized optimiza-
of the rank function Rank(S).
tion objective:
As a consequence, the combination of the nuclear
1 norm with the Frobenius norm on S gives rise to the
W ∗ = arg min ||Y − W X − B||2F + λ||W ||2F , (2)
W N matrix elastic net (MEN) [25] as a regularizer of (3):
where X = [x1 , x2 , · · · , xN ], Y = [y1 , y2 , · · · , yN ], 1
B = [· · · , b, · · · ] ∈ RQ×N , ||A||2F is the Frobenius norm min ||Y − SZ||2F + λ||W ||2F + β||S||∗ + γ||S||2F , (4)
W,S N
4

where the nuclear norm ||S||∗ also known as the trace shown in Theorem 1, which enables kernel extension
norm is a particular instance of the Schatten p-norm to achieve nonlinear regression.
[41] with p = 1, i.e., Schatten 1-norm. The Schatten
Theorem 1. Given any fixed matrix S, the objective
p-norm of a matrix M is defined as
function in (4) w.r.t. W which is defined over a Hilbert
 1/p space H. If (4) has a minimizer w.r.t. W , it admits a
||M ||Sp = σip (M ) , (5) linear representer theorem of the form W = AX  , where
i
A ∈ RQ×N is the coefficient matrix.
where 0 < p ≤ 2, σi (M ) is the i-th largest singular
Remark. Theorem 1 provides important theoretical
value of M . When p < 2, the Schatten p-norm encour-
guarantees for kernel extension to achieve nonlinear
ages sparsity of the singular values, which achieves
multi-target regression. The proof of the theorem is
rank minimization. With (5), √the nuclear norm of S
straightforward and omitted here due to space limit.
can be written as ||S||∗ = tr( S  S). It is interesting
Based on Theorem 1, we can derive kernel extension
to mention that when p = 0, the Schatten 0-norm is
in the reproducing kernel Hilbert space (RKHS) for
defined as  kernel regression to handle nonlinear input-output
||M ||S0 = σi0 (6) relationships. To facilitate the derivation, we rewrite
i
the objective function (4) in term of traces as follows:
which is exactly the rank of A. 1
The MEN is an analog to the elastic-net regular- min tr((Y − S(W X + B)) (Y − S(W X + B)))
W,SN
ization [42] from compressive sensing and sparse √
representation [42]. It has been shown that the elastic + λtr(W  W ) + βtr( S  S) + γtr(S  S).
net often outperforms the lasso [42]. In the MEN, (7)
the nuclear-norm constraint enforces the low-rank According to the linear representer theorem in Theo-
property of the solution S to encode inter-target corre- rem 1, we have
lations, and the Frobenius-norm constraint induces a W = AΦ(X) , (8)
linear shrinkage on the matrix entries leading to stable
solutions [25]. The MEN regularization generalizes where Φ(X) = [φ(x1 ), · · · , φ(xi ), · · · , φ(xN )] and A =
the matrix lasso and provides improved performance [α1 , · · · , αi , · · · , αQ ] ∈ RQ×N , αi ∈ RN , and φ(·)
than lasso [43]. denotes the feature map of xi , which maps xi to φ(xi )
To the best of our knowledge, this is the first work in some RKHS of high, even infinite dimensionality.
that introduces the MEN to multi-target regression The mapping serves as a nonlinear feature extrac-
for robust low-rank learning, which offers a gener- tion to handle complicated input-target relationship-
al framework to encode inter-target correlations. We s. The corresponding kernel function k(·, ·) satisfies
highlight that the proposed multi-layer multi-target k(xi , xj ) = φ(xi ) φ(xj ).
regression enjoys favorable merits: Substituting (8) into (7), we obtain the following
objective function:
• The latent space enables decouples inputs and
1
targets from distinctive distributions, which al- min tr((Y − SAΦ(X) Φ(X)) (Y − SAΦ(X) Φ(X)))
lows them to be effectively handled respectively A,S N

by the regression coefficient W and the structure + λtr((AΦ(X) ) (AΦ(X) )) + βtr( S  S)
matrix S [24], [44]. + γtr(S  S).
• The structure matrix which is detached from
(9)
inputs by the latent space is able to effectively where the bias B is omitted for simplicity since it
calibrate multiple targets according to their dif- has been proven that the bias can be absorbed into
ferent noise levels to achieve optimal parameter the regression coefficient W by adding an additional
estimation [35], [45]. dimension into input features xi [47], [48].
• The matrix elastic nets (MEN) provide a general We accomplish the kernel version of the multi-layer
regularization network to achieve robust low- multi-target regression with the kernel matrix K =
rank learning of the intrinsic inter-target corre- Φ(X) Φ(X) defined in the RKHS space.
lations [25].
1
min tr((Y − SAK) (Y − SAK)) + λtr(AKA )
A,S N
3.3 Kernel Extension √
+ βtr( S  S) + γtr(S  S).
Due to the constraints directly imposed on the regres- (10)
sion matrix W , most of the existing methods would In (10), the induced latent variables Z = AK can
not be kernelized for nonlinear regression. On the extract high-level representations for multiple seman-
contrary, thanks to the multi-layer learning architec- tic targets, which allows to disentangle the nonlinear
ture, the MMR is more flexible and extensible, and relationship between low-level inputs and semantic-
naturally admits the Representer Theorem [46], [18] as level targets [8]. The latent space with high-level
5

features will also facilitate the efficient linear low-rank 3.4.2 Fix A to optimize S
learning of S to model inter-target correlations [49] We propose a gradient based alterative optimization
to achieve more accurate multi-target prediction. The to solve for S, before which we provide the following
MMR in (10) leverages the strength of kernel methods proposition to calculate the derivative of J w.r.t. S.
for nonlinear feature extraction and the structural
advantage of multi-layer architectures for inter-target Proposition 1. Assume that the singular value decompo-
correlation modeling. In contrast to existing multi- sition (SVD) of S is
target regression models, the obtained MMR in (10)
S = U ΣV  , (14)
accomplishes a new multi-layer learning architecture,
which is endowed with great generality, flexibility and where U and V are unitary matrices and Σ is the diag-
expressive ability for diverse challenging tasks. onal matrix with real numbers on the diagonal. Then the
One of the important advantages of the proposed derivative of ||S||∗ w.r.t. S takes the form as follows:
MMR over previous multi-target regression models is
∂||S||∗
its great generality. Theoretically, the proposed MMR = U Σ−1 |Σ|V  (15)
is highly generalized and encompasses some of ex- ∂S
isting models. By setting S = I and K = X  X in where Σ−1 is the Moore-Penrose pseudo-inverse of Σ.
(10), the MMR can recover the fundamental multi-
target ridge regression based on which many previous Proof: By the definition of the nuclear norm, we
models were developed. The MMR can simultane- have

ously encode inter-target correlations and disentangle ||S||∗ = tr( S  S)
linear/nonlinear input-output relationships by cus- 
tomizing kernels in one single framework. Moreover, = tr( (U ΣV  ) (U ΣV  )) (16)
√ √
it can also work with other convex loss functions, = tr( V ΣU  U ΣV  ) = tr( V Σ2 V  )
e.g., the ε-insensitive loss function [50] and accepts
other regularization terms to satisfy desired properties By the property of circularity of trace, we have
[51]. Compared to the recent output kernel learn- √ √
ing (OKL) [36], [52], the MMR employes a general ||S||∗ = tr( V  V Σ2 ) = tr( Σ2 ) = tr(|Σ|) (17)
low-rank regularization network without relying on where |Σ| is the matrix of the element-absolute values
specific assumptions, which allows to fully capture of Σ. Therefore, the nuclear norm of S can be also
more complex, e.g., positive and negative, inter-target defined as the sum of the absolute values of the singu-
correlations rather than only a certain aspect, e.g., lar value decomposition of S. Although the absolute
similarity of multiple targets [36]. Moreover, we do value function is not differentiable on every point in
not assume that all tasks are correlated and allow the its domain, but we can find a subgradient.
existence of outlier tasks [53], which further increases
the generality. ∂||S||∗ ∂tr(|Σ|) tr(∂|Σ|)
= = (18)
∂S ∂S ∂S
3.4 Alternating Optimization
Since Σ is diagonal, the subdifferential set of |Σ| is:
The obtained objective function (10) is non-trivial to
solve simultaneously for A and S due to the non- ∂|Σ| ∂Σ
= |Σ|Σ−1 . (19)
convexity of the objective function. We derive a new ∂S ∂S
alternating optimization algorithm to efficiently solve By substituting (19) into (18), we obtain
the objective function. Denote J(A, S) as the objective
function in (10), and we seek A and S alternately by ∂S∗ tr(|Σ|Σ−1 ∂Σ)
= . (20)
solving J(A, S) for one with the other fixed. ∂S ∂S
From (14), we have
3.4.1 Fix S to optimize A
We calculate the gradients of the objective function ∂S = ∂U ΣV  + U ∂ΣV  + U Σ∂V  , (21)
with respect to A as follows:
which gives rise to
∂J 1
= − S  (Y − SAK)K + λAK. (11)
∂A N U ∂ΣV  = ∂S − ∂U ΣV  − U Σ∂V  . (22)
Setting the derivatives to be 0 gives rise to
Multiplying U  on both sides of (22), we have
S  SAK + λN A = S  Y. (12)
U  U ∂ΣV  V = U  ∂SV − U  ∂U ΣV  V
Multiplying K −1 to both sides on the right leads to (23)
− U  U Σ∂V  V
 −1  −1
S SA + λAN K =S YK , (13)
which leads to
which is a standard Sylvester equation [54] and can
be solved analytically in a closed form. ∂Σ = U  ∂SV − U  ∂U Σ − Σ∂V  V. (24)
6

Note that Algorithm 1 Alternating Optimization

0 = ∂I = ∂(U  U ) = ∂U  U + U  ∂U, (25) Input: Data matrices X associated with correspond-


ing targets Y , regularization parameters λ, β and
where I is an identity matrix, and therefore U  ∂U is and γ.
an antisymmetric matrix. Since Σ is a diagonal matrix, Output: The regression coefficient matrix A and the
we have structure matrix S.
tr(U  ∂U Σ) = tr((U  ∂U Σ) ) 1: Randomly initialize S ∈ RQ×Q and set i = 1;
2: repeat
= tr(Σ ∂U  U )
(26) 3: Calculate the matrix A(i+1) by solving the
= −tr(ΣU  ∂U ) Sylvester equation in (13);
= −tr(U  ∂U Σ) 4: Calculate the S (i+1) using the iterative method
based on gradient descent in (30);
which indicates that tr(U  ∂U Σ) = 0. Similarly, we 5: i ← i + 1;
also have tr(Σ∂V  V ) = 0. Therefore, we achieve 6: until Convergence.
tr(∂Σ) = tr(U  ∂SV ) (27)
Substituting (27) into (20), we obtain
3.4.3 Proof of Convergence
∂S∗ tr(|Σ|Σ−1 ∂Σ) The efficiency of the proposed MMR is ensured by the
=
∂S ∂S guaranteed convergence of the newly-derived alter-
tr(|Σ|Σ−1 U  ∂SV ) nating optimization algorithm. We provide theoretical
=
∂S (28) analysis by rigorous proof of the convergence of the
tr(V |Σ|Σ−1 U  ∂S) alternating optimization.
=
∂S
Theorem 2. The objective function J(A, S) in Sec. 3.4 is
= (V |Σ|Σ−1 U  )
bounded from below and monotonically decreases with each
which closes the proof. optimization step for A and S, and therefore it converges.
Proposition 1 associated with the rigorous proof
provides a theoretical foundation, which can be di- Proof: Since J(A, S) is the summation of norm-
rectly used to solve a large while important family s, we have J(A, S) ≥ 0 for any A and S. Then
of optimization problems with the trace norm based J(A, S) is bounded from below. Denote A(t) and
minimization problems. S (t) as the A and S in the t-th iteration, re-
Based on the Proposition 1, we have the derivative spectively. For the t-th step, A(t) is computed by
of J w.r.t S as follows: A(t) ← arg minA J(A, S (t−1) ). And we also have
J(A(t) , S (t−1) ) ≥ J(A(t) , S (t) ). In this way, we obtain
∂J 1
= −2 (Y − SAK)(AK) the following inequality:
∂S N (29)
+ βU Σ−1 |Σ|V  + 2γS · · · ≥ J(A(t−1) , S (t−1) ) ≥ J(A(t) , S (t−1) )
where U , Σ and V are obtained by the SVD in (14). ≥ J(A(t) , S (t) ) ≥ · · · .
Denote G(S) as the gradient w.r.t. S in (29). There-
fore, S can be solved by an iterative optimization Therefore, J(A(t) , S (t) ) is monotonically decreasing as
based on gradient descent. t → +∞, which indicates that the objective function
J(A, S) converges according to the monotone conver-
S t+1 = S t − ηG(S t ) (30) gence theorem.
where η is the step size also called learning rate, which
can adaptively chosen by line search algorithms [55]. 3.4.4 Complexity Analysis
In each iteration, S t+1 is calculated with the current The complexity of solving for A arises from com-
S t associated with U , Σ and V . Since the objective puting the inversion of the Gram matrix, which is
function J(A, S) is convex with respect to S, it is O(N 3 ), where N is the number of training samples.
guaranteed to find a global minimum of S. The total complexity of the alternating optimization
Note that the size of S depends only on the number algorithm is O(t1 N 3 ) + O(t2 Q3 + t1 QN 2 ), where t1
Q of targets, which is usually much smaller than the is the iterations of the whole alternating optimization
dimensionality d of inputs. Therefore, the complexity and t2 is the total iteration steps of updating S. The
of the singular value decomposition (SVD) of the overall complexity is approximately O(N 3 ) due to the
structure matrix S involved in the calculation of the fact that Q N , t1 N and t2 N . Therefore, the
derivative of the nuclear norm is O(Q3 ). This guar- complexity of the proposed MMR is roughly the same
antees the efficiency of both the iterative algorithm to as regular kernel methods, e.g., KRR.
update S and the alternating optimization algorithm To deal with datasets of large scale or arriving
(Algorithm 1). sequentially, online learning [56], [57], e.g., sequential
7

minimal optimization (SMO), and kernel approxi- TABLE 1: The statistics of the 18 datasets. d is the
mation methods [58] can be employed. Kernel ap- dimension of inputs and Q is the number of targets.
proximation has recently attracted increasing research Dataset Samples Input (d) Target (Q) #-Fold CV
efforts to speed up kernel methods. When using trans- EDM 154 16 2 10
SF1 323 10 3 10
lation invariant kernels, a large family of most widely- SF2 1066 10 3 10
used kernels, e.g., the radius basis function (RBF), the JURA 359 15 3 10
feature map induced by the kernel function can be WQ 1066 16 14 10
ENB 768 8 2 10
approximated by random Fourier features [58] under SLUMP 103 7 3 10
Bochner’s Theorem [59]. Stochastic gradient and back- ANDRO 49 30 6 10
propagation can be used to train the model to scale OSCALES 639 413 12 10
SCPF 1137 23 3 10
up with data of large scales or arriving sequentially. ATP1d 337 411 6 10
ATP7d 296 411 6 10
OES97 334 263 16 10
4 E XPERIMENTS AND R ESULTS OES10 403 298 16 10
We have conducted extensive experiments to evaluate RF1 9125 64 8 5
RF2 9125 576 8 5
the performance of the MMR on all the 18 real- SCM1d 9803 280 16 2
world datasets and compared with state-of-the-art SCM20d 8966 61 16 2
algorithms, and we have also investigated the con-
vergence of the alternating optimization and showed
the results on 2 representatives datasets with different
defined as:
amount of targets. 

 (xi ,yi )∈Dtest (ŷi − y i )2
RRMSE =  , (31)
4.1 Datasets and Settings (xi ,yi )∈Dtest (Ŷ − yi )2
The 18 real-world datasets are widely-used bench-
marks for multi-target regression in [60], which cover where (xi , yi ) is the i-th sample xi with ground truth
a large range of multi-target prediction tasks. Inter- target yi , ŷi is the prediction of yi and Ŷ is the average
target correlations demonstrate diverse patterns on of the targets over the training set Dtrain . We take the
across different datasets, which poses great challenges average RRMSE (aRRMSE) across all the target vari-
for multi-target regression models. The statistics of ables within the test set Dtest as a single measurement.
these datasets are summarized in Table 1. We follow A lower aRRMSE indicates better performance. The
the strategies in [60] to process the datasets with parameters λ, β and γ are chosen by cross validation
missing values in inputs, which are replaced with from a search grid of 10[−5:1:3] on the training set by
sample means in the datasets. tuning one with the others fixed, which could also
We compare with existing representative multi- be selected by adaptive techniques explored in [63],
target regression models including multi-dimensional [25]. We use the radial basis function (RBF) kernel for
support vector regression (mSVR) [50], [61], output nonlinear regression.
kernel learning (OKL) [36], adaptive k-cluster ran-
dom forests (AKRF) [8], multi-task feature learning 4.3 Results
(MTFL) [7] and MROTS [13]. Note that MTRL [5] and
4.3.1 Performance Comparison
FIRE [62] perform worse than MROTS and MORF,
respectively [13] and are therefore not included for The proposed MMR algorithm has achieved consis-
comparison. The methods in [60] including single task tently high multi-target prediction performance on all
learning (STL), multi-object random forests (MORF), 18 datasets and substantially outperforms state-of-the-
the corrected multi-target stacking (MSTC), ensemble art algorithms. The multi-target prediction results of
of regressor chains (ERC) and random linear target the proposed MMR and the comparison with state-of-
combinations (RLC), which have shown great perfor- the-art algorithms that recently proposed are summa-
mance in [60], are also included for comprehensive rized in Table 2.
comparison. We follow the evaluation settings in [60] The proposed MMR substantially outperforms the
to benchmark with other algorithms. Specifically, as best results from state-of-the-art algorithms on most
shown in Table 1, we use two-fold cross validation of these 18 datasets except the ANDRO dataset with
(CV) for SCM1d/SCM20d, five-fold CV for RF1/RF2 only 49 samples. The great effectiveness of the MMR
and ten-fold CV for the rest of the datasets. has been validated for a broad range of multi-target
regression tasks. The large improvement of the pro-
posed MMR over the STL and mKRR with significant
4.2 Evaluation Metric margins on the all the 18 datasets, which shows its
To directly benchmark with state-of-the-art algorithm- effectiveness in modeling inter-target correlations. The
s, we measure the performance by the commonly- STL and mKRR are regarded as the baseline methods
used Relative Root Mean Squared Error (RRMSE) that predict multiple targets independently without
8

TABLE 2: The comparison with state-of-the-art algorithms on 18 real-world datasets in terms of aRRMSE (%).
```
```Method MMR
Dataset ``` STL MTSC ERC RLC MORF mSVR AKRF MTFL MROTS OKL mKRR
EDM 74.2 74.0 74.1 73.5 73.4 73.7 74.0 85.1 81.2 74.1 83.3 71.6
SF1 113.5 106.8 108.9 116.3 128.2 102.1 111.4 111.2 115.5 105.9 110.4 95.8
SF2 114.9 105.5 108.8 122.8 142.5 104.3 113.5 112.7 120.1 100.4 116.6 98.4
JURA 58.9 59.1 59.0 59.6 59.7 61.1 61.8 60.8 62.5 59.9 63.3 58.2
WQ 90.8 90.9 90.6 90.2 89.9 89.9 91.8 96.2 91.3 89.1 92.0 88.9
ENB 11.7 12.1 11.4 12.0 12.1 22.0 23.4 31.6 25.7 13.8 26.3 11.1
SLUMP 68.8 69.5 68.9 69.0 69.4 71.1 72.9 68.1 77.8 69.9 78.9 58.7
ANDRO 60.2 57.9 56.7 57.0 51.0 62.7 62.3 80.3 63.5 55.3 63.9 52.7
OSCALES 74.8 72.6 71.3 74.1 75.3 77.8 77.5 168.2 80.0 71.8 79.9 70.9
SCPF 83.7 83.1 83.0 83.5 83.3 82.8 83.1 89.9 90.1 82.0 85.5 81.2
ATP1d 37.4 37.2 37.2 38.4 42.2 38.10 41.2 41.5 40.4 36.4 38.0 33.2
ATP7d 52.5 50.7 51.2 46.1 55.1 47.75 53.1 55.3 54.9 47.5 48.6 44.3
OES97 52.5 52.4 52.4 52.3 54.9 55.7 58.1 81.8 60.5 53.5 58.7 49.7
OES10 42.0 42.1 42.0 41.9 45.2 44.7 44.6 53.2 55.8 43.2 48.9 40.3
RF1 9.7 9.4 9.1 12.1 12.3 10.9 11.4 98.3 15.4 11.2 17.9 8.9
RF2 10.2 9.7 9.5 13.0 14.8 14.4 15.7 110.3 19.8 11.8 15.9 9.5
SCM1d 34.8 33.6 33.0 34.5 35.2 36.7 36.8 43.7 44.9 34.2 37.1 31.8
SCM20d 47.5 41.3 39.4 44.3 48.2 49.3 65.5 64.3 45.6 44.3 49.8 38.9

EDM SCM1d relatively small (2) targets and the SCM1d dataset
1.5 30
J(A,S) J(A,S) with relatively larger (16) targets. The convergence
aMSE aMSE with respect to the iteration steps is plotted in Fig. 2.
Both the objective function value and the average
mean square error (aMSE) decrease monotonously
with alternation steps. Although we show the first
20 steps, the algorithm can converge within only 10
iterations on the EDM dataset and within 15 iterations
on the SCM1d dataset. The consistently quick con-
vergence shows the great efficiency of the alternative
0 0 optimization and guarantees the practical implemen-
1 5 10 15 20 1 5 10 15 20 tation of the MMR.
Iteration Iteration

Fig. 2: The convergence of the proposed alternat- 5 C ONCLUSION


ing optimization algorithm on the two representative
datasets, i.e., EDM and SCM1d with 2 and 16 targets, We have presented a multi-layer multi-target regres-
respectively. J(A, S) is the value of the objective func- sion (MMR) framework that enables simultaneous-
tion and aMSE is the average mean square error. ly modeling inter-target correlations and nonlinear
input-output relationships. The MMR introduces a
latent space to explicitly encode inter-target correla-
tions in a structure matrix which is learned by robust
low-rank learning via matrix elastic nets (MEN) from
exploring the correlation among multiple targets. The
data without relying any specific assumptions; the
MMR achieves The large improvement over these
MMR is flexible and can seamlessly work in conjunc-
methods including MTSC, ERC, RLC, mSVR, AKRF,
tion with the kernel trick, which enables to handle
MROTS and OKL, in all of which the inter-target cor-
highly complex nonlinear relationships between high-
relation is explored in certain way. This comparison
dimensional inputs and multiple targets; the MMR
results demonstrate the advantage of the proposed
can be solved efficiently by a alternating optimization
MEN in modeling inter-target correlations via robust
algorithm with theoretically guaranteed convergence.
low-rank learning by introducing a structure matrix.
The MMR combines the strengths of kernel methods
for nonlinear feature learning and the structural ad-
4.3.2 Convergence Analysis vantage of multi-layer architectures to capture inter-
The fast convergence is very important for practical target correlations. More importantly, it offers a new
use of the MMR. The proposed alternating optimiza- multi-layer learning paradigm for multi-target regres-
tion in Algorithm 1 shows very quick convergence on sion, which is endowed with high generality, flexi-
all these 18 datasets. The algorithm converges within bility and expressive ability. Extensive experiments
a few (≤ 20) iterations on all these datasets. We show have been conducted on 18 real-world datasets, which
the convergence of the alternative optimization on the validates the great effectiveness and generality of the
two representative datasets, i.e., the EDM dataset with MMR for diverse multivariate prediction.
9

ACKNOWLEDGMENT [17] H. Liu, L. Wang, and T. Zhao, “Calibrated multivariate re-


gression with application to neural semantic basis discovery,”
The authors would like to thank the Associate Ed- Journal of Machine Learning Research, vol. 16, pp. 1579–1606,
itor and all anonymous reviewers for their positive 2015.
[18] F. Dinuzzo and B. Schölkopf, “The representer theorem for
support and constructive comments for improving Hilbert spaces: a necessary and sufficient condition,” in Ad-
the quality of this paper. Computations were per- vances in Neural Information Processing Systems, 2012, pp. 189–
formed using the data analytics Cloud at SHARC- 196.
NET (www.sharcnet.ca) provided through the South- [19] K. Yu, V. Tresp, and A. Schwaighofer, “Learning gaussian
processes from multiple tasks,” in International Conferece on
ern Ontario Smart Computing Innovation Platform Machine Learning, 2005, pp. 1012–1019.
(SOSCIP); the SOSCIP consortium is funded by the [20] H. Daumé III, “Bayesian multitask learning with latent hier-
Ontario Government and the Federal Economic De- archies,” in Proceedings of the Twenty-Fifth Conference on Uncer-
tainty in Artificial Intelligence. AUAI Press, 2009, pp. 135–142.
velopment Agency for Southern Ontario. The authors [21] A. Kumar and H. Daume, “Learning task grouping and over-
also wish to thank Dr. Jinhui Qin for assistance with lap in multi-task learning,” in International Conferece on Machine
the computing environment. X. Zhen is partially spon- Learning, 2012, pp. 1383–1390.
sored by the National Science Foundation of China [22] M. Álvarez, L. Rosasco, and N. Lawrence, Kernels for Vector-
Valued Functions: A Review, ser. Foundations and Trends in
(Grant No. 61571147). Machine Learning, 2012.
[23] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple
tasks with kernel methods,” in Journal of Machine Learning
R EFERENCES Research, 2005, pp. 615–637.
[24] A. Bargi, R. Xu, Z. Ghahramani, and M. Piccardi, “A non-
[1] R. Caruana, “Multitask learning,” Machine learning, vol. 28, parametric conditional factor regression model for multi-
no. 1, pp. 41–75, 1997. dimensional input and response,” in Seventeenth International
[2] Y. Wang, D. Wipf, Q. Ling, W. Chen, and I. Wassell, “Multi-task Conference on Artificial Intelligence and Statistics, 2014. JMLR,
learning for subspace segmentation,” in International Conferece 2014.
on Machine Learning, 2015, pp. 1209–1217. [25] H. Li, N. Chen, and L. Li, “Error analysis for matrix elastic-net
[3] Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, and N. Sebe, regularization algorithms,” IEEE transactions on neural networks
“A multi-task learning framework for head pose estimation and learning systems, vol. 23, no. 5, pp. 737–748, 2012.
under target motion.” IEEE transactions on pattern analysis and [26] A. G. Wilson, D. A. Knowles, and Z. Ghahramani, “Gaus-
machine intelligence, 2015. sian process regression networks,” International Conferece on
[4] X. Zhen, Z. Wang, A. Islam, M. Bhaduri, I. Chan, and S. Li, Machine Learning, 2012.
“Multi-scale deep networks and regression forests for direct bi- [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
ventricular volume estimation,” Medical Image Analysis, 2015. classification with deep convolutional neural networks,” in
[5] Y. Zhang and D. Y. Yeung, “A convex formulation for learning Advances in Neural Information Processing Systems, 2012, pp.
task relationships in multi-task learning,” in 26th Conference on 1097–1105.
Uncertainty in Artificial Intelligence, UAI 2010, Catalina Island, [28] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep
CA, United States, 8-11 July 2010, Code 86680, 2010. kernel learning,” in Proceedings of the 19th International Confer-
[6] R. K. Ando and T. Zhang, “A framework for learning pre- ence on Artificial Intelligence and Statistics, 2016, pp. 370–378.
dictive structures from multiple tasks and unlabeled data,” [29] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient
Journal of Machine Learning Research, vol. 6, pp. 1817–1853, 2005. l 2, 1-norm minimization,” in Proceedings of the twenty-fifth
[7] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature conference on uncertainty in artificial intelligence. AUAI Press,
learning,” in Advances in Neural Information Processing Systems, 2009, pp. 339–348.
2006, pp. 41–48. [30] X. Wang, J. Bi, S. Yu, and J. Sun, “On multiplicative multitask
[8] K. Hara and R. Chellappa, “Growing regression forests by clas- feature learning,” in Advances in Neural Information Processing
sification: Applications to object pose estimation,” in European Systems, 2014, pp. 2411–2419.
Conference on Computer Vision. Springer, 2014, pp. 552–567. [31] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and
[9] M. Long and J. Wang, “Learning transferable features with low-rank patterns from multiple tasks,” ACM Transactions on
deep adaptation networks,” International Conferece on Machine Knowledge Discovery from Data (TKDD), vol. 5, no. 4, p. 22, 2012.
Learning, 2015.
[32] C. Ciliberto, Y. Mroueh, T. Poggio, and L. Rosasco, “Convex
[10] H. Borchani, G. Varando, C. Bielza, and P. Larrañaga, “A sur-
learning of multiple tasks and their structure,” in International
vey on multi-output regression,” Data Mining and Knowledge
Conferece on Machine Learning, 2015, pp. 1548–1557.
Discovery, vol. 5, no. 5, pp. 216–233, 2015.
[11] A. J. Rothman, E. Levina, and J. Zhu, “Sparse multivariate re- [33] Q. Zhou and Q. Zhao, “Flexible clustered multi-task learning
gression with covariance estimation,” Journal of Computational by learning representative tasks.” IEEE transactions on pattern
and Graphical Statistics, vol. 19, no. 4, pp. 947–962, 2010. analysis and machine intelligence, vol. 38, no. 2, pp. 266–278,
2016.
[12] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for
multilabel classification: A least-squares formulation, exten- [34] L. Xu, Z. Wang, Z. Shen, Y. Wang, and E. Chen, “Learning
sions, and analysis,” IEEE Transactions on Pattern Analysis and low-rank label correlations for multi-label classification with
Machine Intelligence, vol. 33, no. 1, pp. 194–200, 2011. missing labels,” in Data Mining (ICDM), 2014 IEEE International
[13] P. Rai, A. Kumar, and H. Daume, “Simultaneously leveraging Conference on. IEEE, 2014, pp. 1067–1072.
output and task structures for multiple-output regression,” [35] H. Liu, L. Wang, and T. Zhao, “Multivariate regression with
in Advances in Neural Information Processing Systems, 2012, pp. calibration,” in Advances in Neural Information Processing Sys-
3185–3193. tems, 2014, pp. 127–135.
[14] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task [36] F. Dinuzzo, C. S. Ong, G. Pillonetto, and P. V. Gehler, “Learning
feature learning,” Machine Learning, vol. 73, no. 3, pp. 243–272, output kernels with block coordinate descent,” in International
2008. Conferece on Machine Learning, 2011, pp. 49–56.
[15] L. Jacob, J.-p. Vert, and F. R. Bach, “Clustered multi-task learn- [37] J. Friedman, T. Hastie, and R. Tibshirani, The elements of
ing: A convex formulation,” in Advances in Neural Information statistical learning. Springer series in statistics Springer, Berlin,
Processing Systems, 2009, pp. 745–752. 2001, vol. 1.
[16] L. Han and Y. Zhang, “Learning tree structure in multi-task [38] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern
learning,” in Proceedings of the 21th ACM SIGKDD International analysis. Cambridge university press, 2004.
Conference on Knowledge Discovery and Data Mining. ACM, [39] X. Zhong, L. Xu, Y. Li, Z. Liu, and E. Chen, “A noncon-
2015, pp. 397–406. vex relaxation approach for rank minimization problems,” in
10

Proceedings of the Twenty-Ninth AAAI Conference on Artificial [62] T. Aho, B. Ženko, S. Džeroski, and T. Elomaa, “Multi-target
Intelligence. AAAI Press, 2015, pp. 1980–1986. regression with rule ensembles,” Journal of Machine Learning
[40] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum- Research, vol. 13, no. 1, pp. 2367–2407, 2012.
rank solutions of linear matrix equations via nuclear norm [63] H. Zou and H. H. Zhang, “On the adaptive elastic-net with a
minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010. diverging number of parameters,” Annals of statistics, vol. 37,
[41] F. Nie, H. Huang, and C. Ding, “Low-rank matrix recovery via no. 4, p. 1733, 2009.
efficient schatten p-norm minimization,” in Proceedings of the
Twenty-Sixth AAAI Conference on Artificial Intelligence. AAAI
Press, 2012, pp. 655–661.
[42] H. Zou and T. Hastie, “Regularization and variable selection
via the elastic net,” Journal of the Royal Statistical Society: Series
B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
[43] R. Tibshirani, “Regression shrinkage and selection via the
lasso,” Journal of the Royal Statistical Society. Series B (Method-
ological), pp. 267–288, 1996.
[44] J. Gillberg, P. Marttinen, M. Pirinen, A. J. Kangas, P. Soininen,
M. Ali, A. S. Havulinna, M.-R. M.-R. Järvelin, M. Ala-Korpela,
and S. Kaski, “Multiple output regression with latent noise,”
arXiv preprint arXiv:1410.7365, 2014. Xiantong Zhen received the B.S. and M.E.
[45] P. Gong, J. Zhou, W. Fan, and J. Ye, “Efficient multi-task degrees from Lanzhou University, Lanzhou,
feature learning with calibration,” in Proceedings of the 20th China in 2007 and 2010, respectively and the
ACM SIGKDD international conference on Knowledge discovery Ph.D. degree from the Department of Elec-
and data mining. ACM, 2014, pp. 761–770. tronic and Electrical Engineering, the Univer-
sity of Sheffield, UK in 2013. He is currently
[46] G. S. Kimeldorf and G. Wahba, “A correspondence between
a postdoctoral fellow with the University of
Bayesian estimation on stochastic processes and smoothing by
Western Ontario, London, Ontario, Canada.
splines,” The Annals of Mathematical Statistics, vol. 41, no. 2, pp.
His research interests include machine learn-
495–502, 1970.
ing, computer vision and medical image anal-
[47] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and ysis.
robust feature selection via joint 2,1 -norms minimization,”
in Advances in Neural Information Processing Systems, 2010, pp.
1813–1821.
[48] S. Zheng, X. Cai, C. Ding, F. Nie, and H. Huang, “A closed
form solution to multi-view low-rank regression,” in Twenty-
Ninth AAAI Conference on Artificial Intelligence, 2015.
[49] B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle, “It is all
in the noise: Efficient multi-task Gaussian process inference
with structured residuals,” in Advances in Neural Information
Processing Systems, 2013, pp. 1466–1474.
[50] M. Sánchez-Fernández, M. de Prado-Cumplido, J. Arenas-
Garcı́a, and F. Pérez-Cruz, “Svm multiregression for nonlinear
channel estimation in multiple-input multiple-output system- Mengyang Yu received the B.S. and M.S.
s,” IEEE transactions on signal processing, vol. 52, no. 8, pp. degrees from the School of Mathematical
2298–2307, 2004. Sciences, Peking University, Beijing, China,
[51] C. M. Bishop, Pattern recognition and machine learning. springer, in 2010 and 2013, respectively, and the Ph.D.
2006. degree from the Department of Computer
[52] F. Dinuzzo, “Learning output kernels for multi-task problem- Science and Digital Technologies, Northum-
s,” Neurocomputing, vol. 118, pp. 119–126, 2013. bria University, Newcastle upon Tyne, U.K.,
[53] P. Gong, J. Ye, and C. Zhang, “Robust multi-task feature in 2017. Currently, he is a postdoctoral re-
learning,” in Proceedings of the 18th ACM SIGKDD international searcher at the Computer Vision Laborato-
conference on Knowledge discovery and data mining. ACM, 2012, ry, ETH Zurich, Switzerland. His research
pp. 895–903. interests include computer vision, machine
[54] G. H. Golub and C. F. Van Loan, Matrix computations. JHU learning, and information retrieval.
Press, 2012, vol. 3.
[55] L. Armijo, “Minimization of functions having lipschitz con-
tinuous first partial derivatives,” Pacific Journal of mathematics,
vol. 16, no. 1, pp. 1–3, 1966.
[56] J. C. Platt, “12 fast training of support vector machines using
sequential minimal optimization,” Advances in kernel methods,
pp. 185–208, 1999.
[57] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning
with kernels,” IEEE transactions on signal processing, vol. 52,
no. 8, pp. 2165–2176, 2004.
[58] A. Rahimi and B. Recht, “Random features for large-scale
kernel machines,” in Advances in Neural Information Processing Xiaofei He received the BS degree in Com-
Systems, 2007, pp. 1177–1184. puter Science from Zhejiang University, Chi-
[59] K. Yano, “On harmonic and killing vector fields,” Annals of na, in 2000 and the Ph.D. degree in Comput-
Mathematics, pp. 38–45, 1952. er Science from the University of Chicago, in
[60] E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vla- 2005. He is a Professor in the State Key Lab
havas, “Multi-target regression via input space expansion: of CAD & CGat Zhejiang University, China.
treating targets as inputs,” Machine Learning, vol. 104, no. 1, Prior to joining Zhejiang University, he was a
pp. 55–98, 2016. Research Scientist at Yahoo! Research Labs,
[61] D. Tuia, J. Verrelst, L. Alonso, F. Pérez-Cruz, and G. Camps- Burbank, CA. His research interests include
Valls, “Multioutput support vector regression for remote sens- machine learning, information retrieval, and
ing biophysical parameter estimation,” IEEE Geoscience and computer vision. He is a senior member of
Remote Sensing Letters, vol. 8, no. 4, pp. 804–808, 2011. IEEE.
11

Shuo Li received the Ph.D. degree in com-


puter science from Concordia University,
Montrál, QC, Canada, in 2006. He was a
Research Scientist and a Project Manager
of General Electric (GE) Healthcare, London,
ON, Canada, for nine years. He is currently
an Associate Professor with the Department
of Medical Imaging and Medical Biophysics,
University of Western Ontario, London, and
a Scientist with the Lawson Health Research
Institute, London. He is the Founder and has
been the Director of the Digital Imaging Group of London, Lon-
don, Ontario, Canada since 2006, which is a highly dynamic and
interdisciplinary group. He has authored and co-authored over 100
publications and edited five Springer books. His current research
interests include the development of intelligent analytic tools to
facilitate physicians and hospital administrative to handle the big
medical data, centered with medical images. Dr. Li was the recipient
of several GE internal awards. His Ph.D. thesis received the Doctoral
Prize giving to the most deserving graduating student in the Faculty
of Engineering and Computer Science. He serves as a Guest Editor
and an Associate Editor in several prestigious journals. He severed
as a Program Committee Member in top conferences.

You might also like