Multi-Target Regression Via Robust Low-Rank Learning
Multi-Target Regression Via Robust Low-Rank Learning
Learning
Xiantong Zhen, Mengyang Yu, Xiaofei He, Senior Member, IEEE, and Shuo Li
Abstract—Multi-target regression has recently regained great popularity due to its capability of simultaneously learning multiple
relevant regression tasks and its wide applications in data mining, computer vision and medical image analysis, while great
challenges arise from jointly handling inter-target correlations and input-output relationships. In this paper, we propose Multi-
layer Multi-target Regression (MMR) which enables simultaneously modeling intrinsic inter-target correlations and nonlinear
input-output relationships in a general framework via robust low-rank learning. Specifically, the MMR can explicitly encode inter-
target correlations in a structure matrix by matrix elastic nets (MEN); the MMR can work in conjunction with the kernel trick to
effectively disentangle highly complex nonlinear input-output relationships; the MMR can be efficiently solved by a new alternating
optimization algorithm with guaranteed convergence. The MMR leverages the strength of kernel methods for nonlinear feature
learning and the structural advantage of multi-layer learning architectures for inter-target correlation modeling. More importantly,
it offers a new multi-layer learning paradigm for multi-target regression which is endowed with high generality, flexibility and
expressive ability. Extensive experimental evaluation on 18 diverse real-world datasets demonstrates that our MMR can achieve
consistently high performance and outperforms representative state-of-the-art algorithms, which shows its great effectiveness
and generality for multivariate prediction.
Index Terms—Robust Low-Rank Learning, Multi-Layer Learning, Multi-Target Regression, Matrix Elastic Nets.
structure of noises, the CMR employs a 2,1 -norm of matrix A, which can be computed by tr(A A),
based loss function to calibrate each regression task, and λ is the regularization parameter that controls the
while the assumption of uncorrelated noises does not amount of shrinkage, i.e., the larger the value of λ, the
always hold in practice. Beside, it is nontrivial to greater the amount of shrinkage [37].
extend the CMR to kernel regression due to the 2,1 - The objective function in (2) is a straightforward
norm loss function. extension of ridge regression, one of classical statis-
In order to handle nonlinear input-output relation- tical learning algorithms [37], to multivariate targets.
ship, the recent output kernel learning (OKL) algorith- When working in a reproducing kernel Hilbert space
m [22], [36], which learns a semi-definite similarity (RKHS), the resulting model is known as kernel ridge
matrix, i.e., the output kernel, of multiple targets, regression (KRR) [37] [38]. Likewise, (2) can be kernel-
would not fully capture inter-target correlations, e.g., ized to achieve multi-target kernel ridge regression
negative correlations [5]. By assuming that all tasks (mKRR). We derive the proposed MMR from this
can be clustered into disjoint groups, the clustered fundamental formulation to ensure its generality.
multi-target learning (CMTL) [15] was developed The multi-target regression model in (2) is de-
to explore inter-target correlations, which learns the coupled into several single-output problems, which
underlying cluster structure from the training data. does not take into account inter-target correlations,
However, the number of clusters needs to be specified, resulting in suboptimal multi-target regression with
which is rarely available in real-world tasks. Recently, inferior performance. In what follows, we introduce
an improved version of CMTL called flexible clustered our MMR, which is a multi-layer learning architecture
multi-target (FCMTL) was presented in [33]. In the to explicitly model the correlations by a robust low-
FCMTL, the cluster structure is learned by identify- rank learning with matrix elastic nets (MEN) [25].
ing representative tasks. However, the assumption of
the existence of representative tasks remains purely 3.2 Robust Low-Rank Learning with MEN
heuristic and therefore would not be shared across
diverse applications. Rather than directly imposing the sparsity constraint
on W in existing methods, we propose incorporating
a latent space, from which a structure matrix S is
3 M ULTI -L AYER M ULTI -TARGET R EGRES - learned to explicitly encode inter-target correlations
SION via a rank minimization.
The proposed MMR accomplishes a general frame- 1
work of multi-layer learning to jointly model inter- min ||Y − SZ||2F + λ||W ||2F + βRank(S) + γ||S||2F ,
W,S N
target correlations by robust low rank learning via (3)
matrix elastic nets (MEN) (Sec. 3.2) and disentan- where Z = [z1 , · · · , zi , · · · , zN ] ∈ RQ×N , zi = W xi +
gle nonlinear input-output relationships by kernel b ∈ RQ contains the latent variables in the latent
regression (Sec. 3.3); the MMR is efficiently solved by space, S ∈ RQ×Q is the structure matrix that serves
a newly derived alternating optimization algorithm to explicitly model inter-target correlations, β is the
with guaranteed convergence (Sec. 3.4). regularization parameter to control the rank of S, that
is, a larger β induces lower rank, and the Frobenius
3.1 Problem Formulation norm control the shrinkage of S with the associated
We start with the fundamental multi-target linear parameter γ. The rank minimization of the structure
regression model matrix S explores the low-rank structure existing
between tasks to capture the intrinsic inter-target cor-
y = W x + b, (1) relation. S is learned automatically from data without
relying on any specific assumptions, which allows to
where y = [y1 , · · · , yi , · · · , yQ ] ∈ RQ are the
adaptively cater different applications.
multivariate targets, x ∈ Rd is the input, W =
However, the objective function in (3) is NP-hard
[w1 , · · · , wi , · · · , wQ ] ∈ RQ×d is the model param-
due to the noncontinuous and non-convex nature of
eter, i.e., the regression coefficient, each wi ∈ Rd is
the rank function [39]. The nuclear norm ||S||∗ is
the predictor for yi , b ∈ RQ is the bias, d and Q
commonly used for low-rank learning, and it has
are the dimensionality of input and output spaces,
proven to be the convex envelop of the rank function
respectively.
over the domain ||S||2 ≤ 1 [40], which provides the
Given the training set {(xi , yi )}N i=1 , one can solve
tightest lower bound among all convex lower bounds
for W by solving the following penalized optimiza-
of the rank function Rank(S).
tion objective:
As a consequence, the combination of the nuclear
1 norm with the Frobenius norm on S gives rise to the
W ∗ = arg min ||Y − W X − B||2F + λ||W ||2F , (2)
W N matrix elastic net (MEN) [25] as a regularizer of (3):
where X = [x1 , x2 , · · · , xN ], Y = [y1 , y2 , · · · , yN ], 1
B = [· · · , b, · · · ] ∈ RQ×N , ||A||2F is the Frobenius norm min ||Y − SZ||2F + λ||W ||2F + β||S||∗ + γ||S||2F , (4)
W,S N
4
where the nuclear norm ||S||∗ also known as the trace shown in Theorem 1, which enables kernel extension
norm is a particular instance of the Schatten p-norm to achieve nonlinear regression.
[41] with p = 1, i.e., Schatten 1-norm. The Schatten
Theorem 1. Given any fixed matrix S, the objective
p-norm of a matrix M is defined as
function in (4) w.r.t. W which is defined over a Hilbert
1/p space H. If (4) has a minimizer w.r.t. W , it admits a
||M ||Sp = σip (M ) , (5) linear representer theorem of the form W = AX , where
i
A ∈ RQ×N is the coefficient matrix.
where 0 < p ≤ 2, σi (M ) is the i-th largest singular
Remark. Theorem 1 provides important theoretical
value of M . When p < 2, the Schatten p-norm encour-
guarantees for kernel extension to achieve nonlinear
ages sparsity of the singular values, which achieves
multi-target regression. The proof of the theorem is
rank minimization. With (5), √the nuclear norm of S
straightforward and omitted here due to space limit.
can be written as ||S||∗ = tr( S S). It is interesting
Based on Theorem 1, we can derive kernel extension
to mention that when p = 0, the Schatten 0-norm is
in the reproducing kernel Hilbert space (RKHS) for
defined as kernel regression to handle nonlinear input-output
||M ||S0 = σi0 (6) relationships. To facilitate the derivation, we rewrite
i
the objective function (4) in term of traces as follows:
which is exactly the rank of A. 1
The MEN is an analog to the elastic-net regular- min tr((Y − S(W X + B)) (Y − S(W X + B)))
W,SN
ization [42] from compressive sensing and sparse √
representation [42]. It has been shown that the elastic + λtr(W W ) + βtr( S S) + γtr(S S).
net often outperforms the lasso [42]. In the MEN, (7)
the nuclear-norm constraint enforces the low-rank According to the linear representer theorem in Theo-
property of the solution S to encode inter-target corre- rem 1, we have
lations, and the Frobenius-norm constraint induces a W = AΦ(X) , (8)
linear shrinkage on the matrix entries leading to stable
solutions [25]. The MEN regularization generalizes where Φ(X) = [φ(x1 ), · · · , φ(xi ), · · · , φ(xN )] and A =
the matrix lasso and provides improved performance [α1 , · · · , αi , · · · , αQ ] ∈ RQ×N , αi ∈ RN , and φ(·)
than lasso [43]. denotes the feature map of xi , which maps xi to φ(xi )
To the best of our knowledge, this is the first work in some RKHS of high, even infinite dimensionality.
that introduces the MEN to multi-target regression The mapping serves as a nonlinear feature extrac-
for robust low-rank learning, which offers a gener- tion to handle complicated input-target relationship-
al framework to encode inter-target correlations. We s. The corresponding kernel function k(·, ·) satisfies
highlight that the proposed multi-layer multi-target k(xi , xj ) = φ(xi ) φ(xj ).
regression enjoys favorable merits: Substituting (8) into (7), we obtain the following
objective function:
• The latent space enables decouples inputs and
1
targets from distinctive distributions, which al- min tr((Y − SAΦ(X) Φ(X)) (Y − SAΦ(X) Φ(X)))
lows them to be effectively handled respectively A,S N
√
by the regression coefficient W and the structure + λtr((AΦ(X) ) (AΦ(X) )) + βtr( S S)
matrix S [24], [44]. + γtr(S S).
• The structure matrix which is detached from
(9)
inputs by the latent space is able to effectively where the bias B is omitted for simplicity since it
calibrate multiple targets according to their dif- has been proven that the bias can be absorbed into
ferent noise levels to achieve optimal parameter the regression coefficient W by adding an additional
estimation [35], [45]. dimension into input features xi [47], [48].
• The matrix elastic nets (MEN) provide a general We accomplish the kernel version of the multi-layer
regularization network to achieve robust low- multi-target regression with the kernel matrix K =
rank learning of the intrinsic inter-target corre- Φ(X) Φ(X) defined in the RKHS space.
lations [25].
1
min tr((Y − SAK) (Y − SAK)) + λtr(AKA )
A,S N
3.3 Kernel Extension √
+ βtr( S S) + γtr(S S).
Due to the constraints directly imposed on the regres- (10)
sion matrix W , most of the existing methods would In (10), the induced latent variables Z = AK can
not be kernelized for nonlinear regression. On the extract high-level representations for multiple seman-
contrary, thanks to the multi-layer learning architec- tic targets, which allows to disentangle the nonlinear
ture, the MMR is more flexible and extensible, and relationship between low-level inputs and semantic-
naturally admits the Representer Theorem [46], [18] as level targets [8]. The latent space with high-level
5
features will also facilitate the efficient linear low-rank 3.4.2 Fix A to optimize S
learning of S to model inter-target correlations [49] We propose a gradient based alterative optimization
to achieve more accurate multi-target prediction. The to solve for S, before which we provide the following
MMR in (10) leverages the strength of kernel methods proposition to calculate the derivative of J w.r.t. S.
for nonlinear feature extraction and the structural
advantage of multi-layer architectures for inter-target Proposition 1. Assume that the singular value decompo-
correlation modeling. In contrast to existing multi- sition (SVD) of S is
target regression models, the obtained MMR in (10)
S = U ΣV , (14)
accomplishes a new multi-layer learning architecture,
which is endowed with great generality, flexibility and where U and V are unitary matrices and Σ is the diag-
expressive ability for diverse challenging tasks. onal matrix with real numbers on the diagonal. Then the
One of the important advantages of the proposed derivative of ||S||∗ w.r.t. S takes the form as follows:
MMR over previous multi-target regression models is
∂||S||∗
its great generality. Theoretically, the proposed MMR = U Σ−1 |Σ|V (15)
is highly generalized and encompasses some of ex- ∂S
isting models. By setting S = I and K = X X in where Σ−1 is the Moore-Penrose pseudo-inverse of Σ.
(10), the MMR can recover the fundamental multi-
target ridge regression based on which many previous Proof: By the definition of the nuclear norm, we
models were developed. The MMR can simultane- have
√
ously encode inter-target correlations and disentangle ||S||∗ = tr( S S)
linear/nonlinear input-output relationships by cus-
tomizing kernels in one single framework. Moreover, = tr( (U ΣV ) (U ΣV )) (16)
√ √
it can also work with other convex loss functions, = tr( V ΣU U ΣV ) = tr( V Σ2 V )
e.g., the ε-insensitive loss function [50] and accepts
other regularization terms to satisfy desired properties By the property of circularity of trace, we have
[51]. Compared to the recent output kernel learn- √ √
ing (OKL) [36], [52], the MMR employes a general ||S||∗ = tr( V V Σ2 ) = tr( Σ2 ) = tr(|Σ|) (17)
low-rank regularization network without relying on where |Σ| is the matrix of the element-absolute values
specific assumptions, which allows to fully capture of Σ. Therefore, the nuclear norm of S can be also
more complex, e.g., positive and negative, inter-target defined as the sum of the absolute values of the singu-
correlations rather than only a certain aspect, e.g., lar value decomposition of S. Although the absolute
similarity of multiple targets [36]. Moreover, we do value function is not differentiable on every point in
not assume that all tasks are correlated and allow the its domain, but we can find a subgradient.
existence of outlier tasks [53], which further increases
the generality. ∂||S||∗ ∂tr(|Σ|) tr(∂|Σ|)
= = (18)
∂S ∂S ∂S
3.4 Alternating Optimization
Since Σ is diagonal, the subdifferential set of |Σ| is:
The obtained objective function (10) is non-trivial to
solve simultaneously for A and S due to the non- ∂|Σ| ∂Σ
= |Σ|Σ−1 . (19)
convexity of the objective function. We derive a new ∂S ∂S
alternating optimization algorithm to efficiently solve By substituting (19) into (18), we obtain
the objective function. Denote J(A, S) as the objective
function in (10), and we seek A and S alternately by ∂S∗ tr(|Σ|Σ−1 ∂Σ)
= . (20)
solving J(A, S) for one with the other fixed. ∂S ∂S
From (14), we have
3.4.1 Fix S to optimize A
We calculate the gradients of the objective function ∂S = ∂U ΣV + U ∂ΣV + U Σ∂V , (21)
with respect to A as follows:
which gives rise to
∂J 1
= − S (Y − SAK)K + λAK. (11)
∂A N U ∂ΣV = ∂S − ∂U ΣV − U Σ∂V . (22)
Setting the derivatives to be 0 gives rise to
Multiplying U on both sides of (22), we have
S SAK + λN A = S Y. (12)
U U ∂ΣV V = U ∂SV − U ∂U ΣV V
Multiplying K −1 to both sides on the right leads to (23)
− U U Σ∂V V
−1 −1
S SA + λAN K =S YK , (13)
which leads to
which is a standard Sylvester equation [54] and can
be solved analytically in a closed form. ∂Σ = U ∂SV − U ∂U Σ − Σ∂V V. (24)
6
minimal optimization (SMO), and kernel approxi- TABLE 1: The statistics of the 18 datasets. d is the
mation methods [58] can be employed. Kernel ap- dimension of inputs and Q is the number of targets.
proximation has recently attracted increasing research Dataset Samples Input (d) Target (Q) #-Fold CV
efforts to speed up kernel methods. When using trans- EDM 154 16 2 10
SF1 323 10 3 10
lation invariant kernels, a large family of most widely- SF2 1066 10 3 10
used kernels, e.g., the radius basis function (RBF), the JURA 359 15 3 10
feature map induced by the kernel function can be WQ 1066 16 14 10
ENB 768 8 2 10
approximated by random Fourier features [58] under SLUMP 103 7 3 10
Bochner’s Theorem [59]. Stochastic gradient and back- ANDRO 49 30 6 10
propagation can be used to train the model to scale OSCALES 639 413 12 10
SCPF 1137 23 3 10
up with data of large scales or arriving sequentially. ATP1d 337 411 6 10
ATP7d 296 411 6 10
OES97 334 263 16 10
4 E XPERIMENTS AND R ESULTS OES10 403 298 16 10
We have conducted extensive experiments to evaluate RF1 9125 64 8 5
RF2 9125 576 8 5
the performance of the MMR on all the 18 real- SCM1d 9803 280 16 2
world datasets and compared with state-of-the-art SCM20d 8966 61 16 2
algorithms, and we have also investigated the con-
vergence of the alternating optimization and showed
the results on 2 representatives datasets with different
defined as:
amount of targets.
(xi ,yi )∈Dtest (ŷi − y i )2
RRMSE = , (31)
4.1 Datasets and Settings (xi ,yi )∈Dtest (Ŷ − yi )2
The 18 real-world datasets are widely-used bench-
marks for multi-target regression in [60], which cover where (xi , yi ) is the i-th sample xi with ground truth
a large range of multi-target prediction tasks. Inter- target yi , ŷi is the prediction of yi and Ŷ is the average
target correlations demonstrate diverse patterns on of the targets over the training set Dtrain . We take the
across different datasets, which poses great challenges average RRMSE (aRRMSE) across all the target vari-
for multi-target regression models. The statistics of ables within the test set Dtest as a single measurement.
these datasets are summarized in Table 1. We follow A lower aRRMSE indicates better performance. The
the strategies in [60] to process the datasets with parameters λ, β and γ are chosen by cross validation
missing values in inputs, which are replaced with from a search grid of 10[−5:1:3] on the training set by
sample means in the datasets. tuning one with the others fixed, which could also
We compare with existing representative multi- be selected by adaptive techniques explored in [63],
target regression models including multi-dimensional [25]. We use the radial basis function (RBF) kernel for
support vector regression (mSVR) [50], [61], output nonlinear regression.
kernel learning (OKL) [36], adaptive k-cluster ran-
dom forests (AKRF) [8], multi-task feature learning 4.3 Results
(MTFL) [7] and MROTS [13]. Note that MTRL [5] and
4.3.1 Performance Comparison
FIRE [62] perform worse than MROTS and MORF,
respectively [13] and are therefore not included for The proposed MMR algorithm has achieved consis-
comparison. The methods in [60] including single task tently high multi-target prediction performance on all
learning (STL), multi-object random forests (MORF), 18 datasets and substantially outperforms state-of-the-
the corrected multi-target stacking (MSTC), ensemble art algorithms. The multi-target prediction results of
of regressor chains (ERC) and random linear target the proposed MMR and the comparison with state-of-
combinations (RLC), which have shown great perfor- the-art algorithms that recently proposed are summa-
mance in [60], are also included for comprehensive rized in Table 2.
comparison. We follow the evaluation settings in [60] The proposed MMR substantially outperforms the
to benchmark with other algorithms. Specifically, as best results from state-of-the-art algorithms on most
shown in Table 1, we use two-fold cross validation of these 18 datasets except the ANDRO dataset with
(CV) for SCM1d/SCM20d, five-fold CV for RF1/RF2 only 49 samples. The great effectiveness of the MMR
and ten-fold CV for the rest of the datasets. has been validated for a broad range of multi-target
regression tasks. The large improvement of the pro-
posed MMR over the STL and mKRR with significant
4.2 Evaluation Metric margins on the all the 18 datasets, which shows its
To directly benchmark with state-of-the-art algorithm- effectiveness in modeling inter-target correlations. The
s, we measure the performance by the commonly- STL and mKRR are regarded as the baseline methods
used Relative Root Mean Squared Error (RRMSE) that predict multiple targets independently without
8
TABLE 2: The comparison with state-of-the-art algorithms on 18 real-world datasets in terms of aRRMSE (%).
```
```Method MMR
Dataset ``` STL MTSC ERC RLC MORF mSVR AKRF MTFL MROTS OKL mKRR
EDM 74.2 74.0 74.1 73.5 73.4 73.7 74.0 85.1 81.2 74.1 83.3 71.6
SF1 113.5 106.8 108.9 116.3 128.2 102.1 111.4 111.2 115.5 105.9 110.4 95.8
SF2 114.9 105.5 108.8 122.8 142.5 104.3 113.5 112.7 120.1 100.4 116.6 98.4
JURA 58.9 59.1 59.0 59.6 59.7 61.1 61.8 60.8 62.5 59.9 63.3 58.2
WQ 90.8 90.9 90.6 90.2 89.9 89.9 91.8 96.2 91.3 89.1 92.0 88.9
ENB 11.7 12.1 11.4 12.0 12.1 22.0 23.4 31.6 25.7 13.8 26.3 11.1
SLUMP 68.8 69.5 68.9 69.0 69.4 71.1 72.9 68.1 77.8 69.9 78.9 58.7
ANDRO 60.2 57.9 56.7 57.0 51.0 62.7 62.3 80.3 63.5 55.3 63.9 52.7
OSCALES 74.8 72.6 71.3 74.1 75.3 77.8 77.5 168.2 80.0 71.8 79.9 70.9
SCPF 83.7 83.1 83.0 83.5 83.3 82.8 83.1 89.9 90.1 82.0 85.5 81.2
ATP1d 37.4 37.2 37.2 38.4 42.2 38.10 41.2 41.5 40.4 36.4 38.0 33.2
ATP7d 52.5 50.7 51.2 46.1 55.1 47.75 53.1 55.3 54.9 47.5 48.6 44.3
OES97 52.5 52.4 52.4 52.3 54.9 55.7 58.1 81.8 60.5 53.5 58.7 49.7
OES10 42.0 42.1 42.0 41.9 45.2 44.7 44.6 53.2 55.8 43.2 48.9 40.3
RF1 9.7 9.4 9.1 12.1 12.3 10.9 11.4 98.3 15.4 11.2 17.9 8.9
RF2 10.2 9.7 9.5 13.0 14.8 14.4 15.7 110.3 19.8 11.8 15.9 9.5
SCM1d 34.8 33.6 33.0 34.5 35.2 36.7 36.8 43.7 44.9 34.2 37.1 31.8
SCM20d 47.5 41.3 39.4 44.3 48.2 49.3 65.5 64.3 45.6 44.3 49.8 38.9
EDM SCM1d relatively small (2) targets and the SCM1d dataset
1.5 30
J(A,S) J(A,S) with relatively larger (16) targets. The convergence
aMSE aMSE with respect to the iteration steps is plotted in Fig. 2.
Both the objective function value and the average
mean square error (aMSE) decrease monotonously
with alternation steps. Although we show the first
20 steps, the algorithm can converge within only 10
iterations on the EDM dataset and within 15 iterations
on the SCM1d dataset. The consistently quick con-
vergence shows the great efficiency of the alternative
0 0 optimization and guarantees the practical implemen-
1 5 10 15 20 1 5 10 15 20 tation of the MMR.
Iteration Iteration
Proceedings of the Twenty-Ninth AAAI Conference on Artificial [62] T. Aho, B. Ženko, S. Džeroski, and T. Elomaa, “Multi-target
Intelligence. AAAI Press, 2015, pp. 1980–1986. regression with rule ensembles,” Journal of Machine Learning
[40] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum- Research, vol. 13, no. 1, pp. 2367–2407, 2012.
rank solutions of linear matrix equations via nuclear norm [63] H. Zou and H. H. Zhang, “On the adaptive elastic-net with a
minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010. diverging number of parameters,” Annals of statistics, vol. 37,
[41] F. Nie, H. Huang, and C. Ding, “Low-rank matrix recovery via no. 4, p. 1733, 2009.
efficient schatten p-norm minimization,” in Proceedings of the
Twenty-Sixth AAAI Conference on Artificial Intelligence. AAAI
Press, 2012, pp. 655–661.
[42] H. Zou and T. Hastie, “Regularization and variable selection
via the elastic net,” Journal of the Royal Statistical Society: Series
B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
[43] R. Tibshirani, “Regression shrinkage and selection via the
lasso,” Journal of the Royal Statistical Society. Series B (Method-
ological), pp. 267–288, 1996.
[44] J. Gillberg, P. Marttinen, M. Pirinen, A. J. Kangas, P. Soininen,
M. Ali, A. S. Havulinna, M.-R. M.-R. Järvelin, M. Ala-Korpela,
and S. Kaski, “Multiple output regression with latent noise,”
arXiv preprint arXiv:1410.7365, 2014. Xiantong Zhen received the B.S. and M.E.
[45] P. Gong, J. Zhou, W. Fan, and J. Ye, “Efficient multi-task degrees from Lanzhou University, Lanzhou,
feature learning with calibration,” in Proceedings of the 20th China in 2007 and 2010, respectively and the
ACM SIGKDD international conference on Knowledge discovery Ph.D. degree from the Department of Elec-
and data mining. ACM, 2014, pp. 761–770. tronic and Electrical Engineering, the Univer-
sity of Sheffield, UK in 2013. He is currently
[46] G. S. Kimeldorf and G. Wahba, “A correspondence between
a postdoctoral fellow with the University of
Bayesian estimation on stochastic processes and smoothing by
Western Ontario, London, Ontario, Canada.
splines,” The Annals of Mathematical Statistics, vol. 41, no. 2, pp.
His research interests include machine learn-
495–502, 1970.
ing, computer vision and medical image anal-
[47] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and ysis.
robust feature selection via joint 2,1 -norms minimization,”
in Advances in Neural Information Processing Systems, 2010, pp.
1813–1821.
[48] S. Zheng, X. Cai, C. Ding, F. Nie, and H. Huang, “A closed
form solution to multi-view low-rank regression,” in Twenty-
Ninth AAAI Conference on Artificial Intelligence, 2015.
[49] B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle, “It is all
in the noise: Efficient multi-task Gaussian process inference
with structured residuals,” in Advances in Neural Information
Processing Systems, 2013, pp. 1466–1474.
[50] M. Sánchez-Fernández, M. de Prado-Cumplido, J. Arenas-
Garcı́a, and F. Pérez-Cruz, “Svm multiregression for nonlinear
channel estimation in multiple-input multiple-output system- Mengyang Yu received the B.S. and M.S.
s,” IEEE transactions on signal processing, vol. 52, no. 8, pp. degrees from the School of Mathematical
2298–2307, 2004. Sciences, Peking University, Beijing, China,
[51] C. M. Bishop, Pattern recognition and machine learning. springer, in 2010 and 2013, respectively, and the Ph.D.
2006. degree from the Department of Computer
[52] F. Dinuzzo, “Learning output kernels for multi-task problem- Science and Digital Technologies, Northum-
s,” Neurocomputing, vol. 118, pp. 119–126, 2013. bria University, Newcastle upon Tyne, U.K.,
[53] P. Gong, J. Ye, and C. Zhang, “Robust multi-task feature in 2017. Currently, he is a postdoctoral re-
learning,” in Proceedings of the 18th ACM SIGKDD international searcher at the Computer Vision Laborato-
conference on Knowledge discovery and data mining. ACM, 2012, ry, ETH Zurich, Switzerland. His research
pp. 895–903. interests include computer vision, machine
[54] G. H. Golub and C. F. Van Loan, Matrix computations. JHU learning, and information retrieval.
Press, 2012, vol. 3.
[55] L. Armijo, “Minimization of functions having lipschitz con-
tinuous first partial derivatives,” Pacific Journal of mathematics,
vol. 16, no. 1, pp. 1–3, 1966.
[56] J. C. Platt, “12 fast training of support vector machines using
sequential minimal optimization,” Advances in kernel methods,
pp. 185–208, 1999.
[57] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning
with kernels,” IEEE transactions on signal processing, vol. 52,
no. 8, pp. 2165–2176, 2004.
[58] A. Rahimi and B. Recht, “Random features for large-scale
kernel machines,” in Advances in Neural Information Processing Xiaofei He received the BS degree in Com-
Systems, 2007, pp. 1177–1184. puter Science from Zhejiang University, Chi-
[59] K. Yano, “On harmonic and killing vector fields,” Annals of na, in 2000 and the Ph.D. degree in Comput-
Mathematics, pp. 38–45, 1952. er Science from the University of Chicago, in
[60] E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vla- 2005. He is a Professor in the State Key Lab
havas, “Multi-target regression via input space expansion: of CAD & CGat Zhejiang University, China.
treating targets as inputs,” Machine Learning, vol. 104, no. 1, Prior to joining Zhejiang University, he was a
pp. 55–98, 2016. Research Scientist at Yahoo! Research Labs,
[61] D. Tuia, J. Verrelst, L. Alonso, F. Pérez-Cruz, and G. Camps- Burbank, CA. His research interests include
Valls, “Multioutput support vector regression for remote sens- machine learning, information retrieval, and
ing biophysical parameter estimation,” IEEE Geoscience and computer vision. He is a senior member of
Remote Sensing Letters, vol. 8, no. 4, pp. 804–808, 2011. IEEE.
11