[go: up one dir, main page]

0% found this document useful (0 votes)
65 views30 pages

GAMTL: Asymmetric Multi-Task Learning

The paper introduces the Group Asymmetric Multi-Task Learning (GAMTL) algorithm, which allows for asymmetric relationships among tasks and local feature transference, enhancing the generalization of individual tasks. GAMTL was evaluated on synthetic and real datasets, including Alzheimer’s Disease progression prediction, demonstrating improved prediction performance and revealing scientifically grounded relationships among cognitive scores and brain regions. The method's robustness was validated through stability selection analysis, and the source code is available on GitHub.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views30 pages

GAMTL: Asymmetric Multi-Task Learning

The paper introduces the Group Asymmetric Multi-Task Learning (GAMTL) algorithm, which allows for asymmetric relationships among tasks and local feature transference, enhancing the generalization of individual tasks. GAMTL was evaluated on synthetic and real datasets, including Alzheimer’s Disease progression prediction, demonstrating improved prediction performance and revealing scientifically grounded relationships among cognitive scores and brain regions. The method's robustness was validated through stability selection analysis, and the source code is available on GitHub.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Asymmetric Multi-Task Learning with Local Transference

SAULLO H. G. OLIVEIRA, School of Electrical and Computer Engineering (FEEC), University of Campinas
(Unicamp), Brazil
ANDRÃĽ R. GONÃĞALVES, Lawrence Livermore National Laboratory, USA
FERNANDO J. VON ZUBEN, School of Electrical and Computer Engineering (FEEC), University of Camp-
inas (Unicamp), Brazil
In this paper, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data
how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features
GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features.
The additional lexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric
relationships. The proposed method leverages the information present in these multiple structures to bias the training of
individual tasks towards more generalizable models.
The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks
parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated
on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by
diverse proiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD)
progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but
also estimated scientiically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks,
and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection
analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter coniguration. GAMTL source code is
available on GitHub: [Link]
CCS Concepts: • Applied computing → Health informatics; • Theory of computation → Models of learning; Non-
convex optimization.
Additional Key Words and Phrases: multi-task learning, structural sparsity, structural learning.

1 INTRODUCTION
Multi-Task Learning (MTL) promotes information sharing among multiple related tasks, aiming at improving the
generalization capacity of individual tasks. In the last two decades, we have witnessed a signiicant increase in the
number of MTL proposals as well as in the variety of applications of MTL methods [7, 33, 37]. One central question
in MTL is how the information lows from task to task during training. To that end, a proper characterization of
the tasks relationship structure is required. Existing methods range from simple models with strong assumptions
about how tasks are related [6, 8, 15, 41] to more complex models that implement intricate learning procedures
Authors’ addresses: Saullo H. G. Oliveira, shgo@[Link], School of Electrical and Computer Engineering (FEEC), University of
Campinas (Unicamp), Av. Albert Einstein, Nž 400 - Cidade UniversitÃąria, Campinas, SÃčo Paulo, Brazil, 13083-852; AndrÃľ R. GonÃğalves,
andre@[Link], Lawrence Livermore National Laboratory, USA; Fernando J. Von Zuben, vonzuben@[Link], School of Electrical and
Computer Engineering (FEEC), University of Campinas (Unicamp), Av. Albert Einstein, Nž 400 - Cidade UniversitÃąria, Campinas, SÃčo
Paulo, Brazil, 13083-852.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
permissions@[Link].
© 2022 Association for Computing Machinery.
1556-4681/2022/1-ART1 $15.00
[Link]

ACM Trans. Knowl. Discov. Data.


1:2 • S. H. G. Oliveira, et al.

to automatically unveil tasks relationships [9, 18, 38, 39]. MTL also shares similarities with Transfer Learning
methods, with the diference that it is more general in the way it models transference. In MTL we want to improve
the performance of all involved tasks, while in transfer learning we use source tasks to improve the performance
of one or more target tasks [27].
As we will see in Section 2, although the current state-of-the-art MTL methods can model and learn the tasks
relationship information from data, two major drawbacks are present: (i) most of the methods assume that tasks
are symmetrically related, that is, task A afects task B in the same way as task B afects task A; and (ii) if two
tasks are related they must inluence each other on the entire set of features, to which we will call global feature
transference.
There have been attempts to encode more lexible structures into the models and to alleviate the mentioned
downsides as we will see in Section 2, usually in two directions: enabling transference to involve subsets of
features , i.e. local feature transference; or estimating the structure of transference even in an asymmetrical way.
But no previous attempts in the literature explore both directions simultaneously.
Recent works used MTL to model the connection between cognitive scores and the progression of Alzheimer’s
Disease (AD) [21, 22, 42], helping to understand how the most common form of dementia in the world [17]
evolves both physiologically and cognitively. As people live longer and we improve methods to identify and
diagnose dementia, the number of people living with dementia is expected to more than triple by 2050 when
compared with the estimates of 2018, according to the World Alzheimer Report of 2018. Not to mention the
sufering brought to a massive number of families, the worldwide inancial cost of dementia was estimated to
be US$ 1 trillion for the year of 2018 and is estimated to double by 2030. Notice that these estimates do not
account for the recent burden imposed by the COVID-19 in health care systems worldwide. Since the neuronal
degeneration of AD proceeds years before the full onset of the disease, medical treatment is more efective in
the early stages of the disease. Therefore, the prediction of AD progression can help to identify markers of AD
stages, as well as highlight how each region of the brain inluences the outcome of such scores.
The simultaneous prediction of diferent cognitive scores associated with subjects at distinct stages of AD, based
on features extracted from brain images (obtained from ADNI, Alzheimer’s Disease Neuroimaging Initiative)
can beneit from MTL, especially from models that can directly encode structural a priori information, such as
grouped features and asymmetric transference. By considering multiple cognitive scores as multiple tasks, [22]
and [42] show how MTL methods can encode in the design matrix each Region of Interest (ROI) in the brain as a
group of features using the Group LASSO [36] regularization, but are not able to circumvent the global feature
transference assumption. Despite the promising results, current MTL methods still present limitations in the way
they capture the relationships among tasks in a realistic and interpretable way.
In Section 3 we present GAMTL [29], a Regularized MTL method that meets three goals:
(1) consider that features are organized in a group structure, thus enabling local feature transference;
(2) estimate a task relationship matrix for each group of related features;
(3) allow tasks to be asymmetrically related.
We compare the proposal with state-of-the-art methods in Section 4, by contrasting their results in an artiicial
setting, and also on the prediction of diferent cognitive scores related to Alzheimer’s Disease. The results highlight
the performance of the method and the interpretability of the explainable relationship structure. In order to
validate the robustness of GAMTL with respect to data sampling rate and hyper-parameter settings, a Stability
Selection procedure was performed. A clustering analysis considering the stable set of features makes it clear how
the lexibility of GAMTL allows the method to automatically capture the distinct roles features can play on related
tasks. The source code with Python programming language is available at the URL: [Link]
Notation: Matrices are represented using uppercase letters, while scalars are represented by lowercase letters.
Vectors are lowercase in bold. For any matrix A, ai¯ is the i-th row of A, and a j is the j-th column of A. Also, ai j is

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:3

the scalar at row i and column j of A. The i-th element of any vector a is represented by ai . For any two vectors
x, y the Hadamard product is denoted by (x ⊙ y)i = x i yi .

2 LEARNING MULTIPLE TASKS


Let T be a set of T learning tasks. Each task t ∈ T has its own dataset, containing mt samples from a common
feature space, X t ∈ Rmt ×n , together with yt ∈ Rmt for a regression task t, or yt ∈ {0, 1}mt for a binary classiication
task t. In MTL, we train all tasks simultaneously, estimating a parameter vector wt for each individual task
t. During the training process, we leverage information from related tasks, such that all tasks tend to have a
better generalization performance when compared to a process that trains each task independently (Single-Task
Learning, STL) and does not account for similarities between the tasks.

2.1 Regularized Multi-Task Learning


Regularized Multi-Task Learning is a family of MTL models that uses regularization terms to encourage information
sharing among the related tasks during training. Regularization terms can be applied directly to the parameters
of the tasks Ð penalizing their magnitude in the optimization process Ð and also as a means of encouraging
transference among multiple tasks. A canonical formulation for the Regularized Multi-Task Learning problem is
given by Equation 1:
T
X
min Lt (wt ) + R (W ), (1)
W
t =1

where Lt : wt → R is a suitable convex loss function for task t, and R is a regularization term over all tasks
parameters. When the tasks weights are stacked as columns of a matrix, we represent the task parameters as
W ∈ Rn×T .
Examples of loss functions for regression problems are Mean Squared Error and Mean Absolute Error, while
classiication problems may use a Logistic Loss. Prior knowledge on the application is what drives the choice of
regularization terms, especially by leveraging sparsity properties and algebraic structures that enforce and/or
capture the relationship among tasks.
Evgeniou and Pontil [8] started from the general premise that the parameter vectors wt , t = 1, · · · ,T , should
be close to each other, by penalizing the deviation of each parameter vector from the average vector comprising
all tasks. Elaborating further, Zhou et al. [41] considered that tasks parameters are grouped into K groups, being
K a hyper-parameter of the model. In this case, all tasks in the same group tend to exhibit similar values for
their parameters. Additionally, all tasks should be part of a single group, which in turn is a more lexible way
of handling the efects of unrelated tasks but is still not completely accounting for it. Both methods induce all
the parameter vectors to pursue the average behavior of their corresponding group, but if tasks are related only
through subsets of features, these methods will fail to capture the relationship and will enforce the relationship
on features that are not supposed to be related. By relying on metric distances to measure tasks similarities, it
may incur in the curse of dimensionality if the parameter vectors lie in a high dimensional space. This implies
that, for interpretability purposes, even when the two tasks belong to the same cluster we still have no reason to
believe that they are related, as metric distances tend to become meaningless in high-dimensional spaces.
The usage of norms as regularization terms is successfully employed in MTL. The l 1 -norm as a regularization
term applied to the parameters of each task individually encourages a sparse activation of its components. As
highlighted by [11], sparse models are easier to interpret, as we have fewer active variables in the solution;
and are also computationally more convenient, usually requiring signiicantly less memory than their dense
counterparts. Theoretically, they also present the nice property of being able to recover the exact support Ð i.e.,
the set of active variables of a vector Ð of a given model [34, 40] if certain conditions are satisied.

ACM Trans. Knowl. Discov. Data.


1:4 • S. H. G. Oliveira, et al.

By applying it to diferent arrangements of parameters of the multiple tasks, the l 1 -norm encourages inde-
pendent feature or task selection in the information sharing over all tasks. However, if the parameters present
some structural correlation Ð groups of correlated features, for example Ð the same support recovery guarantees
do not apply. The mixed lp,q -norms are suited to enforce that groups of features are jointly active or absent. If
one feature of a given group is active, all features in the same group should be active; and if one feature is not
active, the entire group of features should be absent. The Dirty Model [15] takes advantage of lp,q -norms to relate
features among all tasks. Instead of using this norm directly on the parameter matrix, the authors factorize the
parameter matrix into a sum of two matrices W = S + B, where S, B ∈ Rn×T . They apply a diferent sparsity
penalization on each factor matrix: the sum of l 1 -norms (l 1,1 -norm) on the rows of S induces sparsity over the
parameters of all tasks; and the sum of l inf norms (l 1,inf ) on the rows of B relates each feature over all tasks. The
factorization strategy adds the lexibility of sharing features between tasks when convenient since one feature
can be active for all tasks but each task is free to avoid this feature with the regularization on the second matrix.
But we still are left with two limiting properties: (i) the l 1,inf -norm encourages similar values for each feature
across all tasks, implying that one feature has the same impact on the outcome of all related tasks; and (ii) the
model does not consider the case of grouped features.
The Group LASSO regularization [36], also referred to as l 2,1 from the mixed lp,q -norms, is a regularization
that accounts for grouped features. Let the task features be partitioned into G groups of correlated features. Each
д
group д ∈ G = {1, · · · , G} consists of a subset of features in X t for all tasks t ∈ T . Let X t be the design matrix
д
restricted to the features present in a group д for task t, and wt be the task parameter with the same dimension
as wt but admitting non-null values only at locations associated with features belonging to group д, and having
null values at the remaining positions. When p = 2 and q = 1 we have:
XX д
RG L (W ) = ∥W ∥2,1 = ∥wt ∥2 .
t ∈T д ∈ G

Notice that the partition of features into groups is the same for all tasks. In this regularization, each feature must
belong to one group, although isolated features can be put into a singleton group. As it penalizes the l 1 -norm
of a vector of G l 2 norms (one per group), when one element is forced to zero, all variables of this group are
д
forced to zero, thus guiding to wt = 0 for some д ∈ G. However, when two groups overlap and only one group is
active in the inal solution, the group that is not active will have all its features zeroed, even the features that are
shared with the active group. The recovered support of this norm is then the complement of the union of the
overlapping groups [14, 28].
In order to overcome this structural bias, Jacob et al. [14] proposed an extension to the Group LASSO where
д
the feature vector is decomposed as a sum of representations for each group wt = д ∈ G wt , applying the l 2 -norm
P
(or l inf ) on each group. This regularization is called the Latent Group LASSO, or Overlapping Group LASSO, and
can be posed as:
XX д
ROG L (W ) = dд ∥wt ∥2 ,
t ∈T д ∈ G
д
where wt = ∀t ∈ [1, · · · ,T ], and dд are independent weights, accounting for the cardinality of each
P
д ∈ G wt ,
group. Notice that the support for each task is a union of groups, not the complement, as a feature shared by two
groups will have its value preserved for the active group and be zeroed in the inactive group. See [14, 28] for
more details on that.
Both sparsity inducing l 1 and lp,q -norm present strong support recovery guarantees when appropriate condi-
tions are met. However, the performance of methods based on lp,q -norms depends on how features are shared
across tasks. For the l 1,q norm, Negahban and Wainwright [24] showed that if the number of tasks sharing a
group of features is less than a threshold, or even if the parameter values of features of the same group are highly

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:5

uneven, the regularization could perform worse than the l 1 norm. Ideally, each group of features should be free
to play distinct roles depending on the task, i.e., each task may have its support (number of non-null elements in
д
wt ). In this case, we still need a mechanism to select which tasks should transfer for each group independently.
All methods presented so far allow tasks to transfer in diferent ways, but exhibit several limitations as we
need to meet strong assumptions beforehand. Since all tasks are considered to be related, they are not robust to
unrelated tasks; most of them do not account for grouped features and when they do, all tasks must share the
same sparsity pattern, which implies that each group of features has the same inluence on all tasks outcomes. In
the next section, we proceed to a distinct family of MTL methods that extends this setting by unveiling how tasks
are related while learning the tasks’ parameters, which may allow tasks to have a more versatile transference
structure.

2.2 Modeling Task Relationships: Structure Estimation for Multi-Task Learning


In MTL the term Structure Learning is related with the process of not only estimating all tasks’ parameters but also
how transference occurs from one task to another. This is important because we can learn the tasks transference
patterns instead of having to assume how tasks are related to encode this prior knowledge into the parameters of
the model.
MTRL [38] uses a probabilistic framework and places a matrix-variate prior distribution on tasks coeicients to
model their relationship. They estimate a precision matrix Ω that encodes information about tasks relationship.
Similarly, MSSL [9] relies on a probabilistic framework, in which a sparse precision matrix is learned from the
data to capture tasks relationship and to help in isolating unrelated tasks. They also use a LASSO penalty on the
task parameters for automatic feature selection. A semi-parametric copula distribution is used as prior for the
task parameter matrix, capturing non-linear correlation among tasks. On one hand both methods are competitive
and can capture meaningful transference among tasks. On the other hand, since the transference structure is
encoded in a precision matrix, both methods share the property that transference between two tasks is symmetric.
Moreover, as the precision matrix relates two tasks over all tasks parameters, these methods do not account for
groups of features.
GO-MTL [18] considers a latent space where tasks parameters can be linearly decomposed, opening the
possibility of overlapping groups of related tasks. Let L ∈ Rn×k , where k is the dimension of the latent basis, and
S ∈ Rk ×T be a matrix with the weights of a linear combination of tasks. Assuming that W = LS, the associated
optimization problem is deined as:
T
X
min Lt (Lst ) + λ 1 ∥S ∥1 + λ 2 ∥L∥F2
W
t =1
1
The regularization term is composed of two norms on matrices, ∥S ∥1 being the entry-wise l 1 -norm, and ∥L∥F2 = T r (LLT ) 2
being the Frobenius norm of a matrix. The norm on L restrains the magnitude of tasks parameters, while the
sparsity term on S enforces the tasks to derive from a small subset of the latent basis L. The relationship between
tasks occurs when two tasks share components in L, based on their decomposition coded in S. If a task does
not share basis vectors on S with any other task, it may be interpreted as an outlier task. GO-MTL considers
that tasks are related in possibly overlapping groups, i.e., one task can be part of several groups. However, when
two tasks are related, all parameters are involved in this relationship since each component of the latent basis
represents all tasks parameters. The method is still a state-of-the-art when we consider MTL methods devoted to
linear methods with structure estimation, and the strategy of decomposing the tasks parameters into a latent
basis is still competitive.
AMTL [19] assumes that the parameters of a task t can be approximated by a sparse linear combination of the
parameters of all other tasks. In other words, wt ≈ W bt , where bt ∈ RT is a vector with the coeicients of the

ACM Trans. Knowl. Discov. Data.


1:6 • S. H. G. Oliveira, et al.

linear combination. For obvious reasons, a task cannot participate in its own formulation, thus bt t = 0 ∀t ∈ T .
In this case, the parameters of all tasks serve as a latent basis. The authors also use tasks losses to weight
transference: relationships must low from tasks with lower cost (easier) to tasks with higher cost (harder). Let bt
be a column of a matrix B ∈ RT ×T . Each column t indicates how the parameters of the other tasks participate in
the linear combination that approximates the parameters of task t, and a row t indicates the degree with which
the parameters of a task t participate in the approximation of the parameters of other tasks. Therefore, B encodes
the relationships among tasks in an asymmetric scheme: the transference from task t to task s may not be the
same as that from task s to task t.
The related optimization problem is written as follows:
T
X
min (1 + λ 1 ∥bt̄ ∥1 )Lt (wt ) + λ 2 ∥wt − W bt ∥22 .
W
t =1

In the irst term, the cost of a task t weights the l 1 -norm applied to bt̄ (t-th row of B), i.e., the transferences from
task t to all other tasks. (λ 1 , λ 2 ) are regularization hyper-parameters. The asymmetric transference is encoded in
a set of variables that are distinct from the variables involved in prediction, which allows AMTL to achieve a
lexible regularization of related tasks. Nevertheless, AMTL also enforces global feature transference.
Unlike transfer learning settings [25ś27], in which one or more tasks are the source of transference to a
target task, in our Multi-Task learning framework all tasks are simultaneously sources and receptors of shared
information. We proceed now to the exposition of GAMTL, a regularized MTL method that uses group information
over the tasks features to provide a lexible relationship structure that allows local feature transference admitting a
distinct interplay on each group of features; and also asymmetric information sharing. For a more comprehensive
reading on the current status of MTL research, please refer to the recent survey in [37].

3 GAMTL: AN EXTENDED MTL MODEL


Group Asymmetric Multi-Task Learning (GAMTL) [29] is a MTL method that accounts for grouped features on
the tasks design matrix of linear models while estimating how tasks share information. The estimated relationship
structure considers each group of features independently, enabling a bidirectional transference between any
two tasks. In this lexible formulation, tasks can transfer diferently depending on how a group of features
beneits their predictions: if a group of features is relevant for some tasks, transference occurs. On the other
hand, transference refrains when the same group of features is not relevant for a diferent set of tasks. This work
extends [29] in what follows: we expand the mathematical formulation to better emphasize the asymmetric low
and the local transference among tasks; we compare the computational complexity with an MTL baseline; we
include more related methods in the experiments; and we provide a Stability Selection analysis [23], used to
establish how robust the transference structure estimated by GAMTL is to noise in data and hyper-parameter
initialization. The later inspection of the most robust groups of features is of direct interpretation, as it encodes
linear relationships between tasks. We also show relations of these results with other indings in the literature.
As in Lee et al. [19], GAMTL may also consider the loss of each task as a relative measure of the task diiculty,
using this information to weight the transference. This turns tasks with higher costs less inclined to inluence
tasks with smaller costs, while tasks with smaller costs are encouraged to transfer, resulting in asymmetric local
transference. Let the tasks features be partitioned into G = {1, · · · , G} groups of correlated features in X t for all
д д
tasks t ∈ T . X t is the design matrix restricted to the features present in group д for task t, and wt is already deined
in Section 2. We start assuming that the task parameters can be decomposed into a sparse linear combination
of the parameters from the other tasks considering each group of features independently. This is similar to the
д д д
approach in [19], with the addition of considering groups of features. In this case wt ≈ s ∈T \t bst ws , where
P
д
bst is a scalar that encodes the inluence of task s on task t, restricted to the group of features д. For all s, t ∈ T ,

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:7

д д
bst composes an intra-group relationship matrix Bд ∈ RT ×T , where a row bt̄ encodes the inluence of task t on
д
all other tasks at group д, and a column bt encodes the inluence of the other tasks on task t at group д, which
results in G intra-group relationship matrices. Each one can be seen as the adjacency matrix of a directional
graph transference structure. Nodes are tasks and directional weighted edges indicate transference from one task
д
to another. Based on the latent representation of each task parameter vector, wt ≈ д ∈ G W д bt , where W д is the
P
task parameter matrix with values restricted to the group д, and zeros elsewhere. Eq. 2 shows the resulting MTL
optimization problem.
2
*.1 + λ ∥bt̄ ∥1 +/ L(wt ) +
X 1 X д λ2 X д
X д
min 1 wt − W д bt + λ3 dд ∥wt ∥2
2
X, -
д
W , B ∀д ∈ G
t ∈T
m t д∈G д∈G д∈G
2
д (2)
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T
The irst term computes the loss function of each task weighted by the number of samples. Therefore, it takes
into account sample imbalance among tasks, while also using the loss to weight transference from task t to the
д
other tasks. The l 1 -norm applied to bt̄ is used to enforce sparsity on the estimated relationship among the tasks.
This helps us pruning the search space while keeping only the more relevant transferences per group of features.
The second term penalizes the diference between the parameters of a speciic task t and the linear combination
of parameters from the tasks with which task t is grouped. Notice that this term considers how the task t is
related to possibly diferent tasks for each group of features independently. Together with the equality constraint
on each wt , the last term corresponds to the Overlapping Group LASSO regularization. The constraint on Bд
variables restricts the way tasks relate by allowing only non-negative values in the linear combination. However,
in case this restriction is not suitable for the application, we present an optimization procedure for the more
relaxed variant (without the restriction on Bд values). GAMTL uses the transference matrices Bд in a way that
allows us to use the Group LASSO while estimating how tasks share information, instead of forcing transference
involving all tasks on each group of features.
GAMTL contains three hyper-parameters that impact how transference occurs. When λ 1 = 0, λ 2 = 0, and
λ 3 = 0, independent linear models are recovered, following a Single Task Learning (STL) approach. If only λ 3 > 0,
we still have independent linear models per task but regularized by Overlapping Group LASSO. When λ 2 > 0, we
control the transference lexibility from many groups of related tasks - one per group of features - to wt . With
λ 1 > 0 the sparsity of the transference is activated.
Figure 1 shows a lowchart presenting the training process for GAMTL. The input consists of a labeled training
set for each task, with the tasks features structured into groups. The grouped partition of features must be the
same for all tasks design matrix. However, the partition is arbitrary allowing non-contiguous groups of features
to overlap, despite Figure 1 inducing contiguity of features. An alternating optimization procedure performs the
training process, switching between the estimation of tasks parameters and the relationship among tasks. The
relationship among tasks is encoded into G matrices, a transference structure that foments local transference and
is equivalent to a multi-digraph. In this multi-digraph, each level of the graph corresponds to a group of features
where tasks can be related. Tasks are related independently for each group of features in an asymmetrical fashion.
Finally, on the right we have the output for each task.
By representing the relationship among tasks via multiple matrices, and considering the parameters of the
tasks as a latent space for relationship, GAMTL promotes unique lexibility for the transference:
• Tasks may be related only on subsets of features.
• Groups of features can play distinct roles on diferent groups of related tasks.

ACM Trans. Knowl. Discov. Data.


1:8 • S. H. G. Oliveira, et al.

Fig. 1. On the let we see an input data representation: a design matrix and labels for each task along with a possibly
overlapping partition of the input feature set into groups, which are the same for all tasks. The training procedure is depicted
in the middle, where an alternating optimization takes place. One step involves the optimization of tasks parameters so
that each task is free to find its own features sparsity patern and the relationship between any pair of tasks is enforced
locally to each group. The second step estimates how tasks are related considering each group of features. The resulting
relationship matrices are shown as the adjacency matrix of a multi-digraph, where each level corresponds to a group of
features, recursively used at the first step as the structural relationship among tasks, thus implementing the asymmetric
local transference. The output is shown on the right, consisting of the predicted labels for each task.

• Transference is asymmetric: the inluence of task t on task s may difer from the inluence of task s on task
t.
Using the categorization in the survey of [37], GAMTL belongs to the parameter-based transference category of
MTL models. Another important aspect of our formulation is that GAMTL is designed for linear base models. The
structure that encodes the relationship of the tasks is based on linear combinations that can be easily interpreted.
The assumption that the parameters of one task can be decomposed as a linear combination of the parameters of
other tasks on each group of features, may be too restrictive for multi-layer nonlinear models, such as neural
networks.
When we consider all optimization variables at the same time, Eq. (2) ends up being a non-convex optimization
problem, possibly with the presence of local minima [10]. In the sequence, we derive smaller convex sub-problems
that allow us to employ an alternating optimization procedure.

3.1 Solving GAMTL Formulation


д
Let wt ∈ Rn and bt ∈ RT for all д ∈ G, t ∈ T partition the objective function variables. We obtain smaller convex
sub-problems by considering each one of these partitions at a time while ixing the other variables, characterizing
a multi-convex optimization problem.
GAMTL uses an alternating strategy in terms of each wt , while keeping ws ∀s ∈ T \ t, and Bд ∀д ∈ G ixed,
д
and then optimizing with respect to each bt , ∀д ∈ G, t ∈ T , as shown in Algorithm 1. As commonly used in
alternating optimization strategies, the procedure is carried out until there is a suiciently small change in the
values of the variables between successive iterates [3, 10, 35].

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:9

3.2 Optimizing Task Parameters


Isolating Eq. (2) in terms of wt , t ∈ T , with all remaining variables ixed, we have:
2
1 *
.1+λ 1 ∥bt̄ ∥1 +/ L(wt )+
X д λ2 X д
min wt − W д bt
2
, -
wt mt д∈G д∈G 2
2
(3)
λ2 X X д д
X д
+ w̃s − wt bt s +λ 3 dд ∥wt ∥2 ,
2 s ∈T \t д∈G д∈G
2
where X X д д
w̃s = ws − wu bus .
u ∈T \{s,t } д ∈ G
д
The irst term is composed of a convex loss function, as the l 1 -norm on bt̄ has a constant value. The second term
is the projection of wt onto the other task parameters, which is also a convex term. The third term computes the
interference of wt on the projections of the other task parameters, being a sum of convex terms. The last term is
the Group LASSO regularization, a convex and non-diferentiable term.
Eq. (3) can be solved using an accelerated proximal method, such as FISTA [2]. Let us decompose the objective
function into f : Rn → R and h : Rn → R ∪ {∞}, both closed proper convex functions, f being L-Lipschitz
continuous Ð L can be found with a backtracking procedure:
2 2
1 *
.1 + λ 1 ∥bt̄ ∥1 +/ L(wt ) +
X д λ2 X д λ2 X X д д
f (wt ) = wt − W д bt + w̃s − wt bt s , (4)
2 2 s ∈T \t
, -
mt д∈G д∈G д∈G
2 2
and h being the non-diferentiable Group LASSO regularization:
X д
h(wt ) = λ 3 dд ∥wt ∥2 .
д∈G

The proximal operator for the Group LASSO regularization is given by




д
д ( ∥w ∥2 −dд )
 , ∥w д ∥2 ≥ λdд
P
д∈G w
prox λh (w ) = 
д ∥w д∥
2

0 , otherwise.

Algorithm 1 GAMTL Alternating Minimization


1: Initialize W ∼ N (0, I |T | ) and set Bд = 0, ∀д ∈ G
2: while convergence not reached do
3: for t = 1, · · · ,T do
4: update wt optimizing task parameters (Eq. 3)
5: end for
6: for t = 1, · · · ,T do
7: for д ∈ G do
д
8: update bt optimizing task relationships (Eq. 5)
9: end for
10: end for
11: end while

ACM Trans. Knowl. Discov. Data.


1:10 • S. H. G. Oliveira, et al.

3.3 Optimizing Task Relationships


The matrices Bд , д ∈ G, encode the relationship between tasks. Since a task cannot be represented by itself, we
д д
ix bt t = 0. The strategy used by GAMTL is to isolate Eq. (2) in terms of bt , with all remaining variables ixed.
д̃ д̃ д д д
Let w̃t = wt − д̃ ∈ G\д W bt , and let W = [w1 /L(w 1 ), · · · , wT /L(wT )]. The resulting problem is:
P

1 д д λ1 д
min ∥w̃t − W bt ∥22 + ∥bt ∥1
д
bt 2 λ2 (5)
д
subject to bt ≥ 0.

This problem is similar to the Adaptive LASSO [43] and thus is convex but not diferentiable at all points.
Without the constraints in Eq. (5), it can be solved using any standard method for the LASSO. To handle the
д
constraints (bt ≥ 0 ∀д ∈ G, t ∈ T ), GAMTL uses the Alternating Direction Method of Multipliers (ADMM) [4].
In the ADMM framework, the inequality constraint can be represented via an indicator function:
min f (x ) + h 1 (z 1 ) + h 2 (z 2 )
subject to x = z 1 (6)
x = z2

where h 1 = h, and h 2 (z 2 ) is deined as



0
h 2 (z 2 ) = 1R+ (z 2 ) = 
, z2 ≥ 0
 +∞ , otherwise.

The augmented Lagrangian of Eq. 6 is then, L ρ 1, ρ 2 = L ρ 1, ρ 2 (x, z 1 , z 2 , u 1 , u 2 ):
L ρ 1, ρ 2 =f (x ) + h 1 (z 1 ) + h 2 (z 2 )
ρ1 
∥x − z 1 + u 1 ∥22 − ∥u 1 ∥22

+
2
ρ2 
∥x − z 2 + u 2 ∥22 − ∥u 2 ∥22 ,

+
2
resulting in the following ADMM update steps:
 ρi 
zik +1 := argmin hi (zi )+ ∥x k −zi +uik ∥22 , i = {1, 2}
zi 2

x k +1 := argmin *. f (x )+ ∥x −z kj +1 +uik ∥22 +/


2
X ρj
2
, -
x j=1
k+1 k k +1 k+1
ui := ui +x −zi , i = {1, 2}.

The two steps in zi -update can run in parallel, with the same occurring for ui . The zi -update steps are solved
with the proximal operators: soft-thresholding, Sκ (a) = (1 − κ/|a|)+a; and projection onto the non-negative
orthant R+ , S (a) = (a)+ = max(0, a). The x-update step is a convex problem with a diferentiable function f plus
quadratic terms, which can be solved in closed-form via Cholesky decomposition or by a proper gradient-based
method. GAMTL implementation using the Python programming language is available on Github 1 .

1 [Link]

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:11

3.4 Computational Complexity


The existence of many transference matrices tends to ofer better results and interpretability to the model while
including some extra computational efort. The cost of each GAMTL iteration is mostly driven by steps 4 and 8 of
Algorithm (1), which involve a FISTA and an ADMM execution, respectively.
For step 4, we compute ∇f and prox λд . The cost of the proximal operator is G[дmax ]2n, where дmax is the
size of the largest group. The derivative of Eq. 4 needs T 2Gдmax lops. Higher costs involved in the gradient
computation are O(T 2Gn), with other negligible costs. The full computation of ∇f is then O(T 2Gn). Therefore, a
FISTA iteration has an overall cost of O(T 2Gn).
д
In step 8, we prepare w̃t using GT n + n lops. For W , we compute the loss function of each task with the cost
of O(n2 + mn), and it is reused for all iterations over the same д. ADMM computes a soft-thresholding operator,
the projection of z, and the update of u, all with negligible costs. Solving x-update in closed-form with Cholesky
decomposition uses T 3 lops, with a back-solve cost of n2 . This results in an overall cost of T n 2 when considering
n > T . Therefore, the cost of a complete ADMM iteration is at the order of O(T n2 ).
As one iteration of GAMTL consists of T FISTA and GT ADMM executions, with a ixed number of iterations,
GAMTL presents a time complexity of order O(T 3Gn + T 2Gn 2 ) when considering n > T . There is an overhead
on learning how tasks are related for each group of related features. Computing gradients is expensive, but the
number of relationship matrices also involves all tasks in a bi-directional way. However, as we expect tasks to
have a sparse activation of their parameters, most of the computation involved with the relationship of tasks can
be skipped when related to groups of features that are not active.
Let us consider [8] as a baseline comparison. The proposal assumes that all tasks are related over all parameters
by penalizing their deviation of the mean (basically clustering the tasks into a single group), and presents a time
complexity of order O(T 3m3 ). Comparing the worst time complexity of order O(T 3Gn + T 2Gn2 ) presented by
GAMTL, we can see that GAMTL does not increase the complexity related to the number of tasks T , but adds
a quadratic cost on the number of features. As the method is designed to handle data of high dimensionality,
the usage of sparsity in both the grouped features of the tasks and in the estimated relationships is essential to
mitigate this impact. As most groups of features will be set to zero, a great portion of the computation can be
skipped. Considering that GAMTL adds a detailed structure of transference among all tasks for each group of
features, the additional computational burden is counterbalanced by the gain in lexibility, as demonstrated in
the experiments of Section 4.

4 EXPERIMENTS
To evaluate the performance of GAMTL when looking for better generalization over multiple tasks, we show
the results of two experiments: one using an artiicial setting, and the problem of predicting Alzheimer’s
Disease progression. We also provide an extensive stability analysis on the support of tasks parameters and task
relationship variables on this problem. For all experiments, we denote the variants of GAMTL as follows:
• GAMTL - standard formulation presented in Eq. 2;
• GAMTL-nl - without considering loss as a weighting coeicient (Appendix A.1);
• GAMTL-nr - Bд ≥ 0 ∀д ∈ G (Appendix A.2); and
• GAMTL-nlnr - Bд ≥ 0 ∀д ∈ G without considering loss as a weighting coeicient (Appendix A.3).

4.1 Artificial Application


To validate GAMTL and illustrate the scope of possible applications of the method, we design an artiicial and
fully controlled dataset as follows. We generate 8 regression tasks with 50 attributes partitioned into two groups
д1 = [1, . . . , 25] and д2 = [26, · · · , 50]. For the irst two tasks, the truth values of the parameters of the
irst group of attributes are sampled from a standard Gaussian distribution, N (0, I 25 ), and the second group of

ACM Trans. Knowl. Discov. Data.


1:12 • S. H. G. Oliveira, et al.

parameters is set to zero. Parameters of the third and fourth tasks are generated in the same fashion, but the irst
group is set to zero while the second group is sampled from a standard Gaussian distribution. The last four tasks
are based on the previous ones. We generated their parameters as a linear combination of the parameters of the
previous tasks. The linear combination parameters are sampled from a truncated Gaussian distribution, ensuring
that all values were positive.
The design matrix of each task is sampled from a standard Gaussian distribution with 300 samples and 50
attributes. After that, we add a Gaussian noise with σ = 0.4 for the irst four tasks, and with σ = 2.9 for the
remaining tasks. This diference in the amount of noise is related to our assumption of asymmetric transference
based on loss. We expect the transference to occur from tasks with lower costs to tasks with higher costs,
recovering the transference structure among all tasks. If all tasks present the same level of noise, all transferences
will be penalized similarly and the last four tasks will be equally encouraged to transfer back to the irst tasks,
resulting in quasi-symmetric matrices Bд ∀д ∈ G.
The number of samples available to the models for training varied from 30 to 100, as all methods converge to
similar performance from this value on. The synthetic dataset is split so that 70% of the samples are used for
the training and 30% for the test. For each number of samples, we choose the hyper-parameters of all methods
by a holdout procedure in which we split the training set in 70% for training and 30% for validation. The best
parameters are used to train the models for 30 runs.
For this experiment, we compare the results of GAMTL with the LASSO [32] and Group LASSO [14] as STL
contenders. Hyper-parameters were chosen using the Python library Optuna [1], which instead of using a grid
search approach to ind optimal hyper-parameters, implements a relational sampling strategy to search for the
optimal values of some function in a given interval. In this case, the parameters of the search procedure are
the limits of the values of each parameter, and the number of trials to update the relational sampling strategy.
For each method, we sampled 200 trials for the search hyper-parameter values, and choose the values with
the best normalized mean squared error (NMSE) in the validation portion of the training data. For the LASSO,
the search limits were λ ∈ [10−5 , 4], while for the Group LASSO λ ∈ [10−5 , 15]. All variants of GAMTL used
λ 1 , λ 2 , λ 3 ∈ [10−5 , 5]. We report the mean and standard deviation of the NMSE on the test set, over all runs.
Figure 2 shows the performance of all methods against the increasing number m of samples available for the
training / testing procedure. When m ≤ 60, no STL method achieves reasonable NMSE values. Notice however that
the Group LASSO outperforms the LASSO independently of the number of samples, just because it incorporates
the group structure of the features.
All GAMTL variants improve upon STL, especially when the number of samples available for training is
small, as in the interval between 30 and 70 samples. The four GAMTL variants present two distinct levels of
improvement. The two variants that do not use the loss to penalize transference between tasks, i.e. GAMTL-nl
and GAMTL-nlnr, can improve upon the STL methods but the best results are achieved by the two variants that
consider the loss to transfer from tasks with a higher cost to tasks with a lower cost, i.e. GAMTL and GAMTL-nr,
that present larger improvement. The improvement is more expressive when m = 30 where the scenario is highly
ill-conditioned. As m increases, all methods start to have similar performance, as the number of samples provides
enough information to solve the tasks.
As we are also interested in the relational structure that GAMTL provides, Figure 3 shows the relationship
matrices Bд estimated by GAMTL when m = {40, 80, 90, 100} sided by the original Bд used in the generative
process. We choose the number of samples to understand how precise is the recovered structure as we give more
samples to the model. When m = 40 the tasks relationships are not alike the true relationship matrices that
we have designed. However, notice that GAMTL detects that tasks 5 to 8 are related among themselves. Since
they are all based on the same two tasks, depending on the groups of attributes, they are indeed related among
themselves. This explains why GAMTL can increase performance even when the number of samples is low.
When m = 80, as we have more samples available the recovered structure is more sparse and less symmetrical

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:13

20.0 NMSE of all models over all m values


17.5
Method
LASSO
15.0 Group LASSO
12.5 GAMTL-nl
GAMTL-nlnr
NMSE

10.0 GAMTL-nr
7.5 GAMTL
5.0
2.5
0.0
30 40 50 60 70 80 90 100
Number of samples
Fig. 2. Normalized Mean Squared Error of all methods on the artificial dataset, with a varying number of samples available
for training. STL methods are shown using dashed lines. By leveraging the group partition information involving the features
of the tasks, the Group LASSO outperforms the LASSO. MTL methods are shown in solid lines. GAMTL variants show an
expressive gain in performance, especially when the number of training samples is low. Best viewed in color.

but the method is still not detecting that all related tasks are based on tasks 1 to 4 (depending on the group
of features). With m ≥ 90 samples per task, GAMTL can detect the dependency of tasks 5-8 on the irst four
tasks, also incorporating the distinct pattern associated with each group of attributes. GAMTL shows a good
approximation to the original transference scheme for both groups of features, and the asymmetric inluence
among tasks is fully recovered.

4.2 Predicting Cognitive Scores related to Alzheimer’s Disease Progression


Alzheimer’s Disease (AD) is the most common form of dementia in the world [17]. As people live longer and
we improve our capabilities of identifying and diagnosing dementia, we expect the number of people living
with dementia will more than triple by 2050 when compared with the estimates of 2018, according to the World
Alzheimer Report of 2018 [13]. As a recognition of the need for global actions to mitigate and further investigate
dementia, in May 2017, the World Health Assembly endorsed a global action plan 2 on a public health response
to dementia, directed to policy-makers, international, regional and national partners. The absence of treatment
to reverse the progression of this neurodegenerative disease fuels plenty of current research in the hope to
understand the underlying mechanisms of AD. Liu et al. [22] and Zhou et al. [42] have already shown that MTL
can contribute in modeling the connection between cognitive scores (representing multiple regression tasks)
and the progression of AD, considering multiple distinct Regions Of Interest (ROI) in the brain, with each ROI
representing a group of features.
We test the performance of GAMTL in a real scenario on the ADNI dataset 3 . This dataset was collected
by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and was preprocessed by a team from University
2 [Link]
3 Data used in the preparation of this article were obtained from the AlzheimerâĂŹs Disease Neuroimaging Initiative (ADNI) database
([Link]). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data
but did not participate in this experiment or in writing the subsequent analysis. A complete listing of ADNI investigators can be found at:
[Link]

ACM Trans. Knowl. Discov. Data.


1:14 • S. H. G. Oliveira, et al.

Original B 1 B 1(m=40) B 1(m=80) B 1(m=90) B 1(m=100)

3
Original B 2 B 2(m=40) B 2(m=80) B 2(m=90) B 2(m=100)
1 2
2
3
From task

4 1
5
6
7
8 0
12345678
To task

Fig. 3. Original relationship matrices, followed by the relationship matrices estimated by GAMTL on the artificial dataset,
with diferent training sizes. When m ≤ 80, GAMTL can have a rough estimate of how tasks are related to each other.
Compared to the original relationship matrices, GAMTL has a good approximation of the relationships among tasks when
m ≥ 90, for both groups of features. The asymmetric influence among tasks is also fully recovered.

of California at San Francisco, as described in [22], who performed cortical reconstruction and volumetric
segmentation with the FreeSurfer image analysis suite. It contains information from 816 subjects that are the
same for all tasks and are divided into three stages: those cognitively normal (CN) (228), with mild cognitive
impairment (MCI) (399), and with Alzheimer’s disease (AD) (189). There is a total number of 327 features including
cortical thickness average, cortical volume, and sub-cortical volume. The groups of features in this application
correspond to the features derived from many regions of interest (ROI) in the brain. Labels for this dataset include
ive cognitive measures: Rey Auditory Verbal Learning Test (RAVLT) Total score (TOTAL), RAVTL 30 minutes
delay score (T30), RAVLT recognition score (RECOG), Mini Mental State Exam score (MMSE), and Alzheimer’s
Disease Assessment Scale cognitive total score (ADAS). The usage of these scores is widespread, impacting on
drug trials, assessment of the severity of symptoms of AD, the progressive deterioration of functional ability, and
deiciencies in memory, as highlighted in [22], thus evidencing the importance of this type of modeling.
In this experiment, we consider all STL and MTL methods used in the previous experiment, but add more
state-of-the-art contenders. For completion, we added AMTL [19] that is also based on using task parameters
as a latent basis but does not account for groups of features; MT-SGL [22] that is proposed to handle this same
problem; GO-MTL [18] that is based on a latent basis to model related tasks; and MSSL [9] which accounts for
unrelated tasks and estimates a precision matrix as the learning model structure for transference among tasks;
MTRL [38], that uses a probabilistic framework and places a matrix-variate prior distribution on tasks coeicients
to model their relationship; and MTFL [16] that groups tasks based in an orthogonal-complement sub-spaces
decomposition where features are shared among tasks. As in the previous experiment, we used Optuna [1] to
search for the value of the hyper-parameters of all methods using 200 samples for the search of each method. This
time we used a 5-fold cross-validation procedure, where each fold contains the same proportion of participants

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:15

Table 1. NMSE of all methods in ADNI dataset (mean and standard deviation over 30 runs). GAMTL-nr had the best results
(highlighted in bold), closely followed by the other GAMTL variants, and MTRL method. A Mann-Whitney U non-parametric
test was run, assuring the significance of score improvement when comparing GAMTL-nr with all other methods.

Method NMSE
LASSO 0.840 (2.2 · 10−16 )

STL
Group-LASSO 0.977 (2.0 · 10−1 )
GO-MTL 0.896 (1.1 · 10−16 )
MSSL 0.818 (1.1 · 10−16 )
MTFL 0.810 (2.2 · 10−16 )
MT-SGL 0.801 (1.5 · 10−13 )
MTRL 0.791 (2.2 · 10−16 )
MTL

AMTL 0.898 (0.0)


GAMTL 0.781 (1.4 · 10−5 )
GAMTL-nl 0.787 (4.3 · 10−5 )
GAMTL-nr 0.780 (2.2 · 10−5 )
GAMTL-nlnr 0.789 (2.8 · 10−5 )

from the stages CN, MCI, and AD. The search limits used to tune the methods hyper-parameters are as follows:
For the LASSO we searched λ ∈ [10−5 , · · · , 4], while for the Group LASSO λ ∈ [10−5 , 15]. For AMTL we used
µ, λ ∈ [10−5 , 5]. GO-MTL has the number of groups set to 2, 3, 4, while ρ 1 ∈ [10−4 , 10], ρ 2 ∈ [10−4 , 10]. MSSL had
ρ 1 , ρ 2 ∈ [10−5 , 10]. For MTFL, we had 2, 3, 4 as the quantity of task groups, and ρ 1 , ρ 2 ∈ [10−5 , 10]. MT-SGL used
r ∈ [10−5 , 15]. MTRL hyper-parameters were chosen as ρ 1 , ρ 2 ∈ [10−4 , 10]. For AMTL we used µ, λ ∈ [10−5 , 5].
All variants of GAMTL used λ 1 , λ 2 , λ 3 ∈ [10−5 , 5]. After that, we select the hyper-parameters values with the best
result in this step and train the methods for 30 runs, to account for the initial randomness of the parameters of
the tasks.
In Table 1 we see the overall performance of all methods using the NMSE metric. Values are the mean and
standard deviation of the 30 runs and the best result is highlighted in bold.
Among the STL methods, LASSO is the one presenting the best score, but most MTL methods achieved better
results when compared with the LASSO. GAMTL variants achieved better results than all methods but presented
more variation than most of them on the results. As GAMTL estimates more parameters, it is an expected outcome.
We used a Mann-Whitney U test with p ≤ 0.05 and veriied that the score diference between GAMTL-nr and all
other methods was statistically signiicant.
As for each task individually, we use the mean-squared error (MSE) to compare the methods, with the results
presented in Table 2, and the mean absolute error (MAE) is reported in the appendix (Section C) on Table 3. For
visual interpretation, the same information is depicted in Figure 4 on a bar plot. Each sub-igure presents a bar
plot of the MSE obtained by all methods in the experiment for each task.
AMTL presented the smaller MSE for the task TOTAL but showed poor performance for the other measurements.
For the same task Group LASSO shows wide variance in their results. For the task T30, the LASSO presents
the best result, closely followed by MT-SGL. For all other tasks, GAMTL variants had the most competitive
performance. In contrast with the task TOTAL, when we consider the tasks RECOG and MMSE, AMTL shows a
poor performance. GO-MTL also shows a similar behavior in this task: it showed competitive performance for
some tasks, but presents poor results for the task MMSE. As for the ADAS task, the variation of performance
among the methods is small. Each task beneits the most from a diferent strategy of transference, but still, task

ACM Trans. Knowl. Discov. Data.


1:16 • S. H. G. Oliveira, et al.

Table 2. MSE of all methods per task in ADNI dataset. The best results on each task are highlighted in bold.

Method TOTAL T30 RECOG MMSE ADAS


LASSO 0.857(2.2 · 10−16 ) 0.617 (1.1 · 10−16 ) 0.962 (4.4 ·10−16 ) 0.618 (5.5 ·10−16 ) 0.556 (1.1 · 10−16 )
STL

Group-LASSO 1.190 (9.9 · 10−1 ) 0.705 (1.3 · 10−1 ) 1.019 (9.6 · 10−2 ) 0.736 (1.0 · 10−1 ) 0.635 (7.1 · 10−2 )
GO-MTL 0.837 (0.0) 0.643 (2.2 · 10−16 ) 0.842 (3.3 · 10−16 ) 0.856 (1.1 · 10−16 ) 0.584 (2.2 · 10−16 )
MSSL 0.846(2.2 · 10−16 ) 0.648 (3.3 · 10−16 ) 0.856 (0.0) 0.597 (4.4 · 10−16 ) 0.566 (1.1 · 10−16 )
MTFL 0.851(2.2 · 10−16 ) 0.648 (0.0) 0.839 (1.1 · 10−16 ) 0.588 (2.2 · 10−16 ) 0.554 (1.1 · 10−16 )
MT-SGL 0.885(5.9 · 10−13 ) 0.619 (5.6 · 10−13 ) 0.760 (7.0 · 10−13 ) 0.612 (3.4 · 10−13 ) 0.551 (4.7 · 10−13 )
MTRL 0.848(1.1 · 10−16 ) 0.674 (1.1 · 10−16 ) 0.786 (3.3 · 10−16 ) 0.579 (1.1 · 10−16 ) 0.520 (0.0)
MTL

AMTL 0.784 (0.0) 0.712 (0.0) 1.046 (0.0) 0.777 (0.0) 0.507 (0.0)
GAMTL 0.914 (2.7 · 10−5 ) 0.653 (6.0 · 10−5 ) 0.744 (1.5 · 10−5 ) 0.560 (3.2 · 10−5 ) 0.506 (1.8 · 10−5 )
GAMTL_nl 0.860 (9.3 · 10−5 ) 0.646 (6.9 · 10−5 ) 0.794 (1.7 · 10−4 ) 0.563 (4.5 · 10−5 ) 0.528 (4.2 · 10−5 )
GAMTL_nr 0.870 (5.0 · 10−5 ) 0.654 (6.5 · 10−5 ) 0.775 (6.0 · 10−5 ) 0.555 (4.3 · 10−5 ) 0.513 (3.3 · 10−5 )
GAMTL_nlnr 0.857 (3.1 · 10−5 ) 0.645 (5.6 · 10−5 ) 0.801 (9.8 · 10−5 ) 0.566 (6.1 · 10−5 ) 0.531 (2.8 · 10−5 )

T30 could not beneit from MTL. As each method holds distinct premises for the transference among tasks,
this result indicates that a single transference mechanism will not rule them all. Most importantly, when not
improving performance, some MTL methods incur in poorer performance.
We focus now on methods that account for grouped features, to see how GAMTL improves upon their results.
Choosing Group LASSO as the main reference, we take the diference of MSE between Group LASSO and GAMTL
variants for each run. Results are shown in Figure 5. Positive values indicate the method had a smaller MSE than
the Group LASSO (positive transference), while negative values indicate negative transference.
GAMTL variants could improve the generalization performance on all tasks when compared with Group
LASSO. Strong improvements are exhibited for RECOG, MMSE, and ADAS tasks, while not incurring negative
transference for the most challenging tasks (TOTAL and T30). RECOG is the task that beneits the most from
GAMTL models.
In Figure 6 we present a heatmap of the structural sparsity produced by each method that achieved the best
result on at least one task. We take the mean of the parameter values for each group of parameters, and if the
value is greater than zero, we consider it an active group, represented by a darker color. LASSO (Figure 6a) is
used as a reference for the STL methods. AMTL obtained the best result for the task TOTAL and is represented in
Figure 6b. Notice the presence of two groups of related tasks: TOTAL, T30, RECOG, and MMSE as part of one
group, while ADAS was isolated in a singleton group. It is also noticeable that when tasks belong to the same
group, they show a strongly related sparsity pattern on all tasks features.
GAMTL variants show sparser results (Figures 6c and 6d). The ADAS task also seems unrelated to the other
tasks by presenting a diferent sparsity behavior on GAMTL results. Both GAMTL methods allow the tasks to
relate in diferent ways when sharing, thus guiding to a more lexible structural sparsity pattern for related tasks.
In this case, GAMTL allows ADAS task to be related to other tasks only in a few groups of features.
The transference scheme encoded on Bд , ∀д ∈ G matrices is responsible for regularizing the parameters of the
tasks to it into the estimated relationships. As these matrices present interpretable information, we perform
a Stability Selection [23] procedure that accounts for noise both on the data and hyper-parameter settings,
validating which parameters of the learning models are still active in the inal solution.

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:17

Task TOTAL
1.75
1.50
1.25
1.00

MSE
0.75
0.50
0.25
0.00

MT-SGL

AMTL

GAMTL

GAMTL-nl
LASSO

GAMTL-nr

GAMTL-nlnr
Group LASSO

GOMTL

MSSL

MTFL

MTRL
Task T30
1.75
1.50
1.25
1.00
MSE

0.75
0.50
0.25
0.00

MT-SGL

GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr
LASSO

Group LASSO

GOMTL

MSSL

MTFL

MTRL

AMTL
Task RECOG
1.75
1.50
1.25
1.00
MSE

0.75
0.50
0.25
0.00
MT-SGL
LASSO

Group LASSO

GOMTL

MSSL

MTFL

MTRL

AMTL

GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr
Task MMSE
1.75
1.50
1.25
1.00
MSE

0.75
0.50
0.25
0.00
MT-SGL

GAMTL-nl
LASSO

Group LASSO

GOMTL

MSSL

MTFL

MTRL

AMTL

GAMTL

GAMTL-nr

GAMTL-nlnr
Task ADAS
1.75
1.50
1.25
1.00
MSE

0.75
0.50
0.25
0.00
MT-SGL

GAMTL-nl
LASSO

Group LASSO

GOMTL

MSSL

MTFL

MTRL

AMTL

GAMTL

GAMTL-nr

GAMTL-nlnr

Fig. 4. MSE of all methods for each of the five cognitive measures, with a blue horizontal line highlighting the best performance.
For the task TOTAL, we can see that AMTL had the best performance, while Group LASSO shows some variance in their
results. For the task T30, the LASSO presents the best result, closely followed by MT-SGL. For all other tasks, GAMTL variants
had the most competitive performance.

4.3 Stability Selection on ADNI


Besides learning the parameters of the tasks, GAMTL also estimates GT 2 parameters for the relationship among
the tasks. On one hand, we have another source of information retrieved from data; on the other hand, we
have three hyper-parameters to ine-tune. This raises a question: is the set of active variables robust to hyper-
parameterization processes and data noise? As in [12, 20, 22], we use Stability Selection on the ADNI dataset
both to validate the robustness of GAMTL and also to highlight the interpretative capabilities of the model.

ACM Trans. Knowl. Discov. Data.


1:18 • S. H. G. Oliveira, et al.

Task TOTAL
1.0 positive transference

0.5

0.0
negative transference

Task T30

GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr
0.50
positive transference

0.25
0.00
negative transference
0.25
Task RECOG
GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr
0.5 positive transference

0.0
negative transference

Task MMSE
GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr
0.50 positive transference

0.25
0.00
negative transference
0.25
Task ADAS
GAMTL

GAMTL-nl

GAMTL-nr

0.50 GAMTL-nlnr
positive transference

0.25
0.00
negative transference
0.25
GAMTL

GAMTL-nl

GAMTL-nr

GAMTL-nlnr

Fig. 5. GAMTL outperforms STL Group LASSO for each task. For the task TOTAL the gains vary due to the unstable
performance of Group LASSO on that task. For the task T30, we can see a consistent small gain, but GAMTL variants present
an expressive gain for tasks RECOG, MMSE, and ADAS.

Meinshausen and BÃijhlmann [23] proposed Stability Selection as a feature selection procedure that (i) relies on
a sampling procedure to alleviate the importance of hyper-parameter selection and data noise; and (ii) computes
the marginal probability of a feature being active by the total number of runs in the procedure.
Given a set of hyper-parameter values Γ, we choose a subset of the available dataset randomly and without
replacement, then the model is trained for N times. After that, we compute the frequency that a variable was active
in the found solutions and ilter the variables with a threshold. The overall process is described in Algorithm
2. For each variable i of our problem and a certain coniguration of hyper-parameters λ = λ 1 , λ 2 , λ 3 ∈ Γ, τi

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:19
ROIs

ROIs

ROIs

ROIs
MMSE

MMSE

MMSE

MMSE
T30

T30

T30

T30
TOTAL

RECOG

ADAS

TOTAL

RECOG

ADAS

TOTAL

RECOG

ADAS

TOTAL

RECOG

ADAS
(a) LASSO (b) AMTL (c) GAMTL (d) GAMTLnr

Fig. 6. Sparsity patern estimated by the methods with best performance on at least one task. The darker cells indicate
groups of atributes where the mean of their parameters is greater than zero. All methods show a distinct sparsity patern on
the ADAS task, when compared to the other tasks. The results of STL Lasso on 6a show a visual similarity involving the
parameters of all but the ADAS task. AMTL, on 6b takes advantage of the relationship among the tasks, showing a clearer
shared patern, but less sparse. When comparing the results of AMTL with the LASSO, we see some groups of features that
became active for ADAS task, but play no role in the STL result. GAMTL variants (showed on 6c and 6d) present even sparser
results, with the benefit of not enforcing groups to be active for ADAS when the task is not related to others, preserving the
flexibility of tasks to share only on the groups of features that are valuable for transference.

represents the percentage of times that variable i was active over all runs. Let Sˆλ = {τi , |i ∈ W ∪ Bд , ∀д ∈ G} be
the set of percentages, and Sˆ = {Sˆλ |λ ∈ Γ} be the set of percentages over all hyper-parameter values. A variable i
is considered stable when the mean over all elements of Sˆλ are greater than a certain threshold. A ROI (which
corresponds to a group of features) is stable if the mean of the percentages of all its features is greater than the
threshold.
We chose the model hyper-parameters from the set Γ = {λ 1 , λ 2 , λ 3 |λ 1 , λ 2 ∈ [0.001, 5], λ 3 ∈ [0.0001, 1]} and
present results using a threshold of 80%, which is commonly used in the literature.
For each ROI we take the mean of the stability percentages of its features, and compare the value against the
pre-deined threshold of 80%, resulting in a binary matrix Wst ab ∈ Z G×T whose entries indicate which groups
are active for which tasks. For visualization purposes, we apply a clustering analysis on Wst ab . We choose the
number of clusters to 2 by comparing the Silhouette Score of the samples after experimenting with values in the
range between 2 and 10. The Silhouette Score helps us to validate the consistency of a clustering solution by
measuring how similar each object is to its assigned cluster when compared to assigning it to other clusters. Its
values range from −1 to 1, where a low value indicates that the object would be best assigned to a diferent cluster,
while a high value indicates that a sample is best suited to the cluster it is assigned. When the rows of Wst ab are
partitioned into 2 clusters, no sample shows a negative Silhouette Score. We apply a K-means procedure with 30

ACM Trans. Knowl. Discov. Data.


1:20 • S. H. G. Oliveira, et al.

Algorithm 2 Stability Selection


1: Sˆ = {∅}
2: for λ 1 , λ 2 , λ 3 ∈ Γ do
3: while run ≤ N do
4: for t ∈ T do
5: Subsample X̃ t , ỹt from X t , yt without replacement, generating a dataset of size mt /2
6: end for
7: Initialize GAMTL with λ 1 , λ 2 , λ 3
8: Train GAMTL on X̃ , ỹ
9: end while
10: Sˆλ = {τi |i ∈ W ∪ Bд , ∀д ∈ G}
11: Sˆ = Sˆ ∪ Sˆλ
12: end for
13: Compute the mean over Sˆ and apply threshold.

distinct runs to alleviate the efects of the random initialization, keeping the result with the best results in terms
of within cluster sum of squares. Figure 7 presents Wst ab split in those two groups.
On the irst cluster (left) almost all ROIs are stable on the ADAS task, while almost no ROI is stable for the
other tasks. The second cluster (right) shows stable ROIs for all tasks but ADAS. However, we can see that each
cluster contains a few distinct active features depending on each task, showing the lexible transference among
tasks. This is a key point in GAMTL models: the distinct behavior of features is an important characteristic of
the MTL problem setting. If the model does not account for the distinct roles that features can play on related
tasks, negative transference may occur. We should not enforce a relationship that is highly expressed in a set of
features among two tasks, on a diferent set of features.
Choosing some ROIs to further explore the transference among tasks, we picked two ROIs that are active for
all tasks on the second cluster: the Left Caudate and Left Inferior Temporal. Figure 8a shows the illustrative
anatomical location of the Left Caudate on a template brain, and Figure 8b shows the estimated relationship among
tasks considering this ROI. The task RECOG is inluenced by all other tasks (RECOG column) but inluences
only the task ADAS (see the row for task RECOG), while all other tasks are fully connected on this ROI. The
Left Inferior Temporal ROI is depicted anatomically in Figure 8c. In this case, the ADAS task is not related to
any other task; TOTAL and MMSE inluence all other tasks, while receiving their inluence as well; and RECOG
inluences TOTAL and MMSE tasks while is inluenced by TOTAL, T30, and MMSE. Even by choosing ROIs that
are active on the solution of all tasks, we recover a diferent relationship scheme among tasks, stressing the need
for a lexible mechanism to learn how transference occurs.
Considering now the estimated relationship matrices, for each Bд ∀д ∈ G we compute the average of stability
scores, and choose the 6 ROIs with the highest average value:
• Left Cerebral Cortex;
• Right Inferior Temporal;
• Left Caudate;
• Left Accumbens Area;
• Left Pars Orbitalis;
• Left Superior Parietal.
Since the Left Caudate was already explored, we skip its presentation. Figure 9a illustrates the Left Cerebral
Cortex, ROI with most stable transference among the tasks in all directions (Fig. 9d). This is the outermost layer

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:21

Group 1 Group 2

ROIs
TOTAL
T30

TOTAL
T30
RECOG
MMSE
ADAS

RECOG
MMSE
ADAS

Fig. 7. ROIs clustered by similar stability among all cognitive tests (tasks). The cluster on the let shows high activity for
ADAS task, with a sparse presence on the other tasks. On the other hand, the second cluster is highly active for all tasks but
ADAS, clearly showing the flexible transference possibilities of GAMTL.

surrounding the brain, that serves as a connection for several ROIs. We can see strong relationships among tasks
in this analysis.
In Figure 9b we see the Right Inferior Temporal ROI also presenting stable connections among all tasks (Fig.
9e), with the only exception when transferring from ADAS to RECOG.
The Accumbens Area is a small part of the Left Caudate ROI, being depicted in Figure 9c. The relationship
matrix in Figure 9f shows fewer stable connections when compared to the results of the previous ROIs. The
MMSE task is not inluenced by any other tasks, inluencing all but the ADAS task. The Left Pars Orbitalis is
shown in Figure 9g. As we can see in Figure 9i, the pairs of tasks ADAS and TOTAL, RECOG and MMSE, do not
inluence each other. Notice that coincidentally this ROI shows a symmetric relationship among tasks. Finally,
the Left Superior Parietal (Figure 9h) presents sparser relationships among the tasks.

ACM Trans. Knowl. Discov. Data.


1:22 • S. H. G. Oliveira, et al.

TOTAL
T30
RECOG
MMSE
ADAS

T3 L
RE 0
MM G
AD E
AS
TA

S
CO
TO
(a) Let Caudate illustrative anatomical position. (b) Relationship among all tasks for the Let Cau-
date ROI.

TOTAL
T30
RECOG
MMSE
ADAS

T3 L
RE 0
MM G
AD E
AS
TA

S
CO
TO
(c) Let Inferior Temporal illustrative anatomical (d) Relationship among all tasks for the Let Infe-
position. rior Temporal ROI.

Fig. 8. Let Caudate, and Let Inferior Temporal ROIs, belonging to the second cluster, were stable on all tasks. On the let we
see their illustrative anatomical position, on the right we see the tasks relationship produced by GAMTL. Despite being part
of the same cluster, those two ROIs present distinct transference among tasks.

These results agree with indings in the literature. For example, it is known that the Left gray matter sufers
greater loss than its symmetric counterpart in the presence of Alzheimer [30]. It is also known that the left
hemisphere as a whole is impacted by AD, especially the Temporal and Parietal areas [5, 31]. In this case, GAMTL
could ind a stable solution where the ROIs with the most transference activity are known to be related to the
progression of Alzheimer’s Disease.
In this section, we have provided empirical results of GAMTL in two scenarios, one using an artiicial dataset
and the second one on the problem of predicting Alzheimer’s Disease progression, with a stability analysis.
By considering group-sparsity while estimating an explainable relationship matrix for each group of features,
GAMTL provides a higher level of analysis when compared with related methods in the literature. The Stability
Selection, with its sampling procedure and high variation of values for the hyper-parameters, shows which
features are most likely active on solutions found with the model. This allows us to beneit from the lexible
transference mechanism to improve generalization capacity, and also take deeper insights into the diferent layers
of relationship among tasks.

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:23

(a) Let Cerebral Cortex (b) Right Inferior Temporal (c) Let Accumbens Area

TOTAL TOTAL TOTAL


T30 T30 T30
RECOG RECOG RECOG
MMSE MMSE MMSE
ADAS ADAS ADAS
T3 L
RE 0
MM G
AD E
AS

T3 L
RE 0
MM G
AD E
AS

T3 L
RE 0
MM G
AD E
AS
TA

TA

TA
S

S
CO

CO

CO
TO

TO

TO
(d) Let Cerebral Cortex (e) Right Inferior Temporal (f) Let Accumbens Area

(g) Let Pars Orbitalis (h) Let Superior Parietal

TOTAL TOTAL
T30 T30
RECOG RECOG
MMSE MMSE
ADAS ADAS
T3 L
RE 0
MM G
AD E
AS

T3 L
RE 0
MM G
AD E
AS
TA

TA
S

S
CO

CO
TO

TO

(i) Let Pars Orbitalis (j) Let Superior Parietal

Fig. 9. ROIs with highest stability: the Let Cerebral Cortex, the Right Inferior Temporal, the Let Accumbens Area, the Let
Pars Orbitalis, and the Let Superior Parietal. Each sub-figure shows the illustrative anatomical position of the ROI, together
with the respective estimated relationship matrix.

ACM Trans. Knowl. Discov. Data.


1:24 • S. H. G. Oliveira, et al.

5 CONCLUSION AND FUTURE WORK


In this paper we introduced Group Asymmetric Multi-Task Learning (GAMTL), an MTL method that represents the
relationship between tasks on several matrices considering groups of features independently. This distinguished
lexibility on the transference among tasks presents interesting properties, mainly:
• Tasks may be related only on a few groups of features.
• Groups of features can play distinct roles in diferent groups of related tasks.
• Transference is asymmetric: the inluence from a task t on a task s may be diferent from the inluence of s
on t.
The method allows tasks to transfer in a highly lexible way and learns a rich set of local relationship structures
among tasks. The estimated relationship structure presents a matrix representation that is statistically robust,
easy to interpret and helps to understand the subtle intricacies of the transference.
We have seen how the assumptions of transference between tasks represent a major factor in the performance
gains one can achieve by using Multi-Task Learning (MTL) methods. Most current methods in the Regularized
MTL setting have strong assumptions about tasks relationships, usually assuming that all features of the tasks
are included in the transference. Some methods consider grouped features for transference, usually handled with
the lp,q norm family and in a way that if one feature is active for one task, all features in the same group are
also active for all other tasks. Some of them use the l 1 -norm to avoid transference involving undesired groups of
features. To overcome these limitations, GAMTL leverages the strategy of structural learning, a class of MTL
methods that estimates how tasks are related during the training process, to learn more speciic task transferences
based on groups of features.
The optimization problem for the GAMTL formulation is not convex when considering all parameters at the
same time, requiring an alternating optimization strategy. We alternate between optimizing tasks parameters
and tasks relationships, resulting in convex sub-problems that are solved with a stable numerical solution. The
associated Python code is available on GitHub. The added overhead related with the learning of how tasks are
related for each group of related features is counterbalanced by the gain in lexibility, as could be veriied in the
experimental settings.
On the problem of predicting cognitive scores to estimate Alzheimer’s Disease progression, each cognitive
score can be taken as a task and tasks can be independently related at the level of Regions of Interest (ROI) in
the brain. GAMTL showed the best results for predicting the majority of the scores while estimating important
relationships associated with the progression of Alzheimer’s Disease. The stability selection procedure applied on
GAMTL parameters highlighted statistically robust relationships among cognitive scores, conditioned on regions
of the brain taken as groups of features. The results corroborate previous independent results on the ield.
The ability to learn how the features are asymmetrically related to each other in groups instead of resorting
to a priori and less lexible assumptions seems an important research direction. We believe that any further
initiative towards more parsimonious structures, that are still able to encode a lexible transference among tasks,
is a promising direction to follow. Extending the estimation of relationship structures as the ones of GAMTL to
non-linear and regularizable learning models can proit even more from the asymmetric formulation with local
transference, as is being planned for the research sequel.

ACKNOWLEDGMENTS
We acknowledge the Brazilian National Council for Scientiic and Technological Development (CNPq), the SÃčo
Paulo Research Foundation (FAPESP), and the Coordination for the Improvement of Higher Education Personnel
(CAPES) for inancial support.

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:25

REFERENCES
[1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter
Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[2] Amir Beck and Marc Teboulle. 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Img. Sci. 2,
1 (March 2009), 183âĂŞ202. [Link]
[3] James C. Bezdek and Richard J. Hathaway. 2002. Some Notes on Alternating Optimization. In Advances in Soft Computing, Nikhil R. Pal
and Michio Sugeno (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 288ś300.
[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed Optimization and Statistical Learning via
the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3, 1 (Jan. 2011), 1âĂŞ122. [Link]
[5] Elisa Canu, Donald G McLaren, Michele E Fitzgerald, Barbara B Bendlin, Giada Zoccatelli, Franco Alessandrini, Francesca B Pizzini,
Giuseppe K Ricciardi, Alberto Beltramello, Sterling C Johnson, and Giovanni B Frisoni. 2011. Mapping the structural brain changes in
AlzheimerâĂŹs disease: The independent contribution of two imaging modalities. Journal of Alzheimer’s Disease 26, Suppl 3 (2011),
263ś274. [Link] 00034.
[6] Jianhui Chen, Ji Liu, and Jieping Ye. 2012. Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks. ACM Trans. Knowl.
Discov. Data 5, 4, Article 22 (Feb. 2012), 31 pages. [Link]
[7] Michael Crawshaw. 2020. Multi-Task Learning with Deep Neural Networks: A Survey. arXiv:2009.09796 [cs, stat] (Sept. 2020).
[Link] 00004 arXiv: 2009.09796.
[8] Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized MultiśTask Learning. In Proceedings of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA) (KDD ’04). Association for Computing Machinery,
New York, NY, USA, 109âĂŞ117. [Link]
[9] André R. Gonçalves, Fernando J. Von Zuben, and Arindam Banerjee. 2016. Multi-task Sparse Structure Learning with Gaussian Copula
Models. Journal of Machine Learning Research 17, 33 (2016), 1ś30. [Link]
[10] Jochen Gorski, Frank Pfeufer, and Kathrin Klamroth. 2007. Biconvex sets and optimization with biconvex functions: a survey and
extensions. Mathematical Methods of Operations Research 66, 3 (2007), 373ś407. [Link]
2007:i:3:p:373-407
[11] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations.
Chapman I& Hall/CRC.
[12] Zengyou He and Weichuan Yu. 2010. Stable feature selection for biomarker discovery. Computational Biology and Chemistry 34, 4 (2010),
215 ś 225. [Link]
[13] Alzheimer’s Disease International. 2018. World Alzheimer Report. Technical Report. Alzheimer’s Disease International. https:
//[Link]/resource/world-alzheimer-report-2018/
[14] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. 2009. Group Lasso with Overlap and Graph Lasso. In Proceedings of the
26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). ACM, New York, NY, USA, 433ś440.
[Link]
[15] Ali Jalali, Pradeep Ravikumar, Sujay Sanghavi, and Chao Ruan. 2010. A Dirty Model for Multi-Task Learning. In Proceedings of the 23rd
International Conference on Neural Information Processing Systems - Volume 1 (Vancouver, British Columbia, Canada) (NIPS’10). Curran
Associates Inc., Red Hook, NY, USA, 964âĂŞ972.
[16] Zhuoliang Kang, Kristen Grauman, and Fei Sha. 2011. Learning with Whom to Share in Multi-Task Feature Learning. In Proceedings of
the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress,
Madison, WI, USA, 521âĂŞ528.
[17] Zaven S. Khachaturian. 1985. Diagnosis of Alzheimer’s Disease. Archives of Neurology 42, 11 (1985), 1097ś1105. [Link]
archneur.1985.04060100083029
[18] Abhishek Kumar and Hal Daumé. 2012. Learning Task Grouping and Overlap in Multi-Task Learning. In Proceedings of the 29th
International Coference on International Conference on Machine Learning (Edinburgh, Scotland) (ICML’12). Omnipress, Madison, WI, USA,
1723âĂŞ1730.
[19] Giwoong Lee, Eunho Yang, and Sung Ju Hwang. 2016. Asymmetric Multi-task Learning Based on Task Relatedness and Loss. In
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA)
(ICML’16). [Link], 230ś238. [Link]
[20] Jun Liu, Shuiwang Ji, and Jieping Ye. 2009. Multi-task Feature Learning via Eicient L2, 1-norm Minimization. In Proceedings of the
Twenty-Fifth Conference on Uncertainty in Artiicial Intelligence (Montreal, Quebec, Canada) (UAI ’09). AUAI Press, Arlington, Virginia,
United States, 339ś348. [Link]
[21] Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. 2018. Modeling AlzheimerâĂŹs Disease Progression
with Fused Laplacian Sparse Group Lasso. ACM Trans. Knowl. Discov. Data 12, 6, Article 65 (Aug. 2018), 35 pages. [Link]
1145/3230668

ACM Trans. Knowl. Discov. Data.


1:26 • S. H. G. Oliveira, et al.

[22] Xiaoli Liu, AndrÃľ R. Goncalves, Peng Cao, Dazhe Zhao, and Arindam Banerjee. 2018. Modeling Alzheimer’s disease cognitive
scores using multi-task sparse group lasso. Computerized Medical Imaging and Graphics 66 (2018), 100ś114. [Link]
compmedimag.2017.11.001
[23] Nicolai Meinshausen and Peter BÃijhlmann. 2010. Stability selection. Journal of the Royal Statistical So-
ciety: Series B (Statistical Methodology) 72, 4 (2010), 417ś473. [Link]
arXiv:[Link]
[24] Sahand Negahban and Martin J. Wainwright. 2008. Joint Support Recovery under High-Dimensional Scaling: Beneits and Perils of
l 1, inf -Regularization. In Proceedings of the 21st International Conference on Neural Information Processing Systems (Vancouver, British
Columbia, Canada) (NIPSâĂŹ08). Curran Associates Inc., Red Hook, NY, USA, 1161âĂŞ1168.
[25] Shuteng Niu, Yihao Hu, Jian Wang, Yongxin Liu, and Houbing Song. 2020. Feature-based Distant Domain Transfer Learning. In 2020
IEEE International Conference on Big Data (Big Data). 5164ś5171. [Link]
[26] Shuteng Niu, Meryl Liu, Yongxin Liu, Jian Wang, and Houbing Song. 2021. Distant Domain Transfer Learning for Medical Imaging.
IEEE Journal of Biomedical and Health Informatics 25, 10 (Oct 2021), 3784âĂŞ3793. [Link]
[27] Shuteng Niu, Yongxin Liu, Jian Wang, and Houbing Song. 2020. A Decade Survey of Transfer Learning (2010âĂŞ2020). IEEE Transactions
on Artiicial Intelligence 1 (2020), 151ś166.
[28] Guillaume Obozinski. 2011. Group Lasso with Overlaps: the Latent Group Lasso Approach.
[29] Saullo H. G. Oliveira, AndrÃľ R. GonÃğalves, and Fernando J. Von Zuben. 2019. Group LASSO with Asymmetric Structure Estimation for
Multi-Task Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artiicial Intelligence, IJCAI-19. International
Joint Conferences on Artiicial Intelligence Organization, 3202ś3208. [Link]
[30] Paul M. Thompson, Kiralee M. Hayashi, Greig de Zubicaray, Andrew L. Janke, Stephen E. Rose, James Semple, David Her-
man, Michael S. Hong, Stephanie S. Dittmer, David M. Doddrell, and Arthur W. Toga. 2003. Dynamics of Gray Matter Loss
in Alzheimer’s Disease. Journal of Neuroscience 23, 3 (2003), 994ś1005. [Link]
arXiv:[Link]
[31] Paul M. Thompson, Michael S. Mega, Roger P. Woods, Chris I. Zoumalan, Chris J. Lindshield, Rebecca E. Blanton, Jacob Moussai, Colin J.
Holmes, Jefrey L. Cummings, and Arthur W. Toga. 2001. Cortical Change in Alzheimer’s Disease Detected with a Disease-speciic
Population-based Brain Atlas. Cerebral Cortex 11, 1 (01 2001), 1ś16. [Link] arXiv:[Link]
[32] Robert Tibshirani. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological)
58, 1 (1996), 267ś288. [Link]
[33] Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-
Task Learning for Dense Prediction Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1ś
1. [Link] [Link]= vandenhende_multi-task_2021-1, vandenhende_multi-task_2021-2, vandenhende_multi-
task_2021-3, vandenhende_multi-task_2021-4 arXiv: 2004.13379.
[34] Martin J. Wainwright. 2009. Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using l1-Constrained Quadratic
Programming (Lasso). IEEE Trans. Inf. Theor. 55, 5 (May 2009), 2183âĂŞ2202. [Link]
[35] Yangyang Xu and Wotao Yin. 2013. A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications
to Nonnegative Tensor Factorization and Completion. SIAM Journal on Imaging Sciences [electronic only] 6 (07 2013). [Link]
1137/120887795
[36] Ming Yuan and Yi Lin. 2006. Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical
Society. Series B (Statistical Methodology) 68 (2006), 49ś67.
[37] Yu Zhang and Qiang Yang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).
[38] Yu Zhang and Dit-Yan Yeung. 2010. A Convex Formulation for Learning Task Relationships in Multi-task Learning. In Proceedings of the
Twenty-Sixth Conference on Uncertainty in Artiicial Intelligence (Catalina Island, CA) (UAI’10). AUAI Press, Arlington, Virginia, United
States, 733ś742. [Link]
[39] Yu Zhang and Dit-Yan Yeung. 2014. A Regularization Approach to Learning Task Relationships in Multitask Learning. ACM Trans.
Knowl. Discov. Data 8, 3, Article 12 (June 2014), 31 pages. [Link]
[40] Peng Zhao and Bin Yu. 2006. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 7 (Dec. 2006), 2541âĂŞ2563.
[41] Jiayu Zhou, Jianhui Chen, and Jieping Ye. 2011. Clustered Multi-Task Learning via Alternating Structure Optimization. In Proceedings of
the 24th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook,
NY, USA, 702âĂŞ710.
[42] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. 2011. A Multi-Task Learning Formulation for Predicting Disease Progression. In Proceedings
of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, California, USA) (KDD âĂŹ11).
Association for Computing Machinery, New York, NY, USA, 814âĂŞ822. [Link]
[43] Hui Zou. 2006. The Adaptive Lasso and Its Oracle Properties. J. Amer. Statist. Assoc. 101, 476 (2006), 1418ś1429. [Link]
016214506000000735 arXiv:[Link]

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:27

6 APPENDICES
A VARIANTS OF GAMTL
We call GAMTL the formulation that uses the loss to refrain tasks with higher costs to transfer to other tasks,
and that restricts the values of all Bд to be equal to or greater than zero, as shown in Eq. 2. Consider this as the
standard formulation. However, in Section 4 we show results of four variants of GAMTL. These variants are based
on two assumptions of the overall formulation presented in Eq. 2: (i) using the loss to regularize how much a task
can transfer to other tasks; and (ii) using the restriction on the elements of Bд .

A.1 GAMTL-nl: No Loss


The standard formulation of GAMTL considers the loss of each task as a weight that is multiplied by the
regularization parameter of the tasks transference. In this case, the transference of a task t to all other tasks is
proportionally penalized, based on the value of the loss function of task t. On the opposite, tasks with a low value
of the loss function are encouraged to transfer to other tasks.
We call GAMTL-nl the variation in which we do not want this behavior to be active on the formulation. Eq. 7
shows the associated problem.

2
X 1 X д λ2 X д
X д
minд L(wt ) + λ 1 ∥bt̄ ∥1 + wt − W д bt + λ3 dд ∥wt ∥2
W ,B
t ∈T
m t д∈G
2 д∈G д∈G
2
X д (7)
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T
The diference lies in the irst term of Eq. 2 that is now expanded on the irst two terms of Eq. 7. After expanding,
д
we remove the product between the loss function and the l 1 -norm regularization on the bt̄ variables. Now the
regularization on the variables that encode how a task transfers to the other tasks depends only on the value of
the hyper-parameter λ 1 .

A.2 GAMTL-nr: No Restriction


The standard variation of GAMTL restricts the values of all Bд matrices to be equal to or greater than zero. We
call GAMTL-nr the variation where Bд ∈ RT ×T , as shown in Equation 8.
2
*.1 + λ ∥bt̄ ∥1 +/ L(wt ) +
X 1 X д λ2 X д
X д
minд 1 wt − W д bt + λ3 dд ∥wt ∥2
2
X, д -
W ,B m t
t ∈T д∈G д∈G 2 д∈G (8)
subject to wt = wt .
д∈G

Compared to the standard version presented in Equation 2, the diference lies in the removal of the constraints
on Bд .

A.3 GAMTL-nlnr: No Loss, No Restriction


Based on the choices presented above, we can also have another variant of GAMTL, by not using the loss function
on the transferences from tasks with higher costs while not considering the constraints on the values of Bд ∀д ∈ G.

ACM Trans. Knowl. Discov. Data.


1:28 • S. H. G. Oliveira, et al.

We call this variation GAMTL-nlnr. The associated optimization problem is:

2
X 1 X д λ2 X д
X д
minд L(wt ) + λ 1 ∥bt̄ ∥1 + wt − W д bt + λ3 dд ∥wt ∥2
W ,B m t 2
t ∈T д∈G д∈G 2 д∈G (9)
X д
subject to wt = wt .
д∈G

Notice that this formulation combines the changes introduced in Equations 7 and 8.

B OBTAINING GAMTL CONVEX SUB-PROBLEMS


The standard formulation presented in Eq. 2 is not convex in general, which prevents us from solving it using a
gradient-based optimization procedure without some manipulation. On the irst term of the equation, the loss
д
function depends on wt while involved in a multiplication with bt̄ ∀t ∈ T . In the second term, the parameters of
each task are closely tied to the parameters of all other tasks, which also adds a non-convex relationship to the
variables.
In GAMTL the problem is solved by considering an alternating procedure strategy. The problem is isolated to
a subset of the variables that will be optimized while keeping the values of the remaining variables ixed. This
allows us to rewrite the objective function into a smaller convex problem.
We start from the standard formulation from Eq. 2 and isolate the problem in terms of each wt ∀t ∈ T , and
д
bt , ∀д ∈ G and t ∈ T . For each sub-problem, we optimize with relation to the respective subset of variables
while keeping the values of all the remaining variables ixed.
Before diving into more details, we repeat Equation 2 as a reference for the next sub-sections:

2
*.1 + λ ∥bt̄ ∥1 +/ L(wt ) +
X 1 X д λ2 X д
X д
minд 1 wt − W д bt + λ3 dд ∥wt ∥2
2
X, -
W ,B
t ∈T
m t д∈G д∈G д∈G
2
д
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T .

B.1 Isolating in terms of task parameters


Let us consider wt of a given task t as the variables of our problem, while all remaining variables have ixed
values. On the irst term of Eq. 2, for each task in the overall summation we have c ∈ R, a constant resulting
д
from the computation of the terms inside the parenthesis, since bt̄ is ixed for all д ∈ G and t ∈ T . Consider
that the value of the loss function
 of all other tasks is also constant, since ws ∀s ∈ T \ t have ixed values. Let
д д д
d = ss ∈T \t 1 + λ 1 д ∈ G ∥bs̄ ∥1 L(ws )+λ 3 д ∈ G dд ∥ws ∥2 , and let us assume that all ixed values bt ≥ 0, ∀д ∈ G
P P P
and t ∈ T .

ACM Trans. Knowl. Discov. Data.


Asymmetric Multi-Task Learning with Local Transference • 1:29

The optimization problem can be simpliied to:


2
1 λ2 X д
X д
min cL(wt ) + wt − W д bt + λ3 dд ∥wt ∥2 +
wt mt 2 д∈G д∈G
2
2
X λ2 X д
d+ ws − W д bs
s ∈T \t
2 д∈G 2
X д
subject to wt = wt .
д∈G

As wt ∈ W д, we can expand the fourth term and rearrange the variables to isolate wt . Let
X X д д
w̃s = ws − wu bus .
u ∈T \{s,t } д ∈ G

The optimization problem becomes:


2 2
1 λ2 X д
X д λ2 X X д д
min cL(wt ) + wt − W д bt + λ3 dд ∥wt ∥2 +d + w̃s − wt bt s ,
wt mt 2 д∈G д∈G
2 s ∈T \t д∈G
2 2
which is equivalent to Eq. 3.

B.2 Isolating in terms of task relationships


д
Assume now that wt ∀t ∈ T and all Bд ∀д ∈ G, with the exception of bt , have ixed values. The loss function
of each task assumes a constant value. The third term and the restriction associated with the Group LASSO
regularization also assume constant values for all tasks, and since the terms are not related to the current variables
of the problem, they can be discarded. We are left now with irst and second terms, together with the constraints
д̃ д д д
on the values of the Bд matrices. Let w̃t = wt − д̃ ∈ G\д W д̃ bt , and let W = [w1 /L(w 1 ), · · · , wT /L(wT )].
P
By rearranging the terms we are left with the following resulting problem:
1 д д 2 λ1 д
min w̃t − W bt + ∥bt ∥1
bt
д
2 2 λ2
д
subject to bt ≥ 0.

C MORE METRICS FOR THE ADNI EXPERIMENT


Table 3 shows the mean absolute error for each task in the ADNI experiment that is reported in Section 4.2.
Considering the prediction of ive cognitive scores (TOTAL, T30, RECOG, MMSE, and ADAS) as learning tasks,
MSSL presented the smaller MAE for the task TOTAL, showing poorer performance for the other cognitive scores.
For task T30, as in the MSE metric (Table 2), the LASSO presented the best score. For the task RECOG GAMTL
had the best score, and for the remaining MMSE and ADAS tasks, GAMTL-nr had the best scores.
As in the discussion of section 4.2, each task beneits the most from a diferent strategy of transference, but
still, task T30 could not beneit from MTL. Focusing on methods that account for grouped features, we can choose
Group LASSO as the STL reference. When compared with Group LASSO, GAMTL variants could improve the
generalization performance on the tasks TOTAL, RECOG, MMSE, and ADAS. RECOG is the task that beneits the
most from GAMTL models.

ACM Trans. Knowl. Discov. Data.


1:30 • S. H. G. Oliveira, et al.

Table 3. Mean Absolute Error (MAE) of all methods per task in ADNI dataset. The best results on each task are highlighted
in bold.

Method TOTAL T30 RECOG MMSE ADAS


LASSO 0.714 (0.0) 0.629 (0.0) 0.819 (0.0) 0.635 (0.0) 0.560 (0.0)
STL

Group-LASSO 0.796 (2.4 · 10−1 ) 0.661 (5.2 · 10−2 ) 0.831 (3.8 · 10−2 ) 0.695 (5.0 · 10−2 ) 0.599 (4.2 · 10−2 )
GO-MTL 0.715 (0.0) 0.650 (0.0) 0.736 (0.0) 0.790 (0.0) 0.575 (0.0)
MSSL 0.709 (0.0) 0.662 (0.0) 0.757 (0.0) 0.642 (0.0) 0.556 (0.0)
MTFL 0.711 (0.0) 0.665 (0.0) 0.748 (0.0) 0.639 (0.0) 0.549 (0.0)
MT-SGL 0.725(3.4 · 10−14 ) 0.654(1.3 · 10−13 ) 0.700(4.3 · 10−13 ) 0.653(2.5 · 10−14 ) 0.558(1.6 · 10−13 )
MTRL 0.720 (0.0) 0.676 (0.0) 0.712 (0.0) 0.640 (0.0) 0.537 (0.0)
MTL

AMTL 0.721 (0.0) 0.669 (0.0) 0.857 (0.0) 0.758 (0.0) 0.543 (0.0)
GAMTL 0.743 (9.2 · 10−6 ) 0.659 (2.3 · 10−5 ) 0.692 (7.7 · 10−6 ) 0.631 (1.8 · 10−5 ) 0.549 (4.6 · 10−5 )
GAMTL_nl 0.720 (4.9 · 10−5 ) 0.669 (5.0 · 10−5 ) 0.723 (1.0 · 10−4 ) 0.626 (5.2 · 10−5 ) 0.537 (3.5 · 10−5 )
GAMTL_nr 0.725 (2.3 · 10−5 ) 0.672 (3.4 · 10−5 ) 0.713 (1.6 · 10−5 ) 0.623 (4.2 · 10−5 ) 0.531 (2.2 · 10−5 )
GAMTL_nlnr 0.718 (1.5 · 10−5 ) 0.668 (4.6 · 10−5 ) 0.727 (4.4 · 10−5 ) 0.627 (5.6 · 10−5 ) 0.539 (3.0 · 10−5 )

ACM Trans. Knowl. Discov. Data.

You might also like