GAMTL: Asymmetric Multi-Task Learning
GAMTL: Asymmetric Multi-Task Learning
SAULLO H. G. OLIVEIRA, School of Electrical and Computer Engineering (FEEC), University of Campinas
(Unicamp), Brazil
ANDRÃĽ R. GONÃĞALVES, Lawrence Livermore National Laboratory, USA
FERNANDO J. VON ZUBEN, School of Electrical and Computer Engineering (FEEC), University of Camp-
inas (Unicamp), Brazil
In this paper, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data
how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features
GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features.
The additional lexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric
relationships. The proposed method leverages the information present in these multiple structures to bias the training of
individual tasks towards more generalizable models.
The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks
parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated
on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by
diverse proiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD)
progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but
also estimated scientiically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks,
and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection
analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter coniguration. GAMTL source code is
available on GitHub: [Link]
CCS Concepts: • Applied computing → Health informatics; • Theory of computation → Models of learning; Non-
convex optimization.
Additional Key Words and Phrases: multi-task learning, structural sparsity, structural learning.
1 INTRODUCTION
Multi-Task Learning (MTL) promotes information sharing among multiple related tasks, aiming at improving the
generalization capacity of individual tasks. In the last two decades, we have witnessed a signiicant increase in the
number of MTL proposals as well as in the variety of applications of MTL methods [7, 33, 37]. One central question
in MTL is how the information lows from task to task during training. To that end, a proper characterization of
the tasks relationship structure is required. Existing methods range from simple models with strong assumptions
about how tasks are related [6, 8, 15, 41] to more complex models that implement intricate learning procedures
Authors’ addresses: Saullo H. G. Oliveira, shgo@[Link], School of Electrical and Computer Engineering (FEEC), University of
Campinas (Unicamp), Av. Albert Einstein, Nž 400 - Cidade UniversitÃąria, Campinas, SÃčo Paulo, Brazil, 13083-852; AndrÃľ R. GonÃğalves,
andre@[Link], Lawrence Livermore National Laboratory, USA; Fernando J. Von Zuben, vonzuben@[Link], School of Electrical and
Computer Engineering (FEEC), University of Campinas (Unicamp), Av. Albert Einstein, Nž 400 - Cidade UniversitÃąria, Campinas, SÃčo
Paulo, Brazil, 13083-852.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
permissions@[Link].
© 2022 Association for Computing Machinery.
1556-4681/2022/1-ART1 $15.00
[Link]
to automatically unveil tasks relationships [9, 18, 38, 39]. MTL also shares similarities with Transfer Learning
methods, with the diference that it is more general in the way it models transference. In MTL we want to improve
the performance of all involved tasks, while in transfer learning we use source tasks to improve the performance
of one or more target tasks [27].
As we will see in Section 2, although the current state-of-the-art MTL methods can model and learn the tasks
relationship information from data, two major drawbacks are present: (i) most of the methods assume that tasks
are symmetrically related, that is, task A afects task B in the same way as task B afects task A; and (ii) if two
tasks are related they must inluence each other on the entire set of features, to which we will call global feature
transference.
There have been attempts to encode more lexible structures into the models and to alleviate the mentioned
downsides as we will see in Section 2, usually in two directions: enabling transference to involve subsets of
features , i.e. local feature transference; or estimating the structure of transference even in an asymmetrical way.
But no previous attempts in the literature explore both directions simultaneously.
Recent works used MTL to model the connection between cognitive scores and the progression of Alzheimer’s
Disease (AD) [21, 22, 42], helping to understand how the most common form of dementia in the world [17]
evolves both physiologically and cognitively. As people live longer and we improve methods to identify and
diagnose dementia, the number of people living with dementia is expected to more than triple by 2050 when
compared with the estimates of 2018, according to the World Alzheimer Report of 2018. Not to mention the
sufering brought to a massive number of families, the worldwide inancial cost of dementia was estimated to
be US$ 1 trillion for the year of 2018 and is estimated to double by 2030. Notice that these estimates do not
account for the recent burden imposed by the COVID-19 in health care systems worldwide. Since the neuronal
degeneration of AD proceeds years before the full onset of the disease, medical treatment is more efective in
the early stages of the disease. Therefore, the prediction of AD progression can help to identify markers of AD
stages, as well as highlight how each region of the brain inluences the outcome of such scores.
The simultaneous prediction of diferent cognitive scores associated with subjects at distinct stages of AD, based
on features extracted from brain images (obtained from ADNI, Alzheimer’s Disease Neuroimaging Initiative)
can beneit from MTL, especially from models that can directly encode structural a priori information, such as
grouped features and asymmetric transference. By considering multiple cognitive scores as multiple tasks, [22]
and [42] show how MTL methods can encode in the design matrix each Region of Interest (ROI) in the brain as a
group of features using the Group LASSO [36] regularization, but are not able to circumvent the global feature
transference assumption. Despite the promising results, current MTL methods still present limitations in the way
they capture the relationships among tasks in a realistic and interpretable way.
In Section 3 we present GAMTL [29], a Regularized MTL method that meets three goals:
(1) consider that features are organized in a group structure, thus enabling local feature transference;
(2) estimate a task relationship matrix for each group of related features;
(3) allow tasks to be asymmetrically related.
We compare the proposal with state-of-the-art methods in Section 4, by contrasting their results in an artiicial
setting, and also on the prediction of diferent cognitive scores related to Alzheimer’s Disease. The results highlight
the performance of the method and the interpretability of the explainable relationship structure. In order to
validate the robustness of GAMTL with respect to data sampling rate and hyper-parameter settings, a Stability
Selection procedure was performed. A clustering analysis considering the stable set of features makes it clear how
the lexibility of GAMTL allows the method to automatically capture the distinct roles features can play on related
tasks. The source code with Python programming language is available at the URL: [Link]
Notation: Matrices are represented using uppercase letters, while scalars are represented by lowercase letters.
Vectors are lowercase in bold. For any matrix A, ai¯ is the i-th row of A, and a j is the j-th column of A. Also, ai j is
the scalar at row i and column j of A. The i-th element of any vector a is represented by ai . For any two vectors
x, y the Hadamard product is denoted by (x ⊙ y)i = x i yi .
where Lt : wt → R is a suitable convex loss function for task t, and R is a regularization term over all tasks
parameters. When the tasks weights are stacked as columns of a matrix, we represent the task parameters as
W ∈ Rn×T .
Examples of loss functions for regression problems are Mean Squared Error and Mean Absolute Error, while
classiication problems may use a Logistic Loss. Prior knowledge on the application is what drives the choice of
regularization terms, especially by leveraging sparsity properties and algebraic structures that enforce and/or
capture the relationship among tasks.
Evgeniou and Pontil [8] started from the general premise that the parameter vectors wt , t = 1, · · · ,T , should
be close to each other, by penalizing the deviation of each parameter vector from the average vector comprising
all tasks. Elaborating further, Zhou et al. [41] considered that tasks parameters are grouped into K groups, being
K a hyper-parameter of the model. In this case, all tasks in the same group tend to exhibit similar values for
their parameters. Additionally, all tasks should be part of a single group, which in turn is a more lexible way
of handling the efects of unrelated tasks but is still not completely accounting for it. Both methods induce all
the parameter vectors to pursue the average behavior of their corresponding group, but if tasks are related only
through subsets of features, these methods will fail to capture the relationship and will enforce the relationship
on features that are not supposed to be related. By relying on metric distances to measure tasks similarities, it
may incur in the curse of dimensionality if the parameter vectors lie in a high dimensional space. This implies
that, for interpretability purposes, even when the two tasks belong to the same cluster we still have no reason to
believe that they are related, as metric distances tend to become meaningless in high-dimensional spaces.
The usage of norms as regularization terms is successfully employed in MTL. The l 1 -norm as a regularization
term applied to the parameters of each task individually encourages a sparse activation of its components. As
highlighted by [11], sparse models are easier to interpret, as we have fewer active variables in the solution;
and are also computationally more convenient, usually requiring signiicantly less memory than their dense
counterparts. Theoretically, they also present the nice property of being able to recover the exact support Ð i.e.,
the set of active variables of a vector Ð of a given model [34, 40] if certain conditions are satisied.
By applying it to diferent arrangements of parameters of the multiple tasks, the l 1 -norm encourages inde-
pendent feature or task selection in the information sharing over all tasks. However, if the parameters present
some structural correlation Ð groups of correlated features, for example Ð the same support recovery guarantees
do not apply. The mixed lp,q -norms are suited to enforce that groups of features are jointly active or absent. If
one feature of a given group is active, all features in the same group should be active; and if one feature is not
active, the entire group of features should be absent. The Dirty Model [15] takes advantage of lp,q -norms to relate
features among all tasks. Instead of using this norm directly on the parameter matrix, the authors factorize the
parameter matrix into a sum of two matrices W = S + B, where S, B ∈ Rn×T . They apply a diferent sparsity
penalization on each factor matrix: the sum of l 1 -norms (l 1,1 -norm) on the rows of S induces sparsity over the
parameters of all tasks; and the sum of l inf norms (l 1,inf ) on the rows of B relates each feature over all tasks. The
factorization strategy adds the lexibility of sharing features between tasks when convenient since one feature
can be active for all tasks but each task is free to avoid this feature with the regularization on the second matrix.
But we still are left with two limiting properties: (i) the l 1,inf -norm encourages similar values for each feature
across all tasks, implying that one feature has the same impact on the outcome of all related tasks; and (ii) the
model does not consider the case of grouped features.
The Group LASSO regularization [36], also referred to as l 2,1 from the mixed lp,q -norms, is a regularization
that accounts for grouped features. Let the task features be partitioned into G groups of correlated features. Each
д
group д ∈ G = {1, · · · , G} consists of a subset of features in X t for all tasks t ∈ T . Let X t be the design matrix
д
restricted to the features present in a group д for task t, and wt be the task parameter with the same dimension
as wt but admitting non-null values only at locations associated with features belonging to group д, and having
null values at the remaining positions. When p = 2 and q = 1 we have:
XX д
RG L (W ) = ∥W ∥2,1 = ∥wt ∥2 .
t ∈T д ∈ G
Notice that the partition of features into groups is the same for all tasks. In this regularization, each feature must
belong to one group, although isolated features can be put into a singleton group. As it penalizes the l 1 -norm
of a vector of G l 2 norms (one per group), when one element is forced to zero, all variables of this group are
д
forced to zero, thus guiding to wt = 0 for some д ∈ G. However, when two groups overlap and only one group is
active in the inal solution, the group that is not active will have all its features zeroed, even the features that are
shared with the active group. The recovered support of this norm is then the complement of the union of the
overlapping groups [14, 28].
In order to overcome this structural bias, Jacob et al. [14] proposed an extension to the Group LASSO where
д
the feature vector is decomposed as a sum of representations for each group wt = д ∈ G wt , applying the l 2 -norm
P
(or l inf ) on each group. This regularization is called the Latent Group LASSO, or Overlapping Group LASSO, and
can be posed as:
XX д
ROG L (W ) = dд ∥wt ∥2 ,
t ∈T д ∈ G
д
where wt = ∀t ∈ [1, · · · ,T ], and dд are independent weights, accounting for the cardinality of each
P
д ∈ G wt ,
group. Notice that the support for each task is a union of groups, not the complement, as a feature shared by two
groups will have its value preserved for the active group and be zeroed in the inactive group. See [14, 28] for
more details on that.
Both sparsity inducing l 1 and lp,q -norm present strong support recovery guarantees when appropriate condi-
tions are met. However, the performance of methods based on lp,q -norms depends on how features are shared
across tasks. For the l 1,q norm, Negahban and Wainwright [24] showed that if the number of tasks sharing a
group of features is less than a threshold, or even if the parameter values of features of the same group are highly
uneven, the regularization could perform worse than the l 1 norm. Ideally, each group of features should be free
to play distinct roles depending on the task, i.e., each task may have its support (number of non-null elements in
д
wt ). In this case, we still need a mechanism to select which tasks should transfer for each group independently.
All methods presented so far allow tasks to transfer in diferent ways, but exhibit several limitations as we
need to meet strong assumptions beforehand. Since all tasks are considered to be related, they are not robust to
unrelated tasks; most of them do not account for grouped features and when they do, all tasks must share the
same sparsity pattern, which implies that each group of features has the same inluence on all tasks outcomes. In
the next section, we proceed to a distinct family of MTL methods that extends this setting by unveiling how tasks
are related while learning the tasks’ parameters, which may allow tasks to have a more versatile transference
structure.
linear combination. For obvious reasons, a task cannot participate in its own formulation, thus bt t = 0 ∀t ∈ T .
In this case, the parameters of all tasks serve as a latent basis. The authors also use tasks losses to weight
transference: relationships must low from tasks with lower cost (easier) to tasks with higher cost (harder). Let bt
be a column of a matrix B ∈ RT ×T . Each column t indicates how the parameters of the other tasks participate in
the linear combination that approximates the parameters of task t, and a row t indicates the degree with which
the parameters of a task t participate in the approximation of the parameters of other tasks. Therefore, B encodes
the relationships among tasks in an asymmetric scheme: the transference from task t to task s may not be the
same as that from task s to task t.
The related optimization problem is written as follows:
T
X
min (1 + λ 1 ∥bt̄ ∥1 )Lt (wt ) + λ 2 ∥wt − W bt ∥22 .
W
t =1
In the irst term, the cost of a task t weights the l 1 -norm applied to bt̄ (t-th row of B), i.e., the transferences from
task t to all other tasks. (λ 1 , λ 2 ) are regularization hyper-parameters. The asymmetric transference is encoded in
a set of variables that are distinct from the variables involved in prediction, which allows AMTL to achieve a
lexible regularization of related tasks. Nevertheless, AMTL also enforces global feature transference.
Unlike transfer learning settings [25ś27], in which one or more tasks are the source of transference to a
target task, in our Multi-Task learning framework all tasks are simultaneously sources and receptors of shared
information. We proceed now to the exposition of GAMTL, a regularized MTL method that uses group information
over the tasks features to provide a lexible relationship structure that allows local feature transference admitting a
distinct interplay on each group of features; and also asymmetric information sharing. For a more comprehensive
reading on the current status of MTL research, please refer to the recent survey in [37].
д д
bst composes an intra-group relationship matrix Bд ∈ RT ×T , where a row bt̄ encodes the inluence of task t on
д
all other tasks at group д, and a column bt encodes the inluence of the other tasks on task t at group д, which
results in G intra-group relationship matrices. Each one can be seen as the adjacency matrix of a directional
graph transference structure. Nodes are tasks and directional weighted edges indicate transference from one task
д
to another. Based on the latent representation of each task parameter vector, wt ≈ д ∈ G W д bt , where W д is the
P
task parameter matrix with values restricted to the group д, and zeros elsewhere. Eq. 2 shows the resulting MTL
optimization problem.
2
*.1 + λ ∥bt̄ ∥1 +/ L(wt ) +
X 1 X д λ2 X д
X д
min 1 wt − W д bt + λ3 dд ∥wt ∥2
2
X, -
д
W , B ∀д ∈ G
t ∈T
m t д∈G д∈G д∈G
2
д (2)
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T
The irst term computes the loss function of each task weighted by the number of samples. Therefore, it takes
into account sample imbalance among tasks, while also using the loss to weight transference from task t to the
д
other tasks. The l 1 -norm applied to bt̄ is used to enforce sparsity on the estimated relationship among the tasks.
This helps us pruning the search space while keeping only the more relevant transferences per group of features.
The second term penalizes the diference between the parameters of a speciic task t and the linear combination
of parameters from the tasks with which task t is grouped. Notice that this term considers how the task t is
related to possibly diferent tasks for each group of features independently. Together with the equality constraint
on each wt , the last term corresponds to the Overlapping Group LASSO regularization. The constraint on Bд
variables restricts the way tasks relate by allowing only non-negative values in the linear combination. However,
in case this restriction is not suitable for the application, we present an optimization procedure for the more
relaxed variant (without the restriction on Bд values). GAMTL uses the transference matrices Bд in a way that
allows us to use the Group LASSO while estimating how tasks share information, instead of forcing transference
involving all tasks on each group of features.
GAMTL contains three hyper-parameters that impact how transference occurs. When λ 1 = 0, λ 2 = 0, and
λ 3 = 0, independent linear models are recovered, following a Single Task Learning (STL) approach. If only λ 3 > 0,
we still have independent linear models per task but regularized by Overlapping Group LASSO. When λ 2 > 0, we
control the transference lexibility from many groups of related tasks - one per group of features - to wt . With
λ 1 > 0 the sparsity of the transference is activated.
Figure 1 shows a lowchart presenting the training process for GAMTL. The input consists of a labeled training
set for each task, with the tasks features structured into groups. The grouped partition of features must be the
same for all tasks design matrix. However, the partition is arbitrary allowing non-contiguous groups of features
to overlap, despite Figure 1 inducing contiguity of features. An alternating optimization procedure performs the
training process, switching between the estimation of tasks parameters and the relationship among tasks. The
relationship among tasks is encoded into G matrices, a transference structure that foments local transference and
is equivalent to a multi-digraph. In this multi-digraph, each level of the graph corresponds to a group of features
where tasks can be related. Tasks are related independently for each group of features in an asymmetrical fashion.
Finally, on the right we have the output for each task.
By representing the relationship among tasks via multiple matrices, and considering the parameters of the
tasks as a latent space for relationship, GAMTL promotes unique lexibility for the transference:
• Tasks may be related only on subsets of features.
• Groups of features can play distinct roles on diferent groups of related tasks.
Fig. 1. On the let we see an input data representation: a design matrix and labels for each task along with a possibly
overlapping partition of the input feature set into groups, which are the same for all tasks. The training procedure is depicted
in the middle, where an alternating optimization takes place. One step involves the optimization of tasks parameters so
that each task is free to find its own features sparsity patern and the relationship between any pair of tasks is enforced
locally to each group. The second step estimates how tasks are related considering each group of features. The resulting
relationship matrices are shown as the adjacency matrix of a multi-digraph, where each level corresponds to a group of
features, recursively used at the first step as the structural relationship among tasks, thus implementing the asymmetric
local transference. The output is shown on the right, consisting of the predicted labels for each task.
• Transference is asymmetric: the inluence of task t on task s may difer from the inluence of task s on task
t.
Using the categorization in the survey of [37], GAMTL belongs to the parameter-based transference category of
MTL models. Another important aspect of our formulation is that GAMTL is designed for linear base models. The
structure that encodes the relationship of the tasks is based on linear combinations that can be easily interpreted.
The assumption that the parameters of one task can be decomposed as a linear combination of the parameters of
other tasks on each group of features, may be too restrictive for multi-layer nonlinear models, such as neural
networks.
When we consider all optimization variables at the same time, Eq. (2) ends up being a non-convex optimization
problem, possibly with the presence of local minima [10]. In the sequence, we derive smaller convex sub-problems
that allow us to employ an alternating optimization procedure.
0 , otherwise.
1 д д λ1 д
min ∥w̃t − W bt ∥22 + ∥bt ∥1
д
bt 2 λ2 (5)
д
subject to bt ≥ 0.
This problem is similar to the Adaptive LASSO [43] and thus is convex but not diferentiable at all points.
Without the constraints in Eq. (5), it can be solved using any standard method for the LASSO. To handle the
д
constraints (bt ≥ 0 ∀д ∈ G, t ∈ T ), GAMTL uses the Alternating Direction Method of Multipliers (ADMM) [4].
In the ADMM framework, the inequality constraint can be represented via an indicator function:
min f (x ) + h 1 (z 1 ) + h 2 (z 2 )
subject to x = z 1 (6)
x = z2
The two steps in zi -update can run in parallel, with the same occurring for ui . The zi -update steps are solved
with the proximal operators: soft-thresholding, Sκ (a) = (1 − κ/|a|)+a; and projection onto the non-negative
orthant R+ , S (a) = (a)+ = max(0, a). The x-update step is a convex problem with a diferentiable function f plus
quadratic terms, which can be solved in closed-form via Cholesky decomposition or by a proper gradient-based
method. GAMTL implementation using the Python programming language is available on Github 1 .
1 [Link]
4 EXPERIMENTS
To evaluate the performance of GAMTL when looking for better generalization over multiple tasks, we show
the results of two experiments: one using an artiicial setting, and the problem of predicting Alzheimer’s
Disease progression. We also provide an extensive stability analysis on the support of tasks parameters and task
relationship variables on this problem. For all experiments, we denote the variants of GAMTL as follows:
• GAMTL - standard formulation presented in Eq. 2;
• GAMTL-nl - without considering loss as a weighting coeicient (Appendix A.1);
• GAMTL-nr - Bд ≥ 0 ∀д ∈ G (Appendix A.2); and
• GAMTL-nlnr - Bд ≥ 0 ∀д ∈ G without considering loss as a weighting coeicient (Appendix A.3).
parameters is set to zero. Parameters of the third and fourth tasks are generated in the same fashion, but the irst
group is set to zero while the second group is sampled from a standard Gaussian distribution. The last four tasks
are based on the previous ones. We generated their parameters as a linear combination of the parameters of the
previous tasks. The linear combination parameters are sampled from a truncated Gaussian distribution, ensuring
that all values were positive.
The design matrix of each task is sampled from a standard Gaussian distribution with 300 samples and 50
attributes. After that, we add a Gaussian noise with σ = 0.4 for the irst four tasks, and with σ = 2.9 for the
remaining tasks. This diference in the amount of noise is related to our assumption of asymmetric transference
based on loss. We expect the transference to occur from tasks with lower costs to tasks with higher costs,
recovering the transference structure among all tasks. If all tasks present the same level of noise, all transferences
will be penalized similarly and the last four tasks will be equally encouraged to transfer back to the irst tasks,
resulting in quasi-symmetric matrices Bд ∀д ∈ G.
The number of samples available to the models for training varied from 30 to 100, as all methods converge to
similar performance from this value on. The synthetic dataset is split so that 70% of the samples are used for
the training and 30% for the test. For each number of samples, we choose the hyper-parameters of all methods
by a holdout procedure in which we split the training set in 70% for training and 30% for validation. The best
parameters are used to train the models for 30 runs.
For this experiment, we compare the results of GAMTL with the LASSO [32] and Group LASSO [14] as STL
contenders. Hyper-parameters were chosen using the Python library Optuna [1], which instead of using a grid
search approach to ind optimal hyper-parameters, implements a relational sampling strategy to search for the
optimal values of some function in a given interval. In this case, the parameters of the search procedure are
the limits of the values of each parameter, and the number of trials to update the relational sampling strategy.
For each method, we sampled 200 trials for the search hyper-parameter values, and choose the values with
the best normalized mean squared error (NMSE) in the validation portion of the training data. For the LASSO,
the search limits were λ ∈ [10−5 , 4], while for the Group LASSO λ ∈ [10−5 , 15]. All variants of GAMTL used
λ 1 , λ 2 , λ 3 ∈ [10−5 , 5]. We report the mean and standard deviation of the NMSE on the test set, over all runs.
Figure 2 shows the performance of all methods against the increasing number m of samples available for the
training / testing procedure. When m ≤ 60, no STL method achieves reasonable NMSE values. Notice however that
the Group LASSO outperforms the LASSO independently of the number of samples, just because it incorporates
the group structure of the features.
All GAMTL variants improve upon STL, especially when the number of samples available for training is
small, as in the interval between 30 and 70 samples. The four GAMTL variants present two distinct levels of
improvement. The two variants that do not use the loss to penalize transference between tasks, i.e. GAMTL-nl
and GAMTL-nlnr, can improve upon the STL methods but the best results are achieved by the two variants that
consider the loss to transfer from tasks with a higher cost to tasks with a lower cost, i.e. GAMTL and GAMTL-nr,
that present larger improvement. The improvement is more expressive when m = 30 where the scenario is highly
ill-conditioned. As m increases, all methods start to have similar performance, as the number of samples provides
enough information to solve the tasks.
As we are also interested in the relational structure that GAMTL provides, Figure 3 shows the relationship
matrices Bд estimated by GAMTL when m = {40, 80, 90, 100} sided by the original Bд used in the generative
process. We choose the number of samples to understand how precise is the recovered structure as we give more
samples to the model. When m = 40 the tasks relationships are not alike the true relationship matrices that
we have designed. However, notice that GAMTL detects that tasks 5 to 8 are related among themselves. Since
they are all based on the same two tasks, depending on the groups of attributes, they are indeed related among
themselves. This explains why GAMTL can increase performance even when the number of samples is low.
When m = 80, as we have more samples available the recovered structure is more sparse and less symmetrical
10.0 GAMTL-nr
7.5 GAMTL
5.0
2.5
0.0
30 40 50 60 70 80 90 100
Number of samples
Fig. 2. Normalized Mean Squared Error of all methods on the artificial dataset, with a varying number of samples available
for training. STL methods are shown using dashed lines. By leveraging the group partition information involving the features
of the tasks, the Group LASSO outperforms the LASSO. MTL methods are shown in solid lines. GAMTL variants show an
expressive gain in performance, especially when the number of training samples is low. Best viewed in color.
but the method is still not detecting that all related tasks are based on tasks 1 to 4 (depending on the group
of features). With m ≥ 90 samples per task, GAMTL can detect the dependency of tasks 5-8 on the irst four
tasks, also incorporating the distinct pattern associated with each group of attributes. GAMTL shows a good
approximation to the original transference scheme for both groups of features, and the asymmetric inluence
among tasks is fully recovered.
3
Original B 2 B 2(m=40) B 2(m=80) B 2(m=90) B 2(m=100)
1 2
2
3
From task
4 1
5
6
7
8 0
12345678
To task
Fig. 3. Original relationship matrices, followed by the relationship matrices estimated by GAMTL on the artificial dataset,
with diferent training sizes. When m ≤ 80, GAMTL can have a rough estimate of how tasks are related to each other.
Compared to the original relationship matrices, GAMTL has a good approximation of the relationships among tasks when
m ≥ 90, for both groups of features. The asymmetric influence among tasks is also fully recovered.
of California at San Francisco, as described in [22], who performed cortical reconstruction and volumetric
segmentation with the FreeSurfer image analysis suite. It contains information from 816 subjects that are the
same for all tasks and are divided into three stages: those cognitively normal (CN) (228), with mild cognitive
impairment (MCI) (399), and with Alzheimer’s disease (AD) (189). There is a total number of 327 features including
cortical thickness average, cortical volume, and sub-cortical volume. The groups of features in this application
correspond to the features derived from many regions of interest (ROI) in the brain. Labels for this dataset include
ive cognitive measures: Rey Auditory Verbal Learning Test (RAVLT) Total score (TOTAL), RAVTL 30 minutes
delay score (T30), RAVLT recognition score (RECOG), Mini Mental State Exam score (MMSE), and Alzheimer’s
Disease Assessment Scale cognitive total score (ADAS). The usage of these scores is widespread, impacting on
drug trials, assessment of the severity of symptoms of AD, the progressive deterioration of functional ability, and
deiciencies in memory, as highlighted in [22], thus evidencing the importance of this type of modeling.
In this experiment, we consider all STL and MTL methods used in the previous experiment, but add more
state-of-the-art contenders. For completion, we added AMTL [19] that is also based on using task parameters
as a latent basis but does not account for groups of features; MT-SGL [22] that is proposed to handle this same
problem; GO-MTL [18] that is based on a latent basis to model related tasks; and MSSL [9] which accounts for
unrelated tasks and estimates a precision matrix as the learning model structure for transference among tasks;
MTRL [38], that uses a probabilistic framework and places a matrix-variate prior distribution on tasks coeicients
to model their relationship; and MTFL [16] that groups tasks based in an orthogonal-complement sub-spaces
decomposition where features are shared among tasks. As in the previous experiment, we used Optuna [1] to
search for the value of the hyper-parameters of all methods using 200 samples for the search of each method. This
time we used a 5-fold cross-validation procedure, where each fold contains the same proportion of participants
Table 1. NMSE of all methods in ADNI dataset (mean and standard deviation over 30 runs). GAMTL-nr had the best results
(highlighted in bold), closely followed by the other GAMTL variants, and MTRL method. A Mann-Whitney U non-parametric
test was run, assuring the significance of score improvement when comparing GAMTL-nr with all other methods.
Method NMSE
LASSO 0.840 (2.2 · 10−16 )
STL
Group-LASSO 0.977 (2.0 · 10−1 )
GO-MTL 0.896 (1.1 · 10−16 )
MSSL 0.818 (1.1 · 10−16 )
MTFL 0.810 (2.2 · 10−16 )
MT-SGL 0.801 (1.5 · 10−13 )
MTRL 0.791 (2.2 · 10−16 )
MTL
from the stages CN, MCI, and AD. The search limits used to tune the methods hyper-parameters are as follows:
For the LASSO we searched λ ∈ [10−5 , · · · , 4], while for the Group LASSO λ ∈ [10−5 , 15]. For AMTL we used
µ, λ ∈ [10−5 , 5]. GO-MTL has the number of groups set to 2, 3, 4, while ρ 1 ∈ [10−4 , 10], ρ 2 ∈ [10−4 , 10]. MSSL had
ρ 1 , ρ 2 ∈ [10−5 , 10]. For MTFL, we had 2, 3, 4 as the quantity of task groups, and ρ 1 , ρ 2 ∈ [10−5 , 10]. MT-SGL used
r ∈ [10−5 , 15]. MTRL hyper-parameters were chosen as ρ 1 , ρ 2 ∈ [10−4 , 10]. For AMTL we used µ, λ ∈ [10−5 , 5].
All variants of GAMTL used λ 1 , λ 2 , λ 3 ∈ [10−5 , 5]. After that, we select the hyper-parameters values with the best
result in this step and train the methods for 30 runs, to account for the initial randomness of the parameters of
the tasks.
In Table 1 we see the overall performance of all methods using the NMSE metric. Values are the mean and
standard deviation of the 30 runs and the best result is highlighted in bold.
Among the STL methods, LASSO is the one presenting the best score, but most MTL methods achieved better
results when compared with the LASSO. GAMTL variants achieved better results than all methods but presented
more variation than most of them on the results. As GAMTL estimates more parameters, it is an expected outcome.
We used a Mann-Whitney U test with p ≤ 0.05 and veriied that the score diference between GAMTL-nr and all
other methods was statistically signiicant.
As for each task individually, we use the mean-squared error (MSE) to compare the methods, with the results
presented in Table 2, and the mean absolute error (MAE) is reported in the appendix (Section C) on Table 3. For
visual interpretation, the same information is depicted in Figure 4 on a bar plot. Each sub-igure presents a bar
plot of the MSE obtained by all methods in the experiment for each task.
AMTL presented the smaller MSE for the task TOTAL but showed poor performance for the other measurements.
For the same task Group LASSO shows wide variance in their results. For the task T30, the LASSO presents
the best result, closely followed by MT-SGL. For all other tasks, GAMTL variants had the most competitive
performance. In contrast with the task TOTAL, when we consider the tasks RECOG and MMSE, AMTL shows a
poor performance. GO-MTL also shows a similar behavior in this task: it showed competitive performance for
some tasks, but presents poor results for the task MMSE. As for the ADAS task, the variation of performance
among the methods is small. Each task beneits the most from a diferent strategy of transference, but still, task
Table 2. MSE of all methods per task in ADNI dataset. The best results on each task are highlighted in bold.
Group-LASSO 1.190 (9.9 · 10−1 ) 0.705 (1.3 · 10−1 ) 1.019 (9.6 · 10−2 ) 0.736 (1.0 · 10−1 ) 0.635 (7.1 · 10−2 )
GO-MTL 0.837 (0.0) 0.643 (2.2 · 10−16 ) 0.842 (3.3 · 10−16 ) 0.856 (1.1 · 10−16 ) 0.584 (2.2 · 10−16 )
MSSL 0.846(2.2 · 10−16 ) 0.648 (3.3 · 10−16 ) 0.856 (0.0) 0.597 (4.4 · 10−16 ) 0.566 (1.1 · 10−16 )
MTFL 0.851(2.2 · 10−16 ) 0.648 (0.0) 0.839 (1.1 · 10−16 ) 0.588 (2.2 · 10−16 ) 0.554 (1.1 · 10−16 )
MT-SGL 0.885(5.9 · 10−13 ) 0.619 (5.6 · 10−13 ) 0.760 (7.0 · 10−13 ) 0.612 (3.4 · 10−13 ) 0.551 (4.7 · 10−13 )
MTRL 0.848(1.1 · 10−16 ) 0.674 (1.1 · 10−16 ) 0.786 (3.3 · 10−16 ) 0.579 (1.1 · 10−16 ) 0.520 (0.0)
MTL
AMTL 0.784 (0.0) 0.712 (0.0) 1.046 (0.0) 0.777 (0.0) 0.507 (0.0)
GAMTL 0.914 (2.7 · 10−5 ) 0.653 (6.0 · 10−5 ) 0.744 (1.5 · 10−5 ) 0.560 (3.2 · 10−5 ) 0.506 (1.8 · 10−5 )
GAMTL_nl 0.860 (9.3 · 10−5 ) 0.646 (6.9 · 10−5 ) 0.794 (1.7 · 10−4 ) 0.563 (4.5 · 10−5 ) 0.528 (4.2 · 10−5 )
GAMTL_nr 0.870 (5.0 · 10−5 ) 0.654 (6.5 · 10−5 ) 0.775 (6.0 · 10−5 ) 0.555 (4.3 · 10−5 ) 0.513 (3.3 · 10−5 )
GAMTL_nlnr 0.857 (3.1 · 10−5 ) 0.645 (5.6 · 10−5 ) 0.801 (9.8 · 10−5 ) 0.566 (6.1 · 10−5 ) 0.531 (2.8 · 10−5 )
T30 could not beneit from MTL. As each method holds distinct premises for the transference among tasks,
this result indicates that a single transference mechanism will not rule them all. Most importantly, when not
improving performance, some MTL methods incur in poorer performance.
We focus now on methods that account for grouped features, to see how GAMTL improves upon their results.
Choosing Group LASSO as the main reference, we take the diference of MSE between Group LASSO and GAMTL
variants for each run. Results are shown in Figure 5. Positive values indicate the method had a smaller MSE than
the Group LASSO (positive transference), while negative values indicate negative transference.
GAMTL variants could improve the generalization performance on all tasks when compared with Group
LASSO. Strong improvements are exhibited for RECOG, MMSE, and ADAS tasks, while not incurring negative
transference for the most challenging tasks (TOTAL and T30). RECOG is the task that beneits the most from
GAMTL models.
In Figure 6 we present a heatmap of the structural sparsity produced by each method that achieved the best
result on at least one task. We take the mean of the parameter values for each group of parameters, and if the
value is greater than zero, we consider it an active group, represented by a darker color. LASSO (Figure 6a) is
used as a reference for the STL methods. AMTL obtained the best result for the task TOTAL and is represented in
Figure 6b. Notice the presence of two groups of related tasks: TOTAL, T30, RECOG, and MMSE as part of one
group, while ADAS was isolated in a singleton group. It is also noticeable that when tasks belong to the same
group, they show a strongly related sparsity pattern on all tasks features.
GAMTL variants show sparser results (Figures 6c and 6d). The ADAS task also seems unrelated to the other
tasks by presenting a diferent sparsity behavior on GAMTL results. Both GAMTL methods allow the tasks to
relate in diferent ways when sharing, thus guiding to a more lexible structural sparsity pattern for related tasks.
In this case, GAMTL allows ADAS task to be related to other tasks only in a few groups of features.
The transference scheme encoded on Bд , ∀д ∈ G matrices is responsible for regularizing the parameters of the
tasks to it into the estimated relationships. As these matrices present interpretable information, we perform
a Stability Selection [23] procedure that accounts for noise both on the data and hyper-parameter settings,
validating which parameters of the learning models are still active in the inal solution.
Task TOTAL
1.75
1.50
1.25
1.00
MSE
0.75
0.50
0.25
0.00
MT-SGL
AMTL
GAMTL
GAMTL-nl
LASSO
GAMTL-nr
GAMTL-nlnr
Group LASSO
GOMTL
MSSL
MTFL
MTRL
Task T30
1.75
1.50
1.25
1.00
MSE
0.75
0.50
0.25
0.00
MT-SGL
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
LASSO
Group LASSO
GOMTL
MSSL
MTFL
MTRL
AMTL
Task RECOG
1.75
1.50
1.25
1.00
MSE
0.75
0.50
0.25
0.00
MT-SGL
LASSO
Group LASSO
GOMTL
MSSL
MTFL
MTRL
AMTL
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
Task MMSE
1.75
1.50
1.25
1.00
MSE
0.75
0.50
0.25
0.00
MT-SGL
GAMTL-nl
LASSO
Group LASSO
GOMTL
MSSL
MTFL
MTRL
AMTL
GAMTL
GAMTL-nr
GAMTL-nlnr
Task ADAS
1.75
1.50
1.25
1.00
MSE
0.75
0.50
0.25
0.00
MT-SGL
GAMTL-nl
LASSO
Group LASSO
GOMTL
MSSL
MTFL
MTRL
AMTL
GAMTL
GAMTL-nr
GAMTL-nlnr
Fig. 4. MSE of all methods for each of the five cognitive measures, with a blue horizontal line highlighting the best performance.
For the task TOTAL, we can see that AMTL had the best performance, while Group LASSO shows some variance in their
results. For the task T30, the LASSO presents the best result, closely followed by MT-SGL. For all other tasks, GAMTL variants
had the most competitive performance.
Task TOTAL
1.0 positive transference
0.5
0.0
negative transference
Task T30
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
0.50
positive transference
0.25
0.00
negative transference
0.25
Task RECOG
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
0.5 positive transference
0.0
negative transference
Task MMSE
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
0.50 positive transference
0.25
0.00
negative transference
0.25
Task ADAS
GAMTL
GAMTL-nl
GAMTL-nr
0.50 GAMTL-nlnr
positive transference
0.25
0.00
negative transference
0.25
GAMTL
GAMTL-nl
GAMTL-nr
GAMTL-nlnr
Fig. 5. GAMTL outperforms STL Group LASSO for each task. For the task TOTAL the gains vary due to the unstable
performance of Group LASSO on that task. For the task T30, we can see a consistent small gain, but GAMTL variants present
an expressive gain for tasks RECOG, MMSE, and ADAS.
Meinshausen and BÃijhlmann [23] proposed Stability Selection as a feature selection procedure that (i) relies on
a sampling procedure to alleviate the importance of hyper-parameter selection and data noise; and (ii) computes
the marginal probability of a feature being active by the total number of runs in the procedure.
Given a set of hyper-parameter values Γ, we choose a subset of the available dataset randomly and without
replacement, then the model is trained for N times. After that, we compute the frequency that a variable was active
in the found solutions and ilter the variables with a threshold. The overall process is described in Algorithm
2. For each variable i of our problem and a certain coniguration of hyper-parameters λ = λ 1 , λ 2 , λ 3 ∈ Γ, τi
ROIs
ROIs
ROIs
MMSE
MMSE
MMSE
MMSE
T30
T30
T30
T30
TOTAL
RECOG
ADAS
TOTAL
RECOG
ADAS
TOTAL
RECOG
ADAS
TOTAL
RECOG
ADAS
(a) LASSO (b) AMTL (c) GAMTL (d) GAMTLnr
Fig. 6. Sparsity patern estimated by the methods with best performance on at least one task. The darker cells indicate
groups of atributes where the mean of their parameters is greater than zero. All methods show a distinct sparsity patern on
the ADAS task, when compared to the other tasks. The results of STL Lasso on 6a show a visual similarity involving the
parameters of all but the ADAS task. AMTL, on 6b takes advantage of the relationship among the tasks, showing a clearer
shared patern, but less sparse. When comparing the results of AMTL with the LASSO, we see some groups of features that
became active for ADAS task, but play no role in the STL result. GAMTL variants (showed on 6c and 6d) present even sparser
results, with the benefit of not enforcing groups to be active for ADAS when the task is not related to others, preserving the
flexibility of tasks to share only on the groups of features that are valuable for transference.
represents the percentage of times that variable i was active over all runs. Let Sˆλ = {τi , |i ∈ W ∪ Bд , ∀д ∈ G} be
the set of percentages, and Sˆ = {Sˆλ |λ ∈ Γ} be the set of percentages over all hyper-parameter values. A variable i
is considered stable when the mean over all elements of Sˆλ are greater than a certain threshold. A ROI (which
corresponds to a group of features) is stable if the mean of the percentages of all its features is greater than the
threshold.
We chose the model hyper-parameters from the set Γ = {λ 1 , λ 2 , λ 3 |λ 1 , λ 2 ∈ [0.001, 5], λ 3 ∈ [0.0001, 1]} and
present results using a threshold of 80%, which is commonly used in the literature.
For each ROI we take the mean of the stability percentages of its features, and compare the value against the
pre-deined threshold of 80%, resulting in a binary matrix Wst ab ∈ Z G×T whose entries indicate which groups
are active for which tasks. For visualization purposes, we apply a clustering analysis on Wst ab . We choose the
number of clusters to 2 by comparing the Silhouette Score of the samples after experimenting with values in the
range between 2 and 10. The Silhouette Score helps us to validate the consistency of a clustering solution by
measuring how similar each object is to its assigned cluster when compared to assigning it to other clusters. Its
values range from −1 to 1, where a low value indicates that the object would be best assigned to a diferent cluster,
while a high value indicates that a sample is best suited to the cluster it is assigned. When the rows of Wst ab are
partitioned into 2 clusters, no sample shows a negative Silhouette Score. We apply a K-means procedure with 30
distinct runs to alleviate the efects of the random initialization, keeping the result with the best results in terms
of within cluster sum of squares. Figure 7 presents Wst ab split in those two groups.
On the irst cluster (left) almost all ROIs are stable on the ADAS task, while almost no ROI is stable for the
other tasks. The second cluster (right) shows stable ROIs for all tasks but ADAS. However, we can see that each
cluster contains a few distinct active features depending on each task, showing the lexible transference among
tasks. This is a key point in GAMTL models: the distinct behavior of features is an important characteristic of
the MTL problem setting. If the model does not account for the distinct roles that features can play on related
tasks, negative transference may occur. We should not enforce a relationship that is highly expressed in a set of
features among two tasks, on a diferent set of features.
Choosing some ROIs to further explore the transference among tasks, we picked two ROIs that are active for
all tasks on the second cluster: the Left Caudate and Left Inferior Temporal. Figure 8a shows the illustrative
anatomical location of the Left Caudate on a template brain, and Figure 8b shows the estimated relationship among
tasks considering this ROI. The task RECOG is inluenced by all other tasks (RECOG column) but inluences
only the task ADAS (see the row for task RECOG), while all other tasks are fully connected on this ROI. The
Left Inferior Temporal ROI is depicted anatomically in Figure 8c. In this case, the ADAS task is not related to
any other task; TOTAL and MMSE inluence all other tasks, while receiving their inluence as well; and RECOG
inluences TOTAL and MMSE tasks while is inluenced by TOTAL, T30, and MMSE. Even by choosing ROIs that
are active on the solution of all tasks, we recover a diferent relationship scheme among tasks, stressing the need
for a lexible mechanism to learn how transference occurs.
Considering now the estimated relationship matrices, for each Bд ∀д ∈ G we compute the average of stability
scores, and choose the 6 ROIs with the highest average value:
• Left Cerebral Cortex;
• Right Inferior Temporal;
• Left Caudate;
• Left Accumbens Area;
• Left Pars Orbitalis;
• Left Superior Parietal.
Since the Left Caudate was already explored, we skip its presentation. Figure 9a illustrates the Left Cerebral
Cortex, ROI with most stable transference among the tasks in all directions (Fig. 9d). This is the outermost layer
Group 1 Group 2
ROIs
TOTAL
T30
TOTAL
T30
RECOG
MMSE
ADAS
RECOG
MMSE
ADAS
Fig. 7. ROIs clustered by similar stability among all cognitive tests (tasks). The cluster on the let shows high activity for
ADAS task, with a sparse presence on the other tasks. On the other hand, the second cluster is highly active for all tasks but
ADAS, clearly showing the flexible transference possibilities of GAMTL.
surrounding the brain, that serves as a connection for several ROIs. We can see strong relationships among tasks
in this analysis.
In Figure 9b we see the Right Inferior Temporal ROI also presenting stable connections among all tasks (Fig.
9e), with the only exception when transferring from ADAS to RECOG.
The Accumbens Area is a small part of the Left Caudate ROI, being depicted in Figure 9c. The relationship
matrix in Figure 9f shows fewer stable connections when compared to the results of the previous ROIs. The
MMSE task is not inluenced by any other tasks, inluencing all but the ADAS task. The Left Pars Orbitalis is
shown in Figure 9g. As we can see in Figure 9i, the pairs of tasks ADAS and TOTAL, RECOG and MMSE, do not
inluence each other. Notice that coincidentally this ROI shows a symmetric relationship among tasks. Finally,
the Left Superior Parietal (Figure 9h) presents sparser relationships among the tasks.
TOTAL
T30
RECOG
MMSE
ADAS
T3 L
RE 0
MM G
AD E
AS
TA
S
CO
TO
(a) Let Caudate illustrative anatomical position. (b) Relationship among all tasks for the Let Cau-
date ROI.
TOTAL
T30
RECOG
MMSE
ADAS
T3 L
RE 0
MM G
AD E
AS
TA
S
CO
TO
(c) Let Inferior Temporal illustrative anatomical (d) Relationship among all tasks for the Let Infe-
position. rior Temporal ROI.
Fig. 8. Let Caudate, and Let Inferior Temporal ROIs, belonging to the second cluster, were stable on all tasks. On the let we
see their illustrative anatomical position, on the right we see the tasks relationship produced by GAMTL. Despite being part
of the same cluster, those two ROIs present distinct transference among tasks.
These results agree with indings in the literature. For example, it is known that the Left gray matter sufers
greater loss than its symmetric counterpart in the presence of Alzheimer [30]. It is also known that the left
hemisphere as a whole is impacted by AD, especially the Temporal and Parietal areas [5, 31]. In this case, GAMTL
could ind a stable solution where the ROIs with the most transference activity are known to be related to the
progression of Alzheimer’s Disease.
In this section, we have provided empirical results of GAMTL in two scenarios, one using an artiicial dataset
and the second one on the problem of predicting Alzheimer’s Disease progression, with a stability analysis.
By considering group-sparsity while estimating an explainable relationship matrix for each group of features,
GAMTL provides a higher level of analysis when compared with related methods in the literature. The Stability
Selection, with its sampling procedure and high variation of values for the hyper-parameters, shows which
features are most likely active on solutions found with the model. This allows us to beneit from the lexible
transference mechanism to improve generalization capacity, and also take deeper insights into the diferent layers
of relationship among tasks.
(a) Let Cerebral Cortex (b) Right Inferior Temporal (c) Let Accumbens Area
T3 L
RE 0
MM G
AD E
AS
T3 L
RE 0
MM G
AD E
AS
TA
TA
TA
S
S
CO
CO
CO
TO
TO
TO
(d) Let Cerebral Cortex (e) Right Inferior Temporal (f) Let Accumbens Area
TOTAL TOTAL
T30 T30
RECOG RECOG
MMSE MMSE
ADAS ADAS
T3 L
RE 0
MM G
AD E
AS
T3 L
RE 0
MM G
AD E
AS
TA
TA
S
S
CO
CO
TO
TO
Fig. 9. ROIs with highest stability: the Let Cerebral Cortex, the Right Inferior Temporal, the Let Accumbens Area, the Let
Pars Orbitalis, and the Let Superior Parietal. Each sub-figure shows the illustrative anatomical position of the ROI, together
with the respective estimated relationship matrix.
ACKNOWLEDGMENTS
We acknowledge the Brazilian National Council for Scientiic and Technological Development (CNPq), the SÃčo
Paulo Research Foundation (FAPESP), and the Coordination for the Improvement of Higher Education Personnel
(CAPES) for inancial support.
REFERENCES
[1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter
Optimization Framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[2] Amir Beck and Marc Teboulle. 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Img. Sci. 2,
1 (March 2009), 183âĂŞ202. [Link]
[3] James C. Bezdek and Richard J. Hathaway. 2002. Some Notes on Alternating Optimization. In Advances in Soft Computing, Nikhil R. Pal
and Michio Sugeno (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 288ś300.
[4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed Optimization and Statistical Learning via
the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3, 1 (Jan. 2011), 1âĂŞ122. [Link]
[5] Elisa Canu, Donald G McLaren, Michele E Fitzgerald, Barbara B Bendlin, Giada Zoccatelli, Franco Alessandrini, Francesca B Pizzini,
Giuseppe K Ricciardi, Alberto Beltramello, Sterling C Johnson, and Giovanni B Frisoni. 2011. Mapping the structural brain changes in
AlzheimerâĂŹs disease: The independent contribution of two imaging modalities. Journal of Alzheimer’s Disease 26, Suppl 3 (2011),
263ś274. [Link] 00034.
[6] Jianhui Chen, Ji Liu, and Jieping Ye. 2012. Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks. ACM Trans. Knowl.
Discov. Data 5, 4, Article 22 (Feb. 2012), 31 pages. [Link]
[7] Michael Crawshaw. 2020. Multi-Task Learning with Deep Neural Networks: A Survey. arXiv:2009.09796 [cs, stat] (Sept. 2020).
[Link] 00004 arXiv: 2009.09796.
[8] Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized MultiśTask Learning. In Proceedings of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA) (KDD ’04). Association for Computing Machinery,
New York, NY, USA, 109âĂŞ117. [Link]
[9] André R. Gonçalves, Fernando J. Von Zuben, and Arindam Banerjee. 2016. Multi-task Sparse Structure Learning with Gaussian Copula
Models. Journal of Machine Learning Research 17, 33 (2016), 1ś30. [Link]
[10] Jochen Gorski, Frank Pfeufer, and Kathrin Klamroth. 2007. Biconvex sets and optimization with biconvex functions: a survey and
extensions. Mathematical Methods of Operations Research 66, 3 (2007), 373ś407. [Link]
2007:i:3:p:373-407
[11] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations.
Chapman I& Hall/CRC.
[12] Zengyou He and Weichuan Yu. 2010. Stable feature selection for biomarker discovery. Computational Biology and Chemistry 34, 4 (2010),
215 ś 225. [Link]
[13] Alzheimer’s Disease International. 2018. World Alzheimer Report. Technical Report. Alzheimer’s Disease International. https:
//[Link]/resource/world-alzheimer-report-2018/
[14] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. 2009. Group Lasso with Overlap and Graph Lasso. In Proceedings of the
26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). ACM, New York, NY, USA, 433ś440.
[Link]
[15] Ali Jalali, Pradeep Ravikumar, Sujay Sanghavi, and Chao Ruan. 2010. A Dirty Model for Multi-Task Learning. In Proceedings of the 23rd
International Conference on Neural Information Processing Systems - Volume 1 (Vancouver, British Columbia, Canada) (NIPS’10). Curran
Associates Inc., Red Hook, NY, USA, 964âĂŞ972.
[16] Zhuoliang Kang, Kristen Grauman, and Fei Sha. 2011. Learning with Whom to Share in Multi-Task Feature Learning. In Proceedings of
the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress,
Madison, WI, USA, 521âĂŞ528.
[17] Zaven S. Khachaturian. 1985. Diagnosis of Alzheimer’s Disease. Archives of Neurology 42, 11 (1985), 1097ś1105. [Link]
archneur.1985.04060100083029
[18] Abhishek Kumar and Hal Daumé. 2012. Learning Task Grouping and Overlap in Multi-Task Learning. In Proceedings of the 29th
International Coference on International Conference on Machine Learning (Edinburgh, Scotland) (ICML’12). Omnipress, Madison, WI, USA,
1723âĂŞ1730.
[19] Giwoong Lee, Eunho Yang, and Sung Ju Hwang. 2016. Asymmetric Multi-task Learning Based on Task Relatedness and Loss. In
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA)
(ICML’16). [Link], 230ś238. [Link]
[20] Jun Liu, Shuiwang Ji, and Jieping Ye. 2009. Multi-task Feature Learning via Eicient L2, 1-norm Minimization. In Proceedings of the
Twenty-Fifth Conference on Uncertainty in Artiicial Intelligence (Montreal, Quebec, Canada) (UAI ’09). AUAI Press, Arlington, Virginia,
United States, 339ś348. [Link]
[21] Xiaoli Liu, Peng Cao, André R. Gonçalves, Dazhe Zhao, and Arindam Banerjee. 2018. Modeling AlzheimerâĂŹs Disease Progression
with Fused Laplacian Sparse Group Lasso. ACM Trans. Knowl. Discov. Data 12, 6, Article 65 (Aug. 2018), 35 pages. [Link]
1145/3230668
[22] Xiaoli Liu, AndrÃľ R. Goncalves, Peng Cao, Dazhe Zhao, and Arindam Banerjee. 2018. Modeling Alzheimer’s disease cognitive
scores using multi-task sparse group lasso. Computerized Medical Imaging and Graphics 66 (2018), 100ś114. [Link]
compmedimag.2017.11.001
[23] Nicolai Meinshausen and Peter BÃijhlmann. 2010. Stability selection. Journal of the Royal Statistical So-
ciety: Series B (Statistical Methodology) 72, 4 (2010), 417ś473. [Link]
arXiv:[Link]
[24] Sahand Negahban and Martin J. Wainwright. 2008. Joint Support Recovery under High-Dimensional Scaling: Beneits and Perils of
l 1, inf -Regularization. In Proceedings of the 21st International Conference on Neural Information Processing Systems (Vancouver, British
Columbia, Canada) (NIPSâĂŹ08). Curran Associates Inc., Red Hook, NY, USA, 1161âĂŞ1168.
[25] Shuteng Niu, Yihao Hu, Jian Wang, Yongxin Liu, and Houbing Song. 2020. Feature-based Distant Domain Transfer Learning. In 2020
IEEE International Conference on Big Data (Big Data). 5164ś5171. [Link]
[26] Shuteng Niu, Meryl Liu, Yongxin Liu, Jian Wang, and Houbing Song. 2021. Distant Domain Transfer Learning for Medical Imaging.
IEEE Journal of Biomedical and Health Informatics 25, 10 (Oct 2021), 3784âĂŞ3793. [Link]
[27] Shuteng Niu, Yongxin Liu, Jian Wang, and Houbing Song. 2020. A Decade Survey of Transfer Learning (2010âĂŞ2020). IEEE Transactions
on Artiicial Intelligence 1 (2020), 151ś166.
[28] Guillaume Obozinski. 2011. Group Lasso with Overlaps: the Latent Group Lasso Approach.
[29] Saullo H. G. Oliveira, AndrÃľ R. GonÃğalves, and Fernando J. Von Zuben. 2019. Group LASSO with Asymmetric Structure Estimation for
Multi-Task Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artiicial Intelligence, IJCAI-19. International
Joint Conferences on Artiicial Intelligence Organization, 3202ś3208. [Link]
[30] Paul M. Thompson, Kiralee M. Hayashi, Greig de Zubicaray, Andrew L. Janke, Stephen E. Rose, James Semple, David Her-
man, Michael S. Hong, Stephanie S. Dittmer, David M. Doddrell, and Arthur W. Toga. 2003. Dynamics of Gray Matter Loss
in Alzheimer’s Disease. Journal of Neuroscience 23, 3 (2003), 994ś1005. [Link]
arXiv:[Link]
[31] Paul M. Thompson, Michael S. Mega, Roger P. Woods, Chris I. Zoumalan, Chris J. Lindshield, Rebecca E. Blanton, Jacob Moussai, Colin J.
Holmes, Jefrey L. Cummings, and Arthur W. Toga. 2001. Cortical Change in Alzheimer’s Disease Detected with a Disease-speciic
Population-based Brain Atlas. Cerebral Cortex 11, 1 (01 2001), 1ś16. [Link] arXiv:[Link]
[32] Robert Tibshirani. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological)
58, 1 (1996), 267ś288. [Link]
[33] Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-
Task Learning for Dense Prediction Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1ś
1. [Link] [Link]= vandenhende_multi-task_2021-1, vandenhende_multi-task_2021-2, vandenhende_multi-
task_2021-3, vandenhende_multi-task_2021-4 arXiv: 2004.13379.
[34] Martin J. Wainwright. 2009. Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using l1-Constrained Quadratic
Programming (Lasso). IEEE Trans. Inf. Theor. 55, 5 (May 2009), 2183âĂŞ2202. [Link]
[35] Yangyang Xu and Wotao Yin. 2013. A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications
to Nonnegative Tensor Factorization and Completion. SIAM Journal on Imaging Sciences [electronic only] 6 (07 2013). [Link]
1137/120887795
[36] Ming Yuan and Yi Lin. 2006. Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical
Society. Series B (Statistical Methodology) 68 (2006), 49ś67.
[37] Yu Zhang and Qiang Yang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).
[38] Yu Zhang and Dit-Yan Yeung. 2010. A Convex Formulation for Learning Task Relationships in Multi-task Learning. In Proceedings of the
Twenty-Sixth Conference on Uncertainty in Artiicial Intelligence (Catalina Island, CA) (UAI’10). AUAI Press, Arlington, Virginia, United
States, 733ś742. [Link]
[39] Yu Zhang and Dit-Yan Yeung. 2014. A Regularization Approach to Learning Task Relationships in Multitask Learning. ACM Trans.
Knowl. Discov. Data 8, 3, Article 12 (June 2014), 31 pages. [Link]
[40] Peng Zhao and Bin Yu. 2006. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 7 (Dec. 2006), 2541âĂŞ2563.
[41] Jiayu Zhou, Jianhui Chen, and Jieping Ye. 2011. Clustered Multi-Task Learning via Alternating Structure Optimization. In Proceedings of
the 24th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook,
NY, USA, 702âĂŞ710.
[42] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. 2011. A Multi-Task Learning Formulation for Predicting Disease Progression. In Proceedings
of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, California, USA) (KDD âĂŹ11).
Association for Computing Machinery, New York, NY, USA, 814âĂŞ822. [Link]
[43] Hui Zou. 2006. The Adaptive Lasso and Its Oracle Properties. J. Amer. Statist. Assoc. 101, 476 (2006), 1418ś1429. [Link]
016214506000000735 arXiv:[Link]
6 APPENDICES
A VARIANTS OF GAMTL
We call GAMTL the formulation that uses the loss to refrain tasks with higher costs to transfer to other tasks,
and that restricts the values of all Bд to be equal to or greater than zero, as shown in Eq. 2. Consider this as the
standard formulation. However, in Section 4 we show results of four variants of GAMTL. These variants are based
on two assumptions of the overall formulation presented in Eq. 2: (i) using the loss to regularize how much a task
can transfer to other tasks; and (ii) using the restriction on the elements of Bд .
2
X 1 X д λ2 X д
X д
minд L(wt ) + λ 1 ∥bt̄ ∥1 + wt − W д bt + λ3 dд ∥wt ∥2
W ,B
t ∈T
m t д∈G
2 д∈G д∈G
2
X д (7)
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T
The diference lies in the irst term of Eq. 2 that is now expanded on the irst two terms of Eq. 7. After expanding,
д
we remove the product between the loss function and the l 1 -norm regularization on the bt̄ variables. Now the
regularization on the variables that encode how a task transfers to the other tasks depends only on the value of
the hyper-parameter λ 1 .
Compared to the standard version presented in Equation 2, the diference lies in the removal of the constraints
on Bд .
2
X 1 X д λ2 X д
X д
minд L(wt ) + λ 1 ∥bt̄ ∥1 + wt − W д bt + λ3 dд ∥wt ∥2
W ,B m t 2
t ∈T д∈G д∈G 2 д∈G (9)
X д
subject to wt = wt .
д∈G
Notice that this formulation combines the changes introduced in Equations 7 and 8.
2
*.1 + λ ∥bt̄ ∥1 +/ L(wt ) +
X 1 X д λ2 X д
X д
minд 1 wt − W д bt + λ3 dд ∥wt ∥2
2
X, -
W ,B
t ∈T
m t д∈G д∈G д∈G
2
д
subject to wt = wt
д∈G
д
bt ≥ 0, ∀д ∈ G and t ∈ T .
As wt ∈ W д, we can expand the fourth term and rearrange the variables to isolate wt . Let
X X д д
w̃s = ws − wu bus .
u ∈T \{s,t } д ∈ G
Table 3. Mean Absolute Error (MAE) of all methods per task in ADNI dataset. The best results on each task are highlighted
in bold.
Group-LASSO 0.796 (2.4 · 10−1 ) 0.661 (5.2 · 10−2 ) 0.831 (3.8 · 10−2 ) 0.695 (5.0 · 10−2 ) 0.599 (4.2 · 10−2 )
GO-MTL 0.715 (0.0) 0.650 (0.0) 0.736 (0.0) 0.790 (0.0) 0.575 (0.0)
MSSL 0.709 (0.0) 0.662 (0.0) 0.757 (0.0) 0.642 (0.0) 0.556 (0.0)
MTFL 0.711 (0.0) 0.665 (0.0) 0.748 (0.0) 0.639 (0.0) 0.549 (0.0)
MT-SGL 0.725(3.4 · 10−14 ) 0.654(1.3 · 10−13 ) 0.700(4.3 · 10−13 ) 0.653(2.5 · 10−14 ) 0.558(1.6 · 10−13 )
MTRL 0.720 (0.0) 0.676 (0.0) 0.712 (0.0) 0.640 (0.0) 0.537 (0.0)
MTL
AMTL 0.721 (0.0) 0.669 (0.0) 0.857 (0.0) 0.758 (0.0) 0.543 (0.0)
GAMTL 0.743 (9.2 · 10−6 ) 0.659 (2.3 · 10−5 ) 0.692 (7.7 · 10−6 ) 0.631 (1.8 · 10−5 ) 0.549 (4.6 · 10−5 )
GAMTL_nl 0.720 (4.9 · 10−5 ) 0.669 (5.0 · 10−5 ) 0.723 (1.0 · 10−4 ) 0.626 (5.2 · 10−5 ) 0.537 (3.5 · 10−5 )
GAMTL_nr 0.725 (2.3 · 10−5 ) 0.672 (3.4 · 10−5 ) 0.713 (1.6 · 10−5 ) 0.623 (4.2 · 10−5 ) 0.531 (2.2 · 10−5 )
GAMTL_nlnr 0.718 (1.5 · 10−5 ) 0.668 (4.6 · 10−5 ) 0.727 (4.4 · 10−5 ) 0.627 (5.6 · 10−5 ) 0.539 (3.0 · 10−5 )