[go: up one dir, main page]

0% found this document useful (0 votes)
34 views8 pages

Deep Transfer Learning with Markov Logic

This paper proposes a method called Deep Transfer via Markov Logic (DTM) for deep transfer learning, which allows knowledge to be transferred between entirely different domains by discovering structural regularities in the source domain. DTM utilizes second-order Markov logic to represent high-level similarities in logical formulas, enabling the transfer of knowledge despite differences in predicates and variables. The approach has been successfully applied across various domains, including molecular biology and social networks, demonstrating improved performance and the discovery of broadly useful properties.

Uploaded by

Zhu Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

Deep Transfer Learning with Markov Logic

This paper proposes a method called Deep Transfer via Markov Logic (DTM) for deep transfer learning, which allows knowledge to be transferred between entirely different domains by discovering structural regularities in the source domain. DTM utilizes second-order Markov logic to represent high-level similarities in logical formulas, enabling the transfer of knowledge despite differences in predicates and variables. The approach has been successfully applied across various domains, including molecular biology and social networks, demonstrating improved performance and the discovery of broadly useful properties.

Uploaded by

Zhu Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Transfer via Second-Order Markov Logic

Jesse Davis jdavis@[Link]


Pedro Domingos pedrod@[Link]
Department of Computer Science & Engineering, University of Washington, Seattle, WA 98195, USA

Abstract different one. For example, Wall Street firms often


Standard inductive learning requires that hire physicists to solve finance problems. Even though
training and test instances come from the these two domains have superficially nothing in com-
same distribution. Transfer learning seeks mon, training as a physicist provides knowledge and
to remove this restriction. In shallow trans- skills that are highly applicable in finance (e.g., solv-
fer, test instances are from the same do- ing differential equations and performing Monte Carlo
main, but have a different distribution. In simulations). In contrast, a model learned on physics
deep transfer, test instances are from a dif- data would simply not be applicable to finance data,
ferent domain entirely (i.e., described by dif- because the variables in the two domains are different.
ferent predicates). Humans routinely per- What is missing is the ability to discover structural
form deep transfer, but few learning sys- regularities that apply to many different domains, ir-
tems, if any, are capable of it. In this pa- respective of their superficial descriptions. For exam-
per we propose an approach based on a form ple, two domains may be modeled by the same type
of second-order Markov logic. Our algorithm of equation, and solution techniques learned in one
discovers structural regularities in the source can be applied in the other. The inability to do this
domain in the form of Markov logic for- is arguably the biggest gap between current learning
mulas with predicate variables, and instan- systems and humans.
tiates these formulas with predicates from Transfer learning addresses this by explicitly assum-
the target domain. Using this approach, we ing that the source and target problems are different.
have successfully transferred learned knowl- Interest in it has grown rapidly in recent years (e.g.,
edge among molecular biology, social network Baxter et al., 1995; Silver et al., 2005; Banerjee et al.,
and Web domains. The discovered patterns 2006; Taylor et al., 2008). Work to date falls mainly
include broadly useful properties of predi- into what may be termed shallow transfer: generaliz-
cates, like symmetry and transitivity, and ing to different distributions over the same variables,
relations among predicates, such as various or different variations of the same domain (e.g., differ-
forms of homophily. ent numbers of objects). What remains largely unad-
dressed is deep transfer: generalizing across domains
(i.e., between domains where the types of objects and
1. Introduction variables are different).
Inductive learning has traditionally been defined as There is a large body of work in a related field: ana-
generalizing from training instances to test instances logical reasoning (e.g., Falkenhainer et al., 1989). Here
from the same distribution. Decades of research have knowledge from one domain is applied in another by
produced many sophisticated techniques for solving establishing a correspondence between the objects and
this task. Unfortunately, in real applications, training relations in them. However, this typically requires hu-
and test data often come from different distributions. man help in establishing the correspondences and is
Humans are able to cope with this much better than carried out in an ad hoc manner. Further, whatever
machines. In fact, humans are even able to take knowl- domain-independent knowledge was used remains im-
edge learned in one domain and apply it to an entirely plicit. Our goal is to develop a well-founded, fully au-
tomatic approach to deep transfer, that makes domain-
Appearing in Proceedings of the 26 th International Confer- independent knowledge explicit, modular, and an ob-
ence on Machine Learning, Montreal, Canada, 2009. Copy-
right 2009 by the author(s)/owner(s). ject of discourse in its own right.
Deep Transfer via Second-Order Markov Logic

An approach that meets this goal must have a number has a node for each variable, and the model has a
of properties. It must be relational, since only rela- potential function for each clique in the graph. The
tional knowledge can be transferred across domains. joint distribution
Q represented by a Markov network is:
It must be probabilistic, to handle the uncertainty in- P (X = x) = Z1 k φk (x{k} ), where x{k} is the state of
herent in transfer in a principled way. Lastly, it must the kth clique (i.e., the state of the variables that ap-
be second-order, since expressing domain-independent pear in that clique), and Z is a normalization constant.
knowledge requires predicate variables. These require- Markov networks are often conveniently represented as
ments rule out standard inductive logic programming log-linear models, with each clique potential replaced
and first-order probabilistic approaches. To our knowl- by an exponentiated weighted P sum of features
 of the
edge, the only available representation that satisfies all state: P (X = x) = Z1 exp w f (x) . A feature
j j j
the criteria is the second-order extension of Markov
fj (x) may be any real-valued function of the state.
logic introduced by Kok and Domingos (2007). We
use it as the basis for our approach. Markov logic is a probabilistic extension of first-order
logic. It combines first-order logic with Markov net-
Our approach to transfer learning, called DTM (Deep
works. Formally, a Markov logic network (MLN) is
Transfer via Markov logic), discerns high-level simi-
a set of pairs, (Fi , wi ), where Fi is a first-order for-
larities between logical formulas. Given a set of first-
mula and wi ∈ R. MLNs soften logic by associat-
order formulas, DTM lifts the formulas into second-
ing a weight with each formula. Worlds that vio-
order logic by replacing all predicate names with pred-
late formulas become less likely, but not impossible.
icate variables. It then groups the second-order for-
MLNs provide a template for constructing Markov net-
mulas into cliques. DTM evaluates which second-
works. When given a finite set of constants, the for-
order cliques represent regularities whose probability
mulas from an MLN define a Markov network with
deviates significantly from independence among sub-
one node for each ground predicate and one feature
cliques. Finally, it selects the top k highest scor-
for each ground clause. The weight of a feature is
ing second-order cliques to transfer, and instantiates
the weight of the first-order clause that originated it.
the knowledge with the types of objects and variables
An MLN induces the following P probability  distribu-
present in the target domain. The transferred knowl- 1 F
edge provides a declarative bias for structure learning tion: P (X = x) = Z exp i=1 wi ni (x) , where F
in the target domain. is the number formulas in the MLN, wi is the weight
of the ith formula, and ni (x) is the number of true
The most similar approach to ours in the literature is groundings of Fi in possible world x.
Mihalkova et al.’s TAMAR algorithm (2007). TAMAR
learns Markov logic clauses in the source domain, and Algorithms have been proposed for learning the
attempts to transfer each one to the target domain by weights associated with each formula (e.g., Lowd &
replacing its predicates by target-domain predicates in Domingos, 2007), as well as for learning the formu-
all possible ways. While simple, this approach is un- las themselves (e.g., Kok & Domingos, 2005). We
likely to scale to the large, rich domains where trans- will focus on the algorithm of Kok and Domingos,
fer learning is most likely to be useful. Like analog- which we will call MSL. Structure learning consists
ical reasoning, it does not explicitly create domain- of two components: constructing clauses and evaluat-
independent knowledge. ing clauses. Clause construction follows an inductive
logic programming-style search (Dzeroski & Lavrac,
Using our approach, we have successfully transferred 2001). When learning a network from scratch, MSL
learned knowledge among social network, molecular bi- starts with an empty clause and specializes it by suc-
ology and web domains. In addition to improved em- cessively adding literals to the clause. Additionally,
pirical performance, our approach discovered patterns the algorithm can refine an existing network to correct
including broadly useful properties of predicates, like errors in the clauses. Here, it considers both removing
symmetry and transitivity, and relations among pred- and adding literals to a clause as well as flipping the
icates, such as homophily. sign of a literal in the clause. The algorithm uses a
beam search to find the current best clause and add
2. Markov Logic it to the network. The search ends when no clause
improves the score of the MLN. To evaluate the merit
A Markov network is a model for the joint distribu- of each clause, it uses weighted pseudo-log-likelihood
tion of a set of variables X = (X1 , X2 , . . . , Xn ) (Della (WPLL). To avoid overfitting, each clause receives a
Pietra et al., 1997). It is composed of an undirected penalty term proportional to the number of literals
graph G and a set of potential functions φk . The graph that differ between the current clause and the initial
Deep Transfer via Second-Order Markov Logic

clause. those formulas with similar effects into one struc-


ture. A set of literals with predicate variables, such
3. Deep Transfer via Markov Logic as {r(x, y), r(y, x)}, defines each second-order clique
and the states, or features, are conjunctions over the
The formulas in an MLN capture regularities that hold literals in the clique. It is more convenient to look
in the data for a given domain. However, the knowl- at conjunctions than clauses as they do not overlap,
edge that the formulas encode is specific to the types of and can be evaluated separately. The features corre-
objects and predicates present in that domain. Deep spond to all possible ways of negating the literals in a
transfer attempts to generalize learned knowledge clique. The features for {r(x, y), r(y, x)} are {r(x, y) ∧
across domains that have different types of objects r(y, x)}, {r(x, y) ∧ ¬r(y, x)}, {¬r(x, y) ∧ r(y, x)} and
and predicates. We call our approach Deep Transfer {¬r(x, y) ∧ ¬r(y, x)}. Relational Markov networks
via Markov Logic (DTM). In order to abstract away (RMNs) (Taskar et al., 2002) are another represen-
the superficial domain description, DTM uses second- tation language that makes use of relational cliques.
order Markov logic, where formulas contain predicate However, RMNs only look at first-order relations, and
variables (Kok & Domingos, 2007) to model com- are a special case of Markov logic, whereas DTM makes
mon structures among first-order formulas. To illus- use of second-order logic. Furthermore, RMNs do not
trate the intuition behind DTM, consider the formulas allow uncertainty over relations, only over attributes,
Complex(z, y) ∧ Interacts(x, z) ⇒ Complex(x, y) and and they scale poorly with clique size.
Location(z, y) ∧ Interacts(x, z) ⇒ Location(x, y).
DTM imposes the following restrictions on cliques:
Both are instantiations of: r(z, y) ∧ s(x, z) ⇒ r(x, y),
where r and s are predicate variables. Introducing 1. The literals in the clique are connected. That is,
predicate variables allows DTM to represent high-level a path of shared variables must connect each pair
structural regularities in a domain-independent fash- of literals in the clique.
ion. This knowledge can be transferred to another
2. No cliques are the same modulo variable renam-
problem, where the formulas are instantiated with the
ing. For example, {r(x, y), r(z, y), s(x, z)} and
appropriate predicate names. The key assumption
{r(z, y), r(x, y), s(z, x)}, where r and s are predi-
that DTM makes is that the target domain shares
cate variables, are equivalent as the second clique
some second-order structure with the source domain.
renames variable x to z and variable z to x.
Given a set of first-order formulas, DTM converts each
The states of a clique are all possible ways of negat-
formula into second-order logic by replacing all pred-
ing the literals in the clique subject to the following
icate names with predicate variables. It then groups
constraints:
the second-order formulas into cliques. Two second-
order formulas are assigned to the same clique if and 1. No two features are the same modulo variable re-
only if they are over the same set of literals. DTM naming.
evaluates which second-order cliques represent regu- 2. Two distinct variables are not allowed to unify.
larities whose probability deviates significantly from For example, r(x, y) ∧ r(y, x), really represents
independence among their subcliques. It selects the the following formula: r(x, y) ∧ r(y, x) ∧ x 6= y.
top k highest-scoring second-order cliques to transfer This constraint ensures that a possible grounding
to the target domain. The transferred knowledge pro- does not appear in two separate cliques. For ex-
vides a declarative bias for structure learning in the ample, without this constraint, all true ground-
target domain. ings of r(x, y) ∧ s(x, x) would also be true of
The four key elements of DTM, introduced in the next r(x, y) ∧ s(x, z) (which appears as a feature in a
subsections, are: (i) how to define cliques, (ii) how to different clique).
search for cliques, (iii) how to score the cliques and
(iv) how to apply cliques to a new problem.
3.2. Search
3.1. Second-order Cliques DTM works with any learner than induces formulas
in first-order logic. This paper evaluates two separate
DTM uses second-order cliques to model second-order strategies for inducing formulas in the source domain:
structure. It is preferable to use this representation exhaustive search and beam search.
as opposed to arbitrary second-order formulas because
multiple different formulas over the same predicates
can capture the same regularity. A clique groups Exhaustive search. Given a source domain, the
learner generates all first-order clauses up to a max-
Deep Transfer via Second-Order Markov Logic

imum clause length and a maximum number of ob- decompositions. Each instantiation receives the min-
ject variables. The entire set of clauses is passed to imum K-L score over the set of its decompositions,
DTM for evaluation. because any single one could explain the clique’s distri-
bution. Each second-order clique receives the average
Beam search. Exhaustive search does not scale score of its top m first-order instantiations, in order
well, since the number of clauses it produces is ex- to favor second-order cliques that have multiple useful
ponential in the clause length and it becomes com- instantiations.
putationally infeasible to score long clauses. Beam
search is a common strategy for scaling structure
3.4. Transfer Mechanism
learning, and is used in MSL (described in Section
2). However, transfer learning and structure learn- The next question is how to make use of the trans-
ing have different objectives. In transfer learning, ferred knowledge in the target domain. A key compo-
the goal is to derive a large, diverse set of clauses to nent of an inductive logic programming (ILP) (Dze-
evaluate for potential transfer to the target domain. roski & Lavrac, 2001) system is the declarative bias.
Structure learning simply needs to induce a compact Due to the large search space of possible first-order
theory that accurately models the predicates in the clauses, devising a good declarative bias is crucial for
source domain. The theories induced by MSL tend achieving good results with an ILP system. In ILP,
to contain very few clauses and thus are not ideal two primary methods exist for expressing a declara-
for transfer. An alternative approach is to induce tive bias, and both forms of bias are often used in the
a separate theory to predict each predicate in the same system. The first method restricts the search
domain. However, the resulting theory may not be space. Common strategies include having type con-
very diverse, since clauses will contain only the tar- straints, forcing certain predicate arguments to con-
get predicate and predicates in its Markov blanket tain bound variables, and setting a maximum clause
(i.e., its neighbors in the network). A better ap- length. The second method involves incorporating
proach is to construct models that predict sets of background knowledge. Background knowledge comes
predicates. in the form of hand-crafted clauses that define addi-
Given the final set of learned models, DTM groups tional predicates which can be added to a clause un-
the clauses into second-order cliques and evaluates der construction. Effectively, background knowledge
each clique that appears more than twice. allows the learner to add multiple literals to a clause
at once and overcome the myopia of greedy search. It
is important to note that these common strategies can
3.3. Second-Order Clique Evaluation
easily be expressed in second-order logic, and this is
Each clique can be decomposed into pairs of sub- part of what motivates our approach. DTM can be
cliques, and it should capture a regularity beyond what viewed as a way to learn the declarative bias in one
its subcliques represent. For example, the second- domain and apply it in another, as opposed to having
order clique {r(x, y), r(z, y), s(x, z)} can be decom- a user hand-craft the bias for each domain.
posed into the following three pairs of subcliques: (i)
When applying a second-order clique in a target do-
{r(x, y), r(z, y)} and {s(x, z)}, (ii) {r(x, y), s(x, z)}
main, DTM decomposes the clique into a set of clauses,
and {r(z, y)}, and (iii) {r(z, y), s(x, z)} and {r(x, y)}.
and transfers each clause. It uses clauses instead of
To score a second-order clique, each of its first-order in- conjunctions since most structure learning approaches,
stantiations is evaluated. To score a first-order clique, both in Markov logic and ILP, construct clauses. In
DTM checks if its probability distribution is signifi- Markov logic, a conjunction can be converted into an
cantly different from the product of the probabilities equivalent clause by negating each literal in it and flip-
of each possible pair of subcliques that it can be de- ping the sign of its weight.
composed into.
We investigate three different ways to reapply the
The natural way to compare these two distributions is knowledge in the target domain:
to use the K-L divergence, D(p||q) = x p(x) log p(x)
P
q(x) ,
Transfer by Seeding the Beam. In the first ap-
where p is the clique’s probability distribution, and q
proach, the second-order cliques provide a declar-
is the distribution it would have if the two subcliques
ative bias for the standard MLN structure search.
were independent. We use Bayesian estimates of the
DTM selects the top k cliques that have at least one
probabilities with Dirichlet priors with all αi = 1.
true grounding in the target domain. At the be-
For each first-order instantiation of a second-order ginning of each iteration of the beam search, DTM
clique, DTM computes its K-L divergence versus all its seeds the beam with the clauses obtained from each
Deep Transfer via Second-Order Markov Logic

legal instantiation of a clique in the target domain 4. Empirical Evaluation


(i.e., only instantiations that conform with the type
constraints of the domain are considered). This In this section, we evaluate our approach on three real-
strategy forces certain clauses to be evaluated in the world data sets. We compare the DTM algorithm to
search process that would not be scored otherwise the Markov logic structure learning (MSL) algorithm
and helps overcome some of the limitations of greedy described in Section 2 and to TAMAR. MSL is im-
search. plemented in the publicly available Alchemy package
(Kok et al., 2009). We made two modifications to
MSL. First, when counting the number of true ground-
Greedy Transfer without Refinement. The sec- ings of a clause, we do not permit two distinct variables
ond approach again picks the top k second-order to unify to the same constant. We did this to ensure
cliques that have at least one true grounding in the that we evaluate clauses in the same manner that we
target domain and creates all legal instantiations evaluate cliques. Second, we modify MSL to allow
in the target domain. This algorithm imposes a learning clauses that contain constants. We made this
very stringent bias by performing a structure search modification since we are interested in predicting spe-
where it only considers including the transferred cific values of attributes, such as the class label of a
clauses. The algorithm evaluates all clauses and Web page, for each object in a domain. We first de-
greedily picks the one that leads to the biggest im- scribe the domains and characteristics of the data sets
provement in WPLL. The search terminates when we use and then present and discuss our experimental
no clause addition improves the WPLL. results.

Greedy Transfer with Refinement. The final 4.1. Domains


approach adds a refinement step to the previous The data for the Yeast Protein task come from the
algorithm. In this case, the MLN generated by the MIPS (Munich Information Center for Protein Se-
greedy procedure serves as seed network during quence) Comprehensive Yeast Genome Database as
standard structure learning. The MLN search of February 2005 (Mewes et al., 2000; Davis et al.,
can now refine the clauses picked by the greedy 2005). The data set includes information on protein
procedure to better match the target domain. location, function, phenotype, class, and enzymes. It
Additionally, it can induce new clauses to add to also includes information about protein-protein inter-
the theory. action and protein complex data. The data contain
information about approximately 4500 proteins that
are involved in some interaction. We created four dis-
DTM differs from previous approaches to declarative joint subsamples of this data that each contain around
bias in two main ways: it is probabilistic, and it
450 proteins. To create each subsample, we randomly
provides the bias entirely automatically. For exam- selected a seed set of proteins. We included all pre-
ple, relational clichés (Silverstein & Pazzani, 1991) viously unselected proteins that appeared within two
are second-order formulas that define potential refine-
links (via the Interaction predicate) of the seed set.
ments for a candidate clause. However, the clichés For this data set, we attempt to predict the truth
are all hand-coded. DTM is also more fully auto- value of all groundings of two predicates: Function
mated than the Clint/Cia approach (De Raedt & and Interaction.
Bruynooghe, 1992), which uses an oracle to determine
which background knowledge predicates to construct. The WebKB data set consists of labeled Web pages
from the computer science departments of four univer-
A more automated approach is that of Bridewell and sities. We used the relational version of the data set
Todorovski (2007), which uses ILP to learn the declar- from Craven and Slattery (2001), which features 4140
ative bias for process modeling domains. Their algo- Web pages and 10,009 web links, and neighborhoods
rithm analyzes a series of models in order to discern the around each link. Each Web page is marked with
components that are common among accurate models,
some subset of the following categories: person, stu-
which then become the bias for new domains. This dent, faculty, professor, department, research project,
approach requires a set of previously learned models and course. For this data set, we again try two tasks.
in order to derive the bias, while DTM only needs
First, we perform the “collective classification” task,
data from one source domain. Additionally, this ap- and predict the class label for each Web page. Second,
proach would require further hand-crafting of back- we perform the “link prediction” task by predicting
ground knowledge in order to extend it beyond process whether a hyperlink exists between each pair of Web
modeling problems.
Deep Transfer via Second-Order Markov Logic

pages. Within each domain, all algorithms had the same pa-
rameter settings. Each algorithm was allotted 100
The Social Network data set consists of pages col-
hours per database to run. In each domain, we op-
lected from the Facebook social networking Web site.
timized the WPLL for the two predicates we were in-
The data consist of information about friendships
terested in predicting. For DTM we tried two settings,
among individuals and properties of individuals, such
k = 5 and k = 10 for the number k of second-order
as hobbies, interests and network memberships. It has
cliques transferred to the target domain. We tried two
information about approximately 600 individuals, thir-
settings for exhaustive search. The first evaluated all
teen different properties of each and 24,000 friendship
clauses containing at most three literals and three ob-
pairs.
ject variables and the second evaluated all clauses con-
taining up to four literals and four object variables.
4.2. High-Scoring Second-order Cliques When running beam search in the source domain, we
An important evaluation measure for transfer learn- constructed theories for predicting pairs of predicates
ing is whether it discovers and transfers relevant (a tradeoff between diversity and tractability).
knowledge. In fact, DTM discovers multiple broadly In Yeast, we permitted only function constants to ap-
useful properties. For example, among cliques of pear in learned clauses. We wanted to know exactly
length three with up to three object variables, which functions (e.g. metabolism, DNA processing,
{r(x, y), r(z, y), s(x, z)} is ranked first in all three do- etc.) each protein performed, not just that the pro-
mains. This represents the concept of homophily, tein had some function. In WebKB, we permitted page
which is present when related objects (x and z) class to appear in clauses. Here we wanted to predict
tend to have similar properties (y). The clique which labels (e.g. student, person, etc.) applied to
{r(x, y), r(y, z), r(z, x)} is ranked second in Yeast and each Web page, not just whether the Web page has
fourth in WebKB and Facebook and represents the a label. For each train-test split, we used the infor-
concept of transitivity. Both homophily and transi- mation gain (on the training set) to pick the top fifty
tivity are important concepts that appear in a variety words most predictive of page class. For tractability,
of domains. Therefore it is quite significant that DTM we restricted the algorithm to only learn with these
discovers and assigns a high ranking to these concepts. constants.
In all three domains, the top three cliques of length To evaluate each system, we measured the test set
two are: {r(x, y), r(z, y)}, {r(x, y), r(x, z)}, and conditional log-likelihood (CLL) and area under the
{r(x, y), r(y, x)}. The first clique captures the fact precision-recall curve (AUC) for each predicate. The
that particular feature values are more common across advantage of the CLL is that it directly measures the
objects in a domain than others. For example, a com- quality of the probability estimates produced. The ad-
puter science department most likely has more student vantage of the AUC is that it is insensitive to the large
Web pages than faculty Web pages. The second clique number of true negatives.
captures the fact that pairs of values for the same fea-
ture, such as words in the WebKB domain, commonly
4.4. Results
co-occur in the same object. The final clique captures
symmetry, another important general property of re- Figure 1 shows representative learning curves for pre-
lations. dicting protein function when transferring cliques of up
to length four from the WebKB domain into the Yeast
4.3. Methodology domain. DTM consistently outperforms MSL. As ex-
pected, DTM performs better relative to MSL with
The central question our experiments aimed to address less data. Table 1 compares the performance, mea-
was whether DTM improved performance in the tar- sured by average relative difference (e.g., (AU CDT M −
get domain compared to learning from scratch with AU CMSL )/AU CMSL ) of DTM using greedy transfer
MSL. We tried four source-target pairs: Facebook to with refinement to MSL when transferring cliques of
Yeast Protein, WebKB to Yeast Protein, Facebook to up to length three. In this domain, transfer greatly
WebKB and Yeast Protein to WebKB. Each target improves the models’ ability to predict both the AUC
domain was divided into four disjoint subsets, which and CLL for the Function predicate. For this pred-
we called mega-examples. We selected a subset of the icate, in the length three setting, DTM outperforms
mega-examples to train the learner on and then tested MSL on 73% (82 out of 112) of the various different
it on the remaining mega-examples. We repeated this conditions. In the length four setting, it wins 75% of
train-test cycle for all possible subsets of the mega- the cases (42 out of 56) when using WebKB as the
examples.
Deep Transfer via Second-Order Markov Logic

0.50 -0.15 0.125 -0.012


0.40 0.100
-0.20 -0.016
0.30 0.075
AUC

AUC
CLL

CLL
0.20 MSL MSL 0.050 MSL MSL
Seed -0.25 -0.020
Seed Seed Seed
0.10 Greedy Greedy 0.025 Greedy Greedy
Refine Refine Refine Refine
0.00 -0.30 0.000 -0.024
1 2 3 1 2 3 1 2 3 1 2 3
No. of Training Databases No. of Training Databases No. of Training Databases No. of Training Databases

(a) AUC (b) CLL (a) AUC (b) CLL


Figure 1. Learning curves for the Function predicate when Figure 2. Learning curves for the Linked predicate when
transferring second-order cliques of length four from the transferring second-order cliques of length four from the
WebKB domain. Facebook domain.

source domain. Using Facebook in this setting did not seems to yield better and more stable results than
provide a benefit. The cliques transferred here tend beam search. One possible explanation is that despite
to capture more information about link structure and tweaking the beam search algorithm to make it more
the various properties of a single entity, whereas in the appropriate for transfer, it still focuses too heavily on
Yeast domain it is important to model similarities and finding a good theory as opposed to finding the best
differences in the properties of related entities. Neither clauses for transfer.
approach does well at the difficult task of predicting
We also compared our approach to TAMAR (Mi-
the Interaction predicate; MSL outperforms DTM.
halkova et al., 2007). In order to have a fair compari-
Figure 2 shows representative learning curves for pre- son, we extended the system to map clauses that con-
dicting whether two Web pages are linked when trans- tain constants in them. When transferring from Yeast
ferring cliques of up to length four from Facebook to WebKB, TAMAR does significantly worse than all
into WebKB. As in Figure 1, DTM consistently out- other approaches. TAMAR was unable to map theo-
performs MSL. Table 2 compares the performance of ries from WebKB into the Yeast domain in the allotted
DTM using greedy transfer with refinement to MSL time. Introducing constants into the clauses greatly
for WebKB when transferring cliques of up to length increases the number of possible mappings between
three. Across both predicates, transfer outperforms clauses, making it infeasible to perform an exhaus-
MSL on 68% (152 out of 224) of the various different tive search to map clauses between different domains.
test conditions when considering cliques of up to length This problem is particularly acute when mapping into
three. When using Facebook as the source task, trans- the Yeast domain, due to the number of function con-
ferring cliques of up to length four provides a large stants.
benefit, winning 71% (79 out of 112) of time. When
using Yeast as the source domain, transfer only posts 5. Conclusion
modest gains, winning only 51% (58 out of 112) of the
various test conditions. On the Linked predicate, the This paper proposes an approach to deep transfer, the
benefit of transfer is greater for small amounts of data. problem of generalizing across domains. Our DTM al-
Transfer also provides a slight improvement over MSL gorithm identifies domain-independent regularities in
when predicting the label of a Web page. the form of second-order Markov logic clauses, and
uses them to guide discovery of new structure in the
Due to space limitations, we omit the tables and
target domain. Initial experiments in bioinformatics,
graphs of the results for beam search1 . When transfer-
Web and social network domains show that DTM out-
ring from the WebKB domain into the Yeast domain,
performs standard structure learning, and discovers
the results are almost identical to exhaustive search.
significant regularities like homophily and transitivity.
For the Facebook into Yeast pair, the results are worse
than exhaustive search. When transferring from the Directions for future work include: theoretical analy-
Yeast domain into the WebKB domain, the results sis of DTM; improved structure learning algorithms for
are comparable to exhaustive search. For the Face- transfer; transferring formulas instead of whole cliques;
book into WebKB pair, the PageClass results were experiments in richer domains; transferring from mul-
slightly better than exhaustive search, but worse for tiple domains at once; etc.
the Linked predicate. In general, exhaustive search
1
Full results are available in an online appendix:
[Link]
Deep Transfer via Second-Order Markov Logic

Table 1. Experimental results comparing DTM (greedy transfer with refinement) to MSL on the Yeast domain. Each entry
in the table is the average relative difference in AUC or CLL between transfer and MSL when considering second-order
cliques that contain up to three predicate and object variables.
Source Num. AUC CLL
Cli- Function Interaction Function Interaction
ques 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB
WebKB 5 0.48 0.70 0.30 -0.31 -0.33 -0.21 0.12 0.17 0.03 0.01 0.03 -0.02
FB 5 0.29 0.13 -0.22 0.00 0.56 -0.26 0.02 -0.10 -0.23 0.00 0.06 -0.14
WebKB 10 0.47 0.52 0.21 -0.18 -0.32 -0.07 0.12 0.10 0.05 0.04 0.02 0.01
FB 10 0.63 0.51 0.10 -0.35 -0.03 -0.42 0.18 0.01 -0.16 0.01 0.07 -0.01

Table 2. Experimental results comparing DTM (greedy transfer with refinement) to MSL on the WebKB domain. Each
entry in the table is the average relative difference in AUC or CLL between transfer and MSL when considering second-
order cliques that contain up to three predicate and object variables.
Source Num. AUC CLL
Cli- Page Class Linked Page Class Linked
ques 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB
Yeast 5 -0.01 0.02 0.00 42.79 9.77 11.03 -0.12 0.04 0.03 0.12 0.04 0.19
FB 5 0.00 0.01 0.01 40.79 4.64 6.15 -0.06 0.02 0.07 0.12 0.04 0.19
Yeast 10 -0.01 0.02 -0.01 42.79 10.36 10.94 -0.12 0.03 0.00 0.12 0.04 0.19
FB 10 0.00 0.01 0.01 40.79 4.64 6.15 -0.06 0.02 0.07 0.12 0.04 0.19

Acknowledgements Falkenhainer, B., Forbus, K. D., & Gentner, D. (1989).


The structure-mapping engine: Algorithm and exam-
This research was partly funded by ARO grant W911NF- ples. Artificial Intelligence, 41, 1–63.
08-1-0242, DARPA contracts FA8750-05-2-0283, FA8750- Kok, S., & Domingos, P. (2005). Learning the structure of
07-D-0185, HR0011-06-C-0025, HR0011-07-C-0060 and Markov logic networks. Proc. ICML’05 (pp. 441–448).
NBCH-D030010, NSF grants IIS-0534881 and IIS-0803481, Kok, S., & Domingos, P. (2007). Statistical predicate in-
and ONR grant N00014-08-1-0670. The views and conclu- vention. Proc. ICML’07 (pp. 433–440).
sions contained in this document are those of the authors Kok, S., Sumner, M., Richardson, M., Singla, P., Poon,
and should not be interpreted as necessarily representing H., Lowd, D., Wang, J., & Domingos, P. (2009). The
the official policies, either expressed or implied, of ARO, Alchemy system for statistical relational AI (Technical
DARPA, NSF, ONR, or the United States Government. Report). Department of Computer Science and Engi-
neering, University of Washington, Seattle, WA. http:-
//[Link].
References Lowd, D., & Domingos, P. (2007). Efficient weight learning
Banerjee, B., Liu, Y., & Youngblood, G. (Eds.). (2006). for Markov logic networks. Proc. PKDD’07 (pp. 200–
ICML workshop on structural knowledge transfer for ma- 211).
chine learning. Mewes, H. W., Frishman, D., Gruber, C., Geier, B., Haase,
Baxter, J., Caruana, R., Mitchell, T., Pratt, L. Y., Silver, D., Kaps, A., Lemcke, K., Mannhaupt, G., Pfeiffer, F.,
D. L., & Thrun, S. (Eds.). (1995). NIPS workshop on Schüller, C., Stocker, S., & Weil, B. (2000). MIPS: a
learning to learn: Knowledge consolidation and transfer database for genomes and protein sequences. Nucleic
in inductive systems. Acids Research, 28, 37–40.
Bridewell, W., & Todorovski, L. (2007). Learning declara- Mihalkova, L., Huynh, T., & Mooney, R. J. (2007). Map-
tive bias. Proc. ILP’07 (pp. 63 – 77). ping and revising Markov logic networks for transfer
Craven, M., & Slattery, S. (2001). Relational learning with learning. Proc. AAAI’07 (pp. 608–614).
statistical predicate invention: Better models for hyper- Silver, D., Bakir, G., Bennett, K., Caruana, R., Pontil,
text. Machine Learning, 43, 97–119. M., Russell, S., & Tadepalli, P. (Eds.). (2005). NIPS
Davis, J., Burnside, E., Dutra, I., Page, D., & Costa, V. S. workshop on inductive transfer: 10 years later.
(2005). An integrated approach to learning Bayesian Silverstein, G., & Pazzani, M. J. (1991). Relational clichés:
networks of rules. Proc. ECML’05 (pp. 84–95). Constraining constructive induction during relational
De Raedt, L., & Bruynooghe, M. (1992). Interactive learning. Proc. ICML’91 (pp. 203–207).
concept-learning and constructive induction by analogy. Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative
Machine Learning, 8, 107–150. probabilistic models for relational data. Proc. UAI’02
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). (pp. 485–492).
Inducing features of random fields. IEEE Transactions Taylor, M., Fern, A., & Driessens, K. (Eds.). (2008). AAAI
on Pattern Analysis and Machine Intelligence, 19, 380– workshop on transfer learning for complex tasks.
392.
Dzeroski, S., & Lavrac, N. (Eds.). (2001). Relational data
mining. Berlin, Germany: Springer.

You might also like