Deep Transfer Learning with Markov Logic
Deep Transfer Learning with Markov Logic
An approach that meets this goal must have a number has a node for each variable, and the model has a
of properties. It must be relational, since only rela- potential function for each clique in the graph. The
tional knowledge can be transferred across domains. joint distribution
Q represented by a Markov network is:
It must be probabilistic, to handle the uncertainty in- P (X = x) = Z1 k φk (x{k} ), where x{k} is the state of
herent in transfer in a principled way. Lastly, it must the kth clique (i.e., the state of the variables that ap-
be second-order, since expressing domain-independent pear in that clique), and Z is a normalization constant.
knowledge requires predicate variables. These require- Markov networks are often conveniently represented as
ments rule out standard inductive logic programming log-linear models, with each clique potential replaced
and first-order probabilistic approaches. To our knowl- by an exponentiated weighted P sum of features
of the
edge, the only available representation that satisfies all state: P (X = x) = Z1 exp w f (x) . A feature
j j j
the criteria is the second-order extension of Markov
fj (x) may be any real-valued function of the state.
logic introduced by Kok and Domingos (2007). We
use it as the basis for our approach. Markov logic is a probabilistic extension of first-order
logic. It combines first-order logic with Markov net-
Our approach to transfer learning, called DTM (Deep
works. Formally, a Markov logic network (MLN) is
Transfer via Markov logic), discerns high-level simi-
a set of pairs, (Fi , wi ), where Fi is a first-order for-
larities between logical formulas. Given a set of first-
mula and wi ∈ R. MLNs soften logic by associat-
order formulas, DTM lifts the formulas into second-
ing a weight with each formula. Worlds that vio-
order logic by replacing all predicate names with pred-
late formulas become less likely, but not impossible.
icate variables. It then groups the second-order for-
MLNs provide a template for constructing Markov net-
mulas into cliques. DTM evaluates which second-
works. When given a finite set of constants, the for-
order cliques represent regularities whose probability
mulas from an MLN define a Markov network with
deviates significantly from independence among sub-
one node for each ground predicate and one feature
cliques. Finally, it selects the top k highest scor-
for each ground clause. The weight of a feature is
ing second-order cliques to transfer, and instantiates
the weight of the first-order clause that originated it.
the knowledge with the types of objects and variables
An MLN induces the following P probability distribu-
present in the target domain. The transferred knowl- 1 F
edge provides a declarative bias for structure learning tion: P (X = x) = Z exp i=1 wi ni (x) , where F
in the target domain. is the number formulas in the MLN, wi is the weight
of the ith formula, and ni (x) is the number of true
The most similar approach to ours in the literature is groundings of Fi in possible world x.
Mihalkova et al.’s TAMAR algorithm (2007). TAMAR
learns Markov logic clauses in the source domain, and Algorithms have been proposed for learning the
attempts to transfer each one to the target domain by weights associated with each formula (e.g., Lowd &
replacing its predicates by target-domain predicates in Domingos, 2007), as well as for learning the formu-
all possible ways. While simple, this approach is un- las themselves (e.g., Kok & Domingos, 2005). We
likely to scale to the large, rich domains where trans- will focus on the algorithm of Kok and Domingos,
fer learning is most likely to be useful. Like analog- which we will call MSL. Structure learning consists
ical reasoning, it does not explicitly create domain- of two components: constructing clauses and evaluat-
independent knowledge. ing clauses. Clause construction follows an inductive
logic programming-style search (Dzeroski & Lavrac,
Using our approach, we have successfully transferred 2001). When learning a network from scratch, MSL
learned knowledge among social network, molecular bi- starts with an empty clause and specializes it by suc-
ology and web domains. In addition to improved em- cessively adding literals to the clause. Additionally,
pirical performance, our approach discovered patterns the algorithm can refine an existing network to correct
including broadly useful properties of predicates, like errors in the clauses. Here, it considers both removing
symmetry and transitivity, and relations among pred- and adding literals to a clause as well as flipping the
icates, such as homophily. sign of a literal in the clause. The algorithm uses a
beam search to find the current best clause and add
2. Markov Logic it to the network. The search ends when no clause
improves the score of the MLN. To evaluate the merit
A Markov network is a model for the joint distribu- of each clause, it uses weighted pseudo-log-likelihood
tion of a set of variables X = (X1 , X2 , . . . , Xn ) (Della (WPLL). To avoid overfitting, each clause receives a
Pietra et al., 1997). It is composed of an undirected penalty term proportional to the number of literals
graph G and a set of potential functions φk . The graph that differ between the current clause and the initial
Deep Transfer via Second-Order Markov Logic
imum clause length and a maximum number of ob- decompositions. Each instantiation receives the min-
ject variables. The entire set of clauses is passed to imum K-L score over the set of its decompositions,
DTM for evaluation. because any single one could explain the clique’s distri-
bution. Each second-order clique receives the average
Beam search. Exhaustive search does not scale score of its top m first-order instantiations, in order
well, since the number of clauses it produces is ex- to favor second-order cliques that have multiple useful
ponential in the clause length and it becomes com- instantiations.
putationally infeasible to score long clauses. Beam
search is a common strategy for scaling structure
3.4. Transfer Mechanism
learning, and is used in MSL (described in Section
2). However, transfer learning and structure learn- The next question is how to make use of the trans-
ing have different objectives. In transfer learning, ferred knowledge in the target domain. A key compo-
the goal is to derive a large, diverse set of clauses to nent of an inductive logic programming (ILP) (Dze-
evaluate for potential transfer to the target domain. roski & Lavrac, 2001) system is the declarative bias.
Structure learning simply needs to induce a compact Due to the large search space of possible first-order
theory that accurately models the predicates in the clauses, devising a good declarative bias is crucial for
source domain. The theories induced by MSL tend achieving good results with an ILP system. In ILP,
to contain very few clauses and thus are not ideal two primary methods exist for expressing a declara-
for transfer. An alternative approach is to induce tive bias, and both forms of bias are often used in the
a separate theory to predict each predicate in the same system. The first method restricts the search
domain. However, the resulting theory may not be space. Common strategies include having type con-
very diverse, since clauses will contain only the tar- straints, forcing certain predicate arguments to con-
get predicate and predicates in its Markov blanket tain bound variables, and setting a maximum clause
(i.e., its neighbors in the network). A better ap- length. The second method involves incorporating
proach is to construct models that predict sets of background knowledge. Background knowledge comes
predicates. in the form of hand-crafted clauses that define addi-
Given the final set of learned models, DTM groups tional predicates which can be added to a clause un-
the clauses into second-order cliques and evaluates der construction. Effectively, background knowledge
each clique that appears more than twice. allows the learner to add multiple literals to a clause
at once and overcome the myopia of greedy search. It
is important to note that these common strategies can
3.3. Second-Order Clique Evaluation
easily be expressed in second-order logic, and this is
Each clique can be decomposed into pairs of sub- part of what motivates our approach. DTM can be
cliques, and it should capture a regularity beyond what viewed as a way to learn the declarative bias in one
its subcliques represent. For example, the second- domain and apply it in another, as opposed to having
order clique {r(x, y), r(z, y), s(x, z)} can be decom- a user hand-craft the bias for each domain.
posed into the following three pairs of subcliques: (i)
When applying a second-order clique in a target do-
{r(x, y), r(z, y)} and {s(x, z)}, (ii) {r(x, y), s(x, z)}
main, DTM decomposes the clique into a set of clauses,
and {r(z, y)}, and (iii) {r(z, y), s(x, z)} and {r(x, y)}.
and transfers each clause. It uses clauses instead of
To score a second-order clique, each of its first-order in- conjunctions since most structure learning approaches,
stantiations is evaluated. To score a first-order clique, both in Markov logic and ILP, construct clauses. In
DTM checks if its probability distribution is signifi- Markov logic, a conjunction can be converted into an
cantly different from the product of the probabilities equivalent clause by negating each literal in it and flip-
of each possible pair of subcliques that it can be de- ping the sign of its weight.
composed into.
We investigate three different ways to reapply the
The natural way to compare these two distributions is knowledge in the target domain:
to use the K-L divergence, D(p||q) = x p(x) log p(x)
P
q(x) ,
Transfer by Seeding the Beam. In the first ap-
where p is the clique’s probability distribution, and q
proach, the second-order cliques provide a declar-
is the distribution it would have if the two subcliques
ative bias for the standard MLN structure search.
were independent. We use Bayesian estimates of the
DTM selects the top k cliques that have at least one
probabilities with Dirichlet priors with all αi = 1.
true grounding in the target domain. At the be-
For each first-order instantiation of a second-order ginning of each iteration of the beam search, DTM
clique, DTM computes its K-L divergence versus all its seeds the beam with the clauses obtained from each
Deep Transfer via Second-Order Markov Logic
pages. Within each domain, all algorithms had the same pa-
rameter settings. Each algorithm was allotted 100
The Social Network data set consists of pages col-
hours per database to run. In each domain, we op-
lected from the Facebook social networking Web site.
timized the WPLL for the two predicates we were in-
The data consist of information about friendships
terested in predicting. For DTM we tried two settings,
among individuals and properties of individuals, such
k = 5 and k = 10 for the number k of second-order
as hobbies, interests and network memberships. It has
cliques transferred to the target domain. We tried two
information about approximately 600 individuals, thir-
settings for exhaustive search. The first evaluated all
teen different properties of each and 24,000 friendship
clauses containing at most three literals and three ob-
pairs.
ject variables and the second evaluated all clauses con-
taining up to four literals and four object variables.
4.2. High-Scoring Second-order Cliques When running beam search in the source domain, we
An important evaluation measure for transfer learn- constructed theories for predicting pairs of predicates
ing is whether it discovers and transfers relevant (a tradeoff between diversity and tractability).
knowledge. In fact, DTM discovers multiple broadly In Yeast, we permitted only function constants to ap-
useful properties. For example, among cliques of pear in learned clauses. We wanted to know exactly
length three with up to three object variables, which functions (e.g. metabolism, DNA processing,
{r(x, y), r(z, y), s(x, z)} is ranked first in all three do- etc.) each protein performed, not just that the pro-
mains. This represents the concept of homophily, tein had some function. In WebKB, we permitted page
which is present when related objects (x and z) class to appear in clauses. Here we wanted to predict
tend to have similar properties (y). The clique which labels (e.g. student, person, etc.) applied to
{r(x, y), r(y, z), r(z, x)} is ranked second in Yeast and each Web page, not just whether the Web page has
fourth in WebKB and Facebook and represents the a label. For each train-test split, we used the infor-
concept of transitivity. Both homophily and transi- mation gain (on the training set) to pick the top fifty
tivity are important concepts that appear in a variety words most predictive of page class. For tractability,
of domains. Therefore it is quite significant that DTM we restricted the algorithm to only learn with these
discovers and assigns a high ranking to these concepts. constants.
In all three domains, the top three cliques of length To evaluate each system, we measured the test set
two are: {r(x, y), r(z, y)}, {r(x, y), r(x, z)}, and conditional log-likelihood (CLL) and area under the
{r(x, y), r(y, x)}. The first clique captures the fact precision-recall curve (AUC) for each predicate. The
that particular feature values are more common across advantage of the CLL is that it directly measures the
objects in a domain than others. For example, a com- quality of the probability estimates produced. The ad-
puter science department most likely has more student vantage of the AUC is that it is insensitive to the large
Web pages than faculty Web pages. The second clique number of true negatives.
captures the fact that pairs of values for the same fea-
ture, such as words in the WebKB domain, commonly
4.4. Results
co-occur in the same object. The final clique captures
symmetry, another important general property of re- Figure 1 shows representative learning curves for pre-
lations. dicting protein function when transferring cliques of up
to length four from the WebKB domain into the Yeast
4.3. Methodology domain. DTM consistently outperforms MSL. As ex-
pected, DTM performs better relative to MSL with
The central question our experiments aimed to address less data. Table 1 compares the performance, mea-
was whether DTM improved performance in the tar- sured by average relative difference (e.g., (AU CDT M −
get domain compared to learning from scratch with AU CMSL )/AU CMSL ) of DTM using greedy transfer
MSL. We tried four source-target pairs: Facebook to with refinement to MSL when transferring cliques of
Yeast Protein, WebKB to Yeast Protein, Facebook to up to length three. In this domain, transfer greatly
WebKB and Yeast Protein to WebKB. Each target improves the models’ ability to predict both the AUC
domain was divided into four disjoint subsets, which and CLL for the Function predicate. For this pred-
we called mega-examples. We selected a subset of the icate, in the length three setting, DTM outperforms
mega-examples to train the learner on and then tested MSL on 73% (82 out of 112) of the various different
it on the remaining mega-examples. We repeated this conditions. In the length four setting, it wins 75% of
train-test cycle for all possible subsets of the mega- the cases (42 out of 56) when using WebKB as the
examples.
Deep Transfer via Second-Order Markov Logic
AUC
CLL
CLL
0.20 MSL MSL 0.050 MSL MSL
Seed -0.25 -0.020
Seed Seed Seed
0.10 Greedy Greedy 0.025 Greedy Greedy
Refine Refine Refine Refine
0.00 -0.30 0.000 -0.024
1 2 3 1 2 3 1 2 3 1 2 3
No. of Training Databases No. of Training Databases No. of Training Databases No. of Training Databases
source domain. Using Facebook in this setting did not seems to yield better and more stable results than
provide a benefit. The cliques transferred here tend beam search. One possible explanation is that despite
to capture more information about link structure and tweaking the beam search algorithm to make it more
the various properties of a single entity, whereas in the appropriate for transfer, it still focuses too heavily on
Yeast domain it is important to model similarities and finding a good theory as opposed to finding the best
differences in the properties of related entities. Neither clauses for transfer.
approach does well at the difficult task of predicting
We also compared our approach to TAMAR (Mi-
the Interaction predicate; MSL outperforms DTM.
halkova et al., 2007). In order to have a fair compari-
Figure 2 shows representative learning curves for pre- son, we extended the system to map clauses that con-
dicting whether two Web pages are linked when trans- tain constants in them. When transferring from Yeast
ferring cliques of up to length four from Facebook to WebKB, TAMAR does significantly worse than all
into WebKB. As in Figure 1, DTM consistently out- other approaches. TAMAR was unable to map theo-
performs MSL. Table 2 compares the performance of ries from WebKB into the Yeast domain in the allotted
DTM using greedy transfer with refinement to MSL time. Introducing constants into the clauses greatly
for WebKB when transferring cliques of up to length increases the number of possible mappings between
three. Across both predicates, transfer outperforms clauses, making it infeasible to perform an exhaus-
MSL on 68% (152 out of 224) of the various different tive search to map clauses between different domains.
test conditions when considering cliques of up to length This problem is particularly acute when mapping into
three. When using Facebook as the source task, trans- the Yeast domain, due to the number of function con-
ferring cliques of up to length four provides a large stants.
benefit, winning 71% (79 out of 112) of time. When
using Yeast as the source domain, transfer only posts 5. Conclusion
modest gains, winning only 51% (58 out of 112) of the
various test conditions. On the Linked predicate, the This paper proposes an approach to deep transfer, the
benefit of transfer is greater for small amounts of data. problem of generalizing across domains. Our DTM al-
Transfer also provides a slight improvement over MSL gorithm identifies domain-independent regularities in
when predicting the label of a Web page. the form of second-order Markov logic clauses, and
uses them to guide discovery of new structure in the
Due to space limitations, we omit the tables and
target domain. Initial experiments in bioinformatics,
graphs of the results for beam search1 . When transfer-
Web and social network domains show that DTM out-
ring from the WebKB domain into the Yeast domain,
performs standard structure learning, and discovers
the results are almost identical to exhaustive search.
significant regularities like homophily and transitivity.
For the Facebook into Yeast pair, the results are worse
than exhaustive search. When transferring from the Directions for future work include: theoretical analy-
Yeast domain into the WebKB domain, the results sis of DTM; improved structure learning algorithms for
are comparable to exhaustive search. For the Face- transfer; transferring formulas instead of whole cliques;
book into WebKB pair, the PageClass results were experiments in richer domains; transferring from mul-
slightly better than exhaustive search, but worse for tiple domains at once; etc.
the Linked predicate. In general, exhaustive search
1
Full results are available in an online appendix:
[Link]
Deep Transfer via Second-Order Markov Logic
Table 1. Experimental results comparing DTM (greedy transfer with refinement) to MSL on the Yeast domain. Each entry
in the table is the average relative difference in AUC or CLL between transfer and MSL when considering second-order
cliques that contain up to three predicate and object variables.
Source Num. AUC CLL
Cli- Function Interaction Function Interaction
ques 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB
WebKB 5 0.48 0.70 0.30 -0.31 -0.33 -0.21 0.12 0.17 0.03 0.01 0.03 -0.02
FB 5 0.29 0.13 -0.22 0.00 0.56 -0.26 0.02 -0.10 -0.23 0.00 0.06 -0.14
WebKB 10 0.47 0.52 0.21 -0.18 -0.32 -0.07 0.12 0.10 0.05 0.04 0.02 0.01
FB 10 0.63 0.51 0.10 -0.35 -0.03 -0.42 0.18 0.01 -0.16 0.01 0.07 -0.01
Table 2. Experimental results comparing DTM (greedy transfer with refinement) to MSL on the WebKB domain. Each
entry in the table is the average relative difference in AUC or CLL between transfer and MSL when considering second-
order cliques that contain up to three predicate and object variables.
Source Num. AUC CLL
Cli- Page Class Linked Page Class Linked
ques 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB 1 DB 2 DB 3 DB
Yeast 5 -0.01 0.02 0.00 42.79 9.77 11.03 -0.12 0.04 0.03 0.12 0.04 0.19
FB 5 0.00 0.01 0.01 40.79 4.64 6.15 -0.06 0.02 0.07 0.12 0.04 0.19
Yeast 10 -0.01 0.02 -0.01 42.79 10.36 10.94 -0.12 0.03 0.00 0.12 0.04 0.19
FB 10 0.00 0.01 0.01 40.79 4.64 6.15 -0.06 0.02 0.07 0.12 0.04 0.19