Journal of Machine Learning Research 7 (2006) 191–246
Submitted 3/05; Revised 9/05; Published 2/06
Learning the Structure of
Linear Latent Variable Models
Ricardo Silva∗
RBAS @ GATSBY. UCL . AC . UK
Gatsby Computational Neuroscience Unit
University College London
London, WC1N 3AR, UK
Richard Scheines
Clark Glymour
Peter Spirtes
SCHEINES @ ANDREW. CMU . EDU
CG 09@ ANDREW. CMU . EDU
PS 7 Z @ ANDREW. CMU . EDU
Center for Automated Learning and Discovery (CALD) and Department of Philosophy
Carnegie Mellon University
Pittsburgh, PA 15213, USA
Editor: David Maxwell Chickering
Abstract
We describe anytime search procedures that (1) find disjoint subsets of recorded variables for which
the members of each subset are d-separated by a single common unrecorded cause, if such exists;
(2) return information about the causal relations among the latent factors so identified. We prove
the procedure is point-wise consistent assuming (a) the causal relations can be represented by a
directed acyclic graph (DAG) satisfying the Markov Assumption and the Faithfulness Assumption;
(b) unrecorded variables are not caused by recorded variables; and (c) dependencies are linear. We
compare the procedure with standard approaches over a variety of simulated structures and sample
sizes, and illustrate its practical value with brief studies of social science data sets. Finally, we
consider generalizations for non-linear systems.
Keywords: latent variable models, causality, graphical models
1. What We Will Show
In many empirical studies that estimate causal relationships, influential variables are unrecorded, or
“latent.” When unrecorded variables are believed to influence only one recorded variable directly,
they are commonly modeled as noise. When, however, they influence two or more measured variables directly, the intent of such studies is to identify them and their influences. In many cases, for
example in sociology, social psychology, neuropsychology, epidemiology, climate research, signal
source studies, and elsewhere, the chief aim of inquiry is in fact to identify the causal relations of
(often unknown) unrecorded variables that influence multiple recorded variables. It is often assumed
on good grounds that recorded variables do not influence unrecorded variables, although in some
cases recorded variables may influence one another.
When there is uncertainty about the number of latent variables, which measured variables they
influence, or which measured variables influence other measured variables, the investigator who
aims at a causal explanation is faced with a difficult discovery problem for which currently avail∗. This work was completed while Ricardo Silva was at the School of Computer Science, Carnegie Mellon University.
c 2006 Ricardo Silva, Richard Scheines, Clark Glymour and Peter Spirtes.
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
able methods are at best heuristic. Loehlin (2004) argues that while there are several approaches
to automatically learn causal structure, none can be seem as competitors of exploratory factor analysis: the usual focus of automated search procedures for causal Bayes nets is on relations among
observed variables. Loehlin’s comment overlooks Bayes net search procedures robust to latent variables (Spirtes et al., 2000) and heuristic approaches for learning networks with hidden nodes (Elidan
et al., 2000), but the general sense of his comment is correct. For a kind of model widely used in
applied sciences − “multiple indicator models” in which multiple observed measures are assumed
to be effects of unrecorded variables and possibly of each other − machine learning has provided
no principled alternative to factor analysis, principal components, and regression analysis of proxy
scores formed from averages or weighted averages of measured variables, the techniques most commonly used to estimate the existence and influences of variables that are unrecorded. The statistical
properties of models produced by these methods are well understood, but there are no proofs, under
any general assumptions, of convergence to features of the true causal structure. The few simulation
studies of the accuracy of these methods on finite samples with diverse causal structures are not reassuring (Glymour, 1997). The use of proxy scores with regression is demonstrably not consistent,
and systematically overestimates dependencies. Better methods are needed.
Yet the common view is that solving this problem is actually impossible, as illustrated by the
closing words of a popular textbook on latent variable modeling (Bartholomew and Knott, 1999):
When we come to models for relationships between latent variables we have reached
a point where so much has to be assumed that one might justly conclude that the limits
of scientific usefulness have been reached if not exceeded.
This view results from a commitment to factor analysis as the method to identify and measure
unrecorded common causes of recorded variables. One aim of the following work is to demonstrate
that such a commitment is unjustified, and to show that the pessimistic claim that follows from it is
false.
We describe a two part method for this problem. The method (1) finds clusters of measured
variables that are d-separated by a single unrecorded common cause, if such exists; and (2) finds
features of the Markov Equivalence class of causal models for the latent variables. Assuming only
multiple indicator structure and principles standard in Bayes net search algorithms, principles assumed satisfied in many domains, especially in the social sciences, the two procedures converge,
probability 1 in the large sample limit, to correct information. The completeness of the information
obtained about latent structure depends on how thoroughly confounded the measured variables are,
but when, for each unknown latent variable, there in fact exists at least a small number of measured
variables that are influenced only by that latent variable, the method returns the complete Markov
Equivalence class of the latent structure. To complement the theoretical results, we show by simulation studies for several latent structures and for a range of sample sizes that the method identifies
the unknown structure more accurately than does factor analysis and a published greedy search algorithm. We also illustrate and compare the procedures with applications to social science cases,
where expert opinions about measurement are reasonably firm, but are less so about causal relations
among the latent variables.
The focus is on linear models of continuous variables. Although most of our results do not
make special assumptions about the choice of a probability family, for practical purposes we further
assume in the experiments that variables are multivariate Gaussian. In the very end of the paper, we
consider possible generalizations of this approach for non-linear, non-Gaussian and discrete models.
192
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
The outline of this paper is as follows:
• Section 2: Illustrative principles describes a few examples of the techniques we use to learn
causal structure in the presence of latent variables;
• Section 3: Related work is a brief exposition of other methods used in latent variable learning. We note how the causal discovery problem cannot be reliably solved by methods created
for probabilistic modeling only;
• Section 4: Notation, assumptions and definitions contains all relevant definitions and assumptions used throughout this paper for the convenience of the reader;
• Section 5: Procedures for finding pure measurement models describes the method we
use to solve the first half of the problem, discovering which latents exist and which observed
variables measure them;
• Section 6: Learning the structure of the unobserved describes the method we use to solve
the second half of the problem, discovering the Markov equivalence class that contains the
causal graph connecting the latent variables;
• Section 7: Simulation studies and Section 8: Real data applications contain empirical
results with simulated and real data;
• Section 9: Generalizations is a brief exposition of related work describing how the methods
here introduced could be used to discover partial information in certain other classes models;
• Section 10: Conclusion summarizes the contribution of this paper and suggests several avenues of research;
Proofs of theorems and implementation details are given in the Appendix.
2. Illustrative Principles
One widely cited and applied approach to learning causal graphs rely on comparing models that
entail different conditional independence constraints in the observed marginal (Spirtes et al., 2000).
When latent variables are common causes of all observed variables, as in the domains described
in the introduction, no such constraints are expected to exist. Still, when such common causes are
direct causes of just a few variables, there is much structure that can be discovered, although not
by observable independencies. One needs instead a framework that distinguishes among different
causal graphs from other forms of constraints in the marginal distribution of the observed variables.
This section introduces the type of constraints we use through a few illustrative examples.
Consider Figure 1, where X variables are recorded and L variables (in ovals) are unrecorded
and unknown to the investigator. The latent structure, the dependencies of measured variables on
individual latent variables, and the linear dependency of the measured variables on their parents
and (unrepresented) independent noises in Figure 1 imply a pattern of constraints on the covariance
matrix among the X variables. For example, X1 , X2 , X3 have zero covariances with X7 , X8 , X9 . Less
193
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
L1
X1
X2
L3
L2
X3
X4
X5
X6
X7
X8
X9
Figure 1: A latent variable model which entails several constraints on the observed covariance matrix. Latent variables are inside ovals.
obviously, for X1 , X2 , X3 and any one of X4 , X5 , X6 , three quadratic constraints (tetrad constraints) on
the covariance matrix are implied: e.g., for X4
ρ12 ρ34 = ρ14 ρ23 = ρ13 ρ24
(1)
where ρ12 is the Pearson product moment correlation between X1 , X2 , etc. (Note that any two of the
three vanishing tetrad differences above entails the third.) The same is true for X7 , X8 , X9 and any
one of X4 , X5 , X6 ; for X4 , X5 , X6 , and any one of X1 , X2 , X3 or any one of X7 , X8 , X9 . Further, for any
two of X1 , X2 , X3 or of X7 , X8 , X9 and any two of X4 , X5 , X6 , exactly one such quadratic constraint is
implied, e.g., for X1 , X2 and X4 , X5 , the single constraint
ρ14 ρ25 = ρ15 ρ24
(2)
The constraints hold as well if covariances are substituted for correlations.
Statistical tests for vanishing tetrad differences are available for a wide family of distributions
(Wishart, 1928; Bollen, 1990). Linear and non-linear models can imply other constraints on the
correlation matrix, but general, feasible computational procedures to determine arbitrary constraints
are not available (Geiger and Meek, 1999) nor are there any available statistical tests of good power
for higher order constraints. Tetrad constraints therefore provide a practical way of distinguishing
among possible candidate models, with a history of use in heuristic search dating from the early
20th century (see, e.g., references within Glymour et al., 1987). This paper describes a principled
way of using tetrad constraints in search.
In particular, we will focus on a class of “pure” latent variable models where latents can be
arbitrarily connected in a acyclic causal graph, but where observed variables have at most one latent
parent.
Given a “pure” set of measured indicators of latent variables, as in Figure 1 − informally, a
measurement model specifying, for each latent variable, a set of measured variables influenced only
by that latent variable and individual, independent noises − the causal structure among the latent
variables can be estimated by any of a variety of methods. Standard score functions of latent variable
models (such as the chi-square test) can be used to compare models with and without a specified
edge, providing indirect tests of conditional independence among latent variables. The conditional
independence facts can then be input to a constraint based Bayes net search algorithm, such as PC
or FCI (Spirtes et al., 2000), or used to guide a greedy search algorithm such as GES (Chickering,
2002).
This is not to say that we need to assume that the true underlying graph contains only pure
measures of the latent variables. In Figure 1, the measured variables neatly cluster into disjoint sets
of variables and the variables in any one set are influenced only by a single common cause and there
194
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
L2
L1
X1
X2
X7
X8
L3
X3
X4
X5
X6
X9
L4
X 10 X 11 X 12 X 13
Figure 2: A latent variable model which entails several constraints on the observed covariance matrix. These constraints can be used to discover a submodel of the model given above.
are no influences of the measured variables on one another. In many real cases the influences on the
measured variables do not separate so simply. Some of the measured variables may influence others
(as in signal leakage between channels in spectral measurements), and some or many measured
variables may be influenced by two or more latent variables. For example, the latent structure of a
linear, Gaussian system shown in Figure 2 can be recovered by the procedures we propose by finding
a subset of the given measures that are pure measures in the true graph. Our aim in what follows is to
prove and use new results about implied constraints on the covariance matrix of measured variables
to form measurement models that enable estimation of features of the Markov Equivalence class
of the latent structure in a wide range of cases. We will develop the theory first for linear models
(mostly for problems with a joint Gaussian distribution on all variables, including latent variables),
and then consider possibilities for generalization.
3. Related Work
The traditional framework for discovering latent variables is factor analysis and its variants (see,
e.g., Bartholomew et al., 2002). A number of factors is chosen based on some criterion such as
the minimum number of factors that fit the data at a given significance level or the number that
maximizes a score such as BIC. After fitting the data, usually assuming a Gaussian distribution,
different transformations (rotations) to the latent covariance matrix are applied in order to satisfy
some criteria of simplicity. The meaning of a latent variable is determined informally based on the
magnitude of the coefficients relating each observed variable to each latent. This is, by far, the most
common method used in several applied sciences (Glymour, 2002). Social science methodology
also contains various beam searches that begin with an initial latent variable model and iteratively
add or delete dependencies in a greedy search guided by significance tests of nested models. In
simulation experiments (Glymour et al., 1987; Spirtes et al., 2000) these procedures have performed
little better than chance from data generated by true models in which some measured variables are
influenced by multiple latent varibles and by other measured variables.
In non-Gaussian cases, the usual methods are variations of independent component analysis,
such as independent factor analysis (Attias, 1999) and tree-based component analysis (Bach and
Jordan, 2003). These methods severely constrain the dependency structure among the latent vari195
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
ables. That facilitates joint density estimation or blind source separation, but it is of little use in
learning causal structure.
In a similar vein, Zhang (2004) represents latent variable models for discrete variables (both
observed and latent) with a multinomial probabilistic model. The model is constrained to be a
tree and every observed variable has one and only one (latent) parent and no child. Zhang does
not provide a search method to find variables satisfying the assumption, but assumes a priori the
variables measured satisfy it.
Elidan et al. (2000) introduces latent variables as common causes of densely connected regions
of a DAG learned through Bayesian algorithms for learning Bayesian network structures. Once one
latent is introduced as the parent of a set of nodes originally strongly connected, the same search
algorithm is applied using this modified graph as the initial graph. The process can be iterated
to introduce multiple latents. Examples are given for which this procedure, called F IND H IDDEN,
increases the fit over a latent-free graphical model, but for causal modeling the algorithm is not
known to be correct in the large sample limit. In a relevant sense, the algorithm cannot be correct,
because its output yields particular models from among an indistinguishable class of models that is
not characterized.
For instance, consider Figure 3(a), a model of two latents and four observed variables. Two
typical outputs produced by F IND H IDDEN given data generated by this model are shown in Figures
3(b) and 3(c). The choice of model is affected by the strength of the connections in the true model
and the sample size. These outputs suggest correctly that there is a single latent condition on which
all but one pair of observed variables are independent, although the suggestion of some direct causal
connection among a pair of indicators is false. The main problem of F IND H IDDEN here is that each
of these two models represents a different actual latent variable1 which is not clear from the output.
Graphs given Figures 3(b) and 3(c) are also generated by F IND H IDDEN when the true model has
the graphical structure seen in Figure 3(d). In this case, one might be led to infer that there is a
latent condition on which three of the indicators are independent, which is not true.
To report all possible structures indistinguishable by the data instead of an arbitrary one is
the fundamental difference between purely probabilistically oriented applications (as the ones that
motivate the F IND H IDDEN algorithm) and causally oriented applications, as those that motivate
this paper. Algorithms such as the ones by Elidan et al. (2000) and Zhang (2004) are designed
to effectively perform density estimation, which is a very different problem, even if good density
estimators provide one possible causal model compatible with the data.
To tackle issues of sound identifiability of causal structures, we previously developed an approach to learning measurement models (Silva et al., 2003). That procedure requires that the true
underlying graph has a “pure” submodel with three measures for each latent variable, which is a
strong and generally untestable assumption. That assumption is not needed in the procedures described here, but the output is still a pure model.
One of the reasons why we focus on pure models instead of general latent variable models
should be clear from the example in Figure 3: the equivalence class of all latent variable models
that cannot be distinguished given the likelihood function might be very large. While, for instance,
a Markov equivalence class for models with no latent variables can be neatly represented by a single
graphical object known as “pattern” (Pearl, 2000; Spirtes et al., 2000), the same is not true for latent
1. Assuming T1 in this Figure is the true latent that entails the same conditional independencies. In Figure 3(b), T1
should correspond to L2 . In Figure 3(c), to L1 . In the first case, however, the causal direction of T1 into both X1 and
X2 is wrong and cannot be correctly represented without the introduction of another latent.
196
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
L1
L3
T1
T1
L2
L1
L2
L5
L4
X1
X2
X3
(a)
X4
X1
X2
X3
X4
(b)
X1
X2
X3
(c)
X4
X1
X2
X3
X4
(d)
Figure 3: All four models above are undistinguishable in multivariate Gaussian families according
to standard algorithms, but such algorithms do not report this fact.
variable models. The models in Figure 3 differ not only in the direction of the edges, but also in
the adjacencies themselves ({X1 , X2 } adjacent in one case, but not {X3 , X4 }; {X3 , X4 } adjacent in
another case, but not {X1 , X2 }) and the role of the latent variables (ambiguity about which latent
d-separates which observed variables, how they are connected, etc.). A representation of such an
equivalence class, as illustrated by this very small example, can be cumbersome and uninformative.
4. Notation, Assumptions and Definitions
Our work is in the framework of causal graphical models. Concepts used here without explicit definition, such as d-separation and I-map, can be found in standard sources (Pearl, 1988; Spirtes et al.,
2000; Pearl, 2000). We use “variable” and “vertex/node” interchangeably, and standard kinship
terminology (“parent,” “child,” “descendant,” “ancestor”) for directed graph relationships. Sets of
variables are represented in bold, individual variables and symbols for graphs in italics. The Pearson
partial correlation of X, Y controlling for Z is denoted by ρXY.Z . We assume i.i.d. data sampled from
a subset O of the variables of a joint distribution D on variables V = O ∪ L, subject to the following
assumptions:
A1 D factors according to the local Markov assumption for a DAG G with vertex set V. That
is, any variable is independent of its non-descendants in G conditional on any values of its
parents in G.
A2 No vertex in O is an ancestor of any vertex in L. We call this property the measurement
assumption;
A3 Each variable in V is a linear function of its parents plus an additive error term of positive
finite variance;
A4 The Faithfulness Assumption: for all {X,Y, Z} ⊆ V, X is independent of Y conditional on
each assignment of values to variables in Z if and only if the Markov Assumption for G
entails such conditional independencies. For models satisfying A1-A3 with Gaussian distributions, Faithfulness is equivalent to assuming that no correlations or partial correlations
vanish because of multiple pathways whose influences perfectly cancel one another.
Definition 1 (Linear latent variable model) A model satisfying A1 − A4 is a linear latent variable
model, or for brevity, where the context makes the linearity assumption clear, a latent variable
model.
197
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
A single symbol, such as G, will be used to denote both a linear latent variable model and the
corresponding latent variable graph. Linear latent variable models are ubiquitous in econometric,
psychometric, and social scientific studies (Bollen, 1989), where they are usually known as structural equation models.
Definition 2 (Measurement model) Given a linear latent variable model G, with vertex set V, the
subgraph containing all vertices in V, and all and only those edges directed into vertices in O, is
called the measurement model of G.
Definition 3 (Structural model) Given a linear latent variable model G, the subgraph containing
all and only its latent nodes and respective edges is the structural model of G.
Definition 4 (Linear entailment) We say that a DAG G linearly entails a constraint if and only
if the constraint holds in every distribution satisfying A1 - A4 for G with covariance matrix parameterized by Θ, the set of linear coefficients and error variances that defines the conditional
expectation and variance of a vertex given its parents. We will assume without loss of generality
that all variables have zero mean.
Definition 5 (Tetrad equivalence class) Given a set C of vanishing partial correlations and vanishing tetrad differences, a tetrad equivalence class T (C) is the set of all latent variable graphs
each member of which entails all and only the tetrad constraints and vanishing partial correlations
among the measured variables entailed by C.
Definition 6 (Measurement equivalence class) An equivalence class of measurement models M (C)
for C is the union of the measurement models graphs in T (C). We introduce a graphical representation of common features of all elements of M (C), analogous to the familiar notion of a pattern
representing the Markov Equivalence class of a Bayes net.
Definition 7 (Measurement pattern) A measurement pattern, denoted M P (C), is a graph representing features of the equivalence class M (C) satisfying the following:
• there are latent and observed vertices;
• the only edges allowed in an MP are directed edges from latent variables to observed variables, and undirected edges between observed vertices;
• every observed variable in a MP has at least one latent parent;
• if two observed variables X and Y in a M P (C) do not share a common latent parent, then X
and Y do not share a common latent parent in any member of M (C);
• if observed variables X and Y are not linked by an undirected edge in M P (C), then X is not
an ancestor of Y in any member of M (C).
Definition 8 (Pure measurement model) A pure measurement model is a measurement model in
which each observed variable has only one latent parent, and no observed parent. That is, it is a
tree beneath the latents.
198
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
L
X1 X2 X3 X4
L
T
X1 X2 X3 X4
(a)
(b)
L
T
X1 X2 X3 X4
(c)
Figure 4: A linear latent variable model with any of the graphical structures above entails all possible tetrad constraints in the marginal covariance matrix of X1 − X4 .
5. Procedures for Finding Pure Measurement Models
Our goal is to find pure measurement models whenever possible, and use them to estimate the structural model. To do so, we first use properties relating graphical structure and covariance constraints
to identify a measurement pattern, and then turn the measurement pattern into a pure measurement
model.
The key to solving this problem is a graphical characterization of tetrad constraints. Consider
Figure 4(a). A single latent d-separates four observed variables. When this graphical model is
linearly parameterized as
X1
X2
X3
X4
=
=
=
=
λ1 L + ε1
λ2 L + ε2
λ3 L + ε3
λ4 L + ε4
it entails all three tetrad constraints among the observed variables. That is, any choice of values for
coefficients {λ1 , λ2 , λ3 , λ4 } and error variances implies
σX1 X2 σX3 X4
= (λ1 λ2 σ2L )(λ3 λ4 σ2L ) = (λ1 λ3 σ2L )(λ2 λ4 σ2L ) = σX1 X3 σX2 X4
= (λ1 λ2 σ2L )(λ3 λ4 σ2L ) = (λ1 λ4 σ2L )(λ2 λ3 σ2L ) = σX1 X4 σX2 X3
where σ2L is the variance of latent variable L.
While this result is straightforward, the relevant result for a structure learning algorithm is the
converse, i.e., establishing equivalence classes from observable tetrad constraints. For instance,
Figure 4(b) and (c) are different structures with the same entailed tetrad constraints that should
be accounted for. The main contribution of this paper is to provide several of such identification
results, and sound algorithms for learning causal structure based on them. Such results require
elaborate proofs that are left to the Appendix. What follows are descriptions of the most significant
lemmas and theorems, and illustrative examples. This is the core section of the paper. Section 6
complements the approach by describing an algorithm for learning structural models.
199
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
5.1 Identification Rules for Finding Substructures of Latent Variable Graphs
We start with one of the most basic lemmas, used as a building block for later results. It is basically the converse of the observation above. Let G be a linear latent variable model with observed
variables O:
Lemma 9 Let {X1 , X2 , X3 , X4 } ⊂ O be such that σX1 X2 σX3 X4 = σX1 X3 σX2 X4 = σX1 X4 σX2 X3 . If ρAB 6= 0
for all {A, B} ⊂ {X1 , X2 , X3 , X4 }, then there is a node P that d-separates all elements {X1 , X2 , X3 , X4 }
in G.
It follows that, if no observed node d-separates {X1 , X2 , X3 , X4 }, then node P must be a latent
node.
In order to learn a pure measurement model, we basically need two pieces of information: i.
which sets of nodes are d-separated by a latent; ii. which sets of nodes do not share any common
hidden parent. The first piece of information can provide possible indicators (children/descendants)
of a specific latent. However, this is not enough information, since a set S of observed variables can
be d-separated by a latent L, and yet S might contain non-descendants of L (one of the nodes might
have a common ancestor with L and not be a descendant of L, for instance). This is the reason why
we need to cluster observed variables into different sets when it is possible to show they cannot
share a common hidden parent. We will show this clustering allows us to eliminate most possible
non-descendants.
There are several possible combinations of observable tetrad constraints that allow one to identify such a clustering. Consider, for instance, the following case, in which it is determined that
certain variables do not share a common latent. Suppose we have a set of six observable variables,
X1 , X2 , X3 ,Y1 ,Y2 and Y3 such that:
1. there is some latent node that d-separates all pairs in {X1 , X2 , X3 ,Y1 } (Figure 5(a));
2. there is some latent node that d-separates all pairs in {X1 ,Y1 ,Y2 ,Y3 } (Figure 5(b));
3. there is no tetrad constraint σX1 X2 σY1Y2 − σX1Y2 σX2Y1 = 0;
4. no pairs in {X1 , . . . ,Y3 } × {X1 , . . . ,Y3 } have zero correlation;
Notice that is possible to empirically verify the first two conditions by using Lemma 9. Now
suppose, for the sake of contradiction, that X1 and Y1 have a common hidden parent L. One can show
that L should d-separate all elements in {X1 , X2 , X3 ,Y1 }, and also in {X1 ,Y1 ,Y2 ,Y3 }. With some extra
work (one has to consider the possibility of nodes in {X1 , X2 ,Y1 ,Y2 } having common parents with L,
for instance), one can show that this implies that L d-separates {X1 ,Y1 } from {X2 ,Y2 }. For instance,
Figure 5(c) illustrates a case where L d-separates all of the given observed variables.
However, this contradicts the third item in the hypothesis (such a d-separation will imply the
forbidden tetrad constraint, as we show in the formal proof) and, as a consequence, no such L
should exist. Therefore, the items above correspond to an identification rule for discovering some dseparations concerning observed and hidden variables (in this case, we show that X1 is independent
of all latent parents of Y1 given some latent ancestor of X1 ). This rule only uses constraints that can
be tested from the data.
Given such identification rules, what is needed is a principled way of combining the partial
information they provide to build classes of latent variable models of interest. The following section
explains the main rules and an algorithm for building an equivalence class of measurement models.
200
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
L
X1 X2 X3 Y1 X1 Y1 Y2 Y3 X3 X2 X1 Y1 Y2 Y3
(a)
(b)
(c)
Figure 5: If sets {X1 , X2 , X3 ,Y1 } and {X1 ,Y1 ,Y2 ,Y3 } are each d-separated by some node (e.g., as in
Figures (a) and (b) above), the existence of a common parent L for X1 and Y1 implies a
common node d-separating {X1 ,Y1 } from {X2 ,Y2 }, for instance (as exemplified in Figure
(c)).
5.2 Algorithms for Finding Equivalence Classes of Latent Variable Graphs
We start with one of the most basic lemmas, used as a building block for later results. We discover a measurement pattern as an intermediate step before learning a pure measurement model.
F IND PATTERN, given in Table 1, is an algorithm to learn a measurement pattern from an oracle for
vanishing partial correlations and vanishing tetrad differences. The algorithm uses three rules, CS1,
CS2, CS3, based on Lemmas that follow, for determining graphical structure from constraints on
the correlation matrix of observed variables.
Let C be a set of linearly entailed constraints satisfied in the observed covariance matrix. The
first stage of F IND PATTERN searches for subsets of C that will guarantee that two observed variables
do not have any latent parent in common. Let G be the latent variable graph for a linear latent
variable model with a set of observed variables O. Let O′ = {X1 , X2 , X3 ,Y1 ,Y2 ,Y3 } ⊂ O such that for
all triplets {A, B,C}, {A, B} ⊂ O′ and C ∈ O, we have ρAB 6= 0, ρAB.C 6= 0. Let τIJKL represent the
tetrad constraint σIJ σKL − σIK σJL = 0 and ¬τIJKL represent the complementary constraint σIJ σKL −
σIK σJL 6= 0. The following Lemma is a formal description of the example given earlier:
Lemma 10 (CS1 Test) If constraints {τX1Y1 X2 X3 , τX1Y1 X3 X2 , τY1 X1Y2Y3 , τY1 X1Y3Y2 , ¬τX1 X2Y2Y1 } all hold,
then X1 and Y1 do not have a common parent in G.
“CS” here stands for “constraint set,” the premises of a rule that can be used to test if two nodes
do not share a common parent. Figure 6(a) illustrates one situation where X1 and Y1 can be identified to not measure a same latent. In that Figure, some variables are specified with unexplained
correlations represented as bidirected edges between the variables (such edges could be due to independent hidden common causes, for instance). This illustrates that connections between elements
of {X2 , X3 ,Y2 ,Y3 } can occur.
Other sets of observable constraints can be used to reach the same conclusion. We call them
CS2 and CS3. To see one of the limitations of CS1, consider Figure 6(b). There is no single latent
that d-separates X1 ,Y1 and two other variables, as in CS1 cases. In Figure 6(c), there are no tetrad
201
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Algorithm F IND PATTERN
Input: a covariance matrix Σ
1. Start with a complete undirected graph G over the observed variables.
2. Remove edges for pairs that are marginally uncorrelated or uncorrelated conditioned on a
third observed variable.
3. For every pair of nodes linked by an edge in G, test if some rule CS1, CS2 or CS3 applies.
Remove an edge between every pair corresponding to a rule that applies.
4. Let H be a graph with no edges and with nodes corresponding to the observed variables.
5. For each maximal clique in G, add a new latent to H and make it a parent to all corresponding
nodes in the clique.
6. For each pair (A, B), if there is no other pair (C, D) such that σAC σBD = σAD σBC = σAB σCD ,
add an undirected edge A − B to H.
7. Return H.
Table 1: Returns a measurement pattern corresponding to the tetrad and first order vanishing partial
correlations of Σ.
constraints simultaneously involving X1 ,Y1 and other observed variables that are children of the
same latent parent of X1 . These extra rules are not as intuitive as CS1. To fully understand how
these cases still generate useful constraints, some knowledge of the graphical implications of tetrad
constraints is necessary. To avoid interrupting the flow of the paper, we describe these properties
only in the Appendix along with formal proofs of correctness. In the next paragraphs, we just
describe rules CS2 and CS3.
Let the predicate Factor(X,Y, G) be true if and only if there exist two nodes W and Z in latent variable graph G such that τW XY Z and τW XZY are both linearly entailed by G, all variables
in {W, X,Y, Z} are correlated, and there is no observed C in G such that ρAB.C = 0 for {A, B} ⊂
{W, X,Y, Z}:
Lemma 11 (CS2 Test) If constraints {τX1Y1Y2 X2 , τX2Y1Y3Y2 , τX1 X2Y2 X3 , ¬τX1 X2Y2Y1 } all hold such that
Factor(X1 , X2 , G) = true, Factor(Y1 ,Y2 , G) = true, X1 is not an ancestor of X3 and Y1 is not an
ancestor of Y3 , then X1 and Y1 do not have a common parent in G.
Lemma 12 (CS3 Test) If constraints {τX1Y1Y2Y3 , τX1Y1Y3Y2 , τX1Y2 X2 X3 , τX1Y2 X3 X2 , τX1Y3 X2 X3 ,
τX1Y3 X3 X2 , ¬τX1 X2Y2Y3 } all hold, then X1 and Y1 do not have a common parent in G.
The rules are not redundant: only one can be applied on each situation. For instance, in Figure
6(a) the latent on the left d-separates {X1 , X2 , X3 ,Y1 }, which implies {τX1Y1Y2Y3 , τX1Y1Y3Y2 }. The latent
on the right d-separates {X1 ,Y1 ,Y2 ,Y3 }, which implies {τY1 X1Y2Y3 , τY1 X1Y3Y2 }. The constraint τX1 X2Y2Y1
can be shown not to hold given the assumptions. Therefore, this rule tells us information about the
unobserved structure: X1 and Y1 do not have any common hidden parent.
202
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Y
X
Y1 Y 2 Y
3
X1 X2 X3
Y1 Y 2 Y
3
X1 X2 X3
(a)
(b)
Y1 Y 2 Y
3
X1 X2 X3
(c)
Figure 6: Three examples with two main latents and several independent latent common causes of
two indicators (represented by bidirected edges). In (a), CS1 applies, but not CS2 nor
CS3 (even when exchanging labels of the variables); In (b), CS2 applies (assuming the
conditions for X1 , X2 and Y1 ,Y2 ), but not CS1 nor CS3. In (c), CS3 applies, but not CS1
nor CS2.
For CS2 (Figure 6(b)), nodes X and Y are depicted as auxiliary nodes that can be used to verify
predicates Factor. For instance, Factor(X1 , X2 , G) is true because all three tetrads in the covariance
matrix of {X1 , X2 , X3 , X} hold.
Sometime it is possible to guarantee that a node is not an ancestor of another, as required, e.g.,
to apply CS2:
Lemma 13 If for some set O′ = {X1 , X2 , X3 , X4 } ⊆ O, σX1 X2 σX3 X4 = σX1 X3 σX2 X4 = σX1 X4 σX2 X3 and
for all triplets {A, B,C}, {A, B} ⊂ O′ ,C ∈ O, we have ρAB.C 6= 0 and ρAB 6= 0, then A ∈ O′ is not a
descendant in G of any element of O′ \{A}.
This follows immediately from Lemma 9 and the assumption that observed variables are not
ancestors of latent variables. For instance, in Figure 6(b) the existence of the observed node X
(linked by a dashed edge to the parent of X1 ) allows the inference that X1 is not an ancestor of X3 ,
since all three tetrad constraints hold in the covariance matrix of {X, X1 , X2 , X3 }.
We know have theoretical results that provide information concerning lack of common parents
and lack of direct connections of nodes, given a set of tetrad and vanishing partial correlation C.
The algorithm F IND PATTERN from Table 1 essentially uses the given lemmas to construct a measurement pattern, as defined in Section 4.
Theorem 14 The output of F IND PATTERN is a measurement pattern M P(C) with respect to the
tetrad and zero/first order vanishing partial correlation constraints C of Σ.
The presence of an undirected edge does not mean that adjacent vertices in the pattern are
actually adjacent in the true graph. Figure 7 illustrates this: X3 and X8 share a common parent in the
true graph, but are not adjacent. Observed variables adjacent in the output pattern always share at
least one parent in the pattern, but do not always share a common parent in the true DAG. Vertices
sharing a common parent in the pattern might not share a parent in the true graph (e.g., X1 and X8 in
Figure 7).
203
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
X1 X2 X3
X4 X5 X6 X7 X8
X1 X2 X3
(a)
X4 X5 X6 X7 X8
(b)
Figure 7: In (a), a model that generates a covariance matrix Σ. In (b), the output of F IND PATTERN
given Σ. Pairs in {X1 , X2 } × {X4 , . . . , X7 } are separated by CS2.
What is not obvious in the output of F IND PATTERN is how much more information it leaves
implicit and how to extract a (pure) model out of an equivalence class. These issues are treated in
the next section.
5.3 Completeness and Purification
The F IND PATTERN algorithm is sound, but not necessarily complete. That is, there might be graphical features shared by all members of the measurement model equivalence class that are not discovered by F IND PATTERN. For instance, there might be a CS4 rule that is not known to us. F IND PATTERN might be complete, but we conjecture it is not: we did not try to construct rules using
more than 6 variables (unlike CS1, CS2, CS3), since the more variables a rule has, the more computational expensive and the less statistically reliable it is.2 Learning a pure measurement model is
a different matter. We can find a pure measurement model with the largest number of latents in the
true graph, for instance.
A pure measurement model implies a clustering of observed variables: each cluster is a set of
observed variables that share a common (latent) parent, and the set of latents defines a partition over
the observed variables. The output of F IND PATTERN cannot, however, reliably be turned into a pure
measurement pattern in the obvious way, by removing from H all nodes that have more than one
latent parent and one of every pair of adjacent nodes, as attemped by the following algorithm:
• Algorithm T RIVIAL P URIFICATION: remove all nodes that have more than one latent parent,
and for every pair of adjacent observed nodes, remove an arbitrary node of the pair.
T RIVIAL P URIFICATION is not correct. To see this, consider Figure 8(a), where with the exception of pairs in {X3 , . . . , X7 }, every pair of nodes has more than one hidden common cause. Giving
the covariance matrix of such model to F IND PATTERN will result in a pattern with one latent only
(because no pair of nodes can be separated by CS1, CS2 or CS3), and all pairs that are connected by
a double directed edge in Figure 8(a) will be connected by an undirected edge in the output pattern.
One can verify that if we remove one node from each pair connected by an undirected edge in this
pattern, the output with the maximum number of nodes will be given by the graph in Figure 8(b).
2. Under very general conditions, there are also no rules using fewer than 6 variables, as shown by Silva (2005).
204
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
X1 X2 X3
X4 X5 X6
X7 X8 X9
X3
(a)
X4 X5 X6
X7
(b)
Figure 8: In (a), a model that generates a covariance matrix Σ. The output of F IND PATTERN given
Σ contains a single latent variable that is a parent of all observed nodes, and several observed nodes that are linked by an undirected edge. In (b), the pattern with the maximum
number of nodes that can be obtained by T RIVIAL P URIFICATION. It is still not a correct
pure measurement model for any latent in the true graph, since there is no latent that
d-separates {X3 , . . . , X7 } in the true model.
The procedure B UILD P URE C LUSTERS builds a pure measurement model using as input F IND PATTERN and an oracle for constraints. Unlike T RIVIAL P URIFICATION, variables are removed
whenever appropriate tetrad constraints are not satisfied. Table 2 presents a simplified version of
the full algorithm. The complete algorithm is given only in Appendix A to avoid interrupting the
flow of the text, since it requires the explanation of extra steps that are not of much relevance in
practice. We also describe the choices made in the algorithm (Steps 2, 4 and 5) only in the implementation given in Appendix A. The particular strategy for making such choices is not relevant to
the correctness of the algorithm.
The fundamental properties of B UILD P URE C LUSTERS are clear from Table 2: it returns a model
where each latent has at least three indicators, and such indicators are known to be d-separated
by some latent. Nodes that are children of different latents in the output graph are known not to
be children of a common latent in the true graph, as defined by the initial measurement pattern.
However, it is not immediately obvious how latents in the output graph are related to latents in the
true graph.
The informal description is: there is a labeling of latents in the output graph according to the
latents in the true graph G, and in this relabeled output graph any d-separation between a measured
node and some other node will hold in G. This is illustrated by Figure 9. Given the covariance
matrix generated by the true model in Figure 9(a), B UILD P URE C LUSTERS generates the model
shown in Figure 9(b).
Since the labeling of the latents is arbitrary, we need a formal description of the fact that latents
in the output should correspond to latents in the true model up to a relabeling. The formal graphical
properties of the output of B UILD P URE C LUSTERS (as given in Appendix A) are summarized by
the following theorem:
205
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Algorithm B UILD P URE C LUSTERS -S IMPLIFIED
Input: a covariance matrix Σ
1. G ←F IND PATTERN(Σ).
2. Choose a set of latents in G. Remove all other latents and all observed nodes that are not
children of the remaining latents and all clusters of size 1.
3. Remove all nodes that have more than one latent parent in G.
4. For all pairs of nodes linked by an undirected edge, choose one element of each pair to be
removed.
5. If for some set of nodes {A, B,C}, all children of the same latent, there is a fourth node D in
G such that σAB σCD = σAC σBD = σAD σBC is not true, remove one of these four nodes.
6. Remove all latents with less than three children, and their respective measures;
7. if G has at least four observed variables, return G. Otherwise, return an empty model.
Table 2: A general strategy to find a pure measurement measurement model of a subset of the latents
in the true graph. As explained in the body of the text, implementation details (such as the
choices made in Steps 2 and 4) are left to Appendix A.
Theorem 15 Given a covariance matrix Σ assumed to be generated from a linear latent variable
model G with observed variables O and latent variables L, let Gout be the output of B UILD P URE C LUSTERS(Σ) with observed variables Oout ⊆ O and latent variables Lout . Then Gout is a measurement pattern, and there is an unique injective mapping M : Lout → L with the following properties:
1. Let Lout ∈ Lout . Let X be a child of Lout in Gout . Then M(Lout ) d-separates X from Oout \X in
G;
2. M(Lout ) d-separates X from every latent L in G for which M −1 (L) is defined;
3. Let O′ ⊆ Oout be such that each pair in O′ is correlated. At most one element in O′ has the
following property: (i) it is not a descendant of its respective mapped latent parent in G or
(ii) it has a hidden common cause with its respective mapped latent parent in G;
For each group of correlated observed variables, we can guaranteee that at most one edge from
a latent into an observed variable is incorrectly directed. By “incorrectly directed,” we mean the
condition defined in the third item of Theorem 15: although all observed variables are children of
latents in the output graph, one of these edges might be misleading, since in the true graph one of the
observed variables might not be a descendant of the respective latent. This is illustrated by Figure
10.
Notice also that we cannot guarantee that an observed node X with latent parent Lout in Gout
will be d-separated from the latents in G not in Gout , given M(Lout ): if X has a common cause with
M(Lout ), then X will be d-connected to any ancestor of M(Lout ) in G given M(Lout ). This is also
illustrated by Figure 10.
206
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
X11
X12
L1
L2
T1
X1 X2 X3
X7
T2
X4 X5 X6
X8
X9
X10
X1 X2 X3 X7 X11
(a)
X4 X5 X6 X9
(b)
Figure 9: Given as input the covariance matrix of the observable variables X1 − X12 connected according to the true model shown in Figure (a), the B UILD P URE C LUSTERS algorithm will
generate the graph shown in Figure (b). It is clear there is an injective mapping M(.)
from latents {T1 , T2 } to latents {L1 , L2 } such that M(T1 ) = L1 and M(T2 ) = L2 and the
properties described by Theorem 15 hold.
L1
X1
L2
X2 X3
L3
L4
X4 X5 X6
X7
(a)
T1
T2
X1 X2 X3
X4 X5 X6
(b)
Figure 10: Given as input the covariance matrix of the observable variables X1 − X7 connected
according to the true model shown in Figure (a), one of the possible outputs of B UILD P URE C LUSTERS algorithm is the graph shown in Figure (b). It is clear there is an injective mapping M(.) from latents {T1 , T2 } to latents {L1 , L2 , L3 , L4 } such that M(T1 ) = L2
and M(T2 ) = L3 . However, in (b) the edge T1 → X1 does not express the correct causal
direction of the true model. Notice also that X1 is not d-separated from L4 given
M(T1 ) = L2 in the true graph.
5.4 An Example
To illustrate B UILD P URE C LUSTERS, suppose the true graph is the one given in Figure 11(a), with
two unlabeled latents and 12 observed variables. This graph is unknown to B UILD P URE C LUSTERS,
which is given only the covariance matrix of variables {X1 , X2 , ..., X12 }. The task is to learn a
measurement pattern, and then a purified measurement model.
In the first stage of B UILD P URE C LUSTERS, the F IND PATTERN algorithm, we start with a fully
connected graph among the observed variables (Figure 11(b)), and then proceed to remove edges according to rules CS1, CS2 and CS3, giving the graph shown in Figure 11(c). There are two maximal
cliques in this graph: {X1 , X2 , X3 , X7 , X8 , X11 , X12 } and {X4 , X5 , X6 , X8 , X9 , X10 , X12 }. They are distinguished in the figure by different edge representations (dashed and solid - with the edge X8 − X12
present in both cliques). The next stage takes these maximal cliques and creates an intermediate
207
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
X11
X12
X12
X10
X11
X9
X8
X1
X1 X2 X3
X7
X4 X5 X6
X8
X9
X7
X2
X10
X3
X4
(a)
X12
X6
(b)
X10
X11
X5
X9
X8
X1
X1 X2 X3
X4 X5 X6
X7 X8 X9
X10 X11 X12
X7
X2
X3
X4
X5
X6
(c)
X1 X2 X3
(d)
X4 X5 X6
X1 X2 X3 X7 X11
X10 X11 X12
X7 X8 X9
X4 X5 X6 X9
(e)
(f)
Figure 11: A step-by-step demonstration of how a covariance matrix generated by graph in Figure
(a) will induce the pure measurement model in Figure (f).
graphical representation, as depicted in Figure 11(d). In Figure 11(e), we add the undirected edges
X7 − X8 , X8 − X12 , X9 − X10 and X11 − X12 , finalizing the measurement pattern returned by F IND PATTERN . Finally, Figure 11(f) represents a possible purified output of B UILD P URE C LUSTERS given
this pattern. Another purification with as many nodes as in the graph in Figure 11(f) substitutes
node X9 for node X10 .
208
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
There is some superficial similarity between B UID P URE C LUSTERS and the F IND H IDDEN algorithm (Elidan et al., 2000) cited in Section 3. Both algorithms select cliques (or substructures
close to a clique) and introduce a latent as a common cause of the variables in that clique. The algorithms are, however, very different: B UILD P URE C LUSTERS knows that each selected clique should
correspond to a latent,3 and creates all of its latents at the same time. F IND H IDDEN creates one
latent a time, and might backtrack if this latent is not supported by the data. More fundamentally,
there is no clear description of what F IND H IDDEN actually learns (as illustrated in Section 3), and
even if asymptotically it can always find a pure measurement submodel.4
5.5 Parameterizing the Output of B UILD P URE C LUSTERS
Recall that so far we described only an algorithm for learning measurement models. Learning the
structure among latents, as described in Section 6, requires exploring constraints in the covariance
matrix of the observed variables. Since B UILD P URE C LUSTERS returns only a marginal of the true
model, it is important to show that this marginalized graph, when parameterized as a linear model,
also represents the marginal probability distribution of the observed variables.
The following result is essential to provide an algorithm that is guaranteed to find a Markov
equivalence class for the latents in M(Lout ) using the output of B UILD P URE C LUSTERS, as in Section 6. It guarantees that one can fit a linear model using the structure given by B UILD P URE C LUS TERS and have a consistent estimator of the observed covariance matrix (for the selected variables)
in families such as Gaussian distributions. This is important, since the covariance matrix of the observed variables in the model is used to guide the search for a structure among latents, as discussed
in Section 6.
Theorem 16 Let M(Lout ) ⊆ L be the set of latents in G obtained by the mapping function M().
Let ΣOout be the population covariance matrix of Oout . Let the DAG Gaug
out be Gout augmented by
aug
connecting the elements of Lout such that the structural model of Gout is an I-map of the distribution
of M(Lout ). Then there exists a linear latent variable model using Gaug
out as the graphical structure
such that the implied covariance matrix of Oout equals ΣOout .
5.6 Computational Issues and Anytime Properties
A further reason why we do not provide details of some steps of B UILD P URE C LUSTERS at this
point is because there is no unique way of implementing it, and different purifications might be of
interest. For instance, one might be interested in the pure model that has the largest possible number of latents. Another one might be interested in the model with the largest number of observed
variables. However, some of these criteria might be computationally intractable to achieve. Con3
sider for instance the following criterion, which we denote as M P : given a measurement pattern,
decide if there is some choice of observed nodes to be removed such that the resulting graph is a
pure measurement model of all latents in the pattern and each latent has at least three children. This
problem is intractable:
3
Theorem 17 Problem M P is NP-complete.
3. Some latents might be eliminated for not having enough indicators, though.
4. This, of course, bears no fundamental implication on the ability of F IND H IDDEN to generate a model that provides a
good fit to the data, but it is a crucial limitation in causal analysis.
209
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
There is no need to solve a NP-hard problem in order to have the theoretical guarantees of
interpretability of the output given by Theorem 15. For example, there is a stage in F IND PATTERN
where it appears necessary to find all maximal cliques, but, in fact, it is not. Identifying more cliques
increases the chance of having a larger output (which is good) by the end of the algorithm, but it is
not required for the algorithms correctness. Stopping at Step 5 of F IND PATTERN before completion
will not affect Theorems 15 or 16.
Another computational concern is the O(N 5 ) loops in Step 3 of F IND PATTERN, where N is
the number of observed variables.5 Again, it is not necessary to compute this loop entirely. One
can stop Step 3 at any time at the price of losing information, but not the theoretical guarantees of
B UILD P URE C LUSTERS. This anytime property is summarized by the following corollary:
Corollary 18 The output of B UILD P URE C LUSTERS retains its guarantees even when rules CS1,
CS2 and CS3 are applied an arbitrary number of times in F IND PATTERN for any arbitrary subset
of nodes and an arbitrary number of maximal cliques is found.
It is difficult to assess how an early stopping procedure might affect the completeness of the
output. In all of our experiments, we were able to enumerate all maximal cliques in a few seconds
of computation. This is not to say that one should not design better ways of ordering the clique enumeration (using prior knowledge of which variables should not be clustered together, for instance),
or using other alternatives to an anytime stop.
In case there are possibly too many maximal cliques to be enumerated in F IND PATTERN, an
alternative to early stopping is to triangulate the graph, i.e., adding edges connecting some nonadjacent pair of nodes in a chordless cycle. This is repeated until no chordless cycles remain in the
graph G constructed at the end of Step 3 of F IND PATTERN (Table 1). Different heuristics could be
use to choose the next edge to be added, e.g., by linking the pair of nodes that is most strongly correlated. The advantage is that cliques in a triangulated graph can be found in linear time. For the same
reasons that validate Corollary 18, such a triangulation will not affect the correctness of the output,
since the purification procedure will remove all nodes that need to be removed. In general, adding
undirected edges to graph G in F IND PATTERN does not compromise correctness. As a side effect,
it might increase the robustness of the algorithm, since some edges of G are likely to be erroneously
removed in small sample studies, although more elaborated ways of adding edges back would need
to be discussed in detail and are out of the scope of this paper. Such a triangulation procedure,
however, might still cause problems, since in the worst case we will obtain a fully connected (and
uninformative) graph.6
6. Learning the Structure of the Unobserved
The real motivation for finding a pure measurement model is to obtain reliable statistical access to
the relations among the latent variables. Given a pure and correct measurement model, even one
involving a fairly small subset of the original measured variables, a variety of algorithms exist for
finding a Markov equivalence class of graphs over the set of latents in the given measurement model.
5. This immediately follows from, e.g., the definition of CS1: we have to first find a foursome {X1 , X2 ,Y1 ,Y2 } where
σX1 X2 σY1Y2 − σX1Y1 σX2Y2 6= 0, which is a O(N 4 ) loop. Conditioned on this foursome, we have to find two independent
(but distinct) X3 and Y3 . This requires two (almost) independent loops of O(N) within the O(N 4 ) loop.
6. We would like to thank an anonymous reviewer for the suggestions in this paragraph.
210
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
6.1 Constraint-Based Search
Constraint-based search algorithms rely on decisions about independence and conditional independence among a set of variables to find the Markov equivalence class over these variables. Given a
pure and correct measurement model involving at least 2 measures per latent, we can test for independence and conditional independence among the latents, and thus search for equivalence classes
of structural models among the latents, by taking advantage of the following theorem (Spirtes et al.,
2000):
Theorem 19 Let G be a pure linear latent variable model. Let L1 , L2 be two latents in G, and Q a
set of latents in G. Let X1 be a measure of L1 , X2 be a measure of L2 , and XQ be a set of measures
of Q containing at least two measures per latent. Then L1 is d-separated from L2 given Q in G
if and only if the rank of the correlation matrix of {X1 , X2 } ∪ XQ is less than or equal to |Q| with
probability 1 with respect to the Lebesgue measure over the linear coefficients and error variances
of G.
We can then use this constraint to test7 for conditional independencies among the latents. Such
conditional independence tests can then be used as an oracle for constraint-satisfaction techniques
for causality discovery in graphical models, such as the PC algorithm (Spirtes et al., 2000) or the
FCI algorithm (Spirtes et al., 2000).
We define the algorithm PC-MIMB UILD8 as the algorithm that takes as input a measurement
model satisfying the assumption of purity mentioned above and a covariance matrix, and returns
the Markov equivalence class of the structural model among the latents in the measurement model
according to the PC algorithm. A FCI-MIMB UILD algorithm is defined analogously. In the limit
of infinite data, it follows from the preceding and from the consistency of PC and FCI algorithms
(Spirtes et al., 2000) that
Theorem 20 Given a covariance matrix Σ assumed to be generated from a linear latent variable
model G, and Gout the output of B UILD P URE C LUSTERS given Σ, the output of PC-MIMB UILD or
FCI-MIMB UILD given (Σ, Gout ) returns the correct Markov equivalence class of the latents in G
corresponding to latents in Gout according to the mapping implicit in B UILD P URE C LUSTERS.
For most common families of probabilities distributions (e.g., multivariate Gaussians) the sample covariance matrix is a consistent estimator of the population covariance matrix. This fact, combined with Theorem 20, shows we have a point-wise consistent algorithm for learning a latent
variable model with a pure measurement model, up to the measurement equivalence class described
in Theorem 15 and the Markov equivalence class of the structural model.
6.2 Score-Based Search
Score-based approaches for learning the structure of Bayesian networks, such as GES (Meek, 1997;
Chickering, 2002) are usually more accurate than PC or FCI when there are no omitted common
causes, or in other terms, when the set of recorded variables is causally sufficient. We know of
7. One way to test if the rank of a covariance matrix in Gaussian models is at most q is to fit a factor analysis model
with q latents and assess its significance.
8. MIM stands for “multiple indicator model”, a term in structural equation model literature describing latent variable
models with multiple measures per latent.
211
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
no consistent scoring function for linear latent variable models that can be easily computed. This
might not be a practical issue, since any structural model with a fixed measurement model generated
by B UILD P URE C LUSTERS has an unique maximum likelihood estimator, up to the scale and sign
of the latents. That is, the set of maximum likelihood estimators is a single point, instead of a
complicated surface. This sidesteps most of the problems concerning finding the proper complexity
penalization for a candidate model (Spirtes et al., 2000).
We suggest using the Bayesian Information Criterion (BIC) function as a score function. Using
BIC with S TRUCTURAL EM (Friedman, 1998) and GES results in a computationally efficient way
of learning structural models, where the measurement model is fixed and GES is restricted to modify
edges among latents only. Assuming a Gaussian distribution, the first step of our S TRUCTURAL EM
implementation uses a fully connected structural model in order to estimate the first expected latent
covariance matrix. That is followed by a GES search. We call this algorithm GES-MIMB UILD
and use it as the structural model search component in all of the studies of simulated and empirical
data that follow.
7. Simulation Studies
In the following simulation studies, we draw samples of three different sizes from 9 different latent
variable models. We compare our algorithm against exploratory factor analysis and the DAG hillclimbing algorithm F IND H IDDEN (Elidan et al., 2000), and measure the success of each on the
following discovery tasks:
DP1. Discover the number of latents in G.
DP2. Discover which observed variables measure each latent G.
DP3. Discover as many features as possible about the causal relationships among the latents in G.
Since factor analysis addresses only tasks DP1 and DP2, we compare it directly to B UILD P URE C LUSTERS on DP1 and DP2. For DP3, we use our procedure and factor analysis to compute
measurement models, then discover as much about the features of the structural model among the
latents as possible by applying GES-MIMB UILD to the measurement models output by BPC and
factor analysis.
We hypothesized that three features of the problem would affect the performance of the algorithms compared: sample size; the complexity of the structural model; and, the complexity and
level of impurity in the generating measurement model. We use three different sample sizes for
each study: 200, 1,000, and 10,000. We constructed nine generating latent variable graphs by using
all combinations of the three structural models and three measurement models in Figure 12. For
structural model SM3, the respective measurement models are augmented accordingly.
MM1 is a pure measurement model with three indicators per latent. MM2 has five indicators
per latent, one of which is impure because its error is correlated with another indicator, and another
because it measures two latents directly. MM3 involves six indicators per latent, half of which are
impure.
SM1 entails one unconditional independence among the latents: L1 is independent L3 . SM2 entails one first order conditional independence: L1 ⊥L3 |L2 , and SM3 entails one first order conditional
independence: L2 ⊥L3 |L1 , and one second order conditional independence relation: L1 ⊥L4 |{L2 , L3 }.
212
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
SM1
SM2
SM3
X16
X1 X2 X3
X4 X5 X6
MM1
X7 X8 X9
X17
X18
X1 X2 X3
X4 X5 X6
X7 X8 X9
X1 X2 X3
X4 X5 X6
X7 X8 X9
X10 X11
X12 X13
X14 X15
X10 X11
X12 X13
X14 X15
MM2
MM3
Figure 12: The Structural and Measurement models used in our simulation studies.
Thus the statistical complexity of the structural models increases from SM1 to SM3 and the impurity
of measurement models increases from MM1 to MM3.
For each generating latent variable graph, we used the Tetrad IV program9 with the following
procedure to draw 10 multivariate normal samples of size 200, 10 at size 1,000, and 10 at size
10,000.
1. Pick coefficients for each edge in the model randomly from the interval [−1.5, −0.5]∪[0.5, 1.5].
2. Pick variances for the exogenous nodes (i.e., latents without parents and error nodes) from
the interval [1, 3].
3. Draw one pseudo-random sample of size N.
This choice of parameter values for simulations implies that, on average, half of the variance
of the indicators of an exogenous latent is due to the error term, making the problem of structure
learning more particularly difficult for at least some clusters.
We used three algorithms in our studies:
1. BPC: B UILD P URE C LUSTERS + GES-MIMB UILD
2. FA: Factor Analysis + GES-MIMB UILD
3. FH: F IND H IDDEN, using the same sort of hill-climbing procedure used by Elidan et al. (2000)
BPC is the implementation of B UILD P URE C LUSTERS and GES-MIMB UILD described in Appendix A. FA involves combining standard factor analysis to find the measurement model with
GES-MIMB UILD to find the structural model. For standard factor analysis, we used factanal
9. Available at http://www.phil.cmu.edu/projects/tetrad.
213
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
from R 1.9 with the oblique rotation promax. FA and variations are still widely used and are perhaps the most popular approach to latent variable modeling (Bartholomew et al., 2002). We choose
the number of latents by iteratively increasing its number until we get a significant fit above 0.05,
or until we have to stop due to numerical instabilities.
Our implementation of F IND H IDDEN follows closely the implementation suggested by Elidan
et al. (2000): in that implementation, a candidate latent is introduced as a common parent of the
nodes in a dense subgraph of the current graph (such a subgraph is called semiclique by Elidan
et al.). We implemented the most computational expensive version of F IND H IDDEN, where all
semicliques are used to create new candidate graphs, and a full hill-climbing procedure with tabu
search is performed to optimize each of them. The score function is BIC. The initial graph is a fully
connected DAG among observed variables.10
We also added to F IND H IDDEN the prior knowledge that all edges should be directed from
latents into observed variables, and we split the search into two main stages: first, only edges into
observed variables are modified, while keeping a fully connected structural model. After finding the
measurement model, we proceed to learn the structural model using the same type of hill-climbing
procedure suggested by Elidan et al. Without these two modifications, F IND H IDDEN results are
significantly worse.11
In order to compare the output of BPC, FA, and FH on discovery tasks DP1 (finding the correct
number of underlying latents) and DP2 (measuring these latents appropriately), we must map the
latents discovered by each algorithm to the latents in the generating model. That is, we must define
a mapping of the latents in the Gout to those in the true graph G.
We do the mapping by first fitting each model by maximum likelihood to obtain estimates for the
parameters. For each latent in the output model, we sum the absolute values of the edge coefficients
of their observed children, grouping the sum according to their true latent parents. The group with
the highest sum will define the label of the output latent. That is, for each latent Lout in the output
model, the following procedure is performed:
• for all latents L1 , . . . , Lk in the true model, let Si = 0, 1 ≤ i ≤ k
• for every child O that measures Lout in the output model with edge coefficient λLO , such that
O has a single parent Li in the true model, increase Si by |λLO |
• let M be such that SM is maximum among S1 , . . . , Sk . Label Lout as LM .
For example, let Lout be a latent node in the output graph Gout . Suppose S1 is the sum of the
absolute values of the edge coefficients of the children of Lout that measure the true latent L1 , and
S2 is the respective sum for the measures of true latent L2 . If S2 > S1 , we rename Lout as L2 . If two
output latents are mapped to the same true latent, we label only one of them as the true latent by
10. Which is the true graph among observed variables in most simulations. We chose the initialization point to save
computational costs of growing an almost fully connected DAG without hidden variables first.
11. Another important modification in our implementation was in the S TRUCTURAL EM implementation: to escape
out of bad local minima within S TRUCTURAL EM, we do the following whenever the algorithm arrives in a local
minimum: we apply the same search operators, but using the true BIC score evaluation instead of the S TRUCTURAL
EM-BIC score, which is a lower bound on the regular BIC score. This was also crucial to get better results with F IND H IDDEN, but considerably slowed down the algorithm, since computing the true score is computationally expensive
and requires an evaluation of the whole model.
214
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
choosing the one that corresponds to the highest sum of absolute loadings. The other one remains
unmapped and receives an arbitrary label.
We compute the following scores for the output model Gout from each algorithm,12 where the
true graph is labelled G:
• latent omission, the number of latents in G that do not appear in Gout divided by the total
number of true latents in G;
• latent commission, the number of latents in Gout that could not be mapped to a latent in G
divided by the total number of true latents in G;
• mismeasurement, the number of observed variables in Gout that are measuring at least one
wrong latent divided by the number of observed variables in G;
To be generous to factor analysis, we considered only latents with at least three indicators. Even
with this help, we still found several cases in which latent commission errors were more than 100%.
We eliminated from F IND H IDDEN any latent that ended up with no observed children.
Table 3 evaluates all three procedures on the first two discovery tasks: DP1 and DP2. Each
number is the average error across 10 trials with standard deviations in parentheses for sample sizes
of 200, 1000, 10,000. Over all conditions, FA has very low rates of latent omission, but very
high rates of latent commission. In particular, FA is very sensitive to the purity of the generating
measurement model. With MM2, the rate of latent commission for FA was moderate; with MM3
it was abysmal. Because indicators are given too many latent parents in FA, many indicators are
removed during purification, resulting in high indicator omission errors.
BPC does reasonably well on all measures in Tables 3 at all sample sizes and for all generating
models. Our implementation of F IND H IDDEN also does well in most cases, but has issues with
SM1.13
In the final piece of the simulation study, we applied the best causal model search algorithm we
know of, GES, modified for this purpose as GES-MIM BUILD, to the measurement models output
by BPC and FA. We evaluate FH both by 1. using its default structural model, which is obtained
by a standard hill-climbing with tabu search, and by 2. fixing its measurement model and applying
GES to re-learn the corresponding structural model.
If the output measurement model has no errors of latent omission or commission, then scoring
the result of the structural model search is fairly easy. The GES-MIM BUILD search outputs an
equivalence class, with certain adjacencies unoriented and certain adjacencies oriented. If there is
an adjacency of any sort between two latents in the output, but no such adjacency in the true graph,
then we have an error of edge commission. If there is no adjacency of any sort between two latents
in the output, but there is an edge in the true graph, then we have an error of edge omission. For
orientation, if there is an oriented edge in the output that is not oriented in the equivalence class for
12. Other types of errors, such as missing indicators that could have been preserved (in BPC) or adding edges among
indicators when they should not exist (as in F IND H IDDEN) are not directly comparable and not as important with
respect to the task of finding latents and causal relations among latents, and therefore not considered in this simulation
study.
13. One possible explanation for the difficulties with SM1 is the fact that, in the intermediate stages of the algorithm,
there will be paths connecting {X1 , X2 , X3 } and {X7 , X8 , X9 } due to latent variables, but such paths that have to amount
to zero correlation in order to reproduce the marginal covariance matrix. This might be difficult to obtain with single
edge modifications, considering that introducing an edge might cancel some correlations but increase others.
215
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Evaluation of output measurement models
Latent omission
Latent commission
BPC
FA
FH
BPC
FA
FH
Sample
SM1 + MM1
200 0.10(.2)
1000 0.17(.2)
10000 0.07(.1)
SM1 + MM2
200 0.00(.0)
1000 0.00(.0)
10000 0.00(.0)
SM1 + MM3
200 0.00(.0)
1000 0.00(.0)
10000 0.03(.1)
SM2 + MM1
200 0.10(.2)
1000 0.03(.1)
10000 0.00(.0)
SM2 + MM2
200 0.03(.1)
1000 0.00(.0)
10000 0.00(.0)
SM2 + MM3
200 0.00(.0)
1000 0.00(.0)
10000 0.00(.0)
SM3 + MM1
200 0.12(.2)
1000 0.10(.2)
10000 0.05(.1)
SM3 + MM2
200 0.02(.1)
1000 0.02(.1)
10000 0.00(.0)
SM3 + MM3
200 0.02(.1)
1000 0.02(.1)
10000 0.00(.0)
Mismeasurements
BPC
FA
FH
0.00(.0)
0.00(.0)
0.00(.0)
0.50(.3)
0.17(.3)
0.23(.2)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.01(.0)
0.00(.0)
0.00(.0)
0.41(.3)
0.19(.2)
0.14(.2)
0.52(.3)
0.18(.3)
0.23(.2)
0.03(.1)
0.00(.0)
0.00(.0)
0.27(.3)
0.17(.2)
0.27(.3)
0.03(.1)
0.00(.0)
0.03(.1)
0.77(.2)
0.47(.2)
0.33(.3)
0.00(.0)
0.07(.1)
0.07(.1)
0.01(.0)
0.00(.0)
0.02(.1)
0.92(.1)
0.59(.1)
0.55(.2)
0.47(.3)
0.27(.3)
0.33(.3)
0.00(.0)
0.00(.0)
0.00(.0)
0.10(.2)
0.07(.1)
0.23(.3)
0.07(.1)
0.07(.1)
0.00(.0)
1.13(.3)
0.87(.3)
0.70(.3)
0.07(.1)
0.00(.0)
0.03(.1)
0.03(.1)
0.03(.1)
0.00(.0)
0.90(.1)
0.72(.1)
0.60(.2)
0.36(.3)
0.15(.2)
0.30(.3)
0.00(.0)
0.00(.0)
0.00(.0)
0.27(.3)
0.17(.3)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.06(.1)
0.02(.1)
0.00(.0)
0.43(.2)
0.23(.2)
0.11(.1)
0.28(.3)
0.19(.3)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.17(.2)
0.03(.1)
0.03(.1)
0.07(.1)
0.00(.0)
0.00(.0)
0.80(.3)
0.53(.3)
0.27(.3)
0.00(.0)
0.07(.1)
0.03(.1)
0.06(.1)
0.00(.0)
0.00(.0)
0.85(.1)
0.68(.1)
0.53(.2)
0.32(.2)
0.24(.2)
0.08(.1)
0.03(.1)
0.00(.0)
0.00(.0)
0.03(.1)
0.07(.1)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
1.13(.3)
0.73(.3)
0.97(.3)
0.07(.1)
0.07(.1)
0.03(.1)
0.01(.0)
0.00(.0)
0.00(.0)
0.91(.1)
0.71(.2)
0.78(.2)
0.29(.2)
0.15(.1)
0.03(.1)
0.02(.1)
0.02(.1)
0.00(.0)
0.40(.2)
0.02(.1)
0.05(.1)
0.00(.0)
0.00(.0)
0.00(.0)
0.05(.1)
0.02(.1)
0.00(.0)
0.00(.0)
0.00(.0)
0.00(.0)
0.05(.1)
0.01(.0)
0.00(.0)
0.66(.2)
0.30(.2)
0.21(.1)
0.43(.2)
0.03(.1)
0.07(.1)
0.05(.2)
0.02(.1)
0.05(.1)
0.10(.1)
0.02(.1)
0.05(.2)
0.10(.2)
0.02(.1)
0.00(.0)
0.62(.1)
0.38(.2)
0.35(.2)
0.02(.1)
0.05(.1)
0.02(.1)
0.03(.1)
0.01(.0)
0.00(.0)
0.89(.1)
0.68(.2)
0.72(.2)
0.31(.2)
0.15(.1)
0.15(.2)
0.02(.1)
0.08(.2)
0.08(.1)
0.02(.1)
0.00(.0)
0.00(.0)
0.05(.1)
0.00(.0)
0.00(.0)
0.98(.3)
0.72(.3)
0.60(.3)
0.02(.1)
0.08(.1)
0.05(.2)
0.04(.1)
0.00(.0)
0.00(.0)
0.91(.1)
0.77(.1)
0.70(.2)
0.24(.2)
0.08(.1)
0.04(.0)
Table 3: Results obtained with B UILD P URE C LUSTERS (BPC), factor analysis (FA) and FindHidden (FH) for the problem of learning measurement models. Each number is an average
over 10 trials, with standard deviation in parenthesis.
the true structural model, then we have an error of orientation commission. If there is an unoriented
edge in the output which is oriented in the equivalence class for the true model, we have an error of
orientation omission.
216
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Sample
BPC
SM1 + MM1
200 0.05 − 09
1000 0.05 − 09
10000 0.00 − 10
SM1 + MM2
200 0.00 − 10
1000 0.00 − 10
10000 0.00 − 10
SM1 + MM3
200 0.00 − 10
1000 0.00 − 10
10000 0.00 − 10
SM2 + MM1
200 0.00 − 10
1000 0.00 − 10
10000 0.00 − 10
SM2 + MM2
200 0.00 − 10
1000 0.00 − 10
10000 0.00 − 10
SM2 + MM3
200 0.00 − 10
1000 0.00 − 10
10000 0.00 − 10
SM3 + MM1
200 0.12 − 05
1000 0.05 − 08
10000 0.05 − 08
SM3 + MM2
200 0.02 − 09
1000 0.00 − 10
10000 0.00 − 10
SM3 + MM3
200 0.02 − 09
1000 0.08 − 07
10000 0.00 − 10
Evaluation of output structural models
Edge omission
Edge commission
FA
FH
FHG
BPC
FA
FH
FHG
0.05 − 09
0.10 − 08
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.10 − 09
0.20 − 08
0.00 − 10
0.30 − 07
0.30 − 07
0.00 − 10
0.00 − 10
0.60 − 04
0.30 − 07
0.10 − 09
0.10 − 09
0.00 − 10
0.15 − 07
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.10 − 09
0.20 − 08
0.40 − 06
0.40 − 06
0.50 − 05
0.40 − 06
0.40 − 06
0.50 − 05
0.10 − 09
0.00 − 10
0.10 − 09
0.25 − 05
0.15 − 07
0.05 − 09
0.00 − 10
0.00 − 10
0.05 − 09
0.05 − 09
0.00 − 10
0.00 − 10
0.20 − 08
0.10 − 09
0.00 − 10
0.70 − 03
0.70 − 03
0.40 − 06
0.50 − 05
0.60 − 04
0.50 − 05
0.30 − 07
0.10 − 09
0.10 − 09
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.20 − 08
0.00 − 10
0.20 − 08
0.30 − 07
0.30 − 07
0.30 − 07
0.00 − 10
0.00 − 10
0.00 − 10
0.10 − 09
0.00 − 10
0.20 − 08
0.15 − 07
0.10 − 09
0.05 − 09
0.00 − 10
0.05 − 09
0.05 − 09
0.00 − 10
0.05 − 09
0.00 − 10
0.40 − 06
0.10 − 09
0.10 − 09
0.30 − 07
0.60 − 04
0.70 − 03
0.00 − 10
0.10 − 09
0.10 − 09
0.00 − 10
0.20 − 08
0.20 − 08
0.15 − 07
0.15 − 07
0.10 − 08
0.00 − 10
0.00 − 10
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.20 − 08
0.20 − 08
0.00 − 10
0.70 − 03
0.40 − 06
0.50 − 05
0.10 − 09
0.00 − 10
0.00 − 10
0.20 − 08
0.30 − 07
0.00 − 10
0.12 − 06
0.08 − 08
0.15 − 04
0.05 − 08
0.10 − 06
0.05 − 08
0.00 − 10
0.00 − 10
0.02 − 09
0.20 − 06
0.15 − 08
0.15 − 08
0.20 − 06
0.10 − 08
0.15 − 08
0.00 − 10
0.55 − 03
0.50 − 03
0.00 − 10
0.20 − 07
0.15 − 08
0.28 − 03
0.12 − 07
0.00 − 10
0.15 − 06
0.08 − 07
0.02 − 09
0.02 − 09
0.00 − 10
0.02 − 09
0.55 − 03
0.25 − 07
0.10 − 08
0.55 − 02
0.75 − 02
0.80 − 02
0.20 − 06
0.60 − 02
0.65 − 01
0.10 − 08
0.15 − 08
0.20 − 07
0.32 − 02
0.02 − 09
0.05 − 08
0.20 − 03
0.10 − 07
0.02 − 09
0.10 − 06
0.05 − 08
0.00 − 10
0.40 − 05
0.30 − 06
0.15 − 07
0.50 − 02
0.65 − 02
0.65 − 03
0.45 − 03
0.45 − 04
0.70 − 01
0.20 − 07
0.25 − 06
0.10 − 08
Table 4: Results obtained with the application of GES-MIMB UILD to the output of B UILD P URE C LUSTERS and factor analysis, plus F IND H IDDEN and F IND H IDDEN + GESMIMB UILD results, with an indication of the number of perfect solutions over these trials.
If the output measurement model has any errors of latent commission, then we simply leave out
the excess latents in the measurement model given to GES-MIM BUILD. This helps FA primarily,
as it was the only procedure of the three that had high errors of latent commission.
217
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Sample
BPC
SM1 + MM1
200 0.10 − 09
1000 0.20 − 08
10000 0.00 − 10
SM1 + MM2
200 0.00 − 10
1000 0.10 − 09
10000 0.20 − 08
SM1 + MM3
200 0.20 − 08
1000 0.10 − 09
10000 0.00 − 10
SM2 + MM1
200
−−−
1000
−−−
10000
−−−
SM2 + MM2
200
−−−
1000
−−−
10000
−−−
SM2 + MM3
200
−−−
1000
−−−
10000
−−−
SM3 + MM1
200 0.15 − 08
1000 0.10 − 09
10000 0.05 − 09
SM3 + MM2
200 0.50 − 05
1000 0.30 − 07
10000 0.20 − 08
SM3 + MM3
200 0.50 − 04
1000 0.20 − 07
10000 0.00 − 10
Evaluation of output structural models
Orientation omission
Orientation commission
FA
FH
FHG
BPC
FA
FH
FHG
0.15 − 08
0.00 − 10
0.00 − 10
0.10 − 09
0.60 − 04
0.30 − 07
0.10 − 09
0.10 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.20 − 07
0.20 − 07
0.25 − 05
0.40 − 06
0.40 − 06
0.50 − 05
0.10 − 09
0.00 − 10
0.10 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.40 − 04
0.10 − 09
0.30 − 06
0.60 − 04
0.70 − 03
0.50 − 05
0.20 − 08
0.10 − 09
0.10 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.05 − 09
0.10 − 08
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.10 − 09
0.10 − 09
0.00 − 10
0.00 − 10
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
−−−
0.00 − 10
0.00 − 10
0.00 − 10
0.10 − 08
0.05 − 09
0.05 − 09
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.00 − 10
0.65 − 03
0.65 − 03
0.00 − 10
0.10 − 09
0.05 − 09
0.22 − 07
0.10 − 09
0.04 − 09
0.35 − 06
0.00 − 10
0.00 − 10
0.10 − 09
0.04 − 09
0.04 − 09
0.00 − 10
0.00 − 10
0.04 − 09
0.30 − 06
0.45 − 04
0.40 − 06
0.20 − 07
0.65 − 03
0.85 − 01
0.10 − 09
0.30 − 07
0.25 − 07
0.08 − 09
0.00 − 10
0.00 − 10
0.16 − 07
0.05 − 09
0.00 − 10
0.08 − 09
0.11 − 08
0.00 − 10
0.08 − 09
0.05 − 09
0.00 − 10
0.15 − 08
0.35 − 05
0.35 − 05
0.85 − 01
0.50 − 04
0.85 − 01
0.35 − 05
0.05 − 09
0.10 − 09
0.19 − 06
0.15 − 07
0.00 − 10
0.14 − 08
0.02 − 09
0.00 − 10
0.20 − 07
0.04 − 09
0.04 − 09
0.48 − 02
0.11 − 08
0.00 − 10
Table 5: Results obtained with the application of GES-MIMB UILD to the output of B UILD P URE C LUSTERS and factor analysis, plus F IND H IDDEN and F IND H IDDEN + GESMIMB UILD results, with an indication of the number of perfect solutions over these trials.
If the output measurement model has errors of latent omission, then we compare the marginal
involving the latents in the output model for the true structural model graph to the output structural
model equivalence class. For each of the structural models we selected, SM1, SM2, and SM3,
all marginals can be represented faithfully as DAGs. Our measure of successful causal discovery,
218
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
therefore, for a measurement model involving a small subset of the latents in the true graph is very
lenient. For example, if the generating model was SM3, which involves four latents, but the output
measurement model involved only two of these latents, then a perfect search result in this case
would amount to finding that the two latents are associated.
In summary then, our measures for assessing the ability of these algorithms to correctly discover
at least features of the causal relationships among the latents are as follows:
• edge omission (EO), the number of edges in the structural model of G that do not appear in
Gout divided by the possible number of edge omissions (2 in SM1 and SM2 , and 4 in SM3 , i.e.,
the number of edges in the respective structural models);
• edge commission (EC), the number of edges in the structural model of Gout that do not exist
in G divided by the possible number of edge commissions (only 1 in SM1 and SM2 , and 2 in
SM3 );
• orientation omission (OO), the number of arrows in the structural model of G that do not
appear in Gout divided by the possible number of orientation omissions in G (2 in SM1 and
SM3 , 0 in SM2 );
• orientation commission (OC), the number of arrows in the structural model of Gout that do
not exist in G divided by the number of edges in the structural model of Gout ;
Tables 4 and 5 summarize the results. Along with each average we provide the number of trials
where no errors of a specific type were made.
Factor analysis is particularly flawed. This is because FA infers so many latents, which leads
to spurious dependence paths among the latents we scored. The default F IND H IDDEN is also suboptimal in these small models, due to limitations in the hill-climbing procedure compared to GES:
SM3 has a high proportion of “compelled” edges (Chickering, 2002), i.e., edges that are oriented in
the pattern corresponding to the Markov equivalence class, which makes it harder for an algorithm
that searches among DAGs instead of equivalence classes. Therefore, we included in Tables 4
and 5 a variation of F IND H IDDEN, labeled FHG, where we fix the measurement model given by
F IND H IDDEN and learn the structural model using GES. Results are not significantly different from
BPC + GES, except at sample size of 200, where F IND H IDDEN has a tendency to miss latents,
inflating its performance in the structural model evaluation (since with fewer latents there is less
chance of committing mistakes).
Figure 13 provides a summary evaluation of all algorithms, BPC, FA and FHG with respect to
the number of perfect structural models obtained for each graphical structure (from 0 to 10). This
includes not only getting the exact number of latents, but also the correct Markov equivalence class
defined in the true model. Factor analysis is competitive when the true model is pure, but is completely ineffective otherwise. For models based on structural model SM3, FA does not get any fully
correct structure when the measurement model is impure. Moreover, it is clear that while learning
the measurement model can be reasonably performed by B UILD P URE C LUSTERS and F IND H ID DEN with sample sizes of 200, learning the structural model is not an easy task unless more data is
available.
In summary, factor analysis provides little useful information out of the given datasets that were
not generated by pure models. In contrast, the combination of B UILD P URE C LUSTERS and GES219
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
SM1 + MM1
SM1 + MM2
SM1 + MM3
10000
10000
10000
1000
1000
1000
200
200
200
0
2
4
6
8
10
0
2
4
6
8
10
12
0
10000
10000
1000
1000
1000
200
200
200
2
4
6
8
0
10
2
SM3 + MM1
4
6
8
10
0
10000
1000
1000
1000
200
200
200
2
3
4
5
6
7
8
0
2
4
6
6
8
10
4
6
8
10
8
10
0
2
4
6
8
Figure 13: A comparison of the number of perfect solutions in all synthetic data sets.
MIMB UILD largely succeeds. F IND H IDDEN (with GES, i.e., FHG) has generally good results,
although it behaves erractly with SM1.14
8. Real Data Applications
We now briefly present the results for two real data sets. Data collected from such domains may
pose significant problems for exploratory data analysis since sample sizes are usually small and
noisy, nevertheless they have a very useful property for our empirical evaluation. In particular,
data obtained by questionnaires are designed to target specific latent factors (such as “stress”, “job
satisfaction”, and so on) and a theoretical measurement model is developed by experts in the area to
measure the desired latent variables. Very generally, experts are more confident about their choice
of measures than about the structural model. Such data thus provide a basis for comparison with the
output of our algorithm. The chance that various observed variables are not pure measures of their
14. This can probably be improved by adopting other schema of search initialization and extra heuristics for escaping
local minima. However, it can also be a much slower algorithm than BPC, as discussed before.
220
12
SM3 + MM3
10000
1
2
SM3 + MM2
10000
0
4
SM2 + MM3
10000
0
2
SM2 + MM2
SM2 + MM1
10
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Dep1
St1
St2
Dep2
+
Depression
Stress
...
...
Dep20
Coping
St21
C1
C2
...
C20
Figure 14: A theoretical model for the interaction of religious coping, stress and depression. The
signs on the edges depicts the theoretical signs of the corresponding effects.
theoretical latents is high. Measures are usually discrete, but often ordinal with a Likert-scale that
can be treated as normally distributed measures with little loss (Bollen, 1989). In the examples, we
compare our procedures with models produced by domain researchers.
8.1 Stress Religious Coping and Depression
Bongjae Lee from the University of Pittsburgh conducted a study of religious/spiritual coping and
stress in graduate students. In December of 2003, 127 students answered a questionnaire intended to
measure three main factors: stress (measured with 21 items), depression (measured with 20 items)
and religious/spiritual coping (measured with 20 items). The full questionnaire is given by Silva
and Scheines (2004). Lee’s model is shown in Figure 14.
This model fails a chi-square test: p = 0. The measurement model produced by B UILD P URE C LUSTERS is shown in Figure 15(a). Note that the variables selected automatically are
proper subsets of Lee’s substantive clustering. The full model automatically produced with GE S MIMB UILD with the prior knowledge that STRESS is not an effect of other latent variables is given
in Figure 15(b). This model passes a chi square test, p = 0.28, even though the BPC algorithm itself
does not try to directly maximize the fit of the algorithm.
Our F IND H IDDEN implementation took a couple of days to execture and did not perform produce a reasonable output if the theoretical model should be taken as the gold standard: five latents
were found to have 20 indicators altogether, but they have no correspondence to the theoretical clustering. This is not unexpected, since the sample size is very small and F IND H IDDEN tries to create a
model that includes all 61 variables. B UILD P URE C LUSTERS can be seen as a way of doing feature
selection by focusing on the easier, simpler pure models.
8.2 Test Anxiety
A survey of test anxiety indicators was administered to 335 grade 12 male students in British
Columbia. The survey consisted of 20 measures on symptoms of anxiety under test conditions. The
covariance matrix as well as a description of the variables is given by Bartholomew et al. (2002).15
15. The data are available online at http://multilevel.ioe.ac.uk/team/aimdss.html.
221
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Dep9
St3
Dep13
St4
St16
Depression
Stress
St18
Dep19
C9
C12
C14
+
St4
St16
Stress
Dep19
Coping
St20
C15
Dep13
Depression
+
St18
Coping
St20
Dep9
St3
C9
(a)
C12
C14
C15
(b)
Figure 15: The output of BPC and GES-MIMB UILD for the coping study.
X3
X2
X4
X8
X5
X9
X 10
Emotionality
Worry
X6
X7
X 15
X 14
X 16
X 17
X 18
X 20
Figure 16: A theoretical model for psychological factors of test anxiety.
Using exploratory factor analysis, Bartholomew et al. concluded that two latent common causes
underly the variables in this data set, agreeing with previous studies. The original study identified
items {x2 , x8 , x9 , x10 , x15 , x16 , x18 } as indicators of an “emotionality” latent factor (this includes physiological symptoms such as jittery and faster heart beatting), and items {x3 , x4 , x5 , x6 , x7 , x14 , x17 , x20 }
as indicators of a more psychological type of anxiety labeled “worry” by Bartholomew et al. No
further description is given about the remaining five variables. Bartholomew et al.’s factor analysis
with oblique rotation roughly matches this model. Bartholomew et al.’s exploratory factor analysis
model for a subset of the variables is shown in Figure 16. This model is not intended to be pure.
Instead, the figure represents which of the two latents is more “strongly” connected to each indicator. The measurement model itself is not constrained. Trying to fit this model as a pure model
(i.e., using the graph in Figure 16 instead of a two-factor multivariate Gaussian model with a fully
connected measurement model) gives a p-value of zero according to a chi-square test.
BPC provides the measurement model given in 17(a).16 The labels in the latents were given
to us and should be seen as our particular interpretation. Applying GES-MIMB UILD to the this
measurement model results in the model shown in Figure 17(b). The model passes a chi-square
16. We allowed a latent with less than three indicators. It might correspond to more than one latent in the true model.
222
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
test handily, p = 0.47, even though we used constraint-satisfaction techniques that did not try to
maximize the fitness of the model directly. To summarize, BPC provided a model supported by
the data that is very close to a submodel of the theoretical model (variables X4 , X15 , X17 , X20 were
removed), except that:
• one of the latents is split in two. To see how this is supported by the data, trying to merge
latents “Cares about achieving” and ”Self-defeating” will result in a model of p-value of zero;
• variable X11 is used, which is not considered by Bartholomew et al.’s model;
What is remarkable in this case is the ability of reconstructing much of the theoretical model
without using prior knowledge. The model is very simple, i.e., each indicator measures a single
latent, while Bartholomew et al.’s model seems to artificially add edges from all latents into all
indicators to get a model that fits the data. Escaping this artificiality is one of the motivations behind
variable selection in factor analysis methods, such as the one proposed by Kano and Harada (2000):
finding a submodel that is a pure model can provide a better understanding of the causal process
being measured than allowing an impure model, whose extra edges might be no more than a patch
to account for residual correlation among indicators, without necessarily existing in the true model.
Kano and Harada’s method, however, requires an initial measurement model to be “purified,” while
BPC works from scratch.
We applied F IND H IDDEN to this data set, obtaining the model shown in Figure 18(a). To simplify presentation, we removed nodes that were not children of any latent in the output model (e.g.,
X3 was not a child of any of the latents, which results on its removal from the picture). Three latents,
labeled by us as “Emotionality 1”, “Emotionality 2” and “Worry” were generated. Both “Emotionality 1” and “Emotionality 2” seem to be a combination of some of the theoretical “Emotionality”
indicators (Figure 16) plus some indicators not included the theoretical model of Figure 16. There
are also lots of edges corresponding to impurities for which no equivalence class is known. As
discussed in Section 3, these edges might correspond to very different causal mechanisms than they
might suggest.
Since 5 of the variables are not present in the theoretical model, it is not so easy to compare
this model against the theoretical model. Therefore, we also provide the result that is obtained from
F IND H IDDEN when the data contains only the 15 indicators used in Figure 16. The result is the one
shown in Figure 18(b), where we adopted the same latent labels used in BPC’s output. The result
is, surpringly, very different. The model has now a much closer resemblance to BPC’s output,
supporting the plausability of BPC’s output. However, while it seems that BPC is able to find a
pure model among all 20 indicators, F IND H IDDEN in this case was able to find an (almost) pure
model only when variables were properly pre-selected.
9. Generalizations
In many social science studies, latent structure is represented by so called “non-recursive” structure.
In graphical terms, the dependency graph is cyclic. Richardson (1996) has developed a consistent constraint based search for cyclic graphical models of linear systems, and our procedures for
identifying measurement models can be combined with it to search for such structure.
The procedure we have described here can, however, straightforwardly be generalized to cases
with measured variables taking a small finite range of values by treating the discrete variables as
223
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
X3
X2
Cares about achieving
X8
X 10
X6
Self−defeating
X 11
X 14
X 16
X 18
X5
X7
X9
Emotionality
X 10
Cares about achieving
X8
X7
X9
X3
X2
X5
Emotionality
X6
Self−defeating
X 11
X 14
X 16
X 18
(a)
(b)
Figure 17: The output of BPC and GES-MIMB UILD for the test anxiety study.
X4
X8
X5
X9
X 10
Worry
Emotionality 1
X6
X2
X7
X 11
X 14
X 12
X8
X3
Cares about achieving
X7
X9
Emotionality 2
X 20
X 10
X4
Emotionality
X 15
X1
X2
X 13
X 15 X 16
X 18 X 19
X5
Self−defeating
X6
X 14
X 16
X 20
X 17
X 18
(a)
(b)
Figure 18: The output of F IND H IDDEN when using all 20 variables (a) and when using only the
variables present in the theoretical model (b).
224
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
L1
L2
L3
X1 X2 X3
X4 X5 X6
X7 X8 X9
Figure 19: A model with no pure submodel with three indicators per latent.
projections from a Gaussian distribution. These are sometimes called latent trait models in the
literature (Bartholomew and Knott, 1999). Much larger sample sizes are required than for linear,
Gaussian measured variables.
In previous works (Silva et al., 2003; Silva and Scheines, 2005), we developed an approach to
learn measurement models even when the functional relationships among latents are non-linear. In
practice, that generality is of limited use because there are at present no consistent search methods available for structures with continuous, non-linear variables. A modified version of B UILD P URE C LUSTERS, however, exists for the problem of learning equivalence classes of measurement
models for non-linear structural models. Some of the results here developed cannot be carried on to
the non-linear case (e.g., rule CS3). Others are substantially modified (Lemma 9). With extra prior
knowledge, however, much of the measurement model for non-linear structural models can still be
learned from data.
Finally, there are ways of reliably learning some types of impure models using the results discussed in this paper. For instance, only two of the three latents in the model in Figure 19 can be
generated by B UILD P URE C LUSTERS. A small modification of the algorithm, which would include
an equivalence class accounting for some types of impurities, would be able to reconstruct all latents
in this example. A more systematic exploration of such extensions will be treated in a future work.
10. Conclusion
This paper introduced a novel algorithm for learning causal structure in linear models which, to
the best of our knowledge, presents the first published solution for the problem of learning causal
models with latent variables in a principled way where observed conditional independencies are not
expected to exist. It has the following properties:
• it was designed to learn multiple indicator models, i.e., models where observed variables are
not causes of the hidden variables of interest, but which still encompass a large class of useful
models;
• no assumptions about the number of hidden variables and how they are connected to observed
variables are needed;
• it is a two-stage algorithm, which first learns equivalence classes of measurement models
(i.e., which latents exist and which observed children they have) and, based on a choice of
measurement model, returns an equivalence class of causal models among the latents;
225
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
• it is provably correct, in the sense that given the assumptions explicitly described in the paper and in the limit of infinite data, all causal claims made by the output graph hold in the
population;
• it provides a framework which can be partially extended to cover other types of data (discrete,
ordinal) and causal relations (non-linear, non-Gaussian);
Our experiments provide evidence that our procedures can be useful in practice, but there are
certainly classes of problems where B UILD P URE C LUSTERS will not be of practical value. For
instance, learning the causal structure of general blind source separation problems, where measures
are usually indicators of most of the latents (i.e., sources) at the same time.
A number of open problems invite further research, including these:
• completeness of the tetrad equivalence class of measurement models: can we identify all the
common features of measurement models in the same tetrad equivalence class?
• using the more generic rank constraints of covariance matrices to learn measurement models,
possibly identifying the nature of some impure relationships;
• better treatment of discrete variables. Bartholomew and Knott (1999) survey different ways
of integrating factor analysis and discrete variables that can be readily adapted, but the computational cost of this procedure is high;
• finding non-linear causal relationships among latent variables given a fixed linear measurement model, and in other families of multivariate continuous distributions besides the Gaussian;
The fundamental point is that common and appealing heuristics (e.g., factor rotation methods)
fail when the goal is structure learning with a causal interpretation. In many cases it is preferable
to model the relationships of a subset of the given variables than trying to force a bad model over
all of them (Kano and Harada, 2000). Better methods are available now, and further improvements
will surely come from machine learning research.
Acknowledgments
We thank the anonymous reviewers for their comments, which greatly improved the presentation of
this paper. Research for this paper was supported by NASA NCC 2-1377 to the University of West
Florida, NASA NRA A2-37143 to CMU and ONR contract N00014-03-01-0516 to the University
of West Florida.
Appendix A. B UILD P URE C LUSTERS: Full Algorithm and Implementation
We now introduce the complete version of B UILD P URE C LUSTERS. This version has additional
steps that deal with exceptional, but arguably less relevant, situations. This requires removing additional nodes due to vanishing correlations, as well as merging some clusters. The full algorithm is
given in Table 6.
226
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Algorithm B UILD P URE C LUSTERS
Input: a covariance matrix Σ
1. G ←F IND PATTERN(Σ).
2. Choose a set of latents in G. Remove all other latents and all observed nodes that are not
children of the remaining latents and all clusters of size 1.
3. Remove all nodes that have more than one latent parent in G.
4. For all pairs of nodes linked by an undirected edge, choose one element of each pair to be
removed.
5. If for some set of nodes {A, B,C}, all children of the same latent, there is a fourth node D in
G such that σAB σCD = σAC σBD = σAD σBC is not true, remove one of these four nodes.
6. For every latent L with at least two children, {A, B}, if there is some node C in G such that
σAC = 0 and σBC 6= 0, split L into two latents L1 and L2 , where L1 becomes the only parent of
all children of L that are correlated with C, and L2 becomes the only parent of all children of
L that are not correlated with C;
7. Remove any cluster with exactly 3 variables {X1 , X2 , X3 } such that there is no X4 where all
three tetrads in the covariance matrix X = {X1 , X2 , X3 , X4 } hold, all variables of X are correlated and no partial correlation of a pair of elements of X is zero conditioned on some
observed variable;
8. While there is a pair of clusters with latents Li and L j , such that for all subsets {A, B,C, D} of
the union of the children of Li , L j we have σAB σCD = σAC σBD = σAD σBC , and no marginal or
conditional independencies (where the condition set is of size 1) are observed in this cluster,
set Li = L j (i.e., merge the clusters);
9. Again, verify all implied tetrad constraints and remove elements accordingly: iterate Steps
6-7-8 until no changes happen;
10. Remove all latents with less than three children, and their respective measures;
11. if G has at least four observed variables, return G. Otherwise, return an empty model.
Table 6: The complete version of B UILD P URE C LUSTERS.
227
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
L
L4
0
L1
L
Z1 Z2 Z3
L
2
W1 W2 W 3
3
L0
X1 X2 X3
L
Y1 Y2 Y3
W1
(a)
Z1
X1
0
Y1
(b)
Figure 20: The true graph in (a) will generate at some point a purified measurement pattern as in
(b). It is desirable to merge both clusters.
It might be surprising that we merge clusters of variables that we know cannot share a common
latent parent in the true graph. However, we are not guaranteed to find a large enough number
of pure indicators for each of the original latent parents, and as a consequence only a subset of
the true latents will be represented in the measurement pattern. It might be the case that, with respect to the variables present in the output, the observed variables in two different clusters might
be directly measuring some ancestor common to all variables in these two clusters. As an illustration, consider the graph in Figure 20(a), where double-directed edges represent independent hidden
common causes. Assume any sensible purification procedure will choose to eliminate all elements
in {W2 ,W3 , X2 , X3 ,Y2 ,Y3 , Z2 , Z3 } because they are directly correlated with a large number of other
observed variables (extra edges and nodes not depicted).
Meanwhile, one can verify that all three tetrad constraints hold in the covariance matrix of
{W1 , X1 ,Y1 , Z1 }, and therefore there will be no undirected edges connecting pairs of elements in this
set in the corresponding measurement pattern. Rule CS1 is able to separate W1 and X1 into two
different clusters by using {W2 ,W3 , X2 , X3 } as the support nodes, and analogously the same happens
to Y1 and Z1 , W1 and Y1 , X1 and Z1 . However, no test can separate W1 and Z1 , nor X1 and Y1 . If we do
not merge clusters, we will end up with the graph seen in Figure 20(b) as part of our output pattern.
Although this is a valid measurement pattern, and in some situations we might want to output such
a model, it is also true that W1 and Z1 measure a same latent L0 (as well as X1 and Y1 ). It would be
problematic to learn a structural model with such a measurement model. There is a deterministic
relation between the latent measured by W1 and Z1 , and the latent measured by X1 and Y1 : they
are the same latent! Probability distributions with deterministic relations are not faithful, and that
causes problems for learning algorithms.
Finally, we show examples where Steps 6 and 7 of B UILD P URE C LUSTERS are necessary. In
Figure 21(a) we have a partial view of a latent variable graph, where two of the latents are marginally
independent. Suppose that nodes X4 , X5 and X6 are correlated to many other measured nodes not in
this figure, and therefore are removed by our purification procedure. If we ignore Step 6, the result228
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
X1 X2 X3
X4 X5 X6
X7 X8 X9
(a)
X1 X2 X3
X7 X8 X9
(b)
Figure 21: Suppose (a) is our true model. If for some reason we need to remove nodes X4 , X5 and
X6 from our final pure graph, the result will be as shown in Figure (b), unless we apply
Step 6 of B UILD P URE C LUSTERS. There are several problems with (b), as explained in
the text.
ing pure submodel over {X1 , X2 , X3 , X7 , X8 , X9 } will be the one depicted in Figure 21(b) ({X1 , X2 }
are clustered apart from {X7 , X8 , X9 } because of marginal zero correlation, and X3 is clustered apart
from {X7 , X8 , X9 } because of CS1 applied to {X3 , X4 , X5 } × {X7 , X8 , X9 }). However, no linear latent
variable model can be parameterized by this graph: if we let the two latents to be correlated, this
will imply X1 and X7 being correlated. If we make the two latents uncorrelated, X3 and X7 will be
uncorrelated.
Step 7 exists to avoid rare situations where three observed variables are clustered together and
are pairwise part of some foursome entailing all three tetrad constraints with no vanishing marginal
and partial correlation, but still should be removed because they are not simultaneously in such a
foursome. They might not be detected by Step 4 if, e.g., all three of them are uncorrelated with all
other remaining observed variables.
In the rest of this section, we describe a practical implementation of B UILD P URE C LUSTERS.
The algorithm is described for a given covariance matrix to simplify the exposition. Since typically one has only a sample covariance matrix, we need a statistical decision procedure. Statistical
tests for tetrad constraints are described by Spirtes et al. (2000). Although it is known that in
practice constraint-based approaches for learning graphical model structure are outperformed on
accuracy by score-based algorithms such as GES (Chickering, 2002), we favor a combination of
a constraint-based approach and a score-based approach due mostly to computational efficiency.
A smart implementation of constraint-satisfaction algorithms can avoid many statistical shortcomings. If the experimental results are any indication of success, we can claim we provide such an
implementation.
We also describe in full detail how particular choices in B UILD P URE C LUSTERS (e.g., Step 2,
where one has to choose a set of latents from the measurement pattern) are solved in our implementation. We stress that the particularities of the implementation bear no implication on the theoretical
results given in this paper: the algorithms remain point-wise consistent. The informativeness of the
results (i.e., how much of the true structure is discovered) will vary, but in the particular examples
given in this paper, results were quite insensitive to variations of the following implementation.
229
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
A.1 Robust Purification
We do avoid a constraint-satisfaction approach for purification. At least for a fixed p-value and using
false discovery rates to control for multiplicity of tests, purification by testing tetrad constraints often
throws away many more nodes than necessary when the number of variables is relative small, and
does not eliminate many impurities when the number of variables is too large. We suggest a robust
purification approach as follows.
Suppose we are given a clustering of variables (not necessarily disjoint clusters) and a undirect
graph indicating which variables might be ancestors of each other, analogous to the undirect edges
generated in F IND PATTERN. We purify this clustering not by testing multiple tetrad constraints,
but through a greedy search that eliminates nodes from a linear measurement model that entails
tetrad constraints. This is iterated till the current model fits the data according to a chi-square test
of significance (Bollen, 1989) and a given acceptance level. Details are given in Table 7.
This implementation is used as a subroutine for a more robust implementation of B UILD P URE C LUSTERS described in the next section. However, it can be considerably slow. An alternative
is using the approximation derived by Kano and Harada (2000) to rapidly calculate the fitness of
a factor analysis model when a variable is removed. Another alternative is a greedy search over
the initial measurement model, freeing correlations of pairs of measured variables. Once we found
which variables are directly connected, we eliminate some of them till no pair is impure. Details
of this particular implementation are given by Silva and Scheines (2004). In our experiments with
synthetic data, it did not work as well as the iterative removal of variables described in Table 7.
However, we do apply this variation in the last experiment described in Section 6, because it is computationally cheaper. If the model search in ROBUST P URIFY does not fit the data after we eliminate
too many variables (i.e., when we cannot statistically test the model) we just return an empty model.
A.2 Finding a Robust Initial Clustering
The main problem of applying F IND PATTERN directly by using statistical tests of tetrad constraints
is the number of false positives: accepting a rule (CS1, CS2, or CS3) as true when it does not hold
in the population. One can see that might happen relatively often when there are large groups of
observed variables that are pure indicators of some latent: for instance, assume there is a latent L0
with 10 pure indicators. Consider applying CS1 to a group of six pure indicators of L0 . The first
two constraints of CS1 hold in the population, and so assume they are correctly identified by the
statistical test. The last constraint, σX1 X2 σY1Y2 6= σX1Y2 σX2Y1 , should not hold in the population, but
will not be rejected by the test with some probability. Since there are 10!/(6!4!) = 210 ways of CS1
being wrongly applied due to a statistical mistake, we will get many false positives in all certainty.
We can highly minimize this problem by separating groups of variables instead of pairs. Consider the test D ISJOINT G ROUP(Xi , X j , Xk ,Ya ,Yb ,Yc ; Σ):
• D ISJOINT G ROUP(Xi , X j , Xk ,Ya ,Yb ,Yc ; Σ) = true if and only if CS1 returns true for all sets
{X1 , X2 , X3 ,Y1 ,Y2 ,Y3 }, where {X1 , X2 , X3 } is a permutation of {Xi , X j , Xk } and {Y1 ,Y2 ,Y3 } is
a permutation of {Ya ,Yb ,Yc }. Also, we test an extra redundant constraint: for every pair
{X1 , X2 } ⊂ {Xi , X j , Xk } and every pair {Y1 ,Y2 } ⊂ {Ya ,Yb ,Yc } we also require that σX1Y1 σX2Y2 =
σX1Y2 σX2Y1 .
Notice it is much harder to obtain a false positive with D ISJOINT G ROUP than, say, with CS1
applied to a single pair. This test can be implemented in steps: for instance, if for no four foursome
230
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Algorithm
Inputs:
ROBUST P URIFY
Clusters, a set of subsets of some set O;
C, an undirect graph over O;
Σ, a sample covariance matrix of O.
1. Remove all nodes that have appear in more than one set in Clusters.
2. For all pairs of nodes that belong to two different sets in Clusters and are adjacent in C, remove the
one from the largest cluster or the one from the smallest cluster if this has less than three elements.
3. Let G be a graph. For each set S ∈ Clusters, add all nodes in S to G and a new latent as the only
common parent of all nodes in S. Create an arbitrary full DAG among latents.
4. For each variable V in G, fit a graph G′ (V ) obtained from G by removing V . Update G by choosing
the graph G′ (V ) with the smallest chi-square score. If some latent ends up with less than two children,
remove it. Iterate till a significance level is achieved.
5. Do mergings if that increases the fitness. Iterate 4 and 5 till no improvement can be done.
6. Eliminate all clusters with less than three variables and return G.
Table 7: A score-based purification.
including Xi and Ya we have that all tetrad constraints hold, then we do not consider Xi and Ya in
D ISJOING G ROUP.
Based on D ISJOINT G ROUP, we propose here a modification to increase the robustness of B UILD P URE C LUSTERS, the ROBUST B UILD P URE C LUSTERS algorithm, as given in Table 8. It starts with
a first step called F IND I NITIAL S ELECTION (Table 9). The goal of F IND I NITIAL S ELECTION is
to find a pure model using only D ISJOINT G ROUP instead of CS1, CS2 or CS3. This pure model
is then used as an starting point for learning a more complete model in the remaining stages of
ROBUST B UILD P URE C LUSTERS.
In F IND I NITIAL S ELECTION, if a pair {X,Y } cannot be separated into different clusters, but
also does not participate in any successful application of D ISJOINT G ROUP, then this pair will be
connected by a GRAY or YELLOW edge: this indicates that these two nodes cannot be in a pure
submodel with three indicators per latent. Otherwise, these nodes are “compatible”, meaning that
they might be in such a pure model. This is indicated by a BLUE edge.
In F IND I NITIAL S ELECTION we then find cliques of compatible nodes (Step 8).17 Each clique
is a candidate for a one-factor model (a latent model with one latent only). We purify every clique
found to create pure one-factor models (Step 9). This avoids using clusters that are large not because
they are all unique children of the same latent, but because there was no way of separating its
elements. This adds considerably more computational cost to the whole procedure.
After we find pure one-factor models Mi , we search for a combination of compatible groups.
Step 10 first indicates which pairs of one-factor models cannot be part of a pure model with three
indicators each: if Mi and M j are not pairwise a two-factor model with three pure indicators (as
tested by D ISJOINT G ROUP), they cannot be both part of a valid solution.
C HOOSE C LUSTERING C LIQUE is a heuristic designed to find a large set of one-factor models
(nodes of H) that can be grouped into a pure model with three indicators per latent (we need a
17. Any algorithm can be used to find maximal cliques. Notice that, by the anytime properties of our approach, one does
not need to find all maximal cliques.
231
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Algorithm ROBUST B UILD P URE C LUSTERS
Input: Σ, a sample covariance matrix of a set of variables O
1. (Selection,C,C0 ) ←F IND I NITIAL S ELECTION(Σ).
2. For every pair of nonadjacent nodes {N1 , N2 } in C where at least one of them is not in Selection and
an edge N1 − N2 exists in C0 , add a RED edge N1 − N2 to C.
3. For every pair of nodes linked by a RED edge in C, apply successively rules CS1, CS2 and CS3.
Remove an edge between every pair corresponding to a rule that applies.
4. Let H be a complete graph where each node corresponds to a maximal clique in C.
5. FinalClustering ← C HOOSE C LUSTERING C LIQUE (H).
6. Return ROBUST P URIFY(FinalClustering,C, Σ).
Table 8: A modified B UILD P URE C LUSTERS algorithm.
heuristic since finding a maximum clique in H is NP-hard). First, we define the size of a clustering
Hcandidate (a set of nodes from H) as the number of variables that remain according to the following
elimination criteria: 1. eliminate all variables that appear in more than one one-factor model inside
Hcandidate ; 2. for each pair of variables {X1 , X2 } such that X1 and X2 belong to different one-factor
models in Hcandidate , if there is an edge X1 − X2 in C, then we remove one element {X1 , X2 } from
Hcandidate (i.e., guarantee that no pair of variables from different clusters which were not shown to
have any common latent parent will exist in Hcandidate ). We eliminate the one that belongs to the
largest cluster, unless the smallest cluster has less than three elements to avoid extra fragmentation;
3. eliminate clusters that have less than three variables.
The heuristic motivation is that we expected that a model with a large size will have a large number of variables after purification. Our suggested heuristic to be implemented as C HOOSE C LUS TERING C LIQUE is trying to find a good model using a very simple hill-climbing algorithm that
starts from an arbitrary node in H and add new clusters to the current candidate according to the
one that will increase its size mostly while still forming a maximal clique in H. We stop when we
cannot increase the size of the candidate. This is calculated using each node in H as a starting point,
and the largest candidate is returned by C HOOSE C LUSTERING C LIQUE.
A.3 Clustering Refinement
The next steps in ROBUST B UILD P URE C LUSTERS are basically the F IND PATTERN algorithm of
Table 1 with a final purification. The main difference is that we do not check anymore if pairs
of nodes in the initial clustering given by Selection should be separated. The intuition explaining
the usefulness of this implementation is as follows: if there is a group of latents forming a pure
subgraph of the true graph with a large number of pure indicators for each latent, then the initial
step should identify such group. The consecutive steps will refine this solution without the risk
of splitting the large clusters of variables, which are exactly the ones most likely to produce false
positive decisions. ROBUST B UILD P URE C LUSTERS has the power of identifying the latents with
large sets of pure indicators and refining this solution with more flexible rules, covering also cases
where D ISJOINT G ROUP fails.
232
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Algorithm F IND I NITIAL S ELECTION
Input: Σ, a sample covariance matrix of a set of variables O
1. Start with a complete graph C over O.
2. Remove edges of pairs that are marginally uncorrelated or uncorrelated conditioned on a third variable.
3. C0 ← C.
4. Color every edge of C as BLUE.
5. For all edges N1 − N2 in C, if there is no other pair {N3 , N4 } such that all three tetrads constraints hold
in the covariance matrix of {N1 , N2 , N3 , N4 }, change the color of the edge N1 − N2 to GRAY.
6. For all pairs of variables {N1 , N2 } linked by a BLUE edge in C
If there exists a pair {N3 , N4 } that forms a BLUE clique with N1 in C, and a pair
{N5 , N6 } that forms a BLUE clique with N2 in C, all six nodes form a clique in C0 and
D ISJOINT G ROUP(N1 , N3 , N4 , N2 , N5 , N6 ; Σ) = true, then remove all edges linking elements in
{N1 , N3 , N4 } to {N2 , N5 , N6 }.
Otherwise, if there is no node N3 that forms a BLUE clique with {N1 , N2 } in C,
and no BLUE clique in {N4 , N5 , N6 } such that all six nodes form a clique in C0 and
D ISJOINT G ROUP(N1 , N2 , N3 , N4 , N5 , N6 ; Σ) = true, then change the color of the edge N1 − N2
to YELLOW.
7. Remove all GRAY and YELLOW edges from C.
8. ListC ←F IND M AXIMAL C LIQUES(C).
9. Let H be a graph where each node corresponds to an element of ListC and with no edges. Let Mi
denote both a node in H and the respective set of nodes in ListC . Let Mi ← ROBUST P URIFY(Mi ,C, Σ);
10. Add an edge M1 − M2 to H only if there exists {N1 , N2 , N3 } ⊆ M1 and {N4 , N5 , N6 } ⊆ M2 such that
D ISJOINT G ROUP(N1 , N2 , N3 , N4 , N5 , N6 ; Σ) = true.
11. Hchoice ←C HOOSE C LUSTERING C LIQUE(H).
12. Let Hclusters be the corresponding set of clusters, i.e., the set of sets of observed variables, where each
set in Hclusters correspond to some Mi in Hchoice .
13. Selection ←ROBUST P URIFY(Hclusters ,C, Σ).
14. Return (Selection,C,C0 ).
Table 9: Selects an initial pure model.
Notice that the order by which tests are applied might influence the outcome of the algorithms,
since if we remove an edge X −Y in C at some point, then we are excluding the possibility of using
some tests where X and Y are required. Imposing such restriction reduces the overall computational
cost and statistical mistakes. To minimize the ordering effect, an option is to run the algorithm
multiple times and select the output with the highest number of nodes.
Appendix B. Proofs
Before we present the proofs of our results, we need a few more definitions:
233
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
T
C
A
D
C
A
E
B
(a)
D
E
B
(b)
CP
M
A
B
C
N
D
(c)
Figure 22: In (a), C is a choke point for sets {A, B} × {D, E}, since it lies on all treks connecting
nodes in {A, B} to nodes in {D, E} and lies also on the {D, E} side of all of such treks.
For instance, C is on the {D, E} side of A → C → D, where A is the source of such
a trek. Notice also that this choke point d-separates nodes in {A, B} from nodes in
{D, E}. Analogously, D is also a choke point for {A, B} × {D, E} (there is nothing on
the definition of a choke point I × J that forbids it of belonging I ∪ J). In Figure (b), C is
a choke point for sets {A, B} × {D, E} that does not d-separate such elements. In Figure
(c), CP is a node that lies on all treks connecting {A,C} and {B, D} but it is not a choke
point, since it does not lie on the {A,C} side of trek A ← M → CP → B and neither lies
on the {B, D} side of D ← N → CP → A. The same node, however, is a {A, D} × {B,C}
choke point.
• a path in a graph G is a sequence of nodes {X1 , . . . , Xn } such that Xi and Xi+1 are adjacent in
G, 1 ≤ i < n. Paths are assumed to be simple by definition, i.e., no node appears more than
once. Notice there is an unique set of edges associated with each given path. A path is into
X1 (or Xn ) if the arrow of the edge {X1 , X2 } is into X1 ({Xn−1 , Xn } into Xn );
• a collider on a path {X1 , . . . , Xn } is a node Xi , 1 < i < n, such that Xi−1 and Xi+1 are parents
of Xi ;
• a trek is a path that does not contain any collider;
• the source of a trek is the unique node in a trek to which no arrows are directed;
• the I side of a trek between nodes I and J with source X is the subpath directed from X to I.
It is possible that X = I;
• a choke point CP between two sets of nodes I and J is a node that lies on every trek between
any element of I and any element of J such that CP is either (i) on the I side of every such
trek 18 or (ii) on the J side or every such trek.
With the exception of choke points, all other concepts are well known in the literature of graphical models (Spirtes et al., 2000; Pearl, 1988, 2000). What is interesting in a choke point is that,
by definition, such a node is in all treks linking elements in two sets of nodes. Being in all treks
connecting a node Xi and a node X j is a necessary condition for a node to d-separate Xi and X j ,
although this is not a sufficient condition.
18. That is, for every {I, J} ∈ I × J, CP is on the I side of every trek T = {I, . . . , X, . . . , J}, X being the source of T .
234
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Consider Figure 22, which illustrates several different choke points. In some cases, the choke
point will d-separate a few nodes. The relevant fact is that even when the choke point is a latent
variable, this has an implication on the observed marginal distribution, as stated by the Tetrad Representation Theorem:
Theorem 21 (The Tetrad Representation Theorem) Let G be a linear latent variable model, and
let I1 , I2 , J1 , J2 be four variables in G. Then σI1 J1 σI2 J2 = σI1 J2 σI2 J1 if and only if there is a choke point
between {I1 , I2 } and {J1 , J2 }.
Proof: The original proof was given by Spirtes et al. (2000). Shafer et al. (1993) provide an alternative and simplied proof.
Shafer et al. (1993) also provide more details on the definitions and several examples.
Therefore, unlike a partial correlation constraint obtained by conditioning on a given set of
variables, where such a set should be observable, some d-separations due to latent variables can be
inferred using tetrad constraints. We will use the Tetrad Representation Theorem to prove most of
our results. The challenge lies on choosing the right combination of tetrad constraints that allows us
to identify latents and d-separations due to latents, since the Tetrad Representation Theorem is far
from providing such results directly.
In the following proofs, we will frequently use the symbol G(O) to represent a linear latent
variable model with a set of observed nodes O. A choke point between sets I and J will be denoted
as I × J. We will first introduce a lemma that is going to be useful to prove several other results.
Lemma 9 Let G(O) be a linear latent variable model, and let {X1 , X2 , X3 , X4 } ⊂ O be such that
σX1 X2 σX3 X4 = σX1 X3 σX2 X4 = σX1 X4 σX2 X3 . If ρAB 6= 0 for all {A, B} ⊂ {X1 , X2 , X3 , X4 }, then an unique
node P entails all the given tetrad constraints, and P d-separates all elements in {X1 , X2 , X3 , X4 }.
Proof: Let P be a choke point for pairs {X1 , X2 } × {X3 , X4 }. Let Q be a choke point for pairs
{X1 , X3 } × {X2 , X4 }. We will show that P = Q by contradiction.
Assume P 6= Q. Because there is a trek that links X1 and X4 through P (since ρX1 X4 6= 0), we
have that Q should also be on that trek. Suppose T is a trek connecting X1 to X4 through P and Q,
and without loss of generality assume this trek follows an order that defines three subtreks: T0 , from
X1 to P; T1 , from P to Q; and T2 , from Q to X4 , as illustrated by Figure 23(a). In principle, T0 and T2
might be empty, i.e., we are not excluding the possibility that X1 = P or X4 = Q.
There must be at least one trek TQ2 connecting X2 and Q, since Q is on every trek between X1
and X2 and there is at least one such trek (since ρX1 X2 6= 0). We have the following cases:
Case 1: TQ2 includes P. TQ2 has to be into P, and P 6= X1 , or otherwise there will be a trek connecting
X2 to X1 through a (possibly empty) trek T0 that does not include Q, contrary to our hypothesis. For
the same reason, T0 has to be into P. This will imply that T1 is a directed path from P to Q, and T2
is a directed path from Q to X4 (Figure 23(b)).
Because there is at least one trek connecting X1 and X2 (since ρX1 X2 6= 0), and because Q is on
every such trek, Q has to be an ancestor of at least one member of {X1 , X2 }. Without loss of generality, assume Q is an ancestor of X1 . No directed path from Q to X1 can include P, since P is an
ancestor of Q and the graph is acyclic. Therefore, there is a trek connecting X1 and X4 with Q as the
235
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
X2
X1
P
Q
T0
T1
T2
X4
X1
TQ2
P
Q
T0
(a)
T1
T2
(b)
P
S
P
X2
X1
X4
X3
X4
X2
X1
X4
X3
(c)
(d)
Figure 23: In (a), a depiction of a trek T linking X1 and X4 through P and Q, creating three subtreks
labeled as T0 , T1 and T2 . Directions in such treks are left unspecified. In (b), the existence of a trek TQ2 linking X2 and Q through P will compel the directions depicted as a
consequence of the given tetrad and correlation constraints (the dotted path represents
any possible continuation of TQ2 that does not coincide with T ). The configuration in
(c) cannot happen if P is a choke point entailing all three tetrads among marginally dependent nodes {X1 , X2 , X3 , X4 }. The configuration in (d) cannot happen if P is a choke
point for {X1 , X3 } × {X2 , X4 }, since there is a trek X1 − P − X2 such that P is not on the
{X1 , X3 } side of it, and another trek X2 − S − P − X3 such that P is not on the {X2 , X4 }
side of it.
source that does not include P, contrary to our hypothesis.
Case 2: TQ2 does not include P. This case is similar to Case 1. TQ2 has to be into Q, and Q 6= X4 , or
otherwise there will be a trek connecting X2 to X4 through a (possible empty) trek T2 that does not
include P, contrary to our hypothesis. For the same reason, T2 has to be into Q. This will imply that
T1 is a directed path from Q to P, and T0 is a directed path from P to X1 . An argument analogous to
Case 1 will follow.
We will now show by that P d-separates all nodes in {X1 , X2 , X3 , X4 }. From the P = Q result,
we know that P lies on every trek between any pair of elements in {X1 , X2 , X3 , X4 }. First consider
the case where at most one element of {X1 , X2 , X3 , X4 } is linked to P through a trek that is into P.
By the Tetrad Representation Theorem, any trek connecting two elements of {X1 , X2 , X3 , X4 } goes
through P. Since P cannot be a collider on any trek, then P d-separates these two elements.
236
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
To finish the proof, we only have to show that P cannot be a collider in a path connecting any
two elements of {X1 , X2 , X3 , X4 }. We will prove that by contradiction. That is, assume without loss
of generality that there is a trek connecting X1 and P that is into P, and a trek connecting X2 and
P that is into P. We will show this either entails that ρX1 X2 = 0 or that P is not a choke point for
{X1 , X3 } × {X2 , X4 }.
Case 3: there is no trek connecting X1 and P that is out of P neither any trek connecting X2 and
P that is out of P. This implies there is no trek connecting X1 and X2 , since P is on every trek
connecting these two elements according to the Tetrad Representation Theorem. But this implies
ρX1 X2 = 0, a contradiction, as illustrated by Figure 23(c).
Case 4 (this case will be similar to the example given in Figure 22(c)): assume without loss of generality that there is also a trek out of P and into X2 . Then there is a trek connecting X1 to X2 through
P that is not on the {X1 , X3 } side of pair {X1 , X3 } × {X2 , X4 } to which P is a choke point. Therefore,
P should be on the {X2 , X4 } of every trek connecting elements pairs in {X1 , X3 } × {X2 , X4 }. Without
loss of generality, assume there is a trek out of P and into X3 (because if there is no such trek for
either X3 and X4 , we fall in the previous case by symmetry). Let S be the source of a trek into P and
X2 , which should exist since X2 is not an ancestor of P. Then there is a trek of source S connecting
X3 and X2 such that P is not on the {X2 , X4 } side of it as shown in Figure 23(d). Therefore P cannot
be a choke point for {X1 , X3 } × {X2 , X4 }. Contradiction.
Lemma 13 Let G(O) be a linear latent variable model. If for some set O′ = {X1 , X2 , X3 ,
X4 } ⊆ O, σX1 X2 σX3 X4 = σX1 X3 σX2 X4 = σX1 X4 σX2 X3 and for all triplets {A, B,C}, {A, B} ⊂ O′ ,C ∈ O,
we have ρAB.C 6= 0 and ρAB 6= 0, then no element A ∈ O′ is a descendant of an element of O′ \{A} in
G.
Proof: Without loss of generality, assume for the sake of contradiction that X1 is an ancestor of X2 .
From the given tetrad and correlation constraints and Lemma 9, there is a node P that lies on every
trek between X1 and X2 and d-separates these two nodes. Since P lies on the directed path from X1
to X2 , P is a descendant of X1 , and therefore an observed node. However, this implies ρX1 X2 .P = 0,
contrary to our hypothesis.
Lemma 10 Let G(O) be a linear latent variable model. Assume O′ = {X1 , X2 , X3 ,Y1 ,Y2 ,Y3 } ⊆ O.
If constraints {τX1Y1 X2 X3 , τX1Y1 X3 X2 , τY1 X1Y2Y3 , τY1 X1Y3Y2 , ¬τX1 X2Y2Y1 } all hold, and that for all triplets
{A, B,C}, {A, B} ⊂ O′ , C ∈ O, we have ρAB 6= 0, ρAB.C 6= 0, then X1 and Y1 do not have a common
parent in G.
Proof: We will prove this result by contradiction, by assuming that X1 and Y1 have a common parent
L in G and showing this entails τX1 X2Y2Y1 , contrary to the hypothesis.
Initially, we will show by contradiction that L is a choke point for {X1 ,Y1 } × {X2 , X3 }. Suppose
L is not a choke point for {X1 , X2 } × {Y1 , X3 } corresponding to one of the tetrad constraints given
by hypothesis. Because of the trek X1 ← L → Y1 , then either X1 or Y1 is a choke point. Without
loss of generality, assume X1 is a choke point in this case. By Lemma 9 and the given constraints,
X1 d-separates any two elements in {X2 , X3 ,Y1 } contrary to the hypothesis that ρX2 X3 .X1 6= 0. By
237
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
Y2
X2
X1
L
Y1
Y2
(a)
S
L
T3
T4
T1
X1
T
2
Y1
(b)
Figure 24: Figure (a) illustrates necessary treks among elements of {X1 , X2 ,Y1 ,Y2 , L} according to
the assumptions of Lemma 11 if we further assume that X1 is a choke point for pairs
{X1 , X2 } × {Y1 ,Y2 } (other treks might exist). Figure (b) rearranges (a) by emphasizing
that Y1 and Y2 cannot be d-separated by a single node.
symmetry, Y1 cannot be a choke point. Therefore, L is a choke point for {X1 ,Y1 } × {X2 , X3 } and by
Lemma 9, it also lies on every trek for any pair in S1 = {X1 , X2 , X3 ,Y1 }.
Analogously, L is on every trek connecting any pair from the set S2 = {X1 ,Y1 ,Y2 ,Y3 }. It follows that L is on every trek connecting any pairs in the product {X1 ,Y1 } × {X2 ,Y2 }, and it is on the
{X1 ,Y1 } side of {X1 ,Y1 } × {X2 ,Y2 }, i.e., L is a choke point that implies τX1 X2Y2Y1 . Contradiction.
Remember that predicate Factor(X,Y, G) is true if and only if there exist two nodes W and Z in
G such that τW XY Z and τW XZY are both entailed, all nodes in {W, X,Y, Z} are correlated, and there is
no observed C in G such that ρAB.C = 0 for {A, B} ⊂ {W, X,Y, Z}.
Lemma 11 Let G(O) be a linear latent variable model. Assume O′ = {X1 , X2 , X3 ,Y1 ,Y2 ,Y3 }
⊆ O, such that Factor(X1 , X2 , G) and Factor(Y1 ,Y2 , G) hold, Y1 is not an ancestor of Y3 and X1 is
not an ancestor of X3 . If constraints {τX1Y1Y2 X2 , τX2Y1Y3Y2 , τX1 X2Y2 X3 , ¬τX1 X2Y2Y1 } all hold, and that for
all triplets {A, B,C}, {A, B} ⊂ O′ ,C ∈ O, we have ρAB 6= 0, ρAB.C 6= 0, then X1 and Y1 do not have a
common parent in G.
Proof: We will prove this result by contradiction. Assume X1 and Y1 have a common parent L.
Because of the tetrad constraints given by hypothesis and the existence of the trek X1 ← L → Y1 ,
one node in {X1 , L,Y1 } should be a choke point for the pair {X1 , X2 } × {Y1 ,Y2 }. We will first show
that L has to be such a choke point, and therefore lies on every trek connecting X1 and Y2 , as well
as X2 and Y1 . We then show that L lies on every trek connecting Y1 and Y2 , as well as X1 and X2 .
Finally, we show that L is a choke point for {X1 ,Y1 } × {X2 ,Y2 }, contrary to our hypothesis.
Step 1: If there is a common parent L to X1 and Y1 , then L is a {X1 , X2 } × {Y1 ,Y2 } choke point. For
the sake of contradiction, assume X1 is a choke point in this case. By Lemma 13 and assumption
238
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
Y2
P
TPY
Y1
Y2
P
X−1
L
(a)
Y1
Y2
Y
−1
P
X−1
Y
+1
Y1
L
X2
X2
Y
+1
X2
(b)
(c)
Figure 25: In (a), a depiction of TY and TX , where edges represent treks (TX can be seen more
generally as the combination of the solid edge between X2 and P concatenated with a
dashed edge between P and Y1 representing the possibility that TY and TX might intersect
multiple times in TPY , but in principle do not need to coincide in TPY if P is not a choke
point.) In (b), a possible configurations of edges < X−1 , P > and < P,Y+1 > that do
not collide in P, and P is a choke point (and Y+1 6= Y ). In (c), the edge < Y−1 , P >
is compelled to be directed away from P because of the collider with the other two
neighbors of P.
Factor(X1 , X2 , G), we have that X1 is not an ancestor of X2 , and therefore all treks connecting X1
and X2 should be into X1 . Since ρX2Y2 6= 0 by assumption and X1 is on all treks connecting X2 and
Y2 , there must be a directed path out of X1 and into Y2 . Since ρX2Y2 .X1 6= 0 by assumption and X1 is
on all treks connecting X2 and Y2 , there must be a trek into X1 and Y2 . Because ρX2Y1 6= 0, there must
be a trek out of X1 and into Y1 . Figure 24(a) illustrates the configuration.
Since Factor(Y1 ,Y2 , G) is true, by Lemma 9 there must be a node d-separating Y1 and Y2 (neither Y1 nor Y2 can be the choke point in Factor(Y1 ,Y2 , G) because this choke point has to be latent,
according to the partial correlation conditions of Factor). However, by Figure 24(b), treks T2 − T3
and T1 − T4 cannot both be blocked by a single node. Contradiction. Therefore X1 cannot be a choke
point for {X1 , X2 } × {Y1 ,Y2 } and, by symmetry, neither can Y1 .
Step 2: L is on every trek connecting Y1 and Y2 and on every trek connecting X1 and X2 . Let L be the
choke point for pairs {X1 , X2 } × {Y1 ,Y2 }. As a consequence, all treks between Y2 and X1 go through
L. All treks between X2 and Y1 go through L. All treks between X2 and Y2 go through L. Such treks
exist, since no respective correlation vanishes.
Consider the given hypothesis σX2Y1 σY2Y3 = σX2Y3 σY2Y1 , corresponding to a choke point {X2 ,Y2 }×
{Y1 ,Y3 }. From the previous paragraph, we know there is a trek linking Y2 and L. L is a parent of Y1
by construction. That means Y2 and Y1 are connected by a trek through L.
We will show by contradiction that L is on every trek connecting Y1 and Y2 . Assume there is a
trek TY connecting Y2 and Y1 that does not contain L. Let P be the first point of intersection of TY
and a trek TX connecting X2 to Y1 , starting from X2 . If TY exists, such point should exist, since TY
should contain a choke point {X2 ,Y2 } × {Y1 ,Y3 }, and all treks connecting X2 and Y1 (including TX )
contain the same choke point.
239
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
M
L
M
Y2
X1
Y1
L
Y3 Y2 X2 X3 X1
(a)
Y1
Y3
(b)
Figure 26: In (a), Y2 and X1 cannot share a parent, and because of the given tetrad constraints, L
should d-separate M and Y3 . Y3 is not a child of L either, but there will be a trek linking
L and Y3 . In (b), an (invalid) configuration for X2 and X3 , where they share an ancestor
between M and L.
Let TPY be the subtrek of TY starting on P and ending one node before Y1 . Any choke point
{X2 ,Y2 } × {Y1 ,Y3 } should lie on TPY (Figure 25(a)). (Y1 cannot be such a choke point, since all treks
connecting Y1 and Y2 are into Y1 , and by hypothesis all treks connecting Y1 and Y3 are into Y1 . Since
all treks connecting Y2 and Y3 would need to go through Y1 by definition, then there would be no
such trek, implying ρY2Y3 = 0, contrary to our hypothesis.)
Assume first that X2 6= P and Y2 6= P. Let X−1 be the node before P in TX starting from X2 . Let
Y−1 be the node before P in TY starting from Y2 . Let Y+1 be the node after P in TY starting from Y2
(notice that it is possible that Y+1 = Y1 ). If X−1 and Y+1 do not collide on P (i.e., there is no structure
X−1 → P ← Y+1 ), then there will be a trek connecting X2 to Y1 through TPY after P. Since L is not
in TPY , L should be before P in TX . But then there will be a trek connecting X2 and Y1 that does not
intersect TPY , which is a contradiction (Figure 25(b)). If the collider does exist, we have the edge
P ← Y+1 . Since no collider Y−1 → P ← Y+1 can exist because TY is a trek, the edge between Y−1
and P is out of P. But that forms a trek connecting X2 and Y2 (Figure 25(c)), and since L is in every
trek between X2 and Y2 and TY does not contain L, then TX should contain L before P, which again
creates a trek between X2 and Y1 that does not intersect TPY .
If X2 = P, then TPY has to contain L, because every trek between X2 and Y1 contains L. Therefore,
X2 6= P. If Y2 = P, then because every trek between X2 and Y2 should contain L, we again have that
L lies in TX before P, which creates a trek between X2 and Y1 that does not intersect TPY . Therefore,
we showed by contradiction that L lies on every trek between Y2 and Y1 .
Consider now the given hypothesis σX1 X2 σX3Y2 = σX1Y2 σX3 X2 , corresponding to a choke point
{X2 ,Y2 } × {X1 , X3 }. By symmetry with the previous case, all treks between X1 and X2 go through L.
Step 3: If L exists, so does a choke point {X1 ,Y1 } × {X2 ,Y2 }. By the previous steps, L intermediates all treks between elements of the pair {X1 ,Y1 } × {X2 ,Y2 }. Because L is a common parent of
{X1 ,Y1 }, it lies on the {X1 ,Y1 } side of every trek connecting pairs of elements in {X1 ,Y1 } × {X2 ,Y2 }.
L is a choke point for this pair. This implies τX1 X2Y2Y1 . Contradiction.
Lemma 12 Let G(O) be a linear latent variable model. Let O′ = {X1 , X2 , X3 ,Y1 ,Y2 ,Y3 }
⊆ O. If constraints {τX1Y1Y2Y3 , τX1Y1Y3Y2 , τX1Y2 X2 X3 , τX1Y2 X3 X2 , τX1Y3 X2 X3 , τX1Y3 X3 X2 ,
¬τX1 X2Y2Y3 } all hold, and that for all triplets {A, B,C}, {A, B} ⊂ O′ ,C ∈ O, we have ρAB 6= 0, ρAB.C 6=
240
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
0, then X1 and Y1 do not have a common parent in G.
Proof: We will prove this result by contradiction. Suppose X1 and Y1 have a common parent L in G.
Since all three tetrads hold in the covariance matrix of {X1 ,Y1 ,Y2 ,Y3 }, by Lemma 9 the choke point
that entails these constraints d-separates the elements of {X1 ,Y1 ,Y2 ,Y3 }. The choke point should
be in the trek X1 ← L → Y1 , and since it cannot be an observed node because by hypothesis no
d-separation conditioned on a single node holds among elements of {X1 ,Y1 ,Y2 ,Y3 }, L has to be a
latent choke point for all pairs of pairs in {X1 ,Y1 ,Y2 ,Y3 }.
It is also given that {τX1Y2 X2 X3 , τX1Y2 X3 X2 , τX1Y1Y2Y3 , τX1Y1Y3Y2 } holds. Since it is the case that ¬τX1 X2Y2Y3 ,
by Lemma 10 X1 and Y2 cannot share a parent. Let TML be a trek connecting some parent M of Y2
and L. Such a trek exists because ρX1Y2 6= 0.
We will show by contradiction that there is no node in TML \L that is connected to Y3 by a trek
that does not go through L. Suppose there is such a node, and call it V . If the trek connecting V and
Y3 is into V , and since V is not a collider in TML , then V is either an ancestor of M or an ancestor
of L. If V is an ancestor of M, then there will be a trek connecting Y2 and Y3 that is not through L,
which is a contradiction. If V is an ancestor of L but not M, then both Y2 and Y3 are d-connected
to a node V is a collider at the intersection of such d-connecting treks. However, V is an ancestor
of L, which means L cannot d-separate Y2 and Y3 , a contradiction. Finally, if the trek connecting V
and Y3 is out of V , then Y2 and Y3 will be connected by a trek that does not include L, which again is
not allowed. We therefore showed there is no node with the properties of V . This configuration is
illustrated by Figure 26(a).
Since all three tetrads hold among elements of {X1 , X2 , X3 ,Y2 }, then by Lemma 9, there is a
single choke point P that entails such tetrads and d-separates elements of this set. Since TML is a
trek connecting Y2 to X1 through L, then there are three possible locations for P in G:
Case 1: P = M. We have all treks between X3 and X2 go through M but not through L, and
some trek from X1 to Y3 goes through L but not through M. No choke point can exist for pairs
{X1 , X3 }×{X2 ,Y3 }, which by the Tetrad Representation Theorem means that the tetrad σX1Y3 σX2 X3 =
σX1 X2 σY3 X3 cannot hold, contrary to our hypothesis.
Case 2: P lies between M and L in TML . This configuration is illustrated by Figure 26(b). As before,
no choke point exists for pairs {X1 , X3 } × {X2 ,Y3 }, contrary to our hypothesis.
Case 3: P = L. Because all three tetrads hold in {X1 , X2 , X3 ,Y3 } and L d-separates all pairs in
{X1 , X2 , X3 }, one can verify that L d-separates all pairs in {X1 , X2 , X3 ,Y3 }. This will imply a {X1 ,Y3 }×
{X2 ,Y2 } choke point, contrary to our hypothesis.
Theorem 14 The output of F IND PATTERN is a measurement pattern with respect to the tetrad and
vanishing partial correlation constraints of Σ
Proof: Two nodes will not share a common latent parent in a measurement pattern if and only if
they are not linked by an edge in graph C constructed by algorithm F IND PATTERN and that happens
if and only if some partial correlation vanishes or if any of rules CS1, CS2 or CS3 applies. But
then by Lemmas 10, 11, 12 and the equivalence of vanishing partial correlations and conditional
independence in linearly faithful distributions (Spirtes et al., 2000) the claim is proved. The claim
241
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
about undirected edges follows from Lemma 13.
Theorem 15 Given a covariance matrix Σ assumed to be generated from a linear latent variable
model G(O) with latent variables L, let Gout be the output of B UILD P URE C LUSTERS(Σ) with
observed variables Oout ⊆ O and latent variables Lout . Then Gout is a measurement pattern, and
there is an injective mapping M : Lout → L with the following properties:
1. Let Lout ∈ Lout . Let X be the children of Lout in Gout . Then M(Lout ) d-separates any element
X ∈ X from Oout \X in G;
2. M(Lout ) d-separates X from every latent in G for which M −1 (.) exists;
3. Let O′ ⊆ Oout be such that each pair in O′ is correlated. At most one element in O′ with latent
parent Lout in Gout is not a descendant of M(Lout ) in G, or has a hidden common cause with
it;
Proof: We will start by showing that for each cluster Cli in Gout , there exists an unique latent Li in
G that d-separates all elements of Cli . This shows the existance of an unique function from latents
in Gout to latents in G. We then proceed to prove the three claims given in the theorem, and finish
by proving that the given function is injective.
Let Cli be a cluster in a non-empty Gout . Cli has three elements X,Y and Z, and there is at least
some W in Gout such that all three tetrad constraints hold in the covariance matrix of {W, X,Y, Z},
where no pair of elements in {X,Y, Z} is marginally d-separated or d-separated by an observable
variable. By Lemma 9, it follows that there is an unique latent Li d-separating X, Y and Z. If Cli
has more than three elements, it follows that since no node other than Li can d-separate all three
elements in {X,Y, Z}, and any choke point for {W ′ , X,Y, Z}, W ′ ∈ Cli , will d-separate all elements
in {W ′ , X,Y, Z}, then there is an unique latent Li d-separating all elements in Cli . An analogous
argument concerns the d-separation of any element of Cli and observed nodes in other clusters.
Now we will show that each Li d-separates each X in Cli from all other mapped latents. As a
byproduct, we will also show the validity of the third claim of the theorem. Consider {Y, Z}, two
other elements of Cli besides X, and {A, B,C}, three elements of Cl j . Since Li and L j each d-separate
all pairs in {X,Y } × {A, B}, and no pair in {X,Y } × {A, B} has both of its elements connected to
Li (L j ) through a trek that is into Li (L j ) (since Li , or L j , d-separates then), then both Li and L j are
choke points for {X,Y } × {A, B}. According to Lemma 2.5 given by Shafer et al. (1993), any trek
connecting an element from {X,Y } to an element in {A, B} passes through both choke points in the
same order. Without loss of generality, assume the order is first Li , then L j .
If there is no trek connecting X to Li that is into Li , then Li d-separates X and L j . The same
holds for L j and A with respect to Li . If there is a trek T connecting X and Li that is into Li , and
since all three tetrad constraints hold in the covariance matrix of {X,Y, Z, A} by construction, then
there is no trek connecting A and Li that is into Li (Lemma 9). Since there are treks connecting Li
and L j , they should be all out of Li and into L j . This means that Li d-separates X and L j . But this
also creates a trek connecting X and L j that is into L j . Since all three tetrad constraints hold in the
covariance matrix of {X, A, B,C} by construction, then there is no trek connecting A and L j that is
into L j (by the d-separation implied by Lemma 9). This means that L j d-separates A from Li . This
also means that the existance of such a trek T out of X and into Li forbids the existance of any trek
connecting a variable correlated to X that is into Li (since all treks connecting Li and some L j are
out of Li ), which proves the third claim of the theorem.
242
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
We will conclude by showing that given two clusters Cli and Cl j with respective latents Li and
L j , where each cluster is of size at least three, if they are not merged, then Li 6= L j . That is, the
mapping from latents in Gout to latents in G, as defined at the beginning of the proof, is injective.
Assume Li = L j . We will show that these clusters will be merged by the algorithm, proving the
counterpositive argument. Let X and Y be elements of Cli and W , Z elements of Cl j . It immediately
follows that Li is a choke point for all pairs in {W, X,Y, Z}, since Li d-separates any pair of elements
of {W, X,Y, Z}, which means all three tetrads will hold in the covariance matrix of any subset of
size four from Cli ∪Cl j . These two clusters will then be merged by B UILD P URE C LUSTERS.
Theorem 16 Given a covariance matrix Σ assumed to be generated from a linear latent variable
model G(O) with latent variables L, let Gout be the output of B UILD P URE C LUSTERS(Σ) with observed variables Oout ⊆ O and latent variables Lout . Let M(Lout ) ⊆ L be the set of latents in G
obtained by the mapping function M(). Let ΣOout be the population covariance matrix of Oout , i.e.,
the corresponding marginal of Σ. Let the DAG Gaug
out be Gout augmented by connecting the elements
aug
of Lout such that the structural model of Gout is an I-map of the distribution of M(Lout ). Then there
exists a linear latent variable model using Gaug
out as the graphical structure such that the implied
covariance matrix of Oout equals ΣOout .
Proof: If a linear model is an I-map DAG of the true distribution of its variables, then there is a wellknown natural instantiation of the parameters of this model that will represent the true covariance
matrix (Spirtes et al., 2000). We will assume such parametrization for the structural model, and
denote as ΣL (Θ) the parameterized latent covariance matrix. Instead of showing that Gaug
out is an
I-map of the respective set of latents and observed variables and using the same argument, we will
show a valid instantion of its parameters directly.
Assume without loss of generality that all variables have zero mean. To each observed node X
with latent ancestor LX in G such that M −1 (LX ) is a parent of X in Gout , the linear model representation is:
X = λX LX + εX
For this equation, we have two associated parameters, λX and σ2εX , where σ2εX is the variance
of εX . We instantiate them by the linear regression values, i.e., λX = σXLX /σ2LX , and σ2εX is the
respective residual variance. The set {λX } ∪ {σ2εX } of all λX and σ2εX , along with the parameters
used in ΣL (Θ), is our full set of parameters Θ.
Our definition of linear latent variable model requires σεX εY = 0, σεX LX = 0 and σεX LY = 0, for all
X 6= Y . This corresponds to a covariance matrix Σ(Θ) of the observed variables with entries defined
as:
E[X 2 ](Θ) = σ2X (Θ) = λ2X σ2LX + σ2εX
E[XY ](Θ) = σXY (Θ) = λX λT σLX LY
To prove the theorem, we have to show that ΣOout = Σ(Θ) by showing that correlations between
different residuals, and residuals and latent variables, are actually zero.
The relation σεX LX = 0 follows directly from the fact that λX is defined by the regression coefficient of X on LX . Notice that if X and LX do not have a common ancestor, λX is the direct effect
243
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
of LX in X with respect to Gout . As we know, by Theorem 15, at most one variable in any set of
correlated variables will not fulfill this condition.
We have to show also that σXY = σXY (Θ) for any pair X,Y in Gout . Residuals εX and εY are
uncorrelated due to the fact that X and Y are independent given their latent ancestors in Gout , and
therefore σεX εY = 0. To verify that σεX LY = 0 is less straightforward, but one can appeal to the
graphical formulation of the problem. In a linear model, the residual εX is a function only of the
variables that are not independent of X given LX . None of this variables can be nodes in Gout , since
LX d-separates X from all such variables. Therefore, given LX none of the variables that define εX
can be dependent on LY , implying σεX LY = 0.
3
Theorem 17 Problem M P is NP-complete.
Proof: Direct reduction from the 3-SAT problem: let S be a 3-CNF formula from which we want to
decide if there is an assignment for its variables that makes the expression true. Define G as a latent
variable graph with a latent node Li for each clause Ci in M, with an arbitrary fully connected structural model. For each latent in G, add five pure children. Choose three arbitrary children of each
q
q
latent Li , naming them {Ci1 ,Ci2 ,Ci3 }. Add a bi-directed edge Cip ↔ C j for each pair Cip ,C j , i 6= j, if
and only that they represent literals over the same variable but of opposite values. As in the maximum clique problem, one can verify that there is a pure submodel of G with at least three indicators
per latent if and only if S is satisfiable.
The next corollary suggests that even an invalid measurement pattern could be used in B UILD P URE C LUSTERS instead of the output of F IND PATTERN. However, an arbitrary (invalid) measurement pattern is unlikely to be informative at all after being purified. In constrast, F IND PATTERN
can be highly informative.
Corollary 18 The output of B UILD P URE C LUSTERS retains its guarantees even when rules CS1,
CS2 and CS3 are applied an arbitrary number of times in F IND PATTERN for any arbitrary subset
of nodes and an arbitrary number of maximal cliques is found.
Proof: Independently of the choice made on Step 2 of B UILD P URE C LUSTERS and which nodes
are not separated into different cliques in F IND PATTERN, the exhaustive verification of tetrad constraints by B UILD P URE C LUSTERS provides all the necessary conditions for the proof of Theorem
15.
Corollary 20 Given a covariance matrix Σ assumed to be generated from a linear latent variable
model G, and Gout the output of B UILD P URE C LUSTERS given Σ, the output of PC-MIMB UILD or
FCI-MIMB UILD given (Σ, Gout ) returns the correct Markov equivalence class of the latents in G
corresponding to latents in Gout according to the mapping implicit in B UILD P URE C LUSTERS
Proof: By Theorem 15, each observed variable is d-separated from all other variables in Gout given
its latent parent. By Theorem 16, one can parameterize Gout as a linear model such that the observed covariance matrix as a function of the parameterized Gout equals its corresponding marginal
of Σ. By Theorem 19, the rank test using the measurement model of Gout is therefore a consistent
independence test of latent variables. The rest follows immediately from the consistency property
244
L EARNING THE S TRUCTURE OF L INEAR L ATENT VARIABLE M ODELS
of PC and FCI given a valid oracle for conditional independencies.
References
H. Attias. Independent factor analysis. Graphical Models: foundations of neural computation,
pages 207–257, 1999.
F. Bach and M. Jordan. Beyond independent components: trees and clusters. Journal of Machine
Learning Research, 4:1205–1233, 2003.
D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers,
1999.
D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. The Analysis and Interpretation of Multivariate Data for Social Scientists. Arnold Publishers, 2002.
K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989.
K. Bollen. Outlier screening and a distribution-free test for vanishing tetrads. Sociological Methods
and Research, 19:80–92, 1990.
D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning
Research, 3:507–554, 2002.
G. Elidan, N. Lotner, N. Friedman, and D. Koller. Discovering hidden variables: a structure-based
approach. Neural Information Processing Systems, 13:479–485, 2000.
N. Friedman. The Bayesian structural EM algorithm. Proceedings of 14th Conference on Uncertainty in Artificial Intelligence, 1998.
D. Geiger and C. Meek. Quantifier elimination for statistical problems. Proceedings of 15th Conference on Uncertainty in Artificial Intelligence, 1999.
C. Glymour. Social statistics and genuine inquiry: reflections on the bell curve. Intelligence, Genes
and Sucess: Scientists Respond to The Bell Curve, 1997.
C. Glymour. The Mind’s Arrow: Bayes Nets and Graphical Causal Models in Psychology. MIT
Press, 2002.
C. Glymour, Richard Scheines, Peter Spirtes, and Kevin Kelly. Discovering Causal Structure:
Artificial Intelligence, Philosophy of Science, and Statistical Modeling. Academic Press, 1987.
Y. Kano and A. Harada. Stepwise variable selection in factor analysis. Psychometrika, 65:7–22,
2000.
J. Loehlin. Latent Variable Models: An Introduction to Factor, Path and Structural Equation
Analysis. Lawrence Erlbaum, 2004.
C. Meek. Graphical Models: Selecting Causal and Statistical Models. PhD Thesis, Carnegie
Mellon University, 1997.
245
S ILVA , S CHEINES , G LYMOUR AND S PIRTES
J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan
Kaufmann, 1988.
J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
T. Richardson. A discovery algorithm for directed cyclic graphs. Proceedings of 12th Conference
on Uncertainty in Artificial Intelligence, 1996.
G. Shafer, A. Kogan, and P.Spirtes. Generalization of the tetrad representation theorem. DIMACS
Technical Report, 1993.
R. Silva. Automatic discovery of latent variable models. PhD Thesis, Carnegie Mellon University,
http://www.cs.cmu/edu/˜ rbas, 2005.
R. Silva and R. Scheines. Generalized measurement models. Technical Report CMU-CALD-04-101,
Carnegie Mellon University, 2004.
R. Silva and R. Scheines. New d-separation identification results for learning continuous latent
variable models. Proceedings of the 22nd Interational Conference in Machine Learning, 2005.
R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning measurement models for unobserved
variables. Proceedings of 19th Conference on Uncertainty in Artificial Intelligence, pages 543–
550, 2003.
P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University
Press, 2000.
J. Wishart. Sampling errors in the theory of two factors. British Journal of Psychology, 19:180–187,
1928.
N. Zhang. Hierarchical latent class models for cluster analysis. Journal of Machine Learning
Research, 5:697–723, 2004.
246