Bayesian Psychometric Modeling
Bayesian Psychometric Modeling
Bayesian Psychometric Modeling
Psychometric
Modeling
Chapman & Hall/CRC
Statistics in the Social and Behavioral Sciences Series
Series Editors
Jeff Gill Steven Heeringa
Washington University, USA University of Michigan, USA
Tom Snijders
Oxford University, UK
University of Groningen, NL
Large and complex datasets are becoming prevalent in the social and behavioral
sciences and statistical methods are crucial for the analysis and interpretation of such
data. This series aims to capture new developments in statistical methodology with
particular relevance to applications in the social and behavioral sciences. It seeks to
promote appropriate use of statistical, econometric and psychometric methods in
these applied sciences by publishing a broad range of reference works, textbooks and
handbooks.
Bayesian
Psychometric
Modeling
Roy Levy
Arizona State University
Tempe, Arizona, USA
Robert J. Mislevy
Educational Testing Service
Princeton New Jersey, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To Paige, who continues to make all that I do possible, and to Ella, Adrian,
To Robbie, still my best friend after years of this kind of thing, and to Jessica and Meredith,
who must not think this kind of thing is so bad because they do it too.Bob
This page intentionally left blank
Contents
Section I Foundations
Section II Psychometrics
xix
xx Preface
itself in the subsequent presentations of factor analysis, item response theory, Bayes nets,
and other psychometric models that are historically seen as more different than we think
they ought to be.
properties of examinees and observational situations. (For familiar tests, e.g., the observable
variables in classical test theory and often in factor analysis are test scores; in item response
theory and latent class analysis, the observables are item scores, which usually correspond
1-1 to test items.) We use bold typeface to represent collections, vectors, or matrices, with
subscripts indicating what the collection is specific to. For example, xij refers to the data from
examinee i for observable j, xi = (xi1,, xiJ) is the collection of J such observables that comprise
the data for examinee i, and x = (x1,, xn) is the full collection of data from n examinees.
When using a standard distribution, we also use notation based on its name to com-
municate as much. For example, if x has a normal distribution with mean and variance
2, we write x ~ N ( , 2 ) or p( x) = N ( , 2 ). In some cases, we write the arguments of the
distribution with subscripts to indicate what the distribution refers to. We routinely high-
light the dependence structures of random variables, for example, writing the preceding
as x|, 2 ~ N (, 2 ) and p( x| , 2 ) = N ( , 2 ), respectively, to highlight that what is on the
left side of the conditioning bar depends on what is on the right side of the conditioning
bar. The entities conditioned on may be other random variables or they may be fixed quan-
tities, such as values known or chosen by the analyst.
We have attempted to use notational schemes that are conventional in the different areas
of statistics and psychometrics. Rather than advance a general notation that cuts across
these areas, we have attempted to use notational schemes that are commonly found in each
tradition, that is, we use typical regression notation in Chapter 6, factor analysis notation
in Chapter 9, item response theory notation in Chapter 11, and so on. Table 7.2 summarizes
the notation used in the psychometric models that comprise the second part of the book.
As a consequence, several notational elements are reused. For example, is used as
a parameter in the beta and Dirichlet distributions, as the acceptance probability in
Metropolis and Metropolis samplers, and as a transformed discrimination parameter in
item response theory. We hope that the intended interpretation is clear from the context.
A particularly acute challenge arises with the use of , which is commonly used to refer
to a generic parameter in statistical texts, and a latent variable (or person parameter) in
certain psychometric traditions. We attempt to participate in both traditions. We use to
refer to a generic parameter as we develop core Bayesian ideas in the first part of the book.
When we pivot to psychometrics beginning with Chapter 7, we use to refer to a latent
variable, and then use it as such in several of the subsequent chapters. Again, we hope that
the intended interpretation is clear from the context.
In addition, we use the Netica software package (Norsys, 19992012) to illustrate the
use of Bayesian networks in Chapter 14. In preparing the examples, we have also made
use of the R statistical programming environment (R Core Team, 2014) for certain analy-
ses, including several packages: BaM (Gill, 2012), coda (Plummer, Best, Cowles, & Vines,
2006), R2WinBUGS (Sturtz, Ligges, & Gelman, 2005), MCMCpack (Martin, Quinn, & Park,
2011), mcmcplots (Curtis, 2015), poLCA (Linzer & Lewis, 2011), pscl (Jackman, 2014), and
RNetica (Almond, 2013), as well as the code given by Nadarajah and Kotz (2006), and code
of our own writing.
Throughout the book, we present WinBUGS code in this font, commenting on fea-
tures of the code pertinent to the discussion. We used WinBUGS for our analyses, but the
code presented here should work with minimal modification in other versions of BUGS
(Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2012) and the closely related JAGS software
(Plummer, 2003).
Our use of these software packages should not be taken as an endorsement of them
over others. Happily, there are many programs that can be used for fitting Bayesian psy-
chometric and statistical models, including some such as WinBUGS that are intended to
be general and so require the user to express the model, and others that have predefined
models built in.
Online Resources
Additional materials including the datasets, WinBUGS code, R code, and Netica files used
in the examples, and any errors found after the book goes to press, are available on the
books website, www.bayespsychometrics.com. Included there is our current contact infor-
mation, where you can send us any comments.
Roy Levy
Arizona State University, Tempe, Arizona
Robert J. Mislevy
Educational Testing Service, Princeton, New Jersey
This page intentionally left blank
Acknowledgments
Our thinking on matters discussed in the book has been influenced by a number of schol-
ars, and these influences manifest themselves in different ways in the book. Some of those
have been direct collaborations, and several of our examples have appeared in articles and
books written with these colleagues. When introducing such examples, we cite our prior
articles and books with these coauthors. Some of Roys work on Bayesian psychometrics
emerged from projects funded by Cisco, the National Center for Research on Evaluation,
Standards, & Student Testing (CRESST) at the University of California at Los Angeles, the
Institute for Education Sciences (IES), and Pearson. In addition to Cisco and CRESST, Bobs
work was also supported by grants from the Office of Naval Research and the Spencer
Foundation. We are grateful to John Behrens and Kristen DiCerbo for their support during
their time leading research efforts at Cisco and now at Pearson. We thank Dennis Frezzo,
Barbara Termaat, and Telethia Willis for their support while leading research efforts at
Cisco. We also thank Eva Baker and Greg Chung of CRESST for their support, and Allen
Ruby and Phill Gagn, Program Officers for IESs Statistical and Research Methodology
in Education program. The findings and opinions expressed in this book are those of the
authors and do not represent views of the IES or the US Department of Education.
Other interactions, projects, and conversations have shaped our thinking and influenced
this book in more subtle but pervasive ways. We are also grateful to a number of col-
leagues who have supported us in more general ways. We thank Russell Almond, John
Behrens, Darrell Bock, Jaehwa Choi, Kristen DiCerbo, Joanna Gorin, Sam Green, Greg
Hancock, Geneva Haertel, Charlie Lewis, Andre Rupp, Sandip Sinharay, Linda Steinberg,
Marilyn Thompson, David Williamson, and Duanli Yan for their collaboration, insight,
and collegiality.
We also thank the students, participants, and institutions that have hosted our courses
and workshops on Bayesian methods at Arizona State University, the University of
Maryland, the University of Miami, and meetings of the National Council on Measurement
in Education. To the students and participants, this book was written with you in mind.
We hope you didnt mind us piloting some of these materials on you.
We thank Arizona State University for granting Roy a sabbatical to work on this book,
and ETS for supporting Bobs work as the Frederic M. Lord Chair in Measurement and
Statistics.
We are indebted to a number of colleagues who read and provided critiques of drafts of
the chapters. The feedback provided by Jaehwa Choi, Katherine Castellano, Jean-Paul Fox,
Shelby Haberman, and Dubravka Svetina was invaluable, and we wish to thank them all
for doing us this great service. Of course, they are not responsible for any shortcomings
or problems with the book; that resides with the authors. We thank Kim Fryer for manag-
ing the reviewing and editing process at ETS. We are indebted to Rob Calver, our editor at
Chapman & Hall/CRC Press, for his support, encouragement, and patience. We also thank
Kari Budyk, Alex Edwards, Sarah Gelson, Rachel Holt, and Saf Khan of Chapman & Hall/
CRC Press for their assistance.
Roys interests in Bayesian psychometrics began when he was Bobs student at the
University of Maryland, and it was there over a decade ago that the first inklings of what
would become this book took shape. The collaboration on this book is but the latest in
professional and intellectual debts that Roy owes to Bob. For his mentorship, support,
xxv
xxvi Acknowledgments
friendship, and influence over all these years, in too many ways to recount here, Roy
cannot thank Bob enough. Bob, in turn, cannot express the deep satisfaction in playing
some role in the experiences that have shaped a leading voice in a new generation of
psychometricsnot to mention a treasured friend and well-matched collaborator.
We close with thanks to those closest to us, who may be the happiest to see this book
completed. Roy wishes to thank Paige, whose support is only outdone by her patience. Bob
thanks Robbie, also noting patience and support, spiced with tenacity and humor.
Section I
Foundations
The first part of the book lays out foundational material that will be leveraged in the
second part of the book. In Chapter 1, we set out background material on assessment,
psychometrics, probability, and model-based reasoning, setting the stage for develop-
ments to come. Chapters 2 and 3 provide an introduction to Bayesian inference and treat
Bernoulli and binomial models. Whereas Chapter 2 focuses on the machinery of Bayesian
inference, Chapter 3 delves into more conceptual issues. Chapter 4 reviews models for
normal distributions. Chapter 5 discusses strategies for estimating posterior distributions,
with an emphasis on Markov chain Monte Carlo methods used in the remainder of the
book. Chapter 6 treats basic regression models. The second part of the book will draw from
and build off Chapters 2 through 6. Our treatment of the material in Chapters 2 through 6 is
somewhat cursory, as our goals here are to cover things only at a depth necessary for us to
exploit them in our treatment of psychometric models later in the book. Importantly, each
of these topics could be treated in more depth, and most introductory Bayesian texts do so;
excellent accounts may be found in Bernardo and Smith (2000), Congdon (2006), Gelman
etal. (2013), Gill (2007), Jackman (2009), Kaplan (2014), Kruschke (2011), Lynch (2007), and
Marin and Robert (2007).
This page intentionally left blank
1
Overview of Assessment and
Psychometric Modeling
* See Jones and Thissen (2007) for a historical account of psychometrics and alternative perspectives on its
origins.
Apologies to Glenn Shafer, quoted in Section1.2.2.
3
4 Bayesian Psychometric Modeling
Schums (1987, 1994) account of evidentiary reasoning as reasoning from what is known to
what is unknown in the form of explanations, predictions, or conclusions. In the remain-
der of this chapter, we provide an overview of this perspective and how it pertains to psy-
chometrics; see Mislevy (1994) for a more thorough account. Much of the rest of this book
may be seen as elaborating on this overview, providing the mechanics of how to accom-
plish these activities, and discussing the concepts and implications of this perspective.
Observed data become evidence when they are deemed relevant for a desired infer-
ence through establishing relations between the data and the inferential target. We often
employ data from multiple sources of information to serve as evidence. These may be
of similar type (e.g., test questions with the same format targeting the same proficiency,
testimony from multiple witnesses) or of quite different type (e.g., an applicants resume
in addition to her interview, a patients family medical history and a survey of hazardous
materials in her home, crime scene photos in addition to witness testimony). Evidence may
be contradictory (e.g., conflicting testimony from witnesses, a student succeeds at a hard
task but fails at an easy one), and almost always falls short of being perfectly conclusive.
These features have two implications. First, properly marshaling and understanding the
implications of the evidence is difficult. Inference is a messy business. Second, owing to
the inconclusive nature of the evidence, we are necessarily uncertain about our inferences.
To begin to address these, the remainder of this section describes tools that aid in repre-
senting the act of inference and uncertainty.
In evidentiary reasoning, an argument is constructed to ground inferences. Toulmin
(1958) diagrams offer a visual representation of the structure of evidentiary arguments, and
we present versions of them that align with inference in assessment (Mislevy, Steinberg, &
Almond, 2003). Figure1.1 depicts a generic diagram; Figure1.2 depicts a simplified exam-
ple from educational assessment (a one-item test, with a highly informative item!),* where
Unless
Since A
W
On
account So Supports
of
B D R
FIGURE 1.1
Toulmin diagram for the structure of an argument. The flow of reasoning is from Data (D) to a Claim (C),
through a Warrant (W). The Warrant is a generalization that flows the other way: If a Claim holds, certain
kinds of Data usually follow. Backing (B) is empirical and theoretical support for the Warrant. It may not
hold in a particular case, for reasons expressed as alternative explanations (A), which may be supported or
weakened by Rebuttal (R) data. (Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of
educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 362, figure 2. With permis-
sion of CRESST.)
* One-item tests are not unheard of historically, even in high-stakes circumstances (Wainer, 2000). However, we
typically do not have any one item that is so informative as to be conclusive about an examinee. As a result of
these and other considerations, assessments are typically constructed using multiple items, as developed in
Section1.3.
Overview of Assessment and Psychometric Modeling 5
Claim: Elizabeth is
proficient in subtraction
Alternative explanation:
Warrant: Students proficient Elizabeth answered this
in subtraction will answer this Unless
item correctly by guessing
item correctly; students not
proficient in subtraction will Since
answer this item incorrectly
On
account So Supports
of
FIGURE 1.2
Example Toulmin diagram for the structure of the assessment argument. This example shows reasoning from
Elizabeths correct response to a claim about her proficiency.
C W D.
The flow from the claim to data along the direction of the warrant is deductive, in that it pro-
ceeds from the general to the particular. In this example, it is from a claim about what people
like Elizabeth, both those who are proficient and those who are not, typically do, to Elizabeths
behavior on a particular item. When particular data are observed we must reason from these
data to a particular claim, back up through the warrant in the opposite direction.
6 Bayesian Psychometric Modeling
Warrants have varying degrees of strength, reflecting the strength of the relationship
between the claims and the data. We begin with an extreme case of formal syllogisms,
which are powerful warrants (Jaynes, 2003; Mislevy etal., 2003). We will bring uncertainty
and multiple items into the picture shortly, but an initial error-free example illustrates the
basic logic and structure. In this special case, the first component of the warrant is the
major premise (P1), the claim is the minor premise (P2), and the data follow inevitably.
The notation of Dp indicates that the conclusion of this argument is framed as a predictive
or prospective account of the data, rather than the observed value. In general, deductive
arguments are well equipped to reach prospective conclusions. Given the general state-
ment about Elizabeths proficiency and the warrant, we can make predictions or inferences
about what will happen in particular instances.
Importantly, the directional flow of the inference as it actually plays out is in the opposite
direction from how the warrant is laid out here. We do not start with a premise about stu-
dent capabilities and end up with a prospective statement about what the student will do.
Rather, we start with observations of the students behaviors and seek conclusions about
their capabilities. That is, we would like to reverse the directional flow of the warrant and
reason inductively from the particular observations of Elizabeths behaviors to reach conclu-
sions about her proficiency broadly construed. That is, we would like to reason as follows:
P1 W2: Students not proficient in subtraction will answer this item incorrectly.
P2 D: Elizabeth correctly answered this item.
C: It is not the case that Elizabeth is not proficient in subtraction
(i.e.,Elizabeth is proficient in subtraction).
Figure1.3 depicts another example, where we reason from observing an incorrect response
from another student, Frank, to a claim about his proficiency. This example is similar to
that in Figure1.2, but has different claims and explanations as it is based on different data,
namely, Frank providing an incorrect response. Note that the warrant and the backing are
the same regardless of whether the student provides a correct or incorrect answer. The
warrant is constructed to provide an inferential path regardless of the statuses of the par-
ticular claim and data. The warrant spells out expected behavior given all possible values
of proficiency. In the simple example here, a student is either proficient or is not proficient,
and the warrant specifies what is expected under both these possibilities. Note that the
warrant explicitly links different behaviors with different states of proficiency. This dif-
ferential association embodies the psychometric concept of discrimination, and is crucial
to conducting inference.
Overview of Assessment and Psychometric Modeling 7
Alternative explanation:
Warrant: Students proficient Frank answered this
in subtraction will answer this Unless item incorrectly because he
item correctly; students not didnt try
proficient in subtraction will Since
answer this item incorrectly
Supports
On So
account
of
Rebuttal data: The stakes
for Frank are low; Frank
Backing: Cognitive and Data: Frank answered the item very
empirical studies show the answers the quickly; Frank has said
strong positive relationships subtraction item hes focused on starring in
between tests with this and incorrectly the school play that night;
similar items and conceptions
Frank performed well on
or applications of subtraction
other subtraction items in
proficiency
other contexts
FIGURE 1.3
Example Toulmin diagram for the structure of the assessment argument. This example shows reasoning from
Franks incorrect response to a claim about his lack of proficiency.
TABLE1.1
A Deterministic Warrant Connecting a Students
Proficiency Status with Her Response to an Item
Response
Proficiency Correct Incorrect
Proficient Yes No
Not Proficient No Yes
Note: Rows correspond to possible proficiency states; col-
umns correspond to possible outcomes. Reading
across a row gives the Response Outcome conditional on
that proficiency status.
The warrant is summarized in Table1.1, which is akin to a truth table for the proposi-
tions that the student is proficient and the response is correct. The deductive flow of the
argument proceeds from the students status in terms of proficiency to her response, con-
tained in the rows of the table. The inductive flow proceeds from the students response to
her status in terms of proficiency, contained in the columns of the table.
In real-world situations, however, inferences are not so simple. There are almost always
alternative explanations for the data, which in turn require their own support in the form
of rebuttal observations or theory. In educational assessment, alternative explanations
are ever-present, a point has been articulated in a variety of ways over the years. Rasch
(1960, p. 73) warned that
8 Bayesian Psychometric Modeling
Even if we know a person to be very capable, we cannot be sure that he will solve a
difficult problem, nor even a much easier one. There is always the possibility that he
failshe may be tired or his attention led astray, or some other excuse may be given.
In the context of discussing college entrance exams, Lord and Novick (1968, p. 30) stated that
Most students taking . . . examinations are convinced that how they do on a particular
day depends to some extent on how they feel that day. A student who receives scores
which he considers surprisingly low often attributes this unfortunate circumstance to a
physical or psychological indisposition or to some more serious temporary state of affairs
not related to the fact that he is taking the test that day.
Returning to the example in Figure 1.2, one alternative explanation is that Elizabeth
correctly answered the item because she guessed, which is supported by a theoretical
understanding of the response process on selected-response (multiple-choice) items,
perhaps as well as other empirical evidence, say, how quickly she answered, or her per-
formance on other related tasks. Still other possible explanations exist (e.g., that she
cheated), which require rebuttal observations of their own (e.g., Elizabeth gave the same
answer as another examinee positioned nearby, and answered much easier subtraction
items incorrectly). In the second example in Figure 1.3, one alternative explanation is
that Frank did not try to answer the item correctly, which is supported by recognizing
that the stakes are low for Frank and that his attention is focused on another upcoming
activity.
As these examples illustrate, each alternative explanation undercuts the application of
the warrant in some way. The explanations that Elizabeth guessed or cheated stand in
opposition to the warrants assertion that students not proficient in subtraction will answer
the item incorrectly. The explanation that Frank did not try contradicts the warrants asser-
tion that students proficient in subtraction will answer the item correctly. Generally speak-
ing, the possibility of alternative explanations weakens the relationship between the data
and claim. Our evidence is almost always inconclusive, yielding inferences and conclu-
sions that are qualified, contingent, and uncertain (Schum, 1994).
In assessment, we are usually not interested in an examinees performance on a test in
its own right, as it is in some sense contaminated with the idiosyncrasies of performance
at that instance (i.e., on that day, at that time, under those conditions, with these particu-
lar tasks, etc.). Rather, we are interested in broader conception of the examinee, in terms
of what they can do in a larger class of situations, or in terms of their capabilities more
broadly construed. In an ideal case, variation in performance is due solely to variation
in these more broadly construed capabilities. Rarely, if ever, do we have such a circum-
stance. Performance is driven not just by the broader capabilities, but also by a variety of
other factors. The implication is that we must recognize that a test, no matter how well
constructed, is an imperfect measure of what is ultimately of interest and what we would
really like to know. A century ago, this disconnect between the observed performance
Overview of Assessment and Psychometric Modeling 9
and the more broadly construed capabilities that are of interest began to be formalized
into the encompassing term measurement error.
Reasoning in assessment relies on the relationship between the examinees capabilities
and what we observe them say, do, or produce, which is weakened by the presence of
measurement error. Table1.1 simply will not do, as it ignores the possibility of measure-
ment error and its resulting disconnect between the proficiency of inferential interest and
the observed data used as evidence. In his review of the field of Measurement of Learning
and Mental Abilities at the 25th anniversary of the Psychometric Society in 1961, Gulliksen
described the central problem of test theory as the relation between the ability of the
individual and his [or her] observed score on the test (Gulliksen, 1961, p. 101). Such a
characterization holds up well today, with a suitably broad definition of ability and test
score to encapsulate the wide variety of assessment settings. Understanding and rep-
resenting the disconnect between observed performance and the broader conception of
capability is central to reasoning in assessment.
As a result of measurement error, our reasoning in assessmentin educational contexts,
reasoning from what we observe students say, do, or produce to what they know or can do
in other situations or as more broadly construedis an instance of reasoning under uncer-
tainty. Because of the imperfect nature of our measurement, we are uncertain about our
resulting inference. Reasoning from the limited (what we see a student say, do, or produce)
to the more general (student capabilities broadly construed) is necessarily uncertain, and
our inference or conclusion is necessarily possibly incorrect.
There may be an enormous number of factors eliciting [a students] specific overt reac-
tions to a stimulus, and, therefore, it is suitable, even necessary, to handle the situation
in terms of the probabilistic relationship between the two.
This view has come to dominate modern psychometrics, and reflects the utility of
employing the language and calculus of probability theory to communicate the imper-
fection of evidence, and the resulting uncertainty in inferences regardless of discipline
(Schum, 1994). Much of this book is devoted to illuminating how adopting a probabi-
listic approach to psychometric modeling aids not only in representing the disconnect
between proficiency and performance, but also in bridging that divide by employing
10 Bayesian Psychometric Modeling
a framework within which we can reason from one to the other, all the while properly
acknowledging the distinction between them.
As discussed in Section1.1, in deductive reasoning we proceed from the general to the
particular(s), seeing the latter as arising from well-understood relationships between
the two. Deterministic logic is a paradigmatic example of deductive reasoning, but
probabilistic reasoning can be deductive as well. In assessment, deductive reasoning
flows from the unknown or latent variable thought of as the general characterization in
the claim (say, of examinees capabilities) to the observable variables (OVs) that repre-
sent the particular (say, their behavior on tasks). The possibility of alternative explana-
tions undermines the strength of the evidentiary linkage between the claim and the
data. In these cases, the warrant no longer assures that a certain outcome follows neces-
sarily from the claim. Rather, deductive reasoning starts with examinee capability as
the given and, for each of the possible values of that capability, proceeds to arrive at the
likely behavior on tasks.
Framing the warrant probabilistically quantifies how likely the behaviors are for an
examinee with particular capabilities. In educational assessment, we employ a probability
model to quantify how likely certain behaviors on tasks are, given the students profi-
ciencies. A bit more formally, letting x denote the variable corresponding to the students
behavior on a task (e.g., response to an item) and denote the variable corresponding to
the students proficiency status, the relationship is formulated as p( x|)that is, the prob-
ability distribution of x conditional on the value of . Returning to the examples, Table1.2
illustrates these concepts. The rows in the table represent the probabilities of the different
possible responses for the different possibilities of the students proficiency status. If we
begin with the premise that Elizabeth is proficient at subtraction, we arrive at the prob-
ability of .70 that she correctly answers the item, and (its complement) the probability of
.30 that she incorrectly answers the item. Likewise for Frank, we begin with the premise
that he is not proficient at subtraction and arrive at the probability of .20 that he correctly
answers the item, and (its complement) the probability of .80 that he incorrectly answers
the item. Setting up the probability model relating the variables as p( x|), that is, where
variables corresponding to observed behavior are modeled as stochastically dependent on
variables corresponding to unobservable variables representing broadly conceived capa-
bilities, is ubiquitous throughout different psychometric modeling traditions. In Chapter7,
we explore connections between the construction of archetypical psychometric models
and Bayesian approaches to modeling. For the moment, it is sufficient to recognize that the
flow of the probability model directly supports the deductive reasoning; for example, the
rows of Table1.2 are probability distributions.
TABLE1.2
A Probabilistic Warrant Connecting a Students Proficiency
Status with Her Response to an Item
Response
Proficiency Correct Incorrect
Proficient .70 .30
Not Proficient .20 .80
Note: Rows correspond to possible proficiency states; columns correspond
to possible outcomes. Reading across a row gves the Probabilities for
a Response Outcome conditional on that proficiency status.
Overview of Assessment and Psychometric Modeling 11
In inductive reasoning, we reverse this flow of reasoning and proceed from the particular(s)
to the general. In assessment, inductive reasoning starts with examinee behavior on tasks
and proceeds to arrive at likely values of examinee proficiency. How to do so is not as
straightforward as in deductive reasoning; in Table1.2, we set up the rows as probability
distributions, but we cannot interpret the columns as probability distributions. The war-
rant is framed as a probability model with the flow from examinee proficiency to likely
behaviors. To reverse this flow, we employ probability calculus. In particular, we will
employ Bayes theorem to facilitate this reversal, supporting probability-based reasoning
from particulars to general (i.e., from performances on tasks to proficiency) as well as from
general to particulars (i.e., from proficiency to performances on tasks). But we are getting a
bit ahead of ourselves. Before turning to the mechanics of Bayesian inference in Chapter2,
we lay out more of the necessary groundwork.
Representational form C
Representational form A
y = ax + b (yb)/a = x
p( | x) p(x | ) p() Representational form B
Mappings among
representational systems
Entities and
relationships
Reconceived real-
world situation
Real-world situation
FIGURE 1.4
Schematic of model-based reasoning. A real-world situation is depicted at the lower left. It is mapped into
the entities and relationships of the model in the narrative space in the middle plane. The lower right is a
reconception of the real-world situation, blending salient features of the situation with the structures the model
supplies. The upper boxes represent symbol systems that models may have to aid reasoning in the model plane.
Mathematical formulations, diagrams, and graphs are examples of symbol systems used in psychometrics.
(With kind permission from Taylor & Francis: Research Papers in Education, Some implications of expertise
research for educational assessment. 25, 2010, 253270, Mislevy, R. J. Figure 8.)
some quickly, some slowly. Only the correctness of her final responses is represented in
the model-space. The model-space also contains a variable for Elizabeths proficiency, and
relationships that indicate that it determines probability distributions for all the responses
relationships that are not apparent in the world.
Narrowing our focus to that of probability models, choices made in specifying such mod-
els rely on a mixture of considerations including beliefs about the problem, theoretical
conjectures, conventions, computational tractability, ease of interpretation, and commu-
nicative power (Gelman & Shalizi, 2013). In the model expressed in Table 1.2, we have
a dichotomous variable for subtraction proficiency that can take on one of two values,
Proficient or Not Proficient. This amounts to an intentionally simplified framing of the
more complicated real-world situation in light of the desired inferences and available evi-
dence. If we had other purposes, or other evidence that informed on the examinees pro-
ficiency in a more fine-grained manner, we may have specified a variable with more than
Overview of Assessment and Psychometric Modeling 13
that it is a blend of the actual real-world situation and the structures supplied by the model.
In the example from Figure1.2, the reconceived situation still addresses only the correct-
ness of Elizabeths responses, but sees them as outcomes of a random process that depends
on her proficiency parameter. This parameter is introduced in the model-space, and may
be retained in our reconceived notion of the real-world situation. The model provides us
an estimate, as well as an indication of its accuracy, and predictions for Elizabeths likely
responses on other items she was not even presented.
In summary, in probability-based reasoning we use a formal probability framework to
capture imperfect relationships, represent our uncertainty, and reason through the infer-
ential situation at hand. We begin with a real-world situation (bottom left of Figure1.4)
and through our understanding of the salient features of this situation we construct a
simplified version in terms of a probability model (the middle layer of Figure1.4). We then
apply tools, including the calculus of probabilities, to the entities in the modeled version.
The results represent, in the probability framework, the implications and conclusions that
constitute the reconceived real-world situation (bottom right of Figure1.4). In our devel-
opment of psychometric modeling, we will routinely employ probability theory as the
machinery of inference, reflecting a sentiment expressed by Shafer (as quoted in Pearl,
1988, p. 77): Probability is not really about numbers; it is about the structure of reasoning.
different probabilities at different points in time. If you whisper to me what you observed,
Imight then change my beliefs and assert that the probability is also larger than .5.
That different people may assert different probabilities highlights a key aspect of our
perspective, which is that probability is not a property of the world, but rather a property
of the analyst or reasoning agent. This is summarized by this view being referred to as
one of subjective or personal probability. Our interpretation is that probability is an epistemic
rather than an ontological concept, in that it is a property of the analysts thinking about
the world. For me to initially say that the probability of heads is .5 and then to say the
probability is larger than .5 after you recount what you observed reflects a change in my
thinking about the world, not a change in the world itself.
To readers apprehensive about this interpretation, we might emphasize several related
points. First, this conception is by no means new, and dates to the origins of modern math-
ematical treatments of probability (Hacking, 1975). Second, this perspective is commen-
surate with the usual mathematical expressions and restrictions: probabilities must be
greater than or equal to 0, the probability of an event that is certain is 1, and the probability
of the union of disjoint events is equal to the sum of the probabilities associated with each
event individually. See Kadane (2011) and Bernardo and Smith (2000) for recent accounts
on the alignment between an epistemic probability interpretation and the familiar math-
ematical machinery of probability.
Third, this is the way that probability is often used in scientific investigations. For exam-
ple, The International Panel on Climate Change (IPCC, 2014) reported that it was extremely
likely, meaning the probability was greater than .95, that humans caused more than half of
the observed increase in global average surface temperature between the years 1951 and 2010.
Fourth, this is indeed the usual way that probability is used in everyday language.
For example, consider the following statement from an account of the United States
Governments reasoning on whether to pursue an assault on a compound that indeed
turned out to be the location of Osama bin Laden (Woodward, 2011).
Several assessments concluded there was a 60 to 80 percent chance that bin Laden was
in the compound. Michael Leiter, the head of the National Counterterrorism Center,
was much more conservative. During one White House meeting, he put the probability
at about 40 percent. When a participant suggested that was a low chance of success,
Leiter said, Yes, but what weve got is 38 percent better than we have ever had before.
Here, we have a statement expressing different values for the probability held by different
reasoning agents, and at different times. This reflects how a probability is an expression
of uncertainty that is situated in terms of what is believed by a particular person at a par-
ticular time.
These examples additionally highlight how, from the epistemic perspective, probability
can be used to discuss a one-time event. Indeed, it is difficult to conceive of statements
about changes in global temperatures and Osama Bin Ladens whereabouts as having a
justification from a frequentist perspective. If probability is, by definition, a long-run fre-
quency, we would have to object to statements such as these, not on the grounds that they
are wrong, but on the grounds that they nonsensical. Yet we suspect that these sorts of
statements are quite natural to most readers.
Similarly, probability can be used to refer to characterize beliefs about (one-time) events
in the past for which we are uncertain. These include things we may never know with
certainty, such as the causes of changes in global temperatures or who wrote the disputed
16 Bayesian Psychometric Modeling
Federalist Papers (see Mosteller & Wallace, 1964, for a Bayesian analysis), and even things
we can know with certainty, such as whether Martin Van Buren was the ninth President of
the United States.*
Importantly, the notion of epistemic probability is not antithetical to long-run relative
frequencies. Relative frequencies obtained from situations deemed to approximate the
repeated trials that ground the frequentist interpretation may be excellent bases for belief.
Suppose I am interested in the probability that a given person will be a driver in a car acci-
dent in the next year and consult recent data that suggest that the proportion of drivers of
this age that get in an accident in a year is .12. I might then assert that the probability that
the person will be in an accident is .12. Of course, other possibilities exist. I might believe
this person to be a worse driver, or more reckless, than others of the same age, and might
assert that the probability is a bit higher than .12. In either case, I am making a judgment
and ultimately the probability is an expression of my beliefs (Winkler, 1972).
Long-run relative frequencies can also serve as an important metaphor for thinking
through what is ultimately an expression of belief, even if the subject is a one-time event
that is not amenable to repeated trials. Sports fans, journalists, prognosticators, commen-
tators, and talk show hosts routinely talk about the probability that one team wins an
upcoming gamesuch talk is all but inescapable in sports media before championship
games. One mechanism for doing so is by referring to hypothetical long-run frequencies;
for example, I might explain my probability that a team wins the gamewhich will only
be played onceby saying if they played the game 100 times, this team would win 85
of them. The notion of relative frequency over some repetition is a useful metaphor, but
ultimately the probability statements are expressions of our uncertain beliefs.
Our perspective will manifest itself throughout the book in the language we use, such
as referring to probability distributions as representing our beliefs. But our perspective is
just thatour perspective. Many others exist; see Barnett (1999) for a thorough review and
critique of the dominant perspectives on probability and the statistical inferential frame-
works that trade on these perspectives. More acutely, one need not subscribe to the same
perspective as we do to employ Bayesian methods (Jaynes, 2003; Novick, 1964; Senn, 2011;
Williamson, 2010).
* He was not. Van Buren was the eighth President; William Henry Harrison was the ninth.
More properly, it is a particular evaluation of a behavior. We will say more about this later, but different
aspects of the same assessment performance could be identified and evaluated for different purposes.
Overview of Assessment and Psychometric Modeling 17
warrant also refers to both the examinee and the context, expressing the likely behavior
for an examinee in a particular situation. In simple situations such as an examinee taking
a single item, the role of the context is implicit, as it is in the quantitative expression of the
warrant in Table1.2 and even the notation p( x|) , where x and are viewed as examinee
variables.
However, once we move to more complicated situations, it becomes apparent that the
simple framing we have developed this far will not suffice. Perhaps the most basic exten-
sion of our example is the use of multiple items to assess proficiency. If the responses to each
item have exactly the same evidentiary bearing on proficiency, it may suffice to assume that
the warrant expressed in Table1.2 holds for every item response for an examinee. More
generally, we would like to allow them to differ, representing the possibilities that the evi-
dentiary bearing of our observations might vary with the situation. This is expressed in a
narrative form casually by saying that some items are harder, some items are easier, and
we should take that into account in our inference. Accordingly, we may specify each item
to have a possibly unique conditional probability table associated with it. When we turn
to these more complicated statistical models as a means to quantitatively express more
complicated warrants, we will have parameters summarizing the inferential relevance of
the situation in addition to the parameters summarizing the examinee. Accordingly, our
notation for denoting the conditional probability for variables standing for observations
will have to expand, for example, to p( x|, ) , where here stands for a parameter sum-
marizing the features of the context.
It is almost always the case that there remain potentially relevant features of the con-
text that lurk unstated in the warrant. Returning to the example of Frank (Figure1.3), an
alternative explanation for his incorrect response is that he did not try on account of the
stakes being low and his attention being focused elsewhere. In addition to undermining
the inferential force of the data on the inference, this calls attention to a presumed but
unstated aspect of the warrant, namely, that examinees are motivated to answer the ques-
tion correctly.
In principle, the list of possibly relevant features of the context is limitless (Are the
examinees motivated? Is the assessment administered in a language foreign to them?
Is there enough light in the room for them to read?) and choices need to be made about
what to formally include in the warrant. What may be ignorable in one case may not be in
another. Issues of motivation play no role in certain assessment arguments, such as those
based on blood tests in medicine, but a crucial role in others, such as those based on wit-
ness testimony in jurisprudence and intelligence analysis. Considerations of the language
of the test may not be important when a teacher gives a test to the students she has had in
class for months and knows well, but may be critical when she gives the test to the student
who just transferred to the school from a country whose people predominantly speak a
different language.
If previously unarticulated aspects of the context are deemed important, one option is to
expand the warrant and the data we collect. Returning to the example of Frank incorrectly
answering an item, we may expand the warrant by framing it as conditional on whether
the examinee is motivated to perform well. For example, we may say that if the examinee
is motivated to perform well, the linkage between the data and the claim as articulated
in Figure1.3 holds; if the examinee is not motivated to perform well, another linkage is
needed. Expanding the complexity of the warrant may call for additional data to be con-
sidered, in this case data that have bearing on whether the examinee is motivated.
An expansion of the evidentiary narrative of the warrant manifests itself in a corre-
sponding expansion of the statistical model that is the quantitative expression of the
18 Bayesian Psychometric Modeling
warrant (Mislevy, Levy, Kroopnick, & Rutstein, 2008). Returning to the example, we may
specify a mixture model, where the conditional probability structure in Table1.2 holds for
examinees who are motivated and a different conditional probability structure holds for
examinees who are not motivated, along with a probability that an examinee is motivated
and possibly a probability structure relating other data that has evidentiary bearing on
whether the examinee is motivated.
[We] would begin by asking what complex of knowledge, skills, or other attributes
should be assessed, presumably because they are tied to explicit or implicit objectives of
instruction or are otherwise valued by society. Next, what behaviors or performances
should reveal those constructs, and what tasks or situations should elicit those behaviors?
Thus, the nature of the construct guides the selection or construction of relevant tasks
as well as the rational development of construct-based scoring criteria and rubrics.
ECD provides a framework for working through the many details, relationships, and pieces
of machinery that make an operational assessment, including the psychometric models.
It is the grounding in the assessment argument that gives meaning to the variables and
functions in a psychometric model; modeling questions that seem to be strictly technical
on the surface can often only be answered in the context of the application at hand.
To ground such discussions when we need them, this section provides a brief overview
of ECD in terms of the layers depicted in Figure1.5. The layering in this figure indicates
a layering in the kind of work an assessment designer is doing, from initial background
research in Domain Modeling through actual administering, scoring, and reporting the
results of an operational assessment in Assessment Delivery. The ideas and elements con-
structed in a given layer are input for work at the next layer. (The actual design process is
generally iterative, such as moving from design to implementation of a prototype and going
back to revise scoring procedures or psychometric models in the Conceptual Assessment
Framework [CAF], or even finding more foundational research is needed.) Figure1.5 is laid
Overview of Assessment and Psychometric Modeling 19
Domain Analysis
Assessment Delivery
Domain Modeling
Assessment
Implementation
Conceptual
Assessment
Framework
FIGURE 1.5
Layers of ECD. The layers represent successively refined and specific kinds of work test developers do, from
studying the domain in Domain Analysis, through developing an assessment argument in Domain Modeling,
to specifying the models and forms of the elements and processes that will constitute the assessment in the
CAF, to constructing those elements in Assessment Implementation, to administering, scoring, and reporting
in an operational assessment as specified in the Assessment Delivery layer. Actual design usually requires
iterative movements up and down the layers.
out in such a way to highlight the role of the CAF. Conceptually this is a transition point,
which facilitates our pivoting from our understandings about the domain (the layers to
the left) to the actual assessment the examinees experience (the layers to the right). Most
of the material in this book concerns this layer, though as we will see in various examples,
aspects of the other layers will have bearing on the specifications here. Our focus will be
on how ECD allows us to see the role played by the psychometric models that are the focus
of this book, and we will give short shrift to many features of ECD. Complementary treat-
ments that offer greater depth on different aspects or in different contexts may be found in
Mislevy etal. (2003); Mislevy, Almond, and Lukas (2004); Mislevy and Riconscente (2006);
Mislevy (2013); Behrens, Mislevy, DiCerbo, and Levy (2012); and Almond etal. (2015).
ECD was developed in the context of educational assessment, and in our treatment here
we preserve the language of ECD that reflects this origin, in particular where the examin-
ees are students in educational environments. In Section1.4.4, we discuss the wider appli-
cation of the ideas of ECD to other assessment environments.
They are the Student, Task, and Evidence models depicted in Figure 1.6. They address
Messicks questions in a way that becomes a blueprint for jointly developing the tasks and
psychometric models.
FIGURE 1.6
Three central models of the CAF. Circular nodes in the Student Model represent SMVs, with arrows indicating
dependence relationships. Task Models consist of materials presented to the examinee, and work products col-
lected from the examinee. These work products are subjected to evidence identification rules in the Evidence
Models yielding OVs represented as boxes, which are modeled as dependent on the SMVs in the psychometric
or measurement model. (With kind permission from Taylor & Francis: Educational Assessment, Psychometric and
evidentiary advances, opportunities, and challenges for simulation-based assessment. 18, 2013, 182207, Levy, R.)
Overview of Assessment and Psychometric Modeling 21
TABLE1.3
Taxonomy of Popular Psychometric Models
Latent SMVs
Continuous Discrete
OVs Continuous Classical test theory Latent profile analysis
Factor analysis
Discrete Item response theory Latent class analysis
Bayesian networks
Diagnostic classification models
* Note that the observable variables are identified by their role as the dependent variables in the psychometric
model, not by the nature of the tasks or the evaluation rules per se. We avoid ambiguity by using the terms task
for the situations in which persons act, work products for what they produce, and observables for the pieces of
evidence identified from them by whatever means as the lowest level of the psychometric model. For example,
in multiple-choice tests, the word items is used to refer to tasks, responses, evaluated responses, and the
dependent variables in item response models. This does not usually cause problems because these distinct
entities are in one-to-one correspondence. Such correspondences need not hold when the performances and
the evidentiary relationships are more complex, and more precise language is needed.
22 Bayesian Psychometric Modeling
Examinee
Updated Work
student products
model
variables
Evidence
Accumulation
Evidence
via
Identification
Psychometric Observables
Model
FIGURE 1.7
A four-process architecture of Assessment Delivery. Activities are determined in the Task Selection process.
The selected task(s) is/are presented to an examinee (e.g., given a test booklet and an answer sheet to fill out,
presented a simulation environment to take troubleshooting and repair actions to repair a hydraulics system
in an airplane). The work products(s) are passed on to the Evidence Identification process, where values of OVs
are evaluated from them. Values of the OVs are sent to the Evidence Accumulation process, where the informa-
tion in the OVs is used as evidence to update beliefs about the latent variables in the Student Model, as can be
done with a psychometric model. The updated beliefs about SMVs may be used in selecting further tasks, as in
adaptive testing. (This is a modification of figure1 of CSE Technical Report 543, A Sample Assessment Using the
Four Process Framework, by Russell G. Almond, Linda S. Steinberg, and Robert J. Mislevy, published in 2001 by
the National Center for Research on Evaluation, Standards and Student Testing (CRESST). It is used here with
permission of CRESST.)
Overview of Assessment and Psychometric Modeling 23
layers in ECD, the framework provides a general account of the work that needs to be done
in developing an assessment to enact an assessment argument that facilitates our reasoning
from the observed examinee behavior to the inferential targets.
1.4.4 Summary
ECD may be seen as an effort to define what inferences about examinees are sought, what
data will be used to make those inferences, how that data will be obtained, how the data
will be used to facilitate the inferences, and why the use of that particular data in those
particular ways is warranted for the desired inferences. Working clockwise from the top
left of Figure1.7, Assessment Delivery may be seen as the underlying architecture behind
seeking the data, collecting it, processing it, and then synthesizing it to facilitate inferences.
The psychometric or measurement model plays a central role in the last of these activities,
evidence accumulation. As the junction point where the examinees behaviors captured by
OVs are used to inform on the SMVs that represent our beliefs about the examinees profi-
ciency, the measurement model is the distillation of the assessment argument. More abstractly,
it dictates how features of examinee performances, which in and of themselves are innocent
any claim to relevance for our desired inferences, are to be used to conduct those inferences.
The terminology of ECD reflects its origins in educational assessment, but the notions
may be interpreted more broadly. For example, student model variables may be viewed
as aspects of inferential interest for those being assessed, such as the political persuasion
of legislators. In this example, bills play the role of tasks or items, evidence identification
amounts to characterizing the legislators votes on the bills (e.g., in favor or opposed), and
a psychometric model would connect observables capturing these characterizations to the
student model variable, namely the political persuasion of the legislator (the student).
To meet the challenges of representing and synthesizing the evidence, we turn to prob-
ability as a language for expressing uncertainty, and in particular probability model-based
reasoning for conducting inference under uncertainty. A probability model is constructed
for the entities deemed relevant to the desired inference, encoding our uncertain beliefs
about those entities and their relationships. An ECD perspective suggests that this model
is a distillation of a larger assessment argument, in that it is the junction point that relates
the observed data that will serve as evidence to the inferential targets.
Though surface features vary, modern psychometric modeling paradigms typically
structure observed variables as stochastically dependent on the latent variables. As we will
see, building such a model allows for the use of powerful mathematical rules from the
calculus of probabilitieswith Bayes theorem being a workhorseto synthesize the evi-
dence thereby facilitating the desired reasoning. Adopting a probability framework to
model these relationships allows us to characterize both what we believe and why we
believe it, in terms of probability statements that quantify our uncertainty. We develop
these ideas by discussing Bayesian approaches to most of the psychometric modeling fam-
ilies listed in Table1.3. Along the way we will see instances where Bayesian approaches
represent novel, attractive ways to tackle thorny measurement problems. We will also see
instances where Bayesian approaches are entrenched and amount to doing business as
usual, even when these approaches are not framed or referred to as Bayesian.
2
Introduction to Bayesian Inference
25
26 Bayesian Psychometric Modeling
the parameters . In practice, once values for x are observed, they can be entered into
the conditional probability expression to induce a likelihood function, denoted as L(|x ).
This likelihood function L(|x ) is the same expression as the conditional probability
expression p( x|). The difference in notation reflects that when viewed as a likelihood
function, the values of the data are known and the expression is viewed as varying over
different possible values of . ML estimation then comes to finding the values of that
maximize L(|x ).
In frequentist approaches, the (ML) estimator is a function of the data, which have a
distribution. This renders the estimator to be a random variable, and a realization of this
random variable, the ML estimate (MLE), is obtained given a sample of data from the pop-
ulation. Standard errors of these estimates capture the uncertainty in the estimates, and can
be employed to construct confidence intervals. Importantly, these estimation routines for
many psychometric models typically rely on asymptotic arguments to justify the calcula-
tion of parameter estimates, standard errors, or the assumed sampling distributions of the
parameter estimates and associated test statistics.
Moreover, the interpretations of point estimates, standard errors, and confidence inter-
vals derive from the frequentist perspective in which the parameters are treated as fixed
(constant). In this perspective, it is inappropriate to discuss fixed parameters probabilisti-
cally. Distributional notions of uncertainty and variability therefore concern the param-
eter estimates, and are rooted in the treatment of the data as random. The standard error
is a measure of the variability of the parameter estimator, which is the variability of the
parameter estimates on repeated sampling of the data from the population. Likewise, the
probabilistic interpretation of a confidence interval rests on the sampling distribution of
the interval on repeated sampling of data, and applies to the process of interval estimator
construction. Importantly, these notions refer to the variability and likely values of a param-
eter estimator (be it a point or interval estimator), that is, the distribution of parameter esti-
mates on repeated sampling. This approach supports probabilistic deductive inferences in
which reasoning flows from the general to the particulars, in this case from the parameters
assumed constant over repeated samples to the data x. A frequentist perspective does not
support probabilistic inductive inferences in which reasoning flows from the particulars,
here, the data x, to the general, here, the parameters . In frequentist inference, probabilis-
tic statements refer to the variability and likely values of the parameter estimator, not the
parameter itself.
Bayesian approach treats the model parameters as random,* and uses distributions to
model our beliefs about them. Importantly, this conceptual distinction from how frequen-
tist inference frames parameters has implications for the use of probability-based reason-
ing for inductive inference, parameter estimation, quantifying uncertainty in estimation,
and interpreting the results from fitting models to data. Illuminating this difference and
how it plays out in psychometric modeling is a central theme of this book.
In a Bayesian analysis, the model parameters are assigned a prior distribution, a distri-
bution specified by the analyst to possibly reflect substantive, a priori knowledge, beliefs,
or assumptions about the parameters, denoted as p(). Bayesian inference then comes to
synthesizing the prior distribution and the likelihood to yield the posterior distribution,
denoted as p(|x ). This synthesis is given by Bayes theorem. For discrete parameters and
datum x such that p(x) 0, Bayes theorem states that the posterior distribution is
p( x , )
p(|x ) =
p( x )
p( x|)p()
=
p( x )
(2.1)
p( x|)p()
= .
p(x| )p( )
*
* *
p( x|)p().
* Specifically, they are random in the sense that the analyst has uncertain knowledge about them.
Though this characterization applies to Bayesian inference broadly and will underscore the treatment in this
book, there are exceptions. For example, in some approaches to Bayesian modeling certain unknown entities,
such as those at the highest level of a hierarchical model specification, are not treated as random and modeled via
distributions, and instead are modeled as fixed and estimated using frequentist strategies (Carlin & Louis, 2008).
28 Bayesian Psychometric Modeling
1,000
women
8 992
breast cancer no breast cancer
7 1 69 923
positive negative positive negative
mammogram mammogram mammogram mammogram
FIGURE 2.1
Natural-frequency representation and computation for Breast Cancer diagnosis example. (From Gigerenzer, G.
(2002). Calculated risks: How to know when numbers deceive you. New York: Simon & Schuster, figure 4.2. With
permission.)
illuminates the situation and the answer. On this description of 1,000 women, we see that
76 will have positive mammograms. Of these 76, only 7 actually have breast cancer. The
proportion of women with positive mammograms who have breast cancer is 7/76 .09.
This is an instance of Bayesian reasoning. The same result is obtained by instantiating
(2.1). Let be a variable capturing whether the patient has breast cancer, which can take on
the values of Yes or No. Let x be a variable capturing whether the result of the mammogra-
phy, which can take on the values of Positive and Negative. To work through this problem,
the first step that is commensurate with frequentist perspective is to define the conditional
probability of the data given the model parameter, p( x|). This is given in Table2.1, where
the rows represent the probability of a particular result of the mammography given the
presence or absence of breast cancer.
We note that ML strictly works off of the information in Table 2.1 corresponding to the
observed data. In this case, the column associated with the positive mammography result is
the likelihood function. ML views the resulting values of .90 for breast cancer being present
and .07 for breast cancer being absent as a likelihood function. Maximizing it then comes
to recognizing that the value associated with having breast cancer (.90) is larger thanthe
value for not having breast cancer (.07). That is, p(x = Positive| = Yes) is larger than
p(x=Positive| = No). The ML estimate is therefore that (breast cancer) is Yes, and
this is indeed what the data (mammography result) has to say about the situation.
In a Bayesian analysis, we combine the information in the data, expressed in the likeli-
hood, with the prior information about the parameter. To specify a prior distribution for
TABLE 2.1
Conditional Probability of theMammography
Result Given Breast Cancer
Mammography Result (x)
TABLE 2.2
Prior Probability of Breast Cancer
)
Breast Cancer ( Probability
Yes .008
No .992
the model parameter, , we draw from the statement that the patient has no family history
of cancer nor any symptoms, and the proportion of women in this population that have
breast cancer is .008. Table2.2 presents the prior distribution, p().
With these ingredients, we can proceed with the computations in Bayes theorem. The
elements in Tables2.1 and 2.2 constitute the terms in the numerator on the right-hand side
of the second through fourth lines of (2.1). Equation (2.2) illustrates the computations for
the numerator of Bayes theorem, with x positive. The bold elements in the conditional
probability of the data are multiplied by their corresponding elements in the prior dis-
tribution. Note that the column for Mammography Result taking on the value of Positive
is involved in the computation, as that was the actual observed value; the column for
Mammography Result taking on the value of Negative is not involved.
Mammography Result
Breast Breast Breast
Cancer Positive Negative Cancer Cancer
Yes .90 .10 Yes .008 Yes .00720
= (2.2)
No .07 .93 No .992 No .06944
The denominator in Bayes theorem is the marginal probability of the observed data under
the model. For discrete parameters this is given by
p( x ) = p(x|)p().
(2.3)
For the current example, we have one data point (x) that has a value of Positive, and
Continuing with Bayes theorem, we take the results on the right-hand side of (2.2) and
divide those numbers by the marginal probability of .07664 from (2.4). Thus we find that
the posterior distribution is, rounding to two decimal places, p( = Yes|x = Positive) .09,
p( = No|x = Positive) .91 , which captures our belief about the patient after observ-
ing the test result is positive. Equation(2.5) illustrates the entire computation in a single
representation.
30 Bayesian Psychometric Modeling
p( x = Positive|)p()
p(|x = Positive) =
p( x = Positive)
p( x = Positive|) p()
Mammography Result
Breast Breast Breast
Cancer Positive Negative Cancer Cancer
Yes .90 .10 Yes .008 Yes .00720
=
No .07 .93 No .992 No .06944
=
p( x = Positive) .07664
p(|x = Positive)
Breast Cancer
(2.5)
Yes .09
No .91
The marginal probability of the observed data p( x ) in the denominator of (2.1) serves as a
normalizing constant to ensure the resulting mass function sums to one. Importantly, as the
notation and example shows, p( x ) does not vary with the value of . As the last line of (2.1)
shows, dropping this term in the denominator reveals that the posterior distribution is pro-
portional to the product of the likelihood and the prior. In the computation in (2.5), all of the
information in the posterior for Breast Cancer is contained inthe values of .00720 for Yes and
.06944 for No. However, these are not in an interpretable probability metric. Dividing each
by their sum of .07664 yields the values that are in a probability metricthe posterior distri-
bution. This last step is primarily computational. The conceptual implication of the last line
of (2.1) is that the key terms relevant for defining the posterior distribution are p( x|) and
p(). In addition, as discussed in Chapter 5, the proportionality relationship has important
implications for estimation of posterior distributions using simulation-based methods.
Equation (2.1) instantiates Bayes theorem for discrete parameters. For continuous
parameters ,
p( x ,)
p(|x ) =
p( x )
p( x|)p()
=
p( x ) (2.6)
p( x|)p()
=
p(x|)p()d
p( x|)p().
Similarly, for models with both discrete and continuous parameters, the marginal distribu-
tion is obtained by summing and integrating over the parameter space, as needed.
Introduction to Bayesian Inference 31
J J
p( x|) = j =1
p( x j |) =
j =1
xj 1 x j
(1 ) . (2.8)
Let y = Jj =1 x j denote the total number of 1s in x. Continuing with the educational assess-
ment example, y is the total number of correct responses in the set of responses to the J tasks.
The conditional probability of y, given and J, then follows a binomial mass function
J
p( y|, J ) = Binomial( y|, J ) = y (1 ) J y . (2.9)
y
Suppose then that we observe y = 7 successes in J = 10 tasks. In the following two sections,
ML and Bayesian approaches to inference are developed for this example.
y
= . (2.10)
J
32 Bayesian Psychometric Modeling
0.7
0.0 0.2 0.4 0.6 0.8 1.0
FIGURE 2.2
Likelihood function for a binomial model with y = 7 with J = 10. The maximum likelihood estimate (MLE)of
is .7, and isindicated viaa vertical line.
In the current example, the MLE is .7, the point where the highest value of the likelihood
function occurs, as readily seen in Figure2.2. The standard error of the ML estimator is
(1 )
= . (2.11)
J
As is unknown, substituting in the MLE yields the estimate of the standard error
(1 )
= . (2.12)
J
In the current example, .14. When J is large, a normal approximation gives the usual
95% confidence interval of 1.96 , or (.42, .98). In the present example, calculating bino-
mial probabilities for the possible values of y when actually is .7 tells us that 96% of the
MLEs would be included in the score confidence interval [.4, .9]. This interval takes into
account the finite possible values for the MLE, the bounded nature of the parameter, and
the asymmetric likelihood.
2.3.2 Bayesian Modeling for Binomial Distributed Data: The Beta-Binomial Model
Rewriting (2.6) for the current case, Bayes theorem is given by
p( y | , J )p() p( y | , J )p()
p( | y , J ) = = p( y | , J )p(). (2.13)
p( y)
p( y | , J )p()d
To specify the posterior distribution, we must specify the terms on the right-hand side of
(2.13). We now treat each in turn.
The first term, p( y|, J ), is the conditional probability for the data given the parameter.
Once data are observed and it is viewed as a likelihood function of the parameter, the
leading term
J
y
Introduction to Bayesian Inference 33
can be dropped as it does not vary with the parameter. It is repeated here with this one
additional expression
J
p( y|, J ) = Binomial( y|, J ) = y (1 ) J y y (1 )J y . (2.14)
y
The second term on the right-hand side of (2.13), p(), is the prior probability distribution for
the parameter . From the epistemic probability perspective, the prior distribution encodes
the analysts beliefs about the parameter before having observed the data. A more spread out,
less informative, diffuse prior distribution may be employed to represent vague or highly
uncertain prior beliefs. More focused beliefs call for a prior distribution that is heavily con-
centrated at the corresponding points or regions of the support of the distribution. Once
data are observed, we employ Bayes theorem to arrive at the posterior distribution, which
represents our updated beliefs arrived at by incorporating the data with our prior beliefs.
Section3.2 discusses principles of specifying prior distributions in greater detail; suffice it to
say that we often seek to employ distributional forms that allow us to flexibly reflect a variety
of beliefs, are reasonably interpretable in terms of the problem at hand, as well as those that
ease the computational burden.
In the current context of a binomial, a popular choice that meets all of these goals is a
beta distribution,
where the first equality indicates that the probability distribution depends on parameters
and . The beta distribution is a particularly handy kind of prior distribution for the
binomial, a conjugate prior. We will say more about this in Chapter3 too, but for now
the key ideas are that the posterior will have the same functional form as the prior and
there are convenient interpretations of its parameters.
A beta distributions so-called shape parameters and tell us about its shape and the
focus. The sum + indicates how informative it is. When = it is symmetric; when
< it is positively skewed, and when > it is negatively skewed. When < 1 there
is an asymptote at 0 and when < 1 there is an asymptote at 1. Beta(1,1) is the uniform
distribution. When both > 1 and > 1, it has a single mode on (0,1). Figure2.3 contains
density plots for several beta distributions. We see that when used as priors, Beta(5,17)
is more focused or informative than Beta(6,6), which is more diffuse or uninformative.
Beta(6,6) is centered around .5, while Beta(5,17) is positively skewed, with most of its den-
sity around .2. The formulas for its mean, mode, and variance are as follows (see also
Appendix B):
Mean:
( + )
1
Mode:
( + 2 )
Variance: 2
.
( + ) ( + + 1)
34 Bayesian Psychometric Modeling
Beta(1,1)
Beta(6,6)
Beta(5,17)
Beta(0.5,0.5)
FIGURE 2.3
Density plots for beta distributions, including Beta(1,1) (solid line); Beta(6,6) (dashed line); Beta(5,17) (dotted
line); Beta(.5,.5) (dashdotted line).
The form of the beta distribution expressed in the rightmost side of (2.15) suggests a useful
interpretation of the parameters of the beta distribution, and . The rightmost sides of
the likelihood expressed in (2.14) and the beta prior in (2.15) have similar forms with two
components: raised to a certain exponent, and (1 ) raised to a certain exponent. For the
likelihood, these exponents are the number of successes (y) and number of failures (J y),
respectively. The similarity in the forms suggests an interpretation of the exponents and
hence the parameters of the beta prior distribution, namely, that the information in the
prior is akin to 1 successes and 1 failures.
It can be shown that, with the chosen prior and likelihood, the denominator for (2.13)
follows a beta-binomial distribution. For the current purposes, it is useful to ignore this
term and work with the proportionality relationship in Bayes theorem. Putting the pieces
together, the posterior distribution is
y + 1(1 ) J y + 1.
The form of the posterior distribution can be recognized as a beta distribution. That is,
Turning to the example, recall that y = 7 and J = 10. In this analysis, we employ a Beta(6,6)
prior, depicted in Figure2.4. Following the suggested interpretation, the use of the Beta(6,6)
prior distribution is akin to modeling prior beliefs akin to having seen five successes and
five failures. Accordingly, the distribution is symmetric around its mode of .5, embody-
ing the beliefs that our best guess is that the value is most likely near .5, but we are not
very confident, and it is just as likely to be closer to 0 as it is to be closer to 1. Figure2.4
also plots the likelihood and the posterior distribution. Following (2.17), the posterior is a
Introduction to Bayesian Inference 35
Prior
Likelihood
Posterior
FIGURE 2.4
Prior, likelihood, and posterior for the beta-binomial model with a Beta(6,6) prior for , y = 7, and J = 10.
* The use of here refers to a probability, and is not to be confused with the use of as a parameter of a beta
distribution.
36 Bayesian Psychometric Modeling
Mode = 0.6
Mean = 0.59
95% HPD = (0.39,0.79)
FIGURE 2.5
Beta(13,9) posterior, with shaded 95% highest posterior density (HPD) interval, mode, and mean.
Importantly, the posterior distribution, and these summaries, refers to the parameter.
The posterior standard deviation characterizes our uncertainty about the parameter in
terms of variability in the distribution that reflects our belief, after incorporating evidence
from data with our initial beliefs. Similarly, the posterior intervals are interpreted as direct
probabilistic statements of our belief about the unknown parameter; that is, we express
our degree of belief that the parameter falls in this interval in terms of the probability .95.
Similarly, probabilistic statements can be asserted for ranges of parameters or among mul-
tiple parameters (e.g., the probability that one parameter exceeds a selected value, or the
probability that one parameter exceeds another).
The graphs are directed in that all the edges are directed, represented by one-headed
arrows, so that there is a flow of dependence. Further, the graphs are acyclic in that,
when moving along paths from any node in the direction of the arrows, it is impossible
Introduction to Bayesian Inference 37
to return to that node. The graph also contains a number of plates associated with
indexes, used to efficiently represent many nodes. Following conventions from path dia-
grams in structural equation modeling (Ho, Stark, & Chernyshenko, 2012) we employ
rectangles to represent observable entities/variables and circles to represent latent or
unknown variables/entities. We further discuss and compare path diagrams to DAGs
in Chapter9.
The structure of the graph conveys how the model structures the joint distribution. Let
v denote the full collection of entities under consideration. The joint distribution p(v) may
be factored according to the structure of the graph as
p(v) = p ( v|pa(v)),
vv
(2.18)
where pa(v) stands for the parents of v; if v has no parents, p ( v|pa(v) ) is taken as the
unconditional (marginal) distribution of v. Equation(2.18) indicates that we can represent
the structure of the joint distribution in a model in terms of the graph. For each variable,
we examine whether the node in the graph has parents. If it does not, there is an uncondi-
tional distribution for the variable. If it does have parents, the distribution for the variable
is specified as conditional on those parents. Thus, the graph reflects key dependence and
(conditional) independence relationships in the model* (see Pearl, 2009).
Figure2.6 presents two versions of a DAG for the beta-binomial model. The first version
depicts the graph where the node for y is modeled as child of the node for , which has no
parents. This reflects that the joint distribution of the random variables is constructed as
p( y ,) = p( y|)p(). The second version is an expanded graph that also contains the entities
in the model that are known or fixed, communicating all the entities that are involved in
specifying the distributions. More specifically, the graph reflects that
y y
(a) (b)
FIGURE 2.6
Two versions of a directed acyclic graph for the beta-binomial model: (a) depicting entities in the joint distribu-
tion; (b) additionally depicting known or fixed entities.
* We note that the conditional independence relationships expressed in the graph are those explicitly stated by
the model, visible in an ordering of the variables. There may be additional conditional independence relation-
ships among the variables that are not illustrated in a given ordering of the variables, but would show up
under different orderings of variables.
38 Bayesian Psychometric Modeling
xj xj
Observables Observables
j = 1,, J j = 1,, J
(a) (b)
FIGURE 2.7
Two versions of a directed acyclic graph for the beta-Bernoulli model: (a) depicting entities in the joint distribu-
tion; (b) additionally depicting known or fixed entities.
* J is known because that is the size of the sample in the experiment. and are known because we have chosen
them to represent our belief about before we see the data.
Introduction to Bayesian Inference 39
is contained in the bracesthe left-hand brace { following the model statement at the
top, and the right-hand brace } toward the bottom. Within the braces, the two (noncom-
mented) lines of code give the prior distribution for and the conditional distribution of
y given . The (noncommented) line after the right-hand brace is a data statement that
contains known values that are passed to WinBUGS.
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Prior distribution
#########################################################################
theta ~ dbeta(alpha,beta)
#########################################################################
# Conditional distribution of the data
#########################################################################
y~ dbin(theta, J)
#########################################################################
# Data statement
#########################################################################
list(J=10, y=7, alpha=6, beta=6)
-------------------------------------------------------------------------
We conducted an analysis in WinBUGS for this example, running a chain for 50,000
iterations. We will have more to say about the underlying mechanisms of the estimation
strategies implemented in WinBUGS in Chapter5. For the moment, it is sufficient to state
that the analysis produces an empirical approximation to the posterior distribution in that
the iterations constitute draws from the posterior distribution. Table 2.3 presents sum-
mary statistics based on these draws. WinBUGS does not produce the HPD interval; these
values were obtained using the coda package in R. Figure2.8 plots a smoothed density of
the 50,000 draws as produced by the coda package. A similar representation is available
from WinBUGS. The empirical approximation to the posterior from WinBUGS is quite
good. Note the similarity in shape of the density in Figure2.8 to the Beta(13,9) density in
Figures2.4 and 2.5. The empirical summaries of the distribution in Table2.3 are equal to
their analytical values, up to two decimal places.
We emphasize how the mathematical expressions of the model, the DAG, and the
WinBUGS code cohere. For the beta-binomial, for example, Figure2.9 depicts the coherence
TABLE 2.3
Summary Statistics for the Posterior for the Beta-Binomial Model with a Beta(6,6) Prior for , y = 7,
and J = 10 from the WinBUGS Analysis
Standard 95% Central Credibility 95% Highest Posterior
Parameter Mean Median Deviation Interval Density Interval
FIGURE 2.8
Smoothed density of the draws from WinBUGS for the posterior for the beta-binomial model with a Beta(6,6)
prior for y = 7, and J = 10.
J
y Binomial(, J ) y ~ dbin(theta, J)
for (j in 1:J) {
xj Bernoulli()
x[j] ~ dbern(theta)
j = 1,, J
xj }
Observables
j = 1,, J
FIGURE 2.9
Correspondence among the WinBUGS code and data statement for the beta-binomial and beta-Bernoulli mod-
els. The top row communicates the prior distribution. The second row communicates the conditional probabil-
ity of the data based on the binomial model. The third row communicates the conditional probability of the data
based on the Bernoulli model.
for the prior distribution and the conditional distribution of the data in the top and middle
rows, respectively. (The bottom row shows the DAG and the WinBUGS code when we
instead use the Bernoulli specifications for the J replications individually.) Importantly,
the correspondence between the mathematical model and the DAG is a general point of
Bayesian modeling, which simplifies model construction and calculations of posterior dis-
tributions (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). The correspondence between these
features and the WinBUGS code is fortuitous for understanding and writing the WinBUGS
code. In this case, the correspondence is exact. In other cases, the correspondence between
the code and the mathematical expressions/DAG will be less than exact. Nevertheless, we
find it useful in our work to conceive of models using these various representations: clarity
Introduction to Bayesian Inference 41
can be gainedand troubleshooting time can be reducedby laying out and checking the
correspondence between the mathematical expressions, DAG, and code. For example, it is
seen that any child in the graph appears on the left-hand side of a mathematical expres-
sion and the WinBUGS code. For any such node in the graph, its parents are the terms that
appear on the right-hand side of the mathematical expression and the WinBUGS code.
The WinBUGS code for the model and a data statement for the beta-Bernoulli example
are given below.
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Prior distribution
#########################################################################
theta ~ dbeta(alpha,beta)
#########################################################################
# Conditional distribution of the data
#########################################################################
for(j in 1:J){
x[j] ~ dbern(theta)
}
#########################################################################
# Data statement
#########################################################################
list(J=10, x=c(1,0,1,0,1,1,1,1,0,1), alpha=6, beta=6)
-------------------------------------------------------------------------
There are two key differences from the beta-Binomial code. First, the code for the individ-
ual variables are modeled with a for statement, which essentially serves as a loop over the
index j. Inside the for statement (i.e., the loop), each individual xj is specified as following
a Bernoulli distribution with parameter . The second key difference is that the data being
supplied are the values of the individual xs for the 10 variables, rather than their sum.
Again, we emphasize the coherence among the different representations of the model.
The last row in Figure2.9 contains the conditional probability of the data, expressed, math-
ematically, as a DAG, and as the WinBUGS code. Note that repeated structure over j is
communicated in each representation: with a list in the mathematical expression, as a plate
in the DAG, and as a for statement in WinBUGS.
Bayesian data analysis, modeling, and computation generally and in the contexts covered in
this chapter may be found in a number of texts, including those cited in the introduction to
Part I. To this, we add that Lunn etal. (2013) provided an account rooted in graphical models
introduced in this chapter, as well as details on the use of WinBUGS and its variants.
To summarize the principles of the current chapter on the mechanics of Bayesian analy-
ses, we can hardly do better than Rubin, who described a Bayesian analysis as one that
(1984, p. 1152)
treats known values [x] as observed values of random variables [via p( x|)], treats
unknown values [] as unobserved random variables [via p()], and calculates the con-
ditional distribution of unknowns given knowns and model specifications [p(|x )]
using Bayess theorem.
1. Set up the full probability model. This is the joint distribution of all the entities,
including observable entities (i.e., data x) and unobservable entities (i.e., parameters
) in accordance with all that is known about the situation. However, specifying
) may be very difficult for even fairly simple situations.
the joint distribution p( x,
Returning to the medical diagnosis example, it may be difficult to articulate the
bivariate distribution for Breast Cancer and Mammography Result directly. Instead,
we structure the joint distribution by specifying the marginal distribution for Breast
Cancer, that is, without appeal to Mammography Result, and then specify a conditional
distribution for Mammography Result given Breast Cancer. This is an instance of the
general strategy, in which we factor the joint distribution as p( x , ) = p( x|)p().
2. Condition on the observed data (x) and calculate the conditional probability dis-
tribution for the unobservable entities ( ) of interest given the observed data. That
is, we obtain the posterior distribution p(|x ) via Bayes theorem
p( x ,) p( x|)p()
p(|x ) = = p( x|)p().
p( x ) p( x )
3. Do all the things we are used to doing with fitted statistical models; examples
include but are not limited to: examining fit, assessing tenability or sensitivity to
assumptions, evaluating the reasonableness of conclusions, respecifying the model
if warranted, and summarizing results. (Several of these activities are treated in
Chapter 10.)
Turning to the specifics of the examples, Bayesian approaches to medical diagnosis of the sort
illustrated here date to Warner, Toronto, Veasey, and Stephenson (1961). The medical diagnosis
model may be seen as a latent class (Chapter13) or Bayesian network (Chapter14) model, and
principles and extensions of this model are further discussed in those chapters. Lindleyand
Phillips (1976) provided an account of Bayesian approaches to inference with Bernoulli
andbinomial models using beta prior distributions, strongly contrasting it with conventional
frequentist approaches. Novick, Lewis, and Jackson (1973) discussed the extension to the case
of multiple binomial processes, arguing that a Bayesian approach assuming exchangeability
offers advantages in criterion-referenced assessment, particularly for short tests.
Finally, we have developed the mechanics of Bayesian inference in part by contrasting
it with frequentist inference. In doing so, we have alluded to a number of reasons why
Introduction to Bayesian Inference 43
adopting a Bayesian approach is advantageous. We take up this topic in greater detail in the
next chapter. Before doing so, it is important to note that our account of frequentist inference
here is far from comprehensive. For our purposes, the key idea is the reversal of the inferen-
tial direction that the Bayesian approach affords. Aside from a theoretical view of probability
per se, there are a variety of methods of statistical inference, that focus on the distribution
of results (statistics, likelihood functions, etc.) that arise from repeated samples of the data
given the fixed but unknown parameters. It is this directionality that we are referring to in
the way we use frequentist as a contrast to the Bayesian paradigm that requires the full
probability model (including priors), where we then base inference on the posterior distri-
bution of parameters and make probability statements about them conditional on data.
Exercises
2.1 Reconsider the breast cancer diagnosis example introduced in Section2.2. Suppose
another patient reports being symptomatic and a family history such that the pro-
portion of women with her symptoms and family history that have breast cancer
is .20. She undergoes a mammography screening, and the result is positive. What
is the probability that she has breast cancer? Contrast this with the result for the
original example developed in Section2.2.
2.2 Reconsider the breast cancer diagnosis example introduced in Section2.2, for
an asymptomatic woman with no family history; the proportion of such woman
that have breast cancer is .008. Suppose that instead of undergoing a mammog-
raphy screening, the woman is administered a different screening instrument
where the probability of a positive result given the presence of breast cancer is .99,
and the probability of a positive result given the absence of breast cancer is .03.
Suppose the result is positive. What is the probability that she has breast cancer?
Contrast this with the result for the original example developed in Section2.2.
2.3 Revisit the beta-binomial example for a student who has y = 7 successes on J = 10
attempts.
a. Suppose you wanted to encode minimal prior information, representing beliefs
akin to having seen 0 successes and 0 failures. What prior distribution would
you use?
b. Using the prior distribution chosen in part (a), obtain the posterior distribution
for analytically as well as the solution from WinBUGS. How does the poste-
rior distribution compare to the prior distribution and the MLE, for example
in terms of shapes, and the posterior means and modes for the prior and pos-
terior versus the MLE?
c. Suppose you wanted to encode prior information that reflects beliefs that the
student is very capable and would likely correctly complete 90% of all such
tasks, but to express your uncertainty you want to assign the prior a weight
akin to 10 observations. That is, your prior belief is akin to having seen nine
successes and one failure. What prior distribution would you use?
d. Using the prior distribution chosen in part (c), obtain the posterior distribution
for analytically as well as the solution from WinBUGS. How does the poste-
rior distribution compare to the prior distribution and the MLE?
44 Bayesian Psychometric Modeling
e. Suppose you wanted to encode prior information that reflects beliefs that the
student is not very capable and would likely correctly complete 10% of all such
tasks, but to express your uncertainty you want to assign the prior a weight
akin to 10 observations. That is, your prior belief is akin to having seen one
success and nine failures. What prior distribution would you use?
f. Using the prior distribution chosen in part (e), obtain the posterior distribution
for analytically as well as the solution from WinBUGS. How does the poste-
rior distribution compare to the prior distribution and the MLE?
2.4 Exercise2.3(f) asked you to consider the situation where your prior beliefs were
akin to having seen one success and nine failures, and then you went on to observe
y = 7 successes on J = 10 attempts.
a. Keeping the same prior distribution you specified in Exercise2.3(f), suppose
you had observed 14 successes on 20 tasks. What is the posterior distribution
for ? Obtain an analytical solution for the posterior as well as the solution
from WinBUGS.
b. Keeping the same prior distribution, suppose you had observed 70 successes
on 100 tasks. What is the posterior distribution for ? Obtain an analytical
solution for the posterior as well as the solution from WinBUGS.
c. What do your results from parts (a)(b) and Exercise2.3(f) indicate about the
influence of the prior and the data (likelihood) on the posterior?
2.5 Suppose you have a student who successfully completes all 10 tasks she is
presented.
a. What is the MLE of ?
b. Using a prior distribution that encodes minimal prior information representing
beliefs to having seen 0 successes and 0 failures, obtain the posterior distribu-
tion for . Obtain an analytical solution as well as the solution from WinBUGS.
How does the posterior compare to the prior distribution, as well as the MLE?
c. Suppose before observing the students performance, you had heard from
a colleague that she was a very good student, and the student would likely
correctly complete 80% of all such tasks. Suppose you wanted to encode this
information into a prior distribution, but only assigning it a weight akin to 10
observations. What prior distribution would you use?
d. Using the prior distribution chosen in part (c), obtain the posterior distribu-
tion for . Obtain an analytical solution as well as the solution from WinBUGS.
How does the posterior compare to the prior distribution, as well as the MLE?
3
Conceptual Issues in Bayesian Inference
Chapter2 described the mechanics of Bayesian inference, including aspects of model con-
struction and representation, Bayes theorem, and summarizing posterior distributions.
These are foundational procedural aspects of Bayesian inference, which will be instantiated
in a variety of settings throughout the book. This chapter treats foundational conceptual
aspects of Bayesian inference.
The chief aims of this chapter are to introduce concepts of Bayesian inference that will
be drawn upon repeatedly throughout the rest of the book, and to advance the argument
concerning the alignment of Bayesian approaches to statistical inference. In addition, this
chapter serves to further characterize features of Bayesian approaches that are distinct
from frequentist approaches, and may assuage concerns of those new to Bayesian infer-
ence. We suspect that readers steeped in frequentist traditions might not feel immediately
comfortable treating parameters as random, specifying prior distributions, and using the
language of probability in reference to parameters. For those readers, this chapter may
serve as an introduction, albeit far from a comprehensive one, to the arguments and prin-
ciples that motivate adopting a Bayesian approach to inference.
In Section 3.1, we cover how the prior and data combine to yield the posterior, high-
lighting the role of the relative amounts of information in each. We then discuss some
principles used in specifying prior distributions in Section3.2. In Section3.3, we provide
a high-level contrast between the Bayesian and frequentist approaches to inference. In
Section 3.4, we introduce and connect the key ideas of exchangeability and conditional
independence to each other, and to Bayesian modeling. In Section3.5, we lay out reasons
for adopting a Bayesian approach to conducting analyses. In Section3.6, we describe four
different perspectives on just what is going on in Bayesian modeling. We conclude the
chapter in Section3.7 with a summary and pointers to additional readings on these topics.
45
46 Bayesian Psychometric Modeling
0.9 0.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Likelihood (y = 7, J = 10) Likelihood (y = 7, J = 10) Likelihood (y = 7, J = 10)
FIGURE 3.1
Prior, likelihood function, and posterior for the Beta-binomial model with y = 7 with J = 10 and three different
cases defined by different prior distributions. Reading down the columns presents the prior, likelihood, and
posterior for each case. Where applicable, the mode of the distribution and the MLE is indicated via a ver-
tical line.
Summaries of the resulting posteriors are given in Table3.1, which also lists the results from
the preceding ML analysis and the Bayesian analysis using a Beta(6,6) prior.
As depicted in the first column of Figure3.1, the Beta(1,1) distribution is uniform over
the unit line, which is aligned with the interpretation that it encodes 0 prior successes and
0 prior failures in the information in the prior. The posterior distribution is the Beta(8,4)
distribution. This posterior has exactly the same shape as the likelihood. This is because
the posterior density function (i.e., the height of the curve for the posterior) is just the nor-
malized product of the height of the curve for the likelihood and the height of the curve for
the prior. Accordingly, the posterior mode is identical to the maximum likelihood estimate
(MLE) (.70). The second column in Figure3.1 displays the analysis using a Beta(10,2) prior
that encodes nine prior successes and one prior failure. The prior here is concentrated
toward larger values, and in conjunction with the likelihood that is peaked at .7, the poste-
rior density is concentrated around values near .8. The third column in Figure3.1 displays
TABLE 3.1
Summaries of Results for Frequentist and Bayesian Analyses of Binomial Models for Dichotomous Variables
Point Variability 95% Interval
Approach y J MLEa Estimate of SE b Confidence Interval
Frequentist 7 10 0.70 0.14 (0.39, 0.90)
Conceptual Issues in Bayesian Inference
the analysis using a Beta(2,10) prior that encodes one prior success and nine prior failures.
Here, the prior is concentrated closer to 0, and the resulting posterior is concentrated more
toward moderate values of .
Figure3.1 illustrates the effect of the information in the prior, holding the information
in the data (i.e., the likelihood) constant. In all cases, the posterior distribution is a synthe-
sis of the prior and the likelihood. When there is little or no information in the prior (left
column of Figure3.1), the posterior is essentially a normalized version of the likelihood.
To the extent that the prior carries information, it will influence the posterior such that the
posterior will be located in between the prior and the likelihood. In the middle and right
columns of Figure3.1, this can be seen by noting that the posterior mode is in between the
prior mode and the maximum of the likelihood.
Viewed from a lens that focuses on the influence of the prior on the posterior, we may
say that the posterior gets shrunk from the data (likelihood) toward the prior distribution.
To further demonstrate this point, consider a situation in which we observe a student
correctly complete all 10 tasks she attempts (Exercise2.5). To model our prior beliefs that
she is very capable, but with uncertainty akin to this being based on 10 observations, we
employ the Beta(9,3) distribution, depicted as a dotted line in Figure3.2. The dashed line
in Figure 3.2 depicts the likelihood for y = 10 successes on J = 10 attempts. Following
Bayes theorem, the posterior is the Beta(19,3) distribution, depicted as a solid line in
Figure3.2. The likelihood monotonically increases with , yielding a maximum when
= 1. Substantively, a value of 1 for means that the probability that the student will cor-
rectly complete any such task is 1.0; there is no chance that she will not correctly complete
any such task. Although 1 is in fact the MLE, we do not think concluding that = 1 is rea-
sonable.* In the Bayesian analysis, the Beta(19,3) posterior distribution falls in between the
prior distribution and the likelihood. Relative to the point that maximizes the likelihood
(1.0), the posterior mode of .90 is shrunk toward the prior mode of .80. Based on using
the mode (or similarly the mean of .86) as a point summary, we would conclude that the
student has a high probability of completing such tasks. Such a statement could be but-
tressed by an expression of uncertainty through, say, the posterior standard deviation of
Prior
Likelihood
Posterior
FIGURE 3.2
Beta(9,3) prior, likelihood function, and Beta(19,3) posterior for the Beta-binomial example with y = 10 and J = 10.
* If you disagree, note that the same conclusion would be reached by seeing nine successes in nine attempts,
similarly eight successes in eight attempts, and so on down to one success in one attempt. If you think that
is reasonable, the first author would like to report that he just came back from the basketball court where he
took and successfully made a free throw. Please feel free to conclude that he will never miss a free throw.
Unfortunately, everyone who has ever played basketball with him knows better.
Conceptual Issues in Bayesian Inference 49
.07 or a 95% HPD interval of (.72, .98), meaning that the probability that is in the interval
(.72, .98) is .95. We may view this Bayesian analysis as one in which prior distribution reins
in what is suggested by the data, which is subject to substantial sampling variability in
small samples. In this case, the prior distribution guards against the somewhat nonsensi-
cal conclusion suggested by the data alone.
This tempering of what is suggested strictly by the data in light of what is already
believed is common in statistical inference, as in practices of denoting and handling outli-
ers. It is likewise common in inference more broadly: prior beliefs prevent us from aban-
doning the laws of physics when we see David Copperfield make something disappear,
much as they prevented the scientific community from doing so when it was reported that
a particle was measured as traveling faster than the speed of light (Matson, 2011), much
as they prevented the scientific community from concluding that precognition exists when
certain analyses suggested they do (Bem, 2011; Wagenmakers, Wetzels, Borsboom, & van
der Maas, 2011). The point is not that such things are wrongperhaps precognition does
occur or particles do move faster than the speed of lightthe point is that it is appropriate
to reserve concluding as much when presented with finite data situated against a backdrop
of alternative beliefs.
p(|y , J ) = Beta(|y + , J y + ),
and expresses how the posterior distribution is influenced by the prior and the data.
Consider now what occurs as the amount of data increases. As J increases, the posterior
distribution becomes increasingly dominated by the data, and the influence of the prior
diminishes. Figure 3.3 illustrates the results for J = 10, 20, and 100, holding constant
the prior distribution and the proportion of successes y/J = .7 (Exercise2.4). Numerical
summaries for these cases are contained in Table 3.1. Specifically, the prior distribu-
tion is the Beta(2,10), which concentrates most of the density toward lower values of ,
which does not align with the information in the data encoded in the likelihood. The
precision of the data increases with J, represented by the likelihood becoming more
peaked around .7. As a result, the posterior increasingly resembles the likelihood. This
represents a general principle in Bayesian modeling: as the amount of data increases
the posterior increasingly resembles the likelihood, which represents the contribution
of the data, rather than the prior.* That is, the data swamp the prior, and increas-
ingly drives the solution, reflecting Savages principle of stable estimation (Edwards,
Lindman, & Savage, 1963). The implication is that, for a large class of models, analysts
that specify even wildly different prior distributions but allow for the possibility that
the data may suggest otherwise will find that their resulting posterior distributions will
become increasingly similar with more data (Blackwell & Dubins, 1962; c.f. Diaconis &
Freedman, 1986).
* This is the usually case. Exceptions include situations where the support of the prior does not include the val-
ues suggested by the data. As an extreme case, if the prior states the probability of a discrete parameter taking
on a particular value 0 is p( = 0) = 0, no amount of data is going to yield a posterior where p( = 0 | x) > 0.
50 Bayesian Psychometric Modeling
FIGURE 3.3
Prior, likelihood function, and posterior for the Beta-binomial model with a Beta(2, 10) prior and three different
cases defined by different sample sizes. Reading down the columns presents the prior, likelihood, and posterior
for each case. Where applicable, the mode of the distribution and the MLE is indicated via a vertical line.
3.1.3 Summary of the Effects of Information in the Prior and the Data
As the preceding examples in Figures3.1 and 3.3 depict, the posterior distribution is a syn-
thesis of the information in the prior and the data (expressed as the likelihood). Generally
speaking, the influence of the data may be increased by (a) increasing sample size and/or
(b) using a more diffuse prior, reflecting less prior information. Symbolically,*
sample size ,or
information 0
p(|x) p( x|)p() prior
p(|x) p( x|). (3.1)
For the case of data modeled with a binomial distribution, the beta prior distribution
makes for a convenient way to express the relative amounts of information in each of the
* Exactly what prior information 0 means needs further specification in any given case. How to specify
minimally informative priors is a topic of much discussion in the literature (e.g., Bernardo & Smiths [2000]
reference priors, see also Gelman et al. [2013] on alternative approaches on specifying minimally informa-
tive priors).
Conceptual Issues in Bayesian Inference 51
likelihood and the data. Specifically, a Beta( , ) distribution carries the information akin
to + 2 data points, where there are 1 successes and 1 failures. The ease of
interpretability of the parameters is one reason for using a beta as a prior in this context,
a topic we turn to next.
of a binomial, when combined with the binomial likelihood induced by observing the
data, yields a posterior that is also a beta distribution. Hence, the beta distribution is a
conjugate prior for this parameter. The concept of conjugacy is closely linked with the
tractability of the likelihood, expressed in part by the presence of sufficient statistics; these
are summaries of the data that capture all the relevant information about the parameter
contained in the full data. We can identify a conjugate prior by expressing the conditional
distribution for the data in terms of their sufficient statistics, and then view them as defin-
ing constants that govern the distribution of the parameter (Novick & Jackson, 1974). In
the case of i.i.d. Bernoulli variables considered in Chapter2, the sufficient statistics were
the number of successes y and the total number of attempts J. The conditional probability
for the data can be expressed in terms of these two summaries, as in (2.14). Viewing these
constants and the parameter as a random variable, we can recognize the form of the
beta distribution in (2.15), that is, the conjugate prior. See Novick and Jackson (1974) for a
detailed step-by-step treatment of this approach for this and other examples.
Conjugate priors are available when the conditional probability distribution for the data
(i.e., the likelihood) is a member of the general exponential family (Bernardo & Smith,
2000). However, conjugate priors are often not available in complex multivariate models.
This may be the case even when the model is constructed out of pieces that afford conju-
gate priors if taken individually, but when assembled with the other pieces do not. In other
cases, the use of a conjugate multivariate prior implies modeling a dependence between
parameters that is not consistent with substantive beliefs or is overly difficult to work with
conceptually. One strategy is then to specify priors for parameter separately, and use priors
that would be conjugate if all other parameters were known. Though complete conjugacy
is lost, the computation of the posterior distribution is still somewhat simplified. These
are sometimes referred to as conditionally conjugate, semiconjugate, and generalized con-
jugate priors (Gelman etal., 2013; Jackman, 2009). Importantly, the use of such priors can
ease the computational burden and greatly speed up the necessary computing time of
modern simulation-based approaches to estimate posterior distributions, a topic we treat
in Chapter 5. As a consequence, many applications of Bayesian psychometric modeling
employ such priors. For some software packages they are the default, and, in some cases,
the only options for prior distributions.
Historically, the use of conjugate priors was important to facilitate analytical computa-
tion of posterior distributions. In light of modern simulation-based estimation methods
(Chapter5), this issue is less critical than before, though certain choices are relatively less
computationally demanding. Nevertheless, the modern estimation techniques have, in a
sense, freed Bayesian analysts from a dependence on conjugate priors. Accordingly, we
recommend consideration of substantive knowledge as the primary criterion. In many
situations, we should gladly pay the price of computational complexity for accuracy of
modeling ones belief structure. If the prior is flexible, computationally efficient, and eas-
ily interpretable, all the better. The development of the models in Chapters 4, 6, 8, 9, 11,
13, and 14 will involve priors specified in light of these considerations, and will further
communicate what has typically been done in practice.
Specifying a prior comes to specifying not just a family of distributions, but a particular
distribution (i.e., a member of the family). Two broad strategies are described here. The
first strategy involves modeling existing research. Returning to the Beta-binomial model
for student performance on tasks, if prior research indicates that the probability of a stu-
dent correctly completing these tasks is around .9, then the Beta(10,2) prior distribution
would be a sensible choice. To the extent that existing research is substantive, this should
be conveyed in the prior. For example, a Beta( , ) can be thought of as akin to encoding
Conceptual Issues in Bayesian Inference 53
that there might be multiple peaks in the mountain range (i.e., multiple local maxima exist),
and that the peak that one ends up climbing depends on where one starts the climb, as
some peaks are hidden from ones view at a particular location (i.e., the global maximum
might not be reached). Unfortunately, there is no guarantee that one will obtain the global
maximum as the stopping point of the estimation routine will depend on the actual start-
ing value. Difficulties with local maxima and other threats to model convergence are
usually exacerbated in complex models. A related difficulty is that many ML (and other
frequentist) estimation routines typically involve derivatives of the likelihood function,
which may be difficult to obtain in complex models. Software built for statistical modeling
families, such as generalized linear latent and mixed models (Rabe-Hesketh, Skrondal, &
Pickles, 2004) or structured covariance matrices (Muthn & Muthn, 19982013), obviates
this concern for models that can be framed as a member of that family. However, for mod-
els that cannot be defined in those terms and cannot be fitted by the software, analysts are
left to their own devices to obtain the derivatives and implement the estimation routines.
This is likely a deterrent from using such models.
In a Bayesian analysis, the terrain of interest is a different mountain range, namely the
posterior distribution. As a synthesis of the prior distribution and the likelihood, the pos-
terior distribution falls in between its two ingredients, as seen in Figures3.1 through 3.3.
And instead of seeking the highest peak, a fully Bayesian analysis seeks to map the entire
terrain of peaks, valleys, and plateaus in this mountain range. That is, in a Bayesian analy-
sis the solution is not a point, but rather a distribution.* This difference is relevant to the
conceptual foundations of Bayesian analyses of interest here, and has implications for esti-
mation discussed in Chapter5.
As illustrated in Figure3.1, the mountain ranges may be fairly similar (column 2) or
quite different (column 3), as they depend on the relative strength of the information in
the prior and the likelihood, and the degree of agreement of that information. With large
sample sizes and/or diffuse priors, the posterior distribution usually strongly resembles
the likelihood. If the prior is quite diffuse, the posterior is essentially a normalized like-
lihood. In this case, the mode of the posterior distribution is theoretically equivalent to
the MLE. Similarly, as sample size increases, the data tend to swamp the prior so that
the likelihood comes to dominate the posterior (Figure3.3); the posterior mode is then
asymptotically equivalent to the MLE. More generally, it can be shown that, under fairly
general conditions, the posterior distribution asymptotically converges to a normal dis-
tribution, the central tendency and variability of which can be approximated by the
maximum and curvature of the likelihood (Gelman etal., 2013). As a result, the posterior
* The goal of obtaining the posterior distribution for unknowns lies at the heart of Bayesian analysis. This goal
is fully and immediately achievable with conjugate priors. For models that do not enjoy the benefit of having
conjugate priors at the ready, this goal has historically been aspirational rather than achievable, and analysts
have relied on tools to approximate the full posterior distribution. For example, in maximum a posteriori
estimation in item response theory, we apply the same computational algorithms as used in ML, only now
applied to the posterior. The resulting point identified as the maximum and the analytical expression of the
spread of the surface are then interpreted as features of the full posterior and used to flesh out a normal
approximation of the full posterior. As we discuss in more detail in Chapter 5, breakthroughs in statistical
computing and the recognition of how they can be leveraged in Bayesian modeling have opened up the pos-
sibilities of obtaining increasingly better estimates of the posterior distribution.
In many cases a uniform prior at least conceptually represents a maximally diffuse prior, and we employ
such a specification to convey the ideas about a maximally diffuse prior in our context of binomial models.
However, the application of uniform priors to enact a diffuse prior is not without issue, as a uniform prior
on one scale need not yield a uniform prior when translated to another scale. These issues and alternative
approaches to specifying diffuse or noninformative priors are discussed by Gelman et al. (2013).
Conceptual Issues in Bayesian Inference 55
mode (and mean and median) is asymptotically equivalent to the MLE, and the posterior
standard deviation is asymptotically equivalent to the ML standard error. The asymptotic
similarity between posterior summaries and ML estimates provides a connection between
Bayesian and frequentist modeling. Frequentist analyses can often be thought of as limit-
ing cases of Bayesian analyses, as sample size increases to infinity and/or the prior dis-
tribution is made increasingly diffuse. Bayesian estimation approaches, which map the
entire mountain range, may therefore be useful for exploring, maximizing, and finding
multimodality in the likelihood.
In finite samples, when the prior is fairly diffuse and/or the sample size is large enough,
the posterior summaries are often approximately equal to the estimates from ML. As the
results in Table3.1 illustrate, it is often the case that values emerging from frequentist and
Bayesian analyses are similar: the MLE is similar to the posterior mode and sometimes
the mean (or any other measure of central tendency), the standard error is similar to the
posterior standard deviation, and the confidence interval is similar to a posterior credibil-
ity interval. Importantly, however, even when posterior point summaries, standard devi-
ations, and credibility intervals are numerically similar to their frequentist counterparts,
they carry conceptually different interpretations. Those arising from a Bayesian analysis
afford probabilistic statements and reasoning about the parameters; those arising from a
frequentist analysis do not. In a frequentist analysis, distributional notions are invoked by
appealing to notions of repeated sampling and apply to parameter estimators, that is, the
behavior of parameter estimates on repeated sampling.
To illustrate this point, consider the situation in which we observe a student correctly com-
plete all 10 tasks she attempts (see Exercise2.5). However, rather than model prior beliefs that
the student is very capable, let us model maximal uncertainty by using the Beta(1,1) prior
distribution. The analysis is depicted in Figure3.4. As the prior is uniform, the Beta(11,1)
posterior is simply a normalization of the likelihood, and takes on exactly the same shape
as the likelihood, such that in Figure 3.4 the curve for the posterior obscures that of the
likelihood. Accordingly, the posterior mode (i.e., the value that maximizes the posterior den-
sity) of 1 is the same value that maximizes the likelihood. However, their interpretations are
different. The posterior mode is a summary of the (posterior) distribution; the likelihood is
not a distribution. In this analysis, we have encoded no prior information and found that
the posterior is completely driven by the likelihood and that a summary of the posterior is
numerically identical to a corresponding feature of thelikelihood. But as a Bayesian analysis,
Prior
Likelihood
Posterior
FIGURE 3.4
Beta(1,1) prior, likelihood function, and Beta(11,1) posterior for the Beta-binomial example with y = 10 and J = 10.
As the prior is uniform, the shape of the posterior is the same as, and visually obscures, the likelihood.
56 Bayesian Psychometric Modeling
* Although Bayesian and likelihood solutions are numerically similar in straightforward problems, this need
not be the case with more complicated models such as hierarchical models, models with covariate structures,
multiple sources of evidence, or complex patterns of missingness. The Bayesian approach provides a prin-
cipled way to think about them. Historically it was not the case, but now, the Bayesian machinery is more
flexible than the frequentist machinery for tackling such problems, and it is always the same way of thinking
about them (Clark, 2005).
In the case of a binomial, an exact confidence interval for the probability of success can be obtained even when
the MLE is 1 (e.g., Agresti & Coull, 1998). In complex models where approximate confidence intervals are used
based on standard errors, boundaries pose problems for the construction of those intervals.
Conceptual Issues in Bayesian Inference 57
Letusreturn to the context where x = (x1,, xJ) is a collection of J Bernoulli variables (e.g.,
coin flips). We can think about exchangeability in a few ways, with varying levels of statis-
tical formality. At a narrative level, absent any statistical aspects, exchangeability amounts
to saying that we have the same beliefs about the variables in question (prior to observing
their values, or course). Do we think anything different about the first variable (coin flip)
than the second? Or the third? Do we think anything different about any of the variables?
If the answers to these questions are no, we may treat the variables as exchangeable.
A bit more formally, the collection of variables are exchangeable if the joint distribution
p(x1, , x J) is invariant to any re-ordering of the subscripts. To illustrate, for the case of 1 suc-
cess in 5 variables, asserting exchangeability amounts to asserting that
This amounts to asserting that only the total number of successes is relevant; it is irrelevant
where in the sequence the successes occur (Diaconis & Freedman, 1980a). More generally, a
collection of random variables are exchangeable if the joint distribution is invariant to any
permutation (reordering, relabeling) of the random variables.
A remarkable theorem proved by de Finetti (1931, 1937/1964) and later generalized by
Hewitt and Savage (1955), states that for the current case of dichotomous variables, in the
limit as J ,
1 J
xj 1 x j
p( x1 ,, x J ) = (1 ) dF(), (3.2)
0 j =1
where F() is a distribution function for . The left-hand side of (3.2) is the joint distribution
of the variables. The first term inside the integral on the right-hand side, Jj =1 x j (1 )1 x j, is
the joint probability for a collection of i.i.d. Bernoulli variables conditional on the param-
eter (see Equation 2.8). The second term inside the integral, F(), is a distribution function
for . Thus, de Finettis so-called representation theorem reveals that the joint distribu-
tion of an infinite sequence of exchangeable dichotomous variables may be expressed as
a mixture of conditionally independent distributions. Joint distributions for finite sets of
exchangeable variables can be approximated by conditional i.i.d. representations, with
decreasing errors of approximation as J increases (Diaconis & Freedman, 1980b).
De Finettis theorem has been extended to more general forms, with real-valued variables
and mixtures over the space of distributions (Bernardo & Smith, 2000). In our development
in the rest of this chapter, we will focus on the case where F() is absolutely continuous, in
which case we obtain the probability density function for , p() = F()/d and the theorem
may be expressed as
p( x1 ,, x J ) =
p(x |)dF(),
j =1
j (3.3)
or
J
p( x1 ,, x J ) =
p(x |)p()d.
j =1
j (3.4)
58 Bayesian Psychometric Modeling
If the variables are independent, the joint distribution simplifies to the product of the mar-
ginal distributions
p( x1 , x2 ) = p( x1 )p( x2 ). (3.6)
The lone difference between (3.5) and (3.6) is that in the latter, the distribution for x2 is not
formulated as conditional on x1. In other words, if the variables are independent then the
conditional distribution for x2 given x1 is equal to the marginal distribution for x2 (without
regard or reference to x1)
p( x2 |x1 ) = p( x2 ). (3.7)
This reveals the essence of what it means for the two variables to be independent or
unrelatedthe distribution of x2 does not change depending on the value of x1. From the
epistemic probability perspective, independence implies that learning the value of x1 does
not change our beliefs about (i.e., the distribution for) x2.
It is very common for variables to be dependent (related) in assessment applications.
For example, consider variables derived from scoring student responses to tasks in the
same domain. Performance on one task is likely related to performance on the other; stu-
dents who tend to perform well on one task also tend to perform well on the other, while
students who struggle with one task also tend to struggle with the other. In this case,
the variables resulting from scoring the student responses will be dependent. In terms
x1 x2 x3 xJ
FIGURE 3.5
Graphical representation of the right-hand side of (3.3) or (3.4), illustrating conditional independence of the xs
given , in line with exchangeability.
Conceptual Issues in Bayesian Inference 59
Conceptually, (3.8) expresses that although x1 and x2 may be dependent (related), once
is known they are rendered independent. More generally, a set of J variables x1, , xJ are
conditionally independent given if
J
may be. Instead, we may be willing to assert that we have the same beliefs for all units in
the treatment group, and the same beliefs for all units in the control group. That is, we
may assert partial or conditional exchangeability, where exchangeability holds conditional on
group membership. This motivates a multiple-group model, where all the units of the treat-
ment group are specified as following a probability model and all the units of the control
group are specified following a different probability model. More generally, we may have
any number of covariates or explanatory variables that are deemed relevant. Once all such
variables are included in the model, (conditional) exchangeability can be asserted.
p(| P ) = p(1 , , n | P ) = p( | ).
i =1
i P (3.10)
The situation here is one in which the parameters of the original model, , have a structure
in terms of other, newly introduced parameters P, termed hyperparameters* (Lindley &
Smith, 1972). For example, a specification that the examinee variables are normally dis-
tributed may be expressed as p(i| P ) = N (i| , 2 ), where P = ( , 2 ) are the hyperpa-
rameters. If values of P are known or chosen to be of a fixed value, the prior specification
is complete. If they are unknown, they of course require a prior distribution, which may
depend on other parameters. This sequence leads to a hierarchical specification, with
parameters depending on hyperparameters, which may in turn depend on other, higher-
order hyperparameters.
The implications of conditional exchangeability discussed in Section 3.4.4 hold analo-
gously here. If the parameters are not deemed exchangeable, it is still possible that they be
conditionally exchangeable. Returning to the example, if the examinees come from different
groups, conditional exchangeability implies that we may construct the prior distribution as
G ng
p(| P ) = p ( |
g =1 i =1
g ig Pg ), (3.11)
* The use of the subscript P in P is adopted to signal that these are the parameters that govern the prior dis-
tribution for . Similarly, the subscript indicates that, in a directed acyclic graph representation, the elements
of P are the parents of .
Conceptual Issues in Bayesian Inference 61
The upshot in this approach to building prior distributions is that exchangeability sup-
ports the specification of a common prior for each element conditional on some other, pos-
sibly unknown, parameters. This has become the standard approach to specifying prior
distributions in highly parameterized psychometric models, and we will make consider-
able use of it throughout. Further, this connects with a particular conception of Bayesian
modeling as building out, described in Section3.6. We will see in Chapter15 that the
purpose and context of inference can be an important factor in how, and indeed whether,
we include these kinds of structures into a model. In educational testing, even with the
same test and the same examinees, it can be ethical to do so in one situation and unethical
in another; even legal in one situation and illegal in another!
* Really, they are easier-to-think-about lower dimensional distributions. We have considered the case of uni-
variate distributions, but in principle they could multivariate, say, if each p( x j| ) was a joint distribution of a
set of (fewer than J) entities.
64 Bayesian Psychometric Modeling
The exchangeability theorem is powerful in that it provides a structure for the distribu-
tion in terms of the marginalized conditional independence expression (part 1). In and of
itself, the exchangeability representation theorem is just a piece of mathematical machin-
ery. In terms of Figure1.4, it resides in the model-space, and it is unassailable. Importantly,
we cannot be wrong in adopting the marginalized conditional independence form, if we
truly have no distinguishing information or have chosen not to use it in the model we are
positing.
But the exchangeability theorem is vacuous because though this expression holds, no
particular form is specified. Things get interesting when we come to actually make the
specifications for the model, for the forms of the distributions (part 2). These vary consid-
erably, and modelers face the challenge of positing models that reflect what they know
substantively about the problem at issue. A goal of the book is to show how many psycho-
metric models are variations on the theme, with different particular forms or extensions,
motivated by what we know about people, psychology, and assessment.
When we posit particular functional forms for the distributions, however, we are taking
an epistemic stance of using a more constrained probability model. Exchangeability qua
exchangeability still holds as a necessarily-true feature of our reasoning. The same can be
said about the conditional independence relationships, even when populated with par-
ticular parametric forms for the distributions. In terms of Figure1.4, the exchangeability
theorem, the marginalized conditional independence structure, and even the particular
parametric forms reside in the model-space, and there they are unassailable as structures
that represent our belief at a given point in time. Populating the marginalized conditional
independence structure with particular parametric forms accomplishes two things. First,
it provides content to the structure justified by the exchangeability theorem. The theorem
is true, but it does not get us far enough. Particular parametric forms are needed for us
to get work done, in terms of model-based implications that guide inferences and yield
substantive conclusions. Second, it gets us to a place where we can critique the model, in
terms of its match to the real-world situation. Included among the model-based implica-
tions generated by the use of particular distributional forms are implications for the data.
In terms of Figure1.4, when we add particular distributional specifications we can proj-
ect from the model-space back down to the real-world situation in the lower level of the
figure. And in this way the specifications can be wrong in that they yield or permit joint
probabilities that are demonstrably at odds with the data. Techniques for investigating
the extent and ways in which the implications of a model are at odds with the data are the
subject of Chapter10.
Conceptual Issues in Bayesian Inference 65
probability perspective. Prior distributions express our uncertainty about the parameters
before the data are incorporated; posterior distributions express uncertainty about the
parameters after the data are incorporated.
* A uniform prior is an actual probability distribution in case of finite distributions, and in the absence of other
considerations is the greatest uncertainty (maximal entropy) one can express. For variables that can take
infinitely many values, a uniform distribution might have an infinite integral and thus not be an actual distri-
bution. A uniform prior is proper on [0,1], but one on (,) is not. Sometimes the Bayesian machinery works
anyway, producing a true posterior. A more strictly correct Bayesian approach is to use a proper but extremely
diffuse prior, such as N(0,105) for a real-valued parameter, or, with prior distributions that afford them, speci-
fying parameter values that maximize entropy (Jaynes, 1988). Gelman et al. (2013) discussed approaches to
specifying noninformative priors.
Conceptual Issues in Bayesian Inference 67
great advantage in other modeling contexts. For example, this same machinery allows for
partial pooling among groups in multilevel modeling, which lies between the extremes
of complete pooling of all the groups or no pooling among groups (Gelman & Hill, 2007;
Novick etal., 1972).
This modeling flexibility is an important advantage, but is actually somewhat tangential
to the current point, which is that prior probability judgments and assertions are always
present, even when the analysis does not use Bayesian methods for inference. The inclusion
of certain parameters and the exclusion of others (i.e., fixing them to be 0) in a frequentist
analysis amounts to two particular probabilistic beliefscomplete uncertainty about the
former and complete certainty about the latter. That is, any and every structure of the model
amounts to an expression of prior beliefs, in ways that can be viewed as distributions. This
holds for the other so-called model assumptions or model specifications, typically thought
to be distinct from prior probabilistic beliefs. This can readily be seen by embedding a
model into a larger one. For example, modeling a variable as normally distributed (e.g.,
errors in typical regression or factor analysis models) may be viewed as modeling the vari-
able as member of a larger class of distributions with all the prior probability concentrated
in such a way that yields the normal distribution (e.g., as a t distribution where the prior
probability for the degrees of freedom has all its mass at a value of infinity). Generally, all
of our model assumptionseffects/parameters that are included or excluded, functional
forms of relationships, distributional specifications, and so onare the result of prior
probability judgments. Box tidily summarized this view (1980, p.384), remarking that
In the past, the need for probabilities expressing prior belief has often been thought of,
not as a necessity for all scientific inference, but rather as a feature peculiar to Bayesian
inference. This seems to come from the curious idea that an outright assumption does
not count as a prior belief . [I]t is impossible logically to distinguish between model
assumptions and the prior distribution of the parameters. The model is the prior in
the wide sense that it is a probability statement of all the assumptions currently to be
tentatively entertained a priori.
Our view is that the distinction between model specifications/assumptions, which are
central features in conventional modeling, and prior distributions, which are excluded
from such traditions, is more terminological than anything else. The distinction may be
useful for structuring and communicating modeling activities with analysts steeped in
different traditions, but too strict adherence to this distinction may obscure that what is
important is the model as a whole, and the boundaries between parts of the model may be
false. Both model specifications/assumptions and prior distributions involve subjec-
tive decisions (Winkler, 1972); analysts who wish to employ the former but prohibit the
latter on the grounds that the latter are less legitimate have their philosophical work cut
out for them. We prefer to think about model specifications as prior specifications. They
reflect our beliefs about the salient aspects of the real-world situation at hand before we
have observed data, and are subject to possible revision as we learn from data and update
our beliefs. Procedures for learning about the weaknesses (and strengths) in our model
and the theories they represent are discussed in Chapter10.
Historically, disagreement regarding the propriety of incorporating prior beliefs was the
main source of contention between Bayesian and frequentist inference (Weber, 1973). This is
less of an issue currently, partly because of the theoretical considerations such as those just
discussed, and partly because of the recognition that the critique that a Bayesian approach
with priors is subjectiveused as a pejorative term in the critiqueis undermined by
68 Bayesian Psychometric Modeling
the recognition that frequentist approaches hold no stronger claim to objectivity (Berger &
Berry,1988; Lindley & Philips, 1976).* It is also less of an issue because the use of very mild
priors can be applied when we do not have information that would be brought in properly
even in the eyes of frequentists, and because the real issue is seen now as not use-
prior-information-or-not, but as whether to move into the framework where everything
is in the same probability space, and we can make probability statements about any vari-
ables, with any conditioning or marginalizing we might want to do. In complex models
this is really the only practical way to carry out inference in problems people need to
tackle substantively (Clark, 2005; Stone, Keller, Kratzke, & Strumpfer, 2014).
That all modeled assumptions may be seen as prior probability judgments amounts to
a connection between Bayesian approaches and the whole enterprise of statistical model-
ing. But as described next, there are still deeper connections between the formation of
statistical models from first principles and a Bayesian approach that specifies parameters
distributionally.
The point is that exchangeability produces p() and demonstrates the soundness of
the subsequent Bayesian manipulations Thus, from a single assumption of exchange-
ability the Bayesian argument follows. This is one of the most beautiful and impor-
tant results in modern statistics. Beautiful, because it is so general and yet so simple.
Important, because exchangeable sequences arise so often in practice. If there are, and
we are sure there will be, readers who find p() distasteful, remember it is only as dis-
tasteful as exchangeability; and is that unreasonable?
* It may be further argued that Bayesian approaches offer something more than frequentist approaches on this
issue, namely transparency of what is subjective (Berger & Berry, 1988).
Conceptual Issues in Bayesian Inference 69
We are also not asserting that the preceding developments and de Finettis theorem
implies that a parameter exists in any ontological sense; from exchangeability and de
Finettis theorem, we do not conclude that there is some previously unknown parameter,
lurking out there in the world just waiting to be discovered. Viewing probability expres-
sions as reflections of beliefs and uncertainty sheds a particular light on the meaning of
de Finettis theorem. The left-hand side is the joint distribution of the variables, which
amounts to an expression of our beliefs about the full collection of variables. What de
Finettis theorem shows is that if we view the variables as exchangeable, we can param-
eterize this joint distributionindeed, our beliefsby specifying the variables as condi-
tionally independent given a parameter, and a distribution for that parameter.
To summarize, probability-model-based reasoning begins with setting up the joint distri-
bution (Gelman etal., 2013), which represents the analysts beliefs about the salient aspects
of the problem. If we can assert exchangeability, or once we condition on enough vari-
ables to assert exchangeability, de Finettis theorem permits us to accomplish this daunt-
ing task by invoking conditional independence relationshipsmanufacturing entities to
accomplish this if neededwhich permits us to break the problem down into smaller,
more manageable components. That is, we can specify a possibly high-dimensional joint
distribution by specifying more manageable conditional distributions given parameters,
and a distribution for the parameters.
Prior probability distributions for parameters are then just some of the building blocks
of our models used to represent real-world situations, on par with other features we
broadly refer to as specifications or assumptions common in Bayesian and frequentist mod-
eling, including those regarding the distribution of the data or likelihood (e.g., linearity,
normality), characteristics or relationships among persons (e.g., independence of persons
or clustering of persons into groups), characteristics of the parameters (e.g., discrete or con-
tinuous latent variables, the number of such variables, variances being greater than 0), and
among parameters (e.g., certain parameters are equal). What is important in the Bayesian
perspective is modeling what features of situations we think are salient and how we think
they might be related, using epistemic tools such as conditional exchangeability. In this
light, the use of prior distributions is not a big deal. It is the admission price for being able
to put all the variables in the framework of probability-based reasoningnot only from
parameters to variables, like frequency-based statistics can do, but also from variables to
parameters, or some variables to other variables, or current values of variables to future
values, and so onall in terms of sound and natural expression of beliefs of variables
within the model framework through probability distributions.
We specify priors based on a mixture of considerations of substantive beliefs, compu-
tational demands, and ease of interpretability and communication. Importantly, the same
can be said of specifying distributions of the data that induce the likelihood function.
Those new to Bayesian methods are often quick to question the specification of prior distri-
butions, or view the use of a particular prior distribution with a skeptical eye. We support
this skepticism; like any other feature of a model, it and its influence should be questioned
and justifications for it articulated. Analysts should not specify prior distributions without
having reasons for such specifications.
In addition, sensitivity analyses in which solutions from models using different priors
can be compared to reveal the robustness of the inferences to the priors or the unanticipated
effects of the priors. Figure 3.1 illustrates how this could be done for the Beta-binomial
model, where the analyst may see how the use of different prior distributions affects the
70 Bayesian Psychometric Modeling
substantive conclusions. The importance of the prior in influencing the posterior depends
also on the information in the data as the posterior represents a balancing of the contribu-
tions of the data, in terms of the likelihood, and the prior. As the relative contribution of
the data increases, the posterior typically becomes less dependent on the prior and more
closely resembles the likelihood in shape, as illustrated in Figure3.3.
The role of the prior distribution is often the focus of criticism from those skeptical of
Bayesian inference, but the aforementioned issues of justification and examination apply
to other features of the model, including the specification of the likelihood. We disagree
with the perspective that views a prior as somehow on a different ontological or epistemo-
logical plane from the likelihood and other model assumptions. Viewed from the broader
perspective of model-based reasoning, both the prior and the likelihood have the same
status. They are convenient fictionsfalse but hopefully useful accounts we deploy to
make the model reflect what is believed about the real-world situationthat reside in the
model-space as part of the larger model used to enact reasoning to arrive at inferences.
This equivalence is illuminated once we recognize that elements of a likelihood may be
conceived of as elements of the prior, more broadly construed.
in psychometrics; see also Clark, 2005). One way to incorporate substantive knowledge
is via prior distributions, which can rein in estimates that based on the likelihood alone
may be extreme. This is particularly salient in situations with small samples and/or sparse
data, in which case sampling variability is high. Bayesian approaches, even with fairly
diffuse prior distributions, have been shown to perform as well or better than frequentist
methods with small samples (e.g., Ansari & Jedidi, 2000; Chung, 2003; Chung, Lanza, &
Loken, 2008; Depaoli, 2013; Finch & French, 2012; Hox, van de Schoot, & Matthijsse, 2012;
Kim, 2001; Lee & Song, 2004; Muthn & Asparouhov, 2012; Novick, Jackson, Thayer, &
Cole, 1972; Sinharay, Dorans, Grant, & Blew, 2009). We discuss this issue in more detail
in Chapter 11, where prior distributions can serve to regularize estimates or adjudicate
between competing sets of estimates of parameters in item response theory models.
p(|x1 , x2 ) p( x1 , x2|)p()
p( x2|)p(|x1 ).
The second line in (3.12) follows from a factorization of the conditional probability of
the data. The third line follows from the assumption that x1 and x2 are conditionally
independent given . The last line follows from recognizing that, from Bayes theorem,
p(|x1 ) p( x1 |)p(). The right-hand side of the last line takes the form of the right-hand
side of Bayes theorem in (2.1), where p( x2|) is the conditional probability of the (new)
data, and p(|x1 ) plays the role of the prior distribution for that is, prior to having
observed x2. But p(|x1 ) is just the posterior distribution for given x1. Equation (3.12)
reveals how a Bayesian approach naturally accommodates the arrival of data and updat-
ing beliefs about unknowns. We begin with a prior distribution for the unknowns, p().
Incorporating the first dataset, we have the posterior distribution p (|x1 ), which in turn
serves as the prior distribution when incorporating the second dataset, x2. At any point, our
current distribution is both a posterior distribution and prior distribution; it is posterior
to the past data, and prior to future data. Todays posterior is just tomorrows prior, and
the updating is facilitated by Bayes theorem. In psychometrics, this line of thinking sup-
ports adaptive testing discussed in Chapter 11, with modeled as constant over time, and
72 Bayesian Psychometric Modeling
models for student learning over time discussed in Chapter14, with modeled as possibly
changing over time.
3.5.9 Pragmatics
One need not subscribe to the philosophy sketched here to employ Bayesian methods. One
can adopt different orientations toward probability, parameters/latent variables, measure-
ment, modeling, and inference, and still gainfully employ Bayesian methods. For those
who adopt an alternative philosophical outlook, or do not believe philosophical positions
have any bearing, there is another class of arguments that can be advanced in favor of
adopting a Bayesian approach to inference. These are more practical in their focus, and
advocate adopting a Bayesian approach to the extent that it is useful. These arguments
Conceptual Issues in Bayesian Inference 73
advance the notion that adopting a Bayesian approach allows one to fit models and reach
conclusions about questions more readily than can be done using frequentist approaches,
and allows for the construction and use of complex models that are more aligned with
features of the real world and the data (Andrews & Baguley, 2013; Clark, 2005). See Levy
(2009); Levy etal. (2011); Levy and Choi (2013); and Rupp, Dey, and Zumbo (2004) for dis-
cussions and recent reviews of Bayesian approaches to psychometrics and related models
that survey a number of applications that trade on the utility of Bayesian methods.
and procedures to those steeped in frequentist approaches, and accordingly we make use
of it throughout the book.
A fourth conceptualization is one of model building or expansion, in which model con-
struction occurs in stages. This begins not by assuming parameters, but instead by seek-
ing to specify a distribution for the data, x (Bernardo & Smith, 2000). If we can assert
exchangeability, we may structure the distribution based on parameters , as p( x|). If
is unknown, it requires a distribution. In simple cases, a distribution is specified as p().
This perspective does not seem to offer much in simple models, but is especially advanta-
geous in specifying complex, multivariate models which can be framed as having mul-
tiple levels. That is, if p() may be framed as conditional on parameters, say, P, as p(| P )
in (3.10). If P are unknown, they too require a prior distribution p( P ), which itself may
depend on parameters, possibly unknown, and so on. The distributions at the different
levels may have their own covariates and exchangeability structures, such as student
characteristics and school characteristics. This perspective of building out the model is
strongly aligned with strategies in multilevel modeling (e.g., Gelman & Hill, 2007), and is
often framed as a hierarchical specification of the model and prior distributions (Lindley &
Smith, 1972). This perspective has come to dominate Bayesian modeling for wide variety
of psychometric and complex statistical models (Jackman, 2009). At various points in our
treatment of Bayesian psychometric modeling, we will adopt each of these perspectives on
modeling to bring out certain inferential issues.
their probability assessment, may be found in Almond (2010), Almond etal. (2015), Chow,
OLeary, and Mengersen (2009), Garthwaite, Kadane, and OHagan, 2005, Kadane and
Wolfson (1998), Novick and Jackson (1974), OHagan (1998), OHagan et al. (2006), Press
(1989), Savage (1971), and Winkler (1972). See De Leeuw and Klugkist (2012) for several
alternatives for basing prior distributions on prior research in the context of regression
modeling. We discuss strategies for specifying prior distributions based on substantive
beliefs and previous research in psychometric models in the context of item response
theory in Chapter11.
This page intentionally left blank
4
Normal Distribution Models
This chapter provides a treatment of popular Bayesian approaches to working with nor-
mal distribution models. We do not attempt a comprehensive account, instead providing
a more cursory treatment that has two aims. First, it is valuable to review a number of
Bayesian modeling concepts in the context of familiar normal distributions. Second, nor-
mal distributions are widely used in statistical and psychometric modeling. As such, this
chapter provides a foundation for more complex models; in particular, the development
of regression, classical test theory, and factor analysis models will draw heavily from the
material introduced here.
As a running example, suppose we have a test scored from 0 to 100, and we are interested
in the distribution of test scores for examinees. We obtain scores from n = 10 examinees,
and let those scores be x = (91, 85, 72, 87, 71, 77, 88, 94, 84, 92), where xi is the score for exam-
inee i. Assuming that the scores are independently and identically normally distributed,
xi ~ N ( , 2 ), unbiased least-squares estimates of the mean and variance are 84.1 and 66.77;
the MLEs for the mean and variance are 84.1 and 60.09. In what follows, we explore a series
of Bayesian models for this setup. In Section4.1, we model the unknown mean treating the
variance as known. In Section4.2, we model the unknown variance treating the mean as
known. In Section4.3, we consider the case where we model both the mean and variance
as unknown. We conclude this chapter with a brief summary in Section4.4.
p( x| , 2 ) = p(x |, ),
i =1
i
2
(4.2)
where
xi | , 2 N (, 2 ) (4.3)
Note that (4.2) and (4.3) are akin to those in conventional frequentist approaches.
77
78 Bayesian Psychometric Modeling
~ N ( , 2 ), (4.4)
where and 2 are hyperparameters. The subscript notation adopted here is used to
reflect that these are features of the distribution of prior to observing x, in contrast to the
posterior distribution.
where
xi | , 2 ~ N ( , 2 ) for i = 1, , n
xi
i = 1,, n
FIGURE 4.1
Directed acyclic graph for a normal distribution model with unknown mean and known variance 2, where
and 2 are hyperparameters for the unknown mean .
Normal Distribution Models 79
and
~ N ( , 2 ).
The normal prior distribution is popular in part because it is a conjugate prior in this case.
It can be shown that the posterior distribution is then itself normal (Lindley & Smith, 1972;
see also Gelman etal., 2013; Gill, 2007; Jackman, 2009):
where
|x =
( ) + ( nx ) ,
2
2
(4.7)
(1 ) + ( n )
2
2
1
2|x = . (4.8)
( ) (
1 2 + n 2 )
The subscript notation adopted on the right-hand side of (4.6) and in (4.7) and (4.8) is used
to reflect that these are features of the distribution of posterior to observing x.* The pos-
terior distribution also reveals that n and x are sufficient statistics for the analysis; they
jointly capture all the relevant information in the data.
= 2 (4.9)
and the prior precision
= 2 . (4.10)
* As (4.7) and (4.8) reveal, these results are also conditional on the values for the hyperparameters and the
variance of the data. In the current example, these are all known values. When we turn to models that treat
the mean and variance as unknown, we will expand our notation for these expressions accordingly.
80 Bayesian Psychometric Modeling
xi
i = 1,, n
FIGURE 4.2
Directed acyclic graph for a normal distribution model with unknown mean in the precision parameteriza-
tion, where is the known precision of the data, equal to the inverse of the variance 2, and and (the
inverse of 2 ) are hyperparameters for the unknown mean .
where
xi | , ~ N ( , ) for i = 1, , n
and
~ N ( , ).
Figure4.2 contains the DAG for the model in the precision parameterization. It differs from
the DAG in Figure4.1 by modeling as a parent for xi and as a parent for . Thesepre-
cision terms are modeled as the children of the associated variance terms, reflecting the
deterministic relations in (4.9) and (4.10).
Under this parameterization, the posterior distribution for is normal,
|x , ~ N (|x , |x ), (4.12)
+ nx n
|x = = + x (4.13)
+ n + n + n
The precision parameterization reveals several key features of the posterior distribtion.
First, (4.14) indicates that the posterior precision (|x ) is the sum of two components: the
precision in the prior () and the precision in the data (n). Conceptually, the variance of a
distribution is a summary of our uncertaintya distribution with a relatively large (small)
variance indicates relatively high (low) uncertainty. In this light, the precision is a sum-
mary of our certaintya distribution with a relatively large (small) precision indicates
relatively high (low) certainty. In these terms, (4.14) states that our posterior certainty is the
sum of certainty from two sources, namely the prior and the data.
Normal Distribution Models 81
The posterior mean ( |x ) in (4.13) is a weighted average of the prior mean ( ) and the
mean of the data (x). The weight for the prior mean ( /[ + n]) is proportional to the priors
contribution to the total precision. Similarly, the weight for the mean of the data (n /[ + n])
is proportional to the datas contribution to the total precision. Viewing the posterior mean
as a point summary of the posterior distribution, (4.13) also illustrates the general point that
the posterior will be a synthesis of the information in the prior and the information in the
data as expressed in the likelihood. In the current case, the relative contribution of the prior
and the data in this synthesis is governed by the relative precision in each of these sources.
~ N ( 75, 50 ) , (4.15)
which implies the prior precision is = .02. This prior, depicted in Figure4.3, expresses the
prior belief that is almost certainly larger than 50 and probably in the high 60s to low 80s.
The likelihood is also depicted in Figure4.3. It takes a maximum value at 84.1, the mean
of the data. The posterior mean and precision are calculated through (4.13) and (4.14) as
follows:
+ nX (.02)(75) + (10)(.04)(84.1)
|x = = 83.67
+ n .02 + (10)(.04)
and
|x = + n = .02 + (10)(.04) = .42.
This last result implies that the posterior variance is 2|x 2.38 . The posterior distribution,
expressed in the variance parameterization, is
Prior
Likelihood
Posterior
50 60 70 80 90 100
FIGURE 4.3
Prior distribution, likelihood, and posterior distribution for the example with an N(75, 50) prior for the unknown
mean , where 2 = 25, x = 84.1, and n = 10.
82 Bayesian Psychometric Modeling
The posterior is depicted in Figure4.3. It may be summarized in terms of its central ten-
dency, the posterior mean (and median and mode) is 83.67; and variability, the posterior
standard deviation is 1.54. The 95% central and HPD interval is (80.64, 86.69).
2
2|x (4.17)
n
and
|x x . (4.18)
Accordingly, as n ,
2
|x , 2 N x , . (4.19)
n
Likewise, these limiting properties obtain as 2 , holding the features of the data
constant.
These results illustrate a number of the principles discussed in Section3.3. As the rela-
tive contribution of the data increaseseither through the increased amount of data (n)
or the decrease in the information in the prior (larger values of 2 )the results become
increasingly similar to what would be obtained from a frequentist analysis. Point sum-
maries of the posterior (i.e., mean, median, and mode) get closer to x, which is the MLE
of . The posterior standard deviation of gets closer to the sampling variance of x. And
posterior credibility intervals resemble frequentist confidence intervals. Importantly, as
discussed in Section3.3, though the results may be numerically similar to their frequentist
counterparts, their interpretations are different. Unlike a frequentist analysis, a Bayesian
analysis yields probabilistic statements and reasoning about the parameters. The 95% cen-
tral and HPD interval indicates that, according to the model, there is a .95 probability
that is between 80.64 and 86.69.
The first term on the right-hand side is the conditional probability of the data, which is
the same as given in (4.2) and (4.3). The second term on the right-hand side is the prior
distribution for 2.
2 ~ Inv-Gamma( , ), (4.21)
Under this parameterization, 02 has the interpretation akin to a best estimate for the vari-
ance and 0 has the interpretation akin to the degrees of freedom, or pseudo-sample size
associated with that estimate. This could come from prior research, or from subject matter
expert beliefs. For example, a subject matter expert may express their beliefs by saying
Ithink the variance is around 30. But Im not very confident, its only as if that came from
observing 10 subjects. This could be modeled by setting 02 = 30 and 0 = 10, yielding an
Inv-Gamma(5,150) distribution. Then the inverse-gamma distribution may be plotted and
inspected as to whether it represents prior beliefs as intended.
The inverse-gamma distribution has a positive skew shape like the more familiar 2 dis-
tribution. In fact, the 2 distribution is related to the inverse-gamma and the gamma dis-
tribution. Specifically, if 2 ~ Inv-Gamma( 0 / 2, 002 / 2), then the quantity 002 / 2 ~ 20 .
Accordingly, some authors refer to this as a scaled inverse-2 distribution (Gelman etal.,
2013).
Inv-Gamma(0.01,0.01)
Inv-Gamma(3,3)
Inv-Gamma(3,6)
Inv-Gamma(15,15)
0 1 2 3 4
FIGURE 4.4
Inverse-gamma densities.
84 Bayesian Psychometric Modeling
p(2 |x , ) p( x| , 2 )p(2 )
n (4.23)
=
i =1
p( xi | , 2 )p(2 ),
where
xi | , 2 ~ N ( , 2 ) for i = 1, , n
and
where
n
SS( x|) = ( x )
i =1
i
2
(4.25)
is the sum of squares from the data, with the conditioning notation indicating the sum
of squares is taken about the known population mean . The posterior distribution also
0 02
xi
i = 1,, n
FIGURE 4.5
Directed acyclic graph for a normal distribution model with unknown variance 2 and known mean , where
0 and 02 are hyperparameters for the unknown variance 2.
Normal Distribution Models 85
reveals that n and SS ( x| ) are sufficient statistics for the analysis; they jointly capture all
the relevant information in the data.
where p() is the prior distribution for the precision. As discussed in Section 4.2.2, the
inverse gamma is the conjugate prior for 2. Recognizing that is the inverse of 2, it is not
surprising that the gamma distribution is the conjugate prior for . Thus, we specify the
prior distribution for as
~ Gamma( , ), (4.27)
where and are hyperparameters that govern the distribution. Figure4.6 depicts several
gamma distributions. Again, a convenient parameterization is one in which =0/2 and
= 002/2 with the interpretations of 02 and 0 as a prior best estimate for the variance and
pseudo-sample size, respectively.
4.2.5 Complete Model and Posterior Distribution for the Precision Parameterization
Figure 4.7 contains the DAG for the model in the precision parameterization. It differs
from the DAG in Figure4.5 by modeling as the parent for xi. The DAG also includes 2
as a child of , reflecting the deterministic relation between the two. 2 is included here
to reflect that we might model things in terms of the precision, but for reporting we often
prefer to employ the more familiar variance metric.
Putting the pieces together, the posterior distribution in the precision parameterization is
Gamma(0.01,0.01)
Gamma(3,3)
Gamma(3,6)
Gamma(15,15)
FIGURE 4.6
Gamma densities.
86 Bayesian Psychometric Modeling
0 02
xi
i = 1,, n
FIGURE 4.7
Directed acyclic graph for a normal distribution model in the precision parameterization, with unknown pre-
cision (inverse of variance, 2 ) and known mean , where 0 and 02 are hyperparameters for the unknown
precision .
where
xi | , ~ N ( , ) for i = 1, , n
and
~ Gamma( 0 / 2, 002 / 2).
depicted in Figure4.8.
The likelihood is also depicted in Figure4.8. The posterior distribution, also depicted in
Figure4.8, is then
Prior
Likelihood
Posterior
FIGURE 4.8
Prior distribution, likelihood, and posterior distribution for the example with an Inv-Gamma(5,150) prior for
the unknown variance 2, where = 80, SS(x|) = 769, and n = 10.
We may summarize the posterior numerically in terms of its central tendency, variability,
and HPD interval: the posterior mean is 59.39 and the posterior mode is 48.59, the posterior
standard deviation is 21.00, and the 95% HPD interval is (27.02,100.74).*
The first term on the right-hand side is the conditional probability of the data, which
remains as given in (4.2) and (4.3).
The second term on the right-hand side is the joint prior distribution for and 2. To
specify this distribution, we first assume independence between and 2, implying
p( , 2 ) = p()p(2 ). (4.33)
* These values came from calculator for the inverse-gamma distribution. We will see shortly how to approxi-
mate them empirically. We do not need to do so in this problem, but it is a general approach we can use for
more complicated posterior distributions.
88 Bayesian Psychometric Modeling
variance and precision parameterizations are given in Figure4.9. The posterior distribu-
tion is then
p( , 2 |x ) p( x| , 2 ) p() p(2 )
n (4.34)
= i =1
p( xi | , 2 ) p() p(2 ),
where the terms on the right-hand side of (4.34) are the conditional probability of the data
and the prior distributions:
xi | , 2 ~ N (, 2 ) for i = 1, , n,
~ N ( , 2 ),
and
Though these prior distributions are conjugate priors in the cases where there is only one
unknown, they do not constitute a conjugate joint prior in the current context and do not
yield a closed-form joint posterior (Lunn etal., 2013). However, the posterior can be easily
approximated empirically using simulation techniques illustrated in the next section and
described more fully in Chapter 5. For the moment, it is sufficient to state that these proce-
dures yield a series of values that, taken as a collection, constitute an empirical approxima-
tion to the posterior distribution. Following Jackman (2009), we refer to this situation as
one where the priors are conditionally conjugate. They have also been termed generalized
conjugate and semi-conjugate. All of these terms are aimed at reflecting that though this
prior distribution does not yield a closed form for the posterior distribution, it does yield a
manageable form, both conceptually and computationally using simulation strategies that
involve conditioning.
2 0 02 0 02
2 2
xi xi
i = 1,, n i = 1,, n
(a) (b)
FIGURE 4.9
Directed acyclic graphs for a normal distribution model with unknown mean and unknown variance 2 or
precision in the (a) variance parameterization and (b) precision parameterization.
Normal Distribution Models 89
p( , 2 |x ) p( x| , 2 )p()p(2 )
n
= p(x |, )p()p( ),
i =1
i
2 2
where
xi | , 2 ~ N ( , 2 ) for i = 1, , 10,
~ N (75, 50),
and
2 ~ Inv-Gamma(5, 150).
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Conditional distribution for the data
#########################################################################
for(i in 1:n){
x[i] ~ dnorm(mu, tau)
}
#########################################################################
# Define the prior distributions for the unknown parameters
#########################################################################
#########################################################################
Data statement
#########################################################################
list(n=10, x=c(91, 85, 72, 87, 71, 77, 88, 94, 84, 92))
--------------------------------------------------------------------------
Note that WinBUGS uses the precision parameterization for the normal distribution. In
this example, the values for the observables and sample size are contained in the list state-
ment for the data. The values for the hyperparameters are specified in the model portion
of the code. As they are known values, they could have been specified as part of the data
statement. We have specified them in the model portion to more clearly illustrate which
hyperparameters go with which parameter, and how some of the hyperparameters are
calculated as functions of others, to yield quantities used in WinBUGSs parameterization
of the normal and (inverse-)gamma distributions. Figure4.10 depicts the joint distribution
for and 2 in the form of a scatterplot based on an analysis of the code above using 50,000
iterations.* As depicted there, the posterior correlation between the parameters is weak;
in these 50,000 draws, the correlation was .13. Figure4.11 depicts empirical approxima-
tions to the marginal posterior distributions for and 2 produced by WinBUGS. Table4.1
contains summary statistics for these marginal posterior distributions.
300
250
200
2
150
100
50
75 80 85 90
FIGURE 4.10
Scatterplot of the joint posterior distribution for the parameters of the normal distribution where both the mean
and the variance 2 are unknown.
* We will have more to say about choices regarding the number of iterations to use in Chapter5. Suffice it to say
that 50,000 is plenty sufficient to characterize the posterior in the current case.
Normal Distribution Models 91
70 75 80 85 90 95 0 50 150 250
(a) (b) 2
FIGURE 4.11
Marginal posterior densities for the parameters of the normal distribution: (a) the unknown mean and (b) the
unknown variance 2.
TABLE 4.1
Summaries of the Marginal Posterior Distributions
95% Highest Posterior
Parameter Mean Median Standard Deviation Density Interval
83.23 83.27 2.20 (78.80, 87.48)
2
53.19 49.32 19.45 (23.22, 91.34)
2 ~ Inv-Gamma( , ) (4.35)
and
|2 ~ N ( , 2 / ), (4.36)
with being a hyperparameter that aids in defining the prior variance for in terms of 2 .
may be thought of as a pseudo-sample size in that, as 0, the prior variance for gets
larger, reflecting that the prior is expressing less certainty, as would be consistent with
having fewer prior observations.
Though this approach affords conjugacy, it has seen limited application in psychomet-
rics and latent variable models (see Lee, 2007, for examples) and we do not purse it further.
The specification of independent priors for and 2 is attractive because (a) it is conceptu-
ally simpler to conceive of the mean and variance of the normal distribution as indepen-
dent entities, and (b) the computation of the posterior using the independent, conditionally
conjugate priors is fairly straightforward and usually quite fast, thereby minimizing the
computational advantages of having full conjugacy.
Another choice for the prior distribution that is motivated by a desire to specify an
increasingly diffuse prior distribution is the improper prior
p( , 2 ) 2 , (4.37)
which is also uniform on ( , log ). Provocatively, the posterior distribution in this case
aligns closely with forms taken for sampling distributions in frequentist inference
(Jackman, 2009). On reflection this is unsurprising and reinforces a key conceptual point:
to the extent that there is less information in the prior(4.37) is a limiting case of reducing
the information in the priorthe posterior is increasingly similar to the likelihood.
92 Bayesian Psychometric Modeling
4.4 Summary
This chapter has briefly described Bayesian approaches when the observable data is
modeled via a normal distribution. Key ideas introduced include the precision as the
inverse of 2, conjugate and conditionally conjugate prior distributions, and the posterior
distribution for the mean as a precision-weighted synthesis of the prior mean and the
mean of the data. Our treatment has been cursory, only going to a depth sufficient to sup-
port the development of models covered later. More comprehensive accounts can be found
in any of a number of texts on Bayesian modeling, including those cited in the introduction
to Section I. The specifications introduced here will re-emerge most prominently in our
treatments of regression (Chapter6), classical test theory (Chapter8), and factor analysis
(Chapter 9) models. The underlying principles will be used throughout the remaining
chapters.
Exercises
4.1 An example for an inference about a mean () when the variance is known was
given in Section4.1. A related analysis now conducting inference about the mean
and the variance was given in Section 4.3. Compare the results for the two, in
terms of the (marginal) posterior distribution for the mean. How do they differ?
What explains why they differ in these ways? Under what circumstances would
the results from such analyses yield increasingly similar results? Under what cir-
cumstances would the results from such analyses yield increasingly dissimilar
results?
4.2 An example for an inference about a variance (2) when the mean is known was
given in Section4.2. A related analysis now conducting inference about the mean
and the variance was given in Section4.3. Compare the results for the two, in terms
of the (marginal) posterior distribution for the variance. How do they differ? What
explains why they differ in these ways? Under what circumstances would the
results from such analyses yield increasingly similar results? Under what circum-
stances would the results from such analyses yield increasingly dissimilar results?
4.3 In the WinBUGS code in Section4.3.2, add a line defining the standard deviation
sigma as the square root of the variance, and examine its posterior distribution.
What is WinBUGS doing to produce this approximation of its posterior? Is the
posterior mean of sigma equal to the square root of the posterior mean for sigma.
squared? Why or why not?
4.4 (Advanced) Compare the posterior standard deviation of sigma with an approxi-
mation based on the posterior standard deviation of sigma.squared, trans-
formed by the delta method. How close is the approximation, and why does it
differ? Repeat the exercise with larger samples of data produced by creating multi-
ple copies of the sample data, so the sample mean and dispersion remain constant
as n increases. How large does n need to be for the asymptotic approximations in
this exercise and the preceding one to be accurate?
5
Markov Chain Monte Carlo Estimation
In a Bayesian analysis, the posterior distribution is the solution obtained from fitting the
model to the data. Analytical solutions are available for simple models with conjugate
priors, such as the beta-binomial model, or a model for the mean of a normal distribution
with a known variance. These situations arise when the likelihood is a member of the gen-
eral exponential family of distributions (Bernardo & Smith, 2000), so that a special related
distribution exists that can be used as a prior, and the posterior takes the same form with
updated parameters.
Many complex statistical and psychometric models do not enjoy the benefit of hav-
ing conjugate priors. When there is no closed form solution for the product of the prior
and the likelihood, evaluating the possibly high-dimensional integrals in Bayes theo-
rem quickly become intractable, in practice preventing the analyst from obtaining the
full posterior distribution. In some cases, Taylor series analytic approximations can be
derived for moments (e.g., Tierney & Kadane, 1986), or posterior modes may be estimated
via optimization techniques such as NewtonRaphson or the EM algorithm when deriva-
tives are amenable (e.g., Mislevy, 1986). These resulting estimates and complementary
estimates of the curvature of the posterior may be used to define an asymptotic approxi-
mation of the posterior. Such estimation procedures may be useful even in situations
where the posterior is of a known form, say, in estimating the mode of the posterior when
there is no closed form for it.
An attractive alternative when computing resources are available is to simulate values
from distributions in such a way that the collection of such values forms an empirical
approximation of the posterior distribution. Ideally these simulated values would be
independent (or pseudo-independent). In some cases in psychometric modeling, certain
choices for distributional forms (e.g., normality assumptions for the data and associ-
ated conjugate priors) yield posterior distributions of manageable form, in which case
an empirical approximation may be obtained by simulating values using standard pro-
cedures (Lee, 2007).
However, in models that do not employ conjugate priors and/or have complicated fea-
tures that preclude the specification of conjugate priors, the posterior distribution is often
not of a form that facilitates independent sampling. An alternative is to simulate values
that are dependent, which is sufficient if the values can be drawn throughout the poste-
rior distribution and in the correct proportions (Gilks, Richardson, & Spiegelhalter, 1996a).
Ahighly flexible framework for estimating distributions in this vein that supports Bayesian
analyses in such complex models is provided by Markov chain Monte Carlo (MCMC) esti-
mation (Brooks, 1998; Brooks, Gelman, Jones, & Meng, 2011; Gelfand & Smith, 1990; Gilks,
Richardson, & Spiegelhalter, 1996b; Smith & Roberts, 1993; Spiegelhalter, Thomas, Best, &
Lunn, 2007; Tierney, 1994). Although (possibly highly) dependent, if enough samples are
simulated such that the values are obtained throughout the support of the distribution in
93
94 Bayesian Psychometric Modeling
the correct proportions, drawing such dependent samples provides an empirical approxi-
mation to the posterior distribution.
This chapter gives an overview of MCMC estimation. In Section 5.1, we give a broad
description and comparison of MCMC to frequentist estimation. Next, we describe certain
specific algorithms that have become the dominant approaches in Bayesian psychomet-
ric modeling with MCMC (Sections5.2, 5.3, 5.5, and 5.6), and in doing so illuminate the
conceptual alignment between MCMC and Bayesian statistical modeling (Section5.4). We
then briefly survey a number of matters pertaining to the practice of MCMC estimation
(Section5.7). We conclude in Section5.8 with a summary and bibliographic note.
Given that certain general conditions hold (see, e.g., Jackman, 2009; Roberts, 1996; Tierney,
1994), a properly constructed chain is guaranteed to converge to a unique stationary distri-
bution. Briefly, the transition kernel of the chain must be:
Further, the chain should be aperiodic, which means it will not just oscillate between
different states in a regular period. If these conditions hold, then the chain is said to
be ergodic, and the chain will converge to its unique stationary distribution. This is
accomplished by constructing chains that are reversible, discussed in more detail in
Section5.5.2.
To leverage Markov chains in Bayesian analysis, we set up the chain so that the desired
posterior distribution is the chains stationary distribution. If the chain is properly
Markov Chain Monte Carlo Estimation 95
1. Assign initial values for all the components, yielding the collection (10 ) , , (R0 )
where the superscript of t = 0 conveys that these are initial values.
2. For r = 1,, R, draw values for component r from its full conditional distribution
given the observed data and the current values of all other components. In other
words, for each component r , we obtain the value of the chain at iteration t + 1 by
drawing from its full conditional distribution p( r | r , x ) using the current values
for the remaining components r . One complete iteration is given by sequentially
drawing values from
Employing this process using the initial values (t = 0) yields the collection
(1) = ((11) ,, (R1) ) that constitutes the first draw for the components.
3. Increment t, by setting t = t + 1.
4. Repeat steps 2 and 3, for some large number T iterations.
Markov Chain Monte Carlo Estimation 97
5.2.2 Example: Inference for the Mean and Variance of a Normal Distribution
We return to the situation introduced in Chapter4 for conducting a Bayesian analysis in the
situation with data modeled as normally distributed with unknown mean and variance. For
clarity of presentation, we collect and repeat the relevant expressions here. An exchangeabil-
ity assumption regarding subjects supports the factorization of the conditional probability
of the data as
n
p( x| , 2 ) = p(x |, ),
i =1
i
2
(5.2)
where
xi | , 2 ~ N ( , 2 ). (5.3)
p( , 2 ) = p()p(2 ), (5.4)
where
~ N ( , 2 ) (5.5)
and
The full conditional distributions are therefore p(|2 , x ) and p(2 | , x ). It can be shown
that the full conditionals are (Lunn etal., 2013)
|2 , x =
( ) + ( nx ) ,
2
2
(5.8)
(1 ) + ( n )
2
2
1
2|2 , x = , (5.9)
( ) (
1 2 + n 2 )
and
with
n
SS( x|) = ( x )
i =1
i
2
(5.11)
98 Bayesian Psychometric Modeling
being the sums of squares from the data taken about the population mean. These results
for are the same as those obtained in the analysis where 2 was assumed known; com-
pare (5.7)(5.9) to (4.6)(4.8). Similarly, the results for 2 are the same as those obtained in
the analysis where was assumed known; compare (5.10) and (5.11) to (4.24) and (4.25).
Working with the full conditionals puts us back in the contexts where, for each parameter,
the other parameter is treated as known. Here lies the computational payoff of a condition-
ally conjugate prior specification. Although the joint posterior does not have a closed form,
the full conditionals do. We are still capitalizing on the conjugacy, now just localized to each
parameter.
This greatly eases the computational burden needed in the steps of Gibbs sampling.
AGibbs sampling algorithm for the current example is as follows:
(0)
1. Assign initial values for all the parameters: ( 0 ) and 2 , where the superscript
indicates that t = 0.
2. For iteration t + 1, draw
(t )
SS ( x|) = (x
i =1
i
(t ) 2
)
and
(t +1) ~ N
(
)
2 + nx 2( t+1)
(
,
1 )
.
2
(
1 + n )
2( t +1)
( ) ( ) (
1 + n 2
2 ( t +1)
)
3. Increment t, by setting t = t + 1.
4. Repeat steps 2 and 3, for some large number T iterations.
We return to the example from Section 4.3 where = 75, 2 = 50 , 02 = 30, and (0)
0 = 10.
Here, we briefly illustrate the computations for a few iterations. Let ( 0 ) = 70 and 2 = 10 be
the initial values for the parameters. These were arbitrarily selected; we will have more
to say about strategies for selecting initial values in Section5.7. These values are listed in
the first row of Table5.1 in the second and third columns, and represent the values for the
iteration listed in the first column.
To conduct the first iteration, we work with the particular inverse-gamma full condi-
tional distribution for 2 using the initial value for . For the first iteration, the first param-
eter of the inverse-gamma is
0 + n 10 + 10
= = 10.
2 2
In fact, this will be the first parameter for the inverse-gamma full conditional in every
iteration, as its ingredients do not involve and therefore do not vary over iterations.
However, the second parameter will vary with each iteration, as it depends on SS(x | ),
Markov Chain Monte Carlo Estimation 99
TABLE 5.1
Computations in the Gibbs Sampler for a Normal Distribution Model with Conditionally
Conjugate Priors
Parameters Full Conditional for 2 Full Conditional for
0 + n 0 02 + SS( x|)
( ) + ( nx )
2
2
1
Iteration 2
2 SS(x | ) 2 (1 ) + ( n )
2
2
(1 ) + ( n )
2
2
0 70.00 10.00
1 81.19 110.89 10 2589.00 1444.50 82.45 9.08
2 82.78 31.70 10 685.76 492.88 83.56 2.98
3 82.38 43.39 10 618.42 459.21 83.37 3.99
4 81.82 51.72 10 630.32 465.16 83.25 4.69
5 83.16 76.20 10 652.69 476.34 82.90 6.61
which varies over iterations. We first compute the sums of squares of the data about the
current value for the population mean, ( 0 ) = 70:
n n
SS( 0 ) ( x|) = i =1
( x i ( 0 ) )2 = (x 70)
i =1
i
2
= 2589.
Note that this is the value of SS that is used to define the full conditional for 2 in the current
iteration; however, it is computed based on the value for from the previous iteration. Using
this value for SS yields the following for the second parameter in the inverse gamma full
conditional distribution:
= 82.45
(1 ) + ( n )
2
(1 50 ) + (10 110.89 )
2( 1)
Note the use of the just-drawn value of 110.89 for 2 in these computations. We then take
a draw from the full conditional: (1) N (82.45, 9.08). The drawn value was 81.19, which
is now the value for for iteration 1, and is reported in the second row of Table5.1. This
completes one iteration of the Gibbs sampler.
To conduct the second iteration, we return to the full conditional distribution for 2. We
compute the sums of squares of the data about the current value for the population mean
100 Bayesian Psychometric Modeling
n n
SS(1) ( x|) = i =1
( xi (1) )2 = (x 81.19)
i =1
i
2
= 685.76
( ) + ( nx
2
2( 2 )
) = (75 50 ) + ((10 84.1) 31.70 ) 83.56
(1 ) + ( n ) (1 50 ) + (10 31.70 )
(2)
2 2
and the variance is
1 1
= 2.98.
(1 ) + ( n
2
2( 2 )
) (1 50 ) + (10 31.70 )
Accordingly, we then take a draw from the full conditional: ( 2 ) N (83.56, 2.98). The drawn
value was 82.78, which is now the value for for iteration 2. These drawn values are listed
in Table5.1, as are values for three more iterations of the Gibbs sampler (see Exercise5.6).
Recalling that the goal is not to arrive at a point estimate, but rather a (posterior) distribu-
tion, five iterations are hardly sufficient to characterize the distribution. In applications,
we would typically conduct many more iterations, and summarize them via densities and
summary statistics. For this example, the results of running 50,000 iterations in WinBUGS
were given in Figures 4.10 and 4.11 and Table4.1.
5.2.3 Discussion
Many variations and extensions on the Gibbs sampling architecture are possible. One can
work with subsets of the parameters, treating multiple parameters as a set and sampling
from multivariate full conditionals (e.g., Patz & Junker, 1999b). One need not sample every
parameter at each iteration, as long as each value will be visited infinitely many times in the
long run. In cases where some parameters explore the parameter space more slowly than
others, say due to poor mixing or high autocorrelation (see Section5.7), it may be advanta-
geous to devote more resources to sampling for those parameters, rather than others.
Note that the full conditional distribution for each parameter is formulated as condi-
tional on all the remaining parameters. In complex models with many parameters, this
potentially involves conditioning on a great many parameters. However, the full condi-
tional distributions often simplify due to conditional independence specifications. It is
here that the use of DAGs in conceptualizing the model can greatly aid in computation,
as it can be shown that the full conditional distribution for any entity depends at most on
its parents, its children, and the other parents of its children; all other parameters can be
ignored (Lunn etal., 2009; Spiegelhalter & Lauritzen, 1990). In large problems built from
simpler structures with conditional independence relationships, the simplification can be
substantial.
Markov Chain Monte Carlo Estimation 101
|2 , , 2 , x ~ N (|2 , ,2 , x , 2|2 , ,2 , x ),
where
|2 , ,2 , x =
( ) + ( nx ) ,
2
2
(1 ) + ( n )
2
2
1
2|2 , ,2 , x = ,
(1 ) (
2
+ n 2 )
and rewriting (5.10) as
This expanded notation aids in the derivation of full conditionals, as it highlights all
the entities that are involved in computations. We employ this expanded notation in
AppendixA where we develop the full conditional distributions for many of the models
described in the rest of the book. For the rest of the chapters that comprise the main text,
we use the briefer-if-not-fully-complete notation that ignores the hyperparameters on the
right side of the conditioning bar.
If the full conditional distributions are of familiar form (as in the preceding example),
sampling from them may proceed using Monte Carlo procedures. They can be particu-
larly simple when we construct a larger model from constituent models and relationships
that have conjugate or conditionally conjugate priors. However, in complex models, it
might be the case that full conditional distributions are not of known form. In these cases,
more complex sampling schemes are required. Sections 5.3, 5.5, and 5.6 describe such
schemes.
Brooks, 1998; Gilks etal., 1996b). The basic idea is that in lieu of drawing a value from the
difficult-to-sample-from posterior distribution, we draw a value from a different distribu-
tion that we can easily sample from, and then decide to accept or reject that value as the
next value in the chain. This decision rule is constructed in such a way that, after the chain
reaches its stationary distribution, a subsequent draw has the same distribution as a draw
from its posterior distribution. At this point, the frequency distribution of a large number
of draws from the sequence converges to the posterior distribution. The procedure consists
of conducting the following steps:
p(*|x )
(*|(t ) ) = min 1, (t ) ,
p( |x )
where p((t ) |x ) is the probability (or ordinate of the probability density) for the
current value (t ) in the chain in the posterior distribution and p(*|x ) is the prob-
ability (or ordinate of the probability density) for the candidate value * in the
posterior distribution.
4. Set (t+1) = * with probability (*|(t ) ). Set (t +1) = (t ) with probability 1 (*|(t ) ).
Operationally this is typically accomplished by drawing a random variate U ~
Uniform(0,1), and setting (t+1) = * if (*|(t ) ) > U and setting (t +1) = (t ) otherwise.
5. Increment t, by setting t = t + 1.
6. Repeat steps 25 for some large number T iterations.
The proposal distribution q in step 2 may be any symmetric distribution that is defined over
the support of the stationary distribution, which in our case is the posterior distribution.
Here, symmetry refers to a distribution being symmetric with respect to its arguments;
that is, q(*|(t ) ) = q((t ) |*). The most popular choice for q is a normal distribution centered
at the current value of the chain. To simplify the presentation, first consider the situation
where there is only one parameter so that q is a univariate normal distribution, and let 2*
denote the variance of this distribution. Consideration of the normal probability density
function
1 1
q(*|(t ) ) = N (*|(t ) , 2* ) = exp 2
(* (t ) )2 (5.12)
2 2
2 *
*
reveals that interchanging the roles of * and (t ) yields the same resulting value. That
is, N (*|(t ), 2* ) = N ((t ) |*, 2* ) and as such the normal distribution is symmetric with
respect to its arguments.
Figure5.1 illustrates the use of the situation for a model with a single parameter, where
the posterior distribution takes on an irregular shape. Beginning with panel (a) the cur-
rent value of the chain is (t ) . Centered at that value is the normal proposal distribution q.
Suppose that at this iteration, the value drawn from this proposal distribution is *. The
question then becomes whether to accept this as the next value for the chain, (t+1).
Markov Chain Monte Carlo Estimation 103
p( | x) q( |(t) )
p((t) | x)
p(|x)
(a)
(t)
p((t) | x) q( |(t) )
p( | x)
p(|x)
(b) (t)
FIGURE 5.1
Illustration of two iterations of the Metropolis sampler: (a) a proposed value * that will necessarily be accepted;
once accepted it becomes the current value (t ) in (b), which shows a new proposed value that will possibly be
accepted (with probability [p(*|x)/p((t)|x)]). (Modified and used with permission from Educational Testing
Service.)
if a candidate value is less likely than the current value. And it will do so in a way that
the series of values in the chain produced from this process occur in a relative frequency
dictated by the posterior distribution.
Remarkably, Metropolis sampling works to approximate a distribution for practically
any proposal distribution as long as the mild regularity requirements are met. Some pro-
posal distributions lead to more efficient estimation than others, though. Efficiency is gov-
erned by a number of factors including the dimension of the problem and conditional
independence and conjugacy relationships, and proposal distributions that lead to a
30%40% acceptance rate tend to be the most efficient (Gelman etal., 2013). As mentioned
above, a common proposal distribution is a normal distribution centered at the previous
value in the chain. Seeing that the rate of acceptance is too low suggests the proposal dis-
tribution is too wide, and the variance should be reduced. If the acceptance rate is too high,
the variance should be increased.
p(*|x )
(*|(t ) ) = min 1, (t )
p( |x )
p( x|*)p(*) p( x|)p()d
= min 1, (5.13)
p( x|(t ) )p((t ) ) p( x|)p()d
p( x|*)p(*)
= min 1, (t ) (t )
.
p( x| )p( )
The second line results from applications of Bayes theorem. The simplification from the
second to third line reveals that the denominator in Bayes theorem does not factor into the
calculations. All that is required is the numerator: the prior and the likelihood.
As such MCMC alleviates the need to perform the integration over the parameter space in
the denominator of Bayes theorem to obtain the posterior distribution. This is the key feature
of MCMC estimation that permits the estimation of complex Bayesian models. Prior to the
advent of MCMC, applications of Bayesian modeling were limited because of the difficulty
Markov Chain Monte Carlo Estimation 105
* Conjugate priors are convenient, but they are limited to relatively simple models that can be expressed in the
general exponential family. Graphical methods, analytic approximations, and numerical methods to find the
maxima of posteriors often require specialized solutions (Bernardo & Smith, 2000), although in given prob-
lems, they can be an efficient choice for practical work.
106 Bayesian Psychometric Modeling
The left-hand side of (5.14) is the probability of being at (t ) in the posterior distribution
and then moving to *. The right-hand side is the probability of being at * in the poste-
rior distribution and then moving to (t ). If the equality in (5.14) holds, it is said that the
reversibility condition holds. This is sometimes referred as the equation being time revers-
ible (Brooks, 1998), or that it exhibits detailed balance (Gilks etal., 1996a). Conceptually,
Equation (5.14) states that the joint distribution of sampling the two points * and (t ) is
the same regardless of which comes first. One way to think of the reversibility condition is
that it implies that if we ran the chain backwards, we would obtain the values we did with
the same probability. If this condition is satisfied, the transition kernel will yield a sample
from the distribution of interest, namely the posterior distribution p(|x ).
The goal then becomes defining the transition kernel p(*|(t ) ) that makes (5.14) true. We
begin by choosing a proposal distribution q. If it is the case that
then defining p(*|(t ) ) = q(*|(t ) ) makes (5.14) true. Generally, this will not be case; in
general our proposal distribution q will be such that (5.15) does not hold. This means that
the unconditional probability of being at a particular point (e.g., (t ), without loss of gen-
erality) and moving to the other point ( *) is greater than the unconditional probability of
being at * and moving to (t ). Relatively speaking (in terms of achieving the equality in
Equation 5.14), we move from (t ) to * too often and from * to (t ) too rarely. To combat
this, we devise a mechanism such that we reduce the probability of moving from (t ) to *.
More specifically, we only accept the move from (t ) to * with a certain probability and
remain at (t ) with the complement of that probability. We denote this acceptance prob-
ability as (*|(t ) ).
The probability of moving from (t ) to * is then
This states that the probability of moving from (t ) to * is defined as the probability of
selecting * from a proposal distribution multiplied by the probability of accepting that
value of *. Likewise, the probability of moving from * to (t ) is defined as
Recall that, relatively speaking, we move from * to (t ) too rarely. To make sure that we
never miss a chance to move from * to (t ), set
which defines the probability for accepting a move from (t ) to *. If the value of (*|(t ) )
exceeds one, the move from (t ) to * is made. If the value of (*|(t ) ) is less than one, the
move is made with that probability and not made with probability 1 (*|(t ) ).
useful for understanding some of the issues that arise in the use of such software. In the
authors experience, the fidelity and quality with which software programs implement
Bayesian ideas and MCMC estimation vary considerably. Analysts are better off if they are
prepared to spot results that should signal caution is needed before interpreting the output
from software.
* The parameter in the illustration is one from an item response theory model (Chapter 11). Here, we focus on
the patterns of the behavior of the chains generally.
Markov Chain Monte Carlo Estimation 109
4
0 200 400 600 800 1000
FIGURE 5.2
Trace plot from three chains illustrating monitoring convergence.
the same space. Because all chains eventually must converge to the same distribution, the
coming together is taken as evidence, but not proof, of that convergence.
A number of approaches to diagnosing convergence have been proposed, and many can
be viewed as comparing summaries of a subset of the draws to another. Examples include
viewing trace plots as a graphical check of when the draws stabilize and related ANOVA-
like analyses of multiple chains run in parallel examining the ratio of total-variability (i.e.,
from all chains) to within-chain variability (Brooks & Gelman, 1998; Gelman & Rubin,
1992). Following Gelman et al. (2013), these analyses may be conducted as follows. Run
some number C chains from different starting points assumed to represent disparate loca-
tions in the posterior. For any parameter, the pooled within-chain variance obtained based
on running T iterations in each of C chains is given by
C T
(
1 (t )
W= (c) ( c ) )2 , (5.22)
C(T 1)
c =1 t =1
where ((tc)) is the value from iteration t in chain c and ( c ) is the mean of such values in chain c.
The between chain variance is given by
C
(
T
B= (c) )2 . (5.23)
C 1
c =1
(T 1 / T )W + (1 / T )B
R = , (5.24)
W
draws preceding the point of convergence are discarded as burn-in. If there is not suf-
ficient evidence of convergence, more iterations can be conducted, and then the resulting
draws can be inspected for evidence of convergence. In this way, the process of evaluating
convergence and conducting iterations may itself be iterative, cycling between running
more iterations and inspecting the resulting draws. Once there is sufficient evidence of
convergence, the draws preceding the point of convergence are discarded as burn-in and
the draws from subsequent iterations may be used to approximate the posterior.
1
Autocorrelation
1
0 10 20 30 40 50
Lag
FIGURE 5.3
Autocorrelation plot.
Markov Chain Monte Carlo Estimation 111
Thin by 5 Thin by 10
1 1
Autocorrelation
Autocorrelation
0 0
1 1
0 50 100 150 200 250 0 100 200 300 400 500
(a) Lag (b) Lag
Thin by 30 Thin by 40
1 1
Autocorrelation
Autocorrelation
0 0
1 1
0 500 1000 1500 0 500 1000 1500 2000
(c) Lag (d) Lag
FIGURE 5.4
Autocorrelation plots when thinning by (a) 5, (b) 10, (c) 30, and (d) 40.
accuracy of the estimate of the posterior mean as a function of the number of draws and
posterior standard deviation,
= , (5.25)
T
where is the posterior standard deviation for . But if there is autocorrelation, we have
less information from a given number of draws. Borrowing from time series analysis, we
can approximate the impact of autocorrelation with a corrected-for-autocorrelation stan-
dard error of the mean. If is the autocorrelation in a first-order autocorrelation model,
then the corrected standard error of the mean is
1+
= . (5.26)
T 1
The degree of autocorrelation can be very different for different variables in the same
model, as it depends on the amount of information that data convey about a variable and
its joint relationship with other variables.
One additional point about serial dependence is worth noting. Although draws may
be dependent within a chain, they are independent between chains. Running multiple
chains and then pooling the resulting iterations therefore also helps mitigate the effects
of serial dependence.
5.7.4 Mixing
Informally, mixing refers to how well the chain moves throughout the support of the
distribution. That is, we do not want the chain to get stuck in some region of the dis-
tribution, or ignore a certain area. Figure5.2 illustrates how things look with relatively
poor mixing of each of the chains for the first few hundred iterations. Each one appears
112 Bayesian Psychometric Modeling
* The relationship is not as straightforward in more complex models such as hierarchical models.
114 Bayesian Psychometric Modeling
to computational intractability (Robert & Casella, 2011; McGrayne, 2011). We can iden-
tify a similar pivot point for MCMC and Bayesian modeling in psychometrics about a
decade later with publications by Arminger and Muthn (1998) and Scheines, Hoijtink,
and Boomsma (1999) in factor analysis and structural equation modeling, Hoijtink (1998)
in latent class analysis, and particularly those of Patz and Junker (1999a, 1999b) in item
response theory. Although seminal work by Albert (1992, Albert & Chib, 1993) had
shown how normal-ogive item response theory parameters could be estimated via Gibbs
sampling via data-augmentation strategies (Tanner & Wong, 1987), the full power of
MCMC was not fully appreciated in the psychometric community until Patz and Junker
(1999a, 1999b) described a MetropolisHastings-within-Gibbs sampling approach for
the most common logistic IRT models without requiring conjugate relationships. The
emergence of MCMC has precipitated the explosion in applications of Bayesian psycho-
metrics (Levy, 2009), mirroring the effects of its recognition in other complex statistical
modeling scenarios (Brooks et al., 2011; Gilks et al., 1996b). In the following chapters,
we will employ MCMC estimation to empirically approximate posterior distributions,
which is particularly powerful in light of psychometric models for which analytic solu-
tions are intractable.
Exercises
5.1 Reconsider the beta-binomial model example introduced in Section2.3. WinBUGS
code for the model is given in Section2.6. Conduct an analysis of the beta-binomial
model in WinBUGS, monitoring .
a. List at least three possible dispersed initial values for that may be used gain-
fully in running multiple chains. How did you come up with them?
b. Using the initial values identified in (a), run three chains and monitor conver-
gence. When do the chains appear to have converged, and what evidence do
you have for that assessment?
c. Run 10,000 iterations for each chain past the point determined in (b). Obtain
the density and summary statistics for . How does this representation of the
posterior compare to the analytical solution depicted in Figure2.4 and sum-
marized in Section2.6?
5.2 Conduct an analysis of the beta-Bernoulli version of the model in WinBUGS, mon-
itoring .
a. Would the same initial values as used in Exercise5.1 be sensible here? If not,
why not and what values could be used instead?
b. Using the initial values identified in (a), run three chains and monitor conver-
gence. When do the chains appear to have converged, and what evidence do
you have for that assessment? How does this compare to what happened in the
beta-binomial model in Exercise5.1?
c. Run 10,000 iterations for each chain past the point determined in (b). Obtain
the density and summary statistics for . How does this representation of the
posterior compare to that from the beta-binomial model in Exercise5.1?
Markov Chain Monte Carlo Estimation 115
5.3 Reconsider the example from Section4.1 for inference about the mean of a normal
distribution () with the variance known.
a. Develop WinBUGS code for the model. (Hint: Consider modifying the code
given in Section4.3.2.)
b. List at least three possible dispersed initial values for that may be used gain-
fully in running multiple chains. How did you come up with them?
c. Using the initial values identified in (b), run three chains and monitor conver-
gence. When do the chains appear to have converged, and what evidence do
you have for that assessment?
d. Run 20,000 iterations for each chain past the point determined in (c). Obtain
the density and summary statistics for . How does this representation of
the posterior compare to the analytical solution depicted and summarized in
Section4.1?
5.4 Reconsider the example from Section 4.2 for inference about the variance of a
normal distribution (2 ) with the mean known.
a. Develop WinBUGS code for the model. (Hint: Consider modifying the code
given in Section4.3.2.)
b. List at least three possible dispersed initial values for 2 that may be used gain-
fully in running multiple chains. How did you come up with them?
c. Using the initial values identified in (b), run three chains and monitor conver-
gence. When do the chains appear to have converged, and what evidence do
you have for that assessment?
d. Run 20,000 iterations for each chain past the point determined in (c). Obtain
the density and summary statistics for 2. How does this representation of
the posterior compare to the analytical solution depicted and summarized in
Section4.2?
5.5 Reconsider the example from Section4.3 and Section5.2.2 for inference about the
mean and variance of a normal distribution. Now suppose we are interested in
reporting results for the variability in terms of the standard deviation.
a. Develop WinBUGS code for the model. (Hint: Consider modifying the code
given in Section4.3.2.)
b. Would the initial values for from Exercise 5.3 and the initial values for 2
from Exercise 5.4 be appropriate here? If not, why not and what values could
be used instead?
c. Are additional initial values for the standard deviation needed? If so, what are
some possible useful values? If not, why not?
d. Using the initial values identified in (b) and possibly (c), run three chains
monitoring the mean, variance, and standard deviation. When do the chains
appear to have converged, and what evidence do you have for that assessment?
e. Run 20,000 iterations for each chain past the point determined in (d). Obtain
the marginal densities and summary statistics for all the parameters.
f. Consider an estimate of the standard deviation based on taking the square
root of the posterior mean of the variance. How does this result compare to the
marginal posterior distribution of the standard deviation from (e)?
116 Bayesian Psychometric Modeling
5.6 Reconsider the example Section5.2.2 for inference about the mean and variance
of a normal distribution. The computations involved in the first two iterations of
a Gibbs sampler were given in Section5.2.2. The results for three additional itera-
tions are given in Table5.1. Show the computations for each of these additional
iterations.
6
Regression
Letting
i = yi E( yi |0 , , x ) (6.2)
denote the error for subject i, then following assumptions of normality and homogeneity
of variance for the s, the regression model is given by
117
118 Bayesian Psychometric Modeling
where
i ~ N (0, 2 ) (6.4)
and 2 is the error variance. The normal distribution of error terms assumed here is typical,
although alternatives such as thicker-tailed distributions and variances that depend on the
xs can be useful.
We suspect that readers are familiar with this manner of expressing a regression model.
That is, if asked to write down the standard regression model, most readers would write
down the contents of (6.3), or some notational variant of it, possibly including the distri-
butional statement in (6.4). One theme of our approach to statistical modeling is that it is
advantageous to shift from this sort of strategy of thinking equationally to one of think-
ing probabilistically. Combining the deterministic expression in (6.3) with the probabilistic
expression in (6.4), the standard linear regression model is expressed distributionally as
The first term on the right-hand side is the conditional probability of all of the data. The
second term is the prior distribution for all the parameters. Assuming prior independence
of (0 , , 2 ) and allows for the factorization p(0 , , 2 , ) = p(0 , , 2 )p( ). It can be
shown (e.g., Jackman, 2009) that the posterior distribution can be then factored as
This implies that we can analyze the first term on the right-hand sidethe elements of
the standard regression modelby itself with no loss of information. As a consequence,
the distinction between x being fixed or stochastic is irrelevant in the Bayesian analysis
of the model (Jackman, 2009). Either way, the p(x) and terms drop out of the model and
subsequent analysis. This highlights the utility of thinking distributionally and express-
ing the model in (conditional) probabilistic terms in (6.5); namely, for inference about the
Regression 119
p(y | 0 , , 2 , x )p(0 , , 2 )
p(0 , , 2 | y , x ) = p(y | 0 , , 2 , x )p(0 , , 2 ). (6.8)
p(y | x )
p(y | 0 , , 2 , x ) = p(y | , , , x ).
i =1
i 0
2
i (6.9)
j 0 2
xij yi
i = 1,, n
j = 1,, J
FIGURE 6.1
DAG fragment for the conditional probability of the outcomes (yi) for the regression model that specifies predic-
tors (xij), slopes (j), an intercept (0), and the error variance (2 ).
120 Bayesian Psychometric Modeling
0 ~ N (0 , 20 ), (6.10)
p() = p( ).
j =1
j (6.11)
Of course, if we had beliefs about the values of the js through our knowledge of the sub-
stance of an applicationeven for example just knowing the scales that the xs are mea-
sured onwe could forego the presumption of exchangeability for some or all of the js,
or have subsets of exchangeable groups of them, as appropriate. It is quite common to treat
the coefficients as exchangeable, or include enough conditioning variables to treat them
as conditionally exchangeable. For ease of exposition, we continue with model assuming
exchangeability. Each regression slope j, j = 1, , J, is specified as having a (common)
normal distribution,
j ~ N ( , 2 ), (6.12)
where 20 and 0 are hyperparameters that may be interpreted as an estimate for the error
variance and degrees of freedom or pseudo-sample size associated with that estimate,
respectively.
Regression 121
2 0 20 0 2 0
j 0 2
j = 1,, J
FIGURE 6.2
DAG fragment for the prior distribution for the regression model, with hyperparameters for the normal prior
distributions for slopes (j), normal prior distribution for the intercept (0), and an inverse-gamma prior for the
error variance (2 ).
Figure6.2 contains a fragment of the DAG corresponding to the prior distribution for
the parameters. The specification of independent priors for the parameters 0, , and 2
is reflected in the graph by the absence of directed edges connecting the nodes for the
parameters. Further, the plate over j containing the j nodes indicates replication over
the coefficients for the J predictors. Note that the hyperparameters and 2 lie outside the
plate, indicating that the js have a common prior distribution, as is motivated by treat-
ing the js as exchangeable. As noted above, if we had beliefs about the values of the js
through our knowledge of the substance of an application, we could forego the presump-
tion of exchangeability for some or all of the js.
where
0 ~ N (0 , 20 ),
j ~ N ( , 2 ) j = 1, , J ,
and
2 ~ Inv-Gamma( 0 / 2, 020/2).
122 Bayesian Psychometric Modeling
2 0 20 0 2 0
j 0 2
xij yi
i = 1,, n
j = 1,, J
FIGURE 6.3
DAG for the regression model for outcomes (yi) regressed on predictors (xij), with normal prior distributions for
the slopes (j) and intercept (0), and an inverse-gamma prior for the error variance (2 ).
With the conditionally conjugate prior, the model is conceptually simple in that prior
distributions are specified for each parameter individually. Furthermore, these specifica-
tions yield reasonably simple steps in MCMC techniques used to empirically approximate
the posterior, to which we now turn.
(t )
(Jt +1) ~ p( J |(0t +1) , 1(t +1) , , (Jt+11) , 2 , y , x ),
and
( t +1)
2 ~ p(2 |(0t +1) , 1(t +1) , , (Jt +1) , y , x ).
The choice of the conditionally conjugate prior distribution for the parameters renders
these full conditionals to be of known form (see Appendix A for derivations). We express
the full conditional distributions in the following equations. On the left-hand side, the
Regression 123
parameter in question is written as conditional on all the other relevant parameters and
data.* On the right-hand side of each of the following equations, we give the parametric
form for the full conditional distribution. In several places, we denote the arguments of the
distribution (e.g., mean and variance for a normal distribution) with subscripts denoting
that it refers to the full conditional distribution; the subscripts are then just the condition-
ing notation of the left-hand side.
To present all of them compactly, let x A be the (n [J + 1]) augmented predictor matrix
obtained by combining an (n 1) column vector of 1s to the predictor matrix x, where x
has been arranged such that xi is in the ith row of x. Let A be the ([J + 1] 1) augmented
vector of coefficients obtained by combining the intercept 0 with the J coefficients in . Let
A( j ) denote the (J 1) vector obtained by omitting j from A. Similarly, let x A( j ) denote
the (nJ) matrix obtained by omitting the column of values from x A. When j > 0, x j refers
to column vector with values for the jth predictor; when j = 0, x j refers to the column
vector of 1s.
The full conditional distribution for j, j = 0,, J, is then
j | A( j ) , 2 , y , x ~ N ( j | A( j ) ,2 ,y , x , 2 j | A( j ) ,2 ,y , x ), (6.16)
where
1
1 1 1 1
j | A( j ) , 2 , y , x = 2 + 2 xj x j 2 j + 2 xj (y x A( j ) A( j ) ) (6.17)
j
j
and
1
1 1
2
= 2 + 2 xj x j . (6.18)
j | A ( j ) ,2 , y , x j
When j = 0, j and 2 j in (6.17) and (6.18) refer to the prior mean and variance of the inter-
cept, which we had denoted as 0 and 20 in (6.10). When j > 1, j and 2 j in (6.17) and (6.18)
refer to the prior mean and variance of the coefficient for the jth predictor, which under
exchangeability was denoted as and 2 in (6.12).
The full conditional distribution for 2 is
+ n 020 + SS(E )
2 | A , y , x ~ Inv-Gamma 0 , , (6.19)
2 2
where
A Gibbs sampler is constructed by iteratively drawing from the full conditional distribu-
tions defined by (6.16) and (6.19) using the just-drawn values for the conditioned parameters.
* We suppress the role of specified hyperparameters in this notation; see AppendixA for a presentation that
formally includes the hyperparameters.
124 Bayesian Psychometric Modeling
A summary of the Gibbs sampler, for one generic iteration (t + 1) that occurs in a sequence
of many such iterations, is as follows:
1
( t + 1)
1 1 1 1
= 2 + 2( t ) xj x j 2 j + 2( t ) xj (y x A( j ) A( j ) )
(current)
(6.21)
(t )
j
j
j |(current) , 2 ,y ,x
A( j )
and
1
( t +1) 1 1
2 = 2 + 2( t ) xj x j , (6.22)
j |(current)
(t )
2
A ( j ) , ,y , x j
and sample a value from (6.16) using these values for j | A( j ) , 2 , y , x and 2 j | A( j ) , 2 , y , x:
2( t ) (t +1) ( t +1)
(jt +1) | (current)
A( j ) , , y , x ~ N 2( t )
, 2 (current) 2( t ) . (6.23)
j |(current) j | A ( j ) , , y , x
A( j ) , , y , x
2. Sample the error variance. Compute the sums of squares using the just-drawn values
for the augmented vector of regression coefficients,
scores among three end-of-chapter tests in a course. Specifically, scores from the tests in
Chapters1 and 2 are used to predict scores from that in Chapter3. Test scores were formed
by summing the number of correctly answered items. Table6.1 contains summary statis-
tics obtained from 50 students.
The posterior distribution is
=
i =1
p( )p( ),
p( yi | 0 , , 2 , xi )p(0 )
j =1
j
2
where
yi |0 , 1 , 2 , xi , 2 ~ N (0 + 1xi 1 + 2 xi 2 , 2 ) i = 1, , 50,
0 ~ N (0, 1, 000),
j ~ N (0, 1, 000) j = 1, 2,
and
2 ~ Inv-Gamma(1, 1).
i = yi (0 + 1xi 1 + 2 xi 2 ). (6.26)
var()
R2 = 1 , (6.27)
var( y)
where var( y) = n11 ni =1 ( y y )2 is the usual finite-sample variance formula, and analo-
n
gously var() = n11 i =1 (i )2 . The first of these is straightforward because all ys are
known and we are simply calculating a measure of their dispersion. The latter, var(), has
TABLE 6.1
Summary Statistics for the Three End-of-Chapter Tests for n = 50 Subjects
Chapter 1 Chapter 2 Chapter 3
Number of items 16 18 15
Mean 14.10 14.34 12.22
Standard deviation 2.02 3.29 2.96
Chapter 1 Chapter 2
Chapter 2 .58
Chapter 3 .69 .68
Note: The bottom half of the table gives the correlations between the test scores.
126 Bayesian Psychometric Modeling
a straightforward interpretation from a Bayesian perspective but would seem an odd mix
of parameter and estimate from a frequentist perspective. Once the full joint distribu-
tion of all the variables in the regression model has been constructed and the ys and xs
have been observed, we have posterior distributions conditional on them for the s and, as
functions of variables in the model, for the s, var(), and R2 as well. If the s were known
with certainty, R2 would characterize the uncertainty remaining about the ys for the real-
ized data when we condition on the xs. The posterior distribution for R2 additionally takes
the uncertainty about the s into account for this descriptor of the predictive value of the
model for the data in hand. This contrasts with the frequentist calculation of a single esti-
mate of R2 using point estimates of the model parameters, and interest in its distribution
in repeated samples of y.
We compute
1
1 1 1 1
(1) (0) = 2 + 2( 0 ) x0 x0 2 0 + 2( 0 ) x0 (y x A( 0 )(current)
A( 0 ) )
0 |(current)
0 0
2
A( 0 ) , ,y,x
1
1 1 1 1
= 2 + 2( 0 ) x0 x0 2 0 + 2( 0 ) x0 (y x A( 0 ) (1 , 2 ))
(0) (0)
0 0
1
1 1 1 1
= + 11 0 + 1(y x A( 0 ) (1, 0.5)) 9.05
1, 000 5 1, 000 5
and
1 1
( 1) 1 1 1 1
2 (0) = 2 + 2( 0 ) x0 x0 = + 11 0.10.
0 |(current) 2
A( 0 ) , ,y , x
0 1, 000 5
TABLE 6.2
Computations in the Gibbs Sampler for the Example Regressing Chapter3
Test Scores on Chapter1 Test Scores and Chapter2 Test Scores
Iteration Ingredients of the Full Conditional
0 1(y x A( 0) A( 0))
2 +
0 2 1
1 1 1 1
2 + 2 11 2 + 2 11
0 0 0
0 3.00
1 9.17 9.05 0.10
2 9.13 8.90 0.07
3 8.89 8.69 0.09
1 x1 ( y x A( 1) A( 1) )
2 +
1 2 1
1 1 1 1
2 + 2 x1 x1 2 + 2 x1 x1
1 1
1
0 1.00
1 0.98 1.00 0.0005
2 0.99 0.98 0.0004
3 0.97 1.00 0.0004
2 x2 ( y x A( 2) A( 2) )
2 +
2 2 1
1 1 1 1
2 + 2 x 2 x 2 2 + 2 x2 x 2
2 2
2
0 0.50
1 0.51 0.52 0.0005
2 0.48 0.50 0.0003
3 0.49 0.51 0.0004
0 + n 0 20 + SS(E)
2 2 SS(E) 2
0 5.00
1 3.55 26.00 220.26 111.13
2 4.33 26.00 221.38 111.69
3 3.92 26.00 219.61 110.81
1
1 1 1 1
(1)
2( 0 )
= 2 + 2( 0 ) x1 x1 2 1 + 2( 0 ) x1 (y x A( 1) A( 1) )
(current)
1 |(current)
A( 1) , ,y,x
1 1
1
1 1 1 1
= 2 + 2( 0 ) x1 x1 2 1 + 2( 0 ) x1 (y x A( 1) (0 , 2 ))
(1) (0)
1 1
1
1 1 1 1
= + x1 x1 0 + x1 (y x A( 1) (9.17 , 0.5)) 1.00
1, 000 5 1, 000 5
128 Bayesian Psychometric Modeling
and
1 1
( 1) 1 1 1 1
2 2( 0 )
= 2 + ( 0 ) x1 x1 = + x1 x1 0.0005.
1 |(current)
A( 1) , ,y ,x 2
1 1, 000 5
and
1 1
2( 1)
1 1 1 1
2( 0 )
= 2 + 2( 0 ) x2 x 2 = + x2 x 2 0.0005.
2
2 |(current)
A( 2 ) , ,y , x
1, 000 5
Table6.2 lists the results from the first three iterations of a Gibbs sampler (seeExercise6.1).
Regression 129
--------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Prior distributions
#########################################################################
beta.0 ~ dnorm(0, .001) # prior for the intercept
beta.1 ~ dnorm(0, .001) # prior for coefficient 1
beta.2 ~ dnorm(0, .001) # prior for coefficient 2
tau.e ~ dgamma(1, 1) # prior for the error precision
sigma.e <- 1/sqrt(tau.e) # standard deviation of the errors
#########################################################################
# Conditional distribution of the data
# Via a regression model
#########################################################################
for(i in 1:n){
y.prime[i] <- beta.0 + beta.1*x1[i] + beta.2*x2[i]
y[i] ~ dnorm(y.prime[i], tau.e)
}
#########################################################################
# Calculate R-squared
#########################################################################
for(i in 1:n){
error[i] <- y[i] - y.prime[i]
}
Three chains were run from dispersed start values. Convergence diagnostics including the
history plots and the potential scale reduction factor (Section5.7.2) suggested convergence
within just a few iterations. The first 1,000 iterations were discarded as burn-in and then
10,000 iterations from each chain were obtained and saved, yielding 30,000 iterations that
served to empirically approximate the posterior distribution.
Figure6.4 contains density plots representing the marginal posterior distributions for the
intercept (0), slope for the Chapter1 test score (1), slope for the Chapter2 test score (2),
130 Bayesian Psychometric Modeling
0.0 0.2 0.4 0.6 0.8 1.5 2.0 2.5 3.0 0.45 0.50 0.55 0.60
2 R2
FIGURE 6.4
Marginal posterior densities for the parameters and R 2 from the example regressing tests scores on previous
test scores.
standard deviation of the errors = 2 , and R2. The marginal distributions are unimodal
and fairly symmetric, with the distribution for having a slight positive skew and that for
R2having a negative skew. Accordingly, we report the posterior median as well as the pos-
terior mean in Table6.3, along with posterior standard deviations and 95% HPD intervals
as summaries of the marginal posterior distributions for these parameters, as well asR2.
Table6.3 also contains the results from a frequentist (ML) solution to the model. Note
the similarity in the values between the Bayesian and frequentist analysis. This illustrates
thegeneral point that, with diffuse priors, the posterior distribution strongly resemblesthe
likelihood function and the results from the two approaches will be similar. We stress that
these results are numerically similar, but conceptually different. TheBayesian analysis yields
direct summary and probabilistic statements about the parameters themselves, which
represent our uncertain beliefs about their values. For 1, a posterior mean of 0.66 and
posterior standard deviation of 0.17 are descriptions of the distribution for the parameter,
not a parameter estimator as in frequentist analyses. Similarly, the credibility interval is
TABLE 6.3
Summary of the Results from Bayesian and Frequentist Analyses of the Example Regressing the
Chapter 3 Test Scores on the Chapter 1 Test Scores and the Chapter 2 Test Scores
Bayesian Analysis Frequentist Analysis
Posterior Mean Posterior Median Posterior SDa 95% HPDb Int. Est. SEc 95% Conf. Int.
0 2.53 2.54 1.94 (6.43, 1.15) 2.54 1.93 (6.41, 1.34)
1 0.66 0.66 0.17 (0.34, 0.99) 0.66 0.17 (0.33, 0.99)
2 0.38 0.38 0.10 (0.17, 0.57) 0.38 0.10 (0.18, 0.59)
1.91 1.90 0.20 (1.54, 2.31) 1.95 0.28 (1.60, 2.37)
R2 0.58 0.59 0.02 (0.55, 0.60) 0.60
Source: The frequentist results are from Levy, R., & Crawford, A. V. (2009). Incorporating substantive knowledge
into regression via a Bayesian approach to modeling. Multiple Linear Regression Viewpoints, 35, 49.
Reproduced here with permission of the publisher.
a SD = Standard Deviation.
c SE = Standard Error.
Regression 131
interpreted as a probabilistic statement about the parameter; there is a .95 probability that
1 is between 0.34 and 0.99, according to this model. As noted above, in contrast to fre-
quentist approaches, a Bayesian approach yields a distribution for R2, which captures our
uncertainty regarding the predictive power of the model.
Exercises
6.1 Reconsider the multiple regression model example introduced in Section 6.6.
The computations involved in the first iteration of a Gibbs sampler were given in
Section6.6.2. The results for two additional iterations are given in Table6.2. Show
the computations for each of these additional iterations.
6.2 Based on the results reported in Table6.3, write a brief summary of the results
of fitting the model using a Bayesian approach, including interpretations for each
parameter.
6.3 The posterior implies that 0 is most likely negative. Calculate the posterior proba-
bility of this, p(0 < 0 | y , x ), by running a modified model in WinBUGS. This can be
accomplished by adding the following lines to the WinBUGS code in Section6.6.3.
132 Bayesian Psychometric Modeling
------------------------------------------------------------------
is.beta.0.greater.than.0 <- max(beta.0, 0)
probability.beta.0.less.than.0 <- equals(is.beta.0.greater.than.0, 0)
------------------------------------------------------------------
The posterior probability that p(0 < 0|y , x ) is then given by the posterior mean of
probability.beta.0.less.than.0.
6.4 Given the interpretation of a regression intercept in Exercise6.2, is it sensible to
think that 0 could be negative? If not, why does this occur? What could be done
to prevent this from happening, that is, how could the model be modified to build
in the restriction that 0 should not be negative?
6.5 Fit a modified model based on Exercise6.4 in WinBUGS, monitoring convergence,
and summarize the results. How do they compare to those for the original model
reported in Table6.3?
Section II
Psychometrics
We now pivot from foundational issues that comprised the first half of the book to focus
on psychometrics. We begin by introducing what we refer to as the canonical psychometric
model, an abstraction of the popular psychometric models that are the foci of the second half
of the book. We develop Bayesian approaches to common psychometric activities in the con-
text of this model. We consider these approaches to be canonical as well, and they will shape
much of the rest of this book. After introducing them, we briefly discuss the extent to which
these approaches are prevalent in operational assessment. We conclude with a summary
discussion that looks ahead to the rest of the book. The treatment is abstract throughout the
chapter, and may be seen as the platform for instances and extensions of these core ideas that
make up most of the balance of the book. To facilitate the developments in the chapter, we
first introduce a distinction among different kinds of DAGs.
135
136 Bayesian Psychometric Modeling
x y
(a)
1 0 2
xi yi
i = 1,, n
(b)
1
21 0 20 0 2 0
1 0 2
xi yi
i = 1,, n
(c)
FIGURE 7.1
Three kinds of directed acyclic graphs for a regression model: (a) just-persons, (b) all-unknowns, and (c) with
hyperparameters.
* In the psychometric model, observables x are the connection between the latent-variable space and per-
sons real-world actions. Recall that the nature of observable variables and how they come to take their values
differ substantially across types of measuring instruments (e.g., multiple-choice tests, opinion-survey
responses, evaluations of dance performances). The observables in applications of classical test theory are
often test scores, while the observables in item response theory could be item scores on the same test. There
can be several stages in evaluating complex responses, as when essays are scored by computers and distinct
processes of lexical, syntactic, and semantic analyses are carried out and combined to approximate human
raters. Although these considerations are integral to the meaning of the s in any application of a latent vari-
able model and essential to the design and interpretation of any assessment, they are not focus of the present
volume. The reader is referred to Mislevy et al. (2003) for further discussion of these issues.
Canonical Bayesian Psychometric Modeling 137
x1 x2 x3 xJ
FIGURE 7.2
The just-persons directed acyclic graph for the canonical psychometric model.
observables may be seen as arising from an assumption that they are exchangeable and
may be modeled as conditionally independent given the latent student model variable
J
p( x ) = p( x1 , , x J ) =
p(x |)p()d.
j =1
j (7.1)
p( xi ) = p( xi1 , , xiJ ) =
p(x | )p( )d .
i j =1
ij i i i (7.2)
Recall that the representation theorem says that a marginalized conditional independence
structure exists to express beliefs about an exchangeable set of variables, but does not spec-
ify the forms of either the conditional distribution or the mixing distribution. In practical
work, these must come from the analysts beliefs about the substance of the problem. Based
on the form of the data and theory about the nature of proficiency, the various psychomet-
ric models differ in terms of the chosen forms of the distributions and variables (e.g., con-
tinuous or discrete xs, continuous or discrete s, the latter of which implies replacing the
integral in (7.2) with a summation) and possibly additional parameters needed to warrant
exchangeability and the conditional independence structures.
The importance of structuring psychometric models via conditional independence relation-
ships has long been recognized for its computational benefits. We briefly rehash some of the
discussion from Section 3.5 to additionally stress its psychological role in psychometric mod-
els. As Pearl (1988) noted, conditional independence relationships are so powerful for reason-
ing that we not only seek out variables that induce them, but we also invent these variables
if they do not exist. This in line with an epistemic lens on probability models, in which the
formal entities are abstractions defined in a model space to facilitate the use of the machin-
ery of the model, namely probability calculus. De Finettis representation theorem does not
imply that necessarily exists in an ontological sense, out there in the world to be discovered.
Rather, it is introduced as a tool in the model space to organize our thinking about the world.
The implication is that using a psychometric model does not require assuming that
exists out there in the world. It is constructed or defined in the model space to be that
* In certain Bayesian and graphical modeling traditions the term local independence is used in a different sense
than that used here (Spiegelhalter & Lauritzen, 1990).
138 Bayesian Psychometric Modeling
The position taken here is that a true number of dimensions does not exist, but that a
sufficient number of dimensions is needed to accurately represent the major relation-
ships in the . . . data. If the fine details of the relationships in the data are of interest, more
dimensions are needed to accurately show those details.
An analyst chooses the number of latent variables based on the desired purposes in con-
cert with the dependencies in the data. Our goal in building a psychometric is not to be
correct, but to be useful for the purposes at hand. We want the observables to be as close
to conditionally independent as will balance the technical advantages of conditional inde-
pendence, the practicality of implementation and understanding, and the modeling (or
circumvention) of conditional dependence that would cause unacceptable errors in the
inferences we make through the model.
In summary, when conducting model-based reasoning in educational assessment,
we introduce latent variables as modeling devices. Latent variables lurk in the realm
of the model space that we as analysts build, and are introduced to organize our think-
ing about examinee capabilities, proficiencies, and attributes. This reflects a constructive
nature of modeling, in that models are based on certain beliefs and directed towards
achieving certain purposes. The beliefs that are being modeled are better thought of
as epistemological entities, rather than literal or ontological entities. The use of a con-
ditional independence structure is not an ontological statement about the world, but
rather an epistemological statement about our thinking about the world as it pertains to
our beliefs and reasoning about the situation at hand. Though a latent variable is often
interpreted as representing examinee capabilities, it arguably has more to do with what
is going on in our heads than the examinees heads. Our beliefs are informed by what
we actually do know about the situation from theory and experience, but also tailored
to the purposes we have in mind. This makes us more comfortable with using models
of different grain sizes or different qualitative structures, when they highlight different
aspects of situations that might be differentially relevant for different problems. To be
sure, the fidelity of our thinking to the real-world situation is critical. We may of course
decide that a simpler model is insufficient for desired inferences or find that it simply
does not do sufficient justice to the data. Modeling involves balancing tradeoffs among
fidelity to the real-world situation we aim to reason about, practicality of use and inter-
pretability, computational ease, and consequences of inferential errors that may result
from the model.
Canonical Bayesian Psychometric Modeling 139
Suppose we have multiple observable variables derived from responses to several sub-
traction items. Specifically, in this chapter we will employ a running example where
there are two such subtraction items, yielding two observables corresponding to scored
responses to those items. Treating the observables (items) as exchangeable, we may repre-
sent the joint distribution of the observables for any examinee via (7.2), which motivates
the introduction of to render the observables conditionally independent. Its use in the
model does not correspond to an ontological claim regarding any such variable. Rather,
the specification of here is a choice made to frame and organize our thinking as tailored
to our assessment purposes and desired inferences, informed by the fit of the proposed
model to the data. Other choices could be made, reflecting interest in more and finer-
grained distinctions regarding subtraction proficiency, a point we return to in an example
in Chapter 14.
* We had a similar example in the context of the medical diagnosis model in Chapter 2, with x being the result
from a mammography screener and being cancer status.
140 Bayesian Psychometric Modeling
in p(), enabling probability-based reasoning about the unknown . For examinee i, the
posterior distribution for the latent variable given values for the J observables is
J
p(x | )p( )
ij i i J
p(i |xi ) =
j =1
J p(x | )p( ). ij i i (7.3)
p(x | )p( )d
i j =1
ij i i i
j =1
where the integral for the normalizing constant in the denominator is replaced by a sum-
mation in the case of a discrete latent variable. The posterior summarizes our beliefs about
the examinee after having observed her performance.
To complete the model specification in (7.3), we must specify p(i ). The form of the distribu-
tion is in part determined by the proposed nature of the latent variable. As we will see, a nor-
mal distribution is commonly specified when the latent variable is continuous, and Bernoulli
or categorical distributions are commonly specified when the latent variable is discrete. For
our current development, we let P generically denote the parameters of the prior distribution
that is specified.* As examples, if i is modeled as a normal random variable, P contains the
mean and variance of the normal distribution; if i is modeled as a Bernoulli random variable,
P contains the probability that i takes on one of its possible values.
To extend inference to the case of multiple examinees, an assertion of exchangeability
implies the use of a common measurement model and prior distribution for all examinees.
A hierarchical model results. Letting = (i ,, n ) be the full collection of latent variables
from n examinees, the joint posterior of all the examinees latent variables is
n J
p(|x ) p(x | )p( ).
i =1 j =1
ij i i (7.4)
TABLE 7.1
Conditional Probabilities for the Responses to Two Items
Response to Item 1 Response to Item 2
Proficiency Correct Incorrect Correct Incorrect
Proficient .70 .30 .65 .35
Not Proficient .20 .80 .15 .85
* Recall that the use of the subscript P in is adopted to signal that these are the parameters that govern
P
the prior distribution for . Similarly, the subscript indicates that, in a DAG representation, the elements of P
item 2 is slightly harder than item 1 by recognizing that, for each level of proficiency, the
(conditional) probability of a correct response is lower for item 2 than item 1.
Now suppose we observe an examinee correctly answer the first item, and incorrectly
answer the second item. The posterior probability for the examinees latent variable taking
on a value of c (in the current example: Proficient or Not Proficient) is given by
J
p(x | = c)p( = c)
j =1
ij i i
p(i = c|xi ) = J .
p(x | = g)p( = g)
g j =1
ij i i
Working through the computations, the posterior probabilities are
(.70)(.35)(.60) .147
= = .68
(.70)(.35)(.60) + (.20)(.85)(.40) .147 + .068
and
(.20)(.85)(.40) .068
= = .32.
(.70)(.35)(.60) + (.20)(.85)(.40) .147 + .068
On the basis of the observed data, the posterior probability that the examinee is profi-
cient is .68, which is slightly higher than the prior probability (.6). This indicates that
the evidentiary bearing of the observed data serves to slightly increase our belief that
the examinee is proficient (see Exercise7.1). Note that the posterior probability that the
examinee is not proficient is .32 = 1 .68, which reflects the general relationship ensured
by Bayes theorem that our posterior belief is still expressed as a probability distribu-
tion; in this case, implying that p(i = Not Proficient|Xi1 = Correct, Xi 2 = Incorrect ) = 1
p(i = Proficient|xi1 = Correct, xi 2 = Incorrect ).
When it is possible to map the salient elements of an inferential problem into the prob-
ability framework, powerful tools become available to combine explicitly the evidence
that various probans [facts offering proof] convey about probanda [facts to be proved],
as to both weight and direction of probative force . . . . A properly-structured statistical
model embodies the salient qualitative patterns in the application at hand, and spells
out, within that framework, the relationship between conjectures and evidence. It over-
lays a substantive model for the situation with a model for our knowledge of the situ-
ation, so that we may characterize and communicate what we come to believeas to
both content and convictionand why we believe itas to our assumptions, our con-
jectures, our evidence, and the structure of our reasoning.
Operating in a probability model, Bayes theorem is the key tool in probability calculus for
reasoning from examinee performances, captured by observable variables, to their capa-
bilities more broadly conceived, as captured by the latent variable.
Examinees i
1 2 3 J
FIGURE 7.3
Directed acyclic graph for the canonical psychometric model including the measurement model parameters.
n J
Examinees i
xij
j
Observables j
(tasks)
FIGURE 7.4
The all-unknowns directed acyclic graph for the canonical psychometric model including the hyperparameters.
n J
being a term for the s popular in item response theory traditions, or factor scores, a term
popular in factor analysis traditions.
The second process, calibration, refers to arriving at a representation for the (other) param-
eters of the model: principally, the measurement model parameters (the j s), and possibly
the hyperparameters ( P or P ) if they are unknown. In the Bayesian analysis described in
Section 7.3.4, the representations come in the form of a posterior distribution. This could
of course be used to construct other representations, such as using a measure of central
tendency as a point representation. In frequentist analyses, the representations are point
estimates, say, from maximum likelihood (ML) estimation. Calibration is also commonly
referred to as estimating the model parameters, particularly in traditions that do not refer to
as a person parameter. To avoid possible confusion over whether estimating the model
parameters includes the s, we avoid this terminology and refer to the distinct processes
of scoring and calibration. Further, because the psychometric models we discuss start with
observables that have already been identified, the focus here is on test scoring.
A few paradigmatic situations in operational assessment are as follows:
Scoring only: If we have values for the measurement model parameters, say, from
having set them in advance, or from some previous calibration, or from divine
revelation, we can use them to conduct scoring for any examinee. The posterior
analysis in the subtraction proficiency example in Section 7.3.1 is an instance of
scoring.
Calibration only: Conversely, if we have values for the examinees latent vari-
ables, we can use them to conduct calibration for the tasks (observables),
obtaining representations of the measurement model parameters and possibly
hyperparameters. In the subtraction proficiency example, if we knew which
examinees were proficient and which were not, we could estimate the condi-
tional probabilities of correct response as the proportions of each group that
correctly answer the items. A similar situation arises in operational assessment
when some tasks have been previously calibrated, but others have not. Here, the
previously calibrated tasks are used to conduct examinee scoring, and then the
resulting values for the examinees latent variables are used to conduct calibra-
tion for the new tasks.
Calibration and scoring: If we do not have values for either the examinees latent
variables or the measurement model parameters, we need to conduct inference
for both.
The first two scenarios may be viewed as special cases of the last scenario, which is typi-
cally what is meant when analysts talk about fitting psychometric models, and is the most
complicated of the three from a statistical perspective.
Equation(7.6) provides an answer to this scenario from a Bayesian perspective, yielding
a joint posterior distribution for all the unknowns. However, that is not how things are
always handled in operational assessment. More commonly, analysts employ a two-stage
strategy in which calibration is conducted first (e.g., from an initial sample of examinees)
followed by scoring (e.g., for the examinees from that initial sample, or for a large number
of subsequent examinees). We begin our development working from an ML framework,
and mention Bayesian variations along the way.
In the first stage, the joint distribution of the observable and latent variables is con-
structed, conditional on the (fixed in the frequentist framework) measurement model
parameters and hyperparameters that govern the distribution of the latent variables:
146 Bayesian Psychometric Modeling
p( x , |, P ) = p( x|, )p(| P )
n
= p(x | , )p( | )
i =1
i i i P
(7.7)
n J
= p(x | , )p( | ).
i =1 j =1
ij i j i P
To conduct calibration, the marginal distribution of the observables is obtained by inte-
grating the latent variables out
p( x|, P ) = p( x , |, P )d
= p( x|, )p(| P )d
n (7.8)
=
i =1 i
p( xi |i , )p(i | P )di
n J
= p(x | , )p( | )d .
i =1 i j =1
ij i j i P i
Viewed as a function of the model parameters, (7.8) is a marginal likelihood function, which
can be maximized to yield estimates for and P. These estimates are referred to as ML
estimates in factor analysis and latent class analysis, and marginal maximum likelihood
(MML) estimates in item response theory (Bock & Aitkin, 1981).*
One Bayesian variation on this theme includes the use of prior distributions for some or
all of the measurement model parameters; maximizing the resulting function yields Bayes
modal estimates (Mislevy, 1986). Another Bayesian variation conducts the fully Bayesian
analysis in (7.6), obtains the marginal posterior for each element of and P, and then uses
a measure of central tendency (e.g., posterior mean) as a point summary or estimate. The
key point for our purposes is that this first stage yields a set of point estimates for and
P, be they ML estimates or some summary of the posterior distribution. Let ( , P ) denote
these values from the first stage.
The second stage proceeds by conducting scoring using ( , P ) as the values for and P.
An MLE for for each examinee may be obtained by defining a likelihood function for by
viewing
* The terminological differences between psychometric modeling traditions are no doubt due to a several rea-
sons (e.g., historical development, the stronger emphasis on estimating values for examinee latent variables in
many applications of item response theory, where they are sometimes referred to as person parameters), and
we leave such pursuits to professional etymologists. At present we merely wish to highlight that that despite
the differences in terminology, there are core structures common to many paradigms, and suggest that the
different terminology may be related to the more pervasive use of Bayesian approaches in some traditions as
opposed to others.
Canonical Bayesian Psychometric Modeling 147
p( xi |i ) = p(x | , = )
j =1
ij i (7.9)
p(i |xi ) p(x | , = )p( |
j =1
ij i i P = P ), (7.10)
we will examine a simple scenario in the context of latent class models where the differ-
ences are small, and the consequence are not dire. In other scenarios, the differences and
consequences may be striking. Tsutakawa and Soltys (1988) and Tsutakawa and Johnson
(1990) investigated paradigmatic applications of item response theory models for scor-
ing, comparing the use of an approximation to the full posterior for measurement model
parameters to the use of point estimates from ML and partially Bayes procedures. They
found that the procedures that treated point estimates of measurement model parameters
as known had substantially smaller standard errors and correspondingly narrower inter-
val estimates.
In test construction, ignoring estimation error in measurement model parameters can
lead to capitalizations on chance that causes items with higher magnitudes of errors to be
selected in constructing fixed and adaptive tests (Hambleton & Jones, 1994; Hambleton,
Jones, & Rogers, 1993; Patton, Cheng, Yuan, & Diao, 2013; van der Linden & Glas, 2000).
This can lead in turn to overestimates of information and precision of estimation during
scoring. Johnson and Jenkins (2005) investigated a complicated scenario in the context of
the National Assessment of Educational Progress, comparing a fully Bayesian specifica-
tion to typical operational procedures in which the measurement model parameters are
estimated at one stage, and are then treated as known with certainty in later stages that
estimate parameters of subpopulations of examinees (Mazzeo, Lazer, & Zieky, 2006; von
Davier, Sinharay, Oranje, & Beaton, 2007). Using simulated and real data, they concluded
that the procedures that treated point estimates from an earlier stage as known system-
atically underestimated posterior uncertainty for parameters estimated at a later stage. In
addition, the fully Bayesian analysis provided more stable estimates of the sampling vari-
ability as compared to the typical approach.
data are available, have complex relationships with the inferential target, and are possibly
in conflict with one another.
If measurement model parameters are unknown, our probability model expands to
accommodate them, specifying prior distributions for them, again often capitalizing on
exchangeability assumptions. They too become subject to posterior inference, now jointly
with examinees latent variables in the full Bayesian model. Conventional practices in vari-
ous psychometric modeling paradigms vary in the extent to which Bayesian approaches
are employed. Commonly, they depart from the fully Bayesian perspective by estimating
parameters in stages, often treating point estimates from earlier stages as known in later
stages.
The model in (7.6) and Figure 7.4 serves as a cornerstone of Bayesian psychometric
modeling. It is well targeted to what we may call the standard assessment paradigm
(Mislevy, Behrens, DiCerbo & Levy, 2012), which is driven in part by twentieth century
conceptions of psychology (Mislevy, 2006; Rupp & Mislevy, 2007) and technological
capabilities (Cohen & Wollack, 2006; DiCerbo & Behrens, 2012; Dubois, 1970). Here, data
from each examinee are fairly sparse, typically in the form of a single attempt on tasks
that were chosen beforehand. Feedback does not occur during the assessment, so learn-
ing during the assessment is assumed to be negligible, and performance can then be
interpreted as information regarding an examinees static level of proficiency conceived
in trait or behaviorist psychology.
Much of the material in the following chapters describes how this model and extensions
of it play out in a variety of psychometric traditions. Along the way, we will also mention
the practices that have emerged as the conventional approaches in these psychometric
modeling paradigms. Provocatively, though many of these paradigms share the structure
illustrated in Figure7.4, we will see that the standard conventional practices in each tra-
dition vary considerably in the degree to which they involve Bayesian approaches when
conducting scoring and calibration.
Not all psychometric models can be framed as an instance of this model, but many can,
and it is a useful starting point for building more complicated models that depart from
the standard assessment paradigm, particularly in ways that harness shifts in psychol-
ogy (Mislevy, 2006, 2008; Rupp & Mislevy, 2007) and technological capabilities to collect
and process data that vary in type, quality, and quantity (Behrens & DiCerbo, 2013, 2014;
Cohen& Wollack, 2006; Drasgow, Luecht, & Bennett, 2006). In the coming chapters, we
discuss several such examples including: seeking inferences about multiple aspects of pro-
ficiency; acknowledging and modeling the possibility of learning during the assessment
(which may be its very intent); leveraging collateral information about examinees and
tasks and formulating exchangeability conditional on relevant covariates; and adding lay-
ers to the hierarchical specification by specifying P or P as dependent on other unknown
parameters, thereby borrowing strength from the collective of examinees or tasks for esti-
mating the parameters for individual members.
In our tour, we will use notation specific to each modeling paradigm. Table 7.2 lists some
key elements of the notation for the various modeling paradigms, including elements not
present in the canonical psychometric model discussed here. What is listed in the table is
not broad enough to cover all of the variations and extensions of the basic model we will
cover. It is intended to serve as a reference and illuminate how the basic model developed
in this chapter will reappear across the modeling paradigms, list some of the directions for
150 Bayesian Psychometric Modeling
TABLE 7.2
Notation for the Psychometric Models in This Part of Book
Psychometric Modeling Paradigm
Latent Class
Analysis,
Bayesian
Item Networks,
Classical Factor Response Diag.
General Test Theory Analysis Theory Classification
Parameter (Chap. 7) (Chap. 8) (Chap. 9) (Chap. 11) (Chap. 1314)
Examinee T
Hyperparameters for P T , T2 , , 2
examinees (first level)
Hyperparameters for , 2 , 2
examinees (second
d, 0 d, 0
level)
2
Measurement model E d (or b)
c
Hyperparameters P 2 , 2
E2
, 2
d , d2
E
forobservables
(firstlevel) , 2 a , 2a
, 0 c , c
Hyperparameters for ,
observables (second
level)
extensions to come, and aid readers familiar with some but not all of these traditions and
notations in seeing the connections between them.
Exercises
7.1 Reconsider the subtraction proficiency example, where the prior probability that a
student is proficient is p(i = Proficient) = .6 and the conditional probabilities of cor-
rect responses to the two items are given in Table 7.1. In Section 7.3.1, we obtained
the posterior distribution for proficiency for a student who correctly answered the
first question and incorrectly answered the second question:
a. Explain how this posterior distribution can be interpreted from the Bayes-as-
using-data-to-update-prior perspective.
Canonical Bayesian Psychometric Modeling 151
Response to Item
Proficiency Correct Incorrect
Proficient .90 .10
Not Proficient .20 .80
and we observe a student responds correctly to the item. What is the posterior
for his proficiency?
b. Suppose we have an item with the following conditional distribution of
response:
152 Bayesian Psychometric Modeling
Response to Item
Proficiency Correct Incorrect
Proficient .90 .10
Not Proficient .90 .10
What is the posterior distribution for the examinees proficiency given a cor-
rect response? Given an incorrect response? (Hint: you can answer these with-
out doing any calculations.)
c. Compare the results from (a), (b), and Exercise 7.2(a). Why do the posteriors
differ? What does that indicate about the discrimination of the items?
7.4 In Section 7.4, we discussed how it is common to first conduct calibration and
then, using the results of calibration, scoring for both the examinees employed in
the calibration analysis and other examinees. Explain how the use of the results
from a calibration for scoring other examinees (i.e., not those used in the calibra-
tion) may be seen as relying on notions of exchangeability.
8
Classical Test Theory
We begin our tour of psychometric modeling paradigms with classical test theory (CTT).
Textbook treatments and overviews of CTT from a conventional perspective can be found
in Crocker and Algina (1986), Lewis (2007), Lord and Novick (1968), and McDonald (1999).
The CTT model has been conceived of in several distinct ways, differing in terms of whether
an author develops the model from single or multiple observables, and from one of sev-
eral different distributional notions of error (cf. Bollen, 1989; Haertel, 2006; Lewis, 2007;
Lord & Novick, 1968; McDonald, 1999). What follows is a cursory treatment of the model
focusing on consensus notions. Readers are referred to the above cited works for consider-
ably more details on CTT, and for constructing assessments that lead to the scores that
our look at Bayesian modeling for CTT takes as a starting point. We begin our develop-
ment of CTT in Section 8.1 by considering a single observable (test) assuming that the
measurement model parameters and the hyperparameters are known. In Section8.2, we
generalize to the case where there are multiple observables (tests), still assuming that the
measurement model parameters and hyperparameters are known. In Section8.3, we treat
the case where the measurement model parameters and hyperparameters are unknown.
Section8.4 concludes this chapter with a summary and bibliographic note.
153
154 Bayesian Psychometric Modeling
The CTT model specifies xi as an additive combination of two components that may vary
over examinees,
xi = Ti + Ei , (8.1)
where Ti is a true score (for examinee i) with mean T and variance T2 and Ei is an error
(for examinee i) with mean 0 and variance E2 . Errors are uncorrelated with true scores in
the population,
TE = 0, (8.2)
x = T . (8.3)
2x = T2 + E2 + 2TE
(8.4)
= T2 + E2 ,
T2 2
= 2
= 2 T 2, (8.5)
x T + E
which is also the squared correlation between observable and true scores, 2xT .
True scores for examinees may be estimated via Kelleys formula, which dates to at least
Kelley (1923, p. 214). In the current notation,
Ti = xi + (1 ) x
(8.6)
= x + ( xi x ),
where x is the mean of the observed scores and Ti is the estimated true score for examinee i.
The use of the prime notation in Ti to stand for an estimate is derived from noting that
the formula in (8.6) may be framed as the regression of true scores on error scores (e.g.,
Crocker & Algina, 1986; Haertel, 2006; Lewis, 2007; Lord & Novick, 1968; the same idea
appears in hierarchical modeling as well, as in Raudenbush, 1988). The second formulation
on the right-hand side in (8.6) yields an interpretation of Kelleys formula as one in which
we get an estimate for an examinees true score by starting with the mean, and then mov-
ing away from the mean in the direction of their observed score. We do not move all the
wayrather, the amount we move is in proportion to the reliability.
Importantly, the last interpretation reveals that Kelleys formula contradicts the
intuitive notion that we should use the examinees observed score as the estimate of
the true score (i.e., Ti = xi ). The reasoning for the latter approach is that xi is obviously
Classical Test Theory 155
related to the true score, only differing by a random error component and that, after
all, xi is all we have for the examinee. This line of reasoning is quite appealing on the
surface, so it is instructive to consider how Kelleys formula departs from this, and what
it reveals about assessment. Note that Kelleys formula will indeed yield the observed
score as the estimate of the true score when = 1. At the other end of the spectrum,
when = 0, Kelleys formula will yield x as the estimate of the true score.* In this case,
Kelleys formula instructs us to ignore the actual observed score for the examinee. The
reason is that the variance of x is all due to error variance. The test is essentially worth-
less as an inferential tool to differentiate examineeswe would do just as well if did
not administer the test at all. Kelleys formula therefore instructs us to ignore the exam-
inees observed score for this purpose.
Thankfully, in practice it is highly unlikely that = 0. If falls between the extremes
of 0 and 1, the estimated true score will fall between xi and x . To understand what this
captures, we can hardly do better than Kelley himself, who remarked (1947, p. 409)
This is an interesting equation in that it expresses the estimate of true ability as the
weighted sum of two separate estimates,one based upon the individuals observed
score, X1 [xi in the current notation], and the other based upon the mean of the group
to which he belongs, M1 [ x in the current notation]. If the test is highly reliable, much
weight is given to the test score and little to the group mean, and vice versa.
Finally, the variability associated with estimation of the true score is captured by the
standard deviation of the errors of estimation of the true score,
T|x = T (1 ), (8.7)
known as the standard error of the true score (Guilford, 1936) or the standard error of
estimation (Lord & Novick, 1968). Note that this refers to the variability of the true scores
given the observed scores, and differs from the standard error of measurement that refers
to the variability of observed scores given true scores (Dudek, 1979).
* In the rest of this section and Section 8.2, we assume that x is known. If it is unknown, a conventional approach
estimates x as the mean of the test scores from a sample of examinees. We treat the situation of unknown x
in Section 8.3.
In this case, we might have reservations about estimating true scores at all, because our best estimate for
everyone is the population mean. If it comes to pass that = 0, the next step is not really to estimate true scores
to facilitate interpretations and decisions, but rather to revisit and revise the assessment process.
But it is about as likely as = 1. If was indeed 1, that would indicate that there was no measurement error,
obviating the need for much of psychometrics. In practice, many achievement tests have reliability estimates
that exceed .8.
156 Bayesian Psychometric Modeling
The first term on the right-hand side of (8.8) is the conditional distribution of the data given
the true scores and the error variance. To structure this distribution, we assume that, given
the examinees true score, the observed score from the examinee is conditionally inde-
pendent of all other observed scores. This assumption allows for the factorization of the
conditional distribution of the data as
n
2
p( x|T , ) =
E p(x |T , ).
i =1
i i
2
E (8.9)
CTT, as formulated so far, defines the model in terms of the first two moments. In our
development of a Bayesian analysis, we will specify the full distributions.* For each exam-
inee, we now model the observed score as being normally distributed around the exam-
inees true score,
The second term on the right-hand side of (8.8) is the prior distribution for the true scores.
An assumption of exchangeability regarding the examinees implies the use of a common
prior distribution for each, supporting the factorization of the joint prior of all n examin-
ees true scores as
n
p(T|T , T2 ) = p(T | , ).
i =1
i T
2
T (8.11)
For each examinee, we now also model the true score as being normally distributed,
Ti |T , T2 ~ N (T , T2 ). (8.12)
Note that these normal distribution specifications for the observables in (8.10) and for
the true scores in (8.12) are common in conventional presentations of the model rooted in
frequentist paradigms (e.g. Bollen, 1989; Crocker & Algina, 1986) as well as in Bayesian
developments (Lindley, 1970; Novick, 1969; Novick, Jackson, & Thayer, 1971). Other
assumptions can be made, such as assuming normal distributions for errors, but estimat-
ing the distribution of true scores (Mislevy, 1984).
A DAG representation of the model is given in Figure8.1. Each examinees observable
is modeled with the associated true score and the error variance as parents, in accordance
with (8.10). Similarly, each true score is modeled with the hyperparameters T and T2 as
parents, in accordance with (8.12).
Substituting into (8.8), the posterior distribution is
= p(x |T , )p(T | , ),
i =1
i i
2
E i T
2
T
* An alternative development might involve just the first- and second-order moments, capitalizing on Bayesian
analyses of linear models without invoking a full distributional specification (Hartigan, 1969).
Classical Test Theory 157
T T2
Ti
i = 1,, n
xi
E2
FIGURE 8.1
Directed acyclic graph for the classical test theory model for one observable variable, with known true score
mean, true score variance, and error variance.
where
and
Ti |T , T2 ~ N (T , T2 ) for i = 1, , n.
Note that the right-hand side of (8.13) has a form where all the elements are inside the
product operator. This reveals that the conditional independence assumptions imply
that the model for all the examinees can be viewed as a model for an individual exam-
inee, which is then repeated over all examinees. The DAG illustrates this in that the only
stochastic entities, namely the true scores and the observables, lie inside the plate for
examinees.
This implies that the model is one where, for each examinee, the data xi is normally
distributed with unknown mean Ti and known variance E2 . The unknown mean Ti is nor-
mally distributed with known mean T and variance T2 . This is exactly the situation dis-
cussed in Section4.1. The expression in (8.13) is just an instantiation of (4.5), as can be seen
by framing the notation of the former in terms of the latter:
Ti
E2 2
T
T2 2
Thus, the posterior distribution for each Ti is an instantiation of (4.6) and is a normal
distribution,
158 Bayesian Psychometric Modeling
where
Ti|xi =
( T ) ( )
T2 + xi E2
(8.15)
(1 ) + 1
2
T
2
E
and
1
T2i|xi = . (8.16)
(1 ) (2
T + 1 E2 )
A little algebra reveals that the posterior mean can be expressed as
T2 2
Ti|xi = 2
x + 2 E 2 T .
2 i
(8.17)
+ E
T T + E
The coefficient for xi is recognized as . A little more algebra reveals that the coefficient for
T is 1 . Thus, the posterior mean for an examinees true score is
Ti|xi = xi + (1 )T
(8.18)
= T + ( xi T ).
Recalling that the means of the true and observed scores are equal (see Equation 8.3),
we see from the right-hand side of (8.18) that, remarkably, the posterior mean is just an
expression of Kelleys formula as given in (8.6). In the Bayesian analysis, the posterior
mean is a precision-weighted combination of the mean of the data and the mean of the
prior. For inferences about an examinees true score in CTT, the precision of the data is
captured by the reliability of the test, , and the precision in the prior for the true score
is captured by 1 . Echoing similar statements made in Chapter4, this reveals that to
the extent that the test is reliable (i.e., the variation in observed scores is due to variation
in true scores), the observed test score should drive the solution and our beliefs about
the examinees true score. To the extent that the test is unreliable (i.e., the variation in
observed scores is due to something other than variation in true scores), the influence of the
data ought to be reduced. Thus, we may echo Kelleys sentiments quoted in Section8.1.1
with some slight revisions in our phrasing: Kelleys formula, as it appears in (8.18), is
indeed an interesting equation in that it expresses a point summary of our posterior
beliefs about an examinees true ability as the weighted sum of two separate sources of
information: one being the examineess observed score, xi, and the other being what was
believed about the true ability prior to observing any data, captured by T . If the test is
highly reliable, much weight is given to the observed score and little is given to our prior
beliefs, and vice versa.
Classical Test Theory 159
In Bayesian terms, Kelleys formula amounts to saying that the best estimator for an
examinees true score is the posterior mean. For several reasons, it is important to note that
Kelley did not derive his formula in Bayesian terms, but rather as the regression of true
score on observed score (Kelley, 1923, 1947). Whats more, it can be shown that the poste-
rior standard deviation,
1
Ti|xi = T2i|xi = , (8.19)
(1 ) (
2
T + 1 E2)
is equal to the standard error of the estimate of the true score in (8.7), which in conven-
tional approaches is framed as the unexplained variation of the regression of true scores
on observed scores.
The conventional framing of Kelleys formula as the regression of true score on observed
scores, and the standard error of the estimate of true scores as the associated variation from
that regression (Haertel, 2006; Lord & Novick, 1968), brings to light the alignment between
the goals of assessment and the Bayesian approach to inference. To begin, we can recog-
nizethat the formulation of CTT in (8.1) has the appearance of regressing the observables x
on the true scores T. The flow of the model here and in the associated DAG in Figure8.1
is from the true score to the observable, supporting deductive reasoning. Kelleys formula
reverses this flow, regressing the true scores T on the observables x in support of induc-
tive reasoning. The model is constructed with x conditional on T, but once a value for x is
observed, the goal is to reason back from the observed score to the true score. Kelleys formula
facilitates this reasoning partially, by regressing T on x, giving the expectation E(T | x).
The standard error of the estimate of the true scores complements this partial reasoning,
providing a first inkling of variation. A Bayesian perspective makes explicit this reversal of
flow and reasoning back, recognizes Kelleys formula as the posterior mean and the stan-
dard error of the true scores as the posterior standard deviation, and goes beyond these to
give the full posterior distribution in (8.14).
Confusion over the appropriate way to reason back from observed scores to true
scores has been with us almost as long as CTT itself (Dudek, 1979) and does not appear
to be disappearing, at least in some disciplines (McManus, 2012). Generally, a Bayesian
approach to inference constructs a joint distribution for all entities and then conditions
on what is known to yield a conditional distribution for all unknown entities. As such
it offers a framework well suited to reasoning in either direction. If true scores are
known, we have the conditional distribution for observables in (8.10). If observed scores
are known, we have the conditional distribution for true scores in (8.14). No entity has
any special status that prevents us from formulating conditional distributions for it.
Our construction of the joint distribution emerges from structuring the observables
as conditional on true scores. In the current situation with one observable, this choice
does not seem of great importance. When we expand our scope to include multiple
observables, we will model the multiple observables for each examinee as conditionally
(locally) independent given their true score, in line with the implications of exchange-
ability and the utility in organizing our beliefs around conditional independence speci-
fications. Bayes theorem is then the machinery for taking this construction of the joint
distribution and turning it into a conditional distribution for unknown true scores
given observed scores.
160 Bayesian Psychometric Modeling
8.1.3 Example
Suppose we have a test scored from 0 to 100.* We are interested in the posterior distributions
of true scores for individuals, where it is known that the mean of the true scores is T = 80,
the standard deviation of the true scores is T = 6, and the standard deviation of the error
scores is E = 4. The reliability of the test is calculated via (8.5); rounding to two decimal
places, .69.
T2 2
Ti|xi = 2
x + 2 E 2 T
2 i
+ E
T T + E
36 16
= xi + 80
36 + 16 36 + 16
.69xi + 24.62.
T2
Ti|xi = T + ( xi T )
+ E2
2
T
36
= 80 + ( xi 80)
36 + 16
80 + .69( xi 80)).
The latter expression affords the interpretation that the posterior mean for an examinees
true score can be viewed as departing from the mean for all examinees (80) in the direc-
tion of the examinees score (xi), and in an amount proportional to the reliability (.69).
Again, we note that each examinees posterior mean is exactly what is obtained by the
conventional application of Kelleys formula. The posterior variance is constant for all
examinees,
1 1
T2i|xi = = 11.08.
( ) (
1 T2 + 1 E2 )
(1 36 ) + (1 16 )
To illustrate, we compute the posterior distribution for examinees, with scores 70, 80, and 96,
rounding the values obtained to two decimal places. For xi = 70, p(Ti |=
xi 70=) N (73.08, 11.08);
for xi = 80, p(Ti |= =
xi 80 ) N (80, 11.08); for xi = 96, p(Ti |= =
xi 96 ) N (91.08, 11.08). Figure 8.2
depicts these results.
For each examinee, the posterior is a synthesis of the information in the prior and the
information in the data. In CTT, the information in the data is imperfect, as reflected
* As noted above, tests usually contain multiple items, and the test score is the aggregation over the items, say,
by summing the scored responses for the individual items. Estimates of are obtained as functions of varia-
tion among items. But for the purpose of laying out a Bayesian treatment of CTT, the starting point is a single
observed (test) score for each examinee, however, obtained, such that the CTT structure applies.
Classical Test Theory 161
Prior x = 70
Likelihood
Posterior
50 60 70 80 90 100 110
T
x = 80
50 60 70 80 90 100 110
T
x = 96
50 60 70 80 90 100 110
T
FIGURE 8.2
Plotting the components of a Bayesian analysis for selected examinees true scores, with the prior distribution
(dotted), likelihood (dashed), and posterior distribution (solid).
Prior = 0.95
Observed score
Posterior
= 0.80
= 0.50
= 0.05
65 70 75 80 85 90 95 100
T
FIGURE 8.3
Posterior densities (solid lines) for the true score for an N(80, 36) prior (dotted line) and an observed score of 96
(dashed line) as the reliability () changes.
by the reliability being less than 1. To further illustrate this point, consider the effect of
different values for the reliability. Figure8.3 depicts analyses where T = 80, T = 6, and
xi = 96. The curves depict different posterior distributions corresponding to different
values of the reliability of the test. When the reliability is near 0, the posterior distribu-
tion is very similar to the prior distribution. As the reliability increases, the posterior
distribution moves closer to the observed test score of 96 and becomes more narrowly
distributed.
162 Bayesian Psychometric Modeling
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Classical Test Theory
# With Known
# True Score Mean, True Score Variance
# Error Variance
#########################################################################
#########################################################################
# Known Parameters
#########################################################################
mu.T <- 80 # Mean of the true scores
sigma.squared.T <- 36 # Variance of the true scores
sigma.squared.E <- 16 # Variance of the errors
#########################################################################
# Model for True Scores and Observables
#########################################################################
for (i in 1:n){
T[i] ~ dnorm(mu.T, tau.T) # Distribution of true scores
x[i] ~ dnorm(T[i], tau.E) # Distribution of observables
}
#########################################################################
# Data statement
#########################################################################
list(n=3, x=c(70, 80, 96))
--------------------------------------------------------------------------
The model was run with three examinees. Three chains were run from dispersed starting
values. Convergence diagnostics, including the history plots and the potential scale reduc-
tion factor (Section 5.7.2), indicated convergence within a few iterations. This is unsur-
prising given that the posterior is of a known form (see Equation 8.14) that is recognized
as such by WinBUGS. To be conservative, we discarded the first 100 iterations and 1,000
Classical Test Theory 163
additional iterations were run, yielding 3,000 iterations for use in summarizing the poste-
rior distribution.
Figure8.4 contains plots of the analytical marginal posterior distributions for the true
scores (solid line, akin to the solid lines in Figure8.2) and empirical approximations from
WinBUGS (dashed line) for the three examinees. Table8.1 gives, for each examinee, the
posterior mean and standard deviation from the analytical solution, and the estimated
posterior mean and standard deviation from the empirical approximation from WinBUGS.
The empirical approximations come quite close to the analytical solutions in terms of
shape, location, and scale.
Analytical x = 70
Empirical
50 60 70 80 90 100 110
T
x = 80
50 60 70 80 90 100 110
T
x = 96
50 60 70 80 90 100 110
T
FIGURE 8.4
Analytical (solid) and empirical (dashed) marginal posterior densities for true scores for three examinees. The
empirical density closely mirrors the analytical density, such that it is obscured by the latter when plotted.
TABLE 8.1
Analytical Solutions and Empirical Approximations to the
Posterior Distributions for Three Examinees
Empirical Solution from
Analytical Solution MCMCa
Posterior Posterior Posterior Posterior
xi Mean SDb Mean SDb
70 73.08 3.33 73.07 3.28
80 80.00 3.33 79.96 3.34
96 91.08 3.33 91.13 3.36
a MCMC = Markov chain Monte Carlo.
b SD = Standard deviation.
164 Bayesian Psychometric Modeling
where Eij is now the error for examinee i on observable j. The means of the errors for all
observables are 0, and the variances of the errors for the observables are equal, denoted
by E2 . Letting Ej denote the error scores for observable j, the correlation between the true
scores and the errors for observable j is
TEj = 0 (8.21)
for all j. The correlation between errors for observable j and observable j is
EjEj = 0 (8.22)
for j j.
Letting xj and 2x j denote the mean and variance of the scores for observable j, the CTT
model implies that, for j = 1, , J,
x j = T (8.23)
and
2x j = T2 + E2 . (8.24)
As xj and 2x j are the same for all j, we drop the second subscript and use x and 2x to
denote the mean and variance for any observable.
Under these specifications, the reliability of any observable is given by (8.5). The reli-
ability of an equally weighted composite of the J observables, such as their sum or mean,
is given by the SpearmanBrown prophecy formula,
Classical Test Theory 165
J J 2
c = = 2 T 2, (8.25)
( J 1) + 1 J T + E
where now x is the full collection of (n J) observables. The second term on the right-hand
side of (8.26) is the prior distribution for examinees true scores. As before, following assump-
tions of exchangeability and normality, the prior distribution is given in (8.11) and (8.12).
The first term on the right-hand side of (8.26) is the conditional distribution of the data given
the true scores and the error variance. As before, we assume conditional independence between
the values of the observables from different examinees. This supports the factorization
p( x|T , E2 ) = p(x |T , ),
i =1
i i
2
E (8.27)
where xi = (xi1, , xiJ) is the collection of J observable variables for examinee i. Assuming
that, for each examinee, the observables are conditionally (locally) independent given their
true score allows for the specification of a common distribution for each observable,
T T2
i = 1,, n
Ti
j = 1,, j
xij
E2
FIGURE 8.5
Directed acyclic graph for the classical test theory model for multiple observable variables, with known true
score mean, true score variance, and error variance.
166 Bayesian Psychometric Modeling
2
p( xi |Ti , ) =
E p(x |T , ).
j =1
ij i
2
E (8.28)
Combining these two expressions, we have the factorization of the conditional distribu-
tion of the data as
n J
p( x|T , E2 ) = p(x |T , ).
i =1 j =1
ij i
2
E (8.29)
For each examinee, we model the observables as being normally distributed around the
examinees true score,
= p(x |T , )p(T | , ),
i =1 j =1
i i
2
E i T
2
T
where
and
Ti |T , T2 ~ N (T , T2 ) for i = 1, , n.
model. To recap, we begin with the data and build out by invoking exchangeability
assumptions to simplify the specification of the distribution of the data. But that intro-
duces new unknown parameters (here, true scores) that stand in need of distributional
specification. We accomplish this by again invoking exchangeability assumptions to
simplify the specification of the distribution of the true scores.
2
xi |Ti , E2 ~ N Ti , E . (8.32)
J
= i =1
p( xi |Ti , E2 )p(Ti |T , T2 ),
T T2
Ti
i = 1,, n
xi
J
E2
FIGURE 8.6
Directed acyclic graph for the classical test theory model for multiple observable variables formulated using the
mean of the observables, with known true score mean, true score variance, and error variance.
168 Bayesian Psychometric Modeling
where
2
xi |Ti , E2 ~ N Ti , E for i = 1, , n,
J
and
Ti |T , T2 ~ N (T , T2 ) for i = 1, , n.
where
Ti|xi =
( ) + ( Jx )
T
2
T i
2
E
(8.35)
(1 ) + ( J )
2
T
2
E
and
1
T2i|xi = . (8.36)
(1 ) (
2
T + J E2 )
A little algebra reveals that the posterior mean can be expressed as
J T2 2
Ti|xi = 2
x + 2 E 2 T .
2 i
(8.37)
J T + E J T + E
The coefficient for xi is recognized as the reliability c. A little more algebra reveals that the
coefficient for T is 1 c . Thus, the posterior mean for an examinees true score is
Ti|xi = c xi + (1 c )T
(8.38)
= T + c ( xi T ).
where T = 1/T2 is the precision of the true scores, E = 1/E2 is the precision of the error
scores,
Classical Test Theory 169
J E T
Ti|xi = xi + T , (8.40)
T + J E T + J E
and
Ti|xi = T + J E (8.41)
is the posterior precision. Equation (8.41) reveals how the precision in the data and therefore
the total precision in the posterior increases as the number of observables J increases. This
reflects the principles of the SpearmanBrown prophecy formula in CTT and the notion of
Bayesian inference as the mechanism for the accumulation of evidence and the represen-
tation of uncertainty as evidence accumulates. Equation (8.40) reveals that the posterior
mean is a precision-weighted average of the mean of the observables and the prior mean.
This lends the following interpretation to the reliability; namely, the proportion of the pos-
terior precision that is due to the precision in the observables. The complement of the reli-
ability is the proportion of the posterior precision due to the prior.
8.2.3 Example
We continue the example scenario, where we have a test scored from 0 to 100 and we are
interested in true scores for students where it is known that the mean of the true scores
is T = 80, the standard deviation of the true scores is T = 6, and the standard deviation
of the error scores for each observable is E = 4. The reliability of each test is calculated
via (8.5), .69. The reliability of an equally weighted composite is calculated via (8.25),
c .92. We consider a dataset of 10 examinees and 5 tests, given in Table8.2.
J T2 E2
Ti|xi = x i + T
J T2 + E2 J T2 + E2
(5)36 16
= xi + 80
(5)36 + 16 (5)366 + 16
.92 xi + 6.53.
Again, we note that each examinees posterior mean is exactly what is obtained by the
conventional application of Kelleys formula. The posterior variance is constant for all
examinees,
1 1
T2i|xi = = 2.94.
(1 ) (
2
T
2
+ J E )
( ) ( 5 16 )
1 36 +
For example, the posterior distribution for the first examinees true score is
p(T1|x1 ) = N (76.88, 2.94). Table8.2 lists the posterior means and standard deviations for all
examinees under the columns headed by Posterior (Analytical).
170 Bayesian Psychometric Modeling
TABLE 8.2
Dataset with Observed Scores from 10 Examinees to 5 Tests, and the Results from a
Bayesian Classical Test Theory Analysis
Posterior Posterior
Test Mean (Analytical) (MCMCa)
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Classical Test Theory
# With Known
# True Score Mean, True Score Variance
# Error Variance
#########################################################################
#########################################################################
# Known Parameters
#########################################################################
mu.T <- 80 # Mean of the true scores
sigma.squared.T <- 36 # Variance of the true scores
sigma.squared.E <- 16 # Variance of the errors
#########################################################################
# Model for True Scores and Observables
#########################################################################
Classical Test Theory 171
for (i in 1:n) {
T[i] ~ dnorm(mu.T, tau.T) # Distribution of true scores
for(j in 1:J){
x[i,j] ~ dnorm(T[i], tau.E) # Distribution of observables
}
}
} # closes the model statement
#########################################################################
# Data statement
#########################################################################
list(n=10, J=5, x=structure(.Data= c(
80, 77, 80, 73, 73,
83, 79, 78, 78, 77,
85, 77, 88, 81, 80,
76, 76, 76, 78, 67,
70, 69, 73, 71, 77,
87, 89, 92, 91, 87,
76, 75, 79, 80, 75,
86, 75, 80, 80, 82,
84, 79, 79, 77, 82,
96, 85, 91, 87, 90), .Dim=c(10, 5))
)
-------------------------------------------------------------------------
The model was run with the data contained in Table8.2. Three chains were run from
dispersed starting values. Convergence diagnostics including the history plots and the
potential scale reduction factor (Section5.7.2) indicated convergence within a few itera-
tions. This is unsurprising given that the posterior is of a known form (see Equation
8.39) that is recognized as such by WinBUGS. To be conservative, 100 iterations were
discarded as burn-in and 1,000 additional iterations were run, yielding 3,000 iterations
for use in summarizing the posterior distribution. The marginal posterior distributions
for the true scores for the examinees were all approximately normal. The final two col-
umns in Table8.2 report the estimated posterior mean and standard deviation for each
examinee from the MCMC estimation. It is clear that these closely approximate the ana-
lytically derived values.
2 T T2
T T 0
T T2
i = 1,, n
Ti
j = 1,, J
xij
E2
E E20
FIGURE 8.7
Directed acyclic graph for the classical test theory model for multiple observable variables, with unknown
hyperparameters: true score mean, true score variance, and error variance.
The DAG for the model is given in Figure8.7. In contrast to the DAGs for the previous
models, the mean and variance of the true scores and the error variance are depicted as
circles, reflecting that they are unknown. Accordingly, they are modeled with a higher
structure needed to specify their distributions. As stochastic entities, they will need to be
incorporated into the distributional specification.
Following Bayes theorem, the posterior distribution for the full collection of unknowns
given the full collection of the observed scores x is
The first two terms on the right-hand side of (8.42) are (a) the conditional distribution of
the data given the true scores and the error variance and (b) the prior distribution for true
scores given the mean and variance of the true scores. These terms are unchanged from
the situation in which T , T2 , and E2 are assumed known and we adopt the same distribu-
tional specifications in (8.11), (8.12), (8.29), and (8.30).
The third term on the right-hand side of (8.42) is the prior distribution for T , T2 , and E2 .
As the structure of the DAG in Figure8.7 conveys, we proceed by assuming independence
in the prior distribution,
T ~ N (T , 2 T ), (8.44)
and
p(T , T , T2 , E2 |x ) p(x |T , )p(T | , )p( )p( )p( ),
i =1 j =1
ij i
2
E i T
2
T T
2
T
2
E (8.47)
where
Ti |T , T2 ~ N (T , T2 ) for i = 1, , n,
T ~ N (T , 2 T ),
and
E2 ~ Inv-Gamma(E/2, EE20 /2).
* We suppress the role of specified hyperparameters in this notation; see Appendix A for a presentation that
formally includes the hyperparameters.
174 Bayesian Psychometric Modeling
where
Ti|T ,2 ,2 , xi =
( ) + ( Jx )
T
2
T i
2
E
T E
(1 ) + ( J ) 2
T
2
E
and
1
T2i|T ,2 ,2 , xi = .
T E
(1 ) ( 2
T + J E2 )
The full conditional for the mean of the true scores is
where
T|T ,2 =
( ) + ( nT )
T
2
T
2
T
T
(1 ) + ( n )
2
T
2
T
and
1
2 T|T ,2 = .
T
(1 ) (
2
T + n T2 )
The full conditional for the variance of the true scores is
+ n T T20 + SS(T )
T2 |T , T ~ Inv-Gamma T , , (8.50)
2 2
where
n
SS(T ) = (T ) .
i =1
i T
2
+ nJ EE20 + SS(E )
E2 |T , x ~ Inv-Gamma E , , (8.51)
2 2
Classical Test Theory 175
where
n J
SS(E ) = (x
i =1 j =1
ij Ti )2 .
A Gibbs sampler can be constructed by iteratively drawing from these full conditionals,
using the just-drawn values for the conditioned parameters.( 0 )We( 0begin by setting t = 0 and
(0) )
specify initial values for the parameters: T1( 0 ) , , Tn( 0 ), T ,T2 ,E2 , where the parenthetical
superscript of 0 indicates that this is the initial value. We then proceed with a Gibbs sam-
pler, repeating the following steps, generically written below for iteration (t + 1):
(t ) (t )
Ti(t +1) |(Tt ), T2 , E2 , xi ~ N ( (t ) (t ) , 2 ( t+1) (t ) (t ) ), (8.52)
Ti( t +1)|(Tt ), T2 , E2 , xi Ti |(Tt ), T2 , E2 , xi
where
=
( ) + ( Jx )
(t )
T
2( t )
T
2( t )
E
(1 ) + ( J )
(t ) (t )
Ti( t +1)|(Tt ), T2 , E2 , xi 2( t ) 2( t )
T E
and
1
2 ( t+1) = .
(1 ) + ( J )
(t ) (t )
Ti |(Tt ), T2 , E2 , xi 2( t ) 2( t )
T E
Note that this uses values of the other parameters from the previous iteration (t).
2. Sample the parameters for the latent variable distribution. We conduct this by
sampling in a univariate fashion.
a. Sample a value for the mean of the true scores from (8.49) using the current
values for the remaining parameters,
(t )
(Tt +1) |T (t +1), T2 ~ N ( (t ) , 2( t+1) (t ) ), (8.53)
(Tt +1)|T ( t +1),T2 T |T ( t +1),T2
where
=
( T ) (
2 T + nT (t +1) T2
(t )
)
(1 ) + ( n )
(t )
(Tt +1)|T( t +1),T2 2 2 (t )
T T
176 Bayesian Psychometric Modeling
and
1
2( t+1) = .
(1 ) + ( n )
(t )
T |T ( t +1) ,T2 2 2( t )
T T
Note that this includes the just-sampled values for the true scores (i.e., from
iteration t + 1) along with the values of the true score variance from the previ-
ous iteration (t).
b. Sample a value for the variance of the true scores from (8.50) using the current
values for the remaining parameters,
Note that this includes the just-sampled values for the true scores and the
mean of the true scores (i.e., from iteration t + 1).
3. Sample the measurement model parameters. Sample a value for the error variance
from (8.51) using the current values of the remaining parameters,
SS(E (t +1) ) = (x
i =1 j =1
ij Ti(t +1) )2 .
Note that this includes the just-sampled values for the true scores, the mean of the
true scores, and the variance of the true scores (i.e., from iteration t + 1).
8.3.3 Example
8.3.3.1 Model Specification and Posterior Distribution
We continue with the example scenario and the data in Table 8.2 containing observed
values from n = 10 examinees for J = 5 observables, now with T , T2 , and E2 treated as
unknown. We construct a prior distribution for the mean of the true scores based on the
prior beliefs that the mean is likely around 80, and almost certainly between 60 and 100.
Accordingly, we specify a normal prior with mean T = 80 and variance 2 T = 100,
Classical Test Theory 177
T ~ N ( 80, 100 ) .
We construct a prior distribution for the true score variance expressing the beliefs that the
variance of the true scores is likely about 36 (i.e., the standard deviation is about 6), but we
are not very confident, it is as if our beliefs were based on having observed two examinees.
Accordingly, we specify 2 2 = 36 and 2 = 2, yielding
T T
T2 ~ Inv-Gamma(1, 36).
We construct a prior distribution for the error variance expressing the beliefs that the vari-
ance of the errors is likely about 16 (i.e., the standard deviation is about 4), but we are not
very confident, as if that were based on two examinees. Accordingly, we specify 2 2 = 16
E
and 2 = 2, yielding
E
E2 ~ Inv-Gamma(1, 16).
Writing out the posterior distribution for the model using these chosen values for the
hyperparameters, we have
n J
p(T , T , T2 , E2 |x ) p(x |T , )p(T | , )p( )p( )p( ),
i =1 j =1
ij i
2
E i T
2
T T
2
T
2
E
where
Ti |T , T2 ~ N (T , T2 ) for i = 1, , 10,
T ~ N (80, 100),
T2 ~ Inv-Gamma(1, 36),
and
E2 ~ Inv-Gamma(1, 16).
TABLE 8.3
Computations in the Gibbs Sampler for the Classical Test Theory Example with Unknown True Scores (T), Mean of the True Scores
(T ), Variance of the True Scores (T2 ), and Error Variance (E2 )
2 2
T T
( ) + ( J x ) 1 E 1
2 2
Iteration T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T
(1 ) + ( J ) E ( 1 T2 + J E2
) ( )
0 80.00 81.00 82.00 83.00 84.00 85.00 86.00 87.00 88.00 89.00 76.54 1.92
1 77.65 78.62 81.72 71.70 74.38 89.17 75.79 82.19 76.41 89.00 76.54 1.92
2 76.31 80.28 82.03 74.94 72.78 89.21 76.92 77.15 78.12 88.70 76.63 1.76
3 77.40 78.83 82.41 74.83 71.12 86.99 76.36 79.56 79.06 85.85 76.66 1.80
4 77.90 79.24 82.17 76.21 73.26 88.37 74.51 80.27 81.96 87.62 77.04 1.85
5 76.07 79.72 82.37 70.90 74.66 85.72 78.16 79.94 80.52 89.44 76.79 2.62
Parameters Full Conditional for T Full Conditional for T2 Full Conditional for E2
2 2
T T
( ) + ( nT ) 1 T
T + n T T20 + SS(T) E + nJ EE2 0 + SS(E)
2 2 2 2
2 2 T T
(1 ) + ( n ) (1 ) + ( n ) T T
Iteration T T E 2 SS(T) 2 2 SS(E) 2
0 75.00 50.00 10.00 84.29 4.76 6.00 985.00 528.50 26.00 2625.00 1328.50
1 76.99 26.11 9.45 79.68 4.76 6.00 381.46 226.73 26.00 641.87 336.94
2 77.23 19.96 9.88 79.66 2.54 6.00 334.38 203.19 26.00 568.27 300.13
3 80.07 14.49 10.57 79.26 1.96 6.00 218.09 145.05 26.00 592.83 312.42
4 79.84 45.44 13.91 80.15 1.43 6.00 232.51 152.25 26.00 572.36 302.18
5 77.38 31.60 14.35 79.76 4.35 6.00 313.59 192.80 26.00 647.43 339.72
Bayesian Psychometric Modeling
Classical Test Theory 179
1. Sample the latent variables for examinees. For the first examinee, we compute
=
( ) + ( Jx ) = (75 50 ) + ((5)(76.6) 10 ) 76.54
(0)
T
2( 0 )
T 1
2( 0 )
E
(1 ) + ( J ) (1 50 ) + ( 5 10 )
(0) (0)
T1( 1)|(T0 ),T2 ,E2 , xi 2( 0 ) 2( 0 )
T E
and
1
2 ( 1) = 1.92,
(1 50 ) + ( 5 10 )
(0) (0)
T1 |(T0 ),T2 ,E2 , xi
(0) (0)
and draw a value from the full conditional T1(1) |(T0 ), T2 , E2 , x1 ~ N (76.54, 1.92). The
drawn value was 77.65. Repeat this process for the remaining examinees i = 2, , 10.
2. Sample the parameters for the latent variable distribution.
a. Turning to the full conditional for T , we compute
and
1
2 ( 1) = 4.76,
(1 100 ) + (10 50 )
(0)
T |T( 1) ,T2
(0)
and draw a value from the full conditional (T1) |T (1), T2 ~ N (79.68, 4.76). The
drawn value was 76.99.
b. Turning to T2 , we compute the first parameter of the full conditional as
T + n/2 = 2 + 10/2 = 6, which will not vary over iterations. To compute the sec-
ond parameter for the first iteration, we compute
n n
SS(T ) = (1)
(Ti =1
i
(1)
) = (1) 2
T (T
i =1
i
(1)
76.99)2 381.46
and then
n J
(1)
SS(E ) = (x
i =1 j =1
ij Ti(1) )2 641.87
and then
( 1)
Accordingly, we draw a value from the full conditional E2 |T (1), x ~ Inv-Gamma
(26,336.94). The drawn value was 9.45.
Table8.3 lists the results from five iterations of the Gibbs sampler, including the
relevant computations for the parameters and the true score for the first examinee.
8.3.3.3 WinBUGS
The WinBUGS code for the model, including the reliability of a single observable and the
reliability of the composite of the observables, is given as follows. The data statement is the
same as that given in the example in Section8.2.3, and will not be repeated here.
--------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
#model{
#########################################################################
# Classical Test Theory
# With Unknown
# True Score Mean, True Score Variance
# Error Variance
#########################################################################
#########################################################################
# Prior Distributions for Parameters
#########################################################################
mu.T ~ dnorm(80,.01) # Mean of the true scores,
# in terms of its mean and precision
#########################################################################
# Model for True Scores and Observables
#########################################################################
Classical Test Theory 181
for (i in 1:n) {
T[i] ~ dnorm(mu.T, tau.T) # Distribution of true scores
for(j in 1:J){
x[i,j] ~ dnorm(T[i], tau.E) # Distribution of observables
}
}
#########################################################################
# Reliability
#########################################################################
reliability <- sigma.squared.T/(sigma.squared.T+sigma.squared.E)
reliability.of.composite <- J*reliability/((J-1)*reliability+1)
The model was run with the data contained in Table8.2. Three chains were run from dispersed
starting values. Convergence diagnostics including the history plots and the potential scale
reduction factor (Section5.7.2) suggested convergence within a few iterations. The quick con-
vergence here is in part due to the use of the conditionally conjugate prior specification, which
yields full conditional distributions that are of known form (Section8.3.2) and recognized as
such by WinBUGS. To be conservative, 100 iterations were discarded as burn-in and 10,000
additional iterations were run, yielding 30,000 iterations for use in summarizing the posterior
distribution. The marginal posterior densities for the parameters are given in Figure8.8. The
reliability of each test () and the reliability of the composite (c) are also included. Table8.4
gives the summary statistics describing these marginal posterior distributions. The posterior
mean and median are given as measures of central tendency. These are quite similar for all
parameters, despite the skewness in the posterior distributions for the variances and the reli-
ability terms. The Bayesian approach results support probabilistic expressions of our uncer-
tainty about the parameters. For example, the 95% HPD interval for is (.53, .90), indicating
that, based on our data and the model specifications, we are 95% sure that the reliability of
each test is between .53 and .90.
70 75 80 85 75 80 85 75 80 85
T1 T2 T3
70 75 80 66 70 74 78 80 85 90 95
T4 T5 T6
70 75 80 85 74 78 82 86 75 80 85
T7 T8 T9
85 90 95 70 75 80 85 90 0 100 300
T10 T T2
FIGURE 8.8
Marginal posterior densities for the classical test theory example: the 10 examinees true scores (T1, , T10), the
mean of the true scores (T ), true score variance (T2 ), error variance (E2 ), reliability of each test (), and the reli-
ability of the equally weighted composite (c ).
Before turning to other models, let us review the developments in light of the assess-
ment goal of making inferences about examinees. In CTT, the inferential target is the
examinees true score, so it is worthwhile to consider the various options in estimating an
examinees true score. An obvious candidate is the observed score for the examinee. In the
case of one observable, this is xi. In the case of multiple observables, this may be the mean
for the observables for the examinee, xi . However, this fails to account for the uncer-
tainty due to the (un)reliability of the tests. Kelleys formula for estimating true scores
overcomes this, involving two parameters, the reliability () and the population mean of
the observed scores ( x). (Note that if the unit of analysis is the mean over observables for
an examinee, xi, the second parameter is the population mean of the xi, denoted as x .
Classical Test Theory 183
TABLE 8.4
Summaries of the Marginal Posterior Distributions for the Classical
Test Theory Example
Standard 95% Highest Posterior
Parameter Mean Median Deviation Density Interval
T1 76.85 76.85 1.53 (73.87, 79.83)
T2 79.07 79.07 1.53 (76.00, 82.03)
T3 82.04 82.05 1.52 (79.04, 85.03)
T4 75.00 75.00 1.55 (71.97, 78.07)
T5 72.6 72.59 1.55 (69.51, 75.60)
T6 88.54 88.56 1.56 (85.50, 91.60)
T7 77.23 77.22 1.54 (74.23, 80.26)
T8 80.57 80.55 1.53 (77.48, 83.49)
T9 80.2 80.19 1.52 (77.18, 83.13)
T10 89.09 89.11 1.55 (86.02, 92.10)
T 80.1 80.11 1.98 (76.23, 84.11)
2
T 38.78 33.72 21.30 (10.83, 78.59)
2
E 12.54 12.17 2.86 (7.59, 18.21)
0.73 0.74 0.10 (0.53, 0.90)
c 0.93 0.93 0.04 (0.86, 0.98)
Itcan easily be shown that x = x .) In applications, these parameters are unknown. A fre-
quentist approach employs point estimates of these parameters. A point estimate of reli-
ability may be obtained via any number of methods (Crocker & Algina, 1986; Haertel,
2006; Lewis, 2007). The population mean of the observed scores may be estimated via the
sample mean of the observables. (Similarly, if xi is the unit of analysis, x may be estimated
as the mean of the xi). However, this ignores the uncertainty in the parameters. A fully
Bayesian approachone that models parameters as random, using distributions to rep-
resent uncertaintyincorporates the uncertainty about and x (alternatively, x ). (Note
that in our Bayesian formulations this appears as T ; this is simply a change in notation, as
x = T = x.) This incorporation of uncertainty is accomplished by using the posterior dis-
tribution for the parameters, rather than point estimates, in the estimation of true scores. Of
course, the uncertainty in the resulting estimation of true scores is also represented by the
posterior distribution for a true score, rather than a point estimate. Finally, the same logic
applies to the other parameters of interest, such as reliability. Here again, we have the pos-
terior distribution as a synthesis of our prior beliefs and the information in the data. This
affords the expression of beliefs and uncertainty about the reliability in probabilistic terms.
We close by noting that Kelleys formula is sometimes formulated using the mean of the
observed scores for the group rather than a population parameter. Results of a Bayesian
analysis of normal distributions using the observed mean in this way date to Lindleys
discussion of Stein (1962). A more complete development in the context of evaluating true
scores was given by Lindley and reported in Novick (1969). It is here that the connection
between Kelleys formula and Bayes theorem was recognized, culminating with Novick
et al. (1971) explicitly stating the connection. This is certainly not the only time that that
184 Bayesian Psychometric Modeling
Exercises
8.1 Consider again the model in Section8.3.1, depicted in the DAG in Figure8.7. Based
on the DAG, for each entity, list the entities that need to be conditioned on in the
full conditional distribution.
8.2 Section8.3.3 demonstrated the computations involved in the first iteration of a
Gibbs sampler for the CTT model with unknown hyperparameters. The results for
this and four additional iterations are given in Table8.3. Show the computations
for each of these additional iterations.
8.3 Consider again a CTT model with a test scored from 0 to 100 and E = 4 is the stan-
dard deviation of the error scores. There are two groups of examinees: In Class A,
T ( A ) = 85, in Class B, T ( B) = 75, and in both groups the standard deviation of the
true scores is T = 6. Both within-group true-score distributions and the error dis-
tributions are normal. Generate 200 observed scores for both groups, in each case
by first drawing a true score from the group distribution and then drawing an
error term to add to it, so you know each simulees and x. Round each observed
score to the closest integer.
a. For each observed score between 70 and 90, calculate the mean of the true scores
of simulees who obtained that x, within each group separately and combining
across groups.
b. Calculate the posterior mean that corresponds to each observed score between
70 and 90, for each group separately and combining across groups. (Hint: Use
Kelleys formula for the combined-group answers. Calculate the combined-
group variance using the within-group variance and the squared difference
between group means.)
c. Compare the results of (a) and (b).
d. What is the correct posterior mean for an individual with an observed score of
80? For an observed score of 70? (Hint: You may want to revisit this question
after you have answered the next one.)
8.4 Some of the students in the classes from the previous problem are going on a
field trip to the museum. There is one more place left, and the determination will
* For another instance of the issue considered here, see Efron and Morris (1977). For examples of other situations
where advances in conventional approaches may be seen as Bayesian, see Good (1965); Goldstein (1976); Clogg,
Rubin, Schenker, Schultz, and Weidman (1991); and Galindo-Garre, Vermunt, and Bergsma (2004).
Classical Test Theory 185
* As was the case for CTT in Chapter8, these observables may be the result of taking aggregates or other func-
tions of other variables. And that is indeed the case for the example introduced in Section9.3, where we have
five observables, each of which is an average of the scored responses to items. For our purposes of laying out a
Bayesian treatment of CFA, the starting point is a set of observed scores for each examinee, however obtained,
such that the CFA structure applies.
187
188 Bayesian Psychometric Modeling
observed values from examinee i (i.e., the contents of the ith row of x, arranged as a col-
umn vector). These observables are related to a set of latent variables via a factor analytic
measurement model,
x i = + i + i , (9.1)
where
It is also usually assumed that all the sources of covariation among the observables are
expressed in the latent variables and therefore that is diagonal; this assumption may
easily be relaxed in ways discussed later.* Here , , and are the measurement model
parameters as they govern the dependence of the observables on the latent variables,
which is made more explicit in Section9.2.
The expression in (9.1) is a compact way to represent a system of equations. For expository
reasons, we write them out more fully as
xi 1 = 1 + i 1 + i 1 = 1 + 11i 1 + 12i 2 + + 1M iM + i 1
xi 2 = 2 + i 2 + i 2 = 2 + 21i 1 + 22i 2 + + 2 M iM + i 2
(9.2)
xij = j + i j + ij = j + j 1i 1 + j 2i 2 + + jM iM + ij
xiJ = J + i J + iJ = J + J 1i 1 + J 2i 2 + + JM iM + iJ .
EFA models commonly estimate a loading for each observable on each latent variable. CFA
models typically specify the pattern of loadings, meaning that they express which observ-
ables load on which latent variables. To communicate this, path diagrams are commonly
employed when specifying and communicating CFA models within the SEM framework
(Ho et al., 2012). Figure9.1 contains representations of one-factor and two-factor models
for five observables in panels (a) and (b), respectively. The path diagrams correspond to
* The model may be viewed as simultaneously regressing the xs on the s, with , , and the diagonal elements
of playing the roles of regression intercepts, coefficients, and error variances, respectively, only now the
predictors in the regression model (the s) are unknown. See Exercise9.1.
Confirmatory Factor Analysis 189
21
11
11 22
1 1 1 1 2 2
1 1 1
5 11 51 5
21 41 11 31 42 52
1 4 31 21 2
2 4
3 3
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1 1 1
1 2 3 4 5 1 2 3 4 5
(a) 11 22 33 44 55 (b) 11 22 33 44 55
FIGURE 9.1
Path diagrams for confirmatory factor analysis models for five observables with (a) one latent variable (factor)
and (b) two latent variables (factors).
xi 2 = 2 + 21i1 + i 2
xi 3 = 3 + 31i1 + i 3 (9.3)
xi 4 = 4 + 41i1 + i 4
xi 5 = 5 + 51i1 + i 5 .
xi 2 = 2 + 21i1 + i 2
xi 3 = 3 + 31i1 + i 3 (9.4)
xi 4 = 4 + 42i 2 + i 4
xi 5 = 5 + 52i 2 + i 5 .
For each model, the path diagram conveys the same information contained in the equa-
tions, and additionally communicates the mean, variance, and covariance terms for the
latent variables (1, 2, 11, 22 ,21) and variance terms for errors (11 , , 55 ).*
* Although not typically done, the systems of expressions in (9.3) and (9.4) could be expanded to better include
these elements formally. The path diagrams contain a 1 in a triangle that does not seem to appear in the
equations. As discussed in Section9.6, this is a modeling device to aid in connecting path diagrams to equations.
For now, it suffices to say that this 1 is a constant, and may be seen in structural equations as the often-not-
written value for which the parameter on the path emanating from it serves as a coefficient.
190 Bayesian Psychometric Modeling
The observant reader may have noticed that the path diagrams in Figure 9.1 do not
include subscripts for examinee variables: xs, s, and s. This follows conventional use of
path diagrams for CFA from within the SEM framework. Correspondingly, most presenta-
tions do not include such subscripts in the equations (e.g., Bollen, 1989). We have included
them in (9.1)(9.4) to make explicit that these entities are examinee variables and for con-
tinuity with our presentation throughout the book. It is worth more fully contrasting the
structural equation and path diagrammatic representation of models with the probabilis-
tic and DAG representation, a task we defer until Section9.6.
second-order moments for the observables in terms of the model parameters through the
algebra of expectations or via path-tracing rules for path diagrams (Mulaik, 2009; Wright,
1934), which in general yields
E( xi | , , , , ) = + (9.5)
and
var( xi | , , , , ) = + . (9.6)
The notation on the left-hand side of (9.5) and (9.6) stresses conditioning variables, which
aids in contrasting these with representations that come later.
Frequentist approaches dominate conventional approaches to model fitting, which
amounts to fitting these model-implied moments to data via least-squares or ML routines
assuming normality of the observables to yield estimates for , , , , and . In the par-
lance of Chapter7, estimating these parameters amounts to model calibration. Following
the introduction of some necessary terms in Section9.2, we sketch an overview of ML esti-
mation to draw out connections to Bayesian approaches (for treatments of these and other
options in the frequentist framework, see Bollen, 1989, chapter 4; Mulaik, 2009, chapter 7;
Yanai & Ichikawa, 2007; Lei & Wu, 2012). As we will see, the popular normal-theory ML
routines can be grounded in distributional assumptions of normality for the latent vari-
ables and errors.
Interestingly, arriving at representations of the examinees latent variablesreferred to
as scoring in Chapter7is less common in CFA as compared to other psychometric mod-
eling traditions. This is no doubt in part due to the purposes of those applications of CFA
where values for examinees latent variables are not of inferential interest. We suspect
that it is also in part due to the so-called problem of factor score indeterminacy (not to
be confused with the indeterminacies discussed earlier) that has resulted in a number of
proposed ways to represent examinees latent variables (Bartholomew et al., 2011; Yanai &
Ichikawa, 2007) and considerable debate (see, e.g., Maraun, 1996, and surrounding discus-
sion articles).
Working from the principle that the latent variables are random variables, Bartholomew
(1981) argued for the use of Bayesian inference for scoring in CFA, in which the poste-
rior distribution p( i |xi ) is obtained using point estimates for the model parameters from
frequentist calibration, akin to (7.10). However, this approach is hardly one of consensus.
Bartholomew (1981) also noted that what are typically called regression scores for latent
variables correspond to the expectation of this posterior distribution (see also Aitkin &
Aitkin, 2005; Bartholomew etal., 2011). The situation here is akin to what was observed for
CTT, in which Kelleys formula for arriving at a point estimate of a true score coincided
with the posterior mean from a Bayesian analysis.
and parameters for the measurement model relating the observables to the latent variables
(, , and ). Once the data are observed, the posterior distribution following Bayes
theorem is
p( , , , , , |x ) p( x| , , , , , )p( , , , , , ). (9.7)
The first term on the right-hand side of (9.7) is the conditional probability distribution of
the observables. The second term is the prior distribution for the unknowns. Specifying
the model involves specifying these terms. In the following sections, we will treat each in
turn, and discuss some connections and departures from frequentist approaches to CFA.
p( x| , , , , , ) = p( x| , , , ). (9.8)
This expression reflects that, given the values of the latent variables ( ) of examinees, the
observables x are conditionally independent of the parameters that govern the distribution
of the latent variables, namely the means ( ) and (co)variances (
). Put another way, once
we know the values of the latent variables, knowing their means and (co)variances does
not tell us anything new about the observables.
Looking at the right-hand side of (9.8), x contains the values for the observables for exam-
inees, and contains the values for the latent variables for examinees. These vary over
examinees. The remaining parameters are the measurement model parameters that gov-
ern the conditional distribution of the observables given the latent variables: the intercepts
, loadings , and error (co)variances . That these parameters do not vary over examin-
ees reflects an exchangeability assumption with respect to the measurement model: we
think the measurement quality of the observables is the same for each examinee. When
exchangeability of measurement is not assumed, different measurement model parame-
ters are specified for different examinees. This usually occurs by way of specifying group-
specific parameters when group membership is known or unknown (as in latent class or
finite mixture models; Pastor & Gagn, 2013), with applications in invariance or differen-
tial functioning analyses (Millsap, 2011; Verhagen & Fox, 2013).
Assuming exchangeability with respect to the measurement model, the joint conditional
distribution for the observables may be factored into a product over examinees,
p( x| , , , ) = p(x | , , , ).
i =1
i i (9.9)
The assumption that is diagonal embodies the local independence assumption and
reveals why local independence is best thought of with respect to the model as a whole,
not just the latent variables (Levy & Svetina, 2011). To see why, consider what happens if
we do not condition on the errors. If there are no error covariances (i.e., is diagonal), local
independence will hold. In the more general case where error covariances are present,
failing to condition on them will not render the observables independent; a nonzero error
Confirmatory Factor Analysis 193
covariance indicates that the observables are dependent above and beyond that which can
be accounted for by the latent variables (and other measurement model parameters).
As a result of the local independence assumption, the conditional probability of the
observables for any examinee can be further factored into the product over observables,
J
p( xi | i , , , ) = p(x | , , , ).
j =1
ij i j j jj (9.10)
The form of the distribution is dictated by the factor analytic model and the distribu-
tional assumption of the errors. The system of equations in (9.1) and the assumption that
the errors have expectation of 0 for all variables and covariance matrix imply that the
conditional expectation and variance of the observables for an examinee given the latent
variables and the measurement model parameters are
E( xi | i , , , ) = + i (9.11)
and
var( xi | i , , , ) = . (9.12)
Following the assumption of normality of errors, the conditional distribution of the observ-
ables for each examinee is normal. Working at the level of the vector of observables for any
examinee (i.e., the right-hand side of 9.9)
xi | i , , , ~ N ( + i , ). (9.13)
Drilling down further to the level of each observable for any examinee (i.e., the right-hand
side of 9.10)
xij | i , j , j , jj ~ N ( j + i j , jj ). (9.14)
Collecting the preceding developments in (9.8)(9.10), (9.13) and (9.14), we are now ready to
state the conditional distribution of all the observables, expressed by drilling down to the
level of each observable for each examinee.* The conditional distribution is
n J
p( x| , , , , , ) = p( x| , , , ) = p(x | , , , )
i =1 j =1
ij i j j jj (9.15)
where
xij | i , j , j , jj ~ N ( j + i j , jj ).
* As an exercise, we encourage the reader to create a DAG corresponding to the specifications described in this
section.
194 Bayesian Psychometric Modeling
any examinee, coupling the deterministic expression in (9.1) with the assumption that
i ~ N (0 , ) implies the probabilistic expression in (9.13). Working at the level of any indi-
vidual observable for any examinee, coupling the deterministic expression for observable j
in (9.2) with the assumption that ij ~ N (0, jj ) implies the probabilistic expression in (9.14).
Note that we have been working with a model for the conditional distribution of the
data, structuring the data as dependent on model parameters. There is nothing particu-
larly unique to Bayesian modeling here. The conditional distribution for data given the
parameters plays a role in Bayesian inference of course, but it also plays a role in frequen-
tist inference, namely by defining the likelihood function.
The conventional approach to expressing statistical and psychometric models is equa-
tion oriented. For CFA, the standard approach is to use structural equation modeling, as in
(9.1)(9.3); path diagrams such as those in Figure9.1 closely correspond to those structural
equations. This engenders us to think equationally. We value this perspective, and it serves
psychometrics quite well in many respects. However, we find that it is often preferable to
express models using distributional forms, as in (9.13) or (9.14). The equations of course
play a role in specifying the distribution, as they yield the parameters of the resulting
distribution in (9.11) and (9.12). Formulating the model in this latter way is an important
step towards our aim of thinking probabilistically. Formulating and explicitly expressing the
model in terms of probability distributions more naturally coheres with probability-based
reasoning to conduct inference, and probabilistic expressions of beliefs. In addition, it aids
with making conditioning explicit in notation, which allows us to more easily express local
and other conditional independence relationships. This approach is well aligned with hier-
archical strategies to building models, which we flesh out more fully in the current context
in Section9.7. The distributional form is also better connected to the substantive stories
about the data we come to see. We also suspect that this approach may better encourage
the specification of models that depart from or generalize those covered here, such as the
specification of thicker-tailed distributions in support of modeling rather than discarding
outlying observations (Lee & Xia, 2008; Zhang, Lai, Lu, & Tong, 2013).
The first line reflects that the prior distribution for the examinee latent variables is gov-
erned by the parameters and , and is independent of the measurement model parameters
(, , and ). The second line reflects that these parameters that govern the distribution
of the latent variables ( and ) are independent of the measurement model parameters
(, , and ). We now step through the further specifications of the three terms on the
right-hand side of the second line.
p(| , ) = p( |, ).
i =1
i (9.17)
Assuming normality for the latent variables, the prior for each examinee is then
i | , ~ N ( , ). (9.18)
p( , ) = p( )p( ). (9.19)
p( ) = p(
m =1
m ). (9.20)
As the m are means of normal distributions, a common choice for the prior is a normal
distribution
m ~ N ( , 2 ), (9.21)
where and 2 are parameters that govern these prior distributions, specified by the
analyst.
For , we employ an inverse-Wishart distribution, denoted by Inv-Wishart( ),
p( , , ) = p( , , ).
j =1
j j jj (9.23)
To specify p( j , j , jj ), we employ conditionally conjugate priors. In this case, all the ele-
ments, including the possibly multiple loadings for each observable, are modeled as inde-
pendent in the prior,
p( j , j , jj ) = p( j ) p(
m =1
jm )p( jj ). (9.24)
j ~ N ( , 2 ), (9.25)
jm ~ N ( , 2 ), (9.26)
(9.11) through (9.14). Typical frequentist presentations do not include the latent variables in
expressing the distribution of the data, instead presenting them as in (9.5) and (9.6). The dif-
ferences in formulations can be reconciled by recognizing what is being conditioned on,
and what each approach considers to be parameters.
We begin by constructing the joint distribution of the observable and latent variables:
p( x , | , , , , ) = p( x| , , , )p(| , )
n (9.28)
=
i =1
p( xi | i , , , )p( i | , ).
The first term on the right-hand side of (9.28) is the conditional distribution of the observables
given the latent variables (and measurement model parameters), defined in (9.15). The second
term on the right-hand side of (9.28) is the distribution of the latent variables for examinees,
defined in (9.17) and (9.18). The marginal distribution of the observables, without reference
to the latent variables, is obtained by integrating (9.28) over the distribution of the latent
variables as an instance of (7.8),
p( x| , , , , ) = p( x , | , , , , )d
= p( x| , , , )p(| , )d
(9.29)
= p(x | , , , )p( |, )d .
i =1 i
i i i i
Given normality assumptions about the conditional distribution of the data (the first term
in the second and third lines of (9.29) and of the prior distribution for the latent variables
(the second term on these lines), the marginal distribution that results from integrating i
out is also normal:
xi | , , , , ~ N ( + , + ). (9.30)
The resulting joint conditional distribution of all the observables on the left-hand side of
(9.29) is the collection of i.i.d. normal xi with mean vector given by ( + ) and covariance
matrix given by ( + ). Thus, we see from (9.30) that the conventional frequentist
description of CFA in (9.5) and (9.6) may be derived from a probabilistic approach that
integrates the latent variables out. The result expresses the dependence of the observables
on the parameters that characterize the distribution of the latent variables, and , as well
as the measurement model parameters, , , and . Once values for the observables are
known, the conditional distribution in (9.30) induces a likelihood function for these param-
eters. It is this likelihood that yields the (normal-theory) ML fit function that is optimized
in ML estimation of CFA models (e.g., Bollen, 1989).
Of course, that a multivariate normal specification for a prior for the latent variables
yields a convenient marginal distribution might not have quite the same sway in a fully
Bayesian analysis as it might in a frequentist analysis. However, the use of a normal distri-
bution does aid the computational tractability of a fully Bayesian analysis (Section9.2.4),
198 Bayesian Psychometric Modeling
it is a distributional form that many analysts are familiar with, and it has a long track
record of successful use in the analysis of data in psychometric contexts generally and
specifically for continuous latent variables. Whats more, like any other feature, we may
critically evaluate it using follow-up analyses such as those described in Chapter10.
i | , ~ N ( , ) for i = 1, , n,
m ~ N ( , 2 ) for m = 1, , M ,
~ Inv-Wishart(0 , d),
j ~ N ( , 2 ) for j = 1, , J ,
jm ~ N ( , 2 ) for j = 1, , J , m = 1, , M ,
and
Figure9.2 contains the DAG for the model, expressing the joint distribution of all the enti-
ties. We emphasize the correspondence between the DAG and elements of (9.31). Taken as a
whole, the model is fairly complex, relative to those covered previously in the book. However,
a modular approach recognizes the use of familiar structures developed earlier: (1) normal-
theory regression-like structures for the observables as dependent on the latent variables and
measurement model parameters, (2) associated prior distributions for those regression-like
structures, (3) a normal model for the latent variables, (4) a normal prior on the means, and (5)
an inverse-Wishart as a generalization of the inverse-gamma for the covariance matrix.
2 d 0
m = 1,, M
m
i = 1,, n
i
j = 1,, J
xij
m = 1,, M
j jm jj
2 0
2
FIGURE 9.2
Directed acyclic graph for a confirmatory factor analysis model supporting multiple observables loading on
multiple latent variables.
jA ~ N ( A , A ). (9.32)
i | , , , , , xi ~ N ( i| , , , , , xi , i| , , , , , xi ), (9.33)
* We suppress the role of specified hyperparameters in this notation; see Appendix A for a presentation that
formally includes the hyperparameters.
200 Bayesian Psychometric Modeling
where
i| , , , , , xi = ( 1 + 1 )1 ,
and xi = (xi1, , xiJ) is the (J 1) vector of J observed values from examinee i. The subscript
of i | , , , , , xi in (9.33) is rather unwieldy, but by writing it out fully explicitly indi-
cate what the mean vector and covariance matrix in the full conditional for i do and do not
depend on. They do depend on the structural parameters , , , , and . They do depend
on examinee is vector of observed scores, xi. Given these variables, however, they do not
depend on the latent variables or observed scores of other examinees, i* and xi* for i* i.
The same scheme applies to interpreting the subscripts in the rest of the full conditional
distributions that follow. The full conditional for the mean vector for the latent variables is
| , ~ N ( | , , | , ), (9.34)
where
| , = ( 1 + n 1 )1( 1 + n 1 ),
| , = ( 1 + n 1 )1 ,
and is the (M 1) vector of means of the treated-as-known latent variables over the n
examinees. The full conditional for the covariance matrix for the latent variables is
( )( ).
1
S = i i
n
i
Turning to the measurement model parameters, we present the full conditionals for any
observable j. The same structure applies to all the observables. The full conditional for the
augmented loadings for observable j is
jA | , jj , x j ~ N ( jA| , jj , x j , jA | , jj , x j ), (9.36)
where
1
1 1
jA| , jj , x j = 1A + A A 1A A + A x j ,
jj jj
1
1
jA| , jj , x j = 1A + A A ,
jj
and xj = (x1j,, xnj) is the (n 1) vector of observed values for the n examinees for observ-
able j.
The full conditional distribution for the error variance for observable j is
+ n 0 + SS(E j )
jj | , jA , x j ~ Inv-Gamma , , (9.37)
2 2
Confirmatory Factor Analysis 201
where
SS(E j ) = ( x j A jA )( x j A jA )
and x j = (x1j,,xnj) is the (n 1) vector of observed values for the n examinees for
observable j. A Gibbs sampler is constructed by iteratively drawing from these full con-
ditional distributions using the just-drawn values for the conditioned parameters, generi-
cally written below (for iteration t + 1).
and
1 1
( t+1)|( t ),( t ),( t ),( t ),( t ), xi = ((t ) + (t ) (t ) (t ) )1 ,
i
where xi = (xi1,, xiJ) is the (J 1) vector of observed values for examinee i (i.e., the
ith row of x). Sample a value for the latent variables from
b. Using the just-sampled values for the latent variables and the means of the
latent variables (from iteration t + 1), compute
Sample a value for the covariance matrix for the latent variables from
b. For each observable, j = 1, , J, using the just sampled values for the latent
variables and the augmented vector of loadings (from iteration t + 1), compute
SS(E (j t +1) ) = ( x j (At+1) (jAt+1) )( x j (At+1) (jAt+1) )
where xj = (x1j,, xnj) is the (n 1) vector of observed values for observable j
(i.e., the jth row of x). Sample a value for the error variance from
+ n 0 + SS(E (j t +1) )
(jjt +1) | (t +1), (jAt+1), x j ~ Inv-Gamma , .
2 2
Note that the joint posterior distribution includes all unknowns, including the latent vari-
ables. As a result, a Gibbs sampler will yield draws for the latent variables from the pos-
terior distribution. When working within a probability model, Bayes theorem provides a
mechanism for arriving at a representation of the examinees latent variables, be it scoring
using point estimates for other parameters (Bartholomew, 1981, 1996) or in a fully Bayesian
analysis such as that described here (see also Aitkin & Aitkin, 2005). Within the probability
model, the latent variables are unknowns, and Bayes theorem provides the mechanism
for yielding a posterior distribution for all the unknowns. Mechanically, this shows up in
the Gibbs sampler taking draws for the latent variables for each examinee from the full
conditional. For each examinee, the collection of the draws is an empirical approximation
to the marginal distribution for their latent variables, which may be summarized in the
usual ways (e.g., density plots, point and interval summaries, and standard deviations).
Terenzini, 1980). The scale measures students collegiate experiences with and percep-
tions of peers, faculty, intellectual growth, and academic goals. It contains 31 five-point
Likert-type items with rating categories of strongly disagree, disagree, neutral, agree, and
strongly agree, coded as integers from 1 to 5. The items are organized into five subscales:
Peer Interaction, Faculty Interaction, Academic and Intellectual Development, Faculty Concern,
and Institutional Goal Commitment. The current analysis is based on a subset of 500 exam-
inees from the sample analyzed in Levy (2011). For each examinee, scores on the subscales
are obtained by averaging the scores for the items in that subscale. Summary statistics for
the subscales are given in Table9.1.
In this example we pursue a model with a single latent variable (factor, i.e., M = 1) for the
five observable subscale scores. In Section9.4, we pursue a model with two latent variables
(French & Oakes, 2004; Mitchell, 2009), which is but one of several possible alternatives.*
The path diagram for the model is depicted in Figure9.1a. For simplicity we can drop the
subscripting associated with multiple latent variables, denoting the latent variable for an
examinee by i, the mean of the latent variable by , and the variance of the latent variable
by . In the current illustration, we resolve the indeterminacies by fixing the latent vari-
ablemean to be 0 and fixing the loading for the first observable (for Peer Interaction) 1
to be 1.
TABLE 9.1
Summary Statistics for the Institutional Integration Scale Subscales Data,
Including Means, Standard Deviations, and the Correlations
Subscale PI AD IGC FI FC
IGC .41 .64
FI .56 .56 .31
FC .54 .63 .46 .67
Mean 3.33 3.90 4.60 3.03 3.71
Standard Deviation 0.83 0.60 0.46 0.89 0.79
PI = Peer Interaction; AD = Academic and Intellectual Development; IGC=Institutional
Goal Commitment; FI= Faculty Interaction; FC = Faculty Concern.
* One set of alternatives analyzes the individual items rather than the subscale aggregates, and we pursue an
example analyzing the items of one subscale using the tools of item response theory in Section11.4. This could
be extended, for instance to simultaneously modeling all the items, using five latent variablesone for every
subscaleusing the tools of multidimensional item response theory discussed in Section11.5.
204 Bayesian Psychometric Modeling
0
0
i = 1,, n
i i = 1,, n
i
j = 2,, J
xi1 xij
xij j j=1
1 j 11 j jj j
j jj j j = 2,, J
j = 1,, J
2 0 2 2 0 2
(a) (b)
FIGURE 9.3
Two representations for the model with one latent variable with fixed latent variable mean and fixed loading. (a)
Exact directed acyclic graph representation. (b) Inexact but less cluttered representation, where the shaded box
indicates that there is a subset of the larger plate for j with a common structure.
1 is now specified differently than the remaining loadings, and so resides outside the
plate for observables, which is now amended to start at j = 2. Whats more, the intercept
(1) and error variance (11) for the first observable must now lie outside the plate even
though they are assigned the same prior distributions as the remaining intercepts and
error variances. This is exact but somewhat unfortunate, as it produces some additional
clutter at the bottom of the graph. On the positive side, it makes it more clear that there
is something importantly different going on with the loading for the first observable, but
not the intercept or error variance.
A visually simpler representation is given in Figure9.3b. Here, there is a single plate for
all observables j = 1, , J indicating the basic structure. The shaded plates inside that plate
also define values for the observables indexed by j. These indicate that the first loading is
fixed to a specified value and the loadings for the remaining observables are unknown
and follow a prior distribution governed by and 2 . Though technically inexact and not
a representation that directly corresponds to the probability distribution, this type of pre-
sentation is somewhat more elegant in communicating the big idea. Generally speaking,
it combats the tendency for DAGs to get increasingly messy as additional restrictions are
placed on certain parameters.
n J J
p( , , , , |x )
i =1 j =1
p( xij |i , j , j , jj )p(i |, )p()p( j )p( jj ) p( ),
j =2
j
Confirmatory Factor Analysis 205
where
xij |i , j , j , jj ~ N ( j + j i , jj ) for i = 1, , 500, j = 1, , 5, (9.38)
i |, ~ N ( , ) for i = 1, , 500,
= 0,
~ Inv-Gamma(5, 10),
9.3.3 WinBUGS
WinBUGS code for the model and list statements for three sets of initial values are given
as follows.
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Specify the factor analysis measurement model for the observables
#########################################################################
for (i in 1:n){
for(j in 1:J){
mu[i,j] <- tau[j] + ksi[i]*lambda[j]
x[i,j] ~ dnorm(mu[i,j], inv.psi[j])
}
}
206 Bayesian Psychometric Modeling
#########################################################################
# Specify the prior distribution for the latent variables
#########################################################################
for (i in 1:n){
ksi[i] ~ dnorm(kappa, inv.phi)
}
#########################################################################
# Specify the prior distribution for the parameters that govern
# the latent variables
#########################################################################
kappa <- 0 # Mean of factor
inv.phi ~ dgamma(5, 10) # Precision of factor
phi <- 1/inv.phi # Variance of factor
#########################################################################
# Specify the prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
tau[j] ~ dnorm(3, .1) # Intercepts for observables
inv.psi[j] ~ dgamma(5, 10) # Precisions for observables
psi[j] <- 1/inv.psi[j] # Variances for observables
}
#########################################################################
# Initial values for three different chains
#########################################################################
list(tau=c(.1, .1, .1, .1, .1), lambda=c(NA, 0, 0, 0, 0), inv.phi=1,
inv.psi=c(1, 1, 1, 1, 1))
A few comments about the code are warranted. First, note that the error variances are
defined in the code as psi[j]. The use of a single index of j here contrasts with the
double subscripting in jj. This double subscripting reflects that the error variances are
the diagonal of the (J J) error covariance matrix . A matrix representation is useful
when error covariances on the off-diagonal are included, as discussed in Section 9.8. In
the current setting where is diagonal, we can conceive of the diagonal being a vector
of error variances. This simplifies the coding of the model and the inputting of initial
Confirmatory Factor Analysis 207
values in WinBUGS. Second, we call out the use of NA for the first loading in the list
statements for initialvalues. The first loading is fixed to 1 and therefore not a stochas-
tic node. As a result, no initial value (not even 1) can be supplied to WinBUGS for this
node. Finally, the three sets of initial values were specified to contain values that were
anticipated to represent values for the parameters that would be fairly dispersed in the
posterior distribution.*
9.3.4 Results
The model was fit in WinBUGS using three chains with these start values for the inter-
cepts, loadings, error precisions, and the precision of the latent variable. The start values
for the latent variables were generated by WinBUGS. Based on convergence diagnostics
(Section5.72), there was strong evidence of convergence within a few iterations. To be con-
servative, 500 iterations were discarded as burn-in, and 5000 subsequent iterations from
each chain were used, totaling 15,000 draws used to empirically approximate the posterior
distribution. The marginal densities were unimodal and fairly symmetric. Summary sta-
tistics are given in Table9.2.
The loadings for Faculty Interaction and Faculty Concern are comparable to each other
and the fixed loading for Peer Interaction, and the error variances for these variables are
TABLE 9.2
Summary of the Posterior Distribution for the Single Latent Variable Model,
Including Summaries for Two Examinees Latent Variables (1 and 8 )
Parameter Mean Standard Deviation 95% Highest Posterior Density Interval
PI 1.00
AD 0.73 0.04 (0.64, 0.82)
IGC 0.42 0.04 (0.35, 0.49)
FI 1.05 0.07 (0.92, 1.18)
FC 0.98 0.06 (0.87, 1.10)
PI 3.33 0.04 (3.25, 3.41)
AD 3.90 0.03 (3.84, 3.95)
IGC 4.60 0.02 (4.55, 4.64)
FI 3.03 0.04 (2.95, 3.12)
FC 3.71 0.04 (3.65, 3.79)
PI 0.37 0.03 (0.32, 0.43)
AD 0.18 0.01 (0.16, 0.21)
IGC 0.18 0.01 (0.16, 0.20)
FI 0.38 0.03 (0.32, 0.44)
FC 0.27 0.02 (0.22, 0.31)
0.44 0.04 (0.35, 0.52)
1 0.20 0.26 (0.70, 0.30)
8 0.86 0.26 (0.35, 1.38)
* The data statement is not shown here on space considerations. Like all the other code used in the examples, it
is available from the website for the book.
208 Bayesian Psychometric Modeling
comparable, suggesting that the observables have comparable relations to the latent
variable. The loading for Academic and Intellectual Development is a bit lower, as is its error
variance. Most notably, the loading for Institutional Goal Commitment is quite a bit lower
than the others, suggesting it has a weaker relationship to the latent variable. Table 9.2
includes results for two examinees, where it is easily seen that examinee 8 has a higher
standing on the latent variable than examinee 1. These results do not speak to the viability
of this model for this data, and we defer a treatment of techniques for evaluating models
until Chapter10, where this example forms the basis for discussing a variety of techniques
for criticizing and comparing models.
i =1 j =1 m=1
p( j )p( jj )p( 21 )p( 31 )p( 52 ),
Confirmatory Factor Analysis 209
d 0
m = 1, 2 m
i = 1,, n
i
xij jm j = 1, m = 1
j = 4, m = 2
j = 2, 3, m = 1
j jj jm j = 5, m = 2
j = 1,, J
0 2
2
FIGURE 9.4
An inexact directed acyclic graph representation for the model with two latent variables with fixed latent vari-
able means and fixed loadings.
where
xij | i , j , j , jj ~ N ( j + j1i1 , jj ) for i = 1, , 500, j = 1, 2, 3,
9.4.3 WinBUGS
WinBUGS code for the model and list statements for three sets of initial values is given as
follows.
210 Bayesian Psychometric Modeling
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Specify the factor analysis measurement model for the observables
#########################################################################
for (i in 1:n){
for(j in 1:J){
x[i,j] ~ dnorm(mu[i,j], inv.psi[j])
}
}
#########################################################################
# Specify the prior distribution for the latent variables
#########################################################################
for (i in 1:n){
ksi[i, 1:M] ~ dmnorm(kappa[], inv.phi[,])
}
#########################################################################
# Specify the prior distribution for the parameters that govern
# the latent variables
#########################################################################
phi.0[1,1] <- 1
phi.0[1,2] <- .3
phi.0[2,1] <- .3
phi.0[2,2] <- 1
d <- 2
Confirmatory Factor Analysis 211
for (m in 1:M){
for (mm in 1:M){
dxphi.0[m,mm] <- d*phi.0[m,mm]
}
}
#########################################################################
# Specify the prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
tau[j] ~ dnorm(3, .1) # Intercepts for observables
inv.psi[j] ~ dgamma(5, 10) # Precisions for observables
psi[j] <- 1/inv.psi[j ] # Variances for observables
}
for (j in 2:3){
lambda[j,1] ~ dnorm(1, .1) # Prior for the loadings
}
lambda[5,2] ~ dnorm(1, .1) # Prior for the loadings
#########################################################################
# Initial values for three different chains
#########################################################################
list(tau=c(.1, .1, .1, .1, .1), lambda= structure(.Data= c( NA, NA, 2,
NA, 2, NA, NA, NA, NA, 2), .Dim=c(5, 2)), inv.phi= structure(.Data=
c(1,0, 0, 1), .Dim=c(2, 2)), inv.psi=c(1, 1, 1, 1, 1))
Much of the code mimics code introduced earlier, including the specification of initial
values for the chains that were anticipated to represent values for the parameters that
would be fairly dispersed in the posterior distribution. We concentrate on what is new
here, specifically the portion of the code that specifies the prior distribution for the param-
eters that govern the latent variables. The first section of that code specifies that the means
of the latent variables are all 0 by means of a loop over the latent variables. The loop goes
from 1 up to M, the number of latent variables. The value of M could be supplied in the
code, but in this example it is part of our data statement, much like the other constants
n and J. The rest of this section pertains to the covariance matrix of the latent variables.
212 Bayesian Psychometric Modeling
Thespecification of the node inv.phi is the precision matrix, which is what WinBUGS
requires for specifying the multivariate normal distribution, in this case for latent variables.
The specification of the node phi uses the inverse function in WinBUGS to transform the
precision matrix into the covariance matrix.
The prior for the precision matrix (inv.phi) is a Wishart distribution; the rest of the
code in this section details the hyperparameters of that prior distribution. First, we define
a matrix inv.phi.0, corresponding to 0, by plugging in a value for each element in the
matrix. Next we define d, corresponding to d. The next portion of code loops over the num-
ber of latent variables twice, which defines a particular row and column of the covariance
matrix, and for each element multiplies the corresponding element of 0 by d, and places
it in a corresponding place in a node dxphi.0. The double looping effectively carries out
the multiplication of the matrix 0 by the scalar d. This process is conducted because the
result is used in WinBUGS parameterization of the Wishart distribution (Lunn etal., 2013).
9.4.4 Results
The model was fit in WinBUGS using three chains with these start values for the inter-
cepts, loadings, error precisions, and the precision of the latent variable. The start values
for the latent variables were generated by WinBUGS. Based on convergence diagnostics
(Section5.72), there was strong evidence of fast convergence. To be conservative, 500 itera-
tions were discarded as burn-in, and 5000 subsequent iterations from each chain were used,
totaling 15,000 draws used to empirically approximate the posterior distribution. Density
plots for the marginal distributions were unimodal and fairly symmetric. Summary statis-
tics for the marginal posterior distributions are given in Table9.3.
p( x , S| , , , , ) = p( x | , , , , )p(S| , , ), (9.39)
where
1
x N + , ( + )
n
(9.40)
and
S ~ Wishart( + , n 1).
Confirmatory Factor Analysis 213
TABLE 9.3
Summary of the Posterior Distribution for the Two Latent Variable Models,
Including Summaries for Two Examinees Latent Variables (11 , 12 , 81 ,
and 82 )
Parameter Mean Standard Deviation 95% Highest Posterior Density Interval
PI 1.00
AD 0.77 0.05 (0.67, 0.88)
IGC 0.46 0.04 (0.38, 0.54)
FI 1.00
FC 0.92 0.05 (0.81, 1.02)
PI 3.33 0.04 (3.26, 3.41)
AD 3.90 0.03 (3.84, 3.95)
IGC 4.60 0.02 (4.55, 4.64)
FI 3.04 0.04 (2.96, 3.12)
FC 3.71 0.04 (3.64, 3.78)
PI 0.36 0.03 (0.31, 0.42)
AD 0.17 0.01 (0.14, 0.20)
IGC 0.17 0.01 (0.15, 0.20)
FI 0.34 0.03 (0.28, 0.40)
FC 0.24 0.02 (0.20, 0.28)
11 0.39 0.04 (0.30, 0.47)
22 0.49 0.05 (0.39, 0.59)
21 0.38 0.04 (0.31, 0.45)
11 0.15 0.27 (0.70, 0.38)
12 0.31 0.3 (0.90, 0.27)
81 0.88 0.27 (0.36, 1.43)
82 0.84 0.30 (0.26, 1.45)
where the second term on the right-hand side is the prior distribution, which may be
specified in the same manner as described in Section9.2.2.
This approach using summary statistics has a number of advantages over the individ-
ual-level approach. It can be employed when summary statistics are available but the raw
data are not (e.g., in a secondary analysis of published summary statistics; Hayashi & Arav,
2006). Note that in contrast to the individual-level approach, the specification here does not
involve the examinees values of the latent variables, as they are marginalized over in con-
struction of the conditional distribution of the data. This can be advantageous in a simu-
lation-based estimation environment, such as Markov chain Monte Carlo (MCMC), where
the computational burden and time needed increase with each additional parameter to
be estimated. Because the summary-level approach takes the mean vector and covariance
matrix as the data, the number of parameters and the associated computational burden
214 Bayesian Psychometric Modeling
does not increase as sample size increases. In contrast, in the individual-level approach,
each additional examinee brings with them another M latent variables. More concretely,
in Gibbs sampling a full conditional distribution must be defined and sampled from
for each examinees latent variables at each iteration. Thus for the same model, the time
needed to fit a model using the summary-level approach is likely to be less than in the
individual-level approach, especially in large samples with many latent variables (Choi,
Levy, & Hancock, 2006). The estimation-time advantage of the summary-level approach
is likely to be diminished in the future as technological advances increase the compu-
tational power available to analysts. A related point is that the full conditional distribu-
tions for the examinee latent variables are normal distributions, and relatively easy to
sample from. In the authors experience, it is usually not prohibitively time-consuming to
analyze individual-level models in WinBUGS. And in situations where it is prohibitively
time-consuming, the analyst may turn to other software that is faster than WinBUGS
for CFA and related models, such as Mplus (Muthn & Muthn, 19982012) and Amos
(Arbuckle, 2007).
However, the summary-level approach suffers relative to the individual-level approach
in several respects. It does not yield a posterior distribution for the values of the exam-
inees latent variables. If inferences about the examinees latent variables are desired,
auxiliary analyses are required. Options here include taking draws from the full con-
ditional distributions of the latent variables given the drawn values for the parameters
(Aitkin & Aitkin, 2005), or partially Bayesian solutions using point estimates of param-
eters (Bartholomew etal., 2011) that could be applied here using a point summary of the
parameters. Importantly, the typical summary-level approach relies on the assumption of
multivariate normality of the observables, which was derived from an assumption of con-
ditional normality of the observables given the latent variables. Hayashi and Yuan (2003)
proposed a robust analysis to account for non-normality, though it is possible that greater
flexibility is afforded by adopting an approach that specifies non-normal distributions for
the individual-level data (e.g., Lee & Xia, 2008; Zhang etal., 2013).
The need to indicate a constant via a different shape in path diagrams brings into relief
a key distinction between path diagrams and DAGs. Path diagrams implicitly express that
the nodesthe rectangles and circlesvary over examinees. As a result, the parameters
which do not vary over examineesare not nodes in path diagrams. They are placed along
the paths in the diagram. For directed paths (i.e., one-headed arrows), the parameter resid-
ing along the path serves as a coefficient for the variable at the source of the path for the
structural equation for the variable at the destination of the path. When the source of the
path is a constant (i.e., the 1 in the triangle) this just expresses that the parameter along
the path is an intercept in the structural equation. For nondirected paths (i.e., two-headed
arrows), the parameter residing along the path refers to the covariance between the vari-
ables connected by the path; when the connection is between a variable and itself, this
is just the variance of the variable. Conventions for what are and are not represented as
nodes align well with a frequentist perspective on inference: nodes correspond to vari-
ables associated with examinees that are assigned distributions; because parameters are
treated as fixed entities in a frequentist framework, they are not depicted as nodes.
In contrast, DAGs make explicit the replication of variables over examinees, and other
replications, via plate structures. In a fully probabilistic framework, unknown parame-
ters are assigned distributions and appear as nodes in the graph. Path diagrams are well
aligned with a frequentist perspective in which distributions are invoked only where a
sampling concept is applied, namely for examinee variables. DAGs are well aligned with
a Bayesian perspective in which all entities are treated as random variables, with dis-
tributions assigned to all. In this light, examinee variables do not have a distinct status;
other entities appear as nodes in the graph and so the replication over examinees must be
made explicit. This underscores a general point about a Bayesian approach; in contrast to
the conventional frequentist approach, the latent variables have the same status as other
parameters.
Importantly, path diagrams and DAGs differ in how they represent correlational struc-
tures. Path diagrams employ nondirectional (two-headed) arrows to indicate a correlation
between variables, such as the correlation between the latent variables in Figure9.1b. DAGs
only employ directed (one-headed) arrows, and take a multivariate approach to specify-
ing correlational structures. As the DAG in Figure9.2 illustrates, we specify the vector of
latent variables (for each examinee i) and express the correlation via the specification of a
multivariate distribution with covariance structure . Path diagrams also allow for cyclic
dependences, where setting out on a path of directed arrows from a certain node, we can
return to that node. The simplest example of this is when there two variables and between
them are two directed arrows facing in opposite direction, indicating a feedback or recip-
rocal dependence (Kline, 2013).
We stress that though path diagrams are not as closely aligned with the probability-
based modeling approach as DAGs, they pose certain advantages over DAGs. Path models
are especially adept at communicating a number of key features: how many observable
and latent variables are in the model; which observables load on which latent variables;
and the means, variances, and correlations among the latent variables and possibly other
exogenous entities (indicated by a bidirectional arrow). As such they are a powerful knowl-
edge representation for analysts when formulating models and communicating the model
to others. As path diagrams expressing the dependence and conditional independence
relationships among the entities associated with examinee variables, they more easily
support the derivation of model-implied mean and covariance structures via path tracing
(Mulaik, 2009; Wright, 1934), which in turn supports other analyses such as determining
how models are related (Levy & Hancock, 2007, 2011). For these reasons, path diagrams
216 Bayesian Psychometric Modeling
and DAGs may complement each other in how they express and facilitate reasoning using
models, and both may be gainfully called into service simultaneously, as warranted by the
needs of the analyst.
steps. If this introduces new unknown parameters, we repeat the process until we reach
the point where there are no more unknowns. Equation (9.9) expresses the first layer in the
hierarchy, where we have structured the (n J)-variate distribution of observables in terms
of the (n M) latent variables , (J M) loadings , J intercepts , and J error variances in
. As M is usually far less than J, this represents a considerable simplification.
We are now tasked with specifying a (prior) distribution for the newly introduced param-
eters, again by capitalizing on exchangeability, introducing parameters that induce condi-
tional independence, and specifying distributions for them as needed. For the (n M) matrix
of unknowns in , (9.17) contains the next level of this hierarchy and introduces hyperpa-
rameters and , for which (hyperprior) distributions are needed. These are specified in
the next level of the hierarchy, in (9.20) and (9.22). To specify the former, again an assumption
of exchangeability is invoked to allow for a common univariate prior for the elements of ,
in (9.21). In short, by invoking exchangeability assumptions and approaching the problem
in a hierarchical manner, we have solved the problem of specifying an (n M)-variate prior
distribution with the following (collecting Equations 9.18, 9.21, and 9.22):
i | , ~ N ( , ) for all i ,
m ~ N ( , 2 ) for all m,
(9.42)
and
~ Inv-Wishart(0 , d).
We adopt the same strategies for the measurement model parameters. Assuming exchange-
ability allows for the specification of common priors for in (9.25), in (9.26), and in (9.27).
Note that we could specify a hierarchical structure on the measurement model parameters,
that is, where the (hyper)parameters of their prior distributions are unknown and modeled
via their own (hyper)priors. This specification is not common in CFA, but has been used more
in the context of item response theory (Section11.7.4).
This line of reasoning develops the model in a slightly different manner than CFA has
typically been conceived. It is distinctly probabilistic in nature in that it (a) focuses on
the specifications of (conditional) distributions and (b) for every parameter that is intro-
duced, a (prior) distributionitself possibly conditional on other parametersis specified.
Combining this approach with assumptions of exchangeability allows for the efficient con-
struction not only of the distribution of the data (the first layer in the hierarchy) but of the
prior distributions as well (the subsequent layers). The process concludes when there are
no more unknown parameters in need of distributional specification.
ig | g , g ~ N ( g , g ),
mg ~ N ( g , 2 g ),
(9.43)
and
g ~ Inv-Wishart(0 g , dg ),
jmg ~ N ( g , 2 g ),
jg ~ N ( g , 2 g ), (9.44)
and
Models that assume the measurement model parameters are conditionally exchangeable
are commonly used in measurement invariance and differential item functioning analyses
where group membership is known (Millsap, 2011; Verhagen & Fox, 2013) or unknown as
in latent class or mixture models (Pastor & Gagn, 2013).
Note that in (9.43) and (9.44) each group has distinct hyperparameters, reflecting dif-
ferent a priori beliefs about the group-specific parameters. This need not be the case. For
example, we may specify the model as
jmg ~ N ( , 2 ),
jg ~ N ( , 2 ),
(9.45)
and
which allows the measurement model parameters to vary by groups, as may be suggested
by the data, but reflects that a priori we do not believe anything different about them for
Confirmatory Factor Analysis 219
the different groups. Still other possibilities involve not only different hyperparameters
for groups, but different parametric forms as well.
form we can encode various beliefs and at various levels of certainty. This is particularly
natural if a prior can be constructed where one of its parameters captures this strength.
Using a normal prior for a loading allows for the specification of the certainty via the prior
variance. A prior variance of 0 indicates the loading is constrained to be equal to the value
of the prior mean; increasing the prior variance represents a weakening of this belief.
Working from this perspective, Muthn and Asparouhov (2012) drew upon the added
flexibility of a Bayesian approach over conventional frequentist approaches to construct a
model that better reflected their substantive theory. They argued that the constraint that a
cross-loading is 0 is in many cases overly stringent. Substantive theory typically does not
imply that the cross-loadings are exactly 0, just that they are low in magnitude, meaning an
observables dependence on that latent variable is weak relative to its dependence on the
latent variable on which it (primarily) loads. Note however that the alternative in a conven-
tional frequentist tradition is to include the parameter and estimate it from the data, which
places the parameter on par with the primary, substantive loading. Seeking an alternative
to these extremes, Muthn and Asparouhov (2012) advocated the use of prior distributions
for cross-loadings that are normal with mean 0 and a small variance. The specification
of the variance encodes how willing the analyst is to constrain the cross-loading to be 0
versus allowing it to be dictated by the data. Smaller values of the variance reflect a firmer
belief that the cross-loading is exactly 0; larger values reflect an increasing willingness to
allow the cross-loading to be farther from 0.
Another example concerns the associations between error terms. In our present develop-
ment, was assumed to be diagonal, containing error variances on the diagonal and 0s off
the diagonal for the error covariances. This implies that the only source of association for
the observables is expressed by the loadings and the latent variables. This can be relaxed
in a number of ways. One natural approach is to model as a complete covariance matrix,
like , assigning a prior distribution such as an inverse-Wishart for .
Muthn and Asparouhov (2012) studied this and two other approaches. In one, the error
covariance matrix is defined as
= + * , (9.46)
where *is a diagonal matrix and is a covariance matrix that is not assumed to be diago-
nal. * may be assigned a prior using the same strategies we had previously developed
for , say, using inverse-gamma priors for each of the diagonal elements. is assigned an
inverse-Wishart prior specified such that all the elements are relatively small. As a result
is a covariance matrix with possibly nonzero off-diagonal elements, but these elements
are fairly small. In the other approach, the diagonal elements of are assigned inverse-
gamma priors (as in our development) and each of the off-diagonal elements are modeled
with a univariate normal prior with mean 0 and a small variance. In this approach, condi-
tional conjugacy is not preserved as the full conditional distributions in the Gibbs sampling
scheme are not of known form; a more complex Metropolis-type sampler is utilized.
For our current discussion, the salient point is that these approaches allow for nonzero
covariances among all the errors. Such a model is not identified in a frequentist approach;
however, this does not necessarily pose the same sort of problems for a Bayesian analysis
as it would for a frequentist analysis. More generally, CFA models that are not identified
in a frequentist analysis can often by specified by using a Bayesian approach (Scheines
etal., 1999).
It is worth taking a step back and considering just what these sorts of developments
reveal. First, adopting a Bayesian perspective allows the analyst to specify models akin
Confirmatory Factor Analysis 221
to the two choices available in a conventional frequentist analysis: (a) let the parameter
be estimated by the data by way of using a sufficiently diffuse prior or (b) constrain the
parameter to a particular value by way of a prior with all its mass at that point. Second, a
Bayesian perspective allows for the analyst to specify models that lie in between these two
extremes, by using a prior that is in between a maximally diffuse and degenerate distri-
bution. In the situations described here, this frees the analyst from needing to adhere to
the strictness of an equality constraint, without conceding the substantive theory that the
parameters in question should be weak in magnitude. We note that the flexibility in using
distributional representations to encode a continuum of possibilities has been exploited to
considerable advantage in other modeling contexts (e.g., to facilitate partial pooling among
groups in multilevel modeling, Gelman & Hill, 2007; Novick etal., 1972). Third, on a tech-
nical level, model structures that are unidentified in a frequentist approach may be speci-
fied and fitted using a Bayesian approach.
The added flexibility of a Bayesian approach allows for a closer representation of the real-
world situation in the modeled space, which allows for a potentially closer correspondence
between the model and the substantive theory about the real-world situation. Substantive
theory typically does not dictate that the cross-loadings are exactly 0, or that all of the
reasons that observables are associated are captured by the latent variables (Muthn &
Asparouhov, 2012). A Bayesian approach allows for some flexibility here while still pre-
serving the core of the substantive theory, which is that the associations can be mainly
accounted for by the hypothesized factor structure.
More generally, a Bayesian approach allows for the easy imposition of hard constraints
(e.g., a parameter equals 0 or is equal to another parameter) and soft constraints (e.g., a
parameter is likely near 0, or slightly different from another parameter) as warranted by
various hard and soft aspects of the analysts substantive theory and thinking the
model is constructed to represent. The salient point is that the Bayesian approach allows
for the choices that are typically made in conventional frequentist modeling, and addition-
ally allows for choices in between these options, in ways that more closely align the model
with substantive theory and the desired inferences.
Although these examples were studied in the context of CFA, they have implications for
psychometrics more broadly. The possibility of cross-loadings depends on the presence of
multiple latent variables, which has been a feature of applications of CFA since its origin.
Historically, applications of item response and latent class models (Chapters 11 and 13)
have tended to employ one latent variable, though multiple latent variable item response
and latent class models are on the rise (Reckase, 2009; Rupp et al., 2010; see Section11.5 and
Chapter14), and these concepts apply directly to those situations as well.
The example surrounding the residual correlations strikes at the heart of all psychomet-
ric models where observables are modeled as depending on latent variables, and is there-
fore germane to the models covered in the rest of the book. The notion of 0 off-diagonal
elements in is a manifestation of the assumption that the observables are condition-
ally (locally) independent given the latent variables and the measurement model param-
eters. Recall that these conditional independence assumptions are aligned with notions of
exchangeability (Section7.2).
Tying these themes together, an assumption of exchangeability allows us to simplify
the construction of the joint distribution for the observables. In latent variable psycho-
metric models, this exchangeability is conditional on the latent variables and the mea-
surement model parameters; the observables are modeled as conditionally independent
given these entities. However, exchangeability assumptions and implied conditional inde-
pendence structures in the model reflect the analysts thinking about the complexities of
222 Bayesian Psychometric Modeling
the real world in terms of modeled entities to facilitate reasoning. This thinking is an
inexact account of the more complicated real-world scenario of interest. Allowing error
covariances and/or loadings to depart slightly from 0 may be seen as a recognition that the
dependence and conditional independence relationships in the model, and the analysts
thinking it represents, is possibly (if not certainly) not exactly correct. Including these
parameters with small-variance priors represents a softening of the almost-sure-to-be-
wrong-about-the-world independence structures warranted by exchangeability assump-
tions. This serves to better align the modeled scenario with the real-world scenario it is
intended to represent.
This also serves as something of an internal check on the viability of the substantive
components in the model and serves as an opportunity to learn about things that were not
expected. If an analysis with a small-variance prior around 0 for a parameter in fact yields
a posterior where the parameter is rather different from 0, this suggests that the original
thinking that the weakness of the parameter may need to be revised. In this situation,
Muthn and Asparouhov (2012) advocated re-specifying the model so that the parameter
has a more diffuse prior. The broader topic of model checking to alert the analyst to weak-
nesses in the model is taken up in more detail in Chapter10.
These examples also highlight that the distinction between the prior and the likelihood
is inexact. Let us return to the case of a loading being included or excluded from the
model. Readers steeped in frequentist traditions may naturally consider this feature of the
model to be part of the likelihood, and whether a loading is specified is of course a concern
of frequentist analyses, which eschew notions of priors for parameters. On the other hand,
we can conceive of a parameter being included or excluded as being a prior distribution
specification. In the case of a loading, excluding it may be seen as specifying a prior distri-
bution concentrating all its mass at 0.
Resuming a theme introduced in Section3.5.2, we might say much the same about other
model specifications that typically are the focus of conventional modeling approaches,
like specifying which observables load on which latent variables, or whether the latent
variables or certain errors are correlated. These are just some of the issues analysts address
when specifying a CFA model in the conventional frequentist approach to CFA modeling,
estimation, and inference.*
But what are these if not prior specifications? As we have seen, they can be expressed as
prior distributional specifications. The CFA structure, as typically developed in frequentist
traditions, is itself a prior specification. More generally, all of the model specifications,
including when they are framed in frequentist light, may be seen as prior probability
specifications.
We close this section with a word of caution. We have argued that a Bayesian approach
opens up possibilities for employing models, including those that are not identified
in a conventional sense (for conventional treatments of identification in CFA, see, e.g.,
Bollen, 1989; Mulaik, 2009). The use of a model small-variance prior is an example of
this (Muthn & Asparouhov, 2012). However, with these new possibilities come potential
pitfalls. We suspect that the rules and guidelines that have been developed in conven-
tional approaches to modeling may not apply as forcefully when employing Bayesian
approaches. But just how different things are when operating in a Bayesian framework
* And this is to say nothing about other specifications that often go unstated, like the choice that is almost
always made to use normal distributions for the observables and latent variables in CFA and SEM.
Confirmatory Factor Analysis 223
has not been worked out in sufficient detail. Analysts would do well to heed the lessons
learned in conventional approaches to modeling, and proceed cautiously when exploring
unchartered territories. These themes are echoed in the discussion of resolving indeter-
minacies, to which we now turn.
=1
and
This specification fixes the variance of the latent variable to 1 and releases the constraint
on the first loading, specifying it to have the prior as the remaining loadings. All the other
expressions remain the same. The DAG for the model is given in Figure9.5.
i = 1,, n
i
j = 1,, J
xij
j jj j
0 2
2
FIGURE 9.5
Directed acyclic graph for the model with one latent variable with fixed latent variable mean and variance
(which is denoted by the boxes for and ).
224 Bayesian Psychometric Modeling
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Specify the factor analysis measurement model for the observables
#########################################################################
for (i in 1:n){
for(j in 1:J){
mu[i,j] <- tau[j] + ksi[i]*lambda[j]
x[i,j] ~ dnorm(mu[i,j], inv.psi[j])
}
}
#########################################################################
# Specify the prior distribution for the latent variables
#########################################################################
for (i in 1:n){
ksi[i] ~ dnorm(kappa, inv.phi)
}
#########################################################################
# Specify the prior distribution for the parameters that govern
# the latent variables
#########################################################################
kappa <- 0
inv.phi <-1
phi <- 1/inv.phi
#########################################################################
# Specify the prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
tau[j] ~ dnorm(3, .1)
inv.psi[j] ~ dgamma(5, 10)
psi[j] <- 1/inv.psi[j]
}
for (j in 1:J){
lambda[j] ~ dnorm(1, .1)
}
#########################################################################
# Initial values for two different chains
#########################################################################
list(tau=c(3, 3, 3, 3, 3), lambda=c(3, 3, 3, 3, 3), inv.psi=c(2, 2, 2, 2,
2))
Two chains were run in this analysis using starting values listed above. History plots for
the first 1000 iterations for the measurement model parameters for the first observable
and for examinee 8s latent variable are given in Figure 9.6. They are representative of
the other results for parameters of the same type. By the standards of diagnosing conver-
gence by way of inspection of multiple chains, the behavior of the chains values for the
intercepts and error variances look pretty good. However, the behavior for the loadings
(represented by the top plot for 1) and the examinees latent variables (represented by the
bottom plot for 8) would seem to be a clear-cut case of a lack of convergence, or at best that
one chain has converged and the other has not (but which one?). In fact, the chainsboth
of themhave converged. They are just exploring different parts of the posterior distribu-
tion. Figure9.7 contains density plots of several of the parameters. The multimodality in
the posterior for the loadings and the latent variable for examinee 8 can be connected to
the separate chains. The chain on the top in the history plots provides the values that con-
stitute the density for positive values; the chain on the bottom in the history plots provides
the values that constitute the density for negative values.
FIGURE 9.6
History plots for parameters for the model with fixed variance and estimated loadings, including the loading (1),
intercept (1), and error variance (11) for the first observable, and the latent variable for the eighth examinee (8 ).
226 Bayesian Psychometric Modeling
FIGURE 9.7
Density plots for parameters for the model with fixed variance and estimated loadings, including the loading
(1), intercept (1), and error variance (11) for the first observable, and the latent variable for the eighth examinee
(8 ).
The densities in Figure9.7 are indeed representative of the posterior distribution. The
multimodality is a manifestation of the orientation indeterminacy associated with the
factor model, namely that the loadings can be reflected by multiplying them by 1 and
still reproduce the data equally well. With one orientation, the loadings are positive; with
another orientation, the loadings are negative. A similar reflection of positive negative val-
ues takes place in the values of the latent variables. Note that there is no reflection in the
intercept or error variance.
The multimodality exists in the posterior because of the multimodality in the
likelihoodas discussed in Section9.1.2 the indeterminacy in orientation issue pertains
to frequentist estimation as welland the information in the prior is not strong enough
to steer the posterior towards one orientation as opposed to the other. As the information
in the prior diminishes, the posterior more closely resembles the likelihood. Thus, MCMC
may be a useful tool for exploring multimodality and more generally the shape of the
likelihood (see Scheines etal., 1999, for a discussion in CFA; see also Jackman, 2001, for a
discussion in item response theory).
This example has illustrated that resolving the indeterminacy in the scale of the latent
variable by fixing the variance does not resolve the indeterminacy in orientation. As illus-
trated in Section9.3, fixing the first loading to 1 was sufficient to resolve this issue.
Again, we may view this as an issue of the prior specification. When no prior information
is supplied about the orientation, the posterior reflects the indeterminacy as the data are
insufficient for resolving it. As discussed in Section9.8, our adopted approach of fixing
loadings to take on particular values may be seen as a prior specification. This approach is
standard in frequentist analyses. It is also common in Bayesian analysis, particularly if, as
was the case here, rather diffuse priors are used for the remaining parameters. However,
other possibilities exist, such as fixing values or imposing ordinal constraints on examinee
latent variables, which is most viable when there is prior knowledge that two examin-
ees are quite distinct in terms of what the latent variable is intended to capture (Bafumi,
Gelman, Park, & Kaplan, 2005; Jackman, 2001), or fitting the more open model, with-
out constraints, and then postprocessing the resulting draws from MCMC to rescale to a
desired metric (Bafumi etal., 2005, Erosheva & Curtis, 2011; Fox 2010).
Most of the mechanisms used to resolve the indeterminacies may be seen as prior infor-
mation. Fixing the loading to a particular value (e.g., 1, as was done in the examples in
Sections9.3 and 9.4) may be seen as modeling highly specific prior information about that
loading. We may frame this issue in terms of the variance of a normal prior distribution
for the loading. A bit more formally, suppose that this loading has a N (1, 2 j ) prior distribu-
tion. If 2 j = 0 the loading is set equal to 1, which effectively resolved the issue. If 2 j = ,
weareineffect supplying no prior information, and we will have the multimodality result-
ing from the indeterminacy.
These two situations are akin to the two choices in a conventional frequentist analy-
sis: either fix the parameter so that the data have no bearing on its value (2 j = 0 ), or let
it be estimated solely from the information in the data (2 j = ). But as was argued in
Section9.8, in a Bayesian analysis there is room in between these extremes.
When 2 j = 0 the issue was effectively resolved. As we saw in Figures9.6 and 9.7, with a
large but finite value of 2 j the indeterminacy rears its head. How small 2 j needs to be in
order to resolve the indeterminacy depends on the features of the situation, in particular
the separation between the modes in the likelihood, and other prior information that may
be modeled with an informative prior. What we can say is that increasing the variance
2 j moves us further along the undesirable path to the indeterminacy rearing its head
and complicating the posterior. See Jackman (2001) for examples of dwindling differences
between the modes, which increases the chances of a chain flipping to another mode. See
also Loken (2004, 2005) for discussions of the role of seemingly arbitrary constraints in
mode separation.
A few additional comments on the use of MCMC in light of the indeterminacy are
warranted. The multimodality in the posterior in Figures9.6 and 9.7 was immediately
seen in part because the two chains were run from dispersed starting points. It is
tempting to say that if we had only run one of these chains, we would not have seen it.
And this is true, for the finite iterations considered. If a chain was run sufficiently long
enough it would explore the full posterior and would reveal the multimodality. That is,
each chain would eventually flip or jump and start exploring the space associated
with the reflected values. Of course, there is no guarantee when that will happen. In
this and many other applications of CFA and item response models that employ con-
tinuous latent variables (Chapter11), the data are informative enough to yield a separa-
tion between the modes sufficient to greatly reduce the chance of an individual chain
flipping to another mode, such that it might occur only once every thousand years. That
would still be in accordance with theorems of the behavior for infinitely long chains, as
in the long run the chain would hop back and forth, covering the areas in proportion to
the right posterior densities.
228 Bayesian Psychometric Modeling
Exercises
9.1 Factor analysis models may be seen as simultaneously regressing the observables
on the factors.
a. In what ways are the specifications of the factor analysis models analogous to
those in regression models of Chapter6?
b. In what ways are they different?
c. How could regression models of Chapter 6 be converted to factor analysis
models, and what changes would need to be made in the inputs to WinBUGS
for such a model?
9.2 The CTT model of Chapter8 may be seen as an instance or special case of factor
analysis.
a. Formulate the CTT model in Section8.3 as a CFA model with one latent variable
(factor) in terms of mathematical expressions. Alternatively, what restrictions
can be placed on a CFA model with one latent variable to obtain the CTT model?
b. Create a DAG for the model created in (a).
c. Conduct a Bayesian analysis of the model formulated in (a) and (b) in WinBUGS
using the data in Table8.2. Compare your results in terms of convergence and
the resulting posterior distribution to that reported in Section8.3.3.
9.3 An alternative CFA model for the Institutional Integration Scale posits two latent
variables, where Peer Interaction and Faculty Interaction load on one latent vari-
able interpreted as pertaining specifically to Social Integration, and the other three
observables, Academic and Intellectual Development, Institutional Goal Commitment,
and Faculty Concern load on a second latent variable interpreted as pertaining to
Academic Integration.
a. Formulate the model in terms of mathematical expressions.
b. Specify a path diagram and DAG for the model. How do they compare to those
for the model with two latent variables in Section9.4?
c. Conduct a Bayesian analysis of the model formulated in (a) and (b) in
WinBUGS. Interpret the results in terms of convergence and the resulting pos-
terior distribution.
230 Bayesian Psychometric Modeling
9.4 Figure9.4 contains an inexact DAG for the model with one latent variable with
fixed latent variable mean and fixed loading. Create an exact DAG for the model.
How does it compare to that in Figure9.4 in terms of representation and ease of
understanding?
9.5 Reconsider the model in Section9.9 where the rotational indeterminacy is pres-
ent. Explain how a chain sampling values near one mode could jump to sample
values of another mode in the context of a Gibbs sampler, and in the context of a
Metropolis sampler.
10
Model Evaluation
Researchers often use models to account for real-world phenomena in parsimonious and
theoretically meaningful ways to guide inference. In this chapter, we take up the task
of evaluating models. Model criticism and comparison are core activities in statistical
modeling generally, and Bayesian psychometric modeling is no exception. Whats more,
these are active areas of research and debate within the Bayesian community on appro-
priate goals, procedures, and conclusions. In this chapter, we aim to survey and discuss
methods that have gained popularity in Bayesian psychometrics, and point the reader
to existing applications and research on these and other methods. The CFA model of the
previous chapter provides us with the first opportunity to illustrate the methods with a
rich psychometric model, and bring out ways that more general techniques interact with
particularly psychometric concerns.
We describe approaches to the criticism of an individual model in Sections10.1 and 10.2.
In Section10.3, we turn to the situation in which there are multiple models under consider-
ation and we wish to compare them. In this chapter, we return to the use of to generically
denote all the unknowns in the model (i.e., latent variables, measurement model param-
eters, and parameters that govern their priors). When discussing how the methods apply
to the CFA examples, we will employ CFA-specific notation, which does not utilize .
231
232 Bayesian Psychometric Modeling
that indicates that such results are highly unlikely or impossible (Martin& McDonald,
1975; Mislevy, 1986). In conventional approaches, it is often recommended that this should
be addressed by the analyst specifying appropriate bounds for such parameters or using
software that automatically does this (Gorsuch, 1983; Kline, 2010), which may be seen as
incorporating this prior information into the analysis.
In the next chapter on item response theory, we discuss a related situation where the
parameters are poorly determined from the data such that there may be many sets of
measurement model parameters that are almost equally good at fitting the data, some of
which are reasonable and others of which are not. Setting aside research uses that aim for
true estimates, production uses of item response theory need to use the reasonable ones
for very practical reasons. Here again, even a mild prior accomplishes this. Knowing what
item parameter estimates should look like is kind of prior knowledge about the parameters,
but is also about knowing control limits on the factory floor.
of CFA tend not to focus on the estimation of examinees latent variables, whereas many
applications of item response theory (discussed in Chapter11) do. Accordingly, we may
be willing to tolerate certain types or amounts of error in estimating examinees latent
variables in the former case, but less so in the latter case. The onus is on the analyst to
know what can be safely ignored, and what requires careful checking and attention. As
Box succinctly put it (1976, p. 192), it is inappropriate to be concerned about mice when
there are tigers abroad.
We advocate that Tukeys dictum quoted above applies to model checking. In our view,
asking whether the model is true is almost always the wrong question, and using sophis-
ticated statistical machinery in pursuit of this question, while an interesting exercise, may
be something of a misallocation of resources. In applied work, it is better to devote those
resources to addressing the right question, which is whether the inferences emerging from
using the model are adequate. As Tukeys dictum suggests, this question is indeed vague,
and requires an understanding of what constitutes serious threats to the viability of the
inferences and how those threats may reveal themselves in the data or the discrepancy
between the model and the data. Just what those threats are and how they manifest them-
selves, let alone what is to be done about them, is likely to vary based on the uses of the
models, the consequences of misfit, and potentially other considerations such as whether
revising the model is a viable option in light of the available resources.
We will look at a number of examples shortly, to see how the strategy can be applied to a
variety of features, how the features are suggested by the way the model is being used,
and how evaluating the impact of model-data differences can be evaluated in terms of its
effect on targeted inferences.
The above characterization speaks to the computation of residuals but belies the key
role residual analysis plays in the larger modeling enterprise. The crucial conceptual role
of residuals is perhaps best expressed in exploratory data analysis (Mosteller& Tukey,
1977; Tukey, 1977; see also Behrens, DiCerbo, Yel,& Levy, 2013), which stresses the foun-
dational relationship DATA = MODEL + RESIDUALS. This reminds us that a model is
a simplified representation of the complex real-world situation and is necessarily inex-
act, implying that residuals will always be present. Exploratory data analysis manipu-
lates this equation to focus on the residuals as what remains when one subtracts the
model away from the data. The rationale is that a model being fit embodies the analysts
storyhis/her beliefs and understandings in light of purposesof the real-world situ-
ation. Residual analyses proceed by subtracting away the model (i.e., the analysts story)
from the data. What remains is therefore what goes beyond that story. With that story
removed, it is easier to see just what is left over; that is, it is easier to see and seek pat-
terns that were previously obscured. Interactive exploration with tools such as those
pioneered by Data Desk (Data Description, 2011) make this a real-time detective mission
(Tukey, 1969). This kind of interaction is strongly aligned with a view that modeling is
a cyclical process of specification, criticism in light of data, and revision (Box, 1976) as
opposed to the traditional, less cyclical fitting a model to data of confirmatory data
analysis (Tukey, 1986).
As a first example, we describe a foundational residual analysis that may be applied
across a wide variety of psychometric models. To address the first question, we identify
the value of the observables from the examinees as the feature of interest. This is eas-
ily obtained from our data as xij, the value of observable j (such as a test score under
classical test theory or an item response under item response theory) for examinee i.
To address the second question, we work through the model to arrive at its implica-
tions. We denote the model-implied value for the value for observable j from examinee
i by E( xij |). In the context of CFA, the relevant model parameters are i for examinee i
and j and j for observable j, so E( xij | ) = E( xij | j , j , i ) = j + j i . To address the third
question, a natural comparison is to form the difference; the residual for the value from
examinee i on observable j is
with larger values indicating a larger departure of the observed data from the story of the
model.
This residual is applicable in a variety of psychometric modeling scenarios and may
be of primary interest, especially in early stages of analysis either to catch data errors
or identify instances where the particular situations that led to certain observations are
really coming from different mechanisms than most of the data. The overall story that a
model seeks to tell, despite variations it allows in its structure, just does not apply very
Model Evaluation 235
well to these instances, which may need their own special storiespersons who mis-
understood directions or followed atypical solution strategies, for example, or coding
errors in the data.
There is a value of Residualij for every examinee on every observable. With a large sam-
ple size or number of observables, this may be too many to efficiently investigate one at a
time. In addition, the individual values of the observables may not be of primary interest,
so we ought to employ residuals that target what is valued or is of chief concern. CFA mod-
els are frequently employed to reflect theories regarding the associations among variables.
Accordingly, we might prefer a residual that more directly targets associations among
the variables to characterize the adequacy of the model. To address the first question, the
covariances among the observables are a natural first choice for features of the data. Let sjj
be the element in row j and column j of the observed covariance matrix S, usually calcu-
lated in the CFA literature* by
n
( xij x j )( xij x j )
sjj = i =1
. (10.2)
n1
To address the second question, we again work through the CFA model to arrive at the
models implications. The model-implied covariance for observables j and j is denoted by
() jj, and may be obtained as the element in row j and column j of the model-implied
covariance matrix:
() = + . (10.3)
Turning to the third question, a natural comparison is to form the difference; the residual
covariance for observables j and j is then sjj () jj, which corresponds to the element in
row j and column j of the residual covariance matrix
sjj
rjj = . (10.5)
sjj sjj
Let P() denote the model-implied correlation matrix, where each element () jj is given by
() jj
() jj = , (10.6)
() jj () jj
* The covariance could be formulated using n in the denominator, and that may be better justified in the context
of model-data fit analyses. Nevertheless, the following form is common in CFA and the difference becomes
irrelevant as n increases.
236 Bayesian Psychometric Modeling
where () jj is the element in row j and column j of (). The residual correlation matrix is
then
Ecor = R P (). (10.7)
A non-zero residual correlation occurs when the observed correlation differs from the
model-implied value, and speaks to the presence of a conditional dependence. In the con-
text of the CFA model for the Institutional Integration Scale introduced in Section9.3, the
Faculty Interaction and Faculty Concern observables both concern the role of faculty in the
students experience, and may be more correlated than can be accounted for by a single
latent variable modeled as underlying all the observables. Similarly, scores derived from
tests taken at the same time, or with shared format, often result in higher correlations if
these contributing factors are not taken into account. It is not only the size of residuals that
matters, but perhaps patterns among them.
Other residuals may be of central importance to applications of CFA. In the context of
mean-structure modeling, we may focus on residuals for means of the observables,
p( x postpred |x ) = p( x postpred |)p(|x )d, (10.9)
where x postpred denotes posterior predictive data of the same size as the observed data, which
in the current case of CFA is an (n J) matrix. The first term on the right-hand side is the
conditional distribution of the posterior predicted data given the model parameters. The
second term on the right-hand side is the posterior distribution for the parameters given
the observed data. The integration reflects a marginalization. If we knew the values for the
model parameters, we would predict data based on those (using the first term). But we do
not know the values for the model parameters, so we marginalize over the posterior dis-
tribution for them. The result is the posterior predictive distribution on the left-hand side.
It is posterior because it is conditional on the observed data. It is predictive because it refers
to data that were not observed. Indeed, x postpred may be interpreted as data that could have
been observed, or data that may be observed in the future, if the same real-world processes
as captured by the model and values for are replicated in the future.
It is worth noting that, in general, the first term on the right-hand side of (10.9) should
be p( x postpred |, x ). The use of the simpler p( x postpred |) reflects an assumption that the pos-
terior predictive data are conditionally independent of the observed data given the model
parameters. This reflects the broader notion that the model captures the salient features of
the real-world processes that give rise to the data, and so future or potential (i.e., posterior
predictive) data are independent of past (observed) data, given the model.
In practice, the posterior predictive distribution is often empirically approximated using
simulation techniques in concert with those used to empirically approximate the posterior
* This specialized use of the phrase test statistic differentiates it from the fit measures called discrepancy
measures, which are functions of both the observable data and model entities (see Section 10.2.5).
238 Bayesian Psychometric Modeling
distribution of the parameters. For each draw from the posterior distribution, we take the
additional step of generating a posterior predicted dataset via the conditional probability
distribution for the data. Letting (1) , , ( R) denote the R simulations of the model param-
eters from the posterior distribution, we obtain R replicate posterior predicted datasets
x postpred(1) ~ p( x| = (1) )
x postpred( 2 ) ~ p( x| = ( 2 ) )
(10.10)
x postpred( R) ~ p( x| = ( R) ).
The posterior predictive distribution serves as a reference distribution, against which the
observed data are ultimately compared in terms of the test statistic. In particular, we
obtain the posterior predictive distribution for the test statistic by computing the test sta-
tistic in each of the posterior predicted datasets, yielding T ( x postpred(1) ), , T ( x postpred( R) ). The
posterior predictive distribution constitutes the answer to our second question concerning
what the model implies about this feature of data.
Turning to the third question, we now compare the realized value of the test statistic,
T( x ) to the posterior predictive distribution T ( x postpred(1) ), , T ( x postpred( R) ). We recommend
the use of graphical procedures to facilitate the comparison. A simple example in current
context involves a histogram or smoothed density for the posterior predictive distribution,
with a marker or line indicating the location of the realized value.
The results may be numerically summarized via the posterior predictive p-value,
where the probability is taken over both posterior distribution of and the posterior pre-
dictive distribution of x postpred,
I[T(x
postpred
ppost = ) T ( x )]p( x postpred |)p(|x )dx postpredd, (10.12)
where I is the indicator function taking on the value of 1 when its argument is true and 0
when its argument is false. An estimate of ppost may be obtained in a simulation environ-
ment as the proportion of draws in which the posterior predicted value T( x postpred ) exceeds
the realized value T( x ).
case that considerable efficiencies may be gained by doing these sorts of computations
outside of WinBUGS. This is particularly so for more involved computations such as those
involved in the model-checking methods discussed in the following sections. For this rea-
son, we used a modified version of the code from Levy (2011) to import the draws into R
and perform the computations associated with PPMC.*
Figure 10.1 displays the densities for the posterior predicted correlations; the vertical
line in each corresponds to the realized value (i.e., the correlation in the observed data).
Consider first the results for the pairing of Peer Interaction and Faculty Concern. The realized
(observed) correlation is .54. As the vertical line in the graph in the last row and first col-
umn of Figure10.1 reveals, this falls near the middle of the posterior predicted distribution.
The position of the realized value in the posterior predictive distribution is summarized
numerically via the ppost value, which is .65, indicating that 65% of the 15,000posterior pre-
dicted values were greater than the realized value. Substantively, the correlation that was
found between these variables is generally consistent with the models implications for
their correlation, which is evidence of adequate model-data fit.
On the other hand, consider the situation for the correlation between Faculty Interaction
and Faculty Concern, depicted in the last row and fourth column of Figure10.1. The pos-
terior predictive distribution is unimodal and symmetric, centered at .56. The realized
value for the correlation is .67, which falls way out in the upper tail of the posterior pre-
dictive distribution. The ppost value is estimated as the proportion of the 15,000 posterior
predicted values that exceeded .67, which is .0004. The correlation that was found between
these variables is considerably larger than the models implications, which is evidence of
PI
AD
0.35 0.55
IGC
FI
FC
FIGURE 10.1
Posterior predicted densities, with vertical lines for realized values, for the correlations between the latent vari-
ables based on the confirmatory factor analysis model with one latent variable (factor) for the observables from
the Institutional Integration Scale. PI = peer interaction, FI = faculty interaction, AD = academic and intellectual
development, FC = faculty concern, IGC = institutional goal commitment.
* The R code for doing this is available on the website for the book.
240 Bayesian Psychometric Modeling
model-data misfit. Whether the misfit is substantively meaningful depends on the pur-
pose of the analysis. In this case, where we seek to account for the associations among the
variables, the size of the discrepancy between what the model implies (as represented by
the posterior predicted distribution) and what was observed is concerning. This is particu-
larly so when we consider the other results for the correlations. A similar result occurred
for the pairing of Academic and Intellectual Development and Institutional Goal Commitment
(Figure10.1, second row, second column), where the misfit is even more pronounced. Here,
the realized correlation of .64 is much larger than the models implications as expressed in
the posterior predicted distribution (mean of posterior predicted distribution = .39; esti-
mated ppost = 0). In both of those situations, the realized value falls in the upper tail of the
posterior predictive distributions, indicating that the model underpredicts the associations.
We see the opposite for the pairing of Institutional Goal Commitment and Faculty Interaction
(third row, third column). Here, the realized value of .31 falls out the lower tail (mean of
posterior predictive distribution = .39, estimated ppost = .95), indicating the model overpre-
dicts this association, which is also evidence of model-data misfit.
1. How discrepant are the data and the model in ways that are of interest?
2. What does the model imply about how discrepant they ought to be?
3. How does the discrepancy based on the observed data compare to what the model
implies for this discrepancy?
The key difference from PPMC using test statistics is that here we are focused on a feature of
the data and the model, rather than a feature of the data alone. Instead of focusing just on char-
acteristics or features of the data, we explicitly build in the comparison between the observed
data and the models implications at the outset by focusing on the discrepancy between the
data and the model, as in the formulation of a residual. Formally, let D( x ; ) denote a discrep-
ancy measure of interest, where the notation reflects that it is a function of both the data and the
model parameters. Casually, if the function of interest is a function of the model parameters
as well as the data, it is a discrepancy measure; if it is just a function of the data, it is a test
statistic. Many functions that are referred to as test statistics, fit statistics, and fit indices
in the psychometric literature are considered discrepancy measures in the current treatment.
To address the first question, we define a discrepancy measure and evaluate it using
the observed data. In a simulation-based environment, this is empirically approximated
by calculating the discrepancy measure using each simulated value from the posterior
distribution for the parameters. This yields the collection D( x ; (1) ), , D( x ; ( R) ), which are
referred to as the realized values of the discrepancy measure. These summarize the discrep-
ancy between the (observed) data and the model. It is a distribution because we have a
posterior distribution, rather than just a point estimate, for the model parameters.
To address the second question, we calculate the discrepancy measure using the posterior
predicted data from (10.10), yielding an empirical approximation to the distribution of pos-
terior predicted values D( x postpred(1) ; (1) ), , D( x postpred( R) ; ( R) ). This distribution expresses what
the model implies the discrepancy ought to be for observable data if the model were correct.
Turning to the third question, we now compare the distribution of the realized values,
D( x ; (1) ), , D( x ; ( R) ), to the distribution of posterior predictive values, D( x postpred(1) ; (1) ), ,
Model Evaluation 241
D( x postpred( R) ; ( R) ). In the current context where there is now a distribution of realized val-
ues, a simple example involves a scatterplot of the pairs of corresponding realized and
posterior predicted values, where for each value r, a point is plotted in the coordinate
plane as (D( x ; ( r ) ), D( x postpred( r ) ; ( r ) )). The results may also be numerically summarized via
the posterior predictive p-value,
where the probability is taken over both the posterior distribution of and the posterior
predictive distribution of x postpred,
I[D(x
postpred
ppost = , ) D( x , )]p( x postpred |)p(|x )dx postpredd
. (10.14)
ppost may be empirically approximated as the proportion of times that the posterior pre-
dicted value exceeds the corresponding realized value.
This is commonly referred to as the 2 or the 2 statistic for the model, as the sampling
distribution of LR computed using the sample-based normal-theory ML estimates of
asymptotically follows a 2 distribution under mild regularity conditions (West, Taylor,&
Wu, 2012). The second discrepancy measure is the standardized root mean square residual
(SRMR; Bentler, 2006)
J j
2 [(sjj () jj )/(sjj sjj )]2
j =1 j=1
SRMR = . (10.16)
J ( J + 1)
Figure10.2 contains a scatterplot of the realized and posterior predicted values of LR, with
the unit line added as a reference. Observing points that appear to be randomly scattered
about the line would have indicated that the realized and posterior predicted values were
of similar magnitudes. But in Figure 10.2, it is clearly seen that the realized values are
considerably larger than the posterior predictive values. The estimated ppost value is the
proportion of draws above the line, which in this case was 0. This indicates that the dis-
crepancy between the observed and model-implied covariance structure is larger than the
discrepancy implied by the model.
The results for SRMR support this conclusion. The scatterplot for the realized and pos-
terior predicted values is given in Figure10.3 (panel a), where it is seen that the realized
values are consistently larger than their posterior predicted counterparts. The estimated
ppost value is the proportion of draws above the line, which in this case was .004.
242 Bayesian Psychometric Modeling
200
Posterior predicted LR
150
100
50
0
0 50 100 150 200
Realized LR
FIGURE 10.2
Scatterplot of the realized and posterior predicted values of the LR, with the unit line, based on the confirmatory
factor analysis model with one latent variable (factor) for the observables from the Institutional Integration Scale.
0.8
Posterior predicted SRMR
0.6
0.4
0.2
0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8
(a) Realized SRMR (b) Realized SRMR
FIGURE 10.3
(a) Scatterplot of the realized and posterior predicted values of the SRMR, with the unit line and (b) density of
the realized values of SRMR, based on the confirmatory factor analysis model with one latent variable (factor)
for the observables from the Institutional Integration Scale.
Synthesizing the various PPMC analyses, the results from the analysis of LR and SRMR
constitute evidence that the model suffers in explaining the overall covariance structure of
the observables. The analysis of the individual correlations drills down to a finer grained
conclusion, which is that the model suffers chiefly in terms of severely underpredicting the
association between Academic and Intellectual Development and Institutional Goal Commitment
and between Faculty Interaction and Faculty Concern while overpredicting the association
between Institutional Goal Commitment and Faculty Interaction. These results suggest that a
model with two latent variables in which the subscales are organized around one latent
variable associated with Faculty measures and one associated with Student measures,
aswas examined in Section9.4, might provide better fit (see Exercise10.2).
and a number of discrepancy measures have been proposed to evaluate these features
(e.g.,Chen& Thissen, 1997; Levy etal., 2009; Levy, Xu, Yel,& Svetina, 2015; Stout etal., 1996).
There is no shortage of residuals, test statistics, and discrepancy measures that have been
proposed for evaluating psychometric models, with justifications for broad or targeted
use. Research on the performance of these in psychometric models is an active area of
research. Note that in this chapter, we have described machinery associated with model
checking, but exactly which residuals, statistics, and discrepancy measures should be pur-
sued depends on the model and the purpose as well. This chapter used examples pertinent
to certain applications of CFA. We will further illustrate the use of PPMC in the context of
item response theory (Section 11.2.5), latent class analysis (Section 13.4.4), and Bayesian
networks (Section14.4.8), and give an indication about how the nature of the models sug-
gests particular features to examine.
The machinery in PPMC is in some ways similar to that used in frequentist approaches.
In its current form, hypothesis testing from a frequentist perspective is an amalgamation
of different traditions (Gigerenzer, 1993), but may be casually summarized as follows.
The null hypothesis implies how data would vary randomly from sample to sample even
though the same (fixed) values of the parameters govern the data generating process. This
is represented by the sampling distribution of the data, which is equivalent to the first term
in the integral on the right-hand side of (10.9), p( x postpred |). In frequentist approaches, the
sampling distribution is constructed by using point estimates for unknown parameters.
This serves as the reference distribution, against which the observed data are compared.
Note that relying on point estimates for defining the sampling distribution suffers in that it
ignores the uncertainty in these estimates. This has been recognized as a limitation of fre-
quentist approaches, and one that is not easily circumvented (Molenaar& Hoijtink, 1990;
Snijders, 2001; Stone& Hansen, 2000). In PPMC, we employ the posterior distribution for
the parameters, p(|x ), as it appears as the second term inside the integral on the right-
hand side of (10.9), which incorporates our uncertainty about the parameters.
The connections and departures from frequentist hypothesis testing approaches raises
the question of whether ppost values afford the same interpretations as p values in frequen-
tist hypothesis tests. There has been much research and discussion about when they have
frequentist interpretations as in hypothesis testing, and whether we should desire them
to behave as such, and we will not review these issues here (see Bayarri & Berger 2000a,
2000b; Dahl, 2006; Gelman, 2003, 2007; Johnson, 2007; Meng; 1994b; Robins, van der Vaart,&
Ventura, 2000; Rubin 1996a; Stern, 2000). In most situations, we advocate an interpretative
lens that views the results of PPMC as diagnostic pieces of evidence regarding, rather than a
hypothesis test of, (mis)fit of a model that is known a priori to be incorrect (Box, 1976; Gelman,
2003, 2007; Gelman etal., 1996; Gelman& Shalizi, 2013; MacCallum, 2003; Stern, 2000). From
this perspective, ppost values have direct interpretations as expressions of our expectation of
the extremity in future replications, conditional on the model (Gelman, 2003, 2007). Graphical
representations are often employed as summaries of the results of PPMC (e.g., Gelman, 2004),
and ppost is simply a way to summarize the results numerically. From this perspective, model-
data fit analyses using PPMC plays much the same role as residual analyses in exploratory
data analysis. PPMC, graphical summaries, and ppost have less to do with the probability of
rejecting a model in the already-known-to-be-false situation where the model is correct, and
more to do with aiding the analyst in characterizing the ways in which the necessarily incor-
rect but hopefully useful model is (and is not) working well, and the implications of using
such a model (Box, 1979; Box& Draper, 1987; Gelman, 2007; McDonald, 2010).
Several other Bayesian approaches to model checking that have been proposed may be
viewed as differing from PPMC in how it constructs a distribution for the parameters (the
Model Evaluation 245
second term on the right-hand side of (10.9)). These include prior predictive model check-
ing (Box, 1980), partial PPMC (Bayarri& Berger, 2000a), and cross-validated PPMC (Evans,
1997, 2000). Still other strategies work to calibrate ppost emerging from PPMC to ensure it
has frequentist properties (Hjort, Dahl,& Steinbak, 2006). The reader is referred to Levy
(2011) for descriptions and illustrations of these procedures in the context of the CFA.
p( M2 | x ) p( x | M2 ) p( M2 )
=
p( M1 | x ) p( x | M1 ) p( M1 )
=
( 2 )
p( x | ( 2 ) )p(( 2 ) | M2 )d( 2 )
p( M2 )
,
(10.17)
( 1)
p( x | (1) )p((1) | M1 )d(1) p( M1 )
where (1) is the collection of parameters for M1 , p( M1 ) is the prior probability for M1 , and
analogous definitions hold for ( 2 ) and M2. The second term on the right-hand side of (10.17)
246 Bayesian Psychometric Modeling
is the ratio of prior probabilities (i.e., the prior odds). The first term on the right-hand
side is the ratio of the marginal likelihoods under each model. This term is the Bayes
factor (e.g., Kass & Raftery, 1995), in particular the Bayes factor for M2 relative to M1,
which we denote as BF21. Note that the Bayes factor has the form of an LR (and in certain
cases specializes to the familiar LR; Kass& Raftery, 1995). The Bayes factor effectively
transforms the prior odds into the posterior odds via the likelihoods of the data and as
such is sometimes interpreted as the weight of evidence in favor of one model over the
other. Importantly, the Bayes factor is not limited to applications in which the models
are hierarchically related (nested). Jeffreys (1961) provided a classic set of recommenda-
tions for interpreting the magnitudes of Bayes factors; Kass and Raftery (1995) provided
a slightly revised set of recommendations, summarized in Table10.1. Included there are
recommendations for working in (twice) the log metric. Note that since the Bayes factor
is the ratio of marginal likelihoods, the log of the Bayes factor is equal to the subtraction
of the logs of the marginal likelihoods and
p( x|M2 )
2 log(BF21 ) = 2 log = 2(log[p( x|M2 )] log[p( x|
|M1 )]). (10.18)
p( x|M1 )
Bayes factors and its variants have been among the most widely used and studied model
comparison approaches in Bayesian psychometrics (Berkhof et al., 2003; Bolt, Cohen, &
Wollack, 2001; Fox, 2005b; Hoijtink, Bland, & Vermeulen, 2014; Kang& Cohen, 2007; Klein
Entink, Fox,& van der Linden, 2009; Lee& Song, 2003; Li etal., 2006; Li etal., 2009; Raftery,
1993; Sahu, 2002; Song, Lee,& Zhu, 2001; van Onna, 2002; Verhagen& Fox, 2013; Zhu&
Stone, 2012).
Bayes factors attempt to capture the evidence in favor of one model as compared to
another. When used to compare models representing null and alternative hypotheses, this
affords the possibility of concluding that the data provide evidence against the null, much
like frequentist hypothesis testing. Importantly, the use of Bayes factors also affords pos-
sibly concluding that the data provide evidence in favor of the null, or that the data are
inconclusive about which model is to be preferred. As the construction in (10.17) suggests,
implicit in the use of Bayes factors is the perspective that one or the other model could be
a good description of the data. See Gelman etal. (2013) and Gill (2007) for broader discus-
sions and criticisms of Bayes factors, both generally and in specific use contexts.
One set of criticisms involves the challenges in computing Bayes factors, though approxi-
mations make Bayes factors more accessible (Bollen, Ray, Zavisca,& Harden, 2012; Carlin&
Chib, 1995; Chib, 1995; Chib& Jeliakov, 2001; Han& Carlin, 2001; Kass& Raftery,1995).
TABLE 10.1
Guidelines for Interpreting the Magnitude of Bayes Factors
Evidence in Favor of M 2
BF21 2log(BF21) andagainst M1
Inwhat follows, we adopt an approach that employs approximations to the models mar-
ginal likelihood based on predictive distributions and cross-validation strategies. For the
moment, we suppress the conditioning on M as the following applies to computations for
individual models. To begin, a cross-validation strategy involves the examination of pre-
dictions of a subset xi of the data x, when the complement of xi is employed to fit the model.
Following the notation in Chapter 5, let xi denote this complement, such that x = (xi, xi).
The conditional predictive ordinate (CPO) is the value of x i in the conditional predictive
density, which, assuming conditional independence of xi and xi given is*
CPO i = p( xi | x i ) = p( xi | )p( | x i )d. (10.19)
PsML = p( x ) = p(x |x
i =1
i i ). (10.20)
The ratio of the PsMLs for two models yields a pseudo Bayes factor (Gelfand, 1996).
Equation (10.19) suggests that the CPOs may be estimated by running n separate model-
fitting analyses, each one holding out a single examinee and using the remaining n 1
examinees to yield the posterior distribution and the CPO for the held-out examinee.
We employ an alternative approach that estimates the CPOs based on a single analysis
using MCMC. For a set of R iterations, we compute a Monte Carlo estimate of the CPO
(Congdon, 2007),
1
CPO i , (10.21)
1 R
[p( xi |( r ) )]1
R r =1
where ( r ) contain the values for the unknown parameters from iteration r. We then need
to aggregate these over examinees. For numerical stability, we work with logarithms. An
estimate of the log of PsML is given by
n
log(PsML) log(CPO ).
i =1
i (10.22)
* Note that this has a similar form as the posterior predictive distribution in (10.9), and may be seen as the pos-
terior predictive distribution based on obtaining the posterior distribution for the parameters using xi. This
perspective supports traditional cross-validation approaches, where an observed dataset is split into a cross-
validation sample (x i) and a calibration sample (xi) (see Bolt et al., 2001 for an application in psychometrics).
248 Bayesian Psychometric Modeling
For each model, we can compute the core ingredient of the CPO, [p( xi |( r ) )]1, in WinBUGS
by adding the following lines of code to that given in Sections9.3.3 and 9.4.3, respectively:
-------------------------------------------------------------------------
#########################################################################
# Model Syntax to Compute Ingredients for CPO
#########################################################################
for(i in 1:n){
for(j in 1:J){
p.x[i,j] <- (1/sqrt(2*3.141593))*sqrt(inv.psi[j])*exp(-
.5*inv.psi[j]*(x[i,j]-mu[i,j])*(x[i,j]-mu[i,j]))
}
The posterior mean of the node inv.p.x[i] is the denominator in (10.21). For each
model, we obtained these posterior means for the n examinees, computed the estimate
of CPO foreach examinee using (10.21), and then the log(PsML) using (10.22). The value
oflog(PsML) for the single-latent variable (factor) model was 2053.55 and the value of
the log(PsML) for the two-latent variable (factor) model was 2030.38. Twice the difference
yields an estimate of 2log(BF21) = 46.34. Following the guidelines in Table10.1, this is inter-
preted as strong evidence in favor of the model with two latent variables (factors) over the
model with one latent variable (factor).
where pD is a complexity measure defined as the difference between the posterior mean of the
deviance, D(), and the deviance evaluated at the posterior mean, D( ). Model selection here
involves choosing the model with the smallest DIC value, a strategy that is gaining attention in
Bayesian psychometrics (Kang& Cohen, 2007; Kang, Cohen,& Sung, 2009; Klein Entink etal.,
2009; Lee, Song,& Cai, 2010; Lee, Song,& Tang, 2007; Li etal., 2006, 2009; Zhu& Stone, 2012).
WinBUGS provides values for the DIC and its constituent elements for models. However,
WinBUGS does not automatically guard against the use of DIC in situations where it may
Model Evaluation 249
not be appropriate. Note the use of the deviance evaluated at the posterior mean for the
parameters, D( ). This construction relies on the posterior mean being a reasonable point
summary of the parameters, as may be relevant in CFA models analyzed here, and item
response models of Chapter11 (see Exercises11.1 and 11.2). If the posterior mean is not
a reasonable summary of the posterior, the DIC may not be appropriate. For example,
the latent variables in latent class and Bayesian network models (Chapters13 and 14) are
discrete and often nominal in nature, and a mean is not a reasonable point summary for
them. The DIC is not recommended for use in such cases.
The addition of pD in the DIC is intended to capture the model complexity and penal-
ize the model accordingly. The rationale for this sort of construction comes from Occams
razor, which suggests that, all else equal, we should prefer models that are more parsimo-
nious in that they posit fewer entities or are otherwise simpler. In the case of model com-
parison, we recognize that models that are more complex (less parsimonious) will provide
better fit by virtue of their increased flexibility. When it comes to model selection, the more
complex models capability to provide better fit should therefore be penalized. This sort of
rationale is present in the BIC and other information criteria in the form of adjusting the
absolute fit as captured by the likelihood by adding a penalty term based on the number
of estimated parameters, p. As we have seen in Section9.8, a maximally diffuse prior is
akin to a frequentist approach that has the parameter in the model to be estimated, and a
prior with its mass concentrated at a single point is akin to having the parameter fixed and
not estimated. Whether or not to count the parameter as part of the penalty in these cases
is fairly clear. With a prior distribution for a parameter in between these two extremes, it
is not as obvious. What is more, not all parameters have the same impact in terms of the
capability of the model to fit data (Fan & Sivo, 2005; Preacher, 2006). Model complexity
is not just about how many parameters, but which ones. pD is one approach to capturing
model complexity and folding it in to a model comparison process. More advanced discus-
sions of DIC appear in Gelman etal. (2013), Gill (2007), and Plummer (2008).
where pr ( xi ) is the modeled probability of xi under Mr. The difference in entropy values for
two models,
250 Bayesian Psychometric Modeling
captures how much better one model predicts a new observation. The difference is posi-
tive if M2 yields a better prediction, and negative if M1 yields a better prediction. If M1 is
nested within M2, the difference is non-negative; however, the construction of the differ-
ence does not require that one model be nested within another. Additional extensions or
versions are possible. When counting the number of parameters is straightforward, divid-
ing (10.26) by the difference in the number of parameters supports an evaluation of the dif-
ference in terms of improvement per parameter. Similarly, dividing (10.26) by n supports
an interpretation in terms of the improved prediction for a single observation.
Gilula and Haberman (2001) suggested a criterion for model comparison that is analo-
gous to the proportion of variance accounted for. Letting M0 denote a baseline model, the
proportional improvement obtained by using Mr is
Ent( M0 ) Ent( Mr )
. (10.27)
Ent( M0 )
The above description of entropy-based comparisons applies naturally to discrete data.
However, the same properties of entropy do not necessarily hold for continuous data. We illus-
trate these approaches in the context of latent class analysis for discrete data in Section13.4.5.
Exercises
10.1 Reconsider the CFA model with one latent variable for the data from the
Institutional Integration Scale described in Section9.3.
a. Recreate the Bayesian analysis reported there using WinBUGS (see Section
9.3.3) and:
1. Obtain the DIC for the model.
2. Conduct PPMC using correlations among the observables, LR, and SRMR.
Compare your results to those reported here.
3. Conduct PPMC using the means of the observables. Interpret these results
in terms of the adequacy of model-data fit. What feature of the model is
responsible for this (in)adequacy of fit?
Model Evaluation 251
b. Fit the model using ML estimation, and compute LR and SRMR using ML
estimates.
c. Compare the results from the frequentist approach in (b) to the Bayesian
approach in (a). How do LR and SRMR differ across the two approaches? Why
do they differ in these ways?
10.2 Reconsider the CFA model with two latent variables introduced in Section9.4,
in which the subscales are organized around one latent variable associated with
Faculty measures and one associated with Student measures.
a. Recreate the Bayesian analysis reported there using WinBUGS (see Section9.4.3)
and:
1. Obtain the DIC for the model.
2. Conduct PPMC using correlations among the observables, LR, and SRMR.
Interpret these results in terms of the adequacy of model-data fit.
b. Compare your results for this model to those for the model with one latent
variable in Exercise10.1.
1. What does model comparison based on DIC suggest?
2. What does the comparison of the PPMC results from the correlations, LR,
and SRMR suggest?
This page intentionally left blank
11
Item Response Theory
Item response theory (IRT) models employ continuous latent variables to model dichoto-
mous or polytomous observables, as occur frequently in assessment and social and life
science settings (e.g., item responses given by students on an achievement test and scored
as correct or incorrect, personal preferences rated on a Likert scale by subjects on a survey,
votes in favor or against a bill by politicians, or presence or absence of a patients symp-
toms). IRT models specify the probability for an observable taking on a particular value
as a function of the latent variable for the examinee (subject, politician, and patient) and
the measurement model parameters for that observable. The latter are often referred to as
item parameters, a term that reflects developments that occurred in the context of model-
ing response data to items in educational assessment (as does the name item response
theory). In this context, IRT was initially developed for and applied to dichotomous test
items. Such items are a special case of assessment tasks; examinees responses are a spe-
cial case of work products; the evaluations of responses that become the dependent vari-
ables in IRT models are special cases of observable variables. Tasks, work products, and
observable variables correspond one-to-one here, and the word item is commonly asso-
ciated with all three of these distinct entities. Ambiguities arise when the correspondence
breaks down, as occurs when questions are clustered within the same task or when dif-
ferent evaluation rules of the same multiple-choice response produce different observ-
ables (e.g., classifying responses as right/wrong, versus partial credit, versus partially
ordered categories of conceptions and misconceptions). In this chapter, we consider
only situations where the one-to-one correspondence among task, work product, and
observable holds.* Accordingly, we predominantly use the terms items and responses
from educational assessment, though the model applies to situations where this terminol-
ogy does not apply (e.g., vote for or against, symptom present or absent).
Treatments and overviews of IRT from conventional perspectives can be found in De
Ayala (2009), Embretson and Reise (2000), Hambleton and Swaminathan (1985), Lord (1980),
McDonald (1999), and van der Linden and Hambleton (1997). These and other conven-
tional sources include Bayesian elements, reflecting that Bayesian perspectives and pro-
cedures are more prevalent in IRT than in other psychometric modeling paradigms. This
is no doubt in part due to some key difficulties associated with frequentist approaches to
IRT that do not arise as forcefully in other psychometric modeling paradigms. The com-
plexities of modeling discrete data with continuous latent variables have led to a variety of
Bayesian approaches, and we endeavor to cover the main themes and procedures.
The majority of work in IRT has concerned the use of a single latent variable to model
dichotomous observables, and it is in this context that we begin in Section11.1, reviewing
conventional approaches. In Section11.2, we give a Bayesian approach to modeling dichot-
omous observables. We extend this foundation to the case of polytomous observables,
reviewing conventional approaches in Section 11.3 and turning to Bayesian approaches
* We will see examples where this correspondence breaks down: in Chapter13, we consider an example where
one observable is defined based on work products from two tasks, and in Chapter14, we consider an example
where multiple observables are defined based on work products from a single task.
253
254 Bayesian Psychometric Modeling
with elements described next. i is the latent variable for examinee i. In typical educational
measurement contexts, i represents the examinees proficiency such that higher values
indicate more proficiency. dj , a j, and c j are the measurement model (item) parameters,
and IRT models are often referred to by how many of these are included per observable.
dj (or a transformation of it described below and denoted bj ) is a location parameter
for observable j, often referred to as a difficulty parameter for observable (item) j, as it is
strongly related to the overall mean of the observable (which is the proportion correct on an
item scored correct or incorrect). a j is a coefficient for i capturing its relationship to observ-
able j. a j is often referred to as a discrimination parameter for the observable (item), as it is
related to the capability of the observable (item) to discriminate between examinees with
lower and higher values along the latent variable. c j is a lower asymptote parameter for the
observable, bounded by 0 and 1, and is the probability that an examinee will have a 1 on
the observable as . c j is often referred to as a pseudo-guessing parameter, reflecting
that examinees with low proficiency may have a nonzero probability of a correct response
due to guessing, as may be prevalent on a multiple-choice achievement test. In practice, F
is almost always selected to be the logistic cumulative distribution function or the normal
cumulative distribution function, which yields a result for (11.1) that is in the unit interval.
The model may be written equivalently as
differing from (11.1) by the relationship dj = a j bj. The form in (11.2) is useful for viewing
IRT as a scaling model that locates examinees (s) and item difficulties ( bs ) on the same
scale, particularly when c j = 0 in which case the probability in (11.2) is .5 when i = bj. The
version in (11.1) allows for easier extensions to certain models with multiple latent variables
(Section11.5) as well as a recognition of the connections between IRT and CFA (Section11.8).
Versions of (11.1) and (11.2) are referred to as the item response function, and named in
terms of how many of the measurement model parameters they contain. For example,
choosing F in (11.1) to be the normal distribution function yields a three-parameter normal
ogive (3-PNO) model,
P( xij = 1|i , dj , a j , c j ) = c j + (1 c j )( a j i + dj ), (11.3)
where () is the cumulative normal distribution function. Choosing F to be a logistic dis-
tribution function would yield the three-parameter logistic (3-PL) model. Figure11.1 illus-
trates the 3-PNO item response function for two items. Comparing the two, we see that
the item plotted as a dashed line is easier (d = 0) than that plotted as a solid line (d = 4), in
that the probability of observing a 1 is higher for the former than the latter. The slope of the
solid item is steeper than the dashed line, as captured by their discrimination parameters
(a = 2 and a = 1, respectively). Finally, the dashed item has a lower asymptote at c = .25,
whereas the solid item has a lower asymptote at c = 0.
Setting c j = 0 yields a two-parameter (2-P) model. Adopting the logistic form, the 2-P
logistic model is
exp( a j i + dj ) (11.4)
P( xij = 1|i , dj , a j ) = ( a j i + dj ) = ,
1 + exp( a j i + dj )
where () is the cumulative logistic distribution function. Further setting all the a j equal
to one another yields a one-parameter model. One mechanism for resolving an indeter-
minacy in the latent variable (Section11.1.2) involves fixing a discrimination parameter to
a particular value. In the case of a one-parameter model, this fixes all the discrimination
parameters to that value. Choosing that value to be 1, and continuing with the logistic
form, we have the one-parameter logistic (1-PL) or the Rasch (1960) model,
exp(i + dj )
P( xij = 1|i , dj ) = (i + dj ) = . (11.5)
1 + exp(i + dj )
1.0
d = 4, a = 2, c = 0
d = 0, a = 1, c = 0.25
0.8
0.6
P(x = 1)
0.4
0.2
0.0
4 3 2 1 0 1 2 3 4
FIGURE 11.1
Item response functions for two items following a three-parameter normal ogive (3-PNO) model.
256 Bayesian Psychometric Modeling
The 2-PNO and 1-PNO models may be specified by using the normal distribution function
rather than the logistic distribution function in (11.4) and (11.5), respectively. The choice
between using a normal or logistic distribution function is in some sense arbitrary as the
logistic model may be rescaled by multiplying the exponent by 1.701 to make the resulting
function nearly indistinguishable from the normal ogive version (e.g., McDonald, 1999),
though one may prove more tractable for certain purposes in assessment, or under differ-
ent estimation paradigms.
11.1.2 Indeterminacies
The 2-P and 3-P models are subject to the same indeterminacies as CFA models (Sections9.1.2
and 9.9; see also Hambleton & Swaminathan, 1985, and Lord, 1980, for treatments in unidi-
mensional IRT). In the 1-P models, the lone indeterminacy is the location indeterminacy. In
IRT the metric of each latent variable is often specified by fixing the mean and the variance
of the latent variable, and the orientation of each latent variable is often specified by con-
straining the discriminations to be positive. As was the case with CFA, options abound,
in some cases constraints made on one or more measurement models parameters may be
employed to resolve the indeterminacies.
x 1 xij
P( xij |i , j ) = P( xij = 1|i , j ) ij (1 P( xij = 1|i , j )) . (11.6)
As was the case in CTT and CFA, IRT models typically invoke assumptions of inde-
pendence among examinees and conditional (local) independence among observables.
Letting = (1 , , n ) denote the collection of latent variables for n examinees and letting
= (1 , , J ) denote the full collection of measurement model parameters, the joint prob-
ability of the data is then
n n J
P( x|, ) = i =1
P( xi |i , ) = P(x | , ),
i =1 j =1
ij i j (11.7)
P( x|, P ) = P( x|, )p(| P )d,
(11.8)
where in the current case P( x |, ) is given by (11.7). The distribution of the latent vari-
ables is typically assumed normal, in which case P = ( , 2 ) are the mean and variance,
though other specifications are possible and in general it may be empirically defined.
Item Response Theory 257
Once values of the data are observed, (11.8) may be treated as a likelihood function for
and P. Maximizing this with respect to and P yields MML estimates of these parameters.
Note that elements of and P will drop out of the estimation if they are specified in advance
to resolve the indeterminacies. In general, there is no closed form solution, and a variety of
methods have been proposed for evaluating the integrals and maximizing the resulting mar-
ginal likelihood (Bock & Aitkin, 1981; Bock & Moustaki, 2007; Harwell, Baker, & Zwarts, 1988).
The likelihood may be sufficiently complex that the parameters may be poorly deter-
mined from the data. In particular, for 2-P and even more so for 3-P models, discrepant
sets of measurement parameter values might yield quite similar model-implied proba-
bilities, at least over the region(s) of the latent continuum where examinees are located
(Hulin, Lissak, & Drasgow, 1982). As a result, the likelihood surface is nearly flat in cer-
tain regions along one or more dimensions of the parameter space, which may give rise
to problems associated with unstable MML estimates. An implausible set of parameter
values may produce nearly as high a value for the likelihood, or perhaps even a slightly
higher value, than that resulting from a set of plausible parameter values. Moreover, MML
estimates are likely to vary considerably from sample to sample, as the sampling vari-
ability induces ever-so-slight differences in the likelihood surface. In these sorts of cases,
it may be difficult to distinguish between competing sets of values for the measurement
model parameters based on the data alone. Conventional practice often turns to the use
of a prior distribution for some or all of the measurement model parameters that places
more of its density at plausible values to mitigate these problems (Lord, 1986). In a sense,
the prior steps in where the data are equivocal to adjudicate in favor of a priori plausible
sets of parameters values over a priori implausible ones. With priors specified, application
of the same sort of estimation methods for maximizing likelihoods yields Bayes modal
estimates (Mislevy, 1986; see also Harwell & Baker, 1991). This has become fairly common
if not standard practice, and represents the area where Bayesian approaches to calibration
have penetrated operational psychometric practice the most.
Turning to scoring, a variety of approaches have been proposed for estimating examin-
ees latent variables, including maximum likelihood, weighted likelihood, and Bayesian
approaches (Bock & Mislevy, 1982; Embretson & Reise, 2000; Warm, 1989). Of the latter, one
option is to obtain the posterior distribution for an examinees latent variable using point
estimates for parameters as in (7.10). Two popular Bayesian methods involve obtaining
either the mode (MAP) or the mean (EAP) of the posterior distribution for the examinees
latent variable. These offer advantages in that they yield estimates in situations that pose
challenges for frequentist methods, such as situations where a unique maximum of the
likelihood does not exist (Samejima, 1973; Yen, Burket, & Sykes, 1991), or for examinees
with so-called perfect response patterns of all 1s or 0s. The example in Section11.2.5 illus-
trates this point.
n n J
p( x|, ) =
i =1
p( xi |i , ) = p(x | , ),
i =1 j =1
ij i j (11.9)
where
P( xij = 1|i , j ) is the item response function, and the j are the associated measurement
model (item) parameters for the model.
which reflects an independence assumption between the examinees latent variables and
the measurement model parameters. Specifying the prior distribution comes to specifying
distributions for the terms on the right hand side of (11.11), and any hyperprior distribu-
tions for any unknown hyperparameters introduced in such specifications.
p() = p( | ),
i =1
i P
(11.12)
where P denotes the hyperparameters for specifying the prior for the i. Commonly, the
latent variable is assumed to be normally distributed, in which case P = ( , 2 ), and the
prior for each examinees latent variable is
i | , 2 ~ N ( , 2 ), (11.13)
p() = p( | ),
j =1
j P (11.14)
where P denotes the hyperparameters for specifying the prior for the j. If these hyper-
parameters are unknown they require a hyperprior distribution, p( P ). We consider this
case in Section11.7.4, but for the moment we focus on issues surrounding the specification
of the prior for the measurement model parameters themselves.
In CFA, the two dominant approaches to specifying the prior distribution for the mea-
surement model parameters are the conditionally conjugate priors detailed in Section9.2
or a slightly more complex prior that preserves full conjugacy. Though other specifications
are possible, these two are popular in part because they yield easily manageable compu-
tational strategies for obtaining the posterior distribution when the data are assumed to
be conditionally normally distributed, as has long been popular in CFA. In IRT, the situ-
ation is more complex. For many formulations of the popular models, there is no conju-
gate or conditionally conjugate prior. In addition, a number of developments in Bayesian
approaches to IRT, including recommendations for prior distributions, occurred prior to
the advent of MCMC when the focus was on employing optimization routines to estimate
posterior modes and the curvature of the posterior. The model specifications that simplify
the computational burden in one of these contexts may not simplify the computational
burdens in others.
In this section, we develop and illustrate a particular set of common choices for prior
distributions, introducing an MCMC estimation routine that is flexible enough to han-
dle the situation without conditionally conjugate priors. We return to this issue and dis-
cuss alternative prior distributions and associated MCMC routines aligned with them in
Sections11.7 and 11.8.
Assuming a priori independence between the different types of measurement model
parameters, we factor the joint prior for each observables measurement model parameters
j = (dj , a j , c j ) into the product of prior distributions for each parameter individually:
The assumption of exchangeability with respect to the observables supports the specifi-
cation of a common prior for each instance of the different types of measurement model
parameters. The location parameters are continuous and unbounded, and are therefore
typically assigned a normal prior distribution,
dj ~ N ( d , 2d ), (11.16)
where d and 2d are hyperparameters that in the current development are specified by the
analyst. Normal prior distributions are similarly common for b parameters in the formula-
tions of IRT that employ the b parameterization.
260 Bayesian Psychometric Modeling
Note that the hyperparameters in (11.16)(11.18) are not indexed by j, indicating that each
instance of each type of parameter is assigned the same prior in line with the exchangeabil-
ity assumption. This can be relaxed to specify group-specific hyperparameters reflecting
conditional exchangeability of items, possibly given covariates, or even a unique prior for
each parameter. We discuss alternative prior structures and parametric forms in Section11.7.
n J
where
xij |i , dj , a j , c j ~ Bernoulli[P( xij = 1|i , dj , a j , c j )] for i = 1, , n, j = 1, , J ,
dj ~ N ( d , 2d ) for j = 1, , J ,
+ 2
a j ~ N ( a , ) for
a j = 1, , J ,
and
c j ~ Beta( c , c ) for j = 1, , J .
Item Response Theory 261
i = 1,, n
i
j = 1,, J
xij
dj aj cj
d d2 a a2 c c
FIGURE 11.2
Directed acyclic graph for a three-parameter item response theory model.
and xj = (x1j, , xnj) is the collection of n observed values for observable j. Note
that the current values for the other unknowns include the just-sampled val-
ues for from step 1 (from iteration t + 1), and the values of a j and c j from the
previous iteration (t).
b. For each observable, sample a uniform variable U aj ~ Uniform(0, 1) and sample
a candidate value for the observables discrimination parameter a*j from a pro-
posal distribution possibly dependent on the current value a(jt ), a*j ~ qaj ( a*j |a(jt ) ),
and set a(jt +1) = a(*)
j if a j U a j , where
and xj = (x1j, , xnj) is the collection of n observed values for observable j. Note
that the current values for the other unknowns include the just-sampled values
for and dj (from iteration t + 1), and the value of c j from the previous itera-
tion (t).
c. For each observable, sample a uniform variable U c j ~ Uniform(0, 1) and sample
a candidate value for the observables lower asymptote parameter c *j from a pro-
posal distribution possibly dependent on the current value c(jt ), c *j ~ qc j (c *j |c(jt ) ), and
set c(jt +1) = c(*)
j if c j U c j, where
A convenient set of choices for the proposal distributions utilize the following forms:
and
where 2qbj , 2qaj, and are parameters that govern the variability of the proposal distribu-
tions. These proposal distributions are symmetric with respect to their arguments and the
MetropolisHastings steps in the routine above reduce to be Metropolis steps, in which
case the qdj, qaj, and qc j terms drop out of the calculations of in (11.21)(11.23).
TABLE 11.1
Frequency of the 32 Item Response Vectors for the Five-Item LSAT Data Example
Item Item
Vector ID 1 2 3 4 5 Frequency Vector ID 1 2 3 4 5 Frequency
1 0 0 0 0 0 3 17 1 0 0 0 0 10
2 0 0 0 0 1 6 18 1 0 0 0 1 29
3 0 0 0 1 0 2 19 1 0 0 1 0 14
4 0 0 0 1 1 11 20 1 0 0 1 1 81
5 0 0 1 0 0 1 21 1 0 1 0 0 3
6 0 0 1 0 1 1 22 1 0 1 0 1 28
7 0 0 1 1 0 3 23 1 0 1 1 0 15
8 0 0 1 1 1 4 24 1 0 1 1 1 80
9 0 1 0 0 0 1 25 1 1 0 0 0 16
10 0 1 0 0 1 8 26 1 1 0 0 1 56
11 0 1 0 1 0 0 27 1 1 0 1 0 21
12 0 1 0 1 1 16 28 1 1 0 1 1 173
13 0 1 1 0 0 0 29 1 1 1 0 0 11
14 0 1 1 0 1 3 30 1 1 1 0 1 61
15 0 1 1 1 0 2 31 1 1 1 1 0 28
16 0 1 1 1 1 15 32 1 1 1 1 1 298
Source: Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items.
Psychometrika, 35, 179197. Used with kind permission from Springer.
264 Bayesian Psychometric Modeling
where
As noted in Section 11.2.2, the prior distribution for the latent variables is sufficient to
resolve the location and metric indeterminacies in the latent variable. Note the restriction in
the truncated normal distribution ensures that each a j > 0, embodying the assumption that
the probability of a correct response should be monotonically increasing with proficiency.
This restriction is sufficient to resolve the rotational indeterminacy in the latent variable. The
prior distributions for the d and a parameters are relatively diffuse. We discuss the particular
choice of beta prior distribution for the c parameters in more detail in Section11.2.6. A more
general discussion of specifying the forms and parameters of prior distributions is given in
Chapter3, and Section11.7 discusses several alternatives popular in IRT.
WinBUGS code for the model and list statements for three sets of initial values for the
measurement model parameters are given as follows.
--------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Specify the item response measurement model for the observables
#########################################################################
for (i in 1:n){
for(j in 1:J){
P[i,j] <- c[j]+(1-c[j])*phi(a[j]*theta[i]+d[j])
x[i,j] ~ dbern(P[i,j])
}
}
Item Response Theory 265
#########################################################################
# Specify the prior distribution for the latent variables
#########################################################################
for (i in 1:n){
theta[i] ~ dnorm(0, 1)
}
#########################################################################
# Specify the prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
d[j] ~ dnorm(0, .5)
a[j] ~ dnorm(1, .5) I(0,)
c[j] ~ dbeta(5,17)
}
#########################################################################
# Initial values for three chains
#########################################################################
list(d=c(3, 3, 3, 3, 3), a=c(.1, .1, .1, .1, .1), c=c(.05, .05, .05,
.05,.05))
list(d=c(-3, -3, -3, -3, -3), a=c(3, 3, 3, 3, 3), c=c(.5, .5, .5,
.5,.5))
Note that WinBUGS uses the precision as the second argument of the normal distribu-
tion. Further, the I(0,) in the line specifying the prior for the a parameters enacts that the
restriction that the a parameters are positive. The initial values for the measurement model
parameters were chosen to represent what we anticipate to be fairly dispersed values for
the parameters in the posterior distribution. WinBUGS was used to generate initial values
for the latent variables. In this model, WinBUGS uses a slice sampler for the as and the
cs, and the Metropolis sampler for the s and the ds. To complete the specification of the
normal proposal distributions, WinBUGS uses the first 4000 iterations as an adaptive
phase to select an appropriate variance which it then uses for the remaining iterations.
Iterations from this adaptive phase should be discarded. Upon running the analysis, there
was evidence of convergence (see Section5.7.2) by 4000 iterations. To be conservative, we
discarded an additional 2000 iterations as burn-in.
11.2.5.2 Results
The chains exhibited high serial dependence, with nontrivial autocorrelations for several
parameters, particularly those associated with observable (item) 3. To mitigate these effects,
we ran each chain for 20,000 iterations after burn-in, yielding 60,000 iterations for use in
summarizing the posterior distribution. The marginal posterior distributions were uni-
modal and mostly symmetric, with several of them exhibiting some skewness. Accordingly,
we report the median in addition to the mean as summaries of the central tendency of the
marginal posterior distributions for the measurement model parameters, and, as an exam-
ple, one of the examinees latent variables in Table11.2. The location (d) parameters indicate
266 Bayesian Psychometric Modeling
TABLE 11.2
Summary of the Marginal Posterior Distributions for the
Measurement Model Parameters and One Examinees Latent Variable
(1000) for the Three Parameter Normal Ogive (3-PNO) Model of the
Law School Admissions Test Data
Standard 95% Highest Posterior
Parameter Mean Median Deviation Density Interval
that items are fairly easy for these examinees. The hardest item is item 3, which is the only
item with most of the HPD interval for the location parameter that is mostly negative. It is
also the most discriminating item. The posterior standard deviations and HPD intervals
indicate that there is considerable uncertainty about many of the parameters. A related point
is that the marginal posteriors for the c parameters are not too different from the Beta(5, 17)
prior distribution (rounding to two decimal places: mean = .23, standard deviation = .09, 95%
highest density interval of (.07, .40)), suggesting that the data are not very informative about
the lower asymptotes, or at least not strongly contradictory to the information in the prior.
This is consistent with the items being relatively easy for these examinees.
The last row in Table11.2 and Figure11.3 summarizes the marginal posterior distribu-
tion for the latent variable for an examinee who correctly answered all the questions. For
sake of comparison, Figure11.3 also depicts the prior distribution and the likelihood func-
tion evaluated using the posterior medians of the measurement model parameters. The
likelihood function does not have a finite maximum; it increases as . Conceptually,
ever higher values of represent ever better accounts of this examinee, where better is
interpreted in terms of the information in the data as expressed by the likelihood function.
However, as Figure 11.3 depicts, the marginal posterior distribution does have a maxi-
mum, right around a value of .7. Conceptually, ever higher values of do not represent ever
better accounts of this examinee, where better is now interpreted in light of the informa-
tion in the data as expressed by the likelihood function and the information expressed in
the prior distribution. On the basis of the posterior distribution, we conclude that the value
of for this examinee is likely near .7, and almost certainly not greater than 3. Despite their
responding correctly to all five of the items, we are pretty sure that their proficiency is not
infinite. We suspect analysts using ML would have the same beliefs, and then would face
Item Response Theory 267
Prior
Likelihood
Posterior
3 2 1 0 1 2 3
FIGURE 11.3
Marginal posterior distribution (solid line), prior distribution (dotted line), and likelihood function evaluated
using the posterior medians for the measurement model parameters (dashed line) for the latent variable for an
examinee with all correct responses to the items in the Law School Admissions Test example.
the prospect of reconciling the disparity between their beliefs and the MLE through some
post hoc adjustment. An advantage of a Bayesian approach is that the fully probabilistic
framework affords a tighter integration among our beliefs, evidence, and statistical model
of them. We gain by implementing our beliefs in a statistical framework that includes for-
mal, theory-based, model-checking, and sensitivity tools.
400
Number of examinees
300
200
100
0
0 1 2 3 4 5
Raw score
FIGURE 11.4
Observed and posterior predicted raw score distributions for the Law School Admissions Test example. At each
raw score, the box depicts the interquartile range, the notch in the middle depicts the median, and the whiskers
depict the 2.5th and 97.5th percentiles. The points indicate the frequency in the observed data.
268 Bayesian Psychometric Modeling
examinees with that particular raw score. The points indicate the number of examinees with
that particular raw score in the observed data; line segments connect the points as a visual aid.
We employ a similar representation in Figure 11.5 to investigate the adequacy of item
fit (Sinharay, 2006a). Here, a boxplot depicts the posterior predictive distribution of the
proportion of examinees that respond correctly to each item at each raw score. Observed
proportions are plotted as points, with line segments connecting them. For both the distri-
bution of raw scores and the proportion correct given the raw score, the observed values are
well within the posterior predictive distributions, supporting the notion that the model fits
adequately in terms of accounting for the distributions of individual items and raw scores.
Two important and related aspects of model-data fit in IRT concern the assumptions of
local independence and dimensionality, the latter of which is typically framed as whether
the assumed number of latent variables is adequate. To pursue possibilities of local depen-
dence, we employ the standardized model-based covariance (SMBC; Levy et al., 2015); for
the pairings of observables (items) j and j,
(x
i =1
ij E( xij |i , j ))( xij E( xij |i , j ))
SMBC jj = n ,
n n
(x
i =1
ij E( xij |i , j )) 2
(x
i =1
ij E( xij |i , j )) 2
n n
where E( xij |i , j ) is the conditional expectation of the value of the observable, given by
the item response function. In the current context, this is the 3-PNO, and correspondingly
i = i and j = (dj , a j , c j ). SMBC may be interpreted as a conditional correlation among the
observables. To evaluate the adequacy of the assumed unidimensionality of the model,
we employ the standardized generalized discrepancy measure (SGDDM, Levy et al.,
2015) that is the mean of the absolute values of SMBC over unique observable pairs,
Prop. correct
Prop. correct
Prop. correct
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
Raw score Raw score
FIGURE 11.5
Item fit plots for the Law School Admissions Test example with observed and posterior predicted proportion correct
by raw score. At each raw score, the box depicts the interquartile range, the notch in the middle depicts the median,
and the whiskers depict the 2.5th and 97.5th percentiles. The points indicate the frequency in the observed data.
Item Response Theory 269
SMBC
j > j
jj
SGDDM = ,
J ( J 1)/ 2
and is interpreted as the average magnitude of the conditional correlations among the observ-
ables. Scatterplots of the realized and posterior predicted values for SGDDM and SMBC are
given in Figures11.6 and 11.7. The results for SGDDM indicate that, overall, the associations
among the observables are well accounted for by the model. The realized values are small and
commensurate with the posterior predicted values from the model. The scatterplots for SMBC
reveal much the same at the level of the observable-pairs, with realized conditional associa-
tions varying around 0 in ways consistent with the posterior predicted values.
The overall finding that the model fits well with respect to the aspects investigated here
is unsurprising given the complexity of the 3-PNO and the small number of items.
0.06
Posterior predicted SGDDM
0.05
0.04
0.03
0.02
0.01
FIGURE 11.6
Scatterplot of the realized and posterior predicted values of the standardized generalized dimensionality dis-
crepancy measure (SGDDM) for the Law School Admissions Test example with the unit line.
270 Bayesian Psychometric Modeling
Item 1
0.15
0.10
0.05
0.00 Item 2
0.05
0.10
0.15
0.15
0.10
0.05
Posterior predicted SMBC
0.00 Item 3
0.05
0.10
0.15
0.15
0.10
0.05
0.00 Item 4
0.05
0.10
0.15
0.15
0.10
0.05
0.00 Item 5
0.05
0.10
0.15
0.15 0.00 0.15 0.15 0.00 0.15 0.15 0.00 0.15 0.15 0.00 0.15
Realized SMBC
FIGURE 11.7
Scatterplots of realized and posterior predicted values of the standardized model-based covariance (SMBC) for
item-pairs from the Law School Admissions Test example, with the unit line.
guessing on a question that has five options would be correct with probability .2, and then
build a beta distribution around that anchoring idea. A Beta(5,17) may be interpreted as
akin to having seen 5 1 = 4 correct guesses in 5 + 17 2 = 20 attempts, expressing a belief
that the probability should be about .2, and weighting that belief as if it was based on 20
observations. It is fair to say that such reasoning could be accomplished even without high
levels of familiarity with the features of the assessment situation such as examinee capa-
bilities, item properties, or history of the assessment.
A related strategy does capitalize on understanding these issues, usually in form of
quantifying beliefs of subject matter experts who have knowledge about the situation that
is being modeled. Strategies for eliciting and quantifying such expertise are an emerging
issue; references to examples in the broader Bayesian literature were given in Section3.7.
Almond (2010); Almond, Mislevy, Steinberg, Yan, and Williamson (2015); Jiang etal. (2014);
Novick and Jackson (1974); and Tsutakawa (1992) presented conceptual and computational
strategies for eliciting and encoding information to aid in specifying prior distributions
for measurement model parameters in psychometric models.
A different strategy involves past research. If items like the ones in question that have
been calibrated have tended to yield certain c values, then the items in question now will
likely yield similar c values. In the context of our current example, if analyses of response
Item Response Theory 271
data of previous items from section 6 of the LSAT had yielded c values around .2, that
would be justification for using a Beta(5,17) prior. Note that there is an exchangeability
assumption lurking in saying that the current items are like other items. Indeed, there
are several exchangeability assumptions lurking, for instance about the other examin-
ees that took those other items, and the context of those interactions such as the stakes
associated. It is an open question just how similar other situations should be to be relevant
for the current one. Or to put it another way, it is not always obvious when differences
make a difference. Should we consider the results for these examinees taking other items
that have appeared on section6 of the LSAT? Other examinees taking the items of current
interest? Other examinees taking other items? What about other sections of the LSAT, or
other assessments? The answer to these questions is that there is no right answer. These
are judgments and decisions that need to be made by the analyst, just like judgments and
decisions that are made in specifying other aspects of the model.
These strategies may be combined. If there are five response options for an item, reason
and subject matter expertise usually suggest that the chances of guessing the correct one
are probably around .2. If other items for which we have estimates of the lower asymptote
yielded values between .1 and .4, then the lower asymptote for these items is probably
between .1 and .4. These considerations suggest specifying a prior that places higher prob-
abilities for relatively lower values. A Beta(5,17) distribution captures the thrust of these
rationales, giving them weight akin to having seen four correct guesses in 20 attempts.
P( xij = k|i , dj , a j ) = P( xij k|i , djk , a j ) P( xij k + 1|i , dj( k +1) , a j ), (11.27)
where dj = (dj1 , , djK j ) is a collection of location parameters and P( xij k|i , djk , a j ) is the
conditional probability of the value being at or above a particular category k. This is given
by a 2-P structure,
272 Bayesian Psychometric Modeling
For each observable, the model consists of expressions in (11.27) for k = 1, , K j 1. For
the highest response category, P( xij = K j |i , dj , a j ) = P( xij K j |i , djK j , a j ). Conceptually, the
subtraction in (11.27) is not needed, as the probability of a value being at or above a (hypo-
thetical, and in some sense counterfactual) category above the highest category is tautologi-
cally equal to 0. Similarly, the probability of a value being at or above the lowest category
is tautologically equal to 1, which may be seen as setting dj1 = . Non-negativity in the
response probabilities is preserved by restricting djk dj( k +1).
Much of what was described for conventional modeling of dichotomous observables
holds for polytomous observables. With these specifications, no additional indetermina-
cies in the latent variable are introduced* and the conventional approaches to calibration
and scoring are similarly applicable save that the conditional probability for the value of
an observable in (11.6) is
Kj
P( xij |i , j ) = P(x
k =1
ij = k|i , j )
I ( xij =k )
, (11.29)
where I is the indicator function that here takes on a value of 1 when the response from
examinee i for observable j is k, and 0 otherwise.
p( x|, ) = i =1
p( xi |i , ) = p(x | , ),
i =1 j =1
ij i j (11.30)
P( xij |i , j ) = ( P( xij = 1|i , j ),, P( xij = K j |i , j )) are the collection of probabilities for the
K j possible values for observable j, and j are the associated measurement model (item)
parameters. In the GRM, the P( xij = k|i , j ) are given by (11.27) which involve (11.28), and
j = (dj , a j ).
* For certain polytomous IRT models, additional indeterminacies are present when considering models with
multiple latent variables (Section11.5), though we do not discuss them here (see Reckase, 2009).
Item Response Theory 273
and (c) a truncated normal prior distribution for the discriminations, as expressed in (11.17).
Thus, the lone difference from the case of dichotomous observables is the specification of
the now multiple location parameters for each observable. Again, we will employ normal
distributions for the location parameters, though the key issue here is imposing the restric-
tion that djk dj( k +1).
Recalling that the probability of a value being at or above the lowest category is equalto1
may be seen as setting dj1 = , we begin with the prior for dj2 and adopt a normal prior as
in the case of dichotomous data,
dj 2 ~ N ( d2 , 2d2 ), (11.33)
< dj ( k 1)
djk ~ N ( dk , d2k ) for k = 3,..., K j , (11.34)
<d
where N j ( k 1) denotes the normal distribution truncated to be less than dj( k 1). Note that
if the b parameterization is used, the model would restrict each location parameter
to be greater than the location parameter for the preceding category (e.g., Zhu & Stone,
2011).
n J Kj
p(, d, a |x )
i =1 j =1
p(d ),
p( xij |i , dj , a j )p(i )p( a j )
k =2
jk (11.35)
274 Bayesian Psychometric Modeling
i = 1,, n
i
j = 1,, J
xij
FIGURE 11.8
Directed acyclic graph for the graded response model.
where
xij |i , dj , a j ~ Categorical(P ( xij |i , dj , aj )) for i = 1, , n, j = 1,,, J ,
P ( xij |i , dj , aj ) = ( P( xij = 1 |i , dj , aj ), , P( xij = K j |i , dj , aj )) for i = 1, , n, j = 1, , J ,
P( xij = k|i , dj , a j ) = P( xij k|i , djk , a j ) P( xij k + 1|i , dj( k +1) , a j ),
for i = 1, , n, j = 1, , J , k = 1, , K j 1,
P( xij = K j |i , dj , a j ) = P( xij K j |i , djK j , a j ) for i = 1, , n, j = 1, , J ,
P( xij k |i , djk , a j ) = F( a j i + djk ) for i = 1, , n, j = 1, , J , k = 2, , K j ,
P( xij 1) = 1 for i = 1, , n, j = 1, , J ,
i | , 2 ~ N ( , 2 ) for i = 1, , n,
a j | a , 2a ~ N + ( a , 2a ) for j = 1, , J ,
2
dj 2 ~ N ( d2 , ) for j = 1, , J ,
d2
and
<d
djk ~ N j ( k 1) ( dk , 2dk ) for j = 1, , J , k = 3, , K j .
To conduct MCMC estimation of the posterior distribution the general structure of the
MetropolisHastings-within-Gibbs approach outlined in Section 11.2.4 may be applied.
The key differences are that here we have multiple location parameters per observable,
and no lower asymptote. Alternatively, we might employ a sampler that capitalizes on
latent response variable formulations outlined in Section11.8 (see Fox, 2010, who discusses
such an approach using bounded uniform priors for the location parameters). Finally, we
note that WinBUGS uses an adaptive rejection sampler for this model (Gilks & Wild, 1992).
p(, d, a |x ) i =1 j =1
p(d ),
p( xij |i , dj , a j )p(i )p( a j )
k =2
jk
where
xij |i , dj , a j ~ Categorical(P ( xij |i , dj , aj )) for i = 1, , 500, j = 1, , 7 ,
P ( xij |i , dj , a j ) = ( P( xij = 1 |i , dj , a j ), , P( xij = 5|i , dj , a j ))
for i = 1, , 500, j = 1, , 7 ,
P( xij = k |i , dj , a j ) = P( xij k |i , djk , a j ) P( xij k + 1|i , dj( k +1) , a j )
for i = 1, , 500, j = 1, , 7 , k = 1, , 4,
P( xij = 5|i , dj , a j ) = P( xij 5|i , dj 5 , a j ) for i = 1, , 500, j = 1, , 7 ,
exp( a j i + djk )
P( xij k |i , djk , a j ) = for i = 1, , 500, j = 1, , 7 , k = 2, , 5,
1 + exp( a j i + djk )
P( xij 1) = 1 for i = 1, , 500, j = 1, , 7 ,
i | , 2 ~ N (0, 1) for i = 1, , 500,
a j | a , 2a ~ N + (0, 2) for j = 1, , 7 ,
dj2 ~ N (2, 2) for j = 1, , 7 ,
< dj 2
dj 3 ~ N (1, 2) for j = 1, , 7 ,
< dj 3
dj 4 ~ N (1, 2) for j = 1, , 7 ,
and
<d
dj 5 ~ N j 4 (2, 2) for j = 1, , 7.
WinBUGS code for the model and a list statement for three sets of initial values for the
measurement model parameters are given as follows.
--------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Specify the item response measurement model for the observables
#########################################################################
for (i in 1:n){
for(j in 1:J){
276 Bayesian Psychometric Modeling
#####################################################################
# Specify the probabilities of a value being greater than or
# equal to each category
#####################################################################
for(k in 2:(K[j])){
P.gte[i,j,k] <-
exp(a[j]*theta[i]+d[j,k])/(1+exp(a[j]*theta[i]+d[j,k]))
}
P.gte[i,j,1] <- 1
#####################################################################
# Specify the probabilities of a value being equal to each
# category
#####################################################################
for(k in 1:(K[j]-1)){
P[i,j,k] <- P.gte[i,j,k]-p.gte[i,j,k+1]
}
P[i,j,K[j]] <- P.gte[i,j,K[j]]
#####################################################################
# Specify the distribution for each observable
#####################################################################
x[i,j] ~ dcat(P[i,j,1:K[j]])
}
}
#########################################################################
# Specify the prior distribution for the latent variables
#########################################################################
for (i in 1:n){
theta[i] ~ dnorm(0, 1)
}
#########################################################################
# Specify the prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
d[j,2] ~ dnorm(2, .5)
d[j,3] ~ dnorm(1, .5) I(d[j,4],d[j,2])
d[j,4] ~ dnorm(-1, .5) I(d[j,5],d[j,3])
d[j,5] ~ dnorm(-2, .5) I( ,d[j,4])
a[j] ~ dnorm(1, .5) I(0,)
}
#########################################################################
# Initial values for three chains
#########################################################################
list(d= structure(.Data= c(
NA, 3, 1, 0, -1,
Item Response Theory 277
NA, 3, 1, 0, -1,
NA, 3, 1, 0, -1,
NA, 3, 1, 0, -1,
NA, 3, 1, 0, -1,
NA, 3, 1, 0, -1,
NA, 3, 1, 0, -1), .Dim=c(7, 5)),
a=c(.1, .1, .1, .1, .1, .1, .1))
list(d= structure(.Data= c(
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2,
NA, 2, 0, -1, -2), .Dim=c(7, 5)),
a=c(3, 3, 3, 3, 3, 3, 3))
list(d= structure(.Data= c(
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3,
NA, 1, -1, -2, -3), .Dim=c(7, 5)),
a=c(1, 1, 1, 1, 1, 1, 1))
--------------------------------------------------------------------------
The line for d[j,5] expresses the prior distribution for the dj5. The use of the I( ,d[j,4])
imposes the boundary that dj 5 < dj 4. Note that there is no lower bound here. In the line for
d[j,4], the I(d[j,5],d[j,3]) imposes the boundary. In particular, the second argument
imposes the boundary that dj 4 < dj 3 , in accordance with the specification of the model and
Figure11.8. Though conceptually this is sufficient, WinBUGS also requires the first argu-
ment as a lower bound to honor to the constraint that dj 5 < dj 4, even though that is imposed
in the line for d[j,5] as just discussed. The line for d[j,3] is structured similarly.
The list statements with the initial values also deserve some comment. The d parameters
are arranged as a matrix of J = 7 rows and K = 5 columns. For each observable (row), the
first entry is NA, as the dj1 parameters are not really in the model (recall, they may be
conceived of as all being infinite). The remaining entries for each observable is numeric,
giving the initial value for dj2,, dj5 .
11.4.4.2 Results
The model was run with three chains as just described using WinBUGS to generate initial
values for the latent variables. There is evidence of convergence (see Section5.7.2) within a
few hundred iterations. To be conservative, we discarded 1,000 iterations as burn-in, and
ran each chain for 15,000 more iterations after burn-in, yielding 45,000 iterations for use in
summarizing the posterior distribution. The marginal posterior densities were unimodal
and fairly symmetric. Summary statistics for these marginal posterior distributions are
reported in Table11.3, and may be used to characterize the results. For each item, the loca-
tion parameters are fairly well spread out over the latent continuum. And for each item,
278 Bayesian Psychometric Modeling
TABLE 11.3
Summary of the Marginal Posterior Distribution for the Measurement Model Parameters from the
Logistic Graded Response Model of the Peer Interaction Data
Parameter Mean SDa 95% HPDb Interval Parameter Mean SDa 95% HPDb Interval
the extreme location parameters (dj2 and dj5) have higher posterior standard deviations
than those in the middle, reflecting relatively less information in accordance with fewer
examinees using the extreme response categories.
The expansion to multiple latent variables and multiple discrimination parameters per
observable necessitates an expansion of the prior specifications for these unknowns. For
the latent variables, the situation is akin to that of CFA with multiple latent variables, and
we may specify a multivariate normal prior distribution,
p() = p( )
i =1
i
and (11.37)
i ~ N ( , ),
p( ) = p(
m =1
m ),
(11.38)
m ~ N ( , 2 ),
and
For the multiple discrimination parameters, we may again draw upon the connection to the
specifications introduced for the multiple loadings in CFA and specify normal distributions,
p(a j ) = p(a
m =1
jm ),
(11.40)
a jm ~ N + ( a , 2a ),
now with the positivity restriction as in the unidimensional IRT models. The 2P versions
of this type of MIRT model have been extended to modeling polytomous observables.
Amultidimensional GRM model is specified by replacing (11.28) with
The preceding MIRT models may be referred to as compensatory models, which reflects
that high values on one latent variable may compensate for low values on other latent
variables to yield a high probability of a correct response. Conjunctive MIRT models spec-
ify the response probability in such a way that low values on one latent variable cannot
280 Bayesian Psychometric Modeling
be compensated by high values of another; rather, high values on all latent variables are
needed to yield a high probability of a correct response. This is typically operational-
ized via specifying the response function via a product of probabilities. Although we
could employ the normal ogive form and/or the +d parameterizations, this is typically
done using logistic versions of one-parameter structures with the b parameterization
(Embretson, 1984, 1997; Whitely, 1980),
M
exp(im bjm )
P( xij = 1|i , bj ) = 1 + exp(
m =1 im b jm )
, (11.42)
where bj = (bj1 , , bjM ) is the collection of M location parameters for observable j, and bjm
is interpreted as the difficulty parameter for component m of the item.
The expansion to multiple location parameters per observable calls for an expansion of
the prior specifications for these unknowns. Following an assumption of exchangeability
with respect to how difficult each item is along each dimension, we may assign a common
prior to each of the location parameters. As in previous situations, normal distributions are
a natural choice for these parameters,
bjm ~ N (b , b2 ), (11.43)
1. Select the best item to administer to the examinee based on beliefs about the exam-
inees proficiency.
2. Administer the item and capture the examinees response/behavior (via computer).
Item Response Theory 281
Linden & Pashley 2010) as well as supporting or follow-up investigations (Bradlow, Weiss, &
Cho, 1998). As noted above, we can update the distribution for in step 4 via Bayes theo-
rem. We may select an item in step 1 based on Bayesian arguments such as minimizing the
expected posterior variance of or optimizing other criteria. And we can define a stopping
rule based on when the posterior variance or standard deviation is sufficiently small.
Third, though the Bayesian machinery just mentioned is a fruitful way to enact these
steps in CAT, one may adopt frequentist machinery to these ends. Instead of using Bayes
theorem in step 4, we might use ML estimation of once at least one correct and one
incorrect response is observed. We may select the next item based on maximizing the
item information function, and define a stopping rule based on the standard error being
sufficiently small. And yet our CAT process will still have undertones that strongly align
with Bayesian inference, namely the logic of updating what we believe about the examinee
in light of what was just observed, and preparing to update that further in light of new
information that is to come.
i | , 2 , yi ~ N (yi , 2 ), (11.44)
dj | d , 2d , w j ~ N (w j d , 2d ), (11.45)
* In this and the following section, our focus is on the mechanics of incorporating collateral information about
examinees into the model. The propriety of doing so is a situation-specific question involves purposes and
values associated with the model and the desired inferences, points we take up in more detail in Section15.2.
Item Response Theory 283
To accomplish this, NAEP and other large scale surveys employ plausible value meth-
odology (Mislevy, 1991; Mislevy etal, 1992; Mislevy, Johnson, & Muraki, 1992). For each
examinee, a set of plausible values is obtained via draws from the posterior distribution
for the examinees latent variable(s). For simplicity, we consider the case of a single latent
variable. For each examinee, we construct the posterior distribution for the latent variable
conditional on the observables for the examinee xi and collateral information yi if the latter
are available,
where p(i |yi ) is the prior distribution for i, here formulated as conditional on the
collateral variables yi, and the second line follows from the conditional independence
assumption. Plausible value methodology involves taking draws from such posterior
distributions.
Analyzing the draws supports inferences at the group level; details can be found in
the cited sources. For our purposes, we call out a few key points. On the computational
front, the draws for the values of the s in any one iteration of an MCMC sampler repre-
sent a set of plausible values for the examinees s. Multiple draws, from multiple itera-
tions of the MCMC sampler, constitute multiple plausible values for the examinees s.
Conceptually, the draws (plausible values) are taken from the posterior distributions for
individuals s building in everything we know, believe, and have estimated from data
about their relationships with all the collateral variables in the model, in addition to the
observables. Finally, we note that, with respect to the group-level distribution of latent
variables, the individuals latent variables may be conceived as missing data. Cast in this
light, plausible value methodology is an instance of using Rubins (1987) multiple imputa-
tion approach to missing data analysis (Mislevy etal., 1992), a topic we take up in more
detail in Chapter12.
Note that in (11.46), p( xi |i ) is the likelihood function for the examinees latent variable
induced by the observed values in xi and treating the measurement model parameters
as known. Similarly p(i |yi ) is the prior distribution treating the parameters that govern
the dependence of the latent variables on the covariates (e.g., the parameters and 2 in
the latent regression model in (11.44)) as known. This reflects that operational demands
necessitate practical solutions to addressing the complexities of large-scale assessment
surveys, such as drawing on (a) multiple imputation analyses that account for the sam-
pling design, but use point estimates of the parameters of the IRT model and the latent
regression model (Beaton, 1987), (b) estimates of the IRT parameters that ignore the com-
plex sampling (Scott & Ip, 2002), and (c) superpopulation-based analyses for educational
surveys with hierarchical structures assuming error-free dependent variables (Longford,
1995; Raudenbush, Fotiu, & Cheong, 1999). As noted in Section7.4, Johnson and Jenkins
(2005) developed a fully Bayesian model that allows for the estimation of a joint model
addressing all of these design features. Using simulated and real data from operational
NAEP assessments, they found that both the standard analysis with its piece-wise approxi-
mations and their unified model provided consistent estimates of subpopulation features,
but the unified model more appropriately captured the variance of those estimates. By
treating IRT item parameters and population variances as known the standard analysis
tended to underestimate posterior uncertainty by about 10%. Whats more, the unified
model provided more stable estimates of sampling variance than the standard procedures.
Item Response Theory 285
a j ~ log N ( a , 2a ), (11.47)
cj
logit(c j ) = 1(c j ) = log ,
1 cj
1(c j ) ~ N ( c , 2c ). (11.48)
Similarly, the probit transformation 1(c j ) may be employed, with the result modeled as a
normal distribution (Fox, 2010).
The deviation terms are then specified as following a distribution, such that they are non-
negative. A natural choice would be a uniform distribution bounded below by 0. Setting
the maximum to be, say, 10 or larger, enacts a diffuse prior over the latent continuum
in most applications of IRT. An alternative strategy that does not require the analyst to
specify a maximum of the deviation capitalizes on the log transformation of a normal
distribution over the real line:
sjk = exp(ljk ),
where (11.50)
A generalization of this model specifies the distribution for ljk to vary over j or k. A third
strategy specifies unrestricted versions of all the location parameters included in the
model,
and then sets the location parameters in terms of order statistics of these unrestricted ver-
sions (Curtis, 2010; Plummer, 2010). For observable j,
dj 2 = dj*,[ K j ] ,
dj 3 = dj*,[ K j 1] , (11.52)
and
The reverse ordering of the order statistics with the location parameters (i.e., as we go up
from dj2 to djK, we go down from the Kjth to the second-order statistic) is due to the con-
struction of the GRM in (11.27) in terms of probabilities of being at or above a category.
This strategy of using order statistics is a bit more straightforward in versions of the GRM
model that work with cumulative probabilities, in which we model the probabilities of
being at or below a certain category. In that case the kth location parameter corresponds to
the kth-order statistic from the unrestricted versions (Curtis, 2010).
Item Response Theory 287
j ~ N ( , ) I R+ ( a j ) (11.53)
where and are hyperparameters and I R+ ( a j ) is the indicator function that is equal to
1 when a j is in the region defined by the positive real line, and 0 otherwise.*
In software environments where the truncated multivariate normal is difficult to imple-
ment (e.g., WinBUGS), or if a 3-P model is of interest, an alternative strategy involves using
some of the alternative forms presented in this section. The multivariate normal prior may
be specified without the truncation; a judicious choice for the prior mean and variance
for the a parameters may help to resolve the rotational indeterminacy. Alternatively, to
preserve the positivity restriction on the a parameters, we may employ a log transforma-
tion. The idea here is to take the a parameters, which we wish to bound below by 0, and
express them on the real line. Along the same lines, we can define a function that maps
from the restricted space of the c parameters, which we bound by 0 and 1, to the real line
via thelogit or probit transformation. That is, we work with a reparameterization that uses
the identity transformation for the locations, the log transformation for the discrimina-
tions, and the logit transformation for the lower asymptotes:
j = dj ,
j = log( a j ),
and
cj
j = 1(c j ) = log .
1 cj
* To point out a connection with the simpler, more common notation used earlier in the univariate case, note
+ 2
that (11.17) can be written in indicator notation as a j ~ N ( a , a ) I R+ .
288 Bayesian Psychometric Modeling
We can then specify an (unrestricted) multivariate normal prior on the transformed set of
parameters j = ( j , j , j ),
j ~ N ( , ). (11.54)
The original parameters can be recovered by taking the inverses of the transformations
involved in the reparameterizations,
dj = j ,
a j = exp( j ),
and
exp( j )
c j = ( j ) = .
1 + exp( j )
The hyperparameters and may be specified by the user. As thinking in the log and
inverse-logit parameterizations is not natural to most, it may be useful to use the original
parameterization as a guide for defining the hyperparameter values and, roughly speaking,
apply the corresponding transformations to these values. In the example in Section11.2.5,
we employed an N(0, 2) prior for the d parameters. As the s are obtained by the identity
transformation, a comparable specification using the multivariate approach would sim-
ply specify = 0 and 2 = 2 . For the a parameters, we employed an N+(1,2) prior. Taking
the log of the mode of this distribution (here, 1) yields a suggested value of = 0. If the
geometric average of the slopes were 1, then choosing 2 = .25 yields a prior where the a
parameters are highly likely to be between about 1/3 and 3 (Mislevy, 1986), which is in line
with the N+(1,2) prior. For the c parameters, we employed a Beta(5,17) prior, which has a
mean of 5/22 .227. Applying the logit function to that value yields a suggested value
of = 1.225, and using 2 = .25 yields a distribution of the cs that closely matches the
Beta(5, 17). These means may be collected into and these variances may be collected into
the diagonal of .
The off-diagonal elements of capture the dependency among the items measurement
model parameters. One source of information about their dependence comes in the form
of analyses of the items with other examinees or from analyses of similar items. The exam-
ple in Section11.2.5 specified a prior distribution reflecting a priori independence among
the measurement model parameters; however, they were correlated in the posterior. These
correlations, suitably transformed to reflect the reparameterization, could be the basis for
a future analysis of either (a) these items with different examinees, (b) different items with
these examinees, or (c) different items with different examinees. (We would not use these
posterior correlations as the prior in a reanalysis of the same items and same examinees;
see Exercise 11.9). Note the implicit exchangeability assumptions inherent in using the
results of this analysis in any of scenarios (a)(c): The results from the current analysis
would be relevant for an analysis of (a) different examinees if the two sets of examinees
were exchangeable with respect to the current items. They would be relevant for an analy-
sis of (b) different items if the sets of items were exchangeable with respect to the cur-
rent examinees. They would be relevant for an analysis of (c) different items with different
examinees if both exchangeability assumptions are warranted. To the extent that these
exchangeability assumptions are not warranted, the use of the current strategies may be
suspect. Next, we describe a mechanism that potentially accommodates these situations.
Item Response Theory 289
dj ~ N ( d , 2d ). (11.55)
However, instead of specifying values for these hyperparameters, we now specify a hyper-
prior for these parameters. As these are parameters of a normal distribution, a condition-
ally conjugate specification is given by
d ~ N (d , 2 d ) (11.56)
and
where d, 2 d , d, and 02d are hyperparameters specified by the analyst. Viewed as a hier-
archy, (11.55) represents the first level of the hierarchy, and (11.56) and (11.57) represent the
second level of the hierarchy.
First consider the mean of the location parameters, d, which is a parameter in the first
level of the hierarchy. d captures the central tendency of the prior distribution for the ds.
The value of d may be thought of as an answer to the question What should we shrink
the estimate of each dj towards? In the single-level prior specification for the ds, the value
of d is chosen by the analyst, which may be problematic if chosen poorly. Our current
focus is to examine how the hierarchical prior structure offers a mechanism to, in a sense,
relieve the pressure placed on the analyst to make such a choice. To delve into how, con-
sider the full conditional for d given by
where
d Jd
+
2 d 2d
d|d, 2 = ,
d
,d 1 J
+
2 d 2d
1
2 d|d , 2 = ,
d
,d 1 J
+
2 d 2d
d = (d1 ,, dJ ),
and
d = 1/j JJ =1 dj .
The prior variance 2 d may be seen as controlling the degree to which the hierarchical prior
structure departs from the single-level prior with known hyperparameters. For small val-
ues of 2 d, the mean of the location parameters (i.e.,d, which is a parameter at level-1 in the
hierarchy) is more strongly influenced by the prior mean at the second level of the hierar-
chy. In the limit, as 2 d 0, d|d ,2 d ,d d and the two-level hierarchical prior degener-
ates into the single-level prior with known hyperparameters. For large values of 2 d, the
mean of the location parameters (i.e., d , which is a parameter at level-1 in the hierarchy) is
more strongly influenced by the values of the location parameters suggested by the data. In
the limit, d|d, 2 ,d d as 2 d . Casually, this corresponds to answering the question
d
What should we shrink the estimate of each dj towards? with the mean of the ds as sug-
gested by the data. This same limit for the posterior mean, d|d ,2 d ,d d , obtains as J .
More generally, the hierarchical prior structure has the attractive feature of retaining
the core goal of specifying a prior distribution for the posterior distribution to be shrunk
towards (relative to the possibly unstable likelihood), but allows where that prior distri-
bution is located to depend to some degree on the information in the data. How much
it depends on the data is governed principally by relative amounts of information in the
prior, captured by 2 d , and in the data, captured by J.
d governs where we should shrink the data-based estimate of dj towards. The other
parameter in the first level of the prior distribution for the ds in (11.55), 2d , governs how
much we should shrink the data-based estimate. If 2d is large, there will be relatively less
shrinkage, and the posterior for each dj will be more strongly influenced by the data. If2d
is small, there will be relatively more shrinkage. The limiting case of this has 2d 0, in
which case dj d and the data become irrelevant. In the single-level model, the analysts
choice of 2d expresses the variability of the d parameters and constitutes an answer to
the question: How much should we shrink each data-based estimate? By building a
hierarchical prior structure as in (11.57), we again have a mechanism for relieving the pres-
sure on the analyst to specify a value, and enact a balancing of the analysts prior beliefs,
expressed in (11.57), with the information in the data. Generally speaking, the information
in the data becomes more important for 2d and therefore our answer to the question
How much should we shrink each data-based estimate?as d decreases or J increases.
A hierarchical prior structure provides an elegant mechanism to address any uneasy
feelings the analyst may have about specifying values for the hyperparameters of prior
distributions. It allows the influence of the analysts prior beliefs to be softened a bit.
Instead of being represented in the first (and only) level of a prior distribution which more
Item Response Theory 291
directly influence the posterior distribution for the parameters of interest, they are bumped
up to a second level and are further away from the parameters of interest.
It is also very much a Bayesian mechanism. To the extent that the analyst is unsure about
what to specify as the hyperparameters, we do what we always do in Bayesian modeling
when encountered with things we are not sure of: assign a distribution! After conditioning
on the observed data, the posterior distributions for these parameters (the hyperparam-
eters of the first level of the prior distribution) will themselves be a mix of the analysts
prior beliefs (expressed in the second level of the hierarchy) and the data.
A similar logic applies for a hierarchical prior structure on the other measurement model
parameters. For the discrimination parameters, the first level of the prior is
a j ~ N + ( a , 2a ), (11.59)
and instead of specifying values for the hyperparameters, we again specify normal and
inverse-gamma hyperprior distributions as the second level in the hierarchy,
a ~ N (a , 2 a ) (11.60)
and
2a ~ Inv-Gamma( a / 2, a02a / 2), (11.61)
where a, 2 a, a, and 02a are hyperparameters specified by the analyst. Normal and
inverse-gamma hyperprior distributions may be similarly used if the lognormal prior for
discriminations in (11.47) is employed.
For the lower asymptote parameters, the first level of the prior is
c j ~ Beta( c , c ), (11.62)
and instead of specifying values for the hyperparameters, we specify a second level of the
hierarchy. An interpretation of the parameters of the Beta distribution in terms of counts
of successes and failures suggests the use of a Poisson distribution at the second level of
the hierarchy (Levy, 2014):
c ~ Poisson( c ) (11.63)
and
c ~ Poisson(c ), (11.64)
j ~ N ( , ).
Rather than specify values for these hyperparameters, they may be assigned hyperprior
distributions. Desires for conditional conjugacy imply a multivariate normal hyperprior
for and an inverse-Wishart hyperprior for . In addition to allowing where and by how
292 Bayesian Psychometric Modeling
much the data-suggested estimates should be shrunk, this approach allows for learning
about the dependence among the measurement model parameters, as expressed by the
posterior distribution for the off-diagonal elements of (see Exercises11.7 and 11.8).
xij* = a j i + dj + ij (11.65)
xij* ~ N ( a j i + dj , 2j ). (11.66)
The probability for an observable taking on a value of 1 is then modeled as the probability
that the latent response variable is greater than or equal to a threshold for the observable, j ,
It is clear from the expression on the right-hand side of (11.67) that a location indetermi-
nacy is present, as a constant can be added to both dj and j and the probability will be
preserved. An (arbitrary) constraint is needed to resolve this. Interestingly, CFA strategies
rooted in covariance structure analysis typically assume that the location parameters in
the latent response variable equation (here, the ds) are 0, and the thresholds (s) remain in
the model. In IRT traditions, it is more common to assume that the thresholds are 0 and the
location parameters remain in the model.
It is less obvious that an indeterminacy in scale exists as well. This should not be surpris-
inglatent variables do not have inherent metrics, so the introduction of a latent response
variable brings with it the need to resolve these indeterminacies. For each observable, this
can be resolved by setting the 2j equal to a constant; the value of 1 is often chosen for
convenience (see Kamata & Bauer, 2008, for details on this and other alternatives). This
renders ij ~ N (0, 1), which implies that (11.67) may be expressed as
Making the now familiar assumptions of a priori independence between the latent
variables and the measurement model parameters, exchangeability of examinees, and
exchangeability of the observables, the posterior distribution is then
where
1 if xij 0
*
xij |xij* = *
for i = 1, , n, j = 1, , J
0 if xij < 0
and
xij* |i , dj , a j ~ N ( a j i + dj , 1) for i = 1, , n, j = 1, , J .
Employing the normal prior for examinee latent variables in (11.13) and the truncated mul-
tivariate normal prior for the measurement model parameters in (11.53), the full conditional
distributions are expressed in the following equations (see Appendix A for derivations).
294 Bayesian Psychometric Modeling
On the left-hand side, the parameter in question is written as conditional on all the other
relevant parameters and data.* On the right-hand of each of the following equations, we
give the parametric form for the full conditional distribution. In several places we denote
the arguments of the distribution (e.g., mean and variance for a normal distribution) with
subscripts denoting that it refers to the full conditional distribution; the subscripts are
then just the conditioning notation of the left-hand side.
For the examinee latent variables, introducing the latent response variables make the
model akin to a CFA model, with the continuous latent response variables playing the role
of the observables in CFA. The full conditional distribution for examinee i is
where
a (x d )
*
+ j ij j
2 j
i|d,a , xi* = ,
+ a
1 2
2 j
j
1
2i|d,a , xi* = ,
1
+ a j2
2 j
d = (d1 ,, dJ ), a = ( a1 ,, aJ ), and xi* = ( xi*1 , , xiJ* ). The same structure applies to all the
examinees.
For each examinee i and observable j, the associated latent response variable has a nor-
mal distribution given by its prior, now truncated to be positive or negative depending on
whether the corresponding observable was 1 or 0,
N ( a j i + dj , 1) I ( xij* 0) if xij = 1
xij* |i , dj , a j , xij ~ *
(11.71)
N ( a j i + dj , 1) I ( xij < 0) if xij = 0,
where
j| , x *j = ( 1 + A A )1( 1 + A x *j ),
j| , x *j = ( 1 + A A )1 ,
* We suppress the role of specified hyperparameters in this notation; see Appendix A for a presentation that
formally includes the hyperparameters.
Item Response Theory 295
by Patz and Junker (1999a, 1999b), who introduced a more flexible MetropolisHastings-
within-Gibbs approach demonstrating how logistic IRT models for dichotomous data,
polytomous data, missing data, and rater effects could be handled without requiring par-
ticular forms of prior distributions. The flexibility of MCMC estimation has allowed the
expansion to more complex IRT models.
We have confined our treatment to commonly used parametric models where the latent
variable is monotonically related to the probabilities of dichotomous or ordered polytomous
responses. Fox (2010) provides a textbook length treatment of such Bayesian IRT models
emphasizing hierarchical formulations. Various Bayesian approaches to compensatory
and conjunctive MIRT for dichotomous and polytomous variables have been discussed by
Babcock (2011), Bguin and Glas (2001), Bolt and Lall (2003), Clinton, Jackman, and Rivers
(2004), Edwards (2010), Jackman (2001), Kim etal. (2013), Maris and Maris (2002), Reckase
(2009), Sheng and Wikle (2008), Yao and Boughton (2007), and Yao and Schwarz (2006).
Bayesian approaches to other models for dichotomous and polytomous data include
other specifications for ordinal and nominal data (Dunson, 2000; Wollack, Bolt, Cohen, &
Lee, 2002; Yao & Schwarz, 2006) as well as for dichotomous data representing alternative
response processes (Bolfarine & Bazan, 2010; Bolt, Deng, & Lee, 2014; Bolt, Wollack, & Suh,
2012; Jin & Wang, 2014; Loken & Rulison, 2010), including randomized response admin-
istrations (Fox, 2005a; Fox, Klein Entink, & Avetisyan, 2014). Bayesian models have also
been developed for semiparametric (Miyazaki & Hoshino, 2009) and nonparametric mod-
els (Duncan & MacEachern, 2008; Fujimoto & Karabatsos, 2014; Karabatsos & Sheu, 2004;
Karabatsos & Walker, in press) as well as unfolding IRT models that posit a nonmonotonic
relationship between an item response and the latent variable as may be prevalent in psy-
chological assessment (de la Torre, Stark, & Chernyshenko, 2006; Johnson & Junker, 2003;
Roberts & Thompson, 2011; Wang, Liu, & Wu, 2013).
Procedures for accommodating ignorable and nonignorable missingness have also
been developed (Fu, Tao, & Shi, 2010; Maier 2002; Patz & Junker, 1999a), and we discuss
these topics in greater detail in the next chapter. Chang, Tsai, and Hsu (2014) described a
Bayesian approach to modeling a related situation in which the time limit of an assessment
affects the response process.
Section11.7.4 described basic ideas of modeling latent variables and measurement model
parameters with a hierarchical structure. Applications and extensions of this include
models for rater effects (Patz, Junker, Johnson, & Mariano, 2002), testlet effects (Wainer,
Bradlow,& Wang, 2007), differential functioning of items in subpopulations (Frederickx,
Tuerlinckx, De Boeck, & Magis, 2010; Fukuhara & Kamata, 2011; Soares, Gonalves, &
Gamerman, 2009; Verhagen & Fox, 2013), items organized around areas of the domain
(Janssen, Tuerlinckx, Meulders, & De Boeck, 2000), and families of items with members
designed to reflect a common pattern (Geerlings, Glas, & van der Linden, 2011; Johnson &
Sinharay, 2005; Sinharay, Johnson, & Williamson, 2003). Similarly, we may gainfully
employ multilevel structures for the organization of examinees based on groupings or
covariates (Fox, 2005b; Fox & Glas, 2001; Huang & Wang, 2014a; Jiao & Zhang, 2015; Maier,
2002; Natesan, Limbers, & Varni, 2010) or modeling change over time (Segawa, 2005). See
Fox (2010) for a treatment of a number of these models.
Many of these models may be seen as instances of incorporating collateral information
in the form of covariates or conditioning variables for model specifications. Finite mixture
IRT models extend this idea to where the conditioning variables are discrete and latent
(Bolt, Cohen, & Wollack, 2001; Cho, Cohen, & Kim, 2014; Finch & French, 2012; Meyer, 2010)
and possibly dependent on covariates (Dai, 2013), which has been gainfully employed in
the analysis of differential functioning (Cohen & Bolt, 2005; Samuelsen, 2008) including
Item Response Theory 297
in situations recognizing the hierarchical organization of examinees (Cho & Cohen, 2010).
Collateral information about items may be used to model the measurement model parame-
ters, and shed interpretative light on differences among latent groups (Choi & Wilson, 2015).
An alternative source of collateral information may come in the form of other aspects
of the examineeitem interaction, such as performance on items measuring other latent
variables in MIRT (de la Torre, 2009; de la Torre & Song, 2009), the selection of distractors
(Bolt etal., 2012), and response times (Klein Entink etal., 2009; Meyer, 2010; van der Linden;
2007; Wang, Fan, Chang, & Douglas, 2013).
The majority of adaptive testing has focused on adapting the items presented to the
examinee in support of scoring. The same principles may be applied in support of opti-
mizing calibration of measurement model parameters (van der Linden & Ren, 2014). The
principles of adaptive testing are more general than the particular forms of IRT models
and apply equally to other models (Vos, 1999) including those that employ discrete latent
variables as in latent class and Bayesian network models discussed in Chapters 13 and 14
(for focused treatments of adaptive testing using models with discrete latent variables,
see Cheng, 2009; Collins, Greer, & Huang, 1996; Jones, 2014; Macready & Dayton, 1992;
Marshall, 1981; Rudner, 2009; Welch & Frick, 1993). See Almond and Mislevy (1999) for a
general account of adaptive testing via graphical models, including the incorporation of
collateral information.
Exercises
11.1 Recreate the 3-PNO analysis of the LSAT data in Section11.2.5. Verify your results
with those reported here and obtain the DIC.
11.2 Compare the results from the 3-PNO to those from 1-PNO and 2-PNO models of
the LSAT data.
a. For each of the 1-PNO and 2-PNO models, obtain the posterior distribution for
the measurement model parameters, conduct PPMC using the statistics and
discrepancy measures reported in Section11.2.5, and obtain the DIC.
b. Compare the results for the measurement model parameters to those in
Table11.2. How are the similar, how are they different, and why?
c. Compare the models in terms of their fit.
11.3 Consider what occurs if there are fewer items. Fit the 3-PNO model to the first
four items of the LSAT data.
a. Compare the results for the measurement model parameters to those in
Table11.2. How are they similar, how are they different, and why?
b. Compare the result for examinee 1,000 to that in Table11.2. How are they simi-
lar, how are they different, and why?
11.4 Fit a multidimensional GRM-L model for the item response data from the Peer
Interaction and Academic and Intellectual Development subscales of the Institutional
Integration Scale.
a. Specify and write out the DAG for a model with two latent variables, where
the items associated with each subscale load on one latent variable. Be sure to
consider the resolution of the indeterminacies in the model.
298 Bayesian Psychometric Modeling
b. Interpret the results for the measurement model parameters and the param-
eters that govern the distribution of the latent variables. Compare the results
for the correlation between the latent variables to the correlation between the
observed subscale scores of .57.
11.5 In the description of CAT, it was suggested that ML could be used to estimate
after each response if the examinee has at least one correct and at least one incor-
rect response. Why is this needed for ML estimation? Why is it not needed for
Bayesian estimation?
11.6 Consider a 1-PL model that specifies collateral information for examinees and
the items in the forms of (11.44) and (11.45).
a. Write out the DAG for the model.
b. For each parameter, write out what other entities need to be conditioned on in
the full conditional distribution.
11.7 Analyze a 3-PNO model for the LSAT data in Table 11.1 using a prior distribution
that formally models the dependence among the measurement model parameters.
a. Write out the DAG for the model.
b. How do the results for the measurement model parameters compare to those
for the model that specifies a priori independence among the parameters
reported in Table11.2?
11.8 Consider what occurs if data arrive sequentially.
a. Using the same prior as you specified in Exercise 11.7, analyze a 3-PNO model
for a random subset of 500 examinees from the LSAT data. How do the results
differ from those in Exercise11.6?
b. Use the information in the posterior from part (a) to specify a new prior distri-
bution for the measurement model parameters.
c. Obtain the posterior distribution now using the remaining 500 examinees.
How do the results differ from those in Exercise 11.6 and part (a)?
11.9 In Section11.7.3, we described how the results for an analysis of data from exam-
inee responses to a set of items could be used as the basis for the prior distribution
of an analysis of (a) the same items with different examinees, (b) different items
with the same examinees, or (c) different items with different examinees, but not
(d) a reanalysis of the same items and the same examinees. Why would this latter
analysis be dubious?
11.10 Show that, under the stated assumptions and model specifications, the full con-
ditional distribution for d in the hierarchical prior is given by (11.58).
12
Missing Data Modeling
Missing values for potentially observable variables are ubiquitous in psychometric and
related social science scenarios for a variety of reasons, including those that may or may
not be planned, anticipated, or controlled by the assessor. This chapter covers Bayesian
perspectives on missing data as they play out in psychometrics through the lens of Rubins
(1976) missing-data framework. The key insight is this: when there is a hole in the data,
a Bayesian perspective tells us to build a predictive distribution for what we might have
seen, given all the data we did see, our model for the data had it all been observed, and a
model that expresses our belief about the mechanisms that caused part of the data to be
missing (Rubin, 1977).
In Section 12.1, we review foundational concepts pertaining to missing data and
introduce a running example that we will revisit in several of the following sections. In
Section12.2, we focus on conducting inference when the missingness is ignorable, and in
Section12.3, we focus on inference when the missingness is nonignorable. InSection12.4,
we review multiple imputation, drawing connections to the fully Bayesian analyses of
missing data. In Section 12.5, we discuss connections and departures among missing
data, latent variables, and model parameters. Section12.6 concludes the chapter with a
summary and bibliographic note.
299
300 Bayesian Psychometric Modeling
model parameters, and parameters that govern the distributions of those entities. In our
examples below, we primarily focus on scoring in unidimensional IRT, and so in most of
the cases we discuss, just refers to the latent variable in IRT. When we consider examples
that expand the situation such that includes measurement model parameters, we will
note it as such.
The joint distribution for x and I may be constructed as
p( x , I |, ) = p( x|)p( I |x , ). (12.1)
The first term on the right-hand side of (12.1) is the familiar conditional distribution of x
given parameters . What is new is the second term, the conditional distribution of the
missingness given and x, which may be interpreted as a model for the missingness.
In most cases, the joint parameter space of and can be factored into a -space and a
-space. In some cases, such as the intentional-omitting example we will be considering,
missingness can depend directly on . We will then either consider an element of , or
denote the dependence explicitly as p( I |x , , ).
The role of x in the conditioning here is crucial. Equation (12.1) states that the missing-
ness mechanism may depend on x, including potentially both the observed values xobs and
missing values x mis. We briefly cover the various ways this plays out.
Missing data are missing completely at random (MCAR) if the probability of missingness
does not depend on either the observed or unobserved values:
Missing data are missing at random (MAR) if the probability of missingness depends on the
observed values xobs, but once they are conditioned on, the probability of missingness is
independent of the actual unobserved values x mis:
Missing data are missing not at random (MNAR) if the probability of missingness depends
on the unobserved values x mis, even when the observed values xobs are taken into account:
As x = (xobs, x mis), (12.4) is tautological. Nevertheless, it is useful for communicating that x mis
is relevant to the missingness mechanisms, and as a foil for the previous two characteriza-
tions of missingness mechanisms. Note that MAR is a special case of MNAR, and MCAR
is a special case of MAR.
Example
We will use a simple running example from IRT to illustrate the key ideas of modeling
missing data. We will focus our attention on inference about for a single examinee,
and can therefore omit the subscripting by i to simplify notation. There are three test
items, with dichotomous responses xj. The 1-PL (Rasch) model is assumed to hold for
the item responses, and the item difficulty parameters expressed as bj are known to be
0, 1, and 1, respectively. In the examinee population, ~ f (). The missingness indi-
cator Ij is 1 if a response is observed and 0 if it is missing. We will consider inference
about for a single examinee, Sally, for whom I = (1,1,0) and x = (1,0,*). That is, we have
observed her responses to only the first two items, so xobs = (x1, x2) = (1,0) and x mis = (x3).
Missing Data Modeling 301
We will see that even when the Rasch model is assumed to be wholly sufficient for the
process of how item responses are produced, our inference about depends materially
on the process of how the response to x3 came to be missing. We will consider several
situations: using multiple test forms, adaptive testing, Sally running out of time, and
Sally intentionally omitting a response.
p( xobs , I |, ) = p( x , I |, )dxmis
= p( x|)p( I |x , )dx mis (12.5)
= p( x , x |)p( I |x
obs mis obs , xmis , )dxmis .
The observed-data likelihood may be recognized as an average over all of the possible
complete-data likelihoods p( xobs , xmis |) that accord with the observed data xobs, each
weighted by the probability of the (known) missingness pattern (I) given the observed
data (xobs) and the possible missing data (x mis).
A likelihood analysis that ignores the missing data mechanisms works with a likelihood
based only on xobs, namely
p( xobs |) = p( x|) dxmis = p( xobs , xmis |) dxmis. (12.6)
This forms the basis of full-information ML approaches to modeling missing data (Enders,
2010).
When is it safe to work with the likelihood in (12.6) rather than the one in (12.5)? Rubin
(1976, Theorem 7.1) showed that in addition to the assumption of the factorable parameter
space, when the missing data are MAR (or its special case, MCAR), the missingness mecha-
nisms are considered ignorable for likelihood-based inference. This means that an analysis
can safely ignore the reasons for missingness when conducting inference about through
the likelihood. That is, the likelihood based on xobs in (12.6) yields the same inferences
about as does the likelihood based on (xobs, I) in (12.5). Ignoring the missingness mecha-
nism essentially treats the realized values of I as fixed, ignoring its stochastic process. If the
data are MAR (or MCAR), nothing is lost here for inferences based on just likelihoods. In
contrast, mechanisms that are MNAR are nonignorable; the pattern of missingness, I, con-
tains information about beyond the information contained in xobs. Inference for depends
on the model for the missingness, captured in part by the model for .
302 Bayesian Psychometric Modeling
Example (continued)
Multiple test forms. Suppose that there are two forms of the example test, the first
consisting of Items 1 and 2, the second consisting of Items 1 and 3. Sally was assigned
to Test Form 1 by a coin flip with outcomes denoted as H and T for the two sides of the
coin. Letting represent the outcome, the assignment rule was to administer Form 1 if
= H and Form 2 if = T.* The missingness process p( I |x obs , x mis , ) can be written as
p( I = (1, 1, 0)|x , ) = 1 if = H
p( I |xobs , xmis , ) = p( I = (1, 0, 1)|x , ) = 1 if = T
p( I = (i , i , i )|x , ) = 0 if (i , i , i ) {(1, 1, 0),(1, 0 , 1)}.
1 2 3 1 2 3
Because the coin flip does not depend on what Sallys responses were, or would have been
had she also responded to Item 3, it follows that p( I = (1, 1, 0)|x , ) = p( I = (1, 1, 0)|)
and p( I = (1, 0 , 1)|x , ) = p( I = (1, 0 , 1)|), so the missingness is MCAR. The complete-
data likelihood function simplifies to p( x obs |), which is the likelihood ignoring the
missingness process.
Adaptive testing. Suppose Sally took a computerized adaptive test (CAT), where
themiddle-difficulty item, Item 1, is presented first. If the examinees response is right,
the more difficult Item 2 is administered, and if it is wrong, the easier Item 3 is adminis-
tered. In this case,
p( I = (1, 1, 0)|x ) = 1 if x1 = 1
p( I = (1, 0, 1)|x ) = 1 if x1 = 0
p( I = (i , i , i )|x ) = 0 if (i , i , i ) {(1, 1, 0),(1, 0, 1)}.
1 2 3 1 2 3
The missingness process now depends on Sallys response to Item 1, which is part of
xobs, but not on her response to Item 3, which is part of x mis. This missingness is not
MCAR, but it is still MAR, so again the likelihood ignoring the missingness is appropri-
ate under likelihood inference.
Running out of time. Suppose Sally was administered a test form with all three items,
and worked sequentially from the beginning. Sally answered the first two, producing
xobs = (1,0), but ran out of time before looking at Item 3. We have assumed that the
Rasch model accounts for response probabilities, but there is a speed factor at work as
well. It may well be the case that is correlated to say, fast examinees are generally
more proficient than slow examineesso that p(,) = p(|)p(). Here the missingness
process depends on and not x, so MAR holds; and is distinct from in the sense of
separate parameter spaces. The missingness is again ignorable under likelihood infer-
ence, and the MLE of Sallys depends only on her first two responses.
* This is a simplified form administration process, but relates to more complicated procedures in which the
order of forms are predetermined and administered in a spiraled manner in a classroom (e.g., Mazzeo
etal.,2006).
Missing Data Modeling 303
p(, |I , xobs ) = p(, , xmis |I , xobs )dxmis
p( x obs , xmis , I |, )p(, )dxmis (12.7)
= p( x obs , xmis |)p( I |xobs , xmis , ) dxmis p(, ).
The first line in (12.7) marginalizes with respect to x mis. The second line follows from
Bayes theorem, and the third line substitutes in the observed-data likelihood from (12.5).
This suggests that the observed-data posterior distribution may be recognized as an aver-
age over all of the complete-data posterior distributions that accord with the observed data
xobs, with weights given by the probability of the (known) missingness pattern (I) given the
observed data (xobs) and the possible missing data (x mis).
A Bayesian analysis that ignores the missing data mechanisms works with a posterior
distribution for based only on xobs, namely
When is it safe to work with the posterior distribution in (12.8) rather than the one in
(12.7)? Rubin (1976, theorem8.1) showed that when the missing data are MAR (or its special
case, MCAR), the parameter space can be factored into a -space and a -space, and these
parameters are independent in the prior,
then the missingness mechanisms are ignorable for Bayesian inference about . This can
be understood by simplifying the general characterization of the joint posterior for and
in (12.7) in accordance with these assumptions,
p(, |I , xobs ) p( xobs , xmis |)p( I |xobs , ) dxmis p(, )
(12.10)
= {p() p( xobs , xmis |) dxmis }{p()p( I |xobs , )}.
The first line in (12.10) follows from the general characterization of joint posterior for
and in (12.7) and the MAR assumption, in that the distribution of I may depend on
x obs, but not x mis. The second line follows from the a priori independence of the parame-
ters in (12.9). This last line reveals that the joint posterior may be factored into two terms
in braces: one for and one for . Focusing on the former, we have the simpler poste-
rior for based only on x obs in (12.8). This means that, under the stated assumptions,
an analysis can safely ignore the reasons for missingness when conducting Bayesian
inference about . That is, the posterior based on x obs in (12.8) yields the same inferences
about as does the posterior based on (x obs, I) in (12.10). This essentially treats the real-
ized values of I as fixed, ignoring its stochastic process. If the data are MAR (or MCAR),
nothing is lost here.
304 Bayesian Psychometric Modeling
In contrast, there are two cases when missingness mechanisms are nonignorable under
Bayesian inference:
1. Nonfactorable parameter space. When the missingness process is MAR and and
are distinct parameter spaces, ignorability holds for likelihood inference but not
for Bayesian inference if p(, ) does not factor as in (12.9). (For example, in an
application of IRT where is proficiency and is tendency to omit items, sup-
pose we believe that low-proficiency examinees are more likely to omit items; then
p(, ) p()p().) Here, inference for depends not only on xobs but on indirect
information about through I, in the form of
p(|I ) = p(|)p(|I ) d.
~ p(|xobs , xmis )
and (12.11)
This level of generality is sufficient for our purposes, which is to highlight that there is a
full conditional for x mis. In practice, we may employ variations on this theme. For example,
a univariate approach to Gibbs sampling would work with the univariate full conditionals,
where each parameter in is conditioned on the remaining parameters in in addition
to xobs and x mis, and each missing value in x mis is conditioned on the remaining missing
values in x mis in addition to and xobs. Of course, the full conditional distributions may
simplify in light of the conditional independence assumptions in the model. In particular,
* MAR and distinctness are sufficient but not necessary conditions for ignorability. One can construct cases
in which a missingness process is MNAR and missingness is nonignorable under likelihood inference, yet a
specially built p(, ) cancels out the confounding and ignorability obtains under Bayesian inference.
Missing Data Modeling 305
the local independence assumption in psychometric models implies that, given the model
parameters, the observables are conditionally independent. Under local independence, the
full conditional for the missing data simplifies to
Gibbs sampling proceeds by iteratively drawing from the full conditionals. The marginal dis-
tribution for from a chain of draws approximates the posterior distribution for in (12.8).
Example (continued)
The previous block of examples showed that under IRT assumptions, the cases of
multiple test forms, adaptive testing, and running out of time were ignorable under
likelihood inference. We revisit them here, and see that Bayesian ignorability still
holds if (12.9) holds. If (12.9) does not hold, we have Bayesian nonignorable inference
of the Nonfactorable parameter space type.
Multiple test forms. The missingness process for multiple test forms is still MAR, because
it does not depend on an examinees responses to the items not presented. We saw that the
missingness was ignorable under likelihood inference. It is ignorable under Bayesian infer-
ence as well, as long as form assignment does not depend on . Such is the case if the form is
assigned by a coin flip, the result of which has nothing to do with the examinees . Instead,
suppose eighth-grade students are assigned to the harder Test Form 1 with probability
.67 and the easier Test Form 2 with probability .33; fourth-grade students are assigned
randomly to the forms with probabilities .25 and .75, respectively. Again is a coin
flip,butwith differently biased coins at the two grades. If p(|Grade = 4) p(|Grade = 8),
then p(, |Grade = k ) = p(|Grade = k )p(|Grade = k ) but p(, ) p()p(). Ignorability
holds under Bayesian inference only given grade, and the correct conditionally ignor-
ability posterior is p(|x obs , Grade = k ) p( x obs |)p(|Grade = k ). If we know that Sally
is a fourth grader, we should calculate her posterior distribution using the Grade 4 prior
p(|Grade = 4). (If we know she was assigned Form 1 under the grade-dependent admin-
istration scheme but we do not know her grade, the prior we should use is a mixture of
the Grade 4 and 8 distributions with weights proportional to .25 and .67, the probability
of assignment to Form 1 to students in the two grades respectively.)
Adaptive testing. Similar results hold for ignorability under Bayesian inference, with
the exception that if the initial item or the item selection probabilities depend on
some examinee covariate Z, it must be conditioned on in both the prior and posterior:
p(|x obs , Z=z) p( xobs |)p(|Z=z).
Running out of time. Recall that xmis is MAR if the responses are missing because the
examinee ran out of time before getting to them, and speed is assumed to be a separate
parameter from . Ignorability holds under likelihood inference whether or not (12.9)
holds. If (12.9) does not hold, and are not distinct in the required Bayesian sense and
the missingness is not ignorable under Bayesian inference. Inference must proceed with
a marginal posterior for that is proportional to the IRT likelihood times a prior that
takes into account how many items the examinee has reached, or p( x obs |)p(|I ) where
p(|I ) = p(|)p( I |)p() d.
What this means for Sally is that her likelihood function based on Items 1 and 2 remains
the same, but a prior for has also been incorporated that reflects the distribution of
examinees who, like Sally, did not reach Item 3. If this distribution is lower than the undif-
ferentiated population, for example, Sallys posterior distribution will be shifted down-
ward and her Bayes modal (MAP) estimate will be lower than her MLE would have been.
306 Bayesian Psychometric Modeling
p(, , xmis |I , xobs ) p( I |xobs , xmis , , )p( x obs , x mis |)p(|)p(). (12.13)
The key component is the first term on the right-hand side, which is the model for the
missingness pattern given the observed data, missing data, parameters that govern the dis-
tribution of the data, and additional parameters that govern the missingness mechanism.
Given a structure for the missingness process, we may employ MCMC procedures to
empirically approximate the posterior distribution. A Gibbs sampler would iteratively
draw from the full conditionals for the unknowns , , and x mis. What needs to be condi-
tioned on and the actual form of each full conditional will depend on the dependence and
conditional independence structures in the model (e.g., does I depend on some or all of the
other entities in the model?).
The marginal distribution for from the chain of draws approximates the marginal
posterior distribution for ,
p(|I , xobs ) =
p(, , x mis |I , xobs )d dxmis. (12.14)
Similarly, MCMC may be used to approximate the marginal distribution of or its depen-
dence on if it is of inferential interest.
Example (continued)
Intentional Omits. Suppose Sally chooses not to respond to a dichotomously scored item
on the three-item test if her probability of getting the item correct is less than .5. Note that
her probability depends on both her proficiency variable and the items parameters, both
included in in the following equations for convenience. The missingness model is then
p( I |x obs , x mis , , ) = p( I |)
(12.15)
I j = 1 if p(correct|) .5
=
I j = 0 if p(correct|) < .5 .
p( I |x obs , x mis , , ) = p( I |, )
(12.16)
I j = 1 if p(believe correct|, ) .5
=
I j = 0 if p(believe correct|, ) < .5.
Missing Data Modeling 307
conducted for each of the just-constructed complete datasets. The third phase is a pool-
ing phase in which the results from the analyses in the second phase are synthesized.
The following sections detail these phases, and then revisit the plausible values approach
to estimating distributions of latent variables introduced in the previous chapter. It is a
straightforward application of multiple imputation for missing data.
p( xmis |I , xobs ) =
p(, , x mis |I , xobs )dd. (12.17)
Under the general case of MNAR, this amounts to the marginalization of the joint poste-
rior in (12.13). From a Bayesian perspective, the posterior predictive distribution in (12.17)
is the answer to the question of how to impute missing values: what we believe about each
missing value is fully represented by its predictive distribution given the assumed model,
the observations, and the prior distributions for the unknown parameters (Little & Rubin,
2002; Rubin, 1977).
Multiple imputation is typically employed using the observed-data likelihood reflect-
ing an assumption of MAR, and we focus on this case in the following development.
Owing to the factorization of the posterior in (12.10), the joint posterior distribution for the
unknowns and x mis given xobs is
= p( xmis |)p(|xobs ),
where the last line follows from local independence of observables in psychometric
models. Marginalizing over the parameters yields the posterior predictive distribution for
the missing data
p( xmis |xobs ) = p(, xmis |xobs )d
(12.19)
= p( x mis |)p(|xobs )d.
A value drawn from this posterior predictive distribution constitutes a single imputation.
Multiple imputations are obtained by repeatedly drawing from this distribution. This is
typically carried out by cycling through two steps, which correspond to an iteration of the
Gibbs sampler that draws from the full conditional distributions for the parameters and
the missing data. The full conditional distributions are:
~ p(|xobs , xmis ),
(12.20)
xmis ~ p( xmis |xobs , ).
Missing Data Modeling 309
Note that p( xmis |xobs , ) = p( xmis |) under local independence. First, we simulate a value for
from its full conditional distribution given the observed data xobs and initial values for the
missing data xmis. Next, we simulate a value for xmis from its full conditional distribution
given xobs and the just-simulated value for . To generate the next imputation, we simulate a
new value for from its full conditional distribution, using the just-simulated value for x mis,
and then use that value to simulate xmis from its full conditional. Upon convergence of the
chain, the distribution of the draws for ( , xmis) approximates the joint posterior distribution.
Accordingly, the marginal distribution for xmis from a chain of draws approximates the pos-
terior predictive distribution for xmis in (12.19). Each draw for xmis constitutes an imputation.
Importantly, the process for obtaining multiple imputations using MCMC is the same
as the process for conducting Bayesian inference about using MCMC. Both involve
obtaining draws from the joint posterior distribution of parameters and the missing
data given the observed data. The full conditionals in (12.20) are the same as the those
in (12.11). In Section12.2.2, where we focused on inference, we marginalized over the
missing data to yield the marginal posterior distribution for the parameters. In a simula-
tion environment, this amounted to simulating from the joint posterior distribution and
then paying attention only to the values for the parameters. In this treatment focusing
on imputations, we marginalize over the parameters to yield the posterior predictive
distribution for the missing data to yield imputations. In a simulation environment, this
amounts to simulating from the joint posterior distribution and then paying attention
only to the values for the missing data. The central point is that the procedures are the
same. The mechanisms for obtaining multiple imputations are the same mechanism for
conducting Bayesian inference. Producing imputations also yields Bayesian inference
for the parameters. Conducting Bayesian inference for the parameters also produces
imputations. Instead of discarding the draws for the parameters in multiple imputa-
tion, they could be retained and used to conduct inference for the parameters. Instead
of discarding the draws for the missing data in conducting inference, they could be
retained and viewed as imputations. The same holds for Bayesian inference under the
more general case of nonignorability. The draws for x mis, when simulating values from
the joint posterior distribution in (12.13) amount to multiple imputations.
Under this approach, when we formulate a single model there is no need to distinguish
between an imputation phase and an analysis phase. We conduct inference for the param-
eters at the same time as we obtain imputations. In fact, as noted in Section 12.2.2, we
do not even need to pay attention to the imputations to perform inference. The payoff of
multiple imputation strategies to missing data analysis, and in particular, the distinction
between imputation and analysis phases, comes in situations when a dataset is intended to
be used in different contexts (Rubin, 1996b): by different analysts, at different times, with
different models; or when auxiliary variables not of inferential interest may be useful in
predicting the missing values (Meng, 1994a; Rubin, 1996b; Schafer, 2003). Essentially, the
imputation phase handles the missingness and the resulting multiple completed datasets
are turned over as a set to other analysts for whatever analyses they have in mind in the
analysis phase, with guidelines for how to combine the results obtained from the multiple
datasets in the pooling phase.
With this distinction between imputation and analysis phases, the model and parameters
used for imputations need not be the same as those used in the analysis phase. To make
this explicit, in the imputation phase we specify an imputation model with parameters
that specifies the conditional distribution for the data given and a prior distribution
for . In general, need not equal . Once imputations are obtained, we proceed to the
analysis phase for inference about .
310 Bayesian Psychometric Modeling
Example (continued)
Sallys missing response is MAR in the cases of multiple forms, adaptive testing, and
running out of time. Obtaining a draw for x3 is therefore accomplished using the results
of the ignorability analysis under Bayesian inference. If no covariate Z is involved in
form assignment or in item selection algorithms, then an imputation for x3 is drawn in
these two cases from
= p( x3 |) p(|x1 = 1, x2 = 0)
p( x3 |) p( x1 = 1, x2 = 0|)p().
That is, first the likelihood for that is induced by observing x1 and x2, p( x1 = 1, x2 = 0|), is
combined with the prior p() to produce a predictive distribution for given what we
have learned from xobs. Then, we draw a value of from it. Finally, we draw a value of x3
from the IRT function p( x3 |) evaluated with that draw.
Because of MAR, the procedure is the same in running out of time except that the infor-
mation about contained in the fact that Sally did not reach Item 3 must be taken into
account. Let again represent a speed parameter which may be correlated with . We see
the difference in the imputation model in terms of the predictive distribution for :
p( x3 |) p( x1 = 1, x2 = 0|)p(|I , )
The key difference is that the distribution for is now conditional on , and informa-
tion about is contained in its posterior p(|I = (1, 1, 0)) p( I = (1, 1, 0)|)p(). In practice,
a value for is first drawn from its predictive distribution given I = (1,1,0). Then a value
for is drawn from p(|) evaluated with this draw. Finally we draw a value of x3 from
the IRT function p( x3 |) evaluated with that draw.
Missing Data Modeling 311
.
1
= k (12.21)
K k =1
The total variability associated with this estimate is constructed from two components,
K +1
V =W + B, (12.22)
K
W ,
1
W= k (12.23)
K k =1
( ) .
1 2
B= k (12.24)
K 1 k =1
If the previous phase involved a Bayesian analysis simulating values from the posterior
(for each completed dataset), the pooling phase amounts to combining the K sets of simula-
tions from the analyses of the completed datasets, effectively creating a single set of simu-
lations (Gelman etal., 2013). This combined set of simulations may then be summarized as
we would any individual set of simulations in the usual ways (e.g., using graphical sum-
maries or point or interval summaries).
312 Bayesian Psychometric Modeling
in common: (1) they are entities in a probability model constructed by an analyst to aid
in reasoning and (2) they are not known with certainty. From a high-level perspective, in
Bayesian inference we place entities into one of two big buckets, one labeled KNOWNS
and the other labeled UNKNOWNS. Missing data, latent variables and parameters all get
placed in the latter and have the same status: they are unknowns in a probability model,
and through the machinery of Bayes theorem we can condition on the knowns in the
model to obtain the posterior distribution for them. This synthesis is most sensible when
in the context of both elements (1) and (2) just mentioned. The point is that once inside the
probability model, we can use the same machineryBayes theorem, setting up and draw-
ing from full conditionals, and so onto coherently tackle all of them.
Importantly, this does not suggest that we should discontinue the use of separate terms,
as they convey potentially crucial differences. We gain by understanding what kinds of
distinctions among these various entities are useful and which are meaningless for various
purposes. We may accrue considerable benefits from capitalizing on the connections among
these entities when suitable. The distinctions are somewhat artificial when it comes to
estimation, and ignoring them may be quite effective in being able to trick computer pro-
grams into doing what we want. In Section13.5.2, we discuss how treating latent variables
as missing data opens up possibilities for resolving estimation difficulties in the context of
latent class analyses. Seeing past the distinctions may also facilitate understanding con-
nections among different domains. In this respect, Little and Rubins (2002) book Statistical
Analysis with Missing Data exploits this power to great effect; we see that missing data
in surveys, experimental design, latent variable modeling, semisupervised learning, and
resolving mixtures may all be seen as variations of the same basic ideas.
On the other hand, we may draw conceptual benefits from having distinct terms for
different entities. To see this, first consider the various entities in the bucket labeled as
KNOWNS, which includes things like data, fixed values of parameters, and specified val-
ues of hyperparameters in the highest level of a hierarchical prior specification. Again,
the high-level perspective views Bayesian inference as conditioning on the knowns in the
model to obtain the posterior distribution for unknowns. We flesh out the twin emphases
on knowns and in the model. Here, the emphasis on knowns calls out that this is a
technical point about statistical machinerythe knowns are those things that are condi-
tioned on in Bayes theorem. However, lumping them all together may obscure key dis-
tinctions among them. The emphasis on in the model calls out that they are knowns in
the context of our reasoning about some situation of interest. Data are things we observe
or collect from the world. On the other hand, fixing parameters may be based on the need
to resolve indeterminacies. Similarly, hyperparameters in the highest level of hierarchical
priors are not known because they are observed, but because we stipulate them as what
we know from previous experience, expectations, and awareness of how their values
affect what happens in estimationusually in a manner vaguely explicated if at all. Thus,
the grounding, conceptual meaning, and role played by the various entities that all get
tossed into KNOWNS bucket vary considerably.
Much the same can be said for latent variables, missing data, and parameters that occupy
the UNKNOWNS bucket. They may be treated the same by the statistical machinery, but
they may reflect differences in grounding, conceptual meaning, and the roles they play.
An important distinction between missing data and latent variables may be seen by recon-
sidering our characterization of the role of latent variables. From a measurement perspec-
tive, latent variables are employed in psychometric models where what is ultimately of
inferential interest is not directly observable. From a statistical perspective, latent vari-
ables are used in structuring the distribution of variables we can observe. De Finettis
314 Bayesian Psychometric Modeling
representation theorem for exchangeability is a powerful tool for structuring high dimen-
sional joint distributions of observables in terms of simpler conditional distributions given
latent variables that induce conditional independence. Pearl (1988) pointed out that con-
ditional independence is such a central tool in building and reasoning through models,
when the variables that induce conditional independence are unknown or unavailable,
we create them. This reveals a key distinction between latent variables and missing data.
Whereas missing data are things that we do not observe but potentially could have under
different circumstances or choices, this is not the case for the latent variables, at least how
they are employed in psychometrics. We do not choose whether to observe them; we choose
whether to introduce them when constructing a model. It is not that they are entities out
there in the world that we happened to not have observed but could have under differ-
ent choices or circumstances; there are not values out there in the world to observe. The
terminology may represent an interpretative shorthandwe often use the term missing
data to stand for values that in some sense we conceivably could have observed, latent
variable to stand for things we could not, and parameter or person parameter to
highlight that it is of inferential interest.*
On this account, latent variables are entities introduced for and used in a modeling
context to organize our thinking and reason through the messy real-world problem of
making sense of examinees behaviors. The behaviors are situated and our reasoning
task is one of interpreting certain behaviors by certain examinees in certain contexts.
Our thinking and reasoning are likewise situated, against a backdrop of many things we
know or believe with some degree of confidence, and a potentially related backdrop of
purposes and constraints. And it is in this context of a complex reasoning task that we
employ latent variables to summarize and capture our thinking about the examinees.
Adopting the role of the assessor or analyst, the nature and values of latent variables have
more to do with what is going on in our heads than whats going on in the examinees
heads. A latent variables purposeto organize our thinking in the context of a model
constructed to reason from situated data to situated endsand its role as a parameter on
which other things dependnamely observable variables that are rendered conditionally
independentreveal that a latent variable is quite a different beast than what is usually
denoted by the term missing data.
To summarize, the distinctions among these are often not relevant for doing work within
the probability model, but are indeed typically relevant for how they arrive in the model
and what we think about what a model and model-based reasoning has to say about them.
* Without advancing a causal account, we suspect that the relative paucity of use of the term person param-
eter in CFA is related to the relative lack of applications in which inferences about the values of the person
parameters/latent variables are sought. The relatively more frequent use of person parameter in IRT is
related to the relative plethora of applications in which inferences about the values of the person parameters/
latent variables are sought.
Missing Data Modeling 315
case, correct inferences result from analyses that take the pattern of missingness as given,
and simply use p( xobs |) as the likelihood. Most statistical packages now offer this option.
However, if the missingness is nonignorable, inference about the parameters requires includ-
ing a model for missingness. A brief review of multiple imputation was proffered, and it
was emphasized that the imputation phase is explicitly Bayesian. A Bayesian perspective on
modeling brings to light connections among missing data, latent variables, and parameters.
In addition to the foundational work of Rubin (1978, 1987, 1996b), excellent didactic
treatments of multiple imputation can be found in Enders (2010), Schafer and Graham
(2002), and Sinharay, Stern, and Russell (2001). In psychometrics, procedures and examples
for modeling ignorable and nonignorable missingness have also been developed in CFA
(Cai & Song, 2010; Lee, 2006; Song & Lee, 2002a) and IRT (Fu, Tao, & Shi, 2010; Maier 2002;
Patz & Junker, 1999a). Additional examples, analyses, and notable discussion can be found
in Mislevy (in press) in the context of IRT and Lee (2007) in the context of CFA and SEM.
Exercises
12.1 Beginning with the joint posterior distribution for , , and xmis, show that the
posterior distribution for and is separable under the assumptions of factoriza-
tion of the parameter space into a -space and a -space, MAR, and a priori inde-
pendence of and .
12.2 You receive six values x from a population in which x ~ N ( , 1):
0.99, 0.68, 0.23, 0.89, 1.16, 2.27.
Use the prior ~ N ( 0, 1) . Obtain the posterior for , including the posterior mean
and variance, assuming the six values are a random sample from the population.
12.3 Use the same population distribution, prior, and six observations from Exercise
12.2, but now assume that the six values are xobs from a random sample of 10 from
the population, where the missing values are MCAR.
a. Use multiple imputation to create five imputed data sets.
b. Obtain the posterior distribution for in each completed data set separately.
c. Obtain the posterior distribution for under Rubins formulas for pooling the
information across multiple imputations.
d. Compare the results of Exercises12.2, 12.3(b), and 12.3(c). How do they differ,
and why?
12.4 Analysts who are new to multiple imputation are sometimes suspicious that
answers are too accurate because we are making up data. Do you agree or
disagree? Explain your answer.
12.5 Build an imputation model for Sallys response for x3 under random assignment
to the two multiple forms.
12.6 Build an imputation model for Sallys response for x3 under random assignment
to the two forms, but with probabilities .67 and .33 to Forms 1 and 2 for eighth
graders and .25 and .75 for fourth graders, and p(|Grade = 4) = N (1, 1) and
p(|Grade = 8) = N (1, 1), under the following states of knowledge:
a. You know Sally is in Grade 4.
316 Bayesian Psychometric Modeling
Latent class analysis (LCA) departs from the psychometric models previously presented
in that it employs discrete latent variables. We focus on the case of discrete observables as
well. Analysts employ discrete latent variables when they want to organize their think-
ing about examinees in terms of categories or groupings, as opposed to organizing their
thinking about examinees along a continuum by using continuous latent variables.* We
have in fact already encountered simple latent class models in the subtraction proficiency
example in Chapters1 and 7 and the medical diagnosis example in Chapter2. In those
treatments we confined our analyses to the context of scoring examinees, where the mea-
surement model parameters and the parameters governing the distribution of the latent
variables were known. Here, we expand our treatment and conduct inference for these
parameters of the model as well as for examinees. We will see a number of the themes
present in CTT, CFA, and IRT, including conditional independence assumptions and hier-
archical specifications.
Treatments and overviews of LCA from conventional perspectives can be found in
Collins and Lanza (2010), Dayton (1999), Dayton and Macready (2007), Lazarsfeld and
Henry (1968), and McCutcheon (1987). These and other conventional sources emphasize
Bayesian elements for some aspects of inference, but not others. LCA is often employed
in an exploratory manner, where the goal is to determine the number of latent classes
(i.e., levels of the latent variable) and examine the pattern of dependence of the observ-
ables on those classes. This chapter treats more confirmatory flavors of LCA, in which a
particular number of latent classes are specified, possibly with constraints on the con-
ditional probabilities that capture the dependence structure of the observables (Croon,
1990; Dayton & Macready, 2007; Goodman, 1974; Hoijtink, 1998; Lindsay, Clogg, & Grego,
1991; van Onna, 2002).
In Section13.1, we review conventional approaches to LCA. In Section13.2, we present
a Bayesian analysis for the general case of polytomous observable and latent variables.
In Section 13.3, we present a Bayesian analysis for the special case of dichotomous
observable and latent variables. An example is given in Section 13.4. In Section 13.5,
we discuss strategies for resolving indeterminacies associated with the use of discrete
latent variables. We conclude this chapter in Section13.6 with a summary and biblio-
graphic note.
* The distinction between continuous and discrete latent variables is not always sharp, as models that posit
differences here may be exactly or approximately equivalent (e.g., Haertel, 1990; Heinen, 1996; von Davier,
2008).
317
318 Bayesian Psychometric Modeling
k =1
cjk = P(x
k =1
ij = k|i = c) = 1.
A traditional formulation of a C-class model specifies the probability that examinee i has
a value of k on observable j as
C
P( xij = k ) = P(x
c =1
ij = k|i = c ,
j ) c , (13.1)
TABLE 13.1
Conditional Probability Table for Observable j with K = 3 Categories, Given
Membership in One of C = 2 Classes (), Where cjk Denotes the Probability
of Responding to Observable j in Category k Given Membership in Class c
Category (k)
Latent Class () 1 2 3
1 1 j1 = 1 (1 j 2 + 1 j 3 ) 1j2 1j3
2 2 j 1 = 1 (2 j 2 + 2 j 3 ) 2j2 2j3
Latent Class Analysis 319
where c = P(i = c) is the proportion of examinees in class c, also referred to as the size of
class c, which are subjected to the restriction that
C C
P( = c) = 1.
c =1
c =
c =1
i
In other words, (13.1) says that the marginal probability of the response xij = k is the weighted
average of the probability of this response from examinees in each class c, weighted by the
class proportions.
The framework can also accommodate multiple discrete latent variables by constructing a
single discrete latent variable that captures all the possible patterns generated from the mul-
tiple discrete variables. Consider a situation in which we conceive of three dichotomous latent
variables, each of which can take a value of 1 or 2. There are 23 = 8 possible profiles correspond-
ing to combinations of values on these variables: [1,1,1], [1,1,2], [1,2,1], [1,2,2], [2,1,1], [2,1,2], [2,2,1],
and [2,2,2], where the sequence of three numbers refers to the levels of the three variables
(e.g., [2,1,2] refers to an examinee being in the second class on latent variable 1, the first class
on latent variable 2, and the second class on latent variable 3). We could reconceive the situa-
tion as one in which there is a single latent variable with eight classes, each corresponding to
one of the profiles. In this way, we can naturally fold the situation of multiple discrete latent
variables into one in which there is a single discrete latent variable, and we are squarely back
to the model described in (13.1). We therefore restrict our attention in the rest of this chapter
to the situation in which there is a single discrete latent variable. In Chapter14, we revisit this
situation and describe approaches to modeling multiple distinct discrete latent variables.
P( x|, ) = P( xi |i = c , ) = P(x | = c, ),
ij i j (13.2)
i =1 i =1 j =1
320 Bayesian Psychometric Modeling
P( xij |i = c , j ) =
k =1
cjk
I ( xij =k )
, (13.3)
and I is the indicator function that here takes a value of 1 when the response from exam-
inee i for observable j is k, and 0 otherwise. The marginal probability of the observables is
a version of (7.8), instantiated with a discrete latent variable:
n C
P( x| , ) = p(x | , )p( | ).
i =1 i =1
i i i (13.4)
The distribution of the latent variables is a categorical distribution with category prob-
abilities or proportions contained in . Once values for the data are observed, (13.4) may
be treated as a likelihood function for and . In general, there is no closed-form solu-
tion, and numerical methods are typically employed (Dempster, Laird, & Rubin, 1977;
Goodman, 1974).
Turning to scoring, the consensus approach is to use Bayes theorem to obtain the
posterior distribution for an examinees latent variable using point estimates for the
response-probability and class-proportion parameters, as in (7.10). The resulting poste-
rior distribution may be reported as such or summarized via the MAP, which amounts
to estimating the value of the latent variable as the class with the highest posterior
probability.
p( x|, ) = p(x | , ) = p(x | = c, ),
i =1
i i
i =1 j =1
ij i j (13.5)
and cj = (cj1 , , cjK ) are the conditional probabilities for observable j associated with
class c, captured in row c of tables like that in Table13.1.
Latent Class Analysis 321
p() = p( | ),
i =1
i P (13.8)
where P denotes the hyperparameters for specifying the prior for each i . In the cur-
rent context, the latent variable is assumed to be categorically distributed, in which case
P = = ( 1 , , C ) are the latent class proportions and
i | ~ Categorical( ). (13.9)
J J C
p() =
j =1
p( j | P ) = p(
j =1 c =1
cj | P ), (13.11)
where P denotes the hyperparameters spelled out in more detail below. Each j is com-
posed of C vectors of length K that are the conditional probabilities for the observable tak-
ing a value of k given membership in class c, for k = 1,, K, and c = 1,, C. More formally,
j = (1 j , , Cj ), where each cj = (cj1 , , cjK ). The right-hand side of (13.11) reflects a prior
independence assumption among the cj. As each cj is a vector of conditional probabilities
for a categorically distributed variable, a natural choice is to again employ Dirichlet prior
distributions,
cj ~ Dirichlet( c ), (13.12)
where c = ( c 1 , , cK ) are hyperparameters that govern the Dirichlet distribution for
the conditional probabilities given membership in latent class c. The exchangeability
assumption is reflected in the absence of the j in the subscript for c . This may be relaxed
if exchangeability is only partial or conditional on other variables; in the limit, unique
prior distributions may be specified for each observable. Continuing the development
assuming exchangeability, the full collection of hyperparameters for the measurement
model parameters is P = ( c , , C ).
i = 1,, n
i
j = 1,, J
xij
cj
c
c = 1,, C
FIGURE 13.1
Directed acyclic graph for a latent class model for J observables from each of n examinees, with a single discrete
latent variable with C classes.
Latent Class Analysis 323
p(, , | x ) p( x | , , )p(, , )
= p(x
i =1 j =1
ij | i , j )p(i | )p( ) p( ),
c =1
cj
where
( xij|i = c , j ) ~ Categorical( cj ) for i = 1, , n, j = 1, , J ,
i | ~ Categorical( ) for i = 1, , n,
~ Dirichlet( ),
and
cj ~ Dirichlet( c ) for c = 1, , C , j = 1, , J .
This specification is fairly open in the sense that it does not include any constraints that
may be imposed to resolve the indeterminacies and prevent label switching or to reflect
substantive theory. For ease of exposition, we continue with the current specification when
describing the full conditional distributions and then MCMC approaches next. We return
to this issue in the context of an example in Section13.4.
i | , , xi ~ Categorical(si ), (13.14)
where x i = (xi1, , xiJ) refers to the collection of J observed values from examinee i,
si = (si1 ,, siC )
( )
J K
I ( xij =k )
cjk c
j =1 k =1
sic = , (13.15)
(
C J K
I ( xij =k )
gjk ) g
g =1 j =1 k =1
and I is the indicator function that here takes a value of 1 when the response from exam-
inee i for observable j is k, and 0 otherwise.
* We suppress the role of specified hyperparameters in this notation; see Appendix A for a presentation that
formally includes the hyperparameters.
324 Bayesian Psychometric Modeling
| ~ Dirichlet( 1 + r1 , , C + rC ), (13.16)
where r1 , , rC are counts of the number of examinees for whom = c (i.e., are members of
class c).
Turning to the measurement model parameters, we present the full conditional for cj,
which is the collection of conditional probabilities for the possible values of observable
j given the examinee is a member of latent class c. The same structure applies to all the
observables, for all classes. The full conditional distribution for each cj is
where xj = (x1j, , xnj) refers to the collection of n observed values for observable j and
rcj1 , , rcjK are counts of the of the number of examinees for whom = c (i.e., are members of
class c) who have a value of k for observable j. Derivations for these full conditional distri-
butions are given in Appendix A.
A Gibbs sampler is constructed by iteratively drawing from these full conditional dis-
tributions using the just-drawn values for the conditioned parameters, generically written
below (for iteration t + 1).
13.2.4.1 Gibbs Sampling Routine for Polytomous Latent and Observable Variables
1. Sample the latent variables for examinees. For each examinee i = 1,, n, use the val-
ues of the measurement model parameters and parameters that govern the latent
distribution from the previous iteration, (t ) and (t ) for and , respectively, to
compute si(t ) = (si(1t ) , , siC
(t )
) where each
( )
J K
( t ) I ( xij =k ) ( t )
cjk c
(t ) j =1 k =1
sic =C .
( )
J K
( t ) I ( xij =k ) ( t )
gjk
g
j =1 k =1
g =1
3. Sample the measurement model parameters. For each observable j = 1,, J, use the
values of the latent variables from step (1), (t+1), and count the number of exam-
inees in each class that had a value of k for the observable. These counts are the
class- and observable-specific frequencies of values in each category of the observ-
able (rcj(t1+1) , , rcjK
( t +1)
). For each observable j, and each latent class c, use these values
for (rcj1 , , rcjK ) and sample a value for the conditional probabilities of observing a
value in the categories from
where cj is the element in j that captures the conditional probability of xij being a value
of 1 given an examinee is in class c.
i | ~ Bernoulli( ). (13.22)
We note that under the usual specification of Bernoulli variables, the two classes of the
latent variable are coded as 0 and 1, with being the proportion of examinees in the class
coded as 1. The proportion of examinees in the class coded as 0 is 1 . In some situations,
this coding may be desirable. In the next chapter, we will use this coding for a latent vari-
able capturing whether an examinee possesses a skill (coded as 1) or does not possess the
skill (coded as 0). Alternatively, we may prefer coding the latent variable as 1 and 2. This
is usually the case when programming a model in WinBUGS (or other software) in terms
of tables of conditional probabilities, such as that in Table 13.1. Here, we need to refer-
ence the rows of the table defined by the possible values of the latent variable, and it is
sometimes difficult or impossible to reference the 0th row of a table; referring to the rows
(latent classes) as 1 and 2 rather than 0 and 1 is more convenient. One option is to move
326 Bayesian Psychometric Modeling
to the categorical specification in (13.9); in the current case of a dichotomous latent variable,
the collection of probabilities in the categorical variable specification reduces to = ( , 1 ).
Asecond option is to recode the result of a Bernoulli specification from its natural 0/1 coding
to a 1/2 coding. This could be done formally via the introduction of a new parameter, but
we prefer the following expression intended to communicate that i is the result of taking
a Bernoulli random variable and adding 1, effectively converting 0/1 to 1/2 coding:
i | ~ Bernoulli( ) + 1. (13.23)
Either way, we are left with a single parameter that governs the distribution of the dichot-
omous latent variable, in effect capturing the probability of being in a particular class,
regardless of whether it is coded as 1 in the 0/1 coding or 2 in the 1/2 coding. We therefore
specify a beta prior for :
~ Beta( , ), (13.24)
cj ~ Beta( c , c ), (13.25)
n J C
p(, , | x )
i =1 j =1
p( ),
p( xij |i , j )p(i | )p( )
c =1
cj
(13.26)
i | ~ Bernoulli( ) + 1 for i = 1, , n,
~ Beta( , ),
and
cj ~ Beta( c , c ) for c = 1, 2, j = 1, , J .
the class coded as 2. Beginning with the examinees latent variables, we present the full
conditional for any examinee i; the same structure applies to all the examinees. The full
conditional distribution for the latent variable for examinee i is
i | , , xi ~ Bernoulli(si ) + 1, (13.27)
where
p( xi |i = 2, )p(i = 2| )
si =
p( xi )
J
x 1 xij
(2 j ) ij (1 2 j )
j =1
= .
J J
x 1 xij x 1 xij
(2 j ) ij (1 2 j ) + (1 j ) ij (1 1 j ) (1 )
j =1 j =1
The full conditional for the proportion of examinees in latent class 2 is
| ~ Beta( + r2 , + r1 ), (13.28)
where rc is the number of examinees for whom = c (i.e., are members of class c), where
r1 + r2 = n.
Turning to the measurement model parameters, we present the full conditional for cj,
the conditional probability that the value of observable j is 1 given the examinee is a mem-
ber of latent class c. The same structure applies to all the observables, for all classes. The
full conditional for cj is
13.3.4.1 Gibbs Sampling Routine for Dichotomous Latent and Observable Variables
1. Sample the latent variables for examinees. For each examinee i = 1,, n, use the val-
ues of the measurement model parameters and parameters that govern the latent
distribution from the previous iteration, (t ) and (t ) for and , respectively, to
compute
p( xi | i = 2, (t ) )p(i = 2 | (t ) )
si(t ) =
p( xi )
J
x 1 xij
((2tj) ) ij (1 (2tj) ) (t )
j =1
= .
J J
x 1 xij x 1 xij
((2tj) ) ij (1 (2tj) ) (t ) + (1(tj) ) ij (1 1(tj) ) (1 (t ) )
j =1 j =1
328 Bayesian Psychometric Modeling
(cjt+1) | (t +1) , x j ~ Beta( c + rcj(t +1) , c + rc(t +1) rcj(t +1) ), (13.32)
where xj = (x1j,, xnj) is the collection of n observed values for observable j.
TABLE 13.2
Frequency of Response Vectors for the Cheating Data
Observable
Vector ID 1 2 3 4 Frequency
1 0 0 0 0 207
2 1 0 0 0 10
3 0 1 0 0 13
4 1 1 0 0 11
5 0 0 1 0 7
6 1 0 1 0 1
7 0 1 1 0 1
8 1 1 1 0 1
9 0 0 0 1 46
10 1 0 0 1 3
11 0 1 0 1 4
12 1 1 0 1 4
13 0 0 1 1 5
14 1 0 1 1 2
15 0 1 1 1 2
16 1 1 1 1 2
Source: Dayton, C. M. (1999). Latent class scaling analysis. Thousand Oaks, CA:
SAGE Publications. With permission.
n J C
p(, , | x )
i =1 j =1
p( ),
p( xij | i , j )p(i | )p( )
c =1
cj
(13.33)
where
i = 1,, 319
i
j = 1, 2, 3
xij xi4
cj 14 24
c c 1 1 2 2
c = 1, 2
FIGURE 13.2
Directed acyclic graph for a 2-class latent class model for the academic cheating example. The nodes for the
values for the fourth observable (x i 4), the conditional probabilities of the value for the fourth observable (14 and
24), and the hyperparameters that govern their prior distribution ( , , , and ) lie outside the plates over
1 1 2 2
conditional probabilities for the last observable, subject to the constraint that the condi-
tional probability for class 2 is larger than that for class 1. This is sufficient to resolve the
indeterminacy in the latent variable. Conceptually, this constraint renders class 2 to be
the class that was more likely to have copied answers. The distinct treatment of the last
observable manifests itself in terms of the observable and the conditional probabilities
(14 and 24) lying outside the plate over the remaining observables. Additionally, the DAG
expresses that 24 depends on 14, as implied by the last line of specifications in (13.33). The
constraint amounts to specifying truncated beta prior distributions for these parameters
and implies that the corresponding full conditional distributions are truncated beta dis-
tributions. Analogously, the full conditional distribution for the probability vector for a
polytomous variable so constrained is a truncated Dirichlet distribution (Gelfand, Smith,
& Lee, 1992; Nadarajah & Kotz, 2006; Sedransk, Monahan, & Chiu, 1985).
One salient feature of the model in this example is the constraint that 24 > 14 . As a
result, the full conditionals for these parameters will be truncated beta distributions.
For 14,
where Beta <24 denotes the beta distribution truncated to be less than 24, r1 is the number of
examinees for whom = 1 (i.e., are members of class 1), and r14 is the number of examinees
for whom = 1 (i.e., are members of class 1) that have a value of 1 for observable 4.
For 24,
where Beta >14 denotes the beta distribution truncated to be greater than 14, r2 is the number
of examinees for whom = 2 (i.e., are members of class 2), and r24 is the number of examin-
ees for whom = 2 (i.e., are members of class 2) that have a value of 1 for observable 4.
The Gibbs sampler is modified accordingly. Suppose in each iteration we draw a value
for 14 before we draw a value for 24. The instantiations of (13.32) then become
t)
< (24
(14t +1) | (t +1) , (24t ) , x j ~ Beta ( 1 + r14(t +1) , 1 + r1(t +1) r14(t +1) ) (13.36)
and
To illustrate, we step through the computations of the first iteration of the Gibbs sampler.
We begin by specifying initial values. We will say more shortly about choosing start values
in Section13.4.3 where we fit the model in WinBUGS, using multiple sets of dispersed start
values. The ones used here were arbitrarily chosen for the sake of illustration. For the con-
ditional probabilities for the observables, we set (110) = .4, (210) = .8, (120) = .3, (220) = .7, (130) = .2,
(230) = .5, (140) = .1, and (240) = .5. Note that these last two honor the constraint that 24 > 14 .
For the latent variables, (10 ) , , (219 0)
are set to 1 and (220
0)
, , (319
0)
are set to 2. Finally, we set
the initial value for the proportion of examinees in latent class 2 to be .2. In the Dirichlet-
categorical specification adopted here, this amounts to specifying ( 0 ) = ( 1( 0 ) , (20 ) ) = (.8,.2) .
The first iteration of the Gibbs sampler is composed of the following steps.
1. Sample the latent variables for examinees. For the first examinee, the response vector was
(0,0,0,0). We use these values to compute the ingredients of sic( 0 ) based on each class:
J
(
j =1
( 0 ) xij
1j ) (1 (10j ) )
1 xij
(10 ) = (.4)0 (.8)1 (.3)0 (.7)1 (.2)0 (.8)1 (.1)0 (.9)1 .8 = .06048,
(
j =1
( 0 ) xij
2j ) (1 (20j) )
1 xij
(20 ) = (.8)0 (.2)1 (.7)0 (.3)1 (.6)0 (.4)1 (.5)0 (.5)1 .2 = .0024,
332 Bayesian Psychometric Modeling
and then
J
x 1 xij
(1( 0j ) ) ij (1 1( 0j ) ) (10 )
(0) j =1 .06048
s = = .96.
i1 J J
( 0 ) Xij
( ) (1 ) ( 0 ) 1 xij
(0)
+ ( 0 ) xij
( ) (1 ) ( 0 ) 1 xij
(0) .0024 + .06048
2j 2j 2 1j 1j 1
j =1 j =1
and
(0)
s =
( j
( 0 ) xij
2j ) (1 (20j) )
1 xij ( 0 )
2
=
.0024
.04.
( (
i2
( 0 ) xij
2j ) (1 ) ( 0 ) 1 xij ( 0 )
2j 2 + ( 0 ) xij
1j ) (1 ) ( 0 ) 1 xij ( 0 )
1j
1
.0024 + .06048
j j
We then sample a value for the latent variable from (13.18) using these values to
define the distribution,
(i1) | ( 0 ) , ( 0 ) , xi ~ Categorical(.96,.04).
This process is repeated for each examinee, yielding a value of 1 or 2 for each
examinee.
2. Sample the parameters for the latent variable distribution. Count the number
of examinees in class 2 in the just-sampled values of the latent variables,
(1) = ((11) , , (n1) ). This frequency is r2(1), which in the current case was 17. Compute
r1(1) = n r2(1) = 319 17 = 302 and sample a value for the latent class proportions
from (13.19),
For observable 1, sample a value for 11 as in (13.29) using r1(1) = 302 and r11(1) = 27,
TABLE 13.3
Class- and Observable-Specific Counts from the First Iteration of a Gibbs Sampler
for a 2-Class Latent Class Model for the Cheating Data Example
Observable
Class 1 2 3 4
1 r(1)
11 = 27 r (1)
12 = 30 r (1)
13 = 14 r(1)
14 = 55
2 r21(1) = 7 r22(1) = 8 r23(1) = 7 r24(1) = 13
Latent Class Analysis 333
The drawn value was .08. Continuing with observable 1, we sample a value for 21 as in
(13.29) using r2(1) = 17 and r21(1) = 7 ,
The drawn value was .29. Likewise for observables 2 and 3, we have
and
yielding drawn values of (121) = .09, (221) = .76, (131) = .05, and (231) = .41. For observable 4, we
sample a value for 14 as in (13.36), noting that (240) = .5
yielding a value (141) = .22. This value is then used in sampling from the full conditional for
24 as in (13.37),
(241) | (1) , 14
(1)
, x j ~ Beta >.22 (14, 5),
13.4.3 WinBUGS
The WinBUGS code and three sets of initial values are given as follows.
--------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Conditional probability of the observables via a latent class model
#########################################################################
for (i in 1:n){
for(j in 1:J){
x[i,j] ~ dbern(pi[theta[i],j])
}
}
334 Bayesian Psychometric Modeling
#########################################################################
# Prior distribution for the latent variables
#########################################################################
for(i in 1:n){
theta[i] ~ dcat(gamma[])
}
#########################################################################
# Prior distribution for the parameters that govern the distribution
# of the latent variables
#########################################################################
gamma[1:C] ~ ddirch(alpha_gamma[])
for(c in 1:C){
alpha_gamma[c] <- 1
}
#########################################################################
# Prior distribution for the measurement model parameters
#########################################################################
for(c in 1:C){
for(j in 1:(J-1)){
pi[c,j] ~ dbeta(1,1)
}
}
#########################################################################
# Initial values for three chains
#########################################################################
list(gamma=c(.9, .1),
pi= structure(.Data= c(
.37, .20, .06, .04,
.41, .47, .32, .19)
, .Dim=c(2, 4)))
list(gamma=c(.1, .9),
pi= structure(.Data= c(
.58, .62, .69, .77,
.81, .84, .90, .88)
, .Dim=c(2, 4)))
list(gamma=c(.5, .5),
pi= structure(.Data= c(
.32, .49, .29, .61,
.48, .54, .44, .70)
, .Dim=c(2, 4)))
-------------------------------------------------------------------------
The final lines of the model express the prior distribution for the conditional probabilities
of a value of 1 for the last observable. The use of the I(pi[1,J], ) in the last line ensures
Latent Class Analysis 335
that 24 > 14 . Though conceptually this is sufficient, WinBUGS also requires the construc-
tion in the previous line, where I( ,pi[2,J]) ensures that 14 < 24.
The model was fit in WinBUGS using three chains from dispersed starting values for
the parameters listed above in the code, using WinBUGS to generate values for the latent
variables. There was evidence of fast convergence (see Section5.7.2). To be conservative, for
each chain we discarded the first 1,000 iterations and ran an additional 20,000 iterations
for use in inference.
The marginal posterior densities for the conditional probabilities and latent class pro-
portions are plotted in Figure13.3 and numerically summarized in Table13.4. The sum-
maries reported here suggest that, for all the observables, the conditional probability of a
value 1 given the examinee in latent class 2 is higher than the conditional probability in
latent class 1 (i.e., 2 j > 1 j for all j). This was imposed via the prior distribution for the last
observable, but not the others. This lends credence to the desired interpretation of class 2
0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4
1 2
FIGURE 13.3
Marginal posterior densities for the 2-class model for the academic cheating example, where c is the proportion
of examinees in class c and cj is the conditional probability of observing a value of 1 on observable j given the
examinee is a member of class c.
336 Bayesian Psychometric Modeling
TABLE 13.4
Summary of the Posterior Distribution for a 2-Class Latent Class Model for the Academic
Cheating Data
Parameter Median Standard Deviation 95% Highest Posterior Density Interval
11 0.03 0.02 (<.01, 0.06)
21 0.61 0.14 (0.36, 0.90)
12 0.04 0.02 (<.01, 0.08)
22 0.63 0.13 (0.40, 0.90)
13 0.04 0.01 (0.01, 0.07)
23 0.24 0.08 (0.10, 0.41)
14 0.19 0.03 (0.14, 0.24)
24 0.40 0.09 (0.23, 0.58)
1 0.86 0.04 (0.77, 0.94)
2 0.14 0.04 (0.06, 0.23)
as those who are more inclined to cheat. The marginal posterior for indicates that the
proportion of students in the cheater class is around .14, and we are 95% certain that the
proportion is between .06 and .23. Table13.5 reports the marginal posterior probability of
membership in class 2 for each response vector under the column labeled Full Posterior.
Positive responses to the first two observables are more strongly indicative of membership
TABLE 13.5
Posterior Probability of Membership in Class 2 (The Nominally Labeled Cheaters Class) for
Each Response Vector in the Cheating Data Example Computed in Two Waysa
Observable p( = 2|x )
in class 2 (i.e., the cheating class). A positive response to either yields a posterior probabil-
ity in latent class 2 that exceeds .5.
where Pij = E( xij | i , j ) is the probability of xij = 1. In the current context, Pij is equal to cj for
examinees for which i = c (i.e., are members of class c). A discrepancy measure targeting
OF for observable j is then given by
1
1 n
2
OFj ( x j , , j ) =
n
Vij ( xij , i , j ) ,
(13.39)
i =1
which is a root mean square error taken with respect to the values for the observable,
where xj = (x1j,, xnj) is the collection of observed values for observable j from the n exam-
inees. Taking the root mean square with respect to each examinee yields a discrepancy
measure for PF. For examinee i,
1
1 J 2
PFi ( xi , , j ) =
J Vij ( xij , i , j ) ,
(13.40)
j =1
where xi = (xi1,, xiJ) is the collection of J observed values from examinee i.
We conducted PPMC employing (13.39) and (13.40), using 15,000 iterations from MCMC
(5,000 from each of the three chains). We computed ppost as a numerical summary of the
results. The values of ppost for the analyses of (13.39) for the four observables were .54, .54, .53,
and .49, suggesting that the model performs well in terms of accounting for the univariate
distributions of the observables.
Turning to examinees, we confine our discussion here to a select set of results. Examinees
with a response pattern of (0,0,0,0) had ppost values of .29, which is evidence of adequate fit.
This is unsurprising as this was the most common response pattern in the data and there-
fore the posterior is attuned to this pattern. Examinees with a response pattern of (1,1,0,0)
had ppost values of .68, which is evidence of adequate fit. This result is sensible in that it con-
forms to the models implication that the first two items are endorsed more often than the
others. Examinees with a response pattern of (0,0,1,1) had ppost values of .03, which were the
smallest in the data. This is evidence of a lack of fit for these examinees. This reflects that
the models implications are that the items forming the last two observables are endorsed
338 Bayesian Psychometric Modeling
rarely, and with less frequency than the first two items. The examinees in question do not
behave in accordance with these implications, resulting in poorer PF. This lack of PF may
undermine the inferences for the examinees in question. Generally, poor PF suggests that
the inferential argument expressed in the psychometric model may be dubious, at least for
the examinees in question.
Note however that in the current case, the structure of the model does not imply that
having a value of 1 on the last two observables must be accompanied by having a value of
1 on the first two, as would be in the case of latent class models for deterministic versions
of Guttman scaling (Dayton & Macready, 1976; Guttman, 1947; Proctor, 1970). This model
allows for more patterns, though the results of fitting the model suggest that a pattern of
(0,0,1,1) should occur rarely. And that is indeed the case, as only 5 of the 319 examinees
exhibited this pattern. The results from PPMC for these examinees amount to a red flag
indicating that additional considerations may be needed to understand the viability of the
model or alternative explanations for these examinees.
This being positive indicates that the 2-class model is expected to yield better predictions
for a new observation than the 1-class model. Following (10.27), the proportional improve-
ment obtained by using the 2-class model is, rounding to two decimal places,
This indicates that the modeling of two separate classes improves prediction a bit, a little
under 10%. Under Gilula and Habermans (2001) suggestion of thinking of this index as
analogous to a proportion of variance explained, the improvement seems worthwhile.
13.4.6 Comparing the Use of the Full Posterior with Point Summaries in Scoring
It is instructive to consider how the inferences for examinees would be affected by using
point values for the and parameters rather than the full posterior distribution. In con-
ventional approaches to LCA, (frequentist) point estimates of and are obtained first
and then used to obtain the posterior distribution for latent class membership. In the pres-
ent example, we compare the use of the full posterior with an approach that uses the
Latent Class Analysis 339
p(i |xi ) =
p( |x , , )p(, | x)dd,
i i
which will be more diffuse to the extent that there is variability in and . Conceptually,
by using the full posterior distribution for and , our uncertainty in those parameters
is acknowledged and induces some additional uncertainty about the examinees. To the
extent that there is considerable uncertainty about and , ignoring that uncertainty by
the use of a point value (e.g., a point summary of the posterior, or an MLE) may be deleteri-
ous in inferences regarding examinees.
0.8
1
0.4
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
0.8
2
0.4
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
0.8
11
0.4
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
0.8
21
0.4
0.0
0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
FIGURE 13.4
History plots and density plots for parameters for the 2-class model without constraints to resolve the indeter-
minacy for the academic cheating example, including the proportion of examinees in class 1 ( 1), the proportion
of examinees in class 2 ( 2), the conditional probability of observing a value of 1 on the first observable given
membership in class 1 (11), and the conditional probability of observing a value of 1 on the first observable
given membership in class 2 (21).
The densities in Figure13.4 are indeed representative of the posterior distribution. The
multimodality is a manifestation of the indeterminacy associated with the discrete latent
variable, namely that the roles of the classes can be interchanged: the latent class pro-
portion and conditional probabilities associated with one class could be associated with
another class (and vice versa) and still reproduce the data equally well. The multimodal-
ity in the posterior exists because of the multimodality in the likelihoodas discussed
in Section13.1.2 the indeterminacy and associated label switching pertain to frequentist
estimation as welland the information in the prior is not strong enough to steer the
posterior toward one orientation as opposed to the other. As the prior is quite diffuse, the
posterior closely resembles the likelihood. This example illustrates how MCMC may be
a useful tool for exploring multimodality and the shape of the likelihood more broadly.
case of polytomous observables, in which case CjK > > 1 jK labels the classes in order cor-
responding to increases in the conditional probabilities of the highest value of a particular
observable (j). Any unique ordering would be sufficient (e.g., ordering on another observ-
able, or on another response category).
Another option involves a specification of the latent class proportions. Setting 1 < < C
renders the latent classes to be labeled according to their size. Again, any unique ordering
would be sufficient (e.g., reverse order of their size, 1 > > C ).
Still another option involves assigning one or more examinees to each of the classes. For
computational and interpretative ease, we may select examinees with the most extreme
response patterns. In the case of the cheating data example, we may take one or more
examinees with a response pattern of (0,0,0,0) and assign them to a particular class, say,
class 1, and take one or more examinees with a response pattern of (1,1,1,1) and assign
them to the other particular class, here class 2. This strategy is similar to setting the val-
ues of the latent variables for certain examinees to resolve the indeterminacies associated
with continuous latent variables in IRT and CFA. This can be accomplished by capitalizing
on computational strategies that treat latent variables as missing data. In WinBUGS, we
would supply values of the latent variable as part of the data statement, with numerical
values for those examinees assigned to classes, and NA for the remaining examinees.
This approach suffers from certain drawbacks. First, the assigning of examinees to classes
is a fairly strong assumption. On the other hand, when working with a 2-class model for
the cheating data, it stands to reason that examinees with the response patterns of (0,0,0,0)
and (1,1,1,1) ought to be in the different classes; any other account would strain credu-
lity. Still, this approach is limited in that posterior inference and expressions of posterior
uncertainty are not available for those examinees assigned to classes; they are treated as
belonging to the classes with certainty.
All of the strategies just described may also be adopted in frequentist approaches to
LCA modeling and estimation. A Bayesian approach offers an additional variation on each
of these in the form of alternative prior distributions. Instead of imposing a constraint
such as 24 > 14 , we may express the desire to have the second class have higher con-
ditional probability of a value of 1 on the last item via the hyperparameters of the prior
distribution. For example, specifying 14 ~ Beta(2, 5) and 24 ~ Beta(5, 2) would orient the
posterior toward a class labeling where the second class has higher probabilities of a posi-
tive response to the last item. But it does not constrain the posterior to honor this notion.
An example in the next chapter illustrates this approach. Similarly, instead of constraining
the latent class proportions to be ordered in size, the prior distribution may be specified
using hyperparameters to orient the posterior toward a particular class labeling. Likewise,
instead of assigning examinees to particular classes with certainty, we may specify a dif-
ferent prior distribution that orients the solution as such.
Each approach may in fact be fairly weak in supplying information to resolve the inde-
terminacy if the model does not accord with the imposed constraints. For example, if two
classes have comparable sizes, adopting the strategy that orders them by class sizes may
not perform well in resolving the indeterminacy when fitting the model. This applies even
more so for the use of prior distributions that orients but does not constrain the posterior
to just one labeling scheme.
Another strategy involves analyzing a model without constraints needed to resolve the
indeterminacy and then relabeling results based on inspection of the results. In our cheat-
ing data example, we may examine the trace plots in Figure13.4, recognize the situation for
what it is, and simply relabel the classes in one chain to accord with that of another. This
would address the situation here where multiple chains are exploring different regions
342 Bayesian Psychometric Modeling
of the (multimodal) posterior. Things are more complicated when label switching occurs
within a chain, as is possible and indeed guaranteed to happen for an infinitely long chain.
This calls for more complicated algorithms for postprocessing or relabeling each succes-
sive iteration from MCMC (Rodrguez & Walker, 2014; Stephens, 2000).
13.5.3 The Blurred Line between the Prior and the Likelihood
The preceding discussion underscores two larger points. First, echoing themes from our
treatments of CFA and IRT, we can recognize that imposing constraints on the parameters
is itself a type of prior specification. Specifying 14 ~ Beta(2, 5) and 24 ~ Beta(5, 2) consti-
tutes a prior where 24 is probably larger than 14 . We can control how likely that is by
manipulating the forms and or hyperparameters of these prior distributions. The con-
straint that 24 > 14 may be seen as a limiting case of these prior expectations, where 24
is certainly larger than 14 . Second, though the indeterminacy exists in the likelihood, we
have been discussing resolutions in terms of prior distributions. Should a specification
such as 24 > 14 be viewed as a feature of the likelihood, or of the prior? Echoing the
discussions in Sections3.5, 3.6, and 9.8, one perspective is that such a question is largely
irrelevant. Distinctions between the likelihood and the prior may be useful for certain
purposes, but may be unnecessary, or worse, inhibitive to understanding.
Exercises
13.1 The first step in the Gibbs sampler described in Section 13.2.4 involves sam-
pling the values of the latent variables for examinees. In Section13.4.2, we showed
the computation for the first iteration for examinees with the response vector of
(0,0,0,0) in the cheating data example. Show the computations for the remaining
response vectors.
13.2 The third step in the Gibbs sampler described in Section13.2.4 involves sampling
the values for the measurement model parameters for dichotomous observables.
In Section13.4.2, we showed the computation for the first iteration for 11 in the
cheating data example. Show the computations for the remaining measurement
model parameters.
13.3 Replicate the analysis of the 2-class model for the cheating data in Section13.4
using WinBUGS.
a. Verify your results for the measurement model parameters with those reported
here.
b. When fitting the model, compute the probabilities that the conditional prob-
abilities of a positive response are higher in class 2 than class 1 for each observ-
able. (Hint: recall the WinBUGS code in Exercise6.3).
c. Interpret the probabilities from part (b).
13.4 Conduct an analysis of a 3-class model for the cheating data.
a. Specify a model that yields an ordinal interpretation of the classes in terms of
their proclivity for cheating.
b. Obtain and interpret the posterior distribution for the measurement model
parameters and the latent class proportions.
13.5 The chains for the model in the example in Section13.5 explored different parts
of the posterior distribution, corresponding to different labels of the classes.
a. Explain why label switching will eventually happen within a chain.
b. Explain how this could occur in the context of a Gibbs sampler, and in the
context of a Metropolis sampler.
This page intentionally left blank
14
Bayesian Networks
Any ordering is possible; for example we might specify a reversal of this ordering and
specify a distribution for Z, the conditional distribution for Y given Z, and so on. Certain
orderings may be preferable according to certain criteria, including interpretability and
345
346 Bayesian Psychometric Modeling
We can now characterize a BN as a model that structures the joint distribution. For a set of
variables generically represented as v, a BN structures the joint distribution as
p(v) = p ( v|pa(v)),
vv
(14.3)
where pa(v) are the parents of variable v, namely, the variables on which v directly depends.
If v has no parents p ( v|pa(v) ) is taken as the unconditional (marginal) distribution of v.
Y Z
FIGURE 14.1
Directed acyclic graph for four variables in the model in (14.2).
Bayesian Networks 347
plays a key role here too: in addition to structuring the joint distribution of a system of
variables, the DAG structures the computations involved in obtaining the posterior distri-
bution (Lauritzen & Spiegelhalter, 1988; Pearl, 1988).
The elements on the right-hand side of (14.4) are given in Tables2.1 and 2.2, repeated here
as Tables14.1 and 14.2 for p( M|C ) and p(C), respectively.
FIGURE 14.2
Directed acyclic graph for the medical diagnosis example, where the mammogram result (M) is modeled as
dependent on cancer status (C).
TABLE 14.1
Conditional Probability of the Mammography
Result (M) Given Breast Cancer (C)
Mammography Result
Breast Cancer Positive Negative
Yes .90 .10
No .07 .93
TABLE 14.2
Prior Probability of Breast Cancer (C)
Breast Cancer Probability
Yes .008
No .992
Once the value of the observable variable M is known, the posterior distribution for C
is obtained via Bayes theorem. Revisiting the example in Section2.2, once it is observed
that M = Positive, the posterior probability for a patients cancer status, rounding to two
decimal places, is p(C = Yes | M = Positive) .09 and p(C = No | M = Positive) .91.
The medical diagnosis model is a simple LCA model with one dichotomous observable.
The more general LCA model expressed in Chapter13 may be seen as expanding the situ-
ation to one where there are J possibly polytomous observables modeled as dependent on,
and rendered conditionally independent by, a possibly polytomous latent variable. As the
variables are all discrete, the LCA model may be seen as a BN. The just-persons DAG for
the model is given in Figure14.3, which is just the structure of the canonical psychometric
model characterized in Chapter7.
Following the structure of the DAG, the joint distribution for any examinee (suppressing
the subscripting by i) is given by
J
p( x1 , , x J , ) = p(x |)p().
j =1
j (14.5)
x1 x2 x3 xJ
FIGURE 14.3
Directed acyclic graph for the latent class model as a Bayesian network.
Bayesian Networks 349
2 3
x1 x2 x3 x4 x5 x6 x7 x8
FIGURE 14.4
Directed acyclic graph for a Bayesian network with three latent variables, illustrating factorially simple and
factorially complex observables, and dependence among multiple latent variables.
for the multiple latent variables. In the current context with multiple discrete latent vari-
ables, we model the associations among the latent variables by recursively specifying dis-
tributions for the latent variables in a univariate fashion, conditioning on the previously
specified latent variables to reflect the dependence. As expressed by the directed edges
in Figure14.4, we will specify a distribution for 1, a distribution for 2 conditional on its
parent, 1, and a distribution for 3 conditional on its parents, 1 and 2. As we will see, the
form of the conditional distributions for latent variables mimic those of the conditional
distributions for the observables.*
Turning to the observables in Figure14.4, the first three observables (x1, x2, x3) and the last
three observables (x6, x7, x8) are modeled as being factorially simple, each having one latent
parent variable and may be modeled like the (unidimensional) LCA models in Chapter13.
Observables x4 and x5 are modeled as factorially complex, having both 1 and 2 as parents.
For each combination of the parent variables, there is a conditional distribution for the
possible values of the observable. Table14.3 illustrates this, with a conditional distribution
for a dichotomous observable coded as 0 and 1 that depends on two dichotomous latent
variables, each coded as taking on values of 1 or 2. The directed edges connecting the
latent variables indicate a dependence structure. As all the latent variables are discrete, the
dependence may also be expressed in terms of conditional probability tables.
TABLE 14.3
Conditional Probability Table for Observable j That Depends
on Two Latent Variables, Where ( ab ) jk Is the Probability of
Observing a Value of k for Observable j When 1= a and 2 = b
p( x j |1 , 2 )
1 2 0 1
1 1 (11) j 0 = 1 (11) j1 (11) j1
1 2 (12 ) j 0 = 1 (12 ) j1 (12 ) j1
2 1 ( 21) j 0 = 1 ( 21) j1 ( 21) j1
2 2 ( 22 ) j 0 = 1 ( 22 ) j1 ( 22 ) j1
* We note that the same approach may be taken with continuous latent variables, where the associations among
the latent variables are modeled with direct effects. As is the case here, not every latent variable needs to be
associated with every other latent variable. This approach makes the model a member the larger family of
structural equation modeling. See Levy and Choi (2013) for a treatment of Bayesian structural equation mod-
eling in line with the perspectives adopted here. For additional treatments, see Dunson, Palomo, and Bollen
(2005); Kaplan & Depaoli (2012); and Lee (2007).
350 Bayesian Psychometric Modeling
p(, , |x ) p( x |, , )p(, , )
where p( xij |i , j ) is the conditional probability distribution for the value of observ-
able j for examinee i. As in the LCA model, this is given by the measurement model
parameters associated with observable j and examinee is values for the latent vari-
ables, and p(j) is the prior distribution for the measurement model parameters. Next,
p(i | ) = p(i1| 1 )p(i 2 |i1 , 2 )p(i 3 |i1 , i 2 , 3 ) is the prior distribution for the multiple
latent variables for examinee i, and p( ) is the prior distribution for the parameters that
govern the distribution of the latent variables.
* This is the semantically confusing circumstance in which the conditional probabilities in a Bayesian network
are estimated via frequentist methods.
Bayesian Networks 351
This differs from the posterior distribution for the LCA model in (13.13) in two respects.
First, we now have multiple latent variables for each examinee. Second, as a result of the
possibly different patterns of dependence of the observables on the latent variables, the prior
distribution for the measurement model parameters for each observable j is not further fac-
tored. In any particular application, common dependence structures and exchangeability
assumptions may support further simplifications of the model in (14.7). As the forms of the
distributions presented here are the same as those in LCA, namely categorical (or Bernoulli)
distributions for latent and observable variables and Dirichlet (or beta) prior distributions
for the probabilities that govern those distributions, the full conditional distributions may be
derived in a straightforward extension of those for LCA in Chapter13 (see also Appendix A).
The development so far has characterized BNs in a fairly open manner. Commonly, BNs
are specified in ways that may be seen as constrained versions of these models. Such con-
straints may be employed for a variety of reasons including resolving indeterminacies in latent
variables as discussed in Chapter13, specifying the model to reflect substantive theory, and
reducing the parameterization. The latter issue frequently arises in BNs when variables have
many possible categories and/or many parent variables that result in many possible combi-
nations of the values of the parent variables. Letting Kj denote the number of categories for
observable j and letting Cm denote the number of categories for latent variable m, then assum-
ing the observables are conditionally (locally) independent given the latent variables, there are
J
(K 1)
j =1
j
m:mpa( x j )
(Cm )
conditional probabilities altogether for the observables. In these cases, the conditional
probability tables become large, increasing the number of entities in need of specification
and estimation. This can be difficult if the data are sparse with respect to the parameters.
Impositions made to reduce the parameterization should be in line with substantive the-
ory. Of course, constraints motivated by substantive theory often yield a reduced param-
eterization. In the rest of this chapter, we illustrate these ideas in the discussing three
examples of BNs as psychometric models.
and perhaps versions of, a number of different types of models (Rupp etal., 2010). BNs may
be seen as enacting diagnostic measurement akin to DCMs (Almond, DiBello, Moulder, &
Zapata-Rivera, 2007; Rupp etal., 2010). Further many DCMs may be cast as BNs that model
the conditional probabilities in the BN via parameters that both express the theory of cogni-
tion and simplify the conditional probabilities that define the BN.
Tatsuoka (1984) reported on analyses of student responses to test items designed to illumi-
nate which method they were using, and characterized the sorts of errors they were mak-
ing with the various procedures. Mislevy (1995) described BN models for middle school
students performances on mixed-number subtraction tasks based on Tatsuokas (1987,
1990) data and cognitive analyses. Model-fitting strategies and model-data fit analyses for
portions of this model that we describe here have additionally been discussed by Almond
etal. (2015); Mislevy, Almond, Yan, and Steinberg (1999); Sinharay and Almond (2007); Yan,
Almond, and Mislevy (2004); and Yan et al., (2003), and we draw from their work in the
current presentation. Related DCM and cognitive diagnostic modeling of data from these
tasks have been investigated by de la Torre and Douglas (2004, 2008); DeCarlo (2011, 2012);
Henson, Templin, and Willse (2009); and Tatsuoka (2002).
We confine ourselves to a model for Method B and observables associated with 15 items
for which it is not necessary to find a common denominator; extensions to this model
are briefly discussed in Section14.4.8. Following Mislevy (1995), the BN model involves
linking the observables to five latent variables representing the components of executing
Method B for these tasks. Following the parlance of DCMs for these sorts of situations, we
refer to these latent variables as the attributes or skills:
We specify a dichotomous latent variable for each skill, intended to capture our beliefs
about whether the examinee possesses or does not possess each skill. Inference centers on
diagnosing which skills are possessed by which examinees. In principle, there are 25 = 32
possible skill profiles defined by combinations of the values of the five dichotomous latent
variables.
Bayesian Networks 353
Picking up on the final point in Section 14.3, even with the assumption that the
observables are conditionally (locally) independent given the latent variables, there are
511 (conditional) probabilities in need of specification: 31 are needed to model the joint
distribution of the five latent variables;* and 480 are needed to model the dependence of
the observables on the latent variables (as for each of 15 observables there is a conditional
probability table with 32 conditional probabilities, one for each skill profile). Happily,
substantive theory and cognitive analyses of examinee responses to these types of tasks
(Klein etal., 1981; Tatsuoka, 1984, 1987, 1990, 2009) implies several simplifications in both
portions of the model, as described in the next two subsections.
i = 1,, n
i1
i2
i5
iMN
i3 i4
FIGURE 14.5
Directed acyclic graph fragment for the latent variables in the mixed-number subtraction example.
* These may be formulated in a number of ways. One way is to think of the 31 being needed to uniquely define
the joint distribution of the 32 skill profiles, the last probability being constrained by the requirement that they
sum to 1.
354 Bayesian Psychometric Modeling
TABLE 14.4
Probability Tables for 1 and 2 in the Mixed-Number
Subtraction Example, Where 1 Is the Probability That 1 = 1
and 2c Is the Probability That 2 = 1 Given the Value of 1 = c
p(2 |1)
1 Probability 0 1
0 1 1 1 20 20
1 1 1 21 21
TABLE 14.5
Conditional Probability Table for 5 in the Mixed-Number
Subtraction Example, Where 5z Is the Probability That 5 = 1
Given the Sum of 1 and 2 , Denoted by z
p(5 |1 , 2 )
1 2 z 0 1
0 0 0 1 50 50
0 1 1 1 51 51
1 0 1 1 51 51
1 1 2 1 52 52
Bayesian Networks 355
In contrast, the model hypothesizes a hard prerequisite relationship between Skill 3 and
Skill 4. Examinees must possess Skill 3 to possess Skill 4; not possessing Skill 3 implies not
possessing Skill 4. The implication is that of the four logically possible combinations of
values of 3 and 4, only three are deemed possible in the model; the situation where 3 = 0
and 4 = 1 is excluded. We could specify this portion of the model in terms of a conditional
distribution for 3 given 1, 2, and 5, and then a conditional distribution for 4 addition-
ally conditioning on 3 = 1 (see Exercise14.2). Following Mislevy etal. (1999) and Almond
etal. (2015), we take an alternative approach and introduce an intermediary polytomous
variable that enacts the hard prerequisite relationship. This is expressed in the DAG as MN,
standing for mixed number skills, and represents the count of Skill 3 and Skill 4 possessed
by an examinee. This is specified as a categorical variable taking on values of 0, 1, or 2
corresponding to neither, one, or both of Skill 3 and Skill 4. As we did with 5, we adopt a
simplification based on specifying a conditional distribution given the count of the parent
skills the examinee possesses: with the count taking on values of 0, 1, 2, 3 corresponding
to whether the examinee possesses none, one, two, or all three of its parents, Skill 1, Skill 2,
and Skill 5. Formally, we have
Table14.6 lays out the conditional probability structure for MN. The restriction in (14.12)
implies that there are in principle 16 free elements in Table14.6. The choice to model the
conditional distribution of MN given the number of preceding skills possessed reduces
this to eight parameters, and reflects the substantive belief that possessing Skill 3 and
Skill 4 depends on how many of Skill 1, Skill 2, and Skill 5 the examinee possesses, not
which ones.
TABLE 14.6
Conditional Probability Table for MN in the Mixed-Number
Subtraction Example, Where MN,z ,a Is the Probability That
MN = a Given the Sum of 1, 2 , and 5, Denoted by z
p(MN |1 , 2 , 5 )
1 2 5 z 0 1 2
0 0 0 0 MN,0 ,0 MN,0 ,1 MN,0 ,2
0 0 1 1 MN,1,0 MN,1,1 MN,1,2
0 1 0 1 MN,1,0 MN,1,1 MN,1,2
0 1 1 2 MN,2 ,0 MN,2 ,1 MN,2 ,2
1 0 0 1 MN,1,0 MN,1,1 MN,1,2
1 0 1 2 MN,2 ,0 MN,2 ,1 MN,2 ,2
1 1 0 2 MN,2 ,0 MN,2 ,1 MN,2 ,2
1 1 1 3 MN,3 ,0 MN,3 ,1 MN,3 ,2
356 Bayesian Psychometric Modeling
So far, this specifies that the number of mixed number skills (i.e., Skill 3 and Skill 4) an
examinee possesses depends on possessing the other skills, but this does not encode the
prerequisite relationship among Skill 3 and Skill 4. To accomplish this, we include the fol-
lowing deterministic dependence structures for 3 and 4 given MN:
0 if MN = 0
i 3 |iMN = (14.13)
1 if MN = 1 or 2
and
0 if MN = 0 or 1
i 4 |iMN = (14.14)
1 if MN = 2 .
Tying this all together, the distribution for the examinees latent variables is
n
p(| ) = p( | )
i =1
i
n
(14.15)
= p( | )p(
i =1
i1 1 i2 |i1 , 2 )p(i 5 |i1 , i 2 , 5 )
where = ( 1 , 2 , 5 , MN ).
* As the Q-matrix captures which observables depend on which latent variables, it is similar to the matrix of
loadings in CFA and the matrix of discrimination parameters in compensatory MIRT.
Bayesian Networks 357
TABLE 14.7
Q-matrix and Examinee Records for Two Examinees for the Mixed-
Number Subtraction Example
Skill Examinee
Item Text 1 2 3 4 5 527 171
6 6 4 X 1 0
7 7
8 2 2 X 1 0
3 3
12 11 1 X X 1 1
8 8
9 7 X X 1 0
3 2
8
14 4 2 X X 1 0
3 3
5 5
16 5 4 X X 1 1
4 1
7 7
4 1 3 X X X 1 0
3 2
2 2
11 1 4 X X X 0 0
4 2
3 3
17 3 4 X X X 0 1
7
5 5
18 1 8 X X X 0 1
4 2
10 10
20 1 5 X X X 0 1
4 1
3 3
7 1 X X X X 0 0
32
5
15 1 X X X X 1 0
2
3
19 4 X X X X 0 0
4 1
3
10 4 7 X X X X 0 0
4 2
12 12
Source: A portion of this is based on Table 4 of Probability-Based Inference in
Cognitive Diagnosis by Robert J. Mislevy, Published in 1994 by ETS as
Research Report RR-94-3-ONR, and is Used with Permission from ETS.
possess all of the relevant skills should not successfully complete the task. This is an example
of what may be called a conjunctive condensation rule (Klein etal., 1981; Rupp etal., 2010). The
result of the conjunction of the relevant latent variables for observable j can be expressed as
M
ij =
m =1
q jm
im
(14.16)
1 if examinee i has mastered all of the skiills required for task j
=
0 otherwise.
358 Bayesian Psychometric Modeling
i1 i2 i5 iMN i3 i4
xij xij
xi12
j = 6, 8 j = 7, 15, 19
xi10
(a)
i = 1,, n
i1
qj
i2
i5
j1
ij xij
iMN
j0
i3 i4 j = 1,, J
(b)
FIGURE 14.6
Two versions of the directed acyclic graph for the mixed-number subtraction example: (a) just-persons version
highlighting which observables depend on which latent variables and (b) an expanded version that highlights
the role of the Q-matrix and the measurement model parameters.
The model then structures the conditional distribution of an observable given the latent
variables in terms of just two parameters, j1 and j0, further effecting a simplification.
We note in passing that many DCMs refer to j0 as a guessing parameter, as it captures
Bayesian Networks 359
the probability of a correct response given the examinee does not possess the requisite
skills. Such DCMs typically use a parameterization that defines j1 = 1 sj , where sj is a slip
parameter capturing the probability that an examinee who does have the requisite skills
will respond incorrectly.
Tying this all together, the conditional probability of the observables is given by
n n J
p( x|, ) =
i =1
p( xi |i , ) = p(x | , ),
i =1 j =1
ij i j (14.18)
where p( xij |i , j ) is given by (14.17). Figure14.6 contains two versions of the DAG for the
model that now includes the conditional distribution of the data. The first is a just-persons
version that only contains examinee variables. It is distinguished by the presence of mul-
tiple nodes for the observables, with plates indicating replication over sets of observables.
It highlights which observables depend on which latent variables, essentially communi-
cating the information contained in the Q-matrix. This is masked in the second version,
which collapses the multiple nodes for observables and more strongly emphasizes the
parentchild relationships in the probability model. This version highlights the role of the
Q-matrix, which, in combination with the latent student model variables, dictates whether
the examinee has mastered all the requisite skills for a particular observable (ij ). It also
includes the measurement model parameters, j1 and j0, which, in combination with the
s, govern the distribution of the observables.
This reflects a belief that the probability that an examinee possesses Skill 1 is akin to hav-
ing seen 21 1 = 20 examinees who have mastered Skill 1 and 6 1 = 5 examinees who have
not. Put another way, it reflects a belief that proportion of examinees who have mastered
Skill 1 is about .8 (i.e., 20/25), and this is given a weight akin to 25 observations.* Similarly,
we specify
* The choice to weight the prior belief in this way may reflect the expression of confidence by subject matter
experts, analyses of previous data, or a desire to have our prior beliefs influence the posterior in a certain
amount relative to the data at hand.
360 Bayesian Psychometric Modeling
which reflects a belief that if an examinee possesses Skill 1, the probability that she also
possesses Skill 2 is about .8, and this is given a weight akin to 25 observations. Conversely,
if a student does not possess Skill 1, we would not expect them to possess Skill 2. This is
embodied by specifying
which expresses that the probability is likely around .2, again afforded a weight akin to 25
observations. Turning to the parameters that govern the conditional distribution for Skill 5,
a similar line of reasoning leads to expressing that the probability of an examinee possess-
ing Skill 3 is about .8 if she possesses both Skill 1 and Skill 2, about .5 if she possesses one of
Skill 1 and Skill 2, and about .2 if she possesses neither Skill 1 nor Skill 2. The following prior
encodes these beliefs, again affording a weight of 25 observations to each:
and
52 ~ Beta(21, 6). (14.24)
As MN has three categories, we employ Dirichlet prior distributions for the conditional
probabilities. Letting MN,z = ( MN,z ,0 , MN,z ,1 , MN,z ,2 ) denote the collection of probabilities for
the three categories of MN given the number of Skills 1, 2, and 5 possessed is z,
These reflect the belief that the more skills among Skills 1, 2, and 5 that an examinee pos-
sesses, the more likely she is to possess Skills 3 and 4. The weight afforded to the prior in
each case is akin to having observed these patterns in 27 observations, a slight departure
from the weight of 25 afforded to the other parameters, but one that more easily allows for
the division into three values.
and
j0 ~ Beta(3.5, 23.5). (14.30)
These choices reflect the beliefs that examinees who possess the requisite skills are highly
likely to correctly complete the task (probability of about .90) and that examinees who do
not possess all the requisite skills are highly unlikely to correctly complete the task (prob-
ability of about .10), with weights on these beliefs akin to having seen 25 observations of
each situation.
=
i =1 j =1
p( xij |i , j )p(i | )p( ) p( ).
c =0
jc
(1, 1) 1
(20, 20)
20
i = 1,, n
(21, 21) i1
21
(50, 50) i2
50
qj
(51, 51) 51
i5
(52, 52) 52 j1 (1, 1)
ij xij
iMN
MN,0 MN,0 j0 (0, 0)
MN,2 i3 i4
MN,2
MN,3
MN,3
FIGURE 14.7
Directed acyclic graph for the mixed-number subtraction example including hyperparameters.
362 Bayesian Psychometric Modeling
Note the similarity between the structure of (14.31) and the posterior distribution for the
LCA model in (13.13). The key difference is that we now have multiple latent variables for
each examinee. This results in a multivariate distribution of the latent variables for each
examinee
p(i | ) = p(i1 | 1 )p(i 2 |i1 , 2 )p(i 5 |i1 , i 2 , 5 )p(iMN |i1 , i 2 , i 5 , MN )p(i 3 |iMN )p(i 4 |iMN ).
ij =
m =1
q jm
im for i = 1, , n, j = 1, , 15,
0 if MN = 0
i 3 |iMN = for i = 1, , n,
1 if MN = 1 or 2
0 if MN = 0 or 1
i 4 |iMN = for i = 1, , n,
1 if MN = 2
1 ~ Beta(21, 6),
20 ~ Beta(6, 21),
21 ~ Beta(21, 6),
50 ~ Beta(6, 21),
51 ~ Beta(13.5, 13.5),
52 ~ Beta(21, 6),
and
14.4.6 WinBUGS
WinBUGS code for the model is given as follows, making use of the pow function, which
raises the first argument of the function to the power of the second argument of the func-
tion (e.g.,pow(3,2) corresponds to the formula 32).
-------------------------------------------------------------------------
#########################################################################
# Model Syntax
#########################################################################
model{
#########################################################################
# Conditional probability of the observables
#########################################################################
for (i in 1:n){
for(j in 4:4){
delta[i,j] <- pow(theta[i,1], Q[j,1])*pow(theta[i,2],
Q[j,2])*pow(theta[i,3], Q[j,3])*pow(theta[i,4],
Q[j,4])*pow(theta[i,5], Q[j,5])
delta_plus_1[i,j] <- delta[i,j] + 1
x[i,j] ~ dbern(pi_plus_1[delta_plus_1[i,j],j])
}
for(j in 6:12){
delta[i,j] <- pow(theta[i,1], Q[j,1])*pow(theta[i,2],
Q[j,2])*pow(theta[i,3], Q[j,3])*pow(theta[i,4],
Q[j,4])*pow(theta[i,5], Q[j,5])
delta_plus_1[i,j] <- delta[i,j] + 1
x[i,j] ~ dbern(pi_plus_1[delta_plus_1[i,j],j])
}
for(j in 14:20){
delta[i,j] <- pow(theta[i,1], Q[j,1])*pow(theta[i,2],
Q[j,2])*pow(theta[i,3], Q[j,3])*pow(theta[i,4],
Q[j,4])*pow(theta[i,5], Q[j,5])
delta_plus_1[i,j] <- delta[i,j] + 1
x[i,j] ~ dbern(pi_plus_1[delta_plus_1[i,j],j])
}
}
364 Bayesian Psychometric Modeling
#########################################################################
# Prior distribution for the latent variables
#########################################################################
for(i in 1:n){
theta[i,1] ~ dbern(gamma_1)
#########################################################################
# Prior distribution for the parameters
# that govern the distribution of the latent variables
#########################################################################
gamma_1 ~ dbeta(21,6)
gamma_2[1] ~ dbeta(6,21)
gamma_2[2] ~ dbeta(21,6)
gamma_5[1] ~ dbeta(6,21)
gamma_5[2] ~ dbeta(13.5,13.5)
gamma_5[3] ~ dbeta(21,6)
gamma_MN[1,1:3] ~ ddirch(alpha_gamma_MN[1, ])
alpha_gamma_MN[1,1] <- 16
alpha_gamma_MN[1,2] <- 8
alpha_gamma_MN[1,3] <- 6
gamma_MN[2,1:3] ~ ddirch(alpha_gamma_MN[2, ])
alpha_gamma_MN[2,1] <- 12
alpha_gamma_MN[2,2] <- 10
alpha_gamma_MN[2,3] <- 8
gamma_MN[3,1:3] ~ ddirch(alpha_gamma_MN[3, ])
alpha_gamma_MN[3,1] <- 8
alpha_gamma_MN[3,2] <- 10
alpha_gamma_MN[3,3] <- 12
gamma_MN[4,1:3] ~ ddirch(alpha_gamma_MN[4, ])
alpha_gamma_MN[4,1] <- 6
Bayesian Networks 365
alpha_gamma_MN[4,2] <- 8
alpha_gamma_MN[4,3] <- 16
#########################################################################
# Prior distribution for the measurement model parameters
#########################################################################
for(j in 1:J){
pi_plus_1[1,j] ~ dbeta(3.5,23.5)
pi_plus_1[2,j] ~ dbeta(23.5,3.5)
pi_0[j] <- pi_plus_1[1,j]
pi_1[j] <- pi_plus_1[2,j]
For our illustration, we take the publicly available data published by Tatsuoka (2002),
which contains dichotomous scores coded as 1/0 corresponding to correct/incorrect
responses from 536 middle school students. The model was fit in WinBUGS using three
chains from dispersed starting values, and appeared to converge within 100 iterations (see
Section5.7.2). To be conservative, for each chain we discarded the first 1,000 iterations, and
ran an additional 9,000 iterations (3,000 per chain), sufficient to get a stable portrait of the
posterior, for use in inference. The marginal posterior densities were unimodal and fairly
symmetric, with departures from symmetry occurring when the densities for the param-
eters were located near a boundary of 0 or 1. Table14.8 contains numerical summaries of
the marginal posterior distributions.
Beginning with the measurement model parameters, the conditional probabilities for a
correct response given the examinee possesses the requisite skills are all fairly high. For
example, for item 4, the probability of a correct response from an examinee that possesses
all the requisite skills (4,1) has a marginal posterior distribution centered at .89 (and a 95%
HPD interval of (.85, .92)). Conversely, the conditional probabilities for a correct response
given the examinee does not possess all of the requisite skills are fairly low. For item 4,
the probability of a correct response from an examinee that does not possesses all the
requisite skills (4,0) has a marginal posterior distribution centered at .20 (95% HPD of (.16, .25)).
An exception is item 8, which only depends on Skill 1. The item requires the examinee
to subtract a quantity from itself. Such tasks might be solved correctly by reasoning that
anything minus itself is 0, which does not really involve proficiency with respect to frac-
tions per se.
Turning to the parameters for the relationships among the latent variables, we see that the
posterior reflects that examinees have higher probabilities of possessing skills if they pos-
sess the parent skills on which they depend, an idea that was weakly encoded in the prior
distribution, and borne out or at least not strongly contradicted by the data. For example,
the posterior indicates that the probability of an examinee possessing Skill 5 if they possess
neither Skill 1 nor Skill 2 is fairly low (posterior median for 5 ,0 of .21, 95% HPD interval of
(.08, .37)), the probability if they possess exactly one of Skill 1 or Skill 2 is higher (posterior
median for 5 ,1 of .45, 95% HPD interval of (.27, .62)), and the probability if they possess both
Skill 1 and Skill 2 is higher still (posterior median for 5 ,2 of .73, 95% HPD interval of (.64, .82)).
366 Bayesian Psychometric Modeling
TABLE 14.8
Summary of the Posterior Distribution for the Mixed-Number Subtraction Example
95% 95%
Parameter Median SDa HPDb Interval Parameter Median SDa HPDb Interval
4,1 .89 .02 (.85, .92) 1 .80 .02 (.76, .84)
4,0 .20 .02 (.16, .25) 2,0 .21 .08 (.08, .38)
6,1 .95 .01 (.93, .97) 2,1 .90 .02 (.86, .94)
6,0 .13 .04 (.05, .22) 5,0 .21 .08 (.08, .37)
7,1 .87 .03 (.82, .92) 5,1 .45 .09 (.27, .62)
7,0 .13 .02 (.09, .16) 5,2 .73 .05 (.64, .82)
8,1 .83 .02 (.79, .87) MN,0,0 .53 .09 (.36, .71)
8,0 .37 .04 (.29, .46) MN,0,1 .26 .08 (.12, .43)
9,1 .77 .02 (.72, .80) MN,0,2 .19 .07 (.06, .33)
9,0 .30 .04 (.23, .37) MN,1,0 .40 .08 (.25, .57)
10,1 .80 .03 (.75, .85) MN,1,1 .40 .09 (.24, .57)
10,1 .04 .01 (.02, .07) MN,1,2 .20 .06 (.08, .33)
11,1 .92 .02 (.89, .95) MN,2,0 .14 .04 (.06, .23)
11,0 .08 .02 (.05, .11) MN,2,1 .41 .08 (.26, .55)
12,1 .95 .02 (.92, .98) MN,2,2 .45 .07 (.32, .58)
12,0 .12 .03 (.06, .19) MN,3,0 .05 .02 (.02, .09)
14,1 .94 .01 (.91, .96) MN,3,1 .28 .05 (.19, .37)
14,0 .10 .03 (.04, .17) MN,3,2 .66 .04 (.58, .76)
15,1 .92 .02 (.88, .96)
15,0 .17 .02 (.13, .21)
16,1 .90 .02 (.87, .93)
16,0 .11 .03 (.06, .17)
17,1 .87 .02 (.83, .91)
17,0 .05 .01 (.03, .08)
18,1 .85 .02 (.80, .89)
18,0 .14 .02 (.10, .17)
19,1 .81 .03 (.75, .86)
19,0 .03 .01 (.01, .05)
20,1 .83 .02 (.78, .87)
20,0 .02 .01 (.01, .04)
a SD = Standard Deviation.
b HPD = Highest Posterior Density.
A screenshot of the model formulated in Netica (Norsys, 19992012), prior to having any
values for the observables, is given in Figure14.8. Each node is depicted with a bar captur-
ing the probability for the node being in that state, which is also represented numerically
as a percentage.
We use Examinee 527 as a case study for illustration. Table 14.7 lists the response vec-
tor from this examinee. It can be seen that examinee 527 performed well on the items that
required combinations of Skills 1, 2, and 3, but struggled with the items that additionally
required Skill 4, including those that additionally required Skill 5. Entering the observed val-
ues for this examinee into the BN yields the posterior distribution represented in Figure14.9.
On the basis of the model, we are nearly certain the examinee possesses Skills 1, 2, and 3,
and nearly certain the examinee does not possess Skill 4. These reflect the broad patterns
identified in the examinees response vector. Note however that our posterior probability for
the examinee possessing Skill 5 is .65, reflecting considerable uncertainty. This can be under-
stood by recognizing that we only have three items (items 7, 15, and 19) that depend on Skill
5, and they also depend on Skill 4, which the examinee almost certainly does not possess. As
a consequence, we have little in the way of evidenceaboutSkill5 forthis examinee.
p= =
(Y4 l| y3 =
k ) k 0=
, , J 3 , l 0 , , J 4 . (14.32)
Theta1
368
Possess 80.0
NotPossess 20.0
Theta2 Theta5
Possess 76.2 Possess 61.4
NotPossess 23.8 NotPossess 38.6
ThetaMN
Zero 16.9
One 32.4
Two 50.7
Theta3 Theta4
Possess 76.2 Possess 50.7
NotPossess 23.8 NotPossess 49.3
X10 X7
X12
One 37.3 One 40.1
One 71.8 X4 Zero 59.9
Zero 28.2 Zero 62.7
One 51.9
Zero 48.1
X15
X9
X6 One 44.5
One 64.0 X11 X17 Zero 55.5
One 78.6 Zero 36.0
Zero 21.4 One 46.9 One 43.0
Zero 53.1 Zero 57.0
FIGURE 14.8
Netica representation of the mixed-number subtraction example prior to observing any values. The slightly darker shading for the nodes for Theta3 and Theta4
Bayesian Psychometric Modeling
Theta2 Theta5
Possess 98.2 Possess 64.6
NotPossess 1.81 NotPossess 35.4
Bayesian Networks
ThetaMN
Zero 0.97
One 99.0
Two .027
Theta3 Theta4
Possess 99.0 Possess 0.27
NotPossess 0.97 NotPossess 100
X10 X7
X12 One 0
One 100 One 0 X4
Zero 100 Zero 100
Zero 0 One 100
Zero 0
X15
X9 One 100
X6
One 0 X11 X17 Zero 0
One 100 Zero 100
Zero 0 One 0 One 0
Zero 100 Zero 100
FIGURE 14.9
Netica representation of the posterior distribution for the mixed-number subtraction example for examinee 527. The slightly darker shading for the nodes for Theta3
and Theta4 reflect that they are deterministically related to their parent, ThetaMN. The darker shading for the nodes for the Xs reflect that their values are known
with certainty.
369
370 Bayesian Psychometric Modeling
If the prerequisite structure holds, we would expect low frequencies of occurrences for
l being large when k is small, meaning that it is unlikely that a student would correctly
complete the tasks requiring Skill 4 given that they incorrectly completed the tasks requir-
ing Skill 3.
In the current example this approach is complicated by the fact that we do not have
tasks that measure Skill 3 or Skill 4 in isolation. As the Q-matrix in Table14.7 expresses,
the situation is somewhat messy for our purposes of honing in on performances that
depend on Skill 3 and Skill 4. First, all the items require Skill 1. Items 9, 14, and 16 are the
purest evaluation of Skill 3 in the sense that they depend on Skill 3 and Skill 1 only. We
do not have any items that only depend on Skill 4 in addition to Skill 1. Items 4, 11, 17, 18,
and 20 require Skill 4 in addition to Skill 1 and Skill 3. These are the purest evaluation of
Skill4, as the other items that depend on Skill 4 (i.e., items 7, 15, 19, and 10) throw either Skill 2
or Skill5 into the mix. Several options for moving forward with subsets of items are pos-
sible. For the following illustration, we take the subsets that are the purest in the sense
used above. We use items 9, 14, and 16 for Skill 3, even though they additionally depend on
Skill 1. We use items 4, 11, 17, 18, and 20 for Skill 4, even though they additionally depend
on Skill1 and Skill 3. Extending the notation in (14.32), let the sums of the scores from
these subsets of items be y13 and y134, respectively. Even though these values are clouded
reflections of Skill3 and Skill 4, the underlying logic still holds: the prerequisite rela-
tionship between Skill 3 and Skill 4 suggests that examinees with low values of y13 ought
not have high values of y134.
Owing to the uncertainty associated with the dependence of performance on skills, we
know the model allows for some deviations from the expected pattern, but how many?
And how does the fact that we do not have tasks that only measure Skill 3 or Skill 4
in isolation complicate things? To work out the details of the models implications, we
again turn to simulation, and through the use of PPMC we can articulate the models
implications and the correspondence or lack thereof between those implications and the
observed data.
We conducted PPMC compiling the posterior predictive frequencies that y134 takes on
any of its possible values 0,, 5 given the y13 takes on any of its possible values 0,, 3.
The results for the combinations where y134 was 3, 4, or 5 (the three highest values) and
y13 was 0 or 1 (the two lowest values) are depicted in Figure14.10. For each combination
of y134 and y13, the plot depicts the histogram of the 3,000 posterior predicted values
for the frequencies, and the vertical line is placed at the realized value. For y13=0, the
realized frequencies of y134 = 3, 4, and 5 are all 0, which obviously accords with the
substance of the hypothesis, and the model-based expectations expressed in the densi-
ties. A similar picture emerges for y13 = 1. Here, the realized frequencies are 3, 2, and
5 for y134 = 3, 4, and 5, respectively. The posterior predictive distributions suggest that
these are reasonably in accordance with the model-based expectations expressed in the
posterior predicted densities. Overall, these results amount to support for this aspect
of the models performance, and correspondingly for the hypothesis that Skill 3 is a
prerequisite for Skill 4.
y134
y13 3 4 5
0 2 4 6 1 0 1 2 3 4 1 0 1 2 3 4
0 2 4 6 8 0 5 10 0 5 10 15
FIGURE 14.10
Posterior predictive model checking results for the conditional distribution of y134 given low and high values of
y13 in the mixed-number subtraction example.
( xij Pij )2
Vij ( xij , i , j ) = , (14.33)
Pij (1 Pij )
where Pij = E( xij |i , j ) is the probability of a correct response, which in the current case is
Mq q M
j1 if ij imjm = 1 and j0 if ij imjm = 0. A person fit discrepancy measure is then given
m =1 m =1
by the root mean square error taken with respect to the values. For examinee i,
1
1 J 2
PFi ( xi , , j ) =
J Vij ( xij , i , j ) ,
(14.34)
j =1
where xi = (xi1,, xiJ) is the collection of J observed values from examinee i. We conducted
PPMC using (14.34), computing 3,000 realized values by employing the observed data for x
along with the 3,000 draws for the parameters, and 3,000 posterior predicted values by using
the 3,000 posterior predicted datasets along with the 3,000 draws for the parameters. For each
examinee, we computed ppost as a numerical summary of the results. We confine our discus-
sion here to the results for two examinees whose response vectors are given in Table14.7.
Examinee 527 exhibited solid fit (ppost = .501), with a response vector that affords an inter-
pretation that coheres with the model, as the examinee performed well on tasks requiring
just Skill 1, Skill 2, or Skill 3, but struggled with tasks additionally requiring Skill 4 and Skill 5.
In contrast we see poor fit for the model when it comes to examinee 171 (ppost = .003), who
incorrectly answered the two simplest tasks that only required Skill 1, but yet correctly
answered some of the more complex tasks that required Skill 1 as well as additional skills.
This pattern does not cohere with the assumed model structure; this discord manifests
itself in the lack of person fit.
372 Bayesian Psychometric Modeling
A lack of person fit may undermine the inferences for the examinee(s) in question, and
suggests that the inferential argument expressed in the psychometric model may be dubi-
ous, at least for the examinees in question. In the current context, it may be that examinees
are using Method A in working through the tasks, which involves a different set of skills.
The account provided by the BN for Method B would then be a poor basis for making infer-
ences about the examinees. We might employ a BN developed to represent our reasoning for
examinees using Method A (Mislevy, 1995), and then employ the appropriate BN for which-
ever method the examinee uses. Of course, we may not know which method an examinee is
using. To address this, Mislevy (1995) proposed the use of an expanded BN that contains BN
fragments for each method, and an additional latent variable indicating which method the
examinee is using. The conditional distribution for each observable would then depend on
this variable as well as method-specific skills and parameters. Ifthere is evidence that exam-
inees use multiple methods, say, based on perceived features of the task, this could be further
extended to include a latent variable capturing which method is being used for each task.
Formally, the expanded model may be seen as a mixture of the method-specific BNs,
with the latent variable(s) for which method is being used as defining the components of
the mixture. This represents an expanded narrative regarding the situation, and a corre-
sponding expansion to the argument used in reasoning from observed performances to
what is of inferential interest (Mislevy et al., 2008).
In what follows we present a portion of the fuller analysis reported in Levy and Mislevy
(2004). Much of the reasoning in developing the full model and analysis was similar to
that displayed in the mixed-number fraction subtraction example from Section14.4, so the
goal here is to illustrate some additional issues and how they are handled. In particular,
we highlight the use of parametric forms to smooth the elements of a conditional prob-
ability table in a BN, while also encoding substantive beliefs about relationships among
the variables.
Networking
Disciplinary
Knowledge
Network
Modeling
Network
Proficiency
FIGURE 14.11
Directed acyclic graph fragment for the latent variables in the student model for the NetPASS example.
374 Bayesian Psychometric Modeling
TABLE 14.9
Conditional Probability Table for Design in NetPASS, Where NP Is the Value of the
Latent Variable for Network Proficiency and **
Design Is the Effective Theta Obtained
from (14.37) Using Values for the Parameters Based on Subject Matter Expert Beliefs
p(Design | Network Proficiency)
Network
Proficiency NP **
Design Novice Semester 1 Semester 2 Semester 3 Semester 4
Novice 1 3.8 .690 .253 .049 .007 .001
Semester 1 2 1.8 .231 .458 .253 .049 .008
Semester 2 3 0.2 .039 .192 .458 .253 .057
Semester 3 4 2.2 .005 .034 .192 .458 .310
Semester 4 5 4.2 .001 .005 .034 .192 .769
Source: Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a
computer-based interactive assessment. International Journal of Testing, 4, 333369.
for k = 2,, 5. For k = 5, the greater than or equal relationship is one of equality. The effec-
tive theta method fixes the values for a and the bs in (14.36). This in turn shifts the depen-
dence of Design on Network Proficiency to the effective theta, defined as
**
i , Design = cDesign , NP i , NP + dDesign,NP , (14.37)
where the parameters cDesign,NP and dDesign,NP now govern the dependence of Design on Network
Proficiency. This modeling approach effects a parametric smoothing of the conditional prob-
abilities in the table, yielding a structure that is specified in terms of now just two parameters
(here cDesign,NP and dDesign,NP ) regardless of the number of levels of the parent and child vari-
ables. Table14.9 displays the probabilities resulting from fixing a=1 and b = (3, 1, 1, 3) in
(14.36) using cDesign,NP = 2 and dDesign,NP = 5.8. These latter two values were chosen because
they best captured the prior beliefs of subject matter experts; likewise for the choices of the
parameters of the effective theta model for the remaining conditional probability tables.
WinBUGS code for this portion of the model is given as follows. To begin, the code for
the computation of the effective theta for Design instantiates (14.37) for the five possible
values of Network Proficiency:
-------------------------------------------------------------------------
#########################################################################
# general form: etDesign<- cDesign*NP+dDesign
#########################################################################
etDesign[1]<- cDesign*1+dDesign
etDesign[2]<- cDesign*2+dDesign
etDesign[3]<- cDesign*3+dDesign
etDesign[4]<- cDesign*4+dDesign
etDesign[5]<- cDesign*5+dDesign
-------------------------------------------------------------------------
Bayesian Networks 375
This takes the five possible values of Network Proficiency and yields the associated effective
theta. This effective theta is then entered into the GRM. Code for (14.36) is given as follows.
-------------------------------------------------------------------------
for (aa in 1:5){ # index for which effective theta
for (bb in 1:4){ # index for which category boundary
des.p.greater[aa,bb] <- 1/(1+exp((-a)*((etDesign[aa])-(smb[bb]))))
}
}
-------------------------------------------------------------------------
The loop over the index aa refers to the computation for each possible value of the effec-
tive theta, and the loop over the index bb refers to the computation for each of the category
boundaries. a refers to the a parameter of (14.37). smb refers to the collection of b param-
eters in (14.37). The first two letters in this name, sm, indicate that this is the collection
of b parameters for the Student Model variables, in contrast to the b parameters used in the
modeling of the observable variables (discussed in the following section). In the effective
theta method, these are fixed. Code for doing so is as follows.
-------------------------------------------------------------------------
a<- 1
smb[1]<- -3.0; smb[2]<- -1; smb[3]<- 1; smb[4]<- 3;
-------------------------------------------------------------------------
The object des.p.greater contains the probabilities for Design exceeding the associated
category boundary. To complete the GRM and obtain the probabilities for Design being
equal to a particular value, the code for the computations in (14.35) is as follows.
-------------------------------------------------------------------------
for (aa in 1:5){
des.p[aa,1] <- 1- des.p.greater [aa,1]
des.p [aa,2] <- des.p.greater [aa,1]- des.p.greater [aa,2]
des.p [aa,3] <- des.p.greater [aa,2]- des.p.greater [aa,3]
des.p [aa,4] <- des.p.greater [aa,3]- des.p.greater [aa,4]
des.p [aa,5] <- des.p.greater [aa,4]
}
-------------------------------------------------------------------------
des.p is the conditional probability table for the Design variable, akin to Table14.9. Finally,
the distributional specification for the Design variable for examinee i is a categorical dis-
tribution with probabilities given by the row of this table corresponding to the examinees
value for Network Proficiency, which in the code below is denoted NP.
-------------------------------------------------------------------------
for (i in 1:n){
Design[i] ~ dcat(des.p[NP[i], ])
}
-------------------------------------------------------------------------
Structures analogous to that for Design were specified for the conditional probabilities for
Implement and Troubleshoot.
A similar structure was specified for the conditional probability table for Network
Modeling given Network Disciplinary Knowledge. However, subject matter experts indi-
cated that a students level of Network Modeling could not be higher than their level for
376 Bayesian Psychometric Modeling
TABLE 14.10
Conditional Probability Table for Network Modeling in NetPASS, with Values Based
on Subject Matter Expert Beliefs
p(Network Modeling | Network Disciplinary Knowledge)
Network Disciplinary
Knowledge Novice Semester 1 Semester 2 Semester 3 Semester 4
Novice 1 0 0 0 0
Semester 1 .768 .233 0 0 0
Semester 2 .282 .485 .233 0 0
Semester 3 .050 .233 .485 .233 0
Semester 4 .007 .041 .222 .462 .269
Source: Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a
computer-based interactive assessment. International Journal of Testing, 4, 333369.
Network Disciplinary Knowledge. This constraint was imposed by forcing the resulting
probabilities for levels of Network Modeling that were higher than Network Disciplinary
Knowledge to 0 and renormalizing. In effect, Network Disciplinary Knowledge acts as a
ceiling for Network Modeling. Table 14.10 illustrates this structure using cNM,NDK = 2
and dNM,NDK = 8, where it can be seen that p(Network Modeling > Network Disciplinary
Knowledge) = 0.
The WinBUGS code for specifying the distribution of Network Modeling is similar to that
given for Design, with the addition of the renormalization. Code mimicking that given
above for Design is given as follows.
-------------------------------------------------------------------------
#########################################################################
# general form: etNM[NDK]<- cNM*NDK+dNM
#########################################################################
etNM[1]<- cNM*1+dNM
etNM[2]<- cNM*2+dNM
etNM[3]<- cNM*3+dNM
etNM[4]<- cNM*4+dNM
etNM[5]<- cNM*5+dNM
----------------------------------------------------------------------
nm.p.ren[1,1] <- 1
nm.p.ren[1,2] <- 0
nm.p.ren[1,3] <- 0
nm.p.ren[1,4] <- 0
nm.p.ren[1,5] <- 0
----------------------------------------------------------------------
for (i in 1:n){
NM[i] ~ dcat(nm.p.ren[NDK[i], ])
}
----------------------------------------------------------------------
and Network Modeling. To develop the model, an initial effective theta was defined as a
function of Network Disciplinary Knowledge. For each examinee this is defined as
where i,NDK is the value of the Network Disciplinary Knowledge for examinee i. The final
effective theta is then modifies the initial effective theta, as
** *
i , NP = i , NP + cNP,compensatory [i , NM (i , NDK 1)]. (14.39)
This allows for the examinees value of Network Modeling (i,NM) to partially compensate
for Network Disciplinary Knowledge. When Network Modeling is one level below Network
Disciplinary Knowledge, as it was expected to be based on conversations with subject matter
experts, the contribution of Network Modeling is zero. When Network Modeling is equal to
Network Disciplinary Knowledge the contribution is cNP,compensatory. On the other hand, when
Network Modeling is two or more levels below Network Disciplinary Knowledge, the contribu-
tion is negative the number of levels less one multiplied by cNP,compensatory. In principle, there
are 25 combinations of the parent variables Network Disciplinary Knowledge and Network
Modeling. However, some of these are rendered impossible by the ceiling relationship
(aswas the case in Table14.10). Table14.11 illustrates this structure for Network Proficiency
using cNP,baseline = 2, cNP,compensatory = 1, and dNP,baseline = 6.
WinBUGS code for this specification is given below, making use of aa as an index for
possible values of Network Disciplinary Knowledge and bb as an index for possible values of
Network Proficiency.
-------------------------------------------------------------------------
for (aa in 1:5){
etNP.1[aa]<- c1NP*(aa)+dNP
for (bb in 1:5){
etNP.2[aa, bb]<- etNP.1[aa]+c2NP*(bb-(aa-1))
}
}
-------------------------------------------------------------------------
The line for etNP.1 corresponds to (14.38) and the line for etNP.2 corresponds to (14.39).
etNP.2 is the effective theta that can then be subjected to the GRM and renormalization
strategies discussed above.
To complete the model for the latent variables, the i,NDK are specified as following a cat-
egorical distribution in ways discussed previously.
TABLE 14.11
Conditional Probability Table for Network Proficiency in NetPASS, with Values Based on
Subject Matter Expert Beliefs
p(Network Proficiency | Network Disciplinary Knowledge, Network
Modeling)
NDKa NM b Novice Semester 1 Semester 2 Semester 3 Semester 4
Novice Novice 1 0 0 0 0
Semester 1 Novice .368 .632 0 0 0
Semester 1 .238 .762 0 0 0
Semester 2 Novice .135 .432 .432 0 0
Semester 1 .065 .303 .632 0 0
Semester 2 .036 .202 .762 0 0
Semester 3 Novice .05 .233 .485 .233 0
Semester 1 .02 .115 .432 .432 0
Semester 2 .009 .056 .303 .632 0
Semester 3 .005 .031 .202 .762 0
Semester 4 Novice .018 .101 .381 .381 .119
Semester 1 .007 .041 .222 .462 .269
Semester 2 .003 .016 .101 .381 .500
Semester 3 .001 .006 .041 .222 .731
Semester 4 <.001 .002 .016 .101 .881
Source: Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a computer-based
interactive assessment. International Journal of Testing, 4, 333369.
a NDK = Network Disciplinary Knowledge.
b NM = Network Modeling.
Networking
Disciplinary
Knowledge
DK and Correctness of
Design 1 Outcome Design 1
Design
Design Quality of
Context 1 Rationale Design 1
FIGURE 14.12
Directed acyclic graph fragment for the observables from the first task targeting Design in the NetPASS example.
these latter variables are not of inferential interest, but are relevant for representing our
beliefs about the situation in the psychometric model.
Conversations with subject matter experts indicated that a conjunctive condensation
rule was useful for combining Network Disciplinary Knowledge and Design to specify the
intermediary variable DK and Design 1, which can take on any of the same five values of
its parents (i.e., Novice,, Semester 4). We expand on the conjunctive models presented
in Section14.4.3 and define a leaky conjunction relationship for polytomous variables. For
polytomous variables, the conjunction is enacted by taking the minimum value. We there-
fore begin by defining
to initially represent the conjunctive relationship. We could use this value as an effective
theta, and then zero out possible values of for DK and Design 1 that exceed this mini-
mum, and renormalize. This would effect a leaky conjunction, in which the probabilities
are allowed to leak below the level that represents the conjunction, but not above. Levy and
Mislevy (2004) employed an expanded form of this, allowing for the leak to vary depend-
ing on which of the two parent variables was the minimum, specifying the effective theta
for DK and Design 1 as
** *
i , DKandDesign 1 = [cDKandDesign 1,*DKandDesign i , DKandDesign + dDKandDesign 1 ]
-------------------------------------------------------------------------
for (aa in 1:5){
for (bb in 1:5){
etNDKDesign1[aa, bb] <- c1NDKDesign1*(min(aa, bb))
+ dNDKDesign1
+ c2NDKDesign1*(aa-min(aa, bb))
+ c3NDKDesign1*(bb-min(aa, bb))
}
}
-------------------------------------------------------------------------
etNDKDesign1 is the effective theta that can then be subjected to the GRM and renormal-
ization strategies discussed above.
The Design Context 1 variable is introduced to account for the shared dependencies
among the observables derived from performances obtained from the same task (namely,
the first Design task), akin to parameters in testlet and bi-factor models (Rijmen, 2010). In
the current context of a BN, it is defined as having two values, High and Low. As Design
Context 1 has no parents it may be specified as a Bernoulli distribution in ways previously
discussed, with a slight modification discussed next.
Formally, the context variable enters the model for the observable via a compensatory
specification of the effective theta, akin to a CFA or compensatory MIRT model. For observ-
able j, the effective theta is defined as
**
ij = c j , DKandDesign 1i , DKandDesign 1 + c j , DesignContext 1i , DesignContext 1 + dj, DKandDesign 1 , (14.42)
where i,DesignContext1 takes on a value of 1 or 1. The context variable has the effect of raising
or lowering the effective theta by the amount of c j,DesignContext1. The idea here is that there
may be a contextual effect that raises or lowers the level of performance as captured by the
multiple observables derived from a single task.
Bayesian Networks 381
TABLE 14.12
Conditional Probability Table for DK and Design 1 in NetPASS Example, with Values Based
on Subject Matter Expert Beliefs
p(DK and Design 1 | Network Disciplinary Knowledge, Design)
NDKa Design Novice Semester 1 Semester 2 Semester 3 Semester 4
Semester 4 Novice 1 0 0 0 0
Semester 1 1 0 0 0 0
Semester 2 1 0 0 0 0
Semester 3 1 0 0 0 0
Semester 4 1 0 0 0 0
Semester 1 Novice 1 0 0 0 0
Semester 1 0.368 0.632 0 0 0
Semester 2 0.306 0.694 0 0 0
Semester 3 0.258 0.742 0 0 0
Semester 4 0.222 0.778 0 0 0
Semester 2 Novice 1 0 0 0 0
Semester 1 0.335 0.665 0 0 0
Semester 2 0.065 0.303 0.632 0 0
Semester 3 0.050 0.256 0.694 0 0
Semester 4 0.040 0.218 0.742 0 0
Semester 3 Novice 1 0 0 0 0
Semester 1 0.306 0.694 0 0 0
Semester 2 0.057 0.279 0.665 0 0
Semester 3 0.009 0.056 0.303 0.632 0
Semester 4 0.007 0.043 0.256 0.694 0
Semester 4 Novice 1 0 0 0 0
Semester 1 0.281 0.719 0 0 0
Semester 2 0.050 0.256 0.694 0 0
Semester 3 0.008 0.049 0.279 0.665 0
Semester 4 0.001 0.006 0.041 0.222 0.731
Source: Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a computer-
based interactive assessment. International Journal of Testing, 4, 333369.
a NDK = Network Disciplinary Knowledge.
This effective theta is then entered into a version of the GRM that specifies the con-
ditional distribution for the observable taking on any of its three possible values.
Table14.13 contains conditional probabilities from setting a = 1 and b = (2, 2) in (14.36)
and c j,DKandDesign1 = 2, c j,DesignContext1 = .4, and dj,DKandDesign1 = 5.
WinBUGS code for this portion of the model begins by specifying the Design Context 1
variable as a Bernoulli, and then converting it from its usual 0/1 coding to a 1/1
coding:
----------------------------------------------------------------------
for (i in 1:n){
DesignC1[i] ~ dbern(.5)
DesignContext1[i] <- 2*DesignC1[i]-1
}
----------------------------------------------------------------------
382 Bayesian Psychometric Modeling
TABLE 14.13
Conditional Probability Table for an Observable in the NetPASS, with Values Based
on Subject Matter Expert Beliefs
p(Observable | DK and Design 1, Context)
DK and Design 1 Context Low Medium High
Novice Low 0.802 0.193 0.004
High 0.646 0.344 0.010
Semester 1 Low 0.354 0.613 0.032
High 0.198 0.733 0.069
Semester 2 Low 0.069 0.733 0.198
High 0.032 0.613 0.354
Semester 3 Low 0.010 0.344 0.646
High 0.004 0.193 0.802
Semester 4 Low 0.001 0.068 0.931
High 0.001 0.032 0.968
Source: Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a
computer-based interactive assessment. International Journal of Testing, 4, 333369.
-------------------------------------------------------------------------
for (i in 1:n){
etDesign1[i, j] <- c1Design1[j]*NDKDesign1[i]
+ c2Design1[j]*DesignContext1[i]
+ dDesign1[j]
}
-------------------------------------------------------------------------
The effective theta is then entered into the GRM, with code as follows, here, using WinBUGS
logit function to enact (14.37):
-------------------------------------------------------------------------
for (i in 1:n){
for (k in 1:K){
logit(p.greater[i, j, k])<- a*(etDesign1[i, j] - emb[j, k])
}
}
-------------------------------------------------------------------------
Here, a refers to the a parameter of (14.37). emb refers to the collection of b parameters
in (14.37). The first two letters in this name, em, indicate that this is the collection of b
parameters for the Evidence Model variables, in contrast to the b parameters used in the
modeling of the Student Model variables (discussed in the preceding section). In the effec-
tive theta method, these are fixed. Code for doing so is as follows.
-------------------------------------------------------------------------
emb[j, 1] <- -2
emb[j, 2] <- 2
-------------------------------------------------------------------------
Bayesian Networks 383
The object p.greater contains the probabilities for the observable exceeding the asso-
ciated category boundary. To complete the GRM and obtain the probabilities for the
observable being equal to a particular value, the code for the computations in (14.35) is
as follows.
-------------------------------------------------------------------------
for (i in 1:n){
p[i, j, 1] <- 1-p.greater[i, j, 1]
p[i, j, 2] <- p.greater[i, j, 1] - p.greater[i, j, 2]
p[i, j, 3] <- p.greater[i, j, 2]
}
-------------------------------------------------------------------------
p is the collection of conditional probabilities for examinee i for observable j. Finally, the
distributional specification for the observable itself is a categorical distribution with prob-
abilities given by p for the examinee and observable:
-------------------------------------------------------------------------
for (i in 1:n){
x[i, j] ~ dcat(p[i, j, ])
}
-------------------------------------------------------------------------
Analogous specifications were made for the other observables from this and the two other
tasks targeting Design, which had DAGs akin to that in Figure14.12, as well as the three
tasks targeting Implement, and the three tasks targeting Troubleshooting. See Levy and
Mislevy (2004) for complete details on these specifications, prior distributions based on
subject matter experts expectations, MCMC estimation, and resulting inferences about
the tasks and examinees.
i1 i2 i3
Examinees i
FIGURE 14.13
Directed acyclic graph for a dynamic Bayesian network.
variables within each time slice. The second component is a between-time or transition com-
ponent in which the latent variable at a particular time is modeled as dependent on the
latent variable and observable at the previous time.
To simplify the situation, suppose we have a model for one dichotomous latent variable
that represents possessing a skill and one dichotomous observable variable that corre-
sponds to the correctness of a response to a task. The within-time component is then a
latent class model, discussed previously. We focus our attention on the transition compo-
nent that models the latent variable at any time as dependent on the latent and observable
variable at the previous time. Let it denote the value of the latent variable for examinee i at
time t, taking on a value of 1 or 0 corresponding to examinee i possessing or not possess-
ing the skill at time t. Similarly, let xit denote the observable coded as 1 or 0 for a correct
or incorrect response from examinee i at time t. The conditional probability table for the
transition component is given in Table14.14, where kl denotes the conditional probability
that an examinee with values of it = k and xit = l at time t will possess the skill at time t+1
(i.e., i(t +1) = 1).
11 and 10 represent the conditional probabilities that an examinee will possess the skill
at time t + 1 given that they possessed the skill at time t, accompanied by a correct and
incorrect response at time t, respectively. A common simplifying assumption sets these
values at 1, reflecting that if an examinee possesses a skill, they do so at the next time
period(s). 01 and 00 represent the transition probabilities from not possessing a skill to
possessing a skill following a correct or incorrect response, respectively.
A number of simplifications or extensions of this general structure are possible. For
example, Levy (2014) constructed a DBN for Save Patch, an educational game targeting
rational number addition (Chung etal., 2010) in which
TABLE 14.14
Conditional Probability Table for the Transition
Component in a Dynamic Bayesian Network
t xt x(t+1|t, x t)
1 1 11
1 0 10
0 1 01
0 0 00
Bayesian Networks 385
Players (examinees) complete levels (tasks) by using math skills to navigate from
the beginning of the level to the end;
Feedback occurs in that successfully completing a level leads to being presented
a new level, while unsuccessfully completely a level leads to being presented the
same level for another attempt;
The performance of each player on each level of the game was evaluated as rep-
resenting a correct solution or errors of various kinds, defining a polytomous
observable for each attempt at each level;
Multiple latent variables representing a variety of skills of inferential interest and
possible misconceptions were specified;
The conditional probabilities in the within-time component were smoothed via an
IRT parameterization.
See Levy (2014) for complete details, as well as specifications of prior distributions for the
unknown parameters of the parametric forms used to enact smoothing, and the results of
fitting the model using MCMC. For our current purposes of illustrating DBNs, we present
a simplified account of a portion of the model corresponding to performance on Level 19 of
the game, making the simplifying assumption that the player does not possess the miscon-
ception measured on that level and using point summaries of the posterior distribution
obtained from model-fitting.
Table14.15 presents the conditional probability table for the observable corresponding
to performance on Level 19 given the targeted skill, Adding Unit Fractions. This structure
is assumed to hold, with the same conditional probabilities, over all attempts (time points)
at the level.
If a player (examinee) possesses the Adding Unit Fractions skill, they will almost certainly
complete the level with the Standard Solution, meaning they successfully complete the
level in a way that reflects the skill. There is only a small probability that they will provide
an Incomplete Solution, or commit an error, possibly one identified as evidencing certain
misconceptions (Wrong Numerator Error) or one not associated with any misconceptions
(Unknown Error). In contrast, a player who does not possess the skill is much less likely to
successfully complete the level and is more likely to make an error.
Table 14.16 presents the conditional probabilities of the possessing the Adding Unit
Fractions skill at time t + 1 given Adding Unit Fractions and the observable for performance
on the level at time t. The first row reflects the previously mentioned assumption that once
a player (examinee) acquires a skill, she will retain it. The second row gives the probabili-
ties of transitioning from a state of not possessing to a state of possessing the skill, which
vary based on the performance on the level.
TABLE 14.15
Conditional Probability Table for Performance on Level 19 in Save Patch
p(Observable for Level 19 | Adding Unit Fractions)
Adding Unit Standard Alternate Incomplete Wrong Numerator Unknown
Fractions Solution Solution Solution Error Error
Possess 0.95 0.00 0.01 0.03 0.01
Not Possess 0.58 0.02 0.02 0.25 0.13
386 Bayesian Psychometric Modeling
TABLE 14.16
Conditional Probabilities for Possessing Adding Unit Fractions at Time t + 1 on Level 19 in Save Patch
Observable for Level 19 at Time t
Adding Unit Standard Alternate Incomplete Wrong Numerator
Fractionsat Time t Solution Solution Solution Error Unknown Error
Possess 1 1 1 1 1
Not possess .38 .17 .19 .20 .09
FIGURE 14.14
States of a portion of the dynamic Bayesian network over multiple time points for performance on Level 19 in
Save Patch: (a) prior to observing any attempts, (b) after observing an unknown error on the first attempt,
(c) after observing a wrong numerator error on the second attempt, and (d) after observing a standard solution
on the third attempt.
Cast in the current light, DCMs that employ discrete latent and observable variables
may be seen as BNs, underscoring the appeal of BNs as a powerful approach to diagnostic
assessment (Almond etal., 2007). Bayesian approaches have also been popular for exten-
sions of such models. One line of research incorporates continuous latent variables in sev-
eral different ways, including as an aspect of proficiency on which observables depend
(Bradshaw & Templin, 2014; Hong, Wang, Lim, & Douglas, 2015; Roussos et al., 2007),
388 Bayesian Psychometric Modeling
and as a higher order latent variable to model the dependence among the discrete latent
skills (de la Torre & Douglas, 2004). See also Hoijtink, Bland, and Vermeulen (2014) for a
Bayesian treatment of DCMs and cognitive diagnostic models of other types.
Another line of research expands DCMs to model differential attractiveness of response
options to partially correct examinee understandings (DiBello, Henson, & Stout, 2015) or
models slip and guessing parameters as a function of examinees (Huang & Wang, 2014b).
Another extension comes in the form of specifying prior distributions for the entries of
the Q-matrix, reflecting the uncertainty regarding whether correctly completing a task
depends on possessing a particular skill (DeCarlo, 2012). Similarly, a mixture of BNs could
be specified to reflect that multiple solution strategies are possible, with different skills
being invoked under different strategies, possibly with examinees adopting or switching
strategies across tasks (de la Torre & Douglas, 2008; Mislevy, 1995).
A more complete description of construction, MCMC estimation, and use of the BN for
the NetPASS example in Section14.5 is given by Levy and Mislevy (2004). See also Almond
etal. (2001) and Mislevy etal. (2002) for descriptions of functional forms that may be used to
model the conditional probabilities in BNs in complex assessments. See also Almond, Mulder,
Hemat, and Yan (2009), who focused on options for models of dependence that can arise in
different ways when multiple observables are derived from complex assessment. This flex-
ibility of BNs has made it an attractive option for complex assessments, such as in the works
just cited, VanLehn and Martin (1997), Rupp etal. (2012), and Quellmalz etal. (2012).
The use of discrete variables makes the updating in BNs computationally fast, supporting
real-time or near real-time updating as examinees engage with assessments. Accordingly,
BNs have become popular for use in adaptive assessment (Almond & Mislevy, 1999) and
related systems that adapt what they do in light of the examinees performance. Examples
of BNs used in such capacities may be found in applications of intelligent tutoring systems
(Chung, Delacruz, Dionne, & Bewley, 2003; Conati, Gertner, VanLehn, & Druzdzel, 1997;
Mislevy & Gitomer, 1996; Reye, 2004; Sao Pedro, Baker, Gobert, Montalvo, & Nakama, 2013;
VanLehn, 2008) and games (Iseli, Koenig, Lee, & Wainess, 2010; Rowe & Lester, 2010; Shute,
2011; Shute, Ventura, Bauer, & Zapata-Rivera, 2009). Reye (2004) demonstrated how preced-
ing lines of work on modeling longitudinal patterns assuming learning could be framed
as DBNs, paving the way for their applications in tutoring systems (Reye, 2004; VanLehn,
2008) and game-based assessments of learning or change (Iseli etal., 2010; Levy, 2014). See
Levy (2014) for a description of the fully Bayesian specification and estimation of the com-
plete DBN for the game-based assessment example in Section14.6.
BNs may be specified in advance, or partially constructed on the fly in light of examinee
performance. The DBN in Section14.6 represents a simple example of where the BN could be
constructed on the fly. Here, the network adds nodes for the next time slice for the current
level if the player (examinee) is unsuccessful on the previous attempt, or adds nodes for next
level if the player is successful on the previous attempt. Mislevy and Gitomer (1996) described
a more complicated scenario in the context of HYDRIVE, a simulation-based assessment
of troubleshooting for the F-15 aircrafts hydraulics systems. The open-ended nature of the
assessment implies that there are an enormous number of possible states to potentially moni-
tor, which poses challenges to developing psychometric models (Levy, 2013). Rather than
model all possibilities, HYDRIVE was designed to recognize when the examinee had worked
her or his way into a situation that could be characterized as providing evidence regarding
one of the proficiencies of interest. Such situations could occur frequently, sparingly, or not at
all depending on the examinees chosen actions. Once a situation was recognized as such, the
BN for the examinee could be augmented to include the appropriate variables.
Bayesian Networks 389
Exercises
14.1 Reconsider the just-persons DAG for the BN in Figure14.4
a. Assuming each observable and latent variable is dichotomous and the (condi-
tional) probabilities are unknown parameters, write out the model in terms of
the distributions, including conditionally conjugate prior distributions on the
unknown parameters.
b. Write out a DAG for the model in (a) that expands Figure14.4 to include all the
entities.
c. For each unknown entity list out the relevant entities that need to be con-
ditioned on in the full conditional distribution. (Hint: This can be done just
using the DAG from b.)
d. For each unknown entity, write out the full conditional distribution.
e. Repeat (a)(d) now assuming each observable and latent variable is polytomous.
14.2 The BN for the mixed-number subtraction example developed in Section 14.4
introduced MN to structure the conditional distributions of 3 and 4 on the
remaining latent variables honoring the substantive theory, which states that
(a) how many of Skill 1, Skill 2, and Skill 5 the examinee possesses is relevant, not
which ones; and (b) Skill 3 is a hard prerequisite for Skill 4.
a. Specify a model that honors these substantive beliefs, but (i) expresses the
conditional probability of possessing Skill 3 given Skill 1, Skill 2, and Skill 5,
p(3 |1 , 2 , 5 ), in terms of the parameters of the model, and (ii) expresses the
conditional probability of possessing Skill 4 given Skill 1, Skill 2, Skill 5, and
Skill3, p(4 |1 , 2 , 5 , 3 ), in terms of the parameters of the model.
b. How could the conditional probabilities p(3 |1 , 2 , 5 ) and p(4 |1 , 2 , 5 , 3 )
be monitored in WinBUGS?
c. Create an alternative model for the latent variables that honors the substantive
theory but does not involve the intermediary variable MN. That is, create a
model that specifies 3 and 4 as directly dependent on 1, 2, and 5 (and pos-
sibly each other).
14.3 Write out the joint distribution for the entities depicted in Figure14.13 in terms
of recursive conditional distributions given other entities.
14.4 Compute the marginal distributions for the entities in the DBN for Save Patch
example in Figure14.14 in the following situations.
a. Using the prior probability of possessing the Adding Unit Fractions skill at
Time 1 of .70 and the conditional probability tables in Tables 14.5 and 14.6,
compute the marginal distributions for all the entities. Verify your results with
Figure14.14 panel (a).
b. Suppose the examinee exhibits an Unknown Error on their first attempt.
Compute the posterior distribution for the Adding Unit Fractions skill at Time
1, and the posterior predictive distribution for the remaining entities. Verify
your results with Figure14.14 panel (b).
390 Bayesian Psychometric Modeling
c. Suppose the examinee next exhibits a Wrong Numerator Error on their sec-
ond attempt. Compute the posterior distribution for the Adding Unit Fractions
skill at Time 1 and at Time 2, and the posterior predictive distribution for the
remaining entities. Verify your results with Figure14.14 panel (c).
d. Suppose the examinee next completes the level with a Standard Solution
on their third attempt. Compute the posterior distribution for the Adding
Unit Fractions skill at Time 1, Time 2, and Time 3. Verify your results with
Figure14.14 panel (d).
e. Given the observations of these three attempts, what is the probability that the
examinee possesses the Adding Unit Fractions skill at Time4?
15
Conclusion
The specifics of these ideas and how their extensions play out in particular psychometric
modeling families are instantiations of a general approach described in Chapter7. We orga-
nized our presentation in terms of the psychometric modeling families mainly for didactic
reasons. We resist the notion that there are sharp boundaries between these families or
that analysts must choose from among them. Instead, we advocate a modular approach to
391
392 Bayesian Psychometric Modeling
modeling (Rupp, 2002) that views the models we have covered as some of the options in a
menu of structures that we can call into service, combine, modify, and extend as needed.
These are tools to help us achieve some of inferential goals we have when we employ mod-
els, and serve as useful answers to the questions and needs that we have in assessment.
Want to characterize examinees in terms of a continuum of proficiency? Use a continu-
ous latent variable (Chapters8, 9, and 11). Want to also characterize examinees in terms
of finite distinct strategies they use? Use a discrete latent variable for that (Chapters 13
and 14). Want to pursue learning during the assessment? Build a transition component for
change over time (Chapter14). Are examinees organized in terms of meaningful clusters,
or can tasks be grouped in terms of common design features? Add a multilevel structure
with collateral information to render them conditionally exchangeable (Sections9.7 and
11.7.4). And so on. In this text, we have mentioned some of these situations, with pointers
to examples of applications that blur the boundaries between the modeling families as we
have organized them.
We advance that Bayesian approaches allow us to better build and reason with statistical
models. We have examined how such approaches play out in several different psychometric
modeling paradigms, as well as when it comes to statistical issues that transcend psycho-
metric applications such as model-data fit and accommodating missing data. Our view is
that a Bayesian perspective provides an attractive framework for accomplishing reasoning
activities. It allows us to derive things as seemingly disparate as the posterior predictive
distribution for missing data on the one hand and Kelleys formula for estimating true
scores in CTT on the other hand. It has the technical machinery that allows us to carry
our uncertainty from one stage of analysis forward to a later stage, and the technical and
conceptual flexibility to build models with different degrees of firmness of constraints to
reflect the different firmness of our substantive beliefs or theories. Though Bayes is the not
only way to think about the sorts of things that go on in psychometrics, it is a powerful and
useful way to approach psychometric challenges, old and new.
Throughout the second part of the book, we have described a number of psychometric
activities and applications that lend themselves to, or benefit from, adopting a Bayesian
perspective. There are more, many of which may be characterized as capitalizing on one
or more of the key advantages of Bayesian inference, such as leveraging prior information,
properly accounting for uncertainty, borrowing strength across hierarchical structures,
and updating beliefs as information arrives. Examples in assessment include modeling the
use of raters (Cao, Stokes, & Zhang, 2010; Johnson, 1996; Patz, Junker, Johnson, & Mariano,
2002), equating (Baldwin, 2011; Karabatsos & Walker, 2009; Liu, Schulz, & Yu, 2008;
Mislevy, Sheehan, & Wingerksy, 1993), vertical scaling (Patz & Yao, 2007), determining test
length (Hambleton, 1984; Novick & Lewis, 1974), obtaining scores based on subsets of the
assessment (de la Torre & Patz, 2005; Edwards & Vevea, 2006), evidence identification with
complex work products (Johnson, 1996; Rudner & Liang, 2002), modeling in the absence
of predetermined evidence identification rules (Karabatsos & Batchelder, 2003; Oravecz,
Anders, & Batchelder, 2013), and instrument development and validation (Gajewski, Price,
Coffland, Boyle, & Bott, 2013; Jiang etal., 2014).
Bayesian approaches have also been advanced for examining the viability of a psycho-
metric model and our understandings of the real-world assessment situation, including
detecting aberrant responses (Bradlow & Weiss, 2001; Bradlow, Weiss, & Cho, 1998) or
response times (Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014; van der Linden &
Guo, 2008), item preknowledge (McLeod, Lewis, & Thissen, 2003; Segall, 2002), cheating
(van der Linden & Lewis, 2015), sensitivity to instruction (Naumann, Hochweber, & Hartig,
2014), and measurement noninvariance or differential functioning (Frederickx, Tuerlinckx,
Conclusion 393
De Boeck, & Magis, 2010; Fukuhara & Kamata, 2011; Sinharay, Dorans, Grant, & Blew, 2009;
Soares, Goncalves, & Gamerman, 2009; Verhagen & Fox, 2013; Wang, Bradlow, Wainer, &
Muller, 2008; Zwick, Thayer, & Lewis, 1999). The last of these topics is an example of where
Bayesian methods have been advanced because they support the formal inclusion of loss
or utility functions (Zwick, Thayer, & Lewis, 2000), and because they afford improving
our understanding over time as data accrue (Sinharay etal., 2009; Zwick, Ye, Isham, 2012).
Bayesian perspectives have also allowed for the resolution of paradoxes and deeper under-
standings of commonalities that cut across latent variable modeling families (van Rijn &
Rijmen, 2015).
We can employ latent variable models to account for the presence of measurement error
in larger models that focus on relationships among constructs, sampling, or other aspects
of designs. Again, a Bayesian approach has much to recommend it conceptually and in
terms of overcoming difficulties associated with conventional approaches. Plausible values
methodology is one instance. Others include models that acknowledge measurement error
in predictors and outcomes (Fox & Glas, 2003) as in structural equation models (Kaplan
& Depaoli, 2012; Lee, 2007; Levy & Choi, 2013; Song & Lee, 2012) and generalized linear
latent and mixed models (Hsieh, von Eye, Maier, Hsieh, & Chen, 2013; Segawa etal., 2008),
and the integration of latent variable measurement models of the sort described here with
generalizability theory (Briggs & Wilson, 2007).
Ti|xi = xi + (1 )T = T + ( xi T ).
Suppose we have examinees from two groups that differ in terms of their true score mean
on some educational test. The application of Kelleys formula here could then be with
respect to the group-specific true score mean (Kelley, 1947). Letting T(g) denote the true
score mean for group g and assuming that the reliability is constant across groups and
0 < < 1, the EAP estimate for examinee i that is a member of group g is
Ti ( g )|xi = xi + (1 )T ( g ) = T ( g ) + ( xi T ( g ) ). (15.1)
394 Bayesian Psychometric Modeling
This value may differ for examinees that have the same values of x but belong to differ-
ent groups. Suppose examinees in group A are on average higher scoring than those in
groupB; that is, T ( A ) > T ( B ). An examinee who is a member of group B will have a lower
EAP estimate than a member of a group A, even though their scores on the test are the
same. Moreover, the direction that the observed score is shrunk may differ. Let x0 be a
particular test score where T ( A ) > x0 > T ( B ). An examinee in group A with a score of x0 on
the test will have her/his true score estimate shrunk up, toward T(A), while an examinee
in group B with a score of x0 on the test will have her/his true score estimate shrunk down,
toward T(B). Exercise 8.4 illustrated this and contained a more extreme example where an
examinee who was a member of group B had a higher observed score than an examinee
from group A, but had a lower estimated true score.
Such an analysis is perfectly in line with Bayesian reasoning. If there are known or
believed differences between the groups true score means, building a model that includes
all salient aspects of the real-world situation would yield a multiple-group model with
varying T(g) for the different groups. Put another way, if groups differ in ways that mat-
ter, we should condition on group membership and incorporate that in our analysis. Ifan
examinee is a member of a lower scoring group, our estimate about their proficiency
should be downgraded, relative to an examinee that performed the same but is member of
a higher scoring group.
But in many cases, doing so runs afoul of our sense of fairness and would violate some
of the spirit if not the letter of the laws and professional standards regarding assessment
in education and related settings (AERA, APA, NCME, 2014; Camilli, 2006; Phillips &
Camara, 2006). How to reconcile this seeming nefarious implication of adopting a Bayesian
approach to scoring? The answer lies in recognizing that modeling is purpose oriented and
our models should be built to reflect the purposes of inferences as well as our beliefs. We
can understand the implications of purposes for psychometric modeling in the current
context by distinguishing between using tests as measurements versus tests as contests
(Holland, 1994; Wainer & Thissen, 1994). When a test is conceived of as a measurement,
accuracy is paramount. If group-specific applications of Kelleys formula yield more accu-
rate estimates, so be it.*
One area where the test-as-measurement metaphor typically works well is medical diag-
nosis. Two patients with the same symptoms or results from medical tests may have dif-
fering posterior probabilities of a disease based on, say, gender, race, or family history of
disease. We are perfectly content for doctors to take such group membership into account
when forming beliefs about our physical condition. We want them to take such things into
account on the grounds that it leads to more accurate diagnoses.
What is different about assessment in education and similar settings that would make
many of us recoil at the notion that examinees receive different proficiency estimates
based on their gender, or race, or family history? The answer, at least partially, is that in
education and related settings, assessments are often used as contests with consequences
for examinees that are linked to the consequences for the other examinees. Unlike medi-
cal diagnosis, where the consequences of one patients diagnosis are unrelated to those for
other patients, in assessments-as-contests, we have notions of someone winning (a student
is selected for a reward as in Exercise8.4, an applicant is selected for admission or hired,
* Interestingly, some have advocated that if we want a more accurate portrait of an examinees capabilities, we
should not only ignore what Kelleys formula implies, but in fact we should shift examinees scores in the
opposite direction than what Kelleys formula implies. This has generally not borne out when evaluated for
accuracy (Wainer & Brown, 2007).
Conclusion 395
a student or employee is promoted to the next grade or job) and a corresponding notion of
others losing. Accuracy remains important of course, but now principles of fairness take
precedence, which dictates that the same performance should yield the same score and
assignment of rewards. A seemingly paradigmatic example comes from athletic competi-
tions. At the moment the starter pistol is fired, we might believe that one Olympic mara-
thon runner is faster than another runner based on what country they are from or the past
performances of these runners. And even though we might still have these same beliefs
about the runners capabilities broadly construed when the former loses to the latter, we
should nevertheless award the medals based on the order they finish the race that day and
nothing else.*
On this view, the distasteful nature of the notion that examinees with the same observed
performance in an educational test receive different EAP estimates based on their group
membership is the result of viewing the test as a contest rather than a measurement.
Aproper treatment of the test as a contest would preclude the differential EAPs for stu-
dents with the same observed score on the basis of group membership. The solution here
is not to cast aside Kelleys formula or principles of Bayesian modeling on the grounds
that using such models that are based on our substantive beliefs leads to unacceptable
situations. Rather, we recognize that our models should be built in accordance not only
with what we believe about the situation, but our purposes as well. When our beliefs and
purposes are in conflict, some of the aspects of the former may need to be excluded from
the model to preserve some aspects of the latter. What models we use should reflect our
purposes, and we might intentionally use a model that does not fully capture our beliefs
about the world if the situation calls for it. In the situation at hand, we would be perfectly
content to build a model that does not distinguish between groups, in effect conducting a
single group analysisagain using Kelleys formula or Bayesian inference more generally.
Importantly, a Bayesian approach to modeling calls for the analyst to make more explicit
what aspects of the real-world situation are built into the model.
* Provocatively, some have argued that when the observations of athletic performances are prone to measure-
ment error, the Bayesian approach that yields higher accuracy at the price of fairness as we have presented it
is to be preferred, even on the considerations of fairness (Moore, 2009).
This page intentionally left blank
Appendix A: Full Conditional Distributions
We develop the full conditional distributions for regression (Chapter 6), CTT (Chapter8),
CFA (Chapter 9), IRT (Chapter 11), and LCA (Chapter 13) models in Sections A.1throughA.5,
respectively. For each model, we restate the posterior distribution and then pursue the
full conditionals. For completeness and transparency in the full conditionals, we expand
beyond what was presented in the chapters to include conditioning on the hyperparam-
eters. When the hyperparameters are specified in advance, they are not random variables
in the sense of other model parameters and the data. However, including them serves to
highlight that they are involved in the computations and aligns with the use of expanded
DAGs that include them, as the full conditional distribution for any unknown is con-
structed by conditioning on its parents, children, and other parents of its children. Known
hyperparameters fall into the first of these categories. In several places, we denote the
arguments of the distribution (e.g., mean vector and covariance matrix for a normal dis-
tribution) with subscripts denoting that it refers to the full conditional distribution; the
subscripts are then just the conditioning notation of the left-hand side.
Many of the specifications involve normal distributions that afford conditional conju-
gacy relationships through the application of standard Bayesian results for models assum-
ing normality; see Lindley and Smith (1972) and Rowe (2003) for extensive treatments of
these results.
A.1 Regression
In regression, the unknowns include the intercept (0), the coefficients ( = (1 ,, J )) for
the predictors (x), and the error variance (2). We seek the posterior distribution for these
unknowns given the predictors and the outcomes (y).
We first develop the full conditional distributions for an approach where univariate
prior distributions for the parameters are specified. We then develop the full conditional
distributions for a more general version of the model, which illustrates certain features of
Bayesian inference.
n J
p(0 , , 2 |y , x )
i =1
p( )p( ),
p( yi |0 , , 2 , xi )p(0 )
j =1
j
2
(A.1)
397
398 Appendix A
where
0 ~ N (0 , 20 ), (A.3)
j ~ N ( , 2 ) for j = 1, , J , (A.4)
and
Working with the first term on the right-hand side, we can represent the conditional dis-
tribution of the outcomes as
where 1 is an (n 1) vector of 1s, xj is the (n 1) vector of values for individuals on the jth
predictor, and I is an (n n) identity matrix. In the full conditional distribution for 0, the
coefficients 1 , , J are known. We may therefore rewrite (A.7) as
Using this as the first term on the right-hand side of (A.6) and the prior distribution in (A.3)
as the second term, then by standard results of Bayes theorem it can be shown that the full
conditional distribution is (Lindley & Smith, 1972)
where
1
1 n 1[y ( x11 + + x J J )]
0|1,, J ,2 ,0 ,2 ,y , x = 2 + 2 20 +
0
0 0 2
and
1
1 n
20|1 ,, J , 2 ,0 , 2 ,y ,x
= 2 + 2 .
0
0
In the full conditional distribution for 1, the coefficients 0 , 2 , , J are known. We may
therefore rewrite (A.7) as
Using this as the first term on the right-hand side of (A.10) and the prior distribution in
(A.4) as the second term, then by standard results of Bayes theorem it can be shown that
the full conditional distribution is (Lindley & Smith, 1972)
1 |0 , 2 , , J , 2 , 1 , 21 , y , x ~ N (1|0 ,2 ,, J , 2 ,1 , 2 ,y ,x
, 21|0 ,2 ,, J , 2 ,1 , 2 ,y,x
), (A.12)
1 1
where
1
1 x x x [y (10 + x 22 + + x J J )]
1|0 ,2 ,, J ,2 ,1 ,2 ,y , x = 2 + 1 2 1 21 + 1
1
1
1 2
and
1
1 x x
2
1|0 ,2 ,, J ,2 ,1 ,2 , y , x
= 2 + 1 21 .
1
1
The same process can be applied for each regression coefficient. To present all of them
compactly, we work with a matrix representation of the regression model,
y| A , 2 , x ~ N ( x A A , 2 I ), (A.13)
j | A( j ) , 2 , j , 2 j , y , x ~ N ( j| A( j ) , 2 , j , 2 ,y ,x
, 2 j| A( j ) , 2 , j , 2 ,y ,x
), (A.14)
j j
where
1
1 xj x j j xj (y x A( j )A( j ) )
j| A( j ) ,2 , j ,2 = 2 + 2 2 + (A.15)
, y,x j 2
j
j
and
1
2
1 xj x j
= 2 + 2 . (A.16)
j| A ( j ) , 2 , j , 2 , y , x j
j
When j = 0, j and 2 j in (A.15) and (A.16) refer to the prior mean and variance of the inter-
cept, which we had denoted 0 and 20 in (A.3). When j > 1, j and 2 j in (A.15) and (A.16)
refer to the prior mean and variance of the coefficient for the jth predictor, which under
exchangeability was denoted by and 2 in (A.4).
Turning to 2, the full conditional treats A as known and the situation is akin to that of
the full conditional distribution for the variance in a univariate normal distribution in (5.10)
and (5.11). In the current context of regression, the full conditional distribution for 2 can be
shown to be (Rowe, 2003)
400 Appendix A
+ n 020 + SS(E )
2 | A , 0 , 20 , y , x ~ Inv-Gamma 0 , , (A.17)
2 2
where
SS(E ) = (y x A A) (y x A A).
Note the similarity between (A.17) and the full conditional for the variance in a model for
a univariate normal distribution in (5.10) and (5.11). In that context, the sums of squares
was taken with respect to the model-implied mean of the data. In the current context, each
individual has their own model-implied mean, given by the regression model. Hence, the
sum of squares is constructed as (y x A A ) (y x A A ).
The first term on the right-hand side is the conditional distribution of the data given by (A.7).
We specify a multivariate normal prior for A,
A ~ N ( A , A ), (A.19)
where A is a (J + 1) prior mean vector and A is a ([J + 1] [J + 1]) prior covariance matrix.
Tocomplete the model, we specify the inverse-gamma prior distribution for 2 as in (A.5).
It can be shown that the full conditional distribution for A is (Lindley & Smith, 1972)
where
1
1 1
A|2 ,A ,A ,y, x = 1A + 2 xA x A 1A A + 2 xA y
and
1
1
A|2 ,A , A , y, x = 1A + 2 xA x A .
1
1 1
A|2 ,A , A , y, x = 1A + 2 xA x A 1A A + 2 xA x A A ,
where
A = ( xA x A ) xA y
1
is the point estimate for the augmented regression coefficients from an ML or least-squares
solution. The posterior mean A|2 ,y, x is seen as a precision-weighted average of the prior
mean A and the point estimate based on the data alone A, where the weights are given
by the prior precision 1A (i.e., the inverse of the prior variance), and the precision in the
data xA x A 2 (i.e., the inverse of the sampling variance of
A).
Second, the model with univariate priors for the intercept and coefficients reflects an
assumption of a priori independence between these parameters. In the current representation,
this manifests itself as A being diagonal. However, inspection of A|2 ,A ,A ,y, x reveals that its
off-diagonal elements will not necessarily be 0. This reveals that specifying a priori indepen-
dence does not force the parameters to be independent in the posterior. If the data imply that
they are dependent, the posterior distribution will reflect that.
n J
where
xij |Ti , E2 ~ N (Ti , E2 ) for i = 1, , n, j = 1, , J , (A.22)
Ti |T , T2 ~ N (T , T2 ) for i = 1, , n, (A.23)
T ~ N (T , 2 T ), (A.24)
In the full conditional distribution T, T2 , and E2 are treated as known, which was the case
discussed in Section 8.2 where we noted that this may be viewed as the situation where a
sufficient statistic for the observables is normally distributed. Repeating (8.32),
2
xi |Ti , E2 ~ N Ti , E .
J
The prior for the unknown mean Ti in is normal with hyperparameters T and T2 that are
treated as known here. This is therefore an instance of a posterior for the unknown mean
of a normal distribution treating the variance and hyperparameters as known. The full
conditional is
where
Ti| T ,2 ,2 , xi =
( ) + ( Jx )
T
2
T i
2
E
T E
(1 ) + ( J )
2
T
2
E
and
1
T2i| T ,2 ,2 , xi = .
T E
(1/ ) (
2
T + J/E2 )
where the prior distribution p(T) is given in (A.24). Again, this is a situation where vari-
ables (here, T) are normally distributed with a known variance (T2 ) and unknown mean
(T), which is normally distributed with known hyperparameters (T and 2 T ). The full
conditional distribution for T is therefore
where
T|T ,2 ,T ,2 =
( ) + ( nT )
T
2
T
2
T
T T
(1 ) + ( n )
2
T
2
T
Appendix A 403
and
1
2 1 n
T|T,T2 ,T , 2
= 2 + 2 .
T
T T
Given the true scores T, the variance of the true scores T2 is conditionally independent of
the observables x and the error variance E2 . The full conditional for the variance of the true
scores T2 can therefore be simplified as
where the prior distribution p ( T ) is given in (A.25). This situation is the same as was
2
encountered in Section 4.2, where we have a set of variables (T) that are normally distrib-
uted with a known mean (T) and unknown variance (T2 ), which has an inverse-gamma
prior distribution with known hyperparameters (T and T20 ). The full conditional distribu-
tion for T2 is therefore
+ n T T20 + SS(T )
T2 |T , T , T , T20 ~ Inv-Gamma T , , (A.29)
2 2
where
n
SS(T ) = (T ) .
i =1
i T
2
p( E2 |T , T , T2 , E , E20 , x ) = p( E2 |T , E , E20 , x )
= p(x |T , )p( | ,
i =1
i i
2
E
2
E E
2
E0 ),
where the prior distribution for E2 is given in (A.26). We have for each examinee i a col-
lection of variables (xi) that is normally distributed with known mean (Ti ) and unknown
variance E2 that is constant across examinees and has an inverse-gamma prior distribution
with known hyperparameters (E and E20 ). This is actually a special case of that derived
for the error variance in regression in (A.17). There, the subject-specific mean was given by
the regression function. Here, the examinee-specific mean is given by the true score for the
examinee, akin to an intercept-only regression model. The full conditional distribution for
E2 is therefore
+ nJ EE20 + SS(E )
E2 |T , E , E20 , x ~ Inv-Gamma E , , (A.30)
2 2
404 Appendix A
where
n J
SS(E ) = (x
i =1 j =1
ij Ti )2 .
p( , , , , , |x ) p(x | , , , )p( |, )
i =1 j =1 m=1
ij i j j jj i (A.31)
where
xij | i , j , j , jj ~ N ( j + i j , jj ) for i = 1,, n, j = 1,, J, (A.32)
m ~ N ( , 2 ) for m = 1, , M , (A.34)
j ~ N ( , 2 ) for j = 1, , J , (A.36)
jm ~ N ( , 2 ) for j = 1, , J , m = 1, , M , (A.37)
and
jj ~ Inv-Gamma( /2, 0/2). (A.38)
other observables and the measurement model parameters for those other observables.
Suppressing the notation for the hyperparameters for the moment,
p( , , | , , , x ) = p( , , | , x ) = p( , , |, x ),
j =1
j j jj j
and we may proceed with the measurement model parameters one observable at a time.
For the jth observable,
where
1
1 1 1 1
jm| , jA( m ) , jj , ,2 , x j = 2 + m m 2 jm + m ( x j A( m) jA( m) )
jm jj
jm jj
and
1
1 1
2 jm| , jA( m ) , jj , ,2 , x j = 2 + m m .
jm jj
+ n 0 + SS(E j )
jj | , jA , , 0 , x j ~ Inv-Gamma , , (A.40)
2 2
where
SS(E j ) = ( x j A jA )( x j A jA ).
406 Appendix A
Equation (A.33) implies that the first term on the right-hand side is the multivariate nor-
mal conditional distribution of the treated-as-known latent variables given the unknown
means and the treated-as-known covariance matrix. The second term is a multivariate
normal prior distribution
~ N ( , ).
This is a generalzation of the prior in (A.34); the specification in (A.34) obtains when is
diagonal. It can be shown (Lindley & Smith, 1972) that the full conditional distribution is
| , , , ~ N ( | , , , , | , , , ). (A.42)
where
| , , , = ( 1 + n 1 )1( 1 + n 1 ),
| , , , = ( 1 + n 1 )1,
and is the (M 1) vector of means of the treated-as-known latent variables over the n
examinees.
Given the latent variables, the full conditional distribution for the covariance matrix of the
latent variables is independent of the observables and the measurement model parameters,
Equation (A.33) implies that the first term on the right-hand side is the multivariate nor-
mal conditional distribution of the treated-as-known latent variables given the treated-as-
known means and the unknown covariance matrix, as in (A.41). The second term is the
inverse-Wishart prior distribution in (A.35). Treating the latent variables as known, this
is a situation where variables () are normally distributed with known means ( ) and
an unknown covariance. In this case, the inverse-Wishart prior distribution with known
hyperparameters ( 0 and d) is conjugate and by standard analyses (Rowe, 2003) the full
conditional distribution is
( )( ).
1
S = i i
n i
n n
p( | , , , , , x ) = i =1
p( i | , , , , , xi ) p(x | , , , )p( |, ),
i =1
i i i
where xi = (xi1,, xiJ) is the collection of J observed values from examinee i. The second
term on the right-hand side is the prior distribution for the examinees latent variables,
given by (A.33). The first term on the right-hand side is the conditional distribution of the
observables for the examinee, given by (9.13) and restated here:
xi | i , , , ~ N ( + i , ).
Recognizing that is a constant in the full conditional distribution, we may express this as
( xi )| i , , ~ N ( i , ),
which has a form amenable to standard Bayesian analysis (Lindley & Smith, 1972), such
that the full conditional distribution is
i | , , , , , xi ~ N ( i| , , , , , xi , i| , , , , , xi ), (A.44)
where
i| , , , , , xi = ( 1 + 1 )1( 1 + 1( xi ))
and
i| , , , , , xi = ( 1 + 1 )1.
where
1
1 1 1
jA| , jj , A , A , x j = 1A + A A A A + A x j ,
jj jj
408 Appendix A
1
1
jA| , jj , A , A , x j = 1A + A A ,
jj
and xj = (x1j,, xnj) is the (n 1) vector of observed values for observable j. The other full
conditional distributions remain the same.
where
1 if xij 0
*
xij |xij* = *
for i = 1, , n, j = 1, , J , (A.48)
0 if xij < 0
i ~ N ( , 2 ) for i = 1, , n, (A.50)
2
dj ~ N ( d , ) for j = 1, , J ,
d (A.51)
and
a j ~ N + ( a , 2a ) for j = 1, , J . (A.52)
The situation is analogous to that of CFA not only in conditional independence relation-
ships, but in distributional forms as well, where the latent response variables x *j play the
role of the observables x in CFA, j play the role of jA in CFA, play the role of in CFA,
and the error variances in CFA (i.e., the jj s) are equal to 1. The full conditional for j
therefore takes the same form as that for jA in (A.46), subject to the positivity restriction
for each a j . Reintroducing the hyperparameters into the notation,
where
j| , x *j , , = ( 1 + A A )1( 1 + A x *j ),
j| , x *j , , = ( 1 + A A )1 ,
p(|d, a , x * , , 2 , x ) = p(|d, a , x * , , 2 )
n
= p( |d, a , x , , )
i =1
i
*
i
2
p(x | , d, a)p( | , ),
i =1
*
i i i
2
where x i is the collection of J observed values from examinee i. Equation (A.49) implies
that the first term on the right-hand side is the multivariate normal conditional distri-
bution of the latent response variables for the examinee. The second term on the right-
hand side is the prior distribution for the examinees latent variables, given by (A.50).
The similarity of the latent response variable formulation of IRT to CFA manifests itself
here as well, and the full conditional distribution for the latent variables here can be recog-
nized as a simpler version of the case of latent variables in CFA. Connecting the notational
elements, i here plays the role of i in CFA, dj plays the role of j in CFA, and a j plays the
role of j in CFA. What was jj in CFA is now the error variance of the latent response
variables, which is set to be 1. As the formulation of the model may be thought of as a
CFA model with error variances fixed to 1, and as we are confining our treatment to the
unidimensional model with fixed values of the parameters that govern the distribution of
the latent variables, the full conditional for each examinees latent variable is a somewhat
simpler version than that developed for CFA in Section A.3:
410 Appendix A
where
a (x d )
( /2 ) +
j
j
*
ij j
i|d,a , xi* , ,2 =
(1/ ) + a
2
2
j
j
and
1
1
2
i|d ,a , xi* , ,2
= 2 +
a .
2
j
j
n J n J
p( x * |, d,a , x ) =
i =1 j =1
p( xij* |i , dj , a j , xij ) p(x |x )p(x | , d , a ).
i =1 j =1
ij
*
ij
*
ij i j j
The conditional relationship of xij given xij* is deterministic. Given xij* , xij is known with
certainty, as formulated in (A.48). If xij* 0, xij = 1; if xij* < 0, xij = 0. Reversing the direction
of this relationship, knowing xij completely determines the sign of xij* and nothing else;
if xij = 1, xij* 0; if xij = 0, xij* < 0. The full conditional for xij* therefore follows the form of its
prior specification p( xij* |i , j ) subject to a truncation dictated by xij
N ( a j i + dj , 1) I ( xij* 0) if xij = 1
xij* |i , dj , a j , xij ~ *
(A.55)
N ( a j i + dj , 1) I ( xij < 0) if xij = 0,
p(, , |x )
i =1 j =1
p( xij |i , j )p(i | )p( ) p( ),
c =1
cj (A.56)
where
( xij |i = c , j )~ Categorical( cj ) for i = 1,, n, j = 1,, J , (A.57)
~ Dirichlet( ), (A.59)
and
cj ~ Dirichlet( c ) for c = 1,, C , j = 1,, J . (A.60)
p(|, , , x ) = p(|, , x ) =
j =1
p( j |, , x j ) = p( |,
j =1 c =1
cj c , x j ),
rc
p( cj |, c , x j ) p( x j |, cj )p( cj | c ) = p(x | = c, ) p( |
i =1
ij i cj cj c ),
where rc is the number of examinees in latent class c. This, along with conditioning on i = c
expresses that we are restricting attention to examinees in class c. Each of the observables
is categorically distributed (given membership in latent class c), as expressed in (A.57).
Recall that we conducted inference for the unknown parameter of repeated Bernoulli trials
by modeling the counts of successes as a binomial. Similarly, we can formulate the model
forthe current situation in terms of counts. For each observable, for each latent class, we
count the number of examinees who respond in each category of the observable. Let rcjk
denote the number of examinees in class c who have a value of k for observable j. Let rcj =
(rcj1,, rcjK) denote the collection of these counts for class c and observable j, which is suf-
ficient for inference about cj. We can re-express the full conditional distribution as
412 Appendix A
The first term on the right-hand side is a multinomial and the second term is the Dirichlet
prior. Accordingly, the full conditional distribution is a Dirichlet distribution,
The first term on the right-hand side is the conditional distribution of the observables for
the examinee
J
p( xi |i , ) = p(x | = c, ).
j =1
ij i j (A.62)
where p( xij |i = c , j ) is given by (A.57). The second term is the prior distribution for the
examinees latent variable, given by (A.58). This is an instance of Bayes theorem for dis-
crete variables. The posterior distribution is a categorical distribution,
i | , , xi ~ Categorical(si ). (A.63)
p( xi |i = c , )p(i = c| )
sic =
p( xi )
J
p(x | = c, )p( = c| )
j =1
ij i j i
= C J
p(x | = g, )p( = g| )
g =1 j =1
ij i j i
J K
(
j =1 k =1
cjk )
I ( xij =k )
c
= C J K
,
(
g =1 j =1 k =1
gjk )
I ( xij =k )
g
The first term on the right-hand side is the distribution of the treated-as-known latent vari-
ables given the unknown latent class proportions, which for each examinee is a categorical
distribution. Again, we can work with counts of the latent variables and employ a multi-
nomial distribution. Let rc denote the number of examinees for whom = c and r = (r1 , , rC )
denote the collection of these counts. We can re-express the full conditional distribution as
p( |, ) = p( |r , ) p(r | )p( | ),
where the first term on the right-hand side is a multinomial and the second term is the Dirichlet
prior in (A.59). Accordingly, the full conditional distribution is a Dirichlet distribution
|, ~ Dirichlet( 1 + r1 , , C + rC ). (A.64)
p(, , |x )
i =1 j =1
p( xij |i , j )p(i | )p( ) p( ),
c =1
cj (A.65)
~ Beta( , ), (A.68)
and
cj ~ Beta( c , c ) for c = 1, 2, j = 1,, J . (A.69)
where rc is the number of examinees in latent class c. This, along with conditioning on i =c
expresses that we are restricting attention to examinees in class c. Each of the observables
414 Appendix A
The first term on the right-hand side is a binomial and the second term is the beta prior in
(A.69). Accordingly, the full conditional distribution is a beta distribution,
p(| , , x ) =
i =1
p(i | , , xi ) p(x | , )p( |).
i =1
i i i
The first term on the right-hand side is the conditional distribution of the observables for the
examinee, given by (A.62), where p( xij |i = c , j ) is given by (A.66). The second term is the
prior distribution for the examinees latent variable, given by (A.67). This is an instance of
Bayes theorem for discrete variables. The posterior distribution is a Bernoulli distribution
i | , , xi ~ Bernoulli(si ) + 1, (A.71)
p( xi |i = 2, )p(i = 2| )
si =
p( xi )
J
= C J
(j =1
2j
x
) ij (1 2 j )
1 xij
= J J
.
(
j =1
2j
xij
) (1 2 j )
1 xij
+ (j =1
1j
xij
) (1 1 j )
1 xij
(1 )
The first term on the right-hand side is the distribution of the treated-as-known latent
variables given the unknown probability of being in latent class 2, which for each exam-
inee is a Bernoulli distribution. Again, we can work with counts of the latent variables and
employ a binomial distribution. Let r denote the number of examinees in class 2. We can
re-express the full conditional distribution as
p( |, , ) = p( |r , , ) p(r| )p( | , ),
where the first term on the right-hand side is a binomial and the second term is the beta
prior in (A.68). Accordingly, the full conditional distribution is a beta distribution
This appendix presents and briefly discusses features of some standard probability distri-
butions, as an aid to understanding their use throughout the book. In each case, we pres-
ent the notation, probability mass or density function, and summary quantities such as the
mean, variance, and mode (where applicable). In addition, we discuss common usage of
these distributions. We use the notation x for a random variable and x for a random vector,
except in the cases of the Wishart and inverse-Wishart where we use W and W1 to stand
for the random matrix, respectively. For deeper discussions of these and other distribu-
tions, see Casella and Berger (2008), Gelman et al. (2013), and Jackman (2009). See Lunn et al.
(2013) for details on their implementations in WinBUGS and its variants.
B.1.2 Categorical
The categorical distribution is a generalization of the Bernoulli distribution to the case of
K > 2 categories, typically assigned values 1, , K. If x has a categorical distribution with
parameter vector , denoted by x ~ Categorical(), it has probability mass function
p( x = k ) = k
with
= (1 , , K )
k [0, 1] for k = 1, , K
K
k = 1.
k =1
417
418 Appendix B
The categorical distribution may be thought of as a special case of the multinomial (which
is described in more detail below) where the number of trials J = 1. The conjugate prior for
is the Dirichlet distribution.
B.1.3 Binomial
A random variable x {0, 1, , J } has a Binomial distribution with a parameter , denoted
by x ~ Binomial(, J ), if it has probability mass function
J
p( x) = x (1 ) J x
x
with
[0, 1]
E( x) = J
var( x) = J (1 ).
x is commonly viewed as the sum of J independent Bernoulli variables, each with success
probability . When J = 1, the binomial specializes to the Bernoulli distribution. The con-
jugate prior for is the beta distribution.
B.1.4 Multinomial
The multinomial generalizes the binomial to the situation of more than two categories. If
x = (x1, , xK), where xk {0, 1, , J } for k = 1, , K and k =1xk = J , then x has a multinomial
K
J!
p( x ) = 1x1 KxK
x1 ! xK !
with
E( xk ) = J k
var( xk ) = J k (1 k )
cov( xk , xk ) = J k k .
B.1.5 Poisson
A random variable x {0, 1, 2,} has a Poisson distribution with a parameter , denoted by
x ~ Poisson(), if it has probability mass function
1 x
p( x) = exp()
x!
Appendix B 419
with
>0
E( x) =
mode( x) =
var( x) = .
The conjugate prior for is the gamma distribution.
B.2.2 Beta
A random variable x [0, 1] has a beta density with parameters > 0 and > 0, denoted as
x ~ Beta( , ), if it has a probability density function
( + ) 1
p( x) = x (1 x)1
()()
where (t) is the gamma function,
0
x t 1e x dx, with:
E( x) = /( + )
var( x) = /[( + )2 ( + + 1)]
a unique mode at 1 /( + 2) when , > 1.
When = = 1, the beta specializes to the uniform distribution, that is, x ~ Uniform(0, 1).
The beta distribution is the conjugate prior for the probability parameter in the Bernoulli
and binomial distributions.
B.2.3 Dirichlet
A random variable x = (x1, , xK) has the Dirichlet distribution with parameter vector
= (1 , , K ) if it has a probability density function
(1 + + K ) 1 1
p( x ) = x1 xK K 1
(1 ) ( K )
420 Appendix B
with
xk 0 for k = 1, , K
= (1 , , K ) with k > 0 for k = 1, , K .
K
Letting S = k ,
k =1
The Dirichlet distribution may be seen as a multivariate generalization of the beta dis-
tribution, or the beta may be seen as a special case of the Dirichlet where K = 2. That is,
specifying x ~ Beta( a, b) is equivalent to specifying ( x , 1 x) ~ Dirichlet( a, b). The Dirichlet
distribution is the conjugate prior for the probability parameter in categorical and multi-
nomial distributions.
B.2.4 Normal
A random variable x has a normal distribution, denoted as x ~ N ( , 2 ), if it has probability
density function
1 1
p( x) = exp 2 ( x )2
2 2
with
2 > 0
E( x) = mode( x) =
var( x) = 2 .
The normal distribution is the conjugate prior for the mean of a normal distribution.
B.2.5 Lognormal
If log x ~ N (, 2 ), then x has a lognormal distribution, denoted as x ~ log N ( , 2 ) with a
probability density function
1 1
p( x) = exp 2 (log( x) )2
2x 2
with
x>0
2 > 0
E( x) = exp( + 21 2 )
var( x) = exp(2 + 2 )(exp(2 ) 1)
mode( x) = exp( 2 ).
Appendix B 421
B.2.6 Logit-Normal
If logit( x) ~ N (, 2 ), then x has a logit-normal distribution, denoted as x ~ logit N ( , 2 ),
with a probability density function
1 1
p( x) = exp 2 (logit( x) )2
2x(1 x) 2
with
x (0, 1)
2
> 0.
B.2.7 Multivariate Normal
If a random variable x = (x1, , xK) has a multivariate normal distribution with parameters
and positive-definite , denoted as x ~ N ( , ), then it has probability density function
1/2
p( x ) = (2) K /2 exp(( x ) 1( x )/ 2)
with
E( x ) =
var( x ) = .
If x is partitioned as x = (x1, x 2) such that
x1 1 11 12
x = ~ N ,
x2
2
21 22
The multivariate normal distribution is the conjugate prior for the mean vector of a multi-
variate normal distribution.
B.2.8 Gamma
A random variable x has a gamma distribution, denoted as x ~ Gamma( , ), if it has a
probability density function
1
p( x) = x exp(x)
()
with
x>0
> 0 and > 0
E( x) = / if > 1
2
var( x) =
mode( x) = ( 1)/ for 0.
422 Appendix B
B.2.9 Inverse-Gamma
A random variable x is has an inverse-gamma distribution with parameters and ,
denoted as x ~ Inv-Gamma( , ), if it has probability density function
1
p( x) = x exp
() x
with
x>0
> 0 and > 0
E( x) = /( 1) if > 1
var( x) = 2 [( 1)2 ( 2)] if > 2
mode( x) = /( + 1).
B.2.10 Wishart
A random variable W has a Wishart distribution, denoted W ~ Wishart(S, ), if it has a
probability density function
J 1
W 2
tr(S1W )
p(W ) = J J exp
2 S J ( 2 )
2 2 2
where
W is a (J J) symmetric positive definite matrix
S is a (J J) symmetric positive definite matrix
> 0 is the degrees of freedom
tr( ) is the trace operator
J ( J 1)
J + 1 j
J is the multivariate gamma function J ( 2 ) = 2 j =1
2
E(W ) = S.
Appendix B 423
A special case obtains when J = 1 and S = 1, in which case the Wishart reduces to a 2
density with degrees of freedom. If W ~ Wishart(S, ), then its inverse has an inverse-
Wishart distribution, W 1 ~ Inv-Wishart(S, ). The Wishart generalizes the gamma and 2
distributions for precisions to the case of a precision matrix (i.e., the inverse of the covari-
ance matrix), as commonly arise when working with multivariate normal distributions.
The Wishart density is the conjugate prior for a precision matrix (inverse of the covariance
matrix) of a multivariate normal distribution.
B.2.11 Inverse-Wishart
A random variable W has an inverse-Wishart distribution, denoted by W ~ Inv-Wishart(S, ),
if it has a probability density function
( + J +1)
S 2
W 2
tr(S1W 1 )
p(W ) = J exp
2 J ( 2 )
2 2
where
W is a (J J) symmetric positive definite matrix
S is a (J J) symmetric positive definite matrix
> 0 is the degrees of freedom
tr( ) is the trace operator
+ 1 j
J ( J 1) J
J is the multivariate gamma function J ( 2 ) = 2
j =1 2
E(W ) = ( J 1)1 S1, > J + 1.
Adams, R. J., Wilson, M.,& Wang, W. (1997). The multidimensional random coefficients multinomial
logit model. Applied Psychological Measurement, 21, 123.
AERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC:
American Educational Research Association.
Agresti, A.,& Coull, B. A. (1998). Approximate is better than exact for interval estimation of bino-
mial proportions. The American Statistician, 52, 119126.
Aiken, L. S., West, S. G.,& Millsap, R. E. (2008). Doctoral training in statistics, measurement, and
methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Renos
(1990) survey of PhD programs in North America. American Psychologist, 63, 3250.
Aitkin, M., & Aitkin, I. (2005). Bayesian inference for factor scores. In A. Maydeu-Olivares &
J.J.McArdle (Eds.), Contemporary psychometrics: A Festschrift to Roderick P. McDonald (pp.207
222). Mahwah, NJ: Lawrence Erlbaum Associates.
Albert, J., & Chib, S. (1995). Bayesian residual analysis for binary response regression models.
Biometrika, 82, 747759.
Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling.
Journal of Educational and Behavioral Statistics, 17, 251269.
Albert, J. H.,& Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal
of the American Statistical Association, 88, 669.
Almond, R. G. (2010). I can name that Bayesian network in two matrixes. International Journal of
Approximate Reasoning, 51, 167178.
Almond, R. G. (2013). RNetica: R interface to Netica(R) Bayesian Network Engine (Version 0.3-4).
Retrieved from http://ralmond.net/RNetica.
Almond, R. G., DiBello, L. V., Jenkins, F., Senturk, D., Mislevy, R. J., Steinberg, L. S.,& Yan, D. (2001).
Models for conditional probability tables in educational assessment. In Artificial Intelligence
and Statistics 2001: Proceedings of the 8th International Workshop (pp.137143). San Francisco, CA:
Morgan Kaufmann.
Almond, R. G., DiBello, L. V., Moulder, B.,& Zapata-Rivera, J.-D. (2007). Modeling diagnostic assess-
ments with Bayesian networks. Journal of Educational Measurement, 44, 341359.
Almond, R. G.,& Mislevy, R. J. (1999). Graphical models and computerized adaptive testing. Applied
Psychological Measurement, 23, 223237.
Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D.,& Williamson, D. M. (2015). Bayesian networks
in educational assessment. New York: Springer.
Almond, R. G., Mulder, J., Hemat, L. A.,& Yan, D. (2009). Bayesian network models for local dependence
among observable outcome variables. Journal of Educational and Behavioral Statistics, 34, 491521.
Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery of assess-
ment systems: A four-process architecture. The Journal of Technology, Learning and Assessment,
1.Retrieved from https://ejournals.bc.edu/ojs/index.php/jtla/article/view/1671.
Andrews, M.,& Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychology.
British Journal of Mathematical and Statistical Psychology, 66, 17.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561573.
Ansari, A., & Jedidi, K. (2000). Bayesian factor analysis for multilevel binary observations.
Psychometrika, 65, 475496.
Ansari, A., Jedidi, K.,& Dube, L. (2002). Heterogeneous factor analysis models: A Bayesian approach.
Psychometrika, 67, 4977.
Arbuckle, J. L. (2007). AMOS 16.0 users guide. Chicago, IL: SPSS Inc.
Arminger, G.,& Muthn, B. O. (1998). A Bayesian approach to nonlinear latent variable models using
the Gibbs sampler and the Metropolis-Hastings algorithm. Psychometrika, 63, 271300.
425
426 References
Babcock, B. (2011). Estimating a noncompensatory IRT model using Metropolis within Gibbs
sampling. Applied Psychological Measurement, 35, 317329.
Bafumi, J., Gelman, A., Park, D. K.,& Kaplan, N. (2005). Practical issues in implementing and under-
standing Bayesian ideal point estimation. Political Analysis, 13, 171187.
Baldwin, P. (2011). A strategy for developing a common metric in item response theory when param-
eter posterior distributions are known. Journal of Educational Measurement, 48, 111.
Barnett, V. (1999). Comparative statistical inference (3rd ed.). Chichester, UK: Wiley.
Bartholomew, D. (1996). Comment on: Metaphor taken as math: Indeterminancy in the factor model.
Multivariate Behavioral Research, 31, 551554.
Bartholomew, D. J. (1981). Posterior analysis of the factor model. British Journal of Mathematical and
Statistical Psychology, 34, 9399.
Bartholomew, D. J., Knott, M.,& Moustaki, I. (2011). Latent variable models and factor analysis: A unified
approach (3rd ed.). Chichester, UK: Wiley.
Bayarri, M. J.,& Berger, J. O. (2000a). Asymptotic distribution of p values in composite null models:
Rejoinder. Journal of the American Statistical Association, 95, 1168.
Bayarri, M. J., & Berger, J. O. (2000b). P values for composite null models. Journal of the American
Statistical Association, 95, 1127.
Beaton, A. E. (1987). The NAEP 1983-1984 Technical Report. Princeton, NJ: ETS.
Bguin, A. A.,& Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimen-
sional IRT models. Psychometrika, 66, 541561.
Behrens, J. T.,& DiCerbo, K. E. (2013). Technological implications for assessment ecosystems: Opportunities
for digital technology to advance assessment. Princeton, NJ: The Gordon Commission on the Future
of Assessment. Retrieved from http://researchnetwork.pearson.com/wp-content/uploads/
behrens_dicerbo_technlogical_implications_assessments.pdf.
Behrens, J. T.,& DiCerbo, K. E. (2014). Harnessing the currents of the digital ocean. In J. A. Larusson&
B. White (Eds.), Learning analytics: From research to practice (pp.3960). New York: Springer.
Behrens, J. T., DiCerbo, K. E., Yel, N., & Levy, R. (2013). Exploratory data analysis. In Handbook of
psychology, Volume 2: Research methods in psychology (2nd ed., pp.3364). Hoboken, NJ: Wiley.
Behrens, J. T., Mislevy, R. J., Dicerbo, K. E.,& Levy, R. (2012). Evidence centered design for learning
and assessment in the digital world. In M. Mayrath, J. Clarke-Midura,& D. H. Robinson (Eds.),
Technology-based assessments for 21st century skills: Theoretical and practical implications from
modern research (pp.1353). Charlotte, NC: Information Age Publishing.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on
cognition and affect. Journal of Personality and Social Psychology, 100, 407425.
Bentler, P. M. (2006). EQS 6 structural equations program manual. Encino, CA: Multivariate Software,
Inc.
Berger, J. (2006). The case for objective Bayesian analysis. Bayesian Analysis, 1, 385402.
Berger, J. O.,& Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist,
76, 159165.
Berkhof, J., Van Mechelen, I.,& Gelman, A. (2003). A Bayesian approach to the selection and testing
of mixture models. Statistica Sinica, 13, 423442.
Bernardo, J. M.,& Smith, A. F. M. (2000). Bayesian theory. Chichester, UK: Wiley.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society Series B, 36, 192236.
Blackwell, D.,& Dubins, L. (1962). Merging of opinions with increasing information. The Annals of
Mathematical Statistics, 33, 882886.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two
or more nominal categories. Psychometrika, 37, 2951.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46, 443459.
Bock, R. D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis. Applied
Psychological Measurement, 12, 261280.
References 427
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items.
Psychometrika, 35, 179197.
Bock, R. D.,& Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environ-
ment. Applied Psychological Measurement, 6, 431444.
Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework. In C. R. Rao &
S.Sinharay (Eds.), Handbook of statistics, Volume 26: Psychometrics (pp.469514). Amsterdam, the
Netherlands: North-Holland/Elsevier.
Bolfarine, H.,& Bazan, J. L. (2010). Bayesian estimation of the logistic positive exponent IRT model.
Journal of Educational and Behavioral Statistics, 35, 693713.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Bollen, K. A. (2002). Latent variables in psychology and the social sciences. Annual Review of
Psychology, 53, 605634.
Bollen, K. A.,& Curran, P. J. (2006). Latent curve models: A structural equation perspective. Hoboken, NJ:
Wiley.
Bollen, K. A., Ray, S., Zavisca, J.,& Harden, J. J. (2012). A comparison of Bayes factor approximation
methods including two new methods. Sociological Methods& Research, 41, 294324.
Bolt, D. M., Cohen, A. S.,& Wollack, J. A. (2001). A mixture item response model for multiple-choice
data. Journal of Educational and Behavioral Statistics, 26, 381409.
Bolt, D. M., Deng, S.,& Lee, S. (2014). IRT model misspecification and measurement of growth in
vertical scaling. Journal of Educational Measurement, 51, 141162.
Bolt, D. M.,& Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional
item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27,
395414.
Bolt, D. M., Wollack, J. A.,& Suh, Y. (2012). Application of a multidimensional nested logit model to
multiple-choice test items. Psychometrika, 77, 339357.
Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71, 791799.
Box, G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American Statistical
Association, 74, 14.
Box, G. E. P. (1980). Sampling and Bayes inference in scientific modelling and robustness. Journal of
the Royal Statistical Society. Series A (General), 143, 383.
Box, G. E. P. (1983). An apology for ecumenism in statistics. In G. E. P. Box, T. Leonard,& D. F. J. Wu
(Eds.), Scientific inference, data analysis, and robustness (pp.5184). New York: Academic Press.
Box, G. E. P.,& Draper, N. R. (1987). Empirical model-building and response surfaces. New York: Wiley.
Box, G. E. P.,& Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley.
Bradlow, E. T.,& Weiss, R. E. (2001). Outlier measures and norming methods for computerized adap-
tive tests. Journal of Educational and Behavioral Statistics, 26, 85104.
Bradlow, E. T., Weiss, R. E., & Cho, M. (1998). Bayesian identification of outliers in computerized
adaptive tests. Journal of the American Statistical Association, 93, 910919.
Bradshaw, L., & Templin, J. (2014). Combining item response theory and diagnostic classifica-
tion models: A psychometric model for scaling ability and diagnosing misconceptions.
Psychometrika, 79, 403425.
Briggs, D. C.,& Wilson, M. (2007). Generalizability in item response modeling. Journal of Educational
Measurement, 44, 131155.
Brooks, S., Gelman, A., Jones, G.,& Meng, X.-L. (Eds.). (2011). Handbook of Markov chain Monte Carlo.
Boca Raton, FL: Chapman& Hall/CRC Press.
Brooks, S. P. (1998). Markov chain Monte Carlo method and its application. The Statistician, 47, 69100.
Brooks, S. P.,& Gelman, A. (1998). General methods for monitoring convergence of iterative simula-
tions. Journal of Computational and Graphical Statistics, 7, 434455.
Brown, T. A.,& Moore, M. T. (2012). Confirmatory factor analysis. In R. H. Hoyle (Ed.), Handbook of
structural equation modeling (pp.4355). New York: Guilford Press.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical
information-theoretic approach (2nd ed.). New York: Springer.
428 References
Cai, J.-H.,& Song, X.-Y. (2010). Bayesian analysis of mixtures in structural equation models with non-
ignorable missing data. British Journal of Mathematical and Statistical Psychology, 63, 491508.
Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis.
Journal of Educational and Behavioral Statistics, 35, 307335.
Camilli, G. (2006). Test fairness. In R. Brennan (Ed.), Educational measurement (4th ed., pp.220256).
Westport, CT: Praeger.
Cao, J., Stokes, S. L.,& Zhang, S. (2010). A Bayesian approach to ranking and rater evaluation: An
application to grant reviews. Journal of Educational and Behavioral Statistics, 35, 194214.
Carlin, B. P.,& Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. Journal
of the Royal Statistical Society. Series B (Methodological), 57, 473484.
Carlin, B. P., & Louis, T. A. (2008). Bayesian methods for data analysis (3rd ed.). Boca Raton, FL:
Chapman& Hall/CRC Press.
Carroll, J. B. (1945). The effect of difficulty and chance success on correlations between items or
between tests. Psychometrika, 10, 119.
Casella, G.,& Berger, R. L. (2008). Statistical inference. Pacific Grove, CA: Thomson Press.
Casella, G.,& George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46, 167174.
Chang, Y.-W., Tsai, R.-C.,& Hsu, N.-J. (2014). A speeded item response model: Leave the harder till
later. Psychometrika, 79, 255274.
Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response
theory. Journal of Educational and Behavioral Statistics, 22, 265289.
Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT.
Psychometrika, 74, 619632.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association, 90, 13131321.
Chib, S.,& Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American
Statistician, 49, 327.
Chib, S.,& Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of
the American Statistical Association, 96, 270281.
Cho, S.-J.,& Cohen, A. S. (2010). A multilevel mixture IRT model with an application to DIF. Journal
of Educational and Behavioral Statistics, 35, 336370.
Cho, S.-J., Cohen, A. S.,& Kim, S.-H. (2014). A mixture group bifactor model for binary responses.
Structural Equation Modeling: A Multidisciplinary Journal, 21, 375395.
Choi, I.-H.,& Wilson, M. (2015). Multidimensional classification of examinees using the mixture
random weights linear logistic test model. Educational and Psychological Measurement, 75,
78101.
Choi, J., Levy, R., & Hancock, G. R. (2006). Markov chain Monte Carlo estimation method with cova-
riance data for structural equation modeling. Presented at the annual meeting of the American
Educational Research Association, San Francisco, CA.
Chow, S. L., OLeary, R.,& Mengersen, K. (2009). Elicitation by design in ecology: Using expert opin-
ion to inform priors for Bayesian statistical models. Ecology, 90, 265277.
Chow, S.-M., Tang, N., Yuan, Y., Song, X.,& Zhu, H. (2011). Bayesian estimation of semiparametric
nonlinear dynamic factor analysis models using the Dirichlet process prior. British Journal of
Mathematical and Statistical Psychology, 64, 69106.
Chung, G., Delacruz, G. C., Dionne, G. B.,& Bewley, W. L. (2003). Linking assessment and instruc-
tion using ontologies. Proc. I/ITSEC, 25, 18111822.
Chung, G. K., W. K., Baker, E. L., Vendlinski, T. P., Buschang, R. E., Delacruz, G. C., Michiuye, J. K.,&
Bittick, S. J. (2010). Testing instructional design variations in a prototype math game. Presented at the
annual meeting of the American Educational Research Association, Denver, CO.
Chung, H. (2003). Latent-class modeling with covariates. State College, PA: Pennsylvania State
University.
Chung, H., & Anthony, J. C. (2013). A Bayesian approach to a multiple-group latent class-profile
analysis: The timing of drinking onset and subsequent drinking behaviors among U.S. adoles-
cents. Structural Equation Modeling: A Multidisciplinary Journal, 20, 658680.
References 429
Chung, H., Flaherty, B. P.,& Schafer, J. L. (2006). Latent class logistic regression: Application to mari-
juana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series
A (Statistics in Society), 169, 723743.
Chung, H., Lanza, S. T., & Loken, E. (2008). Latent transition analysis: Inference and estimation.
Statistics in Medicine, 27, 18341854.
Chung, H., Loken, E.,& Schafer, J. L. (2004). Difficulties in drawing inferences with finite-mixture
models: A simple example with a simple solution. The American Statistician, 58, 152158.
Chung, H., Walls, T. A., & Park, Y. (2007). A latent transition model with logistic regression.
Psychometrika, 72, 413435.
Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A.,& Liu, J. (2013). A nondegenerate penalized likeli-
hood estimator for variance parameters in multilevel models. Psychometrika, 78, 685709.
Clark, J. S. (2005). Why environmental scientists are becoming Bayesians: Modelling with Bayes.
Ecology Letters, 8, 214.
Clauser, B. E., Margolis, M. J., & Case, S. M. (2006). Testing for licensure in the professions. In
R.Brennan (Ed.), Educational measurement (4th ed., pp.701731). Westport, CT: Praeger.
Clinton, J., Jackman, S., & Rivers, D. (2004). The statistical analysis of roll call data. The American
Political Science Review, 98, 355370.
Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B.,& Weidman, L. (1991). Multiple imputation of
industry and occupation codes in census public-use samples using Bayesian logistic regres-
sion. Journal of the American Statistical Association, 86, 6878.
Cohen, A. S.,& Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal
of Educational Measurement, 42, 133148.
Cohen, A. S.,& Wollack, J. A. (2006). Test administration, security, scoring, and reporting. In R. L.
Brennan (Ed.), Educational measurement (4th ed., pp.355386). Westport, CT: Praeger.
Collins, J. A., Greer, J. E.,& Huang, S. X. (1996). Adaptive assessment using granularity hierarchies
and Bayesian nets. In C. Frasson, G. Gauthier,& A. Lesgold (Eds.), Intelligent tutoring systems
(pp.569577). Berlin, Germany: Springer.
Collins, L. M.,& Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the
social, behavioral, and health sciences. Hoboken, NJ: Wiley.
Conati, C., Gertner, A. S., VanLehn, K., & Druzdzel, M. J. (1997). On-line student modeling for
coached problem solving using Bayesian networks. In User Modeling: Proceedings of the 6th
International Conference, UM97 (pp.231242). Berlin, Germany: Springer.
Congdon, P. (2006). Bayesian statistical modelling (2nd ed.). Chichester, UK: Wiley.
Cowles, M. K.,& Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A com-
parative review. Journal of the American Statistical Association, 91, 883904.
Crawford, A. V. (2014). Posterior predictive model checking in Bayesian networks. Unpublished doctoral
dissertation. Arizona State University, Tempe, AZ.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Cengage
Learning.
Croon, M. (1990). Latent class analysis with ordered latent classes. British Journal of Mathematical and
Statistical Psychology, 43, 171192.
Curtis, S. M. (2010). BUGS code for item response theory. Journal of Statistical Software, 36, 134.
Curtis, S. M. (2015). Mcmcplots: Create Plots from MCMC Output (Version 0.4.2). BUGS code.
Dahl, F. A. (2006). On the conservativeness of posterior predictive p-values. Statistics& Probability
Letters, 76, 11701174.
Dai, Y. (2013). A mixture Rasch model with a covariate: A simulation study via Bayesian Markov
chain Monte Carlo estimation. Applied Psychological Measurement, 37, 375396.
Data Description, Inc. (2011). Data Desk 6.3. Ithaca, NY: Data Description.
Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. Applied
Psychological Measurement, 20, 405416.
Dayton, C. M. (1999). Latent class scaling analysis. Thousand Oaks, CA: SAGE Publications.
Dayton, C. M.,& Macready, G. B. (1976). A probabilistic model for validation of behavioral hierar-
chies. Psychometrika, 41, 189204.
430 References
Dayton, C. M., & Macready, G. B. (2007). Latent class anlaysis in psychometrics. In C. R. Rao &
S.Sinharay (Eds.), Handbook of statistics, Volume 26: Psychometrics (pp.421446). Amsterdam, the
Netherlands: North-Holland/Elsevier.
De Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press.
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and
nonlinear approach. New York: Springer.
DeCarlo, L. T. (2011). On the analysis of fraction subtraction data: The DINA model, classification,
latent class sizes, and the Q-matrix. Applied Psychological Measurement, 35, 826.
DeCarlo, L. T. (2012). Recognizing uncertainty in the Q-matrix via a Bayesian extension of the DINA
model. Applied Psychological Measurement, 36, 447468.
Deely, J. J.,& Lindley, D. V. (1981). Bayes empirical Bayes. Journal of the American Statistical Association,
76, 833841.
De Finetti, B. (1931). Funzione caratteristica di un fenomeno aleatorio. Atti Della R. Academia Nazionale
Dei Lincei, Serie 6. Memorie, Classe Di Scienze Fisiche, Mathematice E Naturale, 4, 251299.
De Finetti, B. (1937/1964). La prvision: Ses lois logiques, ses sources subjectives. In Annales de
lInstitut Henri Poincar 7 (pp.168). Translated by Kyburg and Smokler, eds. (1964). Studies in
subjective probability (pp.93158). New York: Wiley.
De Finetti, B. (1974). Theory of probability, Volume 1. New York: Wiley.
De la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring
and incorporation of ancillary variables. Applied Psychological Measurement, 33, 465485.
De la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis.
Psychometrika, 69, 333353.
De la Torre, J.,& Douglas, J. A. (2008). Model evaluation and multiple strategies in cognitive diagno-
sis: An analysis of fraction subtraction data. Psychometrika, 73, 595624.
De la Torre, J.,& Patz, R. J. (2005). Making the most of what we have: A practical application of multi-
dimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics,
30, 295311.
De la Torre, J.,& Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-
order IRT model approach. Applied Psychological Measurement, 33, 620639.
De la Torre, J., Stark, S.,& Chernyshenko, O. S. (2006). Markov chain Monte Carlo estimation of item
parameters for the generalized graded unfolding model. Applied Psychological Measurement, 30,
216232.
De Leeuw, C., & Klugkist, I. (2012). Augmenting data with published results in Bayesian linear
regression. Multivariate Behavioral Research, 47, 369391.
DeMark, S. F.,& Behrens, J. T. (2004). Using statistical natural language processing for understanding
complex responses to free-response tasks. International Journal of Testing, 4, 371390.
Dempster, A. P., Laird, N. M.,& Rubin, D. B. (1977). Maximum likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 138.
Depaoli, S. (2012). The ability for posterior predictive checking to identify model misspecification in
Bayesian growth mixture modeling. Structural Equation Modeling: A Multidisciplinary Journal,
19, 534560.
Depaoli, S. (2013). Mixture class recovery in GMM under varying degrees of class separation:
Frequentist versus Bayesian estimation. Psychological Methods, 18, 186219.
Diaconis, P.,& Freedman, D. (1980a). de Finettis generalizations of exchangeability. In R. C. Jeffrey
(Ed.), Studies in inductive logic and probability (Vol. 2, pp.233249). Berkeley, CA: University of
California Press.
Diaconis, P.,& Freedman, D. (1980b). Exchangeable sequences. The Annals of Probability, 8, 745764.
Diaconis, P.,& Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics,
14, 126.
DiBello, L. V., Henson, R. A., & Stout, W. F. (2015). A family of generalized diagnostic classifi-
cation models for multiple choice option-based scoring. Applied Psychological Measurement,
39,6279.
References 431
DiBello, L. V., Roussos, L.,& Stout, W. (2007). Review of cognitively diagnostic assessment and a sum-
mary of psychometric models. In C. R. Rao& S. Sinharay (Eds.), Handbook of statistics, Volume
26: Psychometrics (pp.9791030). Amsterdam, the Netherlands: North-Holland/Elsevier.
Dicerbo, K. E., & Behrens, J. T. (2012). Implications of the digital ocean on current and future
assessment. In R. W. Lissitz& H. Jiao (Eds.), Computers and their impact on state assessment:
Recent history and predictions for the future (pp. 273306). Charlotte, NC: Information Age
Publishing.
Drasgow, F., Luecht, R. M.,& Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.),
Educational Measurement (4th ed., pp.471515). Westport, CT: Praeger.
DuBois, P. H. (1970). A history of psychological testing. Needham Heights, MA: Allyn & Bacon.
Dudek, F. J. (1979). The continuing misinterpretation of the standard error of measurement.
Psychological Bulletin, 86, 335337.
Duncan, K. A.,& MacEachern, S. N. (2008). Nonparametric Bayesian modelling for item response.
Statistical Modelling, 8, 4166.
Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 62, 355366.
Dunson, D. B., Palomo, J., & Bollen, K. (2005). Bayesian structural equation modeling (No. 2005-5).
Research Triangle Park, NC: Statistical and Applied Mathematical Sciences Institute.
Edwards, M. C. (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis.
Psychometrika, 75, 474497.
Edwards, M. C. (2013). Purple unicorns, true models, and other things Ive never seen. Measurement:
Interdisciplinary Research& Perspective, 11, 107111.
Edwards, M. C.,& Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation:
How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31,
241259.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological
research. Psychological Review, 70, 193242.
Efron, B.,& Morris, C. (1977). Steins paradox in statistics. Scientific American, 236, 119127.
Embretson, S. E. (1997). Multicomponent response models. In W. J. van der Linden & R. K. Hambleton
(Eds.), Handbook of modern item response theory (pp. 305321). New York: Springer.
Embretson, S. (1984). A general latent trait model for response processes. Psychometrika, 49, 175186.
Embretson, S. E.,& Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Psychology
Press.
Emons, W. H. M., Glas, C. A. W., Meijer, R. R.,& Sijtsma, K. (2003). Person fit in order-restricted latent
class models. Applied Psychological Measurement, 27, 459478.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Erosheva, E. A.,& Curtis, S. M. (2011). Dealing with rotational invariance in Bayesian confirmatory fac-
tor analysis (Technical Report no. 589). Seattle, WA: University of Washington. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.300.8292&rep=rep1&type=pdf.
Evans, M. (1997). Bayesian inference procedures derived via the concept of relative surprise.
Communications in StatisticsTheory and Methods, 26, 11251143.
Evans, M. (2000). Asymptotic distribution of p values in composite null models: Comment. Journal of
the American Statistical Association, 95, 1160.
Evans, M. J., Gilula, Z.,& Guttman, I. (1989). Latent class analysis of two-way contingency tables by
Bayesian methods. Biometrika, 76, 557563.
Fahrmeir, L., & Raach, A. (2007). A Bayesian semiparametric latent variable model for mixed
responses. Psychometrika, 72, 327346.
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement
model components: Rationale of two-index strategy revisited. Structural Equation Modeling:
AMultidisciplinary Journal, 12, 343367.
Fan, X., & Sivo, S. A. (2007). Sensitivity of fit indices to model misspecification and model types.
Multivariate Behavioral Research, 42, 509529.
432 References
Finch, W. H.,& French, B. F. (2012). Parameter estimation with mixture item response theory models:
A Monte Carlo comparison of maximum likelihood and Bayesian methods. Journal of Modern
Applied Statistical Methods, 11, 167178.
Formann, A. K. (1985). Constrained latent class models: Theory and applications. British Journal of
Mathematical and Statistical Psychology, 38, 87111.
Formann, A. K. (1992). Linear logistic latent class analysis for polytomous data. Journal of the American
Statistical Association, 87, 476486.
Fox, J.-P. (2003). Stochastic EM for estimating the parameters of a multilevel IRT model. British Journal
of Mathematical and Statistical Psychology, 56, 6581.
Fox, J.-P. (2005a). Multilevel IRT using dichotomous and polytomous response data. British Journal of
Mathematical and Statistical Psychology, 58, 145172.
Fox, J.-P. (2005b). Randomized item response theory models. Journal of Educational and Behavioral
Statistics, 30, 189212.
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. Springer.
Fox, J.-P., Entink Klein, R., & Avetisyan, M. (2014). Compensatory and non-compensatory multi-
dimensional randomized item response models. British Journal of Mathematical and Statistical
Psychology, 67, 133152.
Fox, J.-P.,& Glas, C. A. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling.
Psychometrika, 66, 271288.
Fox, J.-P.,& Glas, C. A. (2003). Bayesian modeling of measurement error in predictor variables using
item response theory. Psychometrika, 68, 169191.
Frederickx, S., Tuerlinckx, F., De Boeck, P.,& Magis, D. (2010). RIM: A random item mixture model to
detect differential item functioning. Journal of Educational Measurement, 47, 432457.
Freedman, D. A. (1987). As others see us: A case study in path analysis. Journal of Educational Statistics,
12, 101128.
French, B. F.,& Oakes, W. (2004). Reliability and validity evidence for the institutional integration
scale. Educational and Psychological Measurement, 64, 8898.
Fu, Z.-H., Tao, J.,& Shi, N.-Z. (2009). Bayesian estimation in the multidimensional three-parameter
logistic model. Journal of Statistical Computation and Simulation, 79, 819835.
Fu, Z.-H., Tao, J.,& Shi, N.-Z. (2010). Bayesian estimation of the multidimensional graded response model
with nonignorable missing data. Journal of Statistical Computation and Simulation, 80, 12371252.
Fujimoto, K. A., & Karabatsos, G. (2014). Dependent Dirichlet process rating model. Applied
Psychological Measurement, 38, 217228.
Fukuhara, H.,& Kamata, A. (2011). A bifactor multidimensional item response theory model for dif-
ferential item functioning analysis on testlet-based items. Applied Psychological Measurement,
35, 604622.
Gajewski, B. J., Price, L. R., Coffland, V., Boyle, D. K.,& Bott, M. J. (2013). Integrated analysis of content
and construct validity of psychometric instruments. Quality& Quantity, 47, 5778.
Galindo-Garre, F., Vermunt, J. K., & Bergsma, W. P. (2004). Bayesian posterior estimation of logit
parameters with small samples. Sociological Methods& Research, 33, 88117.
Garrett, E. S., Eaton, W. W.,& Zeger, S. (2002). Methods for evaluating the performance of diagnos-
tic tests in the absence of a gold standard: A latent class model approach. Statistics in Medicine,
21, 12891307.
Garrett, E. S.,& Zeger, S. L. (2000). Latent class model diagnosis. Biometrics, 56, 10551067.
Garthwaite, P. H., Kadane, J. B.,& OHagan, A. (2005). Statistical methods for eliciting probability
distributions. Journal of the American Statistical Association, 100, 680701.
Geerlings, H., Glas, C. A., & van der Linden, W. J. (2011). Modeling rule-based item generation.
Psychometrika, 76, 337359.
Geisser, S.,& Eddy, W. F. (1979). A predictive approach to model selection. Journal of the American
Statistical Association, 74, 153160.
Gelfand, A. E. (1996). Model determination using sampling based methods. In W. R. Gilks,
S.Richardson,& D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp.145161).
London: Chapman& Hall/CRC Press.
References 433
Gilula, Z.,& Haberman, S. J. (2001). Analysis of categorical response profiles by informative sum-
maries. Sociological Methodology, 31, 129187.
Glas, C. A. W.,& Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response
theory models. Applied Psychological Measurement, 27, 217233.
Goldstein, H.,& Browne, W. (2002). Multilevel factor analysis modelling using Markov chain Monte
Carlo estimation. In G. A. Marcoulides& I. Moustaki (Eds.), Latent variable and latent structure
models (pp.225243). London: Lawrence Erlbaum Associates.
Goldstein, H.,& Browne, W. (2005). Multilevel factor analysis models for continuous and discrete
data. In A. Maydeu-Olivares& J. J. McArdle (Eds.), Contemporary psychometrics: A Festschrift to
Roderick P. McDonald. (pp.453475). Mahwah, NJ: Lawrence Erlbaum Associates.
Goldstein, M. (1976). Bayesian analysis of regression problems. Biometrika, 63, 5158.
Good, I. J. (1965). The estimation of probabilities: An essay on modern Bayesian methods. Cambridge, MA:
MIT Press.
Good, I. J. (1971). 46656 varieties of Bayesians. American Statistician, 25, 6263.
Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifi-
able models. Biometrika, 61, 215231.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45,
135140.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill.
Gulliksen, H. (1961). Measurement of learning and mental abilities. Psychometrika, 26, 93107.
Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal
of the Royal Statistical Society. Series B (Methodological), 29, 83100.
Guttman, L. (1947). On Festingers evaluation of scale analysis. Psychological Bulletin, 44, 451465.
Hacking, I. (1975). The emergence of probability. Cambridge: Cambridge University Press.
Haertel, E. H. (1990). Continuous and discrete latent structure models for item response data.
Psychometrika, 55, 477494.
Haertel, E. H. (2006). Reliability. In R. Brennan (Ed.), Educational Measurement (4th ed., pp.65110).
Westport, CT: Praeger.
Hambleton, R. K. (1984). Determining suitable test lengths. In R. Berk (Ed.), A guide to criterion-
referenced test construction (pp.144168). Baltimore, MD: The Johns Hopkins University Press.
Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational and psycho-
logical test data: A five-step plan and several graphical displays. In W. R. Lenderking& D. A.
Revicki (Eds.), Advancing health outcomes research methods and clinical applications (pp. 5777).
Washington, DC: Degnon Associates.
Hambleton, R. K.,& Jones, R. W. (1994). Item parameter estimation errors and their influence on test
information functions. Applied Measurement in Education, 7, 171186.
Hambleton, R. K., Jones, R. W.,& Rogers, H. J. (1993). Influence of item parameter estimation errors
in test development. Journal of Educational Measurement, 30, 143155.
Hambleton, R. K.,& Swaminathan, H. (1985). Item response theory: Principles and applications. Hingham,
MA: Springer.
Hambleton, R. K., Swaminathan, H., Cook, L. L., Eignor, D. R.,& Gifford, J. A. (1978). Developments
in latent trait theory: Models, technical issues, and applications. Review of Educational Research,
48, 467510.
Han, C., & Carlin, B. P. (2001). Markov chain Monte Carlo methods for computing Bayes factors:
Acomparative review. Journal of the American Statistical Association, 96, 11221132.
Harring, J. R., Weiss, B. A.,& Hsu, J.-C. (2012). A comparison of methods for estimating quadratic
effects in nonlinear structural equation models. Psychological Methods, 17, 193214.
Hartigan, J. A. (1969). Linear Bayesian methods. Journal of the Royal Statistical Society. Series B
(Methodological), 31, 446454.
Harwell, M. R.,& Baker, F. B. (1991). The use of prior distributions in marginalized Bayesian item
parameter estimation: A didactic. Applied Psychological Measurement, 15, 375389.
References 435
Harwell, M. R., Baker, F. B.,& Zwarts, M. (1988). Item parameter estimation via marginal maximum
likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13, 243271.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97109.
Hayashi, K.,& Arav, M. (2006). Bayesian factor analysis when only a sample covariance matrix is
available. Educational and Psychological Measurement, 66, 272284.
Hayashi, K., & Yuan, K.-H. (2003). Robust Bayesian factor analysis. Structural Equation Modeling:
AMultidisciplinary Journal, 10, 525533.
Heinen, T. (1996). Latent class and discrete latent trait models: Similarities and differences. Thousand Oaks,
CA: SAGE Publications.
Henson, R. A., Templin, J. L.,& Willse, J. T. (2009). Defining a family of cognitive diagnosis models
using log-linear models with latent variables. Psychometrika, 74, 191210.
Hewitt, E., & Savage, L. J. (1955). Symmetric measures on Cartesian products. Transactions of the
American Mathematical Society, 80, 470501.
Hjort, N. L., Dahl, F. A., & Steinbakk, G. H. (2006). Post-processing posterior predictive p values.
Journal of the American Statistical Association, 101, 11571174.
Ho, M. R., Stark, S.,& Chernyshenko, O. S. (2012). Graphical representation of structural equation
models using path diagrams. In R. H. Hoyle (Ed.), Handbook of structural equation modeling
(pp.4355). New York: Guilford Press.
Hoeting, J. A., Madigan, D., Raftery, A. E.,& Volinsky, C. T. (1999). Bayesian model averaging: A tuto-
rial. Statistical Science, 382401.
Hoijtink, H. (1998). Constrained latent class analysis using the Gibbs sampler and posterior predic-
tive p-values: Applications to educational testing. Statistica Sinica, 8, 691711.
Hoijtink, H., Bland, S., & Vermeulen, J. A. (2014). Cognitive diagnostic assessment via Bayesian
evaluation of informative diagnostic hypotheses. Psychological Methods, 19, 2138.
Hojtink, H.,& Molenaar, I. W. (1997). A multidimensional item response model: Constrained latent
class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62,
171189.
Holland, P. W. (1990). On the sampling theory foundations of item response theory models.
Psychometrika, 55, 577601.
Holland, P. W. (1994). Measurements or contests? Comments on Zwick, Bond and Allen/Donoghue.
In Proceedings of the Social Statistics Section of the American Statistical Association (pp. 2729).
Alexandria, VA: American Statistical Association.
Hong, H., Wang, C., Lim, Y. S.,& Douglas, J. (2015). Efficient models for cognitive diagnosis with
continuous and mixed-type latent variables. Applied Psychological Measurement, 39, 3143.
Hox, J., van de Schoot, R.,& Matthijsse, S. (2012). How few countries will do? Comparative survey
analysis from a Bayesian perspective. Survey Research Methods, 6, 8793.
Hsieh, C.-A., von Eye, A., Maier, K., Hsieh, H.-J.,& Chen, S.-H. (2013). A unified latent growth curve
model. Structural Equation Modeling: A Multidisciplinary Journal, 20, 592615.
Huang, H.-Y., & Wang, W.-C. (2014a). Multilevel higher-order item response theory models.
Educational and Psychological Measurement, 74, 495515.
Huang, H.-Y., & Wang, W.-C. (2014b). The random-effect DINA model. Journal of Educational
Measurement, 51, 7597.
Hughes, R. I. (1997). Models and representation. Philosophy of Science, 64, S325S336.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary
Journal, 6, 155.
Hulin, C. L., Lissak, R. I.,& Drasgow, F. (1982). Recovery of two- and three-parameter logistic item
characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6, 249260.
IPCC. (2014). Climate change 2013: The physical science basis. Contribution of Working Group I to the Fifth
Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge: Cambridge
University Press.
436 References
Iseli, M. R., Koenig, A. D., Lee, J. J.,& Wainess, R. (2010). Automated assessment of complex task perfor-
mance in games and simulations (No. 775). Los Angeles, CA: University of California, National
Center for Research on Evaluation, Standards, Student Testing (CRESST). Retrieved from
https://www.cse.ucla.edu/products/reports/R775.pdf.
Jackman, S. (2001). Multidimensional analysis of roll call data via Bayesian simulation: Identification,
estimation, inference, and model checking. Political Analysis, 9, 227241.
Jackman, S. (2009). Bayesian analysis for the social sciences. Chichester, UK: Wiley.
Jackman, S. (2014). pscl: Classes and Methods for R developed in the Political Science Computational
Laboratory, Stanford University (Version 1.4.6). Stanford, CA: Department of Political Science,
Stanford University. Retrieved from http://pscl.stanford.edu/.
Jackson, P. H., Novick, M. R., & Thayer, D. T. (1971). Estimating regressions in m groups. British
Journal of Mathematical and Statistical Psychology, 24, 129153.
James, W.,& Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the 4th Berkeley Symposium
on Mathematical Statistics and Probability (Vol. 1, pp. 361379). Berkeley and Los Angeles, CA:
University of California Press.
Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for
criterion-referenced measurement. Journal of Educational and Behavioral Statistics, 25, 285306.
Jaynes, E. T. (1988). The relation of Bayesian and maximum entropy methods. In G. J. Erickson&
C. R. Smith (Eds.), Maximum-entropy and Bayesian methods in science and engineering (Vol. 1,
pp.2529). Dordrecht, the Netherlands: Kluwer.
Jaynes, E. T. (2003). Probability theory: The logic of science. (G. L. Bretthorst, Ed.). Cambridge: Cambridge
University Press.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, UK: Clarendon Press.
Jensen, F. V. (1996). Introduction to Bayesian networks. New York: Springer.
Jensen, F. V. (2001). Bayesian networks and decision graphs. New York: Springer.
Jiang, Y., Boyle, D. K., Bott, M. J., Wick, J. A., Yu, Q., & Gajewski, B. J. (2014). Expediting clinical
and translational research via Bayesian instrument development. Applied Psychological
Measurement, 38, 296310.
Jiao, H., & Zhang, Y. (2015). Polytomous multilevel testlet models for testlet-based assessments
with complex sampling designs. British Journal of Mathematical and Statistical Psychology, 68,
6583.
Jin, K.-Y.,& Wang, W.-C. (2014). Generalized IRT models for extreme response style. Educational and
Psychological Measurement, 74, 116138.
Johnson, M. S.,& Jenkins, F. (2005). A Bayesian hierarchical model for large-scale educational surveys: An
application to the National Assessment of Educational Progress (Research Report No. RR-04-38).
Princeton, NJ: ETS. Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/j.2333-8504.2004.
tb01965.x/abstract.
Johnson, M. S.,& Junker, B. W. (2003). Using data augmentation and Markov chain Monte Carlo for
the estimation of unfolding response models. Journal of Educational and Behavioral Statistics, 28,
195230.
Johnson, M. S.,& Sinharay, S. (2005). Calibration of polytomous item families using Bayesian hierar-
chical modeling. Applied Psychological Measurement, 29, 369400.
Johnson, V. E. (1996). On Bayesian analysis of multirater ordinal data: An application to automated
essay grading. Journal of the American Statistical Association, 91, 4251.
Johnson, V. E. (2007). Bayesian model assessment using pivotal quantities. Bayesian Analysis, 2,
719733.
Jones, L. V.,& Thissen, D. (2007). A history and overview of psychometrics. In C. R. Rao& S. Sinharay
(Eds.), Handbook of statistics, Volume 26: Psychometrics (pp.127). Amsterdam, the Netherlands:
North-Holland/Elsevier.
Jones, W. P. (2014). Enhancing a short measure of big five personality traits with Bayesian scaling.
Educational and Psychological Measurement, 74, 10491066.
Kadane, J. B. (2011). Principles of uncertainty. Boca Raton, FL: Chapman& Hall/CRC Press.
Kadane, J. B.,& Wolfson, L. J. (1998). Experiences in elicitation. The Statistician, 47, 319.
References 437
Kamata, A.,& Bauer, D. J. (2008). A note on the relation between factor analytic and item response
theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 136153.
Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items. Applied
Psychological Measurement, 31, 331358.
Kang, T., Cohen, A. S.,& Sung, H.-J. (2009). Model selection indices for polytomous items. Applied
Psychological Measurement, 33, 499518.
Kaplan, D. (2014). Bayesian statistics for the social sciences. New York: Guilford Press.
Kaplan, D.,& Depaoli, S. (2012). Bayesian structural equation modeling. In R. H. Hoyle (Ed.), Handbook
of structural equation modeling (pp.650673). New York: Guilford Press.
Karabatsos, G., & Batchelder, W. H. (2003). Markov chain estimation for test theory without an
answer key. Psychometrika, 68, 373389.
Karabatsos, G.,& Sheu, C.-F. (2004). Order-constrained Bayes inference for dichotomous models of
unidimensional nonparametric IRT. Applied Psychological Measurement, 28, 110125.
Karabatsos, G. (2016). Bayesian nonparametric IRT. In W. J. van der Linden (Ed.), Handbook of item
response theory: Models, statistical tools, and applications, volume 1. New York: Chapman& Hall/
CRC Press.
Karabatsos, G., & Walker, S. G. (2009). A Bayesian nonparametric approach to test equating.
Psychometrika, 74, 211232.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90,
773795.
Kelley, T. L. (1923). Statistical method. New York: Macmillan.
Kelley, T. L. (1947). Fundamentals of statistics. Cambridge, MA: Harvard University Press.
Kim, S.-H. (2001). An evaluation of a Markov chain Monte Carlo method for the Rasch model. Applied
Psychological Measurement, 25, 163176.
Kim, S.-Y., Suh, Y., Kim, J.-S., Albanese, M. A.,& Langer, M. M. (2013). Single and multiple ability esti-
mation in the SEM framework: A noninformative Bayesian estimation approach. Multivariate
Behavioral Research, 48, 563591.
Klein, M. F., Birnbaum, M., Standiford, S. N.,& Tatsuoka, K. K. (1981). Logical error analysis and con-
struction of tests to diagnose student bugs in addition and subtraction of fractions (Research Report
No. 81-6). Urbana, IL: University of Illinois, Computer-Based Education Research Laboratory.
Klein Entink, R. H., Fox, J.-P.,& van der Linden, W. J. (2009). A multivariate multilevel approach to
the modeling of accuracy and speed of test takers. Psychometrika, 74, 2148.
Kline, R. B. (2010). Principles and practice of structural equation modeling (3rd ed.). New York: Guilford Press.
Koopman, R. F. (1978). On Bayesian estimation in unrestricted factor analysis. Psychometrika, 43, 109110.
Kruschke, J. K. (2010). Doing Bayesian data analysis: A tutorial with R and BUGS. Burlington, MA:
Academic Press.
Kruschke, J. K., Aguinis, H.,& Joo, H. (2012). The time has come: Bayesian methods for data analysis
in the organizational sciences. Organizational Research Methods, 15, 722752.
Kyburg, H. E.,& Smokler, H. E. (Eds.). (1964). Studies in subjective probability. New York: Wiley.
Lanza, S. T., Collins, L. M., Schafer, J. L.,& Flaherty, B. P. (2005). Using data augmentation to obtain
standard errors and conduct hypothesis tests in latent class and latent transition analysis.
Psychological Methods, 10, 84100.
Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical
structures and their application to expert systems. Journal of the Royal Statistical Society. Series
B (Methodological), 50, 157224.
Lazarsfeld, P. F.,& Henry, N. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin.
Lee, S. E.,& Press, S. J. (1998). Robustness of Bayesian factor analysis estimates. Communications in
Statistics: Theory and Methods, 27, 18711893.
Lee, S.-Y. (1981). A Bayesian approach to confirmatory factor analysis. Psychometrika, 46, 153160.
Lee, S.-Y. (1992). Bayesian analysis of stochastic constraints in structural equation models. British
Journal of Mathematical and Statistical Psychology, 45, 93107.
Lee, S.-Y. (2006). Bayesian analysis of nonlinear structural equation models with nonignorable miss-
ing data. Psychometrika, 71, 541564.
438 References
Lee, S.-Y. (2007). Structural equation modeling: A Bayesian approach. Chichester, UK: Wiley.
Lee, S.-Y., Lu, B.,& Song, X.-Y. (2008). Semiparametric Bayesian analysis of structural equation mod-
els with fixed covariates. Statistics in Medicine, 27, 23412360.
Lee, S.-Y.,& Song, X.-Y. (2003). Bayesian model selection for mixtures of structural equation mod-
els with an unknown number of components. British Journal of Mathematical and Statistical
Psychology, 56, 145165.
Lee, S.-Y.,& Song, X.-Y. (2004). Evaluation of the Bayesian and maximum likelihood approaches in
analyzing structural equation models with small sample sizes. Multivariate Behavioral Research,
39, 653686.
Lee, S.-Y., Song, X.-Y., & Cai, J.-H. (2010). A Bayesian approach for nonlinear structural equation
models with dichotomous variables using logit and probit links. Structural Equation Modeling:
A Multidisciplinary Journal, 17, 280302.
Lee, S.-Y., Song, X.-Y.,& Tang, N.-S. (2007). Bayesian methods for analyzing structural equation mod-
els with covariates, interaction, and quadratic latent variables. Structural Equation Modeling:
AMultidisciplinary Journal, 14, 404434.
Lee, S.-Y.,& Xia, Y.-M. (2008). A robust Bayesian approach for structural equation models with miss-
ing data. Psychometrika, 73, 343364.
Lee, S.-Y., & Zhu, H.-T. (2000). Statistical analysis of nonlinear structural equation models with
continuous and polytomous data. British Journal of Mathematical and Statistical Psychology, 53,
209232.
Lee, S.-Y., & Zhu, H.-T. (2002). Maximum likelihood estimation of nonlinear structural equation
models. Psychometrika, 67, 189210.
Lei, P.-W., & Wu, Q. (2012). Estimation in structural equation modeling. In R. H. Hoyle (Ed.),
Handbook of structural equation modeling (pp.164180). New York: Guilford Press.
Levy, R. (2006). Posterior predictive model checking for multidimensionality in item response theory and
Bayesian networks. Unpublished doctoral dissertation. University of Maryland, College Park,
MD.
Levy, R. (2009). The rise of Markov chain Monte Carlo estimation for psychometric modeling. Journal
of Probability and Statistics, 2009, 118.
Levy, R. (2011). Bayesian data-model fit assessment for structural equation modeling. Structural
Equation Modeling: A Multidisciplinary Journal, 18, 663685.
Levy, R. (2013). Psychometric and evidentiary advances, opportunities, and challenges for simula-
tion-based assessment. Educational Assessment, 18, 182207.
Levy, R. (2014). Dynamic Bayesian network modeling of game based diagnostic assessments (No. 837). Los
Angeles, CA: University of California, National Center for Research on Evaluation, Standards,
Student Testing (CRESST). Retrieved from http://www.cse.ucla.edu/products/reports/
R837.pdf.
Levy, R., Behrens, J. T.,& Mislevy, R. J. (2006). Variations in adaptive testing and their on-line lever-
age points. In D. D. Williams, S. L. Howell,& M. Hricko (Eds.), Online assessment, measurement
and evaluation: Emerging practices (pp.180202). Hershey, PA: Information Science Publishing.
Levy, R.,& Choi, J. (2013). Bayesian structural equation modeling. In G.R. Hancock & R.O. Mueller
(Eds.), Structural equation modeling: A second course (2nd ed., pp. 563623). Charlotte, NC:
Information Age Publishing.
Levy, R., & Crawford, A. V. (2009). Incorporating substantive knowledge into regression via a
Bayesian approach to modeling. Multiple Linear Regression Viewpoints, 35, 49.
Levy, R.,& Hancock, G. R. (2007). A framework of statistical tests for comparing mean and covari-
ance structure models. Multivariate Behavioral Research, 42, 3366.
Levy, R.,& Hancock, G. R. (2011). An extended model comparison framework for covariance and
mean structure models, accommodating multiple groups and latent mixtures. Sociological
Methods& Research, 40, 256278.
Levy, R.,& Mislevy, R. J. (2004). Specifying and refining a measurement model for a computer-based
interactive assessment. International Journal of Testing, 4, 333369.
References 439
Levy, R., Mislevy, R. J., & Behrens, J. T. (2011). MCMC in educational research. In S. Brooks,
A.Gelman, G. L. Jones,& X.-L. Meng (Eds.), Handbook of Markov chain Monte Carlo: Methods
and applications (pp.531545). London: Chapman& Hall/CRC Press.
Levy, R., Mislevy, R. J.,& Sinharay, S. (2009). Posterior predictive model checking for multidimen-
sionality in item response theory. Applied Psychological Measurement, 33, 519537.
Levy, R.,& Svetina, D. (2011). A generalized dimensionality discrepancy measure for dimensionality
assessment in multidimensional item response theory. British Journal of Mathematical and
Statistical Psychology, 64, 208232.
Levy, R., Xu, Y., Yel, N.,& Svetina, D. (2015). A standardized generalized dimensionality discrepancy
measure and a standardized model-based covariance for dimensionality assessment for mul-
tidimensional models. Journal of Educational Measurement, 52, 144158.
Lewis, C. (1986). Test theory and Psychometrika: The past twenty-five years. Psychometrika, 51, 1122.
Lewis, C. (2007). Selected topics in classical test theory. In C. R. Rao& S. Sinharay (Eds.), Handbook of
statistics, Volume 26: Psychometrics (pp.2943). Amsterdam, the Netherlands: North-Holland/
Elsevier.
Lewis, C.,& Sheehan, K. (1990). Using Bayesian decision theory to design a computerized mastery
test. Applied Psychological Measurement, 14, 367386.
Li, F., Cohen, A. S., Kim, S.-H.,& Cho, S.-J. (2009). Model selection methods for mixture dichotomous
IRT models. Applied Psychological Measurement, 33, 353373.
Li, Y., Bolt, D. M.,& Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological
Measurement, 30, 321.
Lindley, D. V. (1970). A Bayesian solution for some educational prediction problems (No. RB-70-33).
Princeton, NJ: ETS.
Lindley, D. V. (1971). The estimation of many parameters. In V. P. Godambe& D. A. Sprott (Eds.),
Foundations of statistical inference (pp. 435455). Toronto, Ontario, Canada: Holt, Rinehart &
Winston.
Lindley, D. V.,& Novick, Melvin R. (1981). The role of exchangeability in inference. The Annals of
Statistics, 9, 4558.
Lindley, D. V., & Phillips, L. D. (1976). Inference for a Bernoulli process (a Bayesian view). The
American Statistician, 30, 112119.
Lindley, D. V., & Smith, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal
Statistical Society. Series B, 34, 141.
Lindsay, B., Clogg, C. C., & Grego, J. (1991). Semiparametric estimation in the Rasch model and
related exponential response models, including a simple latent class model for item analysis.
Journal of the American Statistical Association, 86, 96107.
Linzer, D. A.,& Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent class analy-
sis. Journal of Statistical Software, 42, 129.
Liu, Y., Schulz, E. M.,& Yu, L. (2008). Standard error estimation of 3PL IRT true score equating with
an MCMC method. Journal of Educational and Behavioral Statistics, 33, 257278.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.
Loken, E. (2004). Multimodality in mixture models and factor analysis. In Applied Bayesian modeling
and causal inference from incomplete-data perspectives: An essential journey with Donald Rubins sta-
tistical family (pp.203213). Chichester, UK: Wiley.
Loken, E. (2005). Identification constraints and inference in factor models. Structural Equation
Modeling: A Multidisciplinary Journal, 12, 232244.
Loken, E.,& Rulison, K. L. (2010). Estimation of a four-parameter item response theory model. British
Journal of Mathematical and Statistical Psychology, 63, 509525.
Longford, N. T. (1995). Model-Based Methods for Analysis of Data from 1990 NAEP Trial State Assessment
(No. 95-696). Washington, DC: National Center for Education Statistics. Retrieved from http://
nces.ed.gov/pubsearch/pubsinfo.asp?pubid=95696.
Lopes, H. F.,& West, M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 14,
4168.
440 References
Lord, F. M. (1974). Estimation of latent ability and item parameters when there are omitted responses.
Psychometrika, 39, 247264.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N.J:
Lawrence Erlbaum Associates.
Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory.
Journal of Educational Measurement, 23, 157162.
Lord, F. M.,& Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Lucke, J. F. (2005). The and the of congeneric test theory: An extension of reliability and internal
consistency to heterogeneous tests. Applied Psychological Measurement, 29, 6581.
Lunn, D., Jackson, C., Best, N., Thomas, A.,& Spiegelhalter, D. (2013). The BUGS book: A practical intro-
duction to Bayesian analysis. Boca Raton, FL: Chapman& Hall/CRC Press.
Lunn, D., Spiegelhalter, D., Thomas, A.,& Best, N. (2009). The BUGS project: Evolution, critique and
future directions. Statistics in Medicine, 28, 30493067.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists. New
York: Springer.
MacCallum, R. C. (2003). Working with imperfect models. Multivariate Behavioral Research, 38,
113139.
Macready, G. B.,& Dayton, C. M. (1992). The application of latent class models in adaptive testing.
Psychometrika, 57, 7188.
Maier, K. S. (2002). Modeling incomplete scaled questionnaire data with a partial credit hierarchical
measurement model. Journal of Educational and Behavioral Statistics, 27, 271289.
Maraun, M. D. (1996). Metaphor taken as math: Indeterminacy in the factor analysis model.
Multivariate Behavioral Research, 31, 517538.
Marianti, S., Fox, J.-P., Avetisyan, M., Veldkamp, B. P.,& Tijmstra, J. (2014). Testing for aberrant behav-
ior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426451.
Marin, J.-M.,& Robert, C. (2007). Bayesian core: A practical approach to computational Bayesian statistics.
New York: Springer.
Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187212.
Maris, G., & Maris, E. (2002). A MCMC-method for models with continuous latent responses.
Psychometrika, 67, 335350.
Markman, A. B. (1999). Knowledge representation. Mahwah, NJ: Psychology Press.
Marshall, S. P. (1981). Sequential item selection: Optimal and heuristic policies. Journal of Mathematical
Psychology, 23, 134152.
Marsh, H. W., Hau, K.-T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing
Hu and Bentlers (1999) findings. Structural Equation Modeling: A Multidisciplinary Journal, 11,
320341.
Martin, A. D., Quinn, K. M.,& Park, J. H. (2011). MCMCpack: Markov chain Monte Carlo in R. Journal
of Statistical Software, 42, 121.
Martin, J. K.,& McDonald, R. P. (1975). Bayesian estimation in unrestricted factor analysis: A treat-
ment for Heywood cases. Psychometrika, 40, 505517.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149174.
Matson, J. (2011, September 26). Faster-than-light neutrinos? Physics luminaries voice doubts.
Retrieved from http://www.scientificamerican.com/article/ftl-neutrinos/.
Mavridis, D.,& Ntzoufras, I. (2014). Stochastic search item selection for factor analytic models. British
Journal of Mathematical and Statistical Psychology, 67, 284303.
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models.
Measurement: Interdisciplinary Research& Perspective, 11, 71101.
Maydeu-Olivares, A., Drasgow, F., & Mead, A. D. (1994). Distinguishing among parametric item
response models for polychotomous ordered data. Applied Psychological Measurement, 18, 245256.
Mazzeo, J., Lazer, S.,& Zieky, M. J. (2006). Monitoring educational progress with group-score assess-
ments. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 681699). Westport, CT:
Praeger.
References 441
McCutcheon, A. L. (1987). Latent class analysis. Newbury Park, CA: SAGE Publications.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates.
McDonald, R. P. (2010). Structural models and the art of approximation. Perspectives on Psychological
Science, 5, 675686.
McGrayne, S. B. (2011). The theory that would not die: How Bayes rule cracked the enigma code, hunted
down Russian submarines, and emerged triumphant from two centuries of controversy. New Haven,
CT: Yale University Press.
McLeod, L., Lewis, C.,& Thissen, D. (2003). A Bayesian method for the detection of item preknowl-
edge in computerized adaptive testing. Applied Psychological Measurement, 27, 121137.
McManus, I. C. (2012). The misinterpretation of the standard error of measurement in medical edu-
cation: A primer on the problems, pitfalls and peculiarities of the three different standard
errors of measurement. Medical Teacher, 34, 569576.
Meng, X.-L. (1994a). Multiple-imputation inferences with uncongenial sources of input. Statistical
Science, 9, 538558.
Meng, X.-L. (1994b). Posterior predictive p-values. The Annals of Statistics, 22, 11421160.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance
assessments. Educational Researcher, 23, 13.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H.,& Teller, E. (1953). Equation of
state calculations by fast computing machines. The Journal of Chemical Physics, 21, 10871092.
Meyer, J. P. (2010). A mixture Rasch model with item response time components. Applied Psychological
Measurement, 34, 521538.
Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.
Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359381.
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177195.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples.
Psychometrika, 56, 177196.
Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439483.
Mislevy, R. J. (1995). Probability-based inference in cognitive diagnosis. In P. Nichols& R. Brennan
(Eds.), Cognitively diagnostic assessment (pp.4371). Hillsdale, NJ: Lawrence Erlbaum Associates.
Mislevy, R. J. (2006). Cognitive psychology and educational assessment. In R. Brennan (Ed.),
Educational measurement (4th ed., pp.257305). Westport, CT: Praeger.
Mislevy, R. J. (2008). How cognitive science challenges the educational measurement tradition.
Measurement: Interdisciplinary Research& Perspective, 6, 124.
Mislevy, R. J. (2010). Some implications of expertise research for educational assessment. Research
Papers in Education, 25, 253270.
Mislevy, R. J. (2013). Evidence-centered design for simulation-based assessment. Military Medicine,
105, 107114.
Mislevy, R. J. (2016). Missing responses in item response theory. In W. J. van der Linden (Ed.),
Handbook of item response theory: Models, statistical tools, and applications, volume 2 (pp. 171194).
New York: Chapman& Hall/CRC Press.
Mislevy, R. J., Almond, R. G.,& Lukas, J. F. (2004). A brief introduction to evidence-centered design (No.
CSE Report 632). Los Angeles, CA: National Center for Research on Evaluation, Standards, and
Student Testing (CRESST) Center for the Study of Evaluation (CSE). Retrieved from http://files.
eric.ed.gov/fulltext/ED483399.pdf.
Mislevy, R., Almond, R., Yan, D.,& Steinberg, L. S. (1999). Bayes nets in educational assessment: Where do
the numbers come from? Appears in Proceedings of the 15th Conference on Uncertainty in Artificial
Intelligence, 437446.
Mislevy, R. J., Beaton, A. E., Kaplan, B.,& Sheehan, K. M. (1992). Estimating population characteristics
from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133161.
Mislevy, R. J., Behrens, J. T., Bennett, R. E., Demark, S. F., Frezzo, D. C., Levy, R., Winters, F. I.
(2010). On the roles of external knowledge representations in assessment design. The Journal of
Technology, Learning and Assessment, 8. Retrieved from http://napoleon.bc.edu/ojs/index.php/
jtla/article/view/1621.
442 References
Mislevy, R. J., Behrens, J. T., Dicerbo, K. E.,& Levy, R. (2012). Design and discovery in educational
assessment: Evidence-centered design, psychometrics, and educational data mining. JEDM-
Journal of Educational Data Mining, 4, 1148.
Mislevy, R. J.,& Gitomer, D. H. (1996). The role of probability-based inference in an intelligent tutor-
ing system. User Modeling and User-Adapted Instruction, 5, 253282.
Mislevy, R. J., Johnson, E. G.,& Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational
Statistics, 17, 131154.
Mislevy, R. J., Levy, R., Kroopnick, M.,& Rutstein, D. (2008). Evidentiary foundations of mixture item
response theory models. In G. R. Hancock& K. M. Samuelsen (Eds.), Advances in latent variable
mixture models (pp.149175). Charlotte, NC: Information Age Publishing.
Mislevy, R. J.,& Riconscente, M. M. (2006). Evidence-centered assessment design. In S. Downing&
T. Haladyna (Eds.), Handbook of test development (pp.6190). Mahwah, NJ: Lawrence Erlbaum
Associates.
Mislevy, R. J., Senturk, D., Almond, R. G., Dibello, L. V., Jenkins, F., Steinberg, L. S.,& Yan, D. (2002).
Modeling conditional probabilities in complex educational assessments (No. CSE Technical Report
580). Los Angeles, CA: University of California, National Center for Research on Evaluation,
Standards, Student Testing (CRESST). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.322.4516&rep=rep1&type=pdf.
Mislevy, R. J., Sheehan, K. M.,& Wingersky, M. (1993). How to equate tests with little or no data.
Journal of Educational Measurement, 30, 5578.
Mislevy, R. J., Steinberg, L. S.,& Almond, R. G. (2003). On the structure of educational assessments.
Measurement: Interdisciplinary Research and Perspectives, 1, 362.
Mitchell, L. (2009). Examining the structural properties and competing models for the institutional integra-
tion scale. Unpublished thesis. Arizona State University, Tempe, AZ.
Molenaar, I. W.,& Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika,
55, 75106.
Moore, C. (2009, February 7). Bayesian umpires. Retrieved from http://baseballanalysts.com/
archives/2009/12/bayesian_umpire.php.
Morey, R. D., Romeijn, J.-W.,& Rouder, J. N. (2013). The humble Bayesian: Model checking from a
fully Bayesian perspective. British Journal of Mathematical and Statistical Psychology, 66, 6875.
Mosteller, F.,& Tukey, J. W. (1977). Data analysis and regression: A second course in statistics. Reading,
MA: Pearson.
Mosteller, F.,& Wallace, D. L. (1964). Inference and disputed authorship: The Federalist. Reading, MA:
Addison-Wesley.
Moustaki, I.,& Knott, M. (2000). Weighting for item non-response in attitude scales by using latent
variable models with covariates. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 163, 445459.
Mulaik, S. A. (2009). Linear causal modeling with structural equations. Boca Raton, FL: Chapman& Hall/
CRC Press.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied
Psychological Measurement, 16, 159176.
Muthn, B.,& Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible repre-
sentation of substantive theory. Psychological Methods, 17, 313335.
Muthn, B. O., & Muthn, L. K. (1998). Mplus users guide (7th ed.). Los Angeles, CA: Muthn &
Muthn.
Nadarajah, S.,& Kotz, S. (2006). R programs for truncated distributions. Journal of Statistical Software,
16. Retrieved from http://www.jstatsoft.org/v16/c02.
Natesan, P., Limbers, C., & Varni, J. W. (2010). Bayesian estimation of graded response multilevel
models using Gibbs sampling: Formulation and illustration. Educational and Psychological
Measurement, 70, 420439.
Naumann, A., Hochweber, J.,& Hartig, J. (2014). Modeling instructional sensitivity using a longi-
tudinal multilevel differential item functioning approach. Journal of Educational Measurement,
51, 381399.
References 443
Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Prentice Hall.
Norsys Software Corporation. (1999). Netica manual. Vancouver, BC: Author.
Novick, M. R. (1964). On Bayesian logical probability (No. RB-64-22). Retrieved from http://onlinelibrary
.wiley.com/doi/10.1002/j.2333-8504.1964.tb00330.x/abstract.
Novick, M. R. (1969). Multiparameter Bayesian indifference procedures. Journal of the Royal Statistical
Society. Series B, 31, 2964.
Novick, M. R.,& Jackson, P. (1974). Statistical methods for educational and psychological research. New
York: McGraw-Hill.
Novick, M. R., Jackson, P. H.,& Thayer, D. T. (1971). Bayesian inference and the classical test theory
model: Reliability and true scores. Psychometrika, 36, 261288.
Novick, M. R., Jackson, P. H., Thayer, D. T.,& Cole, N. S. (1972). Estimating multiple regressions
in m groups: A cross-validation study. British Journal of Mathematical and Statistical Psychology,
25, 3350.
Novick, M. R.,& Lewis, C. (1974). Prescribing test length for criterion-referenced measurement. In C. W.
Harris, M. C. Alkin,& W. J. Popham (Eds.), Problems in criterion-referenced measurement (pp.139
158). Los Angeles, CA: Center for the Study of Evaluation, University of California, Los Angeles.
Novick, M. R., Lewis, C., & Jackson, P. H. (1973). The estimation of proportions in m groups.
Psychometrika, 38, 1946.
Nye, C. D.,& Drasgow, F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not
work. Organizational Research Methods, 14, 548570.
OHagan, A. (1998). Eliciting expert beliefs in substantial practical applications. Journal of the Royal
Statistical Society: Series D (The Statistician), 47, 2135.
OHagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley,
J.E.,& Rakow, T. (2006). Uncertain judgements: Eliciting experts probabilities. London: Wiley.
ONeill, B. (2009). Exchangeability, correlation, and Bayes effect. International Statistical Review, 77,
241250.
Oravecz, Z., Anders, R.,& Batchelder, W. H. (2013). Hierarchical Bayesian modeling for test theory
without an answer key. Psychometrika, 124.
Owen, R. J. (1969). Tailored testing (No. 6992). Princeton, NJ: ETS.
Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive
mental testing. Journal of the American Statistical Association, 70, 351356.
Pan, J.-C.,& Huang, G.-H. (2014). Bayesian inferences of latent class models with an unknown num-
ber of classes. Psychometrika, 79, 621646.
Pascarella, E. T.,& Terenzini, P. T. (1980). Predicting freshman persistence and voluntary dropout
decisions from a theoretical model. The Journal of Higher Education, 51, 6075.
Pastor, D. A.,& Gagn, P. (2013). Mean and covariance structure mixture models. In G. R. Hancock&
R. O. Mueller (Eds.), Structural equation modeling: A second course (2nd ed., pp. 343393).
Greenwich, CT: Information Age Publishing.
Patton, J. M., Cheng, Y., Yuan, K.-H., & Diao, Q. (2013). The influence of item calibration error on
variable-length computerized adaptive testing. Applied Psychological Measurement, 37, 2440.
Patz, R. J.,& Junker, B. W. (1999a). Applications and extensions of MCMC in IRT: Multiple item types,
missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342366.
Patz, R. J.,& Junker, B. W. (1999b). A straightforward approach to Markov chain Monte Carlo meth-
ods for item response models. Journal of Educational and Behavioral Statistics, 24, 146178.
Patz, R. J., Junker, B. W., Johnson, M. S.,& Mariano, L. T. (2002). The hierarchical rater model for rated
test items and its application to large-scale educational assessment data. Journal of Educational
and Behavioral Statistics, 27, 341384.
Patz, R. J.,& Yao, L. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich,&
P. W. Holland (Eds.), Linking and aligning scores and scales (pp.253272). New York: Springer.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco,
CA: Morgan Kaufmann.
Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). Cambridge: Cambridge University
Press.
444 References
Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R. Brennan (Ed.), Educational
measurement (4th ed., pp.734755). Westport, CT: Praeger.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(pp. 2022). Vienna, Austria: Technische Universitt Wien.
Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 9, 523539.
Plummer, M. (2010). JAGS version 2.0.0 user manual. Lyon, France. Retrieved from http://www-fis
.iarc.fr/~martyn/software/jags/.
Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). CODA: Convergence diagnosis and output
analysis for MCMC. R News, 6, 711.
Preacher, K. J. (2006). Quantifying parsimony in structural equation modeling. Multivariate Behavioral
Research, 41, 227259.
Press, S. J. (1989). Bayesian statistics: Principles, models, and applications. New York: Wiley.
Press, S. J.,& Shigemasu, K. (1997). Bayesian inference in factor analysis (revised) (No. 243). Riverside,
CA: University of California, Riverside. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.14.6968&rep=rep1&type=pdf.
Proctor, C. H. (1970). A probabilistic formulation and statistical analysis of Guttman scaling.
Psychometrika, 35, 7378.
Quellmalz, E., Timms, M., Buckley, B., Levy, R., Davenport, J., Loveland, M.,& Silberglitt, M. (2012).
21st century dynamic assessment. In J. Clarke-Midura, M. Mayrath,& D. H. Robinson (Eds.),
Technology-based assessments for 21st century skills: Theoretical and practical implications from mod-
ern research (pp.5590). Charlotte, NC: Information Age Publishing.
Rabe-Hesketh, S., Skrondal, A.,& Pickles, A. (2004). GLLAMM Manual (Second Edition). U.C. Berkeley
Division of Biostatistics Working Paper Series. Berkeley, CA: University of California, Berkeley.
Raftery, A. E. (1993). Bayesian model selection in structural equation models. In K. A. Bollen &
J. S. Long (Eds.), Testing structural equation models (pp. 163180). Newbury Park, CA: SAGE
Publications.
Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve esti-
mation. Psychometrika, 56, 611630.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark:
Danish Institute for Educational Research.
Raudenbush, S. W. (1988). Educational applications of hierarchical linear models: A review. Journal
of Educational Statistics, 13, 85116.
Raudenbush, S. W., Fotiu, R. P.,& Cheong, Y. F. (1999). Synthesizing results from the trial state assess-
ment. Journal of Educational and Behavioral Statistics, 24, 413438.
R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R
Foundation for Statistical Computing. Retrieved from http://www.R-project.org/.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Reese, W. J. (2013). Testing wars in the public schools: A forgotten history. Cambridge, MA: Harvard
University Press.
Reye, J. (2004). Student modelling based on belief networks. International Journal of Artificial Intelligence
in Education, 14, 6396.
Richardson, S.,& Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of
components. Journal of the Royal Statistical Society. Series B (Methodological), 59, 731792.
Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and
a second-order multidimensional IRT model. Journal of Educational Measurement, 47, 361372.
Rijmen, F.,& De Boeck, P. (2002). The random weights linear logistic test model. Applied Psychological
Measurement, 26, 271285.
Robert, C.,& Casella, G. (2011). A short history of Markov chain Monte Carlo: Subjective recollec-
tions from incomplete data. Statistical Science, 26, 102115.
Roberts, G. O. (1996). Markov chain concepts related to sampling algorithms. In W. R. Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 4557).
London: Chapman& Hall/CRC Press.
References 445
Roberts, J. S.,& Thompson, V. M. (2011). Marginal maximum a posteriori item parameter estimation
for the generalized graded unfolding model. Applied Psychological Measurement, 35, 259279.
Robins, J. M., van der Vaart, A.,& Ventura, V. (2000). Asymptotic distribution of p values in compos-
ite null models. Journal of the American Statistical Association, 95, 11431156.
Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling: A quiet method-
ological revolution. American Psychologist, 65, 112.
Rodrguez, C. E.,& Walker, S. G. (2014). Label switching in Bayesian mixture models: Deterministic
relabeling strategies. Journal of Computational and Graphical Statistics, 23, 2545.
Roussos, L. A., DiBello, L. V., Stout, W., Hartz, S. M., Henson, R. A.,& Templin, J. L. (2007). The fusion
model skills diagnosis system. In J. P. Leighton& M. J. Gierl (Eds.), Cognitive diagnostic assess-
ment for education: Theory and applications (pp.275318). New York: Cambridge University Press.
Rowe, D. B. (2003). Multivariate Bayesian statistics: Models for source separation and signal unmixing. Boca
Raton, FL: Chapman& Hall/CRC Press.
Rowe, J. P., & Lester, J. C. (2010). Modeling user knowledge with dynamic Bayesian networks in
interactive narrative environments. In G. M. Youngblood& V. Bulitko (Eds.), AIIDE. Retrieved
from http://aaai.org/ocs/index.php/AIIDE/AIIDE10/paper/view/2149.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581592.
Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample
surveys. Journal of the American Statistical Association, 72, 538543.
Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach
to nonresponse. In Proceedings of the survey research methods section of the American Statistical
Association, 1, 2034.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statisti-
cian. The Annals of Statistics, 12, 11511172.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Rubin, D. B. (1988). Using the SIR algorithm to simulate posterior distributions. In J. M. Bernardo,
M.H. DeGroot, D. V. Lindley,& A. F. M. Smith (Eds.), Bayesian Statistics 3 (pp.395402). Oxford,
UK: Oxford University Press.
Rubin, D. B. (1996a). Comment: On posterior predictive p-values. Statistica Sinica, 6, 787792.
Rubin, D. B. (1996b). Multiple imputation after 18+ years. Journal of the American Statistical Association,
91, 473489.
Rubin, D. B., & Stern, H. S. (1994). Testing in latent class models using a posterior predictive
check distribution. In A. von Eye& C. C. Clogg (Eds.), Latent variable analysis: Applications for
developmental research (pp.420438). Thousand Oaks, CA: SAGE Publications.
Rudner, L. M. (2009). Scoring and classifying examinees using measurement decision theory.
Practical Assessment, Research & Evaluation, 14. Retrieved from http://pareonline.net/getvn.
asp?v=14&n=8.
Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes theorem. The Journal of
Technology, Learning and Assessment, 1. Retrieved from http://napoleon.bc.edu/ojs/index.php/
jtla/article/view/1668.
Rupp, A. A. (2002). Feature selection for choosing and assembling measurement models: A building-
block-based organization. International Journal of Testing, 2, 311360.
Rupp, A. A., Dey, D. K.,& Zumbo, B. D. (2004). To Bayes or not to Bayes, from whether to when:
Applications of Bayesian methodology to modeling. Structural Equation Modeling: A Multidisci-
plinary Journal, 11, 424451.
Rupp, A. A., Levy, R., DiCerbo, K. E., Sweet, S., Crawford, A. V., Calico, T., Behrens, J. T. (2012).
Putting ECD into practice: The interplay of theory and data in evidence models within a digi-
tal learning environment. Journal of Educational Data Mining, 4, 49110.
Rupp, A. A.,& Mislevy, R. J. (2007). Cognitive foundations of structured item response models. In
J. P. Leighton& M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applica-
tions (pp.205241). New York: Cambridge University Press.
Rupp, A. A., Templin J.,& Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applica-
tions. New York: Guilford Press.
446 References
Sahu, S. K. (2002). Bayesian estimation and model choice in item response models. Journal of Statistical
Computation and Simulation, 72, 217232.
Samejima, F. (1983). Some methods and approaches of estimating the operation characteristics of
discrete item responses. In H. Wainer & S. Messick (Eds.), Principals of modern psychological
measurement: A Festschrift for Frederic M. Lord (pp.154182). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.),
Handbook of modern item response theory (pp.85100). New York: Springer-Verlag.
Samejima, F. (1969). Estimating of latent ability using a response pattern of graded scores.
Psychometrika Monograph Supplement, No. 17. Richmond, VA: The William Byrd Press. Retrieved
from https://www.psychometricsociety.org/sites/default/files/pdf/MN17.pdf
Samejima, F. (1973). A comment on Birnbaums three-parameter logistic model in the latent trait
theory. Psychometrika, 38, 221233.
Samuelsen, K. M. (2008). Examining differential item function from a latent class perspective. In
G.R. Hancock& K. M. Samuelsen (Eds.), Advances in latent variable mixture models (pp.177197).
Charlotte, NC: Information Age Publishing.
Sao Pedro, M. A., Baker, R. S. J. de, Gobert, J. D., Montalvo, O., & Nakama, A. (2011). Leveraging
machine-learned detectors of systematic inquiry behavior to estimate and predict transfer of
inquiry skill. User Modeling and User-Adapted Interaction, 23(1), 139.
Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American
Statistical Association, 66, 783801.
Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and analysis
models differ. Statistica Neerlandica, 57, 1935.
Schafer, J. L.,& Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological
Methods, 7, 147177.
Scheines, R., Hoijtink, H.,& Boomsma, A. (1999). Bayesian estimation and testing of structural equa-
tion models. Psychometrika, 64, 3752.
Schum, D. A. (1987). Evidence and inference for the intelligence analyst (2nd ed.). Lanham, MD: University
Press of America.
Schum, D. A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461464.
Scott, S. L.,& Ip, E. H. (2002). Empirical Bayes and item-clustering effects in a latent variable hierar-
chical model: A case study from the National Assessment of Educational Progress. Journal of
the American Statistical Association, 97, 409419.
Sedransk, J., Monahan, J., & Chiu, H. Y. (1985). Bayesian estimation of finite population parame-
ters in categorical data models incorporating order restrictions. Journal of the Royal Statistical
Society. Series B (Methodological), 47, 519527.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331354.
Segall, D. O. (2002). An item response model for characterizing test compromise. Journal of Educational
and Behavioral Statistics, 27, 163179.
Segawa, E. (2005). A growth model for multilevel ordinal data. Journal of Educational and Behavioral
Statistics, 30, 369396.
Segawa, E., Emery, S., & Curry, S. J. (2008). Extended generalized linear latent and mixed model.
Journal of Educational and Behavioral Statistics, 33, 464484.
Senn, S. (2011). You may believe you are a Bayesian but you are probably wrong. Rationality, Markets
and Morals, 2, 4866.
Sheng, Y.,& Wikle, C. K. (2007). Comparing multiunidimensional and unidimensional item response
theory models. Educational and Psychological Measurement, 67, 899919.
Sheng, Y.,& Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure.
Educational and Psychological Measurement, 68, 413430.
Sheriffs, A. C., & Boomer, D. S. (1954). Who is penalized by the penalty for guessing? Journal of
Educational Psychology, 45, 8190.
References 447
Song, X.-Y., Lee, S.-Y.,& Hser, Y.-I. (2009). Bayesian analysis of multivariate latent curve models with
nonlinear longitudinal latent effects. Structural Equation Modeling: A Multidisciplinary Journal,
16, 245266.
Song, X.-Y., Lee, S.-Y., & Zhu, H.-T. (2001). Model selection in structural equation models with
continuous and polytomous data. Structural Equation Modeling: A Multidisciplinary Journal,
8, 378396.
Song, X.-Y., Lu, Z.-H., Cai, J.-H.,& Ip, E. H.-S. (2013). A Bayesian modeling approach for generalized
semiparametric structural equation models. Psychometrika, 78, 624647.
Song, X.-Y., Xia, Y.-M., Pan, J.-H.,& Lee, S.-Y. (2011). Model comparison of Bayesian semiparametric
and parametric structural equation models. Structural Equation Modeling: A Multidisciplinary
Journal, 18, 5572.
Spearman, C. (1904). General Intelligence, objectively determined and measured. The American
Journal of Psychology, 15, 201292.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P.,& Van Der Linde, A. (2002). Bayesian measures of model
complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64,
583639.
Spiegelhalter, D. J., & Lauritzen, S. L. (1990). Sequential updating of conditional probabilities on
directed graphical structures. Networks, 20, 579605.
Spiegelhalter, D. J., Thomas, A., Best, A. G.,& Lunn, D. (2007). WinBUGS user manual: Version 1.4.3.
Cambridge: MRC Biostatistics Unit.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal dis-
tribution. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability,
Volume 1: Contributions to the Theory of Statistics Proceedings of the 3rd Berkeley Symposium on
Mathematical Statistics and Probability (pp.197206). Berkeley, CA: University of California Press.
Stein, C. M. (1962). Confidence sets for the mean of a multivariate normal distribution. Journal of the
Royal Statistical Society. Series B, 24, 265296.
Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 62, 795809.
Stern, H. S. (2000). Comment. Journal of the American Statistical Association, 95, 11571159.
Stone, C. A.,& Hansen, M. A. (2000). The effect of errors in estimating ability on goodness-of-fit tests
for IRT models. Educational and Psychological Measurement, 60, 974991.
Stone, L. D., Keller, C. M., Kratzke, T. M.,& Strumpfer, J. P. (2014). Search for the wreckage of Air
France Flight Af 447. Statistical Science, 29, 6980.
Stout, W., Habing, B., Douglas, J., Kim, H. R., Roussos, L.,& Zhang, J. (1996). Conditional covariance-
based nonparametric multidimensionality assessment. Applied Psychological Measurement, 20,
331354.
Sturtz, S., Ligges, U.,& Gelman, A. E. (2005). R2WinBUGS: A package for running WinBUGS from
R. Journal of Statistical Software, 12, 116.
Sli, E.,& Mayers, D. F. (2003). An introduction to numerical analysis. Cambridge: Cambridge University
Press.
Swaminathan, H., & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of
Educational Statistics, 7, 175191.
Swaminathan, H.,& Gifford, J. A. (1985). Bayesian estimation in the two-parameter logistic model.
Psychometrika, 50, 349364.
Swaminathan, H.,& Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model.
Psychometrika, 51, 589601.
Swaminathan, H., Hambleton, R. K.,& Rogers, H. J. (2007). Assessing the fit of item response models.
In C. R. Rao& S. Sinharay (Eds.), Handbook of statistics, Volume 26: Psychometrics (pp.683718).
Amsterdam, the Netherlands: North-Holland/Elsevier.
Takane, Y., & De Leeuw, J. (1987). On the relationship between item response theory and factor
analysis of discretized variables. Psychometrika, 52, 393408.
Tanner, M. A.,& Wong, W. H. (1987). The calculation of posterior distributions by data augmenta-
tion. Journal of the American Statistical Association, 82, 528540.
References 449
Tatsuoka, C. (2002). Data analytic methods for latent partially ordered classification models. Journal
of the Royal Statistical Society: Series C (Applied Statistics), 51, 337350.
Tatsuoka, K. K. (1984). Analysis of errors in fraction addition and subtraction problems (NIE Final Rep.
for Grant No. NIE-G-81-002). Urbana, IL: University of Illinois, Computer-Based Education
Research Laboratory. Retrieved from http://eric.ed.gov/?id=ED257665.
Tatsuoka, K. K. (1987). Validation of cognitive sensitivity for item response curves. Journal of
Educational Measurement, 24, 233245.
Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis.
In N. Frederiksen, R. Glaser, A. Lesgold,& M. G. Shafto (Eds.), Diagnostic monitoring of skill and
knowledge acquisition (pp.453488). Hillsdale, NJ: Lawrence Erlbaum Associates.
Tatsuoka, K. K. (2009). Cognitive assessment: An introduction to the rule space method. New York:
Routledge.
Thissen, D. (2001). Psychometric engineering as art. Psychometrika, 66, 473485.
Thompson, M. S., & Green, S. B. (2013). Evaluating between-group differences in latent variable
means. In G. R. Hancock& R. O. Mueller (Eds.), Structural equation modeling: A second course
(2nd ed., pp.163218). Greenwich, CT: Information Age Publishing.
Thurstone, L. L. (1947). Multiple-factor analysis: A development& expansion of the vectors of mind. The
University of Chicago Press.
Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics, 22,
17011728.
Tierney, L.,& Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal
densities. Journal of the American Statistical Association, 81, 8286.
Toulmin, S. E. (1958). The uses of argument. Cambridge, UK: Cambridge University Press.
Tsutakawa, R. K. (1992). Prior distribution for item response curves. British Journal of Mathematical
and Statistical Psychology, 45, 5174.
Tsutakawa, R. K.,& Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on
ability estimates. Psychometrika, 55, 371390.
Tsutakawa, R. K.,& Lin, H. Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51,
251267.
Tsutakawa, R. K., & Soltys, M. J. (1988). Approximation for Bayesian ability estimation. Journal of
Educational Statistics, 13, 117130.
Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33, 167.
Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 8391.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Pearson.
Tukey, J. W. (1986). Exploratory data analysis as part of a larger whole. In The collected works of John
W. Tukey: Vol. IV. Philosophy and principles of data analysis: 19651986 (pp.793803). Pacific Grove,
CA: Wadsworth.
Van der Linden, W. J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63,
201216.
Van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287308.
Van der Linden, W. J.,& Glas, C. A. (2000). Capitalization on item calibration error in adaptive test-
ing. Applied Measurement in Education, 13, 3553.
Van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive testing. New York: Springer.
Van der Linden, W. J.,& Guo, F. (2008). Bayesian procedures for identifying aberrant response-time
patterns in adaptive testing. Psychometrika, 73, 365384.
Van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory.
NewYork: Springer.
Van der Linden, W. J.,& Lewis, C. (2015). Bayesian checks on cheating on tests. Psychometrika, 80,
689706.
Van der Linden, W. J.,& Pashley, Peter J. (2010). Item selection and ability estimation in adaptive
testing. In W. J. van der Linden& C. A. W. Glas (Eds.), Elements of adaptive testing (pp.330).
New York: Springer.
450 References
Van der Linden, W. J.,& Ren, H. (2014). Optimal Bayesian adaptive design for test-item calibration.
Psychometrika, 80, 126.
VanLehn, K. (2008). Intelligent tutoring systems for continuous, embedded assessment. In C. Dwyer
(Ed.), The future of assessment: Shaping teaching and learning (pp.113138). Mahwah, NJ: Lawrence
Erlbaum.
VanLehn, K., & Martin, J. (1997). Evaluation of an assessment system based on Bayesian student
modeling. International Journal of Artificial Intelligence in Education, 8, 179221.
Van Onna, M. J. H. (2002). Bayesian estimation and model selection in ordered latent class models
for polytomous items. Psychometrika, 67, 519538.
Van Rijn, P. W.,& Rijmen, F. (2012). A note on explaining away and paradoxical results in multidimensional
item response theory (No. ETS RR-12-13). Princeton, NJ: ETS. Retrieved from http://onlinelibrary.
wiley.com/doi/10.1002/j.2333-8504.2012.tb02295.x/abstract.
Van Rijn, P.,& Rijmen, F. (2015). On the explaining-away phenomenon in multivariate latent variable
models. British Journal of Mathematical and Statistical Psychology, 68, 122.
Verhagen, A. J., & Fox, J. P. (2013). Bayesian tests of measurement invariance. British Journal of
Mathematical and Statistical Psychology, 66, 383401.
Von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of
Mathematical& Statistical Psychology, 61, 287307.
Von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2007). The statistical procedures used in
National Assessment of Educational Progress: Recent developments and future directions. In
C. R. Rao& S. Sinharay (Eds.), Handbook of statistics, Volume 26: Psychometrics (pp.10391055).
Amsterdam, the Netherlands: North-Holland/Elsevier.
Vos, H. J. (1999). Applications of Bayesian decision theory to sequential mastery testing. Journal of
Educational and Behavioral Statistics, 24, 271292.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic
Bulletin& Review, 14, 779804.
Wagenmakers, E., Wetzels, R., Borsboom, D.,& van der Maas, H. L. J. (2011). Why psychologists must
change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of
Personality and Social Psychology, 100, 426432.
Wainer, H. (2000). Introduction and history. In H. Wainer, N. J. Dorans, R. Flaugher, B. F. Green, R.
J. Mislevy, L. Steinberg & D. Thissen (Eds.), Computerized adaptive testing: A primer (2nd ed.).
Mahwah, NJ: Routledge.
Wainer, H., Bradlow, E. T.,& Wang, X. (2007). Testlet response theory and its applications. New York:
Cambridge University Press.
Wainer, H.,& Brown, L. M. (2007). Three statistical paradoxes in the interpretation of group differ-
ences: Illustrated with medical school admission and licensing data. In Handbook of Statistics
(Vol. 26, pp.893918). Amsterdam, the Netherlands: Elsevier.
Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., &
Thissen,D. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Routledge.
Wainer, H.,& Thissen, D. (1994). On examinee choice in educational testing. Review of Educational
Research, 64, 159195.
Wall, M. M. (2009). Maximum likelihood and Bayesian estimation for nonlinear structural equa-
tion models. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The SAGE Handbook of Quantitative
Methods in Psychology (pp.540567). London: SAGE Publications.
Wang, C., Fan, Z., Chang, H.-H.,& Douglas, J. A. (2013). A semiparametric model for jointly analyz-
ing response times and accuracy in computerized testing. Journal of Educational and Behavioral
Statistics, 38, 381417.
Wang, W.-C., Liu, C.-W.,& Wu, S.-L. (2013). The random-threshold generalized unfolding model and
its application of computerized adaptive testing. Applied Psychological Measurement, 37, 179200.
Wang, X., Bradlow, E. T., Wainer, H.,& Muller, E. S. (2008). A Bayesian method for studying DIF:
A cautionary tale filled with surprises and delights. Journal of Educational and Behavioral
Statistics, 33, 363384.
References 451
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika,
54, 427450.
Warner, H. R., Toronto, A. F., Veasey, L. G., & Stephenson, R. (1961). A mathematical approach to
medical diagnosis: application to congenital heart disease. Journal of the American Medical
Association, 177, 177183.
Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias.
Journal of the American Statistical Association, 60, 6369.
Weber, J. E. (1973). Historical aspects of the Bayesian controversy: With comprehensive bibliography. Tucson,
AZ: Division of Economic and Business Research, University of Arizona.
Welch, R. E.,& Frick, T. W. (1993). Computerized adaptive testing in instructional settings. Educational
Technology Research and Development, 41, 4762.
West, S. G., Taylor, A. B.,& Wu, W. (2012). Model fit and model selection in structural equation mod-
eling. In R. H. Hoyle (Ed.), Handbook of structural equation modeling (pp. 209231). New York:
Guilford Press.
Whitely, S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika, 45, 479494.
Williamson, D. M., Almond, R. G., Mislevy, R. J., & Levy, R. (2006). An application of Bayesian
networks in automated scoring of computerized simulation tasks. In D. M. Williamson,
R. J. Mislevy,& I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing
(pp.201257). Mahwah, NJ: Lawrence Erlbaum Associates.
Williamson, D. M., Bauer, M., Steinberg, L. S., Mislevy, R. J., Behrens, J. T.,& DeMark, S. F. (2004).
Design rationale for a complex performance assessment. International Journal of Testing, 4,
303332.
Williamson, J. (2010). In defence of objective Bayesianism. Oxford: Oxford University Press.
Winkler, R. L. (1972). An introduction to Bayesian inference and decision. New York: Holt McDougal.
Wirth, R. J.,& Edwards, M. C. (2007). Item factor analysis: Current approaches and future directions.
Psychological Methods, 12, 5879.
Wise, S. L., Plake, B. S., Johnson, P. L.,& Roos, L. L. (1992). A comparison of self-adapted and comput-
erized adaptive tests. Journal of Educational Measurement, 29, 329339.
Wollack, J. A., Bolt, D. M., Cohen, A. S.,& Lee, Y.-S. (2002). Recovery of item parameters in the nomi-
nal response model: A comparison of marginal maximum likelihood estimation and Markov
chain Monte Carlo estimation. Applied Psychological Measurement, 26, 339352.
Woodward, B. (2011, May 12). Death of Osama bin Laden: Phone call pointed U.S. to compound
and to the pacer. Retrieved January 20, 2015, from http://www.washingtonpost.com/world/
national-security/death-of-osama-bin-laden-phone-call-pointed-us-to-compound--and-to-
the-pacer/2011/05/06/AFnSVaCG_story.html.
Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5, 161215.
Yanai, H.,& Ichikawa, M. (2007). Factor analysis. In C. R. Rao& S. Sinharay (Eds.), Handbook of sta-
tistics, Volume 26: Psychometrics (pp.257296). Amsterdam, the Netherlands: North-Holland/
Elsevier.
Yan, D., Almond, R., & Mislevy, R. (2004). A comparison of two models for cognitive diagno-
sis (No. RR-04-02). Princeton, NJ: ETS. Retrieved from http://onlinelibrary.wiley.com/
doi/10.1002/j.2333-8504.2004.tb01929.x/abstract.
Yan, D., Mislevy, R. J.,& Almond, R. G. (2003). Design and analysis in a cognitive assessment (No. RR-03-
32). Princeton, NJ: ETS.
Yang, M., Dunson, D. B.,& Baird, D. (2010). Semiparametric Bayes hierarchical models with mean
and variance constraints. Computational Statistics& Data Analysis, 54, 21722186.
Yao, L.,& Boughton, K. A. (2007). A multidimensional item response modeling approach for improv-
ing subscale proficiency estimation and classification. Applied Psychological Measurement, 31,
83105.
Yao, L.,& Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and
test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30,
469492.
452 References
Yen, W. M., Burket, G. R.,& Sykes, R. C. (1991). Nonunique solutions to the likelihood equation for
the three-parameter logistic model. Psychometrika, 56, 3954.
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.
Zhang, Z., Lai, K., Lu, Z.,& Tong, X. (2013). Bayesian inference and application of robust growth
curve models using Students t distribution. Structural Equation Modeling: A Multidisciplinary
Journal, 20, 4778.
Zhu, H.-T., & Lee, S.-Y. (2001). A Bayesian analysis of finite mixtures in the LISREL model.
Psychometrika, 66, 133152.
Zhu, X., & Stone, C. A. (2011). Assessing fit of unidimensional graded response models using
Bayesian methods. Journal of Educational Measurement, 48, 8197.
Zhu, X., & Stone, C. A. (2012). Bayesian comparison of alternative graded response models
for performance assessment applications. Educational and Psychological Measurement, 72,
774799.
Zwick, R. (2006). Higher education admissions testing. In R. Brennan (Ed.), Educational measurement
(4th ed., pp.647679). Westport, CT: Praeger.
Zwick, R., Thayer, D. T.,& Lewis, C. (1999). An empirical Bayes approach to Mantel-Haenszel DIF
analysis. Journal of Educational Measurement, 36, 128.
Zwick, R., Thayer, D. T.,& Lewis, C. (2000). Using loss functions for DIF detection: An empirical
Bayes approach. Journal of Educational and Behavioral Statistics, 25, 225247.
Zwick, R., Ye, L.,& Isham, S. (2012). Improving Mantel-Haenszel DIF estimation through Bayesian
updating. Journal of Educational and Behavioral Statistics, 37, 601629.