Categorical Data Analysis
Categorical Data Analysis
Second Edition
ALAN AGRESTI
University of Florida
Gainesville, Florida
⬁
This book is printed on acid-free paper. "
Copyright 䊚 2002 John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA
01923, Ž978. 750-8400, fax Ž978. 750-4744. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New
York, NY 10158-0012, Ž212. 850-6011, fax Ž212. 850-6008, E-Mail: PERMREQ@WILEY.COM.
For ordering and customer service, call 1-800-CALL-WILEY.
Library of Congress Cataloging-in-Publication Data Is A©ailable
ISBN 0-471-36093-7
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Jacki
Contents
Preface
1.
Introduction: Distributions and Inference for Categorical Data
xiii
1
1.1 Categorical Response Data, 1
1.2 Distributions for Categorical Data, 5
1.3 Statistical Inference for Categorical Data, 9
1.4 Statistical Inference for Binomial Parameters, 14
1.5 Statistical Inference for Multinomial Parameters, 21
Notes, 26
Problems, 28
2.
Describing Contingency Tables
36
2.1 Probability Structure for Contingency Tables, 36
2.2 Comparing Two Proportions, 43
2.3 Partial Association in Stratified 2 = 2 Tables, 47
2.4 Extensions for I = J Tables, 54
Notes, 59
Problems, 60
3.
Inference for Contingency Tables
3.1
3.2
3.3
3.4
3.5
70
Confidence Intervals for Association Parameters, 70
Testing Independence in Two-Way Contingency
Tables, 78
Following-Up Chi-Squared Tests, 80
Two-Way Tables with Ordered Classifications, 86
Small-Sample Tests of Independence, 91
vii
viii
CONTENTS
3.6
3.7
Small-Sample Confidence Intervals for 2 = 2 Tables,* 98
Extensions for Multiway Tables and Nontabulated
Responses, 101
Notes, 102
Problems, 104
4.
Introduction to Generalized Linear Models
115
4.1
4.2
4.3
4.4
Generalized Linear Model, 116
Generalized Linear Models for Binary Data, 120
Generalized Linear Models for Counts, 125
Moments and Likelihood for Generalized Linear
Models,* 132
4.5 Inference for Generalized Linear Models, 139
4.6 Fitting Generalized Linear Models, 143
4.7 Quasi-likelihood and Generalized Linear Models,* 149
4.8 Generalized Additive Models,* 153
Notes, 155
Problems, 156
5.
Logistic Regression
165
5.1 Interpreting Parameters in Logistic Regression, 166
5.2 Inference for Logistic Regression, 172
5.3 Logit Models with Categorical Predictors, 177
5.4 Multiple Logistic Regression, 182
5.5 Fitting Logistic Regression Models, 192
Notes, 196
Problems, 197
6.
Building and Applying Logistic Regression Models
6.1
6.2
6.3
6.4
6.5
6.6
Strategies in Model Selection, 211
Logistic Regression Diagnostics, 219
Inference About Conditional Associations in 2 = 2 = K
Tables, 230
Using Models to Improve Inferential Power, 236
Sample Size and Power Considerations,* 240
Probit and Complementary Log-Log Models,* 245
*Sections marked with an asterisk are less important for an overview.
211
ix
CONTENTS
6.7
Conditional Logistic Regression and Exact
Distributions,* 250
Notes, 257
Problems, 259
7.
Logit Models for Multinomial Responses
267
7.1
7.2
7.3
7.4
7.5
Nominal Responses: Baseline-Category Logit Models, 267
Ordinal Responses: Cumulative Logit Models, 274
Ordinal Responses: Cumulative Link Models, 282
Alternative Models for Ordinal Responses,* 286
Testing Conditional Independence in I = J = K
Tables,* 293
7.6 Discrete-Choice Multinomial Logit Models,* 298
Notes, 302
Problems, 302
8.
Loglinear Models for Contingency Tables
314
8.1
8.2
Loglinear Models for Two-Way Tables, 314
Loglinear Models for Independence and Interaction in
Three-Way Tables, 318
8.3 Inference for Loglinear Models, 324
8.4 Loglinear Models for Higher Dimensions, 326
8.5 The Loglinear᎐Logit Model Connection, 330
8.6 Loglinear Model Fitting: Likelihood Equations and
Asymptotic Distributions,* 333
8.7 Loglinear Model Fitting: Iterative Methods and their
Application,* 342
Notes, 346
Problems, 347
9.
Building and Extending Loglinearr
r Logit Models
9.1
9.2
9.3
9.4
9.5
9.6
Association Graphs and Collapsibility, 357
Model Selection and Comparison, 360
Diagnostics for Checking Models, 366
Modeling Ordinal Associations, 367
Association Models,* 373
Association Models, Correlation Models, and
Correspondence Analysis,* 379
357
x
CONTENTS
9.7
9.8
Poisson Regression for Rates, 385
Empty Cells and Sparseness in Modeling Contingency
Tables, 391
Notes, 398
Problems, 400
10.
Models for Matched Pairs
409
10.1 Comparing Dependent Proportions, 410
10.2 Conditional Logistic Regression for Binary Matched
Pairs, 414
10.3 Marginal Models for Square Contingency Tables, 420
10.4 Symmetry, Quasi-symmetry, and Quasiindependence, 423
10.5 Measuring Agreement Between Observers, 431
10.6 Bradley᎐Terry Model for Paired Preferences, 436
10.7 Marginal Models and Quasi-symmetry Models for
Matched Sets,* 439
Notes, 442
Problems, 444
11.
Analyzing Repeated Categorical Response Data
455
11.1 Comparing Marginal Distributions: Multiple
Responses, 456
11.2 Marginal Modeling: Maximum Likelihood Approach, 459
11.3 Marginal Modeling: Generalized Estimating Equations
Approach, 466
11.4 Quasi-likelihood and Its GEE Multivariate Extension:
Details,* 470
11.5 Markov Chains: Transitional Modeling, 476
Notes, 481
Problems, 482
12.
Random Effects: Generalized Linear Mixed Models for
Categorical Responses
12.1 Random Effects Modeling of Clustered Categorical
Data, 492
12.2 Binary Responses: Logistic-Normal Model, 496
12.3 Examples of Random Effects Models for Binary
Data, 502
12.4 Random Effects Models for Multinomial Data, 513
491
CONTENTS
xi
12.5 Multivariate Random Effects Models for Binary Data,
516
12.6 GLMM Fitting, Inference, and Prediction, 520
Notes, 526
Problems, 527
13.
Other Mixture Models for Categorical Data*
538
13.1 Latent Class Models, 538
13.2 Nonparametric Random Effects Models, 545
13.3 Beta-Binomial Models, 553
13.4 Negative Binomial Regression, 559
13.5 Poisson Regression with Random Effects, 563
Notes, 565
Problems, 566
14.
Asymptotic Theory for Parametric Models
576
14.1 Delta Method, 577
14.2 Asymptotic Distributions of Estimators of Model
Parameters and Cell Probabilities, 582
14.3 Asymptotic Distributions of Residuals and Goodnessof-Fit Statistics, 587
14.4 Asymptotic Distributions for LogitrLoglinear
Models, 592
Notes, 594
Problems, 595
15.
Alternative Estimation Theory for Parametric Models
600
15.1 Weighted Least Squares for Categorical Data, 600
15.2 Bayesian Inference for Categorical Data, 604
15.3 Other Methods of Estimation, 611
Notes, 615
Problems, 616
16.
Historical Tour of Categorical Data Analysis*
16.1 Pearson᎐Yule Association Controversy, 619
16.2 R. A. Fisher’s Contributions, 622
619
xii
CONTENTS
16.3 Logistic Regression, 624
16.4 Multiway Contingency Tables and Loglinear Models, 625
16.5 Recent Žand Future? . Developments, 629
Appendix A.
A.1
A.2
Using Computer Software to Analyze Categorical Data
632
Software for Categorical Data Analysis, 632
Examples of SAS Code by Chapter, 634
Appendix B.
Chi-Squared Distribution Values
654
References
655
Examples Index
689
Author Index
693
Subject Index
701
Preface
The explosion in the development of methods for analyzing categorical data
that began in the 1960s has continued apace in recent years. This book
provides an overview of these methods, as well as older, now standard,
methods. It gives special emphasis to generalized linear modeling techniques,
which extend linear model methods for continuous variables, and their
extensions for multivariate responses.
Today, because of this development and the ubiquity of categorical data in
applications, most statistics and biostatistics departments offer courses on
categorical data analysis. This book can be used as a text for such courses.
The material in Chapters 17 forms the heart of most courses. Chapters 13
cover distributions for categorical responses and traditional methods for
two-way contingency tables. Chapters 47 introduce logistic regression and
related logit models for binary and multicategory response variables. Chapters 8 and 9 cover loglinear models for contingency tables. Over time, this
model class seems to have lost importance, and this edition reduces somewhat its discussion of them and expands its focus on logistic regression.
In the past decade, the major area of new research has been the development of methods for repeated measurement and other forms of clustered
categorical data. Chapters 1013 present these methods, including marginal
models and generalized linear mixed models with random effects. Chapters
14 and 15 present theoretical foundations as well as alternatives to the
maximum likelihood paradigm that this text adopts. Chapter 16 is devoted to
a historical overview of the development of the methods. It examines contributions of noted statisticians, such as Pearson and Fisher, whose pioneering
effortsand sometimes vocal debatesbroke the ground for this evolution.
Every chapter of the first edition has been extensively rewritten, and some
substantial additions and changes have occurred. The major differences are:
䢇
䢇
A new Chapter 1 that introduces distributions and methods of inference
for categorical data.
A unified presentation of models as special cases of generalized linear
models, starting in Chapter 4 and then throughout the text.
xiii
xiv
䢇
䢇
䢇
䢇
PREFACE
Greater emphasis on logistic regression for binary response variables
and extensions for multicategory responses, with Chapters 4᎐7 introducing models and Chapters 10᎐13 extending them for clustered data.
Three new chapters on methods for clustered, correlated categorical
data, increasingly important in applications.
A new chapter on the historical development of the methods.
More discussion of ‘‘exact’’ small-sample procedures and of conditional
logistic regression.
In this text, I interpret categorical data analysis to refer to methods for
categorical response variables. For most methods, explanatory variables can
be qualitative or quantitative, as in ordinary regression. Thus, the focus is
intended to be more general than contingency table analysis, although for
simplicity of data presentation, most examples use contingency tables. These
examples are often simplistic, but should help readers focus on understanding the methods themselves and make it easier for them to replicate results
with their favorite software.
Special features of the text include:
䢇
䢇
䢇
䢇
More than 100 analyses of ‘‘real’’ data sets.
More than 600 exercises at the end of the chapters, some directed
towards theory and methods and some towards applications and data
analysis.
An appendix that shows, by chapter, the use of SAS for performing
analyses presented in this book.
Notes at the end of each chapter that provide references for recent
research and many topics not covered in the text.
Appendix A summarizes statistical software needed to use the methods
described in this text. It shows how to use SAS for analyses included in the
text and refers to a web site Žwww.stat.ufl.edur; aarcdarcda.html . that
contains Ž1. information on the use of other software Žsuch as R, S-plus,
SPSS, and Stata., Ž2. data sets for examples in the form of complete SAS
programs for conducting the analyses, Ž3. short answers for many of the
odd-numbered exercises, Ž4. corrections of errors in early printings of the
book, and Ž5. extra exercises. I recommend that readers refer to this appendix or specialized manuals while reading the text, as an aid to implementing the methods.
I intend this book to be accessible to the diverse mix of students who take
graduate-level courses in categorical data analysis. But I have also written it
with practicing statisticians and biostatisticians in mind. I hope it enables
them to catch up with recent advances and learn about methods that
sometimes receive inadequate attention in the traditional statistics curriculum.
PREFACE
xv
The development of new methods has influenced ᎏand been influenced
byᎏthe increasing availability of data sets with categorical responses in the
social, behavioral, and biomedical sciences, as well as in public health, human
genetics, ecology, education, marketing, and industrial quality control. And
so, although this book is directed mainly to statisticians and biostatisticians, I
also aim for it to be helpful to methodologists in these fields.
Readers should possess a background that includes regression and analysis
of variance models, as well as maximum likelihood methods of statistical
theory. Those not having much theory background should be able to follow
most methodological discussions. Sections and subsections marked with an
asterisk are less important for an overview. Readers with mainly applied
interests can skip most of Chapter 4 on the theory of generalized linear
models and proceed to other chapters. However, the book has distinctly
higher technical level and is more thorough and complete than my lower-level
text, An Introduction to Categorical Data Analysis ŽWiley, 1996..
I thank those who commented on parts of the manuscript or provided help
of some type. Special thanks to Bernhard Klingenberg, who read several
chapters carefully and made many helpful suggestions, Yongyi Min, who
constructed many of the figures and helped with some software, and Brian
Caffo, who helped with some examples. Many thanks to Rosyln Stone and
Brian Marx for each reviewing half the manuscript and Brian Caffo, I-Ming
Liu, and Yongyi Min for giving insightful comments on several chapters.
Thanks to Constantine Gatsonis and his students for using a draft in a course
at Brown University and providing suggestions. Others who provided comments on chapters or help of some type include Patricia Altham, Wicher
Bergsma, Jane Brockmann, Brent Coull, Al DeMaris, Regina Dittrich,
Jianping Dong, Herwig Friedl, Ralitza Gueorguieva, James Hobert, Walter
Katzenbeisser, Harry Khamis, Svend Kreiner, Joseph Lang, Jason Liao,
Mojtaba Ganjali, Jane Pendergast, Michael Radelet, Kenneth Small, Maura
Stokes, Tom Ten Have, and Rongling Wu. I thank my co-authors on various
projects, especially Brent Coull, Joseph Lang, James Booth, James Hobert,
Brian Caffo, and Ranjini Natarajan, for permission to use material from
those articles. Thanks to the many who reviewed material or suggested
examples for the first edition, mentioned in the Preface of that edition.
Thanks also to Wiley Executive Editor Steve Quigley for his steadfast
encouragement and facilitation of this project. Finally, thanks to my wife
Jacki Levine for continuing support of all kinds, despite the many days this
work has taken from our time together.
ALAN AGRESTI
Gaines®ille, Florida
No®ember 2001
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 1
Introduction: Distributions and
Inference for Categorical Data
From helping to assess the value of new medical treatments to evaluating the
factors that affect our opinions and behaviors, analysts today are finding
myriad uses for categorical data methods. In this book we introduce these
methods and the theory behind them.
Statistical methods for categorical responses were late in gaining the level
of sophistication achieved early in the twentieth century by methods for
continuous responses. Despite influential work around 1900 by the British
statistician Karl Pearson, relatively little development of models for categorical responses occurred until the 1960s. In this book we describe the early
fundamental work that still has importance today but place primary emphasis
on more recent modeling approaches. Before outlining the topics covered, we
describe the major types of categorical data.
1.1
CATEGORICAL RESPONSE DATA
A categorical ®ariable has a measurement scale consisting of a set of categories. For instance, political philosophy is often measured as liberal, moderate, or conservative. Diagnoses regarding breast cancer based on a mammogram use the categories normal, benign, probably benign, suspicious, and
malignant.
The development of methods for categorical variables was stimulated by
research studies in the social and biomedical sciences. Categorical scales are
pervasive in the social sciences for measuring attitudes and opinions. Categorical scales in biomedical sciences measure outcomes such as whether a
medical treatment is successful.
Although categorical data are common in the social and biomedical
sciences, they are by no means restricted to those areas. They frequently
1
2
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
occur in the behavioral sciences Že.g., type of mental illness, with the categories schizophrenia, depression, neurosis., epidemiology and public health
Že.g., contraceptive method at last intercourse, with the categories none,
condom, pill, IUD, other., genetics Žtype of allele inherited by an offspring.,
zoology Že.g., alligators’ primary food preference, with the categories fish,
invertebrate, reptile ., education Že.g., student responses to an exam question,
with the categories correct and incorrect., and marketing Že.g., consumer
preference among leading brands of a product, with the categories brand A,
brand B, and brand C.. They even occur in highly quantitative fields such as
engineering sciences and industrial quality control. Examples are the classification of items according to whether they conform to certain standards, and
subjective evaluation of some characteristic: how soft to the touch a certain
fabric is, how good a particular food product tastes, or how easy to perform a
worker finds a certain task to be.
Categorical variables are of many types. In this section we provide ways of
classifying them and other variables.
1.1.1
Response–Explanatory Variable Distinction
Most statistical analyses distinguish between response Žor dependent. ®ariables
and explanatory Žor independent. ®ariables. For instance, regression models
describe how the mean of a response variable, such as the selling price of a
house, changes according to the values of explanatory variables, such as
square footage and location. In this book we focus on methods for categorical
response variables. As in ordinary regression, explanatory variables can be of
any type.
1.1.2
Nominal–Ordinal Scale Distinction
Categorical variables have two primary types of scales. Variables having
categories without a natural ordering are called nominal. Examples are
religious affiliation Žwith the categories Catholic, Protestant, Jewish, Muslim,
other., mode of transportation to work Žautomobile, bicycle, bus, subway,
walk., favorite type of music Žclassical, country, folk, jazz, rock., and choice of
residence Žapartment, condominium, house, other.. For nominal variables,
the order of listing the categories is irrelevant. The statistical analysis does
not depend on that ordering.
Many categorical variables do have ordered categories. Such variables are
called ordinal. Examples are size of automobile Žsubcompact, compact,
midsize, large., social class Župper, middle, lower., political philosophy
Žliberal, moderate, conservative., and patient condition Žgood, fair, serious,
critical .. Ordinal variables have ordered categories, but distances between
categories are unknown. Although a person categorized as moderate is more
liberal than a person categorized as conservative, no numerical value describes how much more liberal that person is. Methods for ordinal variables
utilize the category ordering.
CATEGORICAL RESPONSE DATA
3
An inter®al ®ariable is one that does have numerical distances between any
two values. For example, blood pressure level, functional life length of
television set, length of prison term, and annual income are interval variables. ŽAn internal variable is sometimes called a ratio ®ariable if ratios of
values are also valid..
The way that a variable is measured determines its classification. For
example, ‘‘education’’ is only nominal when measured as public school or
private school; it is ordinal when measured by highest degree attained, using
the categories none, high school, bachelor’s, master’s, and doctorate; it is
interval when measured by number of years of education, using the integers
0, 1, 2, . . . .
A variable’s measurement scale determines which statistical methods are
appropriate. In the measurement hierarchy, interval variables are highest,
ordinal variables are next, and nominal variables are lowest. Statistical
methods for variables of one type can also be used with variables at higher
levels but not at lower levels. For instance, statistical methods for nominal
variables can be used with ordinal variables by ignoring the ordering of
categories. Methods for ordinal variables cannot, however, be used with
nominal variables, since their categories have no meaningful ordering. It is
usually best to apply methods appropriate for the actual scale.
Since this book deals with categorical responses, we discuss the analysis of
nominal and ordinal variables. The methods also apply to interval variables
having a small number of distinct values Že.g., number of times married. or
for which the values are grouped into ordered categories Že.g., education
measured as - 10 years, 1012 years, ) 12 years..
1.1.3
Continuous–Discrete Variable Distinction
Variables are classified as continuous or discrete, according to the number of
values they can take. Actual measurement of all variables occurs in a discrete
manner, due to precision limitations in measuring instruments. The continuousdiscrete classification, in practice, distinguishes between variables that
take lots of values and variables that take few values. For instance, statisticians often treat discrete interval variables having a large number of values
Žsuch as test scores. as continuous, using them in methods for continuous
responses.
This book deals with certain types of discretely measured responses: Ž1.
nominal variables, Ž2. ordinal variables, Ž3. discrete interval variables having
relatively few values, and Ž4. continuous variables grouped into a small
number of categories.
1.1.4
Quantitative–Qualitative Variable Distinction
Nominal variables are qualitati®edistinct categories differ in quality, not in
quantity. Interval variables are quantitati®edistinct levels have differing
amounts of the characteristic of interest. The position of ordinal variables in
4
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
the quantitativequalitative classification is fuzzy. Analysts often treat
them as qualitative, using methods for nominal variables. But in many
respects, ordinal variables more closely resemble interval variables than they
resemble nominal variables. They possess important quantitative features:
Each category has a greater or smaller magnitude of the characteristic than
another category; and although not possible to measure, an underlying
continuous variable is usually present. The political philosophy classification
Žliberal, moderate, conservative. crudely measures an inherently continuous
characteristic.
Analysts often utilize the quantitative nature of ordinal variables by
assigning numerical scores to categories or assuming an underlying continuous distribution. This requires good judgment and guidance from researchers
who use the scale, but it provides benefits in the variety of methods available
for data analysis.
1.1.5
Organization of This Book
The models for categorical response variables discussed in this book resemble regression models for continuous response variables; however, they
assume binomial, multinomial, or Poisson response distributions instead of
normality. Two types of models receive special attention, logistic regression
and loglinear models. Ordinary logistic regression models, also called logit
models, apply with binary Ži.e., two-category. responses and assume a binomial distribution. Generalizations of logistic regression apply with multicategory responses and assume a multinomial distribution. Loglinear models
apply with count data and assume a Poisson distribution. Certain equivalences exist between logistic regression and loglinear models.
The book has four main units. In the first, Chapters 1 through 3, we
summarize descriptive and inferential methods for univariate and bivariate
categorical data. These chapters cover discrete distributions, methods of
inference, and analyses for measures of association. They summarize the
non-model-based methods developed prior to about 1960.
In the second and primary unit, Chapters 4 through 9, we introduce
models for categorical responses. In Chapter 4 we describe a class of
generalized linear models having models of this text as special cases. We focus
on models for binary and count response variables. Chapters 5 and 6 cover
the most important model for binary responses, logistic regression. In Chapter 7 we present generalizations of that model for nominal and ordinal
multicategory response variables. In Chapter 8 we introduce the modeling of
multivariate categorical response data and show how to represent association
and interaction patterns by loglinear models for counts in the table that
cross-classifies those responses. In Chapter 9 we discuss model building with
loglinear and related logistic models and present some related models.
In the third unit, Chapters 10 through 13, we discuss models for handling
repeated measurement and other forms of clustering. In Chapter 10 we
DISTRIBUTIONS FOR CATEGORICAL DATA
5
present models for a categorical response with matched pairs; these apply,
for instance, with a categorical response measured for the same subjects at
two times. Chapter 11 covers models for more general types of repeated
categorical data, such as longitudinal data from several times with explanatory variables. In Chapter 12 we present a broad class of models, generalized
linear mixed models, that use random effects to account for dependence with
such data. In Chapter 13 further extensions and applications of the models
from Chapters 10 through 12 are described.
The fourth and final unit is more theoretical. In Chapter 14 we develop
asymptotic theory for categorical data models. This theory is the basis for
large-sample behavior of model parameter estimators and goodness-of-fit
statistics. Maximum likelihood estimation receives primary attention here
and throughout the book, but Chapter 15 covers alternative methods of
estimation, such as the Bayesian paradigm. Chapter 16 stands alone from the
others, being a historical overview of the development of categorical data
methods.
Most categorical data methods require extensive computations, and statistical software is necessary for their effective use. In Appendix A we discuss
software that can perform the analyses in this book and show the use of SAS
for text examples. See the Web site www. stat.ufl.edur; aarcdarcda.html to
download sample programs and data sets and find information about other
software.
Chapter 1 provides background material. In Section 1.2 we review the key
distributions for categorical data: the binomial, multinomial, and Poisson. In
Section 1.3 we review the primary mechanisms for statistical inference, using
maximum likelihood. In Sections 1.4 and 1.5 we illustrate these by presenting
significance tests and confidence intervals for binomial and multinomial
parameters.
1.2
DISTRIBUTIONS FOR CATEGORICAL DATA
Inferential data analyses require assumptions about the random mechanism
that generated the data. For regression models with continuous responses,
the normal distribution plays the central role. In this section we review the
three key distributions for categorical responses: binomial, multinomial, and
Poisson.
1.2.1
Binomial Distribution
Many applications refer to a fixed number n of binary observations. Let
y 1 , y 2 , . . . , yn denote responses for n independent and identical trials such
that P Ž Yi s 1. s and P Ž Yi s 0. s 1 y . We use the generic labels
‘‘success’’ and ‘‘failure’’ for outcomes 1 and 0. Identical trials means that the
probability of success is the same for each trial. Independent trials means
6
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
that the Yi 4 are independent random variables. These are often called
Bernoulli trials. The total number of successes, Y s Ý nis1 Yi , has the binomial
distribution with index n and parameter , denoted by binŽ n, ..
The probability mass function for the possible outcomes y for Y is
pŽ y . s
ž/
n
nyy
y Ž1 y .
,
y
where the binomial coefficient
s 1 = q 0 = Ž1 y . s ,
Ž 1.1 .
y s 0, 1, 2, . . . , n,
ž / s n!rw y! Ž n y y .!x. Since EŽY . s EŽY .
n
y
2
i
E Ž Yi . s and
i
var Ž Yi . s Ž 1 y . .
The binomial distribution for Y s Ý i Yi has mean and variance
s E Ž Y . s n and
2 s var Ž Y . s n Ž 1 y . .
'
The skewness is described by E Ž Y y . 3r 3 s Ž1 y 2 .r n Ž 1 y . .
The distribution converges to normality as n increases, for fixed .
There is no guarantee that successive binary observations are independent
or identical. Thus, occasionally, we will utilize other distributions. One such
case is sampling binary outcomes without replacement from a finite population, such as observations on gender for 10 students sampled from a class of
size 20. The hypergeometric distribution, studied in Section 3.5.1, is then
relevant. In Section 1.2.4 we mention another case that violates these
binomial assumptions.
1.2.2
Multinomial Distribution
Some trials have more than two possible outcomes. Suppose that each of n
independent, identical trials can have outcome in any of c categories. Let
yi j s 1 if trial i has outcome in category j and yi j s 0 otherwise. Then
yi s Ž yi1 , yi2 , . . . , yic . represents a multinomial trial, with Ý j yi j s 1; for
instance, Ž0, 0, 1, 0. denotes outcome in category 3 of four possible categories.
Note that yic is redundant, being linearly dependent on the others. Let
n j s Ý i yi j denote the number of trials having outcome in category j. The
counts Ž n1 , n 2 , . . . , n c . have the multinomial distribution.
Let j s P Ž Yi j s 1. denote the probability of outcome in category j for
each trial. The multinomial probability mass function is
p Ž n1 , n 2 , . . . , n cy1 . s
ž
n!
n1 ! n 2 ! n c !
/
1n1 2n 2 cn c .
Ž 1.2 .
7
DISTRIBUTIONS FOR CATEGORICAL DATA
Since Ý j n j s n, this is Ž cy1.-dimensional, with n c s n y Ž n1 q
qn cy1 .. The binomial distribution is the special case with c s 2.
For the multinomial distribution,
E Ž n j . s n j ,
var Ž n j . s n j Ž 1 y j . ,
cov Ž n j , n k . s yn j k .
Ž 1.3 .
We derive the covariance in Section 14.1.4. The marginal distribution of each
n j is binomial.
1.2.3
Poisson Distribution
Sometimes, count data do not result from a fixed number of trials. For
instance, if y s number of deaths due to automobile accidents on motorways
in Italy during this coming week, there is no fixed upper limit n for y Žas you
are aware if you have driven in Italy.. Since y must be a nonnegative integer,
its distribution should place its mass on that range. The simplest such
distribution is the Poisson. Its probabilities depend on a single parameter,
the mean . The Poisson probability mass function ŽPoisson 1837, p. 206. is
pŽ y . s
ey y
y!
,
Ž 1.4 .
y s 0, 1, 2, . . . .
It satisfies E Ž Y . s varŽ Y . s . It is unimodal with mode equal to the
integer part of . Its skewness is described by E Ž Y y . 3r 3 s 1r . The
distribution approaches normality as increases.
The Poisson distribution is used for counts of events that occur randomly
over time or space, when outcomes in disjoint periods or regions are independent. It also applies as an approximation for the binomial when n is large
and is small, with s n . So if each of the 50 million people driving in
Italy next week is an independent trial with probability 0.000002 of dying in a
fatal accident that week, the number of deaths Y is a binŽ50000000, 0.000002.
variate, or approximately Poisson with s n s 50,000,000Ž0.000002. s 100.
A key feature of the Poisson distribution is that its variance equals its
mean. Sample counts vary more when their mean is higher. When the mean
number of weekly fatal accidents equals 100, greater variability occurs in the
weekly counts than when the mean equals 10.
'
1.2.4
Overdispersion
In practice, count observations often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is called o®erdispersion.
We assumed above that each person has the same probability of dying in a
fatal accident in the next week. More realistically, these probabilities vary,
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
8
due to factors such as amount of time spent driving, whether the person
wears a seat belt, and geographical location. Such variation causes fatality
counts to display more variation than predicted by the Poisson model.
Suppose that Y is a random variable with variance varŽ Y < . for given ,
but itself varies because of unmeasured factors such as those just described. Let s E Ž .. Then unconditionally,
EŽ Y . s E EŽ Y < . ,
var Ž Y . s E var Ž Y < . q var E Ž Y < . .
When Y is conditionally Poisson Žgiven ., for instance, then E Ž Y . s E Ž . s
and var Ž Y . s E Ž . q var Ž . s q var Ž . ) .
Assuming a Poisson distribution for a count variable is often too simplistic,
because of factors that cause overdispersion. The negati®e binomial is a
related distribution for count data that permits the variance to exceed the
mean. We introduce it in Section 4.3.4.
Analyses assuming binomial Žor multinomial. distributions are also sometimes invalid because of overdispersion. This might happen because the true
distribution is a mixture of different binomial distributions, with the parameter varying because of unmeasured variables. To illustrate, suppose that an
experiment exposes pregnant mice to a toxin and then after a week observes
the number of fetuses in each mouse’s litter that show signs of malformation.
Let n i denote the number of fetuses in the litter for mouse i. The mice also
vary according to other factors that may not be measured, such as their
weight, overall health, and genetic makeup. Extra variation then occurs
because of the variability from litter to litter in the probability of malformation. The distribution of the number of fetuses per litter showing malformations might cluster near 0 and near n i , showing more dispersion than
expected for binomial sampling with a single value of . Overdispersion
could also occur when varies among fetuses in a litter according to some
distribution ŽProblem 1.12.. In Chapters 4, 12, and 13 we introduce methods
for data that are overdispersed relative to binomial and Poisson assumptions.
1.2.5
Connection between Poisson and Multinomial Distributions
In Italy this next week, let y 1 s number of people who die in automobile
accidents, y 2 s number who die in airplane accidents, and y 3 s number who
die in railway accidents. A Poisson model for Ž Y1 , Y2 , Y3 . treats these as
independent Poisson random variables, with parameters Ž 1 , 2 , 3 .. The
joint probability mass function for Yi 4 is the product of the three mass
functions of form Ž1.4.. The total n s ÝYi also has a Poisson distribution,
with parameter Ý i .
With Poisson sampling the total count n is random rather than fixed. If we
assume a Poisson model but condition on n, Yi 4 no longer have Poisson
distributions, since each Yi cannot exceed n. Given n, Yi 4 are also no longer
independent, since the value of one affects the possible range for the others.
9
STATISTICAL INFERENCE FOR CATEGORICAL DATA
For c independent Poisson variates, with E Ž Yi . s i , let’s derive their
conditional distribution given that ÝYi s n. The conditional probability of a
set of counts n i 4 satisfying this condition is
P Ž Y1 s n1 , Y2 s n 2 , . . . , Yc s n c .
s
s
Ý Yj s n
P Ž Y1 s n1 , Y2 s n 2 , . . . , Yc s n c .
P Ž ÝYj s n .
Ł i exp Ž y i . in irn i !
exp Ž yÝ j .Ž Ý j . rn!
n
s
n!
Ł i ni !
Ł in ,
i
Ž 1.5 .
i
where i s irŽÝ j .4 . This is the multinomial Ž n, i 4. distribution, characterized by the sample size n and the probabilities i 4 .
Many categorical data analyses assume a multinomial distribution. Such
analyses usually have the same parameter estimates as those of analyses
assuming a Poisson distribution, because of the similarity in the likelihood
functions.
1.3
STATISTICAL INFERENCE FOR CATEGORICAL DATA
The choice of distribution for the response variable is but one step of data
analysis. In practice, that distribution has unknown parameter values. In this
section we review methods of using sample data to make inferences about the
parameters. Sections 1.4 and 1.5 cover binomial and multinomial parameters.
1.3.1
Likelihood Functions and Maximum Likelihood Estimation
In this book we use maximum likelihood for parameter estimation. Under
weak regularity conditions, such as the parameter space having fixed dimension with true value falling in its interior, maximum likelihood estimators
have desirable properties: They have large-sample normal distributions; they
are asymptotically consistent, converging to the parameter as n increases;
and they are asymptotically efficient, producing large-sample standard errors
no greater than those from other estimation methods.
Given the data, for a chosen probability distribution the likelihood function
is the probability of those data, treated as a function of the unknown
parameter. The maximum likelihood ŽML. estimate is the parameter value
that maximizes this function. This is the parameter value under which the
data observed have the highest probability of occurrence. The parameter
value that maximizes the likelihood function also maximizes the log of that
function. It is simpler to maximize the log likelihood since it is a sum rather
than a product of terms.
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
10
We denote a parameter for a generic problem by and its ML estimate
by ˆ. The likelihood function is l Ž . and the log-likelihood function is
LŽ . s log w l Ž .x. For many models, LŽ . has concave shape and ˆ is the
point at which the derivative equals 0. The ML estimate is then the solution
of the likelihood equation, LŽ .r s 0. Often, is multidimensional,
ˆ is the solution of a set of likelihood equations.
denoted by , and 
ˆ . denote the
Let SE denote the standard error of ˆ, and let cov Ž
ˆ
asymptotic covariance matrix of . Under regularity conditions ŽRao 1973,
ˆ . is the inverse of the information matrix. The Ž j, k . element of
p. 364., cov Ž
the information matrix is
yE
ž
2
LŽ  .
j k
/
Ž 1.6 .
.
The standard errors are the square roots of the diagonal elements for the
inverse information matrix. The greater the curvature of the log likelihood,
the smaller the standard errors. This is reasonable, since large curvature
ˆ hence,
implies that the log likelihood drops quickly as  moves away from ;
ˆ
the data would have been much more likely to occur if  took a value near 
ˆ
rather than a value far from .
1.3.2
Likelihood Function and ML Estimate for Binomial Parameter
The part of a likelihood function involving the parameters is called the
kernel. Since the maximization of the likelihood is with respect to the
parameters, the rest is irrelevant.
To illustrate, consider the binomial distribution Ž1.1.. The binomial coefficient
ž / has no influence on where the maximum occurs with respect to .
n
y
Thus, we ignore it and treat the kernel as the likelihood function. The
binomial log likelihood is then
L Ž . s log y Ž 1 y .
nyy
s ylog Ž . q Ž n y y . log Ž 1 y . . Ž 1.7 .
Differentiating with respect to yields
L Ž . r s yr y Ž n y y . r Ž 1 y . s Ž y y n . r Ž 1 y . . Ž 1.8 .
Equating this to 0 gives the likelihood equation, which has solution
ˆ s yrn,
the sample proportion of successes for the n trials.
Calculating 2 LŽ .r 2 , taking the expectation, and combining terms,
we get
yE
2
L Ž . r 2 s E yr 2 q Ž n y y . r Ž 1 y .
2
s nr Ž 1 y . .
Ž 1.9 .
11
STATISTICAL INFERENCE FOR CATEGORICAL DATA
Thus, the asymptotic variance of
ˆ is Ž1 y .rn. This is no surprise. Since
E Ž Y . s n and var Ž Y . s n Ž1 y ., the distribution of
ˆ s Yrn has mean
and standard error
EŽ
ˆ. s,
1.3.3
Ž
ˆ. s
(
Ž1 y .
n
.
Wald–Likelihood Ratio–Score Test Triad
Three standard ways exist to use the likelihood function to perform
large-sample inference. We introduce these for a significance test of a null
hypothesis H0 : s 0 and then discuss their relation to interval estimation.
They all exploit the large-sample normality of ML estimators.
With nonnull standard error SE of ˆ, the test statistic
z s Ž ˆ y 0 . rSE
has an approximate standard normal distribution when s 0 . One refers z
to the standard normal table to obtain one- or two-sided P-values. Equivalently, for the two-sided alternative, z 2 has a chi-squared null distribution
with 1 degree of freedom Ždf.; the P-value is then the right-tailed chi-squared
probability above the observed value. This type of statistic, using the nonnull
standard error, is called a Wald statistic ŽWald 1943..
The multivariate extension for the Wald test of H0 :  s  0 has test
statistic
ˆ y  0 . cov Ž 
ˆ.
W s Ž
X
y1
Ž ˆ y  0 . .
ŽThe prime on a vector or matrix denotes the transpose. . The nonnull
ˆ The
covariance is based on the curvature Ž1.6. of the log likelihood at .
ˆ
asymptotic multivariate normal distribution for  implies an asymptotic
ˆ ., which is the
chi-squared distribution for W. The df equal the rank of cov Ž
number of nonredundant parameters in .
A second general-purpose method uses the likelihood function through
the ratio of two maximizations: Ž1. the maximum over the possible parameter
values under H0 , and Ž2. the maximum over the larger set of parameter
values permitting H0 or an alternative Ha to be true. Let l 0 denote the
maximized value of the likelihood function under H0 , and let l 1 denote the
maximized value generally Ži.e., under H0 j Ha .. For instance, for parameter
vector  s Ž 0 ,  1 . and H0 :  0 s 0, l 1 is the likelihood function calculated
at the  value for which the data would have been most likely; l 0 is the
likelihood function calculated at the  1 value for which the data would
have been most likely, when  0 s 0. Then l 1 is always at least as large as
l 0 , since l 0 results from maximizing over a restricted set of the parameter
values.
12
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
The ratio s l 0rl 1 of the maximized likelihoods cannot exceed 1. Wilks
Ž1935, 1938. showed that y2 log has a limiting null chi-squared distribution, as n ™ . The df equal the difference in the dimensions of the
parameter spaces under H0 j Ha and under H0 . The likelihood-ratio test
statistic equals
y2 log
s y2 log Ž l 0rl 1 . s y2 Ž L0 y L1 . ,
where L0 and L1 denote the maximized log-likelihood functions.
The third method uses the score statistic, due to R. A. Fisher and C. R.
Rao. The score test is based on the slope and expected curvature of the
log-likelihood function LŽ . at the null value 0 . It utilizes the size of the
score function
uŽ . s LŽ . r ,
evaluated at 0 . The value uŽ 0 . tends to be larger in absolute value when ˆ
is farther from 0 . Denote yEw 2 LŽ .r 2 x Ži.e., the information. evaluated at 0 by Ž 0 .. The score statistic is the ratio of uŽ 0 . to its null SE,
which is w Ž 0 .x1r2 . This has an approximate standard normal null distribution. The chi-squared form of the score statistic is
uŽ 0 .
Ž 0 .
LŽ . r 0
2
s
yE
2
2
L Ž . r 02
,
where the partial derivative notation reflects derivatives with respect to
that are evaluated at 0 . In the multiparameter case, the score statistic is a
quadratic form based on the vector of partial derivatives of the log likelihood
with respect to  and the inverse information matrix, both evaluated at the
H0 estimates Ži.e., assuming that  s  0 ..
Figure 1.1 is a generic plot of a log-likelihood LŽ . for the univariate
case. It illustrates the three tests of H0 : s 0. The Wald test uses the
behavior of LŽ . at the ML estimate ˆ, having chi-squared form Ž ˆrSE. 2 .
The SE of ˆ depends on the curvature of LŽ . at ˆ. The score test is based
on the slope and curvature of LŽ . at s 0. The likelihood-ratio test
combines information about LŽ . at both ˆ and 0 s 0. It compares the
log-likelihood values L1 at ˆ and L0 at 0 s 0 using the chi-squared
statistic y2Ž L0 y L1 .. In Figure 1.1, this statistic is twice the vertical distance between values of LŽ . at ˆ and at 0. In a sense, this statistic uses the
most information of the three types of test statistic and is the most versatile.
As n ™ , the Wald, likelihood-ratio, and score tests have certain asymptotic equivalences ŽCox and Hinkley 1974, Sec. 9.3.. For small to moderate
sample sizes, the likelihood-ratio test is usually more reliable than the Wald
test.
STATISTICAL INFERENCE FOR CATEGORICAL DATA
FIGURE 1.1
1.3.4
13
Log-likelihood function and information used in three tests of H0 : s 0.
Constructing Confidence Intervals
In practice, it is more informative to construct confidence intervals for
parameters than to test hypotheses about their values. For any of the three
test methods, a confidence interval results from inverting the test. For
instance, a 95% confidence interval for is the set of 0 for which the test
of H0 : s 0 has a P-value exceeding 0.05.
Let z a denote the z-score from the standard normal distribution having
right-tailed probability a; this is the 100Ž1 y a. percentile of that distribution.
Let df2 Ž a. denote the 100Ž1 y a. percentile of the chi-squared distribution
with degrees of freedom df. 100Ž1 y .% confidence intervals based on
asymptotic normality use z r2 , for instance z 0.025 s 1.96 for 95% confidence.
The Wald confidence interval is the set of 0 for which < ˆ y 0 < rSE - z r2 .
This gives the interval ˆ " z r2 ŽSE.. The likelihood-ratio-based confidence
interval is the set of 0 for which y2w LŽ 0 . y LŽ ˆ.x - 12 Ž.. wRecall
that 12 Ž. s z2 r2 .x
When ˆ has a normal distribution, the log-likelihood function has a
parabolic shape Ži.e., a second-degree polynomial.. For small samples with
categorical data, ˆ may be far from normality and the log-likelihood function
can be far from a symmetric, parabolic-shaped curve. This can also happen
with moderate to large samples when a model contains many parameters. In
such cases, inference based on asymptotic normality of ˆ may have inadequate performance. A marked divergence in results of Wald and likelihoodratio inference indicates that the distribution of ˆ may not be close to
normality. The example in Section 1.4.3 illustrates this with quite different
confidence intervals for different methods. In many such cases, inference can
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
14
instead utilize an exact small-sample distribution or ‘‘higher-order’’ asymptotic methods that improve on simple normality Že.g., Pierce and Peters
1992..
The Wald confidence interval is most common in practice because it is
simple to construct using ML estimates and standard errors reported by
statistical software. The likelihood-ratio-based interval is becoming more
widely available in software and is preferable for categorical data with small
to moderate n. For the best known statistical model, regression for a normal
response, the three types of inference necessarily provide identical results.
1.4
STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS
In this section we illustrate inference methods for categorical data by
presenting tests and confidence intervals for the binomial parameter ,
based on y successes in n independent trials. In Section 1.3.2 we obtained
the likelihood function and ML estimator
ˆ s yrn of .
1.4.1
Tests about a Binomial Parameter
Consider H0 : s 0 . Since H0 has a single parameter, we use the normal
rather than chi-squared forms of Wald and score test statistics. They permit
tests against one-sided as well as two-sided alternatives. The Wald statistic is
zW s
ˆ y 0
SE
s
ˆ y 0
'ˆ Ž 1 y ˆ . rn
Ž 1.10 .
.
Evaluating the binomial score Ž1.8. and information Ž1.9. at 0 yields
uŽ 0 . s
y
0
y
nyy
1 y 0
,
Ž 0 . s
n
0Ž1 y 0 .
.
The normal form of the score statistic simplifies to
zS s
uŽ 0 .
Ž 0 .
1r2
s
y y n 0
'n
0
Ž1 y 0 .
s
'
ˆ y 0
0
Ž 1 y 0 . rn
.
Ž 1.11 .
Whereas the Wald statistic z W uses the standard error evaluated at
ˆ , the
score statistic z S uses it evaluated at 0 . The score statistic is preferable, as
it uses the actual null SE rather than an estimate. Its null sampling distribution is closer to standard normal than that of the Wald statistic.
The binomial log-likelihood function Ž1.7. equals L 0 s ylog 0 q
Ž n y y . log Ž1 y 0 . under H0 and L1 s y log
ˆ q Ž n y y . logŽ1 y ˆ . more
15
STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS
generally. The likelihood-ratio test statistic simplifies to
ž
y2 Ž L0 y L1 . s 2 y log
ˆ
0
q Ž n y y . log
1y
ˆ
1 y 0
/
.
Expressed as
ž
y2 Ž L0 y L1 . s 2 y log
y
n 0
q Ž n y y . log
nyy
n y n 0
/
,
it compares observed success and failure counts to fitted Ži.e., null. counts by
2 Ý observed log
observed
fitted
.
Ž 1.12 .
We’ll see that this formula also holds for tests about Poisson and multinomial
parameters. Since no unknown parameters occur under H0 and one occurs
under Ha , Ž1.12. has an asymptotic chi-squared distribution with df s 1.
1.4.2
Confidence Intervals for a Binomial Parameter
A significance test merely indicates whether a particular value Žsuch as
s 0.5. is plausible. We learn more by using a confidence interval to
determine the range of plausible values.
Inverting the Wald test statistic gives the interval of 0 values for which
< z W < - z r2 , or
ˆ " z r2
(
ˆ Ž 1 y ˆ .
n
.
Ž 1.13 .
Historically, this was one of the first confidence intervals used for any
parameter ŽLaplace 1812, p. 283.. Unfortunately, it performs poorly unless n
is very large Že.g., Brown et al. 2001.. The actual coverage probability usually
falls below the nominal confidence coefficient, much below when is near 0
or 1. A simple adjustment that adds 12 z2 r2 observations of each type to the
sample before using this formula performs much better ŽProblem 1.24..
The score confidence interval contains 0 values for which < z S < - z r2 .
Its endpoints are the 0 solutions to the equations
Ž ˆ y 0 . r' 0 Ž 1 y 0 . rn s "z r2 .
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
16
These are quadratic in 0 . First discussed by E. B. Wilson Ž1927., this
interval is
ˆ
ž
n
n q z2 r2
" z r2
/ ž
q
)
1
z2 r2
2
n q z2 r2
1
n q z2 r2
/
ˆ Ž 1 y ˆ .
ž
n
n q z2 r2
/ ž /ž /ž
q
1
1
z2 r2
2
2
n q z2 r2
/
.
The midpoint
˜ of the interval is a weighted average of ˆ and 12 , where the
weight nrŽ n q z2 r2 . given
ˆ increases as n increases. Combining terms, this
midpoint equals
˜ s Ž y q z2 r2 r2.rŽ n q z2 r2 .. This is the sample proportion
for an adjusted sample that adds z2 r2 observations, half of each type. The
square of the coefficient of z r2 in this formula is a weighted average of the
variance of a sample proportion when s
ˆ and the variance of a sample
proportion when s 12 , using the adjusted sample size n q z2 r2 in place
of n. This interval has much better performance than the Wald interval.
The likelihood-ratio-based confidence interval is more complex computationally, but simple in principle. It is the set of 0 for which the likelihoodratio test has a P-value exceeding . Equivalently, it is the set of 0 for
which double the log likelihood drops by less than 12 Ž. from its value at the
ML estimate
ˆ s yrn.
1.4.3
Proportion of Vegetarians Example
To collect data in an introductory statistics course, recently I gave the
students a questionnaire. One question asked each student whether he or
she was a vegetarian. Of n s 25 students, y s 0 answered ‘‘ yes.’’ They were
not a random sample of a particular population, but we use these data to
illustrate 95% confidence intervals for a binomial parameter .
Since y s 0,
ˆ s 0r25 s 0. Using the Wald approach, the 95% confidence interval for is
'
0 " 1.96 Ž 0.0 = 1.0 . r25 ,
or
Ž 0, 0 . .
When the observation falls at the boundary of the sample space, often Wald
methods do not provide sensible answers.
By contrast, the 95% score interval equals Ž0.0, 0.133.. This is a more
believable inference. For H0 : s 0.5, for instance, the score test statistic is
z S s Ž0 y 0.5.r Ž 0.5 = 0.5 . r25 s y5.0, so 0.5 does not fall in the interval.
By contrast, for H0 : s 0.10, z S s Ž0 y 0.10.r Ž 0.10 = 0.90 . r25 s y1.67,
so 0.10 falls in the interval.
'
'
STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS
17
When y s 0 and n s 25, the kernel of the likelihood function is l Ž . s
0 Ž1 y . 25 s Ž1 y . 25 . The log likelihood Ž1.7. is LŽ . s 25 log Ž1 y ..
Note that LŽ
ˆ . s LŽ0. s 0. The 95% likelihood-ratio confidence interval is
the set of 0 for which the likelihood-ratio statistic
y2 Ž L0 y L1 . s y2 L Ž 0 . y L Ž
ˆ.
s y50 log Ž 1 y 0 . F 12 Ž 0.05 . s 3.84.
The upper bound is 1 y expŽy3.84r50. s 0.074, and the confidence interval
equals Ž0.0, 0.074.. wIn this book, we use the natural logarithm throughout, so
its inverse is the exponential function expŽ x . s e x.x Figure 1.2 shows the
likelihood and log-likelihood functions and the corresponding confidence
region for .
The three large-sample methods yield quite different results. When is
near 0, the sampling distribution of
ˆ is highly skewed to the right for small
n. It is worth considering alternative methods not requiring asymptotic
approximations.
FIGURE 1.2 Binomial likelihood and log likelihood when y s 0 in n s 25 trials, and confidence interval for .
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
18
Exact Small-Sample Inference*1
1.4.4
With modern computational power, it is not necessary to rely on large-sample approximations for the distribution of statistics such as
ˆ . Tests and
confidence intervals can use the binomial distribution directly rather than its
normal approximation. Such inferences occur naturally for small samples, but
apply for any n.
We illustrate by testing H0 : s 0.5 against Ha : / 0.5 for the survey
results on vegetarianism, y s 0 with n s 25. We noted that the score statistic
equals z s y5.0. The exact P-value for this statistic, based on the null
binŽ25, 0.5. distribution, is
P Ž < z < G 5.0 . s P Ž Y s 0 or Y s 25 . s 0.5 25 q 0.5 25 s 0.00000006.
100Ž1 y .% confidence intervals consist of all 0 for which P-values
exceed in exact binomial tests. The best known interval ŽClopper and
Pearson 1934. uses the tail method for forming confidence intervals. It
requires each one-sided P-value to exceed r2. The lower and upper
endpoints are the solutions in 0 to the equations
n
Ý
ksy
ž/
n
nyk
k Ž1 y 0 .
s r2 and
k 0
y
Ý
ks0
ž/
n
nyk
k Ž1 y 0 .
s r2,
k 0
except that the lower bound is 0 when y s 0 and the upper bound is 1 when
y s n. When y s 1, 2, . . . , n y 1, from connections between binomial sums
and the incomplete beta function and related cumulative distribution functions Žcdf’s. of beta and F distributions, the confidence interval equals
1q
nyyq1
yF2 y , 2Ž nyyq1. Ž 1 y r2 .
y1
-- 1 q
nyy
Ž y q 1 . F2Ž yq1. , 2Ž nyy . Žr2.
y1
,
where Fa, b Ž c . denotes the 1 y c quantile from the F distribution with
degrees of freedom a and b. When y s 0 with n s 25, the ClopperPearson
95% confidence interval for is Ž0.0, 0.137..
In principle this approach seems ideal. However, there is a serious
complication. Because of discreteness, the actual coverage probability for any
is at least as large as the nominal confidence level ŽCasella and Berger
2001, p. 434; Neyman 1935. and it can be much greater. Similarly, for a test
of H0 : s 0 at a fixed desired size such as 0.05, it is not usually possible
to achieve that size. There is a finite number of possible samples, and hence
a finite number of possible P-values, of which 0.05 may not be one. In testing
H0 with fixed 0 , one can pick a particular that can occur as a P-value.
1
Sections marked with an asterisk are less important for an overview.
STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS
19
FIGURE 1.3 Plot of coverage probabilities for nominal 95% confidence intervals for binomial
parameter when n s 25.
For interval estimation, however, this is not an option. This is because
constructing the interval corresponds to inverting an entire range of 0
values in H0 : s 0 , and each distinct 0 value can have its own set of
possible P-values; that is, there is not a single null parameter value 0 as in
one test.
For any fixed parameter value, the actual coverage probability can be
much larger than the nominal confidence level. When n s 25, Figure 1.3
plots the coverage probabilities as a function of for the ClopperPearson
method, the score method, and the Wald method. At a fixed value with a
given method, the coverage probability is the sum of the binomial probabilities of all those samples for which the resulting interval contains that .
There are 26 possible samples and 26 corresponding confidence intervals, so
the coverage probability is a sum of somewhere between 0 and 26 binomial
probabilities. As moves from 0 to 1, this coverage probability jumps up or
down whenever moves into or out of one of these intervals. Figure 1.3
shows that coverage probabilities are too low for the Wald method, whereas
the ClopperPearson method errs in the opposite direction. The score
method behaves well, except for some values close to 0 or 1. Its coverage
probabilities tend to be near the nominal level, not being consistently
conservative or liberal. This is a good method unless is very close to 0 or 1
ŽProblem 1.23..
In discrete problems using small-sample distributions, shorter confidence
intervals usually result from inverting a single two-sided test rather than two
20
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
one-sided tests. The interval is then the set of parameter values for which the
P-value of a two-sided test exceeds . For the binomial parameter, see
Blaker Ž2000., Blyth and Still Ž1983., and Sterne Ž1954. for methods. For
observed outcome yo , with Blaker’s approach the P-value is the minimum of
the two one-tailed binomial probabilities P Ž Y G yo . and P Ž Y F yo . plus an
attainable probability in the other tail that is as close as possible to, but not
greater than, that one-tailed probability. The interval is computationally
more complex, although available in software ŽBlaker gave S-Plus functions..
The result is still conservative, but less so than the ClopperPearson interval.
For the vegetarianism example, the 95% confidence interval using the Blaker
exact method is Ž0.0, 0.128. compared to the ClopperPearson interval of
Ž0.0, 0.137..
1.4.5
Inference Based on the Mid-P-Value*
To adjust for discreteness in small-sample distributions, one can base inference on the mid-P-®alue ŽLancaster 1961.. For a test statistic T with observed
value t o and one-sided Ha such that large T contradicts H0 ,
mid-P-value s 12 P Ž T s t o . q P Ž T ) t o . ,
with probabilities calculated from the null distribution. Thus, the mid-P-value
is less than the ordinary P-value by half the probability of the observed
result. Compared to the ordinary P-value, the mid-P-value behaves more like
the P-value for a test statistic having a continuous distribution. The sum of
its two one-sided P-values equals 1.0. Although discrete, under H0 its null
distribution is more like the uniform distribution that occurs in the continuous case. For instance, it has a null expected value of 0.5, whereas this
expected value exceeds 0.5 for the ordinary P-value for a discrete test
statistic.
Unlike an exact test with ordinary P-value, a test using the mid-P-value
does not guarantee that the probability of type I error is no greater than a
nominal value ŽProblem 1.19.. However, it usually performs well, typically
being a bit conservative. It is less conservative than the ordinary exact test.
Similarly, one can form less conservative confidence intervals by inverting
tests using the exact distribution with the mid-P-value Že.g., the 95% confidence interval is the set of parameter values for which the mid-P-value
exceeds 0.05..
For testing H0 : s 0.5 against Ha : / 0.5 in the example about the
proportion of vegetarians, with y s 0 for n s 25, the result observed is the
most extreme possible. Thus the mid-P-value is half the ordinary P-value, or
0.00000003. Using the ClopperPearson inversion of the exact binomial test
but with the mid-P-value yields a 95% confidence interval of Ž0.000, 0.113.
for , compared to Ž0.000, 0.137. for the ordinary ClopperPearson interval.
The mid-P-value seems a sensible compromise between having overly
conservative inference and using irrelevant randomization to eliminate prob-
21
STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS
lems from discreteness. We recommend it both for tests and confidence
intervals with highly discrete distributions.
1.5
STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS
We now present inference for multinomial parameters j 4 . Of n observations, n j occur in category j, j s 1, . . . , c.
1.5.1
Estimation of Multinomial Parameters
First, we obtain ML estimates of j 4 . As a function of j 4 , the multinomial
probability mass function Ž1.2. is proportional to the kernel
Ł jn
where
j
j
Ý j s 1.
all j G 0 and
Ž 1.14 .
j
The ML estimates are the j 4 that maximize Ž1.14..
The multinomial log-likelihood function is
LŽ . s
Ý n j log j .
j
To eliminate redundancies, we treat L as a function of Ž 1 , . . . , cy1 ., since
c s 1 y Ž 1 q qcy1 .. Thus, cr j s y1, j s 1, . . . , c y 1.
Since
log c
j
s
1
c
c
j
sy
1
c
,
differentiating LŽ . with respect to j gives the likelihood equation
LŽ .
j
s
nj
j
y
nc
c
s0.
The ML solution satisfies
ˆ jrˆc s n jrn c . Now
ˆc
Ý ˆ j s 1 s
j
žÝ /
nj
j
nc
s
ˆc n
nc
,
so
ˆc s n crn and then ˆ j s n jrn. From general results presented later in
the book ŽSection 8.6., this solution does maximize the likelihood. Thus, the
ML estimates of j 4 are the sample proportions.
22
1.5.2
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
Pearson Statistic for Testing a Specified Multinomial
In 1900 the eminent British statistician Karl Pearson introduced a hypothesis
test that was one of the first inferential methods. It had a revolutionary
impact on categorical data analysis, which had focused on describing associations. Pearson’s test evaluates whether multinomial parameters equal certain
specified values. His original motivation in developing this test was to analyze
whether possible outcomes on a particular Monte Carlo roulette wheel were
equally likely ŽStigler 1986..
Consider H0 : j s j0 , j s 1, . . . , c, where Ý j j0 s 1. When H0 is true,
the expected values of n j 4 , called expected frequencies, are j s n j0 , j s
1, . . . , c. Pearson proposed the test statistic
2
X s
Ý
j
Ž nj y j .
j
2
.
Ž 1.15 .
Greater differences n j y j 4 produce greater X 2 values, for fixed n. Let X o2
denote the observed value of X 2 . The P-value is the null value of P Ž X 2 G
X o2 .. This equals the sum of the null multinomial probabilities of all count
arrays Žhaving a sum of n. with X 2 G X o2 .
For large samples, X 2 has approximately a chi-squared distribution with
2
2
G X o2 ., where cy1
df s c y 1. The P-value is approximated by P Ž cy1
denotes a chi-squared random variable with df s c y 1. Statistic Ž1.15. is
called the Pearson chi-squared statistic.
1.5.3
Example: Testing Mendel’s Theories
Among its many applications, Pearson’s test was used in genetics to test
Mendel’s theories of natural inheritance. Mendel crossed pea plants of pure
yellow strain with plants of pure green strain. He predicted that second-generation hybrid seeds would be 75% yellow and 25% green, yellow being the
dominant strain. One experiment produced n s 8023 seeds, of which n1 s
6022 were yellow and n 2 s 2001 were green. The expected frequencies for
H0 : 10 s 0.75, 20 s 0.25 are 1 s 8023Ž0.75. s 6017.25 and 2 s 2005.75.
The Pearson statistic X 2 s 0.015 Ždf s 1. has a P-value of P s 0.90. This
does not contradict Mendel’s hypothesis.
Mendel performed several experiments of this type. In 1936, R. A. Fisher
summarized Mendel’s results. He used the reproductive property of chisquared: If X 12 , . . . , X k2 are independent chi-squared statistics with degrees
of freedom 1 , . . . , k , then Ý i X i2 has a chi-squared distribution with df s
Ý i i . Fisher obtained a summary chi-squared statistic equal to 42, with
df s 84. A chi-squared distribution with df s 84 has mean 84 and standard
deviation Ž2 = 84.1r2 s 13.0, and the right-tailed probability above 42 is
P s 0.99996. In other words, the chi-squared statistic was so small that the fit
seemed too good.
STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS
23
Fisher commented: ‘‘The general level of agreement between Mendel’s
expectations and his reported results shows that it is closer than would be
expected in the best of several thousand repetitions . . . . I have no doubt
that Mendel was deceived by a gardening assistant, who knew only too well
what his principal expected from each trial made.’’ In a letter written at the
time Žsee Box 1978, p. 297., he stated: ‘‘Now, when data have been faked,
I know very well how generally people underestimate the frequency of wide
chance deviations, so that the tendency is always to make them agree too well
with expectations.’’ In summary, goodness-of-fit tests can reveal not only
when a fit is inadequate, but also when it is better than random fluctuations
would have us expect. wR. A. Fisher’s daughter, Joan Fisher Box Ž1978,
pp. 295300., and Freedman et al. Ž1978, pp. 420428, 478. discussed
Fisher’s analysis of Mendel’s data and the accompanying controversy. Despite
possible difficulties with Mendel’s data, subsequent work led to general
acceptance of his theories.x
1.5.4
Chi-Squared Theoretical Justification*
We now outline why Pearson’s statistic has a limiting chi-squared distribution. For a multinomial sample Ž n1 , . . . , n c . of size n, the marginal distribution of n j is the binŽ n, j . distribution. For large n, by the normal approximation to the binomial, n j Žand
ˆ j s n jrn. have approximate normal distributions. More generally, by the central limit theorem, the sample proportions
ˆ s Ž n1rn, . . . , n cy1 rn.X have an approximate multivariate normal distribution ŽSection 14.1.4.. Let ⌺ 0 denote the null covariance matrix of 'n ,
ˆ and
let 0 s Ž 10 , . . . , cy1,0 .X . Under H0 , since 'n Ž
ˆ y 0 . converges to a
N Ž0, ⌺ 0 . distribution, the quadratic form
nŽ
ˆ y 0 . ⌺y1
ˆ y 0 .
0 Ž
Ž 1.16 .
X
has distribution converging to chi-squared with df s c y 1.
In Section 14.1.4 we show that the covariance matrix of 'n
ˆ has elements
jk s
½
y j k
if j / k
j Ž1 y j .
if j s k
.
The matrix ⌺y1
has Ž j, k .th element 1rc0 when j / k and Ž1r j0 q 1rc0 .
0
Ž
when j s k. You can verify this by showing that ⌺ 0 ⌺y1
equals the identity
0
matrix.. With this substitution, direct calculation Žwith appropriate combining
of terms. shows that Ž1.16. simplifies to X 2 . In Section 14.3 we provide a
formal proof in a more general setting.
This argument is similar to Pearson’s in 1900. R. A. Fisher Ž1922. gave a
simpler justification, the gist of which follows: Suppose that Ž n1 , . . . , n c . are
independent Poisson random variables with means Ž 1 , . . . , c .. For large
24
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
'
j 4 , the standardized values z j s Ž n j y j .r j 4 have approximate standard normal distributions. Thus, Ý j z j2 s X 2 has an approximate chi-squared
distribution with c degrees of freedom. Adding the single linear constraint
Ý j Ž n j y j . s 0, thus converting the Poisson distributions to a multinomial,
we lose a degree of freedom.
When c s 2, Pearson’s X 2 simplifies to the square of the normal score
statistic Ž1.11.. For Mendel’s data,
ˆ 1 s 6022r8023, 10 s 0.75, n s 8023,
and z S s 0.123, for which X 2 s Ž0.123. 2 s 0.015. In fact, for general c the
Pearson test is the score test about multinomial parameters.
1.5.5
Likelihood-Ratio Chi-Squared
An alternative test for multinomial parameters uses the likelihood-ratio test.
The kernel of the multinomial likelihood is Ž1.14.. Under H0 the likelihood is
maximized when
ˆ j s j0 . In the general case, it is maximized when ˆ j s
n jrn. The ratio of the likelihoods equals
s
Ł j Ž j0 .
nj
Ł j Ž n jrn .
nj
.
Thus, the likelihood-ratio statistic, denoted by G 2 , is
G 2 s y2 log
s 2 Ý n j log Ž n jrn j0 . .
Ž 1.17 .
This statistic, which has form Ž1.12., is called the likelihood-ratio chi-squared
statistic. The larger the value of G 2 , the greater the evidence against H0 .
In the general case, the parameter space consists of j 4 subject to
Ý j j s 1, so the dimensionality is c y 1. Under H0 , the j 4 are specified
completely, so the dimension is 0. The difference in these dimensions equals
Ž c y 1.. For large n, G 2 has a chi-squared null distribution with df s c y 1.
When H0 holds, the Pearson X 2 and the likelihood ratio G 2 both have
asymptotic chi-squared distributions with df s c y 1. In fact, they are asymptotically equivalent in that case; specifically, X 2 y G 2 converges in probability to zero ŽSection 14.3.4.. When H0 is false, they tend to grow proportionally to n; they need not take similar values, however, even for very large n.
For fixed c, as n increases the distribution of X 2 usually converges to
chi-squared more quickly than that of G 2 . The chi-squared approximation
is usually poor for G 2 when nrc - 5. When c is large, it can be decent for
X 2 for nrc as small as 1 if the table does not contain both very small and
moderately large expected frequencies. We provide further guidelines in
Section 9.8.4. Alternatively, one can use the multinomial probabilities
to generate exact distributions of these test statistics ŽGood et al. 1970..
25
STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS
1.5.6
Testing with Estimated Expected Frequencies
Pearson’s X 2 Ž1.15. compares a sample distribution to a hypothetical one
j0 4 . In some applications, j0 s j0 Ž .4 are functions of a smaller set of
unknown parameters . ML estimates ˆ
of determine ML estimates
j0 Žˆ
.4 of j0 4 and hence ML estimates
ˆ j s n j0 Žˆ .4 of expected frequencies in X 2 . Replacing j 4 by estimates
ˆ j 4 affects the distribution of X 2 .
When dim Ž . s p, the true df s Ž c y 1. y p ŽSection 14.3.3.. Pearson failed
to realize this ŽSection 16.2..
We now show a goodness-to-fit test with estimated expected frequencies.
A sample of 156 dairy calves born in Okeechobee County, Florida, were
classified according to whether they caught pneumonia within 60 days of
birth. Calves that got a pneumonia infection were also classified according to
whether they got a secondary infection within 2 weeks after the first infection
cleared up. Table 1.1 shows the data. Calves that did not get a primary
infection could not get a secondary infection, so no observations can fall in
the category for ‘‘no’’ primary infection and ‘‘ yes’’ secondary infection. That
combination is called a structural zero.
A goal of this study was to test whether the probability of primary
infection was the same as the conditional probability of secondary infection,
given that the calf got the primary infection. In other words, if ab denotes
the probability that a calf is classified in row a and column b of this table,
the null hypothesis is
H0 : 11 q 12 s 11 r Ž 11 q 12 .
or 11 s Ž 11 q 12 . 2 . Let s 11 q 12 denote the probability of primary
infection. The null hypothesis states that the probabilities satisfy the structure that Table 1.2 shows; that is, probabilities in a trinomial for the
categories Žyesyes, yesno, nono. for primarysecondary infection equal
Ž 2 , Ž1 y ., 1 y ..
Let n ab denote the number of observations in category Ž a, b .. The ML
estimate of is the value maximizing the kernel of the multinomial likelihood
n 11
Ž 2 . Ž y 2 .
TABLE 1.1
n 12
Ž1 y .
n 22
.
Primary and Secondary Pneumonia Infections in Calves
Secondary Infection a
Primary Infection
Yes
No
Yes
No
30 Ž38.1.
0 Ž.
63 Ž39.0.
63 Ž78.9.
Source: Data courtesy of Thang Tran and G. A. Donovan, College of Veterinary Medicine,
University of Florida.
a
Values in parentheses are estimated expected frequencies.
26
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
TABLE 1.2
Probability Structure for Hypothesis
Secondary Infection
Primary
Infection
Yes
2
Yes
No
No
Total
Ž1 y .
1y
1y
The log likelihood is
L Ž . s n11 log 2 q n12 log Ž y 2 . q n 22 log Ž 1 y . .
Differentiation with respect to gives the likelihood equation
2 n11
q
n12
y
n12
1y
y
n 22
1y
s 0.
The solution is
ˆ s Ž 2 n11 q n12 . r Ž 2 n11 q 2 n12 q n 22 . .
For Table 1.1,
ˆ s 0.494. Since n s 156, the estimated expected frequencies are
ˆ 11 s nˆ 2 s 38.1,
ˆ 12 s nŽˆ y ˆ 2 . s 39.0, and
ˆ 22 s nŽ1 y ˆ . s
78.9. Table 1.1 shows them. Pearson’s statistic is X 2 s 19.7. Since the c s 3
possible responses have p s 1 parameter Ž . determining the expected
frequencies, df s Ž3 y 1. y 1 s 1. There is strong evidence against H0 Ž P s
0.00001.. Inspection of Table 1.1 reveals that many more calves got a primary
infection but not a secondary infection than H0 predicts. The researchers
concluded that the primary infection had an immunizing effect that reduced
the likelihood of a secondary infection.
NOTES
Section 1.1: Categorical Response Data
1.1. Stevens Ž1951. defined Žnominal, ordinal, interval. scales of measurement. Other scales
result from mixtures of these types. For instance, partially ordered scales occur when
subjects respond to questions having categories ordered except for don’t know or undecided
categories.
Section 1.3: Statistical Inference for Categorical Data
1.2. The score method does not use ˆ. Thus, when is a model parameter,
compute the score statistic for testing H0 : s 0 without fitting the
advantageous when fitting several models in an exploratory analysis and
computationally intensive. An advantage of the score and likelihood-ratio
one can usually
model. This is
model fitting is
methods is that
27
PROBLEMS
they apply even when < ˆ < s . In that case, one cannot compute the Wald statistic.
Another disadvantage of the Wald method is that its results depend on the parameterization; inference based on ˆ and its SE is not equivalent to inference based on a nonlinear
function of it, such as log ˆ and its SE.
Section 1.4: Statistical Inference for Binomial Parameters
1.3. Among others, Agresti and Coull Ž1998., Blyth and Still Ž1983., Brown et al. Ž2001., Ghosh
Ž1979., and Newcombe Ž1998a. showed the superiority of the score interval to the Wald
interval for . Of the ‘‘exact’’ methods, Blaker’s Ž2000. has particularly good properties. It is
contained in the ClopperPearson interval and has a nestedness property whereby an
interval of higher nominal confidence level necessarily contains one of lower level.
1.4. Using continuity corrections with large-sample methods provides approximations to exact
small-sample methods. Thus, they tend to behave conservatively. We do not present them,
since if one prefers an exact method, with modern computational power it can be used
directly rather than approximated.
1.5. In theory, one can eliminate problems with discreteness in tests by performing a supplementary randomization on the boundary of a critical region Žsee Problem 1.19.. In rejecting the
null at the boundary with a certain probability, one can obtain a fixed overall type I error
probability even when it is not an achievable P-value. For such randomization, the
one-sided P y value is
randomized P-value s U = P Ž T s t o . q P Ž T ) t o . ,
where U denotes a uniform Ž0, 1. random variable ŽStevens 1950.. In practice, this is not
used, as it is absurd to let this random number influence a decision. The mid P-value
replaces the arbitrary uniform multiple U = P ŽT s t o . by its expected value.
Section 1.5: Statistical Inference for Multinomial Parameters
1.6. The chi-squared distribution has mean df, variance 2 df, and skewness Ž8rdf.1r2 . It is
approximately normal when df is large. Greenwood and Nikulin Ž1996., Kendall and Stuart
Ž1979., and Lancaster Ž1969. presented other properties. Cochran Ž1952. presented a
historical survey of chi-squared tests of fit. See also Cressie and Read Ž1989., Koch and
Bhapkar Ž1982., Koehler Ž1998., and Moore Ž1986b..
PROBLEMS
Applications
1.1
Identify each variable as nominal, ordinal, or interval.
a. UK political party preference ŽLabour, Conservative, Social Democrat.
b. Anxiety rating Žnone, mild, moderate, severe, very severe.
c. Patient survival Žin number of months.
d. Clinic location ŽLondon, Boston, Madison, Rochester, Montreal.
28
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
e. Response of tumor to chemotherapy Žcomplete elimination, partial
reduction, stable, growth progression.
f. Favorite beverage Žwater, juice, milk, soft drink, beer, wine.
g. Appraisal of company’s inventory level Žtoo low, about right, too
high.
1.2
Each of 100 multiple-choice questions on an exam has four possible
answers, one of which is correct. For each question, a student guesses
by selecting an answer randomly.
a. Specify the distribution of the student X s number of correct answers.
b. Find the mean and standard deviation of that distribution. Would it
be surprising if the student made at least 50 correct responses?
Why?
c. Specify the distribution of Ž n1 , n 2 , n 3 , n 4 ., where n j is the number
of times the student picked choice j.
d. Find E Ž n j ., var Ž n j ., cov Ž n j , n k ., and corr Ž n j , n k ..
1.3
An experiment studies the number of insects that survive a certain
dose of an insecticide, using several batches of insects of size n each.
The insects are sensitive to factors that vary among batches during the
experiment but were not measured, such as temperature level. Explain
why the distribution of the number of insects per batch surviving the
experiment might show overdispersion relative to a binŽ n, . distribution.
1.4
In his autobiography A Sort of Life, British author Graham Greene
described a period of severe mental depression during which he played
Russian Roulette. This ‘‘game’’ consists of putting a bullet in one of
the six chambers of a pistol, spinning the chambers to select one at
random, and then firing the pistol once at one’s head.
a. Greene played this game six times and was lucky that none of them
resulted in a bullet firing. Find the probability of this outcome.
b. Suppose that he had kept playing this game until the bullet fired.
Let Y denote the number of the game on which it fires. Show the
probability mass function for Y, and justify.
1.5
Consider the statement, ‘‘Please tell me whether or not you think it
should be possible for a pregnant woman to obtain a legal abortion if
she is married and does not want any more children.’’ For the 1996
General Social Survey, conducted by the National Opinion Research
Center ŽNORC., 842 replied ‘‘ yes’’ and 982 replied ‘‘no.’’ Let denote
PROBLEMS
29
the population proportion who would reply ‘‘ yes.’’ Find the P-value for
testing H0 : s 0.5 using the score test, and construct a 95% confidence interval for . Interpret the results.
1.6
Refer to the vegetarianism example in Section 1.4.3. For testing
H0 : s 0.5 against Ha : / 0.5, show that:
a. The likelihood-ratio statistic equals 2w25log Ž25r12.5.x s 34.7.
b. The chi-squared form of the score statistic equals 25.0.
c. The Wald z or chi-squared statistic is infinite.
1.7
In a crossover trial comparing a new drug to a standard, denotes the
probability that the new one is judged better. It is desired to estimate
and test H0 : s 0.5 against Ha : / 0.5. In 20 independent
observations, the new drug is better each time.
a. Find and sketch the likelihood function. Give the ML estimate of
.
b. Conduct a Wald test and construct a 95% Wald confidence interval
for . Are these sensible?
c. Conduct a score test, reporting the P-value. Construct a 95% score
confidence interval. Interpret.
d. Conduct a likelihood-ratio test and construct a likelihood-based
95% confidence interval. Interpret.
e. Construct an exact binomial test and 95% confidence interval.
Interpret.
f. Suppose that researchers wanted a sufficiently large sample to
estimate the probability of preferring the new drug to within 0.05,
with confidence 0.95. If the true probability is 0.90, about how large
a sample is needed?
1.8
In an experiment on chlorophyll inheritance in maize, for 1103 seedlings
of self-fertilized heterozygous green plants, 854 seedlings were green
and 249 were yellow. Theory predicts the ratio of green to yellow is 3:1.
Test the hypothesis that 3:1 is the true ratio. Report the P-value, and
interpret.
1.9
Table 1.3 contains Ladislaus von Bortkiewicz’s data on deaths of
soldiers in the Prussian army from kicks by army mules ŽFisher 1934;
Quine and Seneta 1987.. The data refer to 10 army corps, each
observed for 20 years. In 109 corps-years of exposure, there were no
deaths, in 65 corps-years there was one death, and so on. Estimate the
mean and test whether probabilities of occurrences in these five
categories follow a Poisson distribution Žtruncated for 4 and above..
30
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
TABLE 1.3
Data for Problem 1.9
Number of
Deaths
Number of
Corps-Years
0
1
2
3
4
G5
109
65
22
3
1
0
1.10 A sample of 100 women suffer from dysmenorrhea. A new analgesic is
claimed to provide greater relief than a standard one. After using each
analgesic in a crossover experiment, 40 reported greater relief with the
standard analgesic and 60 reported greater relief with the new one.
Analyze these data.
Theory and Methods
1.11 Why is it easier to get a precise estimate of the binomial parameter
when it is near 0 or 1 than when it is near 12 ?
1.12 Suppose that P Ž Yi s 1. s 1 y P Ž Yi s 0. s , i s 1, . . . , n, where Yi 4
are independent. Let Y s Ý i Yi .
a. What are varŽ Y . and the distribution of Y ?
b. When Yi 4 instead have pairwise correlation ) 0, show that
var Ž Y . ) n Ž1 y ., overdispersion relative to the binomial. wAltham Ž1978. discussed generalizations of the binomial that allow
correlated trials. x
c. Suppose that heterogeneity exists: P Ž Yi s 1 < . s for all i, but
is a random variable with density function g Ž. on w0, 1x having mean
and positive variance. Show that varŽ Y . ) n Ž1 y .. ŽWhen
has a beta distribution, Y has the beta-binomial distribution of
Section 13.3..
d. Suppose that P Ž Yi s 1 < i . s i , i s 1, . . . , n, where i 4 are independent from g Ž.. Explain why Y has a binŽ n, . distribution
unconditionally but not conditionally on i 4 . Ž Hint: In each case, is
Y a sum of independent, identical Bernoulli trials? .
1.13 For a sequence of independent Bernoulli trials, Y is the number of
successes before the kth failure. Explain why its probability mass
31
PROBLEMS
function is the negati®e binomial,
pŽ y . s
Ž y q k y 1. ! y
k
Ž1 y . ,
y! Ž k y 1 . !
y s 0, 1, 2, . . . .
wFor it, E Ž Y . s krŽ1 y . and var Ž Y . s krŽ1 y . 2 , so var Ž Y . )
E Ž Y .; the Poisson is the limit as k ™ and ™ 0 with k s fixed.x
1.14 For the multinomial distribution, show that
'
corr Ž n j , n k . s y j kr j Ž 1 y j . k Ž 1 y k . .
Show that corrŽ n1 , n 2 . s y1 when c s 2.
1.15 Show that the moment generating function Žmgf. for the binomial
distribution is mŽ t . s Ž1 y q e t . n, and use it to obtain the first
two moments. Show that the mgf for the Poisson distribution is
mŽ t . s expŽ wexpŽ t . y 1x4 , and use it to obtain the first two moments.
1.16 A likelihood-ratio statistic equals t o . At the ML estimates, show that
the data are expŽ t or2. times more likely under Ha than under H0 .
1.17 Assume that y 1 , y 2 , . . . , yn are independent from a Poisson distribution.
a. Obtain the likelihood function. Show that the ML estimator
ˆ s y.
b. Construct a large-sample test statistic for H0 : s 0 using (i) the
Wald method, (ii) the score method, and (iii) the likelihood-ratio
method.
c. Construct a large-sample confidence interval for using (i) the
Wald method, (ii) the score method, and (iii) the likelihood-ratio
method.
1.18 Inference for Poisson parameters can often be based on connections
with binomial and multinomial distributions. Show how to test
H0 : 1 s 2 for two populations based on independent Poisson counts
Ž y 1 , y 2 ., using a corresponding test about a binomial parameter .
w Hint: Condition on n s y 1 q y 2 and identify s 1rŽ 1 q 2 ..x
How can one construct a confidence interval for 1r 2 based on one
for ?
1.19 A researcher routinely tests using a nominal P Žtype I error. s 0.05,
rejecting H0 if the P-value F 0.05. An exact test using test statistic T
32
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
has null distribution P ŽT s 0. s 0.30, P ŽT s 1. s 0.62, and P ŽT s 2.
s 0.08, where a higher T provides more evidence against the null.
a. With the usual P-value, show that the actual P Žtype I error. s 0.
b. With the mid-P-value, show that the actual P Žtype I error. s 0.08.
c. Find P Žtype I error. in parts Ža. and Žb. when P ŽT s 0. s 0.30,
P ŽT s 1. s 0.66, P ŽT s 2. s 0.04. Note that the test with midP-value can be conservative or liberal. The exact test with ordinary
P-value cannot be liberal.
d. In part Ža., a randomized-decision test generates a uniform random
variable U from w0, 1x and rejects H0 when T s 2 and U F 58 . Show
the actual P Žtype I error. s 0.05. Is this a sensible test?
1.20 For a binomial parameter , show how the inversion process for
constructing a confidence interval works with Ža. the Wald test, and Žb.
the score test.
1.21 For a flip of a coin, let denote the probability of a head. An
experiment tests H0 : s 0.5 against Ha : / 0.5, using n s 5 independent flips.
a. Show that the true null probability of rejecting H0 at the 0.05
significance level is 0.0 for the exact binomial test and 161 using the
large-sample score test.
b. Suppose that truly s 0.5. Explain why the probability that the
95% ClopperPearson confidence interval contains equals 1.0.
Ž Hint: Is there any possible y for which both one-sided tests of
H0 : s 0.5 have P-value F 0.025?.
1.22 Consider the Wald confidence interval for a binomial parameter .
Since it is degenerate when
ˆ s 0 or 1, argue that for 0 - - 1 the
probability the interval covers cannot exceed w1 y n y Ž1 y . n x;
hence, the infimum of the coverage probability over 0 - - 1 equals
0, regardless of n.
1.23 Consider the 95% binomial score confidence interval for . When
y s 1, show that the lower limit is approximately 0.18rn; in fact,
0 - - 0.18rn then falls in an interval only when y s 0. Argue that
for large n and just barely below 0.18rn or just barely above
1 y 0.18rn, the actual coverage probability is about ey0.18 s 0.84.
Hence, even as n ™ , this method is not guaranteed to have coverage
probability G 0.95 ŽAgresti and Coull 1998; Blyth and Still 1983..
1.24 From Section 1.4.2 the midpoint
˜ of the score confidence interval for
is the sample proportion for an adjusted data set that adds z2 r2 r2
33
PROBLEMS
observations of each type to the sample. This motivates an adjusted
Wald interval,
˜ " z r2
'˜ Ž 1 y ˜ . rn* ,
where n* s n q z2 r2 .
Show that the variance
˜ Ž1 y ˜ .rn* at the weighted average is at
least as large as the weighted average of the variances that appears
under the square root sign in the score interval Ž Hint: Use Jensen’s
inequality.. Thus, this interval contains the score interval. wAgresti and
Coull Ž1998. and Brown et al. Ž2001. showed that it performs much
better than the Wald interval. It does not have the score interval’s
disadvantage ŽProblem 1.23. of poor coverage near 0 and 1.x
1.25 A binomial sample of size n has y s 0 successes.
a. Show that the confidence interval for based on the likelihood
function is w0.0, 1 y expŽyz2 r2 r2 n.x. For s 0.05, use the expansion of an exponential function to show that this is approximately
w0, 2rn x.
b. For the score method, show that the confidence interval is
w0, z2 r2 rŽ n q z2 r2 .x, or approximately w0, 4rŽ n q 4.x when s 0.05.
c. For the ClopperPearson approach, show that the upper bound is
1 y Žr2.1r n, or approximately ylogŽ0.025.rn s 3.69rn when
s 0.05.
d. For the adaptation of the ClopperPearson approach using the
mid-P-value, show that the upper bound is 1 y 1r n, or approximately ylog Ž0.05.rn s 3rn when s 0.05.
1.26 For the geometric distribution pŽ y . s y Ž1 y ., y s 0, 1, 2, . . . ,
show that the tail method for constructing a confidence interval
wi.e., equating P Ž Y G y . and P Ž Y F y . to r2x yields wŽr2.1r y ,
Ž1 y r2.1rŽ yq1. x. Show that all between 0 and 1 y r2 ne®er fall
above a confidence interval, and hence the actual coverage probability
exceeds 1 y r2 over this region.
1.27 A statistic T has discrete distribution with cdf F Ž t .. Show that F ŽT . is
stochastically larger than uniform over w0, 1x; that is, its cdf is everywhere no greater than that of the uniform ŽCasella and Berger 2001,
pp. 77, 434.. Explain why an implication is that a P-value based on T
has null distribution that is stochastically larger than uniform.
1.28 Suppose that P ŽT s t j . s j , j s 1, . . . . Show that E Žmid-P-value. s
0.5. w Hint: Show that Ý j j Ž jr2 q jq1 q . s ŽÝ j j . 2r2.x
34
INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA
1.29 For a statistic T with cdf F Ž t . and pŽ t . s P ŽT s t ., the mid-distribution function is Fmid Ž t . s F Ž t . y 0.5 pŽ t . ŽParzen 1997.. Given T s t o ,
show that the mid-P-value equals 1 y F Ž t o .. ŽIt also satisfies
E w Fmid ŽT .x s 0.5 and varw Fmid ŽT .x s Ž1r12. 1 y E w p 2 ŽT .x4 ..
1.30 Genotypes AA, Aa, and aa occur with probabilities w 2 , 2 Ž1 y .,
Ž1 y . 2 x. A multinomial sample of size n has frequencies Ž n1 , n 2 , n 3 .
of these three genotypes.
a. Form the log likelihood. Show that ˆs Ž2 n1 q n 2 .rŽ2 n1 q 2 n 2 q
2 n 3 ..
b. Show that y 2 LŽ .r 2 s wŽ2 n1 q n 2 .r 2 x q wŽ n 2 q 2 n 3 .r
Ž1 y . 2 x and that its expectation is 2 nr Ž1 y .. Use this to
obtain an asymptotic standard error of ˆ.
c. Explain how to test whether the probabilities truly have this
pattern.
1.31 Refer to Section 1.5.6. Using the likelihood function to obtain the
information, find the approximate standard error of
ˆ.
1.32 Refer to Section 1.5.6. Let a denote the number of calves that got a
primary, secondary, and tertiary infection, b the number that received
a primary and secondary but not a tertiary infection, c the number that
received a primary but not a secondary infection, and d the number
that did not receive a primary infection. Let be the probability of a
primary infection. Consider the hypothesis that the probability of
infection at time t, given infection at times 1, . . . , t y 1, is also , for
t s 2, 3. Show that
ˆ s Ž3a q 2 b q c .rŽ3a q 3b q 2 c q d ..
1.33 Refer to quadratic form Ž1.16..
is the inverse of
a. Verify that the matrix quoted in the text for ⌺y1
0
⌺0.
b. Show that Ž1.16. simplifies to Pearson’s statistic Ž1.15..
c. For the z S statistic Ž1.11., show that z S2 s X 2 for c s 2.
1.34 For testing H0 : j s j0 , j s 1, . . . , c, using sample multinomial proportions
ˆ j 4, the likelihood-ratio statistic Ž1.17. is
G 2 s y2 n Ý
ˆ j log Ž j0rˆ j . .
j
Show that G 2 G 0, with equality if and only if
ˆ j s j0 for all j. Ž Hint:
Ž
.
Apply Jensen’s inequality to E y2 n log X , where X equals j0r
ˆj
with probability
ˆ j ..
35
PROBLEMS
1.35 The chi-squared mgf with df s is mŽ t . s Ž1 y 2 t .y r2 , for < t < - 12 .
Use it to prove the reproductive property of the chi-squared distribution.
1.36 For the multinomial Ž n, j 4. distribution with c ) 2, confidence limits
for j are the solutions of
Ž ˆ j y j .
2
s Ž z r2 c . j Ž 1 y j . rn,
2
j s 1, . . . , c.
a. Using the Bonferroni inequality, argue that these c intervals simultaneously contain all j 4 Žfor large samples. with probability at
least 1 y .
b. Show that the standard deviation of
ˆ j y ˆ k is w j q k y Ž j y
k . 2 xrn. For large n, explain why the probability is at least 1 y
that the Wald confidence intervals
Ž ˆ j y ˆ k . " z r2 a½
ˆ j q ˆ k y Ž ˆ j y ˆ k . rn
2
5
1r2
simultaneously contain the a s cŽ c y 1.r2 differences j y k 4
Žsee Fitzpatrick and Scott 1987; Goodman 1965..
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 2
Describing Contingency Tables
In this chapter we introduce tables that display relationships between
categorical variables. We also define parameters that summarize their association. Parameters in Section 2.2 are used to compare groups on the proportions of responses in the outcome categories. The odds ratio has special
importance, appearing as a parameter in models discussed later. In Section
2.3 we extend the scope by controlling for a third variable. The association
can change dramatically under a control. The chapter’s primary focus is
binary variables, which have only two categories, but in Section 2.4 we
present parameters for nominal and ordinal multicategory variables. First, in
Section 2.1, we introduce basic terminology and notation.
2.1
PROBABILITY STRUCTURE FOR CONTINGENCY TABLES
The joint distribution between two categorical variables determines their
relationship. This distribution also determines the marginal and conditional
distributions.
2.1.1
Contingency Tables and Their Distributions
Let X and Y denote two categorical response variables, X with I categories
and Y with J categories. Classifications of subjects on both variables have IJ
possible combinations. The responses Ž X, Y . of a subject chosen randomly
from some population have a probability distribution. A rectangular table
having I rows for categories of X and J columns for categories of Y displays
this distribution. The cells of the table represent the IJ possible outcomes.
When the cells contain frequency counts of outcomes for a sample, the table
is called a contingency table, a term introduced by Karl Pearson Ž1904..
Another name is cross-classification table. A contingency table with I rows
and J columns is called an I = J Žor I-by-J . table.
36
37
PROBABILITY STRUCTURE FOR CONTINGENCY TABLES
TABLE 2.1 Cross-Classification of Aspirin Use and
Myocardial Infarction
Myocardial Infarction
Placebo
Aspirin
Fatal
Attack
Nonfatal
Attack
No
Attack
18
5
171
99
10,845
10,933
Source: Preliminary report: Findings from the aspirin component of the ongoing Physicians’ Health Study. New Engl.
J. Med. 318: 262264 Ž1988..
Table 2.1, a 2 = 3 contingency table, is from a report on the relationship
between aspirin use and heart attacks by the Physicians’ Health Study
Research Group at Harvard Medical School. The Physicians’ Health Study
was a 5-year randomized study of whether regular aspirin intake reduces
mortality from cardiovascular disease. Every other day, physicians participating in the study took either one aspirin tablet or a placebo. The study was
blindthose in the study did not know whether they were taking aspirin or a
placebo. Of the 11,034 physicians taking a placebo, 18 suffered fatal heart
attacks over the course of the study, whereas of the 11,037 taking aspirin, 5
had fatal heart attacks.
Let i j denote the probability that Ž X, Y . occurs in the cell in row i and
column j. The probability distribution i j 4 is the joint distribution of X and
Y. The marginal distributions are the row and column totals that result from
summing the joint probabilities. We denote these by iq 4 for the row
variable and qj 4 for the column variable, where the subscript ‘‘q’’ denotes
the sum over that index; that is,
iqs
Ý i j
j
and
qj s
Ý i j.
i
These satisfy Ý i iqs Ý jqj s Ý i Ý j i j s 1.0. The marginal distributions
provide single-variable information.
In most contingency tables Žsuch as Table 2.1., one variable, say Y, is a
response variable and the other Ž X . is an explanatory variable. When X is
fixed rather than random, the notion of a joint distribution for X and Y is no
longer meaningful. However, for a fixed category of X, Y has a probability
distribution. It is germane to study how this distribution changes as the
category of X changes. Given that a subject is classified in row i of X, j < i
denotes the probability of classification in column j of Y, j s 1, . . . , J. Note
that Ý j j < i s 1. The probabilities 1 < i , . . . , J < i 4 form the conditional distribution of Y at category i of X. A principal aim of many studies is to compare
conditional distributions of Y at various levels of explanatory variables.
38
DESCRIBING CONTINGENCY TABLES
TABLE 2.2 Estimated Conditional Distributions for
Breast Cancer Diagnoses
Diagnosis of Test
Breast
Cancer
Yes
No
Positive
Negative
Total
0.82
0.01
0.18
0.99
1.0
1.0
Source: Data from W. Lawrence et al., J. Natl. Cancer Inst.
90: 17921800 Ž1998..
2.1.2
Sensitivity and Specificity
The results in Table 2.2 are from a recent article about various methods of
attempting to diagnose breast cancer. Based on a literature survey, the
authors reported these results for the impact of using mammography together with clinical breast examination. Let X s true disease status Ži.e.,
whether a woman truly has breast cancer. and let Y s diagnosis
Žpositive, negative., where a positive outcome predicts that a woman has
breast cancer. The probabilities estimated in Table 2.2 are conditional
probabilities of Y given X.
With diagnostic tests for a disease, the two correct diagnoses are a positive
test outcome when the subject has the disease and a negative test outcome
when a subject does not have it. Given that the subject has the disease, the
conditional probability that the diagnostic test is positive is called the
sensiti®ity; given that the subject does not have the disease, the conditional
probability that the test is negative is called the specificity ŽYerushalmy 1947..
Ideally, these are both high.
For a 2 = 2 table with the format of Table 2.2, sensitivity is 1 <1 and
specificity is 2 < 2 . In Table 2.2, the estimated sensitivity of combined mammography and clinical examination is 0.82. Of women with breast cancer,
82% are diagnosed correctly. The estimated specificity is 0.99. Of women not
having breast cancer, 99% were diagnosed correctly.
2.1.3
Independence of Categorical Variables
When both variables are response variables, descriptions of the association
can use their joint distribution, the conditional distribution of Y given X, or
the conditional distribution of X given Y. The conditional distribution of Y
given X relates to the joint distribution by
j < i s i jr iq
for all i and j.
Two categorical response variables are defined to be independent if all
joint probabilities equal the product of their marginal probabilities,
i j s iq qj
for i s 1, . . . , I
and
j s 1, . . . , J.
Ž 2.1 .
39
PROBABILITY STRUCTURE FOR CONTINGENCY TABLES
TABLE 2.3 Notation for Joint, Conditional, and
Marginal Probabilities
Column
Row
1
2
Total
1
2
Total
11
Ž 1<1 .
21
Ž 1< 2 .
q1
12
Ž 2 <1 .
22
Ž 2 < 2 .
q2
1q
Ž1.0.
2q
Ž1.0.
1.0
When X and Y are independent,
j < i s i jr iqs Ž iq qj . r iqs qj
for i s 1, . . . , I.
Each conditional distribution of Y is identical to the marginal distribution of
Y. Thus, two variables are independent when j <1 s ⭈⭈⭈ s j < I , for j s
1, . . . , J 4 ; that is, the probability of any given column response is the same in
each row. When Y is a response and X is an explanatory variable, this is a
more natural way to define independence than Ž2.1.. Independence is then
often referred to as homogeneity of the conditional distributions.
Table 2.3 displays notation for joint, conditional, and marginal distributions for the 2 = 2 case. Sample distributions use similar notation, with p or
ˆ in place of . For instance, pi j 4 denotes the sample joint distribution. The
cell frequencies are denoted n i j 4 , and n s Ý i Ý j n i j is the total sample size.
Thus,
pi j s n i jrn.
The sample proportion of times that subjects in row i made response j is
pj < i s pi jrpiqs n i jrn iq ,
where n iqs npiqs Ý j n i j .
2.1.4
Poisson, Binomial, and Multinomial Sampling
The probability distributions introduced in Section 1.2 extend to cell counts
in contingency tables. For instance, a Poisson sampling model treats cell
counts Yi j 4 as independent Poisson random variables with parameters i j 4 .
The joint probability mass function for potential outcomes n i j 4 is then the
product of the Poisson probabilities P Ž Yi j s n i j . for the IJ cells, or
Ł Ł exp Ž yi j . inj rn i j ! .
ij
i
j
40
DESCRIBING CONTINGENCY TABLES
When the total sample size n is fixed but the row and column totals are
not, a multinomial sampling model applies. The IJ cells are the possible
outcomes. The probability mass function of the cell counts has the multinomial form
n!r Ž n11 ! ⭈⭈⭈ n I J ! .
Ł Ł inj
i
ij
.
j
Often, observations on a response Y occur separately at each setting of an
explanatory variable X. This case normally treats row totals as fixed, and for
simplicity, we use the notation n i s n iq. Suppose that the n i observations on
Y at setting i of X are independent, each with probability distribution
1 < i , . . . , J < i 4 . The counts n i j , j s 1, . . . , J 4 satisfying Ý j n i j s n i then have
the multinomial form
ni !
Ł j ni j !
Ł jn< i .
ij
Ž 2.2 .
j
When samples at different settings of X are independent, the joint probability function for the entire data set is the product of the multinomial functions
Ž2.2. from the various settings. This sampling scheme is independent multinomial sampling, also called product multinomial sampling.
Independent multinomial sampling also results under the following conditions: Suppose that n i j 4 result from either independent Poisson sampling
with means i j 4 or multinomial sampling over the IJ cells with probabilities
i j s i jrn4 . When X is an explanatory variable, it is sensible to perform
statistical inference conditional on the totals n i s Ý j n i j 4 even when their
values are not fixed by the sampling design. Conditional on n i 4 , the cell
counts n i j , j s 1, . . . , J 4 have the multinomial distribution Ž2.2. with response probabilities j < i s i jr iq, j s 1, . . . , J 4 , and cell counts from different rows are independent. With this conditioning, we treat the row totals
as fixed and analyze the data as if they formed separate independent
samples.
Sometimes both row and column margins are naturally fixed. The appropriate sampling distribution is then the hypergeometric. In Section 3.5.1 we
discuss this case, which is less common.
2.1.5
Seat Belt Example
Researchers in the Massachusetts Highway Department plan to study the
relationship between seat-belt use Žyes, no. and outcome of an automobile
crash Žfatality, nonfatality . for drivers involved in accidents on the Massachusetts Turnpike. They will summarize results in the format shown in
Table 2.4. They plan to catalog all accidents on the turnpike for the next
year, classifying each according to these variables. The total sample size is
41
PROBABILITY STRUCTURE FOR CONTINGENCY TABLES
TABLE 2.4 Seat-Belt Use and Results of
Automobile Crashes
Result of Crash
Seat-Belt Use
Fatality
Nonfatality
Yes
No
then a random variable. They might treat the numbers of observations at the
four combinations of seat-belt use and outcome of crash as independent
Poisson random variables with unknown means 11 , 12 , 21 , 22 4 .
Suppose, instead, that the researchers randomly sample 200 police records
of crashes on the turnpike in the past year and classify each according to
seat-belt use and outcome of crash. For this study, the total sample size n
is fixed. They might then treat the four cell counts as a multinomial random variable with n s 200 trials and unknown joint probabilities
11 , 12 , 21 , 22 4 .
Suppose, instead, that police records for accidents involving fatalities were
filed separately from the others. The researchers might instead randomly
sample 100 records of accidents with a fatality and randomly sample 100
records of accidents with no fatality. This approach fixes the column totals in
Table 2.4 at 100. They might then regard each column of Table 2.4 as an
independent binomial sample. Yet another approach, the traditional experimental design, takes 200 subjects and randomly assigns 100 of them to wear
seat belts; the 200 then all are forced to have an accident. The recorded
results would then be independent binomial samples in each row, with fixed
row totals of 100 each. ŽObviously, traditional designs common in some
experimental science may not be ethical for humans. This is especially true in
medical studies. .
2.1.6
Types of Studies
Table 2.5 comes from one of the first studies of the link between lung cancer
and smoking, by Richard Doll and A. Bradford Hill. In 20 hospitals in
London, England, patients admitted with lung cancer in the preceding year
were queried about their smoking behavior. For each of the 709 patients
admitted, researchers studied the smoking behavior of a noncancer patient at
the same hospital of the same gender and within the same 5-year grouping on
age. The 709 cases in the first column of Table 2.5 are those having lung
cancer and the 709 controls in the second column are those not having it. A
smoker was defined as a person who had smoked at least one cigarette a day
for at least a year.
Normally, whether lung cancer occurs is a response variable and smoking
behavior is an explanatory variable. In this study, however, the marginal
42
DESCRIBING CONTINGENCY TABLES
TABLE 2.5 Cross-Classification of Smoking by
Lung Cancer
Lung Cancer
Smoker
Yes
No
Total
Cases
Controls
688
21
650
59
709
709
Source: Based on data reported in Table IV, R. Doll and A. B.
Hill, British Med. J., Sept. 30, 1950, pp. 739748.
distribution of lung cancer is fixed by the sampling design, and the outcome
measured is whether the subject ever was a smoker. The study, which uses a
retrospecti®e design to ‘‘look into the past,’’ is called a casecontrol study. Such
studies are common in health-related applications. Often, the two samples
are matched, as in this study. Sometimes the samples of cases and controls
are independent rather than matched. For instance, another early casecontrol study on lung cancer and smoking sampled subjects by sending letters to
the estates of physicians who had died of some type of cancer in 1950 or
1951, and observations were cross-classified on type of cancer and the
subject’s smoking behavior Žsee, e.g., Cornfield 1956..
One might want to compare smokers with nonsmokers in terms of the
proportion who suffered lung cancer. These proportions refer to the conditional distribution of lung cancer, given smoking behavior. Instead, casecontrol studies provide proportions in the reverse direction, for the conditional
distribution of smoking behavior, given lung cancer status. For those in Table
2.5 with lung cancer, the proportion who were smokers was 688r709 s 0.970,
while it was 650r709 s 0.917 for the controls.
When we know the proportion of the population having lung cancer, we
can use Bayes’ theorem to compute sample conditional distributions in the
direction of main interest ŽProblem 2.21.. Otherwise, using a retrospective
sample, we cannot estimate the probability of lung cancer at each category of
smoking behavior. For Table 2.5 we do not know the population prevalence
of lung cancer, and the patients suffering it were probably sampled at a rate
far in excess of their occurrence in the general population.
By contrast, imagine a study that samples subjects from the population of
teenagers and then 60 years later measures the rates of lung cancer for the
smokers and nonsmokers. Such a sampling design is prospecti®e. There are
two types of prospective studies. Clinical trials randomly allocate subjects to
the groups who will be smokers and nonsmokers. In cohort studies, subjects
make their own choice about whether to smoke, and the study observes in
future time who develops lung cancer. Yet another approach, a cross-sectional design, samples subjects and classifies them simultaneously on both
variables.
COMPARING TWO PROPORTIONS
43
Prospecti®e studies usually condition on the totals n i s Ý j n i j 4 for categories of X and regard each row of J counts as an independent multinomial
sample on Y. Retrospecti®e studies usually treat the totals nqj 4 for Y as fixed
and regard each column of I counts as a multinomial sample on X. In
cross-sectional studies, the total sample size is fixed but not the row or column
totals, and the IJ cell counts are a multinomial sample.
Casecontrol, cohort, and cross-sectional studies are called obser®ational
studies. They simply observe who chooses each group and who has the
outcome of interest. By contrast, a clinical trial is an experimental study, the
investigator having the advantage of experimental control over which subjects
receive each treatment. Such studies can use the power of randomization to
make the groups balance roughly on other variables that may be associated
with the response. Observational studies are common but have more potential for biases of various types.
2.2
COMPARING TWO PROPORTIONS
Many studies are designed to compare groups on a binary response variable.
Then Y has only two categories, such as Žsuccess, failure. for outcome of a
medical treatment. With two groups, a 2 = 2 contingency table displays the
results. The rows are the groups and the columns are the categories of Y.
This section presents parameters for comparing the groups.
2.2.1
Difference of Proportions
For subjects in row i, 1 < i is the probability that the response has outcome in
category 1 Ž‘‘success’’.. With only two possible outcomes, 2 < i s 1 y 1 < i , and
we use the simpler notation i for 1 < i . The difference of proportions of
successes, 1 y 2 , is a basic comparison of the two rows. Comparison
on failures is equivalent to comparison on successes, since
Ž1 y 1. y Ž1 y 2 . s 2 y 1.
The difference of proportions falls between y1.0 and q1.0. It equals zero
when the rows have identical conditional distributions. The response Y is
statistically independent of the row classification when 1 y 2 s 0.
When both variables are responses, conditional distributions apply in
either direction. One can also compare the two columns, such as by the
difference between the proportions in row 1. This usually is not equal to the
difference 1 y 2 comparing the rows.
2.2.2
Relative Risk
A value 1 y 2 of fixed size may have greater importance when both i
are close to 0 or 1 than when they are not. For a study comparing two
44
DESCRIBING CONTINGENCY TABLES
treatments on the proportion of subjects who die, the difference between
0.010 and 0.001 may be more noteworthy than the difference between 0.410
and 0.401, even though both are 0.009. In such cases, the ratio of proportions
is also informative.
The relati®e risk is defined to be the ratio
Ž 2.3 .
1r 2 .
It can be any nonnegative real number. A relative risk of 1.0 corresponds to
independence. For the proportions just given, the relative risks are
0.010r0.001 s 10.0 and 0.410r0.401 s 1.02. Comparing the rows on the
second response category gives a different relative risk, Ž1 y 1 .rŽ1 y 2 ..
2.2.3
Odds Ratio
For a probability of success, the odds are defined to be
⍀ s r Ž 1 y . .
The odds are nonnegative, with ⍀ ) 1.0 when a success is more likely than a
failure. When s 0.75, for instance, then ⍀ s 0.75r0.25 s 3.0; a success is
three times as likely as a failure, and we expect about three successes for
every one failure. When ⍀ s 13 , a failure is three times as likely as a success.
Inversely,
s ⍀r Ž ⍀ q 1 . .
For instance, when ⍀ s 13 , then s 0.25.
Refer again to a 2 = 2 table. Within row i, the odds of success instead of
failure are ⍀ i s irŽ1 y i .. The ratio of the odds ⍀ 1 and ⍀ 2 in the two
rows,
s
⍀1
⍀2
s
1r Ž 1 y 1 .
2r Ž 1 y 2 .
Ž 2.4 .
is called the odds ratio.
For joint distributions with cell probabilities i j 4 , the equivalent definition
for the odds in row i is ⍀ i s i1r i2 , i s 1, 2. Then the odds ratio is
s
11 r 12
21 r 22
s
11 22
12 21
.
Ž 2.5 .
An alternative name for is the cross-product ratio, since it equals the ratio
of the products 11 22 and 12 21 of probabilities from diagonally opposite
cells ŽYule 1900, 1912..
45
COMPARING TWO PROPORTIONS
2.2.4
Properties of the Odds Ratio
The odds ratio can equal any nonnegative number. The condition ⍀ 1 s ⍀ 2
and hence Žwhen all cell probabilities are positive. s 1 corresponds to
independence of X and Y. When 1 - - ⬁, subjects in row 1 are more
likely to have a success than are subjects in row 2; that is, 1 ) 2 . For
instance, when s 4, the odds of success in row 1 are four times the odds in
row 2. This does not mean that the probability 1 s 4 2 ; that is the
interpretation of a relati®e risk of 4.0. When 0 - - 1, 1 - 2 . When one
cell has zero probability, equals 0 or ⬁.
Values of farther from 1.0 in a given direction represent stronger
association. Two values represent the same association, but in opposite
directions, when one is the inverse of the other. For instance, when s 0.25,
the odds of success in row 1 are 0.25 times the odds in row 2, or equivalently,
the odds of success in row 2 are 1r0.25 s 4.0 times the odds in row 1. When
the order of the rows is reversed or the order of the columns is reversed, the
new value for is the inverse of the original value.
For inference, we shall see it is convenient to use log . Independence
corresponds to log s 0. The log odds ratio is symmetric about this value
reversal of rows or of columns results in a change in its sign. Two values for
log that are the same except for sign, such as log 4 s 1.39 and log 0.25 s
y1.39, represent the same strength of association.
The odds ratio does not change value when the orientation of the table
reverses so that the rows become the columns and the columns become the
rows. This is clear from the symmetric form of Ž2.5.. It is unnecessary to
identify one classification as the response variable in order to use . In fact,
although Ž2.4. defined it in terms of odds using i s P Ž Y s 1 < X s i ., one
could just as well define it using reverse conditional probabilities. With a
joint distribution, conditional distributions exist in each direction, and
s
s
11 22
12 21
s
P Ž Y s 1 < X s 1 . rP Ž Y s 2 < X s 1 .
P Ž Y s 1 < X s 2 . rP Ž Y s 2 < X s 2 .
P Ž X s 1 < Y s 1 . rP Ž X s 2 < Y s 1 .
P Ž X s 1 < Y s 2 . rP Ž X s 2 < Y s 2 .
.
Ž 2.6 .
In fact, the odds ratio is equally valid for prospective, retrospective, or
cross-sectional sampling designs. The sample odds ratio estimates the same
parameter in each case.
For cell counts n i j 4 , the sample odds ratio is
ˆs n11 n 22 rn12 n 21 .
This does not change when both cell counts within any row are multiplied by
a nonzero constant or when both cell counts within any column are multiplied by a nonzero constant. An implication is that the sample odds ratio
46
DESCRIBING CONTINGENCY TABLES
estimates the same characteristic Ž . even when the sample is disproportionately large or small from marginal categories of a variable. For a retrospective study of the association between vaccination and catching a certain strain
of flu, the sample odds ratio estimates the same characteristic with a random
sample of Ž1. 100 people who got the flu and 100 people who did not, or Ž2.
40 people who got the flu and 160 people who did not. The sample versions
of the difference of proportions and relative risk Ž2.3. are invariant to
multiplication of counts within rows by a constant, but they change with
multiplication within columns or with rowcolumn interchange.
2.2.5
Aspirin and Heart Attacks Revisited
We illustrate the three association measures with Table 2.1 on aspirin use
and heart attacks. The table differentiates between fatal and nonfatal heart
attacks, but we combine these outcomes for now. Of the 11,034 physicians
taking placebo, 189 suffered heart attacks, a proportion of 189r11,034 s
0.0171. Of the 11,037 taking aspirin, 104 had heart attacks, a proportion of
0.0094. The sample difference of proportions is 0.0171 y 0.0094 s 0.0077.
The relative risk is 0.0171r0.0094 s 1.82. The proportion suffering heart
attacks of those taking placebo was 1.82 times the proportion suffering heart
attacks of those taking aspirin. The sample odds ratio is Ž189 = 10,933.r
Ž10,845 = 104. s 1.83. The odds of heart attack for those taking placebo was
1.83 times the odds for those taking aspirin.
2.2.6
Case–Control Studies and the Odds Ratio
With retrospective sampling designs, such as casecontrol studies, it is
possible to estimate conditional probabilities of form P Ž X s i < Y s j .. It is
usually not possible to estimate the probability P Ž Y s j < X s i . of an outcome of interest or the difference of proportions or relative risk for that
outcome. It is possible to estimate the odds ratio, however, since by Ž2.6. it is
determined by conditional probabilities in either direction.
To illustrate, we revisit Table 2.5 on X s smoking behavior and Y s lung
cancer. The data were two binomial samples on X at fixed levels of Y. Thus,
we can estimate the probability a subject was a smoker, given the outcome on
whether the subject had lung cancer; this was 688r709 for the cases and
650r709 for the controls. We cannot estimate the probability of lung cancer,
given whether one smoked, which is more relevant. Thus, we cannot estimate
differences or ratios of probabilities of lung cancer. The difference of
proportions and relative risk are limited to comparisons of the probabilities
of being a smoker. However, we can compute the odds ratio using the sample
analog of Ž2.6.,
688 = 59
Ž 688r709. r Ž 21r709.
s
s 3.0.
650 = 21
Ž 650r709. r Ž 59r709.
47
PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES
Moreover, by Ž2.6., interpretations can use the direction of interest, even
though the study was retrospective: The estimated odds of lung cancer for
smokers were 3.0 times the estimated odds for nonsmokers.
2.2.7
Relationship between Odds Ratio and Relative Risk
From definitions Ž2.3. and Ž2.4.,
odds ratio s relative risk
ž
1 y 2
1 y 1
/
.
Their magnitudes are similar whenever the probability i of the outcome of
interest is close to zero for both groups. We saw this similarity in Section
2.2.5 for the aspirin study, where the heart attack proportion was less than
0.02 for each group. The relative risk was 1.82 and the odds ratio was 1.83.
Because of this similarity, when each i is small, the odds ratio provides a
rough indication of the relative risk when it is not directly estimable, such as
in casecontrol studies ŽCornfield 1951.. For instance, for Table 2.5, if the
probability of lung cancer is small regardless of smoking behavior, 3.0 is also
a rough estimate of the relative risk; that is, smokers had about 3.0 times the
relative frequency of lung cancer as nonsmokers.
2.3
PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES
An important part of most studies, especially observational studies, is the
choice of control variables. In studying the effect of X on Y, one should
control any covariate that can influence that relationship. This involves using
some mechanism to hold the covariate constant. Otherwise, an observed
effect of X on Y may actually reflect effects of that covariate on both X and
Y. The relationship between X and Y then shows confounding. Experimental
studies can remove effects of confounding covariates by randomly assigning
subjects to different levels of X, but this is not possible with observational
studies.
Suppose that a study considers effects of passive smoking, the effects on a
nonsmoker of living with a smoker. To analyze whether passive smoking is
associated with lung cancer, a cross-sectional study might compare lung
cancer rates between nonsmokers whose spouses smoke and nonsmokers
whose spouses do not smoke. The study should attempt to control for age,
socioeconomic status, or other factors that might relate both to spouse
smoking and to developing lung cancer. Otherwise, results will have limited
usefulness. Spouses of nonsmokers may tend to be younger than spouses of
smokers, and younger people are less likely to have lung cancer. Then a
lower proportion of lung cancer cases among spouses of nonsmokers may
merely reflect their lower average age.
48
DESCRIBING CONTINGENCY TABLES
In this section we discuss the analysis of the association between categorical variables X and Y while controlling for a possibly confounding variable
Z. For simplicity, the examples refer to a single control variable. In later
chapters we treat more general cases and discuss the use of models to
perform statistical control.
2.3.1
Partial Tables
We control for Z by studying the XY relationship at fixed levels of Z.
Two-way cross-sectional slices of the three-way contingency table cross classify X and Y at separate categories of Z. These cross sections are called
partial tables. They display the XY relationship while removing the effect of
Z by holding its value constant.
The two-way contingency table obtained by combining the partial tables is
called the XY marginal table. Each cell count in the marginal table is a sum of
counts from the same location in the partial tables. The marginal table,
rather than controlling Z, ignores it. The marginal table contains no information about Z. It is simply a two-way table relating X and Y but may reflect
the effects of Z on X and Y.
The associations in partial tables are called conditional associations, because they refer to the effect of X on Y conditional on fixing Z at some
level. Conditional associations in partial tables can be quite different from
associations in marginal tables. In fact, it can be misleading to analyze only
marginal tables of a multiway contingency table. The following example
illustrates.
2.3.2
Death Penalty Example
Table 2.6 is a 2 = 2 = 2 contingency tabletwo rows, two columns, and two
layersfrom an article that studied effects of racial characteristics on whether
persons convicted of homicide received the death penalty. The 674 subjects
classified in Table 2.6 were the defendants in indictments involving cases
TABLE 2.6
Victims’
Race
White
Black
Total
Death Penalty Verdict by Defendant’s Race and Victims’ Race
Death Penalty
Defendant’s
Race
Yes
No
Percent
Yes
White
Black
White
Black
53
11
0
4
414
37
16
139
11.3
22.9
0.0
2.8
White
Black
53
15
430
176
11.0
7.9
Source: M. L. Radelet and G. L. Pierce, Florida Law Re®. 43: 134 Ž1991.. Reprinted with
permission from the Florida Law Re®iew.
PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES
FIGURE 2.1
49
Percent receiving death penalty.
with multiple murders in Florida between 1976 and 1987. The variables in
Table 2.6 are Y s death penalty verdict, having the categories Žyes, no.,
X s race of defendant, and Z s race of victims, each having the categories
Žwhite, black.. We study the effect of defendant’s race on the death penalty
verdict, treating victims’ race as a control variable. Table 2.6 has a 2 = 2
partial table relating defendant’s race and the death penalty verdict at each
category of victims’ race.
For each combination of defendant’s race and victims’ race, Table 2.6 lists
and Figure 2.1 displays the percentage of defendants who received the death
penalty. These describe the conditional associations. When the victims were
white, the death penalty was imposed 22.9% y11.3% s 11.6% more often
for black defendants than for white defendants. When the victims were black,
the death penalty was imposed 2.8% more often for black defendants than
for white defendants. Controlling for victims’ race by keeping it fixed, the
death penalty was imposed more often on black defendants than on white
defendants.
The bottom portion of Table 2.6 displays the marginal table. It results
from summing the cell counts in Table 2.6 over the two categories of victims’
race, thus combining the two partial tables Že.g., 11 q 4 s 15.. Overall,
11.0% of white defendants and 7.9% of black defendants received the death
penalty. Ignoring victims’ race, the death penalty was imposed less often on
black defendants than on white defendants. The association reverses direction compared to the partial tables.
Why does the association change so much when we ignore versus control
victims’ race? This relates to the nature of the association between victims’
race and each of the other variables. First, the association between victims’
50
DESCRIBING CONTINGENCY TABLES
FIGURE 2.2 Proportion receiving death penalty by defendant’s race, controlling and ignoring
victims’ race.
race and defendant’s race is extremely strong. The marginal table relating
these variables has odds ratio Ž467 = 143.rŽ48 = 16. s 87.0. Second, Table
2.6 shows that, regardless of defendant’s race, the death penalty was much
more likely when the victims were white than when the victims were black. So
whites are tending to kill whites, and killing whites is more likely to result in
the death penalty. This suggests that the marginal association should show a
greater tendency than the conditional associations for white defendants to
receive the death penalty. In fact, Table 2.6 has this pattern.
Figure 2.2 illustrates why the marginal association differs so from the
conditional associations. For each defendant’s race, the figure plots the
proportion receiving the death penalty at each category of victims’ race. Each
proportion is labeled by a letter symbol giving the category of victims’ race.
Surrounding each observation is a circle having area proportional to the
number of observations at that combination of defendant’s race and victims’
race. For instance, the W in the largest circle represents a proportion of
0.113 receiving the death penalty for cases with white defendants and white
victims. That circle is largest because the number of cases at that combination Ž53 q 414 s 467. is largest. The next-largest circle relates to cases in
which blacks kill blacks.
We control for victims’ race by comparing circles having the same victims’
race letter at their centers. The line connecting the two W circles has a
positive slope, as does the line connecting the two B circles. Controlling for
victims’ race, this reflects the death penalty being more likely for black
defendants than for white defendants. When we add results across victims’
PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES
51
race to get a summary result for the marginal effect of defendant’s race on
the death penalty verdict, the larger circles, having the greater number of
cases, have greater influence. Thus, the summary proportions for each
defendant’s race, marked on the figure by periods, fall closer to the center of
the larger circles than to the center of the smaller circles. A line connecting
the summary marginal proportions has negative slope, indicating that overall
the death penalty was more likely for white defendants than for black
defendants.
The result that a marginal association can have a different direction from
each conditional association is called Simpson’s paradox ŽSimpson 1951, Yule
1903.. It applies to quantitative as well as categorical variables. Statisticians
commonly use it to caution against imputing causal effects from an association of X with Y. For instance, when doctors started to observe strong odds
ratios between smoking and lung cancer, statisticians such as R. A. Fisher
warned that some variable Že.g., a genetic factor. could exist such that the
association would disappear under the relevant control. However, other
statisticians Žsuch as J. Cornfield. showed that with a very strong XY
association, a very strong association must exist between the confounding
variable Z and both X and Y in order for the effect to disappear or change
under the control ŽBreslow and Day 1980, Sec. 3.4..
2.3.3
Conditional and Marginal Odds Ratios
Odds ratios can describe marginal and conditional associations. We illustrate
for 2 = 2 = K tables, where K denotes the number of categories of a control
variable, Z. Let i jk 4 denote cell expected frequencies for some sampling
model, such as binomial, multinomial, or Poisson sampling.
Within a fixed category k of Z, the odds ratio
X YŽk. s
11 k 22 k
12 k 21 k
Ž 2.7 .
describes conditional XY association in partial table k. The odds ratios for
the K partial tables are called XY conditional odds ratios. These can be quite
different from marginal odds ratios. The XY marginal table has expected
frequencies i jqs Ý k i jk 4 . The XY marginal odds ratio is
X Y s
11q 22q
12q 21q
.
Sample values of X Y Ž k . and X Y use similar formulas with cell counts
substituted for expected frequencies. We illustrate for the association between defendant’s race and the death penalty in Table 2.6. In the first partial
52
DESCRIBING CONTINGENCY TABLES
table, victims’ race is white and
ˆX Y Ž1. s
53 = 37
414 = 11
s 0.43.
The sample odds for white defendants receiving the death penalty were 43%
of the sample odds for black defendants. In the second partial table, victims’
race is black and the estimated odds ratio equals ˆX Y Ž2. s Ž0 = 139.Ž16 = 4.
s 0.0, since the death penalty was never given to white defendants with black
victims.
Estimation of the marginal odds ratio uses the 2 = 2 marginal table within
Table 2.6, collapsing over victims’ race, or Ž53 = 176.rŽ430 = 15. s 1.45.
The sample odds of the death penalty were 45% higher for white defendants
than for black defendants. Yet within each victims’ race category, those odds
were smaller for white defendants. This reversal in the association after controlling for victims’ race illustrates Simpson’s paradox.
2.3.4
Marginal versus Conditional Independence
More generally, X may have I categories and Y may have J categories. An
I = J = K table describes the relationship between X and Y, controlling for
Z. If X and Y are independent in partial table k, then X and Y are called
conditionally independent at le®el k of Z. When Y is a response, this means
that
P Ž Y s j < X s i, Z s k. s P Ž Y s j < Z s k. ,
for all i , j.
Ž 2.8 .
More generally, X and Y are said to be conditionally independent gi®en Z
when they are conditionally independent at every level of Z, that is, when
Ž2.8. holds for all k. Then, given Z, Y does not depend on X.
Suppose that a single multinomial applies to the entire three-way table,
with joint probabilities i jk s P Ž X s i, Y s j, Z s k .4 . Then
i jk s P Ž X s i , Z s k . P Ž Y s j < X s i , Z s k . ,
which under conditional independence of X and Y, given Z, equals
s iqk P Ž Y s j < Z s k . s iqk P Ž Y s j, Z s k . rP Ž Z s k . .
Thus, conditional independence is then equivalent to
i jk s iqk qj krqqk
for all i , j, and k.
Ž 2.9 .
53
PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES
TABLE 2.7 Expected Frequencies Showing That Conditional Independence
Does Not Imply Marginal Independence
Response
Clinic
1
2
Total
Treatment
Success
Failure
A
B
A
B
18
12
2
8
12
8
8
32
A
B
20
20
20
40
Conditional independence does not imply marginal independence ŽYule
1903.. For instance, summing Ž2.9. over k on both sides yields
i jqs
Ý Ž iqk qj krqqk . .
k
All three terms in the summation involve k, and this does not simplify to
i jqs iqq qjq , marginal independence.
For 2 = 2 = K tables, X and Y are conditionally independent when the
odds ratio between X and Y equals 1 at each category of Z. The expected
frequencies i jk 4 in Table 2.7 illustrate this relation for Y s response
Žsuccess, failure., X s drug treatment ŽA, B., and Z s clinic Ž1, 2.. From
Ž2.7., the conditional XY odds ratios are
X Y Ž1. s
18 = 8
12 = 12
s 1.0,
X Y Ž2. s
2 = 32
8=8
s 1.0.
Given the clinic, response and treatment are conditionally independent. The
marginal table combines the tables for the two clinics. Its odds ratio is
X Y s Ž20 = 40.rŽ20 = 20. s 2.0, so the variables are not marginally independent.
Ignoring the clinic, why are the odds of a success for treatment A twice
those for treatment B? The conditional XZ and YZ odds ratios give a clue.
The odds ratio between Z and either X or Y, at each fixed category of the
other variable, equals 6.0. For instance, the XZ odds ratio at the first
category of Y equals Ž18 = 8.rŽ12 = 2. s 6.0. The conditional odds Žgiven
response. of receiving treatment A at clinic 1 are six times those at clinic 2,
and the conditional odds Žgiven treatment . of success at clinic 1 are six times
those at clinic 2. Clinic 1 tends to use treatment A more often, and clinic 1
also tends to have more successes. For instance, if patients at clinic 1 tended
to be younger and in better health than those at clinic 2, perhaps they had a
better success rate regardless of the treatment received.
54
DESCRIBING CONTINGENCY TABLES
It is misleading to study only the marginal table, concluding that successes
are more likely with treatment A. Subjects within a particular clinic are likely
to be more homogeneous than the overall sample, and response is independent of treatment in each clinic.
2.3.5
Homogeneous Association
A 2 = 2 = K table has homogeneous XY association when
X Y Ž1. s X Y Ž2. s ⭈⭈⭈ s X Y Ž K . .
Then the effect of X on Y is the same at each category of Z. Conditional
independence of X and Y is the special case in which each X Y Ž k . s 1.0.
Under homogeneous XY association, homogeneity also holds for the other
associations. For instance, the conditional odds ratio between two categories
of X and two categories of Z is identical at each category of Y. For the odds
ratio, homogeneous association is a symmetric property. It applies to any pair
of variables viewed across the categories of the third. When it occurs, there is
said to be no interaction between two variables in their effects on the other
variable.
When interaction exists, the conditional odds ratio for any pair of variables changes across categories of the third. For X s smoking Žyes, no.,
Y s lung cancer Žyes, no., and Z s age Ž- 45, 4565, ) 65., suppose that
X Y Ž1. s 1.2, X Y Ž2. s 3.9, and X Y Ž3. s 8.8. Then smoking has a weak effect
on lung cancer for young people, but the effect strengthens considerably with
age. Age is called an effect modifier; the effect of smoking is modified
depending on its value.
For the death penalty data ŽTable 2.6., ˆX Y Ž1. s 0.43 and ˆX Y Ž2. s 0.0.
The values are not close, but the second estimate is unstable because of the
zero cell count. Adding 12 to each cell count, ˆX Y Ž2. s 0.94. Because ˆX Y Ž2. is
unstable and because further variation occurs from sampling variability, these
partial tables do not necessarily contradict homogeneous association in a
population. In Section 6.3 we show how to analyze whether sample data are
consistent with homogeneous association or conditional independence.
2.4
EXTENSIONS FOR I = J TABLES
For 2 = 2 tables, a single number such as the odds ratio can summarize the
association. For I = J tables, it is rarely possible to summarize association by
a single number without some loss of information. However, a set of odds
ratios or another summary index can describe certain features of the association.
55
EXTENSIONS FOR I = J TABLES
2.4.1
Odds Ratios in I = J Tables
Odds ratios can use each of the
ž/
ž / s IŽ I y 1.r2 pairs of rows in combinaI
2
tion with each of the 2J s J Ž J y 1.r2 pairs of columns. For rows a and b
and columns c and d, the odds ratio Žac b d .rŽ b cad . uses four cells in a
rectangular pattern. There are 2I 2J odds ratios of this type. This set of
odds ratios contains much redundant information.
Consider the subset of Ž I y 1.Ž J y 1. local odds ratios
ž /ž /
i j s
i j iq1 , jq1
i , jq1 iq1 , j
,
i s 1, . . . , I y 1,
j s 1, . . . , J y 1.
Ž 2.10 .
Figure 2.3 shows that local odds ratios use cells in adjacent rows and adjacent
columns. These Ž I y 1.Ž J y 1. odds ratios determine all odds ratios formed
from pairs of rows and pairs of columns. To illustrate, in Table 2.1, the
sample local odds ratio is 2.08 for the first two columns and 1.74 for the
FIGURE 2.3
Odds ratios for I = J tables.
56
DESCRIBING CONTINGENCY TABLES
second and third columns. In each case, the more serious outcome was more
prevalent for the placebo group. The product of these two odds ratios is 3.63,
which is the odds ratio for the first and third columns.
Construction Ž2.10. for a minimal set of odds ratios is not unique. Another
basic set is
␣i j s
i j I J
I j i J
,
i s 1, . . . , I y 1,
j s 1, . . . , J y 1.
Ž 2.11 .
This uses the rectangular pattern of cells determined by the cell in row i and
column j and the cell in the last row and last column. Figure 2.3 illustrates.
Given the marginal distributions iq 4 and qj 4 , when i j ) 04 , conversion of the probabilities into the set of odds ratios Ž2.10. or Ž2.11. does not
discard information. The cell probabilities determine the odds ratios, and
given the marginals, the odds ratios determine the cell probabilities. In this
sense, Ž I y 1.Ž J y 1. parameters can describe any association in an I = J
table. Independence is equivalent to all Ž I y 1.Ž J y 1. odds ratios equaling
1.0.
For three-way I = J = K tables, sets of odds ratios in the partial tables
describe the conditional association. Homogeneous XY association means
that any conditional odds ratio formed using two categories of X and two
categories of Y is the same at each category of Z.
2.4.2
Summary Measures of Association
An alternative way to describe association uses a single summary index. We
discuss this first for nominal variables and then ordinal variables. The most
interpretable indices for nominal variables have the same structure as Rsquared for interval variables. It and the more general intraclass correlation
coefficient and correlation ratio ŽKendall and Stuart 1979. describe the
proportional reduction in variance from the marginal distribution of the
response Y to the conditional distributions of Y given an explanatory
variable X.
Let V Ž Y . denote a measure of variation for the marginal distribution qj 4
of Y, and let V Ž Y < i . denote this measure computed for the conditional
distribution 1 < i , . . . , J < i 4 of Y at the ith setting of X. A proportional
reduction in variation measure has the form
VŽY . y E VŽY < X .
VŽY .
,
Ž 2.12 .
where E w V Ž Y < X .x is the expectation of the conditional variation taken with
respect to the distribution of X. For the marginal distribution iq 4 of X,
E w V Ž Y < X .x s Ý i iq V Ž Y < i ..
57
EXTENSIONS FOR I = J TABLES
For a nominal response, Theil Ž1970. proposed an index using the variation measure V Ž Y . s Ýqj logqj , called the entropy. For contingency tables,
the proportional reduction in entropy equals
Usy
Ý i Ý j i j log Ž i jr iq qj .
Ý jqj log qj
,
Ž 2.13 .
called the uncertainty coefficient. This measure is well defined when more
than one qj ) 0. It takes value between 0 and 1: U s 0 is equivalent to
independence of X and Y; U s 1 is equivalent to a lack of conditional
variation, in the sense that for each i, j < i s 1 for some j.
Various measures of form Ž2.12. describe association in I = J tables Že.g.,
Problems 2.38 and 2.39.. A difficulty with them is developing intuition for
how large a value constitutes a strong association. What does it mean, for
instance, to say that there is a 30% reduction in entropy? Summary measures
seem easier to interpret and more useful when both classifications are
ordinal, as discussed next.
2.4.3
Ordinal Trends: Concordant and Discordant Pairs
In Table 2.8 the variables are income and job satisfaction, measured for the
black males in a national ŽU.S.. sample. Both classifications are ordinal, job
satisfaction with the categories very dissatisfied ŽVD., little dissatisfied ŽLD.,
moderately satisfied ŽMS., and very satisfied ŽVS..
When X and Y are ordinal, a monotone trend association is common. As
the level of X increases, responses on Y tend to increase toward higher
levels, or responses on Y tend to decrease toward lower levels. For instance,
perhaps job satisfaction tends to increase as income does. A single parameter
can describe this trend. Measures analogous to the correlation describe the
degree to which the relationship is monotone. Some measures are based on
classifying each pair of subjects as concordant or discordant. A pair is
concordant if the subject ranked higher on X also ranks higher on Y. The
TABLE 2.8
Cross-Classification of Job Satisfaction by Income
Job Satisfaction
Income
Ždollars.
- 15,000
15,00025,000
25,00040,000
) 40,000
Very
Dissatisfied
Little
Dissatisfied
Moderately
Satisfied
Very
Satisfied
1
2
1
0
3
3
6
1
10
10
14
9
6
7
12
11
Source: 1996 General Social Survey, National Opinion Research Center.
58
DESCRIBING CONTINGENCY TABLES
pair is discordant if the subject ranking higher on X ranks lower on Y.
The pair is tied if the subjects have the same classification on X andror Y.
We illustrate for Table 2.8. Consider a pair of subjects, one in the cell
Ž- 15, VD. and the other in the cell Ž1525, LD.. This pair is concordant,
since the second subject ranks higher than the first both on income and on
job satisfaction. The subject in cell Ž- 15, VD. forms concordant pairs when
matched with each of the three subjects classified Ž1525, LD., so these two
cells provide 1 = 3 s 3 concordant pairs. The subject in the cell Ž- 15, VD.
is also part of a concordant pair when matched with each of the other
Ž10 q 7 q 6 q 14 q 12 q 1 q 9 q 11. subjects ranked higher on both variables. Similarly, the three subjects in the Ž- 15, LD. cell are part of
concordant pairs when matched with the Ž10 q 7 q 14 q 12 q 9 q 11. subjects ranked higher on both variables.
The total number of concordant pairs, denoted by C, equals
C s 1 Ž 3 q 10 q 7 q 6 q 14 q 12 q 1 q 9 q 11 .
q 3 Ž 10 q 7 q 14 q 12 q 9 q 11 . q 10 Ž 7 q 12 q 11 .
q 2 Ž 6 q 14 q 12 q 1 q 9 q 11 . q 3 Ž 14 q 12 q 9 q 11 .
q10 Ž 12 q 11 . q 1 Ž 1 q 9 q 11 . q 6 Ž 9 q 11 . q 14 Ž 11 . s 1331.
The total number of discordant pairs of observations is
D s 3 Ž 2 q 1 q 0 . q 10 Ž 2 q 3 q 1 q 6 q 0 q 1 . q ⭈⭈⭈ q12 Ž 0 q 1 q 9 . s 849.
In this example, C ) D, suggesting a tendency for low income to occur with
low job satisfaction and high income with high job satisfaction.
Consider two independent observations from a joint probability distribution i j 4 . For that pair, the probabilities of concordance and discordance are
⌸c s 2 Ý
i
ž
Ý i j Ý Ý hk
j
h)i k)j
/
,
⌸d s 2 Ý
i
ž
Ý i j Ý Ý hk
j
h)i k-j
/
.
Here i and j are fixed in the inner summations, and the factor of 2 occurs
because the first observation could be in cell Ž i, j . and the second in cell
Ž h, k ., or vice versa. Several association measures for ordinal variables utilize
the difference ⌸ c y ⌸ d .
2.4.4
Ordinal Measure of Association: Gamma
Given that a pair is untied on both variables, ⌸ crŽ ⌸ c q ⌸ d . is the probability of concordance and ⌸ drŽ ⌸ c q ⌸ d . is the probability of discordance. The
59
NOTES
difference between these probabilities is
␥s
⌸c y ⌸d
⌸c q ⌸d
,
Ž 2.14 .
called gamma ŽGoodman and Kruskal 1954.. The sample version is ␥
ˆ s ŽC y
D .rŽ C q D ..
Like the correlation, gamma treats the variables symmetricallyit is
unnecessary to identify one classification as a response variable. Also like the
correlation, gamma has range y1 F ␥ F 1. A reversal in the category orderings of one variable causes a change in the sign of ␥ . Whereas the absolute
value of the correlation is 1 when the relationship between X and Y is
perfectly linear, only monotonicity is required for < ␥ < s 1, with ␥ s 1 if
⌸ d s 0 and ␥ s y1 if ⌸ c s 0. Independence implies that ␥ s 0, but the
converse is not true. For instance, a U-shaped joint distribution can have
⌸ c s ⌸ d and hence ␥ s 0.
2.4.5
Gamma for Job Satisfaction Example
For Table 2.8, C s 1331 and D s 849. Hence,
␥ˆ s Ž 1331 y 849 . r Ž 1331 q 849 . s 0.221.
Only a weak tendency exists for job satisfaction to increase as income
increases. Of the untied pairs, the proportion of concordant pairs is 0.221
higher than the proportion of discordant pairs.
NOTES
Section 2.2: Comparing Two Proportions
2.1. Breslow Ž1996. presented an interesting overview of the development of methods for
casecontrol studies.
2.2. For 2 = 2 tables, Edwards Ž1963. showed that functions of the odds ratio are the only
statistics that are invariant both to rowcolumn interchange and to multiplication within
rows or within columns by a constant. For I = J tables, Altham Ž1970. gave related
results. Yule Ž1912, p. 587. had argued that multiplicative invariance is a desirable
property for measures of association, especially when proportions sampled in various
marginal categories are arbitrary. Goodman Ž2000. showed five ways of viewing association in a 2 = 2 table and proposed a general measure that includes all five.
Section 2.3: Partial Association in Stratified 2 = 2 Tables
2.3. Paik Ž1985. proposed circle diagrams of type Figure 2.2 to summarize three-way tables.
Friendly Ž2000. discussed graphical presentation of categorical data. For more on
Simpson’s paradox and when it can happen, see Blyth Ž1972., Davis Ž1989., Dong Ž1998.,
60
DESCRIBING CONTINGENCY TABLES
Samuels Ž1993., and Simpson Ž1951.. Good and Mittal Ž1989. extended it to an amalgamation paradox, whereby a marginal measure is greater than the maximum or less than
the minimum of the partial table measures.
Section 2.4: Extensions for I = J Tables
2.4. For continuous variables, samples can be fully ranked Ži.e., no ties occur., so C q D
s
ž/
n
2
and ␥
ˆ s Ž C y D .r
ž/
n
. This is Kendall’s tau. Agresti Ž1984, Chaps. 9 and 10.
2
and Kruskal Ž1958. surveyed ordinal measures of association. These also apply when one
variable is ordinal and the other is binary. When Y is ordinal and X is nominal with
I ) 2, no measure presented in Section 2.4 is very helpful. Ordinal modeling approaches
ŽSection 7.2. use a parameter for each category of X; comparing parameters compares
the ordinal response for pairs of categories of X.
PROBLEMS
Applications
2.1
An article in the New York Times ŽFeb. 17, 1999. about the PSA blood
test for detecting prostate cancer stated: ‘‘The test fails to detect
prostate cancer in 1 in 4 men who have the disease Žfalse-negative
results ., and as many as two-thirds of the men tested receive false-positive results.’’ Let C Ž C . denote the event of having Žnot having. prostate
cancer, and let q Žy. denote a positive Žnegative. test result. Which is
true: P Žy < C . s 14 or P Ž C < y. s 14 ? P Ž C < q . s 23 or P Žq < C . s 23 ?
Determine the sensitivity and specificity.
2.2
A diagnostic test has sensitivity s specificity s 0.80. Find the odds
ratio between true disease status and the diagnostic test result.
2.3
Table 2.9 is based on records of accidents in 1988 compiled by the
Department of Highway Safety and Motor Vehicles in Florida. Identify
the response variable, and find and interpret the difference of proportions, relative risk, and odds ratio. Why are the relative risk and odds
ratio approximately equal?
TABLE 2.9
Data for Problem 2.3
Injury
Safety Equipment
in Use
Fatal
Nonfatal
None
Seat belt
1601
510
162,527
412,368
Source: Florida Department of Highway Safety and Motor Vehicles.
PROBLEMS
61
2.4
Consider the following two studies reported in the New York Times.
a. A British study reported ŽDec. 3, 1998. that of smokers who get
lung cancer, ‘‘ women were 1.7 times more vulnerable than men to
get small-cell lung cancer.’’ Is 1.7 the odds ratio or the relative risk?
b. A National Cancer Institute study about tamoxifen and breast
cancer reported ŽApr. 7, 1998. that the women taking the drug were
45% less likely to experience invasive breast cancer then were
women taking placebo. Find the relative risk for Ži. those taking the
drug compared to those taking placebo, and Žii. those taking placebo
compared to those taking the drug.
2.5
A study ŽE. G. Krug et al., Internat. J. Epidemiol., 27: 214221, 1998.
reported that the number of gun-related deaths per 100,000 people in
1994 was 14.24 in the United States, 4.31 in Canada, 2.65 in Australia,
1.24 in Germany, and 0.41 in England and Wales. Use the relative risk
to compare the United States with the other countries. Interpret.
2.6
A newspaper article preceding the 1994 World Cup semifinal match
between Italy and Bulgaria stated that ‘‘Italy is favored 1011 to beat
Bulgaria, which is rated at 103 to reach the final.’’ Suppose that this
means that the odds that Italy wins are 11
10 and the odds that Bulgaria
wins are 103 . Find the probability that each team wins, and comment.
2.7
In the United States, the estimated annual probability that a woman
over the age of 35 dies of lung cancer equals 0.001304 for current
smokers and 0.000121 for nonsmokers ŽM. Pagano and K. Gauvreau,
Principles of Biostatistics, Duxbury Press, Pacific Grove, CA. 1993,
p. 134..
a. Find and interpret the difference of proportions and the relative
risk. Which measure is more informative for these data? Why?
b. Find and interpret the odds ratio. Explain why the relative risk and
odds ratio take similar values.
2.8
For adults who sailed on the Titanic on its fateful voyage, the odds
ratio between gender Žfemale, male. and survival Žyes, no. was 11.4.
ŽFor data, see R. J. M. Dawson, J. Statist. Ed. 3, 1995..
a. What is wrong with the interpretation, ‘‘The probability of survival
for females was 11.4 times that for males’’? Give the correct interpretation. When would the quoted interpretation be approximately
correct?
b. The odds of survival for females equaled 2.9. For each gender, find
the proportion who survived.
62
2.9
DESCRIBING CONTINGENCY TABLES
In an article about crime in the United States, Newsweek ŽJan. 10,
1994. quoted FBI statistics for 1992 stating that of blacks slain, 94%
were slain by blacks, and of whites slain, 83% were slain by whites. Let
Y s race of victim and X s race of murderer. Which conditional
distribution do these statistics refer to, Y < X, or X < Y ? What additional
information would you need to estimate the probability that the victim
was white given that a murderer was white? Find and interpret the
odds ratio.
2.10 A research study estimated that under a certain condition, the probability that a subject would be referred for heart catheterization was
0.906 for whites and 0.847 for blacks.
a. A press release about the study stated that the odds of referral for
cardiac catheterization for blacks are 60% of the odds for whites.
Explain how they obtained 60% Žmore accurately, 57%..
b. An Associated Press story later described the study and said ‘‘Doctors were only 60% as likely to order cardiac catheterization for
blacks as for whites.’’ Explain what is wrong with this interpretation.
Give the correct percentage for this interpretation. ŽIn stating
results to the general public, it is better to use the relative risk than
the odds ratio. It is simpler to understand and less likely to be
misinterpreted. For details, see New Engl. J. Med. 341: 279283,
1999..
2.11 A 20-year cohort study of British male physicians ŽR. Doll and R. Peto,
British Med. J. 2: 15251536, 1976. noted that the proportion per year
who died from lung cancer was 0.00140 for cigarette smokers and
0.00010 for nonsmokers. The proportion who died from coronary heart
disease was 0.00669 for smokers and 0.00413 for nonsmokers.
a. Describe the association of smoking with each of lung cancer and
heart disease, using the difference of proportions, relative risk, and
odds ratio. Interpret.
b. Which response is more strongly related to cigarette smoking,
in terms of the reduction in number of deaths that would occur with
elimination of cigarettes? Explain.
2.12 Table 2.10 refers to applicants to graduate school at the University of
California at Berkeley, for fall 1973. It presents admissions decisions
by gender of applicant for the six largest graduate departments. Denote the three variables by A s whether admitted, G s gender, and
D s department. Find the sample AG conditional odds ratios and the
marginal odds ratio. Interpret, and explain why they give such different
indications of the AG association.
63
PROBLEMS
TABLE 2.10 Data for Problem 2.12
Whether Admitted
Male
Female
Department
Yes
No
Yes
No
A
B
C
D
E
F
Total
512
353
120
138
53
22
1198
313
207
205
279
138
351
1493
89
17
202
131
94
24
557
19
8
391
244
299
317
1278
Source: Data from Freedman et al. Ž1978, p.14.. See also P. Bickel
et al., Science 187: 398403 Ž1975..
2.13 State three ‘‘real-world’’ variables X, Y, and Z for which you expect a
marginal association between X and Y but conditional independence
controlling for Z.
2.14 Based on 1987 murder rates in the United States, an Associated Press
story reported that the probability that a newborn child has of eventually being a murder victim is 0.0263 for nonwhite males, 0.0049 for
white males, 0.0072 for nonwhite females, and 0.0023 for white females.
a. Find the conditional odds ratios between race and whether a
murder victim, given the gender. Interpret. Do these variables
exhibit homogeneous association?
b. Half the newborns are of each gender, for each race. Find the
marginal odds ratio between race and whether a murder victim.
2.15 At each age level, the death rate is higher in South Carolina than in
Maine, but overall, the death rate is higher in Maine. Explain how this
could be possible. ŽFor data, see H. Wainer, Chance 12: 44, 1999..
2.16 A study of the death penalty for cases in Kentucky between 1976 and
1991 ŽT. Keil and G. Vito, Amer. J. Criminal Justice 20: 1736, 1995.
indicated that the defendant received the death penalty in 8% of the
391 cases in which a white killed a white, in 2% of the 108 cases in
which a black killed a black, in 12% of the 57 cases in which a black
killed a white, and in 0% of the 18 cases in which a white killed a
black. Form the three-way contingency table, obtain the conditional
odds ratios between the defendant’s race and the death penalty verdict,
interpret those associations, study whether Simpson’s paradox occurs,
64
DESCRIBING CONTINGENCY TABLES
and explain why the marginal association is so different from the
conditional associations.
2.17 An estimated odds ratio for adult females between the presence of
squamous cell carcinoma Žyes, no. and smoking behavior Žsmoker,
nonsmoker. equals 11.7 when the smoker category has subjects whose
smoking level s is 0 - s - 20 cigarettes per day; it is 26.1 for smokers
with s G 20 cigarettes per day ŽR. C. Brownson et al., Epidemiology 3:
6164, 1992.. Show that the estimated odds ratio between carcinoma
Žyes, no. and the smoking levels Ž s G 20, 0 - s - 20. equals 2.2.
2.18 Table 2.11 refers to a retrospective study of lung cancer and tobacco
smoking among patients in several English hospitals. The table compares male lung cancer patients with control patients having other
diseases, according to the average number of cigarettes smoked daily
over a 10-year period preceding the onset of the disease.
a. Find the sample odds of lung cancer at each smoking level and the
five odds ratios that pair each level of smoking with no smoking. As
smoking increases, is there a trend? Interpret.
b. If the log odds of lung cancer is linearly related to smoking level,
the log odds in row i satisfies log Žodds i . s ␣ q  i. Show that this
implies that the local odds ratios are identical.
c. Using these data, can you estimate the probability of lung cancer at
each level of smoking? Are the estimated odds ratios in part Ža.
meaningful? Explain.
d. Show that the disease groups are stochastically ordered with respect
to their distributions on smoking of cigarettes Žsee Problem 2.34 and
Section 7.3.4.. Interpret.
TABLE 2.11 Data for Problem 2.18
Disease Group
Daily Average
Number of Cigarettes
None
-5
514
1524
2549
50 q
Lung Cancer
Patients
Control
Patients
7
55
489
475
293
38
61
129
570
431
154
12
Source: Reprinted with permission from R. Doll and A. B. Hill,
British Med. J. 2: 12711286 Ž1952..
65
PROBLEMS
TABLE 2.12 Data for Problem 2.19
Wife’s Rating of Sexual Fun
Husband’s Rating
Never or occasionally
Fairly often
Very often
Almost always
Never or
Occasionally
Fairly
Often
Very
Often
Almost
Always
7
2
1
2
7
8
5
8
2
3
4
9
3
7
9
14
Source: Reprinted with permission from Hout et al. Ž1987..
2.19 Table 2.12 summarizes responses of 91 married couples in Arizona to a
question about how often sex is fun. Find and interpret a measure of
association between wife’s response and husband’s response.
2.20 Table 2.13 is from an early study on the death penalty in Florida.
Analyze these data and show that Simpson’s paradox occurs.
TABLE 2.13 Data for Problem 2.20
Victim’s
Race
White
Black
Death Penalty
Defendant’s
Race
Yes
No
White
Black
White
Black
19
11
0
6
132
52
9
97
Source: Reprinted with permission from M. L. Radelet,
Amer. Sociol. Re®. 46: 918927 Ž1981.
Theory and Methods
2.21 For a diagnostic test of a certain disease, 1 denotes the probability
that the diagnosis is positive given that a subject has the disease, and
2 denotes the probability that the diagnosis is positive given that a
subject does not have it. Let denote the probability that a subject
does have the disease.
a. Given that the diagnosis is positive, show that the probability that a
subject does have the disease is
1 r 1 q 2 Ž 1 y . .
66
DESCRIBING CONTINGENCY TABLES
b. Suppose that a diagnostic test for HIVq status has both sensitivity
and specificity equal to 0.95, and s 0.005. Find the probability
that a subject is truly HIVq , given that the diagnostic test is
positive. To better understand this answer, find the joint probabilities relating diagnosis to actual disease status, and discuss their
relative sizes.
2.22 Binomial parameters for two groups are graphed, with 1 on the
horizontal axis and 2 on the vertical axis. Plot the locus of points for
a 2 = 2 table having Ža. relative risk s 0.5, Žb. odds ratio s 0.5, and
Žc. difference of proportions s y0.5.
2.23 Let D denote having a certain disease and E denote having exposure
to a certain risk factor. The attributable risk ŽAR. is the proportion of
disease cases attributable to that exposure Žsee Benichou 1998..
a. Let P Ž E . s 1 y P Ž E .. Explain why
AR s P Ž D . y P Ž D < E . rP Ž D . .
b. Show that AR relates to the relative risk RR by
AR s P Ž E . Ž RR y 1 . r 1 q P Ž E . Ž RR y 1 . .
2.24 For a 2 = 2 table of counts n i j 4 , show that the odds ratio is invariant
to Ža. interchanging rows with columns, and Žb. multiplication of cell
counts within rows or within columns by c / 0. Show that the difference of proportions and the relative risk do not have these properties.
2.25 For given 1 and 2 , show that the relative risk cannot be farther than
the odds ratio from their independence value of 1.0.
2.26 Explain why for three events E1 , E2 , and E3 and their complements, it
is possible that P Ž E1 < E2 . ) P Ž E1 < E2 . even if both P Ž E1 < E2 E3 . P Ž E1 < E2 E3 . and P Ž E1 < E2 E3 . - P Ž E1 < E2 E3 .. Ž Hint: Use Simpson’s
paradox for a three-way table. .
2.27 Let i j < k s P Ž X s i, Y s j < Z s k .. Explain why XY conditional independence is
i j < k s iq< k qj < k for all i and j and k.
2.28 For a 2 = 2 = 2 table, show that homogeneous association is a symmetric property, by showing that equal XY conditional odds ratios is
equivalent to equal YZ conditional odds ratios.
67
PROBLEMS
2.29 Smith and Jones are baseball players. Smith has a higher batting
average than Jones in each of K years. Is is possible that for the
combined data from the K years, Jones has the higher batting average? Explain, using an example to illustrate.
2.30 When X and Y are conditionally dependent at each level of Z yet
marginally independent, Z is called a suppressor ®ariable. Specify joint
probabilities for a 2 = 2 = 2 table to show that this can happen Ža.
when there is homogeneous association, and Žb. when the association
has opposite direction in the partial tables.
ž/ž/
J
2.31 Show that the ␣ i j 4 in Ž2.11. determine Ža. all 2I
odds ratios
2
Ž
.
formed from pairs of rows and pairs of columns, b all i j 4 in Ž2.10.,
and vice versa.
2.32 Refer to Problem 2.31. When all rows and columns have positive
probability, show that independence is equivalent to all ␣ i j s 14 .
2.33 For I = J contingency tables, explain why the variables are independent when the Ž I y 1.Ž J y 1. differences j < i y j < I s 0, i s 1, . . . ,
I y 1, j s 1, . . . , J y 1.
2.34 A 2 = J table has ordinal response. Let Fj < i s 1 < i q ⭈⭈⭈ q j < i . When
Fj < 2 F Fj <1 for j s 1, . . . , J, the conditional distribution in row 2 is
stochastically higher than the one in row 1. Consider the cumulati®e
odds ratios
j s
Fj <1r Ž 1 y Fj <1 .
Fj < 2r Ž 1 y Fj < 2 .
,
j s 1, . . . , J y 1.
a. Show that log j G 0 for all j is equivalent to row 2 being stochastically higher than row 1. Explain why row 2 is then more likely than
row 1 to have observations at the high end of the ordinal scale.
b. If all local log odds ratios are nonnegative, log j G 0 for 1 F j F
J y 1 ŽLehmann 1966.. Show by counterexample that the converse
is not true.
2.35 Suppose that Yi j 4 are independent Poisson variates with means i j 4 .
Show that P Ž Yi j s n i j . for all i, j, conditional on Yiqs n i 4 , satisfy
independent multinomial sampling wi.e., the product of Ž2.2. for all i x
within the rows.
68
DESCRIBING CONTINGENCY TABLES
2.36 For 2 = 2 tables, Yule Ž1900, 1912. introduced
Qs
11 22 y 12 21
11 22 q 12 21
,
which he labeled Q in honor of the Belgian statistician Quetelet. It is
now called Yule’s Q.
a. Show that for 2 = 2 tables, Goodman and Kruskal’s ␥ s Q.
b. Show that Q falls between y1 and 1.
c. State conditions under which Q s y1 or Q s 1.
d. Show that Q relates to the odds ratio by Q s Ž y 1.rŽ q 1.,
a monotone transformation of from the w0, ⬁x scale onto the
wy1,q 1x scale.
2.37 When X and Y are ordinal with counts n i j 4 :
ž/
a. Explain why the n2 pairs of observations partition into C q D q
TX q T Y y TX Y , where TX s Ýn iq Ž n iqy 1.r2 pairs are tied on X,
T Y pairs are tied on Y, and TX Y pairs are tied on X and Y.
b. For each ordered pair of observations Ž X a , Ya . and Ž X b ,Yb ., let
X ab s signŽ X a y X b . and Yab s signŽ Ya y Yb .. Show that the sample correlation for the nŽ n y 1. distinct Ž X ab , Yab . pairs is
½ž /
b s
CyD
n
y TX
2
ž/
n
y TY
2
5
1r2
.
This ordinal measure, called Kendall’s tau-b ŽKendall 1945., is less
sensitive than gamma to the choice of response categories.
ž/
c. Let d s Ž C y D .r n2 y TX . Explain why d is the difference between the proportions of concordant and discordant pairs out of
those pairs untied on X ŽSomers 1962.. ŽFor 2 = 2 tables, d equals
the difference of proportions, and tau-b equals the correlation
between X and Y..
2.38 Goodman and Kruskal Ž1954. proposed an association measure Žtau.
for nominal variables based on variation measure
VŽY . s
Ý qj Ž 1 y qj . s 1 y Ý qj2 .
a. Show V Ž Y . is the probability that two independent observations on
Y fall in different categories Žcalled the Gini concentration index ..
PROBLEMS
69
Show that V Ž Y . s 0 when qj s 1 for some j and V Ž Y . takes
maximum value of Ž J y 1.rJ when qj s 1rJ for all j.
b. For the proportional reduction in variation, show that E w V Ž Y < X .x
s 1 y Ý i Ý j i2jr iq. wThe resulting measure Ž2.12. is called the
concentration coefficient. Like U, s 0 is equivalent to independence. Haberman Ž1982. presented generalized concentration and
uncertainty coefficients. x
2.39 The measure of association lambda for nominal variables ŽGoodman
and Kruskal 1954. has V Ž Y . s 1 y max qj 4 and V Ž Y < i . s 1 y
max j j < i 4 . Interpret lambda as a proportional reduction in prediction
error for predictions which select the response category that is most
likely. Show that independence implies s 0 but that the converse is
not true.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 3
Inference for Contingency Tables
In this chapter we introduce inferential methods for contingency tables.
Many of these methods also play a vital role in analyses of later chapters for
which categorical data need not have contingency table form. The methods
assume Poisson, multinomial, or independent binomial sampling.
In Section 3.1 we present confidence intervals for measures of association
for 2 = 2 tables such as the odds ratio. Section 3.2 covers chi-squared tests of
the hypothesis of independence between two categorical variables. Like any
significance test, these have limited usefulness. In Section 3.3 we show how
to follow-up the test using residuals or the partitioning property of chi-squared
to extract components that describe the evidence about the association. In
Section 3.4 we present more powerful inference applicable with ordered
categories. The methods of Sections 3.1 through 3.4 assume large samples. In
Sections 3.5 and 3.6 we introduce small-sample methods.
3.1
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS
The accuracy of estimators of association parameters is characterized by
standard errors of their sampling distributions. In this section we present
large-sample standard errors and confidence intervals.
3.1.1
Interval Estimation of Odds Ratios
The sample odds ratio ˆs n11 n 22 rn12 n 21 for a 2 = 2 table equals 0 or ⬁ if
any n i j s 0, and it is undefined if both entries in a row or column are zero.
Since these outcomes have positive probabilities, the expected value and
variance of ˆ and log ˆ do not exist. ŽIn fact, this is also true for ML
estimators of model parameters presented in later chapters. . In terms of bias
and mean-squared error, Gart and Zweiful Ž1967. and Haldane Ž1956.
70
71
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS
showed that the amended estimators
˜s
Ž n11 q 0.5 . Ž n 22 q 0.5 .
Ž n12 q 0.5 . Ž n 21 q 0.5 .
and log ˜ behave well ŽProblem 14.4..
The estimators ˆ and ˜ have the same asymptotic normal distribution
around . Unless n is quite large, however, their distributions are highly
skewed. When s 1, for instance, ˆ cannot be much smaller than Žsince
ˆG 0., but it could be much larger with nonnegligible probability. The log
transform, having an additive rather than multiplicative structure, converges
more rapidly to normality. An estimated standard error for log ˆ is
ˆ Ž log ˆ. s
ž
1
n11
q
1
n12
q
1
n 21
q
1
n 22
/
1r2
.
Ž 3.1 .
We derive this formula in Section 3.1.7.
By the large-sample normality of log ˆ,
log ˆ" z␣ r2 ˆ Ž log ˆ.
Ž 3.2 .
is a Wald confidence interval for log . Exponentiating Žtaking antilogs of. its
endpoints provides a confidence interval for . Woolf Ž1955. proposed this
interval. It works quite well, usually being a bit conservative Ži.e., actual
coverage probability higher than the nominal level..
When ˆs 0 or ⬁, Woolf’s interval does not exist. When ˆs 0, one should
take 0 as the lower limit and when ˆs ⬁, one should take ⬁ as the upper
limit. The other bound can use the Woolf formula following some adjustment, such as Gart’s Ž1966., which replaces n i j 4 by n i j q 0.54 in the
estimator and standard error. A less ad hoc approach forms the interval by
inverting score tests ŽCornfield 1956. or likelihood-ratio tests for , as we
discuss in Section 3.1.8.
3.1.2
Aspirin and Myocardial Infarction Example
We illustrate inference for the odds ratio with Table 3.1 based on a Swedish
study of the association between aspirin use and myocardial infarction similar
to that described in Section 2.2.5. The study randomly assigned 1360 patients
who had already suffered a stroke to an aspirin treatment Žone low-dose
tablet a day. or to a placebo treatment. Table 3.1 reports the number of
deaths due to myocardial infarction during a follow-up period of about 3
years.
The sample odds ratio ˆs 1.56 is close to ˜s 1.55, since no cell count is
especially small. The standard error Ž3.1. of log ˆs 0.445 is ˆ Žlog ˆ. s 0.307.
72
INFERENCE FOR CONTINGENCY TABLES
TABLE 3.1 Swedish Study on Aspirin Use and
Myocardial Infarction
Myocardial Infarction
Placebo
Aspirin
Yes
No
Total
28
18
656
658
684
676
Source: Based on results described in Lancet 338: 1345᎐1349
Ž1991..
A 95% confidence interval for log in the population this sample represents
is 0.445 " 1.96Ž0.307., or Žy0.157, 1.047.. The corresponding interval for is
wexpŽy0.157., expŽ1.047.x, or Ž0.85, 2.85.. The estimate of the true odds ratio
is rather imprecise.
Since the confidence interval for contains 1.0, it is plausible that the true
odds of death due to myocardial infarction are equal for aspirin and placebo.
If there truly is a beneficial effect of aspirin but the odds ratio is not large, it
may require a large sample size to show that benefit because of the relatively
small number of myocardial infarction cases ŽProblem 3.21..
3.1.3
Interval Estimation of Difference of Proportions
The difference of proportions and the relative risk compare conditional
distributions of a response variable for two groups. For these measures, we
treat the samples as independent binomials. For group i, yi has a binomial
distribution with sample size n i and a probability i of a ‘‘success’’ response.
The sample proportion
ˆ i s yirn i has expectation i and variance
i Ž1 y i .rn i . Since
ˆ 1 and ˆ 2 are independent, their difference has
E Ž
ˆ 1 y ˆ 2 . s 1 y 2
and standard error
Ž
ˆ 1 y ˆ 2 . s
1Ž 1 y 1 .
n1
q
2Ž1 y 2 .
n2
1r2
.
Ž 3.3 .
The estimate ˆ Ž
ˆ 1 y ˆ 2 . uses formula Ž3.3. with i replaced by ˆ i . Then
Ž ˆ 1 y ˆ 2 . " z␣ r2 ˆ Ž ˆ 1 y ˆ 2 .
Ž 3.4 .
is a Wald confidence interval for 1 y 2 . Like the Wald interval Ž1.13. for a
single proportion, it usually has true coverage probability less than the
nominal confidence coefficient, especially when 1 and 2 are near 0 or 1.
More complex but better methods are cited in Section 3.1.8, Note 3.2, and
Problem 3.23.
73
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS
3.1.4
Interval Estimation of Relative Risk
The sample relative risk is r s
ˆ 1rˆ 2 . Like the odds ratio, it converges to
normality faster on the log scale. The asymptotic standard error of log r is
Ž log r . s
ž
1 y 1
1 n1
q
1 y 2
2 n2
/
1r2
.
Ž 3.5 .
The Wald interval exponentiates endpoints of log r " z␣ r2 ˆ Žlog r .. It works
well but can be somewhat conservative. We discuss an alternative method in
Section 3.1.8.
For Table 3.1, the sample proportion of myocardial infarction deaths was
0.0409 for subjects taking placebo and 0.0266 for subjects taking aspirin. The
sample relative risk is 0.0409r0.0266 s 1.54. The 95% confidence interval for
the log relative risk of log Ž1.54. " 1.96Ž0.297. translates to Ž0.86, 2.75. for the
relative risk. We infer that the death rate for those taking placebo was
between 0.86 and 2.75 times that for those taking aspirin. The Wald 95%
confidence interval for 1 y 2 is 0.014 " 1.96Ž0.0098. or Žy0.005, 0.033..
According to either measure, substantial public health benefits could result
from taking aspirin, but no effect or a slight negative effect are also plausible.
Results for the larger study described in Section 2.2.5 do show a benefit.
3.1.5
Deriving Standard Errors with the Delta Method*
A simple and useful method exists of deriving standard errors for large-sample inferences. Let Tn denote a statistic that is asymptotically normally
distributed about a parameter , the subscript n expressing its dependence
on sample size. Suppose that an estimator is a function g ŽTn . of Tn . Then,
under mild conditions, g ŽTn . itself has a large-sample normal distribution.
The standard error depends on how fast g Ž t . changes for t near .
Specifically, for large n, suppose that Tn is normally distributed about
with standard error r'n . That is, as n ™ ⬁, the cdf of 'n ŽTn y .
converges to the cdf of a normal random variable with mean 0 and variance
2 . This limiting behavior is an example of con®ergence in distribution,
denoted by
d
'n Ž Tn y . ™
N Ž 0, 2 . .
Let g be a function that is at least twice differentiable at . Using the Taylor
series expansion for g Ž t . in a neighborhood of t s , in Section 14.1.2 we
show
'n
g Ž Tn . y g Ž . f 'n Ž Tn y . g X Ž .
74
INFERENCE FOR CONTINGENCY TABLES
FIGURE 3.1
Depiction of delta method.
for large n, where g X Ž . s ⭸ gr⭸ t evaluated at t s . Recall if a variate
Y ; N Ž0, 2 ., then cY ; N Ž0, c 2 2 .. Thus,
'n
g Ž Tn . y g Ž . ™ N Ž 0, g X Ž .
d
2
2..
Ž 3.6 .
In other words, g ŽTn . is approximately normal around g Ž . with variance
w g X Ž .x2 2rn.
Figure 3.1 portrays this result. Locally around , g Ž t . is approximately
linear, with slope g X Ž .. Then g ŽTn . is approximately normal, since linear
transformations of normal random variables are themselves normal. The
dispersion of g ŽTn . values about g Ž . is about < g X Ž . < times the dispersion
of Tn values about . If the slope of g at is 12 , then g maps a region of Tn
values into a region of g ŽTn . values only about half as wide.
Result Ž3.6. is called the delta method. Since g X Ž . and 2 s 2 Ž .
usually depend on the unknown parameter , the asymptotic variance is
unknown. Confidence intervals and tests substitute Tn for and use the
result that 'n w g ŽTn . y g Ž .xr < g X ŽTn . < ŽTn . is asymptotically standard normal. For instance,
g Ž Tn . " 1.96 g X Ž Tn . Ž Tn . r'n
is a large-sample Wald 95% confidence interval for g Ž ..
3.1.6
Delta Method Applied to Sample Logit*
We illustrate the delta method for a function of the ML estimator Tn s
ˆs
yrn of the binomial parameter , for y successes in n trials. Since E Ž Y . s n
and var Ž Y . s n Ž1 y ., E Ž
ˆ . s and var Žˆ . s Ž1 y .rn. Also, ˆ
75
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS
has a large-sample normal distribution by the central limit theorem. So do
many functions of
ˆ.
The log odds function of
ˆ,
g Ž
ˆ . s log ˆr Ž 1 y ˆ . ,
is called the sample logit. Evaluated at , its derivative equals 1r Ž1 y ..
By the delta method, the asymptotic variance of the sample logit is
Ž1 y .rn Žthe variance of
ˆ . multiplied by the square of w1r Ž1 y .x.
That is
'n
ž
log
ˆ
1y
ˆ
y log
1y
/ ž
d
™ N 0,
1
Ž1 y .
/
.
The asymptotic normality of
ˆ propagates to asymptotic normality of
log w
ˆrŽ1 y ˆ .x.
The asymptotic variance is the variance of the normal distribution that
approximates the true distribution, for large n. It is not an approximation for
the variance of the true distribution. For 0 - - 1, the asymptotic variance
w n Ž1 y .xy1 of the sample logit is finite. By contrast, the true variance
does not exist: Since
ˆ s 0 or 1 with positive probability, the logit can equal
y⬁ or ⬁ with positive probability. The probability of an infinite logit
converges to zero rapidly as n increases. For large n, the distribution of the
sample logit looks essentially normal with mean log wrŽ1 y .x and standard
deviation w n Ž1 y .xy1r2 . Thus, for the logit, the asymptotic variance
actually has greater use than the true variance. Incidentally, related to this,
the bootstrap is not helpful for approximating standard errors for many
discrete measures, because it mimics the true rather than the more relevant
asymptotic standard error.
3.1.7
Delta Method for Log Odds Ratio*
Standard errors for the log odds ratio and the log relative risk result from a
multiparameter version of the delta method. Suppose that n i , i s 1, . . . , c4
have a multinomial Ž n, i 4. distribution. The sample proportion
ˆ i s n irn
has mean and variance
EŽ
ˆi . s i
and
var Ž
ˆ i . s i Ž 1 y i . rn.
Ž 3.7 .
In Section 14.1.4 we show that for i / j,
ˆ i and ˆ j have covariance
cov Ž
ˆ i , ˆ j . s y i jrn.
Ž 3.8 .
The sample proportions Ž
ˆ 1 , ˆ 2 , . . . , ˆcy1 . have a large-sample multivariate
normal distribution. For functions of them, the delta method implies the
76
INFERENCE FOR CONTINGENCY TABLES
following result, proved in Section 14.1.4:
Let g Ž . denote a differentiable function of i4, with sample value g Ž
ˆ . for a
multinomial sample. Let
i s
⭸ gŽ.
⭸ i
Then as n ™ ⬁, the distribution of
normal, where
2s
,
i s 1, . . . , c.
'n w g Ž ˆ . y g Ž .xr
Ý i i2 y Ž Ý i i .
2
converges to standard
Ž 3.9 .
.
The asymptotic variance depends on i 4 and the partial derivatives of the
measure with respect to i 4 . In practice, replacing i 4 and i 4 in Ž3.9. by
their sample values yields an ML estimate ˆ 2 of 2 . Then ˆr'n is an
estimated standard error for g Ž
ˆ .. A large-sample Wald confidence interval
for g Ž . is
gŽ
ˆ . " z␣ r2 ˆr'n .
With the substitution of ˆ for in Ž3.9., the limiting distribution is still
standard normal, but convergence is slower. The equivalence in the largesample distribution is justified as follows: The sample proportions converge
in probability to i 4 , by the weak law of large numbers. Since ˆ is a
continuous function of the sample proportions, it converges in probability to
, and rˆ converges in probability to 1. Now
'n
gŽ
ˆ . y gŽ .
ˆ
s 'n
gŽ
ˆ . y gŽ .
ˆ
.
The first term on the right-hand side converges in distribution to standard
normal, by Ž3.9., and the second term converges in probability to 1. Thus,
their product also has a limiting standard normal distribution.
We now apply the delta method to the log odds ratio, taking g Ž . s log
s log 11 q log 22 y log 12 y log 21 . Since
11 s ⭸ Ž log . r⭸ 11 s 1r 11
12 s y1r 12 ,
21 s y1r 21 ,
22 s 1r 22 ,
Ý i Ý j i j i j s 0 and 2 s Ý i Ý j i j i2j s Ý i Ý j Ž1r i j .. The asymptotic standard error of log ˆ for a multinomial sample n i j 4 is
Ž log ˆ. s r'n s
ž
Ý Ý 1rn i j
i
j
/
Since n
ˆ i j s n i j , the estimated standard error is Ž3.1..
1r2
.
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS
77
The delta method also applies directly with to obtain ˆ Ž ˆ. and a Wald
confidence interval ˆ" z␣ r2 ˆ Ž ˆ.. This is not recommended; ˆ converges
more slowly than log ˆ to normality, this interval could contain negative
values, and it does not give results equivalent to those obtained with the
Wald interval using 1rˆ and its standard error.
3.1.8
Score and Profile Likelihood Confidence Intervals*
Standard errors obtained with the delta method appear in Wald confidence
intervals. However, intervals based on inverting Wald tests sometimes work
poorly for small to moderate n. Alternative intervals result from inverting
likelihood-ratio or score tests. Although computationally more complex,
these methods often perform better.
We illustrate first with the score method for the difference of proportions.
The score test ŽMee 1984; Miettinen and Nurminen 1985. of H0 : 1 y 2 s ⌬
has the test statistic
zŽ ⌬. s
'ˆ Ž ⌬ .
1
Ž ˆ 1 y ˆ 2 . y ⌬
1y
ˆ 1 Ž ⌬ . rn1 q ˆ 2 Ž ⌬ . 1 y ˆ 2 Ž ⌬ . rn 2
where
ˆ i Ž ⌬ . denotes the ML estimate of i subject to the constraint
1 y 2 s ⌬. That is,
ˆ 1Ž ⌬ . and ˆ 2 Ž ⌬ . are the values of 1 and 2
satisfying 1 y 2 s ⌬ that maximize the product of the two binomial
probability mass functions. These values do not have closed-form expressions
and are determined using numerical methods. The score confidence interval
is the set of ⌬ such that < z Ž ⌬ . < - z␣ r2 . Computations for such intervals
require iteration ŽNurminen 1986..
For the relative risk also, slightly better performance results with an
interval using the score method ŽBedrick 1987; Gart and Nam 1988;
Koopman 1984, Miettinen and Nurminen 1985; Nurminen 1986.. Cornfield
Ž1956. and Miettinen and Nurminen Ž1985. showed the score interval for the
odds ratio. We prefer not to use a continuity or finite-sampling correction
with these intervals, as then performance is too conservative. The fact that
the score intervals are computationally more complex than Wald intervals
should not be an impediment to their use in this modern era of computing, as
the principle behind them is simple. However, currently they are not available in standard software.
For a confidence interval based on the likelihood-ratio test, we illustrate
with the odds ratio. The multinomial likelihood for a 2 = 2 table is a function
of 11 , 12 , 21 4 . Equivalently, it can be expressed in terms of , 1q,q1 4
Žrecall Section 2.4.1.. Thus, in inverting a likelihood-ratio test of H0 : s 0
to check whether 0 belongs in the confidence interval, there are two
nuisance parameters. Their null ML estimates
ˆ 1q Ž 0 . and ˆq1Ž 0 . that
maximize the likelihood under the null vary as 0 does.
78
INFERENCE FOR CONTINGENCY TABLES
The profile log-likelihood function is LŽ 0 ,
ˆ 1q Ž 0 ., ˆq1Ž 0 .., viewed as a
function of 0 . For each 0 this function gives the maximum of the ordinary
log likelihood subject to the constraint s 0 . Evaluated at 0 s ˆ, this is
the maximized log likelihood LŽ ˆ,
ˆ 1q, ˆq1 ., which occurs at the sample
proportions
ˆ 1qs n1qrn and ˆq1 s nq1 rn. The profile likelihood confidence interval for is the set of 0 for which
ž
y2 L Ž 0 ,
ˆ 1qŽ 0 . , ˆq1 Ž 0 . . y L ˆ, ˆ 1q , ˆq1
/
- 12 Ž␣ . .
This contains all 0 not rejected in likelihood-ratio tests of nominal size ␣ .
The profile likelihood approach is available with some software Že.g., for
SAS, see Table A.2 in Appendix A.. A related approach, discussed in Section
6.7.1, uses a conditional likelihood function that eliminates the nuisance
parameters by conditioning on their sufficient statistics. This is beneficial
when there are many nuisance parameters. An advantage of score and
likelihood-based intervals is that unlike the Wald, they are not adversely
affected when the sample relative risk or odds ratio is 0 or ⬁.
In this section we have discussed interval estimation. Significance tests
normally refer to a null hypothesis value of 0.0 for the log odds ratio, log
relative risk, and difference of proportions. These are special cases of
independence applied to 2 = 2 tables. In the next section we present tests of
independence for two-way contingency tables.
3.2
TESTING INDEPENDENCE IN TWO-WAY CONTINGENCY TABLES
For multinomial sampling with probabilities i j 4 in an I = J contingency
table, the null hypothesis of statistical independence is H0 : i j s iq qj for
all i and j. For independent multinomial samples in the I rows, independence corresponds to homogeneity of each outcome probability among the
rows. Our discussion refers to a single multinomial sample, but the same tests
apply with independent multinomial samples.
3.2.1
Pearson and Likelihood-Ratio Chi-Squared Tests
In Section 1.5.2 we introduced the Pearson X 2 statistic Ž1.15. for tests about
multinomial probabilities. A test of H0 : independence uses X 2 with n i j in
place of n i and with i j s n iq qj in place of i . Here i j s E Ž n i j . under
H0 . Usually, iq 4 and qj 4 are unknown. Their ML estimates are the
sample marginal proportions
ˆ iqs n iqrn and ˆqj s nqj rn, so estimated
expected frequencies are
ˆ i j s nˆ iq ˆqj s n iq nqj rn4. Then X 2 equals
2
X s
ÝÝ
i
j
Ž n i j y ˆ i j .
ˆi j
2
.
Ž 3.10 .
TESTING INDEPENDENCE IN TWO-WAY CONTINGENCY TABLES
79
Pearson Ž1900, 1904, 1922. claimed that replacing i j 4 by estimates
ˆ i j4
would not affect the distribution of X 2 . Since the contingency table has IJ
categories, he argued that X 2 is asymptotically chi-squared with df s IJ y 1.
On the contrary, since
ˆ i j 4 require estimating iq 4 and qj 4, by Section
1.5.6
df s Ž IJ y 1 . y Ž I y 1 . y Ž J y 1 . s Ž I y 1 . Ž J y 1 . .
The dimensions of iq 4 and qj 4 reflect the constraints Ý i iqs Ý jqj s 1.
R. A. Fisher Ž1922. corrected Pearson’s error Žsee Section 16.2.. His article
introduced the notion of degrees of freedom. ŽPearson had dealt with an
indexed family of chi-squared distributions but had not dealt explicitly with
‘‘degrees of freedom.’’.
The score test produces the X 2 statistic. The likelihood-ratio test produces a different one. For multinomial sampling, the kernel of the likelihood
is
Ł Ł inj
i
j
ij
,
Ý Ý i j s 1.
where all i j G 0 and
i
j
Under H0 : independence,
ˆ i j s ˆ iq ˆqj s n iq nqj rn2 . In the general case,
ˆ i j s n i jrn. The ratio of the likelihoods equals
⌳s
Ł i Ł j Ž n iq nqj .
n n Ł i Ł j n ni ji j
ni j
.
The likelihood-ratio chi-squared statistic is y2 log ⌳. Denoted by G 2 , it
equals
G 2 s y2 log ⌳ s 2 Ý
i
Ý n i j log Ž n i jrˆ i j .
Ž 3.11 .
j
where
ˆ i j s n iq nqj rn4. The larger the values of G 2 and X 2 , the more
evidence exists against independence.
In the general case, the parameter space consists of i j 4 subject to the
linear restriction Ý i Ý j i j s 1, so the dimension is IJ y 1. Under H0 , i j 4
are determined by iq 4 and qj 4 , so the dimension is Ž I y 1. q Ž J y 1..
The difference in these dimensions equals Ž I y 1.Ž J y 1.. For large samples,
G 2 has a chi-squared null distribution with df s Ž I y 1.Ž J y 1.. So G 2 and
X 2 have the same limiting null chi-squared distribution. In fact, they are then
asymptotically equivalent; X 2 y G 2 converges in probability to zero ŽSection
14.3.4.. The limiting results for multinomial sampling also hold with other
sampling schemes ŽRoy and Mitra 1956, Watson 1959..
These results apply as n grows, and hence i j s n i j 4 grow, for a fixed
number of cells. As they grow, the multinomial distribution for n i j 4 is better
80
INFERENCE FOR CONTINGENCY TABLES
approximated by a multivariate normal, and X 2 and G 2 have more nearly
chi-squared distributions. The convergence to chi-squared is quicker for X 2
than G 2 . The approximation is usually poor for G 2 when nrIJ - 5. When I
or J is large, it can be decent for X 2 when some expected frequencies are as
small as 1 but most exceed 5. In Section 9.8.4 we provide further guidelines.
Small-sample methods ŽSection 3.5. are available whenever it is doubtful
whether n is sufficiently large.
3.2.2
Education and Religious Fundamentalism Example
Table 3.2 cross-classifies the degree of fundamentalism of subjects’ religious
beliefs by their highest degree of education. The table also contains the
estimated expected frequencies for H0 : independence. For instance,
ˆ 11 s n1q nq1 rn s Ž424 = 886.r2726 s 137.8. The chi-squared statistics are
X 2 s 69.2 and G 2 s 69.8, with df s Ž3 y 1.Ž3 y 1. s 4. The P-values
are - 0.0001. These statistics provide extremely strong evidence of an
association.
3.3
FOLLOWING-UP CHI-SQUARED TESTS
Like any significance test, chi-squared tests of independence have limited
usefulness. A small P-value indicates strong evidence of association but
provides little information about the nature or strength of the association.
Statisticians have long warned about dangers of relying solely on results of
chi-squared tests rather than studying the nature of the association Že.g.,
Berkson 1938; Cochran 1954.. In this section we discuss ways to follow up the
tests to learn more about the association.
TABLE 3.2
Education and Religious Beliefs
Religious Beliefs
Highest Degree
Less than high school
High school or junior college
Bachelor or graduate
Total
Fundamentalist
Moderate
Liberal
Total
178
Ž137.8.1
Ž4.5. 2
570
Ž539.5.
Ž2.6.
138
Ž208.7.
Žy6.8.
138
Ž161.5.
Žy2.6.
648
Ž632.1.
Ž1.3.
252
Ž244.5.
Ž0.7.
108
Ž124.7.
Žy1.9.
442
Ž488.4.
Žy4.0.
252
Ž188.9.
Ž6.3.
424
886
1038
802
1660
642
2726
Source: 1996 General Social Survey, National Opinion Research Center.
1
Estimated expected frequencies for testing independence; 2 standardized Pearson residuals.
FOLLOWING-UP CHI-SQUARED TESTS
3.3.1
81
Pearson and Standardized Residuals
A cell-by-cell comparison of observed and estimated expected frequencies
helps show the nature of the dependence. Under H0 , larger differences
Ž ni j y
ˆ i j . tend to occur in cells with larger i j . For Poisson sampling, for
instance, the standard deviation of n i j and hence Ž n i j y i j . is i j ; the
standard deviation of Ž n i j y
ˆ i j . is less than that of n i j y i j but is proportional to i j . Thus, this raw difference is insufficient. The Pearson residual,
defined for a cell by
'
'
ei j s
ni j y
ˆi j
ˆ1r2
ij
Ž 3.12 .
,
attempts to adjust for this. Pearson residuals relate to the Pearson statistic by
Ý i Ý j e i2j s X 2 .
Under H0 , e i j 4 are asymptotically normal with mean 0. However, in
Section 14.3.2 we show that their asymptotic variances are less than 1.0,
averaging wŽ I y 1.Ž J y 1.xrŽnumber of cells.. Comparing Pearson residuals
to standard normal percentage points provides conservative indications of
cells having lack of fit.
A standardized Pearson residual that is asymptotically standard normal
results from dividing it by its standard error ŽHaberman 1973a; see also
Section 14.3.2.. For H0 : independence, this is
ni j y
ˆi j
ˆ i j Ž 1 y piq . Ž 1 y pqj .
1r2
Ž 3.13 .
.
A standardized Pearson residual that exceeds about 2 or 3 in absolute value
indicates lack of fit of H0 in that cell. Larger values are more relevant when
df is larger and it becomes more likely that at least one is large simply by
chance.
3.3.2
Education and Religious Fundamentalism Revisited
Table 3.2 also shows standardized Pearson residuals for testing independence. For instance, n11 s 178 and
ˆ 11 s 137.8. The relevant marginal
proportions equal p1qs 424r2726 s 0.156 and pq1 s 886r2726 s 0.325.
The standardized Pearson residual Ž3.13. for this cell equals
Ž 178 y 137.8 . r Ž 137.8 . Ž 1 y 0.156 . Ž 1 y 0.325 .
1r2
s 4.5.
This cell shows a much greater discrepancy between n11 and
ˆ 11 than
expected if the variables were truly independent.
Table 3.2 shows large positive residuals for subjects with less than a high
school education and fundamentalist views and for subjects with a bachelor’s
82
INFERENCE FOR CONTINGENCY TABLES
or graduate degree and liberal views. This means that significantly more
subjects were at these combinations than H0 : independence predicts. Similarly, there were fewer subjects with high levels of education and fundamentalist views and with low levels of education and liberal views than independence predicts.
Odds ratios describe this trend. The 2 = 2 table constructed from the first
and last rows and the first and last columns of Table 3.2 has a sample odds
ratio of Ž178 = 252.rŽ108 = 138. s 3.0. For those with a bachelor’s or graduate degree, the estimated odds of selecting liberal instead of fundamentalist
were 3.0 times the estimated odds for those with less than a high school
education.
3.3.3
Partitioning Chi-Squared
Let Z denote a standard normal random variable. Then Z 2 has a chi-squared
distribution with df s 1. A chi-squared random variable with df s has
representation Z12 q ⭈⭈⭈ qZ2 , where Z1 , . . . , Z are independent standard
normal variables. Thus, a chi-squared statistic having df s has partitionings into independent chi-squared componentsᎏfor example, into components each having df s 1. Conversely, if X 12 and X 22 are independent
chi-squared random variables having degrees of freedom 1 and 2 , then
X 2 s X 12 q X 22 has a chi-squared distribution with df s 1 q 2 . Another
supplement to a chi-squared test partitions its test statistic so that the
components represent certain aspects of the effects. A partitioning may show
that an association reflects primarily differences between certain categories
or groupings of categories.
We begin with a partitioning for the test of independence in 2 = J tables.
We partition G 2 , which has df s Ž J y 1., into J y 1 components. The jth
component is G 2 for a 2 = 2 table where the first column combines columns
1 through j of the full table and the second column is column j q 1. That is,
G 2 for testing independence in a 2 = J table equals a statistic that compares
the first two columns, plus a statistic that combines the first two columns and
compares them to the third column, and so on, up to a statistic that combines
the first J y 1 columns and compares them to the last column. ŽIn Section
9.2.4 we justify this partitioning. . Each component statistic has df s 1.
It might seem more natural to compute G 2 for the Ž J y 1. separate 2 = 2
tables that pair each column with a particular one, say the last. However,
these component statistics are not independent and do not sum to G 2 for the
full table. ŽThis is beyond our scope at this stage but relates to the contrasts
of log probabilities that form the log odds ratios for the two tables not being
orthogonal..
For an I = J table, independent chi-squared components result from
comparing columns 1 and 2 and then combining them and comparing them to
column 3, and so on. Each of the J y 1 statistics has df s I y 1. More
refined partitions contain Ž I y 1.Ž J y 1. statistics, each having df s 1. One
FOLLOWING-UP CHI-SQUARED TESTS
83
such partition ŽLancaster 1949. applies to the Ž I y 1.Ž J y 1. separate 2 = 2
tables
Ý Ý n ab
Ý na j
a-i b-j
a-i
Ý ni b
Ž 3.14 .
ni j
b-j
for i s 2, . . . , I and j s 2, . . . , J. For others, see Gilula and Haberman
Ž1998. and Goodman Ž1969a, 1971b..
3.3.4
Origin of Schizophrenia Example
Table 3.3 classifies a sample of psychiatrists by their school of psychiatric
thought and by their opinion on the origin of schizophrenia. Here G 2 s 23.04
with df s 4. To understand this association better, we partition G 2 into four
independent components. The partitioning Ž3.14. applies to the subtables
shown in Table 3.4.
The first subtable compares the eclectic and medical schools of psychiatric
thought on whether the origin of schizophrenia is biogenic or environmental
given that the classification was in one of these two categories. For this
subtable, G 2 s 0.29, with df s 1. The second subtable compares these two
schools on the proportion of times the origin was ascribed to be a combination, rather than biogenic or environmental. This subtable has G 2 s 1.36,
TABLE 3.3 Most Influential School of Psychiatric Thought and Ascribed
Origin of Schizophrenia
Origin of Schizophrenia
School of
Psychiatric Thought
Eclectic
Medical
Psychoanalytic
Biogenic
Environmental
Combination
90
13
19
12
1
13
78
6
50
Source: Reprinted with permission, based on data from B. J. Gallagher III, B. J. Jones, and L. P.
Barakat, J. Clin. Psychol. 43: 438᎐443 Ž1987..
TABLE 3.4
Subtables Used in Partitioning Chi-Squared for Table 3.3 a
Bio q
Env Com
Bio Env
Ecl 90
Med 13
a
12
1
Ecl
Med
102
14
78
6
Bio q
Env Com
Bio Env
Ecl q Med 103
Psy
19
13
13
Ecl q Med
Psy
116
32
Bio, biogenic; Com, combination; Ecl, eclectic; Env, environmental; Psy, psychoanalytic
84
50
84
INFERENCE FOR CONTINGENCY TABLES
with df s 1. The sum of these two components equals G 2 for testing
independence with the first two rows of Table 3.3. There is little evidence of
a difference between the eclectic and medical schools of thought on the
ascribed origin of schizophrenia.
Next we combine the eclectic and medical schools and compare them to
the psychoanalytic school. The third subtable in Table 3.4 compares them for
the Žbiogenic, environmental. classification, giving G 2 s 12.95 with df s 1.
The fourth subtable compares them for the Žbiogenic or environmental,
combination. split, giving G 2 s 8.43 with df s 1.
The psychoanalytic school seems more likely than the other schools to
ascribe the origins of schizophrenia as being a combination. Of those who
chose either the biogenetic or environmental origin, members of the psychoanalytic school were somewhat more likely than the other schools to choose
the environmental origin. The sum of these four G 2 components equals the
value of 23.04 for testing independence in the full table.
3.3.5
Rules for Partitioning
Goodman Ž1968, 1969a, 1971b. and Lancaster Ž1949, 1969. gave rules for
determining independent components of chi-squared. For forming subtables,
among the necessary conditions are the following:
1. The df for the subtables must sum to df for the full table.
2. Each cell count in the full table must be a cell count in one and only
one subtable.
3. Each marginal total of the full table must be a marginal total for one
and only one subtable.
For a certain partitioning, when the subtable df values sum properly but the
G 2 values do not, the components are not independent.
For the G 2 statistic, exact partitionings occur. The Pearson X 2 need not
equal the sum of the X 2 values for the subtables. It is valid to use the X 2
statistics for the separate subtables; they simply need not provide an exact
algebraic partitioning of X 2 for the full table. When the null hypotheses all
hold, X 2 does have an asymptotic equivalence with G 2 , however. In addition,
when the table has small counts, in large-sample chi-squared tests it is safer
to use X 2 to study the subtables.
3.3.6
Limitations of Chi-Squared Tests
Chi-squared tests of independence merely indicate the degree of evidence of
association. They are rarely adequate for answering all questions about a
data set. Rather than relying solely on results of these tests, investigate the
nature of the association: Study residuals, decompose chi-squared into components, and estimate parameters such as odds ratios that describe the
strength of association.
FOLLOWING-UP CHI-SQUARED TESTS
85
The chi-squared tests also have limitations in the types of data to which
they apply. For instance, they require large samples. Also, the
ˆi j s
n iq nqj rn4 used in X 2 and G 2 depend on the marginal totals but not on the
order of listing the rows and columns. Thus, X 2 and G 2 do not change value
with arbitrary reorderings of rows or of columns. This implies that they treat
both classifications as nominal. When at least one variable is ordinal, test
statistics that utilize the ordinality are usually more appropriate. We present
such tests in Section 3.4.
3.3.7
Why Consider Independence?
Any idealized structure such as independence is unlikely to hold in any given
practical situation. With large samples such as in Table 3.2 it is not surprising
to obtain a small P-value. Given this and the limitations just mentioned, why
even bother to consider independence as a possible representation for a joint
distribution? One reason refers to the benefits of model parsimony. If the
independence model approximates the true probabilities well, then unless n
is very large, the model-based estimates
ˆ i j s n iq nqj rn2 4 of cell probabilities tend to be better than the sample proportions pi j s n i jrn4 . The independence ML estimates smooth the sample counts, somewhat damping the
random sampling fluctuations.
The mean-squared error ŽMSE. formula
MSE s variance q Ž bias .
2
explains why the independence estimators can have smaller MSE. Although
they may be biased, they have smaller variance because they are based on
estimating fewer parameters Ž iq 4 and qj 4 instead of i j 4.. Hence, MSE
can be smaller unless n is so large that the bias term dominates the variance.
We illustrate using Table 3.5, which has i j s iq qj w1 q ␦ Ž i y 2.Ž j y 2.x
for iqs qj s 13 . Here y1 - ␦ - 1, with ␦ s 0 equivalent to independence. Independence approximates the relationship well when ␦ is close to
zero. The total MSE values of the two estimators are
MSE Ž pi j 4 . s
Ý Ý E Ž pi j y i j . 2 s Ý Ý var Ž pi j .
i
s
Ý Ý i j Ž 1 y i j . rn s
i
MSE Ž
ˆi j4 . s
TABLE 3.5
Ž1 q ␦ .r9
1r9
Ž1 y ␦ .r9
j
j
Ý Ý E Ž ˆ i j y i j .
i
2
i
j
ž
1y
Ý Ý i2j
i
j
/
n
.
j
Cell Probabilities for Comparison of Estimators
1r9
1r9
1r9
Ž1 y ␦ .r9
1r9
Ž1 q ␦ .r9
86
INFERENCE FOR CONTINGENCY TABLES
TABLE 3.6 Comparison of Total MSE(=10,000)for Sample Proportion
and Independence Estimators
␦s0
␦ s 0.1
␦ s 0.2
␦ s 0.6
␦ s 1.0
n
p
ˆ
p
ˆ
p
ˆ
p
ˆ
p
ˆ
10
50
100
500
⬁
889
178
89
18
0
489
91
45
9
0
888
178
89
18
0
493
95
50
14
5
887
177
89
18
0
505
110
65
28
20
871
174
87
17
0
634
261
220
186
178
840
168
84
17
0
893
565
529
500
494
For Table 3.5,
MSE Ž pi j 4 . s
1
n
½
8
9
y
4␦ 2
81
5
and rather tedious calculations yield
MSE Ž
ˆi j4 . s
1
n
½
4
9
q
4
9n
5
q
4␦ 2
81
½
1y
2
n
q
2
n
2
y
2
n3
5
.
Table 3.6 lists the total MSE values for various ␦ and n. When ␦ s 0,
MSEŽ pi j 4. s 8r9n, whereas MSEŽ
ˆ i j 4. f 4r9n for large n. The independence estimator is then much better than the sample proportions. When the
table is close to independence Ž ␦ f 0. and n is not large, MSE is only about
half as large for the independence estimator. When ␦ / 0, the inconsistency
of
ˆ i j 4 is reflected by MSEŽˆ i j 4. ™ 4␦ 2r81 wwhereas MSEŽ pi j 4. ™ 0x as
n ™ ⬁. When the table is close to independence, however, the independence
estimator has a lower total MSE even for moderately large n Že.g., for
n s 500 when ␦ s 0.1..
3.4
TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS
The X 2 and G 2 chi-squared tests ignore some information when used to test
independence between ordinal classifications. When rows andror columns
are ordered, more powerful tests usually exist.
3.4.1
Linear Trend Alternative to Independence
When the row variable X and the column variable Y are ordinal, a positive
or negative trend in the association is common. One approach to inference,
described later in this section, uses an ordinal measure of monotone trend.
TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS
87
A more popular analysis assigns scores to categories and summarizes the
linear trend.
A test statistic that is sensitive to positive or negative linear trends utilizes
correlation information. Let u1 F u 2 F ⭈⭈⭈ F u I denote scores for the rows,
and let ®1 F ®2 F ⭈⭈⭈ F ®J denote column scores. The scores have the same
ordering as the categories. They assign distances between categories and
actually treat the measurement scale as interval, with greater distances
between categories that are farther apart.
The sum Ý i Ý j u i ®j n i j weights cross-products of scores by their frequency.
It relates to the covariation of X and Y. For the scores chosen, the
correlation r between X and Y equals the standardization of this sum to
the y1 to q1 scale Žin fact, r equals this sum when both sets of scores are
linearly transformed for the n subjects to have a mean of 0 and standard
deviation of 1.. The larger the correlation is in absolute value, the farther the
data fall from independence in this linear dimension.
A statistic for testing independence against the two-sided alternative of
nonzero true correlation is
M 2 s Ž n y 1. r 2 .
Ž 3.15 .
This statistic increases as < r < or n do. For large samples, it is approximately
chi-squared with df s 1 ŽMantel 1963.. Large values contradict independence, so as with X 2 and G 2 , the P-value is the right-tailed probability above
the value observed. A small P-value does not imply that the association is
linear, merely that searching for a linear component to the association helped
to build power against H0 . The test treats the variables symmetrically.
3.4.2
Job Satisfaction Example Revisited
Table 2.8 showed job satisfaction and income for 96 subjects. The ordinary
chi-squared statistics for testing independence are X 2 s 6.0 and G 2 s 6.8
with df s 9 Ž P-values s 0.74 and 0.66.. These statistics show little evidence
of association, but they ignore the ordering of rows and columns. With scores
Ž1, 2, 3, 4. for job satisfaction and scores 7.5, 20, 32.5, 604 for income that
approximate midpoints of categories in thousands of dollars, the correlation
is r s 0.200. The linear trend test statistic M 2 s Ž96 y 1.Ž0.200. 2 s 3.81.
This shows some evidence of association Ž P s 0.051.. The evidence is stronger
for the one-sided Žpositive trend. alternative, using M s 'n y 1 r s 1.95
Ž P s 0.026..
The nontrivial evidence of positive association may be surprising, since X 2
and G 2 have such unimpressive values. When a positive or negative trend
exists, analyses designed to detect that trend can provide much smaller
P-values than analyses that ignore it.
88
3.4.3
INFERENCE FOR CONTINGENCY TABLES
Monotone Trend Alternatives to Independence
Ordinal variables do not have a specified metric. Detecting a linear trend
alternative to independence requires assigning scores to X and Y, treating
them as interval variables. Alternatively, a strict ordinal analysis with the
weaker alternative of monotonicity uses an ordinal measure of association,
such as gamma ŽSection 2.4.4..
For large random samples, sample gamma has approximately a normal
sampling distribution. The standard error ŽSE. follows from the delta method
ŽProblem 3.27.. Gamma is the basis of an ordinal test of independence using
test statistic z s ␥
ˆrSE. A confidence interval describes the strength of
positive or negative monotone association.
For Table 2.8 on income and job satisfaction, in Section 2.4.5 we showed
that ␥
ˆ s 0.221. The sample has a weak tendency for job satisfaction to be
higher at higher income levels. Software Že.g., PROC FREQ in SAS. reports
a standard error of 0.117 for gamma. There is some evidence that ␥ ) 0,
since z s 0.221r0.117 s 1.89 Ž P s 0.03 for the one-sided alternative .. An
approximate 95% confidence interval for ␥ is 0.221 " 1.96Ž0.117., or Žy0.01,
0.45.. The true association between income and job satisfaction is at best
moderately positive.
3.4.4
Extra Power with Ordinal Tests
For testing independence, X 2 and G 2 refer to the most general alternative,
whereby cell probabilities exhibit any type of statistical dependence. Their
df value of Ž I y 1.Ž J y 1. reflects an alternative hypothesis that has
Ž I y 1.Ž J y 1. more parameters than the null hypothesisᎏthe nonredundant
odds ratios that describe the association wsuch as Ž2.10.x. These statistics are
designed to detect any pattern for these parameters. In achieving this
generality, they sacrifice sensitivity for detecting particular patterns.
By contrast, the analyses for ordinal row and column variables attempt to
describe association using a single parameter. For instance, M 2 uses the
correlation. When a chi-squared test statistic refers to a single parameter
wsuch as M 2 or Ž␥
ˆrSE. 2 dox, it has df s 1. When the association truly has a
positive or negative trend, an ordinal test has a power advantage over the
tests using X 2 or G 2 . Since df equals the mean of the chi-squared distribution, a relatively large M 2 value with df s 1 falls farther out in its right-hand
tail than a comparable value of X 2 or G 2 with df s Ž I y 1.Ž J y 1.; falling
farther out in the tail produces a smaller P-value. The potential discrepancy
in power increases as I and J increase. In Section 6.4 we present the theory
behind such a power comparison.
3.4.5
Choice of Scores
Often, it is unclear how to assign scores to statistics that require them, such
as M 2 . Cochran Ž1954. noted that ‘‘any set of scores gives a ®alid test,
TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS
89
provided that they are constructed without consulting the results of the
experiment. If the set of scores is poor, in that it badly distorts a numerical
scale that really does underlie the ordered classification, the test will not be
sensitive. The scores should therefore embody the best insight available
about the way in which the classification was constructed and used.’’ Ideally,
the scale is chosen by a consensus of experts, and subsequent interpretations
use that same scale.
How sensitive are analyses to the choice or scores? There is no simple
answer, but different scoring systems can give quite different results Že.g.,
Graubard and Korn 1987.. For most data sets, different choices of monotone
scores give similar results. Scores that are linear transforms of each other,
such as Ž1, 2, 3, 4. and Ž0, 2, 4, 6., have the same absolute correlation and
hence the same M 2 . Results may depend on the scores, however, when the
data are highly unbalanced, with some categories having many more observations than others.
Table 3.7 illustrates the potential dependence. It refers to a prospective
study of maternal drinking and congenital malformations. After the first
three months of pregnancy, the women in the sample completed a questionnaire about alcohol consumption. Following childbirth, observations were
recorded on the presence or absence of congenital sex organ malformations.
When a variable is nominal but has only two categories, statistics that treat it
as ordinal are still valid. For instance, we can artificially regard malformation
as ordinal, treating ‘‘present’’ as ‘‘high’’ and ‘‘absent’’ as ‘‘low.’’ With only two
rows, any set of distinct row scores is a linear transformation of any other set
and gives the same M 2 value. Alcohol consumption, measured as the average
number of drinks per day, is an ordinal explanatory variable. This groups a
naturally continuous variable, and we first use the scores ®1 s 0, ®2 s 0.5,
®3 s 1.5, ®4 s 4.0, ®5 s 7.04 , the last score being somewhat arbitrary. For this
choice, M 2 s 6.57, for which the P-value is 0.010. By contrast, for the
equally spaced row scores Ž1, 2, 3, 4, 5., M 2 s 1.83, giving a much weaker
conclusion Ž P s 0.18..
An alternative approach uses the data to form the scores automatically, by
using ranks as the category scores. All subjects in a category receive the
average of the ranks that would apply for a complete ranking of the sample
from 1 to n. These are called midranks. The 17,114 subjects at level 0 for
TABLE 3.7
Example for which Results Depend on Choice of Scores
Alcohol Consumption
Žaverage number of drinks per day.
Malformation
Absent
Present
0
-1
1᎐2
3᎐5
G6
17,066
48
14,464
38
788
5
126
1
37
1
Source: Reprinted with permission from the Biometric Society ŽGraubard and Korn 1987..
90
INFERENCE FOR CONTINGENCY TABLES
alcohol consumption share ranks 1 through 17,114. Each receives the average
of these ranks, which is the midrank Ž1 q 17,114.r2 s 8557.5. Similarly,
the midranks for the last four categories are 24,365.5, 32,013, 32,473,
and 32,555.5. These scores yield M 2 s 0.35 and a weaker conclusion yet
Ž P s 0.55..
Why does this happen? Adjacent categories having relatively few observations necessarily have similar midranks. The midranks are similar for the
final three categories, since those categories have few observations compared
with the first two categories. This scoring scheme treats alcohol consumption
level 1᎐2 drinks Žcategory 3. as much closer to consumption level G 6 drinks
Žcategory 5. than to consumption level 0 drinks Žcategory 1.. This seems
inappropriate. It is usually better to select scores that reflect distances
between categories. When uncertain about this choice, a sensitivity analysis
should be performed, selecting two or three sensible choices and checking
whether results are similar. Equally spaced scores often provide a reasonable
compromise when the category labels do not suggest obvious choices, such as
the categories Žliberal, moderate, conservative. for political philosophy.
When X and Y are both ordinal and M 2 uses midrank scores, the
correlation on which M 2 is based is called Spearman’s rho.
3.4.6
Trend Tests for I = 2 and 2 = J Tables
When I or J equal 2, the tests based on linear or monotonic trend simplify to
well-established procedures. With binary X, 2 = J tables occur in comparisons of two groups, such as when the rows represent two treatments. Using
scores u1 s 0, u 2 s 14 for levels of X, the covariation measure Ý i Ý j u i ®j n i j
in M 2 simplifies to Ý j ®j n 2 j . This term sums the scores on Y for all subjects
in row 2. Divided by the number of subjects in row 2, it gives the mean score
for that row. In fact, M 2 is then directed toward detecting differences
between the two row means of the scores on Y.
With midrank scores for Y, the test using M 2 for 2 = J tables is sensitive
to differences in mean ranks for the two rows. This test is called the Wilcoxon
or Mann᎐Whitney test. Most nonparametric statistics textbooks present this
test for fully ranked response data, whereas the 2 = J table is an extended
case in which sets of subjects in the same category of Y are tied and use
midranks. The large-sample version of that nonparametric test uses a standard normal z statistic. The square of the statistic is equivalent to M 2 , using
arbitrary row scores and midranks for the columns. It is also asymptotically
equivalent to test statistics based on the numbers of concordant and discordant pairs, such as the one using gamma.
When Y has two levels, the table has size I = 2. The linear trend statistic
then refers to a linear trend in the probability of either response category,
such as the probability of malformation as a function of alcohol consumption.
The test in that case, often called the Cochran᎐Armitage trend test, is
presented in Section 5.3.5.
SMALL-SAMPLE TESTS OF INDEPENDENCE
3.4.7
91
Nominal–Ordinal Tables
The tests using the correlation or gamma are appropriate when both classifications are ordinal. When one is nominal with more than two categories,
other statistics are needed. One is based on summarizing the variation among
means on the ordinal variable in the various categories of the nominal
variable. We defer discussion of this case to Section 7.5.3, Note 3.6, and
Problem 3.28.
3.5
SMALL-SAMPLE TESTS OF INDEPENDENCE
The inferential methods of the preceding four sections are large-sample
methods. When n is small, alternative methods use exact small-sample
distributions rather than large-sample approximations. In this section we
describe small-sample tests of independence, starting with one that R. A.
Fisher proposed for 2 = 2 tables.
3.5.1
Fisher’s Exact Test for 2 = 2 Tables
In Section 3.5.7 we show that a distribution not depending on unknown
parameters results from conditioning on the marginal totals of the contingency table. These are usually not naturally fixed. For Poisson sampling
nothing is fixed, for multinomial sampling only n is fixed, and for independent binomial sampling in the two rows only the row marginal totals are
fixed. In any of these cases, under H0 : independence, conditioning on both
sets of marginal totals yields the hypergeometric distribution
p Ž t . s P Ž n11 s t . s
ž /ž /
ž /
n1q
t
n 2q
nq1 y t
n
nq1
.
Ž 3.16 .
This formula expresses the distribution of n i j 4 in terms of only n11 . Given
the marginal totals, n11 determines the other three cell counts. The range of
possible values for n11 is myF n11 F mq, where mys maxŽ0, n1qq nq1 y n.
and mqs minŽ n1q, nq1 ..
For 2 = 2 tables, independence is equivalent to the odds ratio s 1. To
test H0 : s 1, the P-value is the sum of certain hypergeometric probabilities. To illustrate, consider Ha : ) 1. For the given marginal totals, tables
having larger n11 have larger sample odds ratios and hence stronger evidence
in favor of Ha . Thus, the P-value equals P Ž n11 G t o ., where t o denotes the
observed value of n11 . This test for 2 = 2 tables is called Fisher’s exact test
ŽFisher 1934, 1935a,c; Irwin 1935; Yates 1934..
92
3.5.2
INFERENCE FOR CONTINGENCY TABLES
Fisher’s Tea Drinker
R. A. Fisher Ž1935a. described the following experiment from his days at
Rothamsted Experiment Station, an agriculture research lab north of London. Muriel Bristol, a colleague of Fisher’s, claimed that when drinking tea
she could distinguish whether milk or tea was added to the cup first Žshe
preferred milk first.. To test her claim, Fisher asked her to taste eight cups of
tea, four of which had milk added first and four of which had tea added first.
She knew there were four cups of each type and had to predict which four
had the milk added first. The order of presenting the cups to her was
randomized.
Table 3.8 shows a possible result. Distinguishing the order of pouring
better than with pure guessing corresponds to ) 1, reflecting a positive
association between order of pouring and the prediction. We conduct Fisher’s
exact test of H0 : s 1 against Ha : ) 1.
The experimental design fixed both marginal distributions, since Dr.
Bristol had to predict which four cups had milk added first. Thus, the
hypergeometric applies naturally for the null distribution of n11 . The P-value
for Fisher’s exact test is the null probability of Table 3.8 and of tables having
even more evidence in favor of her claim. The observed table, t o s 3 correct
choices of the cups having milk added first, has null probability
ž /ž /
ž/
4
3
4
1
8
4
s 0.229.
The only table that is more extreme in the direction of Ha has n11 s 4
correct. It has a probability of 0.014. The P-value is P Ž n11 G 3. s 0.243. This
result does not establish an association between the actual order of pouring
and her predictions. It is difficult to do so with such a small sample.
According to Fisher’s daughter ŽBox 1978, p. 134., in reality Bristol did
convince Fisher of her ability.
TABLE 3.8
Fisher’s Tea Tasting Experiment
Guess Poured First
Poured First
Milk
Tea
Total
Milk
Tea
Total
3
1
1
3
4
4
4
4
Source: Based on experiment described by Fisher Ž1935a..
SMALL-SAMPLE TESTS OF INDEPENDENCE
3.5.3
93
Two-Sided P-Values for Fisher’s Exact Test
For the one-sided alternative, the same P-value results using tables ordered
according to larger n11 , larger odds ratio, or larger difference of proportions
ŽDavis 1986a.. For the two-sided alternative, different criteria can have
different P-values.
For a two-sided P-value, a popular approach sums P Ž n11 s t . in Ž3.16. for
counts t such that pŽ t . F pŽ t o .; that is, the P-value is P s P w pŽ n11 . F pŽ t 0 .x
for the observed value t o . Another possibility sums pŽ t . for tables that are
farther from H0 ; that is,
P s P n11 y E Ž n11 . G t 0 y E Ž n11 . ,
where the hypergeometric E Ž n11 . s n1q nq1 rn. This is identical to P Ž X 2 G
X o2 . for observed Pearson statistic X o2 . A third approach takes P s
2 minw P Ž n11 G t o ., P Ž n11 F t o .x, but this can exceed 1. A fourth approach
takes P s minw P Ž n11 G t o ., P Ž n11 F t o .x plus an attainable probability in the
other tail that is as close as possible to, but not greater than, that one-tailed
probability.
Each approach has advantages and disadvantages ŽBlaker 2000; Davis
1986a; Dupont 1986; Lloyd 1988b; Mantel 1987b; Yates and discussants
1984.. They can provide different results because of the discreteness and
potential skewness. The approach of ordering tables by a distance measure
from H0 , such as X 2 , extends naturally to I = J tables.
In practice, two-sided tests are much more common than one-sided. Partly
this is so that researchers can avoid charges of bias in giving evidence that
supports their predicted direction for an effect. To conduct a test of size 0.05
when one truly believes that the effect has a particular direction, it is safest
to conduct the one-sided test at the 0.025 level to guard against criticism. For
instance, in the 1998 document Biostatistical Principles for Clinical Trials, the
International Conference on Harmonization ŽICH E9. stated: ‘‘The approach
of setting type I errors for one-sided tests at half the conventional type I
error used in two-sided tests is preferable in regulatory settings. This promotes consistency with two-sided confidence intervals that are generally
appropriate for estimating the possible size of the difference between two
treatments.’’
3.5.4
Discreteness and Conservatism Issues
The hypergeometric distribution Ž3.16. is highly discrete for small samples, as
n11 and hence the P-value can assume relatively few values. It is usually not
possible to achieve a fixed significance level Žsize. such as 0.05.
In the tea-tasting experiment, for instance, n11 can equal only 4, 3, 2, 1, 0.
The one-sided P-values are restricted to 0.014, 0.243, 0.757, 0.986, and 1.0. If
94
INFERENCE FOR CONTINGENCY TABLES
one rejects H0 when the P-value does not exceed 0.05, then 0.05 is not the
probability of type I error. Only the P-value of 0.014 does not exceed 0.05;
thus, when H0 is true, the probability of falsely rejecting it is 0.014, not 0.05.
In this sense, the traditional approach to hypothesis testing is conservative:
The true probability of type I error is less than the nominal level.
It is possible to achieve any fixed significance level by data-unrelated
randomization on the boundary of the critical region, in deciding whether to
reject H0 . For the tea-tasting experiment, suppose that we reject H0 when
n11 s 4, we reject H0 with probability 0.157 when n11 s 3, and we do not
reject H0 otherwise; that is, when n11 s 3, we generate a uniform random
variable U over w0, 1x and reject H0 if U - 0.157. For expectation taken with
respect to the null hypergeometric distribution of n11 , the significance level
equals
P Ž reject H0 . s E P Ž reject H0 < n11 .
s 1.0 Ž 0.014 . q 0.157 Ž 0.229 . q 0.0 = P Ž n11 F 2 . s 0.05.
With the randomization extension, Tocher Ž1950. showed that Fisher’s test is
uniformly most powerful unbiased ŽUMPU..
In practice, randomization having nothing to do with the data is unacceptable. We recommend simply reporting the P-value. To reduce conservativeness, report the mid-P-value ŽSection 1.4.5.. The test is no longer guaranteed
to have true P Žtype I error. no greater than the nominal value, but in
practice it is rarely much greater. For the one-sided test with the tea-tasting
data,
mid-P-value s Ž 1r2 . P Ž n11 s 3 . q P Ž n11 ) 3 . s 0.129.
3.5.5
Small-Sample Unconditional Test of Independence*
A common sampling assumption for analyses comparing two groups on a
binary response is that the rows are independent binomial samples. Then,
only n iq 4 are naturally fixed. For Poisson and multinomial sampling schemes,
neither marginal distribution is fixed. For such cases it may seem artificial to
condition on both sets of marginal counts. An alternative small-sample test,
designed for independent binomial samples, conditions on only the row
totals.
Under binomial sampling with parameter i in row i, consider testing H0 :
1 s 2 using some test statistic T, such as the Pearson X 2 . For fixed n iq 4 ,
T can take a discrete set of values, one of which is the observed value t o .
Given 1 s 2 s , the P-value is P ŽT G t o ., calculated using the product
of the two binomial probability mass functions. This is the sum of the product
binomial probabilities for those pairs of binomial samples that have T G t o .
Since is unknown, the actual P-value is defined as
P s sup P Ž T G t o . .
0FF1
SMALL-SAMPLE TESTS OF INDEPENDENCE
95
This is an unconditional small-sample test of independence. Like Fisher’s
exact test, the true size is no greater than the nominal value Že.g., if we reject
when P F 0.05, the actual P Žtype I error. is no greater than 0.05..
We illustrate using test statistic X 2 for the 2 = 2 table having entries
Ž3, 0r0, 3., by row, with fixed row totals Ž3, 3. as binomial sample sizes.
The sample X 2 s 6.0. This X 2 value for the observed table and for table
Ž0, 3r3, 0. is the maximum possible. For a given value for 1 s 2 , the
probability of the first table is w 3 Ž1 y . 0 xw 0 Ž1 y . 3 x s 3 Ž1 y . 3
Ž3 successes and 0 failures in the first row and 0 successes and 3 failures in
the second., the product of two binomial probabilities. Similarly, the probability of the second table is Ž1 y . 3 3. Thus, the P-value is P Ž X 2 G 6. s
2 3 Ž1 y . 3, the sum of the product binomial probabilities for those two
tables. The supremum of this over 0 F F 1 occurs at s 12 , giving overall
P-value equal to 2Ž0.5. 3 Ž0.5. 3 s 0.031. By contrast, the two-sided Fisher’s
exact test has P-value equal to 2 30 33 r 63 s 0.100.
Barnard Ž1945, 1947. first proposed an unconditional test comparing binomial parameters, although he later Ž1949. refuted it in favor of Fisher’s exact
test. Several authors have since proposed related tests Že.g., Haber 1986;
Suissa and Shuster 1985..
ž /ž / ž /
3.5.6
Conditional versus Unconditional Tests*
Since Barnard introduced the unconditional test, statisticians have debated
the proper way to conduct small-sample analyses of 2 = 2 tables. Fisher
criticized the unconditional approach, arguing that possible samples with
quite different numbers of successes than observed were not relevant. In
Fisher’s Ž1945. view, ‘‘ . . . the existence of these less informative possibilities
should not affect our judgment of significance based on the series actually
observed . . . . The fact that such an unhelpful outcome as these might
occur . . . is surely no reason for enhancing our judgment of significance in
cases where it has not occurred; . . . it is only the sampling distribution of
samples of the same type that can supply a rational test of significance.’’
Sprott Ž2000, Sec. 6.4.4. recently provided a similar argument.
An adaptation of the unconditional approach by Berger and Boos Ž1994.
addresses this criticism somewhat. They took the supremum for the P-value
over a confidence interval of values for the nuisance parameter rather than
over all possible values. Their unconditional P-value is
P s sup P Ž T G t o . q ␥ ,
gC ␥
where C␥ is a 100Ž1 y ␥ .% confidence interval for . Here, ␥ is taken to be
very small Že.g., 0.001., and the test maintains the guaranteed upper bound
on size.
96
INFERENCE FOR CONTINGENCY TABLES
Other arguments in favor of conditioning on both sets of marginal totals
are that the conditional approach provides a simple way to eliminate nuisance parameters in a variety of problems Že.g., generalizing to other contingency table problems., and the margins contain little information about the
association ŽHaber 1989; Yates 1984.. Zhu and Reid Ž1994. noted that some
information loss occurs in conditioning on the margins except when s 1.
Arguments against conditioning partly concern the increased discreteness
that occurs. The few possible values for n11 make it difficult to obtain a small
P-value. In repeated use with a nominal significance level, the actual type I
error probability may be much smaller than the nominal value and the power
may suffer. Finally, for inference about nonnull values Že.g, confidence
intervals., we will see that the conditional approach applies only with the
odds ratio and not other measures.
The conservatism problem is partly unavoidable. Statistics having discrete
distributions are necessarily conservative in terms of achieving nominal
significance levels. Because an unconditional test fixes only one margin,
however, it has many more tables in the reference set for its sampling
distribution. That distribution is less discrete, and a richer array of possible
P-values occurs than with Fisher’s exact test. An unconditional test tends to
be less conservative and more powerful than Fisher’s exact test. A disadvantage is that computations are very intensive for more complex problems, such
as larger tables.
If a table truly has two independent binomial samples, the unconditional
approach seems sensible. See Kempthorne Ž1979. for a cogent argument. The
conditional approach is useful for other cases. In a randomized clinical trial a
convenience sample of n subjects is randomly allocated to two treatments.
The samples are not binomials, as they are not random samples from two
populations of interest. One could focus on the sample alone and consider
the probability of a result at least as extreme as observed if there truly is no
treatment effect. For instance, out of all possible ways of choosing n1q of the
n subjects for treatment 1, for what proportion would n11 be at least as large
as observed? Under the null hypothesis of no treatment effect, the same
overall response distribution Ž nq1 , nq2 . of successes and failures occurs
regardless of the allocation of subjects to treatments. Thus, the column margin is also naturally fixed. This argument leads to hypergeometric null probabilities and Fisher’s exact test ŽGreenland 1981.. This argument does not
extend, however, to nonnull effect values and hence to confidence intervals.
When both sets of marginal totals are naturally fixed, such as in Table 3.8,
the high degree of discreteness is unavoidable and Fisher’s exact test is the
best procedure. Regardless of which margins are naturally fixed, using the
mid-P-value helps reduce conservative effects of discreteness.
3.5.7
Derivation of Exact Conditional Distribution*
We now show how the conditional test for independence yields the hypergeometric distribution. We do this for I = J tables, since we next discuss
SMALL-SAMPLE TESTS OF INDEPENDENCE
97
extensions of Fisher’s exact test for them. We assume independent multinomial sampling within rows, as often applies in comparing I treatment groups.
Then row totals n iq 4 are fixed, and we estimate the I conditional distributions j < i , j s 1, . . . , J 4 . Under H0 : independence, j <1 s j < 2 s ⭈⭈⭈ s
j < I s qj , for j s 1, . . . , J. The product of the I multinomial probability
functions then simplifies to
Ł
i
ž
n iq!
Ł j ni j
Ł jn< i
!
ij
j
/
s
nqj
Ž Ł i n iq! . Ž Ł jqj
.
Ł i Ł j ni j !
.
Ž 3.17 .
This distribution for n i j 4 depends on qj 4 . These are nuisance parameters,
since they do not describe the association. Fisher introduced the standard
way of eliminating nuisance parameters, by conditioning on their sufficient
statistics. From the definition of sufficiency, the resulting conditional distribution does not depend on those parameters.
The contribution of qj 4 to the product multinomial distribution Ž3.17.
depends on the data only through nqj 4 , which are their sufficient statistics.
The nqj 4 have the multinomial Ž n,qj 4. distribution, namely
n!
Ł j nqj !
Ł qjn
qj
Ž 3.18 .
.
j
The joint probability function of n i j 4 and nqj 4 is identical to the probability
function of n i j 4 , since n i j 4 determines nqj 4 . Thus, the probability function
of n i j 4 , conditional on nqj 4 , equals the probability function Ž3.17. of n i j 4
divided by the probability function Ž3.18. evaluated at nqj 4 , or
Ž Ł i n iq! . Ž Ł j nqj ! .
n!Ł i Ł j n i j !
.
Ž 3.19 .
This is the multiple hypergeometric distribution. It applies to the set of n i j 4
having the same n iq 4 and nqj 4 as the observed table. For 2 = 2 tables, it is
the hypergeometric distribution Ž3.16..
When a table has a single multinomial sample, the unknown parameters
are i j 4 . For testing independence Ž i j s iq qj all i and j ., distribution
Ž3.19. results from conditioning on the row and column totals. These are
sufficient statistics for iq 4 and qj 4 , which determine the null distribution.
For either sampling model, both sets of margins are fixed after the conditioning. The end result Ž3.19. does not depend on unknown parameters and thus
permits exact probability calculations.
3.5.8
Exact Tests of Independence for I = J Tables*
Exact tests for I = J tables utilize the multiple hypergeometric distribution.
Freeman and Halton Ž1951. defined the P-value as the probability of the set
98
INFERENCE FOR CONTINGENCY TABLES
TABLE 3.9
Example for Exact Conditional Test
Smoking Level
Žcigarettesrday .
Control
Myocardial infarction
0
1᎐24
) 25
25
0
25
1
12
3
Source: Reprinted with permission, based on Table 5 in
S. Shapiro et al., Lancet 743᎐746 Ž1979..
of tables with the given margins that are no more likely to occur than the
table observed. Other exact tests order the tables using a statistic describing
distance from H0 . Yates Ž1934. used X 2 . The P-value is then the null value
of P Ž X 2 G X o2 . for observed value X o2 . When classifications have ordered
categories, an ordinal statistic is more relevant. For the alternative hypothesis
of a positive association, we could use P ŽT G t o ., where T is the correlation
or gamma and where t o denotes its observed value.
We illustrate an exact test for ordered categories with Table 3.9, which
cross-classifies level of smoking and myocardial infarction for a sample of
young women in a case᎐control study. The second row contains small counts,
and large-sample tests may be inappropriate. Given the marginal counts, the
only table having greater evidence of positive association between smoking
and myocardial infarction has counts Ž25,26,11. for row 1 and Ž0,0,4. in row 2.
Conditional on both sets of margins, the null probability of the observed
table and this more extreme table wbased on formula Ž3.19.x equals 0.018.
Although the sample contains only four myocardial infarction patients, evidence exists of a positive association. The evidence is stronger than using
X 2 , which ignores the ordering of categories. The exact P Ž X 2 G X o2 . s
P Ž X 2 G 6.96. s 0.052.
Special algorithms and software for computing exact tests for I = J tables
are widely available Že.g., Mehta and Patel 1983; see also Appendix A.. We
recommend these tests when asymptotic approximations may be invalid.
Computing time increases exponentially as n, I, or J increase. However, one
can use Monte Carlo to sample randomly from the set of tables with the
given margins. The estimated P-value is then the sample proportion of tables
having test statistic value at least as large as the value observed.
As I andror J increase, the number of possible values for any test statistic
T tends to increase. Thus, the conservativeness issue for conditional tests
becomes less problematic.
3.6
SMALL-SAMPLE CONFIDENCE INTERVALS FOR 2 = 2 TABLES*
Small-sample methods also apply to estimation. Exact distributions depending only on the parameter of interest result from the same arguments. These
SMALL-SAMPLE CONFIDENCE INTERVALS FOR 2 = 2 TABLES
99
distributions are the basis of confidence intervals for measures such as the
odds ratio.
3.6.1
Small-Sample Inference for the Odds Ratio
For multinomial sampling, the distribution of n i j 4 depends on n and cell
probabilities i j 4 . For 2 = 2 tables, the odds ratio is
s
11 22
12 21
s
11 Ž 1 y 1qy q1 q 11 .
Ž 1qy 11 . Ž q1 y 11 .
.
Hence, 11 is a function of and 1q,q1 4 . The same argument applies to
any i j , so the multinomial distribution of n i j 4 can use parameters
, 1q, q1 4 . Conditional on n1q, nq1 4 , the distribution of n i j 4 depends
only on . Since n11 determines all other cell counts, given the marginal
totals, the conditional distribution of n i j 4 is specified by some function
P Ž n11 s t . s f Ž t; n1q, nq1 , n, .. This distribution ŽFisher 1935c. is the noncentral hypergeometric,
f Ž t ; n1q , nq1 , n, . s
ž /ž /
Ý ž
/ž /
n1q
t
mq
usmy
n1q
u
n y n1q
t
nq1 y t
n y n1q
u
nq1 y u
Ž 3.20 .
for myF t F mq.
A confidence interval for results from inverting the test of H0 : s 0 ,
having observed n11 s t o . For Ha : ) 0 , the P-value is
Ps
Ý f Ž t ; n1q , nq1 , n, 0 . .
tGt o
For testing against H0 : - 0 ,
Ps
Ý f Ž t ; n1q , nq1 , n, 0 . .
tFt o
When 0 s 1, these are one-sided Fisher’s exact tests. Cornfield Ž1956.
constructed a confidence interval using the tail method. The lower endpoint
is 0 for which P s ␣r2 in testing against Ha : ) 0 . The upper endpoint is
0 for which P s ␣r2 for Ha : - 0 . The interval is the set of 0 for which
both one-sided P-values G ␣r2.
As in Fisher’s exact test, the conditional approach to interval estimation is
necessarily conservative because of discreteness. The actual confidence coefficient, defined as the infimum of the coverage probabilities for all possible
, has the nominal confidence level as a lower bound. Less conservative
100
INFERENCE FOR CONTINGENCY TABLES
behavior and shorter intervals result from inverting a single two-sided test
rather than inverting two one-sided tests ŽAgresti and Min 2001; Baptista and
Pike 1977.. An alternative approach with independent binomial samples
inverts nonnull unconditional small-sample tests. Because of the reduced
discreteness, such intervals are also usually shorter.
The conditional ML estimate of is the value of that maximizes
probability Ž3.20.. Differentiating the log likelihood with respect to shows
that this estimate satisfies the equation n11 s E Ž n11 . in , where the expectation refers to distribution Ž3.20.. This equation has a unique solution ˆ and is
solved using iterative methods ŽCornfield 1956.. This estimator differs from
the unconditional ML estimator ˆs n11 n 22 rn12 n 21 , which uses the ML estimates of i j 4 for the multinomial distribution of n i j 4 . Using statistical
software, we can calculate conditional ML estimates and small-sample confidence intervals for odds ratios Že.g., for SAS, see Table A.2..
3.6.2
Tea Tasting Example
We illustrate with Table 3.8 from Fisher’s tea-tasting experiment. The conditional ML estimate of is 6.4. Software provides the Cornfield tail-method
interval Ž0.2, 626.2. with confidence coefficient guaranteed G 0.95. Not
surprisingly, it is very wide because of the small sample. Inverting a family of
two-sided ‘‘exact’’ conditional score tests gives a more precise interval, Ž0.3,
306.2.. The unconditional approach is not appropriate here because of the
sampling design. wIf the table were two binomial samples, that approach gives
interval Ž0.4, 234.4. by inverting ‘‘exact’’ unconditional score tests. x
3.6.3
Impact of Discreteness on Exact Confidence Intervals
Small-sample inference is ‘‘exact’’ in the sense that the conditional distribution is free of nuisance parameters. Confidence intervals and tests use exact
probability calculations rather than approximate ones. However, their operating characteristics are conservative because of discreteness.
Large-sample methods do not have the guarantee of bounds on error
probabilities. They can be conservative or liberal, and thus their results can
appear quite different from exact methods. For example, for the tea-tasting
data ŽTable 3.8., the P-value for the Pearson chi-squared test equals 0.157,
compared to 0.486 for the two-sided exact test. The 95% large-sample
confidence interval Ž3.2. for the odds ratio is Ž0.4, 220.9., compared to
Cornfield’s exact interval of Ž0.2, 626.2.. Normally, one would prefer an exact
method over an approximate one. When the conditional distribution is highly
discrete, however, the choice is not so obvious. Exact methods then can be
quite conservative, especially with small samples.
For highly discrete data, it seems sensible to use adjustments of exact
methods based on the mid-P-value. Confidence intervals with the conditional
approach then invert hypergeometric tests of s 0 using the mid-P-value.
Although not guaranteed to have error probabilities no greater than the
EXTENSIONS FOR MULTIWAY TABLES AND NONTABULATED RESPONSES
101
nominal level, this method usually comes closer than the exact method to the
desired level. Compared to large-sample methods, it has the advantage of
working well as the degree of discreteness diminishes, since it then is
essentially the same as the corresponding exact method using an ordinary
P-value.
Inference based on the mid-P-value compromises between the conservativeness of exact methods and the uncertain adequacy of large-sample methods. For interval estimation of the odds ratio, this method tends to be a bit
conservative, but for small samples can yield much shorter intervals than the
Cornfield exact interval. For the tea-tasting data, for instance, the 95%
confidence interval based on inverting two one-sided hypergeometric tests
using the mid-P-value is Ž0.31, 309., compared to the Cornfield interval of
Ž0.21, 626..
3.6.4
Small-Sample Inference for Difference of Proportions
The conditional approach to eliminating nuisance parameters works when
those parameters have sufficient statistics. However, we’ll see ŽSection 6.7.9.
that reduced sufficient statistics occur only for certain models. For binary
data, such models must have odds ratios as parameters. For 2 = 2 tables, the
conditional approach cannot yield confidence intervals for differences or
ratios of proportions. The unconditional approach is more complex but does
not require sufficient statistics. We used it in Section 3.5.5 for testing
1 y 2 s 0 with independent binomial samples.
A small-sample confidence interval inverts the corresponding unconditional test of H0 : 1 y 2 s ␦ 0 , for any fixed y1 - ␦ 0 - 1. The probability
function for the table is the product of bin Ž n1 , 1 . and bin Ž n 2 , 2 . mass
functions. One can express this in terms of ␦ s 1 y 2 and a nuisance
parameter . For instance, if s 1 q 2 , one substitutes 1 s Ž q ␦ .r2
and 2 s Ž y ␦ .r2. For ␦ s ␦ 0 and a fixed value of , one then uses this
binomial product to calculate the probability that the test statistic is at least
as large as observed. The P-value is the supremum of such probabilities
calculated over all possible values for . This provides a family of tests for the
various values of ␦ 0 . The confidence interval for 1 y 2 is the set of ␦ 0 for
which this P-value exceeds ␣ .
This approach can be quite conservative. For details regarding various test
statistics, see Agresti and Min Ž2001., Coe and Tamhane Ž1993., Santner and
Snell Ž1980., and Santner and Yamagami Ž1993.. It is better to invert a single
two-sided test, as in Coe and Tamhane Ž1993., than to invert two separate
one-sided tests.
3.7 EXTENSIONS FOR MULTIWAY TABLES AND
NONTABULATED RESPONSES
The methods of this chapter extend to multiway contingency tables. For
instance, tests of independence for two-way tables extend to tests of condi-
102
INFERENCE FOR CONTINGENCY TABLES
tional independence in three-way tables. In future chapters we present such
methods with models that provide a basis for defining relevant parameters
and their statistical inferences. The methods then apply in a greater variety
of situations, such as when some explanatory variables are continuous rather
than categorical.
3.7.1
Categorical Data Need Not Be Contingency Tables
Examples so far have presented categorical data in the format of contingency
tables. However, this book has broader focus than contingency table analysis.
Models for categorical response variables can have continuous as well as
categorical explanatory variables. Even when all or most variables are categorical, source data files are not usually contingency tables but have the form
of a line of data for each subject. The first three lines in a data file containing
responses of a survey of subjects measuring gender, race, education Ž1 s less
than high school, 2 s high school or some college, 3 s college graduate., and
opinion about homosexuality Ž1 s tolerant, 2 s homophobic. might be:
subject
1
2
3
gender
f
m
m
race
w
b
w
education
2
3
1
opinion
1
1
2
Software can read data files of this type and then conduct analyses that may
involve forming contingency tables.
In the next chapter we introduce the modeling framework used in the rest
of the book. All the methods that we’ve studied in this chapter result from
inferences for parameters in simple versions of these models.
NOTES
Section 3.1: Confidence Inter©als for Association Parameters
3.1. Adaptations of Woolf’s interval Ž3.2. for log to handle zero cell counts include Agresti
Ž1999. and Gart Ž1966, 1971.. Goodman Ž1964a. presented simultaneous confidence
intervals for all odds ratios in an I = J table. Brown and Benedetti Ž1977. and
Goodman and Kruskal Ž1963, 1972. provided standard errors for many association
measures. Goodman and Kruskal Ž1963, 1972. extended Ž3.9. for independent multinomial sampling.
3.2. Agresti and Caffo Ž2000. showed that as in the single-sample case ŽProblem 1.24., the
Wald interval Ž3.4. for 1 y 2 behaves much better after adding two pseudo-observations of each type Žone of each type in each sample..
103
NOTES
Section 3.2: Testing Independence in Two-Way Contingency Tables
3.3. For hypergeometric sampling,
ˆ i j 4 in tests of independence are exact Žrather than
estimated. expected values. Specifically,
E Ž n11 . s
n1q nq1
n
and
var Ž n11 . s
n1q nq1 n 2q nq2
n2 Ž n y 1 .
.
Haldane Ž1940. derived E Ž X 2 . s Ž I y 1.Ž J y 1. nrŽ n y 1. and a complex formula for
var Ž X 2 .; Dawson Ž1954. provided a simplified expression. Lewis et al. Ž1984. derived
the third central moment. Watson Ž1959. showed that the conditional distribution of
X 2 also has the limiting chi-squared distribution.
3.4. Diaconis and Efron Ž1985. presented inference based on a uniform distribution over all
possible tables of the same I, J, and n; their ®olume test considers the proportion of
such tables having X 2 F X o2 .
3.5. Specialized methods are necessary for complex sampling designs. Sequential methods
are useful in biomedical applications ŽJennison and Turnbull 2000, Chap. 12.. Social
science applications often incorporate clustering andror stratification. LaVange et al.
Ž2001. and Rao and Thomas Ž1988. surveyed analyses of categorical data for complex
sampling methods. Gleser and Moore Ž1985. showed that positive dependence causes
null distributions of Pearson statistics to stochastically increase. See also Bedrick
Ž1983., Clogg and Eliason Ž1987., Fay Ž1985., Holt et al. Ž1980., Koehler and Wilson
Ž1986., Rao and Scott Ž1987., Scott and Wild Ž2001., Shuster and Downing Ž1976.,
Tavare
´ and Altham Ž1983., and methods of Chapter 12.
Other modifications are necessary when some data are missing. Watson Ž1956. was
perhaps the first to study this. Lipsitz and Fitzmaurice Ž1996. derived score tests of
independence and conditional independence for contingency tables, assuming ignorable nonresponse, and showed that the test statistics have the usual asymptotic
chi-squared null distributions. See Schafer Ž1997, Chap. 7. for a survey of methods.
Section 3.4: Two-Way Tables with Ordered Classifications
3.6. Bhapkar Ž1968. and Yates Ž1948. proposed statistics similar to M 2 and also proposed
statistics for singly-ordered tables. Graubard and Korn Ž1987. listed 14 tests for 2 = J
tables that utilize a correlation-type statistic. See also Nair Ž1987. and Williams Ž1952..
Cohen and Sackrowitz Ž1991, 1992. evaluated decision-theoretic aspects, such as
admissibility, of tests based on gamma and local log odds ratios. Rayner and Best
Ž2001. considered nonparametrics methods in a contingency table format.
Section 3.5: Small-Sample Tests of Independence
3.7. Yates Ž1934. mentioned that Fisher suggested the hypergeometric to him for an exact
test. He proposed a continuity-corrected version of X 2 ,
X c2 s
ýý
Ž < n i j y ˆ i j <
ˆi j
y 0.5 .
2
,
to approximate the exact test. Haber Ž1980, 1982., Plackett Ž1964., and Yates Ž1984.
discussed its appropriateness. Since software now makes Fisher’s exact test feasible
even with large samples, this correction is no longer needed.
104
INFERENCE FOR CONTINGENCY TABLES
3.8. The UMPU property of Fisher’s exact test follows from conditioning on a sufficient
statistic that is complete and has distribution in the exponential family ŽLehmann 1986,
Secs. 4.5᎐4.7.. Fleiss Ž1981., Gail and Gart Ž1973., and Suissa and Shuster Ž1985.
studied sample size for obtaining fixed power in Fisher’s test. The controversy over
conditioning includes Barnard Ž1945, 1947, 1949, 1979., Berkson Ž1978., Fisher Ž1956.,
Howard Ž1998., Kempthorne Ž1979., Lloyd Ž1988a., Pearson Ž1947., Rice Ž1988.,
Routledge Ž1992., Suissa and Shuster Ž1984, 1985., and Yates Ž1984.. Yates and
discussants also addressed the choice of two-sided P-value. Discussion of unconditional
methods includes Chan Ž1998., Martın
and
´ Andres
´ and Silva Mato Ž1994., and Rohmel
Ⲑ
Mansmann Ž1999.. Altham Ž1969. and Howard Ž1998. discussed Bayesian analyses for
2 = 2 tables Žsee Section 15.2.3.. Agresti Ž1992, 2001. surveyed small-sample methods.
3.9. For discussion of inference using the mid-P-value, see Berry and Armitage Ž1995., Hirji
Ž1991., Hwang and Wells Ž2002., Hwang and Yang Ž2001., Mehta and Walsh Ž1992.,
and Routledge Ž1994.. Similar benefits can accrue from alternative proposed P-values.
One approach, useful when several tables have the same value for a test statistic, uses
the table probability to create a more finely partitioned sample space; for tables having
the observed test statistic value, only those contribute to the P-value that are no more
likely than the observed table ŽCohen and Sackrowitz 1992; Kim and Agresti 1995..
This depends on more than the sufficient statistic, and in some cases a Rao᎐Blackwellized version is the mid-P-value ŽHwang and Wells 2002.. Ordinary P-values obtained
with higher-order asymptotic methods without continuity corrections for discreteness
yield performance similar to that of the mid-P-value ŽPierce and Peters 1999; Strawderman and Wells 1998..
3.10. For exact treatment of I = J tables, see Mehta and Patel Ž1983.. For ordered categories, see also Agresti et al. Ž1990.. For Monte Carlo estimation of exact P-values, see
Agresti et al. Ž1979., Booth and Butler Ž1999., Diaconis and Sturmfels Ž1998., Forster
et al. Ž1996., Mehta et al. Ž1988., and Patefield Ž1982.. Gail and Mantel Ž1977. and
Good Ž1976. gave approximate formulas for the number of tables having certain fixed
margins. Freidlin and Gastwirth Ž1999. extended the unconditional approach to a test
for trend in I = 2 tables and a test of conditional independence with several 2 = 2
tables.
Section 3.6: Small-Sample Confidence Inter©als for 2 = 2 Tables
3.11. Suppose that Ž , . has minimal sufficient statistic ŽT, U ., where is a nuisance
parameter. Cox and Hinkley Ž1974, p. 35. defined U to be ancillary for if its
distribution depends only on , and the distribution of T given U depends only on .
For 2 = 2 tables with odds ratio and s Ž 1q ,q1 ., let T s n11 and U s
Ž n1q , nq1 .. Then U is not ancillary, because its distribution depends on as well as .
Using a definition due to Godambe, Bhapkar Ž1989. referred to the marginals U as
partial ancillary for . This means that the distribution of the data, given U, depends
only on , and that for fixed , the family of distributions of U for various is
complete. Liang Ž1984. gave an alternative definition referring to conditional and
unconditional inference being equally efficient.
PROBLEMS
Applications
3.1
Refer to Table 2.9. Construct and interpret a 95% confidence interval
for the population Ža. odds ratio, Žb. difference of proportions, and Žc.
relative risk between seat-belt use and type of injury.
105
PROBLEMS
3.2
Refer to Table 2.5 on lung cancer and smoking. Construct a confidence
interval for a relevant measure of association. Interpret.
3.3
In professional basketball games during 1980᎐1982, when Larry Bird
of the Boston Celtics shot a pair of free throws, 5 times he missed
both, 251 times he made both, 34 times he made only the first, and 48
times he made only the second ŽWardrop 1995.. Is it plausible that the
successive free throws are independent?
3.4
Refer to Table 3.10.
a. Using X 2 and G 2 , test the hypothesis of independence between
party identification and race. Report the P-values and interpret.
b. Use residuals to describe the evidence of association.
c. Partition chi-squared into components regarding the choice between Democrat and Independent and between these two combined and Republican. Interpret.
d. Summarize association by constructing a 95% confidence interval
for the odds ratio between race and whether a Democrat or
Republican. Interpret.
TABLE 3.10 Data for Problem 3.4
Party Identification
Race
Democrat
Independent
Republican
Black
White
103
341
15
105
11
405
Source: 1991 General Social Survey, National Opinion Research Center.
3.5
Refer to Table 3.10. In the same survey, gender was cross-classified
with party identification. Table 3.11 shows some results. Explain how
to interpret all the results on this printout.
3.6
In a study of the relationship between stage of breast cancer at
diagnosis Žlocal or advanced. and a woman’s living arrangement, of 144
women living alone, 41.0% had an advanced case; of 209 living with
spouse, 52.2% were advanced; of 89 living with others, 59.6% were
advanced. The authors reported the P-value for the relationship as
0.02 ŽD. J. Moritz and W. A. Satariano, J. Clin. Epidemiol. 46:
443᎐454, 1993.. Reconstruct the analysis performed to obtain this
P-value.
106
INFERENCE FOR CONTINGENCY TABLES
TABLE 3.11 Results for Problem 3.5
Frequency
Expected
dem
indep
repub
female
279
261.42
73
70.653
225
244.93
male
165
182.58
47
49.347
191
171.07
Statistic
DF
Value
Chi- Square
Likelihood Ratio Chi- Square
2
2
7.0095
7.0026
Observ
Resraw
1
17.584
2
2.347
3
y19.931
Reschi
1.088
0.279
y1.274
StReschi
2.293
0.465
y2.618
Observ
4
5
6
Resraw
y17.584
y2.347
19.931
Prob
0.0301
0.0302
Reschi
y1.301
y0.334
1.524
StReschi
y2.293
y0.464
2.618
3.7
Refer to Table 2.1. Partition G 2 for testing whether the incidence of
heart attacks is independent of aspirin intake into two components.
Interpret.
3.8
Project Blue Book: Analysis of Reports of Unidentified Aerial Objects was
published by the U.S. Air Force ŽAir Technical Intelligence Center at
Wright-Patterson Air Force Base. in May 1955 to analyze reports of
unidentified flying objects ŽUFOs.. In its Table II, the report classified
1765 sightings later regarded as known objects and 434 sightings later
regarded as unknown, according to the object color Žnine categories..
The report states: ‘‘The chi-square test is applicable only to distributions which have the same number of elements,’’ so the investigators
multiplied all counts in the known category by Ž434r1765., so each row
has 434 observations, before computing X 2 . They reported X 2 s 26.15
with df s 8. Explain why this is incorrect. What should X 2 equal?
Ž Hint: For their adjusted table, first show that the contribution to X 2
is the same for each cell in a column, and then show the effect on
those contributions of multiplying each count in one row by a constant. .
3.9
Table 3.12 classifies a sample of psychiatric patients by their diagnosis
and by whether their treatment prescribed drugs.
a. Obtain standardized Pearson residuals for independence, and interpret.
b. Partition chi-squared into three components to describe differences
and similarities among the diagnoses, by comparing Ži. the first two
rows, Žii. the third and fourth rows, and Žiii. the last row to the first
and second rows combined and the third and fourth rows combined.
107
PROBLEMS
TABLE 3.12
Data for Problem 3.9
Diagnosis
Schizophrenia
Affective disorder
Neurosis
Personality disorder
Special symptoms
Drugs
No Drugs
105
12
18
47
0
8
2
19
52
13
Source: Reprinted with permission from E. Helmes and G. C.
Fekken, J. Clin. Psychol. 42: 569᎐576 Ž1986..
3.10 Refer to Table 7.8. For the combined data for the two genders,
yielding a single 4 = 4 table, X 2 s 11.5 Ž P s 0.24., whereas using row
scores Ž3, 10, 20, 35. and column scores Ž1, 3, 4, 5., M 2 s 7.04
Ž P s 0.008.. Explain why the results are so different.
3.11 A study on educational aspirations of high school students ŽS. Crysdale, Internat. J. Compar. Sociol. 16: 19᎐36, 1975. measured aspirations with the scale Žsome high school, high school graduate, some
college, college graduate.. The student counts in these categories were
Ž11, 52, 23, 22. when family income was low, Ž9, 44, 13, 10. when family
income was middle, and Ž9, 41, 12, 27. when family income was high.
a. Test independence of educational aspirations and family income
using X 2 or G 2 . Explain the deficiency of this test for these data.
b. Find the standardized Pearson residuals. Do they suggest any
association pattern?
c. Conduct an alternative test that may be more powerful. Interpret.
3.12 Refer to Table 8.15. Obtain a 95% confidence interval for gamma.
Interpret the association between schooling and attitude toward abortion.
3.13 Table 3.13 shows the results of a retrospective study comparing radiation therapy with surgery in treating cancer of the larynx. The response
TABLE 3.13
Data for Problem 3.13
Surgery
Radiation therapy
Cancer
Controlled
Cancer Not
Controlled
21
15
2
3
Source: Reprinted with permission from W. M. Mendenhall,
R. R. Million, D. E. Sharkey, and N. J. Cassisi, Internat. J.
Radiat. Oncol. Biol. Phys. 10: 357᎐363 Ž1984., Pergamon
Press plc.
108
INFERENCE FOR CONTINGENCY TABLES
TABLE 3.14 SAS Output for Problem 3.13
Fisher’s Exact Test
Cell (1,1) Frequency (F)
Left- sided Pr <= F
Right- sided Pr >= F
Table Probability (P)
Two- sided Pr<= P
21
0.8947
0.3808
0.2755
0.6384
Odds Ratio
2.1000
Asymptotic Conf Limits:
Exact Conf Limits:
95%
95%
95%
95%
Lower
Upper
Lower
Upper
Conf
Conf
Conf
Conf
Limit
0.3116
Limit 14.1523
Limit 0.2089
Limit 27.5522
indicates whether the cancer was controlled for at least two years
following treatment. Table 3.14 shows SAS output.
a. Report and interpret the P-value for Fisher’s exact test with Ži. Ha :
) 1, and Žii. Ha : / 1. Explain how the P-values are calculated.
b. Interpret the confidence intervals for . Explain the difference
between them and how they were calculated.
c. Find and interpret the one-sided mid-P-value. Give advantages and
disadvantages of this type of P-value.
3.14 A study considered the effect of prednisolone on severe hypercalcaemia in women with metastatic breast cancer ŽB. Kristensen et al., J.
Intern. Med. 232: 237᎐245, 1992.. Of 30 patients, 15 were randomly
selected to receive prednisolone. The other 15 formed a control group.
Normalization in their level of serum-ionized calcium was achieved by
7 of the treated patients and none of the control group. Analyze
whether results were significantly better for treatment than for control.
Interpret.
3.15 For Problem 3.14, obtain a 95% confidence interval for the odds ratio
using Ža. the Woolf Ži.e., Wald. interval, Žb. Cornfield’s ‘‘exact’’ approach, Žc. the profile likelihood. In each case, note the effect of the
zero cell count. Summarize advantages and disadvantages of each
approach.
3.16 Refer to the tea-tasting data ŽTable 3.8.. Construct the null distributions of the ordinary P-value and the mid-P-value for Fisher’s exact
test with Ha : ) 1. Find and compare their expected values.
109
PROBLEMS
3.17 Consider a 3 = 3 table having entries, by row, of Ž4, 2, 0 r 2, 2, 2 r
0, 2, 4.. Conduct an exact test of independence, using X 2 . Assuming
ordered rows and columns and using equally spaced scores, conduct an
ordinal exact test. Explain why results differ so much.
3.18 An advertisement by Schering Corp. in 1999 for the allergy drug
Claritin mentioned that in a pediatric randomized clinical trial, symptoms of nervousness were shown by 4 of 188 patients on loratadine
ŽClaritin., 2 of 262 patients taking placebo, and 2 of 170 patients on
choropheniramine. In each part below, explain which method you
used, and why.
a. Is there inferential evidence that nervousness depends on drug?
b. For the Claritin and placebo groups, construct and interpret a 95%
confidence interval for the Ži. odds ratio and Žii. difference of
proportions suffering nervousness.
3.19 Refer to Problem 2.19 on sexual fun. Analyze these data. Present a
short report summarizing results and interpretations.
Theory and Methods
3.20 Is ˆ the midpoint of large- and small-sample confidence intervals for
? Why or why not?
3.21 For comparing two binomial samples, show that the standard error
Ž3.1. of a log odds ratio increases as the absolute difference of
proportions of successes and failures for a given sample increases.
3.22 Using the delta method, show that the Wald confidence interval for
the logit of a binomial parameter is
'
log
ˆr Ž 1 y ˆ . " z␣ r2 r nˆ Ž 1 y ˆ . .
Explain how to use this interval to obtain one for itself. wNewcombe
Ž2001. noted that the sample logit is also the midpoint of the score
interval for , on the logit scale. He showed that this logit interval
contains the score interval. x
3.23 For two parameters, a confidence interval for 1 y 2 based on
single-sample estimate ˆi and interval Ž l i , u i . for i , i s 1, 2, is
žˆ
1 y ˆ2 y
'Ž ˆ
1 y l1
.
2
q Ž u 2 y ˆ2 . ,
2
ˆ1 y ˆ2 q
'Ž
Ž
u1 y ˆ1 . q ˆ2 y l 2
2
.
2
/
.
110
INFERENCE FOR CONTINGENCY TABLES
Newcombe Ž1998b. proposed an interval for 1 y 2 using the score
interval Žl i , u i . for i that performs much better than the Wald interval
Ž3.4.. It is Ž
ˆ 1 y ˆ 2 y z␣ r2 sL , ˆ 1 y ˆ 2 q z␣ r2 sU ., with
sL s
)
l1 Ž1 y l1 .
n1
q
u2 Ž 1 y u2 .
n2
,
sU s
)
u1 Ž 1 y u1 .
n1
q
l 2 Ž1 y l 2 .
n2
.
Show that it has the general form above of an interval for 1 y 2 .
3.24 For multinomial sampling, use the asymptotic variance of log ˆ to
show that for Yule’s Q ŽProblem 3.26. the asymptotic variance of
'n Ž Qˆ y Q. is 2 s ŽÝ i Ý jy1
.Ž1 y Q 2 . 2r4 ŽYule 1900, 1912..
ij
3.25 Refer to Problem 2.23. For multinomial sampling, show how to obtain
a confidence interval for AR by first finding one for logŽ1 y AR .
ŽFleiss 1981, p. 76..
3.26 For multinomial probabilities s Ž 1 , 2 , . . . . with a contingency
table of arbitrary dimensions, suppose that a measure g Ž . s r␦ .
Show that the asymptotic variance of 'n w g Ž
ˆ . y g Ž .x is 2 s
wÝ i ii2 y ŽÝ i ii . 2 xr␦ 4 , where i s ␦ Ž ⭸r⭸ i . y Ž ⭸␦r⭸ i . ŽGoodman and Kruskal, 1972..
3.27 For ordinal variables, consider gamma Ž2.14.. Let
iŽjc. s
Ý Ý ab q Ý Ý ab ,
a-i b-j
iŽjd. s
a)i b)j
Ý Ý ab q Ý Ý ab ,
a-i b)j
a)i b-j
where i and j are fixed in the summations. Show that ⌸ c s Ý i Ý j i j iŽjc.
and ⌸ d s Ý i Ý j i j iŽjd.. Use the delta method to show that the largesample normality Ž3.9. applies for ␥
ˆ , with ŽGoodman and Kruskal
1963.
i j s 4 ⌸ d iŽjc. y ⌸ c iŽjd. r Ž ⌸ c q ⌸ d . ,
2
Ý Ý i j i j s 0 ,
i
2s
16
Ž ⌸c q ⌸d.
4
Ý Ý i j
i
j
j
⌸ d iŽjc. y ⌸ c iŽjd.
2
.
111
PROBLEMS
3.28 An I = J table has ordered columns and unordered rows. Ridits ŽBross
1958. are data-based column scores. The jth sample ridit is the
average cumulative proportion within category j,
jy1
ž /
1
p .
ˆr j s Ý pqk q
2 qj
ks1
The sample mean ridit in row i is Rˆi s Ý j ˆ
r j pj < i . Show that Ý j pqj ˆ
rj s
0.50 and Ý i piq Rˆi s 0.50. wFor ridit analyses, see Agresti Ž1984, Secs.
9.3 and 10.2., Bross Ž1958., Fleiss Ž1981, Sec. 9.4., and Landis et al.
Ž1978..x
3.29 Show that X 2 s nÝÝŽ pi j y piq pqj . 2rpiq pqj . Thus, X 2 can be large
when n is large, regardless of whether the association is practically
important. Explain why this test, like other tests, simply indicates the
degree of evidence against H0 and does not describe strength of
association. Ž‘‘Like fire, the chi-square test is an excellent servant and
a bad master,’’ Sir Austin Bradford Hill, Proc. Roy. Soc. Med. 58:
295᎐300, 1965..
3.30 For testing H0 : 1 s 2 using independent binomial variates y 1 and
y 2 with n1 and n 2 trials, the score statistic is
zs
ˆ 1 y ˆ 2
'ˆ Ž 1 y ˆ . Ž 1rn
1
q 1rn 2 .
,
where
ˆ s Ž y 1 q y 2 .rŽ n1 q n 2 . is the pooled estimate of 1 s 2
under H0 . Show that z 2 s X 2 .
3.31 For a 2 = 2 table, consider H0 : 11 s 2 , 12 s 21 s Ž1 y ., 22
s Ž1 y . 2 .
a. Show that the marginal distributions are identical and that independence holds.
b. For a multinomial sample, under H0 show that ˆs Ž p1qq pq1 .r2.
c. Explain how to test H0 . Show that df s 2 for the test statistic.
d. Refer to Problem 3.3. Are Larry Bird’s pairs of free throws plausibly
independent and identically distributed?
3.32 For a 2 = 2 table, show that:
a. The four Pearson residuals may take different values.
112
INFERENCE FOR CONTINGENCY TABLES
b. All four standardized Pearson residuals have the same absolute
value. ŽThis is sensible, since df s 1..
c. The square of each standardized Pearson residual equals X 2 .
w Note: X 2 s nŽ n11 n 22 y n12 n 21 . 2rŽ n1q n 2q nq1 nq2 . for 2 = 2 tables. See Mirkin Ž2001. for alternative X 2 formulas for I = J
tables. x
3.33 For testing independence, show that X 2 F n minŽ I y 1, J y 1.. Hence
V 2 s X 2rw nminŽ I y 1, J y 1.x falls between 0 and 1 ŽCramer
´ 1946..
For 2 = 2 tables, X 2rn is often called phi-squared; it equals Goodman
and Kruskal’s tau ŽProblem 2.38.. Other measures based on X 2 include the contingency coefficient w X 2rŽ X 2 q n.x1r2 ŽPearson 1904..
3.34 For counts n i 4 , the power di®ergence statistic for testing goodness of fit
ŽCressie and Read 1984; Read and Cressie 1988. is
2
Ž q 1.
ýn
i
Ž n irˆ i .
y1
for y⬁ - - ⬁.
a. For s 1, show that this equals X 2 .
b. As ™ 0, show that it converges to G 2 . w Hint: log t s lim h ™ 0
Ž t h y 1.rh.x
c. As ™ y1, show that it converges to 2Ý
ˆ i logŽ
ˆ irn i ., the miniŽ
mum discrimination information statistic Gokhale and Kullback
1978..
d. For s y2, show that it equals ÝŽ n i y
ˆ i . 2rn i , the Neyman
modified chi-squared statistic ŽNeyman 1949..
e. For s y 12 , show that it equals 4ÝŽ n i y
ˆ i . 2 , the
Freeman᎐Tukey statistic ŽFreeman and Tukey 1950..
'
'
wUnder regularity conditions, their asymptotic distributions are identical Žsee Drost et al. 1989.. The chi-squared null approximation works
best for near 23 .x
3.35 Use a partitioning argument to explain why G 2 for testing independence cannot increase after combining two rows Žor two columns. of a
contingency table. Ž Hint: Argue that G 2 for full table s G 2 for collapsed table q G 2 for table of the two rows that are combined in the
collapsed table. .
113
PROBLEMS
3.36 Motivate partitioning Ž3.14. by showing that the multiple hypergeometric distribution Ž3.19. for n i j 4 factors as the product of hypergeometric
distributions for the separate component tables ŽLancaster, 1949..
3.37 Explain why nqj 4 are sufficient for qj 4 in Ž3.17..
3.38 Assume independence, and let pi j s n i jrn and
ˆ i j s piq pqj .
a. Show that pi j and
ˆ i j are unbiased for i j s iq qj .
b. Show that varŽ pi j . s iq qj Ž1 y iq qj .rn.
2 . Ž 2 .
2 .
c. Using E Ž piq pqj . 2 s E Ž piq
E pqj and E Ž piq
s var Ž piq . q
2
w E Ž piq .x , show that
var Ž
ˆ i j . s iq qj iq Ž 1 y qj . q qj Ž 1 y iq .
4
n
q iq Ž 1 y iq . qj Ž 1 y qj . rn2 .
d. As n ™ ⬁, show that lim var Ž'n
ˆ i j . F lim var Ž'n pi j ., with equality
only if i j s 1 or 0. Hence, if the model holds or if it nearly holds,
the model estimator is better than the sample proportion.
3.39 Show that the sample value of the uncertainty coefficient Ž2.13. satisfies Uˆ s yG 2r2 nŽÝ pqj log pqj .. wHaberman Ž1982. gave its standard
error.x
3.40 When a test statistic has a continuous distribution, the P-value has a
null uniform distribution, P Ž P-value F ␣ . s ␣ for 0 - ␣ - 1. For
Fisher’s exact test, explain why under the null, P Ž P-value F ␣ . F ␣
for 0 - ␣ - 1. Ž Hint: P Ž P-value F ␣ . s E w P Ž P-value F
␣ < n1q, nq1 , n.x..
3.41 Refer to Note 3.3 about moments of the hypergeometric distribution
Ž3.16.. Letting s nq1 rn, show that n11 has the same mean as a
binomial random variable for n1q trials with success probability , and
that it has its variance multiplied by a finite population correction
factor Ž n y n1q .rŽ n y 1.. ŽThe hypergeometric is similar to the binomial when n1q is small compared to n..
3.42 A contingency table for two independent binomial variables has counts
Ž3, 0 r 0, 3. by row. For H0 : 1 s 2 and Ha : 1 ) 2 , show that the
P-value equals 641 for the exact unconditional test and 201 for Fisher’s
114
INFERENCE FOR CONTINGENCY TABLES
exact test. wFor discussion of this example, see Little Ž1989., G.
Barnard’s remarks at the end of Yates Ž1984., and Sprott Ž2000, Sec.
6.4.4..x
3.43 Refer to Problem 3.42 and exact tests using X 2 with Ha : 1 / 2 .
Explain why the unconditional P-value, evaluated at s 0.5, is related
to Fisher conditional P-values for various tables by
P Ž X 2 G 6. s
6
Ý P Ž X 2 G 6 < nq1 s k . P Ž nq1 s k . .
ks0
Thus, the unconditional P-value of 321 is a weighted average of the
Fisher P-value for the observed column margins and P-values of 0
corresponding to the impossibility of getting results as extreme as
6
observed if other margins had occurred Ži.e., 321 s 0.10 6 Ž 1r2 . ..
ž/
3
The Fisher quote in Section 3.5.6 gave his view about this.
3.44 Consider exact tests of independence, given the marginals, for the
I = I table having n ii s 1 for i s 1, . . . , I, and n i j s 0 otherwise.
Show that Ža. tests that order tables by their probabilities, X 2 , or G 2
have P-value s 1.0, and Žb. the one-sided test that orders tables by an
ordinal statistic such as r or C y D has P-value s Ž1rI!..
3.45 A Monte Carlo scheme randomly samples M separate I = J tables
having the observed margins to approximate Po s P Ž X 2 G X o2 . for an
exact test. Let Pˆ be the sample proportion of the M tables with
X 2 G X o2 . Show that P Ž< Pˆ y Po < F B . s 1 y ␣ requires that M f
z␣2 r2 Po Ž1 y Po .rB 2 .
3.46 Show that the conditional ML estimate of satisfies n11 s E Ž n11 . for
distribution Ž3.18..
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 4
Introduction to Generalized
Linear Models
In Chapters 2 and 3 we focused on methods for two-way contingency tables.
Most studies, however, have several explanatory variables, and they may be
continuous as well as categorical. The goal is usually to describe their effects
on response variables. Modeling the effects helps us do this efficiently. A
good-fitting model evaluates effects, includes relevant interactions, and provides smoothed estimates of response probabilities.
The rest of the book focuses on model building for categorical response
variables. In this chapter we introduce a family of generalized linear models
that contains the most important models for categorical responses as well as
standard models for continuous responses. Section 4.1 covers three components common to all generalized linear models. Section 4.2 illustrates with
models for binary responses. The most important case is logistic regression, a
linear model for the logit transformation of a binomial parameter. In Chapters 5 through 7 we study these models in detail.
In Section 4.3 we present generalized linear models for counts. A Poisson
regression model called a loglinear model is a linear model for the log of a
Poisson mean. In Chapters 8 and 9 we study them for modeling counts in
contingency tables.
Sections 4.4 through 4.8 are more technical. Readers wanting mainly an
overview of methods can skip them or read them lightly. For generalized
linear models, Section 4.4 covers likelihood equations and the asymptotic
covariance matrix of ML model parameter estimates, and Section 4.5 summarizes inferential methods. Methods of solving the likelihood equations are
presented in Section 4.6. In the final two sections we introduce generalizations, quasi-likelihood and generalized additi®e models, that further extend the
scope of models.
115
116
4.1
INTRODUCTION TO GENERALIZED LINEAR MODELS
GENERALIZED LINEAR MODEL
Generalized linear models ŽGLMs. extend ordinary regression models to
encompass nonnormal response distributions and modeling functions of the
mean. Three components specify a generalized linear model: A random
component identifies the response variable Y and its probability distribution;
a systematic component specifies explanatory variables used in a linear
predictor function; and a link function specifies the function of E Ž Y . that the
model equates to the systematic component. Nelder and Wedderburn Ž1972.
introduced the class of GLMs, although many models in the class were well
established by then.
4.1.1
Components of Generalized Linear Models
The random component of a GLM consists of a response variable Y with
independent observations Ž y 1 , . . . , yN . from a distribution in the natural
exponential family. This family has probability density function or mass
function of form
f Ž yi ; i . s a Ž i . b Ž yi . exp yi Q Ž i . .
Ž 4.1 .
Several important distributions are special cases, including the Poisson and
binomial. The value of the parameter i may vary for i s 1, . . . , N, depending on values of explanatory variables. The term QŽ . is called the natural
parameter. In Section 4.4 we present a more general formula that also has a
dispersion parameter, but Ž4.1. is sufficient for basic discrete data models.
The systematic component of a GLM relates a vector Ž1 , . . . , N . to the
explanatory variables through a linear model. Let x i j denote the value of
predictor j Ž j s 1, 2, . . . , p . for subject i. Then
i s
Ý j x i j ,
i s 1, . . . , N.
j
This linear combination of explanatory variables is called the linear predictor.
Usually, one x i j s 1 for all i, for the coefficient of an intercept Žoften
denoted by ␣ . in the model.
The third component of a GLM is a link function that connects the
random and systematic components. Let i s E Ž Yi ., i s 1, . . . , N. The model
links i to i by i s g Ž i ., where the link function g is a monotonic,
differentiable function. Thus, g links E Ž Yi . to explanatory variables through
the formula
g Ž i . s
Ý j x i j ,
j
i s 1, . . . , N.
Ž 4.2 .
117
GENERALIZED LINEAR MODEL
The link function g Ž . s , called the identity link, has i s i . It
specifies a linear model for the mean itself. This is the link function for
ordinary regression with normally distributed Y. The link function that
transforms the mean to the natural parameter is called the canonical link.
For it, g Ž i . s QŽ i ., and QŽ i . s Ý j  j x i j . The following subsections show
examples.
In summary, a GLM is a linear model for a transformed mean of a
response variable that has distribution in the natural exponential family. We
now illustrate the three components by introducing the key GLMs for
discrete response variables.
4.1.2
Binomial Logit Models for Binary Data
Many response variables are binary. Represent the success and failure
outcomes by 1 and 0. The Bernoulli distribution for this Bernoulli trial
specifies probabilities P Ž Y s 1. s and P Ž Y s 0. s 1 y , for which
E Ž Y . s . This is the special case of the binomial Ž1.1. with n s 1. The
probability mass function is
f Ž y; . s y Ž1 y .
1yy
ž
s Ž 1 y . r Ž 1 y .
s Ž 1 y . exp y log
1y
/
y
Ž 4.3 .
for y s 0 and 1. This is in the natural exponential family Ž4.1., identifying
with , aŽ . s 1 y , bŽ y . s 1, and QŽ . s logwrŽ1 y .x. The natural
parameter logwrŽ1 y .x is the log odds of response 1, the logit of . This
is the canonical link. GLMs using the logit link are often called logit models.
4.1.3
Poisson Loglinear Models for Count Data
Some response variables have counts as their possible outcomes. For a
sample of silicon wafers used in manufacturing computer chips, each observation might be the number of imperfections on a wafer. Counts also occur as
entries in contingency tables.
The simplest distribution for count data is the Poisson. Like counts,
Poisson variates can take any nonnegative integer value. Let Y denote a
count and let s E Ž Y .. The Poisson probability mass function Ž1.4. for Y is
f Ž y; . s
ey y
y!
s exp Žy .
ž /
1
y!
exp Ž y log . ,
y s 0, 1, 2, . . . .
This has natural exponential form Ž4.1. with s , aŽ . s exp Žy ., bŽ y . s
1ry!, and QŽ . s log . The natural parameter is log , so the canonical
118
INTRODUCTION TO GENERALIZED LINEAR MODELS
TABLE 4.1
Types of Generalized Linear Models for Statistical Analysis
Random
Component
Normal
Normal
Normal
Binomial
Poisson
Multinomial
Link
Identity
Identity
Identity
Logit
Log
Generalized
logit
Systematic
Component
Model
Chapters
Continuous
Categorical
Mixed
Mixed
Mixed
Mixed
Regression
Analysis of variance
Analysis of covariance
Logistic regression
Loglinear
Multinomial response
5 and 6
8 and 9
7
link function is the log link, s log . The model using this link is
log i s
Ý j x i j ,
i s 1, . . . , N.
Ž 4.4 .
j
This model is called a Poisson loglinear model.
4.1.4
Generalized Linear Models for Continuous Responses
The class of GLMs also includes models for continuous responses. The
normal distribution is in a natural exponential family that includes dispersion
parameters. Its natural parameter is the mean. Therefore, an ordinary
regression model for E Ž Y . is a GLM using the identity link. Table 4.1 lists
this and other standard models for a normal random component. The table
also lists GLMs for discrete responses that are presented in the next six
chapters.
A traditional way to analyze data transforms Y so that it has approximately a normal distribution with constant variance; then, ordinary leastsquares regression is applicable. With GLMs, by contrast, the choice of link
function is separate from the choice of random component. If a link is useful
in the sense that a linear model for the predictors is plausible for that link, it
is not necessary that it also stabilizes variance or produces normality. This is
because the fitting process maximizes the likelihood for the choice of distribution for Y, and that choice is not restricted to normality.
4.1.5
Deviance
For a particular GLM for observations y s Ž y 1 , . . . , yN ., let LŽ; y. denote
the log-likelihood function expressed in terms of the means s Ž 1 , . . . , N ..
Let LŽ;
ˆ y. denote the maximum of the log likelihood for the model.
Considered for all possible models, the maximum achievable log likelihood is
GENERALIZED LINEAR MODEL
119
LŽy; y.. This occurs for the most general model, having a separate parameter
for each observation and the perfect fit
ˆ s y. Such a model is called the
saturated model. This model is not useful, since it does not provide data
reduction. However, it serves as a baseline for comparison with other model
fits.
The de®iance of a Poisson or binomial GLM is defined to be
y2 L Ž ;
ˆ y. y L Ž y; y. .
This is the likelihood-ratio statistic for testing the null hypothesis that the
model holds against the general alternative Ži.e., the saturated model.. For
some Poisson and binomial GLMs, the number of observations N stays fixed
as the individual counts increase in size. Then the deviance has a chi-squared
asymptotic null distribution. The df s N y p, where p is the number of
model parameters; that is, df equals the difference between the numbers of
parameters in the saturated and unsaturated models. The deviance then
provides a test of model fit.
An example is binomial counts at N fixed settings of predictors when the
number of trials at each setting increases. Let Yi be bin Ž n i , i ., i s 1, . . . ,
N. Consider the simple model of homogeneity, i s ␣ all i. It has p s 1
parameter. The saturated model makes no assumption about i 4 , letting
them be any N values between 0 and 1.0. It has N parameters. The deviance
for the homogeneity model has df s N y 1. In fact, it equals the G 2
likelihood-ratio statistic Ž3.11. for testing independence in the N = 2 table
that these samples form. Under independence, it has approximately a chisquared distribution as the n i 4 increase, for fixed N.
We use the deviance throughout the book for model checking and for
inferential comparisons of models. Components of the deviance are residual
measures of lack of fit. Methods for analyzing the deviance generalize
analysis of variance methods for normal linear models.
4.1.6
Advantages of the GLM Formulation
GLMs provide a unified theory of modeling that encompasses the most
important models for continuous and discrete variables. Models studied in
this text are GLMs with binomial or Poisson random component, or multivariate extensions of GLMs. The ML parameter estimates are computed with
an algorithm, presented in Section 4.6, that iteratively uses a weighted
version of least squares. The reason for restricting GLMs to the exponential
family of distributions for Y is that the same algorithm applies to this entire
family, for any choice of link function.
Most statistical software has the facility to fit GLMs. Appendix A gives
details.
120
4.2
INTRODUCTION TO GENERALIZED LINEAR MODELS
GENERALIZED LINEAR MODELS FOR BINARY DATA
Let Y denote a binary response variable. For instance, Y might indicate vote
in a British election ŽLabour, Conservative., choice of automobile Ždomestic,
import., or diagnosis of breast cancer Žpresent, absent.. Each observation has
one of two outcomes, denoted by 0 and 1, binomial for a single trial. The
mean E Ž Y . s P Ž Y s 1.. We denote P Ž Y s 1. by Žx., reflecting its dependence on values x s Ž x 1 , . . . , x p . of predictors. The variance of Y is
var Ž Y . s Ž x . 1 y Ž x . ,
the binomial variance for one trial. In introducing GLMs for binary data, for
simplicity we use a single explanatory variable.
4.2.1
Linear Probability Model
For a binary response, the regression model
Ž x. s ␣ q  x
Ž 4.5 .
is called a linear probability model. With independent observations it is a
GLM with binomial random component and identity link function.
The linear probability model has a major structural defect. Probabilities
fall between 0 and 1, but linear functions take values over the entire real line.
Model Ž4.5. has Ž x . - 0 and Ž x . ) 1 for sufficiently large or small x
values. For its extension with multiple predictors, difficulties often occur
fitting this model because during the fitting process,
ˆ Žx. falls outside the
w0, 1x range for some subjects’ x values. The model can be valid over a
restricted range of x values. When it is plausible, an advantage is its simple
interpretation:  is the change in Ž x . for a one-unit increase in x.
We defer to Section 4.6 the technical details of fitting this and other
GLMs. One should assume a binomial distribution for Y and use maximum
likelihood ŽML. rather than ordinary least squares. Least squares is ML for a
normal distribution with constant variance. For binary responses, the constant variance condition that makes least squares estimators optimal Ži.e.,
minimum variance in the class of linear unbiased estimators. is not satisfied.
Since var Ž Y . s Ž x .w1 y Ž x .x, the variance depends on x through its
influence on Ž x .. As Ž x . moves toward 0 or 1, the distribution of Y is
more nearly concentrated at a single point, and the variance moves toward 0.
Because of the nonconstant variance, the binomial ML estimator is more
efficient than least squares. Also Y, being binary, is very far from normally
distributed. Thus, the usual sampling distributions for the least squares
estimators do not apply. The estimates and standard errors for ML and least
squares are usually similar, however, when
ˆ Ž x . for the sample x values falls
in the range within which the variance is relatively stable Žabout 0.3 to 0.7..
121
GENERALIZED LINEAR MODELS FOR BINARY DATA
TABLE 4.2
Relationship between Snoring and Heart Disease
Heart Disease
Snoring
Never
Occasionally
Nearly every night
Every night
Yes
No
Proportion
Yes
Linear
Fit a
Logit
Fit a
24
35
21
30
1355
603
192
224
0.017
0.055
0.099
0.118
0.017
0.057
0.096
0.116
0.021
0.044
0.093
0.132
a
Model fits refer to proportion of yes responses.
Source: P. G. Norton and E. V. Dunn, British Med. J. 291: 630᎐632 Ž1985., BMJ Publishing
Group.
4.2.2
Snoring and Heart Disease Example
We illustrate the linear probability model with Table 4.2, from an epidemiological survey of 2484 subjects to investigate snoring as a risk factor for heart
disease. Those surveyed were classified according to their spouses’ report of
how much they snored. The model states that the probability of heart disease
is linearly related to the level of snoring x. We treat the rows of the table as
independent binomial samples. No obvious choice of scores exists for categories of x. We used Ž0, 2, 4, 5., treating the last two levels as closer than the
other adjacent pairs ŽProblem 4.4 uses equally spaced scores.. ML estimates
and standard errors are the same if we use a data file of 2484 binary
observations or if we enter the four binomial totals of yes and no responses
listed in Table 4.2.
Software Žsee, e.g., Table A.3 for SAS. reports the ML fit,
ˆ Ž x . s 0.0172
q 0.0198 x, with a standard error SE s 0.0028 for ˆ s 0.0198. For nonsnorers Ž x s 0., the estimated proportion of subjects having heart disease is
0.0172. We refer to the estimated values of E Ž Y . for a GLM as fitted ®alues.
Table 4.2 shows the sample proportions and the fitted values for this model.
Figure 4.1 graphs the sample and fitted values. The table and graph suggest
that the model fits well. ŽIn Section 5.2.3 we discuss formal goodness-of-fit
analyses for binary-response GLMs.. The model interpretation is simple. The
estimated probability of heart disease is about 0.02 for nonsnorers; it increases 2Ž0.0198. s 0.04 for occasional snorers, another 0.04 for those who
snore nearly every night, and another 0.02 for those who always snore.
4.2.3
Logistic Regression Model
Usually, binary data result from a nonlinear relationship between Ž x . and
x. A fixed change in x often has less impact when Ž x . is near 0 or 1 than
when Ž x . is near 0.5. In the purchase of an automobile, consider the choice
between buying new or used. Let Ž x . denote the probability of selecting
new when annual family income s x. An increase of $50,000 in annual
122
FIGURE 4.1
INTRODUCTION TO GENERALIZED LINEAR MODELS
Predicted probabilities for linear probability and logistic regression models.
income would have less effect when x s $1,000,000 wfor which Ž x . is near 1x
than when x s $50,000.
In practice, nonlinear relationships between Ž x . and x are often monotonic, with Ž x . increasing continuously or Ž x . decreasing continuously as
x increases. The S-shaped curves in Figure 4.2 are typical. The most important curve with this shape has the model formula
Ž x. s
exp Ž ␣ q  x .
1 q exp Ž ␣ q  x .
.
Ž 4.6 .
This is the logistic regression model. As x ™ ⬁, Ž x .x0 when  - 0 and
Ž x .≠1 when  ) 0.
Let’s find the link function for which logistic regression is a GLM. For
Ž4.6. the odds are
Ž x.
1 y Ž x.
s exp Ž ␣ q  x . .
The log odds has the linear relationship
log
Ž x.
1 y Ž x.
s ␣ q  x.
Ž 4.7 .
GENERALIZED LINEAR MODELS FOR BINARY DATA
FIGURE 4.2
123
Logistic regression functions.
Thus, the appropriate link is the log odds transformation, the logit. Logistic
regression models are GLMs with binomial random component and logit link
function. Logistic regression models are also called logit models.
The logit is the natural parameter of the binomial distribution, so the logit
link is its canonical link. Whereas Ž x . must fall in the Ž0, 1. range, the logit
can be any real number. The real numbers are also the range for linear
predictors Žsuch as ␣ q  x . that form the systematic component of a GLM.
So this model does not have the structural problem that is true of the linear
probability model.
For the snoring data in Table 4.2, software reports the logistic regression
ML fit
logit
ˆ Ž x . s y3.87 q 0.40 x.
The positive ˆ s 0.40 reflects the increased incidence of heart disease at
higher snoring levels. In Chapters 5 and 6 we study logistic regression in
detail and interpret such equations. Estimated probabilities result from
substituting x values into the estimate of probability formula Ž4.6.. Table 4.2
also reports these fitted values. Figure 4.1 displays the fit. The fit is close to
linear over this narrow range of estimated probabilities, and results are
similar to those for the linear probability model.
124
4.2.4
INTRODUCTION TO GENERALIZED LINEAR MODELS
Binomial GLM for 2 = 2 Contingency Tables
Among the simplest GLMs for a binary response is the one having a single
explanatory variable X that is also binary. Label its values by 0 and 1. For a
given link function, the GLM
link Ž x . s ␣ q  x
has the effect of X described by
 s link Ž 1 . y link Ž 0 . .
For the identity link,  s Ž1. y Ž0. is the difference between proportions. For the log link,  s log w Ž1.x y log w Ž0.x s log w Ž1.r Ž0.x is the
log relative risk. For the logit link,
 s logit Ž 1 . y logit Ž 0 . s log
s log
Ž 1.
1 y Ž 1.
y log
Ž 0.
1 y Ž 0.
Ž 1. r Ž 1 y Ž 1. .
Ž 0. r Ž 1 y Ž 0. .
is the log odds ratio. Measures of association for 2 = 2 tables are effect
parameters in GLMs for binary data.
4.2.5
Probit and Inverse CDF Link Functions*
A monotone regression curve such as the first one in Figure 4.2 has the shape
of a cumulative distribution function Žcdf. for a continuous random variable.
This suggests a model for a binary response having form Ž x . s F Ž x . for
some cdf F.
Using an entire class of location-scale cdf ’s, such as normal cdf ’s with their
variety of means and variances, permits the curve Ž x . s F Ž x . to have
flexibility in the rate of increase and in the location where most of that
increase occurs. Let ⌽ Ž⭈. denote the standard cdf of the class, such as the
N Ž0, 1. cdf. Using ⌽ but writing the model as
Ž x. s ⌽Ž ␣ q  x.
Ž 4.8 .
provides the same flexibility. Shapes of different cdf ’s in the class occur as ␣
and  vary. Replacing x by  x permits the curve to increase at a different
rate than the standard cdf Žor even to decrease if  - 0.; varying ␣ moves
the curve to the left or right.
When ⌽ is strictly increasing over the entire real line, its inverse function
⌽y1 Ž⭈. exists and Ž4.8. is, equivalently,
⌽y1 Ž x . s ␣ q  x .
Ž 4.9 .
125
GENERALIZED LINEAR MODELS FOR COUNTS
For this class of cdf shapes, the link function for the GLM is ⌽y1. The link
function maps the Ž0, 1. range of probabilities onto Žy⬁, ⬁., the range of
linear predictors. The curve has the shape of a normal cdf when ⌽ is the
standard normal cdf. Model Ž4.9. is then called the probit model. This curve
has similar appearance to the logistic regression curve. Probit models are
discussed in Section 6.6.
When  ) 0, the logistic regression curve Ž4.6. is a cdf for the logistic
distribution. When  - 0, the curve for 1 y Ž x ., the probability Y s 0, has
that appearance. The cdf of the logistic distribution with mean and
dispersion parameter ) 0 is
FŽ x. s
exp Ž x y . r
1 q exp Ž x y . r
,
y⬁ - x - ⬁.
The corresponding probability density function is symmetric and bell-shaped,
with standard deviation r'3 Žhere, is the mathematical constant
3.14 . . . .. It looks much like the normal density with the same mean and
standard deviation but with slightly thicker tails. ŽIts kurtosis equals that of a
t distribution with df s 9..
The standardized form of the logistic cdf has s 0 and s 1, so
⌽ Ž x . s e xrŽ1 q e x .. For that function, the logistic regression curve Ž4.6. has
form Ž x . s ⌽ Ž ␣ q  x .. By Ž4.9. the logit transformation is simply the
inverse function for the standard logistic cdf; that is, when ⌽ Ž x . s Ž x . s
e xrŽ1 q e x ., then x s ⌽y1 w Ž x .x s log w Ž x .rŽ1 y Ž x ..x.
4.3
GENERALIZED LINEAR MODELS FOR COUNTS
The best known GLMs for count data assume a Poisson distribution for Y.
We introduced this distribution in Section 1.2.3. In Chapters 8 and 9 we
present Poisson GLMs for counts in contingency tables with categorical
response variables. In this section we introduce Poisson GLMs using an
alternative application: modeling count or rate data for a single discrete
response variable.
4.3.1
Poisson Loglinear Models
The Poisson distribution has a positive mean . Although a GLM can model
a positive mean using the identity link, it is more common to model the log of
the mean. Like the linear predictor ␣ q  x, the log mean can take any real
value. The log mean is the natural parameter for the Poisson distribution,
and the log link is the canonical link for a Poisson GLM. A Poisson loglinear
GLM assumes a Poisson distribution for Y and uses the log link.
The Poisson loglinear model with explanatory variable X is
log s ␣ q  x .
Ž 4.10 .
126
INTRODUCTION TO GENERALIZED LINEAR MODELS
For this model, the mean satisfies the exponential relationship
x
s exp Ž ␣ q  x . s e ␣ Ž e  . .
Ž 4.11 .
A 1-unit increase in x has a multiplicative impact of e  on : The mean at
x q 1 equals the mean at x multiplied by e .
4.3.2
Horseshoe Crab Mating Example
We illustrate Poisson GLMs for Table 4.3 from a study of nesting horseshoe
crabs. Each female horseshoe crab had a male crab resident in her nest. The
study investigated factors affecting whether the female crab had any other
males, called satellites, residing nearby. Explanatory variables are the female
crab’s color, spine condition, weight, and carapace width. The response
outcome for each female crab is her number of satellites. For now, we use
width alone as a predictor. Table 4.3 lists width in centimeters. The sample
mean width equals 26.3 and the standard deviation equals 2.1.
Figure 4.3 plots the response counts of satellites against width, with
numbered symbols indicating the number of observations at each point. The
substantial variability makes it difficult to discern a clear trend. To get a
clearer picture, we grouped the female crabs into width categories ŽF 23.25,
23.25᎐24.25, 24.25᎐25.25, 25.25᎐26.25, 26.25᎐27.25, 27.25᎐28.25, 28.25᎐29.25,
) 29.25. and calculated the sample mean number of satellites for female
crabs in each category. Figure 4.4 plots these sample means against the
sample mean width for crabs in each category.
More sophisticated ways of portraying the trend smooth the data without
grouping the width values or assuming a particular functional relationship.
Figure 4.4 also shows a smoothed curve based on an extension of the GLM
introduced in Section 4.8. The sample means and the smoothed curve both
show a strong increasing trend. ŽThe means tend to fall above the curve,
since the response counts in a category tend to be skewed to the right; the
smoothed curve is less susceptible to outlying observations.. The trend seems
approximately linear, and we discuss next models for the ungrouped data for
which the mean or the log of the mean is linear in width.
For a female crab, let be the expected number of satellites and
x s width. From GLM software Že.g., for SAS, see Table A.4., the ML fit of
the Poisson loglinear model Ž4.10. is
log
ˆ s ␣ˆ q ˆ x s y3.305 q 0.164 x.
The effect ˆ s 0.164 of width is positive, with SE s 0.020. The model fitted
value at any width level is an estimated mean number of satellites
ˆ . For
instance, the fitted value at the mean width of x s 26.3 is
ˆ s exp Ž ␣ˆ q ˆ x . s exp y3.305 q 0.164 Ž 26.3 . s 2.74.
127
GENERALIZED LINEAR MODELS FOR COUNTS
TABLE 4.3
C S
2
3
3
4
2
1
4
2
2
2
1
3
2
2
3
3
2
2
2
2
4
4
2
2
3
2
3
2
2
3
1
2
2
4
3
2
4
2
2
2
2
2
3
1
3
3
3
2
3
2
3
3
1
3
1
3
1
3
3
3
3
3
3
3
3
3
2
1
3
3
3
3
3
3
1
2
1
3
3
3
3
3
1
3
3
3
3
1
W
28.3
26.0
25.6
21.0
29.0
25.0
26.2
24.9
25.7
27.5
26.1
28.9
30.3
22.9
26.2
24.5
30.0
26.2
25.4
25.4
27.5
27.0
24.0
28.7
26.5
24.5
27.3
26.5
25.0
22.0
30.2
25.4
24.9
25.8
27.2
30.5
25.0
30.0
22.9
23.9
26.0
25.8
29.0
26.5
Number of Crab Satellites by Female’s Characteristics a
Wt
Sa
3.05 8
2.60 4
2.15 0
1.85 0
3.00 1
2.30 3
1.30 0
2.10 0
2.00 8
3.15 6
2.80 5
2.80 4
3.60 3
1.60 4
2.30 3
2.05 5
3.05 8
2.40 3
2.25 6
2.25 4
2.90 0
2.25 3
1.70 0
3.20 0
1.97 1
1.60 1
2.90 1
2.30 4
2.10 2
1.40 0
3.28 2
2.30 0
2.30 6
2.25 10
2.40 5
3.32 3
2.10 8
3.00 9
1.60 0
1.85 2
2.28 3
2.20 0
3.28 4
2.35 0
C S
3
2
3
2
4
2
2
1
2
4
1
2
4
2
2
3
2
3
3
3
3
2
2
3
3
2
2
2
2
2
3
3
1
2
2
3
2
2
4
3
2
3
1
3
3
3
1
3
1
3
1
1
3
1
3
3
1
1
1
3
3
3
3
3
3
1
3
3
3
3
3
3
3
3
3
2
3
2
3
1
3
3
3
3
3
1
W
Wt
Sa
C S
22.5
23.8
24.3
26.0
24.7
22.5
28.7
29.3
26.7
23.4
27.7
28.2
24.7
25.7
27.8
27.0
29.0
25.6
24.2
25.7
23.1
28.5
29.7
23.1
24.5
27.5
26.3
27.8
31.9
25.0
26.2
28.4
24.5
27.9
25.0
29.0
31.7
27.6
24.5
23.8
28.2
24.1
28.0
1.55
2.10
2.15
2.30
2.20
1.60
3.15
3.20
2.70
1.90
2.50
2.60
2.10
2.00
2.75
2.45
3.20
2.80
1.90
1.20
1.65
3.05
3.85
1.55
2.20
2.55
2.40
3.25
3.33
2.40
2.22
3.20
1.95
3.05
2.25
2.92
3.73
2.85
1.90
1.80
3.05
1.80
2.62
0
0
0
14
0
1
3
4
5
0
6
6
5
5
0
3
10
7
0
0
0
0
5
0
1
1
1
3
2
5
0
3
6
7
6
3
4
4
0
0
8
0
0
1
3
2
1
2
3
2
1
4
2
2
2
2
2
4
2
2
3
2
2
3
4
2
2
3
2
2
2
4
3
2
4
2
2
2
2
2
2
2
3
4
3
4
1
2
3
1
3
3
1
3
3
3
1
3
3
3
3
3
3
3
2
3
1
1
1
1
3
2
3
3
3
3
3
3
3
3
3
1
3
3
3
2
3
3
3
W
26.0
24.7
25.8
27.1
27.4
26.7
26.8
25.8
23.7
27.9
30.0
25.0
27.7
28.3
25.5
26.0
26.2
23.0
22.9
25.1
25.9
25.5
26.8
29.0
28.5
24.7
29.0
27.0
23.7
27.0
24.2
22.5
25.1
24.9
27.5
24.3
29.5
26.2
24.7
29.8
25.7
26.2
27.0
Wt
Sa
2.30 9
1.90 0
2.65 0
2.95 8
2.70 5
2.60 2
2.70 5
2.60 0
1.85 0
2.80 6
3.30 5
2.10 4
2.90 5
3.00 15
2.25 0
2.15 5
2.40 0
1.65 1
1.60 0
2.10 5
2.55 4
2.75 0
2.55 0
2.80 1
3.00 1
2.55 4
3.10 1
2.50 6
1.80 0
2.50 6
1.65 2
1.47 4
1.80 0
2.20 0
2.63 6
2.00 0
3.02 4
2.30 0
1.95 4
3.50 4
2.15 0
2.17 2
2.63 0
C S
3
2
2
2
2
4
4
2
2
3
3
2
1
2
3
2
2
3
3
3
2
4
4
3
2
2
2
3
4
3
2
2
2
2
2
2
3
2
2
2
2
3
2
3
1
3
3
2
3
3
2
3
3
1
3
1
3
3
3
1
3
2
2
3
3
3
3
3
1
3
3
3
3
3
3
1
1
1
3
3
1
3
3
3
3
2
W
Wt
Sa
24.8
23.7
28.2
25.2
23.2
25.8
27.5
25.7
26.8
27.5
28.5
28.5
27.4
27.2
27.1
28.0
26.5
23.0
26.0
24.5
25.8
23.5
26.7
25.5
28.2
25.2
25.3
25.7
29.3
23.8
27.4
26.2
28.0
28.4
33.5
25.8
24.0
23.1
28.3
26.5
26.5
26.1
24.5
2.10
1.95
3.05
2.00
1.95
2.00
2.60
2.00
2.65
3.10
3.25
3.00
2.70
2.70
2.55
2.80
1.30
1.80
2.20
2.25
2.30
1.90
2.45
2.25
2.87
2.00
1.90
2.10
3.23
1.80
2.90
2.02
2.90
3.10
5.20
2.40
1.90
2.00
3.20
2.35
2.75
2.75
2.00
0
0
11
1
4
3
0
0
0
3
9
3
6
3
0
1
0
0
3
0
0
0
0
0
1
1
2
0
12
6
3
2
4
5
7
0
10
0
0
4
7
3
0
C, color Ž1, light medium; 2, medium; 3, dark medium; 4, dark.; S, spine condition Ž1, both
good; 2, one worn or broken; 3, both worn or broken.; W, carapace width Žcm.; Wt, weight Žkg.;
Sa, number of satellites.
Source: Data courtesy of Jane Brockmann, Zoology Department, University of Florida; study
described in Ethology 102:1᎐21 Ž1996..
a
128
INTRODUCTION TO GENERALIZED LINEAR MODELS
FIGURE 4.3 Number of satellites by width of female crab.
For this model, exp Ž ˆ. s exp Ž0.164. s 1.18 is the multiplicative effect on
ˆ
for a 1-cm increase in x. For instance, the fitted value at x s 27.3 s 26.3 q 1
is exp wy3.305 q 0.164Ž27.3.x s 3.23, which equals 1.18 = 2.74. A 1-cm increase in width yields an 18% increase in the estimated mean.
Figure 4.4 shows that E Ž Y . may grow approximately linearly with width.
This suggests the Poisson GLM with identity link. It has ML fit
ˆ s ␣ˆ q ˆ x s y11.53 q 0.55 x .
This model has an additive rather than a multiplicative effect of X on .
A 1-cm increase in x has an estimated increase of ˆ s 0.55 in
ˆ . The fitted
values are positive at all sampled x, and the model describes simply the
effect: On the average, about a 2-cm increase in width is associated with an
extra satellite.
Figure 4.5 plots
ˆ against width for the models with log link and identity
link. Although they diverge somewhat for relatively small and large widths,
they provide similar predictions over the width range in which most observations occur. We now study whether either model fits adequately.
129
GENERALIZED LINEAR MODELS FOR COUNTS
FIGURE 4.4
TABLE 4.4
Smoothings of horseshoe crab counts.
Sample Mean and Variance of Number of Satellites
Width Žcm.
Number of
Cases
Number of
Satellites
Sample
Mean
Sample
Variance
- 23.25
23.25᎐24.25
24.25᎐25.25
25.25᎐26.25
26.25᎐27.25
27.25᎐28.25
28.25᎐29.25
) 29.25
14
14
28
39
22
24
18
14
14
20
67
105
63
93
71
72
1.00
1.43
2.39
2.69
2.86
3.87
3.94
5.14
2.77
8.88
6.54
11.38
6.88
8.81
16.88
8.29
130
INTRODUCTION TO GENERALIZED LINEAR MODELS
FIGURE 4.5
4.3.3
Estimated mean number of satellites for log and identity links.
Overdispersion for Poisson GLMs
In Section 1.2.4 we noted that count data often show greater variability than
the Poisson allows. For the grouped horseshoe crab data, Table 4.4 shows
the sample mean and variance for the counts of number of satellites for the
female crabs in each width category. The variances are much larger than the
means, whereas Poisson distributions have identical mean and variance.
The greater variability than predicted by the GLM random component
reflects o®erdispersion.
A common cause of overdispersion is subject heterogeneity. For instance,
suppose that width, weight, color, and spine condition are the four predictors
that affect a female crab’s number of satellites. Suppose that Y has a Poisson
distribution at each fixed combination of those predictors. Our model uses
width alone as a predictor. Crabs having a certain width are then a mixture of
crabs of various weights, colors, and spine conditions. Thus, the population of
crabs having that width is a mixture of several Poisson populations, each
having its own mean for the response. This heterogeneity results in an overall
response distribution at that width having greater variation than the Poisson
predicts. If the variance equals the mean when all relevant variables are
controlled, it exceeds the mean when only one is controlled.
Overdispersion is not an issue in ordinary regression with normally distributed Y, because that distribution has a separate parameter Žthe variance.
131
GENERALIZED LINEAR MODELS FOR COUNTS
to describe variability. For binomial and Poisson distributions, however,
the variance is a function of the mean. Overdispersion is common in the
modeling of counts. When the model for the mean is correct but the true
distribution is not Poisson, the ML estimates of model parameters are still
consistent but standard errors are incorrect. We next introduce an extension
of the Poisson GLM that has an extra parameter and accounts better for
overdispersion. In Section 4.7 we present another approach for this, quasilikelihood inference.
4.3.4
Negative Binomial GLMs
The negati®e binomial distribution has probability mass function
f Ž y; k, . s
⌫Ž y q k.
⌫ Ž k . ⌫ Ž y q 1.
ž
/ž
k
k
qk
1y
k
qk
/
y
,
y s 0, 1, 2, . . . ,
Ž 4.12 .
where k and are parameters. This distribution has
EŽ Y . s ,
var Ž Y . s q 2rk .
The index ky1 is called a dispersion parameter. As ky1 ™ 0, varŽ Y . ™ and
the negative binomial distribution converges to the Poisson ŽCameron and
Trivedi 1998, p. 75.. Usually, ky1 is unknown. Estimating it helps summarize
the extent of overdispersion.
For k fixed, one can express Ž4.12. in natural exponential family form
Ž4.1.. Then, a model with negative binomial random component is a GLM.
For simplicity, such models let k be the same constant for all observations
but treat it as unknown. As in GLMs for binary data, a variety of link
functions are possible. Most common is the log link, as in Poisson loglinear
models, but sometimes the identity link is adequate.
In Section 13.4 we discuss negative binomial GLMs. We illustrate it here
for the crab data analyzed above with Poisson GLMs. With the identity link
and width as predictor, the Poisson GLM has
ˆ s y11.53 q 0.55 x ŽSE s
0.06 for ˆ.. For the negative binomial GLM,
ˆ s y11.15 q 0.53 x ŽSE s
0.11.. Moreover, ˆ
ky1 s 0.98, so at a predicted
ˆ , the estimated variance is
2
roughly
ˆq
ˆ , compared to
ˆ for the Poisson GLM. Although fitted values
are similar, the greater SE for ˆ and the greater estimated variance in the
negative binomial model reflect the overdispersion uncaptured with the
Poisson GLM.
4.3.5
Poisson Regression for Rates
When events of a certain type occur over time, space, or some other index of
size, it is usually more relevant to model the rate at which they occur than the
number of them. For instance, a study of homicides in a given year for a
132
INTRODUCTION TO GENERALIZED LINEAR MODELS
sample of cities might model the homicide rate, defined for a city as its
number of homicides that year divided by its population size. The model
might describe how the rate depends on the city’s unemployment rate, its
residents’ median income, and the percentage of residents having completed
high school. In Section 9.7 we discuss Poisson regression for modeling rates.
4.3.6
Poisson GLM of Independence in I = J Contingency Tables
One use of Poisson loglinear models is in modeling counts in contingency
tables. We illustrate for two-way tables with independent counts Yi j 4 having
Poisson distributions with means i j 4 . Suppose that i j 4 satisfy
i j s ␣ i  j ,
where ␣ i 4 and  j 4 are positive constants satisfying Ý i ␣ i s Ý j  j s 1. This is
a multiplicative model, but a linear predictor for a GLM results using the log
link,
log i j s q ␣ iU q  jU ,
Ž 4.13 .
where s log , ␣ iU s log ␣ i ,  jU s log  j . This Poisson loglinear model
has additive main effects of the two classifications but no interaction.
Since the Yi j 4 are independent, the total sample size Ý i Ý j Yi j has a
Poisson distribution with mean Ý i Ý j i j s . Conditional on Ý i Ý j Yi j s n,
the cell counts have a multinomial distribution with probabilities i j s
i jr s ␣ i  j 4 . Similarly, you can check that conditional on n, the row totals
Yiq 4 have a multinomial distribution with probabilities iqs ␣ i 4 and
the column totals Yqj 4 have a multinomial distribution with probabilities
qj s  j 4 .
Conditional on n, the model is a multinomial one that satisfies i j s ␣ i  j
s iq qj . This is independence of the two classifications. In fact, in Poisson
form independence is the loglinear model Ž4.13.. The inferences conducted
in Chapter 3 about independence in two-way contingency tables relate to
GLMs, either Poisson loglinear models or corresponding multinomial models
that fix n or the row or column totals. In Chapters 8 and 9 we present more
complex loglinear models for contingency tables.
4.4 MOMENTS AND LIKELIHOOD FOR GENERALIZED
LINEAR MODELS*
Having introduced GLMs for binary and count data, we now turn our
attention to details such as likelihood equations and methods for fitting
them. The remainder of this chapter is somewhat technical, providing general
results applying to most modeling methods presented in subsequent chapters.
See McCullagh and Nelder Ž1989. for further details.
133
MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS
It is helpful to extend the notation for a GLM so that it can handle many
distributions that have a second parameter. The random component of the
GLM specifies that the N observations Ž y 1 , . . . , yN . on Y are independent,
with probability mass or density function for yi of form
f Ž yi ; i , . s exp yi i y b Ž i . raŽ . q c Ž yi , . 4 .
Ž 4.14 .
This is called the exponential dispersion family and is called the dispersion
parameter ŽJorgensen
1987.. The parameter i is the natural parameter.
Ⲑ
When is known, Ž4.14. simplifies to the form Ž4.1. for the natural
exponential family, which is
f Ž yi ; i . s a Ž i . b Ž yi . exp yi Q Ž i . .
We identify QŽ . here with raŽ . in Ž4.14., aŽ . with expwybŽ .raŽ .x in
Ž4.14., and bŽ y . with exp w cŽ y, .x in Ž4.14.. The more general formula Ž4.14.
is not needed for one-parameter families such as the binomial and Poisson.
Usually, aŽ . has form aŽ . s r i for a known weight i . For instance,
when yi is a mean of n i independent readings, such as a sample proportion
for n i Bernoulli trials, i s n i ŽSection 4.4.2..
4.4.1
Mean and Variance Functions for the Random Component
General expressions for E Ž Yi . and var Ž Yi . use terms in Ž4.14.. Let L i s
log f Ž yi ; i , . denote the contribution of yi to the log likelihood; that is, the
log-likelihood function is L s Ý i L i . Then, from Ž4.14.,
L i s yi i y b Ž i . ra Ž . q c Ž yi , . .
Ž 4.15 .
Therefore,
⭸ L ir⭸ i s yi y bX Ž i . raŽ . ,
⭸ 2 L ir⭸ i2 s ybY Ž i . raŽ . ,
where bX Ž i . and bY Ž i . denote the first two derivatives of bŽ⭈. evaluated
at i . We now apply the general likelihood results
E
ž /
⭸L
⭸
s 0 and
yE
ž / ž /
⭸ 2L
⭸ 2
sE
⭸L
⭸
2
,
which hold under regularity conditions satisfied by the exponential family
ŽCox and Hinkley 1974, Sec. 4.8.. From the first formula applied with a single
observation, E w Yi y bX Ž i .xraŽ . s 0, or
i s E Ž Yi . s bX Ž i . .
Ž 4.16 .
134
INTRODUCTION TO GENERALIZED LINEAR MODELS
From the second formula,
bY Ž i . raŽ . s E Ž Yi y bX Ž i . . raŽ .
2
s var Ž Yi . r a Ž . ,
2
so that
var Ž Yi . s bY Ž i . a Ž . .
Ž 4.17 .
In summary, the function bŽ⭈. in Ž4.14. determines moments of Yi .
4.4.2
Mean and Variance Functions for Poisson and Binomial
We illustrate the mean and variance expressions for Poisson and binomial
distributions. When Yi is Poisson,
f Ž yi ; i . s
ey i iy i
yi !
s exp Ž yi log i y i y log yi ! .
s exp yi i y exp Ž i . y log yi ! ,
where i s log i . This has exponential dispersion form Ž4.14. with bŽ i . s
exp Ž i ., aŽ . s 1, and cŽ yi , . s ylog yi!. The natural parameter is i s
log i . From Ž4.16. and Ž4.17.,
E Ž Yi . s bX Ž i . s exp Ž i . s i ,
var Ž Yi . s bY Ž i . s exp Ž i . s i .
Next, suppose that n i Yi has a bin Ž n i , i . distribution; that is, here yi is
the sample proportion Žrather than number . of successes, so E Ž Yi . is independent of n i . Let i s log w irŽ1 y i .x. Then, i s exp Ž i .rw1 q exp Ž i .x and
log Ž1 y i . s ylog w1 q exp Ž i .x. Extending Ž4.3., one can show that
f Ž yi ; i , n i . s
ž /
ni
n yn y
ni yi Ž 1 y i . i i i
n i yi i
s exp
yi i y log 1 q exp Ž i .
q log
1rn i
ž /
ni
n i yi
.
Ž 4.18 .
This has exponential dispersion form Ž4.14. with bŽ i . s log w1 q exp Ž i .x,
. The natural parameter is the logit,
ž
/
s log w rŽ1 y .x. From Ž4.16. and Ž4.17.,
aŽ . s 1rn i , and cŽ yi , . s log
i
i
ni
n i yi
i
E Ž Yi . s bX Ž i . s exp Ž i . r 1 q exp Ž i . s i ,
½
var Ž Yi . s bY Ž i . a Ž . s exp Ž i . r 1 q exp Ž i .
2
5
n i s i Ž 1 y i . rn i .
135
MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS
4.4.3
Systematic Component and Link Function
Let Ž x i1 , . . . , x i p . denote values of explanatory variables for observation i.
The systematic component of a GLM relates parameters i 4 to these
variables using a linear predictor
i s
Ý j x i j ,
i s 1, . . . , N.
j
In matrix form,
s X ,
where s Ž1 , . . . , N .X ,  s Ž  1 , . . . ,  p .X are column vectors of model
parameters, and X is the N = p matrix of values of the explanatory variables
for the N subjects. In ordinary linear models, X is called the design matrix. It
need not refer to an experimental design, however, and the GLM literature
calls it the model matrix.
The GLM links i to i s E Ž Yi . by a link function g Ž⭈.. Thus, i relates to
the explanatory variables by
i s g Ž i . s
Ý j x i j ,
i s 1, . . . , N.
j
The link function g for which g Ž i . s i in Ž4.14. is the canonical link. For
it, the direct relationship
i s
Ý j x i j
j
occurs between the natural parameter and the linear predictor.
Since i s bX Ž i ., the natural parameter is the function of the mean,
i s Ž bX .y1 Ž i ., where Ž bX .y1 Ž⭈. denotes the inverse function to bX . Thus, the
canonical link is the inverse of bX . In the Poisson case, for instance, bŽ i . s
exp Ž i ., so bX Ž i . s exp Ž i . s i . Thus, Ž bX .y1 Ž⭈. is the inverse of the exponential function, which is the log function Ži.e., i s log i .. The canonical
link is the log link.
4.4.4
Likelihood Equations for a GLM
For N independent observations, from Ž4.15. the log likelihood is
LŽ  . s
Ý Li s Ý log f Ž yi ; i , . s Ý
i
i
i
yi i y b Ž i .
aŽ .
q
Ý c Ž yi , . .
i
Ž 4.19 .
The notation LŽ . reflects the dependence of on the model parameters .
136
INTRODUCTION TO GENERALIZED LINEAR MODELS
The likelihood equations are
⭸ L Ž  . r⭸ j s
Ý ⭸ Lir⭸ j s 0
i
for all j. To differentiate the log likelihood Ž4.19., we use the chain rule,
⭸ Li
⭸ j
s
⭸ L i ⭸ i ⭸ i ⭸i
⭸ i ⭸ i ⭸i ⭸ j
.
Ž 4.20 .
Since ⭸ L ir⭸ i s w yi y bX Ž i .xraŽ ., and since i s bX Ž i . and var Ž Yi . s
bY Ž i . aŽ . from Ž4.16. and Ž4.17.,
⭸ L ir⭸ i s Ž yi y i . raŽ . ,
⭸ ir⭸ i s bY Ž i . s var Ž Yi . raŽ . .
Also, since i s Ý j  j x i j ,
⭸ir⭸ j s x i j .
Finally, since i s g Ž i ., ⭸ ir⭸i depends on the link function for the model.
In summary, substituting into Ž4.20. gives us
⭸ Li
⭸ j
s
yi y i a Ž .
⭸ i
a Ž . var Ž Yi . ⭸i
xi j s
Ž yi y i . x i j ⭸ i
.
var Ž Yi .
⭸i
Ž 4.21 .
The likelihood equations are
N
Ý
is1
Ž yi y i . x i j ⭸ i
s 0,
⭸i
var Ž Yi .
j s 1, . . . , p.
Ž 4.22 .
Although  does not appear in these equations, it is there implicitly through
i , since i s gy1 ŽÝ j  j x i j .. Different link functions yield different sets of
equations.
Interestingly, the likelihood equations Ž4.22. depend on the distribution of
Yi only through i and varŽ Yi .. The variance itself depends on the mean
through a particular functional form
var Ž Yi . s ® Ži .
for some function ®, such as ®Ž i . s i for the Poisson, ®Ž i . s i Ž1 y i .
for the Bernoulli, and ®Ž i . s 2 Ži.e., constant. for the normal. When Yi
has distribution in the natural exponential family, the relationship between
the mean and the variance characterizes the distribution ŽJorgensen
1987..
Ⲑ
For instance, if Yi has distribution in the natural exponential family and if
®Ž i . s i , then necessarily Yi has the Poisson distribution.
MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS
4.4.5
137
Likelihood Equations for Binomial GLMs
Using notation from Section 4.4.2, suppose that n i Yi has a bin Ž n i , i .
distribution. Then yi is a sample proportion of successes for n i trials. The
binomial GLM Ž4.8. for a single predictor extends with several predictors to
i s ⌽
ž Ý /
j
Ž 4.23 .
xi j ,
j
where ⌽ is the standard cdf of some class of continuous distributions. Since
i s i s ⌽ Ži . with i s Ý j  j x i j ,
⭸ ir⭸i s Ž i . s
ž Ý /
j
xi j ,
j
where Ž u. s ⭸ ⌽ Ž u.r⭸ u Ži.e., the probability density function corresponding
to the cdf ⌽ .. Since var Ž Yi . s i Ž1 y i .rn i , the likelihood equations Ž4.22.
simplify to
Ý
i
n i Ž yi y i . x i j
i Ž1 y i .
ž Ý /
j
x i j s 0,
Ž 4.24 .
j
where i s ⌽ ŽÝ j  j x i j .. These depend on the link function ⌽y1 through the
derivative of its inverse.
For the logit link, i s log w irŽ1 y i .x, so ⭸ir⭸ i s 1rw i Ž1 y i .x and
⭸ ir⭸i s ⭸ ir⭸i s i Ž1 y i .. Then the likelihood equations Ž4.22. and
Ž4.24. simplify to
Ý n i Ž yi y i . x i j s 0,
Ž 4.25 .
i
where i satisfies Ž4.23. with ⌽ the standard logistic cdf.
4.4.6
Asymptotic Covariance Matrix of Model Parameter Estimators
The likelihood function for the GLM also determines the asymptotic covariˆ This matrix is the inverse of the
ance matrix of the ML estimator .
information matrix I , which has elements E wy⭸ 2 LŽ .r⭸ h ⭸ j x. To find
this, for the contribution L i to the log likelihood we use the helpful result
E
ž
⭸ 2 Li
⭸ h ⭸ j
/ ž /ž /
s yE
⭸ Li
⭸ Li
⭸ h
⭸ j
,
138
INTRODUCTION TO GENERALIZED LINEAR MODELS
which holds for exponential families ŽCox and Hinkley 1974, Sec. 4.8.. Thus,
E
ž
⭸ 2 Li
⭸ h ⭸ j
/
s yE
s
Ž Yi y i . x i h ⭸ i Ž Yi y i . x i j ⭸ i
var Ž Yi .
⭸i
var Ž Yi .
⭸i
yx i h x i j
var Ž Yi .
ž /
from Ž 4.21 .
2
⭸ i
.
⭸i
Since LŽ . s Ý i L i ,
ž
E y
⭸ 2 LŽ  .
⭸ h ⭸ j
/
N
s
xih xi j
Ý
is1
var Ž Yi .
ž /
⭸ i
⭸i
2
.
Generalizing from this typical element to the entire matrix, the information
matrix has the form
Ž 4.26 .
I s XX WX,
where W is the diagonal matrix with main-diagonal elements
wi s Ž ⭸ ir⭸i . rvar Ž Yi . .
Ž 4.27 .
2
ˆ is estimated by
The asymptotic covariance matrix of 
$
ˆ . s Iˆ s Ž XX WX
ˆ .
cov Ž 
y1
y1
,
Ž 4.28 .
ˆ is W evaluated at .
ˆ From Ž4.27., the form of W also depends on
where W
the link function. We’ll see an example for Poisson GLMs next and for
binomial GLMs in Section 5.5.
4.4.7 Likelihood Equations and Covariance Matrix for
Poisson Loglinear Model
The general Poisson loglinear model Ž4.4. has the matrix form
log s X.
For the log link, i s log i , so i s exp Ži . and ⭸ ir⭸i s exp Ži . s i .
Since var Ž Yi . s i , the likelihood equations Ž4.22. simplify to
Ý Ž yi y i . x i j s 0.
Ž 4.29 .
i
These equate the sufficient statistics Ý i yi x i j for  to their expected values.
139
INFERENCE FOR GENERALIZED LINEAR MODELS
Also, since
wi s Ž ⭸ ir⭸i . rvar Ž Yi . s i
2
ˆ is ŽXX WX
ˆ .y1 , where W
ˆ is the
the estimated covariance matrix Ž4.28. of 
diagonal matrix with elements of
ˆ on the main diagonal.
4.5
INFERENCE FOR GENERALIZED LINEAR MODELS
For most GLMs the likelihood equations Ž4.22. are nonlinear functions of .
ˆ and
For now, we put off details about solving them for the ML estimator 
focus instead on using the fit for statistical inference.
The Wald, score, and likelihood-ratio methods introduced in Section 1.3.3
for significance testing and interval estimation apply to any GLM. In this
section we concentrate on likelihood-ratio inference, through the de®iance of
the GLM.
4.5.1
Deviance and Goodness of Fit
From Section 4.1.5, the saturated GLM has a separate parameter for each
observation. It gives a perfect fit. This sounds good, but it is not a helpful
model. It does not smooth the data or have the advantages that a simpler
model has, such as parsimony. Nonetheless, it serves as a baseline for other
models, such as for checking model fit.
A saturated model explains all variation by the systematic component of
the model. Let ˜ denote the estimate of for the saturated model,
corresponding to estimated means
˜ i s yi for all i. For a particular unsaturated model, denote the corresponding ML estimates by ˆ and
ˆ i . For
maximized log likelihood LŽ;
ˆ y. for that model and maximized log likelihood LŽy; y. in the saturated case,
y2 log
maximum likelihood for model
maximum likelihood for saturated model
s y2 L Ž ;
ˆ y. y L Ž y; y.
describes lack of fit. It is the likelihood-ratio statistic for testing the null
hypothesis that the model holds against the alternative that a more general
model holds. From Ž4.19.,
y2 L Ž ;
ˆ y. y L Ž y; y.
s 2 Ý yi ˜i y b Ž ˜i . raŽ . y 2 Ý yi ˆi y b Ž ˆi . raŽ . .
i
i
140
INTRODUCTION TO GENERALIZED LINEAR MODELS
Usually, aŽ . in Ž4.14. has the form aŽ . s r i , and this statistic equals
ž
/
2 Ý i yi ˜i y ˆi y b Ž ˜i . q b Ž ˆi . r s D Ž y;
ˆ . r .
i
Ž 4.30 .
This is called the scaled de®iance and DŽy;
ˆ . is called the de®iance. The
greater the scaled deviance, the poorer the fit. For some GLMs the scaled
deviance has an approximate chi-squared distribution.
4.5.2
Deviance for Poisson Models
For Poisson GLMs, by Section 4.4.2, ˆi s log
ˆ i and bŽ ˆi . s exp Ž ˆi . s
ˆ i.
Similarly, ˜i s log yi and bŽ ˜i . s yi for the saturated model. Also aŽ . s 1,
so the deviance and scaled deviance Ž4.30. equal
D Ž y;
ˆ . s 2 Ý yi log Ž yir
ˆ i . y yi q
ˆi .
Ž 4.31 .
i
When a model with log link contains an intercept term, the likelihood
equation Ž4.29. implied by that parameter is Ý yi s Ý
ˆ i . Then the deviance
simplifies to
D Ž y;
ˆ . s 2 Ý yi log Ž yir
ˆi . .
Ž 4.32 .
i
For two-way contingency tables, this reduces to the G 2 statistic Ž3.11. in
Section 3.2.1, substituting cell count n i j for yi and the independence fitted
value
ˆ i j for
ˆ i . For a Poisson or multinomial model applied to a contingency table with a fixed number of cells N, we will see in Section 14.3 that
the deviance has an approximate chi-squared distribution for large i 4 .
4.5.3
Deviance for Binomial Models: Grouped and Ungrouped Data
Now consider binomial GLMs with sample proportions yi 4 based on n i 4
trials. By Section 4.4.2, ˆi s log w
ˆ irŽ1 y ˆ i .x and bŽ ˆi . s log w1 q exp Ž ˆi .x s
˜
Ž
.
ylog 1 y
ˆ i . Similarly, i s log w yirŽ1 y yi .x and bŽ ˜i . s ylog Ž1 y yi . for
the saturated model. Also, aŽ . s 1rn i , so s 1 and i s n i . The deviance
Ž4.30. equals
½ž
2 Ý n i yi log
i
yi
1 y yi
s 2 Ý n i yi log
i
s 2 Ý n i yi log
i
y log
1y
ˆi
n i yi
n i y n i yi
n i yi
n i
ˆi
/
ˆi
q log Ž 1 y yi . y log Ž 1 y
ˆi .
y 2 Ý n i yi log
i
n i
ˆi
n i y n i
ˆi
q 2 Ý Ž n i y n i yi . log
i
5
q 2 Ý n i log
n i y n i yi
n i y n i
ˆi
i
.
1 y yi
1y
ˆi
141
INFERENCE FOR GENERALIZED LINEAR MODELS
At setting i, n i yi is the number of successes and Ž n i y n i yi . is the number of
failures, i s 1, . . . , N. Thus, the deviance is a sum over the 2 N cells of
successes and failures and has the same form,
D Ž y;
ˆ . s 2 Ý observed = log Ž observedrfitted . ,
Ž 4.33 .
as the deviance Ž4.32. for Poisson loglinear models with intercept term.
With binomial responses, it is possible to construct the data file as
expressed here with the counts of successes and failures at each setting for
the predictors, or with the individual Bernoulli 0᎐1 observations at the
subject level. The deviance differs in the two cases. In the first case the
saturated model has a parameter at each setting for the predictors, whereas
in the second case it has a parameter for each subject. We refer to these as
grouped data and ungrouped data cases. The approximate chi-squared distribution for the deviance occurs for grouped data but not for ungrouped data
Žsee Problems 4.22 and 5.37.. With grouped data, the sample size increases
for a fixed number of settings of the predictors and hence a fixed number of
parameters for the saturated model.
4.5.4
Likelihood-Ratio Model Comparison Using the Deviance
For a Poisson or binomial model M, s 1, so the deviance Ž4.30. equals
D Ž y;
ˆ . s y2 L Ž ;
ˆ y. y L Ž y; y. .
Ž 4.34 .
Consider two models, M0 with fitted values
ˆ 0 and M1 with fitted values
ˆ 1,
with M0 a special case of M1. Model M0 is said to be nested within M1.
Since M0 is simpler than M1 , a smaller set of parameter values satisfies
M0 than satisfies M1. Maximizing the log likelihood over a smaller space
cannot yield a larger maximum. Thus, LŽ
ˆ 0 ; y. F LŽ
ˆ 1; y., and it follows
from Ž4.34. with the same LŽy; y. for each model that
D Ž y;
ˆ 1 . F D Ž y;
ˆ 0. .
Simpler models have larger deviances. Assuming that model M1 holds, the
likelihood-ratio test of the hypothesis that M0 holds uses the test statistic
y2 L Ž
ˆ 0 ; y. y L Ž
ˆ 1 ; y.
s y2 L Ž
ˆ 0 ; y . y L Ž y; y. y y2 L Ž
ˆ 1 ; y . y L Ž y; y.
4
s D Ž y;
ˆ 0 . y D Ž y;
ˆ1. .
The likelihood-ratio statistic comparing the two models is simply the difference between the deviances. This statistic is large when M0 fits poorly
compared to M1.
142
INTRODUCTION TO GENERALIZED LINEAR MODELS
In fact, since the part in Ž4.30. involving the saturated model cancels, the
difference between deviances,
ž
/
D Ž y;
ˆ 0 . y D Ž y;
ˆ 1 . s 2 Ý i yi ˆ1 i y ˆ0 i y b Ž ˆ1 i . q b Ž ˆ0 i . ,
also has the form of the deviance. Under regularity conditions, this difference
has approximately a chi-squared null distribution with df equal to the
difference between the numbers of parameters in the two models.
For binomial GLMs and Poisson loglinear GLMs with intercept, from
expression Ž4.33. for the deviance, the difference in deviances uses the
observed counts and the two sets of fitted values in the form
D Ž y;
ˆ 0 . y D Ž y;
ˆ 1 . s 2 Ý observed = log Ž fitted 1rfitted 0 . .
With binomial responses, the test comparing models does not depend on
whether the data file has grouped or ungrouped form. The saturated model
differs in the two cases, but its log likelihood cancels when one forms the
difference between the deviances.
4.5.5
Residuals for GLMs
When a GLM fits poorly according to an overall goodness-of-fit test, examination of residuals highlights where the fit is poor. One type of residual uses
components of the deviance. In Ž4.30. let DŽy;
ˆ . s Ýd i , where
ž
/
d i s 2 i yi ˜i y ˆi y b Ž ˜i . q b Ž ˆi . .
The de®iance residual for observation i is
'd
i
= sign Ž yi y
ˆi . ,
Ž 4.35 .
An alternative is the Pearson residual,
yi y
ˆi
ei s $
.
1r2
var Ž Yi .
Ž 4.36 .
For instance, for a Poisson GLM, var Ž Yi . s i and the Pearson residual is
'
e i s Ž yi y
ˆi . r
ˆi .
For two-way contingency tables identifying yi with cell count n i j and
ˆ i with
the independence fitted value
ˆ i j , this has the form Ž3.12.; then Ýe i2j s X 2 ,
the Pearson X 2 statistic. Similarly, the sum of squared deviance residuals
Ýd i j s G 2 , the likelihood-ratio statistic for testing independence.
143
FITTING GENERALIZED LINEAR MODELS
When the model holds, Pearson and deviance residuals are less variable
than standard normal because they compare yi to the fitted means rather
than the true mean Že.g., the denominator of Ž4.36. estimates wvar Ž Yi .x1r2 s
wvar Ž Yi y i .x1r2 rather than wvar Ž Yi y
ˆ i .x1r2 .. Standardized residuals divide
the ordinary residuals by their asymptotic standard errors. For GLMs the
asymptotic covariance matrix of the vector of the raw residuals yi y
ˆ i 4 is
cov Ž Y y
ˆ . s cov Ž Y. w I y Hatx .
Here, I is the identity matrix and Hat is the hat matrix,
Hat s W 1r2 X Ž XX WX .
Ž 4.37 .
where W is the diagonal matrix with elements Ž4.27. ŽPregibon 1981.. Let ˆ
hi
y1
XX W 1r2 ,
denote the estimated diagonal element of Hat for observation i, called its
le®erage. Then, standardizing by dividing yi y
ˆ i by its estimated SE yields
the standardized Pearson residual
ri s
½
yi y
ˆi
var Ž Yi .
Ž 1 y ˆh i .
5
1r2
s
ei
'1 y ˆh
.
Ž 4.38 .
i
'
For Poisson GLMs, for instance, ri s Ž yi y
ˆ i .r
ˆ i Ž 1 y ˆh i . . Pierce and
Schafer Ž1986. presented standardized deviance residuals.
In linear models the hat matrix is so-named because Hat = y projects the
data to the fitted values,
ˆ s‘‘mu-hat.’’ For GLMs, applying the estimated
hat matrix to a linearized approximation for g Žy. yields
ˆ s g Ž
ˆ ., the
model’s estimated linear predictor values. The greater an observation’s leverage, the greater its potential influence on the fit. As in ordinary regression,
the leverages fall between 0 and 1 and sum to the number of model
parameters. Unlike ordinary regression, the hat values depend on the fit as
well as the model matrix, and points that have extreme predictor values need
not have high leverage.
4.6
FITTING GENERALIZED LINEAR MODELS
ˆ of GLM parameters. The
Finally, we study how to find the ML estimators 
ˆ We describe a
Ž
.
likelihood equations 4.22 are usually nonlinear in .
general-purpose iterative method for solving nonlinear equations and apply it
two ways to determine the maximum of a likelihood function.
4.6.1
Newton–Raphson Method
The Newton᎐Raphson method is an iterative method for solving nonlinear
equations, such as equations whose solution determines the point at which a
function takes its maximum. It begins with an initial guess for the solution. It
144
INTRODUCTION TO GENERALIZED LINEAR MODELS
obtains a second guess by approximating the function to be maximized in a
neighborhood of the initial guess by a second-degree polynomial and then
finding the location of that polynomial’s maximum value. It then approximates the function in a neighborhood of the second guess by another
second-degree polynomial, and the third guess is the location of its maximum. In this manner, the method generates a sequence of guesses. These
converge to the location of the maximum when the function is suitable
andror the initial guess is good.
ˆ at
In more detail, here’s how Newton᎐Raphson determines the value 
X
which a function LŽ . is maximized. Let u s Ž ⭸ LŽ .r⭸ 1 , ⭸ LŽ .r⭸ 2 , . . . ..
Let H denote the matrix having entries h ab s ⭸ 2 LŽ .r⭸ a ⭸ b , called the
Hessian matrix. Let uŽ t . and H Ž t . be u and H evaluated at Ž t ., the guess t for
ˆ Step t in the iterative process Ž t s 0, 1, 2, . . . . approximates LŽ . near
.
Ž t . by the terms up to second order in its Taylor series expansion,
L Ž  . f L Ž Ž t . . q uŽ t . Ž  y Ž t . . q Ž 12 . Ž  y Ž t . . H Ž t . Ž  y Ž t . . .
X
X
Solving ⭸ LŽ .r⭸  f uŽ t . q H Ž t . Ž y Ž t . . s 0 for  yields the next guess.
That guess can be expressed as
Ž tq1. s Ž t . y Ž H Ž t . .
y1
Ž 4.39 .
uŽ t . ,
assuming that H Ž t . is nonsingular. ŽHowever, computing routines use standard methods for solving the linear equations rather than explicitly calculating the inverse..
Iterations proceed until changes in LŽŽ t . . between successive cycles are
sufficiently small. The ML estimator is the limit of Ž t . as t ™ ⬁; however,
this need not happen if LŽ . has other local maxima at which the derivative
of LŽ . equals 0. In that case, a good initial estimate is crucial. To help
understand the Newton᎐Raphson process, work through these steps when 
has a single element ŽProblem 4.34.. Then, Figure 4.6 illustrates a cycle of the
method, showing the parabolic Žsecond-order. approximation at a given step.
In the next chapter we use Newton᎐Raphson for logistic regression
models. For now, we illustrate it with a simpler problem for which we know
the answer, maximizing the log likelihood based on an observation y from a
bin Ž n, . distribution. From Section 1.3.2, the first two derivatives of LŽ .
s y log q Ž n y y .logŽ1 y . are
u s Ž y y n . r Ž 1 y . ,
H s y yr 2 q Ž n y y . r Ž 1 y . .
2
Each Newton᎐Raphson step has the form
Ž tq1. s Ž t . q
y
Ž Žt. .
2
q
y1
nyy
Ž 1 y Žt. .
2
y y n Ž t .
Žt. Ž 1 y Žt. .
.
145
FITTING GENERALIZED LINEAR MODELS
FIGURE 4.6
Cycle of Newton᎐Raphson method.
This adjusts Ž t . up if yrn ) Ž t . and down if yrn - Ž t .. For instance,
with Ž0. s 12 , you can check that Ž1. s yrn. When Ž t . s yrn, no adjustment occurs and Ž tq1. s yrn, which is the correct answer for
ˆ . For
starting values other than 12 , adequate convergence usually takes four or five
iterations.
ˆ for the Newton᎐Raphson method is usually
The convergence of Ž t . to 
fast. For large t, the convergence satisfies, for each j,
 jŽ tq1. y ˆj F c  jŽ t . y ˆj
2
for some c ) 0
and is referred to as second-order. This implies that the number of correct
decimals in the approximation roughly doubles after sufficiently many iterations. In practice, it often takes relatively few iterations for satisfactory
convergence.
4.6.2
Fisher Scoring Method
Fisher scoring is an alternative iterative method for solving likelihood equations. It resembles the Newton᎐Raphson method, the distinction being with
the Hessian matrix. Fisher scoring uses the expected ®alue of this matrix,
called the expected information, whereas Newton᎐Raphson uses the matrix
itself, called the obser®ed information.
Let I Ž t . denote the approximation t for the ML estimate of the expected
information matrix; that is, I Ž t . has elements yE Ž ⭸ 2 LŽ .r⭸ a ⭸ b ., evalu-
146
INTRODUCTION TO GENERALIZED LINEAR MODELS
ated at Ž t .. The formula for Fisher scoring is
Ž tq1. s Ž t . q Ž
I Žt. .
y1
uŽ t .
or
Ž 4.40 .
I Ž t . Ž tq1. s I Ž t . Ž t . q uŽ t . .
For estimating a binomial parameter, from Section 1.3.2 the information is
nrw Ž1 y .x. A step of Fisher scoring gives
s
Žt.
q
Žt. Ž 1 y Žt. .
y y n Ž t .
n
y y n Ž t .
y1
n
Ž tq1. s Ž t . q
s
y
n
Žt. Ž 1 y Žt. .
.
This gives the answer for
ˆ after a single iteration and stays at that value for
successive iterations.
Formula Ž4.26. showed that I s XX WX. Similarly, I Ž t . s XX W Ž t . X, where
W Ž t . is W wsee Ž4.27.x evaluated at Ž t .. The estimated asymptotic covariance
ˆ wsee Ž4.28.x occurs as a by-product of this algorithm as
matrix Iŷ1 of 
Ž I Ž t . .y1 for t at which convergence is adequate. From Ž4.22., for both Fisher
scoring and Newton᎐Raphson, u has elements
uj s
⭸ LŽ  .
⭸ j
N
s
Ý
is1
Ž yi y i . x i j ⭸ i
.
var Ž Yi .
⭸i
Ž 4.41 .
For GLMs with a canonical link, we’ll see ŽSection 4.6.4. that the observed
and expected information are the same. For noncanonical link models, Fisher
scoring has the advantages that it produces the asymptotic covariance matrix
as a by-product, the expected information is necessarily nonnegative definite,
and as seen next, it is closely related to weighted least squares methods for
ordinary linear models. However, it need not have second-order convergence,
and for complex models the observed information is often easier to calculate.
Efron and Hinkley Ž1978., developing arguments of R. A. Fisher, gave
reasons for preferring observed information. They argued that its variance
estimates better approximate a relevant conditional variance Žconditional on
statistics not relevant to the parameter being estimated ., it is ‘‘closer to the
data,’’ and it tends to agree more closely with Bayesian analyses.
4.6.3
ML as Iterative Reweighted Least Squares*
A relation exists between weighted least squares estimation and using Fisher
scoring to find ML estimates. We refer here to the general linear model of
147
FITTING GENERALIZED LINEAR MODELS
form
z s X q ⑀ .
When the covariance matrix of ⑀ is V, the weighted least squares ŽWLS.
estimator of  is
Ž XX Vy1 X . XX Vy1 z.
y1
From I s XX WX, expression Ž4.41. for elements of u, and since diagonal
elements of W are wi s Ž ⭸ ir⭸i . 2rvar Ž Yi ., it follows that in Ž4.40.,
I Ž t . Ž t . q uŽ t . s
XX W Ž t . z Ž t . ,
where z Ž t . has elements
z iŽ t . s
Ý
x i j  jŽ t . q Ž yi y Ži t . .
j
s iŽ t . q Ž yi y Ži t . .
⭸iŽ t .
⭸Ži t .
⭸iŽ t .
⭸Ži t .
.
Equations Ž4.40. for Fisher scoring then have the form
Ž XX W Ž t . X . Ž tq1. s XX W Ž t . z Ž t . .
These are the normal equations for using weighted least squares to fit a
linear model for a response variable z Ž t ., when the model matrix is X and the
inverse of the covariance matrix is W Ž t .. The equations have solution
Ž tq1. s Ž XX W Ž t . X .
y1
XX W Ž t . z Ž t . .
The vector z in this formulation is a linearized form of the link function g,
evaluated at y,
g Ž yi . f g Ž i . q Ž yi y i . g X Ž i . s i q Ž yi y i . Ž ⭸ir⭸ i . s z i . Ž 4.42 .
This adjusted Žor ‘‘ working’’. response ®ariable z has element i approximated
by z iŽ t . for cycle t of the iterative scheme. That cycle regresses z Ž t . on X with
weight Ži.e., inverse covariance. W Ž t . to obtain a new estimate Ž tq1. . This
estimate yields a new linear predictor value Ž tq1. s XŽ tq1. and a new
adjusted response value z Ž tq1. for the next cycle. The ML estimator results
from iterative use of weighted least squares, in which the weight matrix
changes at each cycle. The process is called iterati®e reweighted least squares.
A simple way to begin the iterative process uses the data y as the initial
estimate of . This determines the first estimate of the weight matrix W and
148
INTRODUCTION TO GENERALIZED LINEAR MODELS
hence the initial estimate of . It may be necessary to alter some observations slightly for this first cycle only so that g Žy., the initial value of z, is
finite. For instance, when g is the log link applied to counts, a count of
yi s 0 is problematic, so one could set yi s 12 . This is not a problem with the
model itself, since the log applies to the mean, and fitted means are usually
strictly positive in successive iterations.
4.6.4
Simplifications for Canonical Links*
Certain simplifications result with GLMs using the canonical link. For that
link,
i s i s
Ý j x i j .
j
Often, aŽ . in the density or mass function Ž4.14. is identical for all
observations, such as for Poisson GLMs w aŽ . s 1x and binomial GLMs with
each n i s 1 wfor which aŽ . s 1rn i s 1x. Then the part of the log likelihood
Ž4.19. involving both parameters and data is Ý yi i , which simplifies to
ž
Ý yi Ý  j x i j
i
j
/
s
ž
Ý  j Ý yi x i j
j
i
/.
Sufficient statistics for estimating  in the GLM are then
Ý yi x i j ,
j s 1, . . . , p.
i
For the canonical link,
⭸ ir⭸i s ⭸ ir⭸ i s ⭸ bX Ž i . r⭸ i s bY Ž i . .
Thus, the contribution Ž4.21. to the likelihood equation for  j simplifies to
⭸ Li
⭸ j
s
yi y i
var Ž Yi .
bY Ž i . x i j s
Ž yi y i . x i j
.
aŽ .
Ž 4.43 .
When aŽ . is identical for all observations, the likelihood equations are
Ý x i j yi s Ý x i j i ,
i
j s 1, . . . , p.
Ž 4.44 .
i
These equations equate the sufficient statistics for the model parameters to
their expected values ŽNelder and Wedderburn 1972.. For a normal distribution with identity link, these are the normal equations. We obtained these for
Poisson loglinear models in Ž4.29. and for binomial logistic regression models
Žwhen each n i s 1. in Ž4.25..
QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS
149
From expression Ž4.43. for ⭸ L ir⭸ j , with the canonical link the second
derivatives of the log likelihood have components
⭸ 2 Li
⭸ j ⭸ h
sy
xi j
aŽ .
ž /
⭸ i
⭸ h
.
This does not depend on the observation yi , so
⭸ 2 L Ž  . r⭸ h ⭸ j s E ⭸ 2 L Ž  . r⭸ h ⭸ j .
That is, H s yI , and the Newton᎐Raphson and Fisher scoring algorithms
are identical for canonical link models ŽNelder and Wedderburn 1972..
4.7
QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS*
A GLM g Ž i . s Ý j  j x i j specifies i using a link function g and linear
ˆ are the solutions of
predictor. From Ž4.22. and Ž4.41., the ML estimates 
the likelihood equations
uj Ž . s
N
Ý
is1
ž /
Ž yi y i . x i j ⭸ i
s 0,
® Ž i .
⭸i
j s 1, . . . , p,
Ž 4.45 .
where i s gy1 ŽÝ j  j x i j . and ®Ž i . s varŽ Yi .. These equations set the score
functions u j Ž .4 , which are derivatives of the log likelihood with respect to
 j 4 , equal to 0. As we noted in Section 4.4.4, the likelihood equations
depend on the assumed distribution for Yi only through i and ®Ž i .. The
choice of distribution determines the mean᎐variance relationship ®Ž i ..
4.7.1
Mean–Variance Relationship Determines Quasi-likelihood Estimates
Wedderburn Ž1974. proposed an alternative approach, quasi-likelihood estimation, which assumes only a mean᎐variance relationship rather than a
specific distribution for Yi . It has a link function and linear predictor of the
usual GLM form, but instead of assuming a distributional type for Yi it
assumes only
var Ž Yi . s ® Ž i .
for some chosen variance function ®. The equations that determine quasilikelihood estimates are the same as the likelihood equations Ž4.45. for
GLMs. They are not likelihood equations, however, without the additional
assumption that Yi 4 has distribution in the natural exponential family.
To illustrate, suppose we assume that the Yi 4 are independent with
® Ž i . s i .
150
INTRODUCTION TO GENERALIZED LINEAR MODELS
The quasi-likelihood ŽQL. estimates are the solution of Ž4.45. with ®Ž i .
replaced by i . Under the additional assumption that Yi 4 have distribution in
the exponential dispersion family Ž4.14., these estimates are also ML estimates. That case is simply the Poisson distribution. Thus, for ®Ž . s ,
quasi-likelihood estimates are also ML estimates when the random component has a Poisson distribution.
Wedderburn suggested using the estimating equations Ž4.45. for any
variance function, even if it does not occur for a member of the natural
exponential family. In fact, the purpose of the quasi-likelihood method was to
encompass a greater variety of cases, such as discussed in Section 4.7.2. The
QL estimates have asymptotic covariance matrix of the same form Ž4.28. as in
ˆ .y1 with wi s Ž ⭸ ir⭸i . 2rvar Ž Yi ..
GLMs, namely ŽXX WX
4.7.2
Overdispersion for Poisson GLMs and Quasi-likelihood
For count data, we’ve seen ŽSection 4.3.3. that the Poisson assumption is
often unrealistic because of overdispersionᎏthe variance exceeds the mean.
One cause for this is heterogeneity among subjects. This suggests an alternative to a Poisson GLM in which the mean᎐variance relationship has the form
® Ž i . s i
for some constant . The case ) 1 represents overdispersion for the
Poisson model.
In the estimating equations Ž4.45. with ®Ž i . s i , drops out. Thus,
the equations are identical to likelihood equations for Poisson models, and
model parameter estimates are also identical. Also,
wi s Ž ⭸ ir⭸i . var Ž Yi . s Ž ⭸ ir⭸i . r i ,
2
2
ˆ . s ŽXX WX
ˆ .y1 is times that for the Poisson model.
so the estimated cov Ž
When a variance function has the form ®Ž i . s ®*Ž i ., usually is
also unknown. However, is not in the estimating equations. Let X 2 s
ÝŽ yi y
ˆ i . 2r®*Ž
ˆ i ., a Pearson-type statistic for the simpler model with
s 1. Then X 2r is a sum of squares of N standardized terms. When
X 2r is approximately chi-squared or when i is approximately linear in 
with ®*Ž
ˆ i . close to ®*Ž i ., then EŽ X 2r . f N y p, the number of observations minus the number of model parameters p. Hence, E w X 2rŽ N y p .x f .
Using the motivation of moment estimation, Wedderburn Ž1974. suggested
taking ˆ s X 2rŽ N y p . as the estimated multiple of the covariance matrix.
In summary, this quasi-likelihood approach for count data is simple: Fit
the ordinary Poisson model and use its p parameter estimates. Multiply the
ordinary standard error estimates by X 2r Ž N y p . .
We illustrate for the horseshoe crab data analyzed with Poisson GLMs in
Section 4.3.2. With the log link, the fit using width to predict number of
'
QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS
151
satellites was log
ˆ s y3.305 q 0.164 x, with SE s 0.020 for ˆ s 0.164. To
improve the adequacy of using a chi-squared statistic to summarize fit, we
use the satellite totals and fit for all female crabs at a given width, to increase
the counts and fitted values relative to those for individual female crabs. The
N s 66 distinct width levels each have a total count yi for the number of
satellites and a fitted total
ˆ i . The Pearson statistic comparing these is
X 2 s 174.3. The quasi-likelihood adjustment for standard errors equals
174.3r Ž 66 y 2 . s 1.65. Thus, SE s 1.65Ž0.020. s 0.033 is a more plausible
standard error for ˆ s 0.164 in this prediction equation.
Alternative ways of handling overdispersion include mixture models that
allow heterogeneity in the mean at fixed settings of predictors. For count
data these include Poisson GLMs having random effects ŽSection 13.5. and
negative binomial GLMs that result when a Poisson parameter itself has a
gamma distribution ŽSection 4.3.4 and 13.4..
'
4.7.3
Overdispersion for Binomial GLMs and Quasi-likelihood
The quasi-likelihood approach can also handle overdispersion for counts
based on binary data. When yi is the sample mean of n i independent binary
observations with parameter i , i s 1, . . . , N, then binomial sampling has
E Ž Yi . s i and var Ž Yi . s i Ž1 y i .rn i . A simple quasi-likelihood approach
uses the alternative variance function
® Ž i . s i Ž 1 y i . rn i .
Ž 4.46 .
Overdispersion occurs when ) 1. The quasi-likelihood estimates are the
same as ML estimates for the binomial model, since drops out of the
estimating equations Ž4.45.. As in the overdispersed Poisson case, enters
the denominator of wi . Thus, the asymptotic covariance matrix multiplies by
, and standard errors multiply by . An estimate of using the X 2 fit
statistic for the ordinary binomial model is X 2rŽ N y p . ŽFinney 1947..
Methods like these that use estimates from ordinary models but inflate
their standard errors are appropriate only if the model chosen describes well
the structural relationship between the mean of Y and the predictors. If a
large goodness-of-fit statistic is due to some other type of lack of fit, such as
failing to include a relevant interaction term, making an adjustment for
overdispersion will not address the inadequacy.
For counts with binary data, alternative mechanisms for handling overdispersion include mixture models such as binomial GLMs with random effects
ŽSection 12.3. and models for which a binomial parameter itself has a beta
distribution ŽSection 13.3..
'
4.7.4
Teratology Overdispersion Example
Table 4.5 shows results of a teratology experiment in which female rats on
iron-deficient diets were assigned to four groups. Rats in group 1 were given
placebo injections, and rats in other groups were given injections of an iron
152
INTRODUCTION TO GENERALIZED LINEAR MODELS
TABLE 4.5 Response Counts of (Litter Size, Number Dead) for 58 Litters of Rats
in Low-Iron Teratology Study
Group 1: Untreated Žlow iron.
Ž10, 1. Ž11, 4. Ž12, 9. Ž4, 4. Ž10, 10. Ž11, 9. Ž9, 9. Ž11, 11. Ž10, 10. Ž10, 7. Ž12, 12.
Ž10, 9. Ž8, 8. Ž11, 9. Ž6, 4. Ž9, 7. Ž14, 14. Ž12, 7. Ž11, 9. Ž13, 8. Ž14, 5. Ž10, 10.
Ž12, 10. Ž13, 8. Ž10, 10. Ž14, 3. Ž13, 13. Ž4, 3. Ž8, 8. Ž13, 5. Ž12, 12.
Group 2: Injections days 7 and 10
Ž10, 1. Ž3, 1. Ž13, 1. Ž12, 0. Ž14, 4. Ž9, 2. Ž13, 2. Ž16, 1. Ž11, 0. Ž4, 0. Ž1, 0.Ž12, 0.
Group 3: Injections days 0 and 7
Ž8, 0. Ž11, 1. Ž14, 0. Ž14, 1. Ž11, 0.
Group 4: Injections weekly
Ž3, 0. Ž13, 0. Ž9, 2. Ž17, 2. Ž15, 0. Ž2, 0. Ž14, 1. Ž8, 0. Ž6, 0. Ž17, 0.
Source: Moore and Tsiatis Ž1991..
supplement; this was done weekly in group 4, only on days 7 and 10 in group
2, and only on days 0 and 7 in group 3. The 58 rats were made pregnant,
sacrificed after three weeks, and then the total number of dead fetuses was
counted in each litter. In teratology experiments, due to unmeasured covariates and genetic variability the probability of death may vary from litter to
litter within a particular treatment group.
Let yiŽ g . denote the proportion of dead fetuses out of the n iŽ g . in litter i in
treatment group g. Let iŽ g . denote the probability of death for a fetus in
that litter. Consider the model with n iŽ g . yiŽ g . a bin Ž n iŽ g . , iŽ g . . variate, where
iŽ g . s g ,
g s 1, 2, 3, 4.
That is, the model treats all litters in a particular group g as having the
same probability of death g . The ML fit has estimate
ˆg equal to
the sample proportion of deaths for all fetuses from litters in that group.
These equal
ˆ 1 s 0.758 ŽSE s 0.024., ˆ 2 s 0.102 ŽSE s 0.028., ˆ 3 s 0.034
ŽSE s 0.024., and
ˆ4 s 0.048 ŽSE s 0.021., where for group g, SE s
ˆg Ž 1 y ˆg . r Ž Ý i n iŽ g . . . The estimated probability of death is considerably
higher for the placebo group.
For litter i in group g, n iŽ g .
ˆg is a fitted number of deaths and
n iŽ g .Ž1 y
ˆg . is a fitted number of nondeaths. Comparing these fitted values
to the observed counts of deaths and nondeaths in the N s 58 litters using
the Pearson statistic gives X 2 s 154.7 with df s 58 y 4 s 54. There is
considerable evidence of overdispersion. With the quasi-likelihood approach,
ˆg 4 are the same as the binomial ML estimates; however, ˆ s X 2rŽ N y p .
s 154.7rŽ58 y 4. s 2.86, so standard errors multiply by ˆ1r2 s 1.69.
Even with this adjustment for overdispersion, strong evidence remains that
the probability of death is substantially higher for the placebo group. For
'
153
GENERALIZED ADDITIVE MODELS
instance, a 95% confidence interval for 1 y 2 is
2
2
Ž 0.758 y 0.102 . " 1.96 Ž 1.69 = 0.024 . q Ž 1.69 = 0.028 .
or
1r2
Ž 0.54, 0.78 . .
This is wider, however, than the Wald interval of Ž0.59, 0.73. for comparing
independent proportions, which ignores the overdispersion.
4.8
GENERALIZED ADDITIVE MODELS*
The GLM generalizes the ordinary linear model to permit nonnormal distributions and modeling functions of the mean. Quasi-likelihood provides a
further generalization, specifying how the variance depends on the mean
without assuming a given distribution. Another generalization replaces the
linear predictor by smooth functions of the predictors.
4.8.1
Smoothing Data
The GLM structure g Ž i . s Ý j  j x i j generalizes to
g Ž i . s
Ý sj Ž x i j . ,
j
where s j Ž⭈. is an unspecified smooth function of predictor j. A useful smooth
function is the cubic spline. It has separate cubic polynomials over sets of
disjoint intervals, joined together smoothly at boundaries of those intervals.
Like GLMs, this model specifies a distribution for the random component
and a link function g. The resulting model is called a generalized additi®e
model, symbolized by GAM ŽHastie and Tibshirani 1990.. The GLM is the
special case in which each s j is a linear function. Also possible is taking some
s j as smooth functions and others as linear functions or as dummy variables
for qualitative predictors.
The details for fitting GAMs are beyond our scope. The fitting algorithm
employs a generalization of the Newton᎐Raphson method that utilizes local
smoothing. This corresponds to subtracting from the log-likelihood function a
penalty function that increases as the smooth function gets more wiggly. The
model fit assigns a deviance and an approximate df value to each s j in the
additive predictor, enabling inference about those terms. For instance, a
smooth function having df s 5 is similar in overall complexity to a fourthdegree polynomial, which has five parameters. One’s choice of a df value Žor
smoothing parameter. determines how smooth the resulting GAM fit looks.
It is usually worth trying a variety of degrees of smoothing to find one that
smooths the data sufficiently so that the trend is not too irregular but does
154
INTRODUCTION TO GENERALIZED LINEAR MODELS
not smooth so much that it suppresses interesting patterns. This approach
may suggest that a linear model is adequate with a particular link or suggest
ways to improve on linearity. Some software packages that do not have
GAMs can smooth the data by employing a type of regression that gives
greater weight to nearby observations in predicting the value at a given point;
such locally weighted least squares regression is often referred to as lowess. We
prefer GAMs because they recognize explicitly the form of the response. For
instance, with a binary response, lowess can give predicted values below 0 or
above 1, which cannot happen with a GAM.
Even when one plans to use GLMs, a GAM can be helpful for exploratory
analysis. For instance, for continuous X with continuous responses, scatter
diagrams provide visual information about the dependence of Y on X. For
binary responses, the following example shows that such diagrams are not
very informative. Plotting the fitted smooth function for a predictor may
reveal a general trend without assuming a particular functional relationship.
FIGURE 4.7 Whether satellites are present Ž1, yes; 0, no., by width of female crab, with
smoothing fit of generalized additive model.
155
NOTES
4.8.2
GAMs for Horseshoe Crab Example
In Section 4.3.2, Figure 4.4 showed the trend relating number of satellites for
horseshoe crabs to their width. This smooth curve is the fit of a generalized
additive model, assuming a Poisson distribution and using the log link.
In the next chapter we’ll use logistic regression to model the probability
that a crab has at least one satellite. For crab i, let yi s 1 if she has at least
one satellite and yi s 0 otherwise. Figure 4.7 plots these data against
x s crab width. It consists of a set of points with yi s 1 and a second set of
points with yi s 0. The numbered symbols indicate the number of observations at each point. It appears that yi s 1 tends to occur relatively more
often at higher x values. Figure 4.7 also shows a curve based on smoothing
the data using a GAM, assuming a binomial response and logit link. This
curve shows a roughly increasing trend and is more informative than viewing
the binary data alone. It suggests that an S-shaped regression function may
describe this relationship relatively well.
NOTES
Section 4.1: Generalized Linear Model
4.2. Distribution Ž4.1. is called a natural Žor linear . exponential family to distinguish it from a
more general exponential family that replaces y by r Ž y . in the exponential term. For
Ž1987.. Books on GLMs and related models, in
other generalizations, see Jorgensen
Ⲑ
approximate order of technical level from highest to lowest, are McCullagh and Nelder
Ž1989., Fahrmeir and Tutz Ž2001., Aitkin et al. Ž1989., Dobson Ž2002., and Gill Ž2000..
See also Firth Ž1991..
Section 4.3: Generalized Linear Models for Counts
4.2. For further discussion of Poisson regression and related models for count data, see
Breslow Ž1984., Cameron and Trivedi Ž1998., Frome Ž1983., Hinde Ž1982., Lawless
Ž1987., and Seeber Ž1998. and references therein.
Section 4.4: Moments and Likelihood for Generalized Linear Models
4.3. The function bŽ⭈. in Ž4.14. is called the cumulant function, since when aŽ . s 1 its
derivatives yield the cumulants of the distribution ŽJorgensen
1987..
Ⲑ
For many GLMs, including Poisson models with log link and binary models with logit
link, with full-rank model matrix the Hessian is negative definite and the log likelihood
is a strictly concave function. Then ML estimates of model parameters exist and are
unique under quite general conditions ŽWedderburn 1976..
Section 4.5: Inference for Generalized Linear Models
ˆ . wsee Ž4.28.x, in the hat matrix for standardized Pearson
4.4. The matrix W used in cov Ž
residuals wsee Ž4.38.x, and in Fisher scoring wsee Ž4.40.x is the inverse of the covariance
matrix of the linearized form of g Žy. Žsee Section 4.6.3..
156
INTRODUCTION TO GENERALIZED LINEAR MODELS
McCullagh and Nelder Ž1989, Chap. 12. discussed model checking for GLMs. For
discussions about residuals, see also Green Ž1984., Pierce and Schafer Ž1986., Pregibon
Ž1980, 1981., and Williams Ž1987.. Pregibon Ž1982. showed that the squared standardized
Pearson residual is the score statistic for testing whether the observation is an outlier.
Davison and Hinkley Ž1997, Sec. 7.2. discussed bootstrapping in GLMs.
Section 4.6: Fitting Generalized Linear Models
4.5. Fisher Ž1935b. introduced the Fisher scoring method to calculate ML estimates for
probit models. For further discussion of GLM model fitting and the relationship
between iterative reweighted least squares and ML estimation, see Green Ž1984.,
Ž1983., McCullagh and Nelder Ž1989., and Nelder and Wedderburn Ž1972..
Jorgensen
Ⲑ
Ž1983., and Palmgren and Ekholm Ž1987. also discussed this
Green Ž1984., Jorgensen
Ⲑ
relation for exponential family nonlinear models.
Section 4.7: Quasi-likelihood and Generalized Linear Models
4.6. For more on quasi-likelihood, see Sections 11.4, 12.6.4, and 13.3, Breslow Ž1984., Cox
Ž1983., Firth Ž1987., Hinde and Demetrio
Ž1998., McCullagh Ž1983., McCullagh and
´
Nelder Ž1989., Nelder and Pregibon Ž1987., and Wedderburn Ž1974, 1976.. See Heyde
Ž1997. for a theoretical perspective.
Section 4.8: Generalized Additi©e Models
4.7. Besides GAMs, other nonparametric smoothing methods can describe the dependence
of a binary response on a predictor. For instance, see Copas Ž1983., Lloyd Ž1999, Chap.
5., and Section 15.3.3 for kernel smoothing and Kauermann and Tutz Ž2001. for models
with random effects.
PROBLEMS
Applications
4.1
In the 2000 U.S. presidential election, Palm Beach County in Florida
was the focus of unusual voting patterns Žincluding a large number of
illegal double votes. apparently caused by a confusing ‘‘butterfly ballot.’’
Many voters claimed that they voted mistakenly for the Reform Party
candidate, Pat Buchanan, when they intended to vote for Al Gore.
Figure 4.8 shows the total number of votes for Buchanan plotted
against the number of votes for the Reform Party candidate in 1996
ŽRoss Perot., by county in Florida. ŽFor details, see A. Agresti and B.
Presnell, J. Law Public Policy, Volume 13, Fall 2001, 117᎐134..
a. In county i, let i denote the proportion of the vote for Buchanan
and let x i denote the proportion of the vote for Perot in 1996. For
the linear probability model fitted to all counties except Palm Beach
County,
ˆ i s y0.0003 q 0.0304 x i . Give the value of P in the
157
PROBLEMS
FIGURE 4.8 Total vote, by county in Florida, for Reform Party candidates Buchanan in 2000
and Perot in 1996.
interpretation: The estimated proportion vote for Buchanan in 2000
was roughly P% of that for Perot in 1996.
b. For Palm Beach County, i s 0.0079 and x i s 0.0774. Does this
result appear to be an outlier? Explain.
c. For logistic regression, log w
ˆ irŽ1 y ˆ i .x s y7.164 q 12.219 x i . Find
ˆ i in Palm Beach County. Is that county an outlier for this model?
4.2
For games in baseball’s National League during nine decades, Table
4.6 shows the percentage of times that the starting pitcher pitched a
complete game.
TABLE 4.6
Decade
1900᎐1909
1910᎐1919
1920᎐1929
Data for Problem 4.2
Percent
Complete
Decade
Percent
Complete
Decade
Percent
Complete
72.7
63.4
50.0
1930᎐1939
1940᎐1949
1950᎐1959
44.3
41.6
32.8
1960᎐1969
1970᎐1979
1980᎐1989
27.2
22.5
13.3
Source: Data from George Will, Newsweek, Apr. 10, 1989.
158
INTRODUCTION TO GENERALIZED LINEAR MODELS
a. Treating the number of games as the same in each decade, the ML
fit of the linear probability model is
ˆ s 0.7578 y 0.0694 x, where
x s decade Ž x s 1, 2, . . . , 9.. Interpret 0.7578 and y0.0694.
b. Substituting x s 10, 11, 12, predict the percentages of complete
games for the next three decades. Are these predictions plausible?
Why?
c. The ML fit with logistic regression is
ˆ s expŽ1.148 y 0.315 x .rw1
q exp Ž1.148 y 0.315 x .x. Obtain
ˆ i for x s 10, 11, 12. Are these
more plausible?
4.3
For Table 3.7 with scores Ž0, 0.5, 1.5, 4.0, 7.0. for alcohol consumption,
ML fitting of the linear probability model for malformation has output.
Parameter
Intercept
Alcohol
Estimate
0.0025
0.0011
Std Error
0.0003
0.0007
Wald 95% Conf Limits
0.0019 0.0032
y0.0003 0.0025
Interpret the model fit. Use it to estimate the relative risk of malformation for alcohol consumption levels 0 and 7.0.
4.4
For Table 4.2, refit the linear probability model or the logistic regression model using the scores Ža. Ž0, 2, 4, 6., Žb. Ž0, 1, 2, 3., and Žc. Ž1, 2,
3, 4.. Compare ˆ for the three choices. Compare fitted values. Summarize the effect of linear transformations of scores, which preserve
relative sizes of spacings between scores.
4.5
For Table 4.3, let Y s 1 if a crab has at least one satellite, and Y s 0
otherwise. Using x s weight, fit the linear probability model.
a. Use ordinary least squares. Interpret the parameter estimates. Find
the estimated probability at the highest observed weight Ž5.20 kg..
Comment.
b. Try to fit the model using ML, treating Y as binomial. wThe failure
is due to a fitted probability falling outside the Ž0, 1. range. The fit
in part Ža. is ML for a normal random component, for which fitted
values outside this range are permissible. x
c. Fit the logistic regression model. Show that the fitted probability at
a weight of 5.20 kg equals 0.9968.
d. Fit the probit model. Find the fitted probability at 5.20 kg.
4.6
An experiment analyzes imperfection rates for two processes used to
fabricate silicon wafers for computer chips. For treatment A applied to
10 wafers, the numbers of imperfections are 8, 7, 6, 6, 3, 4, 7, 2, 3, 4.
Treatment B applied to 10 other wafers has 9, 9, 8, 14, 8, 13, 11, 5, 7, 6
159
PROBLEMS
imperfections. Treat the counts as independent Poisson variates having
means A and B .
a. Fit the model log s ␣ q  x, where x s 1 for treatment B and
x s 0 for treatment A. Show that exp Ž  . s B rA , and interpret
its estimate.
b. Test H0 : A s B with the Wald or likelihood ratio test of
H0 :  s 0. Interpret.
c. Construct a 95% confidence interval for B rA . Ž Hint: First construct one for  ..
d. Test H0 : A s B based on this result: If Y1 and Y2 are independent Poisson with means 1 and 2 , then Ž Y1 < Y1 q Y2 . is binomial
with n s Y1 q Y2 and s 1rŽ 1 q 2 ..
4.7
For Table 4.3, Table 4.7 shows SAS output for a Poisson loglinear
model fit using X s weight and Y s number of satellites.
a. Estimate E Ž Y . for female crabs of average weight, 2.44 kg.
b. Use ˆ to describe the weight effect. Show how to construct the
reported confidence interval.
c. Construct a Wald test that Y is independent of X. Interpret.
d. Can you conduct a likelihood-ratio test of this hypothesis? If not,
what else do you need?
e. Is there evidence of overdispersion? If necessary, adjust standard
errors and interpret.
TABLE 4.7
SAS Output for Problem 4.7
Criterion
Deviance
Pearson Chi- Square
Log Likelihood
DF
171
171
Value
560.8664
535.8957
71.9524
Parameter Estimate Std Error Wald 95% Conf Limits Chi- Sq Pr > ChiSq
Intercept y0.4284
0.1789
y0.7791 y0.0777
5.73
0.0167
weight
0.5893
0.0650
0.4619
0.7167
82.15
<.0001
4.8
Refer to Problem 4.7. Using the identity link with x s weight,
ˆs
y2.60 q 2.264 x, where ˆ s 2.264 has SE s 0.228. Repeat parts Ža.
through Žc..
4.9
Refer to Table 4.3.
a. Fit a Poisson loglinear model using both W s weight and C s
color to predict Y s number of satellites. Assigning dummy variables, treat C as a nominal factor. Interpret parameter estimates.
160
INTRODUCTION TO GENERALIZED LINEAR MODELS
b. Estimate E Ž Y . for female crabs of average weight Ž2.44 kg. that are
Ži. medium light, and Žii. dark.
c. Test whether color is needed in the model. Ž Hint: From Section
4.5.4, the likelihood-ratio statistic comparing models is the difference in deviances. .
d. The estimated color effects are monotone across the four categories. Fit a simpler model that treats C as quantitative and
assumes a linear effect. Interpret its color effect and repeat the
analyses of parts Žb. and Žc.. Compare the fit to the model in part
Ža.. Interpret.
e. Add width to the model. What effect does the strong positive
correlation between width and weight have? Are both needed in the
model?
4.10 In Section 4.3.2, refer to the Poisson model with identity link. The fit
using least squares is
ˆ s y10.42 q 0.51 x ŽSE s 0.11.. Explain why
the parameter estimates differ and why the SE values are so different.
4.11 For the negative binomial model fitted to the crab satellite counts with
log link and width predictor, ␣
ˆ s y4.05, ˆ s 0.192 ŽSE s 0.048.,
ˆky1 s 1.106 ŽSE s 0.197.. Interpret. Why is SE for ˆ so different from
SE s 0.020 for the corresponding Poisson GLM in Sec 4.3.2? Which is
more appropriate? Why?
4.12 Refer to Problem 4.6. The sample mean and variance are 5.0 and 4.2
for treatment A and 9.0 and 8.4 for treatment B.
a. Is there evidence of overdispersion for the Poisson model having a
dummy variable for treatment? Explain.
b. Fit the negative binomial loglinear model. Note that the estimated
dispersion parameter is 0 and that estimates of treatment means
and standard errors are the same as with the Poisson loglinear
GLM.
c. For the overall sample of 20 observations, the sample mean and
variance are 7.0 and 10.2. Fit the loglinear model having only an
intercept term under Poisson and negative binomial assumptions.
Compare results, and compare confidence intervals for the overall
mean response. Why do they differ? Ž Note: This shows how the
Poisson model can deteriorate when an important covariate is
unmeasured. .
4.13 Table 4.8 shows the free-throw shooting, by game, of Shaq O’Neal of
the Los Angeles Lakers during the 2000 NBA Žbasketball . playoffs.
Commentators remarked that his shooting varied dramatically from
game to game. In game i, suppose that Yi s number of free throws
161
PROBLEMS
TABLE 4.8
Data for Problem 4.13
Number Number of
Number Number of
Number Number of
Game Made Attempts Game Made Attempts Game Made Attempts
1
2
3
4
5
6
7
8
4
5
5
5
2
7
6
9
5
11
14
12
7
10
14
15
9
10
11
12
13
14
15
16
4
1
13
5
6
9
7
3
12
4
27
17
12
9
12
10
17
18
19
20
21
22
23
8
1
18
3
10
1
3
12
6
39
13
17
6
12
Source: www.nba.com.
made out of n i attempts is a bin Ž n i , i . variate and the Yi 4 are
independent.
a. Fit the model, i s ␣ , and find and interpret ␣
ˆ and its standard
error. Does the model appear to fit adequately? Ž Note: You could
check this with a small-sample test of independence of the 23 = 2
table of game and the binary outcome..
b. Adjust the standard error for overdispersion. Using the original SE
and its correction, find and compare 95% confidence intervals for
␣ . Interpret.
4.14 Refer to Table 13.6. Fit a loglinear model with a dummy variable for
race, Ža. assuming a Poisson distribution, and Žb. allowing overdispersion with a quasi-likelihood approach. Compare results.
4.15 Refer to Problem 4.6. The wafers are also classified by thickness of
silicon coating Ž z s 0, low; z s 1, high.. The first five imperfection
counts reported for each treatment refer to z s 0 and the last five
refer to z s 1. Analyze these data.
14.6 Refer to Table 13.9 on frequency of sexual intercourse. Analyze these
data.
Theory and Methods
4.17 Describe the purpose of the link function of a GLM. What is the
identity link? Explain why it is not often used with binomial or Poisson
responses.
4.18 For known k, show that the negative binomial distribution Ž4.12. has
exponential family form Ž4.1. with natural parameter log w rŽ q k .x.
162
INTRODUCTION TO GENERALIZED LINEAR MODELS
4.19 For binary data, define a GLM using the log link. Show that effects
refer to the relative risk. Why do you think this link is not often used?
Ž Hint: What happens if the linear predictor takes a positive value?.
4.20 For the logistic regression model Ž4.6. with  ) 0, show that Ža. as
x ™ ⬁, Ž x . is monotone increasing, and Žb. the curve for Ž x . is the
cdf of a logistic distribution having mean y␣r and standard deviation rŽ  '3 ..
4.21 Show representation Ž4.18. for the binomial distribution.
4.22 Let Yi be a bin Ž n i , i . variate for group i, i s 1, . . . , N, with Yi 4
independent. Consider the model that 1 s ⭈⭈⭈ s N . Denote that
common value by . For observations yi 4 , show that
ˆ s ŽÝ yi .rŽÝn i ..
When all n i s 1, for testing this model’s fit in the N = 2 table, show
that X 2 s n. Thus, goodness-of-fit statistics can be completely uninformative for ungrouped data. ŽSee also Problem 5.37..
4.23 Suppose that Yi is Poisson with g Ž i . s ␣ q  x i , where x i s 1 for
i s 1, . . . , nA from group A and x i s 0 for i s nA q 1, . . . , nA q n B
from group B. Show that for any link function g, the likelihood
equations Ž4.22. imply that fitted means
ˆA and
ˆ B equal the sample
means.
4.24 For binary data with sample proportion yi based on n i trials, we use
quasi-likelihood to fit a model using variance function Ž4.46.. Show
that parameter estimates are the same as for the binomial GLM but
that the covariance matrix multiplies by .
4.25 A binomial GLM i s ⌽ ŽÝ j  j x i j . with arbitrary inverse link function
⌽ assumes$
that n i Yi has a bin Ž n i , i . distribution. Find wi in Ž4.27.
ˆ .. For logistic regression, show that wi s n i i Ž1 y i ..
and hence cov Ž
4.26 A GLM has parameter  with sufficient statistic S. A goodness-of-fit
test statistic T has observed value t o . If  were known, a P-value is
P s P ŽT G t o ;  .. Explain why P ŽT G t o < S . is the uniform minimum
variance unbiased estimator of P.
4.27 Let yi j be observation j of a count variable for group i, i s 1, . . . , I,
j s 1, . . . , n i . Suppose that Yi j 4 are independent Poisson with E Ž Yi j .
s i.
a. Show that the ML estimate of i is
ˆ i s yi s Ý j yi jrn i .
b. Simplify the expression for the deviance for this model. wFor testing
this model, it follows from Fisher Ž1970, p. 58, originally published
PROBLEMS
163
in 1925. that the deviance and the Pearson statistic Ý i Ý j Ž yi j y yi . 2ryi
have approximate chi-squared distributions with df s Ý i Ž n i y 1..
For a single group, Cochran Ž1954. referred to Ý j Ž y 1 j y y 1 . 2ry 1 as
the ®ariance test for the fit of a Poisson distribution, since it
compares the sample variance to the estimated Poisson variance y 1.x
4.28 Conditional on , Y has a Poisson distribution with mean . Values of
vary according to gamma density Ž13.12., which has E Ž . s ,
var Ž . s 2rk. Show that marginally Y has the negative binomial
distribution Ž4.12.. Explain why the negative binomial model is a way
to handle overdispersion for the Poisson.
4.29 Consider the class of binary models Ž4.8. and Ž4.9.. Suppose that the
standard cdf ⌽ corresponds to a probability density function that is
symmetric around 0.
a. Show that x at which Ž x . s 0.5 is x s y␣r .
b. Show that the rate of change in Ž x . when Ž x . s 0.5 is  Ž0..
Show this is 0.25  for the logit link and r'2 Žwhere s 3.14 . . . .
for the probit link.
c. Show that the probit regression curve has the shape of a normal cdf
with mean y␣r and standard deviation 1r <  < .
4.30 Show the normal distribution N Ž , 2 . with fixed satisfies family
Ž4.1., and identify the components. Formulate the ordinary regression
model as a GLM.
4.31 In Problem 4.30, when is also a parameter, show that it satisfies the
exponential dispersion family Ž4.14..
4.32
For binary observations, consider the model Ž x . s 12 q
Ž1r .tany1 Ž␣ q  x .. Which distribution has cdf of this form? Explain
when a GLM using this curve might be more appropriate than logistic
regression.
4.33 Find the form of the deviance residual Ž4.35. for an observation in a Ža.
binomial GLM, and Žb. Poisson GLM. Illustrate part Žb. for a cell
count in a two-way contingency table for the model of independence.
4.34 Consider the value ˆ that maximizes a function LŽ  .. Let  Ž0. denote
an initial guess.
a. Using LX Ž ˆ. s LX Ž  Ž0. . q Ž ˆ y  Ž0. . LY Ž  Ž0. . q ⭈⭈⭈ , argue that for
 Ž0. close to ˆ, approximately 0 s LX Ž  Ž0. . q Ž ˆ y  Ž0. . LY Ž  Ž0. ..
Solve this equation to obtain an approximation  Ž1. for ˆ.
164
INTRODUCTION TO GENERALIZED LINEAR MODELS
b. Let  Ž t . denote approximation t for ˆ, t s 0, 1, 2, . . . . Justify that
the next approximation is
 Ž tq1. s  Ž t . y LX Ž  Ž t . . rLY Ž  Ž t . . .
4.35 For n independent observations from a Poisson distribution, show that
Fisher scoring gives Ž tq1. s y for all t ) 0. By contrast, what happens
with Newton᎐Raphson?
4.36 Write a computer program using the Newton᎐Raphson algorithm to
maximize the likelihood for a binomial sample. For
ˆ s 0.3 based on
n s 10, print out results of the first six iterations when the starting
value Ž0. is Ža. 0.1, Žb. 0.2, . . . , Ži. 0.9. Summarize the effects of the
starting value on speed of convergence. What happens if it is 0 or 1?
4.37 In a GLM, suppose that var Ž Y . s ®Ž . for s E Ž Y .. Show that the
link g satisfying g X Ž . s w ®Ž .xy1r2 has the same weight matrix W Ž t . at
each cycle. Show this link for a Poisson random component is g Ž . s 2
.
'
4.38 For noncanonical links in a GLM, show that the observed information
matrix may depend on the data and hence differs from the expected
information. Illustrate using the probit model.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 5
Logistic Regression
In introducing generalized linear models for binary data in Chapter 4 we
highlighted logistic regression. This is the most important model for categorical response data. It is used increasingly in a wide variety of applications.
Early uses were in biomedical studies but the past 20 years have also seen
much use in social science research and marketing.
Recently, logistic regression has become a popular tool in business applications. Some credit-scoring applications use logistic regression to model the
probability that a subject is credit worthy. For instance, the probability that a
subject pays a bill on time may use predictors such as the size of the bill,
annual income, occupation, mortgage and debt obligations, percentage of
bills paid on time in the past, and other aspects of an applicant’s credit
history. A company that relies on catalog sales may determine whether to
send a catalog to a potential customer by modeling the probability of a sale
as a function of indices of past buying behavior.
Another area of increasing application is genetics. For instance, one
recent article ŽJ. M. Henshall and M. E. Goddard, Genetics 151:885894,
1999. used logistic regression to estimate quantitative trait loci effects,
modeling the probability that an offspring inherits an allele of one type
instead of another type as a function of phenotypic values on various traits
for that offspring. Another recent article ŽD. F. Levinson et al., Amer. J.
Hum. Genet., 67:652663, 2000. used logistic regression for analysis of the
genotype data of affected sibling pairs ŽASPs. and their parents from several
research centers. The model studied the probability that ASPs have identityby-descent allele sharing and tested its heterogeneity among the centers.
In this chapter we study logistic regression more closely. Section 5.1 covers
parameter interpretation. In Section 5.2 we present inferential methods for
those parameters. Sections 5.3 and 5.4 generalize to multiple predictors,
some of which may be qualitative. Finally, in Section 5.5 we apply GLM
model-fitting methods to determine and solve likelihood equations for logistic regression.
165
166
5.1
LOGISTIC REGRESSION
INTERPRETING PARAMETERS IN LOGISTIC REGRESSION
For a binary response variable Y and an explanatory variable X, let Ž x . s
P Ž Y s 1 < X s x . s 1 y P Ž Y s 0 < X s x .. The logistic regression model is
Ž x. s
exp Ž␣ q  x .
1 q exp Ž␣ q  x .
.
Ž 5.1 .
Equivalently, the log odds, called the logit, has the linear relationship
logit Ž x . s log
Ž x.
1 y Ž x.
s ␣ q  x.
Ž 5.2 .
This equates the logit link function to the linear predictor.
5.1.1
Interpreting : Odds, Probabilities, and Linear Approximations
How can we interpret  in Ž5.2.? Its sign determines whether Ž x . is
increasing or decreasing as x increases. The rate of climb or descent
increases as <  < increases; as  ™ 0 the curve flattens to a horizontal straight
line. When  s 0, Y is independent of X. For quantitative x with  ) 0, the
curve for Ž x . has the shape of the cdf of the logistic distribution Žrecall
Section 4.2.5.. Since the logistic density is symmetric, Ž x . approaches 1 at
the same rate that it approaches 0.
Exponentiating both sides of Ž5.2. shows that the odds are an exponential
function of x. This provides a basic interpretation for the magnitude of  :
The odds increase multiplicatively by e  for every 1-unit increase in x. In
other words, e  is an odds ratio, the odds at X s x q 1 divided by the odds
at X s x.
Most scientists are not familiar with odds or logits, so the interpretation of
a multiplicative effect of e  on the odds scale or an additive effect of  on
the logit scale is not helpful to them. A simpler, although approximate slope
interpretation uses a linearization argument ŽBerkson 1951.. Since it has a
curved rather than a linear appearance, the logistic regression function Ž5.1.
implies that the rate of change in Ž x . per unit change in x varies. A
straight line drawn tangent to the curve at a particular x value, shown in
Figure 5.1, describes the rate of change at that point. Calculating ⭸ Ž x .r⭸ x
using Ž5.1. yields a fairly complex function of the parameters and x, but it
simplifies to the form  Ž x .w1 y Ž x .x.
For instance, the line tangent to the curve at x for which Ž x . s 12 has
slope  Ž 21 .Ž 21 . s r4; when Ž x . s 0.9 or 0.1, it has slope 0.09 . The slope
approaches 0 as Ž x . approaches 1.0 or 0. The steepest slope occurs at x for
which Ž x . s 12 ; that x value is x s y␣r . wTo check that Ž x . s 12 at this
INTERPRETING PARAMETERS IN LOGISTIC REGRESSION
FIGURE 5.1
167
Linear approximation to logistic regression curve.
point, substitute y␣r for x in Ž5.1., or substitute Ž x . s 12 in Ž5.2. and
solve for x.x This x value is sometimes called the median effecti®e le®el and
denoted EL 50 . In toxicology studies it is called LD50 ŽLD s lethal dose., the
dose with a 50% chance of a lethal result.
From this linear approximation, near x where Ž x . s 12 , a change in x of
1r corresponds to a change in Ž x . of roughly Ž1r .Ž r4. s 14 ; that
is, 1r approximates the distance between x values where Ž x . s 0.25 or
0.75 Žin reality, 0.27 and 0.73. and where Ž x . s 0.50. The linear approximation works better for smaller changes in x, however.
An alternative way to interpret the effect reports the values of Ž x . at
certain x values, such as their quartiles. This entails substituting those
quartiles for x into formula Ž5.1. for Ž x .. The change in Ž x . over the
middle half of x values, from the lower quartile to the upper quartile of x,
then describes the effect. It can be compared to the corresponding change
over the middle half of values of other predictors.
The intercept parameter ␣ is not usually of particular interest. However,
by centering the predictor about 0 wi.e., replacing x by Ž x y x .x, ␣ becomes
the logit at that mean, and thus e ␣rŽ1 q e ␣ . s Ž x .. ŽAs in ordinary
regression, centering is also helpful in complex models containing quadratic
or interaction terms to reduce correlations among model parameter estimates..
168
5.1.2
LOGISTIC REGRESSION
Looking at the Data
In practice, these interpretations use formula Ž5.1. with ML estimates substituted for parameters. Before fitting the model and making such interpretations, look at the data to check that the logistic regression model is appropriate. Since Y takes only values 0 and 1, it is difficult to check this by plotting Y
against x.
It can be helpful to plot sample proportions or logits against x. Let n i
denote the number of observations at setting i of x. Of them, let yi denote
the number of ‘‘1’’ outcomes, with pi s yirn i . Sample logit i is
logw pirŽ1 y pi .x s log w yirŽ n i y yi .x. This is not finite when yi s 0 or n i . An
ad hoc adjustment adds a positive constant to the number of outcomes of the
two types. The adjustment
log
yi q
1
2
n i y yi q
1
2
is the least-biased estimator of this form of the true logit ŽNote 5.2.. The plot
of sample logits should be roughly linear.
When X is continuous and all n i s 1, or when it is essentially continuous
and all n i are small, this is unsatisfactory. One could group the data with
nearby x values into categories before calculating sample proportions and
sample logits. A better approach that does not require choosing arbitrary
categories uses a smoothing mechanism to reveal trends. One such smoothing
approach fits a generalized additive model ŽSection 4.8., which replaces the
linear predictor of a GLM by a smooth function. Inspect a plot of the fit
to see if severe discrepancies occur from the S-shaped trend predicted
by logistic regression.
5.1.3
Horseshoe Crabs Revisited
To illustrate logistic regression, we reanalyze the horseshoe crab data introduced in Section 4.3.2. The binary response is whether a female crab has any
male crabs residing nearby Žsatellites .: Y s 1 if she has at least one satellite,
and Y s 0 if she has none. We first use as a predictor the female crab’s
width.
Figure 4.7 plotted the data and showed the smoothed prediction of the
mean provided by a generalized additive model ŽGAM., assuming a binomial
response and logit link. The logistic regression model appears to be adequate. This is also suggested by the grouping of the data used to investigate
the adequacy of Poisson regression models in Section 4.3.2 ŽTable 4.4.. In
each of the eight width categories, we computed the sample proportion of
crabs having satellites and the mean width for the crabs in that category.
Figure 5.2 shows eight dots representing the sample proportions of female
crabs having satellites plotted against the mean widths for the eight cate-
169
INTERPRETING PARAMETERS IN LOGISTIC REGRESSION
FIGURE 5.2
Observed and fitted proportions of satellites by width of female crab.
gories. The eight plotted sample proportions and the GAM smoothing curve
both show a roughly increasing trend, so we proceed with fitting the logistic
regression model with linear width predictor.
We defer to Section 5.5 details about ML fitting. Software Že.g., for SAS
see Table A.8. reports output such as Table 5.1 exhibits. For the ungrouped
data from Table 4.3, let Ž x . denote the probability that a female horseshoe
crab of width x has a satellite. The ML fit is
ˆ Ž x. s
TABLE 5.1
Crab Data
exp Ž y12.351 q 0.497x .
1 q exp Ž y12.351 q 0.497x .
.
Computer Output for Logistic Regression Model with Horseshoe
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
171
194.4527
Pearson Chi- Square
171
165.1434
Log Likelihood
y97.2263
Parameter
Intercept
width
Estimate
y12.3508
0.4972
Std
Error
2.6287
0.1017
Likelihood- Ratio
95% Conf Limits
y17.8097 y7.4573
0.3084
0.7090
Wald
Chi- Sq
22.07
23.89
P>ChiSq
<.0001
<.0001
170
LOGISTIC REGRESSION
Substituting x s 26.3 cm, the mean width level in this sample,
ˆ Ž x . s 0.674.
The estimated probability equals 12 when x s y␣
ˆrˆ s 12.351r0.497 s 24.8.
Figure 5.2 plots
ˆ Ž x . against width.
The estimated odds of a satellite multiply by expŽ ˆ. s expŽ0.497. s 1.64
for each 1-cm increase in width; that is, there is a 64% increase. To convey
the effect less technically, we could report the incremental rate of change in
the probability of a satellite. At the mean width,
ˆ Ž x . s 0.674, and ˆ Ž x .
increases by about ˆw
ˆ Ž x .Ž1 y ˆ Ž x ..x s 0.497Ž0.674.Ž0.326. s 0.11 for a
1-cm increase in width. Or, we could report
ˆ Ž x . at the quartiles of x. The
lower quartile, median, and upper quartile for width are 24.9, 26.1, and 27.7;
ˆ Ž x . at those values equals 0.51, 0.65, and 0.81, increasing by 0.30 over the x
values for the middle half of the sample.
The latter summary is useful for comparing the effects of predictors having
different units. For instance, with crab weight as the predictor, logitw
ˆ Ž x .x s
y3.695 q 1.815 x. A 1-kg increase in weight is not comparable to a 1-cm
increase in width, so ˆ s 0.497 for x s width is not comparable to ˆ s 1.815
for x s weight. The quartiles for weight are 2.00, 2.35, and 2.85;
ˆ Ž x . at
those values are 0.48, 0.64, and 0.81, increasing by 0.33 over the middle half
of the sampled weights. The effect is similar to that of width.
5.1.4
Logistic Regression with Retrospective Studies
Another property of logistic regression relates to situations in which the
explanatory variable X rather than the response variable Y is random. This
occurs with retrospective sampling designs, such as case᎐control biomedical
studies ŽSection 2.1.6.. For samples of subjects having Y s 1 Žcases. and
having Y s 0 Žcontrols., the value of X is observed. Evidence exists of an
association if the distribution of X values differs between cases and controls.
In retrospective studies, one can estimate odds ratios ŽSection 2.2.4.. Effects
in the logistic regression model refer to odds ratios. Thus, one can fit such
models and estimate effects in case᎐control studies.
Here is a justification for this. Let Z indicate whether a subject is sampled
Ž1 s yes, 0 s no.. Let 1 s P Ž Z s 1 < y s 1. denote the probability of sampling a case, and let 0 s P Ž Z s 1 < y s 0. denote the probability of sampling
a control. Even though the conditional distribution of Y given X s x is not
sampled, we need a model for P Ž Y s 1 < z s 1, x ., assuming that P Ž Y s 1 < x .
follows the logistic model. By Bayes’ theorem,
P Ž Y s 1 < z s 1, x . s
P Ž Z s 1 < y s 1, x . P Ž Y s 1 < x .
Ý1js0
P Ž Z s 1 < y s j, x . P Ž Y s j < x .
.
Ž 5.3 .
Now, suppose that P Ž Z s 1 < y, x . s P Ž Z s 1 < y . for y s 0 and 1; that is, for
each y, the sampling probabilities do not depend on x. For instance, often x
171
INTERPRETING PARAMETERS IN LOGISTIC REGRESSION
refers to exposure of some type, such as whether someone has been a
smoker. Then, for cases and for controls, the probability of being sampled is
the same for smokers and nonsmokers. Under this assumption, substituting
1 and 0 in Ž5.3. and dividing numerator and denominator by P Ž Y s 0 < x .,
Ž5.3. simplifies to
P Ž Y s 1 < z s 1, x . s
1 exp Ž␣ q  x .
0 q 1 exp Ž␣ q  x .
.
Then, dividing numerator and denominator by 0 and using 1r 0 s
expwlogŽ 1r 0 .x yields
logit P Ž Y s 1 < z s 1, x . s ␣ * q  x
with ␣ * s ␣ q log Ž 1r 0 ..
Thus, the logistic regression model holds with the same effect parameter 
as in the model for P Ž Y s 1 < x .. If the sampling rate for cases is 10 times that
for controls, the intercept estimated is logŽ10. s 2.3 larger than the one
estimated with a prospective study. For related comments, see Anderson
Ž1972., Breslow and Day Ž1980, p. 203., Breslow and Powers Ž1978., Carroll
et al. Ž1995., Farewell Ž1979., Mantel Ž1973., Prentice Ž1976a., and Prentice
and Pyke Ž1979..
With case᎐control studies, one cannot estimate  in other binaryresponse models. Unlike the odds ratio, the effect for the conditional
distribution of X given Y does not then equal that for Y given X. This is an
important advantage of the logit link and is a major reason why logit models
have surpassed other models in popularity in biomedical studies.
Many case᎐control studies employ matching. Each case is matched with
one or more control subjects. The controls are like the case on key characteristics such as age. The model and subsequent analysis should take the
matching into account. In Section 10.2.5 we discuss logistic regression for
matched case᎐control studies.
Regardless of the sampling mechanism, logistic regression may or may not
describe a relationship well. In one special case, it necessarily holds. Given
that Y s i, suppose that X has N Ž i , 2 . distribution, i s 0, 1. Then, by
Bayes’ theorem, P Ž Y s 1 < X s x . equals Ž5.1. with  s Ž 1 y 0 .r 2
ŽCornfield 1962.. When a population is a mixture of two types of subjects,
one type with Y s 1 that is approximately normally distributed on X and the
other type with Y s 0 that is approximately normal on X with similar
variance, the logistic regression function Ž5.1. approximates well the curve for
Ž x .. If the distributions are normal but with different variances, the model
applies also having a quadratic term ŽAnderson 1975.. In that case, the
relationship is nonmonotone, with Ž x . increasing and then decreasing, or
the reverse ŽProblem 5.33..
172
5.2
LOGISTIC REGRESSION
INFERENCE FOR LOGISTIC REGRESSION
By Wald’s Ž1943. asymptotic results for ML estimators, parameter estimators
in logistic regression models have large-sample normal distributions. Thus,
inference can use the ŽWald, likelihood-ratio, score. triad of methods
ŽSection 1.3.3..
5.2.1
Types of Inference
For the model with a single predictor,
logit Ž x . s ␣ q  x,
significance tests focus on H0 :  s 0, the hypothesis of independence. The
Wald test uses the log likelihood at ˆ, with test statistic z s ˆrSE or its
square; under H0 , z 2 is asymptotically 12 . The likelihood-ratio test uses
twice the difference between the maximized log likelihood at ˆ and at  s 0
and also has an asymptotic 12 null distribution. The score test uses the log
likelihood at  s 0 through the derivative of the log likelihood Ži.e., the
score function. at that point. The test statistic compares the sufficient
statistic for  to its null expected value, suitably standardized w N Ž0, 1. or 12 x.
In Section 5.3.5 present this test of H0 :  s 0.
For large samples, the three tests usually give similar results. The likelihood-ratio test is preferred over the Wald. It uses more information, since it
incorporates the log likelihood at H0 as well as at ˆ. When <  < is relatively
large, the Wald test is not as powerful as the likelihood-ratio test and can
even show aberrant behavior wsee Hauck and Donner Ž1977. and Problem
5.38x.
Confidence intervals are more informative than tests. An interval for 
results from inverting a test of H0 :  s  0 . The interval is the set of  0 for
which the chi-squared test statistic is no greater than 12 Ž␣. s z␣2 r2 . For the
Wald approach, this means wŽ ˆ y  0 .rSEx 2 F z␣2 r2 ; the interval is ˆ "
z␣ r2 ŽSE..
For summarizing the relationship, other characteristics may have greater
importance than  , such as Ž x . at various x values. For fixed x s x 0 ,
logit w
ˆ Ž x 0 .x s ␣ˆ q ˆ x 0 has a large-sample SE given by the estimated square
root of
var Ž␣
ˆ q ˆ x 0 . s var Ž␣ˆ. q x 02 var Ž ˆ . q 2 x 0 cov Ž␣ˆ, ˆ . .
A 95% confidence interval for logitw Ž x 0 .x is Ž␣
ˆ q ˆ x 0 . " 1.96 SE. Substituting each endpoint into the inverse transformation Ž x 0 . s exp Žlogit.r
w1 q exp Žlogit.x gives a corresponding interval for Ž x 0 ..
Each method of inference can also produce small-sample confidence
intervals and tests. We defer discussion of this until Section 6.7.
173
INFERENCE FOR LOGISTIC REGRESSION
5.2.2
Inference for Horseshoe Crab Data
We illustrate logistic regression inferences with the model for the probability
a horseshoe crab has a satellite, with width as the predictor. Table 5.1
showed the fit and standard errors. The statistic z s ˆrSE s 0.497r0.102 s
4.9 provides strong evidence of a positive width effect Ž P - 0.0001.. The
equivalent Wald chi-squared statistic, z 2 s 23.9, has df s 1. The maximized
log likelihoods equal y112.88 under H0 :  s 0 and y97.23 for the full
model. The likelihood-ratio statistic equals y2Žy112.88 y 97.23. s 31.3,
with df s 1. This provides even stronger evidence than the Wald test.
The Wald 95% confidence interval for  is 0.497 " 1.96Ž0.102., or Ž0.298,
0.697.. Table 5.1 reports a likelihood-ratio confidence interval of Ž0.308,
0.709., based on the profile likelihood function. The confidence interval for
the effect on the odds per 1-cm increase in width equals Ž e 0.308 , e 0.709 . s
Ž1.36, 2.03.. We infer that a 1-cm increase in width has at least a 36%
increase and at most a doubling in the odds of a satellite.
Most software for logistic regression also reports estimates and confidence
intervals for Ž x . Že.g., PROC GENMOD in SAS with the OBSTATS
option.. Consider this for crabs of width x s 26.5, near the mean width. The
estimated logit is y12.351 q 0.497Ž26.5. s 0.825, and
ˆ Ž x . s 0.695. Software reports
$
var Ž␣
ˆ. s 6.910,
$
var Ž ˆ . s 0.01035,
$
cov Ž␣
ˆ, ˆ . s y0.2668,
from which
$
var logit
ˆ Ž x.
4 s 6.910 q x 2 Ž 0.01035. q 2 x Ž y0.2668 . .
At x s 26.5 this is 0.038, so the 95% confidence interval for logitw Ž26.5.x
equals 0.825 " Ž1.96.'0.038 , or Ž0.44, 1.21.. This translates to the interval
Ž0.61, 0.77. for the probability of satellites Že.g., exp Ž0.44.rw1 q exp Ž0.44.x s
0.61.. ŽAlternatively, for the model fit using predictor x* s x y 26.5, ␣
ˆ and
its SE are the estimated logit and its SE.. Figure 5.3 plots the confidence
bands around the prediction equation for Ž x . as a function of x. Hauck
Ž1983. gave alternative bands for which the confidence coefficient applies
simultaneously to all possible predictor values.
One could ignore the model fit and simply use sample proportions Ži.e.,
the saturated model. to estimate such probabilities. Six female crabs in the
sample had x s 26.5, and four of them had satellites. The sample proportion
estimate at x s 26.5 is
ˆ s 4r6 s 0.67, similar to the model-based estimate.
The 95% score confidence interval ŽSection 1.4.2. based on these six observations alone equals Ž0.30, 0.90..
When the logistic regression model truly holds, the model-based estimator
of a probability is considerably better than the sample proportion. The model
has only two parameters to estimate, whereas the saturated model has a
174
LOGISTIC REGRESSION
FIGURE 5.3 Prediction equation and 95% confidence bands for probability of satellite as a
function of width.
separate parameter for every distinct value of x. For instance, at x s 26.5,
software reports SE s 0.04 for the model-based estimate 0.695, whereas the
SE is
ˆ Ž 1 y ˆ . rn s Ž 0.67 . Ž 0.33 . r6 s 0.19 for the sample proportion
of 0.67 with only 6 observations. The 95% confidence intervals are Ž0.61,
0.77. using the model versus Ž0.30, 0.90. using the sample proportion. Instead
of using only 6 observations, the model uses the information that all 173
observations provide in estimating the two model parameters. The result is a
much more precise estimate.
Reality is a bit more complicated. In practice, the model is not exactly the
true relationship between Ž x . and x. However, if it approximates the true
probabilities decently, its estimator still tends to be closer than the sample
proportion to the true value. The model smooths the sample data, somewhat
dampening the observed variability. The resulting estimators tend to be
better unless each sample proportion is based on an extremely large sample.
Section 6.4.5 discusses this advantage of using models.
'
5.2.3
'
Checking Goodness of Fit: Ungrouped and Grouped Data
In practice, there is no guarantee that a certain logistic regression model fits
the data well. For any type of binary data, one way to detect lack of fit uses a
likelihood-ratio test to compare the model to more complex ones. A more
complex model might contain a nonlinear effect, such as a quadratic term.
Models with multiple predictors would consider interaction. If more complex
models do not fit better, this provides some assurance that the model chosen
is reasonable.
INFERENCE FOR LOGISTIC REGRESSION
175
Other approaches to detecting lack of fit search for any way that the
model fails. This is simplest when the explanatory variables are solely
categorical, as we’ll illustrate in Section 5.4.3. At each setting of x, one can
multiply the estimated probabilities of the two outcomes by the number of
subjects at that setting to obtain estimated expected frequencies for y s 0
and y s 1. These are fitted ®alues. The test of the model compares the
observed counts and fitted values using a Pearson X 2 or likelihood-ratio G 2
statistic. For a fixed number of settings, as the fitted counts increase, X 2 and
G 2 have limiting chi-squared null distributions. The degrees of freedom,
called the residual df for the model, subtract the number of parameters in the
model from the number of parameters in the saturated model Ži.e., the
number of settings of x ..
The reason for the restriction to categorical predictors for a global test of
fit relates to the distinction in Section 4.5.3 that we mentioned between
grouped and ungrouped data for binomial models. The saturated model
differs in the two cases. An asymptotic chi-squared distribution for the
deviance results as n ™ ⬁ with a fixed number of parameters in that model
and hence a fixed number of settings of predictor values.
5.2.4
Goodness of Fit of Model for Horseshoe Crabs
We illustrate with a goodness-of-fit analysis for the model using x s width to
predict the probability that a female crab has a satellite. One way to check it
compares it to a more complex model, such as the model containing
a quadratic term. With width centered at 0 by subtracting its mean of 26.3,
that model has fit
logit
ˆ Ž x . s 0.618 q 0.533 x q 0.040 x 2 .
The quadratic estimate has SE s 0.046. There is not much evidence to
support adding that term. The likelihood-ratio statistic for testing that the
true coefficient of x 2 is 0 equals 0.83 Ždf s 1..
We next consider overall goodness of fit. Width takes 66 distinct values for
the 173 crabs, with few observations at most widths. One can view the data as
a 66 = 2 contingency table. The two cells in each row count the number of
crabs with satellites and the number of crabs without satellites, at that width.
The chi-squared theory for X 2 and G 2 applies when the number of levels of
x is fixed, and the number of observations at each level grows. Although we
grouped the data using the distinct width values rather than using 173
separate binary responses, this theory is violated here in two ways. First, most
fitted counts are very small. Second, when more data are collected, additional width values would occur, so the contingency table would contain more
cells rather than a fixed number. Because of this, X 2 and G 2 for logistic
regression models with continuous or nearly continuous predictors do not
have approximate chi-squared distributions. ŽNormal approximations can be
176
LOGISTIC REGRESSION
TABLE 5.2 Grouping of Observed and Fitted Values for Fit of Logistic
Regression Model to Horseshoe Crab Data
Width Žcm.
Number
Yes
Number
No
Fitted
Yes
Fitted
No
- 23.25
23.2524.25
24.2525.25
25.2526.25
26.2527.25
27.2528.25
28.2529.25
) 29.25
5
4
17
21
15
20
15
14
9
10
11
18
7
4
3
0
3.64
5.31
13.78
24.23
15.94
19.38
15.65
13.08
10.36
8.69
14.22
14.77
6.06
4.62
2.35
0.92
more appropriate, but no single method has received much attention; see
Section 9.8.6 for references. .
One could use X 2 and G 2 to compare the observed and fitted values in
grouped form. Table 5.2 uses the groupings of Table 4.4, giving an 8 = 2
table. In each width category, the fitted value for a yes response is the sum of
the estimated probabilities
ˆ Ž x . for all crabs having width in that category;
the fitted value for a no response is the sum of 1 y
ˆ Ž x . for those crabs. The
fitted values are then much larger. Then, X 2 and G 2 have better validity,
although the chi-squared theory still is not perfect since Ž x . is not constant
in each category. Their values are X 2 s 5.3 and G 2 s 6.2. Table 5.2 has
eight binomial samples, one for each width setting; the model has two
parameters, so df s 8 y 2 s 6. Neither X 2 nor G 2 shows evidence of lack of
fit Ž P ) 0.4.. Thus, we can feel more comfortable about using the model for
the original ungrouped data.
5.2.5
Checking Goodness of Fit with Ungrouped Data by Grouping
As just noted, with ungrouped data or with continuous or nearly continuous
predictors, X 2 and G 2 do not have limiting chi-squared distributions. They
are still useful for comparing models, as done above for checking a quadratic
term and as we will discuss in Sections 5.4.3 and 9.8.5. Also, as just noted,
one can apply them in an approximate manner to grouped observed and
fitted values for a partition of the space of x values. As the number of
explanatory variables increases, however, simultaneous grouping of values for
each variable can produce a contingency table with a large number of cells,
most of which have small counts.
Regardless of the number of predictors, one can partition observed and
fitted values according to the estimated probabilities of success using the
original ungrouped data. One common approach forms the groups in the
partition so they have approximately equal size. With 10 groups, the first pair
177
LOGIT MODELS WITH CATEGORICAL PREDICTORS
of observed counts and corresponding fitted counts refers to the nr10
observations having the highest estimated probabilities, the next pair refers
to the nr10 observations having the second decile of estimated probabilities,
and so on. Each group has an observed count of subjects with each outcome
and a fitted value for each outcome. The fitted value for an outcome is the
sum of the estimated probabilities for that outcome for all observations in
that group.
This construction is the basis of a test due to Hosmer and Lemeshow
Ž1980.. They proposed a Pearson statistic comparing the observed and fitted
counts for this partition. Let yi j denote the binary outcome for observation j
in group i of the partition, i s 1, . . . , g, j s 1, . . . , n i . Let
ˆ i j denote the
corresponding fitted probability for the model fitted to the ungrouped data.
Their statistic equals
Ž Ý j yi j y Ý jˆ i j .
Ý Ý 1 y Ý rn
Ž j ˆi j . i
is1 Ž j ˆ i j .
g
2
.
When many observations have the same estimated probability, there is some
arbitrariness in forming the groups, and different software may report somewhat different values. This statistic does not have a limiting chi-squared
distribution, because the observations in a group are not identical trials, since
they do not share a common success probability. However, Hosmer and
Lemeshow noted that when the number of distinct patterns of covariate
values equals the sample size, the null distribution is approximated by
chi-squared with df s g y 2.
For the logistic regression fit to the horseshoe crab data with continuous
width predictor, the Hosmer᎐Lemeshow statistic with g s 10 groups equals
3.5, with df s 8. It also indicates a decent fit.
Unfortunately, like other proposed global fit statistics, the Hosmer᎐
Lemeshow statistic does not have good power for detecting particular types
of lack of fit ŽHosmer et al. 1997.. In any case, a large value of a global fit
statistic merely indicates some lack of fit but provides no insight about its
nature. The approach of comparing the working model to a more complex
one is more useful from a scientific perspective, since it searches for lack of
fit of a particular type. For either approach, when the fit is poor, diagnostic
measures describe the influence of individual observations on the model fit
and highlight reasons for the inadequacy. We discuss these in Section 6.2.1.
5.3
LOGIT MODELS WITH CATEGORICAL PREDICTORS
Like ordinary regression, logistic regression extends to include qualitative
explanatory variables, often called factors. In this section we use dummy
variables to do this.
178
5.3.1
LOGISTIC REGRESSION
ANOVA-Type Representation of Factors
For simplicity, we first consider a single factor X, with I categories. In row i
of the I = 2 table, yi is the number of outcomes in the first column
Žsuccesses . out of n i trials. We treat yi as binomial with parameter i .
The logit model with a factor is
log
i
1 y i
s ␣ q i .
Ž 5.4 .
The higher i is, the higher the value of i . The right-hand side of Ž5.4.
resembles the model formula for cell means in one-way ANOVA. As in
ANOVA, the factor has as many parameters i 4 as categories, but one is
redundant. With I categories, X has I y 1 nonredundant parameters. One
parameter can be set to 0, say I s 0. If the values do not satisfy this, we can
recode so that it is true. For instance, set ˜i s i y I and ␣
˜ s ␣ q I ,
which satisfy ˜I s 0. Then
logit Ž i . s ␣ q i s Ž␣
˜ y I . q Ž ˜i q I . s ␣˜ q ˜i ,
where the newly defined parameters satisfy the constraint. When I s 0, ␣
equals the logit in row I, and i is the difference between the logits in rows i
and I. Thus, i equals the log odds ratio for that pair of rows.
For any i ) 04 , i 4 exist such that model Ž5.4. holds. The model has as
many parameters Ž I . as binomial observations and is saturated. When a
factor has no effect,  1 s  2 s ⭈⭈⭈ s I . Since this is equivalent to 1 s ⭈⭈⭈
s I , this model with only an intercept term specifies statistical independence of X and Y.
5.3.2
Dummy Variables in Logit Models
An equivalent expression of model Ž5.4. uses dummy ®ariables. Let x i s 1 for
observations in row i and x i s 0 otherwise, i s 1, . . . , I y 1. The model is
logit Ž i . s ␣ q  1 x 1 q  2 x 2 q ⭈⭈⭈ qIy1 x Iy1 .
This accounts for parameter redundancy by not forming a dummy variable
for category I. The constraint I s 0 in Ž5.4. corresponds to this form of
dummy variable. The choice of category to exclude for the dummy variable is
arbitrary. Some software sets  1 s 0; this corresponds to a model with
dummy variables for categories 2 through I, but not category 1.
Another way to impose constraints sets Ý i i s 0. Suppose that X has
I s 2 categories, so  1 s y 2 . This results from effect coding for a dummy
variable, x s 1 in category 1 and x s y1 in category 2.
179
LOGIT MODELS WITH CATEGORICAL PREDICTORS
The same substantive results occur for any coding scheme. For model
Ž5.4., regardless of the constraint for i 4 , ␣
ˆ q ˆi 4 and hence ˆ i 4 are the
same. The differences ˆa y ˆb for pairs Ž a, b . of categories of X are
identical and represent estimated log odds ratios. Thus, expŽ ˆa y ˆb . is the
estimated odds of success in category a of X divided by the estimated odds
of success in category b of X. Reparameterizing a model may change
parameter estimates but does not change the model fit or the effects of
interest.
The value i or ˆi for a single category is irrelevant. Different constraint
systems result in different values. For a binary predictor, for instance, using
dummy variables with reference value  2 s 0, the log odds ratio equals
 1 y  2 s  1; by contrast, for effect coding with "1 dummy variable and
hence  1 q  2 s 0, the log odds ratio equals  1 y  2 s  1 y Žy 1 . s 2  1.
A parameter or its estimate makes sense only by comparison with one for
another category.
5.3.3
Alcohol and Infant Malformation Example Revisited
We return now to Table 3.7 from the study of maternal alcohol consumption
and child’s congenital malformations, shown again in Table 5.3. For model
Ž5.4., we treat malformations as the response and alcohol consumption as an
explanatory factor. Regardless of the constraint for i 4 , ␣
ˆ q ˆi 4 are the
sample logits, reported in Table 5.3. For instance,
logit Ž
ˆ 1 . s ␣ˆ q ˆ1 s log Ž 48r17,066 . s y5.87.
For the coding that constrains 5 s 0, ␣
ˆ s y3.61 and ˆ1 s y2.26. For the
coding  1 s 0, ␣
ˆ s y5.87. Table 5.3 shows that except for the slight
reversal between the first and second categories of alcohol consumption, the
logits and hence the sample proportions of malformation cases increase as
alcohol consumption increases.
The simpler model with all i s 0 specifies independence. For it, ␣
ˆ
equals the logit for the overall sample proportion of malformations, or
logŽ93r32481. s y5.86. To test H0 : independence Ždf s 4., the Pearson
TABLE 5.3
Logits and Proportion of Malformation for Table 3.7
Alcohol
Consumption
0
-1
1᎐2
3᎐5
G6
Proportion Malformed
Present
Absent
Logit
Observed
Fitted
48
38
5
1
1
17,066
14,464
788
126
37
y5.87
y5.94
y5.06
y4.84
y3.61
0.0028
0.0026
0.0063
0.0079
0.0263
0.0026
0.0030
0.0041
0.0091
0.0231
180
LOGISTIC REGRESSION
statistic Ž3.10. is X 2 s 12.1 Ž P s 0.02., and the likelihood-ratio statistic
Ž3.11. is G 2 s 6.2 Ž P s 0.19.. These provide mixed signals. Table 5.3 has a
mixture of very small, moderate, and extremely large counts. Even though
n s 32,574, the null sampling distributions of X 2 or G 2 may not be close to
chi-squared. The P-values using the exact conditional distributions of X 2
and G 2 are 0.03 and 0.13. These are closer, but still give differing evidence.
In any case, these statistics ignore the ordinality of alcohol consumption. The
sample suggests that malformations may tend to be more likely with higher
alcohol consumption. The first two percentages are similar and the next two
are also similar, however, and any of the last three percentages changes
substantially with the addition or deletion of one malformation case.
5.3.4
Linear Logit Model for I = 2 Tables
Model Ž5.4. treats the explanatory factor as nominal, since it is invariant to
the ordering of categories. For ordered factor categories, other models are
more parsimonious than this, yet more complex than the independence
model. For instance, let scores x 1 , x 2 , . . . , x I 4 describe distances between
categories of X. When one expects a monotone effect of X on Y, it is natural
to fit the linear logit model
logit Ž i . s ␣ q  x i .
Ž 5.5 .
The independence model is the special case  s 0.
The near-monotone increase in sample logits in Table 5.3 indicates that
the linear logit model Ž5.5. may fit better than the independence model. As
measured, alcohol consumption groups a naturally continuous variable. With
scores x 1 s 0, x 2 s 0.5, x 3 s 1.5, x 4 s 4.0, x 5 s 7.04 , the last score being
somewhat arbitrary, Table 5.4 shows results. The estimated multiplicative
TABLE 5.4 Computer Output for Logistic Regression Model with Infant
Malformation Data
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
3
1.9487
Pearson Chi- Square
3
2.0523
Log Likelihood
y635.5968
Parameter
Intercept
alcohol
Estimate
y5.9605
0.3166
Std
Error
0.1154
0.1254
Likelihood- Ratio
95% Conf Limits
y6.1930 y5.7397
0.0187
0.5236
Wald
Chi- Sq
2666.41
6.37
Pr>ChiSq
<.0001
0.0116
181
LOGIT MODELS WITH CATEGORICAL PREDICTORS
effect of a unit increase in daily alcohol consumption on the odds of
malformation is expŽ0.317. s 1.37. Table 5.3 shows the observed and fitted
proportions of malformation. The model seems to fit well, as statistics
comparing observed and fitted counts are G 2 s 1.95 and X 2 s 2.05, with
df s 3.
5.3.5
Cochran–Armitage Trend Test
Armitage Ž1955. and Cochran Ž1954. were among the first to emphasize the
importance of utilizing ordered categories in a contingency table. For I = 2
tables with ordered rows and I independent bin Ž n i , i . variates yi 4 , they
proposed a trend statistic for testing independence by partitioning the
Pearson statistic for that hypothesis. They used a linear probability model,
Ž 5.6 .
i s ␣ q  xi ,
fitted by ordinary least squares. For this model, the null hypothesis of
independence is H0 :  s 0. Let x s Ý i n i x irn. Let pi s yirn i , and let p s
ŽÝ i yi .rn denote the overall proportion of successes. The prediction equation
is
ˆi s p q bŽ xi y x . ,
where
bs
Ý i n i Ž pi y p . Ž x i y x .
Ýi n i Ž x i y x .
2
.
Denote the Pearson statistic for testing independence by X 2 Ž I .. For I = 2
tables with ordered rows, it satisfies
1
X 2Ž I. s
pŽ1 y p.
Ý n i Ž pi y p . 2 s z 2 q X 2 Ž L . ,
i
where
X 2 Ž L. s
2
z s
b2
pŽ1 y p.
1
pŽ1 y p.
Ý ni Ž x i y x .
i
2
s
Ý n i Ž pi y ˆ i .
2
i
2
Ý i Ž x i y x . yi
'p Ž 1 y p . Ý n Ž x y x .
i
i
2
. Ž 5.7 .
i
When the linear probability model holds, X 2 Ž L. is asymptotically chi-squared
with df s I y 2. It tests the fit of the model. The statistic z 2 , with df s 1,
182
LOGISTIC REGRESSION
tests H0 :  s 0 for the linear trend in the proportions Ž5.6.. The test of
independence using this statistic is called the Cochran᎐Armitage trend test.
This analysis seems unrelated to the linear logit model. However, the
Cochran᎐Armitage statistic is equivalent to the score statistic for testing
H0 :  s 0 in that model. Moreover, this statistic relates to the statistic M 2 in
Ž3.15. used to test for a linear trend in an I = J table; namely, it equals M 2
applied when J s 2, except with Ž n y 1. replaced by n. When I s 2,
X 2 Ž L. s 0 and z 2 s X 2 Ž I ..
For Table 5.3 on alcohol consumption and malformation, X 2 Ž I . s 12.1.
Using the same scores as in the linear logit model, the Cochran᎐Armitage
trend test has z 2 s 6.6 Ž P-value s 0.010.. The test suggests strong evidence
of a positive slope. In addition,
X 2 Ž I . s 12.1 s 6.6 q 5.5,
where X 2 Ž L. s 5.5 Ždf s 3. shows only slight evidence of departure of the
proportions from linearity. The trend test agrees with M 2 for the sample
correlation of r s 0.014 for n s 32,573 ŽSection 3.4.5.. For the chosen
scores, the correlation seems weak. However, r has limited use as a descriptive measure for tables that are highly discrete and unbalanced.
The Cochran᎐Armitage trend test Ži.e., the score test. usually gives results
similar to the Wald or likelihood-ratio test of H0 :  s 0 in the linear logit
model. The asymptotics work well even for quite small n when n i 4 are equal
and x i 4 are equally spaced. With Table 5.3, the Wald statistic equals
Ž ˆrSE. 2 s Ž0.317r0.125. 2 s 6.4 Ž P s 0.012. and the likelihood-ratio statistic equals 4.25 Ž P s 0.039.. The highly unbalanced counts suggest that it is
safest to use the likelihood function through the likelihood-ratio approach.
This is also true for estimation. The profile likelihood 95% confidence
interval of Ž0.02, 0.52. for  reported in Table 5.4 is preferable to the Wald
interval of 0.317 " 1.96Ž0.125. s Ž0.07, 0.56.. Even though n is very large,
exact inference based on small-sample methods presented in Section 6.7.4 is
relevant here.
5.4
MULTIPLE LOGISTIC REGRESSION
Like ordinary regression, logistic regression extends to models with multiple
explanatory variables. For instance, the model for Žx. s P Ž Y s 1. at values
x s Ž x 1 , . . . , x p . of p predictors is
logit Ž x . s ␣ q  1 x 1 q  2 x 2 q ⭈⭈⭈ q p x p .
Ž 5.8 .
183
MULTIPLE LOGISTIC REGRESSION
The alternative formula, directly specifying Žx., is
Ž x. s
exp Ž␣ q  1 x 1 q  2 x 2 q ⭈⭈⭈ q p x p .
1 q exp Ž␣ q  1 x 1 q  2 x 2 q ⭈⭈⭈ q p x p .
.
Ž 5.9 .
The parameter i refers to the effect of x i on the log odds that Y s 1,
controlling the other x j . For instance, expŽ i . is the multiplicative effect on
the odds of a 1-unit increase in x i , at fixed levels of other x j . An explanatory
variable can be qualitative, using dummy variables for categories.
5.4.1
Logit Models for Multiway Contingency Tables
When all variables are categorical, a multiway contingency table displays the
data. We illustrate ideas with binary predictors X and Z. We treat the
sample size at given combinations Ž i, k . of X and Z as fixed and regard the
two counts on Y at each setting as binomial, with different binomials treated
as independent. Denote the two categories for each variable by Ž0, 1., and let
dummy variables for X and Z have x 1 s z1 s 1 and x 2 s z 2 s 0. The model
logit P Ž Y s 1 . s ␣ q  1 x i q  2 z k
Ž 5.10 .
has main effects for X and Z but assumes an absence of interaction. The
effect of one factor is the same at each level of the other.
At a fixed level z k of Z, the effect on the logit of changing categories of
X is
␣ q 1Ž 1. q  2 z k y ␣ q 1Ž 0. q  2 z k s 1 .
Ž 5.11 .
This logit difference equals the difference of log odds, which is the log odds
ratio between X and Y, fixing Z. Thus, expŽ  1 . is the conditional odds ratio
between X and Y. Controlling for Z, the odds of success when X s 1 equal
expŽ  1 . times the odds when X s 0. This conditional odds ratio is the same
at each level of Z; that is, there is homogeneous XY association ŽSection
2.3.5.. The lack of an interaction term in Ž5.10. implies a common odds ratio
for the partial tables. When  1 s 0, that common odds ratio equals 1. Then
X and Y are independent in each partial table, or conditionally independent,
gi®en Z ŽSection 2.3.4..
Additivity on the logit scale is the generally accepted definition of no
interaction for categorical variables. However, one could, instead, define it as
additivity on some other scale, such as with probit or identity link. Significant
interaction can occur on one scale when there is none on another scale. In
some applications, a particular definition may be natural. For instance,
theory might assume an underlying normal distribution and predict that the
probit is an additive function of predictor effects.
184
LOGISTIC REGRESSION
A factor with I categories needs I y 1 dummy variables, as we showed in
Section 5.3.2. An alternative representation of such factors resembles the way
that ANOVA models often express them. The model formula
logit P Ž Y s 1 . s ␣ q i X q  kZ
Ž 5.11 .
represents effects of X with parameters i X 4 and effects of Z with parameters  kZ 4 . ŽThe X and Z superscripts are merely labels and do not represent
powers.. Model form Ž5.11. applies for any number of categories for X and
Z. The parameter i X denotes the effect on the logit of classification in
category i of X. Conditional independence between X and Y, given Z,
corresponds to  1X s  2X s ⭈⭈⭈ s IX , whereby P Ž Y s 1. does not change as
i changes.
For each factor, one parameter in Ž5.11. is redundant. Fixing one at 0,
such as IX s KZ s 0, represents the category not having its own dummy
variable. When X and Z have two categories, the parameterization in model
Ž5.11. then corresponds to that in model Ž5.10. with  1X s  1 and  2X s 0,
and with  1Z s  2 and  2Z s 0.
5.4.2
AIDS and AZT Example
Table 5.5 is from a study on the effects of AZT in slowing the development
of AIDS symptoms. In the study, 338 veterans whose immune systems were
beginning to falter after infection with the AIDS virus were randomly
assigned either to receive AZT immediately or to wait until their T cells
showed severe immune weakness. Table 5.5 cross-classifies the veterans’ race,
whether they received AZT immediately, and whether they developed AIDS
symptoms during the 3-year study.
In model Ž5.10., we identify X with AZT treatment Ž x 1 s 1 for immediate
AZT use, x 2 s 0 otherwise. and Z with race Ž z1 s 1 for whites, z 2 s 0 for
blacks., for predicting the probability that AIDS symptoms developed. Thus,
␣ is the log odds of developing AIDS symptoms for black subjects without
immediate AZT use,  1 is the increment to the log odds for those with
immediate AZT use, and  2 is the increment to the log odds for white
TABLE 5.5
Development of AIDS Symptoms by AZT Use and Race
Symptoms
Race
AZT Use
Yes
No
White
Yes
No
Yes
No
14
32
11
12
93
81
52
43
Black
Source: New York Times, Feb. 15, 1991.
185
MULTIPLE LOGISTIC REGRESSION
TABLE 5.6
Computer Output for Logit Model with AIDS Symptoms Data
Goodness- of- Fit Statistics
Criterion
DF
Value
Pr ) ChiSq
Deviance
1
1.3835
0.2395
Pearson
1
1.3910
0.2382
Parameter
Intercept
azt
race
Analysis of Maximum Likelihood Estimates
Estimate
Std Error
Wald Chi- Square
y1.0736
0.2629
16.6705
y0.7195
0.2790
6.6507
0.0555
0.2886
0.0370
Effect
azt
race
Pr > ChiSq
- .0001
0.0099
0.8476
Odds Ratio Estimates
Estimate
95% Wald Confidence Limits
0.487
0.282
0.841
1.057
0.600
1.861
Profile Likelihood Confidence Interval for Odds Ratios
Effect
Estimate
95% Confidence Limits
azt
0.487
0.279
0.835
race
1.057
0.605
1.884
Obs
1
2
3
4
race
1
1
0
0
azt
1
0
1
0
y
14
32
11
12
n
107
113
63
55
pi hat
0.14962
0.26540
0.14270
0.25472
lower
0.09897
0.19668
0.08704
0.16953
upper
0.21987
0.34774
0.22519
0.36396
subjects. Table 5.6 shows output. The estimated odds ratio between immediate AZT use and development of AIDS symptoms equals expŽy0.7195. s
0.487. For each race, the estimated odds of symptoms are half as high for
those who took AZT immediately. The Wald confidence interval for this
effect is expwy0.720 " 1.96Ž0.279.x s Ž0.28, 0.84.. Similar results occur for
the likelihood-based interval.
The hypothesis of conditional independence of AZT treatment and development of AIDS symptoms, controlling for race, is H0 :  1 s 0 in Ž5.10.. The
likelihood-ratio statistic comparing model Ž5.10. with the simpler model
having  1 s 0 equals 6.9 Ždf s 1., showing evidence of association Ž P s 0.01..
The Wald statistic Ž ˆ1rSE. 2 s Žy0.720r0.279. 2 s 6.65 provides similar results.
Table 5.7 shows parameter estimates for three ways of defining factor
parameters in Ž5.11.: Ž1. setting the last parameter equal to 0, Ž2. setting the
first parameter equal to 0, and Ž3. having parameters sum to zero. For each
coding scheme, at a given combination of AZT use and race, the estimated
probability of developing AIDS symptoms is the same. For instance, the
intercept estimate plus the estimate for immediate AZT use plus the estimate for being white is y1.738 for each scheme, so the estimated probability
186
LOGISTIC REGRESSION
TABLE 5.7
Parameter Estimates for Logit Model Fitted to Table 5.5
Definition of Parameters
Parameter
Last s Zero
First s Zero
Sum s Zero
Intercept
y1.074
y1.738
y1.406
y0.720
0.000
0.000
0.720
y0.360
0.360
0.055
0.000
0.000
y0.055
0.028
y0.028
AZT Yes
No
Race White
Black
FIGURE 5.4 Estimated effects of AZT use and race on probability of developing AIDS
symptoms Ždots are sample proportions..
that white veterans with immediate AZT use develop AIDS symptoms equals
expŽy1.738.rw1 q expŽy1.738.x s 0.15. The bottom of Table 5.6 shows point
and interval estimates of the probabilities. Figure 5.4 shows a graphical
representation of the sample proportions Žthe four dots. and the point
estimates enclosed in 95% confidence intervals.
Similarly, for each coding scheme,  1X y  2X is identical and represents
the conditional log odds ratio of X with the response, given Z. Here,
exp Ž ˆ1X y ˆ2X . s exp Žy0.720. s 0.49 estimates the common odds ratio between immediate AZT use and AIDS symptoms, for each race.
5.4.3
Goodness of Fit as a Likelihood-Ratio Test
The likelihood-ratio statistic y2Ž L0 y L1 . tests whether certain model parameters are zero by comparing the log likelihood L1 for the fitted model M1
with L0 for a simpler model M0 . Denote this statistic for testing M0 , given
MULTIPLE LOGISTIC REGRESSION
187
that M1 holds, by G 2 Ž M0 < M1 .. The goodness-of-fit statistic G 2 Ž M . is a
special case in which M0 s M and M1 is the saturated model. In testing
whether M fits, we test whether all parameters in the saturated model but
not in M equal zero. The asymptotic df is the difference in the number of
parameters in the two models, which is the number of binomials modeled
minus the number of parameters in M.
We illustrate by checking the fit of model Ž5.10. for the AIDS data. For its
fit, white veterans with immediate AZT use had estimated probability 0.150
of developing AIDS symptoms during the study. Since 107 white veterans
took AZT, the fitted value is 107Ž0.150. s 16.0 for developing symptoms and
107Ž0.850. s 91.0 for not developing them. Similarly, one can obtain fitted
values for all eight cells in Table 5.5. The goodness-of-fit statistics comparing
these with the cell counts are G 2 s 1.38 and X 2 s 1.39. The model has four
binomials, one at each combination of AZT use and race. Since it has three
parameters, residual df s 4 y 3 s 1. The small G 2 and X 2 values suggest
that the model fits decently Ž P ) 0.2..
For model Ž5.10., the odds ratio between X and Y is the same at each
level of Z. The goodness-of-fit test checks this structure. That is, the test also
provides a test of homogeneous odds ratios. For Table 5.5, homogeneity is
plausible. Since residual df s 1, the more complex model that adds an
interaction term and permits the two odds ratios to differ is saturated.
Let LS denote the maximized log likelihood for the saturated model. As
discussed in Section 4.5.4, the likelihood-ratio statistic for comparing models
M1 and M0 is
G 2 Ž M0 < M1 . s y2 Ž L0 y L1 .
s y2 Ž L0 y LS . y y2 Ž L1 y LS .
s G 2 Ž M 0 . y G 2 Ž M1 . .
The test statistic comparing two models is identical to the difference in G 2
goodness-of-fit statistics Ždeviances. for the two models. To illustrate, consider H0 :  2 s 0 for the race effect with the AIDS data. The likelihood-ratio
statistic equals 0.04, suggesting that the simpler model is adequate. But this
equals G 2 Ž M0 . y G 2 Ž M1 . s 1.42 y 1.38, where M0 is the simpler model
with  2 s 0.
The model comparison statistic often has an approximate chi-squared null
distribution even when separate G 2 Ž Mi . do not. For instance, when a
predictor is continuous or a contingency table has very small fitted values, the
sampling distribution of G 2 Ž Mi . may be far from chi-squared. Nonetheless, if
df for the comparison statistic is modest Žas in comparing two models that
differ by a few parameters ., the null distribution of G 2 Ž M0 < M1 . is approximately chi-squared.
188
5.4.4
LOGISTIC REGRESSION
Horseshoe Crab Example Revisited
Like ordinary regression, logistic regression can have a mixture of quantitative and qualitative predictors. We illustrate with the horseshoe crab data
ŽSection 5.1.3., using the female crab’s width and color as predictors. Color
has five categories: light, medium light, medium, medium dark, dark. It is a
surrogate for age, older crabs tending to be darker. The sample contained no
light crabs, so our models use only the other four categories.
We first treat color as qualitative. The four categories use three dummy
variables. The model is
logit Ž . s ␣ q  1 c1 q  2 c 2 q  3 c 3 q 4 x,
Ž 5.12 .
where s P Ž Y s 1., x s width in centimeters, and
c1 s 1 for medium-light color, and 0 otherwise,
c2 s 1 for medium color, and 0 otherwise,
c3 s 1 for medium-dark color, and 0 otherwise.
The crab color is dark Žcategory 4. when c1 s c 2 s c 3 s 0. Table 5.8 shows
the ML parameter estimates. For instance, for dark crabs, logitŽ
ˆ. s
y12.715 q 0.468 x; by contrast, for medium-light crabs, c1 s 1, and logitŽ
ˆ.
s Žy12.715 q 1.330. q 0.468 x s y11.385 q 0.468 x. At the average width
of 26.3 cm,
ˆ s 0.399 for dark crabs and 0.715 for medium-light crabs.
The model assumes a lack of interaction between color and width in their
effects. Width has the same coefficient Ž0.468. for all colors, so the shapes of
the curves relating width to are identical. For each color, a 1-cm increase
in width has a multiplicative effect of expŽ0.468. s 1.60 on the odds that
Y s 1. Figure 5.5 displays the fitted model. Any one curve equals any other
TABLE 5.8
Computer Output for Model with Width and Color Predictors
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
168
187.4570
Pearson Chi- Square
168
168.6590
Log Likelihood
y93.7285
Parameter Estimate
intercept y12.7151
c1
1.3299
c2
1.4023
c3
1.1061
width
0.4680
Standard Likelihood- Ratio 95%
Error
Confidence
Limits
2.7618
y18.4564
y7.5788
0.8525
y0.2738
3.1354
0.5484
0.3527
2.5260
0.5921
y0.0279
2.3138
0.1055
0.2713
0.6870
ChiSquare
21.20
2.43
6.54
3.49
19.66
Pr>ChiSq
<.0001
0.1188
0.0106
0.0617
<.0001
MULTIPLE LOGISTIC REGRESSION
189
FIGURE 5.5 Logistic regression model using width and color predictors of satellite presence
for horseshoe crabs.
curve shifted to the right or left. The parallelism of curves in the horizontal
dimension implies that any two curves never cross. At all width values, color
4 Ždark. has a lower estimated probability of a satellite than the other colors.
There is a noticeable positive effect of width.
The exponentiated difference between two color parameter estimates is an
odds ratio comparing those colors. For instance, the difference for mediumlight crabs and dark crabs equals 1.330. At any given width, the estimated
odds that a medium-light crab has a satellite are exp Ž1.330. s 3.8 times the
estimated odds for a dark crab. At width x s 26.3, the odds equal
0.715r0.285 s 2.51 for a medium-light crab and 0.399r0.601 s 0.66 for a
dark crab, for which 2.51r0.66 s 3.8.
5.4.5
Model Comparison
To test whether color contributes significantly to model Ž5.12., we test
H0 :  1 s  2 s  3 s 0. This states that controlling for width, the probability
of a satellite is independent of color. We compare the maximized log-likelihood L1 for the full model Ž5.12. to L0 for the simpler model. The test
statistic y2Ž L0 y L1 . s 7.0 has df s 3, the difference between the numbers
of parameters in the two models. The chi-squared P-value of 0.07 provides
slight evidence of a color effect.
The more complex model allowing color = width interaction has three
additional terms, the cross-products of width with the color dummy variables.
190
LOGISTIC REGRESSION
Fitting this model is equivalent to fitting logistic regression with width
predictor separately for crabs of each color. Each color then has a differentshaped curve relating width to P Ž Y s 1., so a comparison of two colors
varies according to the width value. The likelihood-ratio statistic comparing
the models with and without the interaction terms equals 4.4, with df s 3.
The evidence of interaction is weak Ž P s 0.22..
5.4.6
Quantitative Treatment of Ordinal Predictor
Color has ordered categories, from lightest to darkest. A simpler model yet
treats this predictor as quantitative. Color may have a linear effect, for a set
of monotone scores. To illustrate, for scores c s 1, 2, 3, 44 for the color
categories, the model
logit Ž . s ␣ q  1 c q  2 x
Ž 5.13 .
has ˆ1 s y0.509 ŽSE s 0.224. and ˆ2 s 0.458 ŽSE s 0.104.. This shows
strong evidence of an effect for each. At a given width, for every one-category
increase in color darkness, the estimated odds of a satellite multiply by
exp Žy0.509. s 0.60.
The likelihood-ratio statistic comparing this fit to the more complex model
Ž5.12. having a separate parameter for each color equals 1.7 Ždf s 2.. This
statistic tests that the simpler model Ž5.13. is adequate, given that model
Ž5.12. holds. It tests that when plotted against the color scores, the color
parameters in Ž5.12. follow a linear trend. The simplification seems permissible Ž P s 0.44..
The color parameter estimates in the qualitative-color model Ž5.12. are
Ž1.33, 1.40, 1.11, 0., the 0 value for the dark category reflecting its lack of a
dummy variable. Although these values do not depart significantly from a
linear trend, the first three are quite similar compared to the last one. Thus,
another potential color scoring for model Ž5.13. is 1, 1, 1, 04 ; that is, score s 0
for dark-colored crabs, and score s 1 otherwise. The likelihood-ratio statistic
comparing model Ž5.13. with these binary scores to model Ž5.12. equals 0.5
Ždf s 2., showing that this simpler model is also adequate. Its fit is
logit Ž
ˆ . s y12.980 q 1.300 c q 0.478 x,
Ž 5.14 .
with standard errors 0.526 and 0.104. At a given width, the estimated odds
that a lighter-colored crab has a satellite are exp Ž1.300. s 3.7 times the
estimated odds for a dark crab.
In summary, the qualitative-color model, the quantitative-color model with
scores 1, 2, 3, 44 , and the model with binary color scores 1, 1, 1, 04 all suggest
that dark crabs are least likely to have satellites. A much larger sample is
MULTIPLE LOGISTIC REGRESSION
191
needed to determine which color scoring is most appropriate. It is advantageous to treat ordinal predictors in a quantitative manner when such models
fit well. The model is simpler and easier to interpret, and tests of the
predictor effect are more powerful when it has a single parameter rather
than several parameters. In Section 6.4 we discuss this issue further.
5.4.7
Standardized and Probability-Based Interpretations
To compare effects of quantitative predictors having different units, it can be
helpful to report standardized coefficients. One approach fits the model to
standardized predictors, replacing each x j by Ž x j y x j .rs x j . Then, each
regression coefficient represents the effect of a standard deviation change in
a predictor, controlling for the other variables. Equivalently, for each j one
can multiply unstandardized estimate ˆj by s x j Žsee also Note 5.9..
Regardless of the units, many find it difficult to understand odds or odds
ratio effects. The simpler interpretation of the approximate change in the
probability based on a linearization of the model ŽSection 5.1.1. applies
also to multiple predictors. Consider a setting of predictors at which
PˆŽ Y s 1. s
ˆ . Then, controlling for the other predictors, a 1-unit increase in
x j corresponds approximately to a ˆj
ˆ Ž1 y ˆ . change in ˆ . For instance, at
predictor settings at which
ˆ s 0.5 for fit Ž5.14., the approximate effect of
a 1-cm increase in width is Ž0.478.Ž0.5.Ž0.5. s 0.12. This is considerable,
since a 1-cm change in width is less than half a standard deviation.
This linear approximation deteriorates as the change in the predictor
increases. More precise interpretations use the probability formula directly.
To describe the effect of x j , one could set the other predictors at their
sample means and compute the estimated probabilities at the smallest and
largest x j values. These are sensitive to outliers, however. It is often more
sensible to use the quartiles.
For fit Ž5.14., the sample means are 26.3 for x and 0.873 for c. The lower
and upper quartiles of x are 24.9 and 27.7. At x s 24.9 and c s c,
ˆ s 0.51.
At x s 27.7 and c s c,
ˆ s 0.80. The change in ˆ from 0.51 to 0.80 over the
middle 50% of the range of width values reflects a strong width effect. Since
c takes only values 0 and 1, one could instead report this effect separately for
each. Also, when an explanatory variable is a dummy, it makes sense to
report the estimated probabilities at its two values rather than at quartiles,
which could be identical. At x s 26.3,
ˆ s 0.40 when c s 0 and ˆ s 0.71
when c s 1. This color effect, differentiating dark crabs from others, is also
substantial.
Table 5.9 shows a way to present effects that can be understandable to
those not familiar with odds ratios. It also shows results of the extension of
model Ž5.14., permitting interaction. The estimated width effect is then
greater for the lighter-colored crabs. However, the interaction is not significant.
192
LOGISTIC REGRESSION
TABLE 5.9 Summary of Effects in Model (5.14) with Crab Width
and Color as Predictors of Presence of Satellites
Variable
No interaction model
Intercept
Color Ž0 s dark,
1 s other.
Width, x Žcm.
Interaction model
Intercept
Color Ž0 s dark,
1 s other.
Width, x Žcm.
Width = color
5.5
Estimate
SE
Comparison
y12.980
2.727
Change in Probability
Ž1, 0. at x
0.526
0.104 ŽUQ, LQ. at c
1.300
0.478
0.31 s 0.71 y 0.40
0.29 s 0.80 y 0.51
y5.854
6.694
y6.958
0.200
0.322
7.318
0.262 ŽUQ, LQ. at c s 0
0.286 ŽUQ, LQ. at c s 1
0.13 s 0.43 y 0.30
0.29 s 0.84 y 0.55
FITTING LOGISTIC REGRESSION MODELS
The mechanics of ML estimation and model fitting for logistic regression are
special cases of the GLM fitting results of Section 4.6. With n subjects, we
treat the n binary responses as independent. Let x i s Ž x i1 , . . . , x i p . denote
setting i of values of p explanatory variables, i s 1, . . . , N. When explanatory variables are continuous, a different setting may occur for each subject,
in which case N s n. The logistic regression model Ž5.8., regarding ␣ as a
regression parameter with unit coefficient, is
Žx i . s
5.5.1
p
exp Ž Ý js1
j x i j .
p
1 q exp Ž Ý js1
j x i j .
Ž 5.15 .
.
Likelihood Equations
When more than one observation occurs at a fixed x i value, it is sufficient to
record the number of observations n i and the number of successes. We then
let yi refer to this success count rather than to an individual binary response.
Then Y1 , . . . , YN 4 are independent binomials with E Ž Yi . s n i Žx i ., where
n1 q ⭈⭈⭈ qn N s n. Their joint probability mass function is proportional to the
product of N binomial functions,
N
Ł Žx i .
is1
s
½
½
yi
1 y Žx i .
N
Ł exp
is1
s exp
log
ž
n iyy i
Žx i .
1 y Žx i .
Žx i .
Ý yi log 1 y Ž x .
i
i
/ 5½
5½Ł
yi
N
Ł
is1
N
is1
ni
1 y Žx i .
1 y Žx i .
ni
5
.
5
193
FITTING LOGISTIC REGRESSION MODELS
For model Ž5.15., the ith logit is Ý j  j x i j , so the exponential term in the
last expression equals exp wÝ i yi ŽÝ j  j x i j .x s exp wÝ j ŽÝ i yi x i j .  j x. Also, since
w1 y Žx i .x s w1 q exp ŽÝ j  j x i j .xy1 , the log likelihood equals
ž
Ý Ý yi x i j
LŽ  . s
j
i
/  y Ý n log 1 q exp ž Ý  x / .
j
i
j
i
ij
Ž 5.16 .
j
This depends on the binomial counts only through the sufficient statistics
Ý i yi x i j , j s 1, . . . , p4 .
The likelihood equations result from setting ⭸ LŽ .r⭸  s 0. Since
⭸ LŽ  .
⭸ j
s
exp Ž Ý k  k x i k .
Ý yi x i j y Ý n i x i j 1 q exp Ž Ý
i
i
k
k x i k .
,
the likelihood equations are
Ý yi x i j y Ý n iˆ i x i j s 0,
i
j s 1, . . . , p,
Ž 5.17 .
i
where
ˆ i s exp ŽÝ k ˆk x i k .rw1 q exp ŽÝ k ˆk x i k .x is the ML estimate of Žx i ..
We observed these equations as a special case of those for binomial GLMs in
Ž4.25. Žbut there yi is the proportion of successes .. The equations are
nonlinear and require iterative solution.
Let X denote the N = p matrix of values of x i j 4 . The likelihood equations
Ž5.17. have form
Ž 5.18 .
XX y s XX
ˆ,
where
ˆ i s n iˆ i . This equation illustrates a fundamental result: For GLMs
with canonical link, the likelihood equations equate the sufficient statistics to
the estimates of their expected values. Equation Ž4.44. showed this result in
the GLM context, and Ž5.18. are the normal equations in ordinary regression.
5.5.2
Asymptotic Covariance Matrix of Parameter Estimators
ˆ have a large-sample normal distribution with covariThe ML estimators 
ance matrix equal to the inverse of the information matrix. The observed
information matrix has elements
y
⭸ 2 LŽ  .
⭸ a ⭸ b
s
Ý
i
x i a x i b n i exp Ž Ý j  j x i j .
1 q exp Ž Ý j  j x i j .
2
s
Ý x i a x i b n i i Ž 1 y i . . Ž 5.19 .
i
This is not a function of yi 4 , so the observed and expected information are
identical. This happens for all GLMs that use canonical links ŽSection 4.6.4..
194
LOGISTIC REGRESSION
The estimated covariance matrix is the inverse of the matrix having
ˆ This has form
elements Ž5.19., substituting .
$
ˆ . s XX diag n iˆ i Ž 1 y ˆ i . X 4
cov Ž 
y1
,
Ž 5.20 .
where diag w n i
ˆ i Ž1 y ˆ i .x denotes the N = N diagonal matrix having
n i
Ž
.4
1
y
on
the main diagonal. This is the special case of the GLM
ˆi
ˆi
ˆ having
covariance matrix Ž4.28. with estimated diagonal weight matrix W
Ž
.
elements w
s
n
1
y
.
The
square
roots
of
the
main
diagonal
elements
ˆi
ˆi
i ˆi
ˆ
of Ž5.20. are estimated standard errors of .
5.5.3
Distribution of Probability Estimators
$
ˆ ., one can conduct inference about  and related effects such as
Using cov Ž
odds ratios. One can also construct confidence intervals for response probabilities Žx. at particular settings x.
$
ˆ is x cov Ž
ˆ .xX . For large samThe estimated variance of logitw
ˆ Žx.x s x
$
X
ˆ . x is a confidence interval for the true logit.
ples, logitw
ˆ Žx.x " z␣ r2 x cov Ž 
The endpoints invert to a corresponding interval for Žx. using the transform
s exp Žlogit.rw1 q expŽlogit.x.
'
5.5.4
Newton–Raphson Method Applied to Logistic Regression
We refer back to Section 4.6.1 for the Newton᎐Raphson iterative method.
Let
uŽj t . s
t.
hŽab
s
⭸ LŽ  .
s
⭸ j

Žt.
i
⭸ 2 LŽ  .
⭸ a ⭸ b

Ý Ž yi y n i iŽ t . . x i j
Žt.
s y Ý x i a x i b n i iŽ t . Ž 1 y iŽ t . . .
i
Here, Ž t ., approximation t for ,
ˆ is obtained from Ž t . through
iŽ t .
s
p
exp Ž Ý js1
 jŽ t . x i j .
p
1 q exp Ž Ý js1
 jŽ t . x i j .
Ž 5.21 .
.
We use uŽ t . and H Ž t . with formula Ž4.39. to obtain the next value Ž tq1. , which
in this context is
Ž tq1. s Ž t . q XX diag n i iŽ t . Ž 1 y iŽ t . . X 4
y1
XX Ž y y Ž t . . ,
where Ži t . s n i iŽ t .. This is used to obtain Ž tq1. , and so forth.
Ž 5.22 .
195
FITTING LOGISTIC REGRESSION MODELS
With an initial guess Ž0. , Ž5.21. yields Ž0. , and for t ) 0 the iterations
proceed as just described using Ž5.22. and Ž5.21.. In the limit, Ž t . and Ž t .
ˆ ŽWalker and Duncan 1967.. The H Ž t .
converge to the ML estimates
ˆ and 
X
ˆ s yX diag w n iˆ i Ž1 y ˆ i .xX. By Ž5.20. the estimated
matrices converge to H
ˆ is a by-product of the Newton᎐Raphson
asymptotic covariance matrix of 
ˆ y1 .
method, namely yH
From the argument in Section 4.6.3, Ž tq1. has the iterative reweighted
least squares form ŽXX Vty1 X.y1 XX Vty1 z Ž t ., where z Ž t . has elements
z iŽ t . s log
iŽ t .
1 y iŽ t .
q
yi y n i iŽ t .
n i iŽ t . Ž 1 y iŽ t . .
,
Ž 5.23 .
and where Vt is a diagonal matrix with elements 1rn i iŽ t . Ž1 y iŽ t . .4 . In this
expression, z Ž t . is the linearized form of the logit link function for the sample
data, evaluated at Ž t . wsee Ž4.42.x. From Section 3.1.6 the elements of Vt are
estimated asymptotic variances of the sample logits. The ML estimate is the
limit of a sequence of weighted least squares estimates, where the weight
matrix changes at each cycle.
5.5.5
Convergence and Existence of Finite Estimates
The log-likelihood function for logistic regression models is strictly concave.
ML estimates exist and are unique except in certain boundary cases ŽHaberman 1974a; Wedderburn 1976; Albert and Anderson 1984.. Estimates do not
exist or may be infinite when there is no overlap in the sets of explanatory
variable values having y s 0 and having y s 1; that is, when a hyperplane
can pass through the space of predictor values such that on one side of that
hyperplane y s 0 for all observations, whereas on the other side, y s 1
always. There is then perfect discrimination, as one can predict the sample
outcomes perfectly by knowing the predictor values Žexcept possibly at a
boundary point.. When there is overlap, ML estimates exist and are unique.
Similar results occur for the probit and some other links ŽSilvapulle 1981..
Figure 5.6 illustrates for a single explanatory variable. Here, y s 0 at
x s 10, 20, 30, 40, and y s 1 at x s 60, 70, 80, 90. An ideal fit has
ˆ s 0 for
x F 40 and
ˆ s 1 for x G 60. By letting ˆ ™ ⬁ and, for fixed ˆ, letting
␣
ˆ s yˆŽ50. so that ˆ s 0.5 at x s 50, one generates a sequence with
ever-increasing value of the likelihood that comes successively closer to a
perfect fit.
In practice, most software fails to recognize that ˆ s ⬁. After a few cycles
of iterative fitting, the log likelihood looks flat at the working estimate, and
convergence criteria are satisfied. Because the log likelihood is so flat and
because variances come from the inverse of the matrix of negative second
derivatives, software typically reports huge standard errors. For these data,
for instance, PROC GENMOD in SAS reports logitŽ
ˆ . s y192.2 q 3.8 x
8
7
with standard errors of 8.0 = 10 and 1.5 = 10 .
196
FIGURE 5.6
LOGISTIC REGRESSION
Perfect discrimination resulting in infinite logistic regression parameter estimate.
NOTES
Section 5.1: Interpreting Parameters in Logistic Regression
5.1. Books focusing on applied logistic regression include Collett Ž1991. and Hosmer and
Lemeshow Ž2000.. Books having major components on logistic regression include Christensen Ž1997., Cox and Snell Ž1989., and Morgan Ž1992.. Prentice Ž1976b. and Stukel
Ž1988. extended the scope by introducing shape parameters that modify the behavior of
the curve in extreme probability regions and allow for asymmetric treatment of the two
tails.
5.2. Haldane Ž1956. recommended adding 21 to the numerator and denominator of the
sample logit. With this modification, the bias is on the order of only 1rn2i , for large n i
Žsee Firth 1993a and Problem 14.4..
5.3. The Cornfield Ž1962. result about normal distributions for Ž X < Y s i . implying the
logistic curve for P Ž Y s 1 < x . suggests that logistic regression is useful in discrimination
and classification problems. These use a subject’s x value to predict to which of two
populations they belong. Anderson Ž1975., Bull and Donner Ž1987., Efron Ž1975., and
Press and Wilson Ž1978. compared logistic regression favorably to discriminant analysis,
which assumes that explanatory variables have a normal distribution at each level of Y.
5.4. Rosenbaum and Rubin Ž1983. used logistic regression to adjust for bias in comparing
two groups in observational studies. They defined the propensity as the probability of
being in one group, for a given setting of the explanatory variables x, and they used
logistic regression to estimate how propensity depends on x. In comparing the groups on
the response variable, they showed that one can control for differing distributions of the
groups on x by adjusting for the estimated propensity. This is done by using the
propensity to match samples from the groups or to subclassify subjects into several strata
consisting of intervals of propensity scores or to adjust directly by entering the propensity in the model. See D’Agostino Ž1998. for a tutorial.
5.5. Adelbasit and Plackett Ž1983., Chaloner and Larntz Ž1988., Minkin Ž1987., and Wu
Ž1985. discussed design problems for binary response experiments, such as choosing
settings for a predictor to optimize a criterion for estimating parameter values or
estimating the setting at which the response probability equals some fixed value. The
nonconstant variance makes this challenging.
PROBLEMS
197
Section 5.2: Inference for Logistic Regression
5.6. Albert and Anderson Ž1984., Berkson Ž1951, 1953, 1955., Cox Ž1958a., Hodges Ž1958.,
and Walker and Duncan Ž1967. discussed ML estimation for logistic regression. For
adjustments with complex sample surveys, see Hosmer and Lemeshow Ž2000, Sec. 6.4.
and LaVange et al. Ž2001.. Scott and Wild Ž2001. discussed the analyses of case
control studies with complex sampling designs.
5.7. Tsiatis Ž1980. suggested an alternative goodness-of-fit test that partitions values for the
explanatory variables into a set of regions and adds a dummy variable to the model for
each region. The test statistic compares the fit of this model to the simpler one, testing
that the extra parameters are not needed. The idea of grouping values to check model fit
by comparing observed and fitted counts extends to any GLM ŽPregibon 1982.. Hosmer
et al. Ž1997. compared various ways of doing this.
Section 5.3: Logit Models with Categorical Predictors
5.8. The CochranArmitage trend test is locally asymptotically efficient for both linear and
logistic alternatives for P Ž Y s 1.. Its efficiency against linear alternatives follows from
the approximate normality of the sample proportions, with constant Bernoulli variance
when  s 0. For the linear logit model Ž5.5., its efficiency follows from its equivalence
with the score test. See Problem 9.35 and Cox Ž1958a. for related remarks. Tarone and
Gart Ž1980. showed that the score test for a binary linear trend model does not depend
on the link function. Gross Ž1981. noted that for the linear logit model, the local
asymptotic relative efficiency for testing independence using the statistic with an
incorrect set of scores equals the square of the Pearson correlation between the true and
incorrect scores. Simon Ž1978. gave related asymptotic results. Corcoran et al. Ž2001.,
Mantel Ž1963., and Podgor et al. Ž1996. extended the trend test.
Section 5.4: Multiple Logistic Regression
5.9. Since the standardized logistic cdf has standard deviation r'3 , some software Že.g.,
PROC LOGISTIC in SAS. defines a standardized estimate by multiplying the unstandardized estimate by s x j'3r .
PROBLEMS
Applications
5.1
For a study using logistic regression to determine characteristics associated with remission in cancer patients, Table 5.10 shows the most
important explanatory variable, a labeling index ŽLI.. This index measures proliferative activity of cells after a patient receives an injection
of tritiated thymidine, representing the percentage of cells that are
‘‘labeled.’’ The response Y measured whether the patient achieved
remission Ž1 s yes.. Software reports Table 5.11 for a logistic regression model using LI to predict the probability of remission.
198
LOGISTIC REGRESSION
TABLE 5.10 Data for Problem 5.1
LI
Number
of Cases
8
10
12
14
16
2
2
3
3
3
Number of
Remissions LI
0
0
0
0
0
Number
of Cases
18
20
22
24
26
Number of
Remissions LI
1
3
2
1
1
1
2
1
0
1
Number
of Cases
Number of
Remissions
1
1
1
3
1
0
1
2
28
32
34
38
Source: Data reprinted with permission from E. T. Lee, Comput. Prog. Biomed. 4: 8092 Ž1974..
TABLE 5.11 Computer Output for Problem 5.1
Criterion
y2 Log L
Intercept
Only
34.372
Intercept and
Covariates
26.073
Testing Global Null Hypothesis: BETA = 0
Test
Chi- Square
DF
Pr > ChiSq
Likelihood Ratio
8.2988
1
0.0040
Score
7.9311
1
0.0049
Wald
5.9594
1
0.0146
Parameter
Intercept
li
Effect
li
Estimate
y3.7771
0.1449
Standard Error
1.3786
0.0593
Chi- Square
7.5064
5.9594
Pr > ChiSq
0.0061
0.0146
Odds Ratio Estimates
Point Estimate
95% Wald Confidence Limits
1.156
1.029
1.298
Estimated Covariance Matrix
Variable
Intercept
li
Intercept
1.900616
y0.07653
li
y0.07653
0.003521
Obs
1
2
li
8
10
remiss
0
0
n
2
2
pi hat
0.06797
0.08879
lower
0.01121
0.01809
upper
0.31925
0.34010
a. Show how software obtained
ˆ s 0.068 when LI s 8.
b. Show that
ˆ s 0.5 when LI s 26.0.
c. Show that the rate of change in
ˆ is 0.009 when LI s 8 and 0.036
when LI s 26.
d. The lower quartile and upper quartile for LI are 14 and 28. Show
that
ˆ increases by 0.42, from 0.15 to 0.57, between those values.
e. For a unit change in LI, show that the estimated odds of remission
multiply by 1.16.
199
PROBLEMS
f. Explain how to obtain the confidence interval reported for the odds
ratio. Interpret.
g. Construct a Wald test for the effect. Interpret.
h. Conduct a likelihood-ratio test for the effect, showing how to
construct the test statistic using the y2 log L values reported.
i. Show how software obtained the confidence interval for reported
at LI s 8. Ž Hint: Use the reported covariance matrix..
TABLE 5.12 Data for Problem 5.2 a
Ft Temp TD Ft Temp TD Ft Temp TD Ft Temp TD Ft Temp TD
1
6
11
16
21
66
72
70
75
75
0
0
1
0
1
2
7
12
17
22
70
73
78
70
76
1
0
0
0
0
3
8
13
18
23
69
70
67
81
58
0
0
0
0
1
4
9
14
19
68
57
53
76
0
1
1
0
5
10
15
20
67
63
67
79
0
1
0
0
Ft, flight number; Temp, temperature Ž⬚F.; TD, thermal distress Ž1, yes; 0, no..
Source: Data based on Table 1 in J. Amer. Statist. Assoc., 84: 945᎐957, Ž1989., by S. R. Dalal,
E. B. Fowlkes, and B. Hoadley. Reprinted with permission from the Journal of the American
Statistical Association.
a
5.2
For the 23 space shuttle flights before the Challenger mission disaster
in 1986, Table 5.12 shows the temperature at the time of the flight and
whether at least one primary O-ring suffered thermal distress.
a. Use logistic regression to model the effect of temperature on the
probability of thermal distress. Plot a figure of the fitted model, and
interpret.
b. Estimate the probability of thermal distress at 31⬚F, the temperature at the place and time of the Challenger flight.
c. Construct a confidence interval for the effect of temperature on the
odds of thermal distress, and test the statistical significance of the
effect.
d. Check the model fit by comparing it to a more complex model.
5.3 Refer to Table 4.2. Using scores 0, 2, 4, 54 for snoring, fit the logistic
regression model. Interpret using fitted probabilities, linear approximations, and effects on the odds. Analyze the goodness of fit.
5.4
Hastie and Tibshirani Ž1990, p. 282. described a study to determine
risk factors for kyphosis, severe forward flexion of the spine following
corrective spinal surgery. The age in months at the time of the
operation for the 18 subjects for whom kyphosis was present were 12,
15, 42, 52, 59, 73, 82, 91, 96, 105, 114, 120, 121, 128, 130, 139, 139, 157
200
LOGISTIC REGRESSION
and for 22 of the subjects for whom kyphosis was absent were 1, 1, 2, 8,
11, 18, 22, 31, 37, 61, 72, 81, 97, 112, 118, 127, 131, 140, 151, 159, 177,
206.
a. Fit a logistic regression model using age as a predictor of whether
kyphosis is present. Test whether age has a significant effect.
b. Plot the data. Note the difference in dispersion on age at the two
levels of kyphosis. Fit the model logit w Ž x .x s ␣ q  1 x q  2 x 2 .
Test the significance of the squared age term, plot the fit, and
interpret. ŽNote also Problem 5.33..
5.5
Refer to Table 6.11. The Pearson test of independence has X 2 Ž I . s
6.88 Ž P s 0.14.. For equally spaced scores, the Cochran᎐Armitage
trend test has z 2 s 6.67 Ž P s 0.01.. Interpret, and explain why results
differ so. Analyze the data using a linear logit model. Test independence using the Wald and likelihood-ratio tests, and compare results
to the Cochran᎐Armitage test. Check the fit of the model, and interpret.
5.6
For Table 5.3, conduct the trend test using alcohol consumption scores
Ž1, 2, 3, 4, 5. instead of Ž0.0, 0.5, 1.5, 4.0, 7.0.. Compare results, noting
the sensitivity to the choice of scores for highly unbalanced data.
5.7
Refer to Table 2.11. Using scores Ž0, 3, 9.5, 19.5, 37, 55. for cigarette
smoking, analyze these data using a logit model. Is the intercept
estimate meaningful? Explain.
5.8
A study used the 1998 Behavioral Risk Factors Social Survey to
consider factors associated with women’s use of oral contraceptives in
the United States. Table 5.13 summarizes effects for a logistic regression model for the probability of using oral contraceptives. Each
predictor uses a dummy variable, and the table lists the category
having dummy outcome 1. Interpret effects. Construct and interpret a
confidence interval for the conditional odds ratio between contraceptive use and education.
TABLE 5.13 Data for Problem 5.8
Variable
Coding s 1 if:
Estimate
SE
Age
Race
Education
Marital status
35 or younger
White
G 1 year college
Married
y1.320
0.622
0.501
y0.460
0.087
0.098
0.077
0.073
Source: Data courtesy of Debbie Wilson, College of Pharmacy, University of
Florida.
201
PROBLEMS
TABLE 5.14 Computer Output for Problem 5.9
Parameter
Intercept
def
vic
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
1
0.3798
Pearson Chi- Square
1
0.1978
Log Likelihood
y209.4783
Standard
Likelihood Ratio
Estimate
Error
95% Conf Limits
y3.5961
0.5069
y4.7754
y2.7349
y0.8678
0.3671
y1.5633
y0.1140
2.4044
0.6006
1.3068
3.7175
Source
def
vic
5.9
DF
1
1
LR Statistics
Chi- Square
5.01
20.35
ChiSquare
50.33
5.59
16.03
Pr > ChiSq
0.0251
<.0001
Refer to Table 2.6. Table 5.14 shows the results of fitting a logit model,
treating death penalty as the response Ž1 s yes. and defendant’s race
Ž1 s white. and victims’ race Ž1 s white. as dummy predictors.
a. Interpret parameter estimates. Which group is most likely to have
the yes response? Find the estimated probability in that case.
b. Interpret 95% confidence intervals for conditional odds ratios.
c. Test the effect of defendant’s race, controlling for victims’ race,
using a Ži. Wald test, and Žii. likelihood-ratio test. Interpret.
d. Test the goodness of fit. Interpret.
5.10 Model the effects of victim’s race and defendant’s race for Table 2.13.
Interpret.
5.11 Table 5.15 appeared in a national study of 15- and 16-year-old adolescent. The event of interest is ever having sexual intercourse. Analyze,
TABLE 5.15 Data for Problem 5.11
Intercourse
Race
Gender
Yes
No
White
Male
Female
43
26
134
149
Black
Male
Female
29
22
23
36
Source: S. P. Morgan and J. D. Teachman, J. Marriage Fam.
50: 929936 Ž1988.. Reprinted with permission from the
National Council on Family Relations.
202
LOGISTIC REGRESSION
including description and inference about the effects of gender and
race, goodness of fit, and summary interpretations.
5.12 According to the Independent newspaper ŽLondon, Mar. 8, 1994., the
Metropolitan Police in London reported 30,475 people as missing in
the year ending March 1993. For those of age 13 or less, 33 of 3271
missing males and 38 of 2486 missing females were still missing a year
later. For ages 14 to 18, the values were 63 of 7256 males and 108 of
8877 females; for ages 19 and above, the values were 157 of 5065 males
and 159 of 3520 females. Analyze and interpret. ŽThanks to Pat
Altham for showing me these data. .
5.13 The National Collegiate Athletic Association studied graduation rates
for freshman student athletes during the 19841985 academic year.
The Žsample size, number graduated . totals were Ž796, 498. for white
females, Ž1625, 878. for white males, Ž143, 54. for black females, and
Ž60, 197. for black males ŽJ. J. McArdle and F. Hamagami, J. Amer.
Statist. Assoc. 89: 11071123, 1994.. Analyze and interpret.
5.14 In a study designed to evaluate whether an educational program makes
sexually active adolescents more likely to obtain condoms, adolescents
were randomly assigned to two experimental groups. The educational
program, involving a lecture and videotape about transmission of the
HIV virus, was provided to one group but not the other. Table 5.16
summarizes results of a logistic regression model for factors observed
to influence teenagers to obtain condoms.
a. Find the parameter estimates for the fitted model, using Ž1, 0.
dummy variables for the first three predictors. Based on the corresponding confidence interval for the log odds ratio, determine the
standard error for the group effect.
b. Explain why either the estimate of 1.38 for the odds ratio for gender
or the corresponding confidence interval is incorrect. Show that if
the reported interval is correct, 1.38 is actually the log odds ratio,
and the estimated odds ratio equals 3.98.
TABLE 5.16 Data for Problem 5.14
Variable
Group Žeducation vs. none.
Gender Žmales vs. females.
SES Žhigh vs. low.
Lifetime number of partners
Odds Ratio
4.04
1.38
5.82
3.22
95% Confidence
Interval
Ž1.17, 13.9.
Ž1.23. 12.88.
Ž1.87, 18.28.
Ž1.08, 11.31.
Source: V. I. Rickert et al., Clin. Pediatr. 31: 205210 Ž1992..
203
PROBLEMS
TABLE 5.17
Data for Problem 5.15
Variable
Effect
P-value
Intercept
Alcohol use
Smoking
Race
Race = smoking
y7.00
0.10
1.20
0.30
0.20
- 0.01
0.03
- 0.01
0.02
0.04
5.15 Table 5.17 shows estimated effects for a logistic regression model with
squamous cell esophageal cancer Ž Y s 1, yes; Y s 0, no. as the response. Smoking status Ž S . equals 1 for at least one pack per day and 0
otherwise, alcohol consumption Ž A. equals the average number of
alcoholic drinks consumed per day, and race Ž R . equals 1 for blacks
and 0 for whites. To describe the race = smoking interaction, construct the prediction equation when R s 1 and again when R s 0.
Find the fitted YS conditional odds ratio for each case. Similarly,
construct
the prediction equation when S s 1 and again when S s 0. Find the
fitted YR conditional odds ratios. Note that for each association,
the coefficient of the cross-product term is the difference between the
log odds ratios at the two fixed levels for the other variable. Explain
why the coefficient of S represents the log odds ratio between Y and S
for whites. To what hypotheses do the P-values for R and S refer?
5.16 A survey of high school students on Y s whether the subject has
driven a motor vehicle after consuming a substantial amount of alcohol
Ž1 s yes., s s gender Ž1 s female., r s race Ž1 s black; 0 s white.,
and g s grade Ž g 1 s 1, grade 9; g 2 s 1, grade 10; g 3 s 1, grade 11;
g 1 s g 2 s g 3 s 0, grade 12. has prediction equation
logit PˆŽ Y s 1 . s y0.88 y 0.40 s y 0.72 r y 2.22 g 1 y 1.43 g 2 y 0.58 g 3
q 0.74 rg 1 q 0.38 rg 2 q 0.01rg 3 .
a. Carefully interpret effects. Explain the interaction by describing the
race effect at each grade and the grade effect for each race.
b. Replace r above by r 1 Ž1 s black, 0 s other.. The study also
measured r 2 Ž1 s Hispanic, 0 s other., with r 1 s r 2 s 0 for white.
Suppose that the prediction equation is as above but with additional
terms y0.29 r 2 q 0.53 r 2 g 1 q 0.25 r 2 g 2 y 0.06 r 2 g 3 . Interpret the
effects.
204
LOGISTIC REGRESSION
TABLE 5.18 Data for Problem 5.17
Patient
D
T
Y
Patient
D
T
Y
Patient
D
T
Y
1
2
3
4
5
6
7
8
9
10
11
12
45
15
40
83
90
25
35
65
95
35
75
45
0
0
0
1
1
1
0
0
0
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
13
14
15
16
17
18
19
20
21
22
23
24
50
75
30
25
20
60
70
30
60
61
65
15
1
1
0
0
1
1
1
0
0
0
0
1
0
1
0
1
0
1
1
1
1
0
1
0
25
26
27
28
29
30
31
32
33
34
35
20
45
15
25
15
30
40
15
135
20
40
1
0
1
0
1
0
0
1
1
1
1
0
1
0
1
0
1
1
0
1
0
0
Source: Data from D. Collett, in Encyclopedia of Biostatistics ŽNew York: Wiley: 1998., pp.
350358.
5.17 Table 5.18 shows the results of a study about Y s whether a patient
having surgery with general anesthesia experienced a sore throat on
waking Ž0 s no, 1 s yes. as a function of the D s duration of the
surgery Žin minutes. and the T s type of device used to secure the
airway Ž0 s laryngeal mask airway, 1 s tracheal tube.. Fit a logit
model using these predictors, interpret parameter estimates, and conduct inference about the effects.
5.18 Refer to model Ž5.2. for the horseshoe crabs using x s width.
a. Show that Ži. at the mean width Ž26.3., the estimated odds of a
satellite equal 2.07; Žii. at x s 27.3, the estimated odds equal 3.40;
and Žiii. since expŽ ˆ. s 1.64, 3.40 s Ž1.64.2.07, and the odds increase by 64%.
b. Based on the 95% confidence interval for  , show that for x near
where s 0.5, the rate of increase in the probability of a satellite
per 1-cm increase in x falls between about 0.07 and 0.17.
5.19 For Table 4.3, fit a logistic regression model for the probability of a
satellite, using color alone as the predictor.
a. Treat color as nominal. Explain why this model is saturated. Express its parameter estimates in terms of the sample logits for each
color.
b. Conduct a likelihood-ratio test that color has no effect.
c. Fit a model that treats color as quantitative. Interpret the fit, and
test that color has no effect.
d. Test the goodness of fit of the model in part Žc.. Interpret.
205
PROBLEMS
5.20 Refer to model Ž5.14.. Describe the effect of width by finding the
estimated probabilities of a satellite at its lower and upper quartiles,
separately for c s 1 and c s 0.
5.21 Refer to the prediction equation logitŽ
ˆ . s y10.071 y 0.509c q
0.458 x for model Ž5.13.. The means and standard deviations are
c s 2.44 and s s 0.80 for color, and x s 26.30 and s s 2.11 for width.
For standardized predictors we.g., x s Žwidth y 26.3.r2.11x, explain
why the estimated coefficients of c and x equal y0.41 and 0.97.
Interpret these by comparing the partial effects of a 1 standard
deviation increase in each predictor on the odds. Describe the color
effect by estimating the change in
ˆ between the first and last color
categories at the mean score for width.
5.22 Refer to model Ž5.12..
a. Fit the model using x s weight. Interpret effects of weight and
color.
b. Does the model permitting interaction provide an improved fit?
Interpret.
c. For part Žb., construct a confidence interval for a difference between the slope parameters for medium-light and dark crabs.
Interpret.
d. Using models that treat color as quantitative, repeat the analyses in
parts Ža. to Žc..
5.23 Fowlkes et al. Ž1988. reported results of a survey of employees of a
large national corporation to determine how satisfaction depends on
race, gender, age, and regional location. The data are at the book’s
Web site Ž www. stat.ufl.edur; aarcdarcda.html .. Fit a logit model to
these data and carefully interpret the parameter estimates. Fowlkes et
al. Ž1988. reported ‘‘The least-satisfied employees are less than 35
years of age, female, other Žrace., and work in the Northeast; . . . . The
most satisfied group is greater than 44 years of age, male, other, and
working in the Pacific or Mid-Atlantic regions; the odds of such
employees being satisfied are about 3.5 to 1.’’ Show how these interpretations result from the fit of this model.
5.24 Let Y denote a subject’s opinion about current laws legalizing abortion
Ž1 s support., for gender h Ž h s 1, female; h s 2, male., religious
affiliation i Ž i s 1, Protestant; i s 2, Catholic; i s 3, Jewish., and
political party affiliation j Ž j s 1, Democrat; j s 2, Republican; j s 3,
Independent .. For survey data, software for fitting the model
logit P Ž Y s 1 . s ␣ q  hG q iR q  jP
206
LOGISTIC REGRESSION
reports ␣
ˆ s 0.62, ˆ1G s 0.08, ˆ2G s y0.08, ˆ1R s y0.16, ˆ2R s
y0.25, ˆ3R s 0.41, ˆ1P s 0.87, ˆ2P s y1.27, ˆ3P s 0.40.
a. Interpret how the odds of support depends on religion.
b. Estimate the probability of support for the group most Žleast. likely
to support current laws.
c. If, instead, parameters used constraints  1G s  1R s  1P s 0, report
the estimates.
5.25 Table 5.19 refers to a sample of subjects randomly selected for an
Italian study on the relation between income and whether one possesses a travel credit card. At each level of annual income in millions
of lira, the table indicates the number of subjects sampled and the
number possessing at least one travel credit card. Analyze these data.
TABLE 5.19 Data for Problem 5.25
Income Number
Income Number
Income Number
Žmillions
of
Credit Žmillions
of
Credit Žmillions
of
Credit
of lira.
Cases Cards of lira.
Cases Cards of lira.
Cases Cards
24
27
28
29
30
31
32
33
34
35
38
1
1
5
3
9
5
8
1
7
1
3
0
0
2
0
1
1
0
0
1
1
1
39
40
41
42
45
48
49
50
52
59
60
2
5
2
2
1
1
1
10
1
1
5
0
0
0
0
1
0
0
2
0
0
2
65
68
70
79
80
84
94
120
130
6
3
5
1
1
1
1
6
1
6
3
3
0
0
0
0
6
1
Source: Categorical Data Analysis, Quaderni del Corso Estivo di Statistica e Calcolo delle
Probabilita,
` n. 4., Istituto di Metodi Quantitativi, Universita` Luigi Bocconi, by R. Piccarreta.
5.26 Refer to Table 9.1, treating marijuana use as the response variable.
Analyze these data.
5.27 The book’s Web site Ž www. stat.ufl.edur; aarcdarcda.html . contains
a five-way table relating occupational aspirations Žhigh, low. to gender,
residence, IQ, and socioeconomic status. Analyze these data.
Theory and Methods
5.28 For model Ž5.1., show that ⭸ Ž x .r⭸ x s  Ž x .w1 y Ž x .x.
207
PROBLEMS
5.29 For model Ž5.1., when Ž x . is small, explain why you can interpret
expŽ  . approximately as Ž x q 1.r Ž x ..
5.30 Prove that the logistic regression curve Ž5.1. has the steepest slope
where Ž x . s 12 . Generalize to model Ž5.8..
5.31 The calibration problem is that of estimating x at which Ž x . s 0 .
For the linear logit model, argue that a confidence interval is the set of
x values for which
␣
ˆ q ˆ x y logit Ž 0 . r var Ž␣ˆ. q x 2 var Ž ˆ . q 2 x cov Ž␣ˆ, ˆ .
1r2
- z␣ r2 .
wMorgan Ž1992, Sec. 2.7. surveyed other approaches. x
5.32 A study for several professional sports of the effect of a player’s draft
position d Ž d s 1, 2, 3, . . . . of selection from the pool of potential
players in a given year on the probability of eventually being named
an all star used the model logitŽ . s ␣ q  log d ŽS. M. Berry, Chance,
14:53᎐57, 2001..
a. Show that rŽ1 y . s e ␣ d  . Show that e ␣ s odds for the first
draft pick.
b. In the United States, Berry reported ␣
ˆ s 2.3 and ˆ s y1.1 for pro
basketball and ␣
ˆ s 0.7 and ˆ s y0.6 for pro baseball. This suggests that in basketball a first draft pick is more crucial and picks
with high d are relatively less likely to be all-stars. Explain why.
5.33 For the population of subjects having Y s j, X has a N Ž j , 2 .
distribution, j s 0,1.
a. Using Bayes theorem, show that P Ž Y s 1 < x . satisfies the logistic
regression model with  s Ž 1 y 0 .r 2 .
b. Suppose that Ž X < Y s j . is N Ž j , j 2 . with 0 / 1. Show that the
logistic model holds with a quadratic term ŽAnderson 1975.. wProblem 5.4 showed that a quadratic term is helpful when x values have
quite different dispersion at y s 0 and y s 1. This result also
suggests that to test equality of means of normal distributions when
the variances differ, one can fit a quadratic logistic regression with
the two groups as the response and test the quadratic term; see
O’Brien Ž1988..x
c. Suppose that Ž X < Y s j . has exponential dispersion family density
f Ž x; j . s expw x j y bŽ j .xraŽ . q cŽ x, .4 . Find the relevant logistic model.
208
LOGISTIC REGRESSION
d. For multiple predictors, suppose that ŽX < Y s j . has a multivariate
N Ž j , ⌺ . distribution, j s 0, 1. Show that P Ž Y s 1 < x. satisfies logistic regression with effect parameters ⌺y1 Ž 1 y 0 . ŽCornfield
1962..
5.34 Suppose that Ž x . s F Ž x . for some strictly increasing cdf F. Explain
why a monotone transformation of x exists such that the logistic
regression model holds. Generalize to alternative link functions.
5.35 For an I = 2 contingency table, consider logit model Ž5.4..
a. Given i ) 04 , show how to find i 4 satisfying I s 0.
b. Prove that  1 s  2 s ⭈⭈⭈ s I is the independence model. Find its
likelihood equation, and show that ␣
ˆ s logitwŽÝ i yi .rŽÝ i n i .x.
5.36 Construct the log-likelihood function for the model logitw Ž x .x s ␣ q
 x with independent binomial outcomes of y 0 successes in n 0 trials at
x s 0 and y 1 successes in n1 trials at x s 1. Derive the likelihood
equations, and show that ˆ is the sample log odds ratio.
5.37 A study has n i independent binary observations yi1 , . . . , yi n i 4 when
X s x i , i s 1, . . . , N, with n s Ý i n i . Consider the model logit Ž i . s
␣ q  x i , where i s P Ž Yi j s 1..
a. Show that the kernel of the likelihood function is the same treating
the data as n Bernoulli observations or N binomial observations.
b. .For the saturated model, explain why the likelihood function is
different for these two data forms. Ž Hint: The number of parameters differs. . Hence, the deviance reported by software depends on
the form of data entry.
c. Explain why the difference between deviances for two unsaturated
models does not depend on the form of data entry.
d. Suppose that each n i s 1. Show that the deviance depends on
ˆi
but not yi . Hence, it is not useful for checking model fit Žsee also
Problem 4.22..
5.38 Suppose that Y has a bin Ž n, . distribution. For the model, logitŽ .
s ␣ , consider testing H0 : ␣ s 0 Ži.e., s 0.5.. Let
ˆ s yrn.
a. From Section 3.1.6, the asymptotic variance of ␣
ˆ s logit Žˆ . is
w n Ž1 y .xy1 . Compare the estimated SE for the Wald test and
the SE using the null value of , using test statistic wlogitŽ
ˆ .rSEx2 .
Show that the ratio of the Wald statistic to the statistic with null SE
equals 4
ˆ Ž1 y ˆ .. What is the implication about performance of
the Wald test if < ␣ < is large and
ˆ tends to be near 0 or 1?
209
PROBLEMS
b. Wald inference depends on the parameterization. How does the
comparison of tests change with the scale wŽ
ˆ y 0.5.rSEx2 , where
SE is now the estimated or null SE of
ˆ?
c. Suppose that y s 0 or y s n. Show that the Wald test in part Ža.
cannot reject H0 : s 0 for any 0 - 0 - 1, whereas the Wald
test in part Žb. rejects every such 0 . w Note: Analogous results apply
for inference about the Poisson mean versus the log mean; see
Mantel Ž1987a..x
5.39 Find the likelihood equations for model Ž5.10.. Show that they imply
the fitted values and that the sample values are identical in the
marginal two-way tables.
5.40 Consider the linear logit model Ž5.5. for an I = 2 table, with yi a
bin Ž n i , i . variate.
a. Show that the log likelihood is
LŽ  . s
I
Ý
yi Ž␣ q  x i . y
is1
I
Ý n i log
1 q exp Ž␣ q  x i . .
is1
b. Show that the sufficient statistic for  is Ý i yi x i , and explain why
this is essentially the variable utilized in the Cochran᎐Armitage
test. ŽHence that test is a score test of H0 :  s 0..
c. Letting S s Ý i yi , show that the likelihood equations are
Ss
exp Ž␣ q  x i .
Ý n i 1 q exp Ž␣ q  x .
i
i
exp Ž␣ q  x i .
Ý yi x i s Ý n i x i 1 q exp Ž␣ q  x . .
i
i
i
d. Let
ˆ i s n iˆ i 4. Explain why Ý i
ˆ i s Ý i yi and
yi
Ý xi S
i
s
ˆi
Ý x i Ý ˆ .
a a
i
Explain why this implies that the mean score on x across the rows
in the first column is the same for the model fit as for the observed
data. They are also identical for the second column.
210
LOGISTIC REGRESSION
5.41 Let Yi be bin Ž n i , i . at x i , and let pi s yirn i . For binomial GLMs
with logit link:
a. For pi near i , show that
log
pi
1 y pi
f log
i
1 y i
q
pi y i
i Ž1 y i .
.
b. Show that z iŽ t . in Ž5.23. is a linearized version of the ith sample
logit, evaluated at approximation iŽ t . for
ˆ i.
$
ˆ ..
c. Verify the formula Ž5.20. for cov Ž
5.42 Using graphs or tables, explain what is meant by no interaction in
modeling response Y and explanatory X and Z when:
a. All variables are continuous Žmultiple regression..
b. Y and X are continuous, Z is categorical Žanalysis of covariance..
c. Y is continuous, X and Z are categorical Žtwo-way ANOVA..
d. Y is binary, X and Z are categorical Žlogit model..
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 6
Building and Applying Logistic
Regression Models
Having studied the basics of fitting and interpreting logistic regression
models, we now turn our attention to building and applying them. With
several explanatory variables, there are many potential models. In Section 6.1
we discuss strategies for model selection. After choosing a preliminary model,
model checking addresses whether systematic lack of fit exists. Section 6.2
covers diagnostics, such as residuals, for model checking.
In practice, a common application compares two groups on a binary
response, with data stratified by control variables. In Section 6.3 we present
logit-related analyses of such data. In Section 6.4 we show the advantages of
a well-chosen model in enhancing inferential power for detecting and estimating associations. Section 6.5 covers power and sample size determination
for logistic regression. Although the logit is the most popular link function
for probabilities, other links are sometimes more appropriate. In Section 6.6
we present models using the probit link and links making a double log
transform.
For small samples or models with many parameters, ordinary large-sample
ML inference may perform poorly. In Section 6.7 we discuss conditional
logistic regression. Like small-sample methods for 2 = 2 tables, this uses
conditioning arguments to eliminate nuisance parameters.
6.1
STRATEGIES IN MODEL SELECTION
Model selection for logistic regression faces the same issues as for ordinary
regression. The selection process becomes harder as the number of explanatory variables increases, because of the rapid increase in possible effects and
interactions. There are two competing goals: The model should be complex
enough to fit the data well. On the other hand, it should be simple to
interpret, smoothing rather than overfitting the data.
211
212
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Most studies are designed to answer certain questions. Those questions
guide the choice of model terms. Confirmatory analyses then use a restricted
set of models. For instance, a study hypothesis about an effect may be tested
by comparing models with and without that effect. For studies that are
exploratory rather than confirmatory, a search among possible models may
provide clues about the dependence structure and raise questions for future
research.
In either case, it is helpful first to study the effect on Y of each predictor
by itself using graphics Žincorporating smoothing. for a continuous predictor
or a contingency table for a discrete predictor. This gives a ‘‘feel’’ for the
marginal effects. Unbalanced data, with relatively few responses of one type,
limit the number of predictors for the model. One guideline suggests at least
10 outcomes of each type should occur for every predictor ŽPeduzzi et al.
1996.. If y s 1 only 30 times out of n s 1000, for instance, the model should
contain no more than about three x terms. Such guidelines are approximate,
and this does not mean that if you have 500 outcomes of each type you are
well served by a model with 50 predictors.
Many model selection procedures exist, no one of which is always best.
Cautions that apply to ordinary regression hold for any generalized linear
model. For instance, a model with several predictors may suffer from multicollinearityᎏcorrelations among predictors making it seem that no one variable is important when all the others are in the model. A variable may seem
to have little effect because it overlaps considerably with other predictors in
the model, itself being predicted well by the other predictors. Deleting such a
redundant predictor can be helpful, for instance to reduce standard errors of
other estimated effects.
6.1.1
Horseshoe Crab Example Revisited
The horseshoe crab data set in Table 4.3 has four predictors: color Žfour
categories., spine condition Žthree categories., weight, and width of the
carapace shell. We now fit a logistic regression model using all these to
predict whether the female crab has satellites Ž y s 1..
We start by fitting a model containing main effects,
logit P Ž Y s 1 . s ␣ q  1weight q  2 width q  3 c1
q 4 c 2 q 5 c 3 q 6 s1 q  7 s2 ,
treating color Ž c i . and spine condition Ž s j . as qualitative Žfactors., with
dummy variables for the first three colors and the first two spine conditions.
Table 6.1 shows results. A likelihood-ratio test that Y is jointly independent
of these predictors simultaneously tests H0 :  1 s ⭈⭈⭈ s  7 s 0. The test
statistic equals 40.6 with df s 7 Ž P - 0.0001.. This shows extremely strong
evidence that at least one predictor has an effect.
213
STRATEGIES IN MODEL SELECTION
TABLE 6.1 Computer Output from Fitting Model with All Main
Effects to Horseshoe Crab Data
Testing Global Null Hypothesis: BETA = 0
Test
Chi- Square
DF
Pr > ChiSq
Likelihood Ratio
40.5565
7
<.0001
Parameter
Intercept
weight
width
color 1
color 2
color 3
spine 1
spine 2
Analysis of Maximum Likelihood Estimates
Estimate
Std Error
Chi- Square
y9.2734
3.8378
5.8386
0.8258
0.7038
1.3765
0.2631
0.1953
1.8152
1.6087
0.9355
2.9567
1.5058
0.5667
7.0607
1.1198
0.5933
3.5624
y0.4003
0.5027
0.6340
y0.4963
0.6292
0.6222
Pr > ChiSq
0.0157
0.2407
0.1779
0.0855
0.0079
0.0591
0.4259
0.4302
Although the overall test is highly significant, the Table 6.1 results are
discouraging. The estimates for weight and width are only slightly larger than
their SE values. The estimates for the factors compare each category to the
final one as a baseline. For color, the largest difference is less than two
standard errors; for spine condition, the largest difference is less than a
standard error.
The small P-value for the overall test, yet the lack of significance for
individual effects, is a warning sign of multicollinearity. In Section 5.2.2 we
showed strong evidence of a width effect. Controlling for weight, color, and
spine condition, little evidence remains of a partial width effect. However,
weight and width have a strong correlation Ž0.887.. For practical purposes
they are equally good predictors, but it is nearly redundant to use them both.
Our further analysis uses width ŽW . with color Ž C . and spine condition Ž S . as
predictors. For simplicity, we symbolize models by their highest-order terms,
regarding C and S as factors. For instance, Ž C q S q W . denotes a model
with main effects, whereas Ž C q S*W . denotes a model that has those main
effects plus an S = W interaction. It is not usually sensible to consider a
model with interaction but not the main effects that make up that interaction.
6.1.2
Stepwise Procedures
In exploratory studies, an algorithmic method for searching among models
can be informative if we use results cautiously. Goodman Ž1971a. proposed
methods analogous to forward selection and backward elimination in ordinary regression.
Forward selection adds terms sequentially until further additions do not
improve the fit. At each stage it selects the term giving the greatest improve-
214
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
ment in fit. The minimum P-value for testing the term in the model is a
sensible criterion, since reductions in deviance for different terms may have
different df values. A stepwise variation of this procedure retests, at each
stage, terms added at previous stages to see if they are still significant.
Backward elimination begins with a complex model and sequentially
removes terms. At each stage, it selects the term for which its removal has
the least damaging effect on the model Že.g., largest P-value.. The process
stops when any further deletion leads to a significantly poorer fit. With either
approach, for qualitative predictors with more than two categories, the
process should consider the entire variable at any stage rather than just
individual dummy variables. Add or drop the entire variable rather than
just one of its dummies. Otherwise, the result depends on the coding. The
same remark applies to interactions containing that variable.
Many statisticians prefer backward elimination over forward selection,
feeling it safer to delete terms from an overly complex model than to add
terms to an overly simple one. Forward selection can stop prematurely
because a particular test in the sequence has low power. Neither strategy
necessarily yields a meaningful model. Use variable selection procedures with
caution! When you evaluate many terms, one or two that are not important
may look impressive simply due to chance. For instance, when all the true
effects are weak, the largest sample effect may substantially overestimate its
true effect. See Westfall and Wolfinger Ž1997. and Westfall and Young
Ž1993. for ways to adjust P-values to take multiple tests into account.
Some software has additional options for selecting a model. One approach
attempts to determine the best model with some fixed number of terms,
according to some criterion. If such a method and backward and forward
selection procedures yield quite different models, this is an indication that
such results are of dubious use. Another such indication would be when a
quite different model results from applying a given procedure to a bootstrap
sample of the same size from the sample distribution.
Finally, statistical significance should not be the sole criterion for inclusion
of a term in a model. It is sensible to include a variable that is central to the
purposes of the study and report its estimated effect even if it is not
statistically significant. Keeping it in the model may help reduce bias in
estimated effects of other predictors and may make it possible to compare
results with other studies where the effect is significant Žperhaps because of a
larger sample size.. Algorithmic selection procedures are no substitute for
careful thought in guiding the formulation of models.
6.1.3
Backward Elimination for Horseshoe Crab Example
Table 6.2 summarizes results of fitting and comparing several logit models to
the horseshoe crab data with predictors width, color, and spine condition.
The deviance Ž G 2 . test of fit compares the model to the saturated model. As
noted in Sections 5.2.4 and 5.2.5, this is not approximately chi-squared when
a predictor is continuous, as width is. However, the difference of deviances
215
STRATEGIES IN MODEL SELECTION
TABLE 6.2 Results of Fitting Several Logistic Regression Models
to Horseshoe Crab Data
Model Predictors a
1
2
3a
3b
3c
4a
4b
5
6a
6b
6c
7a
7b
8
9
a
Ž C*S*W .
Ž C*S q C*W q S*W .
Ž C*S q S*W .
Ž C*W q S*W .
Ž C*S q C*W .
Ž S q C*W .
ŽW q C*S .
ŽC q S q W .
ŽC q S .
ŽS q W .
ŽC q W .
ŽC .
ŽW .
Ž C s dark q W .
None
Deviance
Models
Deviance Corr.
G2
df AIC Compared Difference r Ž y,
ˆ.
170.44
173.68
177.34
181.56
173.69
181.64
177.61
186.61
208.83
194.42
187.46
212.06
194.45
187.96
225.76
152
155
158
161
157
163
160
166
167
169
168
169
171
170
172
212.4
209.7
207.3
205.6
205.7
201.6
203.6
200.6
220.8
202.4
197.5
220.1
198.5
194.0
227.8
ᎏ
ᎏ
Ž2. ᎐ Ž1.
3.2 Ždf s 3.
Ž3a. ᎐ Ž2.
3.7 Ždf s 3.
Ž3b. ᎐ Ž2.
7.9 Ždf s 6.
Ž3c. ᎐ Ž2.
0.0 Ždf s 2.
Ž4a. ᎐ Ž3c. 8.0 Ždf s 6.
Ž4b. ᎐ Ž3c. 3.9 Ždf s 3.
Ž5. ᎐ Ž4b. 9.0 Ždf s 6.
Ž6a. ᎐ Ž5. 22.2 Ždf s 1.
Ž6b. ᎐ Ž5.
7.8 Ždf s 3.
Ž6c. ᎐ Ž5.
0.8 Ždf s 2.
Ž7a. ᎐ Ž6c. 24.5 Ždf s 1.
Ž7b. ᎐ Ž6c. 7.0 Ždf s 3.
Ž8. ᎐ Ž6c. 0.5 Ždf s 2.
Ž9. ᎐ Ž8. 37.8 Ždf s 2.
0.452
0.285
0.402
0.447
0.000
C, color; S, spine condition; W, width.
between two models that differ by a modest number of parameters is
relevant. That difference is the likelihood-ratio statistic y2Ž L0 y L1 . comparing the models, and it has an approximate null chi-squared distribution..
To select a model, we use backward elimination. We test only the
highest-order terms for each variable. It is inappropriate, for instance, to
remove a main effect term if the model has interactions involving that term.
We begin with the most complex model, symbolized by Ž C*S*W ., model 1
in Table 6.2. This model uses main effects for each term as well as the three
two-factor interactions and the three-factor interaction. It allows a separate
width effect at each CS combination. ŽIn fact, at some of those combinations
y outcomes of only one type occur, so effects are not estimable. . The
likelihood-ratio statistic comparing this model to the simpler model Ž C*S q
C*W q S*W . removing the three-factor interaction term equals 3.2 Ždf s 3..
This suggests that the three-factor term is not needed Ž P s 0.36., thank
goodness, so we continue the simplification process.
In the next stage we consider the three models that remove a two-factor
interaction. Of these, Ž C*S q C*W . gives essentially the same fit as the more
complex model, so we drop the S = W interaction. Next, we consider
dropping one of the other two-factor interactions. The model Ž S q C*W .,
dropping the C = S interaction, has an increased deviance of 8.0 on df s 6
Ž P s 0.24.; the model ŽW q C*S ., dropping the C = W interaction, has an
increased deviance of 3.9 on df s 3 Ž P s 0.27.. Neither increase is important, suggesting that we can drop either and proceed. In either case, dropping next the remaining interaction also seems permissible. For instance,
216
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
dropping the C = S interaction from model ŽW q C*S ., leaving model
Ž C q S q W ., increases the deviance by 9.0 on df s 6 Ž P s 0.17..
The working model now has the main effects alone. In the next stage we
consider dropping one of them. Table 6.2 shows little consequence of
removing S. Both remaining variables Ž C and W . then have nonnegligible
effects. For instance, removing C increases the deviance Žcomparing models
7b and 6c. by 7.0 on df s 3 Ž P s 0.07.. The analysis in Section 5.4.6 revealed
a noticeable difference between dark crabs Žcategory 4. and the others. The
simpler model that has a single dummy variable for color, equaling 0 for dark
crabs and 1 otherwise, fits essentially as well. ŽThe deviance difference
between models 8 and 6c equals 0.5, with df s 2.. Further simplification
results in large increases in deviance and is unjustified.
6.1.4
AIC, Model Selection, and the Correct Model
In selecting a model, we are mistaken if we think that we have found the true
one. Any model is a simplification of reality. For instance, width does not
exactly have a linear effect on the probability of satellites, whether we use the
logit link or the identity link.
What is the logic of testing the fit of a model when we know that it does
not truly hold? A simple model that fits adequately has the advantages of
model parsimony. If a model has relatively little bias, describing reality well,
it tends to provide more accurate estimates of the quantities of interest. This
was discussed in Sections 3.3.7 and 5.2.2 and is examined further in Section
6.4.5.
Other criteria besides significance tests can help select a good model in
terms of estimating quantities of interest. The best known is the Akaike
information criterion ŽAIC.. It judges a model by how close its fitted values
tend to be to the true values, in terms of a certain expected value. Even
though a simple model is farther from the true model than is a more complex
model, it may be preferred because it tends to provide better estimates of
certain characteristics of the true model, such as cell probabilities. Thus, the
optimal model is the one that tends to have fit closest to reality. Given a
sample, Akaike showed that this criterion selects the model that minimizes
AIC s y2 Ž maximized log likelihoodᎏnumber of parameters in model . .
This penalizes a model for having many parameters. With models for
categorical Y, this ordering is equivalent to one based on an adjustment of
the deviance, w G 2 y 2Ždf.x, by twice its residual df. For cogent arguments
supporting this criterion, see Burnham and Anderson Ž1998..
We illustrate AIC for model selection using the models Table 6.2 lists.
That table also shows the AIC values. Of models using the three basic
variables, AIC is smallest ŽAIC s 197.5. for C q W, having main effects of
color and width. The simpler model having a dummy variable for whether a
crab is dark fares better yet ŽAIC s 194.0.. Either model seems reasonable.
217
STRATEGIES IN MODEL SELECTION
We should balance the lower AIC for the simpler model against its having
been suggested by the fit of C q W.
6.1.5
Using Causal Hypotheses to Guide Model Building
Although selection procedures are helpful exploratory tools, the model-building process should utilize theory and common sense. Often, a time ordering
among the variables suggests possible causal relationships. Analyzing a certain sequence of models helps to investigate those relationships ŽGoodman
1973..
We illustrate with Table 6.3, from a British study. A sample of men and
women who had petitioned for divorce and a similar number of married
people were asked: Ža. ‘‘Before you married your Žformer. husbandrwife,
had you ever made love with anyone else?’’; Žb. ‘‘During your Žformer.
marriage, Ždid you have. have you had any affairs or brief sexual encounters
with another manrwoman?’’ The 2 = 2 = 2 = 2 table has variables G s
gender, E s whether reported extramarital sex, P s whether reported premarital sex, and M s marital status.
The time points at which responses on the four variables occur suggests
the following ordering of the variables:
E
extramarital
sex
6
P
premarital
sex
6
6
G
gender
M
marital
status
Any of these is an explanatory variable when a variable listed to its right is
the response. Figure 6.1 shows one possible causal structure. In this figure, a
variable at the tip of an arrow is a response for a model at some stage. The
explanatory variables have arrows pointing to the response, directly or
indirectly.
We first treat P as a response. Figure 6.1 predicts that G has a direct
effect on P, so the model of independence of these variables is inadequate.
TABLE 6.3
Marital Status by Report of Pre- and Extramarital Sex (PMS and EMS)
Gender
Women
PMS:
Marital Status
Divorced
Still married
EMS:
Yes
Men
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
17
4
54
25
36
4
214
322
28
11
60
42
17
4
68
130
Source: G. N. Gilbert, Modelling Society ŽLondon: George Allen & Unwin, 1981.. Reprinted with
permission from Unwin Hyman Ltd.
218
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
FIGURE 6.1 Causal diagram for Table 6.3.
At the second stage, E is the response. Figure 6.1 predicts that P and G
have direct effects on E. It also suggests that G has an indirect effect on E,
through its effect on P. These effects on E can be analyzed using the logit
model for E with additive G and P effects. If G has only an indirect effect
on E, the model with P alone as a predictor is adequate; that is, controlling
for P, E and G are conditionally independent. At the third stage, M is the
response. Figure 6.1 predicts that E has a direct effect on M, P has direct
effects and indirect effects through its effects on E, and G has indirect
effects through its effects on P and E. This suggests the logit model for M
having additive E and P effects. For this model, G and M are independent,
given P and E.
Table 6.4 shows results. The first stage, having P as the response, shows
strong evidence of a GP association. The sample odds ratio for their
marginal table is 0.27; the estimated odds of premarital sex for females are
0.27 times that for males. The second stage has E as the response. Only weak
evidence occurs that G had a direct as well as an indirect effect on E, as G 2
drops by 2.9 Ždf s 1. after adding G to a model already containing P as a
predictor. For this model, the estimated EP conditional odds ratio is 4.0.
The third stage has M as the response. Figure 6.1 specifies the logit model
with main effects of E and P, but it fits poorly. The model that allows an
TABLE 6.4
Goodness of Fit of Various Models for Table 6.3 a
Stage
Response
Variable
1
P
G
2
E
G, P
3
M
G, P, E
a
Potential
Explanatory
Actual
Explanatory
G2
df
None
ŽG.
None
Ž P.
ŽG q P .
Ž E q P.
Ž E*P .
Ž E*P q G .
75.3
0.0
48.9
2.9
0.0
18.2
5.2
0.7
1
0
3
2
1
5
4
3
P, premarital sex; E, extramarital sex; M, marital status; G, gender.
LOGISTIC REGRESSION DIAGNOSTICS
219
E = P interaction in their effects on M but assumes conditional independence of G and M fits much better Ž G 2 decrease of 13.0, df s 1.. The model
that also has a main effect for G fits slightly better yet. Either model is more
complicated than Figure 6.1 predicted, since the effects of E on M vary
according to the level of P. However, some preliminary thought about causal
relationships suggested a model similar to one giving a good fit. We leave it
to the reader to estimate and interpret effects for the third stage.
6.1.6
New Model-Building Strategies for Data Mining
As computing power continues to explode, enormous data sets are more
common. A financial institution that markets credit cards may have observations for millions of subjects to whom they sent advertising, on whether they
applied for a card. For their customers, they have monthly data on whether
they paid their bill on time plus information on many variables measured on
the credit card application. The analysis of huge data sets is called data
mining.
Model building for huge data sets is challenging. There is currently
considerable study of alternatives to traditional statistical methods, including
automated algorithms that ignore concepts such as sampling error or modeling. Significance tests are usually irrelevant, as nearly any variable has a
significant effect if n is sufficiently large. Model-building strategies view
some models as useful for prediction even if they have complex structure.
Nonetheless, a point of diminishing returns still occurs in adding predictors
to models. After a point, new predictors tend to be so correlated with a linear
combination of ones already in the model that they do not improve predictive
power. For large n, inference is less relevant than summary measures of
predictive power. This is a topic of the next section.
6.2
LOGISTIC REGRESSION DIAGNOSTICS
In Section 5.2.3 we introduced statistics for checking model fit in a global
sense. After selecting a preliminary model, we obtain further insight by
switching to a microscopic mode of analysis. In contingency tables, for
instance, the pattern of lack of fit revealed in cell-by-cell comparisons of
observed and fitted counts may suggest a better model. For continuous
predictors, graphical displays are also helpful. Such diagnostic analyses may
suggest a reason for the lack of fit, such as nonlinearity in the effect of an
explanatory variable.
6.2.1
Pearson, Deviance, and Standardized Residuals
With categorical predictors, it is useful to form residuals to compare observed and fitted counts. Let yi denote the binomial variate for n i trials at
220
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
setting i of the explanatory variables, i s 1, . . . , N. Let
ˆ i denote the model
estimate of P Ž Y s 1.. Then n i
ˆ i is the fitted number of successes. For a
GLM with binomial random component, the Pearson residual Ž4.36. for this
fit is
yi y n i
ˆi
ei s $
s
1r2
var Ž Yi .
yi y n i
ˆi
' n ˆ Ž1 y ˆ .
i
i
Ž 6.1 .
.
i
This divides the raw residual Ž yi y
ˆ i . by the estimated binomial standard
deviation of yi . The Pearson statistic for testing the model fit satisfies
N
X2s
Ý ei2 .
is1
Each squared Pearson residual is a component of X 2 .
With
ˆ i replaced by i in the numerator of Ž6.1., e i is the difference
between a binomial random variable and its expectation, divided by its
estimated standard deviation. For large n i , e i then has an approximate
N Ž0, 1. distribution, when the model holds. Since i is estimated by
ˆ i and
the
ˆ i 4 depend on yi 4, however, yi y n iˆ i 4 tend to be smaller than
yi y n i i 4 and the e i 4 are less variable than N Ž0, 1.. If X 2 has df s , X 2
s Ý i e i2 is asymptotically comparable to the sum of squares of Žrather than
N . independent standard normal random variables. Thus, when the model
holds, E ŽÝ i e i2 .rN f rN - 1.
The standardized Pearson residual is slightly larger in absolute value and
is approximately N Ž0, 1. when the model holds. In Section 4.5.5 we showed
the adjustment uses the leverage from an estimated hat matrix. For observation i with leverage ˆ
h i , the standardized residual is
ri s
ei
'1 y ˆh
s
i
yi y n i
ˆi
' n ˆ Ž1 y ˆ . Ž1 y ˆh .
i
i
i
.
i
Absolute values larger than roughly 2 or 3 provide evidence of lack of fit.
An alternative residual uses components of the G 2 fit statistic. These are
the de®iance residuals, introduced for GLMs in Ž4.35.. The deviance residual
for observation i is
'd
i
= sign Ž yi y n i
ˆi . ,
Ž 6.2 .
where
ž
d i s 2 yi log
yi
n i
ˆi
q Ž n i y yi . log
n i y yi
n i y n i
ˆi
/
.
This also tends to be less variable then N Ž0, 1. and can be standardized.
221
LOGISTIC REGRESSION DIAGNOSTICS
Plots of residuals against explanatory variables or linear predictor values
may detect a type of lack of fit. When fitted values are very small, however,
just as X 2 and G 2 lose relevance, so do residuals. When explanatory
variables are continuous, often n i s 1 at each setting. Then yi can equal only
0 or 1, and e i can assume only two values. One must then be cautious about
regarding either outcome as extreme, and a single residual is usually uninformative. Plots of residuals also then have limited use, consisting simply of two
parallel lines of dots. The deviance itself is then completely uninformative
ŽProblem 5.37.. When data can be grouped into sets of observations having
common predictor values, it is better to compute residuals for the grouped
data than for individual subjects.
6.2.2
Heart Disease Example
A sample of male residents of Framingham, Massachusetts, aged 40 through
59, were classified on several factors, including blood pressure ŽTable 6.5..
The response variable is whether they developed coronary heart disease
during a six-year follow-up period.
Let i be the probability of heart disease for blood pressure category i.
The table shows the fit and the standardized Pearson residuals for two
logistic regression models. The first model,
logit Ž i . s ␣ ,
treats the response as independent of blood pressure. Some residuals for that
model are large. This is not surprising, since the model fits poorly Ž G 2 s 30.0,
X 2 s 33.4, df s 7..
TABLE 6.5 Standardized Pearson Residuals for Logit Models Fitted to
Data on Blood Pressure and Heart Disease
Fitted
Residual
Blood
Pressure
Sample
Size
Observed
Heart
Disease
Indep.
Model
Linear
Logit
Indep.
Model
Linear
Logit
- 117
117᎐126
127᎐136
137᎐146
147᎐156
157᎐166
167᎐186
) 186
156
252
284
271
139
85
99
43
3
17
12
16
12
8
16
8
10.8
17.4
19.7
18.8
9.6
5.9
6.9
3.0
5.2
10.6
15.1
18.1
11.6
8.9
14.2
8.4
y2.62
y0.12
y2.02
y0.74
0.84
0.93
3.76
3.07
y1.11
2.37
y0.95
y0.57
0.13
y0.33
0.65
y0.18
Source: Data from Cornfield Ž1962..
222
TABLE 6.6
Observ
1
2
3
4
5
6
7
8
a
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Residuals Reported in SAS for Heart Disease Data of Table 6.5 a
disease
3
17
12
16
12
8
16
8
n
156
252
284
271
139
85
99
43
Observation
blood
111.5
121.5
131.5
141.5
151.5
161.5
176.5
191.5
Statistics
Reschi
y0.9794
2.0057
y0.8133
y0.5067
0.1176
y0.3042
0.5135
y0.1395
Resdev
y1.0617
1.8501
y0.8420
y0.5162
0.1170
y0.3088
0.5050
y0.1402
StReschi
y1.1058
2.3746
y0.9453
y0.5727
0.1261
y0.3261
0.6520
y0.1773
Reschi, Pearson residual; StReschi, adjusted residual.
A plot of the residuals show an increasing trend. This suggests the linear
logit model,
logit Ž i . s ␣ q  x i ,
with scores x i 4 for blood pressure level. We used scores Ž111.5, 121.5, 131.5,
141.5, 151.5, 161.5, 176.5, 191.5.. The nonextreme scores are midpoints for
the intervals of blood pressure. The trend in residuals disappears for this
model, and only the second category shows some evidence of lack of fit.
Table 6.6 reports residuals for the linear logit model, as reported by SAS.
The Pearson residuals ŽReschi., deviance residuals ŽResdev., and standardized Pearson residuals ŽStReschi. show similar results. Each is somewhat
large in the second category. One relatively large residual is not surprising,
however. With many residuals, some may be large purely by chance. Here the
FIGURE 6.2
Observed and predicted proportions of heart disease for linear logit model.
223
LOGISTIC REGRESSION DIAGNOSTICS
overall fit statistics Ž G 2 s 5.9, X 2 s 6.3 with df s 6. do not indicate problems. In analyzing residual patterns, we should be cautious about attributing
patterns to what might be chance variation from a model.
Another useful graphical display for showing lack of fit compares observed
and fitted proportions by plotting them against each other or by plotting both
of them against explanatory variables. For the linear logit model, Figure 6.2
plots both the observed proportions and the estimated probabilities of heart
disease against blood pressure. The fit seems decent.
Studying residuals helps us understand either why a model fits poorly or
where there is lack of fit in a generally good-fitting model. The next example
illustrates the second case.
6.2.3
Graduate Admissions Example
Table 6.7 refers to graduate school applications to the 23 departments in the
College of Liberal Arts and Sciences at the University of Florida during the
1997᎐1998 academic year. It cross-classifies applicant’s gender Ž G ., whether
admitted Ž A., and department Ž D . to which the prospective students applied.
We consider logit models with A as the response variable. Let yi k denote the
number admitted and let i k denote the probability of admission for gender
i in department k. We treat Yi k 4 as independent bin Ž n i k , i k .. Other things
being equal, one would hope the admissions decision is independent of
gender. However, the model with no gender effect, given the department,
logit Ž i k . s ␣ q  kD ,
fits rather poorly Ž G 2 s 44.7, X 2 s 40.9, df s 23..
TABLE 6.7 Data Relating Admission to Gender and Department
for Model with No Gender Effect
Dept
anth
astr
chem
clas
comm
comp
engl
geog
geol
germ
hist
lati
Females
Yes
No
32
81
6
0
12
43
3
1
52 149
8
7
35 100
9
1
6
3
17
0
9
9
26
7
Males
Std. Res
Yes
No (Fem,Yes) Dept
21
41
y0.76
ling
3
8
2.87
math
34 110
y0.27
phil
4
0
y1.07
phys
5
10
y0.63
poli
6
12
1.16
psyc
30 112
0.94
reli
11
11
2.17
roma
15
6
y0.26
soci
4
1
1.89
stat
21
19
y0.18
zool
25
16
1.65
Source: Data courtesy of James Booth.
Females
Yes
No
21
10
25
18
3
0
10
11
25
34
2 123
3
3
29
13
16
33
23
9
4
62
Males
Std. Res
Yes No (Fem,Yes)
7
8
1.37
31 37
1.29
9
6
1.34
25 53
1.32
39 49
y0.23
4 41
y2.27
0
2
1.26
6
3
0.14
7 17
0.30
36 14
y0.01
10 54
y1.76
224
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Table 6.7 also reports standardized Pearson residuals for the number of
females who were admitted for this model. For instance, the astronomy
department admitted 6 females, which was 2.87 standard deviations higher
than the model predicted. Each department has only a single nonredundant
standardized residual, because of marginal constraints for the model. The
model has fit
ˆ i k s Ž y 1 k q y 2 k .rnqk , corresponding to an independence fit
Ž
ˆ 1 k s ˆ 2 k . in each partial table. Now, y 1 k y n1 k ˆ 1 k s y 1 k y n1 k Ž y 1 k q
y 2 k .rnqk s Ž n 2 krnqk . y 1 k y Ž n1 krnqk . y 2 k s yŽ y 2 k y n 2 k
ˆ 2 k .. Thus, standard errors of Ž y 1 k y n1 k
ˆ 1 k . and Ž y 2 k y n 2 k ˆ 2 k . are identical. The standardized residuals are identical in absolute value for males and females but
of different sign. Astronomy admitted 3 males, and their standardized residual was y2.87; the number admitted was 2.87 standard deviations fewer than
predicted. This is another advantage of standardized over ordinary Pearson
residuals. The model of independence in a partial table has df s 1. Only one
bit of information exists about how the data depart from independence, yet
the ordinary Pearson residual for males need not equal the ordinary Pearson
residual for females.
Departments with large standardized Pearson residuals reveal the reason
for the lack of fit. Significantly more females were admitted than the model
predicts in the astronomy and geography departments, and fewer in the
psychology department. Without these three departments, the model fits
reasonably well Ž G 2 s 24.4, X 2 s 22.8, df s 20..
For the complete data, adding a gender effect to the model does not
provide an improved fit Ž G 2 s 42.4, X 2 s 39.0, df s 22., because the departments just described have associations in different directions and of
greater magnitude than other departments. This model has an ML estimate
of 1.19 for the GA conditional odds ratio, the odds of admission being 19%
higher for females than males, given department. By contrast, the marginal
table collapsed over department has a GA sample odds ratio of 0.94, the
overall odds of admission being 6% lower for females. This illustrates
Simpson’s paradox ŽSection 2.3.2., the conditional association having different direction than the marginal association.
6.2.4
Influence Diagnostics for Logistic Regression
Other regression diagnostic tools are also helpful in assessing fit. These
include plots of ordered residuals against normal percentiles ŽHaberman
1973a. and analyses that describe an observation’s influence on parameter
estimates and fit statistics. Whenever a residual indicates that a model fits an
observation poorly, it can be informative to delete the observation and refit
the model to remaining ones. This is equivalent to adding a parameter to the
model for that observation, forcing a perfect fit for it.
As in ordinary regression, an observation may be relatively influential in
determining parameter estimates. The greater an observation’s leverage,
the greater its potential influence. The fit could be quite different if an
225
LOGISTIC REGRESSION DIAGNOSTICS
observation that appears to be an outlier on y and has large leverage is
deleted. However, a single observation can have a more exorbitant influence
in ordinary regression than a single binary observation in logistic regression,
since there is no bound on the distance of yi from its expected value. Also, in
Section 4.5.5 we observed that the GLM estimated hat matrix
$
ˆ 1r2 X Ž XX WX
ˆ .
Hat s W
y1
ˆ 1r2
XX W
depends on the fit as well as the model matrix X. For logistic regression, in
ˆ is diagonal with element
Section 5.5.2 we showed that the weight matrix W
w
ˆi s n iˆ i Ž1 y ˆ i . for the n i observations at setting i of predictors. Points
that have extreme predictor values need not have high leverage. In fact, the
leverage can be small if
ˆ i is close to 0 or 1.
Several measures that describe the effect on parameter estimates and fit
statistics of removing an observation from the data set are related algebraically to the observation’s leverage ŽPregibon 1981; Williams 1987.. In
logistic regression, the observation could be a single binary response or a
binomial response for a set of subjects all having the same predictor values.
Influence measures for each observation include:
1. For each model parameter, the change in the parameter estimate when
the observation is deleted. This change, divided by its standard error, is
called Dfbeta.
2. A measure of the change in a joint confidence interval for the parameters produced by deleting the observation. This confidence interval
displacement diagnostic is denoted by c.
3. The change in X 2 or G 2 goodness-of-fit statistics when the observation
is deleted.
For each measure, the larger the value, the greater the influence. We
illustrate them using the linear logit model with blood pressure as a predictor
for heart disease in Table 6.5. Table 6.8 contains simple approximations Ždue
to Pregibon 1981. for the Dfbeta measure for the coefficient of blood
pressure, the confidence interval diagnostic c, the change in G 2 , and the
change in X 2 . ŽThis is the square of the standardized Pearson residual, ri2 ..
All their values show that deleting the second observation has the greatest
effect. This is not surprising, as that observation has the only relatively large
residual. By contrast, Table 6.8 also contains the changes in X 2 and G 2 for
deleting observations in fitting the independence model. At the low and high
ends of the blood pressure values, several changes are very large. However,
these all relate to removing an entire binomial sample at a blood pressure
level instead of removing a single subject’s binary observation. Such subjectlevel deletions have little effect even for this model.
With continuous or multiple predictors, it can be informative to plot these
diagnostics, for instance against the estimated probabilities. See Cook and
226
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
TABLE 6.8 Diagnostic Measures for Logistic Regression Models Fitted
to Heart Disease Data
Blood
Pressure Dfbeta
111.5
121.5
131.5
141.5
151.5
161.5
176.5
191.5
0.49
y1.14
0.33
0.08
0.01
y0.07
0.40
y0.12
c
Pearson
X 2 Diff.
Likelihood-Ratio
G 2 Diff.
Pearson
X 2 Diff. a
Likelihood-Ratio
G 2 Diff. a
0.34
2.26
0.31
0.09
0.00
0.02
0.26
0.02
1.22
5.64
0.89
0.33
0.02
0.11
0.42
0.03
1.39
5.04
0.94
0.34
0.02
0.11
0.42
0.03
6.86
0.02
4.08
0.55
0.70
0.87
14.17
9.41
9.13
0.02
4.56
0.57
0.66
0.80
10.83
6.73
a
Independence model; other values refer to model with blood pressure predictor.
Source: Data from Cornfield Ž1962..
Weisberg Ž1999, Chap. 22., Fowlkes Ž1987., and Landwehr et al. Ž1984. for
examples of useful diagnostic plots.
6.2.5
Summarizing Predictive Power: R and R-Squared Measures
In ordinary regression, R 2 describes the proportional reduction in variation
in comparing the conditional variation of the response to the marginal
variation. It and the multiple correlation R describe the power of the
explanatory variables to predict the response, with R s 1 for perfect prediction. Despite various attempts to define analogs for categorical response
models, no proposed measure is as widely useful as R and R 2 . We present a
few proposed measures in this section.
For any GLM, the correlation r Ž y,
ˆ . between the observed responses yi 4
4
and the model’s fitted values
ˆ i measures predictive power. For least
squares regression, this is the multiple correlation between Y and the
predictors. An advantage of the correlation relative to its square is the appeal
of working on the original scale and its approximate proportionality to effect
size: For a small effect with a single predictor, doubling the slope corresponds roughly to doubling the correlation. This measure can be useful for
comparing fits of different models to the same data set.
In logistic regression,
ˆ i for a particular model is the estimated probability
ˆ i for binary observation i. Table 6.2 shows r Ž y,
ˆ . for a few models fitted to
the horseshoe crab data. Width alone has r s 0.402, and adding color to the
model increases r to 0.452. The simpler model that uses color merely to
indicate whether a crab is dark does essentially as well, with r s 0.447. The
complex model containing color, spine condition, width, and all their twoand three-way interactions has r s 0.526. This seems considerably higher,
but with multiple predictors the r estimates become more highly biased in
estimating the true correlation. It can be misleading to compare r values for
models with greatly different df values. After a jackknife adjustment designed
227
LOGISTIC REGRESSION DIAGNOSTICS
to reduce bias, there is little difference between r for this overly complex
model and the simpler model ŽZheng and Agresti 2000.. Little is lost and
much is gained by using the simpler model.
Another way to measure the association between the binary responses yi 4
and their fitted values
ˆ i 4 uses the proportional reduction in squared error
1y
Ý i Ž yi y
ˆi .
Ý i Ž yi y y .
2
2
,
obtained by using
ˆ i instead of y s Ý yirn as a predictor of yi ŽEfron 1978..
Ž
.
Amemiya 1981 suggested a related measure that weights squared deviations
by inverse predicted variances. For logistic regression, unlike normal GLMs,
these and r Ž y,
ˆ . need not be nondecreasing as the model gets more
complex. Like any correlation-type measure, they can depend strongly on the
range of observed values of explanatory variables.
Other measures directly use the likelihood function. Denote the maximized log likelihood by L M for a given model, LS for the saturated model,
and L0 for the null model containing only an intercept term. Probabilities
are no greater than 1.0, so log likelihoods are nonpositive. As the model
complexity increases, the parameter space expands, so the maximized log
likelihood increases. Thus, L0 F L M F LS F 0. The measure
L M y L0
LS y L0
Ž 6.3 .
falls between 0 and 1. It equals 0 when the model provides no improvement
in fit over the null model, and it equals 1 when the model fits as well as the
saturated model. A weakness is the log likelihood is not an easily interpretable scale. Interpreting the numerical value is difficult, other than in a
comparative sense for different models.
For n independent Bernoulli observations, the maximized log likelihood is
n
log Ł
ˆ iy i Ž 1 y ˆ i .
is1
1yy i
n
s
Ý
yi log
ˆ i q Ž 1 y yi . log Ž 1 y ˆ i . .
is1
The null model gives
ˆ i s ŽÝ yi .rn s y, so that
L0 s n y Ž log y . q Ž 1 y y . log Ž 1 y y . .
The saturated model has a parameter for each subject and implies that
228
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
ˆ i s yi for all i. Thus, LS s 0 and Ž6.3. simplifies to
Ds
L0 y L M
L0
.
McFadden Ž1974. proposed this measure.
With multiple observations at each setting of explanatory variables, the
data file can take the grouped-data form of N binomial counts rather than n
Bernoulli indicators. The saturated model then has a parameter for each
count. It gives N fitted proportions equal to the N sample proportions of
success. Then LS is nonzero and Ž6.3. takes a different value than when
calculated using individual subjects. For N binomial counts, the maximized
likelihoods are related to the G 2 goodness-of-fit statistic by G 2 Ž M . s
y2Ž L M y LS ., so Ž6.3. becomes
D* s
G 2 Ž 0. y G 2 Ž M .
G 2 Ž 0.
.
Goodman Ž1971a. and Theil Ž1970. discussed this and related partial association measures.
With grouped data D* can be large even when predictive power is weak at
the subject level. For instance, a model can fit much better than the null
model even though fitted probabilities are close to 0.5 for the entire sample.
In particular, D* s 1 when it fits perfectly, regardless of how well one can
predict individual subject’s responses on Y with that model. Also, suppose
that the population satisfies the given model, but not the null model. As the
sample size n increases with number of settings N fixed, G 2 Ž M . behaves like
a chi-squared random variable but G 2 Ž0. grows unboundedly. Thus, D* ™ 1
as n ™ ⬁, and its magnitude tends to depend on n. This measure confounds
model goodness of fit with predictive power. Similar behavior occurs for R 2
in regression analyses when calculated using means of Y values Žrather than
individual subjects. at N different x settings. It is more sensible to use D for
binary, ungrouped data.
6.2.6 Summarizing Predictive Power: Classification Tables
and ROC Curves
A classification table cross-classifies the binary response with a prediction of
whether y s 0 or 1. The prediction is ˆ
y s 1 when
ˆ i ) 0 and ˆy s 0 when
ˆ i F 0 , for some cutoff 0 . Most classification tables use 0 s 0.5 and
summarize predictive power by
sensitivity s P Ž ˆ
y s 1 < y s 1.
and
specificity s P Ž ˆ
y s 0 < y s 0.
LOGISTIC REGRESSION DIAGNOSTICS
FIGURE 6.3
229
ROC curve for logistic regression model with horseshoe crab data.
ŽRecall Sections 2.1.2.. Limitations of this table are that it collapses continuous predictive values
ˆ into binary ones, the choice of 0 is arbitrary, and it
is highly sensitive to the relative numbers of times y s 1 and y s 0.
A recei®er operating characteristic ŽROC. curve is a plot of sensitivity as a
function of Ž1 y specificity . for the possible cutoffs 0 . This curve usually
has a concave shape connecting the points Ž0, 0. and Ž1, 1.. The higher the
area under the curve, the better the predictions. The ROC curve is more
informative than the classification table, since it summarizes predictive power
for all possible 0 . Figure 6.3 shows how PROC LOGISTIC in SAS reports
the ROC curve for the model for the horseshoe crabs using width and color
as predictors.
The area under a ROC curve is identical to the value of another measure
of predictive power, the concordance index. Consider all pairs of observations
Ž i, j . such that yi s 1 and y j s 0. The concordance index c estimates the
probability that the predictions and the outcomes are concordant, the observation with the larger y also having the larger
ˆ ŽHarrell et al. 1982.. A value
c s 0.5 means predictions were no better than random guessing. This corresponds to a model having only an intercept term and an ROC curve that is a
straight line connecting points Ž0, 0. and Ž1, 1.. For the horseshoe crab data,
c s 0.639 with color alone as a predictor, 0.742 with width alone, 0.771 with
230
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
width and color, and 0.772 with width and a dummy for whether a crab has
dark color.
ROC curves are a popular way of evaluating diagnostic tests. Sometimes
such tests have J ) 2 ordered response categories rather than Žpositive, negative.. The ROC curve then refers to the various possible cutoffs for defining a
result to be positive. It plots sensitivity against 1 y specificity for the possible
collapsings of the J categories to a Žpositive, negative. scale wsee Toledano
and Gatsonis Ž1996.x.
6.3 INFERENCE ABOUT CONDITIONAL ASSOCIATIONS
IN 2 = 2 = K TABLES
The analysis of the graduate admissions data in Sections 6.2.3 used the model
of conditional independence. This model is an important one in biomedical
studies that investigate whether an association exists between a treatment
variable and a disease outcome after controlling for a possibly confounding
variable that might influence that association. In this section we review the
test of conditional independence as a logit model analysis for a 2 = 2 = K
contingency table. We also present a test ŽMantel and Haenszel 1959. that
seems non-model-based but relates to the logit model.
We illustrate using Table 6.9, showing results of a clinical trial with eight
centers. The study compared two cream preparations, an active drug and a
TABLE 6.9
Clinical Trial Relating Treatment to Response for Eight Centers
Response
Center
Treatment
Success
Failure
Odds Ratio
11k
varŽ n11 k .
1
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
11
10
16
22
14
7
2
1
6
0
1
0
1
1
4
6
25
27
4
10
5
12
14
16
11
12
10
10
4
8
2
1
1.19
10.36
3.79
1.82
14.62
2.47
4.80
10.50
2.41
2.29
1.45
0.70
⬁
3.52
1.20
⬁
0.52
0.25
2.0
0.71
0.42
0.33
4.62
0.62
2
3
4
5
6
7
8
Source: Beitler and Landis Ž1985..
INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES
231
control, on their success in curing an infection. This table illustrates a
common pharmaceutical application, comparing two treatments on a binary
response with observations from several strata. The strata are often medical
centers or clinics; or they may be levels of age or severity of the condition
being treated or combinations of levels of several control variables; or they
may be different studies of the same sort evaluated in a meta analysis.
6.3.1
Using Logit Models to Test Conditional Independence
For a binary response Y, we study the effect of a binary predictor X,
controlling for a qualitative covariate Z. Let i k s P Ž Y s 1 < X s i, Z s k ..
Consider the model
logit Ž i k . s ␣ q  x i q  kZ ,
i s 1, 2,
k s 1, . . . , K ,
Ž 6.4 .
where x 1 s 1 and x 2 s 0. This model assumes that the XY conditional odds
ratio is the same at each category of Z, namely expŽ  .. The null hypothesis
of XY conditional independence is H0 :  s 0. The Wald statistic is Ž ˆrSE. 2 .
The likelihood-ratio statistic is the difference between G 2 statistics for the
reduced model
logit Ž i k . s ␣ q  kZ
Ž 6.5 .
and the full model. These tests are sensible when X has a similar effect at
each category of Z. They have df s 1.
Alternatively, since the reduced model Ž6.5. is equivalent to conditional
independence of X and Y, one could test conditional independence using a
goodness-of-fit test of that model. That test has df s K when X is binary.
This corresponds to comparing model Ž6.5. and the saturated model, which
permits  / 0 and contains XZ interaction parameters. When no interaction
exists or when interaction exists but it has minor substantive importance, it
follows from results to be presented in Section 6.4.2 that this approach is less
powerful, especially when K is large. However, when the direction of the XY
association varies among categories of Z, it can be more powerful.
6.3.2
Cochran–Mantel–Haenszel Test of Conditional Independence
Mantel and Haenszel Ž1959. proposed a non-model-based test of H0 : conditional independence in 2 = 2 = K tables. Focusing on retrospective studies
of disease, they treated response Žcolumn. marginal totals as fixed. Thus, in
each partial table k of cell counts n i jk 4 , their analysis conditions on both the
predictor totals Ž n1qk , n 2qk 4 and the response outcome totals Ž nq1 k , nq2 k ..
The usual sampling schemes then yield a hypergeometric distribution Ž3.16.
for the first cell count n11 k in each partial table. That count determines
n12 k ,n 21 k , n 22 k 4 , given the marginal totals.
232
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Under H0 , the hypergeometric mean and variance of n11 k are
11 k s E Ž n11 k . s n1qk nq1 krnqqk
2
var Ž n11 k . s n1qk n 2qk nq1 k nq2 krnqqk
Ž nqqk y 1 . .
Cell counts from different partial tables are independent. The test statistic
combines information from the K tables by comparing Ý k n11 k to its null
expected value. It equals
CMH s
Ý k Ž n11 k y 11 k .
Ý k var Ž n11 k .
2
.
Ž 6.6 .
This statistic has a large-sample chi-squared null distribution with df s 1.
When the odds ratio X Y Ž k . ) 1 in partial table k, we expect that Ž n11 k y
11 k . ) 0. When X Y Ž k . ) 1 in every partial table or X Y Ž k . - 1 in each table,
Ý k Ž n11 k y 11 k . tends to be relatively large in absolute value. This test works
best when the XY association is similar in each partial table. In this sense it
is similar to the tests of H0 :  s 0 in logit model Ž6.4.. When the sample
sizes in the strata are moderately large, this test usually gives similar results.
In fact, it is a score test ŽSection 1.3.3. of H0 :  s 0 in that model ŽDay and
Byar 1979..
Cochran Ž1954. proposed a similar statistic. He treated the rows in each
2 = 2 table as two independent binomials rather than a hypergeometric.
Cochran’s statistic is Ž6.6. with var Ž n11 k . replaced by
3
var Ž n11 k . s n1qk n 2qk nq1 k nq2 krnqqk
.
Because of the similarity in their approaches, we call Ž6.6. the
Cochran᎐Mantel᎐Haenszel ŽCMH. statistic. The Mantel and Haenszel approach using the hypergeometric is more general in that it also applies to
some cases in which the rows are not independent binomial samples from
two populations. Examples are retrospective studies and randomized clinical
trials with the available subjects randomly allocated to two treatments. In the
first case the column totals are naturally fixed. In the second, under the null
hypothesis the column margins are the same regardless of how subjects were
assigned to treatments, and randomization arguments lead to the hypergeometric in each 2 = 2 table.
Mantel and Haenszel Ž1959. proposed Ž6.6. with a continuity correction.
The P-value from the test then better approximates an exact conditional test
ŽSection 6.7.5. but it tends to be conservative. The CMH statistic generalizes
for I = J = K tables ŽSection 7.5.3..
6.3.3
Multicenter Clinical Trial Example
For the multicenter clinical trial, Table 6.9 reports the sample odds ratio for
each table and the expected value and variance of the number of successes
233
INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES
for the drug treatment Ž n11 k . under H0 : conditional independence. In each
table except the last, the sample odds ratio shows a positive association.
Thus, it makes sense to combine results with CMH s 6.38, with df s 1.
There is considerable evidence against H0 Ž P s 0.012..
Similar results occur in testing H0 :  s 0 in logit model Ž6.4.. The model
fit has ˆ s 0.777 with SE s 0.307. The Wald statistic is Ž0.777r0.307. 2 s 6.42
Ž P s 0.011.. The likelihood-ratio statistic equals 6.67 Ž P s 0.010..
6.3.4
CMH Test and Sparse Data*
In summary, for logit model Ž6.4., CMH is the score statistic alternative to
the likelihood-ratio or Wald test of H0 :  s 0. As n ™ ⬁ with fixed K, the
tests have the same asymptotic chi-squared behavior under H0 . An advantage
of CMH is that its chi-squared limit also applies with an alternative asymptotic scheme in which K ™ ⬁ as n ™ ⬁. The asymptotic theory for likelihood-ratio and Wald tests requires the number of parameters Žand hence K .
to be fixed, so it does not apply to this scheme. An application of this type is
when each stratum has a single matched pair of subjects, one in each group.
With strata of matched pairs, n1qk s n 2qk s 1 for each k. Then n s 2 K,
so K ™ ⬁ as n ™ ⬁. Table 6.10 shows the data layout for this situation.
When both subjects in stratum k make the same response Žas in the first case
in Table 6.10., nq1 k s 0 or nq2 k s 0. Given the marginal counts, the
internal counts are then completely determined, and 11 k s n11 k and
var Ž n11 k . s 0. When the subjects make differing responses Žas in the second
case., nq1 k s nq2 k s 1, so that 11 k s 0.5 and var Ž n11 k . s 0.25. Thus, a
matched pair contributes to the CMH statistic only when the two subjects’
responses differ. Let K * denote the number of the K tables that satisfy this.
Although each n11 k can take only two values, the central limit theorem
implies that Ý k n11 k is approximately normal for large K *. Thus, the distribution of CMH is approximately chi-squared.
Usually, when K grows with n, each stratum has few observations. There
may be more than two observations, such as case᎐control studies that match
several controls with each case. Contingency tables with relatively few observations are referred to as sparse. The nonstandard setting in which K ™ ⬁ as
n ™ ⬁ is called sparse-data asymptotics. Ordinary ML estimation then breaks
down because the number of parameters is not fixed, instead having the same
order as the sample size. In particular, an approximate chi-squared distribution holds for the likelihood-ratio and Wald statistics for testing conditional
TABLE 6.10
Stratum Containing a Matched Pair
Response
Response
Element
of Pair
Success
Failure
Success
Failure
First
Second
1
1
0
0
1
0
0
1
234
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
independence only when the strata marginal totals generally exceed about 5
to 10 and K is fixed and small relative to n.
6.3.5
Estimation of Common Odds Ratio
It is more informative to estimate the strength of association than to test
hypotheses about it. When the association seems stable among partial tables,
it is helpful to combine the K sample odds ratios into a summary measure of
conditional association. The logit model Ž6.4. implies homogeneous association, X Y Ž1. s ⭈⭈⭈ s X Y Ž K . s exp Ž  .. The ML estimate of the common
odds ratio is exp Ž ˆ..
Other estimators of a common odds ratio are not model-based. Woolf
Ž1955. proposed an exponentiated weighted average of the K sample log
odds ratios. Mantel and Haenszel Ž1959. proposed that
ˆMH s
Ý k Ž n11 k n 22 krnqqk .
Ý k Ž n12 k n 21 krnqqk .
s
Ý k p11 < k p 22 < k nqqk
Ý k p12 < k p 21 < k nqqk
,
Ž 6.7 .
where pi j < k s n i jkrnqqk . This gives more weight to strata with larger sample
sizes. It is preferred over the ML estimator when K is large and the data are
sparse. The ML estimator ˆ of the log odds ratio then tends to be too large
in absolute value. For sparse-data asymptotics with only a single matched
p
pair in each stratum, for instance, ˆ™ 2  . wThis con®ergence in probability
means that for any ⑀ ) 0, P Ž < ˆ y 2  < - ⑀ . ™ 1 as n ™ ⬁; see Problem
10.24.x
Hauck Ž1979. gave an asymptotic variance for log Ž ˆMH . that applies for a
fixed number of strata. In that case log Ž ˆMH . is slightly less efficient than the
ML estimator ˆ unless  s 0 ŽTarone et al. 1983.. Robins et al. Ž1986.
derived an estimated variance that applies both for these standard asymptotics with large n and fixed K and for sparse asymptotics in which K is also
large. Expressing ˆMH s RrS s ŽÝ k R k .rŽÝ k Sk . with R k s n11 k n 22 krnqqk ,
their derivation showed that Žlog ˆMH y log . is approximately proportional
to Ž R y S .. They also showed that E Ž R y S . s 0 and derived the variance of Ž R y S .. Their result is
ˆ 2 log ˆMH s
1
Ý ny1
qqk Ž n11 k q n 22 k . R k
2 R2
q
q
k
1
2S2
1
2 RS
Ý ny1
qqk Ž n12 k q n 21 k . S k
k
Ý ny1
qqk Ž n11 k q n 22 k . S k q Ž n12 k q n 21 k . R k
k
.
INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES
235
For the eight-center clinical trial summarized by Table 6.9,
ˆMH s
Ž 11 = 27 . r73 q ⭈⭈⭈ q Ž 4 = 1 . r13
s 2.13.
Ž 25 = 10 . r73 q ⭈⭈⭈ q Ž 2 = 6 . r13
For log ˆMH s 0.758, ˆ wlog ˆMH x s 0.303. A 95% confidence interval for the
common odds ratio is expŽ0.758 " 1.96 = 0.303. or Ž1.18, 3.87.. Similar
results occur using model Ž6.4.. The 95% confidence interval for expŽ  . is
exp Ž0.777 " 1.96 = 0.307., or Ž1.19, 3.97., using the Wald interval, and Ž1.20,
4.02. using the likelihood-ratio interval. Although the evidence of an effect is
considerable, inference about its size is rather imprecise. The odds of success
may be as little as 20% higher with the drug, or they may be as much as four
times as high.
If the true odds ratios are not identical but do not vary drastically, ˆM H
still is a useful summary of the conditional associations. Similarly, the CMH
test is a powerful summary of evidence against H0 : conditional independence, as long as the sample associations fall primarily in a single direction. It
is not necessary to assume equality of odds ratios to use the CMH test.
6.3.6
Testing Homogeneity of Odds Ratios
The homogeneous association condition X Y Ž1. s ⭈⭈⭈ s X Y Ž K . for 2 = 2 = K
tables is equivalent to logit model Ž6.4.. A test of homogeneous association is
implicitly a goodness-of-fit test of this model. The usual G 2 and X 2 test
statistics provide this, with df s K y 1. They test that the K y 1 parameters
in the saturated model that are the coefficients of interaction terms wcross
products of the dummy variable for x with Ž K y 1. dummy variables for
categories of Z x all equal 0. Breslow and Day Ž1980, p. 142. proposed an
alternative large-sample test ŽNote 6.5..
For the eight-center clinical trial data in Table 6.9, G 2 s 9.7 and X 2 s 8.0
Ždf s 7. do not contradict the hypothesis of equal odds ratios. It is reasonable to summarize the conditional association by a single odds ratio Že.g.,
ˆMH s 2.1. for all eight partial tables. In fact, even with a small P-value in a
test of homogeneous association, if the variability in the sample odds ratios is
not substantial, a summary measure such as ˆMH is useful. A test of
homogeneity is not a prerequisite for this measure or for testing conditional
independence.
6.3.7
Summarizing Heterogeneity in Odds Ratios
In practice, a predictor effect is often similar from stratum to stratum. In
multicenter clinical trials comparing a new drug to a standard, for example, if
the new drug is truly more beneficial, the true effect is usually positive in
each stratum.
236
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
In strict terms, however, a model with homogeneous effects is unrealistic.
First, we rarely expect the true odds ratio to be exactly the same in each
stratum, because of unmeasured covariates that affect it. Breslow Ž1976.
discussed modeling of the log odds ratio using a set of explanatory variables.
Second, the model regards the strata effects  kZ 4 as fixed effects, treating
them as the only strata of interest. Often the strata are merely a sampling of
the possible ones. Multicenter clinical trials have data for certain centers but
many other centers could have been used. Scientists would like their conclusions to apply to all such centers, not only those in the study.
A somewhat different logit model treats the true log odds ratios in partial
tables as a random sample from a N Ž , 2 . distribution. Fitting the model
yields an estimated mean log odds ratio and an estimated variability about
that mean. The inference applies to the population of strata rather than only
those sampled. This type of model uses random effects in the linear predictor
to induce this extra type of variability. In Chapter 12 we discuss GLMs with
random effects, and in Section 12.3.4 we fit such a model to Table 6.9.
6.4
USING MODELS TO IMPROVE INFERENTIAL POWER
When contingency tables have ordered categories, in Section 3.4 we showed
that tests that utilize the ordering can have improved power. Testing independence against a linear trend alternative in a linear logit model ŽSections
5.3.4, and 5.4.6 . is a way to do this. In this section we present the reason for
these power improvements.
6.4.1
Directed Alternatives
Consider an I = 2 contingency table for I binomial variates with parameters
i 4 . H0 : independence states
logit Ž i . s ␣ .
The ordinary X 2 and G 2 statistics of Section 3.2.1 refer to the general
alternative,
logit Ž i . s ␣ q i ,
which is saturated. They test H0 :  1 s  2 s ⭈⭈⭈ s I s 0 in that model,
with df s Ž I y 1.. Their general alternative treats both classifications as
nominal. Denote these test statistics as G 2 Ž I . and X 2 Ž I .. Recall that G 2 Ž I .
is the likelihood-ratio statistic G 2 Ž M0 < M1 . s y2Ž L0 y L1 . for comparing
the saturated model M1 with the independence Ž I . model M0 .
Ordinal test statistics refer to narrower, usually more relevant, alternatives. With ordered rows, an example is a test of H0 :  s 0 in the linear logit
USING MODELS TO IMPROVE INFERENTIAL POWER
237
model, logitŽ i . s ␣ q  x i . The likelihood-ratio statistic G 2 Ž I < L. s G 2 Ž I .
y G 2 Ž L. compares the linear logit model and the independence model.
When a test statistic focuses on a single parameter, such as  in that model,
it has df s 1. Now, df equals the mean of the chi-squared distribution.
A large test statistic with df s 1 falls farther out in its right-hand tail than a
comparable value of X 2 Ž I . or G 2 Ž I . with df s Ž I y 1.. Thus, it has
a smaller P-value.
6.4.2
Noncentral Chi-Squared Distribution
To compare power of G 2 Ž I < L. and G 2 Ž I ., it is necessary to compare their
nonnull sampling distributions. When H0 is false, their distributions are
approximately noncentral chi-squared. This distribution, introduced by R. A.
Fisher in 1928, arises from the following construction: If Zi ; N Ž i , 1., i s
1, . . . , , and if Z1 , . . . , Z are independent, ÝZi2 has the noncentral chisquared distribution with df s and noncentrality parameter s Ý 2i . Its
mean is q and its variance is 2Ž q 2 .. The ordinary Žcentral . chisquared distribution, which occurs when H0 is true, has s 0.
Let X2, denote a noncentral chi-squared random variable with df s
and noncentrality . A fundamental result for chi-squared analyses is that,
for fixed ,
P X2, ) 2 Ž␣ . increases as decreases .
That is, the power for rejecting H0 at a fixed ␣-level increases as the df of
the test decreases Že.g., Das Gupta and Perlman 1974.. For fixed , the
power equals ␣ when s 0, and it increases as increases. The inverse
relation between power and df suggests that focusing the noncentrality on a
statistic having a small df value can improve power.
6.4.3
Increased Power for Narrower Alternatives
Suppose that X has, at least approximately, a linear effect on logit w P Ž Y s 1.x.
To test independence, it is then sensible to use a statistic having strong power
for that effect. This is the purpose of the tests based on the linear logit
model, using the likelihood-ratio statistic G 2 Ž I < L., the Wald statistic z s
ˆrSE, and the Cochran᎐Armitage Žscore. statistic.
When is G 2 Ž I < L. more powerful than G 2 Ž I .? The statistics satisfy
G 2 Ž I . s G 2 Ž I < L. q G2 Ž L. ,
where G 2 Ž L. tests goodness of fit of the linear logit model. When the linear
logit model holds, G 2 Ž L. has an asymptotic chi-squared distribution with
238
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
df s I y 2; then if  / 0, G 2 Ž I . and G 2 Ž I < L. both have approximate
noncentral chi-squared distributions with the same noncentrality. Whereas
df s I y 1 for G 2 Ž I ., df s 1 for G 2 Ž I < L.. Thus, G 2 Ž I < L. is more powerful,
since it uses fewer degrees of freedom.
When the linear logit model does not hold, G 2 Ž I . has greater noncentrality than G 2 Ž I < L., the discrepancy increasing as the model fits more poorly.
However, when the model approximates reality fairly well, usually G 2 Ž I < L. is
still more powerful. That test’s df value of 1 more than compensates for its
loss in noncentrality. The closer the true relationship is to the linear logit, the
more nearly G 2 Ž I < L. captures the same noncentrality as G 2 Ž I ., and the
more powerful it is compared to G 2 Ž I .. To illustrate, Figure 6.4 plots power
as a function of noncentrality when df s 1 and 7. When the noncentrality of
a test having df s 1 is at least about half that of a test having df s 7, the test
with df s 1 is more powerful. The linear logit model then helps detect a key
component of an association. As Mantel Ž1963. argued in a similar context,
‘‘that a linear regression is being tested does not mean that an assumption of
linearity is being made. Rather it is that test of a linear component of
regression provides power for detecting any progressive association which
may exist.’’
The improved power results from sacrificing power in other cases. The
G 2 Ž I . test can have greater power than G 2 Ž I < L. when the linear logit model
describes reality very poorly.
The remark about the desirability of focusing noncentrality holds for
nominal variables also. For instance, consider testing conditional independence in 2 = 2 = K tables. One approach tests  s 0 in model Ž6.4., using
df s 1. Another approach tests goodness of fit of model Ž6.5., using df s K
FIGURE 6.4
Power and noncentrality, for df s 1 and df s 7, when ␣ s 0.05.
239
USING MODELS TO IMPROVE INFERENTIAL POWER
TABLE 6.11 Change in Clinical Condition by Degree of Infiltration
Degree of Infiltration
Clinical Change
Worse
Stationary
Slight improvement
Moderate improvement
Marked improvement
High
Low
Proportion
High
1
13
16
15
7
11
53
42
27
11
0.08
0.20
0.28
0.36
0.39
Source: Reprinted with permission from the Biometric Society ŽCochran 1954..
ŽSection 6.3.1.. When model Ž6.4. holds, both tests have the same noncentrality. Thus, the test of  s 0 is more powerful, since is has fewer degrees of
freedom.
6.4.4
Treatment of Leprosy Example
Table 6.11 refers to an experiment on the use of sulfones and streptomycin
drugs in the treatment of leprosy. The degree of infiltration at the start of the
experiment measures a type of skin damage. The response is the change in
the overall clinical condition of the patient after 48 weeks of treatment. We
use response scores y1, 0, 1, 2, 34 . The question of interest is whether
subjects with high infiltration changed differently from those with low infiltration.
Here, the clinical change response variable is ordinal. It seems natural to
compare the mean change for the two infiltration levels. Cochran Ž1954. and
Yates Ž1948. noted that this analysis is identical to a trend test treating the
binary variable as the response. That test is sensitive to linearity between
clinical change and the proportion of cases with high infiltration.
The test G 2 Ž I . s 7.28 Ždf s 4. does not show much evidence of association Ž P s 0.12., but it ignores the row ordering. The sample proportion of
high infiltration increases monotonically as the clinical change improves. The
test of H0 :  s 0 in the linear logit model has G 2 Ž I < L. s 6.65, with df s 1
Ž P s 0.01.. It gives strong evidence of more positive clinical change at the
higher level of infiltration. Using the ordering by decreasing df from 4 to 1
pays a strong dividend. In addition, G 2 Ž L. s 0.63 with df s 3 suggests that
the linear trend model fits well.
6.4.5
Model Smoothing Improves Precision of Estimation
Using directed alternatives can improve not only test power, but also estimation of cell probabilities and summary measures. In generic form, let be
true cell probabilities in a contingency table, let p denote sample proportions,
and let
ˆ denote model-based ML estimates of .
240
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
When satisfy a certain model, both
ˆ for that model and p are
consistent estimators of . The model-based estimator
ˆ is better, as its true
asymptotic standard error cannot exceed that of p. This happens because of
model parsimony: The unsaturated model, on which
ˆ is based, has fewer
parameters than the saturated model, on which p is based. In fact, modelbased estimators are also more efficient in estimating functions g Ž . of cell
probabilities. For any differentiable function g,
asymp. var
'n g Ž ˆ .
F asymp. var
'n g Ž p.
.
In Section 14.2.2 we prove this result. It holds more generally than for
categorical data models ŽAltham 1984.. This is one reason that statisticians
prefer parsimonious models.
In reality, of course, a chosen model is unlikely to hold exactly. However,
when the model approximates well, unless n is extremely large,
ˆ is still
better than p. Although
ˆ i is biased, it has smaller variance than pi , and
MSEŽ
ˆ i . - MSEŽ pi . when its variance plus squared bias is smaller than
varŽ pi .. In Section 3.3.7 we showed that in two-way tables, independencemodel estimates of cell probabilities can be better than sample proportions
even when that model does not hold.
6.5
SAMPLE SIZE AND POWER CONSIDERATIONS*
In any statistical procedure, the sample size n influences the results. Strong
effects are likely to be detected even when n is small. By contrast, detection
of weak effects requires large n. A study design should reflect the sample size
needed to provide good power for detecting the effect.
6.5.1
Sample Size and Power for Comparing Two Proportions
For test statistics having large-sample normal distributions, power calculations can use ordinary methods. To illustrate, consider a test comparing
binomial parameters 1 and 2 for two medical treatments. An experiment
plans independent samples of size n i s nr2 receiving each treatment. The
researchers expect i f 0.6 for each, and a difference of at least 0.10 is
important. In testing H0 : 1 s 2 , the variance of the difference
ˆ 1 y ˆ 2 in
sample proportions is 1Ž1 y 1 .rŽ nr2. q 2 Ž1 y 2 .rŽ nr2. f 0.6 = 0.4
= Ž4rn. s 0.96rn. In particular,
zs
Ž ˆ 1 y ˆ 2 . y Ž 1 y 2 .
Ž 0.96rn .
1r2
has approximately a standard normal distribution for 1 and 2 near 0.6.
241
SAMPLE SIZE AND POWER CONSIDERATIONS
The power of an ␣-level test of H0 is approximately
P
<
ˆ 1 y ˆ 2 <
Ž 0.96rn .
1r2
G z␣ r2 .
When 1 y 2 s 0.10, for ␣ s 0.05, this equals
P
Ž ˆ 1 y ˆ 2 . y 0.10
Ž 0.96rn .
qP
1r2
) 1.96 y 0.10 Ž nr0.96 .
Ž ˆ 1 y ˆ 2 . y 0.10
Ž 0.96rn .
1r2
1r2
- y1.96 y 0.10 Ž nr0.96 .
1r2
s P z ) 1.96 y 0.10 Ž nr0.96 .
1r2
q P z - y1.96 y 0.10 Ž nr0.96 .
1r2
s 1 y ⌽ 1.96 y 0.10 Ž nr0.96 .
1r2
q ⌽ y1.96 y 0.10 Ž nr0.96 .
,
1r2
where ⌽ is the standard normal cdf. The power is approximately 0.11 when
n s 50 and 0.30 when n s 200. It is not easy to attain significance when
effects are small and the sample is not very large. Figure 6.5 shows how the
power increases in n when 1 y 2 s 0.1. By contrast, it shows how the
power improves when 1 y 2 s 0.2.
For a given P Žtype I error. s ␣ and P Žtype II error. s  Žand hence
power s 1 y  ., one can determine the sample size needed to attain those
FIGURE 6.5 Approximate power for testing equality of proportions, with true values near
middle of range and ␣ s 0.05.
242
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
values. A study using n1 s n 2 requires approximately
n1 s n 2 s Ž z␣ r2 q z . 1 Ž 1 y 1 . q 2 Ž 1 y 2 . r Ž 1 y 2 . .
2
2
For a test with ␣ s 0.05 and  s 0.10 when 1 and 2 are truly about 0.60
and 0.70, n1 s n 2 s 473. This formula also provides the sample sizes needed
for a comparable confidence interval for 1 y 2 . With about 473 subjects in
each group, a 95% confidence interval has only a 0.10 chance of containing 0
when actually, 1 s 0.60 and 2 s 0.70.
This sample-size formula is approximate and may underestimate slightly
the actual values required. It is adequate for most practical work, though, in
which only rough conjectures are available for 1 and 2 . Fleiss Ž1981.
showed more precise formulas.
6.5.2
Sample Size Determination in Logistic Regression
Consider now the model logitw Ž x i .x s ␣ q ␥ x i , i s 1, . . . , n, in which x is
quantitative. wWe use ␥ so as not to confuse with  s P Žtype II error..x The
sample size needed to achieve a certain power for testing H0 : ␥ s 0 depends
on the variance of ␥
ˆ. This depends on Ž x i .4, and formulas for n use a guess
for
ˆ s Ž x . and the distribution of X. The effect size is the log odds ratio
comparing Ž x . to Ž x q s x ., the probability for a standard deviation above
the mean of x. For a one-sided test when X is approximately normal, Hsieh
Ž1989. derived
n s z␣ q z exp Ž y 2r4 . Ž 1 q 2␦
ˆ . r Ž
ˆ 2.,
2
where
␦ s 1 q Ž 1 q 2 . exp Ž 5 2r4 . r 1 q exp Ž y 2r4 . .
The value n decreases as
ˆ ™ 0.5 and as < < increases.
We illustrate for modeling the effect of x s cholesterol level on the
probability of severe heart disease for a population for which that probability
at an average level of cholesterol is about 0.08. Researchers want the test to
be sensitive to a 50% increase in this probability, for a standard deviation
increase in cholesterol. The odds of severe heart disease at the mean
cholesterol level equal 0.08r0.92 s 0.087, and the odds one standard deviation above the mean equal 0.12r0.88 s 0.136. The odds ratio equals
0.136r0.087 s 1.57, and s log Ž1.57. s 0.450. For ␣ s 0.05 and  s 0.10,
␦ s 1.306 and n s 612.
6.5.3
Sample Size in Multiple Logistic Regression
A multiple logistic regression model requires larger n to detect effects. Let R
denote the multiple correlation between the predictor X of interest and the
243
SAMPLE SIZE AND POWER CONSIDERATIONS
others in the model. The formula for n above divides by Ž1 y R 2 .. In that
formula,
ˆ is evaluated at the mean of all the explanatory variables, and the
odds ratio refers to the effect of X at the mean level of the other predictors.
Consider the example in Section 6.5.2 when blood pressure is also a
predictor. If the correlation between cholesterol and blood pressure is 0.40,
we need n f 612rw1 y Ž0.40. 2 x s 729.
These formulas provide, at best, rough indications of sample size. Most
applications have only a crude guess for
ˆ and R, and X may be far from
normally distributed. For other work on this problem, see Hsieh et al. Ž1998.
and Whittemore Ž1981..
6.5.4
Power for Chi-Squared Tests in Contingency Tables
When hypotheses are false, squared normal and X 2 and G 2 statistics have
large-sample noncentral chi-squared distributions ŽSection 6.4.2.. Suppose
that H0 is equivalent to model M for a contingency table. Let i denote the
true probability in cell i, and let i Ž M . denote the value to which the ML
estimate
ˆ i for model M converges, where Ý i s Ý i Ž M . s 1. For a
multinomial sample of size n, the noncentrality parameter for X 2 equals
i y iŽ M .
s nÝ
2
iŽ M .
i
Ž 6.8 .
.
This has the same form as X 2 , with i in place of the sample proportion pi
and i Ž M . in place of
ˆ i . The noncentrality parameter for G 2 equals
s 2 n Ý i log
i
i
iŽ M .
Ž 6.9 .
.
TABLE 6.12 Power of Chi-Squared Test for s 0.05
Noncentrality
df
0.0
0.2
0.4
0.6
0.8
1.0
2.0
3.0
4.0
5.0
7.0
10.0
15.0
25.0
1
2
3
4
6
8
10
20
50
.050
.050
.050
.050
.050
.050
.050
.050
.050
.073
.065
.062
.060
.058
.057
.056
.053
.052
.097
.081
.075
.071
.066
.064
.062
.056
.054
.121
.098
.088
.082
.075
.071
.068
.060
.056
.146
.115
.102
.093
.084
.079
.075
.063
.059
.170
.133
.116
.106
.094
.087
.082
.066
.061
.293
.226
.192
.172
.146
.131
.121
.096
.076
.410
.322
.275
.244
.206
.182
.166
.125
.092
.516
.415
.358
.320
.270
.238
.215
.158
.110
.609
.504
.440
.396
.336
.296
.268
.193
.129
.754
.655
.590
.540
.468
.417
.379
.273
.173
.885
.815
.761
.716
.644
.588
.542
.402
.250
.972
.944
.917
.891
.843
.799
.760
.611
.398
.998
.996
.993
.989
.980
.968
.956
.883
.687
Source: Reprinted with permission from G. E. Haynam, Z. Govindarajulu, and F. C. Leone, in
Selected Tables in Mathematical Statistics, eds. H. L. Harter and D. B. Owen ŽChicago: Markham,
1970..
244
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
When H0 is true, all i s i Ž M .. Then, for either statistic, s 0 and the
central chi-squared distribution applies.
To determine the approximate power for a chi-squared test with df s ,
Ž1. choose a hypothetical set of true values i 4 , Ž2. calculate i Ž M .4 by
fitting to i 4 the model M for H0 , Ž3. calculate the noncentrality parameter
, and Ž4. calculate P w X2, ) 2 Ž␣.x. Table 6.12 shows an excerpt from a
table of noncentral chi-squared probabilities for step 4 with ␣ s 0.05.
6.5.5
Power for Testing Conditional Independence
We use an example based on one in O’Brien Ž1986.. A standard fetal heart
rate monitoring test predicts whether a fetus will require nonroutine care
following delivery. The standard test has categories Žworrisome, reassuring ..
The response Y is whether the newborn required some nonroutine medical
care during the first week after birth Ž1 s yes, 0 s no.. A new fetal heart
rate monitoring test is developed, having categories Žvery worrisome, somewhat worrisome, reassuring .. A physician plans to study whether this new test
can help make predictions about the outcome; that is, given the result of the
standard test, is there an association between the response and the result of
the new test? A relevant statistic tests the effect of the new monitoring test in
the logit model having the new test Ž N . and standard test Ž S . as qualitative
predictors.
To help select n, a statistician asks the physician to conjecture about the
joint distribution of the explanatory variables, with questions such as ‘‘What
proportion of the cases do you think will be scored ‘reassuring’ by both
tests?’’ For each NS combination, the physician also guessed P Ž Y s 1..
Table 6.13 shows one scenario for marginal and conditional probabilities.
These yield a joint distribution i jk 4 from their product, such as 0.04 = 0.40
s 0.016 for the proportion of cases judged worrisome by the standard test
and very worrisome by the new test and requiring nonroutine medical care.
These joint probabilities yield fitted probabilities Ž M0 . and Ž M1 . for the
null and alternative logit models. ŽOne can get these by entering i jk 4 in
TABLE 6.13 Scenario for Power Computation
Standard
Worrisome
Reassuring
New
Joint
Probability
P Žnonroutine care.
Very worrisome
Somewhat worrisome
Reassuring
Very worrisome
Somewhat worrisome
Reassuring
0.04
0.08
0.04
0.02
0.18
0.64
0.40
0.32
0.27
0.30
0.22
0.15
Source: Reprinted with permission from O’Brien Ž1986..
PROBIT AND COMPLEMENTARY LOG-LOG MODELS
245
percentage form as counts in software for logistic regression, fit the relevant
model, and divide the fitted counts by 100 to get the fitted joint probabilities. .
The likelihood-ratio test comparing these models has noncentrality Ž6.9. with
Ž M1 . playing the role of and Ž M0 . playing the role of Ž M ..
For the scenario in Table 6.13, the noncentrality equals 0.00816 n, with
df s 2. For n s 400, 600, and 1000, the approximate powers when ␣ s 0.05
are 0.35, 0.49, and 0.73. This scenario predicts 64% of the observations to
occur at only one combination of the factors. The lack of dispersion for the
factors weakens the power.
6.5.6
Effects of Sample Size on Model Selection and Inference
The effects of sample size suggest some cautions for model selection. For
small n, the most parsimonious model accepted in a goodness-of-fit test may
be quite simple. By contrast, larger samples usually require more complex
models to pass goodness-of-fit tests. Then, some effects that are statistically
significant may be weak and substantively unimportant. With large n it may
be adequate to use a model that is simpler than models that pass goodnessof-fit tests. An analysis that focuses solely on goodness-of-fit tests is incomplete. It is also necessary to estimate model parameters and describe strengths
of effects.
These remarks merely reflect limitations of significance testing. Null
hypotheses are rarely true. With large enough n, they will be rejected. A
more relevant concern is whether the difference between true parameter
values and null hypothesis values is sufficient to be important. Many methodologists overemphasize testing and underutilize estimation methods such as
confidence intervals. When the P-value is small, a confidence interval specifies the extent to which H0 may be false, thus helping us determine whether
rejecting it has practical importance. When the P-value is not small, the
confidence interval indicates whether some plausible parameter values are
far from H0 . A wide confidence interval containing the H0 value indicates
that the test had weak power at important alternatives.
6.6
PROBIT AND COMPLEMENTARY LOG-LOG MODELS*
For binary responses, in this section we discuss two alternatives to logit
models. Like the logit model, these models have form Ž4.8.,
Ž x . s ⌽ Ž␣ q  x .
Ž 6.10 .
for a continuous cdf ⌽. The following argument motivates this class.
6.6.1
Tolerance Motivation for Binary Response Models
In toxicology, binary response models describe the effect of dosage of a toxin
on whether a subject dies. The tolerance distribution provides justification for
246
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
model Ž6.10.. Let x denote the dosage level. For a randomly selected subject,
let Y s 1 if the subject dies. Suppose that the subject has tolerance T for the
dosage, with Ž Y s 1. equivalent to ŽT F x .. For instance, an insect survives if
the dosage x is less than T and dies if the dosage is at least T. Tolerances
vary among subjects, and let F Ž t . s P ŽT F t .. For fixed dosage x, the
probability a randomly selected subject dies is
Ž x. s PŽ Y s 1 < X s x. s PŽT F x. s FŽ x. .
That is, the appropriate binary model is the one having the shape of the cdf
F of the tolerance distribution. Let ⌽ denote the standard cdf for the family
to which F belongs. A common standardization uses the mean and standard
deviation of T, so that
Ž x . s F Ž x . s ⌽ Ž x y . r .
Then, the model has form Ž x . s ⌽ Ž␣ q  x ..
6.6.2
Probit Models
Toxicological experiments often measure dosage as the log concentration
ŽBliss 1935.. Often, the tolerance distribution for the dosage is approximately
N Ž , 2 . for unknown and . If F is the N Ž , 2 . cdf, then Ž x . has the
form Ž x . s ⌽ Ž␣ q  x ., where ⌽ is the standard normal cdf, ␣ s yr
and  s 1r . In GLM form,
⌽y1 Ž x . s ␣ q  x
Ž 6.11 .
is the probit model. The probit link function is ⌽y1 Ž⭈.. Whereas the cdf maps
the real line onto the Ž0, 1. probability scale, the inverse cdf maps the Ž0, 1.
scale for Ž x . onto the real line values for linear predictors in binary
response models.
The response curve for Ž x . wor for 1 y Ž x ., when  - 0x has the
appearance of the normal cdf with mean s y␣r and standard deviation
s 1r <  < . Since 68% of the normal density falls within a standard deviation of the mean, 1r<  < is the distance between x values where Ž x . s 0.16
or 0.84 and where Ž x . s 0.50. The rate of change in Ž x . is ⭸ Ž x .r⭸ x s
 Ž␣ q  x ., where Ž⭈. is the standard normal density function. The rate is
highest when ␣ q  x s 0 Ži.e., at x s y␣r ., where it equals rŽ2 .1r2 s
0.40  Žfor s 3.14 . . . .. At that point, Ž x . s 12 .
By comparison, in logistic regression with parameter  , the curve for Ž x .
is a logistic cdf with standard deviation r<  < '3 . Its rate of change in Ž x .
at x s y␣r is 0.25  . The rates of change where Ž x . s 12 are the same
for the cdf ’s corresponding to the probit and logistic curves when the logistic
 is 0.40r0.25 s 1.6 times the probit  . The standard deviations are the
same when the logistic  is r'3 s 1.8 times the probit  . When both
PROBIT AND COMPLEMENTARY LOG-LOG MODELS
247
models fit well, parameter estimates in logistic regression are about 1.6 to 1.8
times those in probit models.
The likelihood equations that Ž4.24. showed for binomial regression models apply to probit models Žsee also Problem 6.32.. One can solve them using
the Fisher scoring algorithm for GLMs ŽBliss 1935, Fisher 1935b..
Newton᎐Raphson yields the same ML estimates but slightly different standard errors. For the information matrix inverted to obtain the asymptotic
covariance matrix, Newon᎐Raphson uses observed information, whereas
Fisher scoring uses expected information. These differ for binary links other
than the logit.
6.6.3
Beetle Mortality Example
Table 6.14 reports the number of beetles killed after 5 hours of exposure to
gaseous carbon disulfide at various concentrations. Figure 6.6 plots Žas dots.
the proportion killed against the log concentration. The proportion jumps up
at about x s 1.8, and it is close to 1 above there.
The ML fit of the probit model is
⌽y1
ˆ Ž x . s y34.96 q 19.74 x.
For this fit,
ˆ Ž x . s 0.5 at x s 34.96r19.74 s 1.77. The fit corresponds to a
normal tolerance distribution with s 1.77 and s 1r19.74 s 0.05. The
curve for
ˆ Ž x . is that of a N Ž1.77, 0.05 2 . cdf.
At dosage x i with n i beetles, n i
ˆ Ž x i . is the fitted count for death,
i s 1, . . . , 8. Table 6.14 reports the fitted values and Figure 6.6 shows the fit.
The table also shows fitted values for the linear logit model. These models fit
similarly and rather poorly. The G 2 goodness-of-fit statistic equals 11.1 for
the logit model and 10.0 for the probit model, with df s 6.
TABLE 6.14 Beetles Killed after Exposure to Carbon Disulfide
Log Dose
Number
of Beetles
Number
Killed
1.691
1.724
1.755
1.784
1.811
1.837
1.861
1.884
59
60
62
56
63
59
62
60
6
13
18
28
52
53
61
60
Fitted Values
Comp. Log-Log
5.7
11.3
20.9
30.3
47.7
54.2
61.1
59.9
Source: Data reprinted with permission from Bliss Ž1935..
Probit
Logit
3.4
10.7
23.4
33.8
49.6
53.4
59.7
59.2
3.5
9.8
22.4
33.9
50.0
53.3
59.2
58.8
248
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
FIGURE 6.6 Proportion of beetles killed versus log dosage, with fits of probit and complementary log-log models.
6.6.4
Complementary Log-Log Link Models
The logit and probit links are symmetric about 0.5, in the sense that
link Ž x . s ylink 1 y Ž x . .
To illustrate,
logit Ž x . s log Ž x . r Ž 1 y Ž x . .
s ylog Ž 1 y Ž x . . r Ž x . s ylogit 1 y Ž x . .
This means that the response curve for Ž x . has a symmetric appearance
about the point where Ž x . s 0.5, so Ž x . approaches 0 at the same rate it
approaches 1. Logit and probit models are inappropriate when this is badly
violated.
The response curve
Ž x . s 1 y exp yexp Ž␣ q  x .
Ž 6.12 .
has the shape shown in Figure 6.7. It is asymmetric, Ž x . approaching 0
fairly slowly but approaching 1 quite sharply. For this model,
log ylog Ž 1 y Ž x . . s ␣ q  x.
The link for this GLM is called the complementary log-log link, since the
log-log link applies to the complement of Ž x ..
PROBIT AND COMPLEMENTARY LOG-LOG MODELS
249
FIGURE 6.7 Model with complementary log᎐log link.
To interpret model Ž6.12., we note that at x 1 and x 2 ,
log ylog Ž 1 y Ž x 2 . . y log ylog Ž 1 y Ž x 1 . . s  Ž x 2 y x 1 . ,
so that
log 1 y Ž x 2 .
log 1 y Ž x 1 .
s exp  Ž x 2 y x 1 .
and
1 y Ž x 2 . s 1 y Ž x1 .
exp w  Ž x 2 yx 1 .x
.
For x 2 y x 1 s 1, the complement probability at x 2 equals the complement
probability at x 1 raised to the power expŽ  ..
A related model to Ž6.12. is
Ž x . s exp yexp Ž␣ q  x . .
Ž 6.13 .
For it, Ž x . approaches 0 sharply but approaches 1 slowly. As x increases,
the curve is monotone decreasing when  ) 0, and monotone increasing
when  - 0. In GLM form it uses the log-log link
log ylog Ž Ž x . . s ␣ q  x.
When the complementary log-log model holds for the probability of a
success, the log-log model holds for the probability of a failure.
Model Ž6.13. with log-log link is the special case of Ž6.10. with cdf of the
extreme ®alue Žor Gumbel . distribution. The cdf equals
F Ž x . s exp yexp y Ž x y a . rb
4
250
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
for parameters b ) 0 and y⬁ - a - ⬁. It has mean a q 0.577b and
standard deviation br'6 . Models with log-log links can be fitted using the
Fisher scoring algorithm for GLMs.
6.6.5
Beetle Mortality Example Revisited
For the beetle mortality data ŽTable 6.14., the complementary log-log model
has ML estimates ␣
ˆ s y39.52 and ˆ s 22.01. At dosage x s 1.7, the fitted
probability of survival is 1 y
ˆ Ž x . s expyexpwy39.52 q 22.01Ž1.7.x4 s
0.885, whereas at x s 1.8 it is 0.332 and at x s 1.9 it is 5 = 10y5. The
probability of survival at dosage x q 0.1 equals the probability at dosage x
raised to the power exp Ž22.01 = 0.1. s 9.03. For instance, 0.332 s Ž0.885. 9.03 .
Table 6.14 shows the fitted values and Figure 6.6 shows the fit. They are
close to the observed death counts Ž G 2 s 3.5, df s 6.. The fit seems adequate. Aranda-Ordaz Ž1981. and Stukel Ž1988. discussed these data further.
6.7 CONDITIONAL LOGISTIC REGRESSION AND EXACT
DISTRIBUTIONS*
ML estimators of logistic model parameters work best when the sample size
n is large compared to the number of parameters. When n is small or when
the number of parameters grows as n does, improved inference results using
conditional maximum likelihood. In this section we present this approach and
in Section 10.2 apply it with matched case᎐control studies.
6.7.1
Conditional Likelihood
This conditional likelihood approach eliminates nuisance parameters by
conditioning on their sufficient statistics. This generalizes Fisher’s method for
2 = 2 tables ŽSection 3.5.. The conditional likelihood refers to a conditional
distribution defined for potential samples that provide the same information
about the nuisance parameters that occurs in the observed sample.
We begin with a general exposition and then discuss special cases. Let yi
denote the binary response for subject i, i s 1, . . . , N. ŽFor now, each yi
refers to a single trial, so n i s 1.. Let x i j be the value of predictor j for that
subject, j s 1, . . . , p. The model is
P Ž Yi s yi . s
p
exp yi Ž␣ q Ý js1
j x i j .
p
1 q exp Ž␣ q Ý js1
j x i j .
,
Ž 6.14 .
where substituting yi s 1 gives the usual expression, such as Ž5.15.. Here, we
explicitly separate the intercept from the coefficients of the p predictors. For
N independent observations,
P Ž Y1 s y 1 , . . . , YN s yN . s
p
exp Ž Ý i yi . ␣ q Ý js1
Ž Ý i yi x i j .  j
p
j x i j .
Ł i 1 q exp Ž␣ q Ý js1
. Ž 6.15 .
CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS
251
From this likelihood function, the sufficient statistic for  j is Ý i yi x i j , j s
1, . . . , p. The sufficient statistic for ␣ is Ý i yi , the total number of successes.
Usually, some parameters refer to effects of primary interest. Others may
be there to adjust for relevant effects, but their values are not of special
interest. We can eliminate the latter parameters from the likelihood by
conditioning on their sufficient statistics. We illustrate by eliminating ␣ . ŽIn
Section 10.2.5 we show that for models for matched case᎐control studies,
intercept terms cause difficulties with inference about the primary parameters, so it can be helpful to eliminate them.. Since the sufficient statistic for ␣
is Ý i yi , we condition on Ý i yi . Suppose that Ý i yi s t. Denote the conditional
reference set of samples having the same value of Ý i yi as observed by
½
S Ž t . s Ž y*,
1 . . . , y*
N. :
Ý yi* s t
i
5.
With yi 4 such that Ý i yi s t, the conditional likelihood function equals
ž
P Y1 s y 1 , . . . , YN s yN
Ý yi s t
i
s
s
/
s
P Ž Y1 s y 1 , . . . , YN s yN .
ÝSŽ t . P Ž Y1 s y 1*, . . . , YN s yN * .
p
p
j x i j .
exp t ␣ q Ý js1
Ž Ý i yi x i j .  j Ł i 1 q exp Ž␣ q Ý js1
p
p
ÝSŽ t . exp t ␣ q Ý js1
j x i j .
Ž Ý i yi*x i j .  j rŁ i 1 q exp Ž␣ q Ý js1
p
exp Ý js1
Ž Ý i yi x i j .  j
p
ÝSŽ t . exp Ý js1
Ž Ý i yi*x i j .  j
.
This does not depend on ␣ .
A conditional likelihood is used just like an ordinary likelihood. For the
parameters in it, their conditional ML estimates are the values maximizing it.
Calculated using iterative methods, the estimators are asymptotically normal
with covariance matrix equal to the negative inverse of the matrix of second
partial derivatives of the conditional log likelihood.
6.7.2
Small-Sample Conditional Inference for Logistic Regression
For small samples, inference for a parameter uses the conditional distribution after eliminating all other parameters. With it, one can calculate probabilities such as P-values exactly rather than with crude approximations ŽCox
1970..
For instance, suppose that inference focuses on  p in model Ž6.14.. To
eliminate other parameters, we condition on their sufficient statistics
Tj s Ý i yi x i j , j s 0, . . . , p y 1 Žwhere x i0 s 1.. With an argument like that
252
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
just shown, one obtains the conditional distribution
P Ž Y1 s y 1 , . . . , YN s yN < Tj s t j , j s 0, . . . , p y 1 .
s
exp Ž Ý i yi x i p .  p
ÝSŽ t 0 , . . . , t py1 . exp Ž Ý i yi*x i p .  p
exp Ž t p  p .
s
ÝSŽ t 0 , . . . , t py1 . exp Ž t p* p .
,
where
½
5
S Ž t 0 , . . . , t py1 . s Ž y*,
1 . . . , y*
N . : Ý yi*x i j s t j , j s 0, . . . , p y 1 .
i
This depends only on  p . Inference for  p uses the conditional distribution
of its sufficient statistic, Tp s Ý i yi x i p , given the others. Let cŽ t 0 , . . . , t py1 , t .
denote the number of data vectors in SŽ t 0 , . . . , t py1 . for which Tp s t. The
conditional distribution of Tp is
P Ž Tp s t < Tj s t j , j s 0, . . . , p y 1 . s
c Ž t 0 , . . . , t py1 , t . exp Ž t  p .
,
Ý u c Ž t 0 , . . . , t py1 , u . exp Ž u  p .
Ž 6.16 .
where the denominator summation refers to the possible values u of Tp .
For testing H0 :  p s 0, the conditional distribution simplifies. For Ha :
 p ) 0 and observed Tp s t obs , the exact conditional P-value is
Ý
P Ž Tp s t < Tj s t j , j s 0, . . . , p y 1 . s
tGt obs
Ý t G t obs c Ž t 0 , . . . , t py1 ,t .
Ý u c Ž t 0 , . . . , t py1 , u .
,
the proportion of data configurations in the conditional set that have the
sufficient statistic for  p at least as large as observed. Implementing this
inference requires calculating cŽ t 0 , . . . , t py1 , u.4 . For all but the simplest
problems, computations are intensive and require specialized software Že.g.,
LogXact of Cytel Software or PROC LOGISTIC in SAS.. In the remainder
of this section we consider special cases for small-sample inference.
6.7.3
Small-Sample Conditional Inference for 2 = 2 Contingency Tables
First, consider logistic regression with a single predictor x,
logit P Ž Yi s 1 . s ␣ q  x i ,
i s 1, . . . , N,
Ž 6.17 .
when x i takes only two values. The model applies to 2 = 2 tables, where
x i s 1 denotes row 1 and x i s 0 denotes row 2. The sufficient statistic for ␣
CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS
253
is Ý i yi , which is the first column total. The sufficient statistic for  is
T s Ý i yi x i , which simplifies to the number of successes in the first row.
Equivalently, the sufficient statistics for the model are the numbers of
successes in the two rows. Let s1 and s2 denote these binomial variates. The
row totals n1 and n 2 are their indices.
To eliminate ␣ , we condition on s s s1 q s2 , the first column total. Since
N s n1 q n 2 is fixed, so then is the other column marginal total. Fixing both
sets of marginal totals yields hypergeometric probabilities for s1 that depend
only on  wsee Ž3.20., identifying s expŽ  .x. In that case the conditional
distribution satisfies Ž6.16. with cŽ t 0 , t . s
ž /ž
n1
N y n1
t
t0 y t
/
and with t 0 s s and
t s s1. The resulting exact conditional test that  s 0 is Fisher’s exact test for
2 = 2 tables ŽSection 3.5.1..
6.7.4
Small-Sample Conditional Inference for Linear Logit Model
The linear logit model, logitŽ i . s ␣ q  x i , applies to I = 2 tables with
ordered rows. We discussed this model in Section 5.3.4. For it, the data yi 4
are I independent bin Ž n i , i .4 counts, with fixed row totals n i 4 . Conditioning on s s Ý yi and hence the column totals yields a conditional likelihood
free of ␣ . Exact inference about  uses its sufficient statistic, T s Ý x i yi .
From Ž6.16. its distribution has the form
ž
P Tst
Ý yi s s; 
i
/
s
c Ž s, t . e  t
Ý u c Ž s, u . e  u
ž /
.
Ž 6.18 .
ni
Here, cŽ s, u. equals the sum of Ł i
for all tables with the given
yi
marginal totals that have T s u.
When  s 0, the cell counts have the multiple hypergeometric distribution Ž3.19.. To test this, ordering the tables with the given margins by T is
equivalent to ordering them by the Cochran᎐Armitage statistic ŽSection
5.3.5.. Thus, this test for the linear logit model is an exact trend test.
In Section 5.3.5 we applied the Cochran᎐Armitage test to Table 5.3 on
maternal alcohol consumption and infant malformation. Even though n s
32,573, the table is highly unbalanced, with both very small and very large
counts. It is safer to use small-sample methods. For the exact conditional
trend test with the same scores, the one-sided P-value for Ha :  ) 0 is
0.0168. The two-sided P-value is 0.0172, reflecting asymmetry of the conditional distribution, given the marginal counts. This is not much different from
the two-sided P-value of 0.010 obtained with the large-sample Cochran᎐
Armitage test.
254
6.7.5
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Small-Sample Tests of Conditional Independence in 2 = 2 = K Tables
For 2 = 2 = K tables n i jk 4 , the Cochran᎐Mantel᎐Haenszel test uses Ý k n11 k .
For logit model Ž6.4., this is the sufficient statistic for  , the effect of X. To
conduct a small-sample test of  s 0, one needs to eliminate the other
model parameters. Constructing the likelihood reveals that the sufficient
statistics for  kZ 4 are the column marginal totals nqj k 4 in each partial table.
When X and Z are predictors, it is natural to treat the numbers of trials
n iqk 4 at each combination of XZ values as fixed. Thus, exact inference
about  conditions on the row and column totals in each stratum.
Conditional on the strata margins, an exact test uses Ý k n11 k . Hypergeometric probabilities occur in each partial table for the independent null
distributions of n11 k , k s 1, . . . , K 4 . The product of the K mass functions
gives the null joint distribution of n11 k , k s 1, . . . , K 4 . wThis is Ž6.19. below,
setting s 1.x This determines the null distribution of Ý k n11 k . For
Ha :  ) 0, the P-value is the null probability that Ý k n11 k is at least as large
as observed, for the fixed strata marginal totals. Mehta et al. Ž1985. presented a fast algorithm. The test simplifies to Fisher’s exact test when K s 1.
6.7.6
Promotion Discrimination Example
Table 6.15 refers to U.S. government computer specialists of similar seniority
considered for promotion. The table cross-classifies promotion decision by
employee’s race, considered for three separate months. We test conditional
independence of promotion decision and race, or H0 :  s 0, in model Ž6.4..
The table contains several small counts. The overall sample size is not small
Ž n s 74., but one marginal count Žcollapsing over month of decision. equals
zero, so we might be wary of using the CMH test.
For Ha :  - 0 Ži.e., odds ratio - 1., the probability of promotion was
lower for black employees than for white employees. For the margins of the
partial tables in Table 6.15, n111 can range between 0 and 4, n112 can range
between 0 and 4, and n113 can range between 0 and 2. The total Ý k n11 k can
range between 0 and 10. The sample data are the most extreme possible
TABLE 6.15 Promotion Decisions by Race and by Month
July
Promotions
August
Promotions
September
Promotions
Race
Yes
No
Yes
No
Yes
No
Black
White
0
4
7
16
0
4
7
13
0
2
8
13
Source: J. Gastwirth, Statistical Reasoning in Law and Public Policy ŽSan Diego, CA: Academic
Press, 1988., p. 266.
255
CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS
result in each case. The observed Ý k n11 k s 0, and the P-value is the null
probability of this outcome. Software provides P s 0.026. A two-sided Pvalue, based on summing the probabilities of all tables no more likely than
the observed table, equals 0.056.
6.7.7
Exact Conditional Estimation and Comparison of Odds Ratios
For model Ž6.4. of homogeneous association in 2 = 2 = K tables, the
ordinary ML estimator of the odds ratio s expŽ  . behaves poorly for
sparse-data asymptotics. The conditional ML estimator maximizes the conditional likelihood function after reducing the parameter space by conditioning
on sufficient statistics for the other parameters ŽAndersen 1970; Birch
1964b..
For cell counts n i jk 4 , given n iqk , nqj k 4 for all k, the conditional probability mass function that Ž n111 s t 1 , . . . , n11 K s t K . is the product of the functions Ž3.20. from the separate strata, or
Ł P Ž n11 k s t k < n1qk , nq1 k , nqqk ; . s Ł
k
k
ž /ž
n1qk
tk
Ýu
ž /ž
n1qk
u
/
nqqk y n1qk
tk
nq1 k y t k
/
nqqk y n1qk
u
nq1 k y u
.
Ž 6.19 .
The conditional ML estimator ˆ maximizes Ž6.19.. Like the Mantel᎐Haenszel
estimator ˆMH , it has good properties for both standard and sparse-data
asymptotic cases ŽAndersen 1970; Breslow 1981., since the number of parameters does not change as K does. It can be slightly more efficient than ˆMH ,
except when s 1.0, where they are equally efficient, or for matched pairs,
where they are identical ŽBreslow 1981..
The conditional distribution Ž6.19. propagates one for Ý k n11 k , which is
used to test H0 : s 0 for an arbitrary value. Then, a 95% confidence
interval for consists of all 0 for which the P-value exceeds 0.05. Such an
interval is guaranteed to have at least the nominal coverage probability ŽGart
1970; Kim and Agresti 1995; Mehta et al. 1985.. This extends the interval for
a single 2 = 2 table ŽSection 3.6.1.. For the promotion discrimination case
ŽTable 6.15., Ý k n11 k s 0, so the lower bound of any confidence interval for
should be 0. For the generalization to several strata of Cornfield’s tail-method
interval, StatXact reports a 95% confidence interval of Ž0, 1.01..
Zelen Ž1971. presented a small-sample test of homogeneity of the odds
ratios. See Agresti Ž1992. for discussion of this and other small-sample
methods for contingency tables.
256
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
TABLE 6.16 Example for Exact Conditional Logistic Regression
Cephalexin a
Age a
Length
of Stay a
Cases of
Diarrhea
Sample
Size
0
0
0
0
1
0
0
1
1
1
0
1
0
1
1
0
5
3
47
5
385
233
789
1081
5
a
See the text for an explanation of 0 and 1.
Source: Based on study by E. Jaffe and V. Chang, Cornell Medical Center, reported in the
Manual for LogXact ŽCambridge, MA: CYTEL Software, 1999., p. 259.
6.7.8
Diarrhea Example
The final example deals with a larger number of variables. Table 6.16 refers
to 2493 patients having stays in a hospital. The response is whether they
suffered an acute form of diarrhea during their stay. The three predictors are
age Ž1 for over 50 years old, 0 for under 50., length of stay in hospital Ž1 for
more than 1 week, 0 for less than 1 week., and exposure to an antibiotic
called Cephalexin Ž1 for yes, 0 for no.. We discuss estimation of the effect of
Cephalexin, controlling for age and length of stay, using a model containing
only main-effect terms.
The sample size is large, yet relatively few cases of acute diarrhea
occurred. Moreover, all subjects having exposure to Cephalexin were also
diarrhea cases. Such boundary outcomes in which none or all responses fall
in one category cause infinite ML estimates of some model parameters. An
ML estimate of ⬁ for the Cephalexin effect means that the likelihood
function increases continually as the parameter estimate for Cephalexin
increases indefinitely.
To study the Cephalexin effect, we use an exact distribution, conditioning
on sufficient statistics for the other predictors. Although the estimate of the
log-odds-ratio parameter for the effect of Cephalexin is infinite, it is possible
to construct a confidence interval by inverting the family of tests for the
parameter, using the conditional distribution. Doing this, a 95% confidence
interval is Ž19, ⬁. for the odds ratio. Assuming that the main-effects model is
valid, Cephalexin appears to have a strong effect. Similarly, P - 0.0001 for
testing that the log odds ratio equals zero.
Results must be qualified somewhat because no Cephalexin cases occurred
at the first three combinations of levels of age and length of stay. In fact, the
first three rows of Table 6.16 make no contribution to the analysis ŽProblem
6.18.. The data actually provide evidence about the effect of Cephalexin only
for older subjects having a long stay.
257
NOTES
6.7.9
Complications from Discreteness
Like Fisher’s exact test, exact conditional inference for contingency tables is
conservative because of discreteness. This is especially true when n is small
or the data are unbalanced, with most observations falling in a single column
or row. Using mid-P-values or P-values based on a finer partitioning of the
sample space ŽNote 3.9. in tests and related confidence intervals reduces
conservativeness. For the promotion discrimination data ŽTable 6.15., we
reported a 95% confidence interval for the common odds ratio of Ž0, 1.01..
Inverting exact tests of H0 : s 0 with the mid-P-value yields the interval
Ž0, 0.78.. However, this approach cannot guarantee that the actual coverage
probability is bounded below by 0.95.
4
A particular problem occurs when no other set of y*
i values has the same
value of a given sufficient statistic Ý i yi x i j as the observed data. In that case
the conditional distribution of the sufficient statistic for the parameter of
interest is degenerate. The P-value for the exact test then equals 1.0. This
commonly happens when at least one explanatory variable x j whose effect is
conditioned out for the inference is continuous, with unequally spaced
observed values.
Finally, a limitation of the conditional approach is requiring sufficient
statistics for the nuisance parameters. This happens only with GLMs that use
the canonical link. Thus, for instance, the conditional approach works for
logit models but not probit models.
NOTES
Section 6.1: Strategies in Model Selection
6.1. A Bayesian argument motivates the Bayesian information criterion BIC s w G 2 y
Žlog n.Ždf.x, an alternative to AIC. It takes sample size into account. Compared to AIC,
BIC gravitates less quickly toward more complex models as n increases. For details and
critiques, see Raftery Ž1986. and the February 1999 issue of Sociological Methods and
Research.
6.2. Tree-structured methods such as CART are alternatives to logistic regression that
formalize a decision process using a sequential set of questions that branch in different
directions depending on a subject’s responses. An example is deciding whether a subject
with chest pains may be suffering a heart attack. Zhang et al. Ž1998. surveyed such
methods.
Section 6.2: Logistic Regression Diagnostics
6.3. For logistic regression diagnostics, see Copas Ž1988., Fowlkes Ž1987., Hosmer and
Lemeshow Ž2000, Chap. 5., Johnson Ž1985., Landwehr et al. Ž1984., and Pregibon Ž1981..
Separate diagnostics are useful for checking the adequacy of each component of a GLM
ŽMcCullagh and Nelder 1989, Chap. 12.. For a family g Ž; ␥ . of link functions indexed
258
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
by parameter ␥ , Pregibon Ž1980. showed how to estimate ␥ giving the link with best fit
and how to check the adequacy of a given link g Ž; ␥ 0 ..
6.4. Amemiya Ž1981., Efron Ž1978., Maddala Ž1983., and Zheng and Agresti Ž2000. and
references therein reviewed R 2 measures for binary regression. Hosmer and Lemeshow
Ž2000, Sec. 5.2.3. discussed classification tables and their limitations. Pepe Ž2000. and
references therein surveyed ROC methodology.
Section 6.3: Inference about Conditional Associations in 2 = 2 = K Tables
6.5. Analogs of ˆMH summarize differences of proportions or relative risks from several
strata ŽGreenland and Robins 1985.. Breslow and Day Ž1980, p. 142. proposed an
alternative large-sample test of homogeneity of odds ratios. In each partial table let
ˆ i jk 4 have the same marginals as the data observed, yet have odds ratio equal to ˆMH .
Their test statistic has the Pearson form comparing n i jk 4 to
ˆ i jk 4. Tarone Ž1985.
showed that because of the inefficiency of ˆMH one must adjust the Breslow᎐Day
statistic for it to have a limiting chi-squared null distribution with df s K y 1. This
adjustment is usually minor. Jones et al. Ž1989. reviewed and compared several tests of
homogeneity in sparse and nonsparse settings. Other work on comparing odds ratios and
estimating a common value include Breslow and Day Ž1980, Sec. 4.4., Donner and
Hauck Ž1986., Gart Ž1970., and Liang and Self Ž1985.. For modeling the odds ratio, see
Breslow Ž1976., Breslow and Day Ž1980, Sec. 7.5., and Prentice Ž1976a.. Breslow
emphasized retrospective studies, in which the conditional approach is natural since the
outcome totals are fixed.
Section 6.5: Sample Size and Power Considerations
6.6. For sample-size determination for comparing proportions, Fleiss Ž1981, Sec. 3.2. provided tables. See Lachin Ž1977. for the I = J case. Chapman and Meng Ž1966., Drost
et al. Ž1989., Haberman Ž1974a, pp. 109᎐112., Harkness and Katz Ž1964., Mitra Ž1958.,
and Patnaik Ž1949. derived theory for asymptotic nonnull behavior of chi-squared
statistics; see also Section 14.3.5. O‘Brien’s Ž1986. simulation results suggested that the
noncentral chi-squared approximation for G 2 holds well for a wide range of powers.
Read and Cressie Ž1988, pp. 147᎐148. listed other articles that studied the nonnull
behavior of X 2 and G 2 .
Section 6.6: Probit and Complementary Log-Log Models
6.7. Finney Ž1971. is the standard reference on probit modeling. Chambers and Cox Ž1967.
showed that it is difficult to distinguish between probit and logit models unless n is
extremely large. Ashford and Sowden Ž1970. generalized the probit model for multivariate binary responses; see also Lesaffre and Molenberghs Ž1991. and Ochi and Prentice
Ž1984.. Wedderburn Ž1976. showed that the log likelihood is concave for probit and
complementary log-log links.
Section 6.7: Conditional Logistic Regression
6.8. For details about conditional logistic regression, see Section 10.2, Breslow and Day
Ž1980, Chap. 7., Cox Ž1970., and Hosmer and Lemeshow Ž2000, Chap. 5.. Liang Ž1984.
showed that conditional ML estimators and conditional score tests are asymptotically
equivalent to their unconditional counterparts under sampling from exponential families. For exact inference using the conditional likelihood, see Hirji et al. Ž1987., Mehta
and Patel Ž1995., and the LogXact manual ŽCytel Software.. Mehta et al. Ž2000.
discussed Monte Carlo approximations.
PROBLEMS
259
PROBLEMS
Applications
6.1
For the horseshoe crab data, fit a model using weight and width as
predictors. Conduct Ža. a likelihood-ratio test of H0 :  1 s  2 s 0, and
Žb. separate tests for the partial effects. Why does neither test in part
Žb. show evidence of an effect when the test in part Ža. shows strong
evidence?
6.2
Refer to the data for Problem 8.13. Treating opinion about premarital
sex as the response variable, use backward elimination to select a
model. Interpret.
6.3
Refer to Table 6.4. Fit the stage 3 model denoted there by Ž E*P q G ..
Use parameter estimates to interpret the G effect and the dependence
of the E effect on P.
6.4
Discern the reasons that Simpson’s paradox occurs for Table 6.7.
6.5
Refer to Problem 2.12.
a. Fit the model with G and D main effects. Using it, estimate the
AG conditional odds ratio. Compare to the marginal odds ratio, and
explain why they are so different. Test its goodness of fit.
b. Fit the model of no G effect, given the department. Use X 2 to test
fit. Obtain residuals, and interpret the lack of fit. ŽEach department
has a single nonredundant standardized Pearson residual. They
satisfy Ý6is1 ri2 s X 2 , their squares giving six df s 1 components..
c. Fit the two models excluding department A. Again consider lack of
fit, and interpret.
6.6
Conduct a residual analysis for the independence model with Table
6.11. What type of lack of fit is indicated?
6.7
Table 6.17, refers to the effectiveness of immediately injected or
1 12 -hour-delayed penicillin in protecting rabbits against lethal injection
with -hemolytic streptococci.
a. Let X s delay, Y s whether cured, and Z s penicillin level. Fit
the logit model Ž6.4.. Argue that the pattern of 0 cell counts
suggests that Žwith no intercept . ˆ1Z s y⬁ and ˆ5Z s ⬁. What does
your software report?
b. Using the logit model, conduct the likelihood-ratio test of XY
conditional independence. Interpret.
260
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
TABLE 6.17 Data for Problem 6.7
Penicillin
Level
Response
Delay
Cured
Died
1
8
None
1 12 h
0
0
6
5
1
4
None
1 12 h
3
0
3
6
1
2
None
1 12 h
6
2
0
4
1
None
1 12 h
5
6
1
0
4
None
1 12 h
2
5
0
0
Source: Reprinted with permission from Mantel Ž1963..
c. Test XY conditional independence using the Cochran᎐Mantel᎐
Haenszel test. Interpret.
d. Estimate the XY conditional odds ratio using Ži. ML with the logit
model, and Žii. the Mantel᎐Haenszel estimate. Interpret.
e. The small cell counts make large-sample analyses questionnable.
Conduct small-sample inference, and interpret.
6.8
Refer to Table 2.6. Use the CMH statistic to test independence of
death penalty verdict and victim’s race, controlling for defendant’s
race. Show another test of this hypothesis, and compare results.
6.9
Treatments A and B were compared on a binary response for 40 pairs
of subjects matched on relevant covariates. For each pair, treatments
were assigned to the subjects randomly. Twenty pairs of subjects made
the same response for each treatment. Six pairs had a success for the
subject receiving A and a failure for the subject receiving B, whereas
the other 14 pairs had a success for B and a failure for A. Use the
Cochran᎐Mantel᎐Haenszel procedure to test independence of response and treatment. ŽIn Section 10.1 we present an equivalent test,
McNemar’s test..
6.10 Refer to Section 6.5.1. Suppose that 1 s 0.7 and 2 s 0.6. What
sample size is needed for the test to have approximate power 0.80,
when ␣ s 0.05, for Ža. Ha : 1 / 2 , and Žb. Ha : 1 ) 2?
261
PROBLEMS
6.11 Refer to Section 6.5.1. Suppose that 1 s 0.63 and 2 s 0.57. When
treatment sample sizes are equal, explain why the joint probabilities in
the 2 = 2 table are 0.315 and 0.185 in the row for treatment A and
0.285 and 0.215 in the row for treatment B. For the model of independence, explain why the fitted joint probabilities are 0.30 for success
and 0.20 for failure, in each row. Show that X 2 has noncentrality
parameter 0.00375n and df s 1. For n s 200 and ␣ s 0.05, find the
power.
6.12 In an experiment designed to compare two treatments on a three-category response, a researcher expects the conditional distributions to be
approximately Ž0.2, 0.2, 0.6. and Ž0.3, 0.3, 0.4..
a. With ␣ s 0.05, find the approximate power using Ži. X 2 , and Žii.
G 2 to compare the distributions with 100 observations for each
treatment. Compare results.
b. What sample size is needed for each treatment for the tests in part
Ža. to have approximate power 0.90?
6.13 The horseshoe crab width values in Table 4.3 have x s 26.3 and
s x s 2.1. If the true relationship were similar to the fitted equation in
Section 5.1.3, about how large a sample yields P Žtype II error. s 0.10,
with ␣ s 0.05, for testing H0 :  s 0 against Ha :  ) 0?
6.14 Refer to Problem 5.1. Table 6.18 shows output for fitting a probit
model. Interpret the parameter estimates Ža. using characteristics of
the normal cdf response curve, Žb. finding the estimated rate of change
in the probability of remission where it equals 0.5, and Žc. finding the
difference between the estimated probabilities of remission at the
upper and lower quartiles of the labeling index, 14 and 28.
TABLE 6.18 Data for Problem 6.14
Parameter
Intercept
LI
Estimate
y2.3178
0.0878
Standard
Error
0.7795
0.0328
Likelihood Ratio 95%
Confidence Limits
y4.0114
y0.9084
0.0275
0.1575
ChiSquare
8.84
7.19
Pr ) ChiSq
0.0029
0.0073
6.15 Use probit models to describe the effects of width and color on the
probability of a satellite for Table 4.3. Interpret.
6.16 Refer to Table 6.14. Fit the model having log-log link rather than
complementary log-log. Test the fit. Why does it fit so poorly?
262
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
6.17 For the linear logit model with Table 3.9 and scores Ž0, 15, 30.,
conduct the exact test of H0 :  s 0 and find a point and interval
estimate of  using the conditional likelihood. Interpret.
6.18 Refer to Table 6.16. Apply conditional logistic regression to the model
discussed in Section 6.7.8.
a. Obtain an exact P-value for testing no C effect against the alternative of a positive effect. Construct a 95% confidence interval for the
conditional CD odds ratio.
b. Construct the partial tables relating C to D for the combinations of
levels of Ž A, L.. Note that three tables have no data when C s 1.
For the sole partial table having data at both C levels, find a 95%
exact confidence interval for the odds ratio and find an exact
one-sided P-value. Compare to results using the entire data set.
Comment about the contribution to inference of tables having only
a single positive row total or a single positive column total.
c. Obtain the ordinary ML fit of the logistic regression model. To
investigate the sensitivity of the estimated C effect, find the change
in the estimate and SE after adding one observation to the data set,
a case with no diarrhea when Ž C, A, L. s Ž1, 1, 1..
6.19 Consider Table 6.19, from a study of nonmetastatic osteosarcoma
ŽA. M. Goorin, J. Clin Oncol. 5: 1178᎐1184, 1987, and the manual
for LogXact .. The response is whether the subject achieved a three-year
disease-free interval.
a. Show that each predictor has a significant effect when used individually without the others.
b. Try to fit a main-effects logistic regression model containing all
three predictors. Explain why the ML estimate for the effect of
lymphocytic infiltration is infinite.
TABLE 6.19 Data for Problem 6.19
Lymphocytic
Infiltration
Gender
High
Female
Male
Low
Female
Male
Disease-Free
Osteoblastic
Pathology
Yes
No
No
Yes
No
Yes
No
Yes
No
Yes
3
2
4
1
5
3
5
6
0
0
0
0
0
2
4
11
Source: LogXact 4 for Windows ŽCambridge, MA: CYTEL Software, 1999..
PROBLEMS
263
c. Using conditional logistic regression, Ži. conduct an exact test for
the effect of lymphocytic infiltration, controlling for the other
variables; and Žii. find a 95% confidence interval for the effect.
Interpret results.
6.20 Use the methods discussed in this chapter to select a model for Table
5.5.
6.21 Logistic regression is applied increasingly to large financial databases,
such as for credit scoring to model the influence of predictors on
whether a consumer is creditworthy. The data archive found under the
index at www. stat.uni-muenchen.de contains such a data set that includes 20 covariates for 1000 observations. Build a model for creditworthiness using the predictors running account, duration of credit,
payment of previous credits, intended use, gender, and marital status.
Theory and Methods
6.22 For a sequence of s nested models M1 , . . . , Ms , model Ms is the most
complex. Let denote the difference in residual df between M1 and
Ms .
a. Explain why for j - k, G 2 Ž M j < Mk . F G 2 Ž M j < Ms ..
b. Assume model M j , so that Mk also holds when k ) j. For all k ) j,
as n ™ ⬁, P w G 2 Ž M j < Mk . ) 2 Ž␣.x F ␣ . Explain why.
c. Gabriel Ž1966. suggested a simultaneous testing procedure in which,
for each pair of models, the critical value for differences between
G 2 values is 2 Ž␣.. The final model accepted must be more
complex than any model rejected in a pairwise comparison. Since
part Žb. is true for all j - k, argue that Gabriel’s procedure has type
I error probability no greater than ␣ .
6.23 Prove that the Pearson residuals for the linear logit model applied to a
I
I = 2 contingency table satisfy X 2 s Ý is1
e i2 . Note that this holds for a
binomial GLM with any link.
6.24 Refer to logit model Ž6.4. for a 2 = 2 = K contingency table n i jk 4 .
a. Using dummy variables, write the log-likelihood function. Identify
the sufficient statistics for the various parameters. Explain how to
conduct exact conditional inference about the effect of X, controlling
for Z.
b. Using a basic result for testing in exponential families, explain why
uniformly most powerful unbiased tests of conditional XY independence are based on Ý k n11 k ŽBirch 1964b; Lehmann 1986, Sec. 4.8..
264
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
6.25 Suppose that i jk 4 in a 2 = 2 = 2 table are, by row, Ž0.15, 0.10 r 0.10,
0.15. when Z s 1 and Ž0.10, 0.15 r 0.15, 0.10. when Z s 2. For testing
conditional XY independence with logit models having Y as a response, explain why the likelihood-ratio test comparing models X q Z
and Z is not consistent but the likelihood-ratio test of fit of the XY
conditional independence model is.
6.26 Refer to Section 6.4.1. When Y is N Ž i , 2 ., consider the comparison
of Ž 1 , . . . , I . based on independent samples at the I categories of X.
When approximately i s ␣ q  x i , explain why the t or F test of H0 :
 s 0 is more powerful than the one-way ANOVA F test. Describe a
pattern for i 4 for which the ANOVA test would be more powerful.
6.27 For a multinomial distribution, let ␥ s Ý i bi i , and suppose that i s
f i Ž . ) 0, i s 1, . . . , I. For sample proportions pi 4 , let S s Ý i bi pi .
Let T s Ý i bi
ˆ i , where ˆ i s f i Ž ˆ., for the ML estimator ˆ of .
a. Show that var Ž S . s wÝ i bi2 i y ŽÝ i bi i . 2 xrn.
b. Using the delta method, show var ŽT . f wvar Ž ˆ.xwÝ i bi f iX Ž .x 2 .
c. By computing the information for LŽ . s Ý i n i log w f i Ž .x, show that
var Ž ˆ. is approximately w nÝ i Ž f iX Ž .. 2rf i Ž .xy1 .
d. Asymptotically, show that var w'n ŽT y ␥ .x F var w'n Ž S y ␥ .x. w Hint:
Show that var ŽT .rvar Ž S . is a squared correlation between two
random variables, where with probability i the first equals bi and
the second equals f iX Ž .rf i Ž ..x
6.28 A threshold model can also motivate the probit model. For it, there is
an unobserved continuous response Y * such that the observed yi s 0
if yi* F and yi s 1 if yi* ) . Suppose that yi* s i q ⑀ i , where
i s ␣ q  x i and where ⑀ i 4 are independent from a N Ž0, 2 . distribution. For identifiability one can set s 1 and the threshold s 0.
Show that the probit model holds and explain why  represents the
expected number of standard deviation change in Y * for a 1-unit
increase in x.
6.29 Consider the choice between two options, such as two product brands.
Let U0 denote the utility of outcome y s 0 and U1 the utility of y s 1.
For y s 0 and 1, suppose that Uy s ␣ y q  y x q ⑀ y , using a scale such
that ⑀ y has some standardized distribution. A subject selects y s 1 if
U1 ) U0 for that subject.
a. If ⑀ 0 and ⑀ 1 are independent N Ž0, 1. random variables, show that
P Ž Y s 1. satisfies the probit model.
b. If ⑀ y are independent extreme-value random variables, with cdf
F Ž ⑀ . s exp wyexp Žy⑀ .x, show that P Ž Y s 1. satisfies the logistic
regression model ŽMaddala 1983, p. 60; McFadden 1974..
265
PROBLEMS
6.30 Consider model Ž6.12. with complementary log-log link.
a. Find x at which Ž x . s 12 .
b. Show the greatest rate of change of Ž x . occurs at x s y␣r .
What does Ž x . equal at that point? Give the corresponding result
for the model with log-log link, and compare to the logit and probit
models.
6.31 Suppose that log-log model Ž6.13. holds. Explain how to interpret  .
6.32 Let yi , i s 1, . . . , n, denote n independent binary random variables.
a. Derive the log likelihood for the probit model ⌽y1 w Žx i .x s Ý j  j x i j .
b. Show that the likelihood equations for the logistic and probit
regression models are
Ý Ž yi y ˆ i . z i x i j s 0,
j s 0, . . . , p,
i
where z i s 1 for the logistic case and z i s ŽÝ j ˆj x i j .r
ˆ i Ž1 y ˆ i .
for the probit case. ŽWhen the link is not canonical, there is no
reduction of the data in sufficient statistics. .
6.33 Sometimes, sample proportions are continuous rather than of the
binomial form Žnumber of successes .rŽnumber of trials.. Each observation is any real number between 0 and 1, such as the proportion of a
tooth surface that is covered with plaque. For independent responses
yi 4 , Aitchison and Shen Ž1980. and Bartlett Ž1937. modeled logitŽ Yi . ;
N Ž i , 2 .. Then Yi itself is said to have a logistic-normal distribution.
a. Expressing a N Ž  , 2 . variate as  q Z, where Z is standard
normal, show that Yi s expŽ i q Z .rw1 q exp Ž i q Z .x.
b. Show that for small ,
Yi s
e i
1 q e i
q
e i
1
1 q e i 1 q e
Zq
i
e i Ž 1 y e i .
i 3
2Ž 1 q e .
2 Z 2 q ⭈⭈⭈ .
c. Letting i s e  irŽ1 q e  i ., when is close to 0 show that
E Ž Yi . f i ,
var Ž Yi . f i Ž 1 y i .
2
2.
d. For independent continuous proportions yi 4 , let i s E Ž Yi .. For a
GLM, it is sensible to use an inverse cdf link for i , but it is unclear
how to choose a distribution for Yi . The approximate moments for
the logistic-normal motivate a quasi-likelihood approach ŽWedderburn 1974. with variance function ®Ž i . s w i Ž1 y i .x 2 for un-
266
BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
known . Explain why this provides similar results as fitting a
normal regression model to the sample logits assuming constant
variance. ŽThe QL approach has the advantage of not requiring
adjustment of 0 or 1 observations, for which sample logits don’t
exist..
e. Wedderburn Ž1974. gave an example with response the proportion
of a leaf showing a type of blotch. Envision an approximation of
binomial form based on cutting each leaf into a large number of
small regions of the same size and observing for each region
whether it is mostly covered with blotch. Explain why this suggests
that ®Ž i . s i Ž1 y i .. What violation of the binomial assumptions might make this questionnable? wThe parametric family of
beta distributions has variance function of this form Žsee Section
Ž1991. proposed a distri13.3.1.. Barndorff-Nielsen and Jorgensen
bution having ®Ž i . s w i Ž1 y i .x 3 ; see also Cox Ž1996..x
6.34 For independent binomial sampling, construct the log likelihood and
identify the sufficient statistics to be conditioned out to perform exact
inference about  in model Ž6.4..
6.35 Let
ˆ Žy. s Žˆ Žy1. , . . . , ˆ Žyn . ., where ˆ Žyi . denotes the estimate of
E Ž Yi . for binary observation i after fitting the model without that
observation. Cross-validation declares a model to have good predictive
power if corr Ž
ˆ Žy. , y. is high. Consider the model logit Ž i . s ␣ for all
i. Show that
ˆ i s y and hence ˆ Žyi . s w nrŽ n y 1.xw y y Ž1rn. yi x, and
hence corr Ž
ˆ Žy. , y. s y1 regardless of how well the model fits. Thus,
cross-validation can be misleading with binary data ŽZheng and Agresti
2000..
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 7
Logit Models for Multinomial
Responses
In Chapters 5 and 6 we discussed modeling binary response variables with
binomial GLMs. Multicategory responses use multinomial GLMs. In this
chapter we generalize logistic regression for multinomial Žnominal and ordinal. response variables.
In Section 7.1 we present a model for nominal responses that uses a
separate binary logit model for each pair of response categories. In Section
7.2 we present a model for ordinal responses that uses logits of cumulative
response probabilities. In Section 7.3 we use other link functions for those
cumulative probabilities. Section 7.4 covers alternative ordinal-response models.
In Section 7.5 we discuss tests of conditional independence with multinomial responses using models and using generalizations of the Cochran
MantelHaenszel statistic. In the final section we introduce a multinomial
logit model for discrete-choice modeling of a subject’s choice from one of
several options when values of predictors may depend on the option.
7.1 NOMINAL RESPONSES: BASELINE-CATEGORY
LOGIT MODELS
Let Y be a categorical response with J categories. Multicategory Žalso called
polytomous. logit models for nominal response variables simultaneously describe log odds for all 2J pairs of categories. Given a certain choice of J y 1
of these, the rest are redundant.
ž/
7.1.1
Baseline-Category Logits
Let j Žx. s P Ž Y s j < x. at a fixed setting x for explanatory variables, with
Ý j j Žx. s 1. For observations at that setting, we treat the counts at the J
categories of Y as multinomial with probabilities 1Žx., . . . , J Žx.4 .
267
268
LOGIT MODELS FOR MULTINOMIAL RESPONSES
Logit models pair each response category with a baseline category, often
the last one or the most common one. The model
log
j Ž x.
J Ž x.
s ␣ j q Xj x,
j s 1, . . . , J y 1,
Ž 7.1 .
simultaneously describes the effects of x on these J y 1 logits. The effects
vary according to the response paired with the baseline. These J y 1 equations determine parameters for logits with other pairs of response categories,
since
log
a Ž x .
b Ž x.
s log
a Ž x .
J Ž x.
y log
b Ž x.
J Ž x.
.
With categorical predictors, X 2 and G 2 goodness-of-fit statistics provide a
model check when data are not sparse. When an explanatory variable is
continuous or the data are sparse, such statistics are still valid for comparing
nested models differing by relatively few terms ŽHaberman 1974a, pp.
372᎐373; 1977a..
7.1.2
Alligator Food Choice Example
Table 7.1 is from a study of factors influencing the primary food choice of
alligators. It used 219 alligators captured in four Florida lakes. The nominal
response variable is the primary food type, in volume, found in an alligator’s
stomach. This had five categories: fish, invertebrate, reptile, bird, other. The
invertebrates included apple snails, aquatic insects, and crayfish. The reptiles
were primarily turtles, although one stomach contained the tags of 23 baby
alligators released in the lake the previous year! The ‘‘other’’ category
consisted of amphibian, mammal, plant material, stones or other debris, or
no food or dominant type. Table 7.1 also classifies the alligators according to
L s lake of capture ŽHancock, Oklawaha, Trafford, George., G s gender
Žmale, female., and S s size ŽF 2.3 meters long, ) 2.3 meters long..
Baseline-category logit models can investigate the effects of L, G, and S
on primary food type. Table 7.2 contains fit statistics for several models. We
denote a model by its predictors: for instance, Ž L q S . having additive lake
and size effects and Ž . having no predictors. The data are sparse, 219
observations scattered among 80 cells. Thus, G 2 is more reliable for comparing models than for testing fit. The statistics G 2 wŽ . < Ž G .x s 2.1 and G 2 s
wŽ L q S . < Ž G q L q S .x s 2.2, each based on df s 4, suggest simplifying by
collapsing the table over gender. ŽOther analyses, not presented here, show
that adding interaction terms including G do not improve the fit significantly. .
The G 2 and X 2 values for the collapsed table indicate that both L and S
have effects. Table 7.3 exhibits fitted values for model Ž L q S . for the
NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS
TABLE 7.1
Primary Food Choice of Alligators
Lake
Gender
Hancock
Male
Female
Oklawaha
Male
Female
Trafford
Male
Female
George
Male
Female
269
Primary Food Choice
Size
Žm.
Fish
Invertebrate
Reptile
Bird
F 2.3
) 2.3
F 2.3
) 2.3
7
4
16
3
1
0
3
0
0
0
2
1
0
1
2
2
5
2
3
3
F 2.3
) 2.3
F 2.3
) 2.3
2
13
3
0
2
7
9
1
0
6
1
0
0
0
0
1
1
0
2
0
F 2.3
) 2.3
F 2.3
) 2.3
3
8
2
0
7
6
4
1
1
6
1
0
0
3
1
0
1
5
4
0
F 2.3
) 2.3
F 2.3
) 2.3
13
9
3
8
10
0
9
1
0
0
1
0
2
1
0
0
2
2
1
1
Other
Source: Data courtesy of Clint Moore, from an unpublished manuscript by M. F. Delaney and C.
T. Moore.
TABLE 7.2 Goodness of Fit of Baseline-Category
Logit Models for Table 7.1
Model a
Ž .
ŽG.
ŽS.
Ž L.
Ž L q S.
ŽG q L q S .
Collapsed over G
Ž .
ŽS.
Ž L.
Ž L q S.
a
G2
X2
df
116.8
114.7
101.6
73.6
52.5
50.3
106.5
101.2
86.9
79.6
58.0
52.6
60
56
56
48
44
40
81.4
66.2
38.2
17.1
73.1
54.3
32.7
15.0
28
24
16
12
G, gender; S, size; L, lake of capture. See the text for
details.
270
LOGIT MODELS FOR MULTINOMIAL RESPONSES
TABLE 7.3
Observed and Fitted Values for Study of Alligator’s Primary Food Choice
Size of alligator
Žmeters.
Lake
Hancock
F 2.3
) 2.3
Oklawaha
F 2.3
) 2.3
Trafford
F 2.3
) 2.3
George
F 2.3
) 2.3
Primary Food Choice
Fish
Invertebrate
Reptile
Bird
Other
23
Ž20.9.
7
Ž9.1.
4
Ž3.6.
0
Ž0.4.
2
Ž1.9.
1
Ž1.1.
2
Ž2.7.
3
Ž2.3.
8
Ž9.9.
5
Ž3.1.
5
Ž5.2.
13
Ž12.8.
11
Ž12.0.
8
Ž7.0.
1
Ž1.5.
6
Ž5.5.
0
Ž0.2.
1
Ž0.8.
3
Ž1.1.
0
Ž1.9.
5
Ž4.4.
89
Ž8.6.
11
Ž12.4.
7
Ž5.6.
2
Ž2.1.
6
Ž5.9.
1
Ž0.9.
3
Ž3.1.
5
Ž4.2.
5
Ž5.8.
16
Ž18.5.
17
Ž14.5.
19
Ž16.9.
1
Ž3.1.
1
Ž0.5.
0
Ž0.5.
2
Ž1.2.
1
Ž1.8.
3
Ž3.8.
3
Ž2.2.
collapsed table. Absolute values of standardized Pearson residuals comparing
observed and fitted values exceed 2 in only two of the 40 cells and exceed 3 in
none of the cells. The fit seems adequate.
Fish was the most common food choice. We now estimate the effects of
lake and size on the odds that alligators select other primary food types
instead of fish. With fish as the baseline category, Table 7.4 contains ML
estimates of effect parameters. These result from models using dummy
variables for the first three lakes and for size. The table uses letter subscripts
to denote the food choice categories. For example, the prediction equation
for the log odds of selecting invertebrates instead of fish is
log Ž
ˆ Irˆ F . s y1.55 q 1.46 s y 1.66 z H q 0.94 z O q 1.12 zT ,
TABLE 7.4 Estimated Parameters in Logit Model for Alligator Food Choice,
Based on Dummy Variable for First Size Category and Each Lake Except
Lake George a
Lake
Logit
b
logŽ Ir F .
logŽ Rr F .
logŽ Br F .
logŽ Or F .
a
Intercept
Size F 2.3
Hancock
Oklawaha
Trafford
y1.55
y3.31
y2.09
y1.90
1.46 Ž0.40.
y0.35 Ž0.58.
y0.63 Ž0.64.
0.33 Ž0.45.
y1.66 Ž0.61.
1.24 Ž1.19.
0.70 Ž0.78.
0.83 Ž0.56.
0.94 Ž0.47.
2.46 Ž1.12.
y0.65 Ž1.20.
0.01 Ž0.78.
1.12 Ž0.49.
2.94 Ž1.12.
1.09 Ž0.84.
1.52 Ž0.62.
SE values in parentheses.
I, invertebrate; R, reptile; B, bird; O, other; F, fish.
NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS
271
where s s 1 for size F 2.3 meters and 0 otherwise, z H is a dummy variable
for Lake Hancock Ž z H s 1 for alligators in that lake and 0 otherwise., and z O
and zT are dummy variables for lakes Oklawaha and Trafford. Size of
alligator has a noticeable effect. For a given lake, for small alligators the
estimated odds that primary food choice was invertebrates instead of fish are
expŽ1.46. s 4.3 times the estimated odds for large alligators; the Wald 95%
confidence interval is expw1.46 " 1.96Ž0.396.x s Ž2.0, 9.3.. The lake effects
indicate that the estimated odds that the primary food choice was invertebrates instead of fish are relatively higher at Lakes Trafford and Oklawaha
and relatively lower at Lake Hancock than they are at Lake George.
The equations in Table 7.4 determine those for other food-choice pairs.
For instance, for Žinvertebrate, other.,
log Ž
ˆ IrˆO . s log Ž ˆ Irˆ F . y log Ž ˆOrˆ F .
s Ž y1.55 q 1.46 s y 1.66 z H q 0.94 z O q 1.12 zT .
y Ž y1.90 q 0.33s q 0.83 z H q 0.01 z O q 1.52 zT .
s 0.35 q 1.13s y 2.48 z H q 0.93 z O y 0.39 zT .
7.1.3
Estimating Response Probabilities
The equation that expresses multinomial logit models directly in terms of
response probabilities j Žx.4 is
j Ž x. s
exp Ž␣j q Xj x .
Jy1
1 q Ý hs1
exp Ž␣h q Xh x .
Ž 7.2 .
with ␣ J s 0 and  J s 0. This follows from Ž7.1., using the fact that Ž7.1. also
holds with j s J by setting ␣ J s 0 and  J s 0. ŽAlso, the parameters equal
zero for a baseline category for identifiability reasons; see Problem 7.26.. The
denominator of Ž7.2. is the same for each j. The numerators for various j
sum to the denominator, so Ý j j Žx. s 1. For J s 2, Ž7.2. simplifies to the
formula of type Ž5.1. used for binary logistic regression.
From Table 7.4 the estimated probability that a large alligator in Lake
Hancock has invertebrates as the primary food choice is
ˆI s
ey1 .55y1 .66
1 q ey1 .55y1 .66 q ey3 .31q1 .24 q ey2 .09q0 .70 q ey1 .90q0 .83
s 0.023.
The estimated probabilities for reptile, bird, other, and fish are 0.072, 0.141,
0.194, and 0.570.
This example used qualitative predictors. Multinomial logit models can
also contain quantitative predictors. In this study, the biologists used the size
dummy variable to distinguish between adult and subadult alligators. However, the alligators’ actual length was measured and is quantitative. With
quantitative predictors, it is informative to plot the estimated probabilities.
272
LOGIT MODELS FOR MULTINOMIAL RESPONSES
FIGURE 7.1
Estimated probabilities for primary food choice.
To illustrate, for alligators at one lake, Figure 7.1 plots the estimated
probabilities that primary food choice is fish, invertebrate, or other Žwhich
combines the other, bird, and reptile categories. as a function of length. With
more than two response categories, the probability for a given category need
not continuously increase or decrease ŽProblem 7.27..
7.1.4
Fitting of Baseline-Category Logit Models*
ML fitting of multinomial logit models maximizes the likelihood subject to
j Žx.4 simultaneously satisfying the J y 1 equations that specify the model.
For i s 1, . . . , n, let yi s Ž yi1 , . . . , yi J . represent the multinomial trial for
subject i, where yi j s 1 when the response is in category j and yi j s 0
otherwise. Thus, Ý j yi j s 1. Let x i s Ž x i1 , . . . , x i p .X denote explanatory variable values for subject i. Let  j s Ž j1 , . . . ,  j p .X denote parameters for the
jth logit.
Since J s 1 y Ž 1 q ⭈⭈⭈ q Jy1 . and yi J s 1 y Ž yi1 q ⭈⭈⭈ qyi, Jy1 ., the
contribution to the log likelihood by subject i is
J
log
Ł j Žx i .
yi j
Jy1
s
js1
Ý
ž
yi j log j Ž x i . q 1 y
js1
Jy1
s
Ý
js1
yi j log
j Žx i .
1 y Ý Jy1
js1 j Ž x i .
Jy1
Ý
js1
/
Jy1
yi j log 1 y
Ý j Žx i .
js1
Jy1
q log 1 y
Ý j Žx i .
js1
.
NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS
273
Thus, the baseline-category logits are the natural parameters for the multinomial distribution.
Now assume n independent observations. In the last expression above,
substituting ␣ j q Xj x i for the logit in the first term and J Žx i . s
X
Ž
.x in the second term, the log likelihood is
1rw1 q Ý Jy1
js1 exp ␣j q  j x i
n
log Ł
is1
J
Ł j Žx i .
js1
n
s
½
Jy1
Ý Ý
is1
yi j Ž␣j q Xj x i . y log 1 q
js1
Jy1
s
yi j
Ý
js1
Ý exp Ž␣j q Xj x i .
js1
žÝ /
n
␣j
Jy1
yi j q
is1
n
Jy1
is1
js1
y Ý log 1 q
p
Ý
ks1
žÝ /
5
n
 jk
x i k yi j
is1
Ý exp Ž␣j q Xj x i .
.
The sufficient statistic for  jk is Ý i x i k yi j , j s 1, . . . , J y 1, k s 1, . . . , p. The
sufficient statistic for ␣ j is Ý i yi j s Ý i x i0 yi j for x i0 s 1; this is the total
number of outcomes in category j.
The likelihood equations equate the sufficient statistics to their expected
values. The log likelihood is concave, and the Newton᎐Raphson method
yields the ML parameter estimates. The estimators have large-sample normal
distributions. Their asymptotic standard errors are square roots of diagonal
elements of the inverse information matrix.
Most statistical software can fit multinomial logit models, but some can fit
only binary logistic regression models. An alternative fitting approach fits
binary logit models separately for the J y 1 pairings of responses: model
Ž7.1. for j s 1 alone, using only observations in category 1 or J of the
response variable to obtain estimates of ␣ 1 and  1; model Ž7.1. using only
categories 2 and J to obtain estimates of ␣ 2 and  2 ; in this manner,
obtaining J y 1 separate fits of logit models. A logit model fitted using data
from only two response categories is the same as a regular logit model fitted
conditional on classification into one of those categories. For instance, the jth
baseline-category logit is a logit of conditional probabilities
log
j Ž x. r Ž j Ž x. q J Ž x. .
J Ž x. r Ž j Ž x. q J Ž x. .
s log
j Ž x.
J Ž x.
.
The separate-fitting estimates differ from the ML estimates for simultaneous fitting of the J y 1 logits. They are less efficient, tending to have larger
standard errors. However, Begg and Gray Ž1984. showed that the efficiency
loss is minor when the response category having highest prevalence is the
274
LOGIT MODELS FOR MULTINOMIAL RESPONSES
baseline. To illustrate this approach, we used the data for the categories
invertebrate and fish alone. The fit is logŽ
ˆ Irˆ F . s y1.69 q 1.66 s y
1.78 z H q 1.05 z O q 1.22 zT , with standard errors Ž0.43, 0.62, 0.49, 0.52. for
the effects. The effects are similar to those from simultaneous fitting with all
five response categoriesᎏsee the first row of Table 7.4. The estimated
standard errors are only slightly larger, since 155 of the 219 observations
were in the fish or invertebrate categories of food type.
7.1.5
Multicategory Logit Model as Multivariate GLM*
For a univariate response variable in the natural exponential family, a GLM
has form g Ž i . s x Xi  for a link function g, expected response i s E Ž Yi .,
vector of values x i of p explanatory variables for observation i, and parameter vector  s Ž 1 , . . . ,  p .X . This extends to a multivariate GLM for distributions in the multivariate exponential family ŽProblem 7.24., such as the
multinomial.
Let yi s Ž yi1 , yi2 , . . . .X be a vector response for subject i, with i s E ŽYi ..
Let g be a vector of link functions. The multivariate GLM has the form
gŽ i . s X i ,
Ž 7.3 .
where row h of the model matrix X i for observation i contains values of
explanatory variables for yi h . For details, see Fahrmeir and Tutz Ž2001,
Chap. 3..
The baseline-category logit model is a multivariate GLM. Here yi s
Ž yi1 , . . . , yi, Jy1 .X , since yi J is redundant. Then, i s Ž 1Žx i ., . . . , Jy1 Žx i ..X
and
g j Ž i . s log i jr 1 y Ž i1 q ⭈⭈⭈ q i , Jy1 .
The model matrix for observation i is
Xi s
1
x Xi
1
x Xi
.
⭈⭈⭈
1
x Xi
4.
0
with 0 entries in other locations, and X s Ž␣1 , X1 , . . . , ␣ Jy1 , XJy1 .. One can
also formulate it for grouped data using sample proportions in the categories.
7.2
ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS
In Section 6.4.1 we showed the benefits of utilizing the ordinality of a
variable by focusing inferences on a single parameter. These benefits extend
to models for ordinal responses. Models with terms that reflect ordinal
ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS
275
characteristics such as monotone trend have improved model parsimony and
power. In this section we introduce the most popular logit model for ordinal
responses.
7.2.1
Cumulative Logits
One way to use category ordering forms logits of cumulative probabilities,
P Ž Y F j < x . s 1 Ž x . q ⭈⭈⭈ q j Ž x . ,
j s 1, . . . , J.
The cumulati®e logits are defined as
logit P Ž Y F j < x . s log
s log
P Ž Y F j < x.
1 y P Ž Y F j < x.
1 Ž x . q ⭈⭈⭈ q j Ž x .
jq1 Ž x . q ⭈⭈⭈ q J Ž x .
,
j s 1, . . . , J y 1. Ž 7.4 .
Each cumulative logit uses all J response categories.
A model for logitw P Ž Y F j .x alone is an ordinary logit model for a binary
response in which categories 1 to j form one outcome and categories j q 1
to J form the second. Better, models can use all J y 1 cumulative logits in a
single parsimonious model.
7.2.2
Proportional Odds Model
A model that simultaneously uses all cumulative logits is
logit P Ž Y F j < x . s ␣ j q X x,
j s 1, . . . , J y 1.
Ž 7.5 .
Each cumulative logit has its own intercept. The ␣ j 4 are increasing in j,
since P Ž Y F j < x. increases in j for fixed x, and the logit is an increasing
function of this probability.
This model has the same effects  for each logit. For a continuous
predictor x, Figure 7.2 depicts the model when J s 4. For fixed j, the
FIGURE 7.2
Cumulative logit model with effect independent of cutpoint.
276
LOGIT MODELS FOR MULTINOMIAL RESPONSES
FIGURE 7.3
Category probabilities in cumulative logit model.
response curve is a logistic regression curve for a binary response with
outcomes Y F j and Y ) j. The response curves for j s 1, 2, and 3 have the
same shape. They share exactly the same rate of increase or decrease but are
horizontally displaced from each other. For j - k, the curve for P Ž Y F k . is
the curve for P Ž Y F j . translated by Ž␣k y ␣ j .r units in the x direction;
that is,
P Ž Y F k < X s x . s P Ž Y F j < X s x q Ž␣k y ␣ j . r . .
Figure 7.3 portrays the curves for the category probabilities.
The cumulative logit model Ž7.5. satisfies
logit P Ž Y F j < x 1 . y logit P Ž Y F j < x 2 .
s log
P Ž Y F j < x 1 . rP Ž Y ) j < x 1 .
P Ž Y F j < x 2 . rP Ž Y ) j < x 2 .
s X Ž x 1 y x 2 . .
An odds ratio of cumulative probabilities is called a cumulati®e odds ratio.
The odds of making response F j at x s x 1 are expw X Žx 1 y x 2 .x times the
odds at x s x 2 . The log cumulative odds ratio is proportional to the distance
between x 1 and x 2 . The same proportionality constant applies to each logit.
Because of this property, McCullagh Ž1980. called Ž7.5. a proportional odds
model.
ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS
277
FIGURE 7.4 Uniform odds ratios ADrBC whenever x 1 y x 2 s 1, for all response cutpoints
with proportional odds model.
With a single predictor, the cumulative odds ratio equals e  whenever
x 1 y x 2 s 1. Figure 7.4 illustrates the constant cumulative odds ratio this
model then implies for all j. It shows the J-category response collapsed into
the binary outcome ŽF j, ) j . and shows the sets of cells that determine
the cumulative odds ratio ADrBC that takes the same value e  for each
such collapsing.
Model Ž7.5. constrains the J y 1 response curves to have the same shape.
Thus, its fit is not the same as fitting separate logit models for each j. Again
let Ž yi1 , . . . , yi J . be binary indicators of the response for subject i. The
likelihood function is
n
J
Ł Ł j Žx i . y
is1
ij
s
js1
s
n
J
is1
js1
n
J
is1
js1
Ł Ł Ž P ŽY F j < x i. y P ŽY F j y 1 < x i..
Ł Ł
ž
exp Ž␣j q X x i .
1 q exp Ž␣j q X x i .
y
yi j
exp Ž␣jy1 q X x i .
1 q exp Ž␣jy1 q X x i .
/
yi j
,
Ž 7.6 .
viewed as a function of Ž ␣ j 4 ,  .. McCullagh Ž1980. and Walker and Duncan
Ž1967. used Fisher scoring algorithms to obtain ML estimates.
7.2.3
Latent Variable Motivation*
A regression model for a continuous variable assumed to underlie Y motivates the common effect  for different j in the proportional odds model
ŽAnderson and Philips 1981.. Let Y * denote this underlying variable. In
statistics, such an unobserved variable is called a latent ®ariable. Suppose that
it has cdf G Ž y* y ., where values of y* vary around a location parameter
Žsuch as a mean. that depends on x through Žx. s X x. Suppose that
y⬁ s ␣ 0 - ␣ 1 - ⭈⭈⭈ - ␣ J s ⬁ are cutpoints of the continuous scale such
278
LOGIT MODELS FOR MULTINOMIAL RESPONSES
FIGURE 7.5 Ordinal measurement and underlying regression model for a latent variable.
that the observed response Y satisfies
Y s j if ␣ jy1 - Y * F ␣ j .
That is, Y falls in category j when the latent variable falls in the jth interval
of values ŽFigure 7.5.. Then
P Ž Y F j < x . s P Ž Y * F ␣ j < x . s G Ž␣j y X x . .
The appropriate model for Y implies that the link Gy1 , the inverse of the cdf
for Y *, applies to P Ž Y F j < x.. If Y * s X x q ⑀ , where the cdf G of ⑀ is the
logistic ŽSection 4.2.5., then Gy1 is the logit link and a proportional odds
model results. Normality for ⑀ implies a probit link for cumulative probabilities ŽSection 7.3.1..
In this derivation, the same parameters  occur for the effects on Y
regardless of how the cutpoints ␣ j 4 chop up the scale for the latent variable.
The effect parameters are invariant to the choice of categories for Y. If a
continuous variable measuring political philosophy has a linear regression
with some predictor variables, then the same effect parameters apply to a
discrete version of political philosophy with the categories Žliberal, moderate,
conservative. or Žvery liberal, slightly liberal, moderate, slightly conservative,
very conservative.. This feature makes it possible to compare estimates from
studies using different response scales.
ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS
279
Note that the use of a cdf of form G Ž y* y . for the latent variable
results in linear predictor ␣ j y X x rather than ␣ j q X x. When  ) 0, as x
increases each cumulative logit then decreases, so each cumulative probability decreases and relatively less probability mass falls at the low end of the Y
scale. Thus, Y tends to be larger at higher values of x. With this parameterization the sign of  has the usual meaning. However, most software Že.g.,
SAS. uses form Ž7.5..
7.2.4
Mental Impairment Example
Table 7.5 comes from a study of mental health for a random sample of adult
residents of Alachua County, Florida. It relates mental impairment to two
explanatory variables. Mental impairment is an ordinal response, with categories Žwell, mild symptom formation, moderate symptom formation, impaired.. The life events index x 1 is a composite measure of the number and
severity of important life events such as birth of child, new job, divorce, or
death in family that occurred to the subject within the past 3 years. Socioeconomic status Ž x 2 s SES. is measured here as binary Ž1 s high, 0 s low..
TABLE 7.5
Mental Impairment by SES and Life Events
Subject
Mental
Impairment
SES a
x2
Life
Events
x1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Well
Well
Well
Well
Well
Well
Well
Well
Well
Well
Well
Well
Mild
Mild
Mild
Mild
Mild
Mild
Mild
Mild
1
1
1
1
0
1
0
1
1
1
0
0
1
0
1
0
1
1
0
1
1
9
4
3
2
0
1
3
3
7
1
2
5
6
3
1
8
2
5
5
a
0, low; 1, high.
Subject
Mental
Impairment
SES a
x2
Life
Events
x1
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Mild
Mild
Mild
Mild
Moderate
Moderate
Moderate
Moderate
Moderate
Moderate
Moderate
Impaired
Impaired
Impaired
Impaired
Impaired
Impaired
Impaired
Impaired
Impaired
1
0
1
1
0
1
0
0
1
0
0
1
1
1
0
0
0
1
0
0
9
3
3
1
0
4
3
9
6
4
3
8
2
7
5
4
4
8
8
9
280
LOGIT MODELS FOR MULTINOMIAL RESPONSES
TABLE 7.6
Output for Fitting Cumulative Logit Model to Table 7.5
Score Test for the Proportional Odds Assumption
Chi- Square
DF
Pr ) ChiSq
2.3255
4
0.6761
Parameter
Intercept1
Intercept2
Intercept3
life
ses
Estimate
y0.2819
1.2128
2.2094
y0.3189
1.1112
Std
Error
0.6423
0.6607
0.7210
0.1210
0.6109
Like. Ratio 95%
Conf Limits
y1.5615
0.9839
y0.0507
2.5656
0.8590
3.7123
y0.5718 y0.0920
y0.0641
2.3471
ChiSquare
0.19
3.37
9.39
6.95
3.31
Pr > Chi Sq
0.6607
0.0664
0.0022
0.0084
0.0689
The main-effects model of form Ž7.5. is
logit P Ž Y F j < x . s ␣ j q  1 x 1 q  2 x 2 .
Table 7.6 shows output. With J s 4 response categories, the model has three
␣ j 4 intercepts. Usually, these are not of interest except for computing
response probabilities. The parameter estimates yield estimated logits and
hence estimates of P Ž Y F j ., P Ž Y ) j ., or P Ž Y s j .. We illustrate for subjects at the mean life events score of x 1 s 4.275 with low SES Ž x 2 s 0.. Since
␣
ˆ1 s y0.282, the estimated probability of response well is
PˆŽ Y s 1 . s PˆŽ Y F 1 . s
exp y0.282 y 0.319 Ž 4.275 .
1 q exp y0.282 y 0.319 Ž 4.275 .
s 0.16.
Figure 7.6 plots PˆŽ Y ) 2. as a function of the life events index, at the two
levels of SES.
FIGURE 7.6
Estimated values of P Ž Y ) 2. for Table 7.5.
ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS
281
The effect estimates ˆ1 s y0.319 and ˆ2 s 1.111 suggest that the cumulative probability starting at the well end of the scale decreases as the life
events score increases and increases at the higher level of SES. Given the life
events score, at the high SES level the estimated odds of mental impairment
below any fixed level are e 1.111 s 3.0 times the estimated odds at the low
SES level.
Descriptions of effects can compare cumulative probabilities rather than
use odds ratios. These can be easier to understand. We describe effects of
quantitative variables by comparing probabilities at their quartiles. We describe effects of qualitative variables by comparing probabilities for different
categories. We control for quantitative variables by setting them at their
mean. We control for qualitative variables by fixing the category, unless there
are several, in which case we can set each at their dummy means. We
illustrate again with P Ž Y s 1., the well outcome. First, we describe the SES
effect. At the mean life events of 4.275, PˆŽ Y s 1. s 0.37 at high SES Ži.e.,
x 2 s 1. and 0.16 at low SES Ž x 2 s 0.. Next, we describe the life events effect.
The lower and upper quartiles of the life events score are 2.0 and 6.5. For
high SES, PˆŽ Y s 1. changes from 0.55 to 0.22 between these quartiles; for
low SES, it changes from 0.28 to 0.09. ŽNote that comparing 0.55 to 0.28 at
the lower quartile and 0.22 to 0.09 at the upper quartile provides further
information about the SES effect. . The sample effect is substantial for both
predictors.
The output in Table 7.6, taken from SAS, also presents a score test of the
proportional odds property. This tests whether the effects are the same for
each cumulative logit against the alternative of separate effects. It compares the model with one parameter for x 1 and one for x 2 to a more
complex model with three parameters for each, allowing different effects
for logit w P Ž Y F 1., logit w P Ž Y F 2.x, and logit w P Ž Y F 3.x. Here, the score
statistic equals 2.33. It has df s 4, since the more complex model has four
additional parameters. The more complex model does not fit significantly
better Ž P s 0.68..
7.2.5
More Complex Models
More complex cumulative logit models are formulated as in ordinary logistic
regression. They simply require a set of intercept parameters rather than a
single one. In the previous example, for instance, permitting interaction
yields a model with ML fit
logit PˆŽ Y F j < x . s ␣
ˆj y 0.420 x 1 q 0.371 x 2 q 0.181 x 1 x 2 ,
where the coefficient of x 1 x 2 has SE s 0.238. The estimated effect of life
events on the cumulative logit is y0.420 for the low SES group and
Žy0.420 q 0.181. s y0.239 for the high SES group. The impact of life
282
LOGIT MODELS FOR MULTINOMIAL RESPONSES
events seems more severe for the low SES group, but the difference in effects
is not significant.
Models in this section used the proportional odds assumption of the same
effects for different cumulative logits. An advantage is that effects are simple
to summarize and interpret, requiring only a single parameter for each
predictor. The models generalize to include separate effects, replacing  in
Ž7.5. by  j . This implies nonparallelism of curves for different logits. However, curves for different cumulative probabilities then cross for some x
values. Such models violate the proper order among the cumulative probabilities.
Even if such a model fits better over the observed range of x, for reasons
of parsimony the simple model might be preferable. One case is when effects
ˆ j 4 with different logits are not substantially different in practical terms.

Then the significance in a test of proportional odds may reflect primarily a
large value of n. Even with smaller n, although effect estimators using the
simple model are biased, they may have smaller MSE than estimators from a
more complex model having many more parameters. So even if a test of
proportional odds has a small P-value, don’t discard this model automatically.
If a proportional odds model fits poorly in terms of practical as well as
statistical significance, alternative strategies exist. These include Ž1. trying a
link function for which the response curve is nonsymmetric Že.g., complementary log-log.; Ž2. adding additional terms, such as interactions, to the linear
predictor; Ž3. adding dispersion parameters; Ž4. permitting separate effects
for each logit for some but not all predictors Ži.e., partial proportional odds;
and Ž5. fitting baseline-category logit models and using the ordinality in an
informal way in interpreting the associations. For approach Ž4., see Peterson
and Harrell Ž1990., Stokes et al. Ž2000, Sec. 15.13., and criticism by Cox
Ž1995.. In the next section we generalize the cumulative logit model to permit
extensions Ž1. and Ž3..
7.3
ORDINAL RESPONSES: CUMULATIVE LINK MODELS
Cumulative logit models use the logit link. As in univariate GLMs, other link
functions are possible. Let Gy1 denote a link function that is the inverse of
the continuous cdf G Žrecall Section 4.2.5.. The cumulati®e link model
Gy1 P Ž Y F j < x . s ␣ j q X x
Ž 7.7 .
links the cumulative probabilities to the linear predictor. The logit link
function Gy1 Ž u. s logw urŽ1 y u.x is the inverse of the standard logistic cdf.
As in the proportional odds model Ž7.5., effects of x in Ž7.7. are assumed
the same for each cutpoint, j s 1, . . . , J y 1. In Section 7.2.3 we showed that
this assumption holds when a linear regression for a latent variable Y * has
ORDINAL RESPONSES: CUMULATIVE LINK MODELS
283
standardized cdf G. Model Ž7.7. results from discrete measurement of Y *
from a location-parameter family having cdf GŽ y* y X x.. The parameters
␣ j 4 are category cutpoints on a standardized version of the latent scale. In
this sense, cumulative link models are regression models, using a linear
predictor X x to describe effects of explanatory variables on crude ordinal
measurement of Y *. Using y rather than q in the linear predictor
ˆ Most software Že.g., GENMOD and
merely results in change of sign of .
LOGISTIC in SAS. fits it in q form.
7.3.1
Types of Cumulative Links
Use of the standard normal cdf ⌽ for G gives the cumulati®e probit model.
This generalizes the binary probit model ŽSection 6.6. to ordinal responses. It
is appropriate when the distribution for Y * is normal. Parameters in probit
models can be interpreted in terms of the latent variable Y *. For instance,
consider the model ⌽y1 w P Ž Y F j .x s ␣ j y  x. From Section 7.2.3, since
Y * s  x q ⑀ where ⑀ ; N Ž0, 1. has cdf ⌽,  has the interpretation that a
1-unit increase in x corresponds to a  increase in E Ž Y *.. When ⑀ need not
be in standardized form with s 1, a 1-unit increase in x corresponds to a
 standard deviation increase in E Ž Y *.. Cumulative logit models provide fits
similar to those for cumulative probit models, and their parameter interpretation is simpler.
An underlying extreme value distribution for Y * implies a model of the
form
log ylog 1 y P Ž Y F j < x .
4 s ␣ j q X x .
In section 6.6 we introduced this complementary log-log link for binary data.
The ordinal model using this link is sometimes called a proportional hazards
model since it results from a generalization of the proportional hazards
model for survival data to handle grouped survival times ŽPrentice and
Gloeckler 1978.. It has the property
P Ž Y ) j < x1 . s P Ž Y ) j < x 2 .
X
exp w  Žx 1 yx 2 .x
.
With this link, P Ž Y F j . approaches 1.0 at a faster rate than it approaches
0.0. The related log-log link log ylogwPŽY F j.x4 is appropriate when the
complementary log-log link holds for the categories listed in reverse order.
7.3.2
Estimation for Cumulative Link Models
McCullagh Ž1980. and Thompson and Baker Ž1981. treated cumulative link
models as multivariate GLMs. McCullagh presented a Fisher scoring algorithm for ML estimation, expressing the likelihood in the form Ž7.6. using
cumulative probabilities. McCullagh showed that sufficiently large n guarantees a unique maximum of the likelihood. Burridge Ž1981. and Pratt Ž1981.
284
LOGIT MODELS FOR MULTINOMIAL RESPONSES
showed that the log likelihood is concave for many cumulative link models,
including the logit, probit, and complementary log-log. Iterative algorithms
usually converge rapidly to the ML estimates.
7.3.3
Life Table Example
Table 7.7 shows the life-length distribution for U.S. residents in 1981, by race
and gender. Life length uses five ordered categories. The underlying continuous cdf of life length increases slowly at small to moderate ages but increases
sharply at older ages. This suggests the complementary log-log link. This link
also results from assuming that the hazard rate increases exponentially with
age, which happens for an extreme value distribution Žthe Gompertz..
For gender G Ž1 s female; 0 s male., race R Ž1 s black; 0 s white., and
life length Y, Table 7.7 contains fitted distributions for the model
log ylog 1 y P Ž Y F j < G s g , R s r .
4 s ␣ j q 1 g q  2 r .
Goodness-of-fit statistics are irrelevant, since the table contains population
distributions. The model describes well the four distributions. Its parameter
values are  1 s y0.658 and  2 s 0.626. The fitted cdf’s satisfy
P Ž Y ) j < G s 0, R s r . s P Ž Y ) j < G s 1, R s r .
exp Ž0.658 .
.
Given race, the proportion of men living longer than a fixed time equaled the
proportion for women raised to the exp Ž0.658. s 1.93 power. Given gender,
the proportion of blacks living longer than a fixed time equaled the proportion for whites to the expŽ0.626. s 1.87 power. The  1 and  2 values
indicate that white men and black women had similar distributions, that
white women tended to have longest lives and black men tended to have
shortest lives. If the probability of living longer than some fixed time equaled
for white women, that probability was about 2 for white men and black
women and 4 for black men.
TABLE 7.7
Life-Length Distribution of U.S. Residents (Percent), a 1981
Males
Life Length
0᎐20
20᎐40
40᎐50
50᎐60
Over 65
White
2.4
3.4
3.8
17.5
72.9
Ž2.4.
Ž3.5.
Ž4.4.
Ž16.7.
Ž73.0.
Females
Black
3.6
7.5
8.3
25.0
55.6
Ž4.4.
Ž6.4.
Ž7.7.
Ž26.1.
Ž55.4.
White
1.6
1.4
2.2
9.9
84.9
Ž1.2.
Ž1.9.
Ž2.4.
Ž9.6.
Ž84.9.
Black
2.7
2.9
4.4
16.3
73.7
Ž2.3.
Ž3.4.
Ž4.3.
Ž16.3.
Ž73.7.
Values in parentheses are fit of proportional hazards Ži.e., complementary log-log link. model.
Source: Data from Statistical Abstract of the United States ŽWashington, DC: U.S. Bureau of the
Census, 1984., p. 69.
a
ORDINAL RESPONSES: CUMULATIVE LINK MODELS
7.3.4
285
Incorporating Dispersion Effects*
For cumulative link models, settings of the explanatory variables are stochastically ordered on the response: For any pair x 1 and x 2 , either P Ž Y F j < x 1 . F
P Ž Y F j < x 2 . for all j or P Ž Y F j < x 1 . G P Ž Y F j < x 2 . for all j. Figure 7.7a
illustrates for underlying continuous density functions and cdf’s at two
settings of x. When this is violated and such models fit poorly, often it is
because the dispersion also varies with x. For instance, perhaps responses
tend to concentrate around the same location but more dispersion occurs
at x 1 than at x 2 . Then perhaps P Ž Y F j < x 1 . ) P Ž Y F j < x 2 . for small j but
P Ž Y F j < x 1 . - P Ž Y F j < x 2 . for large j. In other words, at x 1 the responses
concentrate more at the extreme categories than at x 2 . Figure 7.7b illustrates
for underlying continuous distributions.
A cumulative link model that incorporates dispersion effects is
Gy1 P Ž Y F j < x . s
␣ j q X x
exp Ž ␥ X x .
.
Ž 7.8 .
ŽAgain, one can replace q by y to more closely mimic a location᎐scale
family for an underlying continuous variable.. The denominator contains
FIGURE 7.7 Ž a. Distribution 1 stochastically higher than distribution 2; Ž b . distributions not
stochastically ordered.
286
LOGIT MODELS FOR MULTINOMIAL RESPONSES
scale parameters ␥ that describe the dispersion’s dependence on x. The
ordinary model Ž7.7. is the special case ␥ s 0. Otherwise, the cumulative
probabilities tend to shrink toward each other when ␥ X x ) 0. This creates
higher probabilities in the end categories and overall greater dispersion. The
cumulative probabilities tend to move apart Žcreating less dispersion . when
␥ X x - 0.
To illustrate, we use this model to compare two groups on an ordinal
scale. Suppose that x is a dummy variable with x s 1 for the first group.
With cumulative logits, model Ž7.8. is
logit P Ž Y F j . s ␣ j ,
x s 0,
logit P Ž Y F j . s Ž␣j q  . rexp Ž ␥ . ,
x s 1.
The case ␥ s 0 is the usual model, in which  is a location shift that
determines a common cumulative log odds ratio for all 2 = 2 collapsings of
the 2 = J table. When ␥ / 0 the difference between the logits for the two
groups, and hence the cumulative odds ratio, varies as j does. When ␥ ) 0,
responses at x s 1 tend to be more disperse than at x s 0. See Cox Ž1995.
and McCullagh Ž1980. for model fitting and examples.
7.4
ALTERNATIVE MODELS FOR ORDINAL RESPONSES*
Models for ordinal responses need not use cumulative probabilities. In this
section we discuss alternative logit models and a simpler model that resembles ordinary regression.
7.4.1
Adjacent-Categories Logits
The adjacent-categories logits are
logit P Ž Y s j < Y s j or
j q 1 . s log
j
jq1
,
j s 1, . . . , J y 1. Ž 7.9 .
These logits are a basic set equivalent to the baseline-category logits. The
connections are
log
j
J
s log
j
jq1
q log
jq1
jq2
q ⭈⭈⭈ qlog
Jy1
J
,
and
log
j
jq1
s log
j
J
y log
Either set determines logits for all
jq1
ž/
J
2
J
,
j s 1, . . . , J y 1.
pairs of response categories.
Ž 7.10 .
287
ALTERNATIVE MODELS FOR ORDINAL RESPONSES
Models using adjacent-categories logits can be expressed as baseline-category logit models. For instance, consider the adjacent-categories logit model
log
j Ž x.
jq1 Ž x .
s ␣ j q X x,
j s 1, . . . , J y 1,
Ž 7.11 .
with common effect . From adding Ž J y j . terms as in Ž7.10., the equivalent
baseline-category logit model is
log
j Ž x.
J Ž x.
Jy1
s
Ý ␣ k q X Ž J y j . x,
j s 1, . . . , J y 1
ksj
s ␣ j* q X u j ,
j s 1, . . . , J y 1
with u j s Ž J y j .x. The adjacent-categories logit model corresponds to a
baseline-category logit model with adjusted model matrix but also a single
parameter for each predictor. With some software one can fit model Ž7.11. by
fitting the equivalent baseline-category logit model.
The construction of the adjacent-categories logits recognizes the ordering
of Y categories. To benefit from this in model parsimony requires appropriate specification of the linear predictor. For instance, if an explanatory
variable has similar effect for each logit, advantages accrue from having a
single parameter instead of Ž J y 1. parameters describing that effect. When
used with this proportional odds form, model Ž7.11. with adjacent-categories
logits fit well in similar situations as model Ž7.5. with cumulative logits. They
both imply stochastically ordered distributions for Y at different predictor
values.
The choice of model should depend less on goodness of fit than on
whether one prefers effects to refer to individual response categories, as the
adjacent-categories logits provide, or instead to groupings of categories using
the entire scale or an underlying latent variable, which cumulative logits
provide. Since effects in cumulative logit models refer to the entire scale,
they are usually larger. The ratio of estimate to standard error, however, is
usually similar for the two model types. An advantage of the cumulative logit
model is the approximate invariance of effect estimates to the choice and
number of response categories. This does not happen with the adjacent-categories logits.
7.4.2
Job Satisfaction Example
Table 7.8 refers to the relationship between job satisfaction Ž Y . and income,
stratified by gender, for black Americans. For simplicity, we use income
scores Ž1, 2, 3, 4.. For income x and gender g Ž1 s females, 0 s males.,
consider the model
log Ž jr jq1 . s ␣ j q  1 x q  2 g ,
j s 1, 2, 3.
288
LOGIT MODELS FOR MULTINOMIAL RESPONSES
TABLE 7.8
Job Satisfaction and Income, Controlling for Gender
Job Satisfaction
Gender
Income
Ždollars.
Very
Dissatisfied
A Little
Satisfied
Moderately
Satisfied
Very
Satisfied
Female
- 5000
500015,000
15,00025,000
) 25,000
1
2
0
0
3
3
1
2
11
17
8
4
2
3
5
2
Male
- 5000
500015,000
15,00025,000
) 25,000
1
0
0
0
1
3
0
1
2
5
7
9
1
1
3
6
Source:1991, General Social Survey, National Opinion Research Center.
It describes the odds of being very dissatisfied instead of a little satisfied, a
little instead of moderately satisfied, and moderately instead of very satisfied.
This model is equivalent to the baseline-category logit model
log Ž jr4 . s ␣ Uj q  1 Ž 4 y j . x q  2 Ž 4 y j . g ,
j s 1, 2, 3.
The value of the first predictor in this model is set equal to 3 x in the
equation for logŽ 1r4 ., 2 x in the equation for logŽ 2r4 ., and x in the
equation for logŽ 3r4 .. Some software Že.g., PROC CATMOD in SAS; see
Table A.12. allows one to enter a row of a model matrix for each baselinecategory logit at a given setting of predictors. Then, after fitting the
baseline-category logit model that constrains the effects to be the same for
each logit, the estimated regression parameters are the ML estimates
of parameters for the adjacent-categories logit model. The ML fit gives
ˆ1 s y0.389 ŽSE s 0.155. and ˆ2 s 0.045 ŽSE s 0.314.. For this parameterization, ˆ1 - 0 means the odds of lower job satisfaction decrease as income
increases. Given gender, the estimated odds of response in the lower of two
adjacent categories multiplies by expŽy0.389. s 0.68 for each category increase in income. The model describes 24 logits Žthree for each income =
gender combination. with five parameters. Its deviance G 2 s 12.6 with
df s 19. This model with a linear trend for the income effect and a lack of
interaction between income and gender seems adequate.
Similar substantive results occur with a cumulative logit model. Its deviance G 2 s 13.3 with df s 19. The income effect is larger Ž ˆ1 s y0.51,
SE s 0.20., since it refers to the entire response scale rather than adjacent
categories. However, significance is similar, with ˆ1rSE f y2.5 for each
model.
289
ALTERNATIVE MODELS FOR ORDINAL RESPONSES
7.4.3
Continuation-Ratio Logits
Continuation-ratio logits are defined as
log
j
jq1 q ⭈⭈⭈ q J
,
j s 1, . . . , J y 1
Ž 7.12 .
or as
log
jq1
1 q ⭈⭈⭈ q j
,
j s 1, . . . , J y 1.
Ž 7.13 .
The continuation-ratio logit model form is useful when a sequential mechanism, such as survival through various age periods, determines the response
outcome Že.g., Tutz 1991.. Let j s P Ž Y s j < Y G j .. With explanatory variables,
j Ž x. s
j Ž x.
j Ž x . q ⭈⭈⭈ q J Ž x .
,
j s 1, . . . , J y 1.
Ž 7.14 .
The continuation-ratio logits Ž7.12. are ordinary logits of these conditional
probabilities: namely, logw j Žx.rŽ1 y j Žx..x.
At the ith setting x i of x, let yi j , j s 1, . . . , J 4 denote the response
counts, with n i s Ý j yi j . When n i s 1, yi j indicates whether the response is
in category j, as in Section 7.1.4. Let bŽ n, y; . denote the binomial probability of y successes in n trials with parameter for each trial. By expressing
the multinomial probability of Ž yi1 , . . . , yi J . in the form pŽ yi1 . pŽ yi2 < yi1 . ⭈⭈⭈
pŽ yi J < yi1 , . . . , yi, Jy1 ., one can show that the multinomial mass function has
factorization
b n i , yi1 ; 1 Ž x i . b n i y yi1 , yi2 ; 2 Ž x i . ⭈⭈⭈
b n i y yi1 y ⭈⭈⭈ yyi , Jy2 , yi , Jy1 ; Jy1 Ž x i . .
Ž 7.15 .
The full likelihood is the product of multinomial mass functions from the
different x i values. Thus, the log likelihood is a sum of terms such that
different j enter into different terms. When parameters in the model
specification for logitŽ j . are distinct from those for logitŽ k . whenever
j / k, maximizing each term separately maximizes the full log likelihood.
Thus, separate fitting of models for different continuation-ratio logits gives
the same results as simultaneous fitting. The sum of the J y 1 separate G 2
statistics provides an overall goodness-of-fit statistic pertaining to the simultaneous fitting of J y 1 models.
Because these logits refer to a binary response in which one category
combines levels of the original scale, separate fitting can use methods for
binary logit models. Similar remarks apply to continuation-ratio logits Ž7.13.,
290
LOGIT MODELS FOR MULTINOMIAL RESPONSES
although those logits and the subsequent analysis do not give equivalent
results. Sometimes, simpler models with the same effects for each logit are
plausible ŽMcCullagh and Nelder 1989, p. 164; Tutz 1991..
7.4.4
Developmental Toxicity Study with Pregnant Mice
We illustrate continuation-ratio logits using Table 7.9 from a developmental
toxicity study. Such experiments with rodents test substances posing potential
danger to developing fetuses. Diethylene glycol dimethyl ether ŽdiEGdiME.,
one such substance, is an industrial solvent used in the manufacture of
protective coatings such as lacquer and metal coatings.
This study administered diEGdiME in distilled water to pregnant mice.
Each mouse was exposed to one of five concentration levels for 10 days early
in the pregnancy. The mice exposed to level 0 formed a control group. Two
days later, the uterine contents of the pregnant mice were examined for
defects. Each fetus has three possible outcomes Žnonlive, malformation,
normal.. The outcomes are ordered, with nonlive the least desirable result.
We use continuation-ratio logits to model Ž1. the probability 1 of a nonlive
fetus, and Ž2. the conditional probability 2rŽ 2 q 3 . of a malformed fetus,
given that the fetus was live.
We fitted the continuation-ratio logit models
log
1Ž x i .
2 Ž xi . q 3Ž xi .
s ␣ 1 q 1 x i ,
log
2 Ž xi .
3Ž xi .
s ␣ 2 q 2 x i ,
using x i scores 0, 62.5, 125, 250, 5004 for concentration level. The ML estimates are ˆ1 s 0.0064 ŽSE s 0.0004. and ˆ2 s 0.0174 ŽSE s 0.0012.. In
each case, the less desirable outcome is more likely as the concentration
increases. For instance, given that a fetus was live, the estimated odds that it
was malformed rather than normal multiplies by expŽ1.74. s 5.7 for every
100-unit increase in the concentration of diEGdiME. The likelihood-ratio fit
TABLE 7.9
Outcomes for Pregnant Mice in Developmental Toxicity Study
Concentration
Žmgrkg per day.
0 Žcontrols.
62.5
125
250
500
Response
Nonlive
Malformation
Normal
15
17
22
38
144
1
0
7
59
132
281
225
283
202
9
Based on results in C. J. Price et al., Fund. Appl. Toxicol. 8:115᎐126 Ž1987.. I thank
Louise Ryan for showing me these data.
a
ALTERNATIVE MODELS FOR ORDINAL RESPONSES
291
statistics are G 2 s 5.78 for j s 1 and G 2 s 6.06 for j s 2, each based on
df s 3. Their sum, G 2 s 11.84 Žor similarly X 2 s 9.76., with df s 6, summarizes the fit.
This analysis treats pregnancy outcomes for different fetuses as independent, identical observations. In fact, each pregnant mouse had a litter of
fetuses, and statistical dependence may exist among different fetuses in the
same litter. Different litters at a given concentration level may also have
different response probabilities. Heterogeneity of various sorts among the
litters Že.g., due to varying physical characteristics among different pregnant
mice. would cause these probabilities to vary somewhat. Either statistical
dependence or heterogeneous probabilities violates the binomial assumption
and causes overdispersion. At a fixed concentration level, the number of
fetuses in a litter that die may vary among pregnant mice more than if the
counts were independent and identical binomial variates. The total G 2 shows
some evidence of lack of fit Ž P s 0.07. but may reflect overdispersion caused
by these factors rather than an inappropriate choice of response curve.
To account for overdispersion, we could adjust standard errors using the
quasi-likelihood approach ŽSection 4.7.. This multiplies standard errors by
X 2rdf s 9.76r6 s 1.28. For each logit, strong evidence remains that
 j ) 0. In Chapters 12 and 13 we present other methods that account for the
clustering of fetuses in litters.
'
7.4.5
'
Mean Response Models for Ordered Response
We now present a model that resembles ordinary regression for a continuous
response variable. For scores ®1 F ®2 F ⭈⭈⭈ F ®J , let
M Ž x. s
Ý ®j j Ž x .
j
denote the mean response. The model
M Ž x . s ␣ q X x
Ž 7.16 .
assumes a linear relationship between the mean and the explanatory variables. With J s 2, it is the linear probability model ŽSection 4.2.1.. With
J ) 2, it does not structurally specify the response probabilities but merely
describes the dependence of the mean on x.
Assuming independent multinomial sampling at different x i , Bhapkar
Ž1968., Grizzle et al. Ž1969., and Williams and Grizzle Ž1972. presented
weighted least squares ŽWLS. fits for mean response models. The WLS
approach, described in Section 15.1, applies when all explanatory variables
are categorical. The ML approach for maximizing the product multinomial
likelihood applies for categorical or continuous explanatory variables. Haber
Ž1985. and Lipsitz Ž1992. presented algorithms for ML fitting of a family,
292
LOGIT MODELS FOR MULTINOMIAL RESPONSES
including mean response models. This is somewhat complex, since the
probabilities in the multinomial likelihood are not direct functions of the
parameters in Ž7.16.. Specialized software is available Žsee Appendix A..
7.4.6
Job Satisfaction Example Revisited
We illustrate for Table 7.8, modeling the mean of Y s job satisfaction using
income x and gender g Ž1 s females, 0 s males.. For simplicity, we use job
satisfaction scores and income scores Ž1, 2, 3, 4.. The model has ML fit,
M̂ s 2.59 q 0.181 x y 0.030 g ,
with SE s 0.069 for income and 0.145 for gender. Given gender, the estimated increase in mean job satisfaction is about 0.2 response category for
each category increase of income. Although the evidence is strong of a
positive effect we.g., Wald statistic Ž0.181r0.069. 2 s 6.8, df s 1, P s 0.009x,
the strength of the effect is weak. Job satisfaction at the highest income level
is estimated to average about half a category higher than at the lowest
income level, since 3Ž0.181. s 0.54. Similar results occur with the WLS
solution, for which the estimated income effect of 0.182 has SE s 0.068
ŽTable A.12 shows the use of CATMOD in SAS..
The deviance for testing the model fit equals 5.1. Since means occur at
eight income = gender settings and the model has three parameters, residual
df s 5. The fit seems adequate.
7.4.7
Advantages and Disadvantages of Mean Response Models
Treating ordinal variables in a quantitative manner is sensible if their
categorical nature reflects crude measurement of an inherently continuous
variable. Mean response models have the advantage of closely resembling
ordinary regression.
With J s 2, in Section 4.2.1 we noted that linear probability models have
a structural difficulty because of the restriction of probabilities to Ž0, 1.. A
similar difficulty occurs here, since a linear model can have predicted means
outside the range of assigned scores. This happens less frequently when J is
large and reasonable dispersion of responses occurs throughout the domain
of interest for the explanatory variables. The notion of an underlying latent
variable makes more sense for an ordinal variable than for a strictly binary
response, so this difficulty has less relevance here.
Unlike logit models, mean response models do not uniquely determine cell
probabilities. Thus, mean response models do not specify structural aspects
such as stochastic orderings. These models do not represent the categorical
response structure as fully as do models for probabilities, and conditions such
as independence do not occur as special cases. However, they provide
simpler descriptions than odds ratios or summaries from cumulative link
TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES
293
models. As J increases, they also interface with ordinary regression models.
For large J, they are a simple mechanism for approximating results for a
regression model we would use if we could measure Y continuously.
7.5 TESTING CONDITIONAL INDEPENDENCE IN
I = J = K TABLES*
In Section 6.3.2 we introduced the CochranMantelHaenszel ŽCMH. test of
conditional independence for 2 = 2 = K tables. This section presents related
tests with multicategory responses for I = J = K tables. Likelihood-ratio
tests compare the fit of a model specifying XY conditional independence
with a model having dependence. Alternatively, generalizations of the CMH
statistic are score statistics for certain models.
7.5.1
Using Multinomial Models to Test Conditional Independence
Treating Z as a nominal control factor, we discuss four cases with Ž Y, X . as
Žordinal, ordinal., Žordinal, nominal., Žnominal, ordinal., Žnominal, nominal..
For ordinal Y we use cumulative logit models, but other ordinal links yield
analogous tests. As we noted in Section 6.3.2 when the XY association is
similar in the partial tables, the power benefits from basing a test statistic on
a model of homogeneous association.
1. Y ordinal, X ordinal. Let x i 4 be ordered scores. The model
logit P Ž Y F j < X s i , Z s k . s ␣ j q  x i q  kZ
Ž 7.17 .
has the same linear trend for the X effect in each partial table. For it,
XY conditional independence is H0 :  s 0. Likelihood-ratio, score, or
Wald statistics for H0 provide large-sample chi-squared tests with
df s 1 that are sensitive to the trend alternative.
2. Y ordinal, X nominal. An alternative to conditional independence that
treats X as a factor is
logit P Ž Y F j < X s i , Z s k . s ␣ j q i q  kZ ,
with constraint such as I s 0. For this model, XY conditional independence is H0 :  1 s ⭈⭈⭈ s I . Large-sample chi-squared tests have
df s I y 1.
3. Y nominal, X ordinal. When Y is nominal, analogous tests use
baseline-category logit models. The model of XY conditional independence is
log
P Ž Y s j < X s i, Z s k.
P Ž Y s J < X s i, Z s k.
s ␣ jk .
Ž 7.18 .
294
LOGIT MODELS FOR MULTINOMIAL RESPONSES
For ordered scores x i 4 , a test that is sensitive to the same linear trend
alternatives in each partial table compares this model to
log
P Ž Y s j < X s i, Z s k.
P Ž Y s J < X s i, Z s k.
s ␣ jk q  j x i .
Conditional independence is H0 :  1 s ⭈⭈⭈ s  Jy1 s 0. Large-sample
chi-squared tests have df s J y 1.
4. Y nominal, X nominal. An alternative to XY conditional independence
that treats X as a factor is
log
P Ž Y s j < X s i, Z s k.
P Ž Y s J < X s i, Z s k.
Ž 7.19 .
s ␣ jk q i j
with constraint such as I j s 0 for each j. For each j, X and Z have
additive effects of form ␣ k q i . Conditional independence is H0 :
 1 j s ⭈⭈⭈ s I j for j s 1, . . . , J y 1. Large-sample chi-squared tests
have df s Ž I y 1.Ž J y 1..
Table 7.10 summarizes the four tests. They work well when the model
describes at least a major component of the departure from conditional
independence. This does not mean that one must test the fit of the model to
use the test Žsee the remarks at the end of Section 6.3.2..
Occasionally, the association may change dramatically across the K partial
tables. When Z is ordinal, an alternative by which a log odds ratio changes
linearly across levels of Z is sometimes of use. For instance, when Z s age
of subject, the association between a risk factor X Že.g., level of smoking. and
a response Y Že.g., severity of heart disease . may tend to increase with Z.
When Z is nominal, one can test the conditional independence models
TABLE 7.10 Summary of Models for Testing Conditional Independence
Conditional
Independence
df
s0
1
 1 s ⭈⭈⭈ s I
Iy1
s ␣ jk q  j x i
 1 s ⭈⭈⭈ s  Jy1 s 0
Jy1
s ␣ jk q i j
all i j s 0
Ž I y 1.Ž J y 1.
Model
Y-X
Ord-Ord
-Nom
Nom-Ord
-Nom
logitw P Ž Y F j .x s ␣ j q  x i q  kZ
logitw P Ž Y F j .x s ␣ j q i q
log
log
PŽY s j.
PŽY s J .
PŽY s j.
PŽY s J .
 kZ
295
TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES
against a more general alternative with separate effect parameters at each
level of Z. Allowing effects to vary across levels of Z, however, results in the
test df being multiplied by K, which handicaps power.
7.5.2
Job Satisfaction Example Revisited
We now revisit the job satisfaction data ŽTable 7.8.. Table 7.11 summarizes
the fit of several models. The model treating income as an ordinal predictor
uses scores 3, 10, 20, 354 , approximate midpoints of categories in thousands
of dollars. Each likelihood-ratio test compares a given model to the model
deleting the income effect, controlling for gender.
Testing conditional independence with the cumulative logit model Ž7.17.
yields likelihood-ratio statistic 19.62 y 13.95 s 5.7 with df s 20 y 19 s 1,
strong evidence of an effect. Models that treat either or both variables as
nominal do not provide such strong evidence. Focusing the test on a linear
trend alternative yields a smaller P-value. However, we learn more from
estimating parameters than from significance tests, as in Sections 7.4.2 and
7.4.6.
7.5.3 Generalized Cochran–MantelHaenszel Tests for
I = J = K Tables
Birch Ž1965., Landis et al. Ž1978., and Mantel and Byar Ž1978. generalized
the CMH statistic ŽSection 6.3.2.. The tests treat X and Y symmetrically, so
the three cases correspond to treating both as nominal, both as ordinal, or
one of each. Conditional on row and column totals, each stratum has
Ž I y 1.Ž J y 1. nonredundant cell counts. Let
n k s Ž n11 k , n12 k , . . . , n1, Jy1 , k , . . . , n Iy1 , Jy1 , k . .
X
TABLE 7.11 Summary of Model-Based Likelihood-Ratio Tests of
Conditional Independence for Table 7.8
Income
G 2 Fit
df
Test
Statistic
df
P-value
Ordinal
Ordinal
Nominal
Not in model
13.95
10.51
19.62
19
17
20
5.7
9.1
ᎏ
1
3
ᎏ
0.017
0.028
ᎏ
Nominal
Ordinal
Nominal
Not in model
11.74
7.09
19.37
15
9
18
7.6
12.3
ᎏ
3
9
ᎏ
0.054
0.198
ᎏ
Satisfaction
296
LOGIT MODELS FOR MULTINOMIAL RESPONSES
Let k s E Žn k . under H0 : conditional independence, namely
k s Ž n1qk nq1 k , n1qk nq2 k , . . . , n Iy1 ,q, k nq, Jy1 , k . rnqqk .
X
Let Vk denote the null covariance matrix of n k , where
n iqk Ž ␦ iiX nqqk y n iXqk . nqj k Ž ␦ j jX nqqk y nqj X k .
cov Ž n i jk , n iX jX k . s
2
nqqk
Ž nqqk y 1 .
with ␦ ab s 1 when a s b and ␦ ab s 0 otherwise.
The most general statistic treats rows and columns as unordered. Summing over the K strata, let
ns
Ý nk ,
s
Ý k ,
Ý Vk .
Vs
The generalized CMH statistic for nominal X and Y is
CMH s Ž n y . Vy1 Ž n y . .
Ž 7.20 .
X
Its large-sample chi-squared distribution has df s Ž I y 1.Ž J y 1.. The df
value equals that for the statistics comparing logit models Ž7.18. and Ž7.19..
Both statistics are sensitive to detecting a conditional association that is
similar in each stratum. For K s 1 stratum with n observations, CMH s
wŽ n y 1.rn x X 2 , where X 2 is the Pearson statistic Ž3.10..
Mantel Ž1963. introduced a generalized statistic for ordinal X and Y.
Using ordered scores u i 4 and ®j 4 , it is sensitive to a correlation of common
sign in each stratum. Evidence of a positive trend occurs if in each stratum
Tk s Ý i Ý j u i ®j n i jk exceeds its null expectation. Given the marginal totals in
each stratum, under conditional independence
E Ž Tk . s
Ý u i n iqk Ý ®j nqj k
i
var Ž Tk . s
1
nqqk y 1
=
nqqk ,
j
Ý
Ý
®j2 nqj k
j
u 2i n iqk
y
Ž Ý i u i n iqk .
nqqk
i
y
Ž Ý j ®j nqj k .
nqqk
2
2
.
The statistic w Tk y E ŽTk .xrwvarŽTk .x1r2 equals the correlation between X and
Y in stratum k multiplied by nqqk y 1 . To summarize across the K strata,
'
297
TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES
Mantel Ž1963. proposed
2
M s
Ýk
Ý i Ý j u i ®j n i jk y E Ž Ý i Ý j u i ®j n i jk .
Ý kvar Ž Ý i Ý j u i ®j n i jk .
4
2
.
Ž 7.21 .
This has an approximate 12 null distribution, the same as for testing H0 :
 s 0 in ordinal model Ž7.17.. For K s 1, this is the M 2 statistic Ž3.15..
Landis et al. Ž1978. presented a statistic that has Ž7.20. and Ž7.21. as
special cases. His statistic also can treat X as nominal and Y as ordinal,
summarizing information about how I row means compare to their null
expected values, with df s I y 1 Žsee Note 7.7..
7.5.4
Job Satisfaction Example Revisited
Table 7.12 shows output from conducting generalized CMH tests for Table
7.8. Statistics treating a variable as ordinal used scores 3, 10, 20, 354 for
income and scores 1, 3, 4, 54 for job satisfaction. ŽTable A.12 shows the use of
PROC FREQ in SAS, but with different scores.
The general association alternative treats X and Y as nominal and uses
Ž7.20.. It is sensitive to any association that is similar in each level of Z. The
row mean scores differ alternative treats rows as nominal and columns as
ordinal. It is sensitive to variation among the I row mean scores on Y, when
that variation is similar in each level of Z. Finally, the nonzero correlation
alternative treats X and Y as ordinal and uses Ž7.21.. It is sensitive to a
similar linear trend in each level of Z. As in the model-based analyses that
Table 7.11 summarized, the evidence is stronger using the df s 1 ordinal test.
7.5.5
Related Score Tests for Multinomial Logit Models
The generalized CMH tests seem to be non-model-based alternatives to
those of Section 7.5.1 using multinomial logit models. However, a close
connection exists between them. For various multinomial logit models, the
generalized CMH tests are score tests.
TABLE 7.12 Output for Generalized Cochran–Mantel–Haenszel Tests
with Job Satisfaction and Income Data
Summary Statistics for income by satisf
Controlling for gender
Cochran- Mantel- Haenszel Statistics (Based on Table Scores)
Statistic
Alternative Hypothesis
DF
Value
Prob
1
2
3
Nonzero Correlation
Row Mean Scores Differ
General Association
1
3
9
6.1563
9.0342
10.2001
0.0131
0.0288
0.3345
298
LOGIT MODELS FOR MULTINOMIAL RESPONSES
The generalized CMH test Ž7.20. that treats X and Y as nominal is the
score test that the Ž I y 1.Ž J y 1. i j 4 parameters in logit model Ž7.19. equal
0. The generalized CMH test using M 2 that treats X and Y as ordinal is the
score test of  s 0 in model Ž7.17.. For the cumulative logit model, the
equivalence has the same x i 4 scores in the model as in M 2 , and the ®j 4
scores in M 2 are average rank scores. For the adjacent-categories logit
model analog of Ž7.17., the ®j 4 scores in M 2 are any equally spaced scores.
With large samples in each stratum, the generalized CMH tests give
similar results as likelihood-ratio tests comparing the relevant models. An
advantage of the model-based approach is providing estimates of effects. An
advantage of the generalized CMH tests is maintaining good performance
under sparse asymptotics whereby K grows as n does. Remarks in Section
6.3.4 apply here also.
7.5.6
Exact Tests of Conditional Independence
In principle, exact tests of conditional independence can use the generalized
CMH statistics, generalizing Section 6.7.5 for 2 = 2 = K tables. To eliminate
nuisance parameters, one conditions on row and column totals in each
stratum. The distribution of counts in each stratum is the multiple hypergeometric ŽSection 3.5.7., and this propagates an exact conditional distribution
for the statistic of interest. The P-value is the probability of those tables
having the same strata margins as observed but test statistic at least as large
as observed Žsee Birch 1965; Kim and Agresti 1997; Mehta et al. 1988..
7.6
DISCRETE-CHOICE MULTINOMIAL LOGIT MODELS*
An important application of multinomial logit models is determining effects
of explanatory variables on a subject’s choice from a discrete set of
optionsᎏfor instance, the choice of transportation system to take to work
Ždrive, bus, subway, walk, bicycle., housing Žbuy house, buy condominium,
rent., primary shopping location Ždowntown, mall, catalogs, Internet ., or
product brand. Models for response variables consisting of a discrete set of
choices are called discrete-choice models.
7.6.1
Discrete-Choice Modeling
In many discrete-choice applications, an explanatory variable takes different
values for different response choices. As predictors of choice of transportation system, cost and time to reach destination take different values for each
option. As a predictor of choice of product brand, price varies according to
the option. Explanatory variables of this type are characteristics of the choices.
They differ from the usual ones, for which values remain constant across the
choice set. Such variables, characteristics of the chooser, include income,
education, and other demographic characteristics.
DISCRETE-CHOICE MULTINOMIAL LOGIT MODELS
299
McFadden Ž1974. proposed a discrete-choice model for explanatory variables that are characteristics of the choices. His model also permits the
choice set to vary among subjects. For instance, some subjects may not have
the subway as an option for travel to work. For subject i and response choice
j, let x i j s Ž x i j1 , . . . , x i j p .X denote the values of the p explanatory variables,
and let x i s Žx i1 , . . . , x i p .. Conditional on the choice set Ci for subject i, the
model for the probability of selecting option j is
j Žx i . s
exp Ž X x i j .
Ý h g C i exp Ž X x i h .
Ž 7.22 .
.
For each pair of choices a and b, this model has the logit form
log a Ž x i . r b Ž x i . s X Ž x i a y x i b . .
Ž 7.23 .
Conditional on the choice being a or b, a variable’s influence depends on the
distance between the subject’s values of that variable for those choices. If the
values are the same, the model asserts that the variable has no influence on
the choice between a and b. Reflecting this property, McFadden originally
referred to model Ž7.22. as a conditional logit model.
From Ž7.23., the odds of choosing a over b do not depend on the other
alternatives in the choice set or on their values of the explanatory variables.
Luce Ž1959. called this property independence from irrele®ant alternati®es. It is
unrealistic in some applications. For instance, for travel options auto and red
bus, suppose that 80% choose auto, an odds of 4.0. Now suppose that the
options are auto, red bus, and blue bus. According to Ž7.23., the odds are still
4.0 of choosing auto instead of red bus, but intuitively, we expect them to be
about 8.0 Ž10% choosing each bus option., McFadden Ž1974. stated: ‘‘Application of the model should be limited to situations where the alternatives can
plausibly be assumed to be distinct and weighed independently in the eyes of
each decision-maker.’’
7.6.2
Discrete-Choice and Multinomial Logit Models
Model Ž7.22. can also incorporate explanatory variables that are characteristics of the chooser. This may seem surprising, since Ž7.22. has a single
parameter for each explanatory variable; that is, the parameter vector is the
same for each pair of choices. However, multinomial logit model Ž7.2. has
discrete-choice form Ž7.22. after replacing such an explanatory variable by J
artificial variables; the jth is the product of the explanatory variable with a
dummy variable that equals 1 when the response choice is j. For instance, for
a single explanatory variable, let x i denote its value for subject i. For
j s 1, . . . , J, let ␦ jk equal 1 when k s j and 0 otherwise, and let
z i j s Ž ␦ j1 , . . . , ␦ j J , ␦ j1 x i , . . . , ␦ j J x i . .
X
300
LOGIT MODELS FOR MULTINOMIAL RESPONSES
Let  s Ž␣1 , . . . , ␣ J ,  1 , . . . ,  J .X . Then X z i j s ␣ j q  j x i , and Ž7.2. is Žwith
␣ J s  J s 0 for identifiability .
j Ž xi . s
s
exp Ž␣j q  j x i .
exp Ž␣1 q  1 x i . q ⭈⭈⭈ qexp Ž␣J q  J x i .
exp Ž X z i j .
exp Ž X z i1 . q ⭈⭈⭈ qexp Ž X z i J .
.
This has form Ž7.22..
With this approach, discrete-choice models can contain characteristics of
the chooser and the choices. Thus, model Ž7.22. is very general. The ordinary
multinomial logit model Ž7.2. using baseline-category logits is a special case.
7.6.3
Shopping Choice Example
McFadden Ž1974. used multinomial logit models to describe how residents of
Pittsburgh, Pennsylvania chose a shopping destination. The five possible
destinations were different city zones. One explanatory variable measured
shopping opportunities, defined to be the retail employment in the zone as a
percentage of total retail employment in the region. The other explanatory
variable was price of the trip, defined from a separate analysis using auto
in-vehicle time and auto operating cost.
The ML estimates of model parameters were y1.06 ŽSE s 0.28. for price
of trip and 0.84 ŽSE s 0.23. for shopping opportunity. From Ž7.23.,
log Ž
ˆarˆ b . s y1.06 Ž Pa y Pb . q 0.84 Ž S a y Sb . ,
where P s price and S s shopping opportunity. Not surprisingly, a destination is relatively more attractive as the trip price decreases and as the
shopping opportunity increases. Given values of P and S for each destination, the sample analog of Ž7.22. provides estimated probabilities of choosing
each destination.
NOTES
Section 7.1: Nominal Responses: Baseline-Category Logit Models
7.1. Multicategory models derive from latent variable constructions that generalize those for
binary responses. One approach uses the principle of selecting the category having
maximum utility ŽProblem 6.29.. Fahrmeir and Tutz Ž2001, Chap. 3. gave discussion and
references. Baseline-category logit models were developed in Bock Ž1970., Haberman
Ž1974a, pp. 352᎐373., Mantel Ž1966., Nerlove and Press Ž1973., and Theil Ž1969, 1970..
Lesaffre and Albert Ž1989. presented regression diagnostics. Amemiya Ž1981., Haberman Ž1982., and Theil Ž1970. presented R-squared measures.
NOTES
301
Section 7.2: Ordinal Responses: Cumulati©e Logit Models
7.2. Early uses of cumulative logit models include Bock and Jones Ž1968., Simon Ž1974., Snell
Ž1964., Walker and Duncan Ž1967., and Williams and Grizzle Ž1972.. McCullagh Ž1980.
popularized the proportional odds case. Later articles include Agresti and Lang Ž1993a.,
Hastie and Tibshirani Ž1987., Peterson and Harrell Ž1990., and Tutz Ž1989.. See also
Section 11.3.3, Note 11.3, and Section 12.4.1. McCullagh and Nelder Ž1989, Sec. 5.6.
suggested using cumulative totals in forming residuals.
7.3. McCullagh Ž1980. noted that score tests for model Ž7.5. are equivalent to nonparametric
tests using average ranks. For instance, for 2 = J tables assume that logit w P Ž Y F j .x s
␣ j q  x, with x an indicator. The score test of H0 :  s 0 is equivalent to a discrete
version of the Wilcoxon᎐Mann᎐Whitney test. Whitehead Ž1993. gave sample size
formulas for this case. The sample size n J needed for a certain power decreases as J
increases: When response categories have equal probabilities, n J f 0.75n 2 rŽ1 y 1rJ 2 ..
Thus, for large J, n J f 0.75n 2 , and 1 y 1rJ 2 is a type of efficiency measure of using J
categories instead of a continuous response. The efficiency loss is minor with J f 5, but
major in collapsing to J s 2. Edwardes Ž1997. innovatively adapted the test by treating
the cutpoints as random. This relates to random effects models of Section 12.4.1.
Section 7.3: Ordinal Responses: Cumulati©e Link Models
7.4. Aitchison and Silvey Ž1957. and Bock and Jones Ž1968, Chap. 8. studied cumulative
probit models. Farewell Ž1982. generalized the complementary log-log model to allow
variation among the sample in the category boundaries for the underlying scale; this
relates to random effects models ŽSection 12.4.. Genter and Farewell Ž1985. introduced
a generalized link function that permits comparison of fits provided by probit, complementary log-log, and other links. Yee and Wild Ž1996. defined generalized additive
models for nominal and ordinal responses. Hamada and Wu Ž1990. and Nair Ž1987.
presented alternatives to model Ž7.8. for detecting dispersion effects.
7.5. Some authors have considered inference relating generally to stochastic ordering; see,
for instance, Dardanoni and Forcina Ž1998. and survey articles in a 2002 issue of J.
Statist. Plann. Inference ŽVol. 107, Nos. 1᎐2..
Section 7.4: Alternati©e Models for Ordinal Responses
7.6. The ratio of a pdf to the complement of the cdf is the hazard function ŽSection 9.7.3..
For discrete variables, this is the ratio found in continuation-ratio logits. Hence,
continuation-ratio logits are sometimes interpreted as log hazards. Thompson Ž1977.
used them in modeling discrete survival-time data. When lengths of time intervals
approach 0, his model converges to the Cox proportional hazards model. Other applications of continuation-ratio logits include Laara
¨¨ ¨ and Matthews Ž1985. and Tutz Ž1991..
Section 7.5: Testing Conditional Independence in I = J = K Tables
7.7. Let B k s u k m vk denote a matrix of constants based on row scores u k and column
scores vk for stratum k, where m denotes the Kronecker product. The Landis et al.
302
LOGIT MODELS FOR MULTINOMIAL RESPONSES
Ž1978. generalized statistic is
X
L2 s
Ý B k Žn k y k . Ý B kVk BXk
k
y1
k
Ý B k Žn k y k .
.
k
When u k s Ž u1 , . . . , u I . and vk s Ž ®1 , . . . , ®J . for all strata, L2 s M 2 . When u k is an
Ž I y 1. = I matrix ŽI, y1., where I is an identity matrix of size Ž I y 1. and 1 denotes a
column vector of I y 1 ones, and vk is the analogous matrix of size Ž J y 1. = J, L2
simplifies to Ž7.20. with df s Ž I y 1.Ž J y 1.. With this u k and vk s Ž ®1 , . . . , ®J ., L2 sums
over the strata information about how I row means compare to their null expected
values, and it has df s I y 1. Rank score versions are analogs for ordered categorical
responses of strata-adjusted Spearman correlation and KruskalWallis tests. Landis et
al. Ž1998. and Stokes et al. Ž2000. reviewed CMH methods. Koch et al. Ž1982. reviewed
related methods.
Section 7.6: Discrete-Choice Multinomial Logit Models
7.8. McFadden’s model relates to models proposed by Bradley and Terry Ž1952. Žsee Section
10.6. and Luce Ž1959.. See Train Ž1986. for a text treatment. McFadden Ž1982. discussed
hierarchical models having a nesting of choices in a tree-like structure. For other
discussion, see Maddala Ž1983. and Small Ž1987.. Models that do not assume independence from irrelevant alternatives result with probit link ŽAmemiya 1981. or with the
logit link but including random effects ŽBrownstone and Train 1999.. Methods in Section
12.6 for random effects models are useful for fitting such models. These include Monte
Carlo methods for approximating integrals that determine the likelihood function. See
Stern Ž1997. for a review.
PROBLEMS
Applications
7.1
For Table 7.13, let Y s belief in life after death, x 1 s gender Ž1 s
females, 0 s males., and x 2 s race Ž1 s whites, 0 s blacks.. Table
7.14 shows the fit of the model
log Ž jr 3 . s ␣ j q  jG x 1 q  jR x 2 ,
j s 1, 2,
with SE values in parentheses.
TABLE 7.13
Data for Problem 7.1
Belief in Afterlife
Race
Gender
Yes
Undecided
No
White
Female
Male
371
250
49
45
74
71
Black
Female
Male
64
25
9
5
15
13
Source: 1991 General Social Survey, National Opinion
Research Center.
303
PROBLEMS
TABLE 7.14
Fit of Model for Problem 7.1
Belief Categories for Logit
Parameter
YesrNo
UndecidedrNo
Intercept
Gender
Race
0.883 Ž0.243.
0.419 Ž0.171.
0.342 Ž0.237.
y0.758 Ž0.361.
0.105 Ž0.246.
0.271 Ž0.354.
a. Find the prediction equation for logŽ 1r 2 ..
b. Using the yes and no response categories, interpret the conditional
gender effect using a 95% confidence interval for an odds ratio.
c. Show that for white females,
ˆ 1 s PˆŽ Y s yes. s 0.76.
d. Without calculating estimated probabilities, explain why the intercept estimates indicate that for black males
ˆ 1 ) ˆ 3 ) ˆ 2 . Use
the intercept and gender estimates to show that the same ordering
applies for black females.
e. Without calculating estimated probabilities, explain why the estimates in the gender and race rows indicate that
ˆ 3 is highest for
black males.
f. For this fit, G 2 s 0.9. Explain why residual df s 2. Deleting the
gender effect, G 2 s 8.0. Test whether opinion is independent of
gender, given race. Interpret.
7.2
A model fit predicting preference for U.S. President ŽDemocrat, Republican, Independent . using x s annual income Žin $10,000. is
logŽ
ˆ D rˆ I . s 3.3 y 0.2 x and logŽˆ Rrˆ I . s 1.0 q 0.3 x.
a. Find the prediction equation for logŽ
ˆ Rrˆ D . and interpret the
slope. For what range of x is
ˆ R ) ˆ D?
b. Find the prediction equation for
ˆI.
c. Plot
ˆ D , ˆ I , and ˆ R for x between 0 and 10, and interpret.
7.3
Table 7.15 refers to the effect on political party identification of
gender and race. Find a baseline-category logit model that fits well.
TABLE 7.15 Data for Problem 7.3
Party Identification
Gender
Race
Democrat
Republican
Independent
Male
White
Black
132
42
176
6
127
12
Female
White
Black
172
56
129
4
130
15
304
LOGIT MODELS FOR MULTINOMIAL RESPONSES
Interpret estimated effects on the odds that party identification is
Democrat instead of Republican.
TABLE 7.16 Data for Problem 7.4 a
Males
Females
Length
Žm.
Choice
Length
Žm.
Choice
Length
Žm.
Choice
Length
Žm.
Choice
1.30
1.32
1.32
1.40
1.42
1.42
1.47
1.47
1.50
1.52
1.63
1.65
1.65
1.65
1.65
1.68
1.70
1.73
1.78
1.78
I
F
F
F
I
F
I
F
I
I
I
O
O
I
F
F
I
O
F
O
1.80
1.85
1.93
1.93
1.98
2.03
2.03
2.31
2.36
2.46
3.25
3.28
3.33
3.56
3.58
3.66
3.68
3.71
3.89
F
F
I
F
I
F
F
F
F
F
O
O
F
F
F
F
O
F
F
1.24
1.30
1.45
1.45
1.55
1.60
1.60
1.65
1.78
1.78
1.80
1.88
2.16
2.26
2.31
2.36
2.39
2.41
2.44
I
I
I
O
I
I
I
F
I
O
I
I
F
F
F
F
F
F
F
2.56
2.67
2.72
2.79
2.84
O
F
I
F
F
a
I, invertebrates; F, fish; O, other.
7.4
For 63 alligators caught in Lake George, Florida, Table 7.16 classifies
primary food choice as Žfish, invertebrate, other. and shows length in
meters. Alligators are called subadults if length - 1.83 meters Ž6 feet.
and adults if length ) 1.83 meters.
a. Measuring length as Žadult, subadult ., find a model that adequately
describes effects of gender and length on food choice. Interpret the
effects. For adult females, find the estimated probabilities of the
food-choice categories.
b. Using only observations for which primary food choice was fish or
invertebrate, find a model that adequately describes effects of
gender and binary length. Compare parameter estimates and standard errors for this separate-fitting approach to those obtained with
simultaneous fitting, including the other category.
c. Treating length as binary loses information. Adapt the model in
part Ža. to use the continuous measurements. Interpret, explaining
how the estimated outcome probabilities vary with length. Find the
305
PROBLEMS
estimated length at which the invertebrate and other categories are
equally likely.
7.5
For recent data from a General Social Survey, the cumulative logit
model Ž7.5. with Y s political ideology Žvery liberal, slightly liberal,
moderate, slightly conservative, very conservative. and x s 1 for the
428 Democrats and x s 0 for the 407 Republicans has ˆ s 0.975
ŽSE s 0.129. and ␣
ˆ1 s y2.469. Interpret ˆ. Find the estimated probability of a very liberal response for each group.
7.6
Refer to Problem 7.5. With adjacent-categories logits, ˆ s 0.435. Interpret using odds ratios for adjacent categories and for the Žvery
liberal, very conservative. pair of categories.
7.7
Table 7.17 is an expanded version of a data set analyzed in Section
8.4.2. The response categories are Ž1. not injured, Ž2. injured but not
transported by emergency medical services, Ž3. injured and transported
by emergency medical services but not hospitalized, Ž4. injured and
hospitalized but did not die, and Ž5. injured and died. Table 7.18 shows
output for a model of form Ž7.5., using dummy variables for predictors.
a. Why are there four intercepts? Explain how they determine the
estimated response distribution for males in urban areas wearing
seat belts.
b. Construct a confidence interval for the effect of gender, given
seat-belt use and location. Interpret.
c. Find the estimated cumulative odds ratio between the response and
seat-belt use for those in rural locations and for those in urban
locations, given gender. Based on this, explain how the effect of
seat-belt use varies by region, and explain how to interpret the
interaction estimate, y0.1244.
TABLE 7.17 Data for Problem 7.7
Response
Gender
Location
Female
Urban
Rural
Male
Urban
Rural
Seat Belt
1
2
3
4
5
No
Yes
No
Yes
7,287
11,587
3,246
6,134
175
126
73
94
720
577
710
564
91
48
159
82
10
8
31
17
No
Yes
No
Yes
10,381
10,969
6,123
6,693
136
83
141
74
566
259
710
353
96
37
188
74
14
1
45
12
Source: Data courtesy of Cristanna Cook, Medical Care Development, Augusta, Maine.
306
LOGIT MODELS FOR MULTINOMIAL RESPONSES
TABLE 7.18 Output for Problem 7.7
Parameter
Intercept1
Intercept2
Intercept3
Intercept4
gender
gender
location
location
seatbelt
seatbelt
location*seatbelt
location*seatbelt
location*seatbelt
location*seatbelt
female
male
rural
urban
no
yes
rural
rural
urban
urban
no
yes
no
yes
DF
1
1
1
1
1
0
1
0
1
0
1
0
0
0
Estimate
3.3074
3.4818
5.3494
7.2563
y0.5463
0.0000
y0.6988
0.0000
y0.7602
0.0000
y0.1244
0.0000
0.0000
0.0000
Std Error
0.0351
0.0355
0.0470
0.0914
0.0272
0.0000
0.0424
0.0000
0.0393
0.0000
0.0548
0.0000
0.0000
0.0000
7.8
Refer to the cumulative logit model for Table 7.8.
a. Compare the estimated income effect ˆ1 s y0.510 to the estimate
after collapsing the response to three categories by combining
categories Ži. very satisfied and moderately satisfied, and Žii. very
dissatisfied and a little satisfied. What property of the model does
this reflect?
b. Consider ˆ1rSE using the full scale to ˆ1rSE for the collapsing in
part ŽaŽi... Usually, a disadvantage of collapsing multinomial responses is that the significance of effects diminishes.
c. Check whether an improved model results from permitting interaction between income and gender. Interpret.
7.9
Table 7.19 refers to a clinical trial for the treatment of small-cell lung
cancer. Patients were randomly assigned to two treatment groups. The
sequential therapy administered the same combination of chemotherapeutic agents in each treatment cycle; the alternating therapy had
three different combinations, alternating from cycle to cycle.
TABLE 7.19 Data for Problem 7.9
Response to Chemotherapy
Therapy
Gender
Progressive
Disease
No
Change
Partial
Remission
Complete
Remission
Sequential
Male
Female
28
4
45
12
29
5
26
2
Alternating
Male
Female
41
12
44
7
20
3
20
1
Source: W. Holtbrugge and M. Schumacher, Appl. Statist. 40: 249᎐259 Ž1991..
307
PROBLEMS
a. Fit a cumulative logit model with main effects for treatment and
gender. Interpret.
b. Fit the model that also contains an interaction term. Interpret.
Does it fit better? Explain why it is equivalent to using the four
gendertreatment combinations as levels of a single factor.
7.10 Refer to Table 7.13. Treating belief in an afterlife as ordinal, fit and
interpret an ordinal model.
7.11 Table 9.7 displays associations among smoking status Ž S ., breathing
test results Ž B ., and age Ž A. for workers in certain industrial plants.
Treat B as a response.
a. Specify a baseline-category logit model with additive factor effects
of S and A. This model has deviance G 2 s 25.9. Show that df s 4,
and explain why this model treats all variables as nominal.
b. Treat B as ordinal and S as ordinal in terms of how recently one
was a smoker, with scores si 4 . Consider the model
log
P Ž B s k q 1 < S s i, A s j.
P Ž B s k < S s i, A s j.
s ␣ k q  1 si q  2 a j q  3 si a j
with a1 s 0 and a2 s 1. Show that this assumes a linear effect of S
with slope  1 for age - 40 and  1 q  3 for age 40᎐59. Using
si s i4 , ˆ1 s 0.115, ˆ2 s 0.311, and ˆ3 s 0.663 ŽSE s 0.164.. Interpret the interaction.
c. From part Žb., for age 40᎐59 show that the estimated odds of
abnormal rather than borderline breathing for current smokers are
2.18 times those for former smokers and expŽ2 = 0.778. s 4.74
times those for never smokers. Explain why the squares of these
values are estimated odds of abnormal rather than normal breathing.
7.12 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 7 =
2 table that refers to subjects who graduated from high school in 1965.
They were classified as protestors if they took part in at least one
demonstration, protest march, or sit-in, and classified according to
their party identification in 1982. Analyze the data, using response Ža.
party identification, Žb. whether a protestor. Compare interpretations.
7.13 For Table 7.5, the cumulative probit model has fit ⌽y1 w PˆŽ Y F j .x s ␣
ˆj
y 0.195 x 1 q 0.683 x 2 , with ␣
ˆ1 s y0.161, ␣ˆ2 s 0.746, and ␣ˆ3 s 1.339.
Find the means and standard deviation for the two normal cdf ’s that
provide the curves for PˆŽ Y ) 2. as a function of x 1 s life events index,
at the two levels of x 2 s SES. Interpret effects.
308
LOGIT MODELS FOR MULTINOMIAL RESPONSES
7.14 Analyze Table 7.8 with a cumulative probit model. Compare interpretations to those in the text with other ordinal models.
7.15 Fit a model with complementary log-log link to Table 7.20, which
shows family income distributions by percent for families in the northeast U.S. Interpret the difference between the income distributions.
TABLE 7.20 Data for Problem 7.15
Income Ž$1000.
Year
03
35
57
710
1012
1215
15 q
1960
1970
6.5
4.3
8.2
6.0
11.3
7.7
23.5
13.2
15.6
10.5
12.7
16.3
22.2
42.1
Source: Reproduced with permission from the Royal Statistical Society, London ŽMcCullagh
1980..
7.16 Table 7.21 shows results of fitting the mean response model to Table
7.8 using scores 3, 10, 20, 354 for income and 1, 3, 4, 54 for job satisfaction. Interpret the income effect, provide a confidence interval for the
difference in mean satisfaction at income levels 35 and 3, controlling
for gender, and check the model fit.
TABLE 7.21 Results for Problem 7.16
Effect
Intercept
gender
income
Source
DF
Chi- Square
Pr > ChiSq
Residual
5
6.99
0.2211
Analysis of Weighted Least Squares Estimates
Parameter
Estimate
Std Error
Chi- Square
1
2
3
3.8076
y0.0687
0.0160
0.1796
0.1419
0.0066
449.47
0.23
5.97
Pr > ChiSq
<.0001
0.6283
0.0146
7.17 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 3 =
4 = 4 table that cross-classifies dumping severity Ž Y . and operation
Ž X . for four hospitals Ž H .. The four operations refer to treatments for
duodenal ulcer patients and have a natural ordering. Dumping severity
describes a possible undesirable side effect of the operation. Its three
categories are also ordered.
a. Table 7.22 shows results of generalized CMH tests. Interpret,
explaining how one test can be much more significant than the
others.
309
PROBLEMS
TABLE 7.22 Results for Problem 7.17
Summary Statistics for dumping by operate
Controlling for hospital
Statistic
Alternative Hypothesis
DF
Value
Prob
1
2
3
Nonzero Correlation
Row Mean Scores Differ
General Association
1
3
6
6.3404
6.5901
10.5983
0.0118
0.0862
0.1016
b. Let x i s i4 . Fit the model
logit P Ž Y F j < H s h, X s i . s ␣ j q h q  x i .
Test conditional independence of X and Y using it, and interpret
ˆ. Which generalized CMH test has the same spirit as this?
c. Does an improved fit result from allowing the operation effect to
vary by hospital? Interpret.
d. Find a mean response model that fits well. Interpret.
7.18 Table 7.23 refers to a study that randomly assigned subjects to a
control or treatment group. Daily during the study, treatment subjects
ate cereal containing psyllium. The study analyzed the effect on LDL
cholesterol.
a. Model the ending cholesterol level as a function of treatment, using
the beginning level as a covariate. Interpret the treatment effect.
b. Repeat part Ža., now treating the beginning level as qualitative.
Compare results.
c. An alternative to part Žb. uses a generalized CMH test relating
treatment to the ending response for partial tables defined by
beginning cholesterol level. Apply such a test, taking into account
the response ordering, to compare treatments. Interpret, and compare to part Žb..
TABLE 7.23 Data for Problem 7.18
Ending LDL Cholesterol Level
Control
Beginning F 3.4
F 3.4
3.4᎐4.1
4.1᎐4.9
) 4.9
18
16
0
0
Treatment
3.4᎐4.1
4.1᎐4.9
) 4.9
3.4
3.4᎐4.1
4.1᎐4.9
) 4.9
8
30
14
2
0
13
28
15
0
2
7
22
21
17
11
1
4
25
35
5
2
6
36
14
0
0
6
12
Source: Data courtesy of Sallee Anderson, Kellogg Co.
310
LOGIT MODELS FOR MULTINOMIAL RESPONSES
7.19 Analyze Table 7.5 with each type of model studied in this chapter.
Write a report summarizing results and advantages and disadvantages
of each modeling strategy.
7.20 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 4 =
4 = 5 table that cross-classifies assessment of cognitive impairment,
Alzheimer’s disease, and age. Analyze these data, treating Ža.
Alzheimer’s disease, and Žb. cognitive impairment, as the response
variable.
7.21 Analyze Table 9.5 using logit models that treat Ža. party affiliation, and
Žb. ideology, as the response variable.
7.22 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 4 =
2 = 3 = 3 table that refers to a sample of residents of Copenhagen.
The variables are type of housing Ž H ., degree of contact with other
residents Ž C ., feeling of influence on apartment management Ž I ., and
satisfaction with housing conditions Ž S .. Treating S as the response
variable, analyze these data.
7.23 Refer to Table 7.17. Analyze these data.
Theory and Methods
7.24 A multivariate generalization of the exponential dispersion family
Ž4.14. is
f Ž yi ; i , . s exp yiX i y b Ž i . raŽ . q c Ž yi , . 4 ,
where i is the natural parameter. Show that the multinomial variate yi
defined in Section 7.1.5 for a single trial with parameters j , j s
1, . . . , J y 14 is in the Ž J y 1.-parameter exponential family, with baseline-category logits as natural parameters.
7.25 Cell counts yi j 4 in an I = J contingency table have a multinomial Ž n;
i j 4. distribution. Show that P Ž Yi j s n i j ., i s 1, . . . , I, j s 1, . . . , J 4
can be expressed as
d n n! Ł
i
Ł Ž ni j !.
j
y1
Iy1 Jy1
exp
Ý Ý n i j log Ž␣i j .
is1 js1
Iy1
q
Ý
is1
n iq log Ž i Jr I J . q
Jy1
Ý nqj log Ž I jr I J .
js1
311
PROBLEMS
where ␣ i j s i j I Jr i J I j and d is a constant independent of the
data. Find an alternative expression using local odds ratios i j 4 , by
showing that
Ý Ý n i j log ␣ i j s Ý Ý si j log i j ,
i
j
i
where
si j s
j
Ý Ý n ab .
aFi bFj
7.26 Suppose that we express Ž7.2. as
j Ž x. s
exp Ž␣j q Xj x .
J
Ý hs1
exp Ž␣h q Xh x .
.
Show that dividing numerator and denominator by expŽ␣J q XJ x.
yields new parameters ␣ j* s ␣ j y ␣ J and  j* s  j y  J that satisfy
␣ J s 0 and  J s 0. Thus, without loss of generality, ␣ J s 0 and
 J s 0.
7.27 When J s 3, suppose that
j Ž x . s exp Ž␣j q  j x . r 1 q exp Ž␣1 q  1 x . q exp Ž␣2 q  2 x . ,
j s 1, 2. Show that 3 Ž x . is Ža. decreasing in x if  1 ) 0 and  2 ) 0,
Žb. increasing in x if  1 - 0 and  2 - 0, and Žc. nonmonotone when
 1 and  2 have different signs.
7.28 Refer to the log-likelihood function for the baseline-category logit
model ŽSection 7.1.4.. Denote the sufficient statistics by npj s Ý i yi j
and S jk s Ý i x i k yi j , j s 1, . . . J y 1, k s 1, . . . , p. Let S s Ž S11 , . . . ,
S1 t , . . . S J 1 , . . . , S J t .X . Condition on Ý i yi j , j s 1, . . . , J. Under the null
hypothesis that explanatory variables have no effect, show that
E Ž S. s n Ž p m m . ,
var Ž S. s n Ž V m ⌺ . ,
where p s Ž p1 , . . . , p J .X ; m s Ž x 1 , . . . , x t .X , where x k s ŽÝ i x i k .rn; ⌺
has elements Ž sk2 ® ., where sk2 ® s wÝ i Ž x i k y x k .Ž x i ® y x ® .xrŽ n y 1.; V
has elements ®ii s pi Ž1 y pi . and ®i j s y pi pj , and m denotes the
Kronecker product ŽZelen 1991..
7.29 Is the proportional odds model a special case of a baseline-category
logit model? Explain why or why not.
7.30 Prove factorization Ž7.15. for the multinomial distribution.
312
LOGIT MODELS FOR MULTINOMIAL RESPONSES
7.31 Show that for the model, logitw P Ž Y F j .x s ␣ j q  j x, cumulative probabilities may be misordered for some x values.
7.32 For an I = J contingency table with ordinal Y and scores x i s i4 for
x, consider the model
logit P Ž Y F j < X s x i . s ␣ j q  x i .
Ž 7.24 .
a. Show that logitw P Ž Y F j < X s x iq1 .x y logitw P Ž Y F j < X s x i .x s  .
Show that this difference in logits is a log cumulative odds ratio for
the 2 = 2 table consisting of rows i and i q 1 and the binary
response having cutpoint following category j. Thus, Ž7.24. is a
uniform association model in cumulative odds ratios.
b. Show that residual df s IJ y I y J.
c. Show that independence of X and Y is the special case  s 0.
d. Using the same linear predictor but with adjacent-categories logits,
show that uniform association applies to the local odds ratios Ž2.10..
e. A generalization of Ž7.24. replaces  x i 4 by unordered parameters
i 4 , hence treating X as nominal. For rows a and b, show that the
log cumulative odds ratio equals a y b for all J y 1 cutpoints.
7.33 Suppose that model Ž7.24. holds for a 2 = J table with J ) 2, and let
x 2 y x 1 s 1. Explain why local log odds ratios are typically smaller in
absolute value than the cumulative log odds ratio  . wIn fact, on p. 122
of their first edition, McCullagh and Nelder Ž1989. noted that local
odds ratios 1 j 4 relate to  by
log 1 j s  P Ž Y F j q 1 . y P Ž Y F j y 1 . q o Ž  . ,
j s 1, . . . , J y 1,
where oŽ .r ™ 0 as  ™ 0.x
7.34 A response scale has the categories Žstrongly agree, mildly agree,
mildly disagree, strongly disagree, don’t know.. One way to model such
a scale uses a logit model for the probability of a don’t know response
and uses a separate ordinal model for the ordered categories conditional on response in one of those categories. Explain how to construct
a likelihood to do this simultaneously.
7.35 For the cumulative probit model ⌽y1 w P Ž Y F j .x s ␣ j y X x, explain
why a 1-unit increase in x i corresponds to a i standard deviation
increase in the expected underlying latent response, controlling for
other predictors.
313
PROBLEMS
7.36 For cumulative link model Ž7.7., show that for 1 F j - k F J y 1,
P Ž Y F k < x. s P Ž Y F j < x*., where x* is obtained by increasing the ith
component of x by Ž␣k y ␣ j .ri . Interpret.
7.37 A cumulative link model for an I = J contingency table with a qualitative predictor is
Gy1 P Ž Y F j . s ␣ j q i ,
i s 1, . . . , I, j s 1, . . . , J y 1 .
a. Show that the residual df s Ž I y 1.Ž J y 2..
b. When this model holds, show that independence corresponds to
1 s ⭈⭈⭈ s I and the test of independence has df s I y 1.
c. When this model holds, show that the rows are stochastically
ordered on Y.
7.38
F1Ž y . s 1 y expŽy y . for y ) 0 is a negative exponential cdf with
parameter , and F2 Ž y . s 1 y expŽy y . for y ) 0. Show that the
difference between the cdf ’s on a complementary log-log scale is
identical for all y. Give implications for categorical data analysis.
7.39 Consider the model Linkw j Žx.x s ␣ j q Xj x, where j Žx. is Ž7.14..
a. Explain why this model can be fitted separately for j s 1, . . . , J y 1.
b. For the complementary log-log link, show that this model is equivalent to one using the same link for cumulative probabilities ŽLaara
¨¨ ¨
and Matthews 1985..
7.40 Why is it not optimal to fit mean response models for ordinal responses using ordinary least squares as is done for normal regression?
7.41 When X and Y are ordinal, explain how to test conditional independence by allowing a different trend in each partial table. w Hint:
Generalize model Ž7.17. by replacing  by  k .x
7.42 A cafe has four entrees:
´ chicken, beef, fish, vegetarian. Specify a model
of form Ž7.22. for the selection of an entree
´ using x s gender Ž1 s
female, 0 s male. and u s cost of entree,
´ which is a characteristic of
the choices. Interpret the model parameters.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 8
Loglinear Models for
Contingency Tables
In Section 4.3 we introduced loglinear models as generalized linear models
ŽGLMs. using the log link function with a Poisson response. A common use is
modeling cell counts in contingency tables. The models specify how the
expected count depends on levels of the categorical variables for that cell as
well as associations and interactions among those variables. The purpose of
loglinear modeling is the analysis of association and interaction patterns.
In Section 8.1 we introduce loglinear models for two-way contingency
tables. In Sections 8.2 and 8.3 we extend them to three-way tables, and in
Section 8.4 discuss models for multiway tables. Loglinear models are of use
primarily when at least two variables are response variables. With a single
categorical response, it is simpler and more natural to use logit models.
When one variable is treated as a response and the others as explanatory
variables, logit models for that response variable are equivalent to certain
loglinear models. Section 8.5 covers this connection. In Sections 8.6 and 8.7
we discuss ML loglinear model fitting.
8.1
LOGLINEAR MODELS FOR TWO-WAY TABLES
Consider an I = J contingency table that cross-classifies a multinomial sample of n subjects on two categorical responses. The cell probabilities are i j 4
and the expected frequencies are i j s n i j 4 . Loglinear model formulas use
i j 4 rather than i j 4 , so they also apply with Poisson sampling for N s IJ
independent cell counts Yi j 4 having i j s E Ž Yi j .4 . In either case we denote
the observed cell counts by n i j 4 .
8.1.1
Independence Model
Under statistical independence, in Section 4.3.6 we noted that the i j 4 have
the structure
i j s ␣ i  j .
314
LOGLINEAR MODELS FOR TWO-WAY TABLES
315
For multinomial sampling, for instance, i j s n iq qj . Denote the row
variable by X and the column variable by Y. The formula expressing
independence is multiplicative. Thus, log i j has additive form
log i j s q iX q Yj
Ž 8.1 .
for a row effect iX and a column effect Yj . This is the loglinear model
of independence. As usual, identifiability requires constraints such as IX s
YJ s 0.
The ML fitted values are
ˆ i j s n iq nqj rn4, the estimated expected frequencies for chi-squared tests of independence. The tests using X 2 and G 2
ŽSection 3.2.1. are also goodness-of-fit tests of this loglinear model.
8.1.2
Interpretation of Parameters
Loglinear models for contingency tables are GLMs that treat the N cell
counts as independent observations of a Poisson random component. Loglinear GLMs identify the data as the N cell counts rather than the individual
classifications of the n subjects. The expected cell counts link to the explanatory terms using the log link. As Ž8.1. illustrates, of the cross-classified
variables, the model does not distinguish between response and explanatory
variables. It treats both jointly as responses, modeling i j 4 for combinations
of their levels. To interpret parameters, however, it is helpful to treat the
variables asymmetrically.
We illustrate with the independence model for I = 2 tables. In row i, the
logit equals
logit P Ž Y s 1 < X s i . s log
s log
P Ž Y s 1 < X s i.
P Ž Y s 2 < X s i.
i1
i2
s log i1 y log i2
s Ž q iX q 1Y . y Ž q iX q Y2 . s 1Y y Y2 .
The final term does not depend on i; that is, logitw P Ž Y s 1 < X s i .x is
identical at each level of X. Thus, independence implies a model of form,
logitw P Ž Y s 1 < X s i .x s ␣ . In each row, the odds of response in column 1
equal expŽ␣. s expŽ 1Y y Y2 ..
An analogous property holds when J ) 2. Differences between two parameters for a given variable relate to the log odds of making one response,
relative to the other, on that variable. Of course, with a single response
variable, logit models apply directly and loglinear models are unneeded.
316
8.1.3
LOGLINEAR MODELS FOR CONTINGENCY TABLES
Saturated Model
Statistically dependent variables satisfy a more complex loglinear model,
log i j s q iX q Yj q iXj Y .
Ž 8.2 .
The iXj Y 4 are association terms that reflect deviations from independence.
The right-hand side of Ž8.2. resembles the formula for cell means in two-way
ANOVA, allowing interaction. The iXj Y 4 represent interactions between X
and Y, whereby the effect of one variable on i j depends on the level of the
other. The independence model Ž8.1. results when all iXj Y s 0.
With constraints IX s YJ s 0 in Ž8.1. and Ž8.2., iX 4 and Yj 4 are,
equivalently, coefficients of dummy variables for the first Ž I y 1. categories
of X and the first Ž J y 1. categories of Y. Thus, iXj Y is the coefficient of the
product of dummy variables for iX and Yj . Since there are Ž I y 1.Ž J y 1.
such cross products, IXj Y s iXJ Y s 0, and only Ž I y 1.Ž J y 1. of these
parameters are nonredundant. Tests of independence analyze whether
these Ž I y 1.Ž J y 1. parameters equal zero, so they have residual df s
Ž I y 1.Ž J y 1..
The number of parameters in model Ž8.2. equals 1 q Ž I y 1. q Ž J y 1. q
Ž I y 1.Ž J y 1. s IJ, the number of cells. Hence, this model describes perfectly any i j ) 04 Žsee Problem 8.16.. It is the most general model for
two-way contingency tables, the saturated model. For it, direct relationships
exist between log odds ratios and iXj Y 4 . For instance, for 2 = 2 tables,
log s log
11 22
12 21
s log 11 q log 22 y log 12 y log 21
XY
s Ž q 1X q 1Y q 11
. q Ž q 2X q Y2 q 22X Y .
XY
y Ž q 1X q Y2 q 12
. y Ž q 2X q 1Y q 21X Y .
XY
XY
XY
XY
s 11
q 22
y 12
y 21
.
Ž 8.3 .
Thus, iXj Y 4 determine the association.
In practice, unsaturated models are preferable, since their fit smooths the
sample data and has simpler interpretations. For tables with at least three
variables, unsaturated models can include association terms. Then, loglinear
models are more commonly used to describe associations Žthrough two-factor
terms. than to describe odds Žthrough single-factor terms..
Like others in this book, model Ž8.2. is hierarchical. This means that the
model includes all lower-order terms composed from variables contained in a
higher-order model term. When the model contains iXj Y, it also contains iX
and Yj . A reason for including lower-order terms is that, otherwise, the
statistical significance and the interpretation of a higher-order term depends
on how variables are coded. This is undesirable, and with hierarchical models
the same results occur no matter how variables are coded.
LOGLINEAR MODELS FOR TWO-WAY TABLES
317
An example of a nonhierarchical model is
log i j s q iX q iXj Y .
This model permits association but forces unnatural behavior of expected
frequencies, with the pattern depending on constraints used for parameters.
For instance, with constraints whereby parameters are zero at the last level,
log I j s in every column. Nonhierarchical models are rarely sensible in
practice. Using them is analogous to using ANOVA or regression models
with interaction terms but without the corresponding main effects.
When a model has two-factor terms, interpretations focus on them rather
than on the single-factor terms. By analogy with two-way ANOVA with
two-factor interaction, it can be misleading to report main effects. The
estimates of the main-effect terms depend on the coding scheme used for the
higher-order effects, and the interpretation also depends on that scheme Žsee
Problem 8.16.. Normally, we restrict our attention to the highest-order terms
for a variable, as we illustrate in Section 8.2.
8.1.4
Alternative Parameter Constraints
As with the independence model, the parameter constraints for the saturated
model are arbitrary. Instead of setting all IXj Y s iXJ Y s 0, one could set
Ý i iXj Y s Ý j iXj Y s 0 for all i and j. Different software uses different conXY
XY
XY
XY
straints. What is unique are contrasts such as 11
q 22
y 12
y 21
in
Ž8.3. that determine odds ratios.
For instance, suppose that a log odds ratio equals 2.0 in a 2 = 2 table.
With the first set of constraints, 2.0 is the coefficient of a product of a
dummy variable indicating the first category of X and a dummy variable
XY
XY
XY
XY
s 2.0 and 12
s 21
s 22
indicating the first category of Y. With it, 11
XY
XY
XY
XY
s 0. For sum-to-zero constraints, 11 s 22 s 0.5, 12 s 21 s y0.5. For
either set, the log odds ratio Ž8.3. equals 2.0. For a set of parameters, an
advantage of setting a baseline parameter equal to 0 instead of the sum equal
to 0 is that some parameters in a set can have infinite estimates.
8.1.5
Multinomial Models for Cell Probabilities
Conditional on the sum n of the cell counts, Poisson loglinear models for
i j 4 become multinomial models for cell probabilities i j s i jrŽÝÝ ab .4 .
To illustrate, for the saturated model,
i j s
exp Ž q iX q Yj q iXj Y .
Ý Ý exp Ž q aX q Yb q abX Y .
a
b
.
Ž 8.4 .
318
LOGLINEAR MODELS FOR CONTINGENCY TABLES
This representation implies the usual constraints for probabilities, i j G 04
and Ý i Ý j i j s 1. The intercept parameter cancels in the multinomial
model Ž8.4.. This parameter relates purely to the total sample size, which is
random in the Poisson model but not in the multinomial model.
8.2 LOGLINEAR MODELS FOR INDEPENDENCE AND
INTERACTION IN THREE-WAY TABLES
In Section 2.3 we introduced three-way contingency tables and related
structure such as conditional independence and homogeneous association.
Loglinear models for three-way tables describe their independence and
association patterns.
8.2.1
Types of Independence
A three-way I = J = K cross-classification of response variables X, Y, and Z
has several potential types of independence. We assume a multinomial
distribution with cell probabilities i jk 4 , and Ý i Ý j Ý k i jk s 1.0. The models
also apply to Poisson sampling with means i jk 4 .
The three variables are mutually independent when
i jk s iqq qjq qqk
for all i , j, and k.
Ž 8.5 .
For expected frequencies i jk 4 , mutual independence has loglinear form
log i jk s q iX q Yj q Zk .
Ž 8.6 .
Variable Y is jointly independent of X and Z when
i jk s iqk qjq
for all i , j, and k.
Ž 8.7 .
This is ordinary two-way independence between Y and a variable composed
of the IK combinations of levels of X and Z. The loglinear model is
log i jk s q iX q Yj q Zk q iXkZ .
Ž 8.8 .
Similarly, X could be jointly independent of Y and Z, or Z could be jointly
independent of X and Y. Mutual independence Ž8.5. implies joint independence of any one variable from the others.
From Section 2.3, X and Y are conditionally independent, gi®en Z when
independence holds for each partial table within which Z is fixed. That is, if
i j < k s P Ž X s i, Y s j < Z s k ., then
i j < k s iq< k qj < k for all i , j, and k.
LOGLINEAR MODELS FOR THREE-WAY TABLES
319
For joint probabilities over the entire table, equivalently
i jk s iqk qj krqqk
Ž 8.9 .
for all i , j, and k.
Conditional independence of X and Y, given Z, is the loglinear model
log i jk s q iX q Yj q Zk q iXkZ q YjkZ .
Ž 8.10 .
This is a weaker condition than mutual or joint independence. Mutual
independence implies that Y is jointly independent of X and Z, which itself
implies that X and Y are conditionally independent. Table 8.1 summarizes
these three types of independence.
In Section 2.3.2 we showed that partial associations can be quite different
from marginal associations. For instance, conditional independence does not
imply marginal independence. Conditional independence and marginal independence both hold when one of the stronger types of independence studied
above applies. Figure 8.1 summarizes relationships among the four types of
independence.
8.2.2
Homogeneous Association and Three-Factor Interaction
Loglinear models Ž8.6., Ž8.8., and Ž8.10. have three, two, and one pair of
conditionally independent variables, respectively. In the latter two models,
TABLE 8.1
Summary of Loglinear Independence Models
Model
Probabilistic
Form for i jk
Association Terms
in Loglinear Model
Interpretation
Ž8.6.
Ž8.8.
Ž8.10.
iqq qjq qqk
iqk qjq
iqk qj krqqk
None
iXkZ
iXkZ q YjkZ
Variables mutually independent
Y independent of X and Z
X and Y independent, given Z
FIGURE 8.1 Relationships among types of XY independence.
320
LOGLINEAR MODELS FOR CONTINGENCY TABLES
the doubly subscripted terms Žsuch as iXj Y . pertain to conditionally dependent variables. A model that permits all three pairs to be conditionally
dependent is
log i jk s q iX q Yj q Zk q iXj Y q iXkZ q YjkZ .
Ž 8.11 .
From exponentiating both sides, the cell probabilities have form
i jk s i j jk i k .
No closed-form expression exists for the three components in terms of
margins of i jk 4 except in certain special cases Žsee Note 9.2..
For this model, in the next section we show that conditional odds ratios
between any two variables are identical at each category of the third variable.
That is, each pair has homogeneous association ŽSection 2.3.5.. Model Ž8.11. is
called the loglinear model of homogeneous association or of no three-factor
interaction.
The general loglinear model for a three-way table is
log i jk s q iX q Yj q Zk q iXj Y q iXkZ q YjkZ q iXjkY Z .
Ž 8.12 .
With dummy variables, iXjkY Z is the coefficient of the product of the ith
dummy variable for X, jth dummy variable for Y, and kth dummy variable
for Z. The total number of nonredundant parameters is
1 q Ž I y 1. q Ž J y 1. q Ž K y 1. q Ž I y 1. Ž J y 1. q Ž I y 1. Ž K y 1.
q Ž J y 1 . Ž K y 1 . q Ž I y 1 . Ž J y 1 . Ž K y 1 . s IJK ,
the total number of cell counts. This model has as many parameters as
observations and is saturated. It describes all possible positive i jk 4 . Each
pair of variables may be conditionally dependent, and an odds ratio for any
pair may vary across categories of the third variable.
Setting certain parameters equal to zero in Ž8.12. yields the models
introduced previously. Table 8.2 lists some of these models. To ease referring
to models, Table 8.2 assigns to each model a symbol that lists the highest-order
TABLE 8.2
Loglinear Models for Three-Dimensional Tables
Loglinear Model
log i jk s q
log i jk s q
log i jk s q
log i jk s q
log i jk s q
iX
iX
iX
iX
iX
Symbol
q
q
q
q
q
Yj
Yj
Yj
Yj
Yj
q
q
q
q
q
Zk
Zk
Zk
Zk
Zk
q iXj Y
q iXj Y q YjkZ
XZ
q iXj Y q YjkZ q ik
XY
YZ
XZ
q i j q jk q ik
q iXjkY Z
Ž X, Y, Z .
Ž XY, Z .
Ž XY, YZ .
Ž XY, YZ, XZ .
Ž XYZ .
LOGLINEAR MODELS FOR THREE-WAY TABLES
321
termŽs. for each variable. For instance, the model Ž8.10. of conditional
independence between X and Y has symbol Ž XZ, YZ ., since its highest-order
terms are iXkZ and YjkZ. In the notation we used for logit models in Sections
6.1 and 7.1.2 this stands for Ž X *Z q Y *Z ., which is itself shorthand for
notation Ž X q Y q Z q X = Z q Y = Z . that has the main effects as well as
interactions.
8.2.3
Interpreting Model Parameters
Interpretations of loglinear model parameters use their highest-order terms.
For instance, interpretations for model Ž8.11. use the two-factor terms to
describe conditional odds ratios. At a fixed level k of Z, the conditional
association between X and Y uses Ž I y 1.Ž J y 1. odds ratios, such as the
local odds ratios
i jŽ k . s
i jk iq1 , jq1 , k
i , jq1 , k iq1 , j, k
,
1 F i F I y 1,
1 F j F J y 1. Ž 8.13 .
Similarly, Ž I y 1.Ž K y 1. odds ratios iŽ j. k 4 describe XZ conditional association, and Ž J y 1.Ž K y 1. odds ratios Ž i. jk 4 describe YZ conditional
association. Loglinear models have characterizations using constraints on
conditional odds ratios. For instance, conditional independence of X and Y
is equivalent to i jŽ k . s 1, i s 1, . . . , I y 1, j s 1, . . . , J y 1, k s 1, . . . , K 4 .
The two-factor parameters relate directly to the conditional odds ratios.
To illustrate, substituting Ž8.11. for model Ž XY, XZ, YZ . into log i jŽ k . yields
log i jŽ k . s log
i jk iq1 , jq1 , k
iq1 , jk 1, jq1 , k
XY
XY
XY
s iXj Y q iq1
, jq1 y i , jq1 y iq1 , j . Ž 8.14 .
Since the right-hand side is the same for all k, an absence of three-factor
interaction is equivalent to
i jŽ1. s i jŽ2. s ⭈⭈⭈ s i jŽ K . for all i and j.
The same argument for the other conditional odds ratios shows that model
Ž XY, XZ, YZ . is also equivalent to
iŽ1. k s iŽ2. k s ⭈⭈⭈ s iŽ J . k
for all i and k,
Ž1. jk s Ž2. jk s ⭈⭈⭈ s Ž I . jk
for all j and k.
and to
Any model not having the three-factor interaction term has a homogeneous
association for each pair of variables.
322
LOGLINEAR MODELS FOR CONTINGENCY TABLES
When X and Y have two categories, only one nonredundant iXj Y parameter occurs. Thus, expression Ž8.14. is simplified depending on the constraints.
By the same argument as in Section 8.1.3 for 2 = 2 tables, the conditional log
XY
with dummy-variable constraints setting parameodds ratio simplifies to 11
ters at the second level of X or Y equal to 0.
The iXjkY Z term in the general model Ž8.12. refers to three-factor interaction. It describes how the odds ratio between two variables changes across
categories of the third. We illustrate for 2 = 2 = 2 tables. By direct substitution of the general model formula,
log
11Ž1.
11Ž2.
s log
Ž 111 221 . r Ž 121 211 .
Ž 112 222 . r Ž 122 212 .
X YZ
X YZ
X YZ
X YZ
q 221
y 121
y 211
s Ž 111
.
X YZ
X YZ
X YZ
X YZ
y Ž 112
q 222
y 122
y 212
..
Only one parameter is nonredundant. For constraints setting the second-catX YZ
egory parameters equal to 0, this log ratio of odds ratios equals 111
. When
X YZ
111 s 0, 11Ž1. s 11Ž2. , giving homogeneous XY association.
8.2.4
Alcohol, Cigarette, and Marijuana Use Example
Table 8.3 refers to a 1992 survey by the Wright State University School of
Medicine and the United Health Services in Dayton, Ohio. The survey asked
2276 students in their final year of high school in a nonurban area near
Dayton, Ohio whether they had ever used alcohol, cigarettes, or marijuana.
Denote the variables in this 2 = 2 = 2 table by A for alcohol use, C for
cigarette use, and M for marijuana use.
Section 8.7 covers the fitting of loglinear models. For now, we emphasize
interpretation. Table 8.4 shows fitted values for several loglinear models. The
TABLE 8.3 Alcohol, Cigarette, and Marijuana Use
for High School Seniors
Alcohol
Use
Marijuana Use
Cigarette
Use
Yes
No
Yes
Yes
No
911
44
538
456
No
Yes
No
3
2
43
279
Source: Data courtesy of Harry Khamis, Wright State
University.
LOGLINEAR MODELS FOR THREE-WAY TABLES
TABLE 8.4
323
Fitted Values for Loglinear Models Applied to Table 8.3
a
Loglinear Model
Alcohol Cigarette Marijuana
Ž A, C, M . Ž AC, M . Ž AM, CM . Ž AC, AM, CM . Ž ACM .
Use
Use
Use
Yes
No
Yes
Yes
No
540.0
740.2
611.2
837.8
909.24
438.84
910.4
538.6
911
538
No
Yes
No
282.1
386.7
210.9
289.1
45.76
555.16
44.6
455.4
44
456
Yes
Yes
No
90.6
124.2
19.4
26.6
4.76
142.16
3.6
42.4
3
43
No
Yes
No
47.3
64.9
118.5
162.5
0.24
179.84
1.4
279.6
2
279
a
A, alcohol use; C, cigarette use; M, marijuana use.
fit for model Ž AC, AM, CM . is close to the observed data, which are the
fitted values for the saturated model Ž ACM .. The other models fit poorly.
Table 8.5 illustrates model association patterns by presenting estimated
conditional and marginal odds ratios. For example, the entry 1.0 for the AC
conditional association for the model Ž AM, CM . of AC conditional independence is the common value of the AC fitted odds ratios at the two levels
of M,
1.0 s
909.24 = 0.24
45.76 = 4.76
s
438.84 = 179.84
555.16 = 142.16
.
The entry 2.7 for the AC marginal association for this model is the odds ratio
for the marginal AC fitted table. The odds ratios for the observed data are
those reported for the saturated model Ž ACM ..
Table 8.5 shows that estimated conditional odds ratios equal 1.0 for each
pairwise term not appearing in a model, such as the AC association in model
Ž AM, CM .. For that model, the estimated marginal AC odds ratio differs
from 1.0, since conditional independence does not imply marginal independence. Some models have conditional associations that are necessarily the
TABLE 8.5
Estimated Odds Ratios for Loglinear Models in Table 8.5
Conditional Association
Marginal Association
Model
AC
AM
CM
AC
AM
CM
Ž A, C, M .
Ž AC, M .
Ž AM, CM .
Ž AC, AM, CM .
Ž ACM . level 1
Ž ACM . level 2
1.0
17.7
1.0
7.8
13.8
7.7
1.0
1.0
61.9
19.8
24.3
13.5
1.0
1.0
25.1
17.3
17.5
9.7
1.0
17.7
2.7
17.7
17.7
1.0
1.0
61.9
61.9
61.9
1.0
1.0
25.1
25.1
25.1
324
LOGLINEAR MODELS FOR CONTINGENCY TABLES
same as the corresponding marginal associations. In Section 9.1.2 we present
a condition guaranteeing this.
Model Ž AC, AM, CM . permits all pairwise associations but maintains
homogeneous odds ratios between two variables at each level of the third.
The AC fitted conditional odds ratios for this model equal 7.8. One can
calculate this odds ratio using the model’s fitted values at either level of M,
ˆ11AC q ˆ22AC y ˆ12AC y ˆ21AC ..
or wfrom Ž8.14.x using exp Ž
Table 8.5 shows that estimated odds ratios are very dependent on the
model. This highlights the importance of good model selection. An estimate
from this table is informative only to the extent that its model fits well. In the
next section we discuss goodness of fit.
8.3
INFERENCE FOR LOGLINEAR MODELS
A good-fitting loglinear model provides a basis for describing and making
inferences about associations among categorical responses. Standard methods apply for checking fit and making inference about model parameters.
8.3.1
Chi-Squared Goodness-of-Fit Tests
As usual, X 2 and G 2 test whether a model holds by comparing cell fitted
values to observed counts. Here df equals the number of cell counts minus
the number of model parameters.
For the student survey ŽTable 8.3., Table 8.6 shows results of testing fit for
several loglinear models. Models that lack any association term fit poorly.
The model Ž AC, AM, CM . that has all pairwise associations fits well Ž P s
0.54.. It is suggested by other criteria also, such as minimizing
AIC s y2 Ž maximized log likelihoodᎏnumber of parameters in model .
or equivalently, minimizing w G 2 y 2Ždf.x.
TABLE 8.6
Goodness-of-Fit Tests for Loglinear Models in Table 8.4
Model
Ž A, C, M .
Ž A, CM .
Ž C, AM .
Ž M, AC .
Ž AC, AM .
Ž AC, CM .
Ž AM, CM .
Ž AC, AM, CM .
Ž ACM .
a
P-value for G 2 statistic.
G2
X2
df
P-value a
1286.0
534.2
939.6
843.8
497.4
92.0
187.8
0.4
0.0
1411.4
505.6
824.2
704.9
443.8
80.8
177.6
0.4
0.0
4
3
3
3
2
2
2
1
0
- 0.001
- 0.001
- 0.001
- 0.001
- 0.001
- 0.001
- 0.001
0.54
ᎏ
325
INFERENCE FOR LOGLINEAR MODELS
8.3.2
Inference about Conditional Associations
Tests about conditional associations compare loglinear models. The likelihood-ratio statistic y2Ž L0 y L1 . is identical to the difference G 2 Ž M0 < M1 . s
G 2 Ž M0 . y G 2 Ž M1 . between deviances for models without that term and with
it. For model Ž XY, XZ, YZ ., consider the hypothesis of XY conditional
independence. This is H0 : iXj Y s 0 for the Ž I y 1.Ž J y 1. XY association
parameters. The test statistic is G 2 Ž XZ, YZ . y G 2 Ž XY, XZ, YZ ., with df s
Ž I y 1.Ž J y 1.. This has the same purpose as the generalized CMH and
model-based tests for nominal variables presented in Section 7.5.
For instance, the test of conditional independence between alcohol use
and cigarette smoking compares model Ž AM, CM . with the alternative
Ž AC, AM, CM .. The test statistic is
G 2 Ž AM, CM . < Ž AC, AM, CM . s 187.8 y 0.4 s 187.4,
with df s 2 y 1 s 1 Ž P - 0.001.. The statistics comparing Ž AC, CM . and
Ž AC, AM . with Ž AC, AM, CM . also provide strong evidence of AM and CM
conditional associations. Further analyses of Table 8.3 use model
Ž AC, AM, CM ..
With large sample sizes, statistically significant effects can be weak and
unimportant. A more relevant concern is whether the associations are strong
enough to be important. Confidence intervals are more useful than tests for
assessing this. Table 8.7 shows output from fitting model Ž AC, AM, CM . with
TABLE 8.7
Output for Fitting Loglinear Model to Table 8.3
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value / DF
Deviance
1
0.3740
0.3740
Pearson Chi- Square
1
0.4011
0.4011
Parameter
Intercept
a
c
m
a*m
a*c
c*m
1
1
1
1
1
1
1
1
1
Estimate
5.6334
0.4877
y1.8867
y5.3090
2.9860
2.0545
2.8479
LR Statistics
Source
a*m
a*c
c*m
DF
1
1
1
Standard
Error
0.0597
0.0758
0.1627
0.4752
0.4647
0.1741
0.1638
Chi- Square
91.64
187.38
497.00
Wald
Chi- Square
8903.96
41.44
134.47
124.82
41.29
139.32
302.14
Pr>ChiSq
<.0001
<.0001
<.0001
Pr>ChiSq
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
326
LOGLINEAR MODELS FOR CONTINGENCY TABLES
parameters in the last row and in the last column equal to zero, such as by
using Ž1, 0. dummy variables for each classification. Consider the conditional
ˆ11AC s
AC odds ratio, assuming model Ž AC, AM, CM .. Table 8.7 reports
2.054, with SE s 0.174. For these constraints, this is the estimated conditional log odds ratio. A 95% Wald confidence interval for the true conditional AC odds ratio is expw2.054 " 1.96Ž0.174.x, or Ž5.5, 11.0.. Strong positive association exists between cigarette use and alcohol use, both for users
and nonusers of marijuana.
For model Ž AC, AM, CM ., the 95% Wald confidence intervals are
Ž8.0, 49.2. for the AM conditional odds ratio and Ž12.5, 23.8. for the CM
conditional odds ratio. The intervals are wide, but these associations also are
strong. Table 8.5 shows that estimated marginal associations are even stronger.
Controlling for outcome on one response moderates the association somewhat between the other two.
The analyses in this section pertain to associations. A different analysis
pertains to comparing single-variable marginal distributions, for instance to
determine if students used cigarettes more than alcohol or marijuana. That
type of analysis is presented in Section 10.1.
8.4
LOGLINEAR MODELS FOR HIGHER DIMENSIONS
Loglinear models for three-way tables are more complex than for two-way
tables, because of the variety of potential association terms. Loglinear models
for three-way tables extend readily, however, to multiway tables. As the
number of dimensions increases, some complications arise. One is the increase in the number of possible association and interaction terms, making
model selection more difficult. Another is the increase in number of cells. In
Section 9.8 we show that this can cause difficulties with existence of estimates
and appropriateness of asymptotic theory.
8.4.1
Four-Way Contingency Tables
We illustrate models for higher dimensions using a four-way table with
variables W, X, Y, and Z. Interpretations are simplest when the model has
no three-factor interaction terms. Such models are special cases of
X
Y
Z
log h i jk s q W
h q i q j q k
X
WY
WZ
XY
XZ
YZ
q W
h i q h j q h k q i j q i k q jk ,
denoted by ŽWX, WY, WZ, XY, XZ, YZ .. Each pair of variables is conditionally dependent, with the same odds ratios at each combination of categories
of the other two variables. An absence of a two-factor term implies conditional independence, given the other two variables.
327
LOGLINEAR MODELS FOR HIGHER DIMENSIONS
A variety of models exhibit three-factor interaction. A model could contain any of WXY, WXZ, WYZ, or XYZ terms. For model ŽWXY, WZ, XZ, YZ .,
each pair of variables is conditionally dependent, but at each level of Z the
WX association, the WY association, and the XY association may vary across
categories of the remaining variable. The conditional association between Z
and another variable is homogeneous. The saturated model contains all the
three-factor terms plus a four-factor interaction term.
8.4.2
Automobile Accident Example
Table 8.8 summarizes observations of 68,694 passengers in autos and light
trucks involved in accidents in the state of Maine in 1991. The table classifies
passengers by gender Ž G ., location of accident Ž L., seat-belt use Ž S ., and
injury Ž I .. Table 8.8 reports the sample proportion of passengers who were
injured. For each GL combination, the proportion of injuries was about
halved for passengers wearing seat belts.
Table 8.9 displays tests of fit for several loglinear models. To investigate
the complexity of model needed, we consider models Ž G, I, L, S .,
TABLE 8.8
Loglinear Models for Injury, Seat-Belt Use, Gender, and Location a
Gender
Location
Female
Urban
Rural
Male
Urban
Rural
Ž GI, GL, GS, IL, IS, LS .
Injury
Seat
Belt
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
7,287
11,587
3,246
6,134
10,381
10,969
6,123
6,693
996
759
973
757
812
380
1,084
513
7,166.4
11,748.3
3,353.8
5,985.5
10,471.5
10,837.8
6,045.3
6,811.4
993.0
721.3
988.8
781.9
845.1
387.6
1,038.1
518.2
Sample
Ž GLS, GI, IL, IS .
Proportion
No
Yes
Yes
7,273.2
11,632.6
3,254.7
6,093.5
10,358.9
10,959.2
6,150.2
6,697.6
1,009.8
713.4
964.3
797.5
834.1
389.8
1,056.8
508.4
a
G, gender; I, injury; L, location; S, seat-belt use.
Source:Data courtesy of Cristanna Cook, Medical Care Development, Augusta, Maine.
TABLE 8.9
Goodness-of-Fit Tests for Loglinear Models in Table 8.8
Model
Ž G, I, L, S .
Ž GI, GL, GS, IL, IS, LS .
Ž GIL, GIS, GLS, ILS .
Ž GIL, GS, IS, LS .
Ž GIS, GL, IL, LS .
Ž GLS, GI, IL, IS .
Ž ILS, GI, GL, GS .
G2
df
P-Value
2792.8
23.4
1.3
18.6
22.8
7.5
20.6
11
5
1
4
4
4
4
- 0.0001
- 0.001
0.25
0.001
- 0.001
0.11
- 0.001
0.12
0.06
0.23
0.11
0.07
0.03
0.15
0.07
328
LOGLINEAR MODELS FOR CONTINGENCY TABLES
TABLE 8.10 Estimated Conditional Odds Ratios for Models of Table 8.8
Loglinear Model
Odds Ratio
GI
IL
IS
GL S s no
S s yes
GS L s urban
L s rural
LS Gs female
Gs male
Ž GI, GL, GS, IL, IS, LS .
Ž GLS, GI, IL, IS .
0.58
2.13
0.44
1.23
1.23
0.63
0.63
1.09
1.09
0.58
2.13
0.44
1.33
1.17
0.66
0.58
1.17
1.03
Ž GI, GL, GS, IL, IS, LS ., and Ž GIL, GIS, GLS, ILS . having all terms of varying complexity. Model Ž G, I, L, S . of mutual independence fits very poorly.
Model Ž GI, GL, GS, IL, IS, LS . fits much better but still has a lack of fit
Ž P - 0.001.. Model Ž GIL, GIS, GLS, ILS . fits well Ž G 2 s 1.3, df s 1. but is
complex and difficult to interpret. This suggests studying models more
complex than Ž GI, GL, GS, IL, IS, LS . but simpler than Ž GIL,
GIS, GLS, ILS ..
First, however, we analyze model Ž GI, GL, GS, IL, IS, LS ., which focuses
on pairwise associations. Table 8.8 displays its fitted values. Table 8.10
reports the model-based estimated conditional odds ratios. One can obtain
them directly using the fitted values for partial tables relating two variables at
any combination of levels of the other two. They also follow directly from
IS
IS .
ˆ11IS q ˆ22
ˆ12IS y ˆ21
y
.
parameter estimates; for instance, 0.44 s expŽ
Since the sample size is large, the estimates of odds ratios are quite
precise. For instance, the standard error of the estimated IS conditional log
odds ratio of y0.814 is 0.028. A 95% Wald confidence interval for the true
odds ratio is expwy0.814 " 1.96Ž0.028.x or Ž0.42, 0.47.. This model estimates
that the odds of injury for passengers wearing seat belts were less than half
the odds for passengers not wearing them, at each gender᎐location combination. The fitted odds ratios in Table 8.10 also suggest that other factors being
fixed, injury was more likely in rural than urban accidents and more likely for
females than for males. The estimated odds that males used seat belts were
only 0.63 times the estimated odds for females.
Interpretations are more complex for models containing three-factor interaction terms. Table 8.9 shows results of adding a single three-factor term to
model Ž GI, GL, GS, IL, IS, LS .. Of the four possible models,
Ž GLS, GI, IL, IS . appears to fit best. Table 8.8 also displays its fit. Given the
large sample size, its G 2 value suggests that it fits quite well.
For model Ž GLS, GI, IL, IS ., each pair of variables is conditionally dependent, and at each category of I the association between any two of the others
329
LOGLINEAR MODELS FOR HIGHER DIMENSIONS
varies across categories of the remaining variable. For this model, it is
inappropriate to interpret the GL, GS, and LS two-factor terms on their
own. Since I does not occur in a three-factor interaction, the conditional
odds ratio between I and each variable Žsee the top portion of Table 8.10. is
the same at each combination of categories of the other two variables.
When a model has a three-factor interaction term but no term of higher
order than that, one can study the interaction by calculating fitted odds ratios
between two variables at each level of the third. One can do this at any levels
of remaining variables not involved in the interaction. The bottom portion of
Table 8.10 illustrates this for model Ž GLS, GI, IL, IS .. For instance, the
fitted GS odds ratio of 0.66 for Ž L s urban. refers to four fitted values for
urban accidents, both the four with Žinjury s no. and the four with Žinjury s
yes.; for example, 0.66 s Ž7273.2 = 10,959.2.rŽ11,632.6 = 10,358.9..
8.4.3
Large Samples and Statistical versus Practical Significance
Model Ž GLS, GI, IL, IS . seems to fit much better than Ž GI, GL, GS,
IL, IS, LS .. The difference in G 2 values of 23.4 y 7.5 s 15.9 has df s 5 y 4
s 1 Ž P s 0.0001.. Table 8.10 indicates, however, that the degree of threefactor interaction is weak. The fitted odds ratio between any two of G, L,
and S is similar at both levels of the third variable. The significantly better fit
of model Ž GLS, GI, IL, IS . reflects mainly the enormous sample size.
As in any test, a statistically significant effect need not be practically
important. With huge samples, it is crucial to focus on estimation rather than
hypothesis testing. For instance, a comparison of fitted odds ratios for the
two models in Table 8.10 suggests that the simpler model
Ž GI, GL, GS, IL, IS, LS . is adequate for most purposes.
8.4.4
Dissimilarity Index
For a table of arbitrary dimension with cell counts n i s npi 4 and fitted
values
ˆ i s nˆ i 4, one can summarize the closeness of a model fit to the data
by the dissimilarity index ŽGini 1914.,
ˆs
⌬
Ý
i
ni y
ˆ i r2 n s
Ý
pi y
ˆ i r2 .
i
This index falls between 0 and 1, with smaller values representing a better
fit. It represents the proportion of sample cases that must move to different
cells for the model to fit perfectly.
ˆ estimates a corresponding population index ⌬
The dissimilarity index ⌬
describing model lack of fit. The value ⌬ s 0 occurs when the model holds
perfectly. In practice, this is unrealistic for unsaturated models, and ⌬ ) 0.
ˆ helps study whether the lack of fit is important in a practical
The estimator ⌬
ˆ
sense. When ⌬ - 0.02 or 0.03, the sample data follow the model pattern
330
LOGLINEAR MODELS FOR CONTINGENCY TABLES
ˆ
quite closely, even though the model is not perfect. When ⌬ is near 0, ⌬
tends to overestimate ⌬, substantially so for small n. Firth and Kuha Ž2000.
ˆ and studied ways to reduce its
provided an approximate variance for ⌬
estimation bias.
ˆ s 0.008, and model
For Table 8.8, model Ž GI, GL, GS, IL, IS, LS . has ⌬
ˆ s 0.003. For either model, moving less than 1% of
Ž GLS, GI, IL, IS . has ⌬
the data yields a perfect fit. The relatively large G 2 value for
Ž GI, GL, GS, IL, IS, LS . indicated that it does not truly hold. Nevertheless,
ˆ value suggests that, in practical terms, it fits decently.
the small ⌬
8.5
LOGLINEAR᎐LOGIT MODEL CONNECTION
Loglinear models treat categorical response variables symmetrically, focusing
on associations and interactions in their joint distribution. Logit models, by
contrast, describe how a single categorical response depends on explanatory
variables. The model types seem distinct, but connections exist between
them. For a loglinear model, forming logits on one response helps to
interpret the model. Moreover, logit models with categorical explanatory
variables have equivalent loglinear models.
8.5.1
Using Logit Models to Interpret Loglinear Models
To understand implications of a loglinear model formula, it can help to form
a logit on one variable. We illustrate with the loglinear model Ž XY, XZ, YZ ..
When Y is binary, its logit is
log
P Ž Y s 1 < X s i, Z s k.
P Ž Y s 2 < X s i, Z s k.
s log
i1 k
i2 k
s log i1 k y log i2 k
s Ž q iX q 1Y q Zk q i1X Y q iXkZ q 1YkZ .
XY
q iXkZ q Y2 kZ .
y Ž q iX q Y2 q Zk q i2
XY
s Ž 1Y y Y2 . q Ž i1X Y y i2
. q Ž 1YkZ y Y2 kZ . .
The first parenthetical term is a constant, not depending on i or k. The
second parenthetical term depends on the category i of X. The third
parenthetical term depends on the category k of Z. This logit has the
additive form
logit P Ž Y s 1 < X s i , Z s k . s ␣ q i X q  kZ .
Ž 8.15 .
Using the notation summarizing logit models by their predictors, we denote it
by Ž X q Z ..
LOGLINEAR᎐ LOGIT MODEL CONNECTION
331
In Section 5.4.1 we discussed this logit model. When Y is binary, the
loglinear model Ž XY, XZ, YZ . is equivalent to it. The iXkZ terms for association among explanatory variables cancel in the difference in logarithms the
logit defines. The logit model does not study this association.
8.5.2
Auto Accident Example Revisited
For the Maine auto accidents ŽTable 8.8., in Section 8.4.2 we showed that the
loglinear model Ž GLS, GI, LI, IS .,
log g i l s s q Gg q iI q Ll q Ss q Gg iI q Gg lL q Gg sS
G LS
q iILl q iISs q LS
l s q g l s ,
fits well. It is natural to treat injury Ž I . as a response variable and gender
Ž G ., location Ž L., and seat-belt use Ž S . as explanatory variables, or perhaps S
as a response with G and L as explanatory. One can show that this loglinear
model is equivalent to logit model Ž G q L q S .,
logit P Ž I s 1 < G s g , L s l , S s s . s ␣ q  gG q  lL q  sS . Ž 8.16 .
For instance, the seat-belt effects in the two models satisfy  sS s 1ISs y 2ISs .
In the logit calculation, all terms in the loglinear model not having the injury
index i cancel. Fitted values, goodness-of-fit statistics, residual df, and
standardized Pearson residuals for the logit model are identical to those for
the loglinear model.
Odds ratios describing effects on I relate to two-factor loglinear parameters and main-effect logit parameters. In the logit model, the log odds ratio
IS
IS
IS
IS
q 22
y 12
y 21
for the effect of S on I equals  1S y  2S. This equals 11
in the loglinear model. Their estimates are the same no matter how software
sets up constraints. For Table 8.8, ˆ1S y ˆ2S s y0.817 for the logit model,
IS
IS
ˆ11IS q ˆ22
ˆ12IS y ˆ21
and
y
s y0.817 for the loglinear model.
Loglinear models are GLMs that treat the 16 cell counts in Table 8.8 as 16
independent Poisson variates. Logit models are GLMs that treat the table as
binomial counts. Logit models with I as the response treat the marginal GLS
table n gql s 4 as fixed and regard n g1 l s 4 as eight independent binomial
variates on that response. Although the sampling models differ, the results
from fits of corresponding models are identical.
8.5.3
Correspondence between Loglinear and Logit Models
In the derivation of the logit model Ž X q Z . wsee Ž8.15.x from loglinear model
Ž XY, XZ, YZ ., the iXkZ term cancels. It might seem as if the model Ž XY, YZ .
omitting this term is also equivalent to that logit model. Indeed, forming the
logit on Y for Ž XY, YZ . results in the same logit formula. The loglinear
332
LOGLINEAR MODELS FOR CONTINGENCY TABLES
TABLE 8.11 Equivalent Loglinear and Logit Models for a Three-Way Table
with Binary Response Variable Y
Loglinear Symbol
Ž Y, XZ .
Ž XY, XZ .
Ž YZ, XZ .
Ž XY, YZ, XZ .
Ž XYZ .
Logit Model
Logit Symbol
␣
␣ q  iX
␣ q  kZ
␣ q iX q  kZ
␣ q iX q  kZ qikXZ
Žᎏ.
ŽX.
ŽZ.
Ž X q Z.
Ž X *Z .
model that has the same fit as the logit model, however, contains a general
interaction term for relationships among the explanatory variables. The logit
model does not assume anything about relationships among explanatory
variables, so it allows an arbitrary interaction pattern for them.
Table 8.11 summarizes equivalent logit and loglinear models for three-way
tables when Y is a binary response. Each loglinear model contains the XZ
association term relating the explanatory variables in the logit models. The
simple loglinear model Ž Y, XZ . states that Y is jointly independent of both
X and Z, and is equivalent to the logit model having only an intercept.
The saturated loglinear model Ž XYZ . contains the three-factor interaction
term. When Y is a binary response, this model is equivalent to a logit model
with an interaction between the predictors X and Z. For instance, the effect
of X on Y depends on Z, meaning that the XY odds ratio varies across its
categories. That logit model is also saturated.
Analogous correspondences hold when Y has several categories, using
baseline-category logit models. An advantage of the loglinear approach
is its generality. It applies when more than one response variable exists.
The alcohol᎐cigarette᎐marijuana example in Section 8.2.4, for instance, used
loglinear models to study association patterns among three response variables. Loglinear models are most natural when at least two variables
are response variables. When only one is a response, it is more sensible to
use logit models directly.
8.5.4
Generalized Loglinear Model*
Let n s Ž n1 , . . . , n N .X and s Ž 1 , . . . , N .X denote column vectors of observed and expected counts for the N cells of a contingency table, with
n s Ý i n i . For simplicity we use a single index, but the table may be multidimensional. Loglinear models for positive Poisson means have the form
log s X
for model matrix X and column vector  of model parameters.
Ž 8.17 .
MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS
333
We illustrate with the independence model, log i j s q iX q Yj , for a
2 = 2 table. With constraints 2X s Y2 s 0, it is
log
log
log
log
11
1
12
1
s
21
1
1
22
1
1
0
0
1
0
1
0
1X .
1Y
A generalization of Ž8.17. allows many additional models. This generalized
loglinear model is
C log Ž A . s X
Ž 8.18 .
for matrices C and A. The ordinary loglinear model Ž8.17. results when C and
A are identity matrices. Other special cases include logit models for binary or
multicategory responses.
For instance, the loglinear model of independence for a 2 = 2 table is
equivalent to a model by which the logit for Y is the same in each row of X
Žsee Section 8.1.2.. That logit model has form Ž8.18.: A is a 4 = 4 identity
matrix, so A is the 4 = 1 vector s Ž 11 , 12 , 21 , 22 .X ; the product
C logŽA . forms the logit in row 1 and the logit in row 2 using
Cs
1
0
y1
0
0
1
0
;
y1
then X s Ž1, 1.X is a 2 = 1 matrix, and  is a single constant ␣ , so X forms
a common value for those two logits.
In Chapters 10 and 11 we use the generalized loglinear model for models
outside the classes of GLMs studied thus far. An example is modeling
marginal distributions of multivariate responses.
8.6 LOGLINEAR MODEL FITTING: LIKELIHOOD EQUATIONS
AND ASYMPTOTIC DISTRIBUTIONS*
In discussing the fitting of loglinear models, we first derive sufficient statistics
and likelihood equations. We then present large-sample normal distributions
for ML estimators of model parameters and cell probabilities. We illustrate
results with models for three-way tables. For simplicity, derivations use the
Poisson sampling model, which does not require a constraint on parameters
such as the multinomial does.
334
8.6.1
LOGLINEAR MODELS FOR CONTINGENCY TABLES
Minimal Sufficient Statistics
For three-way tables, the joint Poisson probability that cell counts Yi jk s n i jk 4
is
ŁŁŁ
i
j
ey i jk injki jk
n i jk !
k
,
where the product refers to all cells of the table. The kernel of the log
likelihood is
LŽ . s
Ý Ý Ý n i jk log i jk y Ý Ý Ý i jk .
i
j
k
i
j
Ž 8.19 .
k
For the general loglinear model Ž8.12., this simplifies to
LŽ . s n q
Ý n iqq iX q Ý nqjq Yj q Ý nqqk Zk
i
qÝ
i
qÝ
i
Ý
j
n i jq iXj Y
j
q
k
ÝÝ
i
n iqk iXkZ
k
q
Ý Ý nqj k YjkZ
j
k
Ý Ý n i jk iXjkY Z y Ý Ý Ý exp Ž q ⭈⭈⭈ qiXjkY Z . . Ž 8.20 .
j
k
i
j
k
Since the Poisson distribution is in the exponential family, coefficients of
the parameters are sufficient statistics. For this saturated model, n i jk 4 are
coefficients of iXjkY Z 4 , so there is no reduction of the data. For simpler
models, certain parameters are zero and Ž8.20. simplifies. For instance, for
the model Ž X, Y, Z . of mutual independence, sufficient statistics are the
coefficients in Ž8.20. of iX 4 , Yj 4 , and Zk 4 . These are n iqq 4 , nqjq 4 ,
and nqqk 4 .
Table 8.12 lists minimal sufficient statistics for several loglinear models.
Each one is the coefficient of the highest-order termŽs. in which a variable
appears. In fact, they are the marginal distributions for terms in the model
symbol. Simpler models use more condensed sample information. For instance, whereas Ž X, Y, Z . uses only the single-factor marginal distributions,
Ž XY, XZ, YZ . uses the two-way marginal tables.
TABLE 8.12 Minimal Sufficient Statistics for
Fitting Loglinear Models
Model
Ž X, Y, Z .
Ž XY, Z .
Ž XY, YZ .
Ž XY, XZ, YZ .
Minimal Sufficient Statistics
n iqq 4, nqjq 4, nqqk 4
n i jq 4, nqqk 4
n i jq 4, nqj k 4
n i jq 4, n iqk 4, nqj k 4
MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS
8.6.2
335
Likelihood Equations for Loglinear Models
The fitted values for a model are solutions to the likelihood equations. We
derive likelihood equations using general representation Ž8.17. for a loglinear
model. For a vector of counts n with s E Žn., the model is log s X, for
which logŽ i . s Ý j x i j  j for all i.
Extending Ž8.19., for Poisson sampling the log likelihood is
LŽ . s
Ý n i log i y Ý i
i
s
i
ž
Ý ni Ý x i j j
i
j
/
y
ž
Ý exp Ý x i j  j
i
j
/
.
Ž 8.21 .
The sufficient statistic for  j is its coefficient, Ý i n i x i j . Since
⭸
⭸ j
exp
žÝ  /
xi j
j
s x i j exp
j
⭸ LŽ .
⭸ j
žÝ  /
xi j
j
s x i j i ,
j
s
Ý n i x i j y Ý i x i j ,
i
j s 1,2, . . . , p.
i
The likelihood equations equate these derivatives to zero. They have the
form
Ž 8.22 .
XX n s XX .
ˆ
These equations equate the sufficient statistics to their expected values, a
result obtained with GLM theory in Ž4.29.. For models considered so far,
these sufficient statistics are the marginal tables in the model symbol.
To illustrate, consider model Ž XZ, YZ .. Its log likelihood is Ž8.20. with
X Y s X Y Z s 0. The log-likelihood derivatives
⭸L
⭸ iXkZ
s n iqk y iqk
and
⭸L
⭸YjkZ
s nqj k y qj k
yield the likelihood equations
for all i and k,
Ž 8.23 .
ˆqj k s nqj k for all j and k.
Ž 8.24 .
ˆ iqk s n iqk
Derivatives with respect to lower-order terms yield equations implied by
these ŽProblem 8.30.. For model Ž XZ, YZ ., the fitted values have the same
XZ and YZ marginal totals as the observed data.
336
8.6.3
LOGLINEAR MODELS FOR CONTINGENCY TABLES
Birch’s Results for Loglinear Models
For model Ž XZ, YZ ., from Ž8.23., Ž8.24., and Table 8.12, the minimal sufficient statistics are the ML estimates of the corresponding marginal distributions of expected frequencies. Equation Ž8.22. gives the corresponding result
for any loglinear model. Birch Ž1963. showed that likelihood equations
for loglinear models match minimal sufficient statistics to their expected
values. Poisson GLM theory implied this result in Ž4.29. and Ž4.44.. Thus,
fitted values for loglinear models are smoothed versions of the cell counts
that match them in certain marginal distributions but have associations and
interactions satisfying the model-implied patterns.
Birch showed that a unique set of fitted values both satisfy the model and
match the data in the minimal sufficient statistics. Hence, if we find such a
solution, it must be the ML solution. To illustrate, the independence model
for a two-way table
log i j s q iX q Yj
has minimal sufficient statistics n iq 4 and nqj 4 . The likelihood equations are
ˆ iqs n iq ,
ˆqj s nqj , for all i and j.
The fitted values
ˆ i j s n iq nqj rn4 satisfy these equations and also satisfy the
model. Birch’s result implies that they are the ML estimates.
8.6.4
Direct versus Iterative Calculation of Fitted Values
To illustrate how to solve likelihood equations, we continue the analysis of
model Ž XZ, YZ .. From Ž8.9., the model satisfies
i jk s
iqk qj k
qqk
for all i , j, and k.
For Poisson sampling, the related formula uses expected frequencies. Setting
i jk s i jkrn, this is i jk s iqk qj krqqk 4 . The likelihood equations
Ž8.23. and Ž8.24. specify that ML estimates satisfy
ˆ iqk s n iqk and
ˆqj k s
nqj k and thus also
ˆqqk s nqqk . Since ML estimates of functions of parameters are the same functions of the ML estimates of those parameters,
ˆ i jk s
ˆ iqk
ˆqj k
ˆqqk
s
n iqk nqj k
nqqk
.
This solution satisfies the model and matches the data in the sufficient
statistics. Thus, it is the unique ML solution.
MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS
337
TABLE 8.13 Fitted Values for Loglinear Models in Three-Way Tables
Model a
Probabilistic Form
Fitted Value
Ž X, Y, Z .
i jk s iqq qjq qqk
ˆ i jk s
Ž XY, Z .
i jk s i jq qqk
ˆ i jk s
i jq iqk
Ž XY, XZ .
i jk s
Ž XY, XZ, YZ .
Ž XYZ .
i jk s i j jk ik
No restriction
iqq
ˆ i jk s
n iqq nqjq nqqk
n2
n i jq nqqk
n
n i jq n iqk
n iqq
Iterative methods ŽSection 8.7.
ˆ i jk s n i jk
Formulas for models not listed are obtained by symmetry; for example, for Ž XZ, Y .,
ˆ i jk s
n iqk nqjqrn.
a
Similar reasoning produces
ˆ i jk 4 for all except one model in Table 8.12.
Table 8.13 shows formulas. That table also expresses i jk 4 in terms of
marginal probabilities. These expressions and the likelihood equations determine the ML formulas, using the approach just described.
For models having explicit formulas for
ˆ i jk , the estimates are said to be
direct. Many loglinear models do not have direct estimates. ML estimation
then requires iterative methods. Of models in Tables 8.12 and 8.13, the only
one not having direct estimates is Ž XY, XZ, YZ .. Although the two-way
marginal tables are its minimal sufficient statistics, it is not possible to
express i jk 4 directly in terms of i jq 4 , iqk 4 , and qj k 4 . Direct estimates
do not exist for unsaturated models containing all two-factor associations. In
practice, it is not essential to know which models have direct estimates.
Iterative methods for models not having direct estimates also apply with
models that have direct estimates. Statistical software for loglinear models
uses such iterative methods for all cases.
8.6.5
Chi-Squared Goodness-of-Fit Tests
Model goodness-of-fit statistics compare fitted cell counts to sample counts.
For Poisson GLMs, in Section 4.5.2 we showed that for models with an
intercept term, the deviance equals the G 2 statistic. With a fixed number of
cells, G 2 and X 2 have approximate chi-squared null distributions when
expected frequencies are large. The df equal the difference in dimension
between the alternative and null hypotheses. This equals the difference
between the number of parameters in the general case and when the model
holds.
We illustrate with model Ž X, Y, Z ., for multinomial sampling with probabilities i jk 4 . In the general case, the only constraint is Ý i Ý j Ý k i jk s 1, so
there are IJK y 1 parameters. For model Ž X, Y, Z ., i jk s iqq qjq qqk 4
are determined by I y 1 of iqq 4 Žsince Ý i iqqs 1., J y 1 of qjq 4 , and
K y 1 of qqk 4 . Thus,
df s Ž IJK y 1 . y Ž I y 1 . q Ž J y 1 . q Ž K y 1 . s IJK y I y J y K q 2.
338
LOGLINEAR MODELS FOR CONTINGENCY TABLES
TABLE 8.14 Residual Degrees of Freedom for Loglinear
Models for Three-Way Tables
Model
Degrees of Freedom
Ž X, Y, Z .
Ž XY, Z .
Ž XZ, Y .
Ž YZ, X .
Ž XY, YZ .
Ž XZ, YZ .
Ž XY, XZ .
Ž XY, XZ, YZ .
Ž XYZ .
IJK y I y J y K q 2
Ž K y 1.Ž IJ y 1.
Ž J y 1.Ž IK y 1.
Ž I y 1.Ž JK y 1.
J Ž I y 1.Ž K y 1.
K Ž I y 1.Ž J y 1.
I Ž J y 1.Ž K y 1.
Ž I y 1.Ž J y 1.Ž K y 1.
0
The same df formula applies for Poisson sampling. Then, the general case
has IJK i jk 4 parameters. The residual df equal the number of cells in the
table minus the number of parameters in the Poisson loglinear model for
i jk 4 . For instance, model Ž X, Y, Z . has residual df s IJK y w1 q Ž I y 1. q
Ž J y 1. q Ž K y 1.x, reflecting the single intercept parameter and constraints such as IX s YJ s ZK s 0. This equals the number of linearly
independent parameters equated to zero in the saturated model to obtain the
given model. Table 8.14 shows df formulas for testing three-way loglinear
models.
8.6.6
Covariance Matrix of ML Parameter Estimators
To present large-sample distributions of ML parameter estimators, we return
to general expression logŽ i . s Ý j x i j  j , from which we obtained the log-likelihood derivatives
⭸ LŽ .
⭸ j
s
Ý n i x i j y Ý i x i j ,
i
j s 1, 2, . . . , p.
i
The Hessian matrix of second partial derivatives has elements
⭸ 2 LŽ .
⭸ j ⭸ k
s y Ý xi j
i
s y Ý xi j
i
⭸ i
⭸ k
½
⭸
⭸ k
exp
ž Ý x  / 5 s yÝ x
ih
h
h
ij
x i k i .
i
Like logistic regression models, loglinear models are GLMs using the canonical link; thus this matrix does not depend on the observed data. The
MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS
339
information matrix, the negative of this matrix, is
I s XX diag Ž . X,
where diagŽ . has the elements of on the main diagonal.
ˆ is asymptotiFor a fixed number of cells, as n ™ ⬁, the ML estimator 
y1
cally normal with mean  and covariance matrix I . Thus, for Poisson
sampling, the asymptotic covariance matrix
ˆ . s XX diag Ž . X
cov Ž 
y1
Ž 8.25 .
.
Substituting ML fitted values and then taking square roots of diagonal
ˆ This also follows from the general
elements yields standard errors for .
expression Ž4.28. for GLMs, as noted in Section 4.4.7.
8.6.7
Connection between Multinomial and Poisson Loglinear Models
Similar asymptotic results hold with multinomial sampling. When Yi , i s
1, . . . , N 4 are independent Poisson random variables, the conditional distribution of Yi 4 given n s Ý i Yi is multinomial with parameters i s ir
ŽÝ a a .4 . Birch Ž1963. showed that ML estimates of loglinear model parameters are the same for multinomial sampling as for independent Poisson
sampling. He showed that estimates are also the same for independent
multinomial sampling, as long as the model contains a term for the marginal
distribution fixed by the sampling design. To illustrate, suppose that at each
combination of categories of X and Z, an independent multinomial sample
occurs on Y. Then, n iqk 4 are fixed. The model must contain iXkZ , so the
fitted values satisfy
ˆ iqk s n iqk 4.
That separate inferential theory is unnecessary for multinomial loglinear
models follows from the following argument. Express the Poisson loglinear
model for i 4 as
log i s q x i  ,
where Ž1, x i . is row i of the model matrix X and Ž , X .X is the model
parameter vector. The Poisson log likelihood is
L s LŽ ,  . s
Ý n i log i y Ý i
i
s
i
Ý n i Ž q x i  . y Ý exp Ž q x i  . s n q Ý n i x i  y ,
i
i
i
where s Ý i i s Ý i expŽ q x i  .. Since log s q logwÝ i expŽx i  .x, this
log likelihood has the form
L s LŽ ,  . s
½Ý
i
n i x i  y nlog
Ý exp Ž x i  .
i
5
q Ž nlog y . . Ž 8.26 .
340
LOGLINEAR MODELS FOR CONTINGENCY TABLES
Now i s irŽÝ a a . s expŽ q x i  .rwÝ a expŽ q x a  .x, and expŽ . cancels in the numerator and denominator. Thus, the first term Žin braces. on
the right-hand side in Ž8.26. is Ýn i log i , which is the multinomial log
likelihood, conditional on the total cell count n. Unconditionally, n s Ý i n i
has a Poisson distribution with expectation Ý i i s , so the second term in
Ž8.26. is the Poisson log likelihood for n. Since  enters only in the first term,
ˆ and its covariance matrix for the Poisson log likelihood
the ML estimator 
LŽ ,  . are identical to those for the multinomial log likelihood. The Poisson
loglinear model has one more parameter Ži.e., . than the multinomial loglinear model because of the random sample size. See Birch Ž1963., Lang
Ž1996c., McCullagh and Nelder Ž1989, p. 211., and Palmgren Ž1981. for
details.
For a multinomial sample, we show in Section 14.4.1 that the estimated
covariance matrix of loglinear parameter estimators is
$
ˆ . s XX diag Ž
cov Ž 
ˆ . y rn
ˆ ˆX X 4
y1
.
Ž 8.27 .
The intercept from the Poisson model is not relevant, and X for the
multinomial model deletes the column of X pertaining to it in the Poisson
model.
A similar argument applies with several independent multinomial samples.
Each log-likelihood term is a sum of components from different samples, but
the Poisson log likelihood again decomposes into two parts. One part is a
Poisson log likelihood for the independent sample sizes, and the other part is
the sum of the independent multinomial log likelihoods. Palmgren Ž1981.
showed that conditional on observed marginal totals for explanatory variables, the asymptotic covariances for estimators of parameters involving the
response are the same as for Poisson sampling. For a single multinomial
sample, Palmgren’s result implies that Ž8.27. is identical to Ž8.25. with the row
and column referring to deleted. Birch Ž1963. and Goodman Ž1970. gave
related results. Lang Ž1996c. gave an elegant discussion of connections
between multinomial and Poisson models. His results imply that the asymptotic variance of any linear contrast of estimated log means within a covariate
level is identical for the two models.
8.6.8
Distribution of Probability Estimators
For multinomial sampling, the ML estimates of cell probabilities are
ˆs
Ž
.
Ž
.
rn.
We
next
give
the
asymptotic
cov
.
Lang
1996c
showed
the
asympˆ
ˆ
totic covariance matrix for
ˆ for Poisson sampling and its connection with
covŽ
ˆ ..
The saturated model has
ˆ s p, the sample proportions. Under multinomial sampling, from Ž3.7. and Ž3.8., their covariance matrix is
cov Ž p . s diag Ž . y X rn.
Ž 8.28 .
MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS
341
With I independent multinomial samples on a response variable with J
categories, and p consist of I sets of proportions, each having J y 1
nonredundant elements. Then, covŽp. is a block diagonal matrix. Each of the
independent samples has a Ž J y 1. = Ž J y 1. block of form Ž8.28., and the
matrix contains zeros off the main diagonal of blocks.
Now assume an unsaturated model. Using the delta method we show in
Sections 14.2.2 and 14.4.1 that
ˆ has an asymptotic normal distribution
about . The estimated covariance matrix equals
$
½
$
$
cov Ž
ˆ . s cov Ž p . X XXcov Ž p . X
y1
$
5
XXcov Ž p . rn.
For a single multinomial sample, this expression equals
$
cov Ž
ˆ. s
½ diag Ž ˆ . y ˆ ˆ
X
X XX Ž diag Ž
ˆ. y
ˆ
ˆ X .X
y1
XX diag Ž
ˆ. y
ˆ
ˆX
5 rn.
For tables with many cells, it is not unusual to have a sample proportion of
0 in a cell. In this case the ordinary standard error is 0, which is unappealing.
An advantage of fitting a model is that it typically has a positive fitted
probability and standard error.
8.6.9
Uniqueness of ML Estimates
When all n i ) 04 , the ML estimates exist and are unique. To show this, for
simplicity we use Poisson sampling. Suppose that the model is parameterized
so that X has full rank. Birch Ž1963. showed that the likelihood equations are
soluble, by noting that the kernel of the Poisson log likelihood
LŽ . s
Ý Ž n i log i y i .
i
has individual terms converging to y⬁ as logŽ i . ™ "⬁; thus, the log
likelihood is bounded above and attains its maximum at finite values of the
model parameters. It is stationary at this maximum, since it has continuous
first partial derivatives.
Birch showed that the likelihood equations have a unique solution, and
the likelihood is maximized at that point. He proved this by showing that the
matrix of values y⭸ 2 Lr⭸ h ⭸ j 4 wi.e., the information matrix XX diag( )Xx is
nonsingular and nonnegative definite, and hence positive definite. Nonsingularity follows from X having full rank and the diagonal matrix having positive
elements i 4 . Any quadratic form cX XX diag( )Xc equals Ý i w i ŽÝ j x i j c j .x 2
G 0, so the matrix is also nonnegative definite.
'
342
LOGLINEAR MODELS FOR CONTINGENCY TABLES
8.7 LOGLINEAR MODEL FITTING: ITERATIVE METHODS
AND THEIR APPLICATION*
When a loglinear model does not have direct estimates, iterative algorithms
such as Newton᎐Raphson can solve the likelihood equations. In this section
we also present a simpler but more limited method, iterati®e proportional
fitting.
8.7.1
Newton᎐Raphson Method
In Section 4.6.1 we introduced the Newton᎐Raphson method. Referring to
notation there, we identify LŽ . as the log likelihood for Poisson loglinear
models.
From Ž8.21., let
LŽ  . s
ž
Ý n i Ý x i h h
i
h
/ y Ý exp ž Ý x  / .
ih
i
h
h
Then
uj s
h jk s
⭸ LŽ  .
⭸ j
⭸ 2 LŽ  .
⭸ j ⭸ k
s
Ý n i x i j y Ý i x i j ,
i
i
s y Ý i x i j x i k ,
i
so that
uŽj t . s
Ý Ž n i y Ži t . . x i j
and
hŽjkt . s y Ý Ži t . x i j x i k .
i
i
The t th approximation Žt. for
ˆ derives from Žt. through Ž t . s
Žt..
Ž
exp X . It generates the next value Ž tq1. using Ž4.39., which in this
context is
Ž tq1. s Ž t . q XX diag Ž Ž t . . X
y1
XX Ž n y Ž t . . .
This in turn produces Ž tq1. , and so on.
Alternatively, Ž tq1. can be expressed as
Ž tq1. s y Ž H Ž t . .
y1 Ž t .
r ,
Ž 8.29 .
where r jŽ t . s Ý Ži t . x i j log Ži t . q Ž n i y Ži t . . rŽi t . . The expression in brackets
is the first term in the Taylor series expansion of log n i at log Ži t ..
LOGLNEAR MODEL FITTING: ITERATIVE METHODS
343
The iterative process begins with all Ž0.
i s n i , or with an adjustment such
1
Ž
. produces Ž1. , and for t ) 0 the
as Ž0.
s
n
q
if
any
n
s
0.
Then
8.29
i
i
i
2
iterations proceed as just described with n i 4 . For loglinear models LŽ . is
concave, and Ž t . and Ž t . usually converge rapidly to the ML estimates
ˆ
ˆ as t increases. The H Ž t . matrix converges to H
ˆ s yXX diagŽ
and 
ˆ .X. By
ˆ is yH
ˆ y1 , a
Ž8.25., the estimated large-sample covariance matrix of 
by-product of the method.
As we discussed in Section 4.6.3 for GLMs, Ž8.29. has the iterative
reweighted least squares form
ˆty1 X .
Ž tq1. s Ž XX V
y1
ˆty1 z Ž t . .
XX V
ˆt s
Here, z Ž t . has elements n i s log Ži t . q Ž n i y Ži t . .rŽi t . and V
wdiagŽŽ t . .xy1 . Thus, Ž tq1. is the weighted least squares solution for a model
z Ž t . s X q ⑀ ,
4 Ž1. is
where ⑀ i 4 are uncorrelated with variances 1rŽi t .4 . With Ž0.
i s ni , 
the weighted least squares estimate for model log Žn. s X q ⑀.
8.7.2
Iterative Proportional Fitting
The iterati®e proportional fitting ŽIPF. algorithm is a simple method for
calculating
ˆ i 4 for loglinear models. Introduced by Deming and Stephan
Ž1940., it has the following steps:
4 satisfying a model no more complex than the one being
1. Start with Ž0.
i
4
fitted. For instance, Ž0.
i ' 1.0 are trivially adequate.
4 successively to match
2. By multiplying by appropriate factors, adjust Ž0.
i
each marginal table in the set of minimal sufficient statistics.
3. Continue until the maximum difference between the sufficient statistics
and their fitted values is sufficiently close to zero.
We illustrate using model Ž XY, XZ, YZ .. Its minimal sufficient statistics
are n i jq 4 , n iqk 4 , and nqj k 4 . Initial estimates must satisfy the model. The
first cycle of the IPF algorithm has three steps:
Ž0.
Ž1.
i jk s i jk
n i jq
Ž0.
i jq
,
Ž1.
Ž2.
i jk s i jk
n iqk
Ž1.
iqk
,
Ž2.
Ž3.
i jk s i jk
nqj k
Ž2.
qj
k
.
Summing both sides of the first expression over k shows that Ž1.
i jq s n i jq for
all i and j. After step 1, observed and fitted values match in the XY marginal
table. After step 2, all Ž2.
iqk s n iqk , but the XY marginal tables no longer
Ž3.
match. After step 3, all qj
k s nqj k , but the XY and XZ marginal tables no
344
LOGLINEAR MODELS FOR CONTINGENCY TABLES
longer match. A new cycle begins by again matching the XY marginal tables,
Ž3. Ž
Ž3. .
using Ž4.
i jk s i jk n i jqr i jq , and so on.
At each step, the updated estimates continue to satisfy the model. For
.
instance, step 1 uses the same adjustment factor Ž n i jqrŽ0.
i jq at different
levels k of Z. Thus, XY odds ratios from different levels of Z have ratio
equal to 1, and the homogeneous association pattern continues at each step.
As the cycles progress, the G 2 statistic comparing cell counts to the
updated fit is monotone decreasing, and the process must converge ŽFienberg
1970a; Haberman 1974a.. The IPF algorithm produces ML estimates because
it generates a sequence of fitted values converging to a solution that both
satisfies the model and matches the sufficient statistics. By Birch’s results
ŽSection 8.6.3., only one such solution exists, and it is ML.
The IPF method works even for models having direct estimates. Then, IPF
normally yields ML estimates within one cycle ŽHaberman 1974a, p. 197.. We
illustrate with the model of independence. The minimal sufficient statistics
4
are n iq 4 and nqj 4 . With Ž0.
i j ' 1.0 , the first cycle gives
Ž0.
Ž1.
i j s i j
Ž1.
Ž2.
i j s i j
n iq
Ž0.
iq
nqj
Ž1.
qj
s
s
n iq
J
,
n iq nqj
n
.
The IPF algorithm then gives
ˆ Ži tj. s n iq nqj rn for all t ) 2.
8.7.3
Comparison of Iterative Methods
The IPF algorithm is simple and easy to implement. It converges to the ML
fit even when the likelihood is poorly behaved, for instance with zero fitted
counts and estimates on the boundary of the parameter space. The
Newton᎐Raphson method is more complex, requiring solving a system of
equations at each step. Newton᎐Raphson is sometimes not feasible when the
model is of high dimensionalityᎏfor instance, when the contingency table
and parameter vector are huge.
However, IPF has disadvantages. It is applicable primarily to models for
which likelihood equations equate observed and fitted counts in marginal
tables. By contrast, Newton᎐Raphson is a general-purpose method that can
solve more complex likelihood equations. IPF sometimes converges slowly
compared to Newton᎐Raphson. Unlike Newton᎐Raphson, IPF does not
produce the model parameter estimates and their estimated covariance
matrix as a by-product. Fitted values that IPF produces can generate this
information. Model parameter estimates are contrasts of log
ˆ i 4 Žsee Probˆ ..
.
Ž
.
lems 8.16 and 8.17 , and substituting fitted values into 8.25 yields cov Ž
Because Newton᎐Raphson applies to a wide variety of models and also
yields standard errors, it is the fitting routine used by most software for
LOGLNEAR MODEL FITTING: ITERATIVE METHODS
345
loglinear models. IPF is increasingly viewed as primarily of historical interest.
However, for some applications the analysis is more transparent using IPF, as
the next example illustrates.
8.7.4
Contingency Table Standardization
Table 8.15 relates education and attitudes toward legalized abortion using a
General Social Survey, conducted by the National Opinion Research Center.
To make patterns of association clearer, Smith Ž1976. standardized the table
so that all row and column marginal totals equal 100 while maintaining the
sample odds ratio structure.
The IPF routine to standardize with margins of 100 is
Ž0.
i j s ni j
and then for t s 1, 3, 5, . . . ,
Ži tj. s Ži ty1.
j
100
ty1.
Žiq
,
Ži tq1.
s Ži tj.
j
100
Žt.
qj
.
At the end of each odd-numbered step, all row totals equal 100. At the end
of each even-numbered step, all column totals equal 100. Odds ratios do not
change at each odd Ževen. step, since all counts in a given row Žcolumn.
multiply by the same constant.
The IPF algorithm converges to the entries in parentheses in Table 8.15.
The association is clearer in this standardized table. A ridge appears down
the main diagonal, with higher levels of education having more favorable
attitudes about abortion. The other counts fall away smoothly on both sides.
Table standardization is useful for comparing tables having different
marginal structures. Mosteller Ž1968. compared intergenerational occupa-
TABLE 8.15 Marginal Standardization of Attitudes toward Abortion
by Years of Schooling
Attitude toward Legalized Abortion
Schooling
Less than high school
High school
More than high school
Total
Source: Smith Ž1976..
Generally
Disapprove
Middle
Position
Generally
Approve
209
Ž49.4.
151
Ž32.8.
101
Ž32.0.
126
Ž36.6.
237
Ž18.6.
426
Ž30.6.
16
Ž17.8.
Ž100.
21
Ž31.3.
Ž100.
138
Ž50.9.
Ž100.
Total
Ž100.
Ž100.
Ž100.
346
LOGLINEAR MODELS FOR CONTINGENCY TABLES
tional mobility tables from Britain and Denmark. Yule Ž1912. compared
three hospitals on vaccination and recovery for smallpox patients. A modern
application is adjusting sample data to match marginal distributions specified
by census results.
The process of table standardization is called raking the table. Imrey et al.
Ž1981. and Little and Wu Ž1991. derived the asymptotic covariance matrix for
raked sample proportions. For sample counts n i j 4 with i j s E Ž n i j .4 , let
Ei j 4 denote expected frequencies for the standardized table and Eˆi j 4 fitted
values in the standardized table. The standardization process corresponds to
fitting the model
log Ž Ei jr i j . s q iE q jA .
That is, maintaining the odds ratios means that the two-way tables of
Ei jr i j 4 and of Eˆi jrn i j 4 satisfy independence.
The fitted values Eˆi j 4 in the standardized table satisfy
ˆ q ˆiE q ˆjA .
log Eˆi j y log n i j s
The adjustment term, ylog n i j , to the log link of the fit is called an offset.
The fit corresponds to using log n i j as a predictor on the right-hand side and
forcing its coefficient to equal 1.0. Standard GLM software can fit models
having offsets. To rake a table, one enters as sample data pseudo-values that
satisfy independence and have the desired margins, taking log n i j as an
offset. ŽFor SAS, see Table A.14.. In Section 9.7.1 we discuss further the use
of model offsets.
NOTES
Section 8.2: Loglinear Models for Independence and Interaction in Three-Way Tables
8.1. Roy and Mitra Ž1956. discussed types of independence for three-way tables and their
large-sample tests. Birch’s Ž1963. article on ML estimation for loglinear models was part
of substantial research on loglinear models in the 1960s, much due to L. A. Goodman
Žsee Section 16.4.. Haberman Ž1974a. presented an influential theoretical study of
loglinear models.
Section 8.3: Inference for Loglinear Models
8.2. Goodman Ž1970, 1971b., Haberman Ž1974a, Chap. 5., Lauritzen Ž1996., Sundberg Ž1975.,
and Whittaker Ž1990, Sec. 12.4. discussed families of loglinear models that have direct
ML estimates and interpretations in terms of independence, conditional independence,
or equiprobability. Such models are called decomposable, since expected frequencies
decompose into products and ratios of expected marginal sufficient statistics. Haberman
proved conditions under which loglinear models have direct estimates. Baglivo et al.
Ž1992., Forster et al. Ž1996., and Morgan and Blumenstein Ž1991. discussed exact
inference.
347
PROBLEMS
8.3. For methods that allow for misclassification error, see Kuha and Skinner Ž1997. and
Kuha et al. Ž1998. and references therein. For treatment of missing data, see Little
Ž1998., Schafer Ž1997, Chap. 8., and their references.
Section 8.7: Loglinear Model Fitting: Iterati©e Methods and Their Application
8.4. Deming Ž1964, Chap. VII. described early work on IPF by Deming and Stephan.
Darroch Ž1962. used IPF to obtain ML estimates in contingency tables. Bishop et al.
Ž1975., Fienberg Ž1970a., and Speed Ž1998. presented other applications of IPF. Darroch
and Ratcliff Ž1972. generalized IPF for models in which sufficient statistics are more
complex than marginal tables.
8.5. For further discussion of table raking, see Bishop et al. Ž1975, pp. 76᎐102., Fleiss Ž1981,
Chap. 14., Haberman Ž1979, Chap. 9., Hoem Ž1987., and Little and Wu Ž1991..
PROBLEMS
Applications
8.1
The 1988 General Social Survey compiled by the National Opinion
Research Center asked: ‘‘Do you support or oppose the following
measures to deal with AIDS? Ž1. Have the government pay all of the
health care costs of AIDS patients; Ž2. Develop a government information program to promote safe sex practices, such as the use of condoms.’’ Table 8.16 summarizes opinions about health care costs Ž H .
and the information program Ž I ., classified also by the respondent’s
gender Ž G ..
a. Fit loglinear models Ž GH, GI ., Ž GH, HI ., Ž GI, HI ., and Ž GH, GI,
HI .. Show that models that lack the HI term fit poorly.
b. For model Ž GH, GI, HI ., show that 95% Wald confidence intervals
equal Ž0.55, 1.10. for the GH conditional odds ratio and Ž0.99, 2.55.
for the GI conditional odds ratio. Interpret. Is it plausible that
gender has no effect on opinion for these issues?
TABLE 8.16 Data for Problem 8.1
Gender
Male
Female
Health Opinion
Information
Opinion
Support
Oppose
Support
Oppose
Support
Oppose
76
6
114
11
160
25
181
48
Source: 1988 General Social Survey, National Opinion Research
Center.
348
LOGLINEAR MODELS FOR CONTINGENCY TABLES
TABLE 8.17 Data for Problem 8.2 a
Home
President
Busing
1
2
3
1
1
2
3
1
2
3
1
2
3
41
71
1
2
3
1
0
0
0
65
157
17
5
44
0
3
10
0
0
1
0
0
0
0
1
0
1
2
3
a
1, Yes; 2, no; 3, don’t know.
Source: 1991 General Social Survey, National Opinion Research
Center.
8.2
Refer to Table 8.17 from the 1991 General Social Survey. White
subjects were asked: Ž B . ‘‘Do you favor busing of ŽNegrorBlack. and
white school children from one school district to another?’’, Ž P . ‘‘If
your party nominated a ŽNegrorBlack. for President, would you vote
for him if he were qualified for the job?’’, Ž D . ‘‘During the last few
years, has anyone in your family brought a friend who was a
ŽNegrorBlack. home for dinner?’’ The response scale for each item
was Žyes, no, don’t know.. Fit model Ž BD, BP, DP ..
a. Using the yes and no categories, estimate the conditional odds ratio
for each pair of variables. Interpret.
b. Analyze the model’s goodness of fit. Interpret.
c. Conduct inference for the BP conditional association using a Wald
or likelihood-ratio confidence interval and test. Interpret.
8.3
Refer to Section 8.3.2. Explain why software for which parameters sum
ˆ11AC s ˆ22AC s 0.514 and
to zero across levels of each index reports
AC
AC
ˆ
ˆ
12 s 21 s y0.514, with SE s 0.044 for each term.
8.4
Refer to Table 2.6. Let D s defendant’s race, V s victims’ race, and
P s death penalty verdict. Fit the loglinear model Ž DV, DP, PV ..
a. Using the fitted values, estimate and interpret the odds ratio
between D and P at each level of V. Note the common odds ratio
property.
b. Calculate the marginal odds ratio between D and P, Ži. using the
fitted values, and Žii. using the sample data. Why are they equal?
Contrast the odds ratio with part Ža.. Explain why Simpson’s paradox occurs.
349
PROBLEMS
TABLE 8.18 Data for Problem 8.5
Safety Equipment
in Use
Injury
Whether
Ejected
Nonfatal
Fatal
Yes
No
Yes
No
1,105
411,111
4,624
157,342
14
483
497
1,008
Seat belt
None
Source: Florida Department of Highway Safety and Motor Vehicles.
c. Fit the corresponding logit model, treating P as the response. Show
the correspondence between parameter estimates and fit statistics.
d. Is there a simpler model that fits well? Interpret, and show the
logit᎐loglinear connection.
8.5
Table 8.18 refers to automobile accident records in Florida in 1988.
a. Find a loglinear model that describes the data well. Interpret
associations.
b. Treating whether killed as the response, fit an equivalent logit
model. Interpret the effects.
c. Since n is large, goodness-of-fit statistics are large unless the model
fits very well. Calculate the dissimilarity index for the model in part
Ža., and interpret.
8.6
Refer to Table 8.19. Subjects were asked their opinions about government spending on the environment Ž E ., health Ž H ., assistance to big
cities Ž C ., and law enforcement Ž L..
TABLE 8.19 Data for Problem 8.6 a
Cities
Environment Health
1
2
3
a
1
2
3
1
2
3
1
2
3
Law
Enforcement:
1
2
3
1
2
3
1
2
3
1
2
3
62
11
2
11
1
1
3
1
1
17
7
3
3
4
0
0
0
0
5
0
1
0
0
1
0
0
0
90
22
2
21
6
2
2
2
0
42
18
0
13
9
1
1
1
0
3
1
1
2
0
1
0
0
0
74
19
1
20
6
4
9
4
1
31
14
3
8
5
3
2
2
2
11
3
1
3
2
1
1
0
3
1, Too little; 2, about right; 3, too much.
Source: 1989 General Social Survey, National Opinion Research Center.
350
LOGLINEAR MODELS FOR CONTINGENCY TABLES
TABLE 8.20 Output for Fitting Model to Table 8.19
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value / DF
Deviance
48
31.6695
0.6598
Pearson Chi- Square
48
26.5224
0.5526
Log Likelihood
1284.9404
Parameter
e*h
e*h
e*h
e*h
e*l
e*l
e*l
e*l
e*c
e*c
e*c
e*c
h*c
h*c
h*c
h*c
h*l
h*l
h*l
h*l
c*l
c*l
c*l
c*l
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
DF
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Estimate
2.1425
1.4221
0.7294
0.3183
y0.1328
0.3739
y0.2630
0.4250
1.2000
1.3896
0.6917
1.3767
y0.1865
0.7464
y0.4675
0.7293
1.8741
1.0366
1.9371
1.8230
0.8735
0.5707
1.0793
1.2058
Standard
Error
0.5566
0.6034
0.5667
0.6211
0.6378
0.6975
0.6796
0.7361
0.5177
0.4774
0.5605
0.5024
0.4547
0.4808
0.4978
0.5023
0.5079
0.5262
0.6226
0.6355
0.4604
0.4863
0.4326
0.4462
Wald 95%
Confidence Limits
1.0515
3.2335
0.2394
2.6049
y0.3813
1.8402
y0.8991
1.5356
y1.3829
1.1172
y0.9931
1.7410
y1.5949
1.0689
y1.0178
1.8678
0.1854
2.2147
0.4540
2.3253
y0.4068
1.7902
0.3921
2.3614
y1.0777
0.7048
y0.1959
1.6886
y1.4431
0.5081
y0.2553
1.7138
0.8786
2.8696
0.0052
2.0680
0.7168
3.1574
0.5775
3.0686
y0.0289
1.7760
y0.3824
1.5239
0.2314
1.9271
0.3312
2.0804
ChiSquare
14.81
5.55
1.66
0.26
0.04
0.29
0.15
0.33
5.37
8.47
1.52
7.51
0.17
2.41
0.88
2.11
13.61
3.88
9.68
8.23
3.60
1.38
6.23
7.30
a. Table 8.20 shows some results, including the two-factor estimates,
for the homogeneous association model. Check the fit, and interpret.
b. All estimates at category 3 of each variable equal 0. Report the
estimated conditional odds ratios using the too much and too little
categories for each pair of variables. Summarize the associations.
Based on these results, which termŽs. might you consider dropping
from the model? Why?
EH 4
ˆeh
c. Table 8.21 reports
when parameters sum to zero within rows
and within columns, and when parameters are zero in the first row
and first column. Show how these yield the estimated EH conditional odds ratio for the too much and too little categories. Compare to part Žb.. Construct a confidence interval for that odds ratio.
Interpret.
351
PROBLEMS
TABLE 8.21
Parameter Estimates for Problem 8.6
Sum to Zero Constraints
H
Zero for First Level
H
E
1
2
3
1
2
3
1
2
3
0.509
y0.065
y0.445
0.166
y0.099
y0.068
y0.676
0.163
0.513
0
0
0
0
0.309
0.720
0
1.413
2.142
8.7
Refer to the loglinear models for Table 8.8.
a. Explain why the fitted odds ratios in Table 8.10 for model
Ž GI, GL, GS, IL, IS, LS . suggest that the most likely accident case
for injury is females not wearing seat belts in rural locations.
b. Fit model Ž GLS, GI, IL, IS .. Using model parameter estimates,
show that the fitted IS conditional odds ratio equals 0.44. Show that
for each injury level, the estimated conditional LS odds ratio is 1.17
for Ž G s female. and 1.03 for Ž G s male.. How can you get these
using the model parameter estimates?
8.8
Consider the following two-stage model for Table 8.8. The first stage is
a logit model with S as the response for the three-way GLS table. The
second stage is a logit model with these three variables as predictors
for I in the four-way table. Explain why this composite model is
sensible, fit the models, and interpret results.
8.9
Refer to the logit model in Problem 5.24. Let A s opinion on abortion.
a. Give the symbol for the loglinear model that is equivalent to this
logit model.
b. Which logit model corresponds to loglinear model Ž AR, AP, GRP .?
c. State the equivalent loglinear and logit models for which Ži. A is
jointly independent of G, R, and P; Žii. there are main effects of R
on A, but A is conditionally independent of G and P, given R; Žiii.
there is interaction between P and R in their effects on A, and G
has main effects.
8.10 For a multiway contingency table, when is a logit model more appropriate than a loglinear model? When is a loglinear model more
appropriate?
8.11 Using software, conduct the analyses described in this chapter for the
student survey data ŽTable 8.3..
352
LOGLINEAR MODELS FOR CONTINGENCY TABLES
8.12 Standardize Table 10.6. Describe the migration patterns.
8.13 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 2 =
3 = 2 = 2 table relating responses on frequency of attending religious
services, political views, opinion on making birth control available to
teenagers, and opinion about a man and woman having sexual relations before marriage. Analyze these data using loglinear models.
Theory and Methods
8.14 Suppose that i j s n i j 4 satisfy the independence model Ž8.1..
a. Show that Ya y Yb s log Žqa rqb ..
b. Show that all Yj s 04 is equivalent to qj s 1rJ for all j.
8.15 Refer to the independence model, i j s ␣ i  j . For the corresponding
loglinear model Ž8.1.:
a. Show that one can constrain Ý iX s Ý Yj s 0 by setting
iX s log ␣ i y
ž Ý log ␣ /
I, Yj s log  j y
ž Ý log ␣ /
Iq
h
h
s log q
h
h
ž Ý log  /
h
J,
h
ž Ý log  /
h
J.
h
b. Show that one can constrain 1X s 1Y s 0 by defining iX s log ␣ i
y log ␣ 1 and Yj s log  j y log  1. Then, what does equal?
8.16 For an I = J table, let i j s log i j , and let a dot subscript denote the
mean for that index Že.g., i.s Ý ji jrJ .. Then, let s . . , iX s i.y
. . , Yj s . j y . . , and iXj Y s i j y i.y . j q . . .
a. Show that log i j s q iX q Yj q iXj Y. Hence, any set of positive
i j 4 satisfies the saturated model.
b. Show that Ý i iX s Ý j Yj s Ý i iXj Y s Ý j iXj Y s 0.
XY
c. For 2 = 2 tables, show that log s 411
.
XY
d. For 2 = J tables, show that 11 s ŽÝ j log ␣ j .r2 J, where ␣ j s
11 2 jr 21 1 j , j s 2, . . . , J.
e. Alternative constraints have other odds ratio formulas. Let s 11 ,
iX s i1 y 11 , Yj s 1 j y 11 , and iXj Y s i j y i1 y 1 j q 11 .
Then, show that the saturated model holds with 1X s 1Y s 1Xj Y s
i1X Y s 0 for all i and j, and iXj Y s logŽ 11 i jr 1 j i1 ..
353
PROBLEMS
8.17 Suppose that all i jk ) 0. Let i jk s log i jk , and consider model
parameters with zero-sum constraints.
a. For the general loglinear model Ž8.12., define parameters in the
fashion of Problem 8.16 Že.g., iXj Y s i j.y i. .y . j.q . . . ..
XY
b. For model Ž XY, XZ, YZ . with a 2 = 2 = 2 table, show that 11
1
s 4 log 11Ž k . .
c. For Ž XYZ . with a 2 = 2 = 2 table, show that
X YZ
111
s 18 log 11Ž1. r 11Ž2. .
Thus, iXjkY Z s 0 is equivalent to 11Ž1. s 11Ž2. .
8.18 Two balanced coins are flipped, independently. Let X s whether the
first flip resulted in a head Žyes, no., Y s whether the second flip
resulted in a head, and Z s whether both flips had the same result.
Using this example, show that marginal independence for each pair of
three variables does not imply that the variables are mutually independent.
8.19 For three categorical variables X, Y, and Z:
a. When Y is jointly independent of X and Z, show that X and Y are
conditionally independent, given Z.
b. Prove that mutual independence of X, Y, and Z implies that X and
Y are both marginally and conditionally independent.
c. When X is independent of Y and Y is independent of Z, does it
follow that X is independent of Z? Explain.
d. When any pair of variables is conditionally independent, explain
why there is no three-factor interaction.
8.20 Suppose that X and Y are conditionally independent, given Z, and X
and Z are marginally independent.
a. Show that X is jointly independent of Y and Z.
b. Show X and Y are marginally independent.
c. Show that if X and Z are conditionally Žrather than marginally.
independent, then X and Y are still marginally independent.
8.21 A 2 = 2 = 2 table satisfies iqqs qjqs qqk s 12 , all i, j, k. Give
an example of i jk 4 that satisfies model Ža. Ž X, Y, Z ., Žb. Ž XY, Z .,
Žc. Ž XY, YZ ., Žd. Ž XY, XZ, YZ ., and Že. Ž XYZ ., but in each case not a
simpler model.
8.22 Suppose that model Ž XY, XZ, YZ . holds in a 2 = 2 = 2 table, and the
common XY conditional log odds ratio at the two levels of Z is
354
LOGLINEAR MODELS FOR CONTINGENCY TABLES
positive. If the XZ and YZ conditional log odds ratios are both
positive or both negative, show that the XY marginal odds ratio is
larger than the XY conditional odds ratio. Hence, Simpson’s paradox
cannot occur for the XY association.
8.23 Show that the general loglinear model in T dimensions has 2 T terms.
w Hint: It has an intercept, T single-factor terms, T two-factor
1
2
terms, . . . .x
ž /
ž /
8.24 Each of T responses is binary. For dummy variables z1 , . . . , zT 4 , the
loglinear model of mutual independence has the form
log z 1 , . . . , z T s 1 z1 q ⭈⭈⭈ qT zT .
Show how to express the general loglinear model ŽCox 1972..
8.25 Consider a cross-classification of W, X, Y, Z.
a. Explain why ŽWXZ, WYZ . is the most general loglinear model for
which X and Y are conditionally independent.
b. State the model symbol for which X and Y are conditionally
independent and there is no three-factor interaction.
8.26 For a four-way table with binary response Y, give the equivalent
loglinear and logit models that have:
a. Main effects of A, B, and C on Y.
b. Interaction between A and B in their effects on Y, and C has main
effects.
c. Repeat part Ža. for a nominal response Y with a baseline-category
logit model.
8.27 For a 3 = 3 table with ordered rows having scores x i 4 , identify all
terms in the generalized loglinear model Ž8.18. for models Ža.
logit w P Ž Y F j .x s ␣ j q  x i , and Žb. log w P Ž Y s j .rP Ž Y s 3.x s ␣ j q
j x i.
8.28 For the independence model for a two-way table, derive minimal
sufficient statistics, likelihood equations, fitted values, and residual df.
8.29 For the loglinear model for an I = J table, log i j s q iX , show
that
ˆ i j s n iqrJ and residual df s I Ž J y 1..
8.30 Write the log likelihood L for model Ž XZ, YZ .. Calculate ⭸ Lr⭸ and
show that it implies
ˆqqqs n. Show that ⭸ Lr⭸ iX s n iqqy iqq.
355
PROBLEMS
Similarly, differentiate with respect to each parameter to obtain likelihood equations. Show Ž8.23. and Ž8.24. imply the other equations, so
those equations determine the ML estimates.
8.31 For model Ž XY, Z ., derive Ža. minimal sufficient statistics, Žb. likelihood equations, Žc. fitted values, and Žd. residual df for tests of fit.
8.32 Consider the loglinear model with symbol Ž XZ, YZ ..
a. For fixed k, show that
ˆ i jk 4 equal the fitted values for testing
independence between X and Y within level k of Z.
b. Show that the Pearson and likelihood-ratio statistics for testing this
model’s fit have form X 2 s Ý X k2 , where X k2 tests independence
between X and Y at level k of Z.
8.33 Verify the df values shown in Table 8.14 for models Ž XY, Z ., Ž XY, YZ .,
and Ž XY, XZ, YZ ..
8.34 Verify that loglinear model Ž GLS, GI, LI, IS . implies logit model Ž8.16..
Show that the conditional log odds ratio for the effect of S on I equals
IS
IS
IS
IS
q 22
y 12
y21
in the loglinear
 1S y  2S in the logit model and 11
model.
8.35 Table 8.22 shows fitted values for models for four-way tables that have
direct estimates.
a. Use Birch’s results to verify that the entry is correct for ŽW, X, Y, Z ..
Verify its residual df.
b. Motivate the estimate and df formulas for ŽWX, YZ ., ŽWXY, Z .,
ŽWXY, WZ ., and ŽWXY, WXZ . using composite variables and the
corresponding results for two-way tables we.g., for ŽWXY, WZ ., given
W, Z is independent of the composite XY variablex.
TABLE 8.22 Data for Problem 8.35 a
Model
ŽW, X, Y, Z .
ŽWX, Y, Z .
ŽWX, WY, Z .
ŽWX, YZ .
ŽWX, WY, XZ .
ŽWX, WY, WZ .
ŽWXY, Z .
ŽWXY, WZ .
ŽWXY, WXZ .
a
Expected Frequency Estimate
3
n hqqq nqiqq nqqjq nqqqk rn
n hiqq nqqjq nqqqk rn2
n hiqq n hqjq nqqqk rn hqqq n
n hiqq nqqj krn
n hiqq n hqjq nqiqk rn hqqq nqiqq
n hiqq n hqjq n hqqk rŽ n hqqq . 2
n hi jq nqqqk rn
n hi jq n hqqk rn hqqq
n hi jq n h iqk rn h iqq
Residual DF
HIJK y H y I y J y K q 3
HIJK y HI y J y K q 2
HIJK y HI y HJ y K q H q 1
Ž HI y 1.Ž JK y 1.
HIJK y HI y HJ y IK q H q I
HIJK y HI y HJ y HK q 2 H
Ž HIJ y 1.Ž K y 1.
H Ž IJ y 1.Ž K y 1.
HI Ž J y 1.Ž K y 1.
Number of levels of W, X, Y, Z, denoted by H, I, J, K. Estimates for other models of each type
are obtained by symmetry.
356
LOGLINEAR MODELS FOR CONTINGENCY TABLES
8.36 A T-dimensional table n ab . . . t 4 has Ii categories in dimension i.
a. Find minimal sufficient statistics, ML estimates of cell probabilities,
and residual df for the mutual independence model.
b. Find the minimal sufficient statistics and residual df for the hierarchical model having all two-factor associations but no three-factor
interactions.
8.37 Consider loglinear model Ž X, Y, Z . for a 2 = 2 = 2 table.
a. Express the model in the form log s X.
b. Show that the likelihood equations XX n s XX
ˆ equate n i jk 4 and
ˆ i jk 4 in the one-dimensional margins.
8.38 Apply IPF to model Ža. Ž X, YZ ., and Žb. Ž XZ, YZ .. Show that the ML
estimates result within one cycle.
8.39 Given target row totals ri ) 04 and column totals c j ) 04 :
a. Explain how to use IPF to adjust sample proportions pi j 4 to have
these totals but maintain the sample odds ratios.
b. Show how to find cell proportions that have these totals and for
which all local odds ratios equal ) 0. Ž Hint: Take initial values of
1.0 in all cells in the first row and in the first column. This
determines all other initial cell entries such that all local odds ratios
equal ..
c. Explain how cell proportions are determined by the marginal proportions and the local odds ratios.
8.40 Refer to Birch’s results in Section 8.6.3. Show that L has individual
terms converging to y⬁ as log i ™ "⬁. Explain why positive definiteness of the information matrix implies that the solution of the
likelihood equations is unique, with likelihood maximized at that point.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 9
Building and Extending
LoglinearrLogit Models
In Chapters 5 through 7 we presented logistic regression models, which use
the logit link for binomial or multinomial responses. In Chapter 8 we
presented loglinear models for contingency tables, which use the log link for
Poisson cell counts. Equivalences between them were discussed in Section
8.5.3. In this chapter we discuss building and extending these models with
contingency tables.
In Section 9.1 we present graphs that show a model’s association and
conditional independence patterns. In Section 9.2 we discuss selection and
comparison of loglinear models. Diagnostics for checking models, such as
residuals, are presented in Section 9.3.
The loglinear models of Chapter 8 treat all variables as nominal. In
Section 9.4 we present loglinear models of association between ordinal
variables. In Sections 9.5 and 9.6 we present generalizations that replace
fixed scores by parameters. In the final section we discuss complications that
occur with sparse contingency tables.
9.1
ASSOCIATION GRAPHS AND COLLAPSIBILITY
A graphical representation for associations in loglinear models indicates the
pairs of conditionally independent variables. This representation helps reveal
implications of models. Our presentation derives partly from Darroch et al.
Ž1980., who used mathematical graph theory to represent certain loglinear
models Žcalled graphical models. having a conditional independence structure.
9.1.1
Association Graphs
An association graph has a set of vertices, each vertex representing a variable.
An edge connecting two variables represents a conditional association be357
358
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
FIGURE 9.1
Association graph for model ŽWX, WY, WZ, YZ ..
tween them. For instance, loglinear model ŽWX, WY, WZ, YZ . lacks XY and
XZ terms. It assumes independence between X and Y and between X and
Z, conditional on the remaining two variables. Figure 9.1 portrays this
model’s association graph. The four variables form the vertices. The four
edges represent pairwise conditional associations. Edges do not connect X
and Y or X and Z, the conditionally independent pairs.
Two loglinear models with the same pairwise associations have the same
association graph. For instance, this association graph is also the one for
model ŽWX, WYZ ., which adds a three-factor WYZ interaction.
A path in an association graph is a sequence of edges leading from one
variable to another. Two variables X and Y are said to be separated by a
subset of variables if all paths connecting X and Y intersect that subset. For
instance, in Figure 9.1, W separates X and Y, since any path connecting X
and Y goes through W. The subset W, Z 4 also separates X and Y.
A fundamental result states that two variables are conditionally independent
given any subset of variables that separates them ŽKreiner 1987; Whittaker
1990, p. 67.. Thus, not only are X and Y conditionally independent given W
and Z, but also given W alone. Similarly, X and Z are conditionally
independent given W alone.
9.1.2
Collapsibility in Three-Way Contingency Tables
In Section 2.3.3 we showed that conditional associations in partial tables
usually differ from marginal associations. Under certain collapsibility conditions, however, they are the same.
For three-way tables, XY marginal and conditional odds ratios are identical if
either Z and X are conditionally independent or if Z and Y are conditionally
independent.
The conditions state that the variable treated as the control Ž Z . is conditionally independent of X or Y, or both. These conditions occur for loglinear
models Ž XY, YZ . and Ž XY, XZ .. Thus, the fitted XY odds ratio is identical
in the partial tables and the marginal table for models with association
graphs
X
Y
Z and
Y
X
Z
359
ASSOCIATION GRAPHS AND COLLAPSIBILITY
or even simpler models, but not for the model with graph
X
Z
Y
in which an edge connects Z to both X and Y. The proof follows directly
from the formulas for models Ž XY, YZ . and Ž XY, XZ . ŽProblem 9.26..
We illustrate for the student survey ŽTable 8.3. from Section 8.2.4, with
A s alcohol use, C s cigarette use, and M s marijuana use. Model
Ž AM, CM . specifies AC conditional independence, given M. It has association graph
A
M
C.
Consider the AM association. Since C is conditionally independent of A, the
AM fitted conditional odds ratios are the same as the AM fitted marginal
odds ratio collapsed over C. From Table 8.5, both equal 61.9. Similarly, the
CM association is collapsible. The AC association is not, because M is
conditionally dependent with both A and C in model Ž AM, CM .. Thus, A
and C may be marginally dependent, even though they are conditionally
independent. In fact, from Table 8.5, the fitted AC marginal odds ratio for
this model is 2.7.
For model Ž AC, AM, CM ., no pair is conditionally independent. No
collapsibility conditions are fulfilled. Table 8.5 showed that each pair has
quite different fitted marginal and conditional associations for this model.
When a model contains all two-factor effects, effects may change after
collapsing over any variable.
9.1.3
Collapsibility and Logit Models
The collapsibility conditions apply also to logit models. For instance, suppose
that a clinical trial studies the association between a binary treatment
variable X Ž x 1 s 1, x 2 s 0. and a binary response Y, using data from K
centers Ž Z .. The logit model
logit P Ž Y s 1 < X s i , Z s k . s ␣ q  x i q  kZ
has the same treatment effect  for each center. Since this model corresponds to loglinear model Ž XY, XZ, YZ ., this effect may differ after collapsing the 2 = 2 = K table over centers. The estimated XY conditional odds
ratio, expŽ ˆ., typically differs from the sample odds ratio in the marginal
2 = 2 table.
Next, consider the simpler model that lacks center effects,
logit P Ž Y s 1 < X s i , Z s k . s ␣ q  x i .
For a given treatment, the success probability is identical for each center.
The model satisfies a collapsibility condition, because it states that Z is
360
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
conditionally independent of Y, given X. This logit model is equivalent to
loglinear model Ž XY, XZ ., for which the XY association is collapsible. So,
when center effects are negligible and the simpler model fits nearly as well,
the estimated treatment effect is approximately the marginal XY odds ratio.
9.1.4
Collapsibility and Association Graphs for Multiway Tables
Bishop et al. Ž1975, p. 47. provided a parametric collapsibility condition with
multiway tables:
Suppose that a model for a multiway table partitions variables into three
mutually exclusive subsets, A, B, C, such that B separates A and C. After
collapsing the table over the variables in C, parameters relating variables in A and
parameters relating variables in A to variables in B are unchanged.
We illustrate using model ŽWX, WY, WZ, YZ . ŽFigure 9.1.. Let A s X 4 ,
B s W 4 , and C s Y, Z 4 . Since the XY and XZ terms do not appear, all
parameters linking set A with set C equal zero, and B separates A and C. If
we collapse over Y and Z, the WX association is unchanged. Next, identify
A s Y, Z 4 , B s W 4 , C s X 4 . Then, conditional associations among W, Y,
and Z remain the same after collapsing over X.
This result also implies that when any variable is independent of all other
variables, collapsing over it does not affect any other model terms. For
instance, associations among W, X, and Y in model ŽWX, WY, XY, Z . are
the same as in ŽWX, WY, XY ..
When set B contains more than one variable, although parameter values
are unchanged in collapsing over set C, the ML estimates of those parameters may differ slightly. A stronger collapsibility definition also requires that
the estimates be identical. This condition of commutativity of fitting and
collapsing holds if the model contains the highest-order term relating variables in B to each other. Asmussen and Edwards Ž1983. discussed this
property, which relates to decomposability of tables ŽNote 8.2..
9.2
MODEL SELECTION AND COMPARISON
Strategies for selecting and comparing loglinear models are similar to those
for logistic regression discussed in Section 6.1. A model should be complex
enough to fit well but also relatively simple to interpret, smoothing rather
than overfitting the data.
9.2.1
Considerations in Model Selection
The potentially useful models are usually a small subset of the possible
models. A study designed to answer certain questions through confirmatory
analyses may plan to compare models that differ only by the inclusion of
certain terms. Also, models should recognize distinctions between response
MODEL SELECTION AND COMPARISON
361
and explanatory variables. The modeling process should concentrate on
terms linking responses and terms linking explanatory variables to responses.
The model should contain the most general interaction term relating the
explanatory variables. From the likelihood equations, this has the effect of
equating the fitted totals to the sample totals at combinations of their levels.
This is natural, since one normally treats such totals as fixed. Related to this,
certain marginal totals are often fixed by the sampling design. Any potential
model should include those totals as sufficient statistics, so likelihood equations equate them to the fitted totals.
Consider Table 8.8 with I s automobile injury and S s seat-belt use as
responses and G s gender and L s location as explanatory variables. Then
we treat n gqlq 4 as fixed at each combination for G and L. For example,
20,629 women had accidents in urban locations, so the fitted counts should
have 20,629 women in urban locations. To ensure this, a loglinear model
should contain the GL term, which implies from its likelihood equations that
ˆ gqlqs n gqlq 4. Thus, the model should be at least as complex as Ž GL, S, I .
and focus on the effects of G and L on S and I as well as the SI association.
If S is also explanatory and only I is a response, n gql s 4 should be fixed.
With a single categorical response, relevant loglinear models correspond to
logit models for that response. One should then use logit rather than
loglinear models, when the main focus is describing effects on that response.
For exploratory studies, a search among potential models may provide
clues about associations and interactions. One approach first fits the model
having single-factor terms, then the model having two-factor and single-factor
terms, then the model having three-factor and lower terms, and so on. Fitting
such models often reveals a restricted range of good-fitting models. In
Section 8.4.2 we used this strategy with the automobile injury data set.
Automatic search mechanisms among possible models, such as backward
elimination, may also be useful but should be used with care and skepticism.
Such a strategy need not yield a meaningful model.
9.2.2
Model Building for the Dayton Student Survey
In Sections 8.2.4 and 8.3.2 we analyzed the use of alcohol Ž A., cigarettes Ž C .,
and marijuana Ž M . by a sample of high school seniors. The study also
classified students by gender Ž G . and race Ž R .. Table 9.1 shows the five-dimensional contingency table. In selecting a model, we treat A, C, and M as
responses and G and R as explanatory. Thus, a model should contain the
GR term, which forces the GR fitted marginal totals to equal the sample
marginal totals
Table 9.2 displays goodness-of-fit tests for several models. Because many
cell counts are small, the chi-squared approximation for G 2 may be poor, but
this index is useful for comparing models. The first model listed contains only
the GR association and assumes conditional independence for the other nine
pairs of associations. It fits horribly, which is no surprise. Model 2, with all
two-factor terms, on the other hand, seems to fit well. Model 3, containing all
362
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
TABLE 9.1
Alcohol, Cigarette, and Marijuana Use for High School Seniors
Marijuana Use
Race s White
Female
Race s Other
Male
Female
Male
Alcohol
Use
Cigarette
Use
Yes
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
405
13
1
1
268
218
17
117
453
28
1
1
228
201
17
133
23
2
0
0
23
19
1
12
30
1
1
0
19
18
8
17
No
Source: Harry Khamis, Wright State University.
TABLE 9.2
Model
1.
2.
3.
4a.
4b.
4c.
4d.
4e.
4f.
4g.
4h.
4i.
5.
6.
7.
a
Goodness-of-Fit Tests for Loglinear Models for Table 9.1
a
Mutual independence q GR
Homogeneous association
All three-factor terms
Ž2. ᎐AC
Ž2. ᎐AM
Ž2. ᎐CM
Ž2. ᎐AG
Ž2. ᎐AR
Ž2. ᎐CG
Ž2. ᎐CR
Ž2. ᎐GM
Ž2. ᎐MR
Ž AC, AM, CM, AG, AR, GM, GR, MR.
Ž AC, AM, CM, AG, AR, GM, GR .
Ž AC, AM, CM, AG, AR, GR .
G2
df
1325.1
15.3
5.3
201.2
107.0
513.5
18.7
20.3
16.3
15.8
25.2
18.9
16.7
19.9
28.8
25
16
6
17
17
17
17
17
17
17
17
17
18
19
20
G, gender; R, race; A, alcohol use; C, cigarette use; M, marijuana use.
the three-factor interaction terms, also fits well, but the improvement in fit is
not great Ždifference in G 2 of 15.3 y 5.3 s 10.0 based on df s 16 y 6 s 10..
Thus, we consider models without three-factor terms. Beginning with model
2, we eliminate two-factor terms. We use backward elimination, sequentially
taking out terms for which the resulting increase in G 2 is smallest, when
refitting the model.
Table 9.2 shows the start of this process. Nine pairwise associations are
candidates for removal from model 2 Žall except GR ., shown in models 4a
through 4i. The smallest increase in G 2 , compared to model 2, occurs in
removing the CR term Ži.e., model 4g.. The increase is 15.8 y 15.3 s 0.5,
with df s 17 y 16 s 1, so this elimination seems sensible. After removing it,
363
MODEL SELECTION AND COMPARISON
the smallest additional increase results from removing the CG term Žmodel
5., resulting in G 2 s 16.7 with df s 18, and a change in G 2 of 0.9 based on
df s 1. Removing next the MR term Žmodel 6. yields G 2 s 19.9 with
df s 19, a change in G 2 of 3.2 based on df s 1.
Further removals have a more severe effect. For instance, removing the
AG term increases G 2 by 5.3, with df s 1, for a P-value of 0.02. One cannot
take such P-values literally, since the data suggested these tests, but it seems
safest not to drop additional terms. wSee Westfall and Wolfinger Ž1997. and
Westfall and Young Ž1993. for methods of adjusting P-values to account for
multiple testsx. Model 6, denoted by Ž AC, AM, CM, AG, AR, GM, GR ., has
association graph
M
C
G
A
R
Every path between C and G, R4 involves a variable in A, M 4 . Given the
outcome on alcohol use and marijuana use, the model states that cigarette
use is independent of both gender and race. Collapsing over the explanatory
variables race and gender, the conditional associations between C and A and
between C and M are the same as with the model Ž AC, AM, CM . fitted in
Section 8.2.4.
Removing the GM term from this model yields model 7 in Table 9.2. Its
association graph reveals that A separates G, R4 from C, M 4 . Thus, all
pairwise conditional associations among A, C, and M in model 7 are
identical to those in model Ž AC, AM, CM ., collapsing over G and R. In fact,
model 7 does not fit poorly Ž G 2 s 28.8 with df s 20. considering the large
ˆ s 0.036.. Hence, one might
sample size. ŽIts sample dissimilarity index is ⌬
collapse over gender and race in studying associations among the primary
variables. An advantage of the full five-variable model is that it estimates
effects of gender and race on these responses, in particular the effects of race
and gender on alcohol use and the effect of gender on marijuana use.
9.2.3
Loglinear Model Comparison Statistics
Consider two loglinear models, M1 and M0 , with M0 a special case of M1. By
Sections 4.5.4 and 5.4.3, the likelihood-ratio statistic for testing M0 against
M1 is G 2 Ž M0 < M1 . s G 2 Ž M0 . y G 2 Ž M1 .. We used this statistic above in
comparing pairs of models.
Let n denote a column vector of the observed cell counts n i 4 . Let
ˆ 0 and
ˆ 1 denote vectors of the fitted values
ˆ 0 i 4 and
ˆ 1 i 4 for M0 and M1. The
deviance G 2 Ž M0 . for the simpler model partitions into
G 2 Ž M 0 . s G 2 Ž M 1 . q G 2 Ž M 0 < M1 . .
Ž 9.1 .
Just as G 2 Ž M . measures the distance of fitted values for M from n,
G 2 Ž M0 < M1 . measures the distance of fit
ˆ 0 from fit
ˆ 1. In this sense,
364
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
decomposition Ž9.1. expresses a certain orthogonality: The distance of n from
ˆ 0 equals the distance of n from
ˆ 1 plus the distance of
ˆ 1 from
ˆ 0.
The model comparison statistic equals
G 2 Ž M0 < M1 . s 2 Ý n i log Ž n ir
ˆ 0 i . y 2 Ý n i log Ž n ir
ˆ1i .
i
i
s 2 Ý n i log Ž
ˆ 1 i r
ˆ0i . .
Ž 9.2 .
i
The two loglinear models have the matrix form Ž8.17., or
log 0 s X 0  0
and
log 1 s X 1  1 .
Since M0 is simpler than M1 , one can express log 0 s X 0  0 s X 1 U1 , where
U1 equals  0 with 0 elements appended corresponding to the extra parameters in  1 but not in  0 . Then, from Ž9.2.,
ˆ 1 y X 1
ˆU1
G 2 Ž M0 < M1 . s 2nX Ž log
ˆ 1 y log
ˆ 0 . s 2nX X 1 
ˆ 1 y X 1
ˆU1 s 2
s 2
ˆX1 X 1 
ˆX1 Ž log
ˆ 1 y log
ˆ0.
s 2Ý
ˆ 1 i log Ž
ˆ 1 ir
ˆ0i . ,
Ž 9.3 .
where the replacement of n by
ˆ 1 follows from the likelihood equations
X
w
Ž
.x. Statistic Ž9.3. has the same form as
nX X 1 s
X
for
M
Recall
8.22
ˆ1 1
1
4
G 2 Ž M0 ., but with
playing
the
role of the observed data. Note that
ˆ1i
G 2 Ž M0 . is the special case of G 2 Ž M0 < M1 . with M1 saturated.
The Pearson difference X 2 Ž M0 . y X 2 Ž M1 . does not have Pearson form.
It is not even necessarily nonnegative. A more appropriate Pearson statistic
for comparing models is
X 2 Ž M 0 < M1 . s
Ý Ž ˆ 1 i y ˆ 0 i . rˆ 0 i .
2
Ž 9.4 .
This has the usual form with
ˆ 1 i 4 in place of n i 4. Statistics Ž9.3. and Ž9.4.
depend on the data only through the fitted values and thus only through
sufficient statistics for M1.
When M0 holds, G 2 Ž M0 . and G 2 Ž M1 . have asymptotic chi-squared distributions, and G 2 Ž M0 < M1 . is asymptotically chi-squared with df equal to the
difference between df for M0 and M1. Haberman Ž1977a. showed that
G 2 Ž M0 < M1 . and X 2 Ž M0 < M1 . have the same null large-sample behavior, even
for fairly sparse tables. ŽUnder certain conditions, their difference converges
in probability to 0 as n increases. . When M1 holds but M0 does not, G 2 Ž M1 .
still has its asymptotic chi-squared distribution, but the other two statistics
tend to grow unboundedly as n increases.
365
MODEL SELECTION AND COMPARISON
9.2.4
Partitioning Chi-Squared with Model Comparisons
Equation Ž9.1. utilizes the property by which a chi-squared statistic with
df ) 1 partitions into components. We used such partitionings in tests for
trend with ordinal predictors in linear logit or linear probability models
ŽSection 5.3.5. and with ordinal responses in cumulative logit models ŽSection
7.2.. More generally, this property applies with a set of nested models to test
a sequence of hypotheses. The separate tests for comparing pairs of models
are asymptotically independent.
For example, a chi-squared decomposition with J y 1 models justifies the
partitioning of G 2 stated in Section 3.3.3 for 2 = J tables. For j s 2, . . . , J,
let M j denote the model that satisfies
i s Ž 1 i 2, iq1 . r Ž 1, iq1 2 i . s 1,
i s 1, . . . , j y 1.
For M j , the 2 = j table consisting of columns 1 through j satisfies independence. Model M J is independence in the complete 2 = J table. Model Mh is
a special case of M j whenever h ) j. By Ž9.2.,
G 2 Ž M J . s G 2 Ž M J < M Jy1 . q G 2 Ž M Jy1 .
s G 2 Ž M J < M Jy1 . q G 2 Ž M Jy1 < M Jy2 . q G 2 Ž M Jy2 .
s ⭈⭈⭈ s G 2 Ž M J < M Jy1 . q ⭈⭈⭈ qG 2 Ž M3 < M2 . q G 2 Ž M2 . .
From Ž9.3., G 2 Ž M j < M jy1 . has the G 2 form with the fitted values for model
M jy1 playing the role of the observed data. Substitution of fitted values for
the two models into Ž9.3. shows that G 2 Ž M j < M jy1 . is identical to G 2 for
testing independence in a 2 = 2 table; the first column combines column 1
through j y 1 of the original table, and the second column is column j of the
original table.
With several preplanned comparisons, simultaneous test procedures lessen
the probability of attributing importance to sample effects that simply reflect
chance variation. These procedures use adjusted significance levels. For a set
of s tests for nested models, when each test has level 1 y Ž1 y ␣ .1r s, the
overall asymptotic P Žtype I error. F ␣ ŽGoodman 1969a.. For instance,
suppose that we test the fit of ŽWXZ, WY, XY, ZY ., compare that model to
ŽWX, WZ, XZ, WY, XY, ZY ., and compare that model to ŽWX, WZ, XZ,
WY, ZY .. To ensure overall ␣ s 0.05 for the s s 3 tests, use level 1 y
Ž0.95.1r3 s 0.017 for each.
9.2.5
Identical Marginal and Conditional Tests of Independence
A test using G 2 Ž M0 < M1 . simplifies dramatically when both models have
direct estimates. In that case, the models have independence linkages neces-
366
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
sary to ensure collapsibility. A test of conditional independence has the same
result as the test of independence applied to the marginal table. Sundberg
Ž1975. proved the following: When two direct models M0 and M1 are
identical except for a pairwise association term, G 2 Ž M0 < M1 . is identical to
G 2 for testing independence in the marginal table for that pair of variables.
Bishop Ž1971. and Goodman Ž1970, 1971b. have related discussion.
For instance, G 2 wŽ X, Y, Z . < Ž XY, Z .x tests X Y s 0 in model Ž XY, Z ..
Thus, it tests XY conditional independence under the assumption that X
and Y are jointly independent of Z. Using the two sets of fitted values, from
Ž9.3., it equals
2Ý
i
n i jq nqqk
ÝÝ
j
n
k
s 2Ý
i
log
n i jq nqqk rn
n iqq nqjq nqqk rn2
Ý n i jq log n
j
n i jq
iqq nqjqrn
,
which equals G 2 wŽ X, Y .x for testing independence in the marginal XY table.
This is not surprising. The collapsibility conditions imply that for model
Ž XY, Z ., the marginal XY association is the same as the conditional XY
association.
9.3
DIAGNOSTICS FOR CHECKING MODELS
The model comparison test using G 2 Ž M0 < M1 . is useful for detecting whether
an extra term improves a model fit. Cell residuals provide a cell-specific
indication of model lack of fit.
9.3.1
Residuals for Loglinear Models
In Section 4.5.5 we noted that residuals for the independence model ŽSection
3.3.1. extend to any Poisson GLM. For cell i in a contingency table with
observed count n i and fitted value
ˆ i , the Pearson residual is
ei s
These relate
Like the
variances of
ni y
ˆi
'ˆ
.
Ž 9.5 .
i
to the Pearson statistic by Ýe i2 s X 2 .
Pearson residual Ž6.1. for binomial models, the asymptotic
e i 4 are less than 1.0. They average Žresidual df.rŽnumber of
367
MODELING ORDINAL ASSOCIATIONS
cells.. Haberman Ž1973a. defined the standardized Pearson residual,
'
r i s e ir 1 y ˆ
hi ,
where the leverage ˆ
h i is a diagonal element of the estimated hat matrix
ŽSection 4.5.5.. This has an asymptotic standard normal distribution and is
preferable to the Pearson residual. A closed-form expression applies for
loglinear models having direct estimates ŽHaberman 1978, p. 275.. Alternative residuals use components of the deviance ŽSection 4.5.5..
9.3.2
Student Survey Example Revisited
For Table 9.1 cross-classifying alcohol, cigarette, and marijuana use by
gender and race, we suggested in Section 9.2.2 that the model with all
two-factor associations is plausible. For it, the only large standardized Pearson residual equals 3.2, resulting from a fitted value of 3.1 in the cell having
a count of 8. Further comparisons suggested that the simpler model
Ž AC, AM, CM, AG, AR, GM, GR . is adequate. Its only large standardized
residual equals 3.3, referring to a fitted value of 2.9 in that cell. The number
of nonwhite males who did not use alcohol or marijuana but who smoked
cigarettes is somewhat greater than either model predicts. The standardized
Pearson residuals do not suggest problems with either model, considering the
large sample size and many cells studied.
9.3.3
Correspondence between Loglinear and Logit Residuals
In Section 8.5 we showed that logit models in contingency tables are equivalent to certain loglinear models. However, a Pearson residual for a logit
model differs from a Pearson residual for a loglinear model. The numerators
comparing the ith observed and fitted binomial or Poisson count are the
same, since the model fitted values are the same. However, the logit model
uses a fitted binomial standard deviation in the denominator wsee Ž6.1.x,
whereas the loglinear model uses a fitted Poisson standard deviation wsee
Ž9.5.x. Thus, the logit Pearson residual exceeds the loglinear Pearson residual
Ž9.5..
Once standardized by dividing by estimated standard errors, the standardized Pearson residuals are identical for the two models. This is another
reason for preferring standardized residuals over ordinary Pearson residuals.
9.4
MODELING ORDINAL ASSOCIATIONS
The loglinear models presented so far have a serious limitationᎏthey treat
all classifications as nominal. If the order of a variable’s categories changes in
368
TABLE 9.3
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Opinions about Premarital Sex and Availability of Teenage Birth Control
Teenage Birth Control a
Strongly
Disagree
Disagree
Agree
Strongly
Agree
81
Ž42.4.1
7.6 2
Ž80.9. 3
24
Ž16.0.
2.3
Ž20.8.
68
Ž51.2.
3.1
Ž67.6.
26
Ž19.3.
1.8
Ž23.1.
60
Ž86.4.
y4.1
Ž69.4.
29
Ž32.5.
y0.8
Ž31.5.
38
Ž67.0.
y4.8
Ž29.1.
14
Ž25.2.
y2.8
Ž17.6.
Wrong only sometimes
18
Ž30.1.
y2.7
Ž24.4.
41
Ž36.3.
1.0
Ž36.1.
74
Ž61.2.
2.2
Ž65.7.
42
Ž47.4.
y1.0
Ž48.8.
Not wrong at all
36
Ž70.6.
y6.1
Ž33.0.
57
Ž85.2.
y4.6
Ž65.1.
161
Ž143.8.
2.4
Ž157.4.
157
Ž111.4.
6.8
Ž155.5.
Premarital Sex
Always wrong
Almost always wrong
Independence model fit; 2 standardized Pearson residuals for the independence model fit;
linear-by-linear association model fit.
Source: 1991 General Social Survey, National Opinion Research Center.
a1
3
any way, the fit is the same. For ordinal classifications, these models ignore
important information.
Refer to Table 9.3. Subjects were asked their opinion about a man and
woman having sexual relations before marriage Žalways wrong, almost always
wrong, wrong only sometimes, not wrong at all.. They were also asked
whether methods of birth control should be available to teenagers between
the ages of 14 and 16 Žstrongly disagree, disagree, agree, strongly agree.. For
the loglinear model of independence, denoted by I, G 2 Ž I . s 127.6 with
df s 9. The model fits poorly. Yet, adding the ordinary association term
makes it saturated and unhelpful.
Table 9.3 also contains fitted values and standardized residuals for independence. The residuals in the corners stand out. Sample counts are much
larger than independence predicts where both responses are the most negative possible or the most positive possible. By contrast, the counts are much
smaller than fitted values where one response is the most positive and the
other is the most negative. Cross-classifications of ordinal variables often
exhibit their greatest deviations from independence in the corner cells. This
pattern for Table 9.3 indicates lack of fit in the form of a positive trend.
MODELING ORDINAL ASSOCIATIONS
369
Subjects who are more willing to make birth control available to teenagers
also tend to feel more tolerant about premarital sex.
Models for ordinal variables use association terms that permit trends. The
models are more complex than the independence model, yet unsaturated.
Models with association and interaction terms exist in situations in which
nominal models are saturated. Tests with ordinal models have improved
power for detecting trends.
9.4.1
Linear-by-Linear Association in Two-Way Tables
For two-way tables, a simple model for two ordinal variables assigns ordered
row scores u1 F u 2 F ⭈⭈⭈ F u I and column scores ®1 F ®2 F ⭈⭈⭈ F ®J . The
model is
log i j s q iX q Yj q  u i ®j ,
Ž 9.6 .
with constraints such as IX s YJ s 0. This is the special case of the saturated model Ž8.2. in which iXj Y s  u i ®j . It requires only one parameter to
describe association, whereas the saturated model requires Ž I y 1.Ž J y 1..
Independence occurs when  s 0. The term  u i ®j represents the deviation of log i j from independence. The deviation is linear in the Y scores at
a fixed level of X and linear in the X scores at a fixed level of Y. In column
j, for instance, the deviation is a linear function of X, having form Žslope. =
Žscore for X ., with slope  ®j . Because of this property, Ž9.6. is called the
linear-by-linear association model Žabbreviated, L = L.. The model has its
greatest departures from independence in the corners of the table. Birch
Ž1965., Goodman Ž1979a., and Haberman Ž1974b. introduced special cases.
The direction and strength of the association depend on  . When  ) 0,
Y tends to increase as X increases. Expected frequencies are larger than
expected Žunder independence . in cells where X and Y are both high or both
low. When  - 0, Y tends to decrease as X increases. When the data display
a positive or negative trend, the L = L model usually fits much better than
the independence model.
For the 2 = 2 table using the cells intersecting rows a and c with columns
b and d, direct substitution shows that the model has
log
ab c d
ad cb
s  Ž u c y u a . Ž ®d y ®b . .
Ž 9.7 .
This log odds ratio is stronger as <  < increases and for pairs of categories that
are farther apart. Simple interpretations result when u 2 y u1 s ⭈⭈⭈ s u I y
u Iy1 and ®2 y ®1 s ⭈⭈⭈ s ®J y ®Jy1 . When u i s i4 and ®j s j4 , for instance,
the local odds ratios Ž2.10. for adjacent rows and adjacent columns have
common value e . Goodman Ž1979a. called this case uniform association.
Figure 9.2 portrays local odds ratios having uniform value.
370
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
FIGURE 9.2 Constant odds ratio implied by uniform association model. Ž Note:  s the
constant log odds ratio for adjacent rows and adjacent columns..
The choice of scores affects the interpretation of  . Often, the response
scale discretizes an inherently continuous scale. It is sensible to choose scores
that approximate distances between midpoints of categories for the underlying scale, such as we did in measuring alcohol consumption for a linear logit
model in Section 3.4.5. It is sometimes useful to standardize the scores,
subtracting the mean and dividing by the standard deviation, so
Ý u i iqs Ý ®jqj s 0
Ý u 2i iqs Ý ®j2qj s 1.
Then,  represents the log odds ratios for standard deviation distances in the
X and Y directions. The L = L model tends to fit well when an underlying
continuous distribution is approximately bivariate normal. For standardized
scores,  is then comparable to rŽ1 y 2 ., where is the underlying
correlation. For weak associations,  f Žsee Becker 1989b; Goodman
1981a, b, 1985..
9.4.2
Corresponding Logit Model for Adjacent Responses
A logit formulation of the L = L model treats Y as a response and X as
explanatory. Let j < i s P Ž Y s j < X s i .. Using logits for adjacent response
categories ŽSection 7.4.1.,
jq1 < i
i , jq1
log
s log
s Ž Yjq1 y Yj . q  Ž ®jq1 y ®j . u i .
j<i
i j
For unit-spaced ®j 4 , this simplifies to
jq1 < i
log
s ␣ j q  ui
j<i
371
MODELING ORDINAL ASSOCIATIONS
where ␣ j s Yjq1 y Yj . The same linear logit effect  applies simultaneously
for all Ž J y 1. pairs of adjacent response categories: The odds Y s j q 1
instead of Y s j multiply by e  for each unit change in X. In using
equal-interval response scores, we implicitly assume that the effect of X is
the same on each of the J y 1 adjacent-categories logits for Y.
9.4.3
Likelihood Equations and Model Fitting
The Poisson log-likelihood LŽ . s Ý i Ý j n i j log i j y Ý i Ý j i j simplifies for
the L = L model Ž9.6. to
LŽ . s n q
Ý n iq iX q Ý nqj Yj q  Ý Ý u i ®j n i j
i
yÝ
i
j
i
j
Ý exp Ž q iX q Yj q  u i ®j . .
j
Differentiating LŽ . with respect to Ž iX , Yj ,  . and setting the three partial
derivatives equal to zero yields likelihood equations
ˆ iqs n iq , i s 1, . . . , I,
ˆqj s nqj , j s 1, . . . , J ,
Ý Ý u i ®j ˆ i j s Ý Ý u i ®j n i j .
i
j
i
j
Iterative methods such as Newton᎐Raphson yield the ML fit.
Let pi j s n i jrn and
ˆi j s
ˆ i jrn. The third likelihood equation implies
that
Ý Ý u i ®jˆ i j s Ý Ý u i ®j pi j .
i
j
i
j
Since marginal distributions and hence marginal means and variances are
identical for fitted and observed distributions, the third equation implies the
correlation between the scores for X and Y is the same for both distributions. The fitted counts display the same positive or negative trend as the
data.
Since u i 4 and ®j 4 are fixed, the L = L model Ž9.6. has only one more
parameter Ž . than the independence model. Its residual
df s IJ y 1 q Ž I y 1 . q Ž J y 1 . q 1 s IJ y I y J ,
unsaturated for all but 2 = 2 tables.
9.4.4
Sex Opinions Example
Table 9.3 also reports fitted values for the linear-by-linear association model
applied to Table 9.3, using scores 1, 2, 3, 44 for rows and columns. Table 9.4
372
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
TABLE 9.4
Output for Fitting Linear-by-Linear Association Model to Table 9.3
Criteria For Assessing
Criterion
Deviance
Pearson Chi- Square
Parameter
Intercept
premar
premar
premar
premar
birth
birth
birth
birth
linlin
1
2
3
4
1
2
3
4
Estimate
0.4735
1.7537
0.1077
y0.0163
0.0000
1.8797
1.4156
1.1551
0.0000
0.2858
Source
linlin
Standard
Error
0.4339
0.2343
0.1988
0.1264
0.0000
0.2491
0.1996
0.1291
0.0000
0.0282
Goodness Of Fit
DF
Value
8
11.5337
8
11.5085
Wald 95% Conf.
Limits
y0.3769 1.3239
1.2944 2.2129
y0.2820 0.4974
y0.2641 0.2314
0.0000 0.0000
1.3914 2.3679
1.0243 1.8068
0.9021 1.4082
0.0000 0.0000
0.2305 0.3412
LR Statistics
DF
Chi- Square
1
116.12
ChiSquare
1.19
56.01
0.29
0.02
.
56.94
50.29
80.07
.
102.46
Pr ) ChiSq
0.2751
-.0001
0.5880
0.8972
.
-.0001
-.0001
-.0001
.
-.0001
Pr ) ChiSq
).0001
shows software output. To get this, we added a variable Ždenoted ‘‘linlin’’. to
the independence model having values equal to the product of row and
column number. Compared to the independence model, for which G 2 Ž I . s
127.6 with df s 9, the L = L model fits dramatically better w G 2 Ž L = L. s
11.5, df s 8x. This is especially noticeable in the corners, where it predicts
the greatest departures from independence.
The ML estimate ˆ s 0.286 ŽSE s 0.028. indicates that subjects having
more favorable attitudes about teen birth control also tend to have more
tolerant attitudes about premarital sex. The estimated local odds ratio is
expŽ ˆ. s expŽ0.286. s 1.33. A 95% Wald confidence interval is expŽ0.286 "
1.96 = 0.028., or Ž1.26, 1.41.. The strength of association seems weak. From
Ž9.7., however, nonlocal odds ratios are stronger. The estimated odds ratio
for the four corner cells equals
exp ˆŽ u 4 y u1 . Ž ®4 y ®1 . s exp 0.286 Ž 4 y 1 . Ž 4 y 1 . s 13.1.
This also results from the corner fitted values, Ž80.9 = 155.5.rŽ29.1 = 33.0.
s 13.1.
Two sets of scores having the same spacings yield the same ˆ and the
same fit. Any other sets of equally spaced scores yield the same fit but an
appropriately rescaled ˆ. For instance, using row scores 2, 4, 6, 84 with
®j s j4 also yields G 2 s 11.5, but ˆ s 0.143 with SE s 0.014 Žboth half as
ASSOCIATION MODELS
373
large.. For Table 9.3, one might regard categories 2 and 3 as farther apart
than categories 1 and 2, or categories 3 and 4. Scores such as 1, 2, 4, 54 for
rows and columns recognize this. The L = L model then has G 2 s 8.8
Ždf s 8. and ˆ s 0.146 ŽSE s 0.014..
One need not regard the scores as approximations for distances between
categories or as reasonable scalings of ordinal variables in order for the
models to be valid. They simply imply a certain pattern for the odds ratios. If
the L = L model fits well with equally spaced row and column scores, the
uniform local odds ratio describes the association regardless of whether the
scores are sensible indexes of true distances between categories.
For scores u i s i4 with Table 9.3, the marginal mean and standard
deviation for premarital sex are 2.81 and 1.26. The standardized scores are
Ž i y 2.81.r1.264 , or Žy1.44, y0.65, 0.15, 0.95.. The standardized equal-interval scores for birth control are Žy1.65, y0.69, 0.27, 1.23.. For these scores,
ˆ s 0.374. By solving ˆ s ˆrŽ1 y ˆ2 . for ˆ, ˆ s 0.333. If there is an
underlying bivariate normal distribution, we estimate the correlation to be
0.333.
9.4.5
Directed Ordinal Test of Independence
For the linear-by-linear association model, H0 : independence is H0 :  s 0.
The likelihood-ratio test statistic equals
G2 Ž I < L = L. s G 2 Ž I . y G 2 Ž L = L. .
Designed to detect positive or negative trends, it has df s 1. For Table 9.3,
G 2 Ž I < L = L. s 127.6 y 11.5 s 116.1. This has P - 0.0001, extremely strong
evidence of an association. The Wald statistic z 2 s Ž ˆrSE . 2 s
Ž0.286r0.0282. 2 s 102.5 Ždf s 1. also shows strong evidence. The correlation
statistic Ž3.15. presented in Section 3.4.1 for testing independence is the score
statistic for H0 :  s 0 in this model. It equals 112.6 Ždf s 1..
When the L = L model holds, the ordinal test using G 2 Ž I < L = L. is
asymptotically more powerful than the test using G 2 Ž I .. This is true for the
same reason given in Section 6.4.2 for the linear logit model. The power of a
chi-squared test increases when df decrease, for fixed noncentrality. When
the L = L model holds, the noncentrality is the same for G 2 Ž I < L = L. and
G 2 Ž I .; thus G 2 Ž I < L = L. is more powerful, since its df s 1 compared to
Ž I y 1.Ž J y 1. for G 2 Ž I .. The power advantage increases as I and J increase, since the noncentrality remains focused on df s 1 for G 2 Ž I < L = L.
but df also increases for G 2 Ž I ..
9.5
ASSOCIATION MODELS*
Generalizations of the linear-by-linear association model apply to multiway
tables or treat scores as parameters rather than fixed. The models are called
association models, because they focus on the association structure.
374
9.5.1
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Row and Column Effects Models
We first present a model that treats X as nominal and Y as ordinal. It is
appropriate for two-way tables with ordered columns, using scores ®1 F ®2 F
⭈⭈⭈ F ®J . Since the rows are unordered, they do not have scores. Replacing
the ordered values  u i 4 in the linear-by-linear term  u i ®j in model Ž9.6. by
unordered parameters i 4 gives
log i j s q iX q Yj q i ®j .
Ž 9.8 .
Constraints are needed such as IX s YJ s I s 0. The i 4 are called row
effects. The model is called the row effects model.
Model Ž9.8. has I y 1 more parameters Žthe i 4. than the independence
model. Independence is the special case 1 s ⭈⭈⭈ s I . A corresponding
column effects model has association term u i j . It treats X as ordinal with
scores u i 4 and Y as nominal with parameters j 4 . The row effects and
column effects models were developed by Goodman Ž1979a., Haberman
Ž1974b., and Simon Ž1974..
9.5.2
Logit Model for Adjacent Responses
With ®jq1 y ®j s 14 , the row effects model has adjacent-categories logit
form
log
P Ž Y s j q 1 < X s i.
P Ž Y s j < X s i.
s ␣ j q i .
Ž 9.9 .
The effect in row i is identical for each pair of adjacent responses. Plots of
these logits against i Ž i s 1, . . . , I . for different j are parallel. Goodman
Ž1983. referred to model Ž9.9. as the parallel odds model.
Differences among i 4 compare rows with respect to their conditional
distributions on Y. When i s h , rows h and i have identical conditional
distributions. If i ) h , Y is stochastically higher in row i than row h.
The likelihood equations for the row effects model Ž9.8. are
ˆ iqs n iq 4,
ˆqj s nqj 4, and
Ý ®j ˆ i j s Ý ®j n i j ,
i s 1, . . . , I.
j
Let
ˆj<i s
ˆ i j r
ˆ iq and pj < i s n i jrn iq. Since
ˆ iqs n iq, the third likelihood
equation is Ý j ®j
ˆ j < i s Ý j ®j pj < i . For the conditional distribution within each
row, the mean column score is the same for the fitted and sample distributions. The likelihood equations are solved using iterative methods.
375
ASSOCIATION MODELS
TABLE 9.5
Observed Frequencies and Fitted Values for Political Ideology Data
Political Ideology a
Party Affiliation
Liberal
Moderate
Conservative
Total
Democrat
143
Ž102.0.1
Ž136.6. 2
119
Ž120.2.
Ž123.8.
156
Ž161.4.
Ž168.7.
210
Ž190.1.
Ž200.4.
100
Ž135.6.
Ž93.6.
399
470
15
Ž54.7.
Ž16.6.
72
Ž86.6.
Ž68.9.
141
Ž159.7.
Ž145.8.
127
Ž72.7.
Ž128.6.
Independent
Republican
214
Independence model; 2 row effects model.
Source: Based on data in R. D. Hedlund, Public Opinion Quart. 41: 498᎐514 Ž1978..
a1
9.5.3
Political Ideology Example
Table 9.5 displays the relationship between political ideology and political
party affiliation for a sample of voters in a presidential primary in Wisconsin.
The table shows fitted values for the independence Ž I . model and the row
effects Ž R . model with ®j s j4 .
Table 9.6 shows output. Goodness-of-fit tests show that independence is
inadequate. Adding the row effects parameters much improves the fit Ž G 2 Ž I .
s 105.7, df s 4; G 2 Ž R . s 2.8, df s 2.. Also, testing H0 : 1 s 2 s 3 using
G 2 Ž I < R . s 102.9 Ždf s 2. shows very strong evidence of an association. In
Table 9.5, the improved fit is especially noticeable at the ends of the ordinal
scale, where the model has greatest deviation from independence.
The output uses dummy variables for the first two categories of each
classification. The interaction term equals the product of the score for
ideology and a parameter for party. Thus, the row effect estimates satisfy
ˆ 3 s 0, and the other two estimates contrast the first two parties with
Republicans. The estimates are
ˆ 1 s y1.213 and
ˆ 2 s y0.943. The further
ˆ i falls in the negative direction, the greater the tendency for the party i to
locate at the liberal end of the ideology scale, relative to Republicans. In this
sample the Republicans are much more conservative than the other two
groups, and the Democrats Žrow 1. are the most liberal. From Ž9.9. the model
predicts constant odds ratios for adjacent columns of political ideology. For
instance, since
ˆ3 y
ˆ 1 s 1.213, the estimated odds that Republicans were
conservative instead of moderate, or moderate instead of liberal, were
expŽ1.213. s 3.36 times the corresponding estimated odds for Democrats.
Figure 9.3 shows the parallelism of the estimated logits for the row effects
model.
The loglinear model does not distinguish between response and explanatory variables. Instead, one could use a cumulative logit model to describe
376
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
TABLE 9.6
Output for Fitting Row Effects Model to Table 9.5
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
2
2.8149
Pearson Chi- Square
2
2.8039
Parameter
Intercept
party
party
party
ideology
ideology
ideology
score*party
score*party
score*party
Democ
Indep
Repub
1
2
3
Democ
Indep
Repub
LR Statistics
Source
score*party
FIGURE 9.3
Estimate
4.8565
3.3230
2.9536
0.0000
y2.0488
y0.6244
0.0000
y1.2134
y0.9426
0.0000
DF
2
Std
Error
0.0858
0.3188
0.3149
0.0000
0.2216
0.1139
0.0000
0.1304
0.1260
0.0000
Wald 95% Conf.
ChiPr )
Limits
Square
ChiSq
4.6883
5.0246 3204.02 -.0001
2.6981
3.9479 108.63 -.0001
2.3364
3.5707
87.98 -.0001
0.0000
0.0000
.
.
y2.4831 y1.6145
85.50 -.0001
y0.8476 y0.4013
30.08 -.0001
0.0000
0.0000
.
.
y1.4690 y0.9577
86.56 -.0001
y1.1896 y0.6956
55.95 -.0001
0.0000
0.0000
.
.
Chi- Square
102.85
Pr ) ChiSq
-.0001
Observed and predicted logits for adjacent response categories.
377
ASSOCIATION MODELS
the effects of party affiliation on ideology, or a baseline-category logit model
to describe linear effects of ideology on party affiliation.
9.5.4
Ordinal Variables in Models for Multiway Tables
Multidimensional tables with ordinal responses can use generalizations of
association models. In three dimensions, the rich collection of models includes Ž1. association models that are more parsimonious than the nominal
model Ž XY, XZ, YZ ., and Ž2. models permitting heterogeneous association
that, unlike model Ž XYZ ., are unsaturated.
Models for association that are special cases of Ž XY, XZ, YZ . replace
association terms by structured terms that account for ordinality. For instance, when both X and Y are ordinal, alternatives to iXj Y are a linear-bylinear term  u i ®j , a row effects term i ®j , or a column effects term u i j ;
these provide a stochastic ordering of conditional distributions within rows
and within columns, or just within rows, or just within columns. With a
linear-by-linear term, the model is
log i jk s q iX q Yj q Zk q  u i ®j q iXkZ q YjkZ .
Ž 9.10 .
The conditional local odds ratios Ž8.13. then satisfy
log i jŽ k . s  Ž u iq1 y u i . Ž ®jq1 y ®j .
for all k.
The association is the same in different partial tables, with homogeneous
linear-by-linear XY association.
When the association is heterogeneous, structured terms for ordinal
variables make effects simpler to interpret than in the saturated model. For
instance, the heterogeneous linear-by-linear XY association model
log i jk s q iX q Yj q Zk q  k u i ®j q iXkZ q YjkZ
Ž 9.11 .
allows the XY association to change across levels of Z. With unit-spaced
scores,
log i jŽ k . s  k
for all i and j.
It has uniform association within each level of Z, but heterogeneity among
levels of Z in the strength of association. Fitting it corresponds to fitting the
L = L model Ž9.6. separately at each level of Z.
9.5.5
Air Pollution and Breathing Examples
Table 9.7 displays associations among smoking status Ž S ., breathing test
results Ž B ., and age Ž A. for workers in certain industrial plants in Houston,
378
TABLE 9.7
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Cross-Classification of Industrial Workers by Breathing Test Results
Breathing Test Results
Age
Smoking Status
Normal
Borderline
Abnormal
- 40
Never smoked
Former smoker
Current smoker
Never smoked
Former smoker
Current smoker
577
192
682
164
145
245
27
20
46
4
15
47
7
3
11
0
7
27
40᎐59
Source: From p. 21 of Public Program Analysis by R. N. Forthofer and R. G. Lehnen. Copyright
䊚 1981 by Lifetime Learning Publications, Belmont, CA 94002, a division of Wadsworth, Inc.
Reprinted by permission of Van Nostrand Reinhold. All rights reserved.
Texas. The loglinear model Ž SA, SB, BA. fits poorly Ž G 2 s 25.9, df s 4..
Thus, simpler models such as homogeneous linear-by-linear SB association
are not plausible Ž G 2 s 29.1, df s 7, using equally spaced scores.. The
heterogeneous linear-by-linear SB association model fits much better with
only one additional parameter Ž G 2 s 10.8, df s 6.. With integer scores for S
and B, ˆ1 s 0.115 for the younger group and ˆ2 s 0.781 for the older group,
with SE s 0.167 for the difference. The effect of smoking seems much
stronger for the older group, with estimated local odds ratio of expŽ0.781. s
2.18 compared to expŽ0.115. s 1.12 for the younger group. Here, it may be
more natural to use logit models with B as the response variable ŽProblem
7.11..
When strata are ordered, roughly a linear trend may exist across strata in
certain log odds ratios as Table 9.8 illustrates. The data refer to a sample of
coal miners, measured on B s breathlessness, W s wheeze, and A s age,
where B and W are response variables. One could use a separate logit model
to describe effects of age on each response. To study whether the BW
association varies by age, we fit model Ž BW, AB, AW .. It has residual
G 2 s 26.7, with df s 8. Table 9.8 reports the standardized Pearson residuals.
They show a decreasing tendency as age increases.
This suggests the model
log i jk s Ž BW , AB, AW . q kI Ž i s j s 1 . ␦ ,
Ž 9.12 .
where I is the indicator function. It amends the homogeneous association
model by adding ␦ in the cell for 111 , . . . , 9␦ in the cell for 119 . Then, the
BW log odds ratio changes linearly in the age category. The model fit has
␦ˆ s y0.131 ŽSE s 0.029.. The estimated BW log odds ratio at level k of age
is 3.676 y 0.131k, decreasing from 3.55 to 2.50. The model has residual
G 2 s 6.80 Ždf s 7.. McCullagh and Nelder Ž1989, Sec. 6.6. showed other
analyses.
ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS
TABLE 9.8
379
Coal Miners Classified by Breathlessness, Wheeze, and Age
Breathlessness
Yes
Age
20᎐24
25᎐29
30᎐34
35᎐39
40᎐44
45᎐49
50᎐54
55᎐59
60᎐64
No
Wheeze
Yes
Wheeze
No
Wheeze
Yes
Wheeze
No
9
23
54
121
169
269
404
406
372
7
9
19
48
54
88
117
152
106
95
105
177
257
273
324
245
225
132
1841
1654
1863
2357
1778
1712
1324
967
526
Std. Pearson
Residual a
0.75
2.20
2.10
1.77
1.13
y0.42
0.81
y3.65
y1.44
a
Residual refers to yes᎐yes and no᎐no cells; reverse sign for yes᎐no and no᎐yes cells.
Source: Reprinted with permission from Ashford and Sowden Ž1970..
9.5.6
Other Ordinal Tests of Conditional Independence
Tests of conditional independence of ordinal classifications can generalize
G 2 Ž I < L = L.. For instance, one can compare the XY conditional independence model Ž XZ, YZ . to the homogeneous linear-by-linear XY association
model Ž9.10.. It tests  s 0 in that model, with df s 1. This is an alternative
to the ordinal test of conditional independence in Section 7.5.3. Like Mantel’s
score statistic Ž7.21., this statistic uses correlation information, since
Ý k ŽÝ i Ý j u i ®j n i jk . is the sufficient statistic for  in model Ž9.10.. In fact, the
Mantel statistic provides the score test of H0 :  s 0 in that model.
Exact, small-sample tests can use likelihood-ratio, score, or Wald statistics
for such models. Computations require special algorithms ŽAgresti et al.
1990; Kim and Agresti 1997..
9.6 ASSOCIATION MODELS, CORRELATION MODELS, AND
CORRESPONDENCE ANALYSIS*
The linear-by-linear association Ž L = L. model is a special case of the row
effects Ž R . model, which has parameter row scores, and the column effects
Ž C . model, which has parameter column scores. These models are special
cases of a more general model with row and column parameter scores.
9.6.1
Multiplicative Row and Column Effects Model
Replacing u i 4 and ®j 4 in the L = L model Ž9.6. by parameters yields the row
and column effects Ž RC . model ŽGoodman 1979a.
log i j s q iX q Yj q  i j .
Ž 9.13 .
380
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Identifiability requires location and scale constraints on i 4 and j 4 . The
residual df s Ž I y 2.Ž J y 2.. This model is not loglinear, because the predictor is a multiplicative Žrather than linear. function of parameters i and j . It
treats classifications as nominal; the same fit results from a permutation of
rows or columns. Parameter interpretation is simplest when at least one
variable is ordinal, through the local log odds ratios
log i j s  Ž iq1 y i . Ž jq1 y j . .
Although it may seem appealing to use parameters instead of arbitrary
scores, the RC model presents complications that do not occur with loglinear
models. The likelihood may not be concave and may have local maxima.
Independence is a special case, but it is awkward to test independence using
the RC model. Haberman Ž1981. showed that the null distribution of
G 2 Ž I . y G 2 Ž RC . is not chi-squared but rather that of the maximum eigenvalue from a Wishart matrix.
When one set of parameter scores is fixed, the RC model simplifies to the
R or C model. Goodman Ž1979a. suggested an iterative model-fitting algorithm that exploits this. A cycle of the algorithm has two steps. First, for
some initial guess of j 4 , it estimates the row scores as in the R model. Then,
treating the estimated row scores from the first step as fixed, it estimates the
column scores as in the C model. Those estimates serve as fixed column
scores in the first step of the next cycle, for reestimating the row scores in the
R model. There is no guarantee of convergence to ML estimates, but this
seems to happen when the model fits well. Haberman Ž1995. provided more
sophisticated fitting methods for association models.
Goodman Ž1985. expressed the association term in the saturated model in
a form that generalizes the  i j term in the RC model, namely,
M
iXj Y s
Ý k i k jk
Ž 9.14 .
ks1
where M s minŽ I y 1, J y 1.. The parameters satisfy constraints such as
Ý i k iq s Ý jk qj s 0
i
Ý 2i k iq s Ý jk2 qj s 1
i
for all k,
Ž 9.15 .
j
Ý i k i h iq s Ý jk jhqj s 0
i
for all k,
j
for all k / h.
j
When  k s 0 for k ) M*, model Ž9.14. is called the RC Ž M*. model. See
Becker Ž1990. for ML model fitting. The RC model Ž9.13. is the case
M* s 1.
ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS
TABLE 9.9
381
Cross-Classification of Mental Health Status and Socioeconomic Status
Mental Health Status
Parents’
Socioeconomic
Status
A Žhigh.
B
C
D
E
F Žlow.
Well
Mild
Symptom
Formation
Moderate
Symptom
Formation
Impaired
64
57
57
72
36
21
94
94
105
141
97
71
58
54
65
77
54
54
46
40
60
94
78
71
Source: Reprinted with permission from L. Srole et al. Mental Health in the Metropolis: The
Midtown Manhattan Study, ŽNew York: NYU Press, 1978., p. 289.
9.6.2
Mental Health Status Example
Table 9.9 describes the relationship between child’s mental impairment and
parents’ socioeconomic status for a sample of residents of Manhattan ŽGoodman 1979a.. The RC model fits well Ž G 2 s 3.6, df s 8.. For scaling Ž9.15.,
the ML estimates are Žy1.11, y1.12, y0.37, 0.03, 1.01, 1.82. for the row
scores, Žy1.68, y0.14, 0.14, 1.41. for the column scores, and ˆ s 0.17. Nearly
all estimated local log odds ratios are positive, indicating a tendency for
mental health to be better at higher levels of parents’ SES.
Ordinal loglinear models also fit well. For equal-interval scores, G 2 Ž L =
L. s 9.9 Ždf s 14.. The statistic G 2 Ž L = L < RC . s 6.3 Ždf s 6. tests that
row and column scores in the RC model are equal-interval. The parameter
scores do not provide a significantly better fit. It is sufficient to use a uniform
local odds ratio to describe the table. For unit-spaced scores, ˆ s 0.091
ŽSE s 0.015., so the fitted local odds ratio is expŽ0.091. s 1.09. There is
strong evidence of positive association, but the degree of association is rather
weak, at least locally.
9.6.3
Correlation Models
A correlation model for two-way tables has many features in common with
the RC model ŽGoodman 1985.. In its simplest form, it is
i j s iq qj Ž 1 q i j . ,
where i 4 and i 4 are score parameters satisfying
Ý i iqs Ý jqj s 0
and
Ý 2i iqs Ý j2qj s 1.
Ž 9.16 .
382
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
The parameter is the correlation between the scores for joint distribution
Ž9.16..
The correlation model is also called the canonical correlation model,
because ML estimates of the scores maximize the correlation for Ž9.16.. The
general canonical correlation model is
ž
i j s iq qj 1 q
M
Ý k i k jk
ks1
/
where 0 F M F ⭈⭈⭈ F 1 F 1 and with constraints such as in Ž9.15.. The
parameter k is the correlation between i k , i s 1, . . . , I 4 and jk , j s
1, . . . , J 4 . The i1 4 and j1 4 are standardized scores that maximize the
correlation 1 for the joint distribution; i2 4 and j2 4 are standardized
scores that maximize the correlation 2 , subject to i14 and i2 4 being
uncorrelated and j1 4 and j2 4 being uncorrelated, and so on.
Unsaturated models result from replacing M by M* - minŽ I y 1, J y 1..
Gilula and Haberman Ž1986. and Goodman Ž1985. discussed ML fitting.
When is close to zero in Ž9.16., Goodman Ž1981a, 1985, 1986. noted that
ML estimates of and the score parameters are similar to those of  and
the score parameters in the RC model. Correlation models can also use fixed
scores instead of parameter scores.
Goodman discussed advantages of association models over correlation
models. The correlation model is not defined for all possible combinations of
score values because of the constraint 0 F i j F 1, ML fitted values do not
have the same marginal totals as the observed data, and the model is not
simply generalizable to multiway tables. Gilula and Haberman Ž1988. analyzed multiway tables with correlation models by treating explanatory variables as a single variable and response variables as a second variable.
9.6.4
Correspondence Analysis
Correspondence analysis is a graphical way to represent associations in
two-way contingency tables. The rows and columns are represented by points
on a graph, the positions of which indicate associations. Goodman Ž1985, 1986.
noted that coordinates of the points are reparameterizations of i k 4 and jk 4
in the general canonical correlation model. Correspondence analysis uses
adjusted scores
x i k s k i k ,
y jk s k jk .
These are close to zero for dimensions k in which the correlation k is close
to zero. A correspondence analysis graph uses the first two dimensions,
plotting Ž x i1 , x i2 . for each row and Ž y j1 , y j2 . for each column.
ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS
383
TABLE 9.10 Scores from Correspondence Analysis Applied to Table 9.9
Dimension
Column Score
1
2
3
4
1
2
Dimension
3
0.260
0.012
0.023
0.030
0.024 y0.019
y0.013 y0.069 y0.002
y0.236
0.019
0.016
Row Score
1
1
2
3
4
5
6
0.181
0.185
0.059
y0.008
y0.164
y0.287
2
3
y0.018
0.028
y0.011 y0.026
y0.021 y0.010
0.042
0.011
0.044 y0.009
y0.061
0.005
Source: Reprinted with permission from the Institute of Mathematical Statistics, based on
Goodman Ž1985..
Goodman Ž1985, 1986. used Table 9.9 to illustrate the similarities of
correspondence analysis to analyses using correlation models and association
models. For the general canonical correlation model, M s minŽ I y 1, J y 1.
s 3. Its estimated squared correlations are Ž0.0260, 0.0014, and 0.0003.. The
association is rather weak. Table 9.10 contains estimated row and column
scores for the correspondence analysis of these three dimensions. Both sets
of scores in the first dimension fall in a monotone increasing pattern, except
for a slight discrepancy between the first two row scores. This indicates an
overall positive association. The scores for the second and third dimension
ˆ2 and ˆ3 .
are close to zero, reflecting the relatively small
Figure 9.4 exhibits the results of the correspondence analysis. The horizontal axis has estimates for the first dimension, and the vertical axis has
estimates for the second dimension. Six points Žcircles. represent the six
rows, with point i giving Ž ˆ
x i1 , ˆ
x i2 .. Similarly, four points Žsquares. display the
estimates Ž ˆ
y j1 , ˆ
y j2 .. Both sets of points lie close to the horizontal axis, since
the first dimension is more important than the second.
FIGURE 9.4 Graphical display of scores from first two dimensions of correspondence analysis.
wBased on Escoufier Ž1982.; reprinted with permission.x
384
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Row points that are close together represent rows with similar conditional
distributions across the columns. Close column points represent columns with
similar conditional distributions across rows. Row points close to column
points represent combinations that are more likely than expected under
independence. Figure 9.4 shows a tendency for subjects at the high end of
one scale to be at the high end of the other and for subjects at the low end of
one to be at the low end of the other.
Correspondence analysis is used mainly as a descriptive tool. Goodman
Ž1986. developed inferential methods for it. For Table 9.9, inferential analysis
reveals that the first dimension, accounting for 94% of the total squared
correlation, is adequate for describing the association. Goodman argued for
choosing the unsaturated model employing only one dimension and having
graphics display fitted scores for that dimension alone. Then, correspondence
analysis is equivalent to a ML analysis using correlation model Ž9.16.. The
estimated scores for that model are Žy1.09, y1.17, y0.37, 0.05, 1.01, 1.80.
for the rows and Žy1.60, y0.19, 0.09, 1.48. for the columns. The model fits
well Ž G 2 s 2.75, df s 8.. The quality of fit and the estimated scores are
similar to those we saw in Section 9.6.2 for the RC model. More parsimonious correlation models also fit these data well, such as ones using equally
spaced scores.
All analyses of Table 9.9 have yielded similar conclusions about the
association. They all neglect, however, that mental health is a natural
response variable. It may make more sense to use an ordinal logit model.
Like correlation models, a severe limitation of correspondence analysis is
nontrivial generalization to multiway tables. Greenacre Ž1993. showed displays of several pairwise associations in a single plot.
9.6.5
Model Selection and Score Choice for Ordinal Variables
The past three sections showed several ways to use category orderings in
model building. With allowance for ordinal effects, the variety of potential
models is much greater than standard loglinear models. To choose among
models, one approach uses the standard models for guidance. If a standard
model fits well, simplify by replacing some parameters with structured terms
for ordinal classifications.
Association, correlation, and correspondence analysis models have scores
for categories of ordinal variables. Parameter interpretations are simplest for
equally spaced scores. With parameter scores, the resulting ML estimates of
scores need not be monotone. Constrained versions of the models force
monotonicity by maximizing the likelihood subject to order restrictions Že.g.,
Agresti et al. 1987; Ritov and Gilula 1991.. Disadvantages exist, however, of
treating scores as parameters. The model becomes less parsimonious, and
tests of effects may be less powerful because of a greater df value Žrecall
Section 6.4.3.. When one variable alone is a response, cumulative link models
POISSON REGRESSION FOR RATES
385
ŽSections 7.2 and 7.3. for that response do not require preassigned or
parameter scores.
9.7
POISSON REGRESSION FOR RATES
Loglinear models need not refer to contingency tables. In Section 4.3 we
introduced Poisson regression for modeling counts. When outcomes occur
over time, space, or some other index of size, it is more relevant to model
their rate of occurrence than their raw number.
9.7.1
Analyzing Rates Using Loglinear Models with Offsets
When a response count n i has index equal to t i , the sample rate is n irt i . Its
expected value is irt i . With an explanatory variable x, a loglinear model for
the expected rate has form
log Ž irt i . s ␣ q  x i .
Ž 9.17 .
This model has equivalent representation
log i y log t i s ␣ q  x i .
As noted in Section 8.7.4, the adjustment term, ylog t i , to the log link of the
mean is called an offset. The fit correspond to using log t i as a predictor on
the right-hand side and forcing its coefficient to equal 1.0.
For model Ž9.17., the expected response count satisfies
i s t i exp Ž␣ q  x i . .
The mean is proportional to the index, with proportionality constant depending on the value of x. The identity link is also sometimes useful. The model is
then
irt i s ␣ q  x i , or i s ␣ t i q  x i t i .
This does not require an offset. It corresponds to an ordinary Poisson GLM
using identity link with t i and x i t i as explanatory variables and no intercept.
It provides additive, rather than multiplicative, predictor effects. It is less
useful with many predictors, as the fitting process may fail because of
negative fitted counts at some iteration.
9.7.2
Modeling Death Rates for Heart Valve Operations
Laird and Olivier Ž1981. analyzed patient survival after heart valve replacement operations. A sample of 109 patients were classified by type of heart
386
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
TABLE 9.11 Data on Heart Valve Replacement Operations
Type of Heart Valve
Age
- 55
55 q
Deaths
Time at risk
Death rate
Deaths
Time at risk
Death rate
Aortic
Mitral
4
1259
0.0032
7
1417
0.0049
1
2082
0.0005
9
1647
0.0055
Source: Reprinted with permission, based on data in Laird and Olivier Ž1981..
valve Žaortic, mitral. and by age Ž- 55, G 55.. Follow-up observations occurred until the patient died or the study ended. Operations occurred
throughout the study period, and follow-up observations covered lengths of
time varying from 3 to 97 months. The response was whether the subject died
and the follow-up time. For subjects who died, this is the time after the
operation until death; for the others, it is the time until the study ended or
the subject withdrew from it.
Table 9.11 lists the numbers of deaths during the follow-up period, by
valve type and age. These counts are the first layer of a three-way contingency table that classifies valve type, age, and whether died Žyes, no.. The
subjects not tabulated in Table 9.11 were not observed to die. They are
censored, since we know only a lower bound for how long they lived after the
operation. It is inappropriate to analyze that 2 = 2 = 2 table using binary
GLMs for the probability of death, since subjects had differing times at risk;
it is not sensible to treat a subject who could be observed for 3 months and a
subject who could be observed for 97 months as identical trials with the same
probability. To use age and valve type as predictors in a model for frequency
of death, the proper baseline is not the number of subjects but rather the
total time that subjects were at risk. Thus, we model the rate of death.
The time at risk for a subject is their follow-up time of observation. For a
given age and valve type, the total time at risk is the sum of the times at risk
for all subjects in that cell Žthose who died and those censored.. Table 9.11
lists those total times in months. The sample rate, also shown in that table,
divides the number of deaths by total time at risk. For instance, 4 deaths in
1259 months of observation occurred for younger subjects with aortic valve
replacement, so their sample rate is 4r1259 s 0.0032.
We now model effects of age and valve type on the rate. Let a be a
dummy variable for age, with a1 s 0 for the younger age group and a2 s 1
for the older group. Let ® be a dummy variable for valve type, with ®1 s 0 for
aortic and ®2 s 1 for mitral. Let n i j denote the number of deaths for age a i
and valve type ®j , with expected value i j for total time at risk t i j . Given t i j ,
387
POISSON REGRESSION FOR RATES
TABLE 9.12 Fit to Table 9.11 for Poisson Regression Models
Log Link
Age
- 55
55 q
Number of deaths
Death rate
Number of deaths
Death rate
Identity Link
Aortic
Mitral
Aortic
Mitral
2.28
0.0018
8.72
0.0062
2.72
0.0013
7.28
0.0044
3.16
0.0025
9.17
0.0065
1.19
0.0006
7.48
0.0046
the expected rate is i jrt i j . The model
log Ž i jrt i j . s ␣ q  1 a i q  2 ®j
Ž 9.18 .
assumes a lack of interaction in the effects.
Model fitting uses standard iterative methods, treating n i j 4 as independent Poisson variates with means i j 4 . This is done conditional on t i j 4 . Table
9.12 presents the fitted death counts and estimated rates. The estimated
effects are
ˆ1 s 1.221
Ž SE s 0.514 . ,
ˆ2 s y0.330
Ž SE s 0.438 . .
There is evidence of an age effect. Given valve type, the estimated rate for
the older age group is expŽ1.221. s 3.4 times that for the younger age group.
The 95% Wald confidence interval for  1 of 1.221 " 1.96Ž0.514. translates to
Ž1.2, 9.3. for the true multiplicative effect expŽ 1 .. ŽThe likelihood-ratio
confidence interval is Ž1.3, 10.4... The study contains much censored data. Of
the 109 patients, only 21 died during the study period. Both effect estimates
are imprecise. Note, though, that the analysis uses all 109 patients through
their contributions to the times at risk.
Goodness-of-fit statistics comparing n i j 4 to fitted values
ˆ i j 4 are G 2 s 3.2
2
and X s 3.1. The residual df s 1, since the four response counts have three
parameters. The mild evidence of lack of fit corresponds to evidence of
interaction between valve type and age. However, the model without valvetype effects wi.e.,  2 s 0 in Ž9.18.x fits nearly as well, with G 2 s 3.8 and
X 2 s 3.8 Ždf s 2.. Models omitting age effects fit poorly.
The corresponding model with identity link
i j s ␣ t i j q  1 a i t i j q  2 ®j t i j
shows a good fit, with G 2 s 1.1 and X 2 s 1.1 Ždf s 1.. Table 9.12 shows the
fit. Substantive conclusions are similar. The estimate ˆ1 s 0.0040 ŽSE s
0.0014. then represents an estimated difference in death rates between the
older and younger age groups for each valve type.
388
9.7.3
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
Modeling Survival Times*
A method for modeling survival times relates to the Poisson loglinear model
for rates. This method focuses on times until death rather than on numbers
of deaths. Let T denote the time to some event, such as death or such as
product failure in a reliability study. Let f Ž t . denote the probability density
function Žpdf. and F Ž t . the cdf of T. A connection exists between ML
estimation using a Poisson likelihood for numbers of events and a negative
exponential likelihood for T ŽAitkin and Clayton 1980..
A subject having T s t contributes f Ž t . to the likelihood. For a subject
whose censoring time equals t, we know only that T ) t. Thus, this subject
contributes P ŽT ) t . s 1 y F Ž t .. Using the indicator wi s 1 for death and 0
for censoring for subject i, the survival-time likelihood for n independent
observations is
n
Ł f Ž ti .
1 y F Ž ti .
wi
is1
1yw i
.
The log likelihood equals
Ý wi log f Ž t i .
i
q
Ý Ž 1 y wi . log
1 y F Ž ti . .
Ž 9.19 .
i
Further analysis requires a parametric form for f and a model for the
dependence of its parameters on explanatory variables.
Most survival models focus on the rate at which death occurs rather than
on E ŽT .. The hazard function
hŽ t . s
f Ž t.
1 y FŽ t.
s lim
Pwt - T - t q ⑀ < T ) tx
⑀
⑀ x0
represents the instantaneous rate of death for subjects who have survived to
time t. A simple density for survival modeling is the negative exponential.
The pdf is
f Ž t . s ey t ,
t ) 0.
The cdf is F Ž t . s 1 y ey t for t ) 0, and E ŽT . s y1 . The hazard function
is
hŽ t . s ,
t ) 0,
constant for all t.
Now we include explanatory variables x. Suppose that the hazard function
for a negative exponential survival distribution is
h Ž t ; x . s exp Ž X x . .
Ž 9.20 .
389
POISSON REGRESSION FOR RATES
That is, the distribution for T has parameter depending on x through Ž9.20..
The choice of functional form Ž9.20. for explanatory variable effects ensures
the hazard is nonnegative at all x. For instance, loglinear model Ž9.18.
corresponds to a multiplicative model of type Ž9.20. for the rate itself.
Now, consider the log likelihood Ž9.19. with f Ž t . equal to the negative
exponential density with parameter expŽX x.. For subject i, let
i s t i exp Ž X x i . .
With this substitution, the log likelihood simplifies to
Ý wi log i y Ý i y Ý wi log t i .
i
i
i
The first two terms involve . This part is identical to the log likelihood for
independent Poisson variates wi 4 with expected values i 4 . In this application wi 4 are binary rather than Poisson, but that is irrelevant to the process
of maximizing with respect to . This process is equivalent to maximizing the
likelihood for the Poisson loglinear model
log i y log t i s log q X x i
with offset logŽ t i ., using observations wi 4 . When we sum terms in the log
likelihood for subjects having a common value of x, the observed data are the
numbers of deaths ŽÝwi . at each setting of x, and the offset is the log of ŽÝt i .
at each setting.
The assumption of constant hazard over time is often not sensible. As
products wear out, their failure rate increases. A generalization divides the
time scale into disjoint time intervals and assumes constant hazard in each,
namely,
h Ž t ; x . s k exp Ž X x .
for t in interval k, k s 1, . . . . A separate hazard rate applies to each piece
of the time scale. Consider the contingency table for numbers of deaths, in
which one dimension is a discrete time scale and other dimensions represent
categorical explanatory variables. Holford Ž1980. and Laird and Olivier
Ž1981. showed that Poisson loglinear models and likelihoods for this table are
equivalent to loglinear hazard models and likelihoods that assume piecewise
exponential hazards for the survival times.
For short time intervals, the piecewise exponential approach is essentially
nonparametric, making no assumption about the dependence of the hazard
on time. This suggests the generalization of model Ž9.20. that replaces by
an unspecified function Ž t ., so that
h Ž t ; x . s Ž t . exp Ž X x . .
This is the Cox proportional hazards model. Its ratio of hazards
h Ž t ; x 1 . rh Ž t ; x 2 . s exp X Ž x 1 y x 2 .
is the same for all t.
390
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
TABLE 9.13 Number of Deaths from Lung Cancer
Follow-up
Time
Interval
Žmonths.
0᎐2
2᎐4
4᎐6
6᎐8
8᎐10
10᎐12
12 q
Histology a
Disease
Stage:
I
II
III
1
2
3
1
2
3
1
2
3
9
Ž157
2
Ž139
9
Ž126
10
Ž102
1
Ž88
3
Ž82
1
Ž76
12
134
7
110
5
96
10
86
4
66
3
59
4
51
42
212
26
136
12
90
10
64
5
47
4
39
1
29
5
77
2
68
3
63
2
55
2
50
2
45
2
42
4
71
3
63
5
58
4
42
2
35
1
32
4
28
28
130
19
72
10
42
5
21
0
14
3
13
2
7
1
21
1
17
1
14
1
12
0
10
1
8
0
6
1
22
1
18
3
14
1
10
0
8
0
8
2
6
19
101.
11
63.
7
43.
6
32.
3
21.
3
14.
3
10.
a
Values in parentheses represent total follow-up.
Source: Reprinted with permission from the Biometric Society, based on Holford Ž1980..
9.7.4
Lung Cancer Survival Example*
Table 9.13 describes survival for 539 males diagnosed with lung cancer. The
prognostic factors are histology Ž H . and stage Ž S . of disease. For a piecewise
exponential hazard approach, the time scale for follow-up ŽT . was divided
into two-month intervals.
Let i jk denote the expected number of deaths and t i jk the total time at
risk for histology i and state of disease j, in follow-up time interval k. The
model
log Ž i jkrt i jk . s q iH q Sj q Tk
Ž 9.21 .
has residual G 2 s 43.9 Ždf s 52.. All models assuming no interaction between follow-up time interval and either prognostic factor are proportional
hazards models, since they have the same effects of histology and stage of
disease for each time interval. Table 9.14 summarizes results of fitting several
such models. Although stage of disease is an important prognostic factor,
histology did not contribute significant additional information.
For model Ž9.21., the effects of stage of disease satisfy
ˆS2 y ˆ1S s 0.470
Ž SE s 0.174 . ,
ˆS3 y ˆ1S s 1.324
Ž SE s 0.152 . .
EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES
391
TABLE 9.14 Results for Poisson Regression Models
of Proportional Hazards Form with Table 9.13
Effects a
T
TqH
TqS
TqSqH
TqSqHqS=H
a
G2
df
170.7
143.1
45.8
43.9
41.5
56
54
54
52
48
T, time scale for follow-up; H, histology; S, disease stage.
For instance, at a fixed follow-up time for a given histology, the estimated
death rate at the third stage of disease is expŽ1.324. s 3.8 times that at the
first stage. Adding interaction terms between stage and time does not significantly improve the fit Žchange in G 2 s 14.9, change in df s 12.. The
ˆSj 4 are very similar for the simpler model without the histology effects.
9.7.5
Analyzing Weighted Data*
The process of fitting a loglinear model with an offset is also useful in other
applications. For expected frequencies i 4 and fixed constants t i 4 , consider a
model
log Ž irt i . s ␣ q  1 x i1 q  2 x i2 q ⭈⭈⭈ .
Standard loglinear models have t i s 14 . The general form is useful for the
analysis of categorical data with sampling designs more complex than simple
random sampling.
Many surveys have sampling designs employing stratification andror clustering. Case weights inflate or deflate the influence of each observation
according to features of that design. Adding the case weights for subjects in a
particular cell i provides a total weighted frequency for that cell. The average
cell weight z i is defined to be the total weighted frequency divided by the cell
count. Conditional on z i 4 , loglinear models for the weighted expected
frequencies z i i s irt i 4 with t i s zy1
express the model as a standard
i
loglinear model for log i 4 , with offset log t i s ylog z i 4 . Fitting this model
provides appropriate parameter estimates and standard errors ŽClogg and
Eliason 1987..
9.8 EMPTY CELLS AND SPARSENESS IN MODELING
CONTINGENCY TABLES
Contingency tables having small cell counts are said to be sparse. We end
this chapter by discussing effects of sparse tables on model fitting. Sparse
392
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
tables occur when the sample size n is small. They also occur when n is large
but so is the number of cells. Sparseness is common in tables with many
variables. The following discussion refers to a generic contingency table and
model, with cell counts n i 4 and expected frequencies i 4 for n observations
in N cells.
9.8.1
Empty Cells: Sampling versus Structural Zeros
Sparse tables usually contain cells with n i s 0. These empty cells are of two
types: sampling zeros and structural zeros. In most cases, even though n i s 0,
i ) 0. It is possible to have observations in the cell, and n i ) 0 with
sufficiently large n. This empty cell is called a sampling zero. The empty cells
in Table 9.1 for the student survey are sampling zeros.
An empty cell in which observations are impossible is called a structural
zero. For such cells i s 0 and necessarily
ˆ i s 0 and n i s 0 regardless of n.
For a table that cross classifies cancer patients on their gender, race, and
type of cancer, some cancers Že.g., prostate cancer, ovarian cancer. are
gender specific. Thus, certain cells have structural zeros. Contingency tables
with structural zeros are called incomplete tables.
Sampling zeros are part of the data set. A count of 0 is a permissible
outcome for a Poisson or multinomial variate. It contributes to the likelihood
function and model fitting. A structural zero, on the other hand, is not an
observation and is not part of the data. Sampling zeros are much more
common than structural zeros, and the remaining discussion refers to them.
9.8.2
Existence of Estimates in Loglinear r Logit Models
Sampling zeros can affect the existence of finite ML estimates of loglinear
and logit model parameters. Haberman Ž1973b, 1974a., generalizing work by
Birch Ž1963. and Fienberg Ž1970b., studied this. Let n denote the vector of
cell counts and their expected values. Haberman showed results 1 through
5 for Poisson sampling, but by result 6 they apply also to multinomial
sampling.
1. The log-likelihood function is a strictly concave function of log .
2. If a ML estimate of exists, it is unique and satisfies the likelihood
equations XX n s XX .
ˆ Conversely, if
ˆ satisfies the model and also the
likelihood equations, it is the ML estimate of .
3. If all n i ) 0, ML estimates of loglinear model parameters exist.
4. Suppose that ML parameter estimates exist for a loglinear model that
equates observed and fitted counts in certain marginal tables. Then
those marginal tables have uniformly positive counts.
5. If ML estimates exist for a model M, they also exist for any special case
of M.
EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES
393
6. For any loglinear model, the ML estimates
ˆ are identical for multinomial and independent Poisson sampling, and those estimates exist in
the same situations.
To illustrate, consider the saturated model. By results 2 and 3, when all
n i ) 0, the ML estimate of is n. By result 4, parameter estimates do not
exist when any n i s 0. Model parameter estimates are contrasts of log
ˆ i4,
and since
ˆ s n for the saturated model, the estimates are finite only when
all n i ) 0.
For unsaturated models, by results 3 and 4 ML estimates exist when all
n i ) 0 and do not exist when any count is zero in the set of sufficient
marginal tables. Suppose that at least one n i s 0 but the sufficient marginal
counts are all positive. For hierarchical loglinear models, Glonek et al. Ž1988.
showed that the positivity of the sufficient counts implies the existence of ML
estimates if and only if the model is decomposable ŽNote 8.2., which includes
the conditional independence models. Models having all pairs of variables
associated, however, are more complex. For model Ž XY, XZ, YZ ., for instance, ML estimates exist when only one n i s 0 but may not exist when at
least two cells are empty. For instance, ML estimates do not exist for Table
9.15, even though all sufficient statistics Žthe two-way marginal totals. are
positive ŽProblem 9.47..
Haberman showed that the supremum of the likelihood function is finite.
This motivated him to define extended ML estimators of . These always
exist but may equal 0 and, falling on the boundary, need not have the same
properties as regular ML estimators wsee also Baker et al. Ž1985.x. A sequence
of estimates satisfying the model that converges to the extended estimate has
log likelihood approaching its supremum. In this extended sense,
ˆ i s 0 is
the ML estimate of i for the saturated model when n i s 0, and one can
have infinite loglinear parameter estimates.
When a sufficient marginal count for a factor equals zero, infinite estimates occur for that term. For instance, when a XY marginal total equals
ˆiXj Y 4 for loglinear models such as
zero, infinite estimates occur among
Ž XY, XZ, YZ ., and infinite estimates occur among ˆi X 4 for the effect of X
on Y in logit models. Sometimes, however, not even infinite estimates exist.
An example is estimating the log odds ratio when both entries in a row or
column of a 2 = 2 table equal 0.
TABLE 9.15 Data for Which ML Estimates Do Not Exist
for Model ( XY, XZ, YZ ) a
1
Z:
X
1
2
a
Y:
2
1
2
1
2
0
)
)
)
)
)
)
0
Cells containing * may contain any positive numbers.
394
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
A value of ⬁ Žor y⬁. for a ML parameter estimate implies that ML fitted
values equal 0 in some cells, and some odds ratio estimates equal ⬁ or 0. One
potential indicator is when the iterative fitting process does not converge,
typically because an estimate keeps increasing from cycle to cycle. Most
software, however, is fooled after a certain point in the iterative process by
the nearly flat likelihood. It reports convergence, but because of the very
slight curvature of the log likelihood, the estimated standard errors Žbased on
inverting the information matrix of second partial derivatives. are extremely
large and numerically unstable. Slight changes in the data then often cause
dramatic changes in the estimates and their standard errors. A danger with
sparse data is that one might not realize that a true estimated effect is
infinite and, as a consequence, report estimated effects and results of
statistical inferences that are invalid and highly unstable.
Many ML analyses are unharmed by empty cells. Even when a parameter
estimate is infinite, this is not fatal to data analysis. The likelihood-ratio
confidence interval for the true log odds ratio has one endpoint that is finite.
For instance, when n11 s 0 but other n i j ) 0 in a 2 = 2 table, log ˆs y⬁
and a confidence interval has form Žy⬁, U . for some finite upper bound U.
When the pattern of empty cells forces certain fitted values for a model to
equal 0, this affects the df for testing model fit ŽHaslett 1990..
9.8.3
Clinical Trials Example
Table 9.16 shows results of a clinical trial conducted at five centers. The
purpose was to compare an active drug to placebo for treating fungal
infections, with a binary Žsuccess, failure. response. For these data, let
Y s response, X s treatment Ž x 1 s 1 for active drug and x 2 s 0 for placebo.,
and Z s center.
Centers 1 and 3 had no successes. Thus, the 5 = 2 marginal table relating
response to center, collapsed over treatment, contains zero counts. The last
two columns of Table 9.16 show this marginal table. Infinite ML estimates
occur for terms in loglinear or logit models containing the YZ association. An
example is the logit model
logit P Ž Y s 1 < X s i , Z s k . s  x i q  kZ .
ŽWe omit the intercept, so the  kZ 4 need no constraint; then, these refer to
center effects rather than contrasts between centers and a baseline center. .
The likelihood function increases continually as  1Z and  3Z decrease toward
y⬁; that is, as the logit decreases toward y⬁, so the fitted probability of
success decreases toward the ML estimate of 0 for those centers.
The counts in the 2 = 2 marginal table relating response to treatment,
shown in the bottom panel of Table 9.16, are all positive. The empty cells in
Table 9.16 affect the center estimates, but not the treatment estimate, for
this logit model. In the limit as the log likelihood increases, the fitted values
have a log odds ratio ˆ s 1.55 ŽSE s 0.70.. Most software reports this, but
395
EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES
TABLE 9.16 Clinical Trial Relating Treatment to Response with XY and
YZ Marginal Tables a
Response
YZ Marginal
Center
Treatment
Success
Failure
1
Active drug
Placebo
Active drug
Placebo
Active drug
Placebo
Active drug
Placebo
Active drug
Placebo
0
0
1
0
0
0
6
2
5
2
5
9
12
10
7
5
3
6
9
12
Active drug
Placebo
12
4
36
42
2
3
4
5
XY
marginal
Success
Failure
0
14
1
22
0
12
8
9
7
21
a
X, Treatment; Y, response; Z, center.
Source: Data courtesy of Diane Connell, Sandoz Pharmaceuticals Corporation.
instead of ˆ1Z s ˆ3Z s y⬁ reports large numbers with extremely large standard errors. For instance, PROC GENMOD in SAS reports values of about
y26 for ˆ1Z and ˆ3Z , with standard errors of about 200,000.
The treatment estimate ˆ s 1.55 also results from deleting centers 1 and 3
from the analysis. When a center contains responses of only one type, it
provides no information about this odds ratio. ŽIt does provide information
about the size of some other measures, such as the difference of proportions..
In fact, such tables also make no contribution to standard tests of conditional
independence, such as the Cochran᎐Mantel᎐Haenszel test ŽSection 6.3.2.
and exact test ŽSection 6.7.5..
An alternative strategy in multicenter analyses combines centers of a
similar type. Then, if each resulting partial table has responses with both
outcomes, the inferences use all data. For Table 9.16, perhaps centers 1 and
3 are similar to center 2, since the success rate is very low for that center.
Combining these three centers and refitting the model to this table and the
tables for the other two centers yields ˆ s 1.56 ŽSE s 0.70.. Usually, this
strategy produces results similar to deleting the table with no outcomes of a
particular type.
9.8.4
Effect of Small Samples on X 2 and G 2
Although empty cells and sparse tables need not affect parameter estimates
of interest, they can cause sampling distributions of goodness-of-fit statistics
to be far from chi-squared. The true sampling distributions converge to
396
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
chi-squared as n ™ ⬁, for a fixed number of cells N. The adequacy of the
chi-squared approximation depends both on n and N.
Cochran studied the chi-squared approximation for X 2 in several articles.
In 1954, he suggested that to test independence with df ) 1, a minimum
expected value i f 1 is permissible as long as no more than about 20% of
i - 5. Koehler Ž1986., Koehler and Larntz Ž1980., and Larntz Ž1978. showed
that X 2 applies with smaller n and more sparse tables than G 2 . The
distribution of G 2 is usually poorly approximated by chi-squared when nrN
is less than 5. Depending on the sparseness, P-values based on referring G 2
to a chi-squared distribution can be too large or too small. When most i are
smaller than 0.5, treating G 2 as chi-squared gives a highly conservative test;
when H0 is true, reported P-values tend to be much larger than true ones.
When most i are between 0.5 and 4, G 2 tends to be too liberal; the
reported P-value tends to be too small.
The size of nrN that produces adequate approximations for X 2 tends to
decrease as N increases ŽKoehler and Larntz 1980.. However, the approximation tends to be poor for sparse tables containing both small and moderately large i ŽHaberman 1988.. It is difficult to give a guideline that covers
all cases. For other discussion, see Cressie and Read Ž1989. and Lawal
Ž1984..
For fixed n and N, the chi-squared approximation is better for tests with
smaller df. For instance, in testing conditional independence in I = J = K
tables, G 2 wŽ XZ, YZ . < Ž XY, XZ, YZ .x Žwith df s Ž I y 1.Ž J y 1.. is closer to
chi-squared than G 2 Ž XZ, YZ . wwith df s K Ž I y 1.Ž J y 1.x. The ordinal test
of H0 :  s 0 with the homogeneous linear-by-linear XY association model
Ž9.10. has df s 1, and behaves even better.
9.8.5
Model-Based Tests and Sparseness
From Ž9.3. and Ž9.4., the model-based statistics G 2 Ž M0 < M1 . and X 2 Ž M0 < M1 .
depend on the data only through the fitted values, and hence only through
minimal sufficient statistics for the more complex model. These statistics
have null distributions converging to chi-squared as the expected values of
the minimal sufficient statistics grow. For most loglinear models, these
sufficient statistics refer to marginal tables. Marginal totals are more nearly
normally distributed than are single cell counts. Thus, G 2 Ž M0 < M1 . and
X 2 Ž M0 < M1 . converge to their limiting chi-squared distribution more quickly
than does G 2 Ž M0 . and X 2 Ž M0 ., which depend also on individual cell counts.
When
ˆ i 4 are small but the sufficient marginal totals for M1 are mostly in
at least the range 5 to 10, the chi-squared approximation is usually adequate
for model comparison statistics. Haberman Ž1977a. provided theoretical
justification.
9.8.6
Alternative Asymptotics and Alternative Statistics
When large-sample approximations are inadequate, exact small-sample methods are an alternative. When they are infeasible, it is often possible to
EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES
397
approximate exact distributions precisely using Monte Carlo methods
Že.g., Booth and Butler 1999; Forster et al. 1996; Kim and Agresti 1997;
Mehta et al. 1988..
An alternative approach uses sparse asymptotic approximations that apply
when the number of cells N increases as n increases. For this approach, i 4
need not increase, as they must do in the usual Žfixed N, n ™ ⬁. large-sample theory. For goodness-of-fit testing of a specified multinomial, Koehler
and Larntz Ž1980. showed that a standardized version of G 2 has an approximate normal distribution for very sparse tables. Koehler Ž1986. presented
limiting normal distributions for G 2 for use in testing models having direct
ML estimates. McCullagh Ž1986. reviewed ways of handling sparse tables and
presented an alternative approximation for G 2 . Zelterman Ž1987. gave normal approximations for X 2 and proposed an alternative statistic.
9.8.7
Adding Constants to Cells of a Contingency Table
Empty cells and sparse tables can cause problems with existence of estimates
for loglinear model parameters, estimation of odds ratios, performance of
computational algorithms, and asymptotic approximations of chi-squared
statistics. However, they need not be problematic. The likelihood can still be
maximized, a point estimate of ⬁ for an effect still usually has a finite lower
bound for a likelihood-based confidence interval, and one can use small-sample inferential methods rather than asymptotic ones.
One way to obtain finite estimates of all effects and ensure convergence of
fitting algorithms is to add a small constant to cell counts. Some algorithms
add 12 to each cell, as Goodman Ž1964b, 1970, 1971a. recommended for
saturated models. An example of the beneficial effect of this for a saturated
model is bias reduction for estimating an odds ratio in a 2 = 2 table ŽGart
1966; Gart and Zweiful 1967.. Adding 12 to each cell before fitting an
unsaturated model smooths the data too much, however, causing havoc with
sampling distributions. This operation has too conservative an influence on
estimated effects and test statistics. The effect is very severe with a large
number of cells.
Even for a saturated model, adding 12 to each cell is not a panacea for all
purposes. When the ordinary ML estimate of an odds ratio is infinite, the
estimate after adding 12 to each cell is finite, as are the endpoints of any
confidence interval. However, it is more sensible to use an upper bound of ⬁
for the odds ratio, since no sample evidence suggests that the odds ratio falls
below any given value.
When in doubt about the effect of sparse data, one should perform a
sensitivity analysis. For example, for each possibly influential observation,
delete it or move it to another cell to see how results vary with small
perturbations to the data. Influence diagnostics for GLMs ŽWilliams 1987.
are also useful for this purpose. Often, some associations are not affected by
empty cells and give stable results for the various analyses, whereas others
398
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
that are affected are highly unstable. Use caution in making conclusions
about an association if small changes in the data are influential.
Later chapters show ways to smooth data in a less ad hoc manner than
adding arbitrary constants to cells. These include random effects models
ŽSection 12.3. and Bayesian methods ŽSection 15.2..
NOTES
Section 9.1: Association Graphs and Collapsibility
9.1. Darroch et al. Ž1980. defined a class of graphical models that contains the family of
decomposable models Žsee Note 8.2.. For expositions on graphical models and their
relevant independence graphs, which show the conditional independence structure, see
Ž2000., Edwards Ž2000., Edwards and Kreiner Ž1983.,
also Anderson and Bockenholt
¨
Kreiner Ž1998., Lauritzen Ž1996., and Whittaker Ž1990.. Whittaker Ž1990, Sec. 12.5.
summarized connections with various definitions of collapsibility.
9.2 For I = J = 2 tables, the collapsibility conditions ŽSection 9.1.2. are necessary as well
as sufficient ŽSimpson 1951; Whittemore 1978.. For I = J = K tables, Ducharme and
Lepage Ž1986. showed the conditions are necessary and sufficient for the odds ratios to
remain the same no matter how the levels of Z are pooled Ži.e., no matter how Z is
partially collapsed ..
Darroch Ž1962. defined a perfect table as one for which for all i, j, k,
Ý
i jq iqk
Ý
iqk qj k
i
k
iqq
qqk
s qjq qqk ,
Ý
j
qj k i jq
qjq
s iqq qqk ,
s iqq qjq .
For perfect tables, homogeneous association implies that
i jk s i jq iqk qj kr iqq qjq qqk 4
and conditional odds ratios are identical to marginal odds ratios. Whittemore Ž1978.
used perfect tables to illustrate that for I = J = K tables with K ) 2, conditional and
marginal odds ratios can be identical even when no pair of variables is conditionally
independent. See also Davis Ž1986b..
Suppose that the difference of proportions or relative risk, computed for a binary
response Y and predictor X, is the same at every level of Z. If Z is independent of X
in the marginal XZ table or if Z is conditionally independent of Y given X, the
measure has the same value in the marginal XY table ŽShapiro 1982.. Thus, for
factorial designs with the same number of observations at each combination of levels,
the difference of proportions and relative risk are collapsible. See also Wermuth Ž1987..
Section 9.2: Model Selection and Comparison
9.3. Articles on loglinear model selection include Aitkin Ž1979, 1980., Benedetti and Brown
Ž1978., Brown Ž1976., Goodman Ž1970, 1971a., Wermuth Ž1976., and Whittaker and
Aitkin Ž1978.. When a certain model holds, G 2 rdf has an asymptotic mean of 1.
Goodman Ž1971a. recommended this index for comparing fits. Smaller values represent
better fits.
NOTES
399
9.4. Kullback et al. Ž1962. and Lancaster Ž1951. were among the first to partition chi-squared
statistics in multiway tables. Goodman Ž1970. and Plackett Ž1962. noted difficulties with
their approaches. When observations have distribution in the natural exponential
family, Simon Ž1973. showed G 2 Ž M0 < M1 . s 2Ý i
ˆ 1 i logŽ
ˆ 1 ir
ˆ 0 i . whenever models are
linear in the natural parameters. See Lang Ž1996b. for partitionings for more complex
models.
Section 9.4: Modeling Ordinal Associations
9.5. Goodman Ž1979a. stimulated research on loglinear models for ordinal data. His work
extended Haberman Ž1974b., who expressed the X Y association term with an expansion in orthogonal polynomials. For more general ordinal models for multiway tables,
see Agresti Ž1984., Becker Ž1989a., Becker and Clogg Ž1989., and Goodman Ž1986..
Section 9.6: Association Models, Correlation Models, and Correspondence Analysis
9.6. Early articles on the RC model include Goodman Ž1979a, 1981a, b. and Andersen
Ž1980, pp. 210᎐216., apparently partly motivated by earlier work of G. Rasch Žsee
Ž2000., Becker Ž1989a, b, 1990., Becker and
Andersen 1995.. Anderson and Bockenholt
¨
Clogg Ž1989., Chuang et al. Ž1985., and Goodman Ž1985, 1986, 1996. discussed generalizations for multiway tables. Anderson Ž1984. discussed a related model. Anderson and
Vermunt Ž2000. showed that RC and related association models arise when observed
variables are conditionally independent given a latent variable that is conditionally
normal, given the observed variables. Their work generalizes results in Lauritzen and
Wermuth Ž1989. and discussion by Whittaker of van der Heijden et al. Ž1989.. See also
de Falguerolles et al. Ž1995.. Clogg and Shihadeh Ž1994. surveyed association models
and related correlation models.
9.7. Kendall and Stuart Ž1979, Chap. 33. surveyed basic canonical correlation methods for
contingency tables. See also Williams Ž1952., who discussed earlier work by R. A.
Fisher and others. Karl Pearson often analyzed tables by assuming an underlying
bivariate normal distribution ŽSection 16.1.. For estimating that distribution’s correlation, see Becker Ž1989b., Goodman Ž1981b., Kendall and Stuart Ž1979, Chaps. 26 and
33., Lancaster Ž1969, Chap. X., the Pearson Ž1904. tetrachoric correlation for 2 = 2
tables, and the Lancaster and Hamdan Ž1964. polychoric correlation for I = J tables.
9.8. Correspondence analysis gained popularity in France under the influence of Benzecri
´
Žsee, e.g., 1973.. Goodman Ž1996. attributed its origins to H. O. Hartley, publishing
under his original German name ŽHirschfeld, 1935.. Greenacre Ž1993. related it to the
singular value decomposition of a matrix. For other discussion, see Escoufier Ž1982.,
Friendly Ž2000, Chap. 5., Goodman Ž1986, 1996, 2000., Michailidis and de Leeuw
Ž1998., van der Heijden and de Leeuw Ž1985., and van der Heijden et al. Ž1989..
Gabriel Ž1971. discussed related work on biplots.
Section 9.7: Poisson Regression for Rates
9.9. Another application using offsets is table standardization ŽSection 8.7.4.. For analyses
of rate data, see Breslow and Day Ž1987, Sec. 4.5., Freeman and Holford Ž1980., Frome
Ž1983., and Hoem Ž1987.. Articles dealing with grouped survival data, particularly
loglinear and logit models for survival probabilities, include Aranda-Ordaz Ž1983.,
Larson Ž1984., Prentice and Gloeckler Ž1978., Schluchter and Jackson Ž1989., Stokes et
al. Ž2000, Chap. 17., and Thompson Ž1977.. Aitkin and Clayton Ž1980. discussed
exponential survival models and also presented similar models having hazard functions
400
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
for Weibull or extreme-value survival distributions. Log likelihood Ž9.19. actually
applies only for noninformati®e censoring mechanisms. It does not make sense if
subjects tend to withdraw from the study because of factors related to it, perhaps
because of health effects related to one of the treatments.
9.10. Lindsey and Mersch Ž1992. showed a clever way to use loglinear models to fit
exponential family distributions f Ž y; . of form Ž4.14. with known. One breaks the
response scale into intervals Ž y k y ⌬ kr2, y k q ⌬ kr2.4. Counts in those intervals follow
a multinomial with probabilities approximated by f Ž y k , . ⌬ k 4. The log expected count
approximations are linear in with an offset.
PROBLEMS
Applications
9.1
Use odds ratios in Table 8.3 to illustrate the collapsibility conditions.
a. For Ž A, C, M ., all conditional odds ratios equal 1.0. Explain why all
reported marginal odds ratios equal 1.0.
b. For Ž AC, M ., explain why Ži. all conditional odds ratios are the
same as the marginal odds ratios, and Žii. all
ˆ acqs n acq.
Ž
.
Ž
.
c. For AM, CM , explain why i the AC conditional odds ratios of
1.0 need not be the same as the AC marginal odds ratio, Žii. the
AM and CM conditional odds ratios are the same as the marginal
odds ratios, and Žiii. all
ˆ aqm s n aqm and
ˆqc m s nqc m .
d. For Ž AC, AM, CM ., explain why Ži. no conditional odds ratios need
be the same as the related marginal odds ratios, and Žii. the fitted
marginal odds ratios must equal the sample marginal odds ratios.
9.2
Table 9.17 summarizes a study with variables age of mother Ž A.,
length of gestation Ž G . in days, infant survival Ž I ., and number of
cigarettes smoked per day during the prenatal period Ž S .. Treat G and
I as response variables and A and S as explanatory.
a. Explain why a loglinear model should include the A S term.
b. Fit the models Ž AGIS ., Ž AGI, AIS, AGS, GIS ., Ž AG, AI, AS,
GI, GS, IS ., and Ž AS, G, I .. Identify a subset of models nested
between two of these that may fit well. Select one such model.
c. Use Ži. forward selection, and Žii. backward elimination to build a
model. Compare the results of the strategies, and interpret the
models chosen.
9.3 Refer to Table 2.13. Consider the nested set Ž DVP ., Ž DP, VP, DV .,
Ž VP, DV ., Ž P, DV ., Ž D, V, P .4 . Partition chi-squared to compare the
four pairs, ensuring that the overall type I error probability for the four
comparisons does not exceed ␣ s 0.10. Which model would you select,
using a backward comparison starting with Ž DVP .? Show that the final
401
PROBLEMS
TABLE 9.17 Data for Problem 9.2
Infant Survival
Age
Smoking
Gestation
No
Yes
- 30
-5
F 260
) 260
F 260
) 260
F 260
) 260
F 260
) 260
50
24
9
6
41
14
4
1
315
4012
40
459
147
1594
11
124
5q
30 q
-5
5q
Source: N. Wermuth, pp. 279᎐295 in Proc. 9th International Biometrics Conference, Vol. 1 Ž1976..
Reprinted with permission from the Biometric Society.
model selected depends on the choice of nested set, by repeating the
analysis with Ž DP, VP, DV ., Ž DP, DV ., Ž P, DV ., Ž D, V, P ..
9.4
Consider the loglinear model selection for Table 6.3.
a. Why is it not sensible to consider models omitting the G M term?
b. Using forward selection starting with Ž GM, E, P ., show that model
Ž GM, GP, EG, EMP . seems reasonable.
c. Using backward elimination, show that Ž GM, GP, EMP . or
Ž GM, GP, EG, EMP . seems reasonable.
d. The EMP interaction seems vital. To describe it, show that the
effect of extramarital sex on divorce is greater for subjects who had
no premarital sex.
e. Use residuals to describe the lack of fit of model Ž GM, EMP ..
9.5
For model Ž AC, AM, CM . with Table 8.3, the standardized Pearson
residual in each cell equals "0.63. Interpret, and explain why each one
has the same absolute value. By contrast, model Ž AM, CM . has standardized Pearson residual "3.70 in each cell where M s yes Že.g.,
q3.70 when A s C s yes. and "12.80 in each cell where M s no
Že.g., q12.80 when A s C s yes.. Interpret.
9.6
Refer to Table 8.8. Conduct a residual analysis with the model of no
three-factor interaction to describe the nature of the interaction.
9.7
Perform a residual analysis for the independence model with Table
3.2. Explain why it suggests that the linear-by-linear association model
may fit better. Fit it, compare to the independence model, and interpret.
402
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
9.8
Refer to Problem 9.7.
a. Using standardized scores, find ˆ. Comment on the strength of
association.
b. Fit a model in which job satisfaction scores are parameters. Interpret the estimated scores, and compare the fit to the L = L model.
9.9
Refer to Table 9.3.
a. For the linear-by-linear association model, construct a 95% confidence interval for the odds ratio using the four corner cells. Interpret.
b. Fit the column effects model. Compare estimated column scores to
the equal-interval scores in part Ža.. Test that the true column
scores are equal-interval, given that the model holds. Interpret.
Construct a 95% confidence interval for the odds ratio using the
four corner cells. Compare to part Ža..
9.10 A weak local association may be substantively important for nonlocal
categories. Illustrate with the L = L model for Table 9.9, showing how
the estimated odd ratio for the four corner cells compares to the
estimated local odds ratio.
9.11 Refer to Table 7.8. Fit the homogeneous linear-by-linear association
model, and interpret. Test conditional independence between income
Ž I . and job satisfaction Ž S ., controlling for gender Ž G ., using Ža. that
model, and Žb. model Ž IS, IG, SG .. Explain why the results are so
different.
9.12 Fit the RC model to Table 9.3. Interpret the estimated scores. Does it
fit better than the uniform association model?
9.13 Replicate the results in Section 9.6 for the correlation and correspondence models with Table 9.9.
9.14 One hundred leukemia patients were randomly assigned to two treatments. During the study, 10 subjects on treatment A died and 18
subjects on treatment B died. The total time at risk was 170.4 years for
treatment A and 147.3 years for treatment B. Test whether the two
treatments have the same death rates. Compare the rates with a
confidence interval.
9.15 For Table 9.11, fit a model in which death rate depends only on age.
Interpret the age effect.
9.16 Consider model Ž9.18.. What is the effect on the model parameter
estimates, their standard errors, and the goodness-of-fit statistics when
Ža. the times at risk are doubled, but the numbers of deaths stay the
403
PROBLEMS
same; Žb. the times at risk stay the same, but the numbers of deaths
double; and Žc. the times at risk and the numbers of deaths both
double.
9.17 Consider Table 9.13. Explain how one could analyze whether the
hazard depends on time.
9.18 An article by W. A. Ray et al. Ž Amer. J. Epidemiol. 132: 873᎐884,
1992. dealt with motor vehicle accident rates for 16,262 subjects aged
65᎐84 years, with data on each for up to 4 years. In 17.3 thousand
years of observation, the women had 175 accidents in which an injury
occurred. In 21.4 thousand years, men had 320 injurious accidents.
a. Find a 95% confidence interval for the true overall rate of injurious
accidents.
b. Using a model, compare the rates for men and women.
9.19 A table at the text’s Web site Ž www. stat.ufl.edur;aarcdarcda.html .
shows the number of train miles Žin millions. and the number of
collisions involving British Rail passenger trains between 1970 and
1984. A Poisson model assuming a constant log rate ␣ over the 14-year
period has ␣
ˆ s y4.177 ŽSE s 0.1325. and X 2 s 14.8 Ždf s 13..
Interpret.
9.20 Table 9.18 lists total attendance Žin thousands . and the total number of
arrests in the 1987᎐1988 season for soccer teams in the Second
Division of the British football league. Let Y s number of arrests for a
team, and let t s total attendance. Explain why the model E Ž Y . s t
TABLE 9.18 Data for Problem 9.20
Team
Aston Villa
Bradford City
Leeds United
Bournemouth
West Brom
Hudderfield
Middlesbro
Birmingham
Ipswich Town
Leicester City
Blackburn
Crystal Palace
Attendance
Žthousands .
Arrests
404
286
443
169
222
150
321
189
258
223
211
215
308
197
184
149
132
126
110
101
99
81
79
78
Team
Shrewsbury
Swindon Town
Sheffield Utd.
Stoke City
Barnsley
Millwall
Hull City
Manchester City
Plymouth
Reading
Oldham
Attendance
Žthousands .
Arrests
108
210
224
211
168
185
158
429
226
150
148
68
67
60
57
55
44
38
35
29
20
19
Source: The Independent ŽLondon., Dec. 21, 1988. Thanks to P. M. E. Altham for showing me
these data.
404
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
might be plausible. Assuming Poisson sampling, fit it and interpret.
Plot arrests against attendance, and overlay the prediction equation.
Use residuals to identify teams that had arrest counts much different
than expected.
TABLE 9.19 Data for Problem 9.21
Person-Years
Age
35᎐44
45᎐54
55᎐64
65᎐74
75᎐84
Coronary Deaths
Nonsmokers
Smokers
Nonsmokers
Smokers
18,793
10,673
5710
2585
1462
52,407
43,248
28,612
12,663
5317
2
12
28
28
31
32
104
206
186
102
Source: R. Doll and A. B. Hill, Natl. Cancer Inst. Monogr. 19: 205᎐268 Ž1966.. See also N. R.
Breslow in A Celebration of Statistics, ed. A. C. Atkinson and S. E. Fienberg, ŽNew York:
Springer-Verlag, 1985..
9.21 Table 9.19 is based on a study with British doctors.
a. For each age, find the sample coronary death rates per 1000
person-years for nonsmokers and smokers. To compare them, take
their ratio and describe its dependence on age.
b. Fit a main-effects model for the log rates having four parameters
for age and one for smoking. In discussing lack of fit, show that this
model assumes a constant ratio of nonsmokers’ to smokers’ coronary death rates over age.
c. From part Ža., explain why it is sensible to add a quantitative
interaction of age and smoking. For this model, show that the log
ratio of coronary death rates changes linearly with age. Assign
scores to age, fit the model, and interpret.
9.22 Analyze Table 9.9 using ordinal logit models. Interpret, and discuss
advantagesrdisadvantages compared to loglinear analyses.
9.23 Refer to Problem 8.6. Analyze these data, using methods of this
chapter.
Theory and Methods
9.24 In a 2 = 2 = K table, the true XY conditional odds ratios are identical, but different from the XY marginal odds ratio. Is there three-factor interaction? Is Z conditionally independent of X or Y ? Explain.
405
PROBLEMS
9.25 Consider loglinear model ŽWX, XY, YZ .. Explain why W and Z are
independent given X alone or given Y alone or given both X and Y.
When are W and Y conditionally independent? When are X and Z
conditionally independent?
9.26 Suppose that loglinear model Ž XY, XZ . holds.
a. Find i jq and log i jq. Show the loglinear model for the XY
marginal table has the same association parameters as iXj Y 4 in
Ž XY, XZ .. Deduce that odds ratios are the same in the XY marginal
table as in the partial tables. Using an analogous result for model
Ž XY, YZ ., deduce the collapsibility conditions in Section 9.1.2.
b. Calculate log i jq for model Ž XY, XZ, YZ ., and explain why marginal associations need not equal conditional associations.
9.27 For a four-way table, is the WX conditional association the same as the
WX marginal association for the loglinear model Ža. ŽWX, XYZ .? and
Žb. ŽWX, WZ, XY, YZ .? Why?
9.28 Loglinear model M0 is a special case of loglinear model M1.
a. Explain why the fitted values for the two models are identical in the
sufficient marginal distributions for M0 .
b. Haberman Ž1974a. showed that when
ˆ i 4 satisfy any model that is a
special case of M0 , Ý i
ˆ 1 i log
ˆ i s Ýi
ˆ 0 i log
ˆ i . Thus,
ˆ 0 is the
orthogonal projection of
ˆ 1 onto the linear manifold of log 4
satisfying M0 . Using this, show that G 2 Ž M0 . y G 2 Ž M1 . s
2Ý i
ˆ 1 i logŽ
ˆ 1 ir
ˆ 0 i ..
9.29 Refer to Section 9.2.4. Show that G 2 Ž M j < M jy1 . equals G 2 for independence in the 2 = 2 table comparing columns 1 through j y 1 with
column j.
9.30 For T categorical variables X 1 , . . . , X T , explain why:
a. G 2 Ž X 1 , X 2 , . . . , X T . s G 2 Ž X 1 , X 2 . q G 2 Ž X 1 X 2 , X 3 .
q ⭈⭈⭈ qG 2 Ž X 1 X 2 ⭈⭈⭈ X Ty1 , X T ..
b. G 2 Ž X 1 ⭈⭈⭈ X Ty1 , X T . s G 2 Ž X 1 , X T . q G 2 Ž X 1 X T , X 1 X 2 .
q ⭈⭈⭈ qG 2 Ž X 1 X 2 ⭈⭈⭈ X Ty1 , X 1 X 2 ⭈⭈⭈ X Ty2 X T ..
9.31 For I = 2 contingency tables, explain why the linear-by-linear association model is equivalent to the linear logit model Ž5.5..
9.32 Consider the L = L model
Explain why ˆ is halved but
Ž9.6. with ®j s j4 replaced by ®j s 2 j4 .
ˆ i j 4, ˆi j 4, and G 2 are unchanged.
406
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
9.33 Lehmann Ž1966. defined Ž X, Y . to be positi®ely likelihood-ratio
dependent if their joint density satisfies f Ž x 1 , y 1 . f Ž x 2 , y 2 . G
f Ž x 1 , y 2 . f Ž x 2 , y 1 . whenever x 1 - x 2 and y 1 - y 2 . Then, the conditional
distribution of Y Ž X . stochastically increases as X Ž Y . increases
ŽGoodman 1981a..
a. For the L = L model, show that the conditional distributions of Y
and of X are stochastically ordered. What is its nature if  ) 0?
b. In row effects model Ž9.8., if i ) h , show that the conditional
distribution of Y is stochastically higher in row i than in row h.
Explain why 1 s ⭈⭈⭈ s I is equivalent to the equality of the I
conditional distributions within rows.
9.34 Yule Ž1906. defined a table to be isotropic if an ordering of rows and of
columns exists such that the local log odds ratios are all nonnegative
wsee also Goodman Ž1981a.x.
a. Show that a table is isotropic if it satisfies Ži. the linear-by-linear
association model, Žii. the row effects model, and Žiii. the RC
model.
b. Explain why a table that is isotropic for a certain ordering is still
isotropic when adjacent rows or columns are combined.
9.35 Consider the log likelihood for the linear-by-linear association model.
a. Differentiating with respect to  and evaluating at  s 0 and null
estimates of parameters, show that the score function is proportional to
Ý Ý u i ®j Ž pi j y piq pqj . .
i
j
b. Use the delta method to show that its null SE is
½
Ý u 2i piqy Ž Ý u i piq .
2
Ý ®j2 pqj y Ž Ý ®j pqj .
2
n
5
1r2
.
c. Construct a score statistic for testing independence. Show that it is
essentially the correlation test Ž3.15.. wHirotsu Ž1982. discussed a
family of score tests for ordered alternatives. x
9.36 Given the parenthetical result in Problem 7.33, show that if cumulative
logit model Ž7.24. holds and <  < is small, the linear-by-linear association
model should fit well with row scores x i 4 and ‘‘ridit’’ column scores
®j s w P Ž Y F j y 1. q P Ž Y F j .xr24 , with its  parameter about twice
 for model Ž7.24..
407
PROBLEMS
9.37 Consider the row effects model Ž9.8..
a. Show that no loss of generality occurs in letting IX s YJ s I s 0.
b. Show that minimal sufficient statistics are n iq 4 , nqj 4 , and Ý j ®j n i j ,
i s 1, . . . , I 4 , and derive the likelihood equations.
9.38 Show that the column effects model corresponds to a baseline-category
logit model for Y that is linear in scores for X, with slope depending
on the paired response categories.
9.39 Refer to the homogeneous linear-by-linear association model Ž9.10..
a. Show that the likelihood equations are, for all i, j, and k,
ˆ iqk s n iqk ,
ˆqj k s nqj k ,
Ý Ý u i ®j ˆ i jqs Ý Ý u i ®j n i jq .
i
j
i
j
b. Show that residual df s K Ž I y 1.Ž J y 1. y 1.
c. When I s J s 2, explain why it is equivalent to Ž XY, XZ, YZ ..
d. Show how the last likelihood equation above changes for heterogeneous linear-by-linear XY association Ž9.11.. Explain why, in each
stratum, the fitted XY correlation equals the sample correlation.
9.40 When model Ž XY, XZ, YZ . is inadequate and variables are ordinal,
useful models are nested between it and Ž XYZ .. For ordered scores
u i 4 , ®j 4 , and w k 4 , consider
log i jk s q iX q Yj q Zk q iXj Y q iXkZ q YjkZ q  u i ®j w k . Ž 9.22 .
a. Define i jk s i jŽ kq1. r i jŽ k . s iŽ jq1. kr iŽ j. k s Ž iq1. jkrŽ i. jk . For
unit-spaced scores, show that log i jk s  . Goodman Ž1979a. called
this the uniform interaction model.
b. Show that log odds ratios for any two variables change linearly
across levels of the third variable.
c. Show that the likelihood equations are those for model
Ž XY, XZ, YZ . plus
Ý Ý Ý u i ®j wk ˆ i jk s Ý Ý Ý u i ®j wk
i
j
k
i
j
n i jk .
k
d. Explain why model Ž9.12. is a special case of model Ž9.22..
9.41 Construct a model having general XZ and YZ associations, but row
effects for the XY association that are Ža. homogeneous, and Žb.
heterogeneous across levels of Z. Interpret.
408
BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS
9.42 Explain why the RC model requires scale constraints for the scores.
Show the residual df s Ž I y 2.Ž J y 2.. Find and interpret the likelihood equations. Explain why the fit is invariant to category orderings.
9.43 Refer to correlation model Ž9.16. ŽGoodman 1985, 1986..
a. Show that is the correlation between the scores.
b. If this model holds, show that Ý i i Ž i jrqj . s j and
Ý j j Ž i jr iq . s i . Interpret.
c. With close to zero, show that logŽ i j . has form ␥ i q ␦ i q i j q
oŽ ., where oŽ .r ™ 0 as ™ 0. Thus, when the association is
weak, the correlation model is similar to the linear-by-linear association model with  s and scores u i s i 4 and ®j s j 4 .
9.44 For the general canonical correlation model, show that Ý 2k s
Ý i Ý j Ž i j y iq qj . 2r iq qj . Thus, the squared correlations partition
a dependence measure that is the noncentrality Ž6.8. of X 2 for the
independence model with n s 1. wGoodman Ž1986. stated other partitionings.x
9.45 Refer to model Ž9.18.. Given the times at risk t i j 4 , show that sufficient
statistics are n iq 4 and nqj 4 .
9.46 Refer to Section 9.7.3. Let T s Ýt i and W s Ýwi . Suppose that
survival times have a negative exponential distribution with parameter
.
ˆ s WrT.
a. Using log likelihood Ž9.19., show that
b. Conditional on T, show that W has a Poisson distribution with
ˆ s WrT.
mean T. Using the Poisson likelihood, show that
9.47 Show that ML estimates do not exist for Table 9.15. w Hint: Haberman
Ž1973b, 1974a, p. 398.: If
ˆ 111 s c ) 0, then marginal constraints the
model satisfy imply that
ˆ 222 s yc.x
9.48 For a loglinear model, explain heuristically why the ML estimate of a
parameter is infinite when its sufficient statistic takes its maximum or
minimum possible value, for given values of other sufficient statistics.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 10
Models for Matched Pairs
We next introduce methods for comparing categorical responses for two
samples when each observation in one sample pairs with an observation in
the other. Such matched-pairs data commonly occur in studies with repeated
measurement of subjects, such as longitudinal studies that observe subjects
over time. Because of the matching, the responses in the two samples are
statistically dependent. This is the first of four chapters on special methods
for handling such dependence.
Table 10.1 illustrates matched-pairs data. For a poll of a random sample
of 1600 voting-age British citizens, 944 indicated approval of the Prime
Minister’s performance in office. Six months later, of these same 1600 people,
880 indicated approval. The two cells with identical row and column response
form the main diagonal of the table. These subjects had the same opinion at
both surveys. They compose most of the sample, since relatively few people
changed opinion. A strong association exists between opinions six months
apart, the sample odds ratio being Ž794 = 570.rŽ150 = 86. s 35.1.
For matched pairs with a categorical response, a two-way contingency
table with the same row and column categories summarizes the data. The
table is square. In this chapter we present analyses of square tables. In
Section 10.1 we describe methods for comparing proportions with a binary
response. In Section 10.2 we discuss logistic regression analyses of such data.
For multicategory responses, Section 10.3 covers nominal and ordinal logit
TABLE 10.1 Rating of Performance of Prime Minister
First
Survey
Approve
Disapprove
Total
Second Survey
Approve
Disapprove
Total
794
86
150
570
944
656
880
720
1600
409
410
MODELS FOR MATCHED PAIRS
models for comparing the response distributions. In Section 10.4 we introduce loglinear models for square tables. In Sections 10.5 and 10.6 we discuss
two matched-pairs applications for which models for square tables are useful:
analyzing agreement between two observers who rate a common set of
subjects, and evaluating preferences of treatments based on their pairwise
evaluation.
Section 10.7 extends the models of Sections 10.2 through 10.4 to multiway
tables that result from matched sets of observations. In Chapter 11 we extend
them further to incorporate explanatory variables.
10.1 COMPARING DEPENDENT PROPORTIONS
For each of n matched pairs, let ab denote the probability of outcome a for
the first observation and outcome b for the second. Let n ab count the
number of such pairs, with pab s n abrn the sample proportion. We treat
n ab 4 as a sample from a multinomial Ž n; ab 4. distribution. Then paq is the
proportion in category a for observation 1, and pqa is the corresponding
proportion for observation 2. We compare samples by comparing marginal
proportions paq 4 with pqa 4 . With matched samples, these proportions are
correlated, and methods for independent samples are inappropriate.
In this section we consider binary outcomes. When 1qs q1 , then
2qs q2 also, and there is marginal homogeneity. Since
1qy q1 s Ž 11 q 12 . y Ž 11 q 21 . s 12 y 21 ,
marginal homogeneity in 2 = 2 tables is equivalent to 12 s 21 . The table
then shows symmetry across the main diagonal.
10.1.1
Inference for Dependent Proportions
One comparison of the marginal distributions uses ␦ s q1 y 1q. Let
d s pq1 y p1qs p 2qy pq2 .
From formula Ž1.3. for multinomial covariances, covŽ pq1 , p1q . s covŽ p11 q
p 21 , p11 q p12 . simplifies to Ž 11 22 y 12 21 .rn. Thus,
var Ž 'n d . s 1q Ž 1 y 1q . q q1 Ž 1 y q1 . y 2 Ž 11 22 y 12 21 . .
Ž 10.1 .
For large samples, d has approximately a normal sampling distribution. A
confidence interval for ␦ s q1 y 1q is then
d " z␣ r2 ˆ Ž d . ,
411
COMPARING DEPENDENT PROPORTIONS
where
ˆ 2 Ž d . s p1q Ž 1 y p1q . q pq1 Ž 1 y pq1 . y 2 Ž p11 p 22 y p12 p 21 . rn
s Ž p12 q p 21 . y Ž p12 y p 21 . rn,
Ž 10.2 .
2
with the second formula following after substitution and some algebra.
Inverting the score test of H0 : ␦ s ␦ 0 is more complex but provides coverage
probabilities closer to the nominal values ŽTango 1998., as does adding 1 to
each cell before computing d and ˆ Ž d ..
The hypothesis of marginal homogeneity is H0 : 1qs q1 Ži.e., ␦ s 0..
The ratio z s dr
ˆ Ž d . or its square is a Wald test statistic. Under H0 , an
alternative estimated variance is
ˆ02 Ž d . s
p12 q p 21
n
s
n12 q n 21
n2
.
Ž 10.3 .
The score test statistic z 0 s dr
ˆ0 Ž d . simplifies to
z0 s
n 21 y n12
Ž n 21 q n12 .
1r2
.
Ž 10.4 .
The square of z 0 is a chi-squared statistic with df s 1. The test using it is
called McNemar’s test ŽMcNemar 1947..
The McNemar statistic depends only on cases classified in different categories for the two observations. The n11 q n 22 on the main diagonal are
irrelevant to inference about whether 1q and q1 differ. This may seem
surprising, but all cases contribute to inference about how much 1q and
q1 differ: for instance, to estimating ␦ and the standard error.
10.1.2
Prime Minister Approval Rating Example
For Table 10.1, the sample proportions of approval of the prime minister’s
performance are p1qs 944r1600 s 0.59 for the first survey and pq1 s
880r1600 s 0.55 for the second. Using Ž10.2., a 95% confidence interval for
q1 y 1q is Ž0.55 y 0.59. " 1.96Ž0.0095., or Žy0.06, y0.02.. The approval
rating appears to have dropped between 2 and 6%.
For testing marginal homogeneity, the test statistic Ž10.4. using the null
variance is
z0 s
86 y 150
Ž 86 q 150.
1r2
s y4.17.
It shows strong evidence of a drop in the approval rating.
412
MODELS FOR MATCHED PAIRS
10.1.3
Increased Precision with Dependent Samples
The final term of formula Ž10.1., based on covŽ pq1 , p1q ., reflects the
dependence between the marginal proportions. By contrast, for independent
samples of size n each to estimate binomial probabilities 1 and 2 , the
covariance for the sample proportions is zero, and
var
'n Ž difference of sample proportions .
s 1Ž 1 y 1 . q 2 Ž 1 y 2 . .
Dependent samples usually exhibit a positive dependence, with log s
logw 11 22 r 12 21 x ) 0; that is, 11 22 ) 12 21 . From Ž10.1., positive dependence implies that varŽ d . is smaller than when the samples are independent.
A study design using dependent samples can help improve the precision of
statistical inferences for within-subject effects. ŽBy contrast, standard errors
tend to be larger, per given number of observations, for between-subject
group comparisons.. The improvement is substantial when samples are highly
correlated. To illustrate, Table 10.1 with dependent samples of size 1600
each has a standard error of 0.0095 for d s 0.55 y 0.59. The two observations have strong association, the sample odds ratio being 35.1. Independent
samples of size 1600 each with
ˆ 1 y ˆ 2 s 0.55 y 0.59 have a standard error
of 0.0175 for the difference, nearly twice as large.
10.1.4
Small-Sample Test Comparing Matched Proportions
The null hypothesis of marginal homogeneity for binary matched pairs is,
equivalently, H0 : 12 s 21 or 21 rŽ 21 q 12 . s 0.5. For small samples,
an exact test conditions on n* s n 21 q n12 ŽMosteller 1952.. Under H0 , n 21
has a binomial Ž n*, 12 . distribution, for which E Ž n 21 . s 12 n*. The P-value for
the test is a binomial tail probability.
For instance, for Table 10.1, consider Ha : q1 - 1q, or equivalently,
Ha : 21 - 12 . Since n* s 86 q 150 s 236, the reference distribution is
binŽ236, 12 .. The P-value is the probability of at least 150 successes out of 236
trials, which equals 0.00002. The P-value for Ha : q1 / 1q doubles this.
When n* ) 10, the reference binomial distribution is approximately normal with mean 12 n* and variance n*Ž 12 .Ž 12 .. The standardized normal test
statistic equals
zs
n 21 y 12 n*
n* Ž 12 .Ž 12 .
1r2
s
n 21 y n12
Ž n 21 q n12 .
This is identical to the McNemar statistic Ž10.4..
1r2
.
413
COMPARING DEPENDENT PROPORTIONS
10.1.5
Connection between McNemar and Cochran–Mantel–Haenszel Tests
An alternative representation of binary responses for n matched pairs
presents the data in n partial tables, one 2 = 2 table for each pair. It has
columns that are the two possible outcomes for each measurement. Row 1
shows the outcome of the first observation, and row 2 shows the outcome of
the second.
Table 10.2 shows the four possible partial tables in this representation.
For Table 10.1, the full three-way table has 1600 partial tables; 794 look like
the one for subject 1 Ži.e., ‘‘approve’’ at both surveys., 570 who disapproved at
each survey have tables like the one for subject 2, 86 have tables like the one
for subject 3, and 150 have tables like the one for subject 4. The 1600 subjects
from Table 10.1 provide 3200 observations in a 2 = 2 = 1600 contingency
table. Collapsing this table over the 1600 partial tables yields a 2 = 2 table
with first row equal to Ž944, 656. and second row equal to Ž880, 720.. These
are the total number of Žapprove, disapprove. responses for the two surveys.
They form the marginal counts in Table 10.1.
For each subject, suppose that the probability of approval is identical in
each survey. Then, conditional independence exists between the opinion
outcome and the survey time, controlling for subject. The probability of
approval is then also the same for each survey in the marginal table collapsed
over the subjects. But this implies that the true probabilities for Table 10.1
satisfy marginal homogeneity. Thus, a test of conditional independence in the
2 = 2 = 1600 table provides a test of marginal homogeneity for Table 10.1.
To test conditional independence in this three-way table, one can use the
Cochran᎐Mantel᎐Haenszel ŽCMH. statistic Ž6.6.. The result of that chisquared statistic is algebraically identical to the squared McNemar’s statistic,
namely Ž n 21 y n12 . 2rŽ n12 q n 21 . for tables of form Ž10.1.. McNemar’s test is
a special case of the CMH test applied to the binary responses of n matched
pairs displayed in n partial tables. This connection is not helpful for computational purposes, since the McNemar statistic is simple. But it does suggest
TABLE 10.2 Representation of Four Types of Matched
Pairs Contributing to Counts in Table 10.1
Response
Subject
Survey
Approve
Disapprove
1
First
Second
1
1
0
0
2
First
Second
0
0
1
1
3
First
Second
0
1
1
0
4
First
Second
1
0
0
1
414
MODELS FOR MATCHED PAIRS
ways of handling more complex matched data. With several outcome categories or several observations, one can test marginal homogeneity by applying
the generalized CMH tests ŽSection 7.5. using a single stratum for each
subject, with each row representing a particular observation ŽDarroch 1981;
Mantel and Byar 1978..
Coming sections refer to the 2 = 2 = n table representation of matchedpairs data as the subject-specific table. They refer to the 2 = 2 table of form
of Table 10.1 as the population-a®eraged table, since its margins provide
direct estimates of population marginal proportions.
10.2 CONDITIONAL LOGISTIC REGRESSION FOR BINARY
MATCHED PAIRS
In Section 6.7 we introduced conditional logistic regression for eliminating
nuisance parameters from an analysis. We now study this for binary
matched-pairs data. The models refer to subject-specific tables.
10.2.1
Marginal versus Conditional Models for Matched Pairs
The analyses of Section 10.1 occur in the context of models. Let Ž Y1 , Y2 .
denote the pair of observations for a randomly selected subject, where a ‘‘1’’
outcome denotes category 1 Žsuccess . and ‘‘0’’ denotes category 2. The
difference ␦ s P Ž Y2 s 1. y P Ž Y1 s 1. between marginal probabilities occurs
as a parameter in
P Ž Yt s 1 . s ␣ q ␦ x t ,
Ž 10.5 .
where x 1 s 0 and x 2 s 1; then, P Ž Y1 s 1. s ␣ and P Ž Y2 s 1. s ␣ q ␦ .
Alternatively, the logit link yields
logit P Ž Yt s 1 . s ␣ q  x t .
Ž 10.6 .
The parameter  is a log odds ratio with the marginal distributions.
Models Ž10.5. and Ž10.6. are marginal models: They focus on the marginal
distributions of responses for the two observations. For instance, in terms of
the population-averaged table, the ML estimate of  in Ž10.6. is the log odds
ratio of marginal proportions, ˆ s logw pq1 p 2qrpq2 p1q x. See Problem 10.26
for its asymptotic variance.
By contrast, the subject-specific table having strata like Table 10.2 implicitly allows probabilities to vary by subject. Let Ž Yi1 , Yi2 . denote the ith pair of
observations, i s 1, . . . , n. A model then has the form
link P Ž Yi t s 1 . s ␣ i q  x t .
Ž 10.7 .
This is called a conditional model, since the effect  is defined conditional on
the subject. Its estimate describes conditional association for the three-way
table stratified by subject. The effect is subject-specific, since it is defined at
415
CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS
the subject level. By contrast, the effects in marginal models Ž10.5. and Ž10.6.
are population-a®eraged, since they refer to averaging over the entire population rather than to individual subjects.
For the identity link, subject-specific and population-averaged effects are
identical. For instance, for the conditional model Ž10.7. with identity link,
 s P Ž Yi2 s 1. y P Ž Yi1 s 1. for all i, and averaging this over subjects in the
population equates  to the ␦ parameter in model Ž10.5.. For nonlinear
links, however, the effects differ. For model Ž10.7. with the logit link, for
instance,
P Ž Yit s 1 . s exp Ž␣i q  x t . r 1 q exp Ž␣i q  x t . .
The average of this for the population does not have the form expŽ␣ q
 x t .rw1 q expŽ␣ q  x t .x corresponding to the marginal logit model Ž10.6..
We now take a closer look at the conditional model with logit link.
10.2.2
A Logit Model with Subject-Specific Probabilities
Model Ž10.7. differs from models in earlier chapters by permitting subjects to
have their own probability distributions. Cox Ž1958b, 1970. and Rasch Ž1961.
presented this model with logit link. This model for Yi t , observation t for
subject i, is
logit P Ž Yit s 1 . s ␣ i q  x t ,
Ž 10.8 .
where x 1 s 0 and x 2 s 1. Although permitting subject-specific distributions,
it assumes a common effect  . For subject i,
P Ž Yi1 s 1 . s
exp Ž␣i .
1 q exp Ž␣i .
,
P Ž Yi2 s 1 . s
exp Ž␣i q  .
1 q exp Ž␣i q  .
.
The parameter  compares the response distributions. For each subject,
the odds of success for observation 2 are expŽ . times the odds for observation 1.
Given the parameters, with model Ž10.8. one normally assumes independence of responses for different subjects and for the two observations on the
same subject. However, averaged over all subjects, the responses are nonnegatively associated. Suppose that <  < is small compared to < ␣ i <. A subject with
a large positive ␣ i has high P Ž Yit s 1. for each t and is likely to have a
success each time; a subject with a large negative ␣ i has low P Ž Yit s 1. for
each t and is likely to have a failure each time. The greater the variability in
␣ i 4 , the greater the overall positive association between responses, successes
Žfailures . for observation 1 tending to occur with successes Žfailures . for
observation 2. This is true for any  . The positive association reflects the
shared value of ␣ i for each observation in a pair. No association occurs only
when ␣ i 4 are identical. Thus, the model does account for the dependence in
matched pairs. Fitting it takes into account nonnegative association through
the structure of the model.
416
MODELS FOR MATCHED PAIRS
For this model, the large number of ␣ i 4 causes difficulties with the fitting
process and with the properties of ordinary ML estimators ŽProblem 10.24..
The remedy of conditional ML treats them as nuisance parameters and
maximizes the likelihood function for a conditional distribution that eliminates them. A note on terminology: We’ve referred to model Ž10.8. as a
conditional model, meaning that its effect  is subject-specific, conditional
on the subject. The analyses described below for such models are examples of
conditional logistic regression; but here the term conditional refers to the ML
analysis that is performed conditional on sufficient statistics for nuisance
parameters, to eliminate those parameters from the likelihood.
10.2.3
Conditional ML Inference for Binary Matched Pairs
For model Ž10.8., assuming independence of responses for different subjects
and for the two observations on the same subject, the joint mass function for
Ž y 11 , y 12 ., . . . , Ž yn1 , yn2 .4 is
n
Ł
is1
ž
/ž
y i1
exp Ž␣i .
1 q exp Ž␣i .
=
ž
1
1 q exp Ž␣i .
/ž
/
1yy i1
y i2
exp Ž␣i q  .
1 q exp Ž␣i q  .
1
1 q exp Ž␣i q  .
/
1yy i2
.
In terms of the data, this is proportional to
exp
ž
Ý ␣ i Ž yi1 q yi2 . q  Ý yi2
i
i
/.
To eliminate ␣ i 4 , we condition on their sufficient statistics, the pairwise
success totals Si s yi1 q yi2 4 . Given Si s 0, P Ž Yi1 s Yi2 s 0. s 1, and given
Si s 2, P Ž Yi1 s Yi2 s 1. s 1. The distribution of Ž Yi1 , Yi2 . depends on  only
when Si s 1; that is, only when outcomes differ for the two responses. Given
yi1 q yi2 s 1, the conditional distribution is
P Ž Yi1 s yi1 , Yi2 s yi2 < Si s 1 .
s P Ž Yi1 s yi1 , Yi2 s yi2 .
s
ž
exp Ž␣i .
/ ž
y i1
1
P Ž Yi1 s 1, Yi2 s 0 . q P Ž Yi1 s 0, Yi2 s 1 .
/ ž
1yy i1
exp Ž␣i q  .
/ ž
y i2
1
1 q exp Ž␣i q  .
1 q exp Ž␣i q  .
1
exp Ž␣i q  .
q
1 q exp Ž␣i . 1 q exp Ž␣i q  .
1 q exp Ž␣i . 1 q exp Ž␣i q  .
1 q exp Ž␣i .
1 q exp Ž␣i .
1
exp Ž␣i .
s exp Ž  . r 1 q exp Ž  . ,
s 1r 1 q exp Ž  . ,
yi1 s 0,
yi1 s 1,
yi2 s 1
yi2 s 0.
/
1yy i2
417
CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS
Again, let n ab 4 denote the counts for the four possible sequences. For
subjects having Si s 1, Ý i yi1 s n12 , the number of subjects having success for
observation 1 and failure for observation 2. Similarily, for those subjects,
Ý i yi2 s n 21 and Ý i Si s n* s n12 q n 21 . Since n 21 is the sum of n* independent, identical Bernoulli variates, its conditional distribution is binomial with
parameter expŽ .rw1 q expŽ .x. For testing marginal homogeneity Ž  s 0.,
the parameter equals 12 . In summary, the conditional analysis for the logit
model implies that pairs in which yi1 s yi2 are irrelevant to inference about
 . When this model is realistic, it provides justification for comparing
marginal distributions using only the n12 q n 21 pairings having outcomes in
different categories at the two observations.
Conditional on Si s 1, the joint distribution of the matched pairs is
Ł
Sis1
ž
1
1 q exp Ž  .
/ž
exp Ž  .
y i1
1 q exp Ž  .
/
y i2
s
exp Ž  .
n 21
1 q exp Ž  .
n*
Ž 10.9 .
where the product refers to all pairs having Si s 1. Differentiating the log of
this conditional likelihood and equating to 0 and solving yields the conditional ML estimator of  in model Ž10.8.. You can check that it and its
standard error are
ˆ s log Ž n 21 rn12 . ,
10.2.4
'
SE s 1rn 21 q 1rn12 .
Ž 10.10 .
Random Effects in Binary Matched-Pairs Model
An alternative remedy to handling the huge number of nuisance parameters
in logit model Ž10.8. treats ␣ i 4 as random effects. This regards ␣ i 4 as an
unobserved random sample from a probability distribution, usually assumed
to be N Ž , 2 . with unknown and . It eliminates ␣ i 4 by averaging with
respect to their distribution, yielding a marginal distribution. The likelihood
function then depends on  as well as the N Ž , 2 . parameters. It has only
three parameters and is more manageable. For matched pairs with nonnegative sample log odds ratio, this approach also yields ˆ s logŽ n 21 rn12 .
ŽNeuhaus et al. 1994.. This model is an example of a generalized linear mixed
model, containing both random effects and the fixed effect  . Its analysis is
presented in Chapter 12.
Model Ž10.8. implies that the true odds ratio for each of the n subjectspecific partial tables equals expŽ .. In Section 6.3.5 we presented the
Mantel᎐Haenszel estimate of a common odds ratio for several 2 = 2 tables.
In fact, that estimator applied to subject-specific tables of the form shown in
Table 10.2 is algebraically identical to n 21 rn12 for tables of the form shown
in Table 10.1. ŽRecall that partial tables with responses in only one column
do not contribute to the CMH test or Mantel᎐Haenszel estimate. . In
summary, the Mantel᎐Haenszel estimate, the conditional ML estimate, and
418
MODELS FOR MATCHED PAIRS
Žwith nonnegative log odds ratio. the ML estimate for the random effects
version of logit model Ž10.8. yield expŽ ˆ. s n 21 rn12 .
10.2.5
Logistic Regression for Matched Case–Control Studies
The two observations Ž yi1 , yi2 . in a matched pair need not refer to the same
subject. For instance, case᎐control studies that match a single control with
each case yield matched-pairs data. For a binary response Y, each case
Ž Y s 1. is matched with a control Ž Y s 0. according to criteria that could
affect the response. Subjects in the matched pairs are measured on the
predictor variableŽs. of interest, X, and the XY association is analyzed.
Table 10.3 illustrates. A case᎐control study of acute myocardial infarction
ŽMI. among Navajo Indians matched 144 victims of MI according to age and
gender with 144 people free of heart disease. Subjects were asked whether
they had ever been diagnosed as having diabetes Ž x s 0, no; x s 1, yes..
Table 10.3 has the same form as Table 10.1 except that the levels of X rather
than the levels of Y form the rows and the columns.
One can display the data for each matched case᎐control pair using a
partial table of the form shown in Table 10.2, but reversing the roles of X
and Y. The X values have four possible patterns, shown in Table 10.4. There
are 37 partial tables of type a, since for 37 pairs the case had diabetes and
the control did not, 16 partial tables of type b, 9 of type c, and 82 of type d.
Now, for subject t in matched pair i, consider the model
logit P Ž Yit s 1 . s ␣ i q  x i t .
Ž 10.11 .
TABLE 10.3 Previous Diagnoses of Diabetes for
Myocardial Infarction (MI) Case–Control Pairs
MI Cases
MI Controls
Diabetes
No Diabetes
Total
Diabetes
No diabetes
9
37
16
82
25
119
46
98
144
Total
Source: J. L. Coulehan et al., Amer. J. Public Health 76: 412᎐414 Ž1986.,
reprinted with permission from the American Public Health Association.
TABLE 10.4 Possible Case–Control Pairs for Table 10.3
a
Diabetes
Yes
No
b
c
d
Case
Control
Case
Control
Case
Control
Case
Control
1
0
0
1
0
1
1
0
1
0
1
0
0
1
0
1
CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS
419
The probabilities modeled refer to the distribution of Y given X, but the
retrospective study provides information only about the distribution of X
given Y. One can estimate the odds ratio expŽ ., however, since it refers to
the XY odds ratio, which relates to both conditional distributions ŽSections
2.2.4, 5.1.4.. Even though this study reverses the roles of X and Y in terms of
which is fixed and which is random, the conditional ML estimate of expŽ . is
simply n 21 rn12 s 37r16 s 2.3.
10.2.6
Conditional ML for Matched Pairs with Multiple Predictors
When the binary response has p predictors for case᎐control or subjectspecific matched pairs, the model generalizes to
logit P Ž Yit s 1 . s ␣ i q  1 x 1 it q  2 x 2 it q ⭈⭈⭈ q p x p it ,
Ž 10.12 .
where x h it denotes the value of predictor h for observation t in pair i,
t s 1, 2. Typically, one predictor is an explanatory variable of interest, such
as diabetes status. The others are covariates being controlled, in addition to
those already controlled by virtue of using them to form the matched pairs.
The conditional ML approach to estimating  j 4 conditions on sufficient
statistics for ␣ i to eliminate them from the likelihood.
Let x it s Ž x 1 i t , . . . , x p it .X and  s Ž 1 , . . . ,  p .X . A generalization of the
derivation in Section 10.2.3 shows that
P Ž Yi1 s 0, Yi2 s 1 < Si s 1 . s exp Ž x Xi2  .
exp Ž x Xi1  . q exp Ž x Xi2  . ,
P Ž Yi1 s 1, Yi2 s 0 < Si s 1 . s exp Ž x Xi1  .
exp Ž x Xi1  . q exp Ž x Xi2  . . Ž 10.13 .
Dividing numerator and denominator by expŽx Xi1  . shows that the first
equation has the form of logistic regression with no intercept and with
predictor values x*i s x i2 y x i1. In fact, one can obtain conditional ML
estimates for model Ž10.12. by fitting a logistic regression model to those
pairs alone, using artificial response y* s 1 when Ž yi1 s 0, yi2 s 1., y* s 0
when Ž yi1 s 1, yi2 s 0., no intercept, and predictor values x*.
i This addresses
the same likelihood as the conditional likelihood ŽBreslow et al. 1978;
Chamberlain 1980..
To illustrate, for model Ž10.11. with Table 10.3, let y*
i s yi2 y yi1 and
x*i s x i2 y x i1. If t s 1 refers to the control and t s 2 to the case, then
yi* s 1 always. Since x it s 1 represents ‘‘ yes’’ for diabetes and x i t s 0
.
Ž i s 1, x*i s 0. for
represents ‘‘no,’’ Ž y*
i s 1, x*
i s y1 for 16 observations, y*
.
9 q 82 s 91 observations, and y*
i s 1, x*
i s q1 for 37 observations. The
logit model that forces ␣
ˆ s 0 has ˆ s 0.84. With a single binary predictor,
the estimate is identical to logŽ n 21 rn12 ..
420
10.2.7
MODELS FOR MATCHED PAIRS
Marginal Models and Conditional Models: Extensions
For binary matched-pairs data, Section 10.1 presented analyses for a marginal
Ži.e., population-averaged . model, and this section presented analyses for a
conditional Ži.e., subject-specific . model. These models generalize to multinomial responses and to matched sets. For instance, Chamberlain Ž1980.
discussed conditional ML for matched pairs on a multinomial response. For
binary responses, model Ž10.12. applies when ␣ i refers to a set of repeated
measurements on subject i. Or, it could refer to a matched set that is a
cluster of subjects, such as children from family i or fetuses from litter i.
With extensions of the conditional model to matched-set clusters, the
conditional ML approach is restricted to estimating  j that are withincluster effects, such as occur in case᎐control and crossover studies. For these,
the explanatory variable varies in t for each i. Conditional ML cannot
estimate a between-cluster effect. Statistics providing information about such
an effect use subject totals at different levels of the relevant explanatory
variable; however, those totals sum the sufficient statistics for ␣ i 4 , so they
are themselves fixed and have degenerate distributions after conditioning on
the sufficient statistics. An explanatory variable that is constant in t for each
i cancels out of the conditional likelihood. wYou can observe this for matched
pairs with Ž10.13. for any j for which x ji1 s x ji2 all i.x For it, at best one can
stratify by its levels and fit a model estimating within-cluster effects separately at each level. An advantage of using the random effects approach
instead of conditional ML with the conditional model is that it is not
restricted to estimating within-cluster effects.
In the remainder of this chapter we emphasize marginal models for
matched pairs with multinomial responses. In the following chapter we deal
with marginal model extensions allowing matched sets and explanatory variables. Conditional models using a random effects approach have extra
computational complexities. We mention briefly some multinomial conditional models in this chapter, but we defer most discussion to Chapter 12.
10.3 MARGINAL MODELS FOR SQUARE CONTINGENCY TABLES
Matched pairs analyses generalize from binary to I ) 2 outcome categories.
A square I = I table n ab 4 shows counts of possible sequences Ž a, b . of
outcomes for Ž Y1 , Y2 .. Let ab s P Ž Y1 s a, Y2 s b .. Marginal homogeneity is
P Ž Y1 s a. s P Ž Y2 s a. for a s 1, . . . , I. Marginal models compare P Ž Y1 s
a.4 and P Ž Y2 s a.4 .
10.3.1
Marginal Models for Ordinal Classifications
For ordered categories, marginal model Ž10.6. for binary matched pairs
extends using ordinal logits. With cumulative logits,
logit P Ž Yt F j . s ␣ j q  x t ,
t s 1, 2,
j s 1, . . . , I y 1, Ž 10.14 .
421
MARGINAL MODELS FOR SQUARE CONTINGENCY TABLES
where x 1 s 0 and x 2 s 1. This model has proportional odds structure
ŽSection 7.2.2.. The odds of outcome Y2 F j equal expŽ . times the odds of
outcome Y1 F j. The model implies stochastically ordered marginal distributions, with  ) 0 meaning that Y1 tends to be higher than Y2 . Marginal
homogeneity corresponds to  s 0.
Model fitting treats Ž Y1 , Y2 . as dependent. The ML approach maximizes
the multinomial likelihood for ab 4 . This is not simple. Since the model
refers to marginal probabilities P Ž Y1 s a. s aq 4 and P Ž Y2 s b . s qb 4 ,
one cannot substitute the model formula in the kernel Ý a Ý b n ab log ab of
the log likelihood, which has joint probabilities. We defer discussion of ML
model fitting of marginal models to Section 11.2.5. Model Ž10.14. describes
the 2Ž I y 1. marginal probabilities by I parameters, so df s I y 2 for testing
fit. Alternatively, one can compare margins using summaries such as a
difference in means for chosen category scores ŽProblem 10.38..
10.3.2
Premarital and Extramarital Sex Example
Refer to Table 10.5. For a General Social Survey, subjects gave their opinion
about premarital sex Ža couple having sex before marriage. and extramarital
sex Ža married person having sex with someone other than the marriage
partner .. The response categories are 1 s always wrong, 2 s almost always
wrong, 3 s wrong only sometimes, 4 s not wrong at all.
The sample cumulative marginal proportions are Ž0.307, 0.389, 0.611. for
premarital sex and Ž0.815, 0.918, 0.987. for extramarital sex. This suggests that
responses on premarital sex tended to be higher on the ordinal scale than
those on extramarital sex. With scores Ž1, 2, 3, 4., the mean for premarital sex
is 2.69, closest to the ‘‘ wrong only sometimes’’ score, and the mean response
for extramarital sex is 1.28, closest to the ‘‘always wrong’’ score.
The cumulative logit model Ž10.14. has ˆ s 2.51 ŽSE s 0.13.. There is
strong evidence that population responses are more positive on premarital
than on extramarital sex. The fit of the marginal homogeneity model has
G 2 s 348.1 Ždf s 3., and the fit of model Ž10.14. has G 2 s 35.1 Ždf s 2.. The
ordinal model does not fit well, but it fits much better than the marginal
homogeneity model. Models to be considered in Section 10.4.7 fit better yet.
TABLE 10.5
Premarital
Sex
Opinions on Premarital Sex and Extramarital Sex
Extramarital Sex
1
2
3
4
Total
1
2
3
4
144
33
84
126
2
4
14
29
0
2
6
25
0
0
1
5
146
39
105
185
Total
387
49
33
6
475
Source: 1989 General Social Survey, National Opinion Research Center.
422
MODELS FOR MATCHED PAIRS
10.3.3
Marginal Models for Nominal Classifications
With nominal responses, it is not sensible to assume the same effect for each
logit. A baseline-category logit model has form
log P Ž Yt s j . rP Ž Yt s I . s ␣ j q  j x t ,
t s 1, 2,
j s 1, . . . , I y 1,
Ž 10.15 .
where x 1 s 0 and x 2 s 1. This model has 2Ž I y 1. parameters for the
2Ž I y 1. marginal probabilities. It is saturated.
Marginal homogeneity is the special case  1 s ⭈⭈⭈ s Iy1 s 0. To fit it,
Lipsitz et al. Ž1990. and Madanksy Ž1963. maximized the multinomial likelihood for n ab 4 subject to these constraints. Iterative methods produce fitted
values
ˆ ab 4. Comparing these to n ab 4 using G 2 or X 2 tests marginal
homogeneity, with df s I y 1.
Bhapkar Ž1966. tested marginal homogeneity by exploiting the asymptotic
normality of marginal proportions. Let d a s pqa y paq, and let dX s
Ž d1 , . . . , d Iy1 .. It is redundant to include d I , since Ýd a s 0. The sample
ˆ of 'n d has elements
covariance matrix V
ˆ®ab s y Ž pab q p b a . y Ž pqa y paq . Ž pqb y p bq .
for a / b,
2
ˆ®aa s pqa q paqy 2 paa y Ž pqa y paq . .
Now 'n wd y E Žd.x has an asymptotic multivariate normal distribution with
ˆ Under marginal homogeneity, EŽd. s 0,
estimated covariance matrix V.
and
Ž 10.16 .
ˆy1 d
W s ndX V
is asymptotically chi-squared with df s I y 1. This is a Wald test for parameters in the analog of model Ž10.15. using the identity link. Stuart Ž1955.
ˆ0y1 d, which uses the sample null covariance matrix V
ˆ0
proposed W0 s ndX V
and is the score test. This has
ˆ®ab0 s y Ž pab q p b a .
for a / b,
ˆ®aa0 s pqa q paqy 2 paa .
Ireland et al. Ž1969. noted that W s W0rŽ1 y W0rn.. For I s 2, W0 is
McNemar’s statistic, the square of Ž10.4..
These tests use all I y 1 degrees of freedom available for comparisons of
I pairs of marginal proportions. With ordered categories, when I is large and
the dependence between classifications is strong, ordinal tests Žwith df s 1.
can be much more powerful ŽAgresti 1984, p. 209..
SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE
423
TABLE 10.6 Migration from 1980 to 1985, with Fit of Marginal Homogeneity Model
Residence
in 1980
Northeast
Midwest
South
West
Total
Residence in 1985
Northeast
Midwest
South
11,607
Ž11,607.
87
Ž88.7.
100
Ž98.1.
13,677
Ž13,677.
366
Ž265.7.
515
Ž379.1.
172
Ž276.5.
63
Ž92.5.
255
Ž350.8.
176
Ž251.3.
17,819
Ž17,819.
286
Ž269.8.
11,929
Ž12,064.7.
14,178
Ž14,377.1.
18,986
Ž18,733.5.
West
124
Ž94.0.
302
Ž323.3.
270
Ž287.3.
10,192
Ž10,192.
10,888
Ž10,805.6.
Total
12,197
Ž12,064.7.
14,581
Ž14,377.1.
18,486
Ž18,733.5.
10,717
Ž10,805.6.
55,981
Source: Data based on Table 12 of U.S. Bureau of the Census, Current Population Reports,
Series P-20, No. 420, Geographical Mobility: 1985 ŽWashington, DC: U.S. Government Printing
Office., 1987.
10.3.4
Migration Example
For a sample of U.S. residents, Table 10.6 compares region of residence in
1985 with 1980. Relatively few people changed region, 95% of the observations falling on the main diagonal. The ML fit of marginal homogeneity,
shown in Table 10.6, gives G 2 s 240.8 Ždf s 3.. Statistics using differences in
sample marginal proportions give similar results. For instance, Bhapkar’s
statistic Ž10.16. is W s 236.5 Ždf s 3..
The sample marginal proportions for the four regions were
Ž0.218, 0.260, 0.330, 0.191. in 1980 and Ž0.213, 0.253, 0.339, 0.194. in 1985. Little change occurred over such a short time period. The large test statistics
reflect the huge sample size. To estimate the change for a given region, we
apply Ž10.2. to the collapsed 2 = 2 table that combines the other regions. A
95% confidence interval for q1 y 1q is Ž0.2131 y 0.2179. " 1.96Ž0.00054.,
or y0.005 " 0.001. Similarly, a 95% confidence interval for q2 y 2q is
y0.007 " 0.001, for q3 y 3q is 0.009 " 0.001, and for q4 y 4q is
0.003 " 0.001. Although strong evidence of change occurs for all four
regions, the changes were small.
10.4 SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE
An alternative analysis of square contingency tables directly models the joint
distribution using logit or loglinear models. Some models have marginal
homogeneity as a special case.
424
MODELS FOR MATCHED PAIRS
An I = I joint distribution ab 4 satisfies symmetry if
Ž 10.17 .
whenever a / b.
ab s b a
Under symmetry, aqs Ý bab s Ý b b a s qa for all a, so marginal homogeneity occurs. For I s 2, symmetry is equivalent to marginal homogeneity,
but for I ) 2, marginal homogeneity can occur without symmetry.
10.4.1
Symmetry as Logit and Loglinear Models
When all ab ) 0, symmetry is a logit and a loglinear model. In logit form, it
is trivially
log Ž abr b a . s 0
for all a - b.
For expected frequencies ab s nab 4 , it has the loglinear form
Ž 10.18 .
log ab s q a q b q ab
where all ab s b a . Both classifications have the same single-factor parameters a4 , so log ab s log b a . Identifiability requires constraints. A simpler
expression is log ab s ab , with all ab s b a .
For Poisson or multinomial cell counts n ab 4 , the likelihood equations are
ˆ ab q
ˆ b a s n ab q n b a for all a - b and
ˆ aa s n aa for all a.
The main diagonal has perfect fit. The solution that satisfies symmetry is
ˆ ab s
n ab q n b a
2
for all a, b.
ž/
The logit symmetry model has no parameters for the 2I binomial pairs
Ž n ab , n b a .4 with a - b, so its residual df s I Ž I y 1.r2. Equivalently, the
loglinear symmetry model log ab s ab Ž ab s b a . for I 2 Poisson counts
n ab 4 has I ab 4 with a - b and I aa4 , so df s I 2 y w I q I Ž I y 1.r2x s
ž/
2
I Ž I y 1.r2. For testing symmetry, Bowker Ž1948. showed that X 2 simplifies
to
X2s
ÝÝ
a-b
Ž n ab y n b a .
n ab q n b a
2
.
For I s 2 this is McNemar’s statistic, the square of Ž10.4.. The standardized
Pearson residuals equal
r ab s Ž n ab y n b a . Ž n ab q n b a .
1r2
.
SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE
425
Only one residual for each pair of categories is nonredundant, since r ab s
2
yr b a . They satisfy Ý a- b r ab
s X 2.
The symmetry model is very simple. Except for a few specialized applications, such as describing intraobserver agreement for pairs of measurements
by an observer, it rarely fits well. When the marginal distributions differ
substantially, it fits poorly.
10.4.2
Quasi-symmetry
One can accommodate marginal heterogeneity by permitting the main-effect
terms in the symmetry model Ž10.18. to differ. The resulting loglinear model,
called quasi-symmetry, is
log ab s q aX q Yb q ab ,
Ž 10.19 .
where ab s b a for all a - b ŽCaussinus 1966.. Symmetry is the special case
aX s Ya for a s 1, . . . , I, and independence is the special case in which all
ab s 0.
The likelihood equations for quasi-symmetry are
ˆ aqs n aq ,
a s 1, . . . , I
ˆqb s nqb ,
b s 1, . . . , I
ˆ ab q
ˆ b a s n ab q n b a
Ž 10.20 .
for a F b.
Only one of the first two sets of equations is needed. The other is redundant,
given the other two. The residual df s Ž I y 1.Ž I y 2.r2. From Ž10.20.,
ˆ aa s n aa for a s 1, . . . , I. Otherwise, the likelihood equations do not have a
direct solution. They are solved using iterative methods such as
Newton᎐Raphson and IPF ŽCaussinus 1966..
The quasi-symmetry model has multiplicative form
ab s ␣ a  b␥ab , where ␥ab s ␥ b a all a - b
Ž 10.21 .
and all parameters are positive. The symmetry model is Ž10.21. with ␣ a s  a
for all a. This equation indicates that a table satisfying quasi-symmetry is the
cellwise product of a table satisfying independence with one satisfying symmetry. The association symmetry implies that odds ratios on one side of the
main diagonal are identical to corresponding odds ratios on the other side. In
fact, the model can be defined by properties such as
ab II
a I I b
s
b a II
b I Ia
for all a - b
Ž 10.22 .
or ab s b a for local odds ratios. Goodman Ž1979a. referred to it as the
symmetric association model.
426
MODELS FOR MATCHED PAIRS
The meaning of quasi-symmetry is less obvious than symmetry. However, it
usually fits much better and has greater scope. One way to interpret its
parameters relates to subject-specific logit models. For such models having
additivity of subject terms and occasion terms, of which model Ž10.8. is the
simplest case, the joint distribution in the corresponding population-averaged
table necessarily satisfies quasi-symmetry Žsee Darroch 1981; Section 13.2.7
shows this.. Consider the generalization of baseline-category logit model
Ž10.15. to a subject-specific model
log P Ž Yit s j . rP Ž Yit s I . s ␣ i j q  j x t ,
t s 1, 2,
j s 1, . . . , I y 1.
This has the additive form of Ž10.8. for each j. The model implies, averaging
over subjects, that the quasi-symmetry model Ž10.19. holds for the I = I
population-averaged table with  j s Yj y jX 4 , when one constrains IX s
YI s 0. In fact, for the conditional ML analysis that conditions out ␣ i j 4 , the
conditional ML estimates of ˆj 4 relate to the ordinary ML fit of quasi-symˆYj y ˆjX 4 ŽConaway 1989.. This provides an interpretation for
metry by ˆj s
the main-effect terms in quasi-symmetry.
Related results hold for multiple occasions using a multivariate form
Ž10.33. of quasi-symmetry Že.g., Agresti 1997; Conaway 1989; Darroch 1981;
Tjur 1982; see also Section 13.2.7.. In addition, quasi-symmetry contains as a
special case other useful models. These include the ones in Sections 10.4.3
and 10.6.3.
10.4.3
Quasi-independence
Square tables usually exhibit positive dependence, manifested by larger
counts on the main diagonal than the independence model predicts. Conditional on the event that a matched pair falls off the main diagonal, though,
the relationship may have a simple structure.
A square contingency table satisfies quasi-independence when the variables
are independent, given that the row and column outcomes differ. This has
the loglinear form
log ab s q aX q Yb q ␦ a I Ž a s b . ,
Ž 10.23 .
where I Ž⭈. is the indicator function,
I Ž a s b. s
½
1,
0,
asb
a / b.
This adds a parameter to the independence model for each cell on the main
diagonal. The first three terms in Ž10.23. specify independence, and ␦ a4
permit aa4 to depart from this pattern and have arbitrary positive values.
When ␦ a ) 0, aa is larger than under independence.
The likelihood equations for quasi-independence are
ˆ aqs n aq ,
ˆqa s nqa ,
ˆ aa s n aa ,
a s 1, . . . , I.
SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE
427
A perfect fit occurs on the main diagonal, but independence holds for the
remaining cells. The model implies that odds ratios equal 1.0 for all rectangularly formed 2 = 2 tables in which all cells fall off the main diagonal. One
can fit the model using Newton᎐Raphson or IPF. The model has I more
parameters than the independence model, so its residual df s Ž I y 1. 2 y I.
It applies to tables with I G 3.
Quasi-independence is the special case of quasi-symmetry Ž10.21. in which
␥ab for a / b4 are identical. Caussinus Ž1966, p. 146. showed that they are
equivalent when I s 3.
10.4.4
Migration Revisited
We now return to Table 10.6 on migration patterns. Not surprisingly, the
independence model fits terribly, with G 2 s 125,923 and X 2 s 146,929. ŽThe
maximum possible value of X 2 is 3n s 167,943; see Problem 3.33.. The
symmetry model is also unpromising. For instance, 124 people moved from
the northeast to the west, but only 63 people made the reverse move. The
deviance for testing symmetry is G 2 s 243.6 Ždf s 6..
Quasi-independence states that for people who moved, residence in 1985
is independent of region in 1980. Table 10.7 contains its fitted values, for
which G 2 s 69.5 Ždf s 5.. This model fits much better than the independence model, primarily because it forces a perfect fit on the main diagonal,
where most observations occur. However, lack of fit is apparent off that
diagonal. Many more people moved from the northeast to the south and
many fewer moved from the west to the south than quasi-independence
predicts.
TABLE 10.7 Fit of Models to Table 10.6
Residence in 1985 a
Residence
in 1980
Northeast
Midwest
South
West
Total
Northeast
11,607
100
Ž126.6.1
Ž95.8. 2
124
Ž150.5.
Ž123.8.
12,197
87
Ž117.4.
Ž91.2.
172
Ž133.2.
Ž167.6.
13,677
366
Ž312.9.
Ž370.4.
515
Ž531.1.
Ž501.7.
17,189
302
Ž255.5.
Ž311.1.
270
Ž290.0.
Ž261.1.
14,581
Midwest
South
West
Total
a1
63
Ž71.4.
Ž63.2.
11,929
2
255
Ž243.8.
Ž238.3.
176
Ž130.6.
Ž166.9.
14,178
286
Ž323.0.
Ž294.9.
18,986
18,486
10,192
10,717
10,888
55,981
Quasi-independence fit; quasi-symmetry fit; both models giving perfect fit on main diagonal.
428
MODELS FOR MATCHED PAIRS
The quasi-symmetry model has G 2 s 3.0, with df s 3. Table 10.7 displays
its fit, which is much better than with quasi-independence. The lack of
symmetry in cell probabilities reflects slight marginal heterogeneity. The
subject-specific effects can be described using the model’s parameter estiˆ1Y y ˆ1X s y0.672, ˆY2 y ˆ2X s y0.623, ˆY3 y ˆ3X s 0.1224 . For inmates,
stance, for a given subject the estimated odds of living in the south instead of
the west in 1985 were expŽ0.122. s 1.13 times the odds in 1980. We’ll see in
Chapter 12 that such subject-specific effects tend to be stronger than those in
corresponding marginal models, especially in tables like this with strong
association.
A related application with matched samples is the study of occupational
mobility. Each observation pairs parent’s occupation with child’s occupation
ŽGoodman 1979b; Hout et al. 1987..
10.4.5
Marginal Homogeneity and Quasi-symmetry
Marginal homogeneity is not equivalent to a loglinear model. However,
quasi-symmetry is a useful model for studying marginal homogeneity.
Caussinus Ž1966. showed that symmetry is equivalent to quasi-symmetry and
marginal homogeneity holding simultaneously. We have seen that symmetry
implies both quasi-symmetry and marginal homogeneity. Now we give
Caussinus’s argument for the converse, that the joint occurrence of quasisymmetry and marginal homogeneity implies symmetry.
From Ž10.21., if quasi-symmetry holds, ab s ␣ a  b␥ab , where ␥ab s ␥ b a )
0 for all a - b. Equivalently,
ab s a ␦ ab ,
where a s ␣ ar a and ␦ ab s  a  b␥ab also satisfies ␦ ab s ␦ b a ) 0 for all
a - b. If there is also marginal homogeneity, then
jqs j Ý ␦ jb s
Ý a ␦a j s qj ,
a
b
or
j s
ž Ý ␦ / ž Ý␦ / s ž Ý ␦ / ž Ý␦ /,
a aj
a
jb
b
a aj
a
bj
j s 1, . . . , I.
b
Thus, each j is a weighted average of a4 , with weights ␦ a j rÝ b ␦ b j ) 0,
a s 1, . . . , I 4 . Any set a4 satisfying this must be identical. Otherwise, there
would be a j that is no greater than any a but smaller than at least one,
and hence it could not be a positive weighted average of all of them. But
since a4 are identical, ab s a ␦ ab s b ␦ ab s b ␦ b a s b a , so symmetry
holds. Thus, a table that satisfies both quasi-symmetry and marginal homo-
SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE
429
geneity also satisfies symmetry. Since the converse holds,
quasi-symmetry q marginal homogeneity s symmetry.
Ž 10.24 .
It follows that when quasi-symmetry ŽQS. holds, marginal homogeneity
ŽMH. is equivalent to symmetry ŽS., which is aX s Ya , a s 1, . . . , I 4 in the
QS model. Thus, conditional on quasi-symmetry, testing marginal homogeneity is equivalent to testing symmetry. A test of marginal homogeneity compares fit statistics for the symmetry and quasi-symmetry models,
G 2 Ž S < QS . s G 2 Ž S . y G 2 Ž QS . ,
Ž 10.25 .
with df s I y 1. This is an alternative to approaches using marginal models
discussed in Section 10.3.3.
Table 10.6 on migration from 1980 to 1985 has G 2 ŽS. s 243.6 and G 2 ŽQS.
s 3.0. The difference G 2 ŽS < QS. s 240.6 Ždf s 3. shows extremely strong
evidence of marginal heterogeneity. Results are similar to those quoted in
Section 10.3.4 for the likelihood-ratio test based on model Ž10.15., for which
G 2 s 240.8, or the Wald test, for which W s 236.5 Žboth with df s 3..
10.4.6
Ordinal Quasi-symmetry Model
The loglinear models presented so far for square tables treat classifications
as nominal. With ordered categories, more parsimonious models are useful.
Let u1 F ⭈⭈⭈ F u I denote ordered scores for both the row and columns. An
ordinal quasi-symmetry model is
log ab s q a q b q  u b q ab ,
Ž 10.26 .
where ab s b a for all a - b. It is the special case of the quasi-symmetry
model Ž10.19. in which
Yb y bX s  u b
has a linear trend. Symmetry is the special case  s 0.
This model has logit representation,
log Ž abr b a . s  Ž u b y u a .
for a F b.
Ž 10.27 .
This is the special case of the linear logit model, logitŽ . s ␣ q  x, with
␣ s 0, x s u b y u a and equal to the conditional probability of cell Ž a,b .,
given response sequence Ž a, b . or Ž b, a.. The greater the value of <  < , the
greater the difference between ab and b a and hence between the marginal
distributions.
430
MODELS FOR MATCHED PAIRS
The likelihood equations for ordinal quasi-symmetry are
Ý u a ˆ aqs Ý u a n aq ,
a
a
Ý u b ˆqb s Ý u b nqb ,
b
ˆ ab q
ˆ b a s n ab q n b a
b
for
a - b.
The fitted marginal counts need not equal the observed marginal counts.
However, dividing the first two equations by n shows that they have the same
means.
When  / 0, this model implies stochastically ordered margins. When
 ) 0 Ž  - 0., responses have a higher mean in the column Žrow. distribution. Like the ordinal marginal models ŽSection 10.3.1., this model
concentrates the marginal effect on df s 1. A test of marginal homogeneity
Ž H0 :  s 0. uses
ordinal quasi-symmetry q marginal homogeneity s symmetry.
The likelihood-ratio test statistic compares the deviance for symmetry and
ordinal quasi-symmetry.
One can fit this model by fitting Ž10.27. with logit model software: Identify
Ž n ab , n b a . as binomial with n ab q n b a trials, and fit a logit model with no
intercept and predictor x s u b y u a . One can also fit Ž10.26. using iterative
methods for loglinear models.
10.4.7
Premarital and Extramarital Sex Revisited
For Table 10.5 on attitudes toward premarital and extramarital sex, a cursory
glance at the data reveals that the symmetry model is inadequate Ž G 2 s 402.2,
df s 6.. By comparison, quasi-symmetry fits well Ž G 2 s 1.4, df s 3.. The
simpler model of ordinal quasi-symmetry also fits well: With scores 1, 2, 3, 44 ,
G 2 s 2.1 Ždf s 5..
The ML estimate ˆ s y2.86. From Ž10.27., the estimated probability that
outcome on premarital sex is x categories more positive than the outcome on
extramarital sex equals expŽ2.86 x . times the reverse probability. For instance,
the estimated probability that premarital sex is judged almost always wrong
and extramarital sex is always wrong equals expŽ2.86. s 17.4 times the
estimated probability that premarital sex is always wrong and extramarital sex
is almost always wrong.
10.4.8
Other Ordinal Models for Square Tables
For ordered classifications, when symmetry does not hold, often either
ab ) b a for all a - b, or ab - b a for all a - b. A generalization of
431
MEASURING AGREEMENT BETWEEN OBSERVERS
symmetry with this property is the logit model
log Ž abr b a . s
for a - b.
Ž 10.28 .
It implies that for all a - b,
P Ž Yi1 s a, Yi2 s b < Yi1 - Yi2 . s P Ž Yi1 s b, Yi2 s a < Yi1 ) Yi2 . .
The pattern of probabilities for cells above the main diagonal is a mirror
image of the pattern for cells below it. This property is called conditional
symmetry ŽMcCullagh 1978.. Problem 10.35 shows the corresponding loglinear model and its fit. Symmetry is the special case s 0.
Another model generalizes quasi-independence. Let u a4 be ordered scores.
The model
log ab s q aX q Yb q  u a u b q ␦ a I Ž a s b .
Ž 10.29 .
permits linear-by-linear association wsee Ž9.6.x off the main diagonal. It is a
special case of quasi-symmetry, and quasi-independence is the special case
 s 0. For equal-interval scores, it implies uniform local association, given
that responses differ. Goodman Ž1979a. called it quasi-uniform association.
For Table 10.5 on opinions about premarital and extramarital sex, the
conditional symmetry model has ˆ
s y4.130 ŽSE s 0.451.. The estimated
probability that extramarital sex is considered more wrong are expŽ4.13. s
62.2 times the estimated probability that premarital sex is considered more
wrong. The quasi-uniform association model has ˆ s 0.632 ŽSE s 0.106.. Off
the main diagonal, the estimated local odds ratio equals expŽ0.632. s 1.88.
10.5 MEASURING AGREEMENT BETWEEN OBSERVERS
We now discuss an application, analyzing agreement between two observers,
that uses matched-pairs models. We illustrate with Table 10.8. This shows
ratings by two pathologists, labeled A and B, who separately classified 118
slides regarding the presence and extent of carcinoma of the uterine cervix.
The rating scale has the ordered categories Ž1. negative, Ž2. atypical squamous hyperplasia, Ž3. carcinoma in situ, Ž4. squamous or invasive carcinoma.
10.5.1
Agreement: Departures from Independence
Let ab denote the probability that observer A classifies a slide in category a
and observer B classifies it in category b. Then aa is the probability that
they both choose category a, and Ý aaa is the total probability of agreement.
Perfect agreement occurs when Ý aaa s 1.
With subjective scales, agreement is less than perfect. Analyses focus on
describing strength of agreement and detecting patterns of disagreement.
432
MODELS FOR MATCHED PAIRS
TABLE 10.8 Diagnoses of Carcinoma
Pathologist B a
Pathologist A
1
2
3
4
Total
1
22
Ž8.5.
5
Žy0.5.
2
Žy0.5.
7
Ž3.2.
2
Žy5.9.
14
Žy0.5.
0
Žy1.8.
0
Žy1.8.
26
0
Žy4.1.
0
Žy3.3.
2
Žy1.2.
1
Žy1.3.
36
Ž5.5.
17
Ž0.3.
0
Žy2.3.
10
Ž5.9.
27
12
69
10
2
3
4
Total
26
38
28
118
a
Values in parentheses are standardized Pearson residuals for the independence model.
Source: N. S. Holmquist, C. A. McMahon, and O. D. Williams, Arch. Pathol. 84: 334᎐345
Ž1967.; reprinted with permission from the American Medical Association. See also Landis
and Koch Ž1977..
Agreement and association are distinct facets of the joint distribution. Strong
agreement requires strong association, but strong association can exist without strong agreement. If observer A consistently rates subjects one category
higher than observer B, strength of agreement is poor even though the
association is strong.
Evaluations of agreement compare n ab 4 to the values n aq nqb rn4 predicted under independence. That model is a baseline, showing the agreement
expected if no association existed between ratings. Normally, it fits poorly if
even mild agreement exists, but its cell standardized residuals ŽSection 3.3.1.
show patterns of agreement and disagreement. Ideally, standardized residuals are large positive on the main diagonal and large negative off that
diagonal. The sizes are influenced by sample size n, however, larger values
tending to occur as n increases.
The independence model fits Table 10.8 poorly Ž G 2 s 118.0, df s 9..
That table reports the standardized Pearson residuals in parentheses. The
large positive residuals on the main diagonal indicate that agreement for
each category is greater than expected by chance, especially for the first
category. Off the main diagonal they are primarily negative. Disagreements
occurred less than expected under independence, although the evidence of
this is weaker for categories closer together. The most common disagreements were observer B choosing category 3 and observer A instead choosing
category 2 or 4.
10.5.2
Using Quasi-independence to Analyze Agreement
More complex models add components that relate to agreement beyond that
expected under independence. A useful generalization is quasi-independence
433
MEASURING AGREEMENT BETWEEN OBSERVERS
TABLE 10.9 Fitted Values for Carcinoma Diagnoses of Table 10.8
Pathologist B a
Pathologist A
1
2
3
4
1
22
Ž22.1
Ž22. 2
2
5
Ž2.4.
Ž4.6.
0
Ž0.8.
Ž0.4.
0
Ž1.9.
Ž0.0.
2
Ž0.7.
Ž2.4.
7
Ž7.
Ž7.
2
Ž3.3.
Ž1.6.
14
Ž16.6.
Ž14.4.
0
Ž0.0.
Ž0.0.
0
Ž0.0.
Ž0.0.
Ž1.2.
Ž1.6.
1
Ž3.0.
Ž1.0.
36
Ž36.
Ž36.
17
Ž13.1.
Ž17.0.
0
Ž0.0.
Ž0.0.
10
Ž10.
Ž10.
3
4
a1
Quasi-independence model; 2 quasi-symmetry model.
Ž10.23., which adds main-diagonal parameters ␦ a4 . For Table 10.8, this
model has G 2 s 13.2 Ždf s 5.. It fits much better than independence, but
some lack of fit remains. Table 10.9 shows the fit.
For two subjects, suppose that each observer classifies one in category a
and one in category b. The odds that the observers agree rather than
disagree on which is in category a and which is in category b equal
ab s
aa b b
ab b a
s
aa b b
ab b a
.
Ž 10.30 .
As ab increases, the observers are more likely to agree for that pair of
categories. Under quasi-independence,
ab s exp Ž ␦ a q ␦ b . .
Larger ␦ a4 represent stronger agreement. For instance, for Table 10.8,
␦ˆ2 s 0.6 and ␦ˆ3 s 1.9, and ˆ
23 s 12.3. The degree of agreement also seems
fairly strong for other pairs of categories.
10.5.3
Quasi-symmetry and Agreement Modeling
For Table 10.8, the quasi-independence model shows some lack of fit. Given
that the pathologists disagree, some association remains between ratings. For
observer agreement tables, this is common. Quasi-symmetry Ž10.19. often fits
much better, because it permits association. For Table 10.8, it has G 2 s 1.0
Ždf s 2.. Table 10.9 displays the fit. It is not unusual for tables to have many
434
MODELS FOR MATCHED PAIRS
empty cells. When n ab q n b a s 0 for any pair Žsuch as categories 1 and 4 in
Table 10.8., the ML fitted values for quasi-symmetry in those cells must also
be zero since one of its likelihood equations is
ˆ ab q
ˆ b a s n ab q n b a. One
should eliminate those cells from the fitting process to get the proper
residual df value.
ˆaa q ˆb b y ˆab y ˆb a ., where ˆab s
Under quasi-symmetry, ˆ
ab s expŽ
ˆ
b a . For categories 2 and 3 of Table 10.8, for instance, ˆ
23 s 10.7.
Loglinear models directly address the association component of agreement. The quasi-symmetry model also yields information about similarity of
marginal distributions. The simpler symmetry model that forces the margins
to be identical fits Table 10.8 poorly Ž G 2 s 39.2, df s 5.. The statistic
G 2 Ž S < QS . s 39.2 y 1.0 s 38.2 Ždf s 3. provides strong evidence of marginal
heterogeneity. In Table 10.8, differences in marginal proportions are substantial in each category but the first. The marginal heterogeneity is one reason
that the agreement is not stronger.
Models for agreement can take ordering of categories into account.
Conditional on observer disagreement, a tendency usually remains for high
Žlow. ratings by one observer to occur with relatively high Žlow. ratings by the
other observer Žsee Problem 10.41..
10.5.4
Kappa Measure of Agreement
An alternative approach summarizes agreement with a single index. For
nominal scales, the most popular measure is Cohen’s kappa ŽCohen 1960.. It
compares the probability of agreement Ý aaa to that expected if the ratings
were independent, Ý aaq qa , by
s
Ý aaa y Ý aaq qa
1 y Ý aaq qa
.
The denominator equals the numerator with Ý aaa replaced by its maximum
possible value of 1, corresponding to perfect agreement. Kappa equals 0
when the agreement merely equals that expected under independence. It
equals 1.0 when perfect agreement occurs. The stronger the agreement, the
higher is , for given marginal distributions. Negative values occur when
agreement is weaker than expected by chance, but this rarely happens.
For multinomial sampling, the sample value
ˆ has a large-sample normal
distribution. Its estimated asymptotic variance ŽFleiss et al. 1969. is
ˆ 2 Ž ˆ . s
1
n
½
Po Ž 1 y Po .
Ž 1 y Pe .
2
q
q
2 Ž 1 y Po . 2 Po Pe y Ý a paa Ž paqq pqa .
Ž 1 y Pe .
3
2
2
Ž 1 y Po . Ý a Ý b pab Ž p bqq pqa . y 4 Pe2
Ž 1 y Pe .
4
5
,
435
MEASURING AGREEMENT BETWEEN OBSERVERS
where Po s Ý a paa and Pe s Ý a paq pqa . It is rarely plausible that agreement
is no better than expected by chance. Thus, rather than testing H0 : s 0,
it is more relevant to estimate strength of agreement by interval estimation
of .
For Table 10.8, Po s 0.636 and Pe s 0.281. Sample kappa equals Ž0.636 y
0.281.rŽ1 y 0.281. s 0.493. The difference between observed agreement and
that expected under independence is about 50% of the maximum possible
difference. The estimated standard error is 0.057, so apparently falls
roughly between 0.4 and 0.6, moderately strong agreement.
10.5.5
Weighted Kappa: Quantifying Disagreement
Kappa treats classifications as nominal. When categories are ordered, the
seriousness of a disagreement depends on the difference between the ratings.
For nominal classifications also, some disagreements may be considered
more severe than others. The measure weighted kappa ŽSpitzer et al.
1967. uses weights wab 4 satisfying 0 F wab F 1, with all waa s 1 and all
wab s w b a to describe closeness of agreement. One possibility is wab s 1 y
< a y b < rŽ I y 1.4 , for which agreement is greater for cells nearer the main
diagonal. Fleiss and Cohen Ž1973. suggested wab s 1 y Ž a y b . 2rŽ I y 1. 2 4 .
The weighted agreement is Ý a Ý b wabab and weighted kappa is
w s
Ý a Ý b wabab y Ý a Ý b wabaq qb
1 y Ý a Ý b wabaq qb
.
Controversy surrounds the utility of kappa and weighted kappa, partly
because their values depend strongly on the marginal distributions. The same
diagnostic rating process can yield quite different values, depending on the
proportions of cases of the various types ŽProblem 10.40.. In summarizing a
contingency table by a single number, the reduction in information can be
severe. It is helpful to construct models providing more detailed investigation
of the agreement and disagreement structure rather than to depend solely on
a summary index.
10.5.6
Extensions to Multiple Observers
With several observers, ordinary loglinear models are not usually relevant.
Their description of agreement and association between two observers is
conditional on ratings by the others. It is more relevant to study this
marginally, without conditioning on the other ratings. Hence, for R observers, modelling simultaneously the pairwise agreement and association
structure requires studying the
ŽBecker and Agresti 1992..
ž / pairs of two-way marginal distributions
R
2
436
MODELS FOR MATCHED PAIRS
Other approaches have also been used. For instance, generalizations of
kappa summarize pairwise agreements or multiple agreements ŽFleiss 1981,
Sec. 13.2; Landis and Koch 1977.. Or, it may make sense to use a mixture
model that assumes latent classes of subjects for whom the observers agree
and subjects for whom they disagree. Such an analysis is shown in Section
13.1.2.
10.6 BRADLEY–TERRY MODEL FOR PAIRED PREFERENCES
Sometimes, categorical outcomes result from pairwise evaluations. A common example is athletic competitions, when the outcome for a team or player
consists of categories Žwin, lose.. Another example is pairwise comparison
of product brands, such as two brands of wine of some type. When a wine
critic rates I brands of sauvignon blanc, it might be difficult to establish an
outright ranking, especially if I is large. However, for any given pair, the
critic could probably state a preference after tasting them at the same
occasion. An overall ranking of the wines could then be based on the
pairwise preferences. We present a model for this in this section.
10.6.1
Bradley–Terry Model
Bradley and Terry Ž1952. proposed a logit model for paired evaluations.
Let ⌸ ab denote the probability that a is preferred to b. Suppose that
⌸ ab q ⌸ b a s 1 for all pairs; that is, a tie cannot occur. The Bradley᎐Terry
model is
log
⌸ ab
⌸ba
s a y  b .
Ž 10.31 .
Alternatively,
⌸ ab s exp Ž  a . r exp Ž  a . q exp Ž  b . .
Thus, ⌸ ab s 12 when  a s  b and ⌸ ab ) 12 when  a )  b .
Identifiability requires a constraint such as I s 0 or Ý a expŽ ˆa . s 1.
Since the model describes I probabilities Ž ⌸ ab 4 for a - b . by Ž I y 1.
ž/
parameters, residual df s ž / y ŽI y 1..
2
I
2
For a - b, let Nab denote the sample number of evaluations, with a
preferred n ab times and b preferred n b a s Nab y n ab times. A square
contingency table with empty cells on the main diagonal summarizes results.
When the Nab comparisons are independent with probability ⌸ ab for each,
n ab has a binŽ Nab , ⌸ ab . distribution. If evaluations for different pairs are also
independent, ordinary methods for logit models apply for fitting the model.
BRADLEY᎐ TERRY MODEL FOR PAIRED PREFERENCES
437
TABLE 10.10 Results of 1987 Season for American League Baseball Teams
Losing Team a
Winning
Team
Milwaukee Detroit Toronto New York Boston Cleveland Baltimore
Milwaukee
Detroit
Toronto
New York
Boston
Cleveland
Baltimore
ᎏ
6 Ž6.0.
4 Ž5.6.
6 Ž5.4.
6 Ž5.0.
4 Ž3.8.
2 Ž2.2.
7 Ž7.0.
ᎏ
6 Ž6.0.
8 Ž5.9.
2 Ž5.4.
4 Ž4.2.
4 Ž2.5.
9 Ž7.4.
7 Ž7.0.
ᎏ
6 Ž6.3.
6 Ž5.9.
5 Ž4.6.
1 Ž2.8.
7 Ž7.6.
5 Ž7.1.
7 Ž6.7.
ᎏ
7 Ž6.0.
6 Ž4.7.
3 Ž2.9.
7 Ž8.0.
11 Ž7.6.
7 Ž7.1.
6 Ž7.0.
ᎏ
6 Ž5.1.
1 Ž3.2.
9 Ž9.2.
9 Ž8.8.
8 Ž8.4.
7 Ž8.3.
7 Ž7.9.
ᎏ
7 Ž4.4.
11 Ž10.8.
9 Ž10.5.
12 Ž10.2.
10 Ž10.1.
12 Ž9.8.
6 Ž8.6.
ᎏ
a
Values in parentheses represent the fit of the Bradley᎐Terry model.
Source: American League Red Book, 1988 ŽSt. Louis, MO: Sporting News Publishing Co..
10.6.2
Home Team Advantage in Baseball
Table 10.10 shows results of the 1987 season for the seven baseball teams in
the Eastern Division of the American League. For instance, of games
between Boston and New York, Boston won 7 and New York won 6. Table
10.10 shows the population of regular-season games. We regard this as a
sample estimate of a conceptual distribution representing the long-run performance of teams as constituted in 1987.
We fitted the Bradley᎐Terry model as a logit model for 72 s 21 indepen-
ž/
dent binomial samples, using an appropriate model matrix and no intercept
Že.g., for SAS, see Table A.19.. The model fits adequately Ž G 2 s 15.7,
df s 15.. Table 10.10 contains the fitted values
ˆ ab 4. Table 10.11 displays
the sample proportion of games each team won and the model estimates of
ˆa4 Žsetting ˆ7 s 0. and expŽ ˆa .4 wsetting Ý a expŽ ˆa . s 1x. When Boston
played New York, the estimated probability that Boston won is
ˆ 54 s exp Ž ˆ5 .
⌸
exp Ž ˆ5 . q exp Ž ˆ4 . s 0.46.
The standard error of each ˆa and of each ˆa y ˆb is about 0.3, so not
much evidence exists of a difference among the top five teams.
TABLE 10.11 Results of Fitting Bradley–Terry Models to Baseball Data
Team
Milwaukee
Detroit
Toronto
New York
Boston
Cleveland
Baltimore
Winning
Percentage
ˆi
Ž10.31.
64.1
60.2
56.4
55.1
51.3
39.7
23.1
1.58
1.44
1.29
1.25
1.11
0.68
0.00
expŽ ˆi .
Ž10.31.
expŽ ˆi .
Ž10.32.
0.218
0.189
0.164
0.158
0.136
0.089
0.045
0.220
0.190
0.164
0.157
0.137
0.088
0.044
438
MODELS FOR MATCHED PAIRS
TABLE 10.12 Wins r Losses by Home and Away Team, 1987
Away Team
Home Team Milwaukee Detroit Toronto New York Boston Cleveland Baltimore
Milwaukee
Detroit
Toronto
New York
Boston
Cleveland
Baltimore
ᎏ
3-3
2-5
3-3
5-1
2-5
2-5
4-3
ᎏ
4-3
5-1
2-5
3-3
1-5
4-2
4-2
ᎏ
2-5
3-3
3-4
1-6
4-3
4-3
2-4
ᎏ
4-2
4-3
2-4
6-1
6-0
4-3
4-3
ᎏ
4-2
1-6
4-2
6-1
4-2
4-2
5-2
ᎏ
3-4
6-0
4-3
6-0
6-1
6-0
2-4
ᎏ
Source: American League Red Book, 1988 ŽSt. Louis, MO: Sporting News Publishing Co...
This model does not recognize which team is the home team. Most sports
have a home field advantage: A team is more likely to win when it plays at its
home city. Table 10.12 contains results for the 1987 season according to the
Žhome team, away team. classification. For instance, when Boston was the
home team, it beat New York 4 times and lost 2 times; when New York was
the home team, it beat Boston 4 times and lost 3 times. Now for all a / b, let
⌸*ab denote the probability that team a beats team b, when a is the home
team. Consider logit model
log
⌸*ab
1 y ⌸*ab
s ␣ q Ž a y  b . .
Ž 10.32 .
When ␣ ) 0, a home field advantage exists. The home team of two evenly
matched teams has probability expŽ␣.rw1 q expŽ␣.x of winning.
For Table 10.12, model Ž10.32. describes 42 binomial distributions with 7
parameters. It has G 2 s 38.6 Ždf s 35.. Table 10.11 displays expŽ ˆa .4 , which
are similar to those obtained previously. The estimate of the home-field
parameter is ␣
ˆ s 0.302. For two evenly matched teams, the home team had
estimated probability 0.575 of winning. When Boston played New York, the
estimated probability of a Boston win was 0.54 at Boston and 0.39 at New
York.
Model Ž10.32. is a useful generalization of the Bradley᎐Terry model
whenever an order effect exists. For instance, in pairwise taste evaluations,
the product tasted first may have a slight advantage.
10.6.3
Bradley–Terry Model and Quasi-symmetry
Fienberg and Larntz Ž1976. showed that the Bradley᎐Terry model is a logit
formulation of the quasi-symmetry model Ž10.19.. For quasi-symmetry, given
that an observation is in cell Ž a, b . or Ž b, a., the logit of the conditional
MARGINAL AND QUASI-SYMMETRY MODELS FOR MATCHED SETS
439
probability of cell Ž a, b . equals
log
ab
b a
XY
s Ž q aX q Yb q ab
. y Ž q bX q Ya q bXaY .
s Ž aX y Ya . y Ž bX y Yb . s  a y  b ,
ˆaX 4 and ˆYa 4 for quasi-symmetry yield ˆa4
where  a s aX y Ya . Estimates
for the Bradley᎐Terry model.
10.6.4
Extensions to Ties and Ordinal Evaluations
The Bradley᎐Terry model extends to ordinal comparisons, such as the
evaluation scale Žmuch better, slightly better, the same, slightly worse, much
worse. in comparing two products. With cumulative logits and an I-category
evaluation scale, let Yab denote the response for a comparison of a with b.
The model is
logit P Ž Yab F j . s ␣ j q Ž  a y  b . .
Since P Ž Yab F j . s P Ž Yb a ) I y j . s 1 y P Ž Yb a F I y j ., it follows that
logit w P Ž Yab F j x s y logit w P Ž Yb a F I y j .x. Thus, necessarily, ␣ j s y␣ Iyj .
The most common ordered preference scale is Žwin, tie, lose.. Then,
␣ 1 s y␣ 2 .
10.7 MARGINAL AND QUASI-SYMMETRY MODELS FOR
MATCHED SETS*
Methods for matched pairs extend to matched sets. Here we present mainly
the loglinear modeling approach; in Chapters 11 and 12 we present extensions of the marginal and conditional logit modeling approaches.
10.7.1
Marginal Homogeneity, Complete Symmetry, and Quasi-symmetry
Let Ž Y1 , Y2 , . . . , YT . denote the T responses in each matched set. With I
response categories, a contingency table with I T cells summarizes the possible outcomes. Let i s Ž i1 , . . . , i T . denote the cell having Yt s i t , t s 1, . . . , T.
Let i s P Ž Yt s i t , t s 1, . . . , T ., and let i s n i . Then
P Ž Yt s j . s q⭈ ⭈ ⭈qjq⭈ ⭈ ⭈q ,
where the j subscript is in position t, and P Ž Yt s j ., j s 1, . . . , I 4 is the
marginal distribution for Yt .
440
MODELS FOR MATCHED PAIRS
This T-way table satisfies marginal homogeneity if
P Ž Y1 s j . s P Ž Y2 s j . s ⭈⭈⭈ s P Ž YT s j .
for j s 1, . . . , I.
It satisfies complete symmetry if
i s j
for any permutation j s Ž j1 , . . . , jT . of i s Ž i1 , . . . , i T .. Complete symmetry
implies marginal homogeneity, but the converse does not hold except when
T s I s 2.
Complete symmetry is a loglinear model. One representation is
log i s ab . . . m ,
where a is the minimum of Ž i1 , . . . , i T ., b is the next smallest, . . . , and m is
the maximum. In a three-way table, for instance, log 122 s log 212 s
log 221 s 122 . The number of ab . . . m 4 parameters is the number of ways
of selecting T out of I items with replacement, which is
T
residual df s I y
ž
IqTy1
T
/
ŽHaberman 1978, p. 518..
ž
IqTy1
T
/ . Thus,
An I T table satisfies quasi-symmetry if
log i s 1 i1 q 2 i 2 q ⭈⭈⭈ qT i T q ab . . . m
Ž 10.33 .
where ab . . . m is defined as in the complete symmetry model. It has symmetric association and higher-order interaction terms, but permits each singlefactor marginal distribution to have its own parameters. Identifiability requires constraints such as t I s 0 for each t. One set of main-effect terms is
redundant ŽProblem 10.31.. This model has Ž I y 1.ŽT y 1. more parameters
than complete symmetry. It is fitted using iterative methods.
For ordinal responses, a simpler model with quantitative main effects uses
ordered scores u a4 . The ordinal quasi-symmetry model is
log i s  1 u i1 q  2 u i 2 q ⭈⭈⭈ q T u i T q ab . . . m
where one can set  T s 0. Complete symmetry is the special case  1 s
⭈⭈⭈ s  T .
When quasi-symmetry Ž10.33. or ordinal quasi-symmetry holds, marginal
homogeneity is equivalent to complete symmetry. Marginal heterogeneity
occurs if quasi-symmetry ŽQS. holds but complete symmetry ŽS. does not.
The statistic
G 2 Ž S < QS . s G 2 Ž S . y G 2 Ž QS .
MARGINAL AND QUASI-SYMMETRY MODELS FOR MATCHED SETS
441
tests marginal homogeneity. Under complete symmetry, it is asymptotically
chi-squared with df s Ž I y 1.ŽT y 1.. The corresponding test for the ordinal
quasi-symmetry model has df s ŽT y 1..
10.7.2
Attitudes toward Legalized Abortion Example
Refer to Table 10.13. Subjects indicated whether they support legalized
abortion in three situations: Ž1. if the family has a very low income and
cannot afford any more children, Ž2. when the woman is not married and
does not want to marry the man, and Ž3. when the woman wants it for any
reason. The table also classifies subjects by gender, resulting in a 2 4 table.
Let g h i j denote the expected frequency for gender g Ž1 s female; 0 s
male. with response sequence Ž h, i, j . for the three questions. Consider the
model
log g h i j s  g q ab c ,
where the interaction term is 111 when Ž h, i, j . s Ž1, 1, 1., 112 when Ž h, i, j .
s Ž1, 1, 2. or Ž1, 2, 1. or Ž2, 1, 1., 122 when Ž h, i, j . s Ž1, 2, 2. or Ž2, 1, 2. or
Ž2, 2, 1., and 222 when Ž h, i, j . s Ž2, 2, 2.. This model implies the same
complete symmetry pattern of probabilities for each gender. Its fit has
G 2 s 39.2 with df s 11.
Adding main-effect terms for the three issues implies the same quasi-symmetric pattern for each gender. It fits much better, having G 2 s 10.2 with
df s 9. Thus, it seems plausible to assume a symmetric association structure.
In fact, the loglinear model with only two-factor association terms has fitted
log odds ratios of 3.2 for items 1 and 2, 2.6 for items 1 and 3, and 3.3 for
items 2 and 3.
One can test marginal homogeneity, given gender, by the likelihood-ratio
statistic 39.2 y 10.2 s 29.0, with df s 2. An analysis of the main-effect terms
in the quasi-symmetry model shows greater support for legalized abortion
when the family has a low income and cannot afford any more children than
in the other two instances.
TABLE 10.13 Support for Legalizing Abortion in Three Situations, by Gender
Sequence of Responses on the Three Items a
Gender
Ž1, 1, 1.
Ž1, 1, 2.
Ž2, 1, 1.
Ž2, 1, 2.
Ž1, 2, 1.
Ž1, 2, 2.
Ž2, 2, 1.
Ž2, 2, 2.
Male
Female
342
440
26
25
6
14
21
18
11
14
32
47
19
22
356
457
Items are Ž1. if the family has a very low income and cannot afford anymore children, Ž2. when
the woman is not married and does not want to marry the man, and Ž3. when the woman wants it
for any reason. 1, yes; 2, no.
Source: Data from 1994 General Social Survey, National Opinion Research Center.
a
442
MODELS FOR MATCHED PAIRS
10.7.3
Types of Marginal Symmetry
A general type of symmetry for I T tables has marginal homogeneity and
complete symmetry as special cases. For an I T table, P Ž Yt 1 s j1 , . . . , Yt h s jh .,
where h is between 1 and T, is a h-dimensional marginal probability, h s 1
giving single-variable marginal probabilities. There is hth-order marginal
symmetry if for all h-tuples j s Ž j1 , . . . , jh ., this probability is the same for
each permutation of j and for all combinations t s Ž t 1 , . . . , t h . of h of the T
responses.
For h s 1, first-order marginal symmetry is marginal homogeneity. Second-order marginal symmetry occurs if for all t and u, P Ž Yt s a, Yu s b . is
the same and the equality holds for all pairs of outcomes Ž a, b .. In other
words, the two-way marginal tables exhibit symmetry, and they are identical.
T th-order marginal symmetry in an I T table is complete symmetry.
When hth-order symmetry holds, ith-order marginal symmetry holds for
any i - h. For instance, complete symmetry implies second-order marginal
symmetry, which itself implies marginal homogeneity. Although this hierarchy
is mathematically attractive, the higher-order symmetries are usually too
restrictive to fit well in practice.
10.7.4
Marginal Models: Multiway Tables
In practice, usually the form of the joint distribution is of secondary interest.
Research questions pertain instead to the marginal distributions. The
marginal models of Section 10.3 for matched pairs extend to matched sets.
For instance, with ordinal classifications, a cumulative logit model is
logit P Ž Yt F j . s ␣ j q t ,
j s 1, . . . , I y 1,
t s 1, . . . , T . Ž 10.34 .
In the next chapter we study marginal models in more general contexts,
extending the analyses of this chapter to incorporate matched sets and
explanatory variables.
NOTES
Section 10.1: Comparing Dependent Proportions
10.1. Miettinen Ž1969. generalized the McNemar test to case᎐control sets having several
controls per case. The Table 10.2 representation is then useful. Each of n matched
sets forms a stratum of a 2 = 2 = n table with one observation in column 1 Žthe case.
and several observations in column 2 Žthe controls..
Altham Ž1971. and Ghosh et al. Ž2000. presented Bayesian analyses for binary
matched pairs. Copas Ž1973., Gart Ž1969., Kenward and Jones Ž1994., and Miettinen
Ž1969. studied generalizations of matched-pairs designs. With some approaches ŽGhosh
et al. 2000; Liang and Zeger 1988; Suissa and Shuster 1991., inferences about
marginal homogeneity also use the main-diagonal observations.
443
NOTES
Section 10.4: Symmetry, Quasi-symmetry, and Quasi-independence
10.2. For other discussion of quasi-symmetry, see Darroch Ž1981. and McCullagh Ž1982..
The term quasi-independence originated in Goodman Ž1968.. A more general definition of it is ab s ␣ a  b for some fixed set of cells. See Caussinus Ž1966., Fienberg
Ž1970b, 1972., and Goodman Ž1968.. Caussinus used the concept to analyze tables that
deleted a certain set of cells from consideration, and Goodman used it in earlier
analyses of social mobility. Altham Ž1975. used it with triangular tables, for which
observations occur only above or only below the main diagonal. Stigler Ž1999, Chap.
19. summarized early uses, including Karl Pearson’s handling in 1913 of a triangular
array. Booth and Butler Ž1999. and Smith et al. Ž1996. discussed exact tests for
square-table models.
10.3. The effect  in ordinal quasi-symmetry relates to the occasion effect in a subjectspecific adjacent-categories-logit model ŽAgresti 1993.. Conditional symmetry is a
special case of diagonals-parameter symmetry,
log Ž abr b a . s bya ,
a - b.
See Goodman Ž1979b, 1985. and Hout et al. Ž1987..
10.4. In some applications a table is a priori symmetric or independent, but one can observe
only the pair Ž i, j . rather than their order, thus leading to an upper-triangular table.
See Khamis Ž1983. for examples and ML fitting of models for such three-way tables
that are symmetric within layers.
Section 10.5: Measuring Agreement between Obser©ers
10.5. Kappa and weighted kappa relate to the intraclass correlation, a measure of interrater
reliability for interval scales ŽFleiss 1981; Fleiss and Cohen 1973; Kraemer 1979..
Banerjee et al. Ž1999. and Fleiss Ž1981, Chap. 13. reviewed kappa and its generalizations. See Becker and Agresti Ž1992., Goodman Ž1979b., Tanner and Young Ž1985.,
and Problem 10.41 for examples of modeling agreement with loglinear models.
Darroch and McCloud Ž1986. showed that quasi-symmetry has an important role in
agreement modeling.
Section 10.6: Bradley–Terry Model for Paired Preferences
10.6. Zermelo Ž1929. proposed a model that is equivalent to the Bradley᎐Terry model.
Luce Ž1959. provided an axiomatic basis for it. Mosteller Ž1951. and Thurstone Ž1927.
proposed an analogous model with probit link. An interesting interview of Ralph
Bradley by M. Hollander Ž Stat. Sci. 16: 75᎐100, 2001. discussed food-tasting applications that motivated its development. For extensions, see Bradley Ž1976.. Fienberg and
Larntz Ž1976. and Imrey et al. Ž1976. related it to quasi-independence. Dittrich et al.
Ž1998. allowed covariates. Matthews and Morris Ž1995. gave an application with a
factorial design, ties, and allowance for dependence among judgments. Bockenholt
¨
and Dillon Ž1997. modeled dependence with ordinal preferences. David Ž1988. and
Imrey Ž1998. surveyed paired preference methods.
444
MODELS FOR MATCHED PAIRS
TABLE 10.14 Data for Problem 10.1
Let Patient Die
Suicide
Yes
No
Yes
No
1097
203
90
435
Source: 1994 General Social Survey, National Opinion Research Center.
PROBLEMS
Applications
10.1 Table 10.14 shows results when subjects were asked ‘‘Do you think a
person has the right to end his or her own life if this person has an
incurable disease?’’ and ‘‘When a person has a disease that cannot be
cured, do you think doctors should be allowed to end the patient’s life
by some painless means if the patient and his family request it?’’ The
table refers to these variables as ‘‘suicide’’ and ‘‘let patient die.’’
a. Compare the marginal proportions using a confidence interval.
b. Perform McNemar’s test, and interpret.
c. Find the conditional ML estimate of  for model Ž10.8.. Interpret.
10.2 Refer to Table 8.16 and Problem 8.1. Treat the data as matched pairs
on opinion, stratified by gender. Testing independence for the 2 = 2
table using entries Ž6, 160. in row 1 and Ž11, 181. in row 2 tests
equality of  for logit model Ž10.8. for each gender. Explain why.
10.3 A crossover experiment with 100 subjects compares two drugs for
treating migraine headaches. The response scale is success Ž1. or
failure Ž0.. Half the study subjects, randomly selected, used drug A
the first time they had a headache and drug B the next time. For
them, 6 had outcomes Ž1, 1. for Ž A, B ., 25 had outcomes Ž1, 0., 10
had outcomes Ž0, 1., and 9 had outcomes Ž0, 0.. For the 50 subjects
who took the drugs in the reverse order, 10 were Ž1, 1. for Ž A, B ., 20
were Ž1, 0., 12 were Ž0, 1., and 8 were Ž0, 0..
a. Ignoring treatment order, compare the success probabilities for the
two drugs. Interpret.
b. McNemar’s test uses only the pairs of outcomes that differ. For
this study, Table 10.15 shows such data from both treatment
orders. Testing independence for this table tests whether success
rates are identical for the treatments ŽGart 1969.. Explain why.
Analyze these data, and interpret.
445
PROBLEMS
TABLE 10.15 Data for Problem 10.3
Treatment That Is Better
Treatment
Order
A, then B
B, then A
First
Second
25
12
10
20
10.4 A case᎐control study has 8 pairs of subjects. The cases have colon
cancer, and the controls are matched with the cases on gender and
age. A possible explanatory variable is the extent of red meat in a
subject’s diet, measured as ‘‘1 s high’’ or ‘‘0 s low.’’ The Žcase, control. observations on this were Ž1, 1. for 3 pairs, Ž0, 0. for 1 pair, Ž1, 0.
for 3 pairs, and Ž0, 1. for 1 pair.
a. Cross-classify the 8 pairs in terms of diet Ž1 or 0. for the case
against diet Ž1 or 0. for the control. Call this Table A. Display the
2 = 2 = 8 table with eight partial tables relating diet Ž1 or 0. to
response Žcase or control. for the 8 pairs. Call this Table B.
b. Calculate the McNemar z 2 for Table A and the CMH statistic for
Table B. Compare.
c. Show that the Mantel᎐Haenszel estimate of a common odds ratio
for Table B is identical to n12 rn 21 for Table A.
d. For Table B with pairs deleted in which the case and the control
had the same diet, show that the CMH statistic and the
Mantel᎐Haenszel odds ratio estimate do not change.
e. This sample size is small for large-sample tests. Use the binomial
distribution with Table A to find the exact P-value for testing
marginal homogeneity against the alternative hypothesis of a higher
incidence of colon cancer for the high-red-meat diet.
10.5 Each week Variety magazine summarizes reviews of new movies by
critics in several cities. Each review is categorized as pro, con, or
mixed, according to whether the overall evaluation is positive, negative, or a mixture of the two. Table 10.16 summarizes the ratings from
TABLE 10.16 Data for Problem 10.5
Ebert
Siskel
Con
Mixed
Pro
Con
Mixed
Pro
24
8
10
8
13
9
13
11
64
Source: A. Agresti and L. Winner, CHANCE 10: 10᎐14
Ž1997., reprinted with permission, copyright 1997 by the
American Statistical Association.
446
MODELS FOR MATCHED PAIRS
April 1995 through September 1996 for Chicago film critics Gene
Siskel and Roger Ebert.
a. Fit the symmetry model, quasi-independence model, and quasisymmetry model. Interpret.
b. Test marginal homogeneity using models, and interpret.
c. Analyze these data using agreement models andror measures of
agreement.
10.6 Refer to Table 10.5. Fit the ordinal quasi-symmetry model using
u1 s 1 and u 4 s 4 and picking u 2 and u 3 that are unequally spaced
but represent sensible choices. Compare results and interpretations to
those in Sections 10.3.2 and 10.4.7.
10.7 Refer to all four items in Table 8.19.
a. Fit the complete symmetry and quasi-symmetry models. Test
marginal homogeneity. Interpret.
b. Fit the ordinal quasi-symmetry model. Test marginal homogeneity.
Interpret the effects.
10.8 Table 10.17 shows subjects’ purchase choice of instant decaffeinated
coffee at two times.
a. Fit the symmetry model and use residuals to analyze changes.
b. Test marginal homogeneity. Show that the small P-value reflects a
decrease in the proportion choosing High Point and an increase in
the proportion choosing Sanka, with no evidence of change for the
other coffees.
c. Show that quasi-independence has G 2 s 13.8 Ždf s 11.. Interpret,
and suggest other analyses that might be useful.
TABLE 10.17 Data for Problem 10.8
Second Purchase
First
Purchase
High Point
Taster’s Choice
Sanka
Nescafe
Brim
High
Point
Taster’s
Choice
Sanka
Nescafe
Brim
93
9
17
6
10
17
46
11
4
4
44
11
155
9
12
7
0
9
15
2
10
9
12
2
27
Source: Based on data from R. Grover and V. Srinivasan, J. Market. Res. 24: 139᎐153 Ž1987..
Reprinted with permission from the American Marketing Association.
447
PROBLEMS
TABLE 10.18 Data for Problem 10.9
Father’s
Status
Son’s Status
1
2
3
4
5
Total
1
2
3
4
5
50
28
11
14
3
45
174
78
150
42
8
84
110
185
72
18
154
223
714
320
8
55
96
447
411
129
495
518
1510
848
Total
106
489
459
1429
1017
3500
Source: Reprinted with permission from D. V. Glass Žed., Social Mobility in Britain, Glencoe, IL:
Free Press Ž1954..
10.9 Table 10.18 relates father’s and son’s occupational status for a British
sample. Analyze these data, using models of Ža. symmetry, Žb. quasisymmetry, Žc. ordinal quasi-symmetry, Žd. conditional symmetry, Že.
marginal homogeneity, Žf. quasi-independence, and Žg. quasi-uniform
association. Interpret using their fit and lack of fit.
10.10 For Table 10.18, use kappa to describe agreement. Interpret.
10.11 Table 10.19 displays multiple sclerosis diagnoses for two neurologists
who classified patients in two sites, Winnipeg and New Orleans. The
diagnostic classes are Ž1. certain; Ž2. probable; Ž3. possible; and Ž4.
doubtful, unlikely, or definitely not. For the New Orleans patients,
study the agreement using Ža. the independence model and residuals,
Žb. more complex models, and Žc. kappa. Interpret each.
TABLE 10.19 Data for Problem 10.11
Winnipeg Neurologist
New Orleans
Neurologist
1
2
3
4
Winnipeg Patients
New Orleans Patients
1
2
3
4
1
2
3
4
38
33
10
3
5
11
14
7
0
3
5
3
1
0
6
10
5
3
2
1
3
11
13
2
0
4
3
4
0
0
4
14
Source: J. R. Landis and G. G. Koch, Biometrics 33: 159᎐174 Ž1977.. Reprinted with permission
from the Biometric Society.
448
MODELS FOR MATCHED PAIRS
10.12 For Problem 10.11, construct a model that describes agreement
between neurologists for the two sites simultaneously.
10.13 Calculate kappa for a 4 = 4 table having n ii s 5 all i, n i, iq1 s 15,
i s 1, 2, 3, n 41 s 15, and n i j s 0 otherwise. Explain why strong association does not imply strong agreement.
10.14 Refer to Table 10.8. Based on the reported standardized residuals,
explain why the linear-by-linear association model Ž9.6. might fit well.
Fit it and describe the association.
10.15 In 1990, a sample of psychology graduate students at the University of
Florida made blind, pairwise preference tests of three cola drinks.
For 49 comparisons of Coke and Pepsi, Coke was preferred 29 times.
For 47 comparisons of Classic Coke and Pepsi, Classic Coke was
preferred 19 times. For 50 comparisons of Coke and Classic Coke,
Coke was preferred 31 times. Comparisons resulting in ties are not
reported.
a. Fit the Bradley᎐Terry model, analyze the quality of fit, and rank
the drinks. Is there sufficient evidence to conclude a preference
for one drink?
b. Estimate the probability that Coke is preferred to Pepsi, using the
model, and compare to the sample proportion.
10.16 Table 10.20 refers to journal citations among four statistics journals
during 1987᎐1989. The more often articles in a particular journal are
cited, the more prestige that journal accrues. For citations involving
pair A and B, view it as a victory for A if it is cited by B and a defeat
for A if it cites B. Fit the Bradley᎐Terry model. Interpret the fit, and
give a prestige ranking of the journals. For citations involving Commun. Stat. and JRSS-B, estimate the probability that the Commun.
Stat. article cites the JRSS-B article.
TABLE 10.20 Data for Problem 10.16
Cited Journal
Citing Journal
Biometrika
Commun. Stat.
JASA
JRSS-B
Biometrika
Commun. Stat.
JASA
JRSS-B
714
730
498
221
33
425
68
17
320
813
1072
142
284
276
325
188
Source: Stigler Ž1994.. Reprinted with permission from the Institute of Mathematical Statistics.
449
PROBLEMS
TABLE 10.21 Data for Problem 10.17
Loser
Winner
Seles
Graf
Sabatini
Navratilova
Sanchez
Seles
Graf
Sabatini
Navratilova
Sanchez
ᎏ
3
0
3
0
2
ᎏ
3
0
1
1
6
ᎏ
2
2
3
3
1
ᎏ
1
2
7
3
3
ᎏ
10.17 Table 10.21 refers to matches for several women tennis players during
1989 and 1990.
a. Fit the Bradley᎐Terry model. Interpret, and rank the players.
b. Estimate the probability of Seles beating Graf. Compare the model
estimate to the sample proportion. Construct a 90% confidence
interval for the probability.
c. Which pairs of players are significantly different according to a
80% simultaneous Bonferroni comparison?
10.18 Refer to Problem 3.3 on basketball free-throw shooting. Analyze
these data.
10.19 Refer to Table 2.12 and Problem 2.19. Using models, describe the
relationship between husband’s and wife’s sexual fun.
10.20 Refer to Table 8.19. The two-way table relating responses for the
environment Žas rows. and cities Žas columns. has cell counts, by row,
Ž108, 179, 157 r 21, 55, 52 r 5, 6, 24.. Analyze these data.
Theory and Methods
10.21 Explain the following analogy: McNemar’s test is to binary data as the
paired difference t test is to normally distributed data.
10.22 For a 2 = 2 table, derive covŽ pq1 , p1q ., and show that varw'n Ž pq1 y
p1q .x equals Ž10.1..
10.23 Refer to the subject-specific model Ž10.8. for binary matched pairs.
a. Show that expŽ . is a conditional odds ratio between observation
and outcome. Explain the distinction between it and the odds ratio
expŽ . for model Ž10.6..
450
MODELS FOR MATCHED PAIRS
b. Using the conditional distribution Ž10.9., show that
logŽ n 21 rn12 ..
c. For a random sample of n pairs, explain why
E Ž n 21 rn . s
1
n
n
Ý
is1
1
exp Ž␣i q  .
1 q exp Ž␣i . 1 q exp Ž␣i q  .
ˆ s
.
6
6
Similarily, state E Ž n12 rn.. Using
their ratio for fixed n and as
p
n
⬁, explain why n 21 rn12
expŽ .. Ž Hint: Apply the law of
large numbers due to A. A. Markov for independent but not
identically distributed random variables, or use Chebyshev’s inequality..
d. Show that the Mantel᎐Haenszel estimator Ž6.7. of a common odds
ratio in the 2 = 2 = n form of the data simplifies to expŽ ˆ. s
n 21 rn12 .
e. Use the delta method to show Ž10.10. for the SE of ˆ.
f. For a table of the form shown in Table 10.2, show that the CMH
statistic Ž6.6. is algebraically identical to the McNemar statistic
Ž n 21 y n12 . 2rŽ n 21 q n12 . for tables of Table 10.1 type.
6
10.24 Refer to Problem 10.23. Unlike the conditional ML estimator of  ,
the unconditional ML estimator is inconsistent ŽAndersen 1980, pp.
244᎐245; first shown by him in 1973.. Show this as follows:
a. Assuming independence of responses for different subjects and
different observations by the same subject, find the log likelihood.
Show that the likelihood equations are yqt s Ý i P Ž Yit s 1. and
yiqs Ý t P Ž Yit s 1..
b. Substituting expŽ␣i .rw1 q expŽ␣i .x q expŽ␣i q  .rw1 q expŽ␣i q
 .x in the second likelihood equation, show that ␣
ˆi s y⬁ for the
n 22 subjects with yiqs 0, ␣
ˆi s ⬁ for the n11 subjects with yiqs 2,
and ␣
ˆi s yˆr2 for the n 21 q n12 subjects with yiqs 1.
c. By breaking Ý i P Ž Yi t s 1. into components for the sets of subjects
having yiqs 0, yiqs 2, and yiqs 1, show that the first likelihood equation is, for t s 1, yq1 s n 22 Ž0. q n11Ž1. q Ž n 21 q
n12 .expŽyˆr2.rw1 q expŽyˆr2.x. Explain why yq1 s n11 q n12 ,
and solve the first likelihood equation to show pthat ˆ s
2 .
2 logŽ n 21 rn12 .. Hence, as a result of Problem 10.23, ˆ
10.25 Consider marginal model Ž10.6. when Y1 and Y2 are independent and
conditional model Ž10.8. when ␣ i 4 are identical. Explain why they are
equivalent.
451
PROBLEMS
10.26 Let ˆM s logŽ pq1 p 2qrpq2 p1q . refer to marginal model Ž10.6. and
ˆC s logŽ n 21 rn12 . to conditional model Ž10.8.. Using the delta
method, show that the asymptotic variance of 'n Ž ˆM y M . is
Ž 1q 2q .
y1
q Ž q1 q2 .
y1
y 2 Ž 11 22 y 12 21 . r Ž 1q 2q q1 q2 . .
Under the independence condition of the previous problem, M s C .
In that case, show that the asymptotic variances satisfy
var
'n Ž ˆM .
s Ž 1q 2q .
y1
q Ž q1 q2 .
y1
F Ž 1q q2 .
y1
q Ž q1 2q .
y1
y1
s y1
12 q 21 s var
'n Ž ˆC .
10.27 Refer to model Ž10.12. for a matched-pairs study. For the conditional
ML approach, show that the conditional distribution satisfies Ž10.13.
and does not depend on  when Si s 0 or 2. Show what happens to
 j in the conditional distribution for a predictor for which x ji1 s x ji2
all i.
10.28 Consider model Ž10.12. for a study with matched sets of T observations rather than matched pairs. Explain how Ž10.13. generalizes and
construct the form of the conditional likelihood.
10.29 Give an example illustrating that when I ) 2, marginal homogeneity
does not imply symmetry.
10.30 Derive the likelihood equations and residual df for Ža. symmetry,
Žb. quasi-symmetry, Žc. quasi-independence, and Žd. ordinal quasisymmetry.
10.31 For the quasi-symmetry model Ž10.19., let a s aX y Ya . Show that
one can express it equivalently as log ab s q a q *ab , with *ab
s *b a . Hence, one needs only one set of main-effect parameters.
10.32 Show that quasi-symmetry is equivalent ŽCaussinus 1966. to
Ž ab b cc a . r Ž b acbac . s 1 all a, b, and c.
10.33 Derive the covariance matrix Ž10.16. for the difference vector d.
452
MODELS FOR MATCHED PAIRS
10.34 Construct the loglinear model satisfying both marginal homogeneity
and statistical independence. Show that
ˆab s Ž pqa q paq .Ž pqb q
p bq .r4 and residual df s I Ž I y 1..
10.35 Consider the conditional symmetry ŽCS. model Ž10.28..
a. Show that it has the loglinear representation
log ab s min Ž a, b., max Ž a, b. q I Ž a - b . ,
where I Ž⭈. is an indicator Žsee also Bishop et al. 1975, pp. 285᎐286..
b. Show that the likelihood equations are
ˆ ab q
ˆ b a s n ab q n b a for all a F b,
ÝÝ ˆ ab s ÝÝ n ab .
a-b
a-b
c. Show that ˆ s logwŽÝÝ a- b n ab .rŽÝÝ a) b n ab .x,
ˆ aa s n aa , a s
1, . . . , I,
ˆ ab s exp wˆ I Ž a - b .xŽ n ab q n b a .rwexp Žˆ . q 1x for a / b.
is
d. Show that the estimated asymptotic variance of ˆ
ž ÝÝ n /
ab
a-b
y1
q
ž ÝÝ n /
ab
y1
.
a)b
e. Show that residual df s Ž I q 1.Ž I y 2.r2.
f. Show that conditional symmetry q marginal homogeneity s
symmetry. Explain why G 2 ŽS < CS. tests marginal homogeneity
Ždf s 1.. When the model holds G 2 ŽS < CS. is more powerful
asymptotically than G 2 ŽS < QS.. Why?
10.36 Identify loglinear models that correspond to the logit models, for
a - b, logŽabr b a . s Ža. 0, Žb. , Žc. ␣ a y ␣ b , and Žd.  Ž b y a..
10.37 A nonmodel-based ordinal measure of marginal heterogeneity is
ˆs
⌬
ÝÝ paq pqb y ÝÝ paq pqb .
a-b
a)b
ˆ estimates ⌬ s P Ž Y1 ) Y2 . y P Ž Y2 ) Y1 ., where Y1 has
Show that ⌬
distribution aq 4 and Y2 is independent from qb 4 . Show that
marginal homogeneity implies that ⌬ s 0. Show that the estimated
453
PROBLEMS
ˆ is
asymptotic variance of ⌬
ž
Ý Ý ˆab2 pab y Ý Ý ˆab pab
a
b
a
b
/
2
n,
where ˆab s Fˆb1 q Fˆby1,1 y Fˆa2 y Fˆay1,2 with Fˆa1 s Ž p1qq ⭈⭈⭈ qpaq .
and Fˆa2 s Ž pq1 q ⭈⭈⭈ qpqa . ŽAgresti 1984, pp. 208᎐209..
10.38 For ordered scores u a4 , let y 1 s Ý a u a paq and y 2 s Ý a u a pqa . Show
that marginal homogeneity implies that E Ž Y1 . s E Ž Y2 . and
Ý Ý Ž u a y u b . 2 pab y Ž y1 y y 2 .
a
2
n.
b
estimates varŽ Y1 y Y2 .. Construct a test of marginal homogeneity
ŽBhapkar 1968..
10.39 Consider the multiplicative model for a square table,
ab s
½
␣a ␣b Ž1 y  . ,
a/b
␣ a2 q ␣ a Ž 1 y ␣ a . ,
a s b.
a. Show that the model satisfies Ži. symmetry, Žii. marginal homogeneity, Žiii. quasi-symmetry, Živ. quasi-independence.
b. Show that ␣ a s aqs qa , a s 1, . . . , I.
c. Show that  s Cohen’s kappa, and interpret s 0 and s 1 for
this model.
10.40 A 2 = 2 table has a true odds ratio of 10. Find the cell probabilities
for which Ža. 1qs q1 s 0.5, Žb. 1qs q1 s 0.3, and Žc. 1qs
q1 s 0.1. Find the value of kappa for each. ŽThis shows that for a
given association, kappa depends strongly on the marginal probabilities; see also Sprott 2000, p. 59..
10.41 A model for agreement on an ordinal response partitions beyondchance agreement into that due to a baseline association and a
main-diagonal increment ŽA. Agresti, Biometrics 44: 539᎐548, 1988..
For ordered scores u a4 , the model is
log ab s q aA q bB q  u a u b q ␦ I Ž a s b . .
Ž 10.35 .
a. Show that this is a special case of quasi-symmetry and of quasiassociation Ž10.29..
454
MODELS FOR MATCHED PAIRS
b. For agreement odds Ž10.30., show that log ab s Ž u b y u a . 2 q 2 ␦ .
For unit-spaced scores, show the local odds ratios have log ab s 
when none of the four cells falls on the main diagonal.
c. Find the likelihood equations and show that
ˆ ab 4 and n ab 4 share
the same marginal distributions, correlation, and prevalence of
exact agreement.
d. For Table 10.8 using u a s a4 , show that Ž10.35. has G 2 s 4.8
Ždf s 7., with ␦ˆ s 0.842 ŽSE s 0.427. and ˆ s 1.316 ŽSE s 0.420..
Interpret using ˆ
a, aq1 and ˆab for a y b ) 1.
10.42 Refer to the Bradley᎐Terry model.
a. Show that logŽ ⌸ acr⌸ c a . s logŽ ⌸ abr⌸ b a . q logŽ ⌸ b cr⌸ cb ..
b. With this model, is it possible that a could be preferred to b Ži.e.,
⌸ ab ) ⌸ b a . and b could be preferred to c, yet c could be preferred to a? Explain.
c. Explain why  a4 are not identifiable without a constraint such as
I s 0. Ž Hint: Show the model holds when  a* s  a y c4 for any
c..
10.43 Refer to model Ž10.32..
a. Construct a more general model having home-team parameters
H i 4 and away-team parameters A i 4 , such that the probability
team i beats team j when i is the home team is
exp Ž H i .rwexpŽ H i . q exp Ž A j .x, where A I s 0 but H i is unrestricted.
b. Interpret the case H i s A i q c4 , when Ži. c s 0, and Žii. c ) 0.
c. Fit the model to Table 10.12. Compare the fit to model Ž10.32..
Compare ˆH i 4 and ˆA i 4 to describe how teams play at home and
away.
10.44 Find the log likelihood for the Bradley᎐Terry model. From the
kernel, show that Žgiven Nab 4. the minimal sufficient statistics are
n aq 4 . Thus, explain how ‘‘ victory totals’’ determine the estimated
ranking.
10.45 Explain how to fit the complete symmetry model in T dimensions.
10.46 Prove that if kth-order marginal symmetry holds, jth-order marginal
symmetry holds for any j - k.
10.47 Suppose that quasi-symmetry holds for an I T table. When the table is
collapsed over a variable, show that the model holds for the I Ty1
table with the same main effects.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 11
Analyzing Repeated Categorical
Response Data
Many studies observe the response variable for each subject repeatedly, at
several times or under various conditions. Repeated categorical response
data occur commonly in health-related applications, especially in longitudinal
studies. For example, a physician might evaluate patients at weekly intervals
regarding whether a new drug treatment is successful. In some cases explanatory variables may also vary over time. But the repeated responses need not
refer to different times. A dental study might measure whether there is decay
for each tooth in a subject’s mouth.
Often, the responses refer to matched sets, or clusters, of subjects. An
example is a Žsurvival, nonsurvival. response for each fetus in a litter, for a
sample of pregnant mice exposed to various dosages of a toxin. A multistage
sample to study factors affecting obesity in children may regard children from
the same family as a cluster. Observations within a cluster tend to be more
alike than observations from different clusters. Ordinary analyses that ignore
this may be badly inappropriate.
In this chapter we generalize methods of Chapter 10, which referred to
matched pairs. In Section 11.1 we compare marginal distributions in T-way
tables. The remaining sections extend models to include explanatory variables. For instance, many studies compare the repeated measurements for
different groups or treatments. In Section 11.2 we use ML methods for fitting
marginal models. In Section 11.3 we use generalized estimating equations
ŽGEE., a multivariate version of quasi-likelihood that is computationally
simpler than ML. Section 11.4 covers technical details about the GEE
approach. In the final section we introduce a transitional approach that
models observations in terms of previous outcomes.
455
456
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
11.1 COMPARING MARGINAL DISTRIBUTIONS:
MULTIPLE RESPONSES
Usually, the multivariate dependence among repeated responses is of less
interest than their marginal distributions. For instance, in treating a chronic
condition Žsuch as a phobia. with some treatment, the primary goal might be
to study whether the probability of success increases over the T weeks of a
treatment period. The T success probabilities refer to the T first-order
marginal distributions. In Sections 10.2.1 and 10.3 we compared marginal
distributions for matched pairs ŽT s 2. using models that apply directly to
the marginal distributions. In this section we extend this approach to T ) 2.
11.1.1
Binary Marginal Models and Marginal Homogeneity
Denote T binary responses by Ž Y1 , Y2 , . . . , YT .. The marginal logit model
Ž10.6. for matched pairs extends to
logit P Ž Yt s 1 . s ␣ q t ,
t s 1, . . . , T ,
Ž 11.1 .
with a constraint such as  T s 0 or ␣ s 0. For a possible sequence of
outcomes i s Ž i1 , i 2 , . . . , i T . where each i t s 0 or 1, let
i s P Ž Y1 s i1 , Y2 s i 2 , . . . , YT s i T . .
Let denote the vector of these probabilities for the possible i. They refer
to a 2 T table that cross-classifies the T responses and describes the joint
distribution of Ž Y1 , . . . , YT .. The sample cell proportions are the ML estimates of , and the sample proportion with yt s 1 is the ML estimate of
P Ž Yt s 1..
Model Ž11.1. is saturated, describing T marginal probabilities by T parameters. Marginal homogeneity, for which P Ž Y1 s 1. s ⭈⭈⭈ s P Ž YT s 1., is the
special case  1 s ⭈⭈⭈ s  T . Even though this case has only one parameter,
ML fitting is not simple. The multinomial likelihood refers to the 2 T joint cell
probabilities rather than the T marginal probabilities P Ž Yt s 1.4 . Fitting
methods are described in Section 11.2.5.
Let n i denote the sample cell count in cell i. The kernel of the log
likelihood LŽ . is Ýi n i log i . Let LŽp. denote the log likelihood evaluated
at the sample proportions p i s n irn4 , the ML fit of model Ž11.1.. Let
LŽ
ˆ M H . denote the maximized log likelihood assuming marginal homogeneity. The likelihood-ratio test of marginal homogeneity ŽLipsitz et al. 1990;
Madansky 1963. uses
y2 L Ž
ˆ M H . y L Ž p . s 2 Ý n i log Ž p irˆ iM H . .
i
Ž 11.2 .
COMPARING MARGINAL DISTRIBUTIONS: MULTIPLE RESPONSES
457
TABLE 11.1 Responses to Three Drugs in a Crossover Study
Drug A Favorable
C Favorable
C Unfavorable
Drug A Unfavorable
B Favorable
B Unfavorable
B Favorable
B Unfavorable
6
16
2
4
2
4
6
6
Source: Reprinted with permission from the Biometric Society ŽGrizzle et al. 1969..
The asymptotic null chi-squared distribution has df s T y 1, since the general model Ž11.1. has T y 1 more parameters than marginal homogeneity.
11.1.2
Crossover Drug Comparison Example
Table 11.1 comes from a crossover study in which each subject used each of
three drugs for treatment of a chronic condition at three times. The response
measured the reaction as favorable or unfavorable. The 2 3 table gives the
Žfavorable, unfavorable. classification for reaction to drug A in the first
dimension, drug B in the second, and drug C in the third. We assume that
the drugs have no carryover effects and that the severity of the condition
remained stable for each subject throughout the experiment. These assumptions are reasonable for many chronic conditions, such as migraine headache.
The sample proportion favorable was Ž0.61, 0.61, 0.35. for drugs ŽA, B, C..
The likelihood-ratio statistic for testing marginal homogeneity is 5.95 Ždf s 2.,
for a P-value of 0.05. For simultaneous confidence intervals comparing pairs
of treatments with overall error probability no greater than 0.05, the Bonferroni method uses confidence coefficient Ž1 y 0.05r3. s 0.9833 for each. For
instance, from formula Ž10.1., the estimate 0.261 s 0.609 y 0.348 of the
difference between drugs A and C has an estimated standard error of 0.108.
The confidence interval for the true difference is 0.261 " 2.39Ž0.108., or
Ž0.002, 0.520.. The same interval holds for comparison of drugs B and C.
There is some evidence that the proportion of favorable responses is lower
for drug C.
The sample size is not large, however, so we view these results with
caution. For each pair of drugs, a 2 = 2 table relates the two responses. An
exact binomial test ŽSection 10.4.1. uses its off-diagonal counts. These yield
P-values of 1.0 for comparing drugs A and B and 0.036 for comparing A with
C and for comparing B with C.
11.1.3
Modeling Margins of a Multicategory Response
The binary marginal model Ž11.1. extends to multinomial responses. With
baseline-category logits for I outcome categories, the saturated model is
log P Ž Yt s j . rP Ž Yt s I . s t j ,
t s 1, . . . , T ,
j s 1, . . . , I y 1.
Ž 11.3 .
458
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
Marginal homogeneity, whereby P Ž Y1 s j . s ⭈⭈⭈ s P Ž YT s j . for j s
1, . . . , I y 1, is the special case in which
 1 j s  2 j s ⭈⭈⭈ s  T j ,
j s 1, . . . , I y 1.
The likelihood-ratio test of marginal homogeneity comparing the two models
has form Ž11.2. and df s ŽT y 1.Ž I y 1..
For an ordinal response, an unsaturated model that is more complex than
marginal homogeneity focuses on shifts up and down in the T margins. One
such model is
logit P Ž Yt F j . s ␣ j q t ,
t s 1, . . . , T ,
j s 1, . . . , I y 1, Ž 11.4 .
with constraint such as  T s 0. Marginal homogeneity is the special case
 1 s ⭈⭈⭈ s  T . Its test has df s T y 1. The ␣ j 4 satisfy ␣ 1 - . . . - ␣ Iy1
because of the ordering of the cumulative probabilities. These models can be
fitted using ML methodology presented in Section 11.2.5.
11.1.4
Wald and Generalized CMH Score Tests of Marginal Homogeneity
In this chapter we focus on modeling the marginal distributions rather than
merely testing marginal homogeneity. However, a variety of tests are available besides the likelihood ratio, so we briefly summarize a couple of them.
Let pj Ž t . denote the sample proportion in category j for response Yt , let
pj s
Ý pj Ž t . rT ,
d j Ž t . s pj Ž t . y pj ,
t
and let d denote the vector of d j Ž t ., t s 1, . . . , T y 1, j s 1, . . . , I y 14 . Let
ˆ denote the estimated covariance matrix of 'n d. Bhapkar Ž1973. proposed
V
the Wald statistic
ˆy1 d.
W s ndX V
Ž 11.5 .
for the general alternative. This generalizes Ž10.16. and has a large-sample
chi-squared distribution with df s Ž I y 1.ŽT y 1..
Other statistics are special cases of the generalized Cochran᎐Mantel᎐
Haenszel ŽCMH. statistic ŽSection 7.5.3.. Recall that for the binary case
Ž I s 2. with matched pairs ŽT s 2., the CMH statistic applies to a three-way
table Žsee, e.g., Table 10.2. in which each stratum shows the two outcomes
for a given subject. A generalization of Table 10.2 provides n strata of T = I
tables. The kth stratum gives the T outcomes for subject k. Row t in a
stratum has a 1 in the column that is the outcome for observation t, and 0 in
all other columns Žor 0 in every column if that observation is missing..
Probability distributions for the subject-stratified setup naturally relate to
MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH
459
subject-specific models such as logit model Ž10.8., rather than to marginal
models. However, conditional independence in this three-way table Žgiven
subject. corresponds to an exchangeability among variables in the I T table
that implies marginal homogeneity. A generalized CMH test of conditional
independence in the T = I = n table also tests marginal homogeneity using a
sampling distribution generated under the stronger exchangeability condition
ŽDarroch 1981.. For an ordinal response with fixed scores, the generalized
CMH statistic for detecting variability among T means is appropriate.
When I s 2 and T s 2, this CMH approach is equivalent to McNemar’s
statistic. When I s 2 but T ) 2, the generalized CMH statistic treating the T
responses as unordered is identical to a statistic Cochran Ž1950. proposed.
His statistic, called Cochran’s Q, has df s T y 1 ŽProblem 11.22..
11.2 MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH
Analyses above compared marginal distributions, but without accounting for
explanatory variables. We now include such predictors. In this section we use
ML, but we defer model fitting details to the end of the section.
11.2.1
Longitudinal Mental Depression Example
We use Table 11.2 to illustrate a variety of analyses in this and the next
chapter. It refers to a longitudinal study comparing a new drug with a
standard drug for treatment of subjects suffering mental depression ŽKoch
et al. 1977.. Subjects were classified into two initial diagnosis groups according to whether severity of depression was mild or severe. In each group,
subjects were randomly assigned to one of the two drugs. Following 1 week, 2
weeks, and 4 weeks of treatment, each subject’s suffering from mental
depression was classified as normal or abnormal.
TABLE 11.2 Cross-Classification of Responses on Depression
at Three Times by Diagnosis and Treatment
Response at Three Times a
Diagnosis Treatment
Mild
Severe
a
Standard
New drug
Standard
New drug
NNN
NNA
NAN
NAA
ANN
ANA
AAN
AAA
16
31
2
7
13
0
2
2
9
6
8
5
3
0
9
2
14
22
9
31
4
2
15
5
15
9
27
32
6
0
28
6
N, normal; A, abnormal.
Source: Reprinted with permission from the Biometric Society ŽKoch et al. 1977..
460
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
Table 11.2 shows four groups, the combinations of categories of the two
explanatory variables: treatment type and severity of initial diagnosis. Since
the study observed the binary response Ždepression assessment . at T s 3
occasions, Table 11.2 shows a 2 3 table for each group. The three depression
assessments form a multivariate response variable with three components,
with Yt s 1 for normal and 0 for abnormal. The 12 marginal distributions
result from three repeated observations for each of the four groups.
Let s denote the severity of the initial diagnosis, with s s 1 for severe and
s s 0 for mild. Let d denote the drug, with d s 1 for new and d s 0 for
standard. Let t denote the time of measurement. Koch et al. Ž1977. noted
that if the time metric reflects cumulative drug dosage, a logit scale often has
a linear effect for the logarithm of time. They used scores Ž0, 1, 2., the logs to
base 2 of the week numbers Ž1, 2, and 4., for time.
Table 11.3 shows sample proportions of normal responses Ži.e., yt s 1. for
the 12 marginal distributions. For instance, from Table 11.2, the sample
proportion of normal responses after week 1 for subjects with mild initial
diagnosis using the standard drug was Ž16 q 13 q 9 q 3.rŽ16 q 13 q 9 q 3
q 14 q 4 q 15 q 6. s 0.51. The sample proportion of normal responses Ž1.
increased over time for each group; Ž2. increased at a faster rate for the new
drug than the standard, for each fixed initial diagnosis; and Ž3. was higher for
the mild than the severe initial diagnosis, for each treatment at each
occasion. In such a study the company that developed the new drug would
hope to show that patients have a significantly higher rate of improvement
with it.
The marginal logit model
logit P Ž Yt s 1 . s ␣ q  1 s q  2 d q  3 t
has the main effects of the explanatory variables Žseverity of initial diagnosis
and drug. and of the variable Žtime. that specifies the different components
of the multivariate response. Its linear time effect  3 is the same for each
group.
The natural sampling assumption is multinomial for the eight cells in the
2 3 cross-classification of the three responses, independently for the four
TABLE 11.3 Sample Marginal Proportions of Normal Response for
Depression Data of Table 11.2
Sample Proportion
Diagnosis
Treatment
Week 1
Week 2
Week 4
Mild
Standard
New drug
Standard
New drug
0.51
0.53
0.21
0.18
0.59
0.79
0.28
0.50
0.68
0.97
0.46
0.83
Severe
MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH
461
groups. However, the model refers to 12 marginal probabilities Žfor 2 drug
treatments = 2 initial severity diagnoses = 3 time points. rather than the
4 = 2 3 s 32 cell probabilities in the product multinomial likelihood function.
The three marginal binomial variates for each group are dependent. ML
estimation requires an iterative routine for maximizing the product multinomial likelihood, subject to the constraint that the marginal probabilities
satisfy the model. An algorithm for this is given in Section 11.2.5.
A check of model fit compares the 32 cell counts in Table 11.2 to their ML
fitted values. Since the model describes 12 marginal logits using four parameters, residual df s 8. The deviance G 2 s 34.6. The poor fit is not surprising.
The model assumes a common rate of improvement  3 , but the sample
shows a higher rate for the new drug.
A more realistic model permits the time effect to differ by drug,
logit P Ž Yt s 1 . s ␣ q  1 s q  2 d q  3 t q 4 dt.
Its time effect estimate is ˆ3 s 0.48 ŽSE s 0.12. for the standard drug
Ž d s 0. and ˆ3 q ˆ4 s 1.49 ŽSE s 0.14. for the new one Ž d s 1.. For the
new drug, the slope is ˆ4 s 1.01 ŽSE s 0.18. higher than for the standard,
giving strong evidence of faster improvement. This model fits much better,
with G 2 s 4.2 Ždf s 7.. The G 2 decrease of 34.6 y 4.2 s 30.4 compared to
the simpler model is the likelihood-ratio test of H0 : 4 s 0, a common time
effect for each drug.
The severity of initial diagnosis estimate is ˆ1 s y1.29 ŽSE s 0.14.; for
each drug᎐time combination, the estimated odds of a normal response when
the initial diagnosis was severe equal expŽy1.29. s 0.27 times the estimated
odds when the initial diagnosis was mild. The estimate ˆ2 s y0.06 ŽSE s
0.22. indicates an insignificant difference between the drugs after 1 week Žfor
which t s 0.. At time t, the estimated odds of normal response with the new
drug are expŽy0.06 q 1.01 t . times the estimated odds for the standard drug,
for each initial diagnosis level. In summary, severity of initial diagnosis, drug
treatment, and time all have substantial effects on the probability of a normal
response.
11.2.2
Modeling a Repeated Multinomial Response
Models for marginal distributions of a repeated binary response generalize to
multicategory responses. At observation t, the marginal response distribution
has I y 1 logits. With nominal responses, baseline-category logit models
describe the odds of each outcome relative to a baseline. For ordinal
responses, one might use cumulative logit models.
For a particular marginal logit, a model has the form
logit j Ž t . s ␣ j q Xj x t ,
j s 1, . . . , I y 1,
t s 1, . . . .
462
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
For an ordinal response, perhaps logit j Ž t . s logitw P Ž Yt F j .x. Then,  j may
simplify to , in which case the model takes the proportional odds form with
the same effects for each logit. Some parameters in  may refer to the
variable subscripted by t Že.g., time. that indexes the repeated measurements.
One can then compare marginal distributions at particular settings of x or
evaluate effects of x on the response. In either case, checking for interaction
is crucial. For instance, are the effects of x the same at each t?
11.2.3
Insomnia Example
Table 11.4 shows results of a randomized, double-blind clinical trial comparing an active hypnotic drug with a placebo in patients who have insomnia
problems. The response is the patient’s reported time Žin minutes. to fall
asleep after going to bed. Patients responded before and following a two-week
treatment period. The two treatments, active and placebo, form a binary
explanatory variable. The subjects receiving the two treatments were independent samples.
Table 11.5 displays sample marginal distributions for the four
treatment᎐occasion combinations. From the initial to follow-up occasion,
time to falling asleep seems to shift downward for both treatments. The
degree of shift seems greater for the active treatment, indicating possible
interaction. The response variable is a discrete version of a continuous
variable, so by the derivation in Section 7.2.3 a cumulative link model is
natural. The proportional odds model
logit P Ž Yt F j . s ␣ j q  1 t q  2 x q  3 tx
Ž 11.6 .
permits interaction between t s occasion Ž0 s initial, 1 s follow-up. and
TABLE 11.4 Time to Falling Asleep, by Treatment and Occasion
Time to Falling Asleep
Follow-up
Treatment
Initial
- 20
20᎐30
30᎐60
) 60
Active
- 20
20᎐30
30᎐60
) 60
- 20
20᎐30
30᎐60
) 60
7
11
13
9
7
14
6
4
4
5
23
17
4
5
9
11
1
2
3
13
2
1
18
14
0
2
1
8
1
0
2
22
Placebo
Source: From S. F. Francom, C.Chuang-Stein, and J. R. Landis, Statist. Med. 8: 571᎐582 Ž1989..
Reprinted with permission from John Wiley & Sons Ltd.
MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH
463
TABLE 11.5 Sample Marginal Distributions of Table 11.4
Response
Treatment
Occasion
- 20
20᎐30
30᎐60
) 60
Active
Initial
Follow-up
Initial
Follow-up
0.101
0.336
0.117
0.258
0.168
0.412
0.167
0.242
0.336
0.160
0.292
0.292
0.395
0.092
0.425
0.208
Placebo
x s treatment Ž0 s placebo, 1 s active., but assumes the same effects for
each response cutpoint.
For ML model fitting, G 2 s 8.0 Ždf s 6. for comparing observed to fitted
cell counts in modeling the 12 marginal logits using these six parameters. The
ML estimates are ˆ1 s 1.074 ŽSE s 0.162., ˆ2 s 0.046 ŽSE s 0.236., and
ˆ3 s 0.662 ŽSE s 0.244.. This shows evidence of interaction. At the initial
observation, the estimated odds that time to falling asleep for the active
treatment is below any fixed level equal expŽ0.046. s 1.04 times the estimated odds for the placebo treatment; at the follow-up observation, the
effect is expŽ0.046 q 0.662. s 2.03. In other words, initially the two groups
had similar distributions, but at the follow-up those with the active treatment
tended to fall asleep more quickly.
For simpler interpretation, it can be helpful to report sample marginal
means and their differences. With response scores 10, 25, 45, 754 for time to
fall asleep, the initial means were 50.0 for the active group and 50.3 for the
placebo. The difference in means between the initial and follow-up responses
was 22.2 for the active group and 13.0 for the placebo. The difference
between these differences of means equals 9.2, with SE s 3.0, indicating that
the change was significantly greater for the active group.
11.2.4
Comparisons That Control for Initial Response
For data such as Table 11.4, suppose that the marginal distributions for
initial response are identical for the treatment groups. This is true, apart
from sampling error, with random assignment of subjects to the groups.
Suppose also that conditional on the initial response, the follow-up response
distribution is identical for the treatment groups. Then, the follow-up marginal
distributions are also identical.
If the initial marginal distributions are not identical, however, the difference between follow-up and initial marginal distributions may differ between
treatment groups, even though their conditional distributions for follow-up
response are identical. In such cases, although marginal models can be
useful, they may not tell the entire story. It may be more informative
to construct models that compare the follow-up responses while controlling
for the initial response.
464
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
Let Y2 denote the follow-up response, for treatment x with initial response y 1. In the model
logit P Ž Y2 F j . s ␣ j q  1 x q  2 y 1 ,
Ž 11.7 .
 1 compares the follow-up distributions for the treatments, controlling for
initial observation. This is an analog of an analysis-of-covariance model, with
ordinal rather than continuous response. This cumulative logit model refers
to a univariate response Ž Y2 . rather than marginal distributions of a multivariate response Ž Y1 , Y2 .. It is an example of a transitional model, discussed in
the final section of this chapter.
11.2.5
ML Fitting of Marginal Logit Models*
ML fitting of marginal logit models is awkward. For T observations on an
I-category response, at each setting of predictors the likelihood refers to I T
multinomial joint probabilities, but the model applies to T sets of marginal
multinomial parameters P Ž Yt s k ., k s 1, . . . , I 4 . The marginal multinomial
variates are not independent.
Let denote the complete set of multinomial joint probabilities for all
settings of predictors. Marginal logit models have the generalized loglinear
model form
C log Ž A . s X
Ž 11.8 .
introduced in Section 8.5.4. In the binary case, the matrix A applied to
forms the T marginal probabilities P Ž Yt s 1.4 and their complements at
each setting of predictors. The matrix C applied to the log marginal probabilities forms the T marginal logits for each setting; each row of C has 1 in the
position multiplied by the log numerator probability for a given marginal
logit, y1 in the position multiplied by the log denominator probability, and 0
elsewhere.
For instance, for the model of marginal homogeneity in a 2 T table with no
covariates,  is a single parameter, denoted by ␣ in Ž11.1.. For T s 2, has
four elements, and this model is
1
0
y1
0
0
1
1
0
0
log
y1
1
0
1
0
0
1
0
1
1
0
0
1
0
1
11
12
21
22
s
1
␣,
1
which sets both logit Ž 11 q 12 . s logit w P Ž Y1 s 1.x and logit Ž 11 q 21 . s
logit w P Ž Y2 s 1.x equal to ␣ .
The likelihood function l Ž . for a marginal logit model is the product of
the multinomial mass functions from the various predictor settings. One
MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH
465
approach for ML fitting views the model as a set of constraints and uses
methods for maximizing a function subject to constraints. In model Ž11.8., let
U denote a full column rank matrix such that the space spanned by the
columns of U is the orthogonal complement of the space spanned by the
columns of X. Then, UX X s 0, and the model has the equivalent constraint
form
UX C log Ž A . s 0.
For instance, for marginal homogeneity in a 2 = 2 table with Ž11.8. as
expressed above, UX s Ž1, y1.. Then UX applied to C logŽA . sets the difference between the row and column marginal logits equal to 0.
This method of maximizing the likelihood incorporates these model constraints as well as identifiability constraints, which constrain the response
probabilities at each predictor setting to sum to 1. We express this collection
of model constraints UX C logŽA . s 0 and identifiability constraints as fŽ .
s 0. The method introduces Lagrange multipliers corresponding to these
constraints and solves the Lagrangian likelihood equations using a Newton᎐
Raphson algorithm ŽAitchison and Silvey 1958; Haber 1985.. Let be a
vector having elements and the Lagrange multipliers . The Lagrangian
likelihood equations have form hŽ . s 0, where
h Ž . s h Ž , . s Ž f Ž . , ⭸ log l Ž . r⭸ q ⭸ f Ž . r⭸ .
X
X
is a vector with terms involving the contrasts in marginal logits that the model
specifies as constraints as well as log-likelihood derivatives.
The Newton᎐Raphson method then is
Ž tq1.
s
Žt.
y
⭸ h Ž Ž t . .
⭸
y1
h Ž Ž t . . ,
t s 1, . . . .
This can be computationally intensive because the derivative matrix inverted
has dimensions larger than the number of elements in . A refinement ŽLang
1996a; Lang and Agresti 1994. uses an asymptotic approximation to a
reparameterized derivative matrix that has a much simpler form, requiring
inverting only a diagonal matrix and a symmetric positive definite matrix.
This ML marginal fitting method is available in specialized software
ŽAppendix A mentions an S-Plus function.. It makes no assumption about the
model that describes the joint distribution . Thus, when the marginal model
holds, the ML estimate of  in Ž11.8. is consistent regardless of the
dependence structure for that distribution. Several alternative fitting approaches have been considered. Lang and Agresti Ž1994. simultaneously
fitted a marginal model and an unsaturated loglinear model for . The
complete model can be specified as a special case of Ž11.8. and fitted using
the constraint approach with Lagrange multipliers just described. In standard
cases, the marginal and joint model parameters are orthogonal. If the
466
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
marginal model holds, the ML estimator of the marginal model parameters is
consistent even if the model for the joint distribution is incorrect.
Fitzmaurice and Laird Ž1993. gave a related ML approach. A one-to-one
correspondence holds between and parameters of the saturated loglinear
model. They used a further one-to-one correspondence between the main
effect and the higher-order parameters of that loglinear model with the
marginal probabilities and those same higher-order loglinear parameters.
Models were then specified separately for the marginal probabilities and the
higher-order Žconditional. loglinear parameters. The likelihood is then maximized in terms of the two sets of model parameters. Again, the two sets of
parameters are orthogonal, so the ML estimator of marginal model parameters is consistent when the marginal model holds. This mixed parameter
approach is also available in specialized software ŽKastner et al. 1997; see
also Appendix A..
Yet another ML approach uses a one-to-one correspondence between
and parameters that describe the marginal distributions, the bivariate distributions, the trivariate distributions, and so on Že.g., Glonek and McCullagh
1995; Molenberghs and Lesaffre 1994.. Multivariate logistic models then
apply to the component distributions, although some higher-order effects
may be assumed to vanish, for simplicity. Glonek Ž1996. proposed a hybrid of
this and the Fitzmaurice and Laird Ž1993. approach.
11.3 MARGINAL MODELING: GENERALIZED ESTIMATING
EQUATIONS (GEE) APPROACH
At each combination of predictor values, ML fitting assumes a multinomial
distribution for the I T cell probabilities for the T observations on an
I-category response. As the number of predictors increases, the number of
multinomial probabilities increases dramatically. Currently, all the ML approaches described above are not practical when T is large or there are many
predictors, especially when some are continuous. Compared to the continuous-response case using the multivariate normal, marginal modeling of
multivariate categorical responses is also hindered by the lack of a simple
multivariate distribution for describing correlations among the T responses.
For instance, with T means and a common variance and correlation, the
multivariate normal has only T q 2 parameters, compared to the I T y 1
parameters for the multinomial.
An alternative to ML fitting uses a multivariate generalization of quasilikelihood ŽSection 4.7.. Rather than assuming a particular distribution for Y,
the quasi-likelihood method specifies only the first two moments; it links the
mean to a linear predictor and also specifies how the variance depends on
the mean. The estimates are solutions of estimating equations that are
likelihood equations under the further assumption of a distribution in the
exponential family with that mean and variance ŽWedderburn 1974..
MARGINAL MODELING: GENERALIZED ESTIMATING EQUATIONS APPROACH
11.3.1
467
Generalized Estimating Equation Methodology: Basic Ideas
Repeated measurement provides a multivariate response Ž Y1 , Y2 , . . . , YT .,
where T sometimes varies by subject. As in the univariate case, the quasilikelihood method specifies a model for s E Ž Y . and specifies a variance
function ®Ž . describing how varŽ Y . depends on . Now, though, that model
applies to the marginal distribution for each Yt . The method also requires a
working guess for the correlation structure among Yt 4 . The estimates are
solutions of quasi-likelihood equations called generalized estimating equations.
The method is often referred to as the GEE method. Liang and Zeger Ž1986.
proposed it for marginal modeling with GLMs. Their work built on related
material in the econometrics literature Že.g., Gourieroux et al. 1984; Hansen
1982; White 1982.. We outline concepts here and give more details in Section
11.4.
The GEE approach utilizes an assumed covariance structure for
Ž Y1 , Y2 , . . . , YT ., specifying a variance function and a pairwise correlation
pattern, without assuming a particular multivariate distribution. The GEE
estimates of model parameters are valid even if one misspecifies the covariance structure. Consistency Ži.e., estimates converging in probability to the
true parameters . depends on the first moment but not the second. Specifically, suppose that the model is correct in the sense that the chosen link
function and linear predictor truly describe how E Ž Yt . depend on the
predictors, t s 1, . . . , T. Then the GEE model parameter estimators are
consistent.
In practice, a chosen model is never exactly correct. This result is useful,
however, for suggesting that the correlation structure need not adversely
affect the quality of estimates for whatever model one uses. Often, no
a priori information is available about this structure, and the correlation is
regarded as a nuisance. A simple implementation of the GEE method naively
treats Yt 4 as pairwise independent. Although parameter estimates are usually
fine under this naive assumption, standard errors are not. More appropriate
standard errors result from an adjustment the GEE method makes using the
empirical dependence the data exhibit. The naive standard errors based on
the independence assumption are updated using the information the data
provide about the actual dependence structure to yield more appropriate
Ž robust . standard errors.
As an alternative to estimates that treat Yt 4 as pairwise independent, the
GEE method can use a working guess about the correlation structure but
again empirically adjust the standard error. The exchangeable working correlation structure treats corrŽ Yt , Ys . as identical for all s and t. This is more
flexible and realistic than the naive independence assumption. Even more
realistic is an unstructured working correlation that permits a separate
correlation for each pair. When T is large, however, this approach suffers
some efficiency loss because of the many additional parameters.
In theory, choosing the working correlation wisely can pay benefits of
improved efficiency of estimation. However, Liang and Zeger Ž1986. noted
468
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
that estimators based on independence working correlation can have surprisingly good efficiency when the actual correlation is weak to moderate. One
can check the sensitivity to the selection by comparing results for different
working correlation assumptions. In our experience, when the correlations
are modest, all working correlation structures yield similar GEE estimates
and standard errors, as the empirical dependence has a large impact on
adjusting the naive standard errors. ŽIf they differed substantially, a more
careful study of the correlation structure would be necessary. . Unless one
expects dramatic differences among the correlations, we recommend the
exchangeable working correlation structure. This recognizes the dependence
at the cost of only one extra parameter.
The GEE approach is appealing for categorical data because of its
computational simplicity compared to ML. Advantages include not requiring
a multivariate distribution and the consistency of estimation even with
misspecified correlation structure. However, it has limitations. Since the
GEE approach does not completely specify the joint distribution, it does not
have a likelihood function. Likelihood-based methods are not available for
testing fit, comparing models, and conducting inference about parameters.
Instead, inference uses Wald statistics constructed with the asymptotic normality of the estimators together with their estimated covariance matrix.
However, unless the sample size is quite large, the empirically based standard
errors tend to underestimate the true ones Že.g., Firth 1993b.. As estimators,
those standard errors can also show more variability than parametric estimators ŽKauermann and Carroll 2001.. Boos Ž1992. and Rotnitzky and Jewell
Ž1990. proposed analogs of score tests for effects of predictors, using quasilog-likelihood, that may be more trustworthy than Wald tests. Some statisticians Že.g., Lindsey 1999. are critical of the GEE approach because of the
lack of likelihood. Others do not find this problematic, as they regard GEE
as an estimation method rather than a model.
11.3.2
Longitudinal Mental Depression Example
For Table 11.2 comparing two treatments for mental depression, ML fitting
of a logit model with drug = time interaction was used in Section 11.2.1. The
GEE analysis provides similar results, regardless of the choice of working
correlation structure. With the exchangeable structure, the GEE estimated
slope Žon the logit scale. for the standard drug is ˆ3 s 0.48 ŽSE s 0.12.. For
the new drug the slope increases by ˆ4 s 1.02 ŽSE s 0.19.. Table 11.6 shows
results using the independence working correlations. Estimates are the same
to two decimal places. The initial estimates and standard errors there are
those that apply if the repeated responses are truly independent. They equal
those obtained by using ordinary logistic regression with 3 = 340 s 1020
independent observations rather than treating the data as three dependent
observations for each of 340 subjects. The empirical standard errors incorporate the sample dependence to adjust the independence-based standard
errors.
MARGINAL MODELING: GENERALIZED ESTIMATING EQUATIONS APPROACH
469
TABLE 11.6 Output from Using GEE to Fit Logit Model to Table 11.2
Initial Parameter Estimates
Parameter
Intercept
diagnose
drug
time
drug)time
Estimate
y0.0280
y1.3139
y0.0596
0.4824
1.0174
Row1
Row2
Row3
Std Error
0.1639
0.1464
0.2222
0.1148
0.1888
GEE Parameter Estimates
Empirical Std Error Estimates
Parameter
Estimate
Std Error
Intercept
y0.0280
0.1742
diagnose
y1.3139
0.1460
drug
y0.0596
0.2285
time
0.4824
0.1199
drug)time
1.0174
0.1877
Working Correlation Matrix
Col1
Col2
Col3
1.0000
0.0000
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
1.0000
With exchangeable correlation structure, the estimated common correlation between pairs of the three responses is y0.003. The successive observations apparently have pairwise appearance like independent observations.
This is quite unusual for repeated measurement data. For this reason, similar
results occur from fitting the model assuming the three observations for a
subject actually come from three separate subjects Ži.e., assuming 1020
independent observations..
11.3.3
GEE Approach for Multinomial Responses: Insomnia Example
Liang and Zeger Ž1986. originally specified the GEE methodology for modeling univariate marginal distributions, such as the binomial and Poisson. It
extends to marginal modeling of multinomial responses. Lipsitz et al. Ž1994.
outlined a GEE approach for cumulative logit models with repeated ordinal
responses. With this approach, for each pair of outcome categories one
selects a working correlation matrix for the pairs of repeated observations.
Each multinomial response at a fixed observation uses the Ž I y 1. = Ž I y 1.
multinomial covariance matrix. Section 11.4.4 has details.
We illustrate for the insomnia data of Table 11.4. In Section 11.2.3 we
used ML to fit the marginal model
logit P Ž Yt F j . s ␣ j q  1 t q  2 x q  3 tx
for Yt s time to fall asleep with treatment x at occasion t. With independence working correlation structure, the GEE estimates are ˆ1 s 1.038
ŽSE s 0.168., ˆ2 s 0.034 ŽSE s 0.238., and ˆ3 s 0.708 ŽSE s 0.244.. The
estimates are similar to the ML estimates, and the substantive conclusions
are the same. Considerable evidence exists that the distribution of time to fall
asleep decreased more for the treatment group than for the placebo group.
470
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
11.4 QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE
EXTENSION: DETAILS*
A GLM assumes a certain distribution for the response variable. Sometimes
it is unclear how to select it. However, often there is a plausible relationship
between the mean and variance, such as ®Ž i . s i for count data. Then,
an alternative to ML estimation is quasi-likelihood estimation ŽSection 4.7..
We next present some details about this method and its GEE extension for
marginal modeling of multivariate responses.
We begin with models for a single response and later discuss marginal
models for a multivariate response. For subject i, i s 1, . . . , n, let yi be the
outcome on Y with i s E Ž Yi . and variance function ®Ž i ., and let x i j be the
value of explanatory variable j. For link function g, the linear predictor is
i s g Ž i . s Ý j  j x i j s x Xi . The quasi-likelihood ŽQL. parameter estimates
ˆ are the solutions of quasi-score equations

uŽ  . s
Ý
i
ž /
⭸ i
⭸
X
® Ž i .
y1
Ž yi y i . s 0,
Ž 11.9 .
where i s gy1 Žx Xi  .. These estimating equations are the same as the likelihood equations Ž4.22. for GLMs when we substitute
⭸ i
⭸ j
s
⭸ i ⭸i
⭸i ⭸ j
s
⭸ i
⭸i
xi j .
They are not likelihood equations, however, without the extra assumption
that yi 4 has distribution in the natural exponential family. Under that
assumption, ®Ž i . characterizes the distribution within the natural exponential family ŽJorgensen
1987.. Another motivation for equations Ž11.9. is that
Ⲑ
Ž
.
with ® i replaced by known variance ®i , they result from the weighted least
squares problem of minimizing Ý i Ž yi y i . 2 ®y1
i .
The likelihood equations Ž4.22. for a GLM depend only on the mean and
variance of yi 4 and the link function g, which determines ⭸ ir⭸i . Thus,
Wedderburn Ž1974. suggested using them as estimating equations for any
link and variance function, even if they do not correspond to a particular
member of the natural exponential family.
11.4.1
Properties of Quasi-likelihood Estimators
In the quasi-likelihood ŽQL. method, the quasi-score function u j Ž . in Ž11.9.
is called an unbiased estimating function; this term refers to any function
hŽy;  . of y and  such that E w hŽY;  .x s 0 for all . The equations Ž11.9.
ˆ are called estimating equations.
that determine 
The quasi-likelihood method treats the quasi-score function as the derivative of a function called the quasi-log likelihood. This function may not be a
QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS
471
proper log likelihood function. Nonetheless, McCullagh Ž1983. showed that
QL estimators have properties similar to those of ML estimators. For
ˆ are asymptotically normal with covariance
instance, the QL estimators 
matrix approximated by
Ý
Vs
i
ž /
⭸ i
X
⭸
® Ž i .
y1
ž /
⭸ i
⭸
y1
Ž 11.10 .
.
6
This is equivalent to the formula for the large-sample covariance matrix of
the ML estimator in a GLM wwhich is estimated by Ž4.28.x.
ˆ is consistent for  Ži.e., 
ˆ p .
A key result is that the QL estimator 
even if the variance function is misspecified, as long as the specification is
correct for the link function and linear predictor. That is, assuming that the
ˆ holds even if the
model form g Ž i . s Ý j  j x i j is correct, the consistency of 
Ž
.
true variance function is not ® i . We now give a heuristic explanation for
this.
When truly i s gy1 ŽÝ j  j x i j ., then from Ž11.9., E w u j Ž .x s 0 for all j.
From Ž11.9., uŽ .rn is a vector of sample means. By a law of large numbers,
ˆ of the
it converges in probability to its expected value of 0. The solution 
quasi-score equations is a continuous function of these sample means, so it
ˆ is the value of  for which the sum is exactly equal
converges to , since 
to 0. The consistency also follows from general results for unbiased estimating functions ŽLiang and Zeger 1995..
11.4.2
Sandwich Covariance Adjustment for Variance Misspecification
If one assumes that varŽ Yi . s ®Ž i . but the true varŽ Yi . / ®Ž i ., then the
ˆ is not V as given
actual asymptotic covariance matrix of the QL estimator 
in Ž11.10.. Instead, it is ŽDiggle et al. 2001; White 1982.
V
Ý
i
ž /
⭸ i
⭸
X
® Ž i .
y1
var Ž Yi . ® Ž i .
y1
ž /
⭸ i
⭸
V.
Ž 11.11 .
Even though the variances are scalar, we express the matrices in this form to
motivate the GEE multivariate extension discussed below. Matrix Ž11.11.
simplifies to V if varŽ Yi . s ®Ž i .. In practice, the true variance function is
unknown. A consistent estimator of Ž11.11. is a sample analog, replacing i
by
ˆ i and varŽ Yi . by Ž yi y
ˆ i . 2 ŽLiang and Zeger 1986.. The estimated
covariance matrix is valid regardless of whether the variance specification
®Ž i . is correct. This estimated covariance matrix is called a sandwich
estimator, because the empirical evidence is sandwiched between the modeldriven covariance matrices.
In summary, even with incorrect specification of the variance function, one
can still consistently estimate  and one can estimate the asymptotic variance
472
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
ˆ by estimating the sandwich adjustment Ž11.11.. However, some effiof 
ciency loss occurs when the variance chosen, ®Ž i ., is wildly inaccurate. Also,
the number of clusters n may need to be large for the sample version of
Ž11.11. to work well; otherwise, it can be biased downward. Of course, a
modeling process never gets anything exactly correct. Just as the variance
function chosen only approximates the true one Žhopefully, closely., so is the
specification for the mean only approximate.
11.4.3
GEE Methodology: Technical Details
Now we consider the generalized estimating equations ŽGEE. multivariate
generalization of QL. For subject i, let yi s Ž yi1 , . . . , yiT i .X and i s
Ž i1 , . . . , iT .X , where it s E Ž Yi t .. The number Ti of responses may vary by
i
cluster. Let x it denote a p = 1 vector of explanatory variable values for yit .
The notation allows for cases where explanatory variables also vary for the
repeated measurements. The linear predictor of the model is it s g Ž it . s
x Xit  for link function g. The model refers to the marginal distribution at
each t rather than the joint distribution. Let X i be the Ti = p matrix of
predictor values for cluster Žor subject. i, for which row t is x Xit .
We assume that yit has probability mass function of form
f Ž yit ; it , . s exp
½
yit it y b Ž it .
5
q c Ž yit , . .
When is known, this is the natural exponential family with natural
parameter it . From Section 4.4.1,
it s E Ž Yit . s bX Ž it . ,
® Ž it . s var Ž Yit . s bY Ž it . .
The GEE method also assumes a working correlation matrix RŽ ␣ . for Yi ,
depending on parameters ␣ . The exchangeable working correlation has
corrŽ Yit , Yi s . s ␣ for each pair in Yi . Let b i Ž . s Ž bŽ i1 ., . . . , bŽ iT i .., and let
B i denote a diagonal matrix with main diagonal elements bYi Ž .. Then the
working covariance matrix for Yi is
Vi s B1r2
R Ž ␣ . B1r2
i
i .
Ž 11.12 .
Note that Vi s covŽYi . if R is the true correlation matrix for Yi .
Now let ⌬ i be the diagonal matrix with elements ⭸ itr⭸it on the main
diagonal for t s 1, . . . , Ti . ŽFor the canonical link, this is the identity matrix..
Let Di s ⭸ ir⭸  s B i ⌬ i X i be a Ti = p matrix with typical element expressing ⭸ itr⭸ j in the form Ž ⭸ itr⭸ it .Ž ⭸ itr⭸i t .Ž ⭸itr⭸ j .. From Ž11.9., for
univariate GLMs the quasi-likelihood estimating equations have the form
Ý Ž ⭸ ir⭸  . ® Ž i . y1
X
i
yi y i Ž  . s 0,
QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS
473
where i s i Ž . s gy1 Žx Xi  .. The analog of this in the multivariate case is
the set of generalized estimating equations
n
Ý DXi Viy1
yi y i Ž  . s 0.
is1
ˆ is the solution of these equations.
The GEE estimator 
The naive approach, which sets RŽ ␣ . s I, treats pairs of responses as
independent. In that case, Ž11.12. simplifies to Vi s B i , and the generalized
estimating equations simplify to
Ý DXi Viy1
yi y i Ž  . s
i
Ý XXi ⌬ i B iViy1
yi y i Ž  .
i
s Ž 1r . Ý XXi ⌬ i yi y i Ž  . s 0,
i
ˆ is then the same as the ordinary
or Ý i XXi ⌬ i wyi y i Ž .x s 0. The solution 
estimator for a GLM with the chosen link function and variance function,
treating Ž yi1 , . . . , yiT i . as independent observations.
Normally, one selects a working correlation matrix permitting dependence,
such as the exchangeable structure. For time-series data, also popular is the
autoregressive structure, corrŽ Yit , Yi s . s ␣ < tys < , which treats observations farther apart in time as more weakly correlated. Liang and Zeger Ž1986.
suggested computing the GEE estimates by iterating between a modified
Fisher scoring algorithm for solving the generalized estimating equations for
 Žgiven current estimates of ␣ and . and using residuals for moment
estimation of ␣ and Žbased on the current estimates of  .. They suggested
estimates of RŽ ␣ . for a variety of correlation structures. Alternative algorithms simultaneously solve estimating equations for  and for association
parameters Že.g., Liang et al. 1992; see also Note 11.8.. GEE algorithms need
not converge, but often one iteration gives adequate results ŽLipsitz et al.
1991..
Liang and Zeger Ž1986. showed asymptotic normality and consistency as
the number of clusters n increases. Under certain regularity conditions,
d
N Ž 0, VG . .
6
'n Ž ˆ y  .
Here, generalizing Ž11.11., VG s lim n™⬁ VG, n with
VG , n s n
Ý DXi Viy1
i
y1
Di
Ý DXi Viy1 covŽ Yi . Viy1 Di Ý DXi Viy1 Di
i
y1
.
i
ˆG, nrn of 
ˆ replaces  with ,
ˆ with ˆ,
The estimated covariance matrix V
X
ˆ
ˆ
Ž
.
w
Ž
.xw
Ž
.x
␣ with ␣
ˆ , and cov Yi by yi y i  yi y i  . The purpose of the
474
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
sandwich estimator is to use the data’s empirical evidence about covariation
to adjust the standard errors in case the true covariance differs substantially
from the working guess.
When the working correlation structure is the true one and covŽYi . s Vi ,
the asymptotic covariance matrix VG, nrn simplifies to ŽÝ i DXi Viy1 Di .y1 . This is
the relevant covariance if we put complete faith in our guess about the
correlation structure.
With binary data, the correlation may not be the best way to express the
within-cluster association. The marginal probabilities constrain the possible
correlation values, since the range of possible values for E Ž Yi t Yi s . s P Ž Yi t s 1,
Yi s s 1. depends on P Ž Yit s 1. and P Ž Yi s s 1.. An alternative approach uses
the odds ratio, for instance by modeling the log odds ratios for pairs in a
cluster as exchangeable. This has the advantage that the association parameters are distinct from the means. See Fitzmaurice et al. Ž1993. and Lipsitz
et al. Ž1991.. Carey et al. Ž1993. suggested an iterative alternating logistic
regressions algorithm. It alternates between a GEE step for the regression
parameters in the model for the mean and a step for an association model for
the log odds ratio. This is useful when the structure of the association is itself
a major focus rather than a nuisance.
11.4.4
GEE Approach: Multinomial Responses
We now briefly describe the Lipsitz et al. Ž1994. GEE approach for marginal
modeling with a multinomial response. This is appropriate, for instance, with
cumulative logit models. Let yit Ž j . s 1 if observation t in cluster i has
outcome j Ž j s 1, . . . , I y 1.. Let yi be the Ti Ž I y 1. binary indicators for
cluster i. Then, one selects a w Ti Ž I y 1.x = w Ti Ž I y 1.x working covariance
matrix Vi for yi , specifying a pattern for corrŽ Yit Ž j ., Yi s Ž k .. for each pair of
outcome categories Ž j, k . and each pair Ž t, s .. The Ž I y 1. = Ž I y 1. block of
Vit for Ž yit Ž1., . . . , yit Ž I y 1.. is a multinomial covariance matrix with ®it Ž j . s
P Ž Yit Ž j . s 1.w1 y P Ž Yit Ž j . s 1.x on the main diagonal and yP Ž Yit Ž j . s
1. P Ž Yit Ž k . s 1. off it. The remaining elements of Vi contain elements
covŽ Yit Ž j ., Yi s Ž k ... For instance, one possibility is the exchangeable structure,
corrŽ Yit Ž j ., Yi s Ž k .. s jk for all t and s.
In this approach the generalized estimating equations for  again have the
form
uŽ  . s
n
Ý DXi Viy1 Ž yi y i . s 0,
is1
where i is the vector of probabilities associated with yi , DXi s ⭸ Xir⭸ , and
the parameters are evaluated at their current estimates. Lipsitz et al. suggested a Fisher scoring algorithm for solving these equations and a method of
moments update for estimating jk 4 at each step of the iteration. An
QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS
475
ˆ is again
empirically adjusted sandwich covariance matrix of 
n
Ý DXi Viy1 Di
is1
y1
n
n
is1
is1
Ý DXi Viy1 covŽ Yi . Viy1 Di Ý DXi Viy1 Di
y1
.
This is estimated by substituting
ˆ i from the model fit and replacing covŽYi .
by the empirical covariance matrix of yi .
11.4.5
Dealing with Missing Data
Unfortunately, studies with repeated measurement often have cases for
which at least one response in a cluster is missing. In a longitudinal study, for
instance, some subjects may drop out before its conclusion. When data are
missing, analyzing the observed data alone as if no data are missing can result
in biased estimates.
An advantage of the GEE method is that different clusters can have
different numbers of observations. The data input file has a separate line for
each observation, and for longitudinal studies, computations use those times
for which a subject has an observation. However, bias can arise in GEE
estimates unless one can make certain assumptions about why the data are
missing.
Let Y Ž o. denote the observed responses, Y Ž m. the missing responses, and Y
their union. Let M denote a missing data indicator that equals 1 when an
observation is missing and 0 otherwise. Little and Rubin Ž1987. called the
data missing completely at random if M is statistically independent of Y; that
is, the probability that an observation is missing is independent of that
observation’s value, although it may depend on the explanatory variables.
Less restrictively, they called the data missing at random if the distribution of
Ž M < Y. equals that of Ž M < Y Ž o. .; that is, missingness depends only on Y Ž o.
and not on the missing values.
When either of these is plausible, with a likelihood-based analysis it is not
necessary to model the missingness mechanism. An analysis using only Y Ž o.
is not systematically biased. The same is true with GEE methods when
estimating equations can be weighted by response probabilities ŽRobins et al.
1995.. Otherwise, however, with non-likelihood-based methods such as GEE,
the missingness process can be ignored only when data are missing completely at random. Kenward et al. Ž1994. illustrated the breakdown in GEE
estimates when the data are not missing completely at random.
Often, missingness depends on the missing values. For instance, in a
longitudinal study measuring pain, perhaps a subject dropped out when the
pain got above some threshhold. Then, more complex analyses are needed
that model the joint distribution of Y and M ŽLittle 1998.. Let f Ž⭈. denote a
generic probability mass function, which also depends on explanatory variables x and parameters. Selection models factor the joint distribution of Y
476
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
and M as
f Ž y, M ; x,  , . s f Ž y; x,  . f Ž M < y; x, . ,
where f Žy; x,  . is the model in the absence of missing values and f Ž M < y; x, .
is the model for the missing-data mechanism. Pattern mixture models use the
alternative factorization,
f Ž y, M ; x, , . s f Ž y < M, x, . f Ž M ; x, . ,
which conditions the distribution of Y on the missing data pattern. The two
specifications are equivalent when M is independent of Y, with  s and
s . For discussion of advantages of each modeling approach and details
on ways of modeling missingness, see Little Ž1998. and references in Note
11.9. See Stokes et al. Ž2000, p. 524. for an example of building
the missingness pattern into a model to check whether it is associated with
the response or interacts with effects of explanatory variables.
Analyses in the presence of much missingness should be made with
caution. Typically, little is known about the missing data mechanism, and
assumptions about it cannot be checked. Since inferences may not be robust,
a sensitivity study is necessary to check how results depend on specification
of that mechanism. In the absence of a model for the missingness, one should
at least compare results of the analysis using all available cases for all clusters
to the analysis using only clusters having no missing observations. If results
differ substantially, conclusions should be very tentative until the reasons for
missingness can be studied.
11.5 MARKOV CHAINS: TRANSITIONAL MODELING
When Yt denotes the response at time t, t s 0, 1, 2, . . . , the indexed family of
random variables Ž Y0 , Y1 , Y2 , . . . . is a stochastic process. The state space of the
process is the set of possible values for Yt . The value Y0 is the initial state.
When the state space is categorical and observations occur at a discrete set of
times, Yt 4 has discrete state space and discrete time.
11.5.1
Transitional Models
The main focus is usually on the dependence of Yt on the responses
y 0 , y 1 , . . . , yty1 4 observed previously as well as any explanatory variables.
Models of this type are called transitional models. Let f Ž y 0 , . . . , y T . denote
the joint probability mass function of Ž Y0 , . . . , YT . Žignoring, for now, ex-
MARKOV CHAINS: TRANSITIONAL MODELING
477
planatory variables.. Transitional models use the factorization
f Ž y 0 , . . . , y T . s f Ž y 0 . f Ž y 1 < y 0 . f Ž y 2 < y 0 , y 1 . ⭈⭈⭈ f Ž y T < y 0 , y 1 , . . . , y Ty1 . .
Unlike the marginal models in the other sections of this chapter, this
modeling is conditional on previous responses.
In this section we introduce discrete-time Marko® chains, a simple stochastic process having discrete state space. Many transitional models have Markov
chain structure for at least part of the model.
11.5.2
First-Order Markov Chains
A Marko® chain is a stochastic process for which, for all t, the conditional
distribution of Ytq1 , given Y0 , . . . , Yt , is identical to the conditional distribution of Ytq1 given Yt alone. That is, given Yt , Ytq1 is conditionally independent of Y0 , . . . , Yty1 . Knowing the present state of a Markov chain, information about past states does not help us predict the future. For Markov chains,
f Ž y 0 , . . . , y T . s f Ž y 0 . f Ž y 1 < y 0 . f Ž y 2 < y 1 . . . . f Ž y T < y Ty1 . .
Ž 11.13 .
A stochastic process is a kth-order Marko® chain if, for all t, the conditional distribution of Ytq1 , given Y0 , . . . , Yt , is identical to the conditional
distribution of Ytq1 , given Ž Yt , . . . , Ytykq1 .. Given the states at the previous k
times, the future behavior of the chain is independent of past behavior before
those k times. Our discussion here focuses mainly on ordinary Markov chains
as in Ž11.13., which are first order Ž k s 1..
Denote the conditional probability P Ž Yt s j < Yty1 s i . by j < i Ž t .. The
j < i Ž t .4 , which satisfy Ý j j < i Ž t . s 1, are called transition probabilities. The
I = I matrix j < i Ž t ., i s 1, . . . , I, j s 1, . . . , I 4 is a transition probability
matrix. It is called one-step, to distinguish it from the matrix of probabilities
for k-step transitions from time t y k to time t.
From Ž11.13., the joint distribution for a Markov chain depends only on
one-step transition probabilities and the marginal distribution for the initial
state. It also follows that the joint distribution satisfies loglinear model
Ž Y0 Y1 , Y1Y2 , . . . , YTy1 YT . .
For a sample of realizations of a stochastic process, a contingency table
displays counts of the possible sequences. A test of fit of this loglinear model
checks whether the process plausibly satisfies the Markov property.
Statistical inference for Markov chains uses standard methods of categorical data analysis. For example, consider ML estimation of transition probabilities. Let n i j Ž t . denote the number of transitions from state i at time t y 1 to
state j at time t. For fixed t, n i j Ž t .4 form the two-way marginal table for
dimensions t y 1 and t of an I Tq1 contingency table. For the n iq Ž t . subjects
478
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
in category i at time t y 1, suppose that n i j Ž t ., j s 1, . . . , I 4 have a multinomial distribution with parameters j < i Ž t .4 . Let n i0 4 denote the initial counts.
Suppose that they also have a multinomial distribution, with parameters
i0 4 . If subjects behave independently, from Ž11.13. the likelihood function is
proportional to
ž
I
Ł i0n
is1
i0
/½
T
I
I
Ł Ł Ł j<i Ž t . n
ts1 is1
js1
i jŽ t .
5
Ž 11.14 .
.
The transition probabilities are parameters of IT independent multinomial
distributions. From Anderson and Goodman Ž1957., the ML estimates are
ˆ j < i Ž t . s n i j Ž t . rn iq Ž t . .
11.5.3
Respiratory Illness Example
Table 11.7 refers to a longitudinal study at Harvard of effects of air pollution
on respiratory illness in children. The children were examined annually at
ages 9 through 12 and classified according to the presence or absence of
wheeze.
Denote the binary response Žwheeze, no wheeze . by Yt at age t, t s
9, 10, 11, 12. The loglinear model Ž Y9 Y10 , Y10 Y11 , Y11 Y12 . represents a firstorder Markov chain. It fits poorly, with G 2 s 122.9 Ždf s 8.. Given the state
at time t, classification at time t q 1 depends on states at times previous to
time t. The model Ž Y9 Y10 Y11 , Y10 Y11 Y12 . represents a second-order Markov
chain, satisfying conditional independence at ages 9 and 12, given states at
ages 10 and 11. This model also fits poorly, with G 2 s 23.9 Ždf s 4.. The
poor fits may partly reflect subject heterogeneity, since these analyses ignore
possibly relevant covariates such as parental smoking behavior.
The loglinear model Ž Y9 Y10 , Y9 Y11 , Y9 Y12 , Y10 Y11 , Y10 Y12 , Y11 Y12 . that permits association at each pair of ages fits well, with G 2 s 1.5 Ždf s 5.. Table
TABLE 11.7 Results of Breath Test at Four Ages a
a
Y9
Y10
Y11
Y12
Count
Y9
Y10
Y11
Y12
Count
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
2
2
1
1
2
2
1
2
1
2
1
2
1
2
94
30
15
28
14
9
12
63
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
1
1
2
2
1
1
2
2
1
2
1
2
1
2
1
2
19
15
10
44
17
42
35
572
1, wheeze; 2, no wheeze.
Source: Ware et al. Ž1988..
MARKOV CHAINS: TRANSITIONAL MODELING
479
TABLE 11.8 Estimated Conditional Log Odds
Ratios for Table 11.7
Association
Estimate
Simpler
Structure
1.81
1.65
1.85
0.95
1.05
1.07
1.75
1.75
1.75
1.04
1.04
1.04
Y9 Y10
Y10 Y11
Y11Y12
Y9 Y11
Y9 Y12
Y10 Y12
11.8 shows its ML estimates of pairwise conditional log odds ratios. The
association seems similar for pairs of ages 1 year apart, and somewhat weaker
for pairs of ages more than 1 year apart. The simpler model in which
Yi j9 Y10 s Yi j10 Y11 s Yi j11 Y12
and
Yi j9 Y11 s Yi j9 Y12 s Yi j10 Y12
fits well, with G 2 s 2.3 Ždf s 9.. The estimated log odds ratios are 1.75 in the
first case, and 1.04 in the second.
11.5.4
Transitional Models with Explanatory Variables
Transitional models usually also include explanatory variables x. The joint
mass function of T sequential responses is then
f Ž y1 , . . . , yT ; x.
s f Ž y 1 ; x . f Ž y 2 < y 1 ; x . f Ž y 3 < y 1 , y 2 ; x . ⭈⭈⭈ f Ž y T < y 1 , y 2 , . . . , y Ty1 ; x . .
With binary y, for instance, one might specify a logistic regression model for
each term in this factorization,
f Ž yt < y 1 , . . . , yty1 ; x t .
s
exp yt Ž␣ q  1 y 1 q ⭈⭈⭈ qty1 yty1 q X x t .
1 q exp Ž␣ q  1 y 1 q ⭈⭈⭈ qty1 yty1 q X x t .
,
yt s 0,1.
Here, the predictor x may take different value for each component. The
model treats previous responses as explanatory variables. It is called a
regressi®e logistic model ŽBonney 1987..
ˆ depends on how many previous
The interpretation and magnitude of 
observations are in the model. Within-cluster effects may diminish markedly
480
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
by conditioning on previous responses. This is an important difference from
marginal models, for which the interpretation does not depend on the
specification of the dependence structure. In the special case of first-order
Markov structure, the coefficients of y 1 , . . . , yty2 4 equal 0 in the model for yt
Že.g., Azzalini 1994; Bonney 1987.. It may help to allow interaction between
x t and yty1 in their effects on yt .
For a given subject, the product of the conditional mass functions determines that subject’s contribution to the likelihood function. ŽOne usually
ignores the contribution of the marginal distribution for the first term.. That
is, given the predictor, the model treats repeated transitions by a subject as
independent. Thus, one can fit the model with ordinary GLM software,
treating each transition as a separate observation ŽBonney 1986..
11.5.5
Child’s Respiratory Illness and Maternal Smoking
Table 11.9 is also from the Harvard study of air pollution and health. At ages
7 through 10, children were evaluated annually on the presence of respiratory
illness. A predictor is maternal smoking at the start of the study, where s s 1
for smoking regularly and s s 0 otherwise. Let yt denote the response at age
t Ž t s 7, 8, 9, 10.. We consider the regressive logistic model
logit P Ž Yt s 1 . s ␣ q  1 s q  2 t q  3 yty1 ,
t s 8, 9, 10.
Each subject contributes three observations to the model fitting. The data
set consists of 12 binomials, for the 2 = 3 = 2 combinations of Ž s, t, yty1 ..
For instance, for the combination Ž0, 8, 0., y 8 s 0 for 237 q 10 q 15 q 4 s
TABLE 11.9 Child’s Respiratory Illness by Age and Maternal Smoking
No Maternal
Smoking
Maternal
Smoking
Age 10
Age 10
Child’s Respiratory Illness
Age 7
No
Age 8
Age 9
No
Yes
No
Yes
No
No
Yes
No
Yes
No
Yes
No
Yes
237
15
16
7
24
3
6
5
10
4
2
3
3
2
2
11
118
8
11
6
7
3
4
4
6
2
1
4
3
1
2
7
Yes
Yes
No
Yes
Source: Data courtesy of James Ware.
481
NOTES
266 subjects and y 8 s 1 for 16 q 2 q 7 q 3 s 28 subjects. The ML fit is
logit PˆŽ Yt s 1 . s y0.293 q 0.296 s y 0.243t q 2.211 yty1 ,
with SE values Ž0.846, 0.156, 0.095, 0.158.. Not surprisingly, the previous observation has a strong effect. Given that and the child’s age, there is slight
evidence of a positive effect of maternal smoking: The likelihood-ratio
statistic for H0 :  1 s 0 is 3.55 Ždf s 1, P s 0.06.. The model itself does not
show any evidence of lack of fit Ž G 2 s 3.1, df s 8..
NOTES
Section 11.1: Comparing Marginal Distributions: Multiple Responses
11.1. Darroch Ž1981. surveyed thoroughly the relationships among statistics for testing
marginal homogeneity and their connections with generalized CMH analyses. See also
Mantel and Byar Ž1978. and White et al. Ž1982.. Croon et al. Ž2000. studied a variety
of hypotheses for longitudinal data in the context of the generalized loglinear model.
Section 11.2: Marginal Modeling: Maximum Likelihood Approach
11.2. For other work on ML fitting of marginal models, see Bergsma and Rudas Ž2002.,
Ekholm et al. Ž2000., Fitzmaurice et al. Ž1993., and Lang et al. Ž1999..
Section 11.3: Marginal Modeling: Generalized Estimating Equations Approach
11.3. Liang et al. Ž1992. discussed GEE methods for categorical Žprimarily binary. responses. For multinomial responses, see Heagerty and Zeger Ž1996., Lipsitz et al.
Ž1994., Miller et al. Ž1993., and references in Agresti and Natarajan Ž2001.. More
general models with ordinal responses allow for dispersion parameters that also
depend on covariates ŽToledano and Gatsonis 1996..
11.4. LaVange et al. Ž2001. used GEE methods to adjust for clustered sampling in surveys
and clinical trials. Boos Ž1992. discussed generalized score tests that incorporate
empirical variance estimates, illustrating with tests for trend and lack of fit in binary
regression.
11.5. Koch et al. Ž1977. used weighted least squares ŽWLS. to fit marginal models to Table
11.2. WLS for categorical modeling is described in Section 15.1. It has severe
limitations Že.g., covariates must be categorical and marginal tables cannot be sparse .
but led naturally to the GEE approach.
Section 11.4: Quasi-likelihood and Its GEE Multi©ariate Extension: Details
11.6. Firth Ž1993b. provided a useful overview of quasi-likelihood methods. McCullagh
Ž1983. showed that under correct specification of the mean and the variance function,
quasi-likelihood estimators are asymptotically efficient among estimators that are
locally linear in yi 4. His result generalizes the Gauss᎐Markov theorem, although in an
asymptotic rather than exact manner. See also Heyde Ž1997. and Liang and Zeger
Ž1995. for discussions of unbiased estimating functions and their connections with
482
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
asymptotic consistency and efficiency. Godambe showed in 1960 that ML estimators
are optimal solutions with an unbiased estimating function. When quasi-likelihood
estimators are not ML, Cox Ž1983. and Firth Ž1987. suggested that they still retain
good efficiency when the departure from the natural exponential family is at most
moderate, such as modest overdispersion relative to such a family.
11.7. The generalized estimating equations are likelihood equations, and hence the GEE
estimates are also ML, in certain cases. Examples are multivariate normal data or
binary data when the working covariance is correct ŽFitzmaurice et al. 1993.. Results
about effects of model misspecification arise in a variety of model-building contexts.
For general theory, see Gourieroux et al. Ž1984., Hansen Ž1982., Liang and Zeger
Ž1995., and White Ž1982..
11.8. A GEE2 analysis adds estimating equations for the correlation structure ŽPrentice and
Zhao 1991.. This has the potential to increase efficiency. A disadvantage is that,
ˆ is no longer consistent if this part of the model is
unlike with ordinary GEE, 
misspecified. Qu et al. Ž2000. showed how to increase efficiency by representing the
working correlation matrix by a linear combination of basis matrices.
11.9. For surveys of ways to handle missing data, see Little Ž1998., Little and Rubin Ž1987,
Chap. 9., Schafer Ž1997., and Verbeke and Molenberghs Ž2000.. See also Baker and
Laird Ž1988., Fay Ž1986., Fitzmaurice et al. Ž1994., Forster and Smith Ž1998., Fuchs
Ž1982., Molenberghs and Goetghebeur Ž1997., Molenberghs et al. Ž1997., Park and
Brown Ž1994., and Stokes et al. Ž2000..
Section 11.5: Marko© Chains: Transitional Modeling
11.10. For statistical inference with Markov chains, see Andersen Ž1980, Sec. 7.7., Anderson
and Goodman Ž1957., Billingsley Ž1961., Bishop et al. Ž1975, Chap. 7., and Kalbfleisch
and Lawless Ž1985.. See Conaway Ž1989., Stiratelli et al. Ž1984., and Ware et al. Ž1988.
for other analyses focusing on the conditional dependence structure.
PROBLEMS
Applications
11.1 Refer to Table 8.3. Viewing the table as matched triplets, construct
the marginal distribution for each substance. Find the sample proportions of students who used marijuana, alcohol, and cigarettes. Test
the hypothesis of marginal homogeneity. Interpret results.
11.2 Refer to Table 9.1. Fit a marginal model to describe main effects of
race, gender, and substance type Žmarijuana, alcohol, cigarettes . on
whether a subject had used that substance. Summarize effects.
11.3 Refer to Problem 11.2. Further study shows evidence of an interaction between gender and substance type. Using GEE with exchangeable working correlation, the model fit for the probability of using
483
PROBLEMS
a particular substance is
logit Ž
ˆ . s y0.57 q 1.93S1 q 0.86S2 q 0.38 R
y 0.20G q 0.37G = S1 q 0.22G = S2 ,
where R, G, S1 , S2 are dummy variables for race Ž1 s white., gender
Ž1 s female., and substance type Ž S1 s 1, S2 s 0 for alcohol; S1 s
0, S2 s 1 for cigarettes; S1 s S2 s 0 for marijuana.. Show that:
a. The estimated odds a nonwhite male has used marijuana are
expŽy0.57. s 0.57.
b. Given gender, the estimated odds a white subject used a given
substance are 1.46 times the estimated odds for a black subject.
c. Given race, the estimated odds a female has used alcohol are 1.19
times the estimated odds for males; for cigarettes and for marijuana, the estimated odds ratios are 1.02 and 0.82.
d. Given race, the estimated odds a female has used alcohol
Žcigarettes . are 9.97 Ž2.94. times the estimated odds she has used
marijuana.
e. Given race, the estimated odds a male has used alcohol Žcigarettes .
are 6.89 Ž2.36. times the estimated odds he has used marijuana.
Interpret the interaction.
11.4 Refer to Table 11.2. Analyze the data using the scores Ž1, 2, 4. for the
week number, using ML or GEE. Interpret estimates and compare
substantive results to those in the text with scores Ž0, 1, 2..
11.5 Analyze Table 11.9 using a marginal logit model with age and
maternal smoking as predictors. Compare interpretations to the
Markov model of Section 11.5.5.
11.6 Table 11.10 refers to a three-period crossover trial to compare placebo
Žtreatment A. with a low-dose analgesic Žtreatment B . and high-dose
analgesic Žtreatment C . for relief of primary dysmenorrhea. Subjects
in the study were divided randomly into six groups, the possible
sequences for administering the treatments. At the end of each
period, each subject rated the treatment as giving no relief Ž0. or
some relief Ž1.. Let yiŽ k .t s 1 denote relief for subject i using treatment t Ž t s A, B, C ., where subject i is nested in treatment sequence
k Ž k s 1, . . . , 6.. Assuming common treatment effects for each sequence, and setting A s 0, obtain and interpret ˆt 4 Žusing ML or
GEE. for the model
logit P Ž YiŽ k .t s 1 . s ␣ k q t .
How would you order the drugs, taking significance into account?
484
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
TABLE 11.10 Data for Problem 11.6
Treatment
Sequence
A
A
B
B
C
C
B
C
A
C
A
B
C
B
C
A
B
A
Response Pattern for Treatments ŽA, B, C.
000
001
010
011
100
101
110
111
0
2
0
0
3
1
2
0
1
1
0
5
2
0
1
1
0
0
9
9
8
8
7
4
0
1
1
1
0
0
0
0
3
0
1
3
1
0
0
0
2
1
1
4
1
1
1
0
Source: Jones and Kenward Ž1987..
11.7 Table 11.11 is from a Kansas State University survey of 262 pig
farmers. For the question ‘‘What are your primary sources of veterinary information?,’’ the categories were ŽA. professional consultant,
ŽB. veterinarian, ŽC. state or local extension service, ŽD. magazines,
and ŽE. feed companies and reps. Farmers sampled were asked to
select all relevant categories. The 2 5 = 2 = 4 table shows the Žyes, no.
counts for each of these five sources cross-classified with the farmers’
education Žwhether they had at least some college education. and size
of farm Žnumber of pigs marketed annually, in thousands ..
TABLE 11.11 Data for Problem 11.7
Response on D
A s yes
B s yes
A s no
B s no
B s yes
B s no
C s yes C s no C s yes C s no C s yes C s no C s yes C s no
Educ
Pigs
E
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
No
-1
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
1
0
2
0
3
1
2
1
3
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
3
0
6
0
0
0
1
0
0
0
10
2
1
4
0
3
0
1
0
4
4
2
2
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
1
1
1
0
0
0
0
1
0
2
5
4
5
1
2
0
1
1
2
1
4
0
5
1
4
1
4
1
0
2
0
0
0
0
4
0
2
0
0
0
1
1 5
7 7
0 0
3 4
0 1
1 4
1 0
0 6
0 2
6 14
0 1
7 14
1 1
4 4
0 0
2 4
3
0
4
0
1
0
2
0
11
0
6
0
3
0
2
0
1᎐2
2᎐5
)5
Some
-1
1᎐2
2᎐5
)5
Source: Data courtesy of Tom Loughin, Kansas State University.
485
PROBLEMS
a. Explain why it is not proper to analyze the data by fitting a
multinomial model to the counts in the 2 = 4 = 5 contingency
table cross-classifying education by size of farm by the source of
veterinary information, treating source as the response variable.
ŽThis table contains 453 positive responses of sources from the 262
farmers..
b. For a farmer with education i and size of farm s, let j Ž is . denote
the probability of responding ‘‘ yes’’ on the jth source. Table 11.12
shows output for using GEE with exchangeable working correlation to estimate parameters in the model lacking an education
effect,
logit j Ž is . s ␣ j q  j s,
s s 1, 2, 3, 4.
Explain how to interpret the working correlation matrix. Explain
why the results suggest a strong positive size of farm effect for
source A and perhaps a weak negative size effect of similar
magnitude for C, D, and E.
c. Constraining  3 s 4 s 5 , the ML estimate of the common slope
is y0.184 ŽSE s 0.063.. Explain why it is advantageous to fit the
marginal model simultaneously for all sources rather than separately to each. wAgresti and Liu Ž1999. and Loughin and Scherer
Ž1998. discussed analyses for data of this form.x
TABLE 11.12 Output for Problem 11.7
Row1
Row2
Row3
Row4
Row5
Col1
1.0000
0.0997
0.0997
0.0997
0.0997
Parameter
source
source
source
source
source
size*source
size*source
size*source
size*source
size*source
Working Correlation Matrix
Col2
Col3
Col4
0.0997
0.0997
0.0997
1.0000
0.0997
0.0997
0.0997
1.0000
0.0997
0.0997
0.0997
1.0000
0.0997
0.0997
0.0997
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Estimate
Std Error
Z
1
y4.4994
0.6457
y6.97
2
y0.8279
0.2809
y2.95
3
y0.1526
0.2744
y0.56
4
0.4875
0.2698
1.81
5
y0.0808
0.2738
y0.30
1
1.0812
0.1979
5.46
2
0.0792
0.1105
0.72
3
y0.1894
0.1121
y1.69
4
y0.2206
0.1081
y2.04
5
y0.2387
0.1126
y2.12
Col5
0.0997
0.0997
0.0997
0.0997
1.0000
Pr> <Z <
<.0001
0.0032
0.5780
0.0708
0.7680
<.0001
0.4738
0.0912
0.0412
0.0341
486
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
TABLE 11.13 Output for Problem 11.8
Row1
Row2
Row3
Parameter
Intercept
question 1
question 2
question 3
female
Working Correlation Matrix
Col1
Col2
Col3
1.0000
0.8173
0.8173
0.8173
1.0000
0.8173
0.8173
0.8173
1.0000
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Estimate
Std Error
Z
y0.1253
0.0676
y1.85
0.1493
0.0297
5.02
0.0520
0.0270
1.92
0.0000
0.0000
.
0.0034
0.0878
0.04
Pr> <Z <
0.0637
<.0001
0.0544
.
0.9688
11.8 Refer to Table 11.13 on attitudes toward legalized abortion. For the
response Yt Ž1 s support legalization, 0 s oppose. for question t
Ž t s 1, 2, 3. and for gender g Ž1 s female, 0 s male., consider the
model logitw P Ž Yt s 1.x s ␣ q ␥ g q t with  3 s 0.
a. A GEE analysis using unstructured working correlation gives correlation estimates 0.826 for questions 1 and 2, 0.797 for 1 and 3,
and 0.832 for 2 and 3. What does this suggest about a reasonable
working correlation structure?
b. Table 11.13 shows a GEE analysis with exchangeable working
correlation. Interpret effects.
c. Treating the three responses for each subject as independent
observations and performing ordinary logistic regression, ˆ1 s
0.149 ŽSE s 0.066., ˆ2 s 0.052 ŽSE s 0.066., and ␥
ˆ s 0.004 ŽSE
s 0.054.. Give a heuristic explanation of why within-subject standard errors are much larger than with GEE, yet the between-subject standard error is smaller.
11.9 Refer to the air pollution data in Table 11.7. Using ML or GEE, fit
marginal logit models that assume Ža. marginal homogeneity, Žb. a
linear effect of time, and Žc. no pattern. Interpret and compare.
11.10 Refer to the clinical trials data in Table 12.5, analyzed with random
effects models in Section 12.3.4. Use GEE methods to analyze them,
treating each center as a correlated cluster.
11.11 Refer to Table 10.5. Using GEE methods with cumulative logits,
compare the two marginal distributions. Compare results to those
using ML in Section 10.3.2.
11.12 Refer to the 3 4 table on government spending in Table 8.19. Analyze
these data with a marginal cumulative logit model. Interpret effects.
487
PROBLEMS
11.13 Refer to Table 11.4.
a. To compare effects while controlling for initial response, fit model
Ž11.7., using scores 10, 25, 45, 754 for time to falling asleep. Also fit
the interaction model, and describe the lack of fit. ŽNote that for
the first two baseline levels, the active and placebo treatments
have similar sample response distributions at the follow-up; at
higher baseline levels, the active treatment seems more successful. .
b. Fit the interaction model
logit P Ž Y2 F j . s ␣ j q  1 x q  2 y 1 q  3 xy1
that constrains effects  1 x q  2 y 1 q  3 xy 1 4 to follow the pattern
Ž , , q , . for the active group and Ž , , , 0. for the placebo
ˆ.
group. Interpret
11.14 Find a marginal model with another type of logit that fits the
insomnia data of Table 11.4 well. Interpret parameter estimates, and
compare conclusions to those using cumulative logits.
11.15 Refer to Table 11.9. Combine the data for the two levels of maternal
smoking. Does a first-order Markov chain model these data adequately? Find a loglinear model that does fit adequately.
11.16 Analyze Table 11.9 using a transitional model with two previous
responses. Does it fit better than the first-order model of Section
11.5.5? Interpret.
11.17 Analyze Table 11.2 using a first-order transitional model. Compare
interpretations to those in this chapter using marginal models.
11.18 Table 11.14 is from a longitudinal study of coronary risk factors in
schoolchildren ŽWoolson and Clarke 1984.. A sample of children aged
11᎐13 in 1977 were classified by gender and by relative weight Žobese,
not obese. in 1977, 1979, and 1981. Analyze these data.
TABLE 11.14 Data for Problem 11.18
Responses a
Gender
NNN
NNO
NON
NOO
ONN
ONO
OON
OOO
Male
Female
119
129
7
8
8
7
3
9
13
6
4
2
11
7
16
14
a
NNN indicates not obese in 1977, 1979, and 1981; NNO indicates not obese in 1977 and 1979
but obese in 1981; and so on.
Source: Reproduced with permission from the Royal Statistical Society, London ŽWoolson and
Clarke 1984..
488
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
11.19 Refer to the pig farmer survey of Problem 11.7 ŽTable 11.11.. Analyze
these data using marginal models with all the variables.
11.20 Refer to the cereal diet and cholesterol study of Problem 7.18 ŽTable
7.23.. Analyze these data with marginal models.
Theory and Methods
11.21 Refer to Problem 11.1. Suppose that we expressed the data with a
3 = 2 partial table of drug-by-response for each subject, to use a
generalized CMH procedure to test marginal homogeneity. Explain
why the 911 q 279 subjects who make the same response for every
drug have no effect on the test.
11.22 Let yit s 1 or 0 for observation t on subject i, i s 1, . . . , n, t s
1, . . . , T. Let y.t s Ý i yitrn, yi.s Ý t yi trT, and y. .s Ý i Ý t yitrnT.
a. Regard yiq 4 as fixed. Suppose that each way to allocate the yiq
‘‘successes’’ to yiq of the observations is equally likely. Show that
E Ž Yi t . s yi. , varŽ Yit . s yi.Ž1 y yi. ., and covŽ Yi t , Yi k . s yyi.Ž1 y
yi. .rŽT y 1. for t / k. w Hint: The covariance is the same for any
pair of cells in the same row, and varŽÝ t Yit . s 0 since yiq is fixed.x
b. Refer to part Ža.. For large n with independent subjects, explain
why Ž Y.1 , . . . , Y.T . is approximately multivariate normal with pairwise correlation s y1rŽT y 1.. Conclude that Cochran’s Q
statistic ŽCochran 1950.
Qs
n2 Ž T y 1 . ÝTts1 Ž y .t y y . . .
2
T Ý nis1 yi . Ž 1 y yi . .
is approximately chi-squared with df s ŽT y 1.. wOne way notes
that if Ž X 1 , . . . , X T . is multivariate normal with common mean and
common variance 2 and common correlation for pairs Ž X t , X k .,
then ÝŽ X t y X . 2r 2 Ž1 y . is chi-squared with df s ŽT y 1.. See
Bhapkar and Somes Ž1977. for slightly weaker conditions for a
chi-squared limiting distribution for Q than those in part Ža..x
c. Show that Q is unaffected by deleting cases in which yi1 s ⭈⭈⭈ s
yiT .
11.23 Consider the model i s  , i s 1, . . . , n, assuming that ®Ž i . s i .
Suppose that actually varŽ Yi . s 2i . Using the univariate version of
GEE described in Section 11.4, show that uŽ . s Ý i Ž yi y  .r and
ˆ s y. Show that V in Ž11.10. equals rn, the actual asymptotic
variance Ž11.11. simplifies to  2rn, and its consistent estimate is
Ý i Ž yi y y . 2rn2 .
PROBLEMS
489
11.24 Repeat Problem 11.23 assuming that ®Ž i . s 2 when actually
varŽ Yi . s i .
11.25 Consider the model i s  , i s 1, . . . , n, for independent Poisson
observations. For ˆ s y, show that the model-based asymptotic variance estimate is yrn, whereas the robust estimate of the asymptotic
variance is Ý i Ž yi y y . 2rn2 . Which would you expect to be better Ža. if
the Poisson model holds, and Žb. if there is severe overdispersion?
11.26 Show that Ž11.10. is equivalent to the formula for the large-sample
covariance of the ML estimator in a GLM, estimated by Ž4.28..
11.27 a. For a univariate response, how is quasi-likelihood ŽQL. inference
different from ML inference? When are they equivalent?
b. Explain the sense in which GEE methodology is a multivariate
version of QL.
c. Summarize the advantages and disadvantages of the QL approach.
d. Describe conditions under which GEE parameter estimators are
consistent and conditions under which they are not. For conditions
in which they are consistent, explain why.
11.28 Formulate a model using adjacent-categories logits or continuationratio logits that is analogous to Ž11.4.. Interpret parameters.
11.29 Refer to the analysis of mean time to falling asleep at the end of
Section 11.2.3. Explain how to calculate SE for the difference between the difference of means reported there. ŽNote that one difference uses paired samples and the other uses independent samples..
11.30 What is wrong with this statement?: ‘‘For a first-order Markov chain,
Yt is independent of Yty2 .’’
11.31 Suppose that loglinear model Ž Y0 , Y1 , . . . , YT . holds. Is this a Markov
chain?
11.32 Gamblers A and B have a total of I dollars. They play games of pool
repeatedly. Each game they each bet $1, and the winner takes the
other’s dollar. The outcomes of the games are statistically independent, and A has probability and B has probability 1 y of winning
any game. Play stops when one player has all the money. Let Yt
denote A’s monetary total after t games.
a. Show that Yt 4 is a first-order Markov chain.
b. State the transition probability matrix. ŽFor this gambler’s ruin
problem, 0 and I are absorbing states. Eventually, the chain enters
one of these and stays. The other states are transient..
490
ANALYZING REPEATED CATEGORICAL RESPONSE DATA
11.33 A first-order Markov chain has stationary Žor time-homogeneous.
transition probabilities if the one-step transition probability matrices
are identical, that is, if for all i and j,
j < i Ž 1 . s j < i Ž 2 . s ⭈⭈⭈ s j < i Ž T . s j < i .
Let X, Y, and Z denote the classifications for the I = I = T table
consisting of n i j Ž t ., i s 1, . . . , I, j s 1, . . . , I, t s 1, . . . , T 4 .
a. Explain why all transition probabilities are stationary if expected
frequencies for this table satisfy loglinear model Ž XY, XZ .. wThus,
the likelihood-ratio statistic for testing stationary transition probabilities equals G 2 for testing fit of model Ž XY, XZ ..x
b. Let n i j s Ý t n i j Ž t .. Under the assumption of stationary transition
probabilities, show how the likelihood in Ž11.14. simplifies, and
show that the ML estimators are
ˆ j < i s n i jrn iq .
c. For a Markov chain with stationary transition probabilities, let yi jk
denote the number of transitions from i to j to k over two
successive steps. For yi jk 4 , argue that the goodness of fit of
loglinear model Ž Y1Y2 , Y2 Y3 . tests that the chain is first order
against the alternative that it is second order ŽAnderson and
Goodman 1957..
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 12
Random Effects: Generalized
Linear Mixed Models for
Categorical Responses
In Chapter 11 we noted that observations often occur in clusters. For
instance, cluster i might consist of repeated measurements on subject i or
observations for all subjects in family i. Observations within a cluster tend to
be more alike than observations from different clusters. Thus, they are
usually positively correlated. Ordinary analyses that ignore the correlation
and treat within-cluster observations the same as between-cluster observations produce invalid standard errors.
In Chapter 11 we focused on modeling the marginal distributions of
clustered responses, treating the joint dependence structure as a nuisance. In
this chapter we present an alternative approach using cluster-level terms
in the model. These terms take the same value for each observation in
a cluster but different values for different clusters. They are unobserved
and, when treated as varying randomly among clusters, are called random
effects. In Section 10.2.4 we introduced this approach in a model for matched
pairs. The models have conditional interpretations, referred to as subjectspecific when each cluster is a subject. This contrasts with marginal models,
which have population-a®eraged interpretations.
Random effects models for normal responses are well established. By
contrast, only recently have random effects been used much in models for
categorical data. In this chapter we extend generalized linear models to
include random effects. In Section 12.1 we introduce this extension, the
generalized linear mixed model. In Section 12.2 we discuss an important special
case for binary data, the logistic-normal model. Several examples are shown in
Section 12.3. Section 12.4 covers extensions for multinomial responses, and
Section 12.5 covers models with multivariate random effects. In Section 12.6
we discuss model fitting, assuming normality for the random effects. Parts of
this chapter are from Agresti et al. Ž2000..
491
492
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
12.1 RANDOM EFFECTS MODELING OF CLUSTERED
CATEGORICAL DATA
Parameters that describe a factor’s effects in ordinary linear models are
called fixed effects. They apply to all categories of interest, such as genders,
age groupings, or treatments. By contrast, random effects usually apply to a
sample. For a study using a sample of clinics, for example, the model treats
observations from a given clinic as a cluster, and it has a random effect for
each clinic.
GLMs extend ordinary regression by allowing nonnormal responses and a
link function of the mean. The generalized linear mixed model ŽGLMM. is a
further extension that permits random effects as well as fixed effects in the
linear predictor.
12.1.1
Generalized Linear Mixed Model
Let yit denote observation t in cluster i, t s 1, . . . , Ti . As in the GEE
analyses in Chapter 11, the number of observations may vary by cluster. In a
longitudinal study, even if clusters have equal size, many of them may have
missing observations. Let x it denote a column vector of values of explanatory
variables, for fixed effect model parameters . Let u i denote the vector of
random effect values for cluster i. This is common to all observations in the
cluster. Let z it denote a column vector of their explanatory variables. Often,
the random effect is univariate.
Conditional on u i , a GLMM resembles an ordinary GLM. Let it s
E Ž Yit < u i .. The linear predictor for a GLMM has the form
g Ž it . s x Xit  q zXit u i
Ž 12.1 .
for link function g Ž.. The random effect vector u i is assumed to have a
multivariate normal distribution N Ž0, .. The covariance matrix depends
on unknown ®ariance components and possibly also correlation parameters.
Denote varŽ Yit < u i . s it ®Ž i t . , where the variance function ®Ž. describes
how the Žconditional. variance depends on the mean. As in Section 4.4, often
i t s 1 or i t s r i t , where i t is a known weight Že.g., number of trials
for a binomial count. and is an unknown dispersion parameter. Conditional on u i , the model treats yit 4 as independent over i and t. As discussed
in Section 10.2.2, the variability among u i induces a nonnegative association
among the responses, for the marginal distribution averaged over the subjects. This is caused by the shared random effect u i for each observation in a
cluster.
In Ž12.1., the random effect enters the model on the same scale as the
predictor terms. This is convenient but also natural for many applications.
For instance, random effects sometimes represent heterogeneity caused by
RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA
493
omitting certain explanatory variables. Consider the special case with univariate random effect and z it s 1. With u i replaced by u i* where u i*4 are
N Ž0, 1., the GLMM has the form
g Ž it . s x Xit  q u i* .
This has the form of an ordinary GLM with unobserved values u i*4 of a
particular covariate. Thus, random effects models relate to methods of
dealing with unmeasured predictors and other forms of missing data. The
random effects part of the linear predictor reflects terms that would be in the
fixed effects part if those explanatory variables had been included. Random
effects also sometimes represent random measurement error in the explanatory variables. If we replace a particular predictor x it by x it* q i , with x it*
the true value and i the measurement error, then i times the regression
parameter can be absorbed in the random effects term. Related to these
motivations, random effects also provide a mechanism for explaining overdispersion in basic models not having those effects ŽBreslow and Clayton 1993..
12.1.2
Logit GLMM for Binary Matched Pairs
We illustrate the GLMM expression Ž12.1. using a simple case, that of binary
matched pairs. The data form two dependent binomial samples ŽSection
10.1.. Cluster i consists of the responses Ž yi1 , yi2 . for matched pair i.
Observation t in cluster i has yit s 1 Ža success . or 0 Ža failure., t s 1, 2.
In Section 10.2.2 we introduced the model ŽCox 1958b, Rasch 1961.
logit P Ž Yi t s 1 . s i q x t
Ž 12.2 .
where x 1 s 0 and x 2 s 1. For it, is a cluster-specific log odds ratio. That
section treated i as a fixed effect and eliminated it using conditional ML.
An equivalent representation of Ž12.2. is
logit P Ž Yi1 s 1 < u i . s q u i ,
logit P Ž Yi2 s 1 < u i . s q q u i ,
Ž 12.3 .
where u i s i y for some constant . Now, we treat u i as a random effect
for cluster i, with u i 4 independent from a N Ž0, 2 . distribution with
unknown. Conditionally on u i , we assume that yi1 and yi2 are independent.
Model Ž12.3. is the special case of Ž12.1. in which i t s P Ž Yi t s 1 < u i ., g Ž.
is the logit link, X s Ž , ., x Xi1 s Ž1, 0. and x Xi2 s Ž1, 1. for all i, and z it s 1
for all i and t. The univariate random effect adjusts the intercept but does
not modify the fixed effect. A GLMM with random effect of this form is
called a random intercept model. Instead of the usual fixed intercept , it has
a random intercept q u i .
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
494
Let Y1 s Ý i yi1 and Y2 s Ý i yi2 . Marginally, Y1 is binomial with n trials
and parameter E expŽ q U .rw1 q expŽ q U .x4 , and Y2 is binomial with
parameter E expŽ q q U .rw1 q expŽ q q U .x4 . The expectations refer to U, a N Ž0, 2 . random variable. The model implies a nonnegative
correlation between Y1 and Y2 , with greater association resulting from
greater heterogeneity Ži.e., larger .. Clusters with a large positive u i have a
relatively large P Ž Yit s 1 < u i . for each t, whereas clusters with a large
negative u i have a relatively small P Ž Yit s 1 < u i . for each. For this model, Y1
and Y2 are independent only if s 0.
A 2 = 2 population-averaged table with Žsuccess, failure. for both the row
and column categories summarizes the number of observations for which
Ž yi1 , yi2 . s Ž1, 1., Ž1, 0., Ž0, 1., or Ž0, 0.. Let n ab 4 denote these counts. Table
12.1, analyzed first in Section 10.1, is an example. Let
ˆ ab 4 denote marginal
fitted values for model Ž12.3.. We defer discussion of model fitting until
Section 12.6. However, model Ž12.3. is a rare instance in which the fixed
effect in a random effects model has a closed-form ML estimate,
ˆ s log Ž
ˆ 21 r
ˆ 12 . .
When the sample log odds ratio logŽ n11 n 22 rn12 n 21 . G 0, then
ˆ ab s n ab 4
ˆ
Ž
.
and s log n 21 rn12 . This is the same as the conditional ML estimate
ŽSection 10.2.3.. Neuhaus et al. Ž1994. showed that this is true for any
parametric choice of random effects distribution for which the model Ž12.3.
can generate n ab 4 as fitted values. Lindsay et al. Ž1991. showed that this
estimate also results with a nonparametric approach discussed in Section
13.2.4. The model implies that the true log odds ratio for this 2 = 2 table
is at least 0. When logŽ n11 n 22 rn12 n 21 . - 0, however, then ˆ s 0 and the
fitted values
ˆ ab s n aq nqb rn4 satisfy independence. Then, ˆ is identical
to the estimate for the marginal model Ž10.6. by which is the difference between logits for the two marginal distributions, namely ˆ s
log wŽ n 2q nq1 .rŽ n1q nq2 .x.
12.1.3
Ratings of Prime Minister Revisited
For Table 12.1, the ML fit of model Ž12.3., treating u i 4 as normal, yields
ˆ s logŽ86r150. s y0.556 ŽSE s 0.135., with ˆ s 5.16. This is identical to
the conditional ML estimate Ž10.10., with standard error wŽ1r86. q
Ž1r150.x1r2 . For a given subject, the estimated odds of approval at the second
TABLE 12.1 Rating of Performance of Prime Minister
First
Survey
Approve
Disapprove
Total
Second Survey
Approve
Disapprove
Total
794
86
880
150
570
720
944
656
1600
RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA
495
survey equal expŽy0.556. s 0.57 times those at the first survey. The large ˆ
reflects the very strong association between the two responses, with sample
odds ratio 35.1.
12.1.4
Extension: Rasch Model and Item Response Models
An extension of the logit matched-pairs model Ž12.3. allows T ) 2 observations in each cluster. The random intercept model then has form
logit P Ž Yit s 1 < u i . s u i q t ,
Ž 12.4 .
where u i 4 are independent N Ž0, 2 .. Equivalently, the model can add an
intercept or let E Ž u i . s , but then identifiability requires a constraint
such as T s 0.
Early applications of this GLMM were in psychometrics. The model
describes responses to a battery of T questions on an exam. The probability
P Ž Yit s 1 < u i . that subject i makes the correct response on question t
depends on the overall ability of subject i, characterized by u i , and the
easiness of question t, characterized by t . Such models are called item-response models. The logit form Ž12.4. is called the Rasch model ŽRasch 1961..
In estimating t 4 , Rasch treated u i 4 as fixed effects and used conditional
ML, as outlined in Section 10.2.3 for matched pairs. Later authors used the
normal random effects approach for this model and the model with probit
link Že.g., Bock and Aitkin 1981..
The t 4 in the Rasch model differ from parameters in corresponding
marginal models such as Ž11.1., since the effects are subject specific. The
Rasch model refers to a T = 2 = n table of observation by outcome by
subject, whereas the marginal model refers to the T = 2 observation-byoutcome table of the T marginal distributions, collapsed over subjects. For
observations s and t for a given subject i with model Ž12.4.,
s y t s logit P Ž Yi s s 1 < u i . y logit P Ž Yit s 1 < u i . ,
which is a log odds ratio conditional on the subject. By contrast, the
corresponding population-averaged effect in marginal model Ž11.1. is
s y t s logit P Ž Yh s s 1 . y logit P Ž Yit s 1 . ,
with subject h randomly selected for observation s and subject i randomly
selected for observation t Ži.e., h and i are independent observations..
12.1.5
Random Effects versus Conditional ML Approaches
Suppose that one treated u i 4 in model Ž12.4. as fixed effects instead of
random effects. Then, consider ordinary ML estimation of t 4 and u i 4 . As n
increases, so does the number of parameters, since each subject has a u i .
496
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
Even though the number of t 4 does not increase as n does, the ordinary
ML estimators ˆt 4 are not consistent. This happens in many models when
the number of parameters has an order similar to that of the number of
subjects. Asymptotic optimality properties of ML estimators, such as consistency, require the number of parameters to be fixed as n increases. For
model Ž12.4., ML estimators of t 4 have bias of order TrŽT y 1. ŽAndersen
1980, pp. 244 245.. For the matched-pairs model Ž12.2., for instance, ˆ ™ 2
in probability ŽProblem 10.24..
For this reason, the preferable approach for the fixed effects model is
conditional ML. One eliminates u i 4 by conditioning on their sufficient
statistics Si s Ý t yit , i s 1, . . . , n4 . In the item response context, these are
the numbers of correct responses for each subject. Conditional on Si 4 , the
distribution of yit 4 is independent of u i 4 . Maximizing the resulting likelihood then yields consistent estimators of t 4 . The analysis generalizes the
one in Section 10.2.3 for the subject-specific logistic model Ž10.8. for matched
pairs. See Andersen Ž1980. for details.
Compared with the random effects approach, the conditional ML approach has certain advantages. One does not need to assume a parametric
distribution for u i 4 . It is difficult to check this assumption in the random
effects approach. Conditional ML is also appropriate with retrospective
sampling. In that case, bias can occur with a random effects approach
because the clusters are not randomly sampled ŽNeuhaus and Jewell 1990b..
However, the conditional ML approach has severe disadvantages. It is
restricted to the canonical link Žthe logit., for which reduced sufficient
statistics exist for u i 4 . More important, as discussed in Section 10.2.7, it is
restricted to inference about within-cluster fixed effects. The conditioning
removes the source of variability needed for estimating between-cluster
effects in models with explanatory variables such as those considered next.
Also, this approach does not provide information about u i 4 , such as predictions of their values and estimates of their variability or of the probabilities
they determine. Finally, in more general models with covariates, conditional
ML can be less efficient than the random effects approach for estimating the
fixed effects Žsee Note 12.2..
12.2 BINARY RESPONSES: LOGISTIC-NORMAL MODEL
The item response model Ž12.4. with random intercept is a special case of an
important class of random effects models for binary data called logisticnormal models. With univariate random effect, the model form is
logit P Ž Yi t s 1 < u i . s x Xit  q u i
Ž 12.5 .
where u i 4 are independent N Ž0, 2 . variates. This is the special case of the
GLMM Ž12.1. in which g Ž. is the logit link and the random effects structure
BINARY RESPONSES: LOGISTIC-NORMAL MODEL
497
simplifies to a random intercept. The logistic-normal model has a long
history, dating at least to Cox Ž1970, Prob. 20 in that text. for the matchedpairs model Ž12.3. and Pierce and Sands Ž1975..
More generally, the link function in model Ž12.5. can be an arbitrary
inverse cdf. For such models, Yi s and Yi t are treated conditionally Žgiven u i .
as independent but are marginally nonnegatively correlated. Let
denote
the cdf that is the inverse link function. Then, for s / t,
cov Ž Yi s , Yit . s E cov Ž Yi s , Yit < u i . q cov E Ž Yi s < u i . , E Ž Yit < u i .
s 0 q cov
Ž xXi s  q u i . , Ž xXit  q u i . .
Ž 12.6 .
The functions in the last covariance term are both monotone increasing in u i ,
and hence are nonnegatively correlated. For common predictor value x at
each t, the joint distribution for the model is exchangeable. This is often
plausible for clustered data. In longitudinal studies, however, observations
closer together in time may tend to be more highly correlated.
Usually, the main focus in using a GLMM is inference about the fixed
effects. The random effects part of the model is a mechanism for representing how the positive correlation occurs between observations within a cluster.
Parameters pertaining to the random effects may themselves be of interest,
however. For instance, the estimate ˆ of the standard deviation of a random
intercept may be a useful summary of the degree of heterogeneity of a
population.
12.2.1
Interpreting Heterogeneity in Logistic-Normal Models
When s 0, the logistic-normal model Ž12.5. simplifies to the ordinary
logistic regression model treating all observations as independent. When
) 0, how can we interpret the variability in effects this model implies?
Consider observation yit at setting x it of predictors and observation y h s at
setting x h s . Their log odds ratio is
logit P Ž Yit s 1 < u i . y logit P Ž Yh s s 1 < u h . s Ž x it y x h s .  q Ž u i y u h . .
X
We cannot observe Ž u i y u h ., which has a N Ž0, 2 2 .distribution. However,
100Ž1 y .% of those log odds ratios fall within
Ž x it y x h s .  " z r2 '2 .
X
Ž 12.7 .
When s 0, Žx it y x h s .  is the usual form of log odds ratio for a model
without random effects. When ) 0, Žx it y x h s .X  is the log odds ratio for
two observations in the same cluster Ž h s i . or with the same random effect
value. Suppose that x it s x h s for observations from different clusters. Then,
using z 0.25 s 0.674, the middle 50% of the log odds ratios fall within
498
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
"0.674'2 s "0.95 . Hence, the median odds ratio between the observation with higher random effect and the observation with lower random effect
equals expŽ0.95 .. With a single predictor and x it y x h s s 1, the median
such odds ratio equals expŽ q 0.95 .. Larsen et al. Ž2000. presented
related interpretations.
12.2.2
Connections between Conditional Models and Marginal Models
The fixed effects parameters  in GLMMs have conditional intepretations,
given the random effect. Those fixed effects are of two types. First, consider
an explanatory variable that varies in value among observations in a cluster.
For instance, in a crossover study comparing T drugs, for each subject the
drug taken varies from observation to observation in that subject’s cluster of
T observations. For such an explanatory variable, its coefficient in the model
refers to the effect on the response of a within-cluster Že.g., subject-specific .
1-unit increase of that predictor. The random effect as well as other explanatory variables in the model are constant while that predictor increases by 1.
The effect of that explanatory variable is a ‘‘ within-cluster’’ or ‘‘ within-subject’’ one.
Second, consider an explanatory variable with constant value among
observations in a cluster. An example is gender when each subject forms a
cluster. For such an explanatory variable, its coefficient refers to the effect on
the response of a ‘‘between-cluster’’ 1-unit increase of that predictor. An
example is a comparison of females and males using a dummy variable and
its coefficient. However, this fixed effect in the GLMM applies only when the
random effect Žas well as other explanatory variables in the model. takes the
same value in both groups: for instance, a male and a female with the same
value for their random effects.
It is in this sense that random effects models are conditional models, as
both within- and between-cluster effects apply conditional on the random
effect value. By contrast, effects in marginal models are averaged over all
clusters Ži.e., population averaged., so those effects do not refer to a comparison at a fixed value of a random effect. In fact, a fundamental difference
between the two model types is that when the link function is nonlinear, such
as the logit, the population-averaged effects of marginal models often are
smaller than the cluster-specific effects of GLMMs.
Specifically, the GLMM Ž12.1. refers to the conditional mean, it s
Ž
E Yit < u i .. By inverting the link function,
E Ž Yi t < u i . s gy1 Ž x Xi t  q zXit u i . .
Marginally, averaging over the random effects, the mean is
E Ž Yit . s E E Ž Yit < u i . s
Hg
y1
Ž xXi t  q zXit u i . f Ž u i ; . du i ,
BINARY RESPONSES: LOGISTIC-NORMAL MODEL
499
where f Žu; . is the N Ž0, . density function for the random effects. For the
identity link,
E Ž Yit . s
HŽ x
X
it
q zXit u i . f Ž u i ; . du i s x Xi t .
The marginal model has the same model form and effects . This is not true
for other links. For instance, for the logistic-normal model Ž12.5.,
E Ž Yi t . s E
exp Ž x Xit q u i .
1 q exp Ž x Xit q u i .
.
This expectation does not have form exp Žx Xit .rw1 q exp Žx Xit .x except when
u i has a degenerate distribution Ž s 0..
Approximate relationships exist between estimates from the two model
types. In the logistic-normal case with effect and small , Zeger et al.
Ž1988. showed that
E Ž Yit . f exp Ž cx Xit . r 1 q exp Ž cx Xit . ,
Ž 12.8 .
where c s w1 q 0.6 2 xy1r2 . Since the effect in the marginal model multiplies
that of the conditional model by about c, it is typically smaller in absolute
value. The discrepancy increases as increases. For near 0, Neuhaus et
al. Ž1991. showed that the marginal model effect is approximately Ž1 y .,
where s corrŽ Yit , Yi s . at s 0. Again, the discrepancy increases as
increases, since increases with .
For Table 12.1 on ratings of the prime minister, the ML estimate for
model Ž12.3. is ˆ s y0.556, with ˆ s 5.16 for variability of u i 4 . Approximation Ž12.8. suggests that ˆ s y0.556 with ˆ s 5.16 corresponds to a marginal
estimate of about w1 q 0.6Ž5.16. 2 xy1r2 Žy0.556. s y0.135. The actual
marginal estimate is the log odds ratio for the sample marginal distributions,
equaling
log Ž 880r720. r Ž 944r656. s y0.163.
In fact, the marginal effect is much smaller than the conditional effect, but
this approximation connecting the two estimates works better for smaller ˆ .
At s 0, the fit of the model is that of the symmetry model, for which
ˆ 12 s
ˆ 21 s Ž n12 q n 21 .r2. The correlation for that 2 = 2 table equals 0.699,
from which the conditional estimate of y0.556 suggests a marginal estimate
of y0.556Ž1 y 0.699. s y0.167, very close to the actual value of y0.163.
Figure 12.1 illustrates why the marginal effect is smaller than the conditional effect. For a single explanatory variable x, the figure shows subjectspecific curves for P Ž Yit s 1 < u i . for several subjects when considerable heterogeneity exists. This corresponds to a relatively large for random effects.
500
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
FIGURE 12.1 Logistic random-intercept model, showing the conditional Žsubject-specific .
curves and the marginal Žpopulation-averaged . curve averaging over these.
At any fixed value of x, variability occurs in the conditional means, E Ž Yit < u i .
s P Ž Yit s 1 < u i .. The average of these is the marginal mean, E Ž Yit .. These
averages for various x values yield the superimposed curve. It has a shallower slope. In fact, it does not exactly follow the logistic formula. Similar
remarks apply to other GLMMs. For the probit link with binary data,
however, the conditional probit model with normal random effect does imply
a marginal model of probit form ŽProblem 12.29.. With univariate random
intercept, the marginal effect equals the conditional effect multiplied by
w1 q 2 xy1r2 ŽZeger et al. 1988.. In Section 13.5.1 we explore the conditional marginal connection for loglinear GLMMs.
12.2.3
Comments about Conditional versus Marginal Models
Random effects models describe conditional Žsubject-specific . effects, whereas
marginal models describe population-averaged effects. Some statisticians
prefer one of these types, but most feel that both are useful, depending on
the application.
The conditional modeling approach is preferable if one wants to specify a
mechanism that could generate positive association among clustered observations, estimate cluster-specific effects, estimate their variability, or model the
joint distribution. Latent variable constructions used to motivate model forms
Že.g., the tolerance motivation for binary models of Section 6.6.1 and the
related threshold motivation in Problem 6.28 and utility motivation in Problem 6.29. usually apply more naturally at the cluster level than at the
marginal level. Given a conditional model, one can recover information about
marginal distributions. That is, a conditional model implies a marginal model,
BINARY RESPONSES: LOGISTIC-NORMAL MODEL
501
but a marginal model does not itself imply a conditional model Žalthough see
Note 12.10 for an implicit connection..
In many surveys or epidemiological studies, a goal is to compare the
relative frequency of occurrence of some outcome for different groups in a
population. Then, quantities of primary interest include between-group odds
ratios among marginal probabilities for the different groups. That is, effects
of interest are between-cluster rather than within-cluster. When marginal
effects are the main focus, it is usually simpler and may be preferable to
model the margins directly. One can then parameterize the model so that
regression parameters have a direct marginal interpretation. Developing a
more detailed model of the joint distribution that generates those margins, as
a random effects model does, provides greater opportunity for misspecification. For instance, with longitudinal data the assumption that observations
are independent, given the random effect, need not be realistic. With the
marginal model approach, we showed in Chapter 11 that ML is sometimes
possible but that the GEE approach is computationally simpler and more
versatile. A drawback of the GEE approach is that it does not explicitly
model random effects and therefore does not allow these effects to be
estimated. In addition, likelihood-based inferences are not possible because
the joint distribution of the responses is not specified.
In Section 12.2.2 it was noted that conditional effects are usually larger
than marginal effects, and increase as variance components increase. Usually,
though, the significance of an effect Že.g., as measured by the ratio of
estimate to standard error. is similar in the two model types. If one effect
seems more important than another in a conditional model, the same is
usually true with a marginal model. So the choice of the model is usually not
crucial to inferential conclusions.
This statement requires a caveat, however, since sizes of effects in marginal
models depend on the degree of heterogeneity in conditional models. In
comparing effects for two groups or two variables that have quite different
variance components, relative sizes of effects will differ for marginal and
conditional models. From Ž12.8., with binary data the attenuation from the
conditional to the marginal effect will tend to be greater for the group having
the larger variance component. For instance, suppose that two groups, one
young in age and the other elderly, both show the same conditional effect in
a crossover study comparing two drugs. If the elderly group has more
heterogeneity on the response, their marginal effect may be smaller than that
for the younger group. The marginal effects differ even though the conditional effects are the same, because of the greater variance component for
the elderly. In such cases, the conditional effect Žappropriately modeled. may
have more relevance.
Finally, with either marginal or conditional models, missing data are a
common problem with multivariate responses. Unless data are missing at
random, potential bias occurs in ML inference. GEE methods usually require
502
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
the stronger condition that data are missing completely at random ŽSection
11.4.5.. Thus, modeling missingness or conducting a sensitivity study to
discern its potential effects can be an important component of an analysis.
Regardless of the choice of paradigm, it is a challenge for statisticians
even to explain to practitioners why marginal and conditional effects differ
with a nonlinear link function. Graphics such as Figure 12.1 can help.
Neuhaus Ž1992. and Pendergast et al. Ž1996. surveyed ways of analyzing
clustered binary data, including conditional and marginal models. Agresti
and Natarajan Ž2001. surveyed conditional and marginal modeling of clustered ordinal data.
12.3 EXAMPLES OF RANDOM EFFECTS MODELS FOR
BINARY DATA
In the next three sections we present a variety of examples of random effects
models. In this section we consider binary responses.
12.3.1
Small-Area Estimation of Binomial Proportions
Small-area estimation refers to estimation of parameters for a large number of
geographical areas when each has relatively few observations. For instance,
one might want county-specific estimates of characteristics such as the
unemployment rate or the proportion of families having health insurance
coverage. With a national or statewide survey, some counties may have few
observations. Then, sample proportions in the counties may poorly estimate
the true countywide proportions. Random effects models that treat each
county as a cluster can provide improved estimates. In assuming that the true
proportions vary according to some distribution, the fitting process ‘‘borrows
from the whole’’ᎏit uses data from all the counties to estimate the proportion in any given one.
Let i denote the true proportion in area i, i s 1, . . . , n. These areas may
be all the ones of interest, or only a sample. Let yi 4 denote independent
i
binŽTi , i . variates; that is, yi s ÝTts1
yit , where yit , t s 1, . . . , Ti 4 are independent with P Ž Yit s 1. s i and P Ž Yit s 0. s 1 y i . The sample proportions pi s yirTi 4 are ML estimates of i 4 for the fixed-effects model
logit Ž i . s q i ,
i s 1, . . . , n.
This model is saturated, having n nonredundant parameters Žwith a constraint such as Ý i i s 0. for the n binomial observations.
For small Ti 4 , pi 4 have large standard errors. Thus, pi 4 may display much
more variability than i 4 , especially when i 4 are similar. Then, it is helpful
EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA
503
to shrink pi 4 toward their overall mean. One can accomplish this with the
random effects model
logit P Ž Yit s 1 < u i . s q u i ,
Ž 12.9 .
where u i 4 are independent N Ž0, 2 . variates. This model is a logit analog of
one-way random effects ANOVA. When s 0, all i are identical.
For this model,
ˆ i s exp Žˆ q uˆi . r 1 q exp Žˆ q uˆi . .
This estimate differs from the sample proportion pi . If ˆ s 0, then all
i
yit .rŽÝ i Ti .,
u
ˆi s 0. Then, the random effects estimate of each i is ŽÝ nis1 ÝTts1
the overall sample proportion after pooling all n samples. When truly all i
are equal, this is a much better estimator of that common value than the
sample proportion from a single sample.
Generally, the random effects model estimators shrink the separate sample proportions toward the overall sample proportion. The amount of shrinkage decreases as ˆ increases. The shrinkage also decreases as the Ti 4 grow;
as each sample has more data, we put more trust in the separate sample
proportions. The predicted random effect u
ˆi is the estimated mean of the
distribution of u i , given the data Žsee Section 12.6.7.. This prediction depends on all the data, not just data from area i. A benefit is potential
reduction in the mean-squared error of the estimates around the true values.
We illustrate model Ž12.9. with a simulated sample of size 2000 to mimic a
poll taken before the 1996 U.S. presidential election. For Ti observations in
state i Ž i s 1, . . . , 51, where i s 51 is DC s District of Columbia., yi is
binŽTi , i ., where i is the actual proportion of votes in state i for Bill
Clinton in the 1996 election, conditional on voting for Clinton or the
Republican candidate, Bob Dole. Here, Ti is proportional to the state’s
population size, subject to Ý i Ti s 2000. Table 12.2 shows Ti 4 , i 4 , and
pi s yirTi 4 .
For the ML fit of model Ž12.9.,
ˆ s 0.163 and ˆ s 0.29. The predicted
random effect values Žobtained using PROC NLMIXED in SAS. yield the
proportion estimates
ˆ i 4, also shown in Table 12.2. Since Ti 4 are mostly
small and since ˆ is relatively small, considerable shrinkage of these estimates occurs from the sample proportions toward the overall proportion
supporting Clinton, which was 0.548. The
ˆ i 4 vary only between 0.468 Žfor
.
Ž
TX s Texas and 0.696 for NY s New York., whereas the sample proportions vary between 0.111 Žfor Idaho. and 1.0 Žfor DC.. Sample proportions
based on fewer observations, such as DC, tended to shrink more. Although
the estimates incorporating random effects are relatively homogeneous, they
tend to be closer than the sample proportions to the true values.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
504
TABLE 12.2 Estimates of Proportion of Vote for Clinton, Conditional on Voting
for Clinton or Dole in 1996 U.S. Presidential Election a
State
Ti
i
pi
ˆi
State
Ti
i
pi
ˆi
AK
AL
AR
AZ
CA
CO
CT
DC
DE
FL
GA
HI
IA
ID
IL
IN
KS
KY
LA
MA
MD
ME
MI
MN
MO
MS
5
32
19
34
240
29
25
4
5
108
56
9
22
9
89
44
19
29
33
46
38
9
73
35
41
21
0.394
0.463
0.594
0.512
0.572
0.492
0.604
0.903
0.586
0.532
0.494
0.643
0.557
0.391
0.596
0.468
0.400
0.506
0.566
0.686
0.586
0.627
0.573
0.594
0.535
0.472
0.200
0.500
0.526
0.618
0.538
0.586
0.720
1.000
0.400
0.602
0.554
0.556
0.500
0.111
0.539
0.432
0.316
0.448
0.667
0.739
0.474
0.778
0.589
0.571
0.561
0.333
0.508
0.524
0.537
0.573
0.538
0.558
0.602
0.576
0.527
0.583
0.548
0.543
0.528
0.472
0.540
0.488
0.477
0.506
0.592
0.637
0.511
0.578
0.570
0.554
0.550
0.477
MT
NC
ND
NE
NH
NJ
NM
NV
NY
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VA
VT
WA
WI
WV
WY
7
55
5
13
9
60
13
12
137
84
23
24
90
7
28
6
40
144
15
51
4
42
39
14
4
0.483
0.475
0.461
0.395
0.567
0.600
0.540
0.506
0.660
0.536
0.456
0.547
0.552
0.689
0.469
0.479
0.513
0.473
0.380
0.489
0.633
0.572
0.559
0.584
0.426
0.429
0.455
0.600
0.462
0.556
0.667
0.462
0.500
0.752
0.488
0.478
0.625
0.567
0.571
0.571
0.667
0.500
0.444
0.333
0.412
0.500
0.619
0.487
0.571
0.250
0.526
0.494
0.546
0.524
0.543
0.611
0.524
0.533
0.696
0.507
0.520
0.569
0.558
0.545
0.552
0.555
0.522
0.468
0.490
0.473
0.538
0.578
0.517
0.548
0.518
a
i , True; pi , sample;
ˆ i , estimate using random effects model.
12.3.2
Modeling Repeated Binary Responses
In Section 12.1.4 we introduced a random effects version of the Rasch model
for repeated binary measurement. This model extends to incorporate covariates.
We illustrate using Table 10.13, first analyzed in Section 10.7.2. The
subjects indicated whether they supported legalizing abortion in each of
three situations. Table 10.13 also classified the subjects by gender. Let yit
denote the response for subject i on item t, with yit s 1 representing
support. Consider the model
logit P Ž Yi t s 1 < u i . s u i q t q x i ,
Ž 12.10 .
where x i s 1 for females and 0 for males, and where u i 4 are independent
N Ž0, 2 .. ŽEquivalently, one could place a constraint on t 4 and allow an
505
EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA
intercept .. Here, the gender effect is assumed the same for each item,
and the t 4 refer to the items.
Since model Ž12.10. implies nonnegative association among responses on
the items, one should use items and scales for which this should occur. For
opinions about legalized abortion with scale Žyes, no., it would not be
appropriate for one question to ask ‘‘Do you agree that abortion should be
legal when a woman is not married?’’ and another to ask ‘‘Do you agree that
abortion should be illegal during the last three months of pregnancy?’’
Table 12.3 summarizes ML fitting results. The contrasts of ˆt 4 indicate
greater support for legalized abortion with item 1 Žwhen the family has a low
income and cannot afford any more children . than with the other two. There
is slight evidence of greater support with item 2 Žwhen the woman is not
married and does not want to marry the man. than with item 3 Žwhen the
woman wants the abortion for any reason.. The fixed effects estimates have
log odds ratio interpretations. For a given subject of either gender, for
instance, the estimated odds of supporting legalized abortion for item 1 equal
expŽ0.83. s 2.3 times the estimated odds for item 3. Since
ˆ s 0.01, for each
item the estimated probability of supporting legalized abortion is similar for
females and males with similar random effect values.
For these data, subjects are highly heterogeneous Ž ˆ s 8.6.. Thus, strong
associations exist among responses on the three items. This is reflected by
1595 of the 1850 subjects making the same response on all three items: that
is, response patterns Ž0, 0, 0. and Ž1, 1, 1.. It implies tremendous variability in
between-subject odds ratios. From Ž12.7., for different subjects of a given
gender, the middle 50% of odds ratios comparing items 1 and 3 are estimated
to vary between about expŽ0.83 y 0.95 = 8.6. and expŽ0.83 q 0.95 = 8.6..
For contingency tables, one can obtain cell fitted values. To do this, one
must integrate over the estimated random effects distribution to obtain
estimated marginal probabilities of any particular sequence of responses. For
the ML parameter estimates, the probability of a particular sequence of
responses Ž yi1 , . . . , yiT . for a given u i is the appropriate product of conditional probabilities, Ł t P Ž Yit s yit < u i ., since the responses are independent
given u i . Integrating this product probability with respect to u i for the
TABLE 12.3 Summary of ML Estimates for Random Effects Model (12.10)
and ML and GEE Estimates for Corresponding Marginal Model
GLMM ML
Effect
Abortion
Gender
'varŽ u .
i
Marginal Model ML
Marginal Model GEE
Parameter
Estimate
SE
Estimate
SE
Estimate
SE
1 y 3
1 y 2
2 y 3
0.83
0.54
0.29
0.01
0.16
0.16
0.16
0.48
0.148
0.098
0.049
0.005
0.030
0.027
0.027
0.088
0.149
0.097
0.052
0.003
0.030
0.028
0.027
0.088
8.6
0.54
506
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
N Ž0, ˆ 2 . distribution estimates the marginal probability for a given cell
Žaveraged over subjects.. This requires numerical integration methods described in Section 12.6. Multiplying this marginal probability of a given
sequence by the sample size for that multinomial gives a fitted value.
Not surprisingly, for these data, the response patterns Ž0, 0, 0. and Ž1, 1, 1.
also have the largest fitted values for the multinomial for each gender. For
instance, for females 440 indicated support under all three circumstances
Ž457 under none of the three., and the fitted value was 436.5 Ž459.3.. Overall
chi-squared statistics comparing the 16 observed and fitted counts are G 2 s
23.2 and X 2 s 27.8 Ždf s 9.. These are not that large considering the very
large sample size and the few parameters Ž 1 , 2 , 3 , ␥ , . used to describe
the 14 multinomial cell probabilities Ž8 y 1 s 7 for each gender. in Table
10.13. Here, df s 9 since we are modeling 14 multinomial parameters using
five GLMM parameters.
An extended model allows interaction between gender and item. It has
different t 4 for men and women. However, it does not fit better. The
likelihood-ratio statistic s 1.0 Ždf s 2. for testing that the extra parameters
equal 0.
An alternative analysis of these data focuses on the marginal distributions,
treating the dependence as a nuisance. A marginal model analog of Ž12.10. is
logit P Ž Yt s 1 . s t q ␥ x.
For it, Table 12.3 also shows GEE estimates for the exchangeable working
correlation structure and ML estimates. The marginal model fits well, with
G 2 s 1.1; here, df s 2 since the model describes six marginal probabilities
Žthree for each gender. using four parameters. These population-averaged
ˆt 4 are much smaller than the subject-specific ˆt 4 from the GLMM. This
reflects the very large GLMM heterogeneity Ž ˆ s 8.6. and the corresponding
strong correlations among the three responses. For instance, the GEE
analysis estimates a common correlation of 0.82 between pairs of responses.
Although the GLMM ˆt 4 are about five to six times the marginal model ˆt 4 ,
so are the standard errors. The two approaches provide similar substantive
interpretations and conclusions.
12.3.3
Longitudinal Mental Depression Study Revisited
We now revisit Table 11.2 from a longitudinal study to compare a new drug
with a standard for treating subjects suffering mental depression. In Section
11.2.1 we analyzed the data using marginal models. The response yt for
measurement t on mental depression equals 1 for normal and 0 for abnormal. For severity of initial diagnosis s Ž1 s severe, 0 s mild., drug treatment
d Ž1 s new, 0 s standard ., and time of measurement t, we used the model
logit P Ž Yt s 1 . s q 1 s q 2 d q 3 t q 4 dt
to evaluate the marginal distributions.
507
EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA
TABLE 12.4 Model Parameter Estimates for Marginal and Conditional Logit
Models Fitted to Table 11.2
ML Marginal Std. GEE Marginal Std. Random Effects Std.
Estimate
Error
Estimate
Error ML Estimate
Error
Parameter
Diagnosis
Drug
Time
Drug = Time
y1.29
y0.06
0.48
1.01
0.14
0.22
0.12
0.18
y1.31
y0.06
0.48
1.02
0.15
0.23
0.12
0.19
y1.32
y0.06
0.48
1.02
0.15
0.22
0.12
0.19
Now let yit denote observation t for subject i. The model
logit P Ž Yit s 1 < u i . s q 1 s q 2 d q 3 t q 4 dt q u i
has subject-specific rather than population-averaged effects. Table 12.4 shows
the ML estimates. The time trend estimates are ˆ3 s 0.48 for the standard
drug and ˆ3 q ˆ4 s 1.50 for the new one. These are nearly identical to the
ML and GEE estimates for the corresponding marginal model, also shown in
the table Žthese are discussed in Sections 11.2.1 and 11.3.2.. The reason is
that the repeated observations do not exhibit much correlation, as the GEE
analysis observed. Here, this is reflected by ˆ s 0.07, showing little heterogeneity among subjects.
Based on the model fit, integrating over the N Ž0, 0.07 2 . random effects
distribution yields marginal fitted values of the possible response sequences.
Comparing these to the sample counts in Table 11.2 indicates a relatively
good fit. The model describes the 28 multinomial cell probabilities Žseven for
the trivariate response at each of the four severity drug combinations. using
six parameters. The usual fit statistics comparing the observed cell counts to
their fitted values are G 2 s 22.0 and X 2 s 20.8 Ždf s 28 y 6 s 22..
The deviance increases by only 0.001 when one assumes that s 0. From
results to be discussed in Section 12.6.6, the P-value for comparing models is
half what one gets by treating the deviance as chi-squared with df s 1, or
P s 0.49. This simpler model, which gives nearly identical effect estimates
and SE values, is adequate. This is also suggested by AIC values Že.g., PROC
NLMIXED in SAS reports 1173.9 for the random effects model and 1171.9
for the simpler model with s 0..
12.3.4
Modeling Heterogeneity among Multicenter Clinical Trials
Many applications compare two groups on a response for data stratified on a
third variable. With binary outcomes, the data form several 2 = 2 contingency tables. The main focus relates to studying the association in the 2 = 2
tables and whether and how it varies among the strata.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
508
The strata are sometimes themselves a sample, such as schools or medical
clinics. A random effects approach is then natural. With a random sampling
of strata, it enables inferences to extend to the population of strata. The fit of
the random effects model provides a simple summary such as an estimated
mean and standard deviation of log odds ratios for the population of strata.
In each stratum it also provides a predicted log odds ratio that shrinks the
sample value toward the mean. This is especially useful when the sample size
in a stratum is small and the ordinary sample odds ratio has large standard
error. Even when the strata are not a random sample or not even a sample
and a random effects approach is not as natural, the model is beneficial for
these purposes.
We illustrate using Table 12.5, previously analyzed in Section 6.3, showing
the results of a clinical trial at eight centers. The purpose was to compare an
active drug and a control, for curing an infection. For a subject in center i
using treatment t Ž1 s active drug; 2 s control., let yit s 1 denote success.
One possible model is the logistic-normal,
logit P Ž Yi1 s 1 < u i . s q r2 q u i
logit P Ž Yi2 s 1 < u i . s y r2 q u i ,
Ž 12.11 .
TABLE 12.5 Clinical Trial Relating Treatment to Response for Eight Centers
Response
Center
Treatment
Success
Failure
1
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
Drug
Control
11
10
16
22
14
7
2
1
6
0
1
0
1
1
4
6
25
27
4
10
5
12
14
16
11
12
10
10
4
8
2
1
2
3
4
5
6
7
8
Source: Beitler and Landis Ž1985..
Sample
Odds Ratio
Fitted
Odds Ratio
1.19
2.02
1.82
2.09
4.80
2.19
2.29
2.11
2.18
2.12
2.0
2.11
0.33
2.06
EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA
509
where u i 4 are independent N Ž0, 2 . variates. This model assumes that the
log odds ratio between treatment and response is constant over centers.
The parameter summarizes center heterogeneity in the success probabilities.
A logistic-normal model permitting treatment-by-center interaction is
logit P Ž Yi1 s 1 < u i , bi . s q Ž q bi . r2 q u i ,
logit P Ž Yi2 s 1 < u i , bi . s y Ž q bi . r2 q u i ,
Ž 12.12 .
where u i 4 are independent N Ž0, a2 ., bi 4 are independent N Ž0, b2 ., and u i 4
are independent of bi 4 . The log odds ratio equals q bi in center i. These
vary among centers according to a N Ž , b2 . distribution. That is, is the
expected center-specific log odds ratio between treatment and response, and
b describes variability in those log odds ratios. The model parameters are
Ž , , a , b ..
In Table 12.5 the sample success rates vary markedly among centers both
for the control and drug treatments, but in all except the last center that rate
is higher for the drug treatment. In using models with random center and
possibly random treatment effects, it is preferable to have more than eight
centers. It is difficult to get reliable variance component estimates with so
few centers. Keeping this in mind, we use these data to illustrate the models.
With a large number of centers it would also be sensible to allow correlation
between bi and u i , but we shall not attempt that here. The treatment
estimates are ˆ s 0.739 ŽSE s 0.300. for the model Ž12.11. of no interaction
and ˆ s 0.746 ŽSE s 0.325. for the model Ž12.12. permitting interaction.
Considerable evidence of a drug effect occurs. With such a small sample,
however, it is unclear whether that effect is weak or moderate.
The evidence about association is weaker for the model permitting interaction. The Wald statistics are Ž0.739r0.300. 2 s 6.0 for the no-interaction
model and Ž0.746r0.325. 2 s 5.3 for the interaction model. The corresponding likelihood-ratio statistics are 6.3 and 4.6 Ždf s 1.. The extra variance
component in the interaction model pertains to variability in the log odds
ratios. As its estimate ˆb increases, so does the standard error of the
estimated treatment effect ˆ tend to increase. In this example, ˆb s 0.15 is
relatively small and the standard errors of ˆ are not very different in the two
models. When ˆb s 0, the standard errors and the model fits are the same.
To show the effect of larger ˆb on the standard error of the mean
treatment effect estimate ˆ, we alter Table 12.5 slightly. We change three
failures to successes for drug in center 3 and three successes to failures for
drug in center 8. With these changes, the estimated variability of the
treatment effects increases from ˆb s 0.15 to ˆb s 1.4. The ML estimates of
the mean treatment effects are then ˆ s 0.722 ŽSE s 0.299. for the no
interaction model Ž12.11. and ˆ s 0.767 ŽSE s 0.623. for the interaction
model. The Wald statistics are 5.8 and 1.5. The evidence of a treatment
510
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
effect is then dramatically weaker for the interaction model Ž12.12.. Not
surprisingly, when the treatment effect varies substantially among centers, it
is more difficult to estimate the mean of that effect.
For the actual data in Table 12.5, because ˆb s 0.15 for model Ž12.12. is
relatively small, the model shrinks the sample odds ratios considerably. Table
12.5 shows the sample values and the model predicted values. These are
based on predicting the random effects Žto be explained in Section 12.6., and
substituting them and the ML estimates of fixed effects into the model
formula to estimate the two response probabilities for each treatment in each
center. The sample odds ratios vary from 0.33 to ⬁; their random effects
model counterparts Žcomputed with PROC NLMIXED in SAS. vary only
between 2.0 and 2.2. The smoothed estimates are much less variable and do
not have the same ordering as the sample values. For instance, the smoothed
estimate of 2.2 for center 3 is greater than the estimate of 2.1 for center 6,
even though the sample value is infinite for the latter. This partly reflects the
greater shrinkage that occurs when sample sizes are smaller. When ˆb s 0,
model Ž12.12. provides the same fit as model Ž12.11., and estimated odds
ratios are identical in each center.
For related analyses permitting heterogeneity in odds ratios with several
2 = 2 tables, see Liu and Pierce Ž1993. and Skene and Wakefield Ž1990..
12.3.5
Alternative Formulations of Random Effects Models
There are other ways to express the models. For instance, an equivalent
expression for interaction model Ž12.12. is
logit P Ž Yit s 1 < u i , bit . s q x t q bit q u i ,
where x t is a treatment dummy variable Ž x 1 s 1, x 2 s 0., u i 4 are independent N Ž0, a2 ., and bi1 4 and bi2 4 are independent N Ž0, 2 .. Here, bi1 y bi2
corresponds to bi in parameterization Ž12.12., and 2 2 corresponds to b2 .
Formulating a random effects model requires care about implications of
the model expression and the random effects correlation structure. Suppose
that one expressed the interaction model Ž12.12. as
logit P Ž Yi t s 1 < u i , bi . s q Ž q bi . x t q u i ,
Ž 12.13 .
with bi 4 from N Ž0, b2 .. This is inappropriate, since the model then imposes
greater variability for the logit with the first treatment than the second, since
x 2 s 0 and u i 4 and bi 4 are uncorrelated. Also, the model should not depend
on the definition of the dummy variable x t . Note, however, that if z t s x t q c
for some constant c, then model Ž12.13. is equivalently
logit P Ž Yi t s 1 < u i , bi .
s q Ž q bi . Ž z t y c . q u i s X q Ž q bi . z t q ®i ,
511
EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA
where s y c and ®i s u i y cbi . Thus, Ž ®i , bi . are correlated even if
Ž u i , bi . are not. In fact, expression Ž12.13. is sensible only with correlated
random effects. It is then equivalent to Ž12.12. with correlated random
effects. See Agresti and Hartzel Ž2000. for further discussion.
12.3.6
Capture–Recapture Modeling to Predict Population Size
Capture recapture experiments are a method of using a series of samples to
estimate the size of a population. Such methods have traditionally been used
to estimate animal abundance in some habitat. At each sampling occasion,
animals are captured and marked in some manner. The animals captured for
any given sample are freed and all animals are candidates for recapture in a
later sample. With T sampling occasions, a 2 T contingency table displays the
data, with scale Žcaptured, not captured . at each occasion. The count n 22 2
is missing for the cell corresponding to noncapture at each occasion. If we
knew this cell count, adding it to the others would yield the population size.
Models specified for this 2 T table use the 2 T y 1 observed counts to fit the
model. The fit refers to those 2 T y 1 cells, but extrapolating it yields an
estimated count in the unobserved cell. Adding that to the total of the 2 T y 1
observed counts yields an estimate of population size.
To illustrate, suppose that T s 2. We observe n11 animals at both occasions, n12 at the first but not the second occasion, and n 21 at the second but
not the first. We do not know the number n 22 not captured either time. If we
assumed independence in the 2 = 2 table, the prediction ˆ
n 22 would be the
value giving an odds ratio of 1.0; but Ž n11 ˆ
n 22 .rŽ n12 n 21 . s 1 implies that
ˆn 22 s n12 n 21 rn11 . This yields a population size prediction ŽSekar and Deming
1949. of
N̂ s n11 q n12 q n 21 q n12 n 21 rn11
s n1q nq1 rn11
with
$
var Ž Nˆ . s
n1q nq1 n12 n 21
3
n11
.
The assumption of independence is usually unrealistic, however. With additional sampling occasions, one can try more complex models.
Table 12.6, analyzed by Cormack Ž1989. and others, refers to a study
having T s 6 consecutive trapping days for a population of snowshoe hares.
The study observed 68 hares. For instance, Table 12.6 indicates that 3 hares
were observed on the first day but on none of the other days. For simplicity,
models for studies over a brief time period assume that no deaths, births, or
immigration into the population occurred during the study period. This is
called a closed population.
Most methods for capture recapture treat the probability of capture at a
given occasion as identical for each subject Že.g., animal.. This is usually
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
512
TABLE 12.6 Results of Capture–Recapture of Snowshoe Hares
Capture Capture
6
5
Capture
4
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Capture 3, Capture 2, Capture 1a
000
001
010
011
100
101
110
111
ᎏ
Ž24.0.
3
Ž4.8.
4
Ž3.9.
1
Ž1.3.
4
Ž6.8.
4
Ž2.3.
2
Ž1.9.
1
Ž1.0.
3
Ž2.3.
2
Ž0.8.
2
Ž0.6.
0
Ž0.3.
1
Ž1.1.
0
Ž0.6.
0
Ž0.5.
1
Ž0.4.
6
Ž5.4.
3
Ž1.8.
3
Ž1.5.
0
Ž0.8.
1
Ž2.6.
3
Ž1.3.
1
Ž1.1.
1
Ž0.9.
0
Ž0.9.
0
Ž0.5.
1
Ž0.4.
0
Ž0.3.
1
Ž0.6.
0
Ž0.5.
0
Ž0.4.
0
Ž0.5.
5
Ž3.2.
0
Ž1.1.
0
Ž0.9.
0
Ž0.5.
2
Ž1.5.
1
Ž0.8.
1
Ž0.7.
0
Ž0.5.
1
Ž0.5.
1
Ž0.3.
1
Ž0.2.
0
Ž0.2.
0
Ž0.4.
0
Ž0.3.
0
Ž0.3.
0
Ž0.3.
0
Ž1.2.
0
Ž0.6.
0
Ž0.5.
0
Ž0.4.
2
Ž0.9.
2
Ž0.7.
1
Ž0.6.
1
Ž0.7.
0
Ž0.3.
0
Ž0.3.
0
Ž0.2.
0
Ž0.3.
0
Ž0.4.
0
Ž0.4.
0
Ž0.4.
2
Ž0.7.
a
Values in parentheses represent the fit of the logistic-normal model.
Source: A. Agresti, Biometrics 50: 494 500 Ž1994..
unrealistic. One way to allow heterogeneous capture probabilities uses a logit
model having subject random effects. For subject i, i s 1, . . . , N with N
unknown, let yiX s Ž yi1 , . . . , yiT ., where yit s 1 denotes capture in sample t
and yit s 0 denotes noncapture. Lacking explanatory variables, one might
use the Rasch-type model
logit P Ž Yit s 1 < u i . s u i q t ,
where u i 4 are independent N Ž0, 2 .. The larger the value of t , the greater
the capture probability at occasion t. The larger is , the more heterogeneous are the capture probabilities. When s 0 this logistic-normal model
simplifies to mutual independence wi.e., loglinear model Ž8.6.x for the 2 T
table.
As with other random effects models, integrating the random effect from
the probability mass function of Žyi < u i . yields the likelihood function Žas
discussed in Section 12.6.. One can consider this likelihood function and the
resulting ML estimates of t 4 and for all possible counts in the unobserved cell. A profile likelihood function views the maximized likelihood as a
function of the unobserved cell count. The ML prediction for that unobserved cell count is the value that maximizes this profile likelihood. Lacking
specialized software, one can fit the random effects model repeatedly with
various counts in the unobserved cell to determine by trial and error the
count that maximizes the likelihood function.
RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA
513
ML fitting of this model to Table 12.6 yields a prediction of 24 for the
unobserved cell count. Since the study observed 68 hares, the population size
estimate is Nˆ s 92. For this fit, ˆ s 1.0.
Methods for obtaining a confidence interval for N include using the
profile likelihood function or a nonparametric bootstrap method. With the
profile likelihood approach, the interval for the missing cell count consists of
the possible counts for that cell such that the G 2 fit statistic increases by less
than 12 Ž. from its value at the ML estimate. Adding the number of subjects
observed in the samples to the endpoints of this interval gives the corresponding interval for N. For the snowshoe hares, a 95% profile-likelihood
confidence interval for N is Ž75, 154.. It is common for Nˆ to be nearer the
low end of the interval. See Coull and Agresti Ž1999. for details.
The greater the heterogeneity, as reflected by larger ˆ , Nˆ tends to be
larger and the confidence interval tends to be wider. Large ˆ causes
difficulties in estimation, since it results in a relatively flat likelihood surface.
This implies imprecise estimates of N. In particular, the upper limit of the
profile-likelihood confidence interval for N is essentially infinite when the
likelihood function gets sufficiently flat. Also, the ML estimator is then often
ˆ
unstable, with small changes in the data yielding large changes in N.
Difficulties can also arise when probabilities of capture are small. Evidence
of this occurs when most subjects captured appear in only one sample. When
this happens or when ˆ is large, it is unrealistic to expect narrow confidence
intervals for N.
Alternative models are discussed in Section 13.1.3. Models that ignore
likely heterogeneity can give unrealistically narrow confidence intervals for
N. Although traditionally used for animal populations, capture recapture
applications also include estimating population size for human populations,
such as estimating population prevalence of injecting drug use and HIV
infection. Darroch et al. Ž1993. considered census population estimation, and
Chao et al. Ž2001. estimated the number of people infected during a hepatitis
outbreak ŽProblem 12.21.. An interesting application is estimating the number of files on the World Wide Web relating to some subject by taking
samples using several search engines ŽFienberg et al. 1999..
12.4 RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA
Random effects models for binary responses extend to multicategory responses. For the multicategory models of Chapter 7, a multinomial observation with I categories is a vector of I y 1 indicators, the jth of which is 1
when the observation falls in category j and 0 otherwise. In Section 7.1.5 we
defined a multivariate GLM by applying a vector of link functions to this
multivariate response. Adding random effects extends this multivariate GLM
and the GLMM Ž12.1. to a multivariate GLMM ŽHartzel et al. 2001b; Tutz
and Hennevogl 1996.. This class includes models for nominal and ordinal
responses.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
514
12.4.1
Cumulative Logit Model with Random Intercept
Modeling is simpler with ordinal than nominal responses, since often the
same random effect and the same fixed effect can apply to each logit. With
cumulative logits, this is the proportional odds structure ŽSection 7.2.2..
Denote the possible outcomes for yit , observation t in cluster i, by 1, 2, . . . , I.
A GLMM for the cumulative logits has the form
logit P Ž Yit F j < u i . s j q x Xit  q zXit u i ,
j s 1, . . . , I y 1. Ž 12.14 .
Hedeker and Gibbons Ž1994. discussed model fitting, primarily with u i as
multivariate normal.
For cumulative logit and probit random intercept models, the same
relationship exists between their effects and those in marginal models as
presented in Section 12.2.2 for binary-response models. Marginal effects tend
to be smaller, increasingly so as increases. Also, the same predictor
structure as in Ž12.14. holds with other links for which a common effect for
each logit is plausible. For instance, Hartzel et al. Ž2001a, b. used it with
adjacent-categories logits.
12.4.2
Insomnia Study Revisited
Table 11.4 showed results of a clinical trial at two occasions comparing a
drug with placebo in treating insomnia patients. In Sections 11.2.3 and 11.3.3
the data were analyzed with marginal models. For yt s time to fall asleep at
occasion t, the marginal model
logit P Ž Yt F j . s j q 1 t q 2 x q 3 tx
permitted interaction between t s occasion Ž0 s initial, 1 s follow-up. and
x s treatment Ž1 s active, 0 s placebo.. Table 12.7 shows the ML and GEE
estimates.
Now, let yit denote the response for subject i at occasion t. Table 12.7 also
shows results of fitting the random-intercept model
logit P Ž Yi t F j < u i . s u i q j q 1 t q 2 x q 3 tx.
TABLE 12.7 Fits of Cumulative Logit Models to Table 11.4 a
Effect
Treatment
Occasion
Treatment = occasion
a
Marginal
ML
Marginal
GEE
Random Effects
ŽGLMM. ML
0.046 Ž0.236.
1.074 Ž0.162.
0.662 Ž0.244.
0.034 Ž0.238.
1.038 Ž0.168.
0.708 Ž0.244.
0.058 Ž0.366.
1.602 Ž0.283.
1.081 Ž0.380.
Values in parentheses represent standard errors.
515
RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA
Results are substantively similar to the marginal model, but estimates and
standard errors are about 50% larger. This reflects the relatively large
heterogeneity Ž ˆ s 1.90. and the resultant strong association between the
responses at the two occasions.
12.4.3
Cluster Sampling
With surveys that use cluster sampling, standard methods based on simple
random sampling Že.g., for a single multinomial sample. require adjustment.
Ordinary standard errors are too small. The usual chi-squared test statistics
no longer have chi-squared null distributions, but rather, weighted sums of
chi-squared. Rao and Thomas Ž1988. surveyed ways of adjusting standard
inferences to take into account complex sampling methods in the analysis and
modeling of categorical data.
When the sampling scheme randomly samples clusters, one can account
for the clustering using cluster random effects. We illustrate using data from
Brier Ž1980., who reported 96 observations taken from 20 neighborhoods Žthe
clusters . on Y s satisfaction with home and X s satisfaction with neighborhood as a whole. Each variable was measured with the ordinal scale Žunsatisfied, satisfied, very satisfied .. Brier’s analysis adjusted for clustering by
reducing the Pearson statistic for testing independence in the 3 = 3 contingency table relating X and Y from 17.9 to 15.7 Ždf s 4..
Consider the model for yit , observation t in cluster i,
logit P Ž Yit F j < u i . s u i q j q x it ,
Ž 12.15 .
with scores Ž1, 2, 3. for the satisfaction levels of x it . With a N Ž0, 2 . distribution assumed for u i , the ML effect estimate is ˆ s y1.201 ŽSE s 0.407.,
with ˆ s 0.92. By contrast, treating the 96 observations as a random sample
corresponds to fitting this model with s 0. It has ˆ s y1.226 ŽSE s 0.370..
A slight reduction in significance results from adjusting for clustering.
12.4.4
Baseline-Category Logit Models with Random Effects
For nominal response variables, one can formulate a binary model that pairs
each category with a baseline and fit these models simultaneously while
allowing separate effects. This requires using a vector of cluster-specific
random effects u i j , one for each logit. The general form of the baseline-category logit model with random effects is
log
P Ž Yit s j .
P Ž Yit s I .
s j q x Xi t  j q zXit u i j ,
j s 1, . . . , I y 1.
The fixed effects  j and the random effects u i j depend on j, since the
baseline category is arbitrary. With nominal responses there is no reason to
expect effects to be similar for different j.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
516
Cluster i has a vector uXi s ŽuXi1 , . . . , uXi, Iy1 . of random effects. The usual
approach treats u i 4 as independent multivariate normal variates. We recommend an unspecified covariance matrix for u i . For instance, it is sensible
to allow different variances for random effects that apply to different logits.
With a common variance, that variance would not be the same as that for
the implied random effect for a logit for an arbitrary pair of categories,
logw P Ž Yit s j .rP Ž Yit s k .x. With unspecified covariance the model is structurally the same regardless of the choice of baseline category. See Hartzel
et al. Ž2001b. for an example.
12.5 MULTIVARIATE RANDOM EFFECTS MODELS FOR
BINARY DATA
In practice, random effects are often univariate, taking the form of random
intercepts. However, we’ve seen that nominal responses require multivariate
random effects and that bivariate random effects are helpful for describing
heterogeneity in multicenter clinical trials. In this section we present other
examples in which multivariate random effects are natural.
12.5.1
Matched Pairs with a Bivariate Binary Response
Leo Goodman analyzed Table 12.8 in several articles Že.g., Goodman 1974..
A sample of schoolboys were interviewed twice, several months apart, and
asked about their self-perceived membership in the ‘‘leading crowd’’ and
about whether they sometimes needed to go against their principles to belong
to that group. Thus, there are two binary response variables, which we refer
to as membership and attitude, measured at two interview times for each
subject. Table 12.8 labels the categories for attitude as Žpositive, negative.,
where ‘‘positive’’ refers to disagreeing with the statement that one must go
against his principles.
TABLE 12.8 Membership and Attitude Toward the ‘‘Leading Crowd’’
Ž M, A. for
First Interview
Yes, positive
Yes, negative
No, positive
No, negative
a
Ž M, A. for Second Interview a
ŽYes, Positive.
ŽYes, Negative.
ŽNo, Positive.
ŽNo, Negative.
458
171
184
85
140
182
75
97
110
56
531
338
49
87
281
554
M, membership; A, attitude.
Source: J. S. Coleman, Introduction to Mathematical Sociology ŽLondon: Free Press of Glencoe,
1964., p. 170.
MULTIVARIATE RANDOM EFFECTS MODELS FOR BINARY DATA
517
For subject i, let yit ® be the response at interview time t on variable ®,
where ® s M for membership and ® s A for attitude. The logit model
logit P Ž Yit ® s 1 < u i ® . s t ® q u i ®
Ž 12.16 .
is a multivariate form of the Rasch-type model Ž12.4.. It has additive item
and subject effects for each variable ®. Here, Ž u i M , u i A . is a bivariate random
effect that describes subject heterogeneity for Žmembership, attitude .. We
assume that the Ž u i M , u i A .4 are independent from a bivariate normal distribution, N Ž 0, ., with possibly different variances and nonzero correlation.
The ML fit yields ˆ2 M y ˆ1 M s 0.379 ŽSE s 0.075. and ˆ2 A y ˆ1 A s
0.176 ŽSE s 0.058.. For both variables, the probability of the first outcome
category is higher at the second interview. For instance, for a given subject
the odds of self-perceived membership in the leading crowd at interview 2
are estimated to be expŽ0.379. s 1.46 times the odds at interview 1.
The estimated correlation between the random effects is 0.30. Their
estimated standard deviations are ˆ1 s 3.1 for u i M 4 and ˆ2 s 1.5 for u i A 4 .
Since these are quite different, the relative sizes of membership and attitude
effects differ for marginal and conditional models Žrecall the caveat in
Section 12.2.3.. The marginal effect is attenuated more for membership. For
this conditional model, the ratio of estimated odds ratios is
expŽ0.379.rexpŽ0.176. s 1.46r1.19 s 1.22. For the marginal model, the estimated odds ratios use the marginal distributions of each variable at each time
we.g., this is Ž1392r2006.rŽ1253r2145. s 1.188 for membershipx, and the
ratio of estimated odds ratios is 1.188r1.133 s 1.05.
Integrating over the estimated random effects distribution yields fitted
values for the 16 possible sequences of responses in Table 12.8. The deviance
of G 2 s 5.5 Ždf s 8. compares the 16 observed counts to their fitted values.
The model, which describes 15 multinomial probabilities with seven parameters, fits well. The model constraining the random effects to be uncorrelated
fits poorly Ž G 2 s 97.5, df s 9.. The model constraining the random effects to
be perfectly correlated is equivalent to having a single random effect u i for
each subject. The model is then a Rasch-type model with four items that are
the combinations of interviews and variables. That model fits very poorly
Ž G 2 s 655.5, df s 10.. Agresti et al. Ž2000. gave further details.
12.5.2 Continuation-Ratio Logits for Clustered Ordinal
Outcomes: Toxicity Study
For continuation-ratio logit models with ordinal responses, the logits refer to
independent binomial variates ŽSection 7.4.3.. Thus, binary logit random
effects models apply to clustered ordinal responses using continuation-ratio
logits ŽTen Have and Uttal 1994.. For observation t in cluster i, let i j s
P Ž Yit s j < Yit G j, u i j .. ŽMore generally, this probability could also depend on
t, but this generality is not needed for the example below.. The
continuation-ratio logits are logitŽi j ., j s 1, . . . , I y 14 .
518
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
Let n i j be the number of subjects in cluster i making response j. Let
n i s Ý Ijs1 n i j . For a given cluster in a continuation-ratio logit model, treating
Ž n i1 , . . . , n i, Iy1 . as multinomial is equivalent to treating them as a sequential
set of independent binomial variates, where n i j is bin Ž n i y Ý h - j n i h , i j .,
j s 1, . . . , I y 1.
We illustrate with a developmental toxicity study conducted under the
U.S. National Toxicology Program. This study examined the developmental
effects of ethylene glycol ŽEG. by administering one of four dosages
Ž0, 0.75, 1.50, 3.00 grkg. to pregnant rodents. The four dose groups had
Ž25, 24, 22, 23. pregnant rodents. The clusters are litters of mice. The three
possible outcomes Ždeadrresorption, malformation, normal. for each fetus
are ordered, normal being the most desirable result. Table 12.9 shows the
data. The continuation-ratio logit is natural here since categories are hierarchically related; an animal must survive before a malformation can take
place. The following analyses are from Coull and Agresti Ž2000..
For litter i in dose group d, let logitŽiŽ d.1 . be the continuation-ratio logit
for the probability of death and logitŽiŽ d.2 . the continuation-ratio logit for
the conditional probability of malformation, given survival. wThe notation iŽ d .
represents litter i nested within dose d.x Let x d be the dosage for group d.
We account for the litter effect using litter-specific random effects u iŽ d. s
Ž u iŽ d.1 , u iŽ d.2 . sampled from N Ž0, d .. This bivariate random effect allows for
differing amounts of overdispersion for the probability of death and for the
probability of malformation, given survival. A model also permitting different
fixed effects for each is
logit ŽiŽ d. j . s u iŽ d. j q j q j x d .
Ž 12.17 .
TABLE 12.9 Response Counts for 94 Litters of Mice on (Number Dead,
Number Malformed, Number Normal)
Dose s 0.00 grkg
Dose s 0.75 grkg
Dose s 1.50 grkg
Dose s 3.00 grkg
Ž1, 0, 7., Ž0, 0, 14.
Ž0, 0, 13., Ž0, 0, 10.
Ž0, 1, 15., Ž1, 0, 14.
Ž1, 0, 10., Ž0, 0, 12.
Ž0, 0, 11., Ž0, 0, 8.
Ž1, 0, 6., Ž0, 0, 15.
Ž0, 0, 12., Ž0, 0, 12.
Ž0, 0, 13., Ž0, 0, 10.
Ž0, 0, 10., Ž1, 0, 11.
Ž0, 0, 12., Ž0, 0, 13.
Ž1, 0, 14., Ž0, 0, 13.
Ž0, 0, 13., Ž1, 0, 14.
Ž0, 0, 14.
Ž0, 3, 7., Ž1, 3, 11.
Ž0, 2, 9., Ž0, 0, 12.
Ž0, 1, 11., Ž0, 3, 10.
Ž0, 0, 15., Ž0, 0, 11.
Ž2, 0, 8., Ž0, 1, 10.
Ž0, 0, 10., Ž0, 1, 13.
Ž0, 1, 9., Ž0, 0, 14.
Ž1, 1, 11., Ž0, 1, 9.
Ž0, 1, 10., Ž0, 0, 15.
Ž0, 0, 15., Ž0, 3, 10.
Ž0, 2, 5., Ž0, 1, 11.
Ž0, 1, 6., Ž1, 1, 8.
Ž0, 8, 2., Ž0, 6, 5.
Ž0, 5, 7., Ž0, 11, 2.
Ž1, 6, 3., Ž0, 7, 6.
Ž0, 0, 1., Ž0, 3, 8.
Ž0, 8, 3., Ž0, 2, 12.
Ž0, 1, 12., Ž0, 10, 5.
Ž0, 5, 6., Ž0, 1, 11.
Ž0, 3, 10., Ž0, 0, 13.
Ž0, 6, 1., Ž0, 2, 6.
Ž0, 1, 2., Ž0, 0, 7.
Ž0, 4, 6., Ž0, 0, 12.
Ž0, 4, 3., Ž1, 9, 1.
Ž0, 4, 8., Ž1, 11, 0.
Ž0, 7, 3., Ž0, 9, 1.
Ž0, 3, 1., Ž0, 7, 0.
Ž0, 1, 3., Ž0, 12, 0.
Ž2, 12, 0., Ž0, 11, 3.
Ž0, 5, 6., Ž0, 4, 8.
Ž0, 5, 7., Ž2, 3, 9.
Ž0, 9, 1., Ž0, 0, 9.
Ž0, 5, 4., Ž0, 2, 5.
Ž1, 3, 9., Ž0, 2, 5.
Ž0, 1, 11.
Source: Study described by C. J. Price, C. A. Kimmel, R. W. Tyl, and M. C. Marr, Toxicol. Appl.
Pharmacol. 81: 113 127 Ž1985..
519
MULTIVARIATE RANDOM EFFECTS MODELS FOR BINARY DATA
TABLE 12.10 Comparisons of Log Likelihoods for Multivariate Random
Effects Models for Developmental Toxicity Study
Model
Dose-specific i
i , Common ,
Common
Common , s 0
Univariate 2
Number of
Parameters
Change in
Parameters
Change in
Log Likelihood
16
14
7
6
5
ᎏ
2
9
10
11
ᎏ
28.4
7.4
7.4
16.7
Table 12.10 reports the change in the maximized log likelihood from fitting
four special cases of this model:
1.
2.
3.
4.
Common intercept and slope for the two logits: 1 s 2 and 1 s 2
Common covariance matrix for the four doses: 1 s 2 s 3 s 4
Common covariance matrix and uncorrelated random effects
Univariate common variance component across dose: u iŽ d.1 s u iŽ d.2 and
d s
Tests of the first three special cases against the general model Ž12.17. can
use ordinary likelihood-ratio tests. Little seems to be lost by using the simpler
model having uncorrelated random effects with homogeneous covariance
structure Ži.e., the fourth model listed in Table 12.10., as the likelihood-ratio
statistic comparing this to model Ž12.17. equals 2Ž7.4. s 14.8 Ždf s 10.. The
model provides a separate univariate logistic-normal model for each conditional binomial outcome, specifying that the proportion of dead pups and the
proportion of malformed pups Žgiven survival. are independent, both within
litter and marginally.
The univariate model in Table 12.10 is the special case of the third model
listed in which the variances are common for the two logits and the random
effects are perfectly correlated. Hence, it reduces to a univariate random
effects model. Comparing the univariate model to a multivariate counterpart
involves testing that correlation parameters fall on the boundary. Ordinary
chi-squared asymptotic theory for likelihood-ratio tests applies only when the
parameter falls in the interior of the parameter space. Tests when a null
model has a correlation of 1 or a variance component of 0 are complex and
beyond our scope here Žsee Section 12.6.6.. However, an informal analysis of
change in log likelihoods suggests that the univariate model is inadequate.
The ML estimated effects for the separate univariate logistic-normal
model for each conditional binomial outcome are ˆ1 s 0.08 ŽSE s 0.21.,
ˆ2 s 1.79 ŽSE s 0.22.. For a given cluster, there is no evidence of a dose
effect on the death rate, but the estimated odds of malformation, given
survival, multiply by expŽ1.79. s 6.0 for every additional grkg of ethylene
520
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
glycol. The variance component estimates suggest a stronger litter effect for
the malformation outcome given survival Ž ˆ2 s 1.6. than for death Ž ˆ1 s 0.5..
12.5.3
Hierarchical (Multilevel) Modeling
Hierarchical data structures, with units grouped at different levels, are
common in education. A statewide study of factors that affect student
performance might measure students’ scores on a battery of exams but use a
model that takes into account the student, the school or school district, and
the county. Just as two observations on the same student might tend to be
more alike than observations on different students, so might two students in
the same school tend to be more alike than students from different schools.
Student, school, and county terms might be treated as random effects,
with different ones referring to different le®els of the model. For instance,
a model might have students at level 1, schools at level 2, and counties at
level 3. GLMMs for data having a hierarchical grouping of this sort are called
multile®el models. Random effects enter the model at each level of the
hierarchy.
We illustrate with a two-level model. Let iŽ j.t denote the probability that
student i in school j passes test t in a battery of tests. A multilevel model
with random effects for student and school and fixed effects for explanatory
variables has the form
logit w iŽ j.t x s x XiŽ j.t  q u j q ®iŽ j. .
Here, the explanatory variables x might include one that identifies the test in
the battery. The random effects u j for schools and ®iŽ j. for students within
schools are independent with different variance components. The level 1
random effects ®iŽ j.4 account for variability among students in ability or
parents’ socioeconomic status or other characteristics not measured by x.
When they have a relatively large variance component, there is a strong
correlation among the test results for students. The level 2 random effects
u j 4 account for variability among schools due to possibly unmeasured factors
such as per-capita expenditure in the school’s budget.
For examples of the use of multivariate random effects in multilevel
modeling, see Aitkin et al. Ž1981., Anderson and Aitkin Ž1985., Gibbons and
Hedeker Ž1997., Goldstein Ž1995., Goldstein and Rasbash Ž1996., and Longford Ž1993..
12.6 GLMM FITTING, INFERENCE, AND PREDICTION
Model fitting is rather complex for GLMMs. The main difficulty is that the
likelihood function does not have a closed form. Numerical methods for
approximating it can be computationally intensive for models with multivari-
GLMM FITTING, INFERENCE, AND PREDICTION
521
ate random effects. In this section we outline the basic ideas of ML fitting of
GLMMs. Some ML methods are available in software Že.g., PROC
NLMIXED in SAS..
12.6.1
Marginal Likelihood and Maximum Likelihood Fitting
The GLMM is a two-stage model. At the first stage, conditional on the
random effects, observations are assumed to follow a GLM. That is, observation yit in cluster i has distribution in the exponential family with expected
value it linked to a linear predictor,
g Ž it . s x Xit  q zXit u i .
Then, zXit u i is a known offset and observations in a cluster are independent.
At the second stage, the random effects u i 4 are assumed independent from a
N Ž0, . distribution.
For a discrete variable, denote the vector of all the observations by y and
the vector of all the random effects by u. Let f Žy < u; . denote the conditional mass function of y, given u. Let f Žu; . denote the normal density
function for u. The likelihood function l Ž, ; y. for a GLMM is the
probability mass function f Žy; , . of y, viewed as a function of and .
This mass function refers to the marginal distribution of y after integrating
out the random effects,
l Ž , ; y. s f Ž y; , . s H f Ž y < u; . f Ž u; . du.
Ž 12.18 .
It is often called a marginal likelihood. For example, the likelihood function
l Ž, 2 ; y. for the logistic-normal model Ž12.5. Žabsorbing into . is
Ł
i
žH
y
Ł
t
exp Ž x Xit q u i .
1 q exp Ž x Xit q u i .
yit
1
1 q exp Ž x Xit q u i .
1yy i t
/
f Ž u i ; 2 . du i .
The likelihood function is evaluated numerically and maximized as a function
of and . Many methods have been developed to do this. We next discuss
a few of the most popular.
12.6.2
Gauss–Hermite Quadrature Methods
The integral determining the likelihood function has dimension that depends
on the random effects structure. When the dimension is small, as in the
one-dimensional integral above for the logistic-normal model Ž12.5., standard
numerical integration methods can approximate the likelihood function.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
522
Gauss Hermite quadrature is a method for approximating the integral of
a function f Ž. multiplied by another function having the shape of a normal
density. The approximation is a finite weighted sum that evaluates the
function at certain points. In the univariate normal random effects case,
the approximation has the form
Hy
f Ž u . exp Ž yu 2 . du f
q
Ý c k f Ž sk . ,
ks1
with weights c k 4 and quadrature points sk 4 that are tabulated. The approximation improves as q, the number of quadrature points, increases.
The approximated likelihood can be maximized with standard algorithms
ˆ and .
ˆ Inverting an
such as Newton Raphson, yielding ML estimates 
approximation for the observed information matrix provides standard errors
for the ML estimates. For complex models, second partial derivatives for the
Hessian may be computed numerically rather than analytically. Adequate
ˆ We
approximation usually requires larger q for standard errors than for .
recommend sequentially increasing q until the changes are negligible in both
the estimates and standard errors.
An adaptive version of Gauss Hermite quadrature Že.g., Liu and Pierce
1994. centers the quadrature points with respect to the mode of the function
being integrated and scales them according to the estimated curvature at the
mode. This improves efficiency, dramatically reducing the number of quadrature points needed to approximate the integrals effectively. Lesaffre and
Spiessens Ž2001. showed comparisons and warned against using too few
points.
12.6.3
Monte Carlo Methods
Multivariate forms of Gauss Hermite quadrature handle multivariate, correlated random effects. Adequate approximation becomes more difficult, however, when the dimension of the integral exceeds roughly 5. Then, Monte
Carlo methods are more feasible computationally than numerical integration.
Various Monte Carlo approaches have been studied Že.g., McCulloch 1997.,
including Monte Carlo in combination with Newton Raphson, Monte Carlo
in combination with the EM algorithm, and simulating the likelihood directly.
Here, we briefly describe a Monte Carlo EM ŽMCEM. algorithm.
The EM algorithm is a popular iterative method of finding ML estimates
when data are missing or when filling in some ‘‘missing’’ data simplifies a
likelihood ŽDempster et al. 1977. wsee Laird Ž1998. for a useful reviewx. In
each cycle an E-step takes an expectation over the missing data to approximate the likelihood function and an M-step maximizes the likelihood given
the working values of the parameter estimates. In GLMMs, one regards the
random effects u as missing data. Then, hŽy, u; , . s f Žy < u; . f Žu; .
specifies the joint distribution of the complete data. The E-step in iteration r
GLMM FITTING, INFERENCE, AND PREDICTION
523
of the EM algorithm calculates
E log h Ž y, u;  , . < y; Ž r . , Ž r . 4 .
The expectation is with respect to the distribution of Žu < y. with parameter
values equal to Ž r . and Ž r ., the working estimates for iteration r. The
distribution of Žu < y. follows from those of Žy < u. and u in the GLMM via
Bayes’ theorem. The M-step then maximizes the result with respect to and
to obtain Ž rq1. and Ž rq1. .
The MCEM algorithm approximates the expectation in the E-step using
Monte Carlo methods. Possible ways of doing this include using independent
simulations from the distribution of Žu < y., at the current estimate of parameters, or using Markov chain Monte Carlo ŽMCMC.. For details, including the
issue of choosing an appropriate Monte Carlo sample size, see Booth and
Hobert Ž1999., Chan and Kuk Ž1997., and McCulloch Ž1994, 1997..
12.6.4
Penalized Quasi-likelihood Approximation
The Gauss Hermite and Monte Carlo integration methods provide likelihood approximations such that resulting parameter estimates converge to the
ML estimates as they are applied more finely Ži.e., as the number of
quadrature points increases for numerical integration and as the Monte
Carlo sample size increases in the MCEM method.. This contrasts with other
approximate methods that are simpler but need not yield estimates near the
ML estimates. These methods maximize an analytical approximation of the
likelihood function.
Recall that the likelihood function Ž12.18. results from integrating out the
random effects u from the joint distribution of y and u. Using the exponential
family representation of each component of that joint distribution, the
integrand of Ž12.18. is an exponential function of u. One approach approximates that function using a second-order Taylor series expansion of its
exponent around a point ˜
u at which the first-order term equals 0. wThat point
Ž
<
.
x
˜u f E u y . The approximating function for the integrand is then exponential with quadratic exponent in Žu y ˜
u. and has the form of a constant
multiple of a multivariate normal density. Thus, its integral has closed form.
This type of integral approximation is called a Laplace approximation. The
approximation for integral Ž12.18. is then treated as a likelihood and maximized with respect to and .
For one such method ŽBreslow and Clayton 1993., the integral approximation yields a function approximating the log likelihood that has the form
q Ž , y . y Ž 1r2 . ˜
uX y1 ˜
u,
524
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
where q Ž, y. resembles a quasi-log-likelihood function for the GLM conditional on u s ˜
u. Thus, the approximation results in a penalty for the quasilog likelihood, with the penalty increasing as elements of ˜
u increase in
absolute value. This approach is called penalized quasi-likelihood ŽPQL.. The
calculations for maximizing the penalized quasi-likelihood use methods
for linear mixed models with a normal response. This treats a linearization of
the logit as a working response and entails iterative solution of sets of
likelihood-like equations in  and u. PQL methods do not require numerical
or Monte Carlo integration and so are simpler than ML methods. They are
computationally feasible for large data sets and models with complex random
effects structure.
Unfortunately, PQL methods can perform poorly relative to ML ŽMcCulloch 1997.. For instance, for the abortion example in Section 12.3.2, the PQL
approximations to the ML estimates Žobtained using the GLIMMIX macro in
SAS. are decent for t 4 , but the standard errors and the estimate of are
only about half what they should be Že.g., PQL gives ˆ s 4.3, compared to
the ML estimate of 8.6.. When true variance components are large, ordinarily PQL tends to produce variance component estimates with substantial
negative bias ŽBreslow and Lin 1995.. The PQL estimators also behave poorly
when the response distribution is far from normal Že.g., binary.. Adjustments
have been developed for some cases to lessen the bias Že.g., Goldstein and
Rasbash 1996., but where possible we recommend using ML rather than
PQL.
12.6.5
Bayesian Approaches
Another approach to fitting of GLMMs is Bayesian. With it, the distinction
between fixed and random effects no longer occurs, as every effect has a
probability distribution. Use of a flat prior distribution yields a posterior that
is a constant multiple of the likelihood function. Then, Markov chain Monte
Carlo ŽMCMC. methods for approximating intractable posterior distributions
can approximate the likelihood function ŽZeger and Karim 1991.. For instance, an approximation for the mode of the posterior distribution approximates the ML estimate.
A danger is that improper prior distributions have improper posteriors for
many models for categorical data ŽNatarajan and McCulloch 1995.. In using
MCMC, one may fail to realize that the posterior is improper. It is safer to
use a proper but relatively diffuse prior. However, the posterior mode need
not be close to the ML estimate, and Markov chains may converge slowly
ŽNatarajan and McCulloch 1998.. This is currently an active area of research,
not just as a way of approximating ML results but also as an approach
preferred over ML by those who adopt the Bayesian paradigm. See, for
instance, Daniels and Gatsonis Ž1999. for multilevel modeling of geographic
and temporal trends with clustered longitudinal binary data, which built on
earlier hierarchical modeling by Wong and Mason Ž1985..
GLMM FITTING, INFERENCE, AND PREDICTION
12.6.6
525
Inference for Model Parameters
After fitting the model, inference about fixed effects proceeds in the usual
way. For instance, likelihood-ratio tests can compare nested models. Asymptotics for GLMMs apply as the number of clusters increases, rather than as
the numbers of observations within the clusters increase. Similarly, resampling methods such as the bootstrap using a large number of clusters should
sample clusters rather than individual observations within clusters, to preserve the within-cluster dependence.
Inference about random effects Že.g., their variance components. is more
complex. For instance, sometimes one model is a special case of another in
which a variance component equals 0. The simpler model then falls on the
boundary of the parameter space relative to the more complex model, so
ordinary likelihood-based inference does not apply. The asymptotic distribution of the likelihood-ratio statistic is known for the most common situation,
testing H0 : 2 s 0 against Ha : 2 ) 0 for a model containing a single
variance component. The null distribution is an equal mixture of 02 Ži.e.,
degenerate at 0. and 12 random variables ŽSelf and Liang 1987.. The value
of 0 occurs when ˆ s 0, in which case the maximized likelihoods are
identical under H0 and Ha . When ˆ ) 0 and the observed test statistic
equals t, the P-value for this large-sample test is 12 P Ž 12 ) t ., half the
P-value that applies for 12 asymptotic tests. For testing more than one
variance component, the mixture distribution becomes more complex, and it
is simpler to use a score test ŽLin 1997..
12.6.7
Prediction Using Random Effects
The use of random effects in a model implies heterogeneity of certain effects
of interest, such as odds ratios. Estimated effects of interest are often then
linear combinations of fixed and random effects. For example, in the clinical
trial comparing two treatments with random effects for centers ŽSection
12.3.4., one can predict the probability of success for each treatment in each
center and odds ratios in those centers.
Given the data, the conditional distribution of Žu < y. contains the information about the random effects u. A prediction for u is E Žu < y., its posterior
mean given the data. Calculation of E Žu < y. itself requires numerical integration or Monte Carlo approximation. The expectation depends on  and ,
ˆ and
ˆ in the approximation. The standard
so in practice one substitutes
error of the predictor of the random effect u i is the standard deviation of the
ˆ and
ˆ in EŽu < y., however,
distribution of Ž u i < y.. When one substitutes
the standard error does not account for the sampling variability in those
estimates. Hence, the true standard error tends to be underestimated ŽBooth
and Hobert 1998..
This approach to prediction using posterior means of random effects
provides effect estimates that exhibit shrinkage relative to estimates using
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
526
only data in the specific cluster. In this sense the results are similar to those
using an empirical Bayes approach ŽTen Have and Localio 1999.. This adapts
an ordinary Bayesian analysis by using the sample data to estimate parameters of the prior distribution. For a vector of mean parameters, this approach
yields an estimate of a particular mean that is a weighted average of the
sample mean and the overall mean of the sample means. Thus, it shrinks the
sample mean toward the overall mean. Shrinkage estimators can be far
superior to sample values when the sample size for estimating each parameter is small, when there are many parameters to estimate, or when the true
parameter values are roughly equal. The empirical Bayes paradigm has been
in use for some time: for instance, for estimating a vector of means or
binomial proportions ŽEfron and Morris 1975..
Although random effects models are natural in many applications, further
work is needed. Work continues on the development of methodology for
model-fitting and inference with complex GLMMs. In addition, research is
needed on model checking and diagnostics. Nonetheless, we believe that
GLMMs provide a very useful extension of ordinary GLMs.
NOTES
Section 12.1: Random Effects Modeling of Clustered Categorical Data
12.1. For further discussion of the Rasch model and ways of estimating its parameters, see
Andersen Ž1980, Sec. 6.4. and Fischer and Molenaar Ž1995.. Haberman Ž1977b.
showed ML estimators can achieve consistency when both n and T grow at suitable
rates. For multinomial Rasch extensions, see Andersen Ž1980, pp. 272 284; 1995. and
Conaway Ž1989.. Early work on random effects models for a categorical response
includes Anderson and Aitkin Ž1985., Bartholomew Ž1980., Bock and Aitkin Ž1981.,
Chamberlain Ž1980., Gilmour et al. Ž1985., Pierce and Sands Ž1975., and Stiratelli et
al. Ž1984..
12.2. In models with covariates, Neuhaus and Lesperance Ž1996. noted that conditional ML
may lose considerable efficiency compared to the random effects approach when
cluster sizes are small and covariates have strong positive within-cluster correlation.
As that correlation approaches q1, the covariate effect resembles a between-cluster
one, which the conditional ML approach cannot estimate. The matched-pairs case
referred to in Section 12.1.2 in which the conditional ML estimate equals the random
effects estimate has within-cluster covariate correlation s y1, as depending on the
order of viewing the observations, x t changes from 0 to 1 or from 1 to 0; then, no
efficiency loss occurs.
Section 12.3: Examples of Random Effects Models for Binary Data
12.3. For further discussion of modeling capture recapture data, see Bishop et al. Ž1975,
Chap. 6., Chao et al. Ž2001., Cormack Ž1989., Coull and Agresti Ž1999., Darroch et al.
Ž1993., Fienberg et al. Ž1999., and Hook and Regal Ž1995.. Similarities exist between
this problem and the related problem of estimating the binomial index n when
observing independent bin Ž n, . counts with unknown n and ; see Aitkin and
Stasinopoulos Ž1989. and references therein. Relatively flat log likelihoods also occur
with other models that permit capture heterogeneity ŽBurnham and Overton 1978.,
such as a beta-binomial model.
PROBLEMS
527
12.4. King Ž1997. used random effects models as part of a solution for analyzing aggregated
categorical data, the problem of ecological inference. Chambers and Steel Ž2001.
discussed early work by Leo Goodman on this problem and proposed a simpler
semiparametric approach.
Section 12.4: Random Effects Models for Multinomial Data
12.5. With the complementary log-log link, the likelihood function has closed form with a
log gamma random effects distribution ŽCrouchley 1995, Farewell 1982, Ten Have
1996..
12.6. Chen and Kuo Ž2001. discussed nominal responses, including discrete choice models
ŽSec. 7.6. with random effects. See also Brownstone and Train Ž1999. for discrete
choice GLMMs.
Section 12.5: Multi©ariate Random Effects Models for Binary Data
12.7. Rabe-Hesketh and Skrondal Ž2001. showed that careful attention must be paid to
parameter identification in models with multivariate random effects. Their factor
model contains many multivariate random effects models as special cases.
12.8. For longitudinal bivariate binary responses, Ten Have and Morabia Ž1999. simultaneously modeled bivariate log odds ratios and univariate logits. Multivariate responses
sometimes have both continuous and categorical components. For random effects
modeling of such data, see Catalano and Ryan Ž1992. and Gueorguieva and Agresti
Ž2001..
Section 12.6: GLMM Fitting, Inference, and Prediction
12.9. See Fahrmeir and Tutz Ž2001, Chap. 7. and McCulloch and Searle Ž2001. for more
details on the fitting of GLMMs. Just as the likelihood function for a GLMM is an
integral, so do likelihood equations have the form of integral equations ŽMcCulloch
and Searle 2001, p. 227.. Wolfinger and O’Connell Ž1993. described a fitting method
related to PQL, also motivated by a Laplace approximation.
12.10. A GLMM determines the marginal relationship Žaveraged over random effects .
between the mean response and explanatory variables. Conversely, Heagerty Ž1999.
noted that a marginal model for the mean implicitly determines the form of the fixed
portion of the linear predictor in a conditional model. The conditional GLMM Ž12.1.
X
X
X
has linear predictor, x i t  q z i t u i . A more general form ⌬ i t q z i t u i implies a particular marginal model. Here, ⌬ i t is a function of the marginal linear predictor and the
random effects distribution. It is implicitly defined by the integral equation that links
the marginal and conditional means.
PROBLEMS
Applications
12.1 Refer to the matched-pairs data of Table 10.14 and Problem 10.1.
a. Fit model Ž12.3.. Interpret ˆ. If your software uses numerical
integration, report ˆ, ˆ , and their standard errors for 5, 10, 25,
100, and 200 quadrature points, and comment on convergence.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
528
b. Compare ˆ and its SE for this approach to the conditional ML
approach.
12.2 Refer to Table 4.8 on the free-throw shooting of Shaq OX Neal. In
game i, suppose that yi s number made out of n i attempts is a
bin Ž n i , i . variate and yi 4 are independent.
a. Fit the model, logitŽ i . s . Find and interpret
ˆ i . Does the
model appear to fit adequately?
b. Fit the model, logitŽ i . s q u i , where u i 4 are independent
N Ž0, 2 .. Use
ˆ and ˆ to summarize OX Neal’s free-throw shooting.
c. Explain how the model in part Ža. is a special case of that in part
Žb.. Is there evidence that the one in part Žb. fits better?
12.3 For Table 8.3, let yit s 1 when subject i used substance t. Table
12.11 shows output for the logistic-normal model
logit P Ž Yit s 1 < u i . s u i q t .
Interpret. Illustrate by comparing use of cigarettes and marijuana.
TABLE 12.11 Output for Problem 12.3
Description
Subjects
Max Obs Per Subject
Parameters
Quadrature Points
Log Likelihood
Value
2276
3
4
200
y3311
Parameter
beta1
beta2
beta3
sigma
Estimate
4.2227
1.6209
y0.7751
3.5496
Std
Error
0.1824
0.1207
0.1061
0.1627
t Value
23.15
13.43
y7.31
21.82
12.4 How is the focus different for the model in Problem 12.3 than for the
loglinear model Ž AC, AM, CM . used in Section 8.2.4? If ˆ s 0,
which loglinear model has the same fit as the GLMM?
12.5 For the student survey in Table 9.1, Ža. analyze using GLMMs, and
Žb. compare results and interpretations to those with marginal models
in Problem 11.2.
12.6 Fit model Ž12.10. to the responses on abortion. If your software uses
Gauss Hermite quadrature, report the approximate number of
quadrature points needed for parameter estimates to converge and
the number needed for standard error estimates to converge. ŽThis
example has large ˆ and requires many points..
529
PROBLEMS
12.7 For the crossover study in Table 11.10 ŽProblem 11.6., fit the model
logit P Ž YiŽ k .t s 1 < u iŽ k . . s k q t q u iŽ k . ,
Ž 12.19 .
where u iŽ k .4 are independent N Ž0, 2 .. Interpret ˆt 4 and ˆ .
12.8 For Problem 12.7, compare estimates of B y A and C y A and
SE values to those using Ža. a marginal model ŽProblem 11.6., and Žb.
conditional logistic regression ŽSection 10.2., treating subject terms in
model Ž12.19. as fixed effects.
12.9 For Problem 12.7, fit the more general GLMM having treatment
effects t k 4 that vary by sequence. Test whether the fit is better. One
could also consider period or carryover effects. Add two period
effects to model Ž12.19. Že.g., the first-period-effect parameter adds to
the model when t s A and k s 1, 2, t s B and k s 3, 4, and t s C
and k s 5, 6.. Check whether the fit improves. Interpret.
12.10 Consider the logistic-normal model Ž12.10. for the abortion opinion
data, under the constraint s 0.
a. Explain why the fit is the same as an ordinary logit model treating
the three responses for each subject as if they were independent
responses for three separate subjects.
b. Explain why the model fit is the same as an ordinary loglinear
model Ž GI1 , GI2 , GI3 . of mutual independence of responses on the
three items Ž I1 , I2 , I3 ., given G s gender.
c. Fit the model. Interpret, and explain why ˆt y ˆu4 are quite
different from those in Section 12.3.2 allowing ) 0.
12.11 For Table 6.7 on admissions decisions for graduate school applicants,
let yi g s 1 denote a subject in department i of gender g Ž1 s females,
0 s males. being admitted.
a. For the fixed effects model, logitw P Ž Yi g s 1.x s q g q iD ,
ˆ s 0.173 ŽSE s 0.112.. Interpret.
b. The corresponding model Ž12.12. in which departments are a
normal random effect has ˆ s 0.163 ŽSE s 0.111.. Interpret.
c. The model of form Ž12.12. allowing the gender effect to vary by
department has ˆ s 0.176 ŽSE s 0.132., with ˆb s 0.20. Interpret. Explain why the standard error of ˆ is slightly larger than
with the other analyses.
d. The marginal sample log odds ratio between gender and whether
admitted equals y0.07. How could this take different sign from ˆ
in these models?
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
530
e. The sample conditional odds ratios between gender and whether
admitted vary between 0 and ⬁. By contrast, predicted odds ratios
for the interaction random effects model do not vary much. Explain why results can be so different.
12.12 For the clinical trial in Table 9.16, let it s P Ž Yit s 1 < u i . denote the
probability of success for treatment t in center i.
a. The random intercept model Ž12.11. has ˆ s 1.52 ŽSE s 0.70. and
ˆ s 1.9. Interpret.
b. From Section 9.8.3, the fixed effects analog of this model Žreplacing q u i by i . has
ˆ1 s ˆ3 s y, corresponding to ˆ 1 t s ˆ 3 t
s 0 for each treatment. By contrast, the random effects model has
ˆ q uˆ1 s y3.78 Žusing NLMIXED in SAS. and ˆ 11 s 0.047 and
ˆ 12 s 0.011 in center 1. Explain how this model can have ˆ it ) 0
in centers having no successes.
12.13 Refer to the subject-specific model in Section 12.3.3. Verify that the
estimated difference in time effect slopes between the new and
standard drugs for treating depression are Ža. 1.018 ŽSE s 0.192. with
the GLMM approach, and Žb. 1.156 ŽSE s 0.222. with conditional
ML.
12.14 For marginal model Ž10.14. for Table 10.5 on premarital and extramarital sex, Table 12.12 shows results of fitting a corresponding
random intercept model. Interpret ˆ. Compare estimates of and
inferences about to those in Section 10.3.2 for the marginal model.
TABLE 12.12 Output for Problem 12.14
Subjects
Max Obs Per Subject
Parameters
Quadrature Points
Log Likelihood
475
2
5
100
y890.1
Parameter
inter1
inter2
inter3
beta
sigma
Estimate
y1.5422
y0.6682
0.9273
4.1342
2.0757
Std
Error
0.1826
0.1578
0.1673
0.3296
0.2487
t Value
y8.45
y4.24
5.54
12.54
8.35
12.15 A data set from the 1994 General Social Survey on subjects’ opinions
on four items Žthe environment, health, law enforcement, education.
related to whether they believed government spending on each item
should increase, stay the same, or decrease. Subjects were also
classified by their gender and race. For subject i, let Gi s 1 for
females and 0 for males, let R1 i s 1 for whites and 0 otherwise,
531
PROBLEMS
R 2 i s 1 for blacks and 0 otherwise, and R1 i s R 2 i s 0 for the other
category of race. Let yit denote the response for subject i on
spending item t, where outcomes Ž1, 2, 3. represent Žincrease, stay the
same, decrease ..
a. With constraint 4 s 0, the random-intercept model
logit P Ž Yit F j < u i .
s j q t q g Gi q r1 R i1 q r 2 R 2 i q u i ,
j s 1, 2,
has ˆ1 s y0.55, ˆ2 s y0.60, ˆ3 s y0.49, with ˆ s 1.03. These
estimates are greater than five standard errors in absolute value.
Interpret.
b. Table 12.13 shows results with a race-by-item interaction. Interpret.
TABLE 12.13 Results for Problem 12.15 a
Variable
Estimate
SE
Intercept-1
Intercept-2
Gender
Race1-w
Race2-b
Item1-envir
Item2-health
Item3-crime
Race1)Item1
Race1)Item2
Race1)Item3
Race2)Item1
Race2)Item2
Race2)Item3
1.065
1.919
0.409
y0.055
0.434
y0.357
y0.319
y0.585
y0.170
y0.387
0.197
y0.452
0.454
y0.518
0.391
0.051
0.088
0.397
0.452
0.539
0.493
0.480
0.549
0.503
0.491
0.606
0.598
0.560
a
Coding 0 for item 4 Žeducation. and race 3 Žother..
12.16 Refer to Problem 11.12 for Table 8.19 on government spending.
Analyze these data using a cumulative logit model with random
effects. Interpret. Compare results to those with a marginal model
ŽProblem 11.12..
12.17 For the insomnia example in Section 12.4.2, according to SAS the
maximized log likelihood equals y593.0, compared to y621.0 for the
simpler model forcing s 0. Compare models, using either a likelihood-ratio test or AIC. What do you conclude?
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
532
TABLE 12.14 Results for Problem 12.18
Observer
Effect
GEE
Random
Effects
A
B
C
D
E
F
y0.451 Ž0.108.
y0.391 Ž0.093.
0.319 Ž0.118.
0.632 Ž0.105.
y0.491 Ž0.098.
1.252 Ž0.161.
y1.201 Ž0.300.
y0.919 Ž0.299.
0.558 Ž0.301.
1.545 Ž0.313.
y1.379 Ž0.300.
2.907 Ž0.344.
12.18 Landis and Koch Ž1977. showed ratings by seven pathologists who
separately classified 118 slides regarding the presence and extent of
carcinoma of the uterine cervix, using a five-point ordinal scale.
ŽTable 13.1 is a collapsing of their table that combines the first two
categories and the last three categories.. For slide i with rater t,
Table 12.14 shows results of fitting model
logit P Ž Yit F j < u i . s u i q j q t
to the ordinal table Žwith ˆG s 0., assuming that the u i 4 are independent N Ž0, 2 .. It also shows GEE estimates, using independence
working equations, for the corresponding marginal model. Interpret
ˆF for each model. Explain why estimates using the random effects
model, for which ˆ s 3.8, tend to be much larger in absolute value.
Discuss the differences in assumptions and interpretations for the two
models.
12.19 Refer to Section 12.5.1 on boys’ attitudes toward the leading crowd.
Table 12.15 shows results for a sample of schoolgirls. Fit model
Ž12.16. and interpret. Summarize the estimated variability and correlation of random effects.
TABLE 12.15 Data for Problem 12.19
Ž M, A. for
First Interview
Yes, positive
Yes, negative
No, positive
No, negative
a
Ž M, A. for Second Interview a
ŽYes, Positive.
ŽYes, Negative.
ŽNo, Positive.
ŽNo, Negative.
484
112
129
74
93
110
40
75
107
30
768
303
32
46
321
536
M, membership; A, attitude.
Source: J. S. Coleman, Introduction to Mathematical Sociology ŽLondon: Free Press of Glencoe,
1964., p. 168.
533
PROBLEMS
12.20 Generalize model Ž12.16. to apply simultaneously to Tables 12.8 and
12.15, using a gender main effect but the same membership effect
and the same attitude effect for each gender. Fit the model. Use the
maximized log likelihood to compare with a more general model
having different membership effects and different attitude effects for
each gender. Interpret.
12.21 Table 12.16 reports results from a study to estimate the number N of
people infected during a 1995 hepatitis A outbreak in Taiwan. The
271 observed cases were reported from records based on a serum test
taken by the Institute of Preventive Medicine of Taiwan ŽP., records
reported by the National Quarantine Service ŽQ., and records based
on questionnaires administered by epidemiologists ŽE.. Estimating N
is difficult, because many subjects had only one capture.
a. Find Nˆ if you observed only Ži. P and Q, Žii. P and E, Žiii. Q and
E.
ˆ using the model of mutual independence with P, Q, and E.
b. Find N
c. Find a 95% profile likelihood interval for N using the model in
part Žb..
d. The random effects model of Section 12.3.6 has fit shown in Table
12.16, for which ˆ s 2.9. The log-likelihood is relatively flat, and
ˆ s 4551 with a 95% profile likelihood interval of Ž758, ⬁. ŽCoull
N
and Agresti 1999.. Explain why this model may provide imprecise
estimates of N. Since the interval in part Žc. is much narrower, is it
necessarily more reliable?
TABLE 12.16 Data for Problem 12.21
PQE
Observed
Count
Logistic-Normal
ML Fit
000
001
010
011
100
101
110
111
ᎏ
63
55
18
69
17
21
28
Ž487, ⬁.
61.0
58.0
17.0
68.0
20.0
19.0
28.0
Source: Data from Chao et al. Ž2001..
12.22 Analyze the crossover data of Table 11.1 using a random effects
approach. Interpret, and compare results to those in Section 11.1.2.
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
534
12.23 The analyses in Section 12.3.2 comparing opinions on some topic
extend to ordinal responses. Using an ordinal random effects model,
analyze the 4 3 table in Agresti Ž1993., found also at the book’s Web
site, www. stat.ufl.edur; aarcdarcda.html.
12.24 The analyses in Section 12.3.4 describing heterogeneity in multicenter
clinical trials extend to ordinal responses. Using random effects
models, analyze the 2 = 3 = 8 table in Hartzel et al. Ž2001a..
12.25 You are a statistical consultant asked to analyze Table 4 in B. Efron,
Statistical Science 13: 95 122 Ž1998., which shows 2 = 2 tables from a
clinical trial in 41 cities. Analyze, and write a report summarizing
your analysis.
12.26 Analyze Table 11.9 with age and maternal smoking as predictors
using a Ža. logistic-normal model, Žb. marginal model, and Žc. transitional model. Explain how the interpretation of the maternal smoking
effect differs for the three approaches.
Theory and Methods
12.27 Refer to Section 12.3.1. Using supplementary information improves
predictions. Let qi denote the true proportion of votes for Clinton in
state i in the 1992 election, conditional on voting for him or Bush.
Consider the model
logit P Ž Yit s 1 < u i . s logit Ž qi . q q u i ,
where qi 4 are known and u i 4 are independent N Ž0, 2 .. When
ˆ s 0, show
ˆ i s qi exp Žˆ.rw1 y qi q qi exp Žˆ.x. Compared to qi 4,
explain how
ˆ i then shifts up or down depending on how the overall
Democratic vote compares in the current poll to the previous election
Ži.e., depending on
ˆ .. When also ˆ s 0, show ˆ i s qi .
12.28 For a binary response, consider the random effects model
logit P Ž Yit s 1 < u i . s q t q u i ,
t s 1, . . . , T ,
where u i 4 are independent N Ž0, 2 ., and the marginal model
logit P Ž Yt s 1 . s q t* ,
t s 1, . . . , T .
For identifiability, T s T* s 0. Explain why all t s 0 implies that
all t* s 0. Is the converse true?
535
PROBLEMS
12.29 The GLMM for binary data using probit link function is
y1
P Ž Yit s 1 < u i . s x Xit  q zXit u i ,
where
is the N Ž0, 1. cdf and u i has N Ž0, . pdf, f Žu i ; ..
a. Show that the marginal mean is
P Ž Yt s 1 . s
HP Ž Z y z
X
it u i
F x Xit . f Ž u i ; . du i ,
where Z is a standard normal variate that is independent of u i .
b. Since Z y zXi t u i has a N Ž0, 1 q zXit z it . distribution, deduce that
y1
P Ž Yt s 1 . s x Xit w 1 q zXit z it x
y1r2
.
Hence, the marginal model is a probit model with attenuated
effect. In the univariate random intercept case, show the marginal
effect equals that from the GLMM divided by '1 q 2 .
12.30 In the Rasch model, logitw P Ž Yit s 1.x s i q t , i is a fixed effect.
a. Assuming independence of responses for different subjects and for
different observations on the same subject, show that the log
likelihood is
Ý Ý i yit q Ý Ý t yit y Ý Ý log
i
t
i
t
i
1 q exp Ži q t . .
t
b. Show that the likelihood equations are yqt s Ý i P Ž Yit s 1. and
yiqs Ý t P Ž Yit s 1. for all i and t. Explain why conditioning on
yiq 4 yields a distribution that does not depend on i 4 .
c. Discuss advantages and disadvantages of, instead, treating i as
random.
12.31 Consider the matched-pairs random effects model Ž12.3.. For given
0 , let 0 be such that
ˆ 12 s n12 q 0 and
ˆ 21 s n 21 y 0 satisfies
log Ž
ˆ 21 r
ˆ 12 . s 0 . Suppose
ˆ i j 4 has nonnegative log odds ratio.
Explain why:
a. This is the fit of the model assuming s 0 .
b. The likelihood-ratio statistic for testing H0 : s 0 in this model
equals
ž
2 n12 log
n12
n12 q 0
q n 21 log
n 21
n 21 y 0
/
.
c. The likelihood-ratio test of H0 : s 0 is the test of symmetry.
536
RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS
12.32 Explain why the logistic-normal model is not helpful for capture
recapture experiments with only two captures.
12.33 Refer to the crossover study in Problem 12.7. Kenward and Jones
Ž1991. reported results using the ordinal response scale Žnone, moderate, complete. for relief. Explain how to formulate an ordinal logit
random effects model for these data analogous to model Ž12.19..
12.34 Formulate a model using adjacent-categories logits that is analogous
to model Ž12.14. for cumulative logits. Interpret parameters.
12.35 For ordinal square I = I tables of counts n ab 4 , model Ž12.3. for
binary matched-pairs responses Ž Yi1 , Yi2 . for subject i extends to
logit P Ž Yit F j < u i . s j q x t q u i
with u i 4 independent N Ž0, 2 . variates and x 1 s 0 and x 2 s 1.
a. Explain how to interpret , and compare to the interpretation of
in the corresponding marginal model Ž10.14..
b. This model implies model Ž12.3. for each 2 = 2 collapsing that
combines categories 1 through j for one outcome and categories
j q 1 through I for the other. Use the form of the conditional ML
Žor random effects ML. estimator for binary matched pairs to
explain why
log
žÝ Ý / žÝ Ý /
n ab
a)j b-j
n ab
a-j b)j
is a consistent estimator of .
c. Treat these Ž I y 1. collapsed 2 = 2 tables naively as if they are
independent samples. Show that adding the numerators and adding
the denominators of the separate estimates of e motivates the
summary estimator of ,
˜ s log
½
Ý Ž a y b . n ab
a)b
Ý Ž b y a. n ab
b)a
5
.
Explain why ˜ is consistent for even recognizing the actual
dependence.
d. A standard error for ˜ that treats the collapsed tables in part Žc.
as independent is inappropriate. Treating n ab 4 as a multinomial
sample, show that an estimated asymptotic variance of ˜ is ŽAgresti
537
PROBLEMS
and Lang 1993a.
½
Ý Ž b y a. 2 n ab
b)a
q
½
Ý Ž b y a. n ab
b)a
Ý Ž a y b . 2 n ab
a)b
2
5
Ý Ž a y b . n ab
a)b
2
5
.
12.36 Summarize advantages and disadvantages of using a GLMM approach compared to a marginal model approach. Describe conditions
under which parameter estimators are consistent for Ža. marginal
models using GEE, Žb. marginal models using ML, Žc. GLMM using
PQL, and Žd. GLMM using ML.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 13
Other Mixture Models for
Categorical Data*
In Chapters 10 through 12 we introduced ways of handling correlated
observations due to repeated measurement and other forms of clustering.
The generalized linear mixed models ŽGLMMs. of Chapter 12 assume
normal random effects. They describe heterogeneity by replacing the linear
predictor by a normally distributed mixture of linear predictors. In this
chapter we present additional models having connections with GLMMs.
Except for one case, these models use nonnormal mixture distributions.
In Section 13.1 we present latent class models. These treat a contingency
table as a mixture of unobserved tables at categories of a qualitative latent
Žunobserved. variable. In Section 13.2 we discuss a related nonparametric
approach to fitting GLMMs that uses an unspecified discrete quantitative
distribution for the random effects distribution.
In Section 13.3 we model clustered binomial responses using the beta
distribution to describe heterogeneity of binomial parameters. The resulting
beta-binomial distribution has variance function for which quasi-likelihood
methods are also available. In Section 13.4 we model count responses using
the gamma distribution to describe heterogeneity of Poisson parameters. The
resulting negative binomial regression model corresponds to a Poisson GLMM
having a log-gamma distributed random effect. It is an alternative to the
GLMM for Poisson responses with normal random effects, a model discussed
in Section 13.5.
13.1 LATENT CLASS MODELS
GLMMs create a mixture of linear predictor values using a latent variable,
the unobserved random effect vector, having a normal distribution. By
contrast, latent class models use a mixture distribution that is qualitative
rather than quantitative. The basic model assumes existence of a latent
538
539
LATENT CLASS MODELS
FIGURE 13.1 Association graph for latent class model.
categorical variable such that the observed response variables are conditionally independent, given that variable.
For categorical response variables Ž Y1 , Y2 , . . . , YT ., the latent class model
assumes a latent categorical variable Z such that for each possible sequence
of response outcomes Ž y 1 , . . . , y T . and each category z of Z,
P Ž Y1 s y 1 , . . . , YT s y T < Z s z . s P Ž Y1 s y 1 < Z s z . ⭈⭈⭈ P Ž YT s y T < Z s z . .
Figure 13.1 shows the association graph for the model. A latent class model
summarizes probabilities of classification P Ž Z s z . in the latent classes as
well as conditional probabilities P Ž Yt s yt < Z s z . of outcomes for each Yt
within each latent class. These are the model parameters. More generally,
the latent variable Z can be multivariate. The model is an analog for
categorical responses and latent variables of the factor analysis model for
multivariate normal responses.
The latent class model is sometimes plausible when the observed variables
are several indicators of some concept, such as prejudice, religiosity, or
opinion about an issue. An example is Table 10.13, in which subjects gave
their opinions about whether abortion should be legal in various situations.
Perhaps an underlying latent variable describes one’s basic attitude toward
legalized abortion, such that given the value of that latent variable, responses
on the observed variables are conditionally independent. For instance, the
latent variable may be a qualitative variable with three categories: One class
for those who always oppose legalized abortion regardless of the situation,
one for those who always favor it, and one for those whose response depends
on the situation.
The T-dimensional contingency table cross classifying Ž Y1 , . . . , YT . is observed. The ŽT q 1.-dimensional table that cross-classifies it with the latent
variable is an unobserved table. Denote the number of categories of each Yt
by I and the number of latent classes of Z by q. For the observed table, let
y 1 , . . . , y T s P Ž Y1 s y 1 , . . . , YT s y T .. The model assumes a multinomial distribution over its I T cells. For a given cell,
q
y1 , . . . , y T s
Ý
zs1
P Ž Y1 s y 1 , . . . , YT s y T < Z s z . P Ž Z s z . .
540
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
The conditional independence factorization for the latent class model states
that
y1 , . . . , y T s
q
T
zs1
ts1
Ý Ł P Ž Yt s yt < Z s z . P Ž Z s z . .
Ž 13.1 .
This is a nonlinear model for the I T multinomial probabilities.
13.1.1
Fitting Latent Class Models
Denote the counts in the observed table by n y 1 , . . . , y T 4 . Summing over the I T
cells in that table, the kernel of the multinomial log likelihood is
Ý ny , . . . , y
1
T
Ž 13.2 .
log y 1 , . . . , y T .
Substituting parameters from Ž13.1., one can maximize Ž13.2. with respect to
those parameters using Newton᎐Raphson ŽHaberman 1979, Chap. 10. or the
EM algorithm ŽGoodman 1974.. It is helpful to note that the latent class
model states that the loglinear model symbolized by Ž Y1 Z, Y2 Z, . . . , YT Z .
holds for the unobserved table. The model makes no assumption about the
Yt Z 4 associations but assumes that the Yt 4 are mutually independent within
each category of Z.
The EM algorithm has two steps in each iteration. The E Žexpectation .
step in iteration s calculates pseudo-counts nŽys.1 , . . . , y T , z 4 for the unobserved table using n y 1 , . . . , y T 4 and a working conditional distribution for
Ž Z < Y1 , . . . , YT . described shortly. The M Žmaximization. step treats
nŽys., . . . , y , z 4 as data and applies an algorithm such as iterative reweighted
1
T
least squares or IPF for fitting the model Ži.e., the loglinear model
Ž Y1 Z, Y2 Z, . . . , YT Z ... The fit Žys., . . . , . . . , y , z 4 of that model in the unob1
T
served table then determines the new working conditional distribution of
Ž Z < Y1 , . . . , YT . to apply to n y , . . . , y 4 for the E-step of the next iteration.
1
T
This allocates the observed data to pseudo-counts in the unobserved cells in
proportion to this fit, using
nŽysq1.
s n y1 , . . . , yT
1 , . . . , yT , z
Žys.1 , . . . , y T , z
q
Ý
.
Žys.1 , . . . , y T , k
ks1
These are entries in the unobserved table for iteration Ž s q 1.. They are used
as pseudo-data for the M-step of iteration Ž s q 1..
Eventually, the algorithm converges to fitted values for the unobserved
table that provide fitted probabilities that satisfy mutual independence within
each latent class, and such that the corresponding fitted probabilities in the
observed table Ži.e., added over the latent categories. maximize the likelihood
Ž13.2.. The fitted probabilities in the unobserved table are an estimated joint
LATENT CLASS MODELS
541
distribution for Ž Y1 , . . . , YT , Z .. One can use them to calculate the ML
estimates of the latent class model parameters P Ž Yt s yt < Z s z .4 and
P Ž Z s z .4 .
The EM algorithm is computationally simple and relatively stable. Each
iteration increases the likelihood. However, its convergence can be slow. See
Laird Ž1998. for a review. The log likelihood for a latent class model may
have local maxima. Thus, with either the Newton᎐Raphson or EM algorithm,
it is advisable to perform the fitting process a few times with different
starting guesses for the parameter values. The EM algorithm tends to be less
sensitive to the choice of starting values. Thus, some software begins with the
EM algorithm and then switches to the Newton᎐Raphson algorithm as it
approaches the ML estimates to speed the process. As q increases, multiple
local maxima are more likely and the danger increases of a lack of identifiability.
Standard errors for model parameter estimates result from inverting the
model’s estimated information matrix. This is a by-product of the
Newton᎐Raphson algorithm but not the EM algorithm. One way to obtain
standard errors with it applies a useful formula of Louis Ž1982. for the
observed information when using the EM algorithm. It equals the expected
value of the observed information for the loglinear model for the unobserved
table minus the expected value of the information for the conditional
distribution of Z given the observed data. Baker Ž1992. and Lang Ž1992. gave
related results.
Chi-squared statistics comparing observed cell counts to fitted values test
the model fit. The residual df s I T y qT Ž I y 1. y q. This follows since
multinomial model Ž13.1. describes I T y 1 multinomial probabilities using
Ž I y 1. parameters P Ž Yt s yt < Z s z ., yt s 1, . . . , I y 14 at each of qT combinations of z and t, and q y 1 parameters P Ž Z s z .4 . Often, the nature of
the variables suggests a value for q, usually quite small Ž2 to 4.. Otherwise,
the usual procedure starts with q s 2; if the fit is inadequate, it increases by
steps of 1 as long as the fit shows substantive improvement. Specialized
software exists for such models ŽAppendix A..
13.1.2
Latent Class Model for Rater Agreement
Table 13.1 is an expanded data set of the example in Section 10.5. Seven
pathologists classified each of 118 slides on the presence or absence of
carcinoma in the uterine cervix. For modeling interobserver agreement, the
conditional independence assumption of the latent class model is often
plausible. With a blind rating scheme, ratings of a given subject or unit by
different pathologists are independent. If subjects having true rating in a
given category are relatively homogeneous, then ratings by different pathologists may be nearly independent within a given true rating class. Thus, one
might posit a latent class model with q s 2 classes, one for subjects whose
true rating is positive and one for subjects whose true rating is negative. This
542
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
TABLE 13.1
Diagnoses of Carcinoma and Fits of Latent Class Models a
Pathologist
Fit
A
B
C
D
E
F
G
Count
qs1
qs2
qs3
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
1
1
0
1
0
0
1
1
0
1
0
0
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
1
0
0
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
34
2
6
1
4
5
2
1
2
1
2
7
1
1
2
3
13
5
10
16
1.1
1.6
2.2
2.8
3.3
4.2
1.4
1.6
2.8
3.5
4.2
5.3
1.4
1.3
2.0
0.5
3.3
0.9
1.2
0.3
23.0
6.6
12.7
1.7
3.6
0.5
3.0
0.2
1.7
0.3
0.5
3.7
2.6
0.1
4.3
3.1
11.5
8.4
13.5
9.9
33.8
2.0
6.3
1.5
3.0
4.7
2.1
0.2
1.3
1.6
2.9
6.5
1.4
0.1
2.6
2.0
9.6
8.7
13.6
12.3
Fits obtained with Latent Gold ŽStatistical Innovations, Belmont MA.. 1, yes; 0, no.
Source: Based on data in Landis and Koch Ž1977., not showing empty cells.
a
model expresses the 2 7 joint distribution of the seven ratings as a mixture of
two 2 7 distributions, one for each true rating class.
Table 13.2 shows results of fitting some latent class models Žincluding a
mixture model studied in Section 13.2.4.. Because the observed table is
sparse, the deviance is mainly useful for comparing models. This is an
informal comparison, though, since the chi-squared distribution does not
apply for comparing deviances of models with different numbers of latent
classes. A model with q classes is a special case of a model with q* ) q
classes in which P Ž Z s z . s 0 for z ) q and hence falls on the boundary of
the parameter space. Ordinary chi-squared likelihood-ratio tests require
parameters to fall in the interior of the parameter space Ži.e., 0 - P Ž Z s z .
- 1 for z s 1, . . . , q*..
Table 13.1 also shows the fitted values for latent class models with
q s 1, 2, 3, for the cells having positive counts. ŽEach empty cell also has a
fitted value, not shown here.. The model with q s 1 latent class is the model
of mutual independence of the seven ratings. Equivalently, it is the loglinear
model Ž Y1 , Y2 , . . . , Y7 .. It fits poorly, as one would expect. With q s 2,
considerable evidence remains of lack of fit. For instance, the fitted count for
543
LATENT CLASS MODELS
TABLE 13.2 Likelihood-Ratio Statistics for Latent Class Models
Fitted to Table 13.1 a
Model
Deviance Ž G 2 .
Statistic
df
Mutual independence
Latent class
Rasch mixture
Latent class
Rasch mixture
Latent class
Rasch mixture Žquasi-symmetry.
476.8
62.4
67.6
15.3
27.5
6.4
23.7
120
112
118
104
116
96
114
Number of
Latent Classes
1
2
3
4
a
Models fitted with Latent Gold ŽStatistical Innovations, Belmont, MA..
a negative rating by each pathologist is 23.0, compared to an observed count
of 34. ŽThe small G 2 that Table 13.2 reports for this model does not imply a
good fit; in Section 9.8.4 we noted that G 2 tends to be highly conservative
when most fitted values are very close to 0.. The model with q s 3 seems to
fit adequately.
Studying the estimated probability P Ž Yt s 1 < Z s z . of a carcinoma diagnosis for each pathologist, conditional on a given latent class z, helps
illuminate the nature of these classes. Table 13.3 reports these for the
three-class model. They suggest that Ž1. the first latent class refers to cases
that all pathologists Žexcept occasionally B. agree show no carcinoma; Ž2. the
third latent class refers to cases in which A, B, E, and G agree show
carcinoma and C and D usually agree; and Ž3. the second latent class refers
to cases of strong disagreement, whereby C, D, and F rarely diagnose
carcinoma but B, E, and G usually do. The estimated proportions in the
three latent classes are PˆŽ Z s 1. s 0.37, PˆŽ Z s 2. s 0.18, and PˆŽ Z s 3. s
0.45. The model estimates that 18% of the cases fall in the problematic class.
TABLE 13.3 Estimated Probabilities of Diagnosing Carcinoma, for
Latent Class Model and Rasch Mixture Model with Three Classes a
Pathologist
Latent
Class
A
B
C
D
E
F
G
Latent
Class
1
2
3
0.057
0.513
1.000
0.138
1.00
0.981
0.000
0.000
0.858
0.000
0.058
0.586
0.055
0.751
1.000
0.000
0.000
0.476
0.000
0.631
1.000
Rasch
Mixture
1
2
3
0.022
0.611
0.994
0.150
0.923
0.999
0.001
0.052
0.853
0.000
0.015
0.617
0.047
0.774
0.997
0.000
0.009
0.483
0.022
0.611
0.994
Model
a
Results obtained with Latent Gold ŽStatistical Innovations, Belmont, MA..
544
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
A danger with latent variable models, shared by factor analysis for continuous responses, is the temptation to interpret latent variables too literally. In
this example it is tempting to treat latent class 1 Žlatent class 3. as cases truly
without carcinoma Žwith carcinoma.. Thus, it is tempting to treat a rating of
no carcinoma Ža rating of carcinoma. given that the subject falls in latent
class 1 Žlatent level 3. as necessarily being a correct judgment. One should
realize the tentative nature of the latent variable. Be careful not to make the
error of reificationᎏtreating an abstract construction as if it has actual
existence ŽGould 1981..
Using model parameter estimates and Bayes’ theorem, one can also
estimate P Ž Z s z < Yt s yt . and P Ž Z s z < Y1 s y 1 , . . . , YT s y T .. If a pathologist makes a ‘‘ yes’’ rating, for instance, what is the estimated probability that
the subject is in the latent class for which agreement on a positive rating
usually occurs? We perform further analysis in Section 13.2.5 after studying a
simpler model.
Espeland and Handelman Ž1989., Uebersax Ž1993., Uebersax and Grove
Ž1990, 1993., and Yang and Becker Ž1997. presented various latent variable
models for rater agreement and diagnostic accuracy. One could also use
methods of Chapters 11 and 12, such as a model with a continuous rather
than qualitative latent variable. A logistic-normal random intercept model,
for instance, yields subject-specific comparisons of P Ž Yt s 1. for various t.
13.1.3
Latent Class Models for Capture–Recapture
We next apply latent class models to capture᎐recapture modeling for estimating population size. In Section 12.3.6 a logistic-normal GLMM was used
for this. With T sampling occasions, a 2 T contingency table displays the data,
with scale Žcaptured, not captured . at each occasion. A prediction of the
population size equals the prediction for the missing cell count, representing
subjects not captured at every occasion, added to the counts in other cells.
With two classes, the latent class model treats the population as a mixture
of two types, perhaps determined by genetic or environmental factors.
Homogeneity of capture probabilities occurs for subjects within each type,
but the type of any given subject is unknown. This model represents
a compromise between the mutual independence model, which assumes a
single latent class and complete homogeneity, and the logistic-normal GLMM,
which assumes a continuous mixture of capture probabilities rather than two
classes.
We illustrate with the T s 6-capture data set on snowshoe hares in Table
12.6. The model of mutual independence predicts that Nˆ s 75. Its 95%
profile-likelihood confidence interval for N is Ž70, 83.. The latent class model
with two classes has Nˆ s 85 and a profile-likelihood confidence interval of
Ž74, 106.. The latent class model with three classes gives similar results. Since
the logistic-normal GLMM in Section 12.3.6 gave the interval Ž75, 154., these
seem too short to be trusted. This simple latent class model may not capture
NONPARAMETRIC RANDOM EFFECTS MODELS
545
all the existing heterogeneity. It is more plausible to assume a continuous
latent variable than a discrete one with a couple of classes. We’ll analyze
these data further with related models in the next section.
13.2 NONPARAMETRIC RANDOM EFFECTS MODELS
In spite of its popularity and attractive features, the normality assumption for
random effects in ordinary GLMMs can rarely be closely checked. For
instance, in studying normal GLMMs, Verbeke and Lesaffre Ž1996. noted
that under a normality assumption for random effects, their predicted values
often appear normally distributed even when the true values are generated
from a highly nonnormal distribution. An obvious concern of this or any
parametric assumption for the random effects is possibly harmful effects of
misspecification. To check sensitivity to this assumption, one can fit GLMMs
using alternative or more general random effects assumptions.
13.2.1
Logit Models with Unspecified Random Effects Distribution
A nonparametric approach Že.g., Aitkin 1999. guards against possibly harmful
misspecification effects. This uses an unspecified random effects distribution
on a finite set of mass points. The location of the mass points and their
probabilities are parameters. The number of mass points can be fixed. When
this number is itself unknown, one treats it as fixed in the estimation process
but increases it sequentially until the likelihood is maximized. The maximization usually requires relatively few mass points. Even allowing a continuous
mixture distribution, the nonparametric estimate of that distribution takes a
finite number of points Že.g., Lindsay et al. 1991.. In fact, fitting a model
having only two mass points often results in fixed effects estimates quite
similar to those with the full maximization. This approach is useful primarily
when the random effects distribution is not itself of direct interest, since the
nonparametric estimate of that distribution tends to be poor even for very
large samples.
Model fitting is actually simpler than for models with normal random
effects, since the integral that determines the likelihood function simplifies to
a finite sum. In Section 13.2.4 we discuss this point with a Rasch-type model.
Specialized software can fit nonparametric mixture models ŽAppendix A..
However, this approach also has disadvantages. For instance, with multivariate random effects it cannot provide simple correlation structure as the
normal can. Standard inference does not apply for comparing models
with different numbers of mass points, since one model is on the boundary of
the parameter space compared to the other. Also, the ML estimate of the
random effects distribution often places some weight at "⬁. Although
this can be useful with binary data for identifying a subsample for which the
estimated response probability equals 1 or equals 0 for all observations in a
546
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
cluster, it is not then possible to describe heterogeneity with an estimated
variance component.
To illustrate this approach, we reanalyze Table 10.13 on attitudes about
legalized abortion. In Section 12.3.2 we fitted the logistic-normal model
Ž12.10.,
logit P Ž Yit s 1 < u i . s u i q t q ␥ x,
Ž 13.3 .
with x s gender and parameters t 4 representing three conditions under
which abortion might be legal. Treating u i instead nonparametrically, the
likelihood maximizes with a two-point mixture distribution. Estimated abortion item effects are ˆ1 y ˆ3 s 0.83 ŽSE s 0.16., ˆ2 y ˆ3 s 0.30 ŽSE s
0.16., and ˆ1 y ˆ2 s 0.52 ŽSE s 0.16.. Results are similar to those that
Table 12.3 shows for the normal random effects approach ŽSection 12.3.2..
13.2.2
Nonparametric Mixing of Logistic Regression
Follman and Lambert Ž1989. presented an example with a prespecified
number of mass points. They analyzed the effect of the dosage of a poison on
the probability of death of a protozoan of a particular genus. Table 13.4
shows the data. They assumed two unobserved types of that genus.
Let i Ž x . denote the probability of death at log dose level x for genus
type i, i s 1, 2. Let denote the probability a protozoan belongs to genus
type 1. Their model specifies
Ž x . s 1 Ž x . q Ž 1 y . 2 Ž x . ,
where
logit i Ž x . s ␣ i q  x ,
with unknown . The curve for Ž x . is a weighted average of two curves
having the same shapes but different intercepts.
The ordinary logistic regression model is the special case s 1. Its fit,
logit w
ˆ Ž x .x s y68.4 q 42.1 x Žwith SE s 3.8 for ˆ s 42.1., is poor, with
deviance G 2 s 24.7 Ždf s 6.. The fit of the mixture model is
ˆ Ž x . s 0.34ˆ 1 Ž x . q 0.66ˆ 2 Ž x . ,
with
logit
ˆ 1Ž x . s y196.2 q 124.8 x,
logit
ˆ 2 Ž x . s y205.7 q 124.8 x,
TABLE 13.4 Number of Protozoa Exposed to Poison Dose and Number That Died
Poison
Dose
Exposed
Dead
Poison
Dose
Exposed
Dead
4.7
4.8
4.9
5.0
55
49
60
55
0
8
18
18
5.1
5.2
5.3
5.4
53
53
51
50
22
37
47
50
Source: Follman and Lambert Ž1989.. Reprinted with permission from the Journal of the
American Statistical Association.
NONPARAMETRIC RANDOM EFFECTS MODELS
547
FIGURE 13.2 Fit of binary mixture of logistic regressions to Table 13.4 wmodel fitted using
Latent Gold ŽStatistical Innovations, Belmont, MA.x.
and SE s 25.2 for ˆ s 124.8. Figure 13.2 shows the fit. This is much better,
with G 2 s 3.4 Ždf s 4.; that is, double the maximized log-likelihood increases by 24.7 y 3.4 s 21.3 by adding two parameters: an additional intercept and the probability for the mixture. Follman and Lambert noted that
with eight dose levels, at most two mixture points are identifiable for this
model.
The ordinary GLMM assumes a normal mixture of logistic curves. It gives
a deviance reduction of only 1.7 compared to the ordinary logistic model with
s 1.
13.2.3
Is Misspecification a Serious Problem?
Is it worth the trouble to consider alternatives to the normality assumption
for random effects in GLMMs, whether they be parametric or nonparametric? Not much work exists on investigating misspecification effects. For
logistic random intercept models, different assumptions for the random
effects distribution often provide similar results for estimating the regression
effects. Choosing an incorrect random effects distribution does not tend to
bias estimators of those effects. The true distribution for the random effects
being skewed can result in some bias for the normal intercept estimator
ŽNeuhaus et al. 1992.. The choice of random effects distribution also usually
has little impact on efficiency of estimation.
When the true random effects distribution is dramatically far from normal,
there can be some efficiency loss for the logistic-normal estimator. This can
548
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
happen when the true distribution is a two-point mixture with large variance
component. B. Caffo and I studied this with various models, such as a simple
one-way random effects model. In cluster i, let yit be a Bernoulli variate
satisfying
logit P Ž Yit s 1 < u i . s ␣ q u i ,
i s 1, . . . , n,
t s 1, . . . , T , Ž 13.4 .
where varŽ u i . s 2 . Simulated samples from this model used various n, T,
␣ , and , and various true distributions for u i including normal, uniform,
exponential, and binary. Usually, assuming normality does not hurt when the
true distribution is nonnormal. Also, using a nonparametric approach when
the true distribution is normal does not result in much efficiency loss
wNeuhaus and Lesperance Ž1996. noted this for a related model.x However,
when the true distribution is a two-point mixture, the normal approach loses
efficiency in estimating i s P Ž Yit s 1 < u i .4 as and T increase. For
example, when n s T s 30, ␣ s 0, and the mixture has probability 0.5 at
each point, the expected value of
ˆ i y i is Ž0.06, 0.05. for the
Žnormal, nonparametric . approach when s 0.5, Ž0.06, 0.02. when s 1.0,
and Ž0.04, 0.01. when s 2.0. Differences for estimating ␣ are less dramatic.
The example from Follman and Lambert Ž1989. discussed in Section
13.2.2, which has a covariate but T s 1, illustrates the potential efficiency
loss with the logistic-normal GLMM. The two-point mixture model has
ˆ s 124.8 with SE s 25.2, for which ˆrSE s 4.9. The normal mixture model
has ˆ s 65.5 with SE s 19.5, for which ˆrSE s 3.4.
Our study suggested that the random effects distribution has to be rather
extremely nonnormal for the normal GLMM to suffer in bias or efficiency.
However, Heagerty and Zeger Ž2000. Žsee also McCulloch 1997. noted that
other types of misspecification can be more crucial. Regarding bias, they
argued that sensitivity to the random effects assumption is greater for
estimating regression parameters in random effects models than estimating
their counterparts in corresponding marginal models. They illustrated this
with a model violation by which the variance of the random effects depends
on values of covariates. They concluded that between-cluster effects may be
more sensitive to correct specification of the random effects distribution than
within-cluster effects. This is an advantage of using marginal models for
between-cluster effects.
13.2.4
Rasch Mixture Model
From Section 12.1.4, for subject i with item t the Rasch model for a binary
response is
logit P Ž Yit s 1 < u i . s u i q t ,
t s 1, . . . , T .
Ž 13.5 .
549
NONPARAMETRIC RANDOM EFFECTS MODELS
The GLMM treats u i 4 as normal random effects. Lindsay et al. Ž1991.
studied this model when u i instead can assume only a finite number q of
values. Denote the distribution of the latent variable u i , which is the same for
all i, by
P Ž U s ak . s k ,
k s 1, . . . , q,
for unknown a k 4 and k 4 . For identifiability one can either place a constraint on this distribution, such as Ý k k a k s 0, or on t 4 . This model is
called a Rasch mixture model.
Like other random effects models, the Rasch mixture model is a latent
variable model. The random effect u i is unobserved, and the T responses are
assumed conditionally independent at each fixed u i value. It differs from the
ordinary latent class model for binary responses having q latent classes
ŽSection 13.1., since it assumes structure Ž13.5. for P Ž Yit s 1 < u i . whereas
latent class model Ž13.1. assumes no structure for P Ž Yt s yt < Z s z ..
This model is simpler to fit than GLMMs with normal random effects
because the GLMM’s intractable integral that determines the likelihood
function is replaced by a finite sum. The marginal probability of a sequence
of responses Ž y 1 , . . . , y T . is
y1 , . . . , y T s
q
T
ks1
ts1
exp yt Ž a k q t .
Ý k Ł 1 q exp Ž a
k
q t .
.
Substituting this in the multinomial log likelihood Ž13.2., ML estimation of
a k , k 4 and t 4 can proceed using Newton᎐Raphson or EM algorithms. As
q increases, the maximized likelihood increases and the fit improves. However, Lindsay et al. Ž1991. showed that with T items, the likelihood no longer
changes once q s ŽT q 1.r2. Then, the model gives the same fit to the 2 T
observed table as the quasi-symmetry model Ž10.33.. Thus, this simpler latent
class model has a symmetric conditional association structure among the
observed variables. Arminger et al. Ž2000. extended the Rasch mixture model
to incorporate covariates.
13.2.5
Modeling Rater Agreement
For the ratings of carcinoma by seven pathologists ŽTable 13.1., Table 13.2
also summarizes the fit of Rasch mixture models. Here, P Ž Yit s 1 < u i . in
Ž13.5. denotes the probability of a carcinoma diagnosis for pathologist t
evaluating slide i. With q s 3 Ži.e., u i can take 3 values., it does not fit
significantly more poorly than the latent class model. With T s 7 raters, the
discrete mixture can take at most ŽT q 1.r2 s 4 points. The model with
q s 4 is equivalently the quasi-symmetry model. It does not seem to fit better
than with q s 3.
550
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
FIGURE 13.3 Pathologist estimates for Rasch mixture model and results of 90% Bonferroni
simultaneous comparison.
Figure 13.3 shows ˆt 4 for the Rasch mixture model with q s 3, setting
Ý t ˆt s 0. These describe variation among the pathologists’ response distributions at each latent level. For a given latent class, for instance, the estimated
odds of a carcinoma diagnosis for pathologist B are expŽ3.52 y 1.48. s 7.7
times the estimated odds for pathologist A. Pathologist B tends to make a
carcinoma diagnosis most often, and D and F the least. The figure also shows
results of a 90% Bonferroni comparison of the 21 pairs of pathologists, based
on standard errors of pairwise differences ˆt y ˆs .
For pathologist t, conditional on latent level k for a slide,
exp Ž a
ˆk q ˆt .
1 q exp Ž a
ˆk q ˆt .
estimates the probability of a carcinoma diagnosis. Table 13.3 reports these,
which use a
ˆ1 s y5.25, aˆ2 s y1.02, and aˆ3 s 3.63. They are similar to the
estimates for the ordinary latent class model but a bit smoother, with fewer
estimates at the boundary. Again, at latent level 1 pathologists tend not to
diagnose carcinoma, at level 2 many disagreements occur, and at level 3
pathologists tend to diagnose carcinoma. The estimated latent class proportions are ˆ1 s 0.37, ˆ2 s 0.19, and ˆ3 s 0.43, with 19% of cases falling in
the problematic class.
Model Ž13.5. implies that the association between each Yt and U has log
odds ratio Ž a k y a l . for levels k and l of U. For instance, in the third latent
class the estimated odds that a pathologist diagnoses carcinoma are
exp w3.63 y Žy5.25.x ) 7000 times those in the first latent class. In terms of
the estimated probabilities in Table 13.3, using pathologist A this is
exp wŽ0.994r0.006.rŽ0.022r0.978.x. The large a
ˆk y aˆ l 4 suggest strong association between each pathologist’s rating and the latent variable. This induces
strong association between pairs of pathologist ratings. ŽThe model-fitted
odds ratios between pairs of raters vary between about 7 and 400.. However,
the quite varied ˆt 4 suggest that substantial marginal heterogeneity exists
among the seven ratings. This causes heterogeneity in pairwise levels of
agreement.
The mutual independence model is the special case of the Rasch mixture
model with q s 1; that is, 1 s 1. For Table 13.1 the Rasch mixture model
with q s 3 has only four more parameters than the mutual independence
NONPARAMETRIC RANDOM EFFECTS MODELS
551
model Ži.e., k and a k , k s 1,2.. Yet it fits well and has simple interpretations. See Agresti and Lang Ž1993b. for further details and a simpler model
that sets a1 y a2 s a2 y a3 .
13.2.6
Other Models for Capture–Recapture
In Section 13.1.3 latent-class models were used for capture᎐recapture experiments. Alternatively, one could use the Rasch mixture model. Model Ž13.5.
with two classes gives Nˆ s 77 and a 95% profile-likelihood confidence
interval of Ž71, 87.. This seems too short to trust. It is more realistic to allow
a continuous distribution for capture probabilities. Model Ž13.5. treating u i
as normal rather than binary does this, and in Section 12.3.6 we used it for
these data.
So, which models might be used other than a parametric random effects
model? One possibility is a loglinear model ŽCormack 1989.. This is a
marginal model, applying to probabilities averaged over subjects. Let Yt
denote the binary capture variable for a randomly selected subject at occasion t, with categories Žcaptured, not captured .. The simplest model, denoted
by Ž Y1 , Y2 , . . . , YT ., assumes that capture events are mutually independent.
This is equivalent to the logistic-normal model Ž13.5. with s 0 and latent
class model Ž13.1. with q s 1. A more plausible model allows an association
between pairs of capture variables. This is equivalently the loglinear model
denoted Ž Y1Y2 , Y1Y3 , . . . , YTy1 YT .. Alternatively, a model with Markov structure such as Ž Y1Y2 , Y2 Y3 , . . . , YTy1 YT . may be useful. Usually, insufficient
data exists to warrant using very complex loglinear models. For any such
model, its fit for the 2 T y 1 observed cells projects to the remaining cell to
predict the number unobserved at every occasion.
A connection exists between nonparametric random effects and loglinear
approaches. In Section 13.2.7 we show that assuming model Ž13.5. but using a
nonparametric treatment of u i implies a loglinear model of quasi-symmetric
form for the marginal model. The quasi-symmetry model Ž10.33. itself is not
useful for this problem, because any count in the missing cell is consistent
with it. The model has an interaction parameter pertaining to that cell alone,
which results in a likelihood equation equating that cell count to its fitted
value. So, information in other cells does not help in the estimation of the
expected frequency in that cell. However, special cases of quasi-symmetry
are useful ŽDarroch et al. 1993.. An example is the loglinear model with the
same association for each pair of occasions. Like the logistic-normal model,
this model of exchangeable association has only one more parameter than the
mutual independence model.
For the snowshoe hare data of Table 12.6, the model with exchangeable
two-factor association has Nˆ s 90.5 and a confidence interval of Ž75, 125..
This interval and the one of Ž71, 87. for the Rasch mixture model with q s 2
are substantially narrower than the interval Ž75, 154. for the logistic-normal
model ŽSection 12.3.6.. In capture᎐recapture experiments, Nˆ and the confi-
552
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
dence interval for N depend strongly on the choice of model. The problem is
inherently one of prediction. Estimating N requires extrapolating from the
observed numbers of subjects having 1, 2, . . . , T captures to the number of
subjects with 0 captures. Standard goodness-of-fit criteria are of limited help.
Two models can fit the data well, yet yield quite different estimates for the
unobserved count. For instance, for the snowshoe hare data, the loglinear
models of mutual independence and of two-factor association both fit the
observed cells relatively well Ž G 2 s 58.3, df s 56 for mutual independence
and G 2 s 32.4, df s 41 for the two-factor model.; however, their Nˆ values
are 75 and 105.
Simpler models usually give narrower confidence intervals for N, through
the usual benefits of model parsimony. This is not necessarily good. A narrow
confidence interval for N is desirable, but not at the expense of severe
sacrifice in the actual confidence level. Intervals based on a possibly unrealistic assumption of subject homogeneity may be overly optimistic. Simulations
suggest that actual coverage probabilities are often well below nominal levels
when even slight model misspecification occurs. Allowance for heterogeneity
among subjects results in wider intervals. Severe population heterogeneity
makes reaching useful conclusions difficult, as intervals can be very wide
ŽBurnham and Overton 1978, Coull and Agresti 1999..
13.2.7
Nonparametric Mixtures and Quasi-symmetry
A distribution-free approach for u i with the Rasch form of model Ž13.5.
implies the quasi-symmetry loglinear model marginally ŽDarroch 1981; Tjur
1982.. We now show this result, to which we alluded in Section 10.4.2.
Let Yi denote the sequence of T responses for subject i. For possible
outcomes y s Ž y 1 , . . . , y T ., where each yt s 1 or 0,
P Ž Yi s y < u i . s
s
Ł
t
exp Ž u i q t .
yt
1
1 q exp Ž u i q t .
exp u i Ž Ý t yt . q Ý t yt t
Ł t 1 q exp Ž u i q t .
1yy t
1 q exp Ž u i q t .
.
Let F denote the cdf of u i . The marginal probability of sequence y for a
randomly selected subject is Žsuppressing the subject label.
y 1 , . . . , y T s EU P Ž Y s y < U . s exp
ž Ý y  /H Ł
t
t
exp u Ž Ý t yt .
t
t
1 q exp Ž u q t .
dF Ž u . .
This probability contributes to the log likelihood, which is Ž13.2. for a
multinomial distribution over the 2 T cells for possible y. Regardless of the
choice for F, the integral is complex. However, it depends on the data only
BETA-BINOMIAL MODELS
553
through Ý t yt . A more general model replaces this integral by a separate
parameter for each value of Ý t yt . This model has form
log y 1 , . . . , y T s
Ý yt t q y q ⭈⭈⭈ qy .
1
t
Ž 13.6 .
t
The final term represents a separate parameter at each value of Ý t yt .
The implied marginal model Ž13.6. has interaction term that is invariant to
a permutation of the response outcomes y, since each such permutation
yields the same sum, Ý t yt . Thus, it is the loglinear model of quasi-symmetry
Ž10.33.. No matter what form F takes, the marginal model has the same main
effect structure, and it has an interaction term that is a special case of the
one in Ž13.6.. Thus, one can consistently estimate t 4 using the ordinary ML
estimates for the loglinear model. In fact, Tjur Ž1982. showed that these
estimates are also the conditional ML estimates, treating u i 4 as fixed effects
and conditioning on their sufficient statistics. The interaction parameters in
model Ž13.6. result from the dependence in responses among variables, due
to heterogeneity in u i 4 .
We illustrate for the opinions about legalized abortion analyzed in
Sections 10.7.2 and 12.3.2 and with a nonparametric random effects approach
in Section 13.2.1. For model Ž13.3., estimated within-subject comparisons
t y  s of items result from fitting a quasi-symmetric loglinear model. Let
g Ž y 1 , y 2 , y 3 . denote the expected frequency for gender g making response yt
to item t, t s 1, 2, 3, where for item t, yt s 1 for approval of legalized
abortion and 0 for disapproval. The loglinear model is
log g Ž y 1 , y 2 , y 3 . s  1 y 1 q  2 y 2 q  3 y 3 q ␥ g q y 1qy 2qy 3 . Ž 13.7 .
For y 1 q y 2 q y 3 s k, k refers to all cells in which subjects voiced
approval for k of the three items, k s 0, 1, 2, 3. The ML fit, which has
G 2 s 10.2 with df s 9, yields ˆ1 y ˆ2 s 0.521 ŽSE s 0.154., ˆ1 y ˆ3 s 0.828
ŽSE s 0.160., and ˆ2 y ˆ3 s 0.307 ŽSE s 0.161.. These are similar to the
normal random effects estimates ŽTable 12.3. and nonparametric random
effects estimates in Section 13.2.1. They also are the conditional ML estimates for model Ž13.3., treating u i 4 as fixed. With this approach or conditional ML, however, one cannot estimate between-groups effects, such as the
gender effect in model Ž13.7.. wThe ␥ parameter in model Ž13.7. refers to
relative sample sizes of males and females and is not the same as the gender
effect in Ž13.3..x
13.3 BETA-BINOMIAL MODELS
The beta-binomial model is a parametric mixture model that is another
alternative to binary GLMMs with normal random effects. As with other
554
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
mixture models that assume a binomial distribution at a fixed parameter
value, the marginal distribution permits more variation than the binomial.
Thus, a model using the beta-binomial is a way to handle overdispersion
occurring with ordinary binomial models.
13.3.1
Beta-Binomial Distribution
The beta-binomial distribution results from a beta distribution mixture of
binomials. Suppose that Ža. given , Y has a binomial distribution, binŽ n, .,
and Žb. has a beta distribution.
The beta probability density function is
f Ž ; ␣ ,  . s
⌫ Ž␣ q  .
⌫ Ž␣ . ⌫ Ž  .
␣y1 Ž 1 y .
y1
0 F F 1, Ž 13.8 .
,
with parameters ␣ ) 0 and  ) 0, for the gamma function ⌫ Ž⭈.. Let
s
␣
␣q
,
s 1r Ž␣ q  . .
The beta distribution for has mean and variance
EŽ . s ,
var Ž . s Ž 1 y . r Ž 1 q . .
When ␣ and  exceed 1.0, the distribution is unimodal, with skew to the
right when ␣ -  , skew to the left with ␣ )  , and symmetry when ␣ s  .
It simplifies to the uniform distribution when ␣ s  s 1.
Marginally, averaging with respect to the beta distribution for , Y has
the beta-binomial distribution. Its mass function is
pŽ y ; ␣ ,  . s
ž/
n B Ž␣ q y, n q  y y .
,
y
B Ž␣ ,  .
y s 0, 1, . . . , n.
In terms of and , the beta-binomial mass function is
pŽ y ; , . s
ž/
n
y
yy1
Ł ks0
Ž q k .
nyyy1
Ł ks0
Ž 1 y q k .
ny1
Ł ks0
Ž 1 q k .
. Ž 13.9 .
It is easier to understand the nature of this distribution from its moments
than from its mass function. The first two moments are
E Ž Y . s n ,
var Ž Y . s n Ž 1 y . 1 q Ž n y 1 . r Ž 1 q . .
BETA-BINOMIAL MODELS
555
As ™ 0 in the beta distribution, varŽ . ™ 0 and that distribution converges
to a degenerate distribution at . Then varŽ Y . ™ n Ž1 y . and the betabinomial distribution converges to the binŽ n, ..
13.3.2
Models Using the Beta-Binomial Distribution
Models using the beta-binomial distribution permit wand hence E Ž Y .x to
depend on explanatory variables. The simplest models let be the same
unknown constant for all observations. wPrentice Ž1986. considered extensions
where it could also depend on covariates.x Like GLMs, models can use
various link functions, but the logit is most common. For observation i with
n i trials, assuming that yi has a beta-binomial distribution with index n i and
parameters Ž i , ., the model links i to predictors by
logit Ž i . s ␣ q X x i .
The beta-binomial is not in the natural exponential family, even for known
. Articles using beta-binomial models have employed a variety of fitting
methods ŽNote 13.4.. Crowder Ž1978. discussed the likelihood behavior for an
Ž1998. obtained the ML fit by
ANOVA-type model. Hinde and Demetrio
´
iterating between solving the likelihood equations for the regression parameters , for fixed , and solving the likelihood equation for for fixed .
Each part can use Newton᎐Raphson. McCulloch and Searle Ž2001, p. 61.
showed the asymptotic covariance matrix of Ž
ˆ , ˆ. and of Ž␣ˆ, ˆ. for
independent observations from a single beta-binomial distribution.
A related but simpler approach for overdispersed binary counts uses
quasi-likelihood with similar variance function as the beta-binomial. The
quasi-likelihood variance function is
® Ž i . s n i i Ž 1 y i . 1 q Ž n i y 1 .
Ž 13.10 .
with < < F 1. Although motivated by the beta-binomial model, this variance
function results merely from assuming that i has a distribution with
varŽ i . s i Ž1 y i .. It also results from assuming a common correlation
between each pair of the n i individual binary random variables that sum to yi
ŽAltham 1978.. The ordinary binomial variance results when s 0. Overdispersion occurs when ) 0.
For this quasi-likelihood approach, Williams Ž1982. gave an iterative
routine for estimating  and the overdispersion parameter . He let ˆ be
such that the resulting Pearson X 2 that sums the squared Pearson residuals
for this variance function equals the residual df for the model. This requires
an iterative two-step process of Ž1. solving the quasi-likelihood equations for
ˆ solving for ˆ in the
 for a given ˆ, and then Ž2. using the updated ,
ˆ and ˆ. to its df.
equation that equates X 2 Žwhich depends on 
556
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
An alternative quasi-likelihood approach uses the simpler variance function
® Ž i . s n i i Ž 1 y i .
Ž 13.11 .
introduced in Section 4.7.3. The ordinary binomial variance has s 1.0 and
ˆ is the same as its ML
overdispersion has ) 1. With this approach, 
estimate for the ordinary binomial model. Commonly, ˆ s X 2rdf, where X 2
is the Pearson fit statistic for the binomial model ŽFinney 1947.. The standard
errors for the overdispersion approach multiply those for the binomial model
by ˆ1r2 .
Liang and McCullagh Ž1993. showed several examples using these two
variance functions. A plot of the standardized residuals for the ordinary
binomial model against the indices n i 4 can provide insight about which is
more appropriate. When the residuals show an increasing trend in their
spread as n i increases, the beta-binomial-type variance function may be more
appropriate. This is because when the beta-binomial variance holds,
the residuals from an ordinary binomial model have denominator that is
progressively too small as n i increases. The two quasi-likelihood approaches
are equivalent when n i 4 are identical. Only when the indices vary considerably might results differ much. Because the variance function ®Ž i . s
n i i Ž1 y i . has a structural problem when n i s 1 ŽProblem 13.33. and has
less direct motivation, we prefer quasi-likelihood with the beta-binomial
variance function.
13.3.3
Teratology Overdispersion Example Revisited
Refer back to Table 4.5 on results of a teratology experiment analyzed by
Liang and McCullagh Ž1993. and Moore and Tsiatis Ž1991.. Female rats on
iron-deficient diets were assigned to four groups. Group 1 was given only
placebo injections. The other groups were given injections of an iron supplement according to various schedules. The rats were made pregnant and then
sacrificed after 3 weeks. For each fetus in each rat’s litter, the response was
whether the fetus was dead. Because of unmeasured covariates, it is natural
to permit the probability of death to vary from litter to litter within a
particular treatment group.
Let yi denote the number dead out of the n i fetuses in litter i. Let it
denote the probability of death for fetus t in litter i. First, suppose that yi is
a bin Ž n i , it . variate, with
logit Ž it . s ␣ q  2 z 2 i q  3 z 3 i q 4 z 4 i ,
where z g i s 1 if litter i is in group g and 0 otherwise. This model treats all
litters in a group g as having the same probability of death, expŽ␣ q  g .r
w1 q expŽ␣ q  g .x, where  1 s 0. However, it has evidence of overdispersion,
BETA-BINOMIAL MODELS
557
TABLE 13.5 Estimates for Several Logit Models Fitted to Table 4.5
Type of Logit Model a
Parameter
Intercept
Group 2
Group 3
Group 4
Overdispersion
Binomial ML
QLŽ1.
QLŽ2.
GEE
GLMM
1.144 Ž0.129.
1.212 Ž0.223.
1.144 Ž0.219.
1.144 Ž0.276.
1.802 Ž0.362.
y3.322 Ž0.331. y3.370 Ž0.563. y3.322 Ž0.560. y3.322 Ž0.440. y4.515 Ž0.736.
y4.476 Ž0.731. y4.585 Ž1.303. y4.476 Ž1.238. y4.476 Ž0.610. y5.855 Ž1.190.
y4.130 Ž0.476. y4.250 Ž0.848. y4.130 Ž0.806. y4.130 Ž0.576. y5.594 Ž0.919.
None
ˆ s 0.192
ˆ s 2.86
ˆ s 0.185
ˆ s 1.53
Binomial ML assumes no overdispersion, QLŽ1. is quasi-likelihood with beta-binomial-type
variance, QLŽ2. is quasi-likelihood with inflated binomial variance; QLŽ2. and GEE Žindependence working equations. estimates are the same as binomial ML estimates. Values in parentheses are standard errors.
a
with X 2 s 154.7 and G 2 s 173.5 Ždf s 54.. Table 13.5 shows ML estimates
and standard errors.
Table 13.5 also shows results for the two quasi-likelihood approaches.
Estimates and standard errors are qualitatively similar for each. For variance
function ®Ž i . s n i i Ž1 y i ., the estimates equal the binomial ML
estimates but standard errors are multiplied by ˆ1r2 s Ž X 2rdf.1r2 s
154.7r54 s 1.69. For the beta-binomial-type variance function, ˆ s 0.192.
This fit treats the variance of Yi as
'
n i i Ž 1 y i . 1 q 0.192 Ž n i y 1 . .
This corresponds roughly to a doubling of the variance relative to the
binomial with a litter size of 6 and a tripling with n i s 11. Even with these
adjustments for overdispersion, Table 13.5 shows that strong evidence remains that the probability of death is substantially lower for each treatment
group than the placebo group.
Figure 13.4 plots the standardized Pearson residuals against litter size for
the binomial logit model. The apparent increase in their variability as litter
size increases suggests that the beta-binomial variance function is plausible.
The term in that variance function corresponds to rŽ1 q . in the
variance of the beta-binomial distribution. For that distribution or more
generally, ˆ s 0.192 means that the probabilities of death for litters of a
particular group have estimated standard deviation 0.192 i Ž 1 y i . . This
equals 0.22 when the mean is 0.5 and 0.13 when the mean is 0.1 or 0.9, which
is considerable heterogeneity. More generally, a model could let vary by
treatment group or be different for the placebo group than the others. We
leave this to the reader.
For comparison, Table 13.5 also shows results with the GEE approach to
fitting the logit model, assuming an independence working correlation structure for observations within a litter. The estimates are the same as the ML
'
558
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
FIGURE 13.4 Standardized Pearson residuals for binomial logit model fitted to Table 4.5.
estimates for the binomial logit model, but the empirical adjustment increases the standard errors. Similar results occur with an exchangeable
working correlation structure. For it, the estimated within-litter correlation
between the binary responses is 0.185. This is comparable to the value of
0.192 that yields the quasi-likelihood results with beta-binomial variance
function. The GEE standard errors are somewhat different from those with
the quasi-likelihood approach. It may be that the sample size is insufficient
for the GEE sandwich adjustment, which tends to underestimate standard
errors unless the number of clusters is quite large. Or, this may simply reflect
the different variance function for the GEE approach.
Finally, Table 13.5 also shows results for the GLMM that adds a normal
random intercept u i for litter i to the binomial logit model. Results are also
similar in terms of significance of the treatment groups relative to placebo.
Estimated effects are larger for this logistic-normal model, since they are
subject-specific Ži.e., litter-specific . rather than population-averaged.
13.3.4
Conjugate Mixture Models
The beta-binomial model is an example of a conjugate mixture model. These
are models for which the marginal distribution has closed form. The data
have a particular distribution, conditional on a parameter, and then the
parameter has its own distribution such that the marginal distribution has
closed form.
559
NEGATIVE BINOMIAL REFRESSION
Similarly, in Bayesian methods the conjugate prior distribution is a distribution that when combined with the likelihood, gives a closed form for the
posterior distribution. For instance, for observations from a binomial distribution with beta prior distribution for the binomial parameter, the posterior
distribution of that parameter is also beta. Conjugate models were the
primary method of conducting Bayesian analysis before the development of
computationally intensive methods, such as Markov chain Monte Carlo, for
evaluating the integral that determines the posterior distribution.
The beta-binomial conjugate mixture model applies with totals from
binary trials. In the next section we study a conjugate mixture model for
count data. It uses a gamma distribution to mix the Poisson parameter. A
disadvantage of the conjugate mixture approach is the lack of generality and
flexibility, requiring a different mixture distribution for each type of problem.
In addition, the extra variability need not enter on the same scale as the
ordinary predictors, and it can be difficult to have multivariate random
effects structure. Lee and Nelder Ž1996. discussed this approach and considered a variety of hierarchical models of GLMM form in which the random
effect need not be normal.
13.4 NEGATIVE BINOMIAL REGRESSION
The negative binomial is a conjugate mixture distribution for count data. It is
useful when overdispersion occurs with Poisson GLMs.
13.4.1
Negative Binomial as Gamma Mixture of Poisson Distributions
In Section 4.3.3 we noted that a severe limitation of Poisson models is that
the variance of Y must equal the mean. Hence, at a fixed mean the variance
cannot decrease as additional predictors enter the model. Count data often
show overdispersion, with the variance exceeding the mean. This might
happen, for instance, because some relevant explanatory variables are not in
the model. A mixture model is a flexible way to account for overdispersion.
At a fixed setting of the predictors used, given the mean the distribution of Y
is Poisson, but the mean itself varies according to some distribution.
Suppose that Ž1. given , Y has a Poisson distribution with mean , and
Ž2. has a gamma distribution, G Ž k, .. The gamma probability density
function for is
Ž kr .
exp Ž yk r . ky1 , G 0.
⌫Ž k.
k
f Ž ; k , . s
This gamma distribution has
E Ž . s ,
var Ž . s 2rk .
Ž 13.12 .
560
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
The parameter k ) 0 describes the shape. The density is skewed to the right,
but the degree of skewness decreases as k increases.
Marginally, the gamma mixture of the Poisson distributions yields the
negative binomial distribution for Y. Its probability mass function is
p Ž y ; k, . s
⌫Ž y q k.
⌫ Ž k . ⌫ Ž y q 1.
ž
/ž
k
k
qk
1y
k
qk
/
y
,
y s 0, 1, 2, . . . .
Ž 13.13 .
This negative binomial distribution has
EŽ Y . s ,
var Ž Y . s q 2rk .
The index ky1 is called the dispersion parameter. As ky1 ™ 0, the gamma
distribution has varŽ . ™ 0 and it converges to a degenerate distribution at
; similarly, the negative binomial distribution then has varŽ Y . ™ and it
converges to the Poisson distribution with mean .
For given ky1 , the negative binomial is in the natural exponential family.
The natural parameter is logw rŽ q k .x. Usually, though, the dispersion
parameter ky1 is itself unknown. Estimating it helps to summarize the extent
of overdispersion. The greater ky1 , the greater the overdispersion compared
to the ordinary Poisson GLM. For independent observations, the ML estimate of is the sample mean, but ML estimation for ky1 requires iterative
methods ŽR. A. Fisher showed this in an appendix of a 1953 Biometrics article
by C. Bliss.. Problem 13.40 shows an alternative gamma parameterization
that implies a linear rather than quadratic variance function for the negative
binomial.
13.4.2
Negative Binomial Regression Modeling
Negative binomial models for counts permit to depend on explanatory
variables ŽLawless 1987.. Such models normally take ky1 to be the same for
all observations. This corresponds to a constant coefficient of variation in the
gamma mixing distribution, var Ž . rEŽ . s 1r'k . with the standard deviation increasing as the mean does. Most common is the log link, as in Poisson
loglinear models. Sometimes the identity link is adequate. One such case is
with a single predictor that is a factor.
For k fixed, a negative binomial model is a GLM. Thus, the likelihood
equations for the regression parameters  are special cases of those wsee
Ž4.22.x for an ordinary GLM with variance function ®Ž . s q 2rk. The
usual iterative reweighted least squares algorithm applies for ML model
fitting. When k is unknown, ML fitting can use a Newton᎐Raphson routine
on all the parameters simultaneously. Or, one can evaluate the profile
likelihood for various fixed k ŽLawless 1987.. Another approach alternates
'
561
NEGATIVE BINOMIAL REFRESSION
between Ž1. using iterative reweighted least squares to solve the equations for
ˆ using Newton᎐Raphson to estimate k,
, for fixed k, and Ž2. for fixed ,
iterating between them until convergence.
The full log likelihood LŽ, k; y. for a negative binomial model satisfies
⭸ 2L
⭸ j ⭸ k
s
Ý
i
yi y i
2
Ž k q i . g X Ž i .
xit .
Thus, E Ž ⭸ 2 Lr⭸ j ⭸ k . s 0 for each j. Similarly, the inverse of the expected
information matrix has 0 elements connecting k with each  j . Since this is
ˆ and ˆk are asymptotically independent.
the asymptotic covariance matrix, 
ˆ obtained from part Ž1. of the iterative
It follows that standard errors for 
scheme above are correct. Cameron and Trivedi Ž1998, p. 72. showed the
asymptotic covariance matrix. They wand Lawless Ž1987.x considered a
moment estimator for ky1 and studied robustness properties of estimators.
ˆ from this model is consistent if the model for the mean is
They noted that 
correctly specified, even if the true distribution is not negative binomial.
13.4.3
Frequency of Knowing Homicide Victims Example
Table 13.6 summarizes responses of 1308 subjects to the question: Within the
past 12 months, how many people have you known personally that were
victims of homicide? The table shows responses by race, for those who
identified their race as white or as black. The sample mean for the 159 blacks
was 0.522, with a variance of 1.150. The sample mean for the 1149 whites was
0.092, with a variance of 0.155.
A natural first choice for modeling count data is a Poisson GLM, such as a
loglinear model with a dummy predictor for race. Let yit denote the response
for subject t of race i. For it s E Ž Yit ., this model is
log it s ␣ q  x it ,
TABLE 13.6 Number of Victims of Murder Known in Past Year, by Race,
with Fit of Poisson and Negative Binomial Models
Data
Poisson GLM
Neg. Bin. GLM
Poisson GLMM
Response
Black
White
Black
White
Black
White
Black
White
0
1
2
3
4
5
6
119
16
12
7
3
2
0
1070
60
14
4
0
0
1
94.3
49.2
12.9
2.2
0.3
0.0
0.0
1047.7
96.7
4.5
0.1
0.0
0.0
0.0
122.8
17.9
7.8
4.1
2.4
1.4
0.9
1064.9
67.5
12.7
2.9
0.7
0.2
0.1
116.7
24.5
8.1
3.6
1.9
1.1
0.7
1068.3
65.3
10.1
2.8
1.1
0.5
0.3
Source: 1990 General Social Survey, National Opinion Research Center.
562
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
with x 1 t s 1 Žblacks. and x 2 t s 0 Žwhites .. This model has fit log
ˆ it s y2.38
q 1.733 x it . The estimated expected responses are expŽy2.38 q 1.733. s
0.522 for blacks and expŽy2.38. s 0.092 for whites, the sample means. For
any link function for this model, the likelihood equations imply that the fitted
means equal the sample means. Since ˆ s 1.733 ŽSE s 0.147. is the difference between the log means for blacks and whites, the ratio of sample means
is expŽ1.733. s 5.7 s 0.522r0.092. However, for each race the sample variance is roughly double the mean. Table 13.6 also shows the fit of this model.
The evidence of overdispersion is reflected by the higher observed counts at
y s 0 and at large y values than the Poisson GLM predicts.
An alternative is the same model form but assuming a negative binomial
response. A mixture model does seem plausible. Due to various demographic
factors, heterogeneity probably occurs among subjects of a given race in the
distribution of Y. For ML fitting, the deviance decreases by 122.2 compared
to the ordinary Poisson GLM that is the special case with ky1 s 0. Table
13.6 also shows this model fit. It is dramatically better at y s 0 and 1.
Table 13.7 shows parameter estimates for the negative binomial and
Poisson GLMs. For both, ˆ s 1.733 since both models provide fitted means
equal to the sample means. However the estimated standard error of ˆ
increases from 0.147 for the Poisson GLM to 0.238 for the negative binomial
model. The Wald 95% confidence interval for the ratio of means for blacks
and whites goes from expw1.733 " 1.96Ž0.147.x s Ž4.2, 7.5. for the Poisson
GLM to expw1.733 " 1.96Ž0.238.x s Ž3.5, 9.0. for the negative binomial. In
accounting for the overdispersion, we obtain results that are not as precise as
the more naive model suggests.
The negative binomial model has ˆ
ky1 s 4.94 ŽSE s 1.00.. This shows
y1
strong evidence that k ) 0, indicating that the negative binomial model is
more appropriate than the Poisson GLM. The estimated variance of Y is
ˆq
ˆ 2rkˆ s
ˆ q 4.94
ˆ 2 , which is 0.13 for whites and 1.87 for blacks, much
closer to the sample values than the Poisson model provides.
Table 13.7 also shows results for negative binomial and Poisson models
using the identity link. The fits
ˆ it s 0.092 q 0.430 x it reproduce the sample
means. Now ˆ refers to the difference in means rather than their log ratio.
The estimated difference ˆ s 0.430 has SE s 0.058 for the Poisson model
and SE s 0.109 for the negative binomial. Results are more imprecise but
TABLE 13.7
Parameter Estimates for Models Fitted to Homicide Data
Models with Log Link
Models with Identity Link
Term
Neg. Binom.
GLM
Poisson
GLM
Poisson
GLMM
␣

SEŽˆ.
y2.38
1.733
0.238
y2.38
1.733
0.147
y3.69
1.897
0.246
Neg. Binom.
GLM
0.092
0.430
0.109
Poisson
GLM
0.092
0.430
0.058
563
POISSON REGRESSION WITH RANDOM EFFECTS
more realistic with the negative binomial model. For this link also the
estimated dispersion parameter is ˆ
ky1 s 4.94.
13.5 POISSON REGRESSION WITH RANDOM EFFECTS
The GLMMs introduced in Chapter 12 referred to categorical responses.
GLMMs are also useful for other types of discrete responses, such as counts.
This section illustrates with Poisson regression modeling of count data.
We’ve seen that a flexible way to account for overdispersion is with a
mixture model. In Section 13.4 we mixed the Poisson using the gamma
distribution, yielding the negative binomial marginally. Breslow Ž1984. and
Hinde Ž1982. suggested the GLMM structure Ž12.1. with the log link and
normal random intercept. The model for the mean for observation t in
cluster i is
log E Ž Yit < u i . s x Xit  q u i ,
Ž 13.14 .
where u i 4 are independent N Ž0, 2 .. Conditional on u i , yi t has a Poisson
distribution. Marginally, the distribution has variance greater than the mean
whenever ) 0.
Applications of Poisson GLMMs include the analysis of maps of cancer
rates in epidemiology ŽBreslow and Clayton 1993. and modeling variability in
bacteria counts ŽAitchison and Ho 1989.. Although links other than the log
are possible, the identity link Žand any other link having range only the
positive real line. has a structural problem. With a normal random effect
with ) 0, a positive probability exists that the linear predictor is negative,
but the Poisson mean must be nonnegative.
The negative binomial model Žfor fixed k . is a GLMM with nonnormal
random effect. With the log link, it results from a loglinear model of form
Ž13.14. with random intercept, where expŽ u i . has a gamma distribution with
mean 1 and variance ky1. With identity link, negative binomial models
usually work better than Poisson GLMMs. Regardless of the gamma mixture
distribution, the resulting marginal mean is nonnegative for the negative
binomial.
13.5.1
Marginal Model Implied by Poisson GLMM
The Poisson GLMM Ž13.14. implies a relatively simple marginal model,
averaging out the random effect. The mean of the marginal distribution is
E Ž Yit . s E E Ž Yit < u i . s E w e x i t qu i x s e x i t q
X
X
2
r2
.
Here E wexpŽ u i .x s expŽ 2r2. because a N Ž0, 2 . variate u i has moment
generating function E wexpŽ tu i .x s expŽ t 2 2r2.. So, for the Poisson GLMM
564
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
the log of the mean conditionally equals x Xit  q u i and marginally equals
x Xit  q 2r2. A loglinear model still applies. The marginal effects of the
explanatory variables are the same as the cluster-specific effects. Thus, the
ratio of means at two different settings of x Xit is the same conditionally
and marginally. However, marginally the intercept is offset. ŽNote that
Jensen’s inequality applies, since the link is not linear. .
The variance of the marginal distribution is
var Ž Yit . s E var Ž Yit < u i . q var E Ž Yi t < u i . s E w e x i t qu i x q e 2x i t  var Ž e u i .
X
s e x i t q r2 q e 2x i t  Ž e 2 y e
X
2
X
2
2
X
2
. s E Ž Yit . q E Ž Yit . Ž e y 1 . .
2
Here, varŽ e u i . s E Ž e 2 u i . y w E Ž e u i .x 2 s e 2 y e by evaluating the moment
generating function at t s 2 and t s 1. As in the negative binomial model,
the marginal variance is a quadratic function of the marginal mean. It
exceeds the marginal mean when ) 0. The ordinary Poisson model results
when s 0. When ) 0 the marginal distribution is not Poisson, and the
extent to which the variance exceeds the mean increases as increases.
As in binary GLMMs, Yit and Yi s are independent given u i but are
marginally nonnegatively correlated. For t / s,
2
2
cov Ž Yit , Yi s . s E cov Ž Yit , Yi s < u i . q cov E Ž Yit < u i . , E Ž Yi s < u i .
s 0 q cov exp Ž x Xit  q u i . , exp Ž x Xi s  q u i . .
Ž 13.15 .
The functions in the last covariance term are both monotone increasing
functions of u i , and hence are nonnegatively correlated ŽProblem 13.44..
13.5.2
Frequency of Knowing Homicide Victims Example
We now return to Table 13.6 on responses, classified by race, of the number
of victims of homicide within the past 12 months that subjects knew personally. Models permitting subject heterogeneity are sensible. For the response
yit for subject t of race i, the Poisson GLMM is
log E Ž Yit < u it . s ␣ q  x it q u it ,
where u it 4 are independent N Ž0, 2 .. The log means vary according to a
N Ž␣, 2 . distribution for whites and a N Ž␣ q  , 2 . distribution for blacks.
Given u it , yit has a Poisson distribution.
Table 13.6 also shows this model fit, and Table 13.7 shows estimates. The
random effects have ˆ s 1.63 ŽSE s 0.15.. The deviance decreases by 116.6
compared to the Poisson GLM, indicating a better fit by allowing heterogeneity. For subjects at the means of the random effects distributions Ž u it s 0. the
estimated expected responses are expŽy3.69 q 1.90. s 0.167 for blacks and
565
NOTES
expŽy3.69. s 0.025 for whites. The fitted marginal mean is expŽ␣
ˆ q ˆ x it q
ˆ 2r2., or 0.63 for blacks and 0.09 for whites. The fitted marginal variances
are 0.21 for blacks and 5.78 for whites. These are somewhat larger than the
sample means and variances, perhaps because the fitted distribution has
nonnegligible mass above the largest observed response of 6.
13.5.3
Negative Binomial Models versus Poisson GLMMs
The Poisson GLMM with normal random effects has the advantage, relative
to the negative binomial GLM, of easily permitting multivariate random
effects and multilevel models. However, the negative binomial has properties
that can make interpretation simpler. We’ve seen that the identity link is
valid for it, which is useful for simple examples such as the preceding one
with a factor predictor. With any link and a factor predictor, its ML fitted
means equal the sample means. This is not the case for the Poisson GLMM.
Besides the Poisson GLMM and the negative binomial model, an alternative way of accounting for overdispersion with count data is quasi-likelihood
with variance function
® Ž i . s i ,
for some constant . This is often adequate for exploratory analyses.
NOTES
Section 13.1: Latent Class Models
13.1. Aitkin et al. Ž1981., Bartholomew and Knott Ž1999., Clogg Ž1995., Clogg and Goodman
Ž1984., Goodman Ž1974., Haberman Ž1979, Chap. 10., Hagenaars Ž1998., Heinen
Ž1996., and Lazarsfeld and Henry Ž1968. discussed fitting and intrepretation of latent
class and related latent variable models.
13.2. Rudas et al. Ž1994. proposed a clever mixture method for summarizing goodness of fit.
For a model M for a contingency table with true probabilities , they used the mixture
s Ž1 y . 1 q 2 , with 1 the model-based probabilities and 2 unconstrained.
Their index of lack of fit is the smallest such possible for which this holds. It is the
fraction of the population that cannot be described by the model. This recognizes that
any given model does not truly hold but is useful if is close to 0. The mixture
contrasts with the latent class model in which both 1 and 2 correspond to
independence.
Section 13.2: Nonparametric Random Effects Models
13.3. For connections between Rasch-type models and quasi-symmetry models, see Agresti
Ž1993., Conaway Ž1989., Darroch Ž1981., Darroch et al. Ž1993., Hatzinger Ž1989., and
Kelderman Ž1984.. For the matched-pairs random effects model Ž12.16., a nonparametric or conditional ML treatment of Ž u i1 , u i2 . implies a multivariate quasi-symmetry
model ŽAgresti 1997.. Model Ž12.16. with correlated normal random effects is a
566
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
continuous analog to discrete latent class models that Goodman Ž1974. proposed, based
on two associated binary latent variables.
Section 13.3: Beta-Binomial Models
13.4. Skellam Ž1948. introduced the beta-binomial distribution and discussed parameter
estimation. For modeling using this distribution or related quasi-likelihood approaches,
see Brooks et al. Ž1997., Crowder Ž1978., Hinde Ž1996., Lee and Nelder Ž1996., Liang
and Hanfelt Ž1994., Liang and McCullagh Ž1993., Lindsey and Altham Ž1998., Moore
Ž1986a., Moore and Tsiatis Ž1991., Nelder and Pregibon Ž1987., Prentice Ž1986., Rosner
Ž1984, 1989. wwith critique by Neuhaus and Jewell Ž1990a.x, Slaton et al. Ž2000., and
Williams Ž1975, 1982.. For beta-binomial type variance, Ryan Ž1995. and Williams
Ž1988. showed advantages of the quasi-likelihood approach over ML. Often, it helps to
permit the quasi-likelihood scale parameter Žor the related parameter in the
beta-binomial. to vary among groups.
The beta-binomial generalizes to a Dirichlet-multinomial. Conditional on the probabilities, the distribution is multinomial. The probabilities themselves have a Dirichlet
distribution, which is a generalization of the beta defined on vectors of probabilities
that sum to 1. See Mosimann Ž1962. and Paul et al. Ž1989..
13.5. Kupper et al. Ž1986. and Ryan Ž1992. discussed modeling overdispersion caused by
litter effects in developmental toxicity studies. See Follman and Lambert Ž1989.,
Kupper and Haseman Ž1978., and Lefkopoulou et al. Ž1989. for related material.
Section 13.4: Negati©e Binomial Regression
13.6. Greenwood and Yule Ž1920. derived the negative binomial as a gamma mixture of
Poissons. Johnson et al. Ž1992. summarized its properties. Biggeri Ž1998., Cameron and
Ž1998., and Lawless Ž1987. discussed modeling
Trivedi Ž1998., Hinde and Demetrio
´
using it.
PROBLEMS
Applications
13.1 For the 2 3 table of opinions about legalized abortion ŽTable 10.13.
collapsed over gender, fit a latent class model with two classes. Show
that it is saturated. For each latent class, report the estimated
probability of supporting legalized abortion in each of the three
situations. Give a tentative interpretation for the classes.
13.2 Analyze Table 8.3 using a latent class model with q s 2.
a. For a subject in the first latent class, estimate the probability of
having used Ži. marijuana, Žii. alcohol, Žiii. cigarettes, Živ. all three,
and Žv. none of them.
b. Estimate the probability a subject is in the first latent class, given
they have used Ži. marijuana, Žii. alcohol, Žiii. cigarettes, Živ. all
three, and Žv. none of them.
567
PROBLEMS
13.3 Analyze Table 8.19 on government spending using latent class models.
13.4 For capture᎐recapture experiments, Coull and Agresti Ž1999. used a
loglinear model with exchangeable association and no higher-order
terms. Explain why the model expected frequencies satisfy
log Ž y 1 , . . . , y T . s q  1 y 1 q ⭈⭈⭈ q T y T
q  Ž y 1 y 2 q y 1 y 3 q ⭈⭈⭈ qy Ty1 y T . .
Show that the fit of this model to Table 12.6 yields Nˆ s 90.5 and a
95% profile-likelihood confidence interval for N of Ž75, 125..
13.5 Use or write software to replicate the analyses of the opinions about
abortion data in Section 13.2 using Ža. nonparametric random effects
fitting of logit model Ž13.3., and Žb. the quasi-symmetry model.
13.6 A data set on pregnancy rates among girls under 18 years of age in 13
north central Florida counties has information on a 3-year total for
each county i on n i s number of births and yi s number of those for
which mother had age under 18 Žsee J. Booth, in Statistical Modelling:
Lecture Notes in Statistics, 104, Springer, 43᎐52, 1995..
a. A beta-binomial model states that given i 4 , Yi 4 are independent
binŽ n i , i .4 variates, and i 4 are independent from a betaŽ␣,  .
distribution. The ML estimated parameters are ␣
ˆ s 9.9 and ˆ s
240.8 Žthanks to J. Booth for this analysis .. Use the mean and
variance to describe the estimated beta distribution and the estimated marginal distribution of Yi Žas a function of n i ..
b. Quasi-likelihood using variance function Ž13.10. for the model
logitŽ i . s ␣ has ␣
ˆ s y3.18 and ˆ s 0.005. Describe the estimated mean and variance of Yi .
c. Quasi-likelihood using variance Ž13.11. for the model logitŽ i . s ␣
has ␣
ˆ s y3.35 and ˆ s 8.3. Describe the estimated mean and
variance of Yi .
d. The logistic-normal GLMM, logitŽ i . s ␣ q u i , yields ␣
ˆ s y3.24
and ˆ s 0.33. Describe the estimated mean of Yi wRecall Ž12.8.x.
13.7 In Problem 12.2 about Shaq OX Neal’s free-throw shooting, the simple
binomial model, i s ␣ , has lack of fit. Fit the beta-binomial model,
or use the quasi-likelihood approach with that variance structure. Use
the fit to summarize his free-throw shooting, by giving an estimated
mean and standard deviation for i .
568
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
13.8 For the toxicity study of Table 12.9, collapsing to a binary response,
consider linear logit models for the probability a fetus is normal.
a. Does the ordinary binomial model show evidence of overdispersion?
b. Fit the linear logit model using the quasi-likelihood approach with
inflated binomial variance. How do the standard errors change?
c. Fit the linear logit model using quasi-likelihood with beta-binomial
variance. Interpret and compare with previous results.
d. Fit the linear logit model using a GEE approach with exchangeable working correlation among fetuses in the same litter. Interpret and compare with previous results, including comparing the
estimated GEE correlation with the estimate ˆ from part Žc..
e. Fit the linear logit GLMM after adding a litter-specific normal
random effect. Interpret and compare with previous results.
13.9 Extend the various analyses of the teratology data ŽTable 4.5. in
Section 13.3.3 as follows:
a. Include a predictor for litter size Žas well as group.. Interpret, and
compare results to those without this predictor.
b. Fit a model with beta-binomial variance Ž13.10. in which varies
by treatment group. Use results to motivate a model that allows
overdispersion only in the placebo group. Interpret and compare
results to those with common for each group.
13.10 Table 13.8 reports the results of a study of fish hatching under three
environments. Eggs from seven clutches were randomly assigned to
three treatments, and the response was whether an egg hatched by
day 10. The three treatments were Ž1. carbon dioxide and oxygen
removed, Ž2. carbon dioxide only removed, and Ž3. neither removed.
TABLE 13.8 Data for Problem 13.10
Treatment 1
Treatment 2
Treatment 3
Clutch
Number
Hatched
Total
Number
Hatched
Total
Number
Hatched
Total
1
2
3
4
5
6
7
0
0
0
0
0
0
0
6
13
10
16
32
7
21
3
0
8
10
25
7
10
6
13
10
16
28
7
20
0
0
6
9
23
5
4
6
13
9
16
30
7
20
Source: Data courtesy of Becca Hale, Zoology Department, University of Florida.
569
PROBLEMS
a. Let i t denote the probability of hatching for an egg from clutch i
in treatment t. Assuming independent binomial observations, fit
the model
logit Ž i t . s  1 z1 q  2 z 2 q  3 z 3 ,
where z t s 1 for treatment t and 0 otherwise. What does your
software report for ˆ1 , and what should it be? Ž Hint: Note that
treatment 1 has no successes. .
b. Analyze these data using an approach that allows overdispersion.
Interpret. Indicate whether evidence of overdispersion occurs for
treatments 2 and 3.
13.11 For the train accidents in Problem 9.19, a negative binomial model
assuming constant log rate over the 14-year period has estimate
y4.177 ŽSE s 0.153. and estimated dispersion parameter 0.012. Interpret.
13.12 One question in the 1990 General Social Survey asked subjects how
many times they had sexual intercourse in the preceding month.
Table 13.9 shows responses, classified by gender.
a. The sample means were 5.9 for males and 4.3 for females; the
sample variances were 54.8 and 34.4. The mode for each gender
was 0. Does an ordinary Poisson GLM seem appropriate? Explain.
b. The Poisson GLM with log link and a dummy variable for gender
Ž1 s males, 0 s females. has gender estimate 0.308 ŽSE s 0.038..
Explain why this implies a ratio of 1.36 for the fitted means. ŽThis
is also the ratio of sample means, since this model has fitted means
equal to sample means.. Show that the Wald 95% confidence
interval for the ratio of means for males and females is Ž1.26, 1.47..
TABLE 13.9
Data for Problem 13.12
Response Male
0
1
2
3
4
5
6
7
8
65
11
13
14
26
13
15
7
21
Female
Response
Male
Female
Response
Male
Female
128
17
23
16
19
17
17
3
15
9
10
12
13
14
15
16
17
18
2
24
6
3
0
3
3
0
0
2
13
10
3
1
10
1
1
1
20
22
23
24
25
27
30
50
60
7
0
0
1
1
0
3
1
1
6
1
1
0
3
1
1
0
0
Source:1990 General Social Survey, National Opinion Research Center.
570
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
c. For the negative binomial model, the log likelihood increases by
248.7 Ždeviance decreases by 497.3.. The estimated difference
between the log means is also 0.308, but now SE s 0.127. Show
that the 95% confidence interval for the ratio of means is Ž1.06,
1.75.. Compare to the Poisson GLM, and interpret.
d. The mode for the Poisson distribution is the integer part of the
mean, rather than 0. Argue that a possibly more realistic mixture
model assumes for gender i a proportion i that has a Poisson
distribution with mean 0 and a proportion 1 y i that has distribution that is a gamma mixture of Poissons. Explain why the corresponding marginal distribution for each gender is a mixture of a
degenerate distribution at 0 and a negative binomial distribution.
13.13 Refer to Problem 13.12. Fit the Poisson and negative binomial GLMs
using identity link. Show that the estimated differences in means
between males and females are identical for the two GLMs but the
SE values are very different. Explain why. Use the more appropriate
one to form a confidence interval for the true difference in means.
13.14 For the counts of horseshoe-crab satellites in Table 4.3, Table 13.10
shows the results of ML fitting of the negative binomial model using
width as the predictor, with the identity link.
a. State and interpret the prediction equation.
b. Show that at a predicted
ˆ , the estimated variance is roughly
ˆq
ˆ 2.
c. The corresponding Poisson GLM has fit
ˆ s y11.53 q 0.55 x
ŽSE s 0.06.. Compare 95% confidence intervals for the slopes for
the two models. Interpret, and indicate whether overdispersion
seems to exist relative to the Poisson GLM.
TABLE 13.10 Results for Problem 13.14
Parameter
Intercept
width
Dispersion
Estimate
y11.1471
0.5308
0.9843
Standard
Error
2.8275
0.1132
0.1822
Wald 95% Confidence
Limits
y16.6890
y5.6052
0.3089
0.7528
0.6847
1.4149
ChiSquare
15.54
21.97
13.15 Refer to Problem 13.14.
a. Fit a negative binomial model with log link. Interpret. Plot the
counts against width and indicate which link seems more appropriate.
b. Fit a Poisson GLMM with log link, using width predictor. Interpret.
PROBLEMS
571
c. Compare results for the various models, including those in Section
4.3.2 for a Poisson GLM. Indicate your preferred model. Justify.
13.16 Refer to Problems 13.14 and 13.15. Using width and qualitative color
as predictors, fit a Ža. negative binomial GLM, and Žb. Poisson
GLMM, checking for interaction and interpreting the final model.
13.17 Refer to Table 13.6. For those with race classified as ‘‘other,’’ the
sample counts for Ž0, 1, 2, 3, 4, 5, 6. homicides were Ž55, 5, 1, 0, 1, 0, 0..
Fit an appropriate model simultaneously to these data and those for
white and black race categories. Interpret by making pairwise comparisons of the three pairs of means.
13.18 Use a quasi-likelihood approach to analyze Table 13.6 on counts of
murder victims.
13.19 Conduct the analyses of Problem 4.6 on defects in the fabrication of
computer chips, but use a negative binomial GLM. Compare results
to those for the Poisson GLM. Indicate why results are similar.
13.20 With data at the book’s Web site Ž www. stat.ufl.edur;aarcdar
cda.html ., use methods of this chapter to analyze how the countywide
vote for the Reform Party candidate Pat Buchanan in the 2000
presidential election related to the vote for Reform Party candidate
Ross Perot in the 1996 presidential election. Note that Palm Beach
County is an enormous outlier Žapparently mainly reflecting votes
intended for Al Gore but cast for Buchanan because of a confusing
ballot.. Model with and without that observation and compare results.
13.21 Conduct a latent class analysis of the data in Espeland and Handelman Ž1989..
13.22 Refer to the teratology study in Liang and Hanfelt Ž1994.. Analyze
these data using at least two different approaches for overdispersed
binary data. Compare results and interpret.
13.23 Refer to Problem 13.14. Using an appropriate subset of width, weight,
color, and spine condition as predictors, find and interpret a reasonable model for predicting the number of satellites.
Theory and Methods
13.24 Derive residual df for a latent class model with q latent classes. When
I s 2, for q G 2 show one needs T G 4 for the model to be unsaturated. Then, find the maximum value for q when T s 4, 5. For an I 2
table, show one needs q - I 2rŽ2 I y 1..
572
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
13.25 Express the log likelihood for latent class model Ž13.1. in terms of the
model parameters. Derive likelihood equations ŽGoodman 1974,
Haberman 1979..
13.26 Let denote an I = J matrix of cell probabilities for the joint
distribution of X and Y. Suppose that there exist I = 1 column
vectors 1 k and J = 1 column vectors 2 k of probabilities, k s
1, . . . , q, and a set of probabilities k 4 such that
q
s
Ý k 1 kX2 k .
ks1
Explain why this implies that there is a latent variable Z such that X
and Y are conditionally independent, given Z.
13.27 In Section 13.2.2, under the null that the ordinary logistic regression
model holds, explain why it is inappropriate to treat the difference
between the deviances for that model and the mixture of two logistic
regressions as a chi-squared statistic.
13.28 Refer to Problem 12.7. Let k Ž a, b, c . denote the expected frequency
of outcomes Ž a, b, c . for treatments Ž A, B, C . under treatment sequence k, where outcome 1 s relief and 0 s nonrelief. With a nonparametric random effects approach, show that one can estimate
treatment effects in model Ž12.19. by fitting the quasi-symmetry
model
log k Ž a, b, c . s a A q bB q c C q k Ž a, b, c . ,
where k Ž a, b, c . s k Ž a, c, b . s k Ž b, a, c . s k Ž b, c, a. s
k Ž c, a, b . s k Ž c, b, a.. Fit the model, and show that ˆB y ˆA s 1.64
ŽSE s 0.34., ˆC y ˆA s 2.23 ŽSE s 0.39., ˆC y ˆB s 0.59 ŽSE s
0.39.. Interpret. Compare results with Problem 12.7 for model Ž12.19..
13.29 Show that the beta-binomial distribution Ž13.9. simplifies to the
binomial when s 0.
13.30 Express the numerator of the beta density in terms of and . Using
this, show that it is Ža. unimodal when - minŽ , 1 y ., and Žb. the
uniform density when s s 12 .
13.31 Suppose that i s P Ž Yit s 1. s 1 y P Ž Yit s 0., for t s 1, . . . , n i , and
corrŽ Yi t , Yi s . s for t / s. Show that varŽ Yi t . s i Ž1 y i .,
573
PROBLEMS
covŽ Yit , Yi s . s i Ž1 y i ., and
var
ž Ý Y / s n Ž 1 y . 1 q Ž n y 1. .
it
i
i
i
i
t
13.32 When n s 1, show that the beta-binomial distribution is no different
from the binomial Ži.e., Bernoulli.. Explain why overdispersion cannot
occur when n s 1.
13.33 When yi is the sum of n i binary responses each having mean i , refer
to the quasi-likelihood approach with ®Ž i . s n i i Ž1 y i .. Explain
why this variance function has a structural problem, with only s 1
making sense when n i s 1.
13.34 Liang and Hanfelt Ž1994. described a teratology study comparing
control and treatment groups in which the ML estimate of the
treatment effect in a beta-binomial model differs by a factor of 2
depending on whether one assumes the same overdispersion parameter for each group. By contrast, with variance function Ž13.11., the
quasi-likelihood estimate of the treatment effect is the same whether
one assumes the same or different for the two groups. Explain why,
and discuss whether this is an advantage or disadvantage of that
method.
13.35 Consider the logistic-normal model, logit Ž i . s ␣ q x Xi  q u i . For
small , show that it corresponds approximately to a mixture model
for which the mixture distribution has var Ž i . s w i Ž1 y i .x 2 2 .
Ž Hint: See Problem 6.33..
13.36 Altham Ž1978. introduced the discrete distribution
f Ž y ; , . s cŽ , .
ž/
n
nyy
y Ž1 y .
exp y Ž n y y . ,
y
y s 0,1, . . . , n,
where cŽ , . is a normalizing constant. Show that this is in the
exponential family. Show that the binomial occurs when s 0.
wAltham noted that overdispersion occurs when - 0. Corcoran
et al. Ž2001. and Lindsey and Altham Ž1998. used this as the basis of
an alternative model to the beta-binomial. x
13.37 When y 1 , . . . , yN are independent from the negative binomial distribution Ž13.13. with k fixed, show that
ˆ s y.
574
13.38
OTHER MIXTURE MODELS FOR CATEGORICAL DATA
Using E Ž Y . s E w E Ž Y < X .x and var Ž Y . s E wvar Ž Y < X .x q
varw E Ž Y < X .x, derive the mean and variance of the Ža. beta-binomial
distribution,
and Žb. negative binomial distribution.
13.39 Suppose that given u, Y is Poisson with E Ž Y < u. s u , where may
depend on predictors. Suppose that u is a positive random variable
with E Ž u. s 1 and var Ž u. s . Show that E Ž Y . s and varŽ Y . s
q 2 . Explain how negative binomial GLMs and Poisson GLMMs
with log link can follow as special cases.
13.40 An alternative negative binomial parameterization results from the
gamma density formula,
Ž k.
f Ž ; k, . s
exp Ž yk . k y1 ,
⌫Ž k.
k
G 0,
for which EŽ . s , varŽ . s rk. Show that this gamma mixture of
Poissons yields a negative binomial with
EŽ Y . s ,
var Ž Y . s Ž 1 q k . rk .
For what limiting value of k does this reduce to the Poisson? wSee
Nelder and Lee Ž1996. for ML model fitting. Cameron and Trivedi
Ž1998, p. 75. pointed out that, unlike with quadratic variance, consistency does not occur for parameter estimators when the model for the
mean holds but the true distribution is not negative binomial.x
13.41 The negative binomial distribution is unimodal with a mode at the
integer part of Ž k y 1.rk ŽJohnson et al. 1992, pp. 208᎐209.. Show
that the mode is 0 when F 1, and that when ) 1 the mode is still
0 if k - rŽ y 1.. ŽThis gives greater scope than the Poisson, since
its mode equals the integer part of the mean..
13.42 Consider the loglinear random effects model
log E Ž Yit < u i . s x Xit  q zXit u i ,
where u i 4 are independent N Ž0, ⌺ .. Show that this implies the
marginal loglinear model
log E Ž Yit . y
1
2
zXit ⌺ z it s x Xit  ,
575
PROBLEMS
with the same fixed effects but with offset term. For the random-intercept case, indicate the role of on the size of the offset. Explain
what happens when s 0.
13.43 In Section 13.5.1 and Problem 13.42 we saw that for Poisson GLMMs,
the marginal effects are the same as the cluster-specific effects. This
does not imply that ML estimates of effects are the same for a
Poisson GLMM and a Poisson GLM. Explain why. Ž Hint: For the
GLMM, is the marginal distribution Poisson?.
13.44 For the Poisson GLMM Ž13.14., use the normal mgf to show that for
t / s,
cov Ž Yit , Yi s . s exp Ž x Xit q x Xi s .  exp Ž 2 . Ž exp Ž 2 . y 1 .
Hence, find corrŽ Yit , Yi s ..
13.45 Consider a Poisson GLMM using the identity link. Relate the marginal
mean and variance to the conditional mean and variance. Explain the
structural problem that this model has.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 14
Asymptotic Theory for
Parametric Models
This chapter has a more theoretical flavor than others. It presents asymptotic
theory for parametric models for categorical data, with emphasis on multinomial models for contingency tables. In Section 14.1 we review and extend the
delta method. This is used to derive large-sample normal distributions for
many statistics. In Section 14.2 we apply the delta method to ML estimation
of parameters in models for contingency tables, later illustrated in Section
14.4 for logit and loglinear models. In Section 14.3 we derive asymptotic
distributions of cell residuals and the X 2 and G 2 goodness-of-fit statistics.
The results in this chapter have a long history. Pearson Ž1900. derived the
asymptotic chi-squared distribution of X 2 for testing a specified multinomial
distribution. Fisher Ž1922, 1924. showed the adjustment in degrees of freedom when multinomial probabilities are functions of unknown parameters.
Cramer
´ Ž1946, pp. 424434. formally proved this result, under the assumption that ML estimators of the parameters are consistent. Rao Ž1957. proved
consistency of the ML estimators under general conditions. He also gave the
asymptotic distribution of the ML estimators, although the primary emphasis
of his articles was on proving consistency. Birch Ž1964a. proved these results
under weaker conditions. Andersen Ž1980., Bishop et al. Ž1975., Cox Ž1984.,
Haberman Ž1974a., and Watson Ž1959. provided other proofs or considered
related cases.
As in Cramer’s
´ and Rao’s proofs, our derivation regards the ML estimator
as a point in the parameter space where the derivative of the log likelihood
function is zero. Birch regarded it as a point at which the likelihood takes
value arbitrarily near its supremum. Although his approach is more powerful,
the proofs are more complex. We avoid a formal ‘‘theoremproof’’ style of
exposition. Instead, we show that powerful results follow from simple mathematical ideas, such as Taylor series expansions.
576
DELTA METHOD
577
14.1 DELTA METHOD
Suppose that a statistic used as an estimator of a parameter has a large-sample normal distribution. Then, in this section we show that many functions of
that statistic are also asymptotically normal.
14.1.1
O, o Rates of Convergence
Big O and little o notation is useful for describing limiting behavior of
sequences. For real numbers z n4 , the little o notation oŽ z n . represents a
term that has smaller order than z n as n ™ ⬁, in the sense that oŽ z n .rz n ™ 0
as n ™ ⬁. For instance, 'n is oŽ n. as n ™ ⬁, since 'n rn ™ 0 as n ™ ⬁.
A sequence that is oŽ1. satisfies oŽ1.r1 s oŽ1. ™ 0; for instance, ny1r2 is
oŽ1. as n ™ ⬁.
The big O notation O Ž z n . represents terms that have the same order of
magnitude as z n , in the sense that O Ž z n . rz n is bounded as n ™ ⬁. For
instance, Ž3rn. q Ž8rn2 . is O Ž ny1 . as n ™ ⬁; dividing it by ny1 gives a ratio
that takes value close to 3 as n increases.
Similar notation applies to sequences of random variables. This notation
uses a subscript p to indicate that the sequence has probabilistic rather than
deterministic behavior. The symbol op Ž z n . denotes a random variable of
smaller order than z n for large n, in the sense that op Ž z n .rz n con®erges
in probability to 0; that is, for any fixed ) 0, P Ž op Ž z n . rz n F . ™ 1 as
n ™ . The notation Op Ž z n . represents a random variable such that for every
) 0, there is a constant K and an integer n 0 such that P w Op Ž z n . rz n - K x
) 1 y for all n ) n 0 .
To illustrate, let Yn denote the sample mean of n independent observations Y1 , . . . , Yn from a distribution having E Ž Yi . s . Then Ž Yn y . s
op Ž1., since Ž Yn y .r1 converges in probability to zero as n ™ by the law
of large numbers. By Tchebychev’s inequality, the difference between a
random variable and its expected value has the same order of magnitude as
the standard deviation of that random variable. Since Yn y has standard
deviation r'n , Ž Yn y . s Op Ž ny1r2 ..
A random variable that is Op Ž ny1r2 . is also op Ž1.. An example is Ž Yn y ..
Multiplication affects the order in the way one expects intuitively ŽProblem
14.1.. For instance, 'n Ž Yn y . s n1r2 Op Ž ny1r2 . s Op Ž n1r2 ny1r2 . s Op Ž1..
If the difference between two random variables is op Ž1. as n ™ , Slutzky’s
theorem states that those random variables have the same limiting distribution.
14.1.2
Delta Method for Function of Random Variable
Let Tn denote a statistic, the subscript expressing its dependence on the
sample size n. For large samples, suppose that Tn is approximately normally
578
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
distributed about , with approximate standard error r'n . More precisely,
as n ™ , suppose that the cdf of 'n ŽTn y . converges to a N Ž0, 2 . cdf.
This limiting behavior is an example of con®ergence in distribution, denoted
'n Ž Tn y .
N Ž 0, 2 . .
d
6
Ž 14.1 .
For a function g, we now derive the limiting distribution of g ŽTn ..
Suppose that g is at least twice differentiable at . We use the Taylor series
expansion for g Ž t . in a neighborhood of . For some * between t and ,
g Ž t . s g Ž . q Ž t y . g X Ž . q Ž t y . g Y Ž * . r2
2
s g Ž . q Ž t y . gX Ž . q O Ž t y
2
..
Substituting the random variable Tn for t, we have
'n
g Ž Tn . y g Ž . s 'n Ž Tn y . g X Ž . q 'n O Ž Tn y
s 'n Ž Tn y . g X Ž . q Op Ž ny1r2 .
2
.
Ž 14.2 .
since
'n O Ž
Tn y
2
. s 'n O
Op Ž ny1 . s Op Ž ny1r2 . .
Since the Op Ž ny1r2 . term is asymptotically negligible, 'n w g ŽTn . y g Ž .x has
the same limiting distribution as 'n ŽTn y . g X Ž .; that is, g ŽTn . y g Ž .
behaves like the constant multiple g X Ž . of ŽTn y .. Now, ŽTn y . is
approximately normal with variance 2rn. Thus, g ŽTn . y g Ž . is approximately normal with variance 2 w g X Ž .x 2rn. More precisely,
g Ž Tn . y g Ž .
d
N Ž 0, 2 g X Ž .
6
'n
2
..
Ž 14.3 .
Figure 3.1 illustrated this result, and in Section 3.1.6 it was applied to the
sample logit.
Result Ž14.3. is called the delta method for obtaining asymptotic distributions. Since 2 s 2 Ž . and g X Ž . usually depends on , the asymptotic
variance is unknown. Let 2 ŽTn . and g X ŽTn . denote these terms evaluated at
the sample estimator Tn of . When g X Ž. and s Ž. are continuous at ,
ŽTn . g X ŽTn . is a consistent estimator of Ž . g X Ž .. Thus, confidence intervals and tests use the result that 'n w g ŽTn . y g Ž .xr ŽTn . g X Ž Tn . is asymptotically standard normal. For instance,
g Ž Tn . " z
r2
Ž Tn . g X Ž Tn . r'n
is a large-sample 100Ž1 y .% confidence interval for g Ž ..
579
DELTA METHOD
When g X Ž . s 0, Ž14.3. is uninformative because the limiting variance
equals 0. In that case, 'n w g ŽTn . y g Ž .x s op Ž1., and higher-order terms in
the Taylor series expansion yield the asymptotic distribution Žsee Note 14.1..
14.1.3
Delta Method for Function of Random Vector
The delta method generalizes to functions of random ®ectors. Suppose that
Tn s ŽTn1 , . . . , Tn N .X is asymptotically multivariate normal with mean s
Ž 1 , . . . , N .X and covariance matrix rn. Suppose that g Ž t 1 , . . . , t N . has a
nonzero differential s Ž 1 , . . . , N .X at , where
i
s
g
ti
.
ts
Then,
g Ž Tn . y g Ž .
d
N Ž 0, X . .
Ž 14.4 .
6
'n
For large n, g ŽTn . has distribution similar to the normal with mean g Ž . and
variance X rn.
The proof of Ž14.4. follows from the expansion
g Ž Tn . y g Ž . s Ž Tn y . q o Ž Tn y
X
.,
where z s ŽÝ z i2 .1r2 denotes the length of vector z. For large n, g ŽTn . y g Ž .
behaves like a linear function of the approximately normal random vector
ŽTn y .. Thus, it itself is approximately normal.
14.1.4
Asymptotic Normality of Functions of Multinomial Counts
The delta method for random vectors implies asymptotic normality of many
functions of cell counts in contingency tables. Suppose that cell counts
Ž n1 , . . . , n N . have a multinomial distribution with cell probabilities s
Ž 1 , . . . , N .X . Let n s n1 q qn N , and let p s Ž p1 , . . . , pN .X denote the
sample proportions, where pi s n irn.
Denote observation i of the n cross-classified in the contingency table by
Yi s Ž Yi1 , . . . , Yi N ., where Yi j s 1 if it falls in cell j, and Yi j s 0 otherwise,
i s 1, . . . , n. For instance, Y6 s Ž0, 0, 1, 0,0, . . . , 0. means that observation 6 is
in the third cell of the table. Now, since each observation falls in just one cell,
Ý j Yi j s 1 and Yi j Yi k s 0 when j / k. Also, pj s Ý i Yi jrn, and
E Ž Yi j . s P Ž Yi j s 1 . s
j
s E Ž Yi 2j . ,
E Ž Yi j Yi k . s 0 if j / k.
It follows that
E Ž Yi . s and
cov Ž Yi . s ,
i s 1, . . . , n,
580
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
where s Ž jk . with
j j s var Ž Yi j . s E Ž Yi 2j . y E Ž Yi j .
2
s
j
Ž1 y j . ,
jk s cov Ž Yi j , Yi k . s E Ž Yi j Yi k . y E Ž Yi j . E Ž Yi k . s y
j
for j / k.
k
The matrix has form
s diag Ž . y X
where diagŽ . is the diagonal matrix with the elements of on the main
diagonal.
Since p is a sample mean of n independent observations, namely
Ý nis1Yi
ps
n
,
cov Ž p . s diag Ž . y X rn.
Ž 14.5 .
This covariance matrix is singular, because of the linear dependence Ý pi s 1.
The multivariate central limit theorem ŽRao 1973, p. 128. implies
d
N 0, diag Ž . y X .
Ž 14.6 .
6
'n Ž p y .
By the delta method, functions of p having nonzero differential at are
also asymptotically normal. Let g Ž t 1 , . . . , t N . be a differentiable function, and
let
i
denote
i,
s gr
i s 1, . . . , N,
gr t i evaluated at t s . By the delta method Ž14.4.,
g Ž p. y g Ž .
d
N Ž 0, X diag Ž . y X . .
6
'n
Ž 14.7 .
The asymptotic variance equals
X diag Ž . y Ž X . s
2
Ý
i
2
i
y ŽÝ
i
i
.
2
.
In Section 3.1.7 we used this formula to derive the large-sample variance of
the sample log odds ratio.
14.1.5
Delta Method for Vector Function of Random Vector
The delta method generalizes further to a ®ector of functions of an asymptotically normal random vector. Let gŽt. s Ž g 1Žt., . . . , g q Žt..X and let Ž gr .
denote the q = N Jacobian matrix for which the entry in row i and column j
581
DELTA METHOD
is
g i Žt.r t j evaluated at t s . Then,
N 0, Ž gr . Ž gr . .
d
Ž 14.8 .
X
6
'n g Ž Tn . y g Ž .
The rank of the limiting normal distribution equals the rank of the asymptotic covariance matrix.
Expression Ž14.8. is useful for finding large-sample joint distributions. For
instance, from Ž14.6., Ž14.7., and Ž14.8., the asymptotic distribution of several
functions of multinomial proportions has covariance matrix of the form
asymp. cov 'n g Ž p . y g Ž .
4 s
diag Ž . y X X ,
where is the Jacobian Ž gr ..
14.1.6
Joint Asymptotic Normality of Log Odds Ratios
We illustrate formula Ž14.8. by finding the joint asymptotic distribution of a
set of log odds ratios in a contingency table. We use the log scale because
convergence to normality is more rapid for it.
Let gŽ . s logŽ . denote the vector of natural logs of cell probabilities,
for which
gr s diag Ž .
y1
.
The covariance of the asymptotic distribution of 'n wlogŽp. y logŽ .x is
diag Ž .
y1
diag Ž . y X diag Ž .
y1
s diag Ž .
y1
y 11X
where 1 is an N = 1 vector of 1 elements.
For a q = N matrix of constants C, it follows that
log Ž p . y log Ž .
d
N 0, C diag Ž .
y1
6
'n C
CX y C11X CX . Ž 14.9 .
Now, suppose C logŽp. is a set of sample log odds ratios. Then, each row of C
contains zeros except for two q1 elements and two y1 elements in the
positions multiplied by the relevant elements of logŽp. to form the given log
odds ratio. The second term in the covariance matrix in Ž14.9. is then zero. If
a particular odds ratio uses the cells numbered h, i, j, and k, the variance of
the asymptotic distribution is
asymp. var
'n Ž sample log odds ratio .
s
y1
h
q
y1
i
q
y1
j
q
y1
k .
When two log odds ratios have no cells in common, their asymptotic covariance in the limiting normal distribution equals zero.
582
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
14.2 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS OF
MODEL PARAMETERS AND CELL PROBABILITIES
We now derive basic results of large-sample model-based inference for
contingency tables. The delta method is the key tool. The derivations apply to
a single multinomial distribution. They extend directly to products of multinomials, when the parameter space stays fixed as the sample size increases.
The observations are counts n s Ž n1 , . . . , n N .X in N cells of a contingency
table. The asymptotics regard N as fixed and let n s Ýn i ™ . We assume
that n s np has a multinomial distribution with probabilities s
Ž 1 , . . . , N .X . The model is
s Ž . ,
where Ž . denotes a function that relates to a smaller number of
parameters s Ž 1 , . . . , q .X .
As ranges over its parameter space, Ž . ranges over a subset of the
space of for N probabilities. Adding components to , the model becomes
more complex and the space of that satisfy the model is larger. We use
and to denote generic parameter and probability values, and 0 s
Ž 10 , . . . , q 0 .X and 0 s Ž 10 , . . . , N 0 .X s Ž 0 . to denote true values for a
particular application. When the model does not hold, no 0 exists for which
Ž 0 . s 0 ; that is, 0 falls outside the subset of values that is the range
of Ž . for the space of possible . We consider this case in Section 14.3.5.
We first derive the asymptotic distribution of the ML estimator ˆ
of .
We use that to derive the asymptotic distribution of the model-based ML
estimator
ˆ s Žˆ . of . The approach follows Rao Ž1973, Sec. 5e. and
Bishop et al. Ž1975, Secs. 14.7 and 14.8.. The assumed regularity conditions
are:
1. 0 is not on the boundary of the parameter space.
2. All i0 ) 0.
3. Ž . has continuous first-order partial derivatives in a neighborhood
of 0 .
4. The Jacobian matrix Ž r . has full rank q at 0 .
These conditions ensure that Ž . is locally smooth and one-to-one at 0
and Taylor series expansions exist in neighborhoods around 0 and 0 .
When the Jacobian does not have full rank, often it does with reformulation
of the model using fewer parameters.
14.2.1
Distribution of Model Parameter Estimator
The key to deriving the asymptotic distribution of ˆ
is to express ˆ
as a
linearized function of p. Then the delta method applies, using the asymptotic
583
ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS
normality of p. The linearization has two steps, first relating p to ,
ˆ and then
ˆ to ˆ.
The kernel of the multinomial log likelihood is
N
L Ž . s log Ł
N
i
i Ž . s n Ý pi log i Ž . .
n
is1
is1
The likelihood equations are
LŽ .
j
s nÝ
i
i
Ž .
pi
i
Ž .
j
s 0,
Ž 14.10 .
j s 1, . . . , q.
These depend on the functional form Ž . used in the model. Note that
Ý
i
Ž .
j
i
s
Ý i Ž .
j
s
j
i
Ž 1 . s 0.
Ž 14.11 .
ˆ
ˆ
Ž .
Let
ir j represent
i r j evaluated at . Subtracting a common
term from both sides of the jth likelihood equation Ž14.10.,
Ý
n Ž pi y
i0
.
i
ˆj
ˆi
i
s
Ý
nŽ ˆ i y
i0
.
i
i
Ž 14.12 .
,
ˆj
ˆi
since the first sum on the right-hand side equals zero from Ž14.11..
Next we express
ˆ in terms of ˆ using
ˆi y
i0
s
Ý Ž ˆk y k 0 .
i
k
k
where
ir k represents
ir k evaluated at some point falling between ˆ
and 0 . Substitution of this into the right-hand side of Ž14.12. and
division of both sides by 'n yields, for each j,
Ý
i
'n Ž pi y
ˆi
i0
.
i
ˆj
s
ž
1
Ý 'n Ž ˆk y k 0 . Ý ˆ ˆ
i
k
i
j
i
i
k
/
. Ž 14.13 .
Some notation lets us express more simply the dependence of ˆ
on p. Let
A denote the N = q matrix having elements
ai j s
y1r2
i0
i
Ž .
j0
.
584
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
The matrix expression for A is
A s diag Ž 0 .
y1r2
Ž r 0 . ,
Ž 14.14 .
where Ž r 0 . denotes the Jacobian Ž r . evaluated at 0 . As ˆ
converges to 0 , the term in brackets on the right-hand side of Ž14.13.
X
converges to the element in row j and column k of AA.
As ˆ
™ 0 , the set of
equations Ž14.13. has the form
AX diag Ž 0 .
y1r2
X
'n Ž p y 0 . s Ž AA
. 'n Ž ˆ y 0 . q op Ž 1 . .
Since the Jacobian has full rank at 0 , AXA is nonsingular. Thus,
y1
y1r2
X
'n Ž ˆ y 0 . s Ž AA
'n Ž p y 0 . q op Ž 1. . Ž 14.15 .
. AX diag Ž 0 .
Now, the asymptotic distribution of p determines that of ˆ
. From Ž14.6.,
is asymptotically normal, with covariance matrix wdiagŽ 0 . y
0 X0 x. By the delta method, 'n Žˆ
y 0 . is also asymptotically normal, with
asymptotic covariance matrix
'n Žp y 0 .
y1
y1r2
y1r2
y1
X
X
= diag Ž 0 . y 0 X0 = diag Ž 0 .
A Ž AA
Ž AA
. AX diag Ž 0 .
. .
Using Ž14.11. and Ž14.14., the term subtracted in this expression disappears
because
X0 diag Ž 0 .
y1r2
A s X0 diag Ž 0 .
y1r2
s 1X Ž r 0 . s
diag Ž 0 .
žÝ
y1r2
/
Ž r 0 .
X
0 s 0X .
ir
i
Thus, this asymptotic covariance expression for 'n Žˆ
y 0 . simplifies to
X y1
ŽAA
. .
In summary, this argument establishes the general result
d
X
N 0, Ž AA
.
6
'n Ž ˆ y 0 .
y1
.
Ž 14.16 .
The asymptotic covariance matrix of ˆ
depends on Ž r 0 . and hence on
ˆ denote A evaluated at the
the function for modeling in terms of . Let A
ˆ
ML estimate . The estimated covariance matrix is
$
X
ˆˆ
cov Ž ˆ
. s Ž AA
.
y1
rn.
The asymptotic normality and covariance of ˆ
follows more simply from
general results for ML estimators. However, those results require stronger
585
ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS
regularity conditions ŽRao 1973, p. 364. than the ones assumed here. Suppose
that observations are independent from f Žy; ., some probability mass function. The ML estimator ˆ
is efficient, in the sense that
'n Ž ˆ y .
N Ž 0,
d
6
Iy1 . ,
where I is the information matrix for a single observation. The Ž j, k .
element of I is
yE
ž
2
/
log f Ž y, .
j k
log f Ž y, .
sE
j
log f Ž y, .
k
.
When f is the probability of a single observation having multinomial probabilities 1Ž ., . . . , N Ž .4 , this element of I equals
N
Ý
is1
log Ž
i
j
Ž . .
log Ž
i
Ž . .
k
i
Ž . s
N
Ý
is1
Ž .
i
j
k
i
Ž .
1
i
Ž .
.
X
This is the Ž j, k . element of AA.
Thus the asymptotic covariance is Iy1 s
X y1
ŽAA. .
For results of this section to apply, a ML estimator of must exist and be
a solution of the likelihood equations. This requires the following strong
identifiability condition: For every ) 0, there exists a ) 0 such that if
y 0 ) , then Ž . y 0 ) . This condition implies a weaker one
that two values cannot have the same value. When strong identifiability
and the other regularity conditions hold, the probability an ML estimator is a
root of the likelihood equations converges to 1 as n ™ . That estimator has
the asymptotic properties given above of a solution of the likelihood equations. For proofs, see Birch Ž1964a. and Rao Ž1973, pp. 360362..
14.2.2
Asymptotic Distribution of Cell Probability Estimators
The asymptotic distribution of the model-based estimator
ˆ follows from the
Taylor-series expansion
ˆ s Ž ˆ . s Ž 0 . q
0
Ž ˆ y 0 . q op Ž ny1r2 . .
Ž 14.17 .
The size of the remainder term follows from Žˆ
y 0 . s Op Ž ny1r2 .. Now
ˆ
'
Ž
.
Ž
.
0 s 0 , and n y 0 is asymptotically normal with asymptotic co-
586
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
X y1
. . By the delta method,
variance ŽAA
'n Ž ˆ y 0 .
d
6
N 0,
0
X
Ž AA
.
y1
0
X
Ž 14.18 .
.
When the model holds with having q - N y 1 elements,
ˆ s Žˆ . is
more efficient than the sample proportion p for estimating . More generally, for estimating a smooth function g Ž . of , g Ž
ˆ . has smaller asymptotic variance than g Žp.. We next derive this result, discussed in Section
6.4.5. The derivation deletes the Nth component from p and ,
ˆ so their
covariance matrices are positive definite ŽProblem 14.16.. The Nth proportion is linearly dependent on the first N y 1 since they sum to 1. Let s
diagŽ . y X denote the Ž N y 1. = Ž N y 1. covariance matrix of 'n p.
The inverse of is
y1 s diag Ž .
y1
q 11X
Ž 14.19 .
,
N
which can be verified by evaluating y1 and showing that it equals the
identity matrix.
Let Ž gr 0 . s Ž gr 1 , . . . , gr Ny1 .X , evaluated at s 0 . By the
delta method,
'n g Ž p.
s
'n g Ž ˆ .
s
asymp. var
ž /
X
g
0
cov Ž 'n p .
g
0
s
ž /
g
0
X
g
0
and
Asymp. var
s
ž /
ž /
g
X
0
g
X
0
g
Asymp. cov Ž 'n
ˆ.
0
0
Asymp. cov Ž 'n ˆ
.
ž /
X
0
g
0
.
Using Ž14.11. and Ž14.19. yields
X
Asymp. cov Ž 'n ˆ
. s Ž AA
.
y1
s Ž r 0 . diag Ž 0 .
X
s Ž r 0 . y1 Ž r 0 .
X
y1
y1
Ž r 0 .
y1
.
Since is positive definite and Ž r 0 . has rank q, y1 and
wŽ r 0 .X y1 Ž r 0 .xy1 are also positive definite.
587
DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS
To show that asymp. varw'n g Žp.x G asymp. varw'n g Ž
ˆ .x, we show that
ž /½
g
0
X
y
0
ž /
X
y1
y1
0
0
ž /5
X
g
0
0
G 0.
But this quadratic form is identical to
Ž Y y B . y1 Ž Y y B .
X
where Y s Ž gr 0 ., B s Ž r 0 ., and s ŽBX y1 B.y1 BX y1 Y. The
result then follows from the positive definiteness of y1.
This proof is based on one given by Altham Ž1984.. Her proof uses
standard properties of ML estimators. It applies whenever regularity conditions hold that guarantee those properties. The proof applies not only to
categorical data but to any situation in which a model describes the dependence of a set of parameters on some smaller set .
14.3 ASYMPTOTIC DISTRIBUTIONS OF RESIDUALS AND
GOODNESS-OF-FIT STATISTICS
We next study the distribution of Pearson X 2 and likelihood-ratio G 2
goodness-of-fit statistics for the multinomial model s Ž .. We first
derive the asymptotic joint distribution of the sample proportions p and
model-based estimator .
ˆ This distribution determines large-sample distributions of statistics that depend on both p and .
ˆ For instance, it determines
the asymptotic joint distribution of the Pearson residuals, which compare p
with .
ˆ Deriving the large-sample chi-squared distribution for X 2 , which is
the sum of squared Pearson residuals, is then straightforward. We also show
that X 2 and G 2 are asymptotically equivalent, when the model holds. Our
presentation borrows from Bishop et al. Ž1975, Chap 14., Cox Ž1984., Cramer
´
Ž1946, pp. 432433., and Rao Ž1973, Sect. 6b..
14.3.1
Joint Asymptotic Normality of p and
ˆ
We first express the joint dependence of p and
ˆ on p, in order to show the
joint asymptotic normality of p and .
ˆ Let
D s diag Ž 0 .
1r2
X
A Ž AA
.
A diag Ž 0 .
y1 X
y1r2
From Ž14.15. and Ž14.17.,
ˆ y 0 s
0
Ž ˆ y 0 . q o p Ž ny1r2 .
s D Ž p y 0 . q o p Ž ny1r2 . .
.
588
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
Therefore,
'n
ž
p y 0
ˆ y 0
/
s
ž /' Ž
I
D
n p y 0 . q o p Ž 1. ,
where I is a N = N identity matrix. By the delta method,
ž
p y 0
ˆ y 0
/
d
N Ž 0, * .
Ž 14.20 .
6
'n
where
* s
ž
diag Ž 0 . y 0 X0
diag Ž 0 . y 0 X0 DX
D diag Ž 0 . y 0 X0
D diag Ž 0 . y 0 X0 DX
/
. Ž 14.21 .
The two matrix blocks on the main diagonal of * are covŽ'n p. and asymp.
covŽ'n
ˆ ., derived previously. The new information here is that asymp.
covŽ'n p, 'n
ˆ . s wdiagŽ 0 . y 0 X0 xDX .
14.3.2
Asymptotic Distribution of Pearson and Standardized Residuals
For cell counts n i 4 the Pearson statistic is X 2 s Ýe i2 , where
ei s
ni y
ˆi
ˆ1r2
i
s
'n Ž pi y ˆ i .
ˆ i1r2
.
We next derive the asymptotic distribution of e s Ž e1 , . . . , e N .X , which is a
diagnostic measure of lack of fit. For Poisson models it is the Pearson
residual. Dividing it by its standard error gives the standardized residual. The
distribution of e is also helpful in deriving the distribution of X 2 .
The residuals e are functions of p and ,
ˆ which are jointly asymptotically
normal from Ž14.20.. To use the delta method, we calculate
e ir pi s 'n ˆy1r2
,
i
e ir ˆ i s y'n Ž pi q ˆ i . r2 ˆy3r2
i
e ir p j s e ir ˆ j s 0
for i / j.
That is,
e
p
e
ˆ
s 'n diag Ž
ˆ.
y1r2
and
s y Ž 12 . 'n diag Ž p . q diag Ž
ˆ . diag Ž
ˆ.
y3r2
.
Ž 14.22 .
Evaluated at p s 0 and
ˆ s 0 , these matrices equal 'n diagŽ 0 .y1r2 and
y1r2
'
Ž
.
y n diag 0
. Using Ž14.21., Ž14.22., and AX 1r2
s 0 wwhich follows
0
589
DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS
from Ž14.11.x, the delta method implies that
d
X
1r2 X
N Ž 0, I y 1r2
y A Ž AA
.
0 0
6
e
A ..
Ž 14.23 .
y1 X
The limiting distribution has form N Ž0, I y Hat., where Hat is the hat
matrix ŽSection 4.5.5.. Although asymptotically normal, e behaves less variably than standard normal random variables. The standardized Pearson
residual ŽHaberman 1973a. divides e by its estimated standard error. This
statistic, which is asymptotically standard normal, equals
ri s
ei
1 y ˆ i y Ý j Ý k Ž 1r ˆ i .
ž
ˆ
ir j
/Ž
ˆ ® jk
ir k . ˆ
1r2
,
Ž 14.24 .
ˆXˆ.y1 . The
where ˆ
® jk denotes the element in row j and column k of ŽAA
denominator of ri is
1yˆ
h i , where the leverage ˆ
h i for observation i
estimates the ith diagonal element of the hat matrix. This simplifies to Ž3.13.
for testing independence in two-way tables.
'
14.3.3
Asymptotic Distribution of Pearson Statistic
The proof that the Pearson X 2 statistic has an asymptotic chi-squared
distribution uses the following relationship between normal and chi-squared
distributions ŽRao 1973, p. 188.:
Let X be multivariate normal with mean and covariance matrix B.
A necessary and sufficient condition for ŽX y .X CŽX y . to have a chi-squared
distribution is BCBCB s BCB. The degrees of freedom equal the rank of CB.
When B is nonsingular, the condition simplifies to CBC s C.
The Pearson statistic relates to e by X 2 s eX e, so we apply this result by
X y1 X
1r2 X
. A 4.
identifying X with e, s 0, C s I, and B s I y 1r2
y AŽAA
0 0
X
X
2
Ž
.
Ž
.
4
Since C s I, the condition for X y C X y s e e s X to have a
chi-squared distribution simplifies to BBB s BB. A direct computation using
AX 1r2
s 0 shows that B is idempotent, so the condition holds. Since e is
0
asymptotically multivariate normal, X 2 is asymptotically chi-squared.
For symmetric idempotent matrices, the rank equals the trace. The trace
X 1r2
1r2 X
equals the trace of 1r2
0 s Ý i0 s 1,
of I is N; the trace of 1r2
0 0
0
X y1 X
X y1 X
which is 1; the trace of AŽAA. A equals the trace of ŽAA. ŽAA. s identity
matrix of size q = q, which is q. Thus, the rank of B s CB is N y q y 1, and
the asymptotic chi-squared distribution has df s N y q y1.
This result, due to Fisher Ž1922., is remarkably simple. When the sample
size is large, the distribution of X 2 does not depend on 0 or the model
form. It depends only on the difference between the dimension of , which
is N y 1, and the dimension of . With q s 0 parameters, X 2 is Pearson’s
590
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
Ž1900. statistic Ž1.15. for testing that multinomial probabilities equal certain
specified values, and df s N y 1 as Pearson claimed. Watson Ž1959. showed
that the same result holds for the asymptotic conditional distribution, given a
sufficient statistic for nuisance parameters.
14.3.4
Asymptotic Distribution of Likelihood-Ratio Statistic
When the model holds, the likelihood-ratio statistic G 2 is asymptotically
equivalent to X 2 as n ™ . To show this, we express
G 2 s 2 Ý n i log
i
ni
ˆi
ž
pi y ˆ i
s 2 n Ý pi log 1 q
i
ˆi
/
and apply the expansion
log Ž 1 q x . s x y x 2r2 q x 3r3 y
for x - 1.
We identify x with Ž pi y ˆ i .r ˆ i , which converges in probability to 0 when
the model holds. For large n,
G s 2 n Ý ˆ i q Ž pi y ˆ i .
2
pi y ˆ i
s 2nÝ
Ž
pi y ˆ i . y
i
s nÝ
i
Ž pi y ˆ i .
ˆi
ž /
y
ˆi
i
1
Ž pi y ˆ i .
2
ˆi
ž /
1
Ž pi y ˆ i .
2
ˆ i2
2
q
Ž pi y ˆ i .
2
q
2
ˆi
q Op Ž pi y ˆ i .
3
2
q 2 nOp Ž ny3r2 . s X 2 q Op Ž ny1r2 . s X 2 q op Ž 1 . ,
since ÝŽ pi y ˆ i . s 0 and Ž pi y ˆ i . s Ž pi y i . y Ž ˆ i y i ., both of which
are Op Ž ny1r2 .. Thus, when the model holds, the difference between X 2 and
G 2 converges in probability to 0. As a consequence, G 2 , like X 2 , has an
asymptotic chi-squared distribution with df s N y q y 1.
The parameter value that maximizes the likelihood is the one that minimizes G 2 . To show this, we let
G 2 Ž ; p . s 2 n Ý pi log Ž pir
i
..
The kernel of the multinomial log likelihood is
L Ž . s n Ý pi log
i
s yn Ý pi log
sy
ž /
1
2
Ž .
pi
i
Ž .
q n Ý pi log pi
G 2 Ž Ž . ; p . q n Ý pi log pi .
DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS
591
The second term in the last expression does not depend on , so maximizing
LŽ . is equivalent to minimizing G 2 with respect to .
A fundamental result for G 2 concerns comparisons of nested models.
Suppose that model M0 is a special case of model M1. Let q0 and q1 denote
the numbers of parameters in the two models. Let ˆ 0 i 4 and ˆ 1 i 4 denote ML
estimators of cell probabilities for the two models. Then
G 2 Ž M0 . y G 2 Ž M1 . s 2 n Ý pi log Ž ˆ 1 ir ˆ 0 i .
has the form of y2Žlog likelihood ratio. for testing that M0 holds against the
alternative that M1 holds. Theory for likelihood-ratio tests suggests that
when the simpler model holds, the asymptotic distribution of G 2 Ž M0 . y
G 2 Ž M1 . is chi-squared with q1 y q2 degrees of freedom. For details, see
Bishop et al. Ž1975, pp. 525526., Haberman Ž1974a, p. 108., and Rao Ž1973,
pp. 418419.. The statistic X 2 Ž M0 < M1 . defined in Ž9.4. is a quadratic
approximation for the G 2 difference. Haberman Ž1977a. noted that these
tests can perform well even for large, sparse tables, as long as q1 y q0 is
small compared to the sample size and no expected frequency has larger
order of magnitude than the others.
14.3.5
Asymptotic Noncentral Distributions
Results in this chapter assume that a certain parametric model holds. In
practice, any unsaturated model almost surely does not hold perfectly, so one
might question the scope of these results. This is not problematic if we regard
models merely as convenient approximations for reality. For instance, the
ML estimator ˆ
converges to a value 0 that describes the best fit of the
chosen model to reality. In this sense, inferences for give us information
about a useful approximation for reality. Similarly, model-based inferences
about cell probabilities are inconsistent for the true probabilities when the
model does not hold; nevertheless, those inferences are consistent for describing a useful smoothing of reality.
For goodness-of-fit statistics, a relevant distinction exists between limiting
behavior when the model holds and when it does not hold. When the model
holds, we’ve seen X 2 and G 2 have a limiting chi-squared distribution, and
the difference between them disappears as n increases. When the model
does not hold, X 2 and G 2 tend to grow unboundedly as n increases, and
X 2 y G 2 need not go to zero. One method for obtaining proper limiting
distributions considers a sequence of situations n for which the lack of fit
diminishes as n increases. Specifically, the model is s fŽ ., but in reality
n s f Ž . q r'n .
Ž 14.25 .
The best fit of the model to the population has ith probability equal to f i Ž .,
but the true value differs from that by ir'n .
For this representation, Mitra Ž1958. showed that the Pearson X 2 has a
limiting noncentral chi-squared distribution, with df s N y q y 1 and non-
592
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
centrality parameter
n
sn Ý
is1
ni
y fi Ž .
2
fi Ž .
.
This has the form of X 2 , with the sample values pi and ˆ i replaced by
population values ni and f i Ž .. Similarly, the noncentrality of the
likelihood-ratio statistic has the form of G 2 , with the same substitution.
Haberman Ž1974a, pp. 109112. showed that under certain conditions G 2
and X 2 have the same limiting distribution; that is, their noncentrality values
converge to a common value as n ™ .
Representation Ž14.25. means that for large n, the noncentral chi-squared
approximation is valid when the model is just barely incorrect. In practice, it
is often reasonable to adopt Ž14.25. for fixed, finite n to approximate the
distribution of X 2 , even though Ž14.25. would not be plausible as we obtain
more data. The alternative representation
s fŽ . q
Ž 14.26 .
in which differs from fŽ . by a fixed amount as n ™ may seem more
natural. In fact, this is more appropriate than Ž14.25. for proving the test to
be consistent Ži.e., for convergence to 1 of the probability of rejecting the
hypothesis that the model holds.. For Ž14.26., however, the noncentrality
parameter grows unboundedly as n ™ , and a proper limiting distribution
does not result for X 2 and G 2 .
When the model holds, s 0 in either representation Ž14.25. or Ž14.26..
That is, fŽ . s Ž ., s 0, and the results in Sections 14.3.3 and 14.3.4
apply.
14.4 ASYMPTOTIC DISTRIBUTIONS FOR
LOGIT r LOGLINEAR MODELS
For loglinear models, formulas in Section 8.6 for the asymptotic covariance
matrices of ˆ
and
ˆ are special cases of ones derived in Section 14.2. We
present these for the multinomial form of the models, which relates directly
to that section. Then we discuss the connection to Poisson loglinear models.
To constrain probabilities to sum to 1, we express loglinear models for
multinomial sampling as
s exp Ž X . r 1X exp Ž X .
Ž 14.27 .
where X is a model matrix and 1X s Ž1, . . . , 1.. Letting x i denote row i of X,
i
s
i
Ž . s
exp Ž x i .
Ý k exp Ž x k .
.
593
ASYMPTOTIC DISTRIBUTIONS FOR LOGIT r LOGLINEAR MODELS
14.4.1
Asymptotic Covariance Matrices
A model affects covariance matrices through the Jacobian. Since
i
j
s
s
Ý k exp Ž x k .
exp Ž x i . x i j y exp Ž x i .
Ý k exp Ž x k .
i
xi j y
i
Ý xk j
k
Ý k x k j exp Ž x k .
2
,
k
the matrix of these elements has the form
r s diag Ž . y X X.
Using this with Ž14.14. and Ž14.16., the information matrix at 0 is
X
AA
s Ž r 0 . diag Ž 0 .
X
y1
Ž r 0 .
s XX diag Ž 0 . y 0 X0 diag Ž 0 .
X
y1
diag Ž 0 . y 0 X0 X
s XX diag Ž 0 . y 0 X0 X.
Thus, for multinomial loglinear models, ˆ
is asymptotically normally
distributed with estimated covariance matrix
$
cov Ž ˆ
. s XX diag Ž
ˆ. y
ˆ
ˆ X X4
y1
Ž 14.28 .
rn.
Similarly, from Ž14.23. the estimated asymptotic covariance matrix of
ˆ is
$
cov Ž
ˆ . s diag Ž
ˆ. y
ˆ
ˆ X X XX diag Ž
ˆ. y
ˆ
ˆ X X4
=XX diag Ž
ˆ. y
ˆ
ˆX
y1
n.
From Ž14.23., the Pearson residuals e are asymptotically normal with
X
1r2
asymp. cov Ž e . s I y 1r2
0 Ž 0 . y A Ž AA .
X
s I y 1r2
0 Ž
y1 X
A
. y diag Ž 0 . y1r2
X
1r2
0
= XX diag Ž 0 . y 0 X0 X 4
y1
XX
= diag Ž 0 . y 0 X0 diag Ž 0 .
14.4.2
diag Ž 0 . y 0 X0 X
y1r2
.
Connection with Poisson Loglinear Models
This book expressed loglinear models in terms of Poisson expected cell
frequencies s Ž 1 , . . . , N .X , using formulas of the form
log
s X a a .
Ž 14.29 .
594
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
The model matrix X a and parameter vector a in this formula are slightly
different from X and in multinomial model Ž14.27.. The Poisson expression
Ž14.29. does not have constraints on . For multinomial model Ž14.27.,
Ý i i s n is fixed, and s rn satisfies
log
s log n s X q log n y log Ž 1X exp Ž X . . 1
s X q 1
where s log n y logŽ1X expŽX ..x. In other words, multinomial model Ž14.27.
implies Poisson model Ž14.29. with
X a s w 1: X x
and
a s Ž , X . .
X
The columns of X in the multinomial representation must be linearly
independent of 1; that is, the parameter , which relates to the total sample
size, does not appear in . The dimension of is 1 less than the number of
parameters reported in this text for Poisson loglinear models. For instance,
for the saturated model, has N y 1 elements for the multinomial representation, reflecting the sole constraint on of Ý i s 1.
NOTES
Section 14.1: Delta Method
14.1. For detailed discussion of large-sample theory including the delta method, see Bishop
et al. Ž1975, Chap. 14. and Sen and Singer Ž1993..
14.2. In applying the delta method to a function g of an asymptotically normal random
vector Tn , suppose that the first-order, . . . , Ž a y 1.st-order differentials of the function
are zero at , but the ath-order differential is nonzero. A generalization of the delta
method implies that n a r2 w g ŽTn . y g Ž .x has limiting distribution involving products of
order a of components of a normal random vector. When a s 2, the limiting distribution is a quadratic form in a multivariate normal vector, which often relates to a
Y
chi-squared distribution; in the univariate case, it is 2 Ž g Ž ..r2 times a 12 variable
ŽCasella and Berger 2001, p. 244..
Resampling methods such as the jackknife and the bootstrap are alternative tools
for estimating standard errors and obtaining confidence intervals. They can be helpful
when use of the delta method is questionablefor instance, for small samples, highly
sparse data, or complex sampling designs. For details, see Davison and Hinkley Ž1997.,
Fay Ž1985., Parr and Tolley Ž1982., and Simonoff Ž1986..
Section 14.3: Asymptotic Distributions of Residuals and Goodness-of-Fit Statistics
14.3. If Y is Poisson with E Ž Y . s , then for large the delta method implies Y 1r2 is
approximately normal with standard deviation 21 . This motivates an alternative goodness-of-fit statistic, the FreemanTukey statistic, FT s 4ÝŽ y i y
ˆ i . 2 . When the
model holds, FT y X 2 is also op Ž1. as n ™ . See Bishop et al. Ž1975, p. 514. for
details.
'
'
595
PROBLEMS
Results of this chapter do not apply when the number of cells N grows as n ™ , or
when different expected frequencies grow at different rates. Haberman Ž1988. showed
the consistency of X 2 breaks down with non-standard asymptotics.
14.4. Drost et al. Ž1989. showed noncentral approximations using other sequences of alternatives than the local and fixed ones Ž14.25. and Ž14.26..
PROBLEMS
14.1 Explain why:
a. If c ) 0, nyc s oŽ1. as n ™ .
b. If c / 0, cz n has the same order as z n ; that is, oŽ cz n . is equivalent
to oŽ z n . and O Ž cz n . is equivalent to O Ž z n ..
c. o Ž yn . o Ž z n . s o Ž yn z n ., O Ž yn . O Ž z n . s O Ž yn z n ., o Ž yn . O Ž z n . s
oŽ yn z n ..
14.2 If X 2 has an asymptotic chi-squared distribution with fixed df as
n ™ , then explain why X 2rn s op Ž1..
14.3 a. Use Tchebychev’s inequality to show that if E Ž X n . s n and
varŽ X n . s n2 - , then Ž X n y n . s Op Ž n ..
b. Suppose that Y1 , . . . , Yn are independent with E Ž Yi . s and
varŽ Yi . s 2 for i s 1, . . . , n. Let Yn s ŽÝ i Yi .rn. Apply part Ža.
to show that Yn y s Op Ž ny1r2 ..
14.4 Let Y be a Poisson random variable with mean .
a. For a constant c ) 0, show that
E log Ž Y q c . s log q Ž c y
1
2
. r q O Ž y2 .
Ž Hint: Note that logŽ Y q c . s log q logw1 q Ž Y q c y .r x..
b. Cell counts in a 2 = 2 table are independent Poisson random
variables. Use part Ža. to argue that to reduce bias in estimating
the log odds ratio, a sensible estimator is the sample log odds ratio
after adding 12 to each cell.
14.5 Let p denote the sample proportion for n independent Bernoulli
trials. Find the asymptotic distribution of the estimator w pŽ1 y p .x1r2
of the standard deviation. What happens when s 0.5?
14.6 Suppose that Tn has a Poisson distribution with mean s n , for
fixed ) 0. For large n, show that the distribution of log Tn is
approximately normal with mean logŽ . and variance y1 . w Hint: By
596
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
the central limit theorem, Tnrn is approximately N Ž , rn. for
large n.x
'
14.7 a. Refer to Problem 14.6. If Tn is Poisson, show Tn has asymptotic
variance 14 .
b. For a binomial sample with n trials and sample proportion p,
show the asymptotic variance of siny1 Ž p . is 1r4n. wThis transformation and the one in part Ža. are ®ariance stabilizing, producing
variates with asymptotic variances that are the same for all values
of the parameter. Traditionally, these transformations were employed to make ordinary least squares applicable to count data.
See Cochran 1940 for discussion and ML analyses. x
'
14.8 For a multinomial Ž n, i 4. distribution, show the correlation between
pi and pj is yw i jrŽ1 y i .Ž1 y j .x1r2 . What does this equal when
i s 1 y j and k s 0 for k / i, j?
14.9 An animal population has N species, with population proportion i
of species i. Simpson’s index of ecological di®ersity ŽSimpson 1949. is
I Ž . s 1 y Ý i2 . wRao Ž1982. surveyed diversity measures. x
a. Two animals are randomly chosen from the population, with
replacement. Show I Ž . is the probability they are different
species.
b. For proportions p for a random sample, show that the estimated
asymptotic standard error of I Žp. is
2
½
Ý
i
pi3
y
žÝ /
pi2
i
2
n
5
1r2
.
14.10 Let Yi 4 be independent Poisson random variables. Show by the delta
method that the estimated asymptotic variance of Ýa i logŽ Yi . is Ýa2i ryi .
wThis formula applies to ML estimators of parameters for the saturated loglinear model, which are contrasts of logŽ yi .4 . Formula Ž14.9.
yields the asymptotic covariance structure of such estimators; see Lee
Ž1977..x
14.11 Assuming two independent binomial samples, derive the asymptotic
standard error of the log relative risk ŽSection 3.1.4..
14.12 Refer to Problem 3.27. The sample size may need to be quite large
for the sampling distribution of ␥
ˆ to be approximately normal,
especially if < ␥ < is large. The Fisher-type transform ˆs 12 logwŽ1 q ␥
ˆ .r
Ž1 y ␥
.x
Ž
ˆ Agresti 1984, pp. 166᎐167, 177; O’Gorman and Woolson
1988. converges more quickly to normality.
597
PROBLEMS
a. Show that the asymptotic variance of ˆ equals the asymptotic
variance of ␥
ˆ multiplied by Ž1 y ␥ 2 .y2 .
b. Explain how to construct a confidence interval for and use it to
obtain one for ␥ .
c. Show that ˆs 12 logŽ CrD .. For 2 = 2 tables, show that this is half
the log odds ratio.
14.13 Let 2 ŽT. s Ý i ŽTi y i0 . 2r i0 . Then 2 Žp. s X 2rn, where X 2 is
the Pearson statistic Ž1.15. for testing H0 : i s i0 , i s 1, . . . , N, and
n 2 Ž . is the noncentrality for that test when is the true value.
Under H0 , why does the delta method not yield an asymptotic normal
distribution for 2 Žp.? ŽSee Note 14.2..
14.14 In an I = J contingency table, let i j denote local odds ratio Ž2.10.,
and let ˆi j denote its sample value.
a. Show that asymp. covŽ'n log ˆi j , 'n log ˆiq1, j . s yw y1
iq1, j q
y1
x
.
iq1, jq1
b. Show that asymp. covŽ'n log ˆi j , 'n log ˆiq1, jq1 . s y1
iq1, jq1 .
c. When ˆi j and ˆh k use mutually exclusive sets of cells, show that
asymp. covŽ'n log ˆi j , 'n log ˆh k . s 0.
d. State the asymptotic distribution of log ˆi j .
14.15 For loglinear model Ž XY, XZ, YZ ., ML estimates of i jk 4 and hence
the X 2 and G 2 statistics are not direct. Alternative approaches may
yield direct analyses. For 2 = 2 = 2 tables, find a statistic for testing
the hypothesis of no three-factor interaction, using the delta method
with the asymptotic normality of log ˆ111 , where
ˆ111 s
p111 p 221 rp121 p 211
p112 p 222 rp122 p 212
.
14.16 Refer to Section 14.2.2, with s diagŽ . y X the covariance
matrix of 'n Ž p1 , . . . , pNy1 .X . Let
Zs
½
ci
0
with probability
with probability
i,
i s 1, . . . , N y 1
N
and let c s Ž c1 , . . . , c Ny1 .X .
a. Show that E Ž Z . s cX , E Ž Z 2 . s cX diagŽ .c, and varŽ Z . s cX c.
b. Suppose that at least one c i / 0, and all i ) 0. Show varŽ Z . ) 0,
and deduce that is positive definite.
598
ASYMPTOTIC THEORY FOR PARAMETRIC MODELS
c. If s Ž
definite.
1,
...,
N
.X , so is N = N, prove that is not positive
2
14.17 Consider the model for a 2 = 2 table,
11 s ,
12 s
21 s
2
Ž1 y ., 22 s Ž1 y . , where is unknown ŽProblems 3.31 and
10.34..
a. Find the matrix A in Ž14.14. for this model.
b. Use A to obtain the asymptotic variance of ˆ. ŽAs a check, it is
simple to find it directly using the inverse of yE 2 Lr 2 , where
L is the log likelihood.. For which value is the variance maximized? What is the distribution of ˆ if s 0 or s 1?
c. Find the asymptotic covariance matrix of 'n .
ˆ
d. Find df for testing fit using X 2 .
14.18 Refer to the model for the calf data in Section 1.5.6. Obtain the
asymptotic variance of ˆ .
14.19 Justify the use of estimated asymptotic covariance matrices. For
ˆ ˆ close to AA?
instance, for large samples, why is AA
14.20 Cell counts Yi 4 are independent Poisson random variables, with
i s E Ž Yi .. Consider the Poisson loglinear model
log
s X a a ,
where
s Ž 1 , . . . , N . .
Using arguments similar to those in Section 14.2, show that the
large-sample covariance matrix of ˆ
a can be estimated by
wXXa diagŽ ˆ .X a xy1 , where ˆ is the ML estimator of .
14.21 For a given set of parameter constraints, show that weak identifiability conditions hold for the independence loglinear model for a two-way
table; that is, when two values for give the same , those parameter vectors must be identical.
14.22 Use the delta method, with derivatives Ž14.22., to derive the asymptotic covariance matrix in Ž14.23. for residuals. Show that this matrix
is idempotent.
14.23 In some situations, X 2 and G 2 take very similar values. Explain the
joint influence on this event of Ža. whether the model holds, Žb.
whether the sample size n is large, and Žc. whether the number of
cells N is large.
599
PROBLEMS
14.24 Show X and in multinomial representation Ž14.27. for the independence model for an I = J table. By contrast, show X a for the
corresponding Poisson loglinear model Ž14.29..
$
14.25 Using Ž14.18. and Ž14.28., derive the asymptotic covŽ
ˆ . for a multinomial loglinear model.
14.26 Consider the ML estimator ˆ i j s piq pqj of i j for the independence
model, when that model does not hold. Show that E Ž piq pqj . s
Ž
.
iq qj n y 1 rn q i jrn. To what does ˆ i j converge as n increases?
14.27 Let denote a generic measure of association. For K independent
multinomial samples of sizes n k 4 , suppose that
n k Ž ˆk y k .
d
N Ž0, k2 . as n k ™ . A summary measure is
'
6
s
Ý k Ž n krˆk2 . ˆk
Ý k Ž n krˆk2 .
.
a. Show that Ý k z k2 s V q w 2rˆ 2 Ž .x, where
Vs
Ý
k
ž
n k ˆk y
ˆk2
/
2
,
zk s
ˆ
n1r2
k k
ˆk
,
ˆ 2 Ž . s
žÝ ˆ /
nk
k
k2
y1
.
b. Suppose that n ™ with n krn ™ k ) 0, k s 1, . . . , K. State the
asymptotic chi-squared distribution for each component in the
partitioning in part Ža.. Indicate the hypothesis that each tests.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 15
Alternative Estimation Theory
for Parametric Models
In this book we have used the maximum likelihood ŽML. approach to
inference. This is by far the most common approach for categorical data
analysis. Other paradigms have been used, however. In this chapter we
discuss some of them. These methods have similar asymptotic properties as
maximum likelihood, so the large-sample theory of Chapter 14 applies also to
them.
In Section 15.1 we discuss weighted least squares for fitting models for
categorical data. This and related quasi-likelihood methods introduced in
Sections 4.7 and 11.4 are sometimes simpler to apply than ML.
The Bayesian paradigm is increasingly popular as computations become
easier to implement. A full discussion of modern developments with this
approach is beyond our scope, but in Section 15.2 we present Bayesian
methods of estimating cell probabilities in a contingency table. Four other
methods of estimation for categorical data are described in the final section.
15.1 WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA
Weighted least squares ŽWLS. is an extension of ordinary least squares that
permits responses to be correlated and to have nonconstant variance. Familiarity with the WLS method is useful because:
1. WLS computations have a standard form that is simple to apply for a
wide variety of models.
2. Algorithms for calculating ML estimates often consist of iterative use of
WLS. An example is the Fisher scoring method for generalized linear
models ŽSection 4.6.3..
3. When the model holds, WLS and ML estimators are asymptotically
equivalent, both falling in the class of best asymptotically normal ŽBAN.
600
601
WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA
estimators. For large samples, the estimators are approximately normally distributed around the parameter value, and the ratio of their
variances converges to 1.
Grizzle, Starmer, and Koch Ž1969. popularized WLS for categorical data
analyses. In honor of them, WLS for such analyses is often called the GSK
method. This section summarizes the ingredients of this approach.
15.1.1
Notation and Preliminaries for WLS Approach
For a response variable Y with J categories, consider multinomial samples of
sizes n1 , . . . , n I at I levels of an explanatory variable or combinations of
levels of several explanatory variables. Let s Ž X1 , . . . , XI .X , where
i s Ž 1 < i , 2 < i , . . . , J < i .
X
Ý j < i s 1
with
j
denotes the conditional distribution of Y at level i. Let p denote corresponding sample proportions, with V their IJ = IJ covariance matrix. When the I
samples are independent,
V1
0
V2
Vs
..
.
0
From Section 14.1.4, the covariance matrix of
n iVi s
1 < i Ž1 y 1 < i .
y 1 < i 2 < i
y 2 < i 1 < i
.
.
.
2 < i Ž1 y 2 < i .
.
.
.
y J < i 2 < i
y J < i 1 < i
VI
'n p
i
i
is
⭈⭈⭈
y 1 < i J < i
⭈⭈⭈
y 2 < i J < i
.
.
.
.
J < i Ž1 y J < i .
⭈⭈⭈
Each set of proportions has Ž J y 1. linearly independent elements.
Let F be a vector of u F I Ž J y 1. response functions
F Ž . s F1 Ž . , . . . , Fu Ž . .
X
The WLS approach applies to linear models for F of form
F Ž . s X ,
Ž 15.1 .
602
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
where  is a q = 1 vector of parameters and X is a u = q model matrix of
known constants having rank q. From Section 8.5.4, loglinear and logit
response functions are special cases of FŽ . s C log ŽA . for certain matrices C and A.
Let FŽp. denote the sample response functions. We assume that F has
continuous second-order partial derivatives in an open region containing .
This assumption enables the delta method to determine the large-sample
normal distribution for FŽp.. The asymptotic covariance matrix of FŽp.
depends on the u = IJ matrix
Qs
⭸ Fk Ž .
⭸ j < i
for k s 1, . . . , u and all IJ combinations Ž i, j .. Linear response models have
response functions of form FŽ . s A for a matrix of known constants A, in
which case Q s A. For the generalized loglinear model FŽ . s C logŽA .
Žrecall Sections 8.5.4 and 11.2.5., Q s CwdiagŽA .xy1A. wSee Magnus and
Neudecker 1988 for matrix differential calculus. x By the multivariate delta
method ŽSection 14.1.5., the asymptotic covariance matrix of FŽp. is
VF s QVQX .
ˆF denote the sample version of VF , substituting sample proportions in Q
Let V
and V. For subsequent formulas, this matrix must be nonsingular.
15.1.2
Inference Using the WLS Approach to Model Fitting
For the general model Ž15.1., the WLS estimate of  is
ˆy1
b s Ž XX V
F X.
y1
ˆFy1 F Ž p . .
XX V
This is the  value that minimizes the quadratic form
X y1
ˆF F Ž p . y X .
F Ž p . y X V
The ordinary least squares estimate, for uncorrelated responses with constant
ˆF is a constant multiple of the identity matrix.
variance, results when V
The WLS estimator has an asymptotic multivariate normal distribution,
with estimated covariance matrix
$
ˆFy1 X .
cov Ž b . s Ž XX V
y1
.
The normal distribution improves as the sample size increases and FŽp. is
more nearly normally distributed.
603
WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA
ˆ s Xb for the response functions.
The estimate b yields predicted values F
Since they satisfy the model, these predicted values are smoother than the
ˆ is asymptotically
sample response functions FŽp.. When the model holds, F
better than FŽp. as an estimator of FŽ . ŽSection 14.2.2.. The estimated
covariance matrix of the predicted values is
ˆFˆ s X Ž XX V
ˆFy1 X .
V
y1
XX .
The test of model goodness of fit uses the residual term
X y1
ˆF F Ž p . y Xb s F Ž p .XV
ˆFy1 F Ž p . y bX Ž XX V
ˆFy1 X . b,
W s F Ž p . y Xb V
which compares the sample response functions with their model predicted
values. Under H0 : FŽ . y X s 0 that the model holds, W is asymptotically
chi-squared with df s u y q, the difference between the number of response
functions and the number of model parameters.
One can more closely check the model fit by studying the residuals,
ˆ They are orthogonal to the fit F,
ˆ so
FŽp. y F.
½
5
ˆ qF
ˆ s cov F Ž p . y F
ˆ q cov Ž F
ˆ. .
cov F Ž p . s cov F Ž p . y F
Thus, the estimated covariance matrix of the residuals equals
ˆ. s V
ˆF y V
ˆFˆ s V
ˆF y X Ž XX V
ˆFy1 X .
cov F Ž p . y cov Ž F
y1
XX .
Dividing the residuals by their standard errors yields standardized residuals
having large-sample standard normal distributions.
Hypotheses about contrasts and other effects of explanatory variables have
form H0 : C s 0, where C is a known c = q matrix with c F q, having rank
c. The estimator Cb of C is asymptotically normal with mean 0 under H0
ˆFy1 X.y1 CX . The Wald statistic
and with covariance matrix estimated by CŽXX V
ˆFy1 X . CX
WC s bX CX C Ž XX V
y1
Cb
Ž 15.2 .
has an approximate chi-squared null distribution with df s c. This statistic
also equals the difference between residual chi-squared statistics for the
reduced model implied by H0 and the full model. For the special case H0 :
i s 0, WC s bi2rvar Ž bi . has df s 1.
15.1.3
Scope of WLS versus ML Estimation
The WLS approach requires estimating the multinomial covariance matrix of
sample responses at each setting of the explanatory variables. It is inapplicable when explanatory variables are continuous, since there may be only one
604
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
observation at each such setting. WLS also becomes less appropriate as the
number of explanatory variables increases, since few observations may occur
at each of the many combinations of settings. By contrast, in principle,
continuous explanatory variables or many explanatory settings are not problematic to ML.
When a certain model holds, with large cell expected frequencies ML and
WLS give similar results. Both estimators are in the class of best asymptotically normal estimators. However, practical considerations often favor ML
estimation. For example, zero cell counts often adversely affect the WLS
approach. The sample response functions may then be ill-defined or have a
singular estimated covariance matrix.
WLS shares with quasi-likelihood the feature that inferential results
depend only on specifying a model for the mean responses and specifying a
variance function and covariance structure Žhere, based on the multinomial..
It does not use the likelihood function for the complete distribution. Thus,
inference uses Wald methods.
Historically, an advantage of the WLS approach was computational simplicity. This is not relevant now that software is available for ML analyses
and for extensions of WLS Že.g., quasi-likelihood methods such as GEE. that
do not have some of its disadvantages. Thus, WLS is now used much less
frequently than it was about 25 years ago. Nonetheless, it has close connections with more sophisticated methods. Some algorithms for calculating ML
estimates iteratively use WLS. Also, Miller et al. Ž1993. showed that under
certain conditions the solution of the first iteration in the GEE fitting process
gives the WLS estimate. This equivalence uses initial estimates based directly
on sample values and assumes a saturated association structure that allows a
separate correlation parameter for each pair of response categories and each
pair of observations in a cluster. In this sense, GEE is an iterated form of
WLS. Moreover, in this case, the covariance matrix for the estimates is the
same in both approaches.
15.2 BAYESIAN INFERENCE FOR CATEGORICAL DATA
Methodology using the Bayesian paradigm has advanced tremendously in the
past decade. New computational methods make it easier to evaluate posterior distributions for model parameters. Nonetheless, Bayesian inference is
not as fully developed or commonly used for categorical data analysis as in
many other areas of statistics. For multiway contingency table analysis, partly
this is because of the plethora of parameters for multinomial models, often
necessitating substantial prior specification. Bayesian theory and methods are
beyond the scope of this book. We present only relatively elementary problems in which the Bayesian approach applies quite naturally and is sometimes
more appealing than ML. We then briefly summarize more complex developments.
BAYESIAN INFERENCE FOR CATEGORICAL DATA
605
The first applications of Bayesian methods to contingency tables involved
smoothing cell counts to improve estimation of cell probabilities Že.g., Good
1965.. The sample proportions are ordinary ML estimators for the saturated
model. When data are sparse, these can have undesirable features. Large
sparse tables often contain many sampling zeros, for which 0.0 is unappealing
as a probability estimate. In addition, Stein’s results for estimating multivariate normal means suggest that lower total mean-squared error occurs with
Bayes estimators that shrink the sample proportions toward some average
value ŽEfron and Morris 1975..
In considering Bayesian estimators, we cannot hope to find one that is
uniformly better than ML. For instance, suppose that a true cell probability
i s 0. Then the sample proportion pi s 0 with probability 1, and the
sample proportion is better than any other estimator. Because parameter
values exist for which the sample proportion is optimal, no other estimator is
uniformly better over the entire parameter space. Here the criterion of
comparison is the expected value of a loss function that measures distance
between the estimator and the parameter, such as squared error. In
decision-theoretic terms the sample proportion is an admissible estimator,
for standard loss functions ŽJohnson 1971.. In this sense, the sample mean for
the multinomial or multivariate binomial differs from the sample mean for
the multivariate normal, which is inadmissible Ždominated by Bayes estimators. when the dimension of the mean vector is at least three ŽFerguson 1967,
p. 170.. Meeden et al. Ž1998. gave related results for decomposable loglinear
models.
Another approach for estimating cell probabilities fits an unsaturated
model. Often, though, there is no particular model expected to describe the
table well. For I = J cross-classifications of nominal variables, for instance,
the independence model rarely fits well. When unsaturated models approximate the true relationship poorly, model-based estimators also have undesirable properties. Although they smooth the data, the smoothing is too severe
for large samples. The model-based estimators are inconsistent, converging
to values that may be far from the true cell probabilities as n increases.
A Bayesian approach to estimating cell probabilities compromises between
sample proportions and model-based estimators. A model still provides part
of the smoothing mechanism, with the Bayes estimators shrinking the sample
proportions toward a set of proportions satisfying the model.
15.2.1
Bayesian Estimation of Binomial Parameter
We illustrate basic ideas with Bayesian inference for a binomial parameter.
Let y denote a binŽ n, . variate. Since falls between 0 and 1, a natural
prior density for is the beta wŽ13.8. in Section 13.3.1x for some choice of
␣ ) 0 and  ) 0. This satisfies E Ž . s ␣rŽ␣ q  ..
In Bayesian inference the posterior density of a parameter, given the data,
is proportional to the product of the prior density with the likelihood
606
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
function. Here, the beta prior depends on through ␣y1 Ž1 y . y1 , and
the binomial likelihood has kernel depending on through y Ž1 y . nyy .
Thus, the posterior density hŽ < y . of is proportional to
hŽ < y . A y Ž 1 y .
nyy
␣y1 Ž 1 y .
y1
s yq␣y1 Ž 1 y .
nyyqy1
,
for 0 F F 1. The beta is the conjugate prior distribution. The posterior
density is also beta, with parameters ␣ * s y q ␣ and  * s n y y q  .
The mean of the posterior distribution is a Bayesian estimator of a
parameter. This is optimal when a squared-error loss function ŽT y . 2
describes the consequence of estimating by an estimator T ŽFerguson
1967, p. 46.. The mean of the beta posterior distribution for is
E Ž < y . s ␣ *r Ž␣* q  * . s Ž y q ␣ . r Ž n q ␣ q  .
s w Ž yrn . q Ž 1 y w . ␣r Ž␣ q  . ,
where w s nrŽ n q ␣ q  .. This is a weighted average of the sample proportion p s yrn and the mean of the prior distribution. For fixed Ž␣,  ., the
weight given the sample increases as n increases. The standard deviation of
the posterior distribution describes the accuracy of this estimator. This equals
the square root of
var Ž < y . s ␣ * *r Ž␣* q  * . Ž␣* q  * q 1 . .
2
'
For large n the standard deviation is roughly p Ž 1 y p . rn , the ordinary
standard error for the ML estimator
ˆ s p.
The Bayes estimator requires selecting parameters Ž␣,  . for the prior
distribution. Complete ignorance about might suggest a uniform prior
distribution. This is the beta distribution with ␣ s  s 1. The posterior
distribution then has the same shape as the binomial likelihood function. The
Bayes estimator is then
E Ž < y . s Ž y q 1. r Ž n q 2. .
This shrinks the sample proportion slightly toward 12 .
Alternatively, a popular prior with Bayesians is the Jeffreys prior. This is
proportional to the square root of the determinant of the Fisher information
matrix for the parameters of interest, for a single observation. With a single
parameter , this is w E Ž ⭸ 2 log f Ž y < .r⭸ 2 .x1r2 . In the binomial case with
s and n s 1, this equals w Ž1 y .xy1r2 and the prior is beta with
␣ s  s .5. Brown et al. Ž2001. showed that the posterior generated by this
prior yields a confidence interval for with good performance. It approximates the Clopper᎐Pearson interval with the mid-P adjustment ŽSections
607
BAYESIAN INFERENCE FOR CATEGORICAL DATA
1.4.4 and 1.4.5.. For a test of H0 : G 12 against Ha : - 12 , a Bayesian
P-value is the posterior probability that G 12 . Routledge Ž1994. showed that
with the Jeffreys prior, this posterior probability approximately equals the
one-sided mid-P-value for the ordinary binomial test.
15.2.2
Dirichlet Prior and Posterior for Multinomial Parameters
These ideas generalize from the binomial to the multinomial ŽGood 1965..
Suppose that cell counts Ž n1 , . . . , n N . have a multinomial distribution with
n s Ýn i and parameters s Ž 1 , . . . , N .X . The multinomial likelihood is
proportional to
N
Ł in .
i
is1
For a prior distribution over potential values, the multivariate generalization of the beta is the Dirichlet density
gŽ . s
⌫ Ž Ý i .
N
Ł  y1
Ł i ⌫ Ž i . is1 i
i
Ý i s 1,
for 0 F i F 1 all i ,
i
where i ) 04 . For it, E Ž i . s irŽÝ j  j ..
The posterior density is also Dirichlet, with parameters n i q i 4 . The
Bayes estimator of i is
E Ž i < n1 , . . . , n N . s Ž n i q  i .
ž
nq
Ý j
j
/
.
Ž 15.3 .
Let K s Ý  j and ␥ i s E Ž i . s irK. The ␥ i 4 are prior guesses for the cell
probabilities. Bayes estimator Ž15.3. equals the weighted average
nr Ž n q K . pi q Kr Ž n q K . ␥ i .
Ž 15.4 .
From Ž15.3. the Bayes estimator is a sample proportion when the prior
information corresponds to Ý j  j trials with i outcomes of type i, i s 1, . . . ,
N. This interpretation may provide guidance for choosing i 4 . The Jeffreys
prior sets all i s 0.5. Good referred to K as a flattening constant, since with
identical i 4 Ž15.4. shrinks each sample proportion toward the uniform value
␥ i s 1rN. Greater flattening occurs as K increases, for fixed n. Hierarchical
models treat i 4 as unknown and specify a second-stage prior for them Že.g.,
Albert and Gupta 1982..
Bayes estimators combine good characteristics of sample proportions and
model-based estimators. Like sample proportions and unlike model-based
608
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
estimators, they are consistent even when the model does not hold. Unless
the model holds, the weight given the sample proportion increases to 1.0 as
the sample size increases. Like model-based estimators and unlike sample
proportions, the Bayes estimators smooth the data. The resulting estimates,
although slightly biased, usually have smaller total mean-squared error than
the sample proportions.
15.2.3
Development of Bayesian Methods for Categorical Data
We now summarize the development of Bayesian methods for categorical
data since Good’s Ž1965. work on smoothing multinomial proportions.
Leonard and Hsu Ž1994. provided a more detailed review. We begin with
methods for two-way contingency tables.
For 2 = 2 tables, Altham Ž1969. gave a Bayesian analysis comparing
parameters for two independent binomial samples. She tested H0 : 1 F 2
against 1 ) 2 using independent betaŽ ␣ i , i . priors for 1 and 2 .
Altham showed that the P-value that is the posterior probability that 1 F 2
can equal the one-sided P-value for Fisher’s exact test. This happens when
one uses improper prior distributions Ž␣1 ,  1 . s Ž1, 0. and Ž␣2 ,  2 . s Ž0, 1..
These represent prior belief favoring the null hypothesis, in effect penalizing
against concluding that 1 ) 2 . That is, Fisher’s exact test corresponds to a
conservative prior distribution.
If ␣ i s i s ␥ , i s 1, 2, with 0 F ␥ F 1, Altham showed that the Bayesian
P-value is smaller than the Fisher P-value. The difference between the two is
no greater than the null probability of the observed data. Use of Jeffreys
priors with ␣ i s i s 0.5 provides a type of continuity correction to Fisher’s
exact test in much the way the mid-P-value does for the frequentist approach.
Howard Ž1998. showed that with these priors the posterior probability that
1 F 2 approximates the one-sided P-value for the large-sample z test
using pooled variance Ži.e., the signed square root of the Pearson statistic; see
Problem 3.30. for testing H0 : 1 s 2 against Ha : 1 ) 2 . Howard also
discussed other priors for 2 = 2 tables, including ones that treat 1 and 2
as dependent.
Altham Ž1971. showed Bayesian analyses for binomial proportions from
matched-pairs data. For a simple model in which the probability of success is
the same for each subject at a given occasion, she again showed that the
classical exact P-value ŽSection 10.1.4, using the binomial distribution . is a
Bayesian P-value for a prior distribution favoring H0 . For a model similar to
Ž10.8. in which the probability varies by subject but the occasion effect is
constant, she showed that the Bayesian evidence against the null is weaker
as the number of pairs giving the same response at both occasions increases,
for fixed values of the numbers of pairs giving different responses at the two
occasions. This differs from the conditional ML result, which does not
BAYESIAN INFERENCE FOR CATEGORICAL DATA
609
depend on such pairs ŽSection 10.2.3.. Ghosh et al. Ž2000. showed related
results.
The Bayesian approaches presented so far focused directly on cell probabilities by using a prior distribution for them. Lindley Ž1964. did this with
I = J contingency tables. He considered the posterior distribution of contrasts of log probabilities, such as the log odds ratio. An alternative approach
ŽLaird 1978; Leonard 1975. focused on parameters of the saturated loglinear
model, using normal priors. This is not a conjugate prior, but normal
distributions can approximate the posterior. Using independent normal
N Ž0, 2 . distributions for the association parameters is a way of inducing
shrinkage toward the independence model ŽLaird 1978.. A hierarchical
approach puts second-stage priors on the parameters of the prior distribution
ŽLeonard 1975..
Historically, a barrier for the Bayesian approach has been the difficulty of
calculating the posterior distribution when the prior is not conjugate. This is
less problematic with modern ways of approximating posterior distributions
by simulating samples from them. These include the importance sampling
generalization of Monte Carlo simulation ŽZellner and Rossi 1984. and
Markov chain Monte Carlo methods such as Gibbs sampling ŽGelfand and
Smith 1990.. Zellner and Rossi used Bayesian methods for logistic regression
and Gelfand and Smith considered a class of multinomial models with
Dirichlet prior. Zeger and Karim Ž1991. fitted generalized linear mixed
models ŽGLMMs. essentially using a Bayesian framework with priors for
fixed and random effects.
The focus on distributions for random effects in GLMMs in articles such
as Zeger and Karim Ž1991. led to the treatment of parameters in GLMs as
random variables with a fully Bayesian approach. Dey et al. Ž2000. edited a
collection of articles that provided Bayesian analyses for GLMs. For instance,
in that volume Gelfand and Ghosh surveyed the subject, Albert and Ghosh
reviewed item response modeling, Chib modeled correlated binary data, and
Chen and Dey modeled correlated ordinal data.
Bayesian methods are used increasingly in applications. For instance,
Skene and Wakefield Ž1990. modeled multicenter binary response studies
with a logit model that allows the treatment᎐response log odds ratio to vary
among centers. This gives a Bayesian alternative to the GLMM analysis
presented in Section 12.3.4. Daniels and Gatsonis Ž1999. used multi-level
GLMs to analyze geographic and temporal trends with clustered longitudinal
binary data. This built on hierarchical modeling ideas introduced by Wong
and Mason Ž1985.. An article by Landrum and Normand in Dey et al. Ž2000.
gave a case study using Bayesian ordinal probit and logit models. Chaloner
and Larntz Ž1989. used a Bayesian approach to determining optimal design
for experiments using logistic regression. J. Albert has suggested Bayesian
models for a variety of categorical data analyses. For instance, Albert Ž1997.
modeled associations in two-way tables and Albert and Chib Ž1993. studied
610
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
binary regression modeling, focusing on the probit case with extensions to
ordered multinomial responses.
15.2.4
Data-Dependent Choice of Prior Distribution
With Bayesian analyses, careful prior specification is necessary. The use of an
improper prior, such as the uniform prior over the entire or positive real line,
sometimes results in improper posteriors. One may not realize this from the
output of software for Bayesian fitting. In addition, with simulation methods
it may not be obvious when convergence has occurred. Be suspicious if
results are dramatically different from ordinary ML frequentist results.
Some dislike the subjectivity of the Bayesian approach inherent in selecting a prior distribution. Instead of choosing particular parameters for a prior
distribution, it is increasingly popular to use a hierarchical approach in which
those parameters themselves have a second-stage prior distribution. Alternatively, the empirical Bayes approach lets the data suggest parameter values
for use in the prior distribution Že.g., Efron and Morris 1975.. This approach
uses the prior that maximizes the marginal probability of the observed data,
integrating out with respect to the prior. Laird Ž1978. did this for the
loglinear model, estimating 2 in normal priors for association parameters
by finding the value that maximizes an approximation for the marginal
distribution of the cell counts, evaluated at the observed data. A disadvantage of empirical Bayes compared to the hierarchical approach is that it does
not take into account the source of variability due to substituting estimates
for prior parameters.
Fienberg and Holland Ž1973. proposed analyses for contingency tables
with data-dependent priors. For a particular choice of Dirichlet means ␥ i 4
for the Bayes estimator Ž15.4., they showed that the minimum total meansquared error occurs when
K s Ž1 y
Ý i2 . Ý Ž ␥ i y i . 2
.
Ž 15.5 .
The optimal K s K Ž ␥, . depends on , so they used the estimate K Ž ␥, p.
of K in which the sample proportion p replaces . As p falls closer to the
prior guess ␥, K Ž ␥, p. increases and the prior guess receives more weight in
the posterior estimate. They selected the prior pattern ␥ i 4 for the cell
probabilities based on the fit of a simple model. For two-way tables, they
used the independence fit ␥ i j s piq pqj 4 . The Bayes estimator then shrinks
sample proportions toward that fit.
As in other inference, Bayesian modeling should normally account for any
ordering in the response categories. For instance, in the method just mentioned for smoothing contingency tables, one could shrink toward an ordinal
model.
611
OTHER METHODS OF ESTIMATION
15.3 OTHER METHODS OF ESTIMATION
In this final section we describe some alternative estimation methods for
categorical data. Consider estimation of or , assuming a model s Ž ..
Let ˜
denote a generic estimator of , for which
˜ s Ž˜ . estimates .
The ML estimator ˆ
maximizes the likelihood. It also minimizes the deviance
statistic G 2 comparing observed and fitted proportions ŽSection 14.3.4..
15.3.2
Minimum Chi-Squared Estimators
Other estimators minimize other measures of distance between Ž . and p.
The value ˜
that minimizes the Pearson statistic
X
2
Ž . , p s n Ý
pi y i Ž .
2
i Ž .
is called the minimum chi-squared estimate. It is simpler to calculate the
estimate that minimizes the modified statistic
2
X mod
Ž . , p s n Ý
pi y i Ž .
2
pi
Ž 15.6 .
that replaces the denominator by the sample proportion. This is called the
minimum modified chi-squared estimate. It is the solution for to the
equations
Ý
i
i Ž .
pi
ž
⭸ i Ž .
⭸ j
/
s 0,
j s 1, . . . , q.
Neyman Ž1949. introduced minimum modified chi-squared estimators. He
showed that they and minimum chi-squared estimators are best asymptotically normal ŽBAN. estimators. When the model holds, they are asymptotically Žas n ™ ⬁. equivalent to ML estimators. Under the model, different
estimation methods ŽML, WLS, minimum chi-squared, etc.. yield nearly
identical estimates of parameters when n is large. This happens partly
because the estimators are consistent, converging in probability to as n
increases. When the model does not hold, estimates for different methods
can be quite different, even when n is large. The estimators converge to
values for which the model gives the best approximation to reality, and this
approximation is different when best is defined in terms of minimizing G 2
rather than minimizing X 2 or some other measure.
For any n, minimum modified chi-squared estimates are sometimes identical to WLS estimates. The connection refers to an alternative way of
612
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
specifying a model, using a set of constraint equations for ,
g j Ž 1 , . . . , N . s 04 .
For instance, for an I = J table, the Ž I y 1.Ž J y 1. constraint equations
log i j y log i , jq1 y log iq1 , j q log iq1 , jq1 s 0
specify the model of independence. The number of constraint equations
equals the residual df for the model.
Neyman Ž1949. noted that minimum modified chi-squared estimates result
from minimizing
N
Ý
is1
Ž pi y i .
pi
2
Nyq
Ý
q
j gj Ž 1 , . . . , N .
js1
with respect to , where the j 4 are Lagrange multipliers. When the
constraint equations are linear in , the resulting estimating equations are
linear. Then Bhapkar Ž1966. showed that these estimators are identical to
WLS estimators. The statistic Ž15.6. then equals the WLS residual statistic
ŽSection 15.1.2. for testing model fit.
Usually, however, constraint equations are nonlinear in , such as for the
independence model. The WLS estimator is then the minimum modified
chi-squared estimator based on a linearized version of the constraints,
g j Ž p . q Ž y p . ⬘⭸ g j Ž . r⭸ s 0,
with differential vector evaluated at p.
Berkson Ž1944, 1955, 1980. was a strong advocate of minimum chi-squared
methods. For logistic regression, his minimum logit chi-squared estimators
minimized a weighted sum of squares between sample logits and linear
predictions. Mantel Ž1985. criticized such methods, noting that their consistency requires group sizes to grow large, whereas ML Žor conditional ML,
when there are many nuisance parameters . is consistent however information
goes to the limit Žsee also Problem 15.14..
15.3.2
Minimum Discrimination Information
Kullback Ž1959. formulated estimation by minimum discrimination information ŽMDI.. The discrimination information for two probability vectors and
␥ is
IŽ ; ␥ . s
N
Ý i log Ž ir␥ i . .
is1
Ž 15.7 .
613
OTHER METHODS OF ESTIMATION
This directed measure of distance between and ␥ is nonnegative, equaling
0 only when s ␥. Gokhale and Kullback Ž1978. studied MDI estimates
that minimize I Ž ; ␥ ., subject to model constraints, using ␥ s p for some
problems and ␥ with ␥ 1 s ␥ 2 s ⭈⭈⭈ s ␥ N s 1rN for others. Good Ž1963.
conducted related work in the area of maximum entropy.
In some cases with ␥ i s 1rN 4 , the MDI estimator is identical to the ML
estimator ŽSimon 1973.. With ␥ s p it is not ML, but it has similar asymptotic properties, being best asymptotically normal ŽBAN.. Then Gokhale and
Kullback recommended testing goodness of fit using twice the minimized
value of I Ž ; p.. This statistic reverses the roles of p and relative to G 2 ,
2
much as X mod
in Ž15.6. reverses their roles relative to X 2 . Both statistics fall
in the class of power divergence statistics ŽCressie and Read 1984; see also
Problem 3.34. and have similar asymptotic properties. More generally, one
could choose any member of the power divergence statistics and define
estimates to be the values minimizing it. Under regularity conditions, they
are all BAN.
15.3.3
Kernel Smoothing
Kernel estimation is a smoothing method that estimates a probability density
or mass function without assuming a parametric distribution. Let K denote a
matrix containing nonnegative elements and having column sums equal to 1.
Kernel estimates of cell probabilities in a contingency table have form
Ž 15.8 .
˜ s Kp.
For unordered multinomials with N categories, Aitchison and Aitken
Ž1976. used
ki j s ,
isj
s Ž 1 y . r Ž N y 1. ,
i/j
for Ž1rN . F F 1. The resulting kernel estimator of has form
Ž 1 y ␣ . p q ␣ Ž 1rN . ,
Ž 15.9 .
where ␣ s N Ž1 y .rŽ N y 1.. This estimator shrinks the sample proportion
toward Ž1rN, . . . , 1rN .. As decreases from 1 to 1rN, the smoothing
parameter ␣ increases from 0 to 1. Brown and Rundell Ž1985. proved that
when no i s 1, - 1 exists such that the total mean squared error is
smaller for this kernel estimator than for the sample proportions. Results for
other shrinkage estimators applied to multivariate means suggest that the
improvement for the kernel estimator can be large when n is small and the
true cell probabilities are roughly equal.
Brown and Rundell generalized kernel smoothing for multiway contingency tables that may contain both nominal and ordinal variables. For a
614
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
T-way table, let L k be a stochastic matrix Ži.e., row and column sums equal to
1. with elements
l k, i j s
½
k ,
isj
dk Ž i , j . Ž 1 y k . ,
i / j,
k s 1, . . . , T. They let K in Ž15.8. be the Kronecker product
K s L 1 m ⭈⭈⭈ m L T .
When variable k is ordinal, shrinkage alone is not enough, and it helps to
borrow information from nearby cells. Then d k Ž i, j . is chosen to be smaller
for greater distances between categories i and j. If variable k is nominal, the
natural choice is d k Ž i, j . s 1rŽ Ik y 1., where Ik is the number of categories
for variable k. For fixed k 4 , collapsing the smoothed table gives the same
result as smoothing the corresponding collapsing of the original table. With
k s , k s 1, . . . , T 4 , Brown and Rundell described ways of finding to
minimize an unbiased estimate of the total mean squared error.
Dong and Simonoff Ž1995. and Simonoff Ž1986. described other approaches for ordered categories. Most such kernels yield probability estimates of the form
˜ i s Ž 1 y ␣ . pi q ␣ = smootheri ,
where the smoothing is designed to work well when true probabilities in
nearby cells are similar.
15.3.4
Penalized Likelihood
Good and Gaskins Ž1971. introduced the penalized likelihood method for
density estimation. For log likelihood LŽ ., the estimator maximizes
L* Ž . s L Ž . y ␣ Ž .
where ␣ Ž⭈. is a function that provides a roughness penalty. That is, ␣ Ž .
decreases as elements of are smoother, in some sense. The penalized
likelihood estimator has a Bayesian interpretation. With prior density proportional to expwy␣ Ž .x, the posterior density is proportional to the penalized likelihood function. Hence, the mode of the posterior distribution equals
the penalized likelihood estimator.
Simonoff Ž1983. applied penalized likelihood to estimating cell probabilities . Like Bayesian and kernel methods, it provides estimates that are
smoother than the sample proportions. For a single multinomial with ordered
Ny1 Ž
categories, Simonoff Ž1983. used penalty function ␣ Ž . s Ý is1
log i y
2
.
log iq1 , which encourages adjacent category estimates to be similar. For
615
NOTES
6
two-way contingency tables, Simonoff suggested using ␣ Ž . s Ý i Ý j Žlog i j . 2
with the local odds ratios. This provides shrinkage toward the independence
estimator. One chooses the smoothing parameter to minimize an approximation for the mean-squared error of the estimator.
In evaluating smoothing methods such as kernel smoothing and penalized
likelihood, it is useful to distinguish between large-sample asymptotics with a
fixed number of cells N and sparse-data asymptotics for which N grows with
n Žrecall Section 6.3.4.. For the former, these smoothing methods and
Bayesian inference behave asymptotically like ordinary ML Ži.e., the sample
proportions.. They have the same rate of convergence to true probabilities.
These methods then improve over ML primarily for small samples, where the
benefit of ‘‘borrowing from the whole’’ occurs. For sparse-data asymptotics,
however, smoothing is particularly beneficial. As the dimensions of a table
increase, the number of cells grows exponentially and the ‘‘curse of dimensionality’’ occurs. Accurate estimation becomes more difficult, with estimators converging more slowly to true values. The table then has an increasing
proportion of empty cells. Smoothing can be better than ML even asymptotically. For such results, see Fienberg and Holland Ž1973. for the Dirichletbased Bayes multinomial estimator and Simonoff Ž1983. for penalized likelihood with the multinomial. Simonoff showed that consistency
can occur with
p
the latter estimator in the sense that sup i
0 as n and N grow
ˆ ir i y 1
and the probabilities themselves approach 0.
For surveys of smoothing methods, see Fahrmeir and Tutz Ž2001, Chap. 5.,
Lloyd Ž1999, Chap. 5., and Simonoff Ž1996, Chap. 6; 1998.. As Simonoff
noted, all smoothing methods attempt to balance the low bias of undersmoothing with the low variability of oversmoothing. The methods require
input from the user about the degree of smoothness, whether it be determined by a prior distribution or some type of smoothing parameter.
In summary, many methods exist for smoothing categorical data. Besides
those discussed in this section, there are traditional model-building methods.
Some of these, such as generalized additive models ŽSection 4.8., are also
specifically directed toward smoothing. A particular type of smoothing method
may seem most natural for a given application. An advantage of the Bayesian
approach is that its entire formulation seems less ad hoc than some others.
NOTES
Section 15.1: Weighted Least Squares for Categorical Data
15.1. Applications of WLS include fitting mean response models ŽGrizzle et al. 1969. and
models for marginal distributions ŽKoch et al. 1977.. For general discussion, see
Bhapkar and Koch Ž1968., Imrey et al. Ž1981., and Koch et al. Ž1985..
616
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
Section 15.2: Bayesian Inference for Categorical Data
15.2. Other literature on Bayesian analyses of categorical responses includes Fienberg et al.
Ž1999., Forster and Smith Ž1998., Good Ž1976., Knuiman and Speed Ž1988., Spiegelhalter and Smith Ž1982., and Walley Ž1996..
Section 15.3: Other Methods of Estimation
15.3. For further discussion of minimum chi-squared methods, see Bhapkar Ž1966., Koch et
al. Ž1985., Neyman Ž1949., and Rao Ž1963..
15.4. For the use of minimum discrimination information, see Gokhale and Kullback Ž1978.,
Ireland and Kullback Ž1968a, b., Ireland et al. Ž1969., and Ku et al. Ž1971..
15.5. Hall and Titterington Ž1987. studied rates of convergence for multinomial kernel
estimators. They defined one that achieves the optimal rate. Ordinary kernel estimators
tend to be biased toward zero at the boundary of a table. Dong and Simonoff Ž1994.
dealt with improving kernel estimates on the boundary of large sparse tables. Kernel
methods are also useful for discrete regression modeling. For binary response data,
Copas Ž1983. used one to display in a nonparametric manner the dependence of
P Ž Y s 1. on x.
PROBLEMS
Applications
15.1 Consider the mean response model fitted in Section 7.4.6. Show how
to use WLS for this analysis. Identify the number of multinomial
samples I, the number of response categories J, the response functions F, the model matrix X, the parameter vector , and the
ˆF .
estimated covariance matrix V
15.2 Use WLS to conduct the longitudinal analysis of depression in Section 11.2.1. Using software Že.g., SAS: PROC CATMOD., obtain
WLS estimates and standard errors and compare to the ML results.
15.3 Refer to Problem 15.2. Using these data, describe the differences
between Ža. WLS and ML, and Žb. WLS and GEE methods for
marginal models with multivariate categorical response data.
15.4 Using data from Section 1.4.3, obtain a Bayesian estimate of the
proportion of vegetarians. Explain how you chose the prior distribution. Compare results to those with ML.
15.5 Refer to Table 9.8. Consider the model that simultaneously assumes
Ž9.12. as well as linear logit relationships for the marginal effects of
age on breathlessness and on wheeze.
PROBLEMS
617
a. Specify C, A, and X for which this model has form C log A s X.
b. Using software, fit the model and interpret estimates.
Theory and Methods
15.6 Consider marginal homogeneity for an I = I table.
a. Letting FŽ . s A, explain how Ži. FŽ . s 0, where A has I y 1
rows, and Žii. FŽ . s X, where A has 2Ž I y 1. rows and  has
I y 1 elements. In part Žii., show A, , X,  when I s 3.
b. Explain how to use WLS to test marginal homogeneity. wThis is
Bhapkar’s test Ž10.16..x
15.7 For WLS with FŽ . s CwlogŽA .x, show that Q s CwdiagŽA .xy1A.
ˆFy1 wFŽp. y X x is minimized by
15.8 With WLS, show that wFŽp. y X x⬘V
y1
y1
y1
ˆ
ˆ
 s ŽX⬘VF X. X⬘VF FŽp..
15.9 The response functions FŽp. have asymptotic covariance matrix VF .
Derive the asymptotic covariance matrix of the WLS model parameˆ s Xb.
ter estimator b and the predicted values F
15.10 Consider the Bayes estimator of a binomial parameter using a beta
prior distribution.
a. Does any beta prior distribution produce a Bayes estimator that
coincides with the ML estimator?
b. Show that the ML estimator is a limit of Bayes estimators, for a
certain sequence of beta prior parameter values.
c. Find an improper prior density Žone for which its integral is not
finite. such that the Bayes estimator coincides with the ML estimator. ŽIn this sense, the ML estimator is a generalized Bayes estimator..
d. For Bayesian inference using loss function w Ž .ŽT y . 2 , the
Bayes estimator of is the posterior expected value of w Ž .
divided by the posterior expected value of w Ž . ŽFerguson 1967,
p. 47.. With loss function ŽT y . 2rw Ž1 y .x, show the ML
estimator of is a Bayes estimator for the uniform prior distribution.
e. The risk function is the expected loss, treated as a function of .
For the loss function in part Žd., show the risk function is constant.
ŽBayes’ estimators with constant risk are minimax; their maximum
risk is no greater than the maximum risk for any other estimator..
f. Show that the Jeffreys prior for equals the beta density with
␣ s  s .5.
618
ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS
15.11 For the Dirichlet prior for multinomial probabilities, show the posterior expected value of i is Ž15.3.. Derive the expression for this
Bayes estimator as a weighted average of pi and E Ž i ..
15.12 For Bayes estimator Ž15.4., show that the total mean squared error is
Kr Ž n q K .
2
Ý Ž i y ␥i . 2
q nr Ž n q K .
2
Ž 1 y Ý i2 . .
Show that Ž15.5. is the value of K that minimizes this.
15.13 Refer to Problem 15.6. For marginal homogeneity, explain why the
minimum modified chi-squared estimates are identical to WLS estimates.
15.14 Let yi be a binŽ n i , i . variate for group i, i s 1, . . . , N, with yi 4
independent. Consider the model that 1 s ⭈⭈⭈ s N . Denote that
common value by .
a. Show that the ML estimator of is p s ŽÝ i yi .rŽÝ i n i ..
b. The minimum chi-squared estimator
˜ is the value of minimizing
N
Ý
is1
Ž yirn i . y
2
N
q
Ý
is1
Ž yirn i . y
1y
2
.
The second term results from comparing Ž1 y yirn i . to Ž1 y .,
the proportions in the second category. If n1 s ⭈⭈⭈ s n N s 1,
show that
˜ minimizes NpŽ1 y .r q N Ž1 y p .rŽ1 y ..
Hence show
˜ s p1r2 r p1r2 q Ž 1 y p .
1r2
.
Note the bias toward 12 in this estimator.
c. Argue that as N ™ ⬁ with all n i s 1, the ML estimator is consistent but the minimum chi-squared estimator is not ŽMantel 1985..
15.15 Refer to Problem 15.14. For N s 2 groups with n1 and n 2 independent observations, find the minimum modified chi-squared estimator
of . Compare it to the ML estimator.
15.16 Show that the kernel estimator Ž15.9. is the same as the Bayes
estimator Ž15.3. for the Dirichlet prior with i s ␣ nrŽ1 y ␣ . N 4 .
Using this result, suggest a way of letting the data determine the
value of ␣ in the kernel estimator.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
CHAPTER 16
Historical Tour of Categorical
Data Analysis*
This book concludes with an informal historical overview of the evolution of
methods for categorical data analysis ŽCDA.. We have seen that categorical
scales are pervasive in the social sciences and the biomedical sciences. Not
surprisingly, the development of GLMs for categorical responses was fostered by statisticians having ties to the social sciences or to the biomedical
sciences.
Only in the last quarter of the twentieth century did these models receive
the attention given early in the century to models for continuous data.
Regression models for continuous variables evolved out of Francis Galton’s
breakthroughs in the 1880s. The strong influence of R. A. Fisher, G. Udny
Yule, and other statisticians on experimentation in agriculture and biological
sciences ensured widespread adoption of regression and ANOVA modeling
by the mid-twentieth century. On the other hand, despite influential articles
around 1900 by Karl Pearson and Yule on association between categorical
variables, models for categorical responses received scant attention until the
1960s.
The beginnings of CDA were often shrouded in controversy. Key figures
in the development of statistical science made groundbreaking contributions,
but these statisticians were often in heated disagreement with one another.
16.1 PEARSON–YULE ASSOCIATION CONTROVERSY
Much of the early development of methods for CDA took place in England,
and it is fitting that we begin our historical tour in London at the beginning
of the twentieth century. The year 1900 is an apt starting point, since in that
year Karl Pearson introduced his chi-squared statistic Ž X 2 . and G. Udny
Yule presented the odds ratio and related measures of association. Before
619
620
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
then most work focused on descriptive aspects for relatively simple measures.
For instance, Goodman and Kruskal Ž1959. noted that the Belgian social
statistician Adolphe Quetelet used the relative risk in 1849.
By 1900, Karl Pearson Ž18571936. was already well known in the statistics
community. He was head of a statistical laboratory at University College in
London. His work the previous decade included developing a family of
skewed probability distributions Žcalled Pearson cur®es., obtaining the product-moment estimate of the correlation coefficient and finding its standard
error, and extending Galton’s work on linear regression. In fact, Pearson was
a true renaissance man, writing on a wide variety of topics that included art,
religion, philosophy, law, socialism, women’s rights, physics, genetics, eugenics, and evolution. Pearson’s motivation for developing the chi-squared test
included testing whether outcomes on a roulette wheel in Monte Carlo varied
randomly, checking the fit to various data sets of normal distributions and
Pearson curves, and testing statistical independence in two-way contingency
tables.
Much of the literature on CDA early in the twentieth century consisted of
vocal debates about appropriate ways to summarize association. Pearson’s
approach assumed that continuous bivariate distributions underlie two-way
contingency tables ŽPearson 1904, 1913.. He argued in favor of approximating
a measure, such as the correlation, for the underlying continuum. In 1904,
Pearson introduced the term contingency as a ‘‘measure of the total
deviation of the classification from independent probability,’’ and he introduced measures to describe its extent. The tetrachoric correlation is a ML
estimate of the correlation for a bivariate normal distribution assumed to
underlie counts in 2 = 2 tables. It is the correlation value in the bivariate
normal density that would produce cell probabilities equal to the sample cell
proportions when that density is collapsed to a 2 = 2 table having the same
marginal proportions as the observed table. The mean-square contingency and
the contingency coefficient are normalizations of X 2 to the Ž0, 1. scale.
Pearson’s contingency coefficient ŽProblem 3.33. for I = J tables standardized X 2 to approximate an underlying correlation.
George Udny Yule Ž18711951., a British contemporary of Pearson’s, took
a different approach. Having completed pioneering work developing multiple
regression models and multiple and partial correlation coefficients, Yule
turned his attention between 1900 and 1912 to association in contingency
tables. He believed that many categorical variables, such as Žvaccinated, unvaccinated . and Ždied, survived., are inherently discrete. Yule defined indices
directly using cell counts without assuming an underlying continuum. He
popularized the odds ratio wwhich Goodman Ž2000. noted may first have
been proposed by a Hungarian statistician, J. Korosy
˝ ¨ x and a transformation of
w
x
Ž
.
Ž
.
it to the y1, q1 scale, Q s y 1 r q 1 , now called Yule’s Q ŽProblem
2.36.. Discussing one of Pearson’s measures that assumes underlying normality, Yule argued Ž1912, p. 612. that ‘‘at best the normal coefficient can only
be said to give us in cases like these a hypothetical correlation between
PEARSON YULE ASSOCIATION CONTROVERSY
621
supposititious variables. The introduction of needless and unverifiable hypotheses does not appear to me a desirable proceeding in scientific work.’’
Yule Ž1903. also showed the potential discrepancy between marginal and
conditional associations in contingency tables, later studied by E. H. Simpson
Ž1951. and now called Simpson’s paradox.
In the first quarter of the twentieth century, Karl Pearson was the rarely
challenged leader of statistical science in Britain. Pearson’s strong personality
did not take kindly to criticism, and he reacted negatively to Yule’s ideas. He
argued that Yule’s own coefficients were unsuitable. For instance, Pearson
claimed that their values were unstable, since different collapsings of I = J
tables to 2 = 2 tables could produce quite different values of the measures.
Pearson and D. Heron Ž1913. filled more than 150 pages of Biometrika, a
journal he co-founded and edited, with a scathing reply to Yule’s criticism. In
a passage critical also of Yule’s well-received book An Introduction to the
Theory of Statistics, they stated ‘‘If Mr. Yule’s views are accepted, irreparable
damage will be done to the growth of modern statistical theory. . . . wYule’s
Qx has never been and never will be used in any work done under his
wPearson’sx supervision. . . . We regret having to draw attention to the manner in which Mr. Yule has gone astray at every stage in his treatment of
association, but criticism of his methods has been thrust on us not only by
Mr. Yule’s recent attack, but also by the unthinking praise which has been
bestowed on a text-book which at many points can only lead statistical
students hopelessly astray.’’ Pearson and Heron attacked Yule’s ‘‘half-baked
notions’’ and ‘‘specious reasoning’’ and argued that Yule would have to
withdraw his ideas ‘‘if he wishes to maintain any reputation as a statistician.’’
In retrospect, Pearson and Yule both had valid points. Some classifications, such as most nominal variables, have no apparent underlying continuous distribution. On the other hand, many applications relate naturally to an
underlying continuum, and that fact can motivate models and inference Že.g.,
Section 7.2.3.. Goodman Ž1981a, b. noted that the ordinal models presented
in Sections 9.4.1 and 9.6.1 provide a sort of reconciliation between Yule and
Pearson, since Yule’s odds ratio characterizes models that fit well when
underlying distributions are approximately normal.
Half a century after the PearsonYule controversy, Leo Goodman and
William Kruskal surveyed the development of association measures for
contingency tables and made many contributions of their own. Their 1979
book reprinted four influential articles of theirs from the Journal of the
American Statistical Association on this topic. Initial development of many
measures occurred in the nineteenth century. Their 1959 article contains the
following quote from M. H. Doolittle in 1887, which illustrates the lack of
precision in early attempts to quantify the meaning of association even in
2 = 2 tables: ‘‘Having given the number of instances respectively in which
things are both thus and so, in which they are thus but not so, in which they
are so but not thus, and in which they are neither thus nor so, it is required
to eliminate the general quantitative relativity inhering in the mere thingness
622
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
of the things, and to determine the special quantitative relativity subsisting
between the thusness and the soness of the things.’’ Goodman Ž2000. added
to the historical survey and proposed a new measure.
16.2 R. A. FISHER’S CONTRIBUTIONS
Pearson’s disagreements with Yule were minor compared to his later ones
with Ronald A. Fisher Ž18901962.. Using a geometric representation, Fisher
Ž1922. introduced degrees of freedom to characterize the family of chi-squared
distributions. Fisher claimed that for tests of independence in I = J tables,
X 2 has df s Ž I y 1.Ž J y 1.. By contrast, Pearson Ž1900, 1904. had argued
that for any application of X 2 , the index that Fisher later identified as df
equals the number of cells minus 1, or IJ y 1 for two-way tables. Fisher
pointed out, however, that estimating hypothesized cell probabilities using
estimated row and column probabilities resulted in an additional Ž I y 1. q
Ž J y 1. constraints on the fitted values, thus affecting the distribution of X 2 .
Not surprisingly, Pearson Ž1922. reacted critically to Fisher’s suggestion
that his df formula was incorrect. He stated: ‘‘I hold that such a view
wFisher’sx is entirely erroneous, and that the writer has done no service to the
science of statistics by giving it broad-cast circulation in the pages of the
Journal of the Royal Statistical Society. . . . I trust my critic will pardon me for
comparing him with Don Quixote tilting at the windmill; he must either
destroy himself, or the whole theory of probable errors, for they are invariably based on using sample values for those of the sampled population
unknown to us.’’ Pearson claimed that using row and column sample proportions to estimate unknown probabilities had negligible effect on large-sample
distributions, although he had realized ŽPearson 1917. that df must be
adjusted when the cell counts have linear constraints. Fisher was unable to
get his rebuttal published by the Royal Statistical Society, and he ultimately
resigned his membership.
Statisticians soon realized that Fisher was correct, but he maintained
much bitterness over this and other dealings with Pearson. In the preface to a
later volume of his collected works, he remarked that his 1922 article ‘‘had to
find its way to publication past critics who, in the first place, could not
believe that Pearson’s work stood in need of correction, and who, if this had
to be admitted, were sure that they themselves had corrected it.’’ Writing
about Pearson: he stated: ‘‘If peevish intolerance of free opinion in others is
a sign of senility, it is one which he had developed at an early age.’’ In Fisher
Ž1926., he was able to dig the knife a bit deeper into the Pearson family using
11,688 2 = 2 tables randomly generated assuming independence by Karl
Pearson’s son, E. S. Pearson. Fisher showed that the sample mean of X 2 for
these tables was 1.00001, much closer to the 1.0 predicted by his formula for
E Ž X 2 . of df s Ž I y 1.Ž J y 1. s 1 than Pearson’s IJ y 1 s 3. His daughter,
R. A. FISHER’S CONTRIBUTIONS
623
Joan Fisher Box Ž1978., discussed this and other conflicts between Fisher and
Pearson. Hald Ž1998, pp. 652663., Plackett Ž1983., and Stigler Ž1999,
Chap. 19. summarized the chi-squared controversy.
Fisher’s preeminent reputation among statisticians today accrues mainly
from his theoretical work Žintroducing concepts such as sufficiency, information, and optimal properties of ML estimators. and his methodological
contributions to the design of experiments and the analysis of variance.
Although not so well known for work in CDA, he made other interesting
contributions. Moreover, he made good use of the methods in his applied
work. For instance, Fisher was also a famed geneticist. In one article, he used
Pearson’s goodness-of-fit test to check Mendel’s theories of natural inheritance and showed that the fit was too good ŽSection 1.5.3..
Fisher realized the limitations of large-sample methods for laboratory
work, and he was at the forefront of advocating specialized small-sample
methods. Writing about large-sample methods in the preface to the first
edition of his classic text Statistical Methods for Research Workers, he stated:
‘‘wTxhe traditional machinery of statistical processes is wholly unsuited to the
needs of practical research. Not only does it take a cannon to shoot a
sparrow, but it misses the sparrow! The elaborate mechanism built on the
theory of infinitely large samples is not accurate enough for simple laboratory
data. Only by systematically tackling small sample problems on their merits
does it seem possible to apply accurate tests to practical data.’’ Fisher was
among the first to promote the work by W. S. Gosset Žpseudonym ‘‘Student’’.
on the t distribution. The fifth edition of Statistical Methods for Research
Workers Ž1934. introduced Fisher’s exact test for 2 = 2 contingency tables. In
his 1935 book The Design of Experiments, Fisher described the tea-tasting
experiment ŽSection 3.5.2. motivated by his experience at an afternoon tea
break while employed at Rothamsted Experiment Station.
The mid-1930s finally saw some model building for categorical responses.
Chester Bliss Ž1934, 1935., following up a 1933 report on quantal response
methods by J. H. Gaddum, popularized the probit model for applications in
toxicology with a binary response. Bliss introduced the term probit but used
the inverse normal cdf with mean 5 Žrather than 0, in order to avoid negative
values. and standard deviation 1. In the appendix of Bliss Ž1935., Fisher
Ž1935b. outlined an algorithm for finding ML estimates of model parameters.
That algorithm was a NewtonRaphson type of method using expected
information, today commonly called Fisher scoring ŽSection 4.6.2.. Stigler
Ž1986, p. 246. and Finney Ž1971. attributed the first use of inverse normal cdf
transformations of proportions to the German physicist Gustav Fechner in
his 1860 book Elemente der Psychophysik. See Finney Ž1971. and McCulloch
Ž2000. for other history of the probit method.
The definition for homogeneous association Žno interaction . in contingency tables originated in an article by the British statistician Maurice
Bartlett Ž1935. about 2 = 2 = 2 tables. Bartlett showed how to find ML
624
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
estimates of cell probabilities satisfying the property of equality of odds ratios
between two variables at each level of the third. He attributed the idea to
Fisher.
In 1940, Fisher developed canonical correlation methods for contingency
tables. He showed how to assign scores to rows and columns of a contingency
table to maximize the correlation. His work relates to the later development,
particularly in France, of correspondence analysis methods Že.g., Benzecri
´
1973..
R. A. Fisher has had the greatest influence on the practice of modern
statistical science. The biography by his daughter ŽBox 1978. gives a fascinating account of his impressive contributions to statistics and genetics. Fienberg
Ž1980. summarized his contributions to CDA.
16.3 LOGISTIC REGRESSION
Bartlett Ž1937. used logw yrŽ1 y y .x in regression and ANOVA to transform
observations y that are continuous proportions ŽProblem 6.33.. In a book of
statistical tables published in 1938, R. A. Fisher and Frank Yates suggested it
as a possible transformation of a binomial parameter for analyzing binary
data. In 1944, the physician and statistician Joseph Berkson introduced the
term logit for this transformation. Berkson showed that the model using the
logit fitted similarly to the probit model, and his subsequent work did much
to popularize logistic regression. In 1951, Jerome Cornfield, another statistician with strong medical ties, used the odds ratio to approximate relative
risks in casecontrol studies. Dyke and Patterson Ž1952. apparently first used
the logit in models with qualitative predictors.
Sir David R. Cox introduced many statisticians to logistic regression,
through his 1958 article and 1970 book, The Analysis of Binary Data. About
the same time, an article by the Danish statistician and mathematician Georg
Rasch sparked an enormous literature on item response models. The most
important of these is the logit model with subject and item parameters, now
called the Rasch model ŽSection 12.1.4.. This work was highly influential in
the psychometric community of northern Europe Žespecially in Denmark, the
Netherlands, and Germany. and spurred many generalizations in the educational testing community in the United States.
The extension of logistic regression to multicategory responses received
occasional attention before 1970 Že.g., Mantel 1966. but substantial work
after about that date. For nominal responses, early work was mainly in
the econometrics literature. See Bock Ž1970., McFadden Ž1974., Nerlove
and Press Ž1973., and Theil Ž1969, 1970.. In 2000, Daniel McFadden won the
Nobel Prize in Economics for his work in the 1970s and 1980s on the
discrete-choice model ŽSection 7.6.. For cumulative logit models for ordinal
responses, see Bock and Jones Ž1968., Simon Ž1974., Snell Ž1964., Walker and
Duncan Ž1967., and Williams and Grizzle Ž1972.. The cumulative probit case,
MULTIWAY CONTINGENCY TABLES AND LOGLINEAR MODELS
625
based on an underlying normal response, has a longer history; see, for
instance, Aitchison and Silvey Ž1957. and Bock and Jones Ž1968, Chap. 8..
Cumulative logit and probit models received much more attention following
publication of McCullagh Ž1980., which provided a Fisher scoring algorithm
for ML fitting of all cumulative link models.
The next major advances with logistic regression dealt with its application
to casecontrol studies Že.g., Breslow 1996; Mantel 1973; Prentice 1976a;
Prentice and Pyke 1979; see also Section 5.1.4. and the conditional ML
approach to model fitting for those studies and others with numerous
nuisance parameters ŽBreslow et al. 1978, with related work in Breslow 1976,
1982; Breslow and Day 1980; Breslow and Powers 1978; Cox 1970; Farewell
1979; Prentice 1976a; Prentice and Breslow 1978; Zelen 1971; see also
Sections 6.7 and 10.2.. The conditional approach was later exploited in
small-sample exact inference ŽHirji et al. 1987; Mehta and Patel 1995; see
also Section 6.7..
Nathan Mantel, whose name appears in the preceding two paragraphs,
made a variety of interesting contributions to CDA. Although best known for
the 1959 MantelHaenszel test and related odds ratio estimator, he also
discussed trend tests Ž1963., multinomial logit and loglinear modeling Ž1966.,
logistic regression for casecontrol data Ž1973., the number of contingency
tables having fixed margins ŽGail and Mantel 1977., the analysis of square
contingency tables ŽMantel and Byar 1978., and problems with minimum
chi-squared and Wald tests Ž1985, 1987a..
More recently, attention has focused on fitting logistic models to correlated responses for clustered data. One strand of this is marginal modeling of
longitudinal data ŽDiggle et al. 2002; Liang and Zeger 1986; Liang et al.
1992.. Much of this literature focuses on quasi-likelihood methods such as
generalized estimating equations ŽGEE.. Another strand is generalized linear
mixed models Že.g., Breslow and Clayton 1993..
Perhaps the most far-reaching contribution of the past half century has
been the introduction by British statisticians John Nelder and R. W. M.
Wedderburn in 1972 of the concept of generalized linear models. This unifies
the logistic and probit regression models for binomial data with loglinear
models for Poisson data and with long-established regression and ANOVA
models for normal-response data. Interestingly, the algorithm they used to fit
GLMs is Fisher scoring, which R. A. Fisher introduced in 1935 for ML fitting
of probit models. McCulloch Ž2000. reviewed the journey from probit models
to GLMs and their further generalizations such as quasi-likelihood.
16.4 MULTIWAY CONTINGENCY TABLES AND
LOGLINEAR MODELS
The quarter century following the end of World War II saw the development
of a theoretical underpinning for models for contingency tables. H. Cramer
´
626
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
Ž1946. derived general expressions for large-sample distributions of parameter estimators. C. R. Rao Ž1957, 1963. conducted related work.
In 1949, the Berkeley-based statistician Jerzy Neyman, who had already
performed fundamental work on hypothesis testing and interval estimation
methods with E. S. Pearson, introduced the family of best asymptotically
normal ŽBAN. estimators. These have the same optimal large-sample properties as ML estimators. The BAN family includes estimators obtained by
minimizing chi-squared-type measures comparing observed proportions to
proportions predicted by the model ŽSection 15.3.1.. This type of estimator
itself includes some weighted least squares ŽWLS. estimators. The simplicity of
their computation, compared to ML estimators, was an important consideration before the advent of modern computing. Neyman’s Ž1949. only mention
of Fisher was the suggestion that Fisher did not realize that estimators other
than ML could be BAN, stating that ‘‘the results . . . contradict the assertion
of R. A. Fisher, not a very clear one, that ‘the maximum likelihood equation
may indeed be derived from the conditions that it shall be linear in frequencies, and efficient for all values of ’.’’ Fisher, of course, returned the
compliment: for instance, writing Ž1956. about proposals for an unconditional
test for 2 = 2 tables, ‘‘the Principles of Neyman and Pearson’s ‘Theory of
Testing Hypotheses’ are liable to mislead those who follow them into much
wasted effort.’’
In the early 1950s, William Cochran published work dealing with a variety
of important topics in CDA. Scottish-born, Cochran spent most of his career
at American universities: Iowa State, North Carolina State, Johns Hopkins,
and Harvard. He Ž1940. modeled Poisson and binomial responses with
variance-stabilizing transformations. He Ž1943. recognized and discussed
ways of dealing with overdispersion. He Ž1950. introduced a generalization
ŽCochran’s Q . of McNemar’s test for comparing proportions in several
matched samples. His classic 1954 article is a mixture of new methodology
and advice for applied statisticians. It gave sample-size guidelines for chisquared approximations to work well for the X 2 statistic. It also stressed the
importance of directing inferences toward narrow Že.g., single-degree-offreedom. alternatives and partitioning chi-squared statistics into components.
One instance of this was Cochran’s proposed test of conditional independence in several 2 = 2 tables, which was closely related to the Mantel and
Haenszel Ž1959. test ŽSection 6.3.2.. Another was a test for a linear trend in
proportions across quantitatively defined rows of an I = 2 table ŽSection
5.3.5.. See also Cochran Ž1955.. Fienberg Ž1984. reviewed Cochran’s contributions to CDA.
Bartlett’s work on interaction structure in 2 = 2 = 2 contingency tables
had relatively little impact for 20 years. Indeed, in presenting methods for
partitioning X 2 in 2 = 2 = 2 tables, Lancaster Ž1951. noted that ‘‘Doubtless
little use will ever be made of more than a three-dimensional classification.’’
However, in the mid-1950s and early 1960s, Bartlett’s work was extended in
many ways to multiway tables. See, for instance, Darroch Ž1962., Good
MULTIWAY CONTINGENCY TABLES AND LOGLINEAR MODELS
627
Ž1963., Goodman Ž1964b., Plackett Ž1962., Roy and Kastenbaum Ž1956., and
Roy and Mitra Ž1956.. These articles as well as influential articles by Martin
W. Birch Ž1963, 1964a, b, 1965. were the genesis of research work on
loglinear models between about 1965 and 1975. Birch’s work was part of a
never-submitted Ph.D. thesis at the University of Glasgow. He showed how
to obtain ML estimates of cell probabilities in three-way tables, under various
conditions. He showed the equivalence of those ML estimates for Poisson
and multinomial sampling. He Žand Watson 1959. extended theoretical
results of Cramer
´ and Rao on large-sample distributions for contingency
table models. Mantel Ž1966. discussed early results and made the loglinear
model formula explicit. A survey article by the French statistician Henri
Caussinus Ž1966., based partly on his Ph.D. thesis, provides a good glimpse of
the state-of-the-art of CDA just before this decade of advances. There,
Caussinus introduced the quasi-symmetry model for square tables.
Much of the work in the next decades on loglinear and related logit
modeling took place at three American universities: the University of Chicago,
Harvard University, and the University of North Carolina. At Chicago, Leo
Goodman wrote a series of groundbreaking articles, dealing with such topics
as partitionings of chi-squared, models for square tables Že.g., quasi-independence., stepwise logit and loglinear model-building procedures, deriving
asymptotic variances of ML estimates of loglinear parameters, latent class
models, association models, correlation models, and correspondence analysis.
For surveys of his early work, see Goodman Ž1968, an R. A. Fisher memorial
lecture, 1970.. For later work, see Goodman Ž1985, 1996, 2000.. Goodman
also wrote a stream of articles for social science journals that had a substantial impact on popularizing loglinear and logit methods for applications Že.g.,
Goodman 1969b..
Over the past 50 years, Goodman has been the most prolific contributor to
the advancement of CDA methodology. The field owes tremendous gratitude
to his steady and impressive body of work. In addition, some of Goodman’s
students at Chicago also made fundamental contributions. In 1970, Shelby
Haberman completed a Ph.D. dissertation Žthe basis of his 1974a monograph. making substantial theoretical contributions to loglinear modeling.
Among topics he considered were residual analyses, existence of ML estimates, loglinear models for ordinal variables, and theoretical results for
models Žsuch as the Rasch model. for which the number of parameters grows
with the sample size. Clifford Clogg followed in Goodman’s steps by having
influence in the social sciences and in statistics with his work on association
models, demography, models for rates, the census, and various other topics.
Simultaneously with Goodman’s work, related research on ML methods
for loglinear-logit models occurred at Harvard by students of Frederick
Mosteller Žsuch as Stephen Fienberg. and William Cochran. Much of this
research was inspired by problems arising in analyzing large, multivariate
data sets in the National Halothane Study ŽBishop and Mosteller 1969; see
also p. 345 of an interview with Lincoln Moses in Statist. Sci. 14, 1999.. That
628
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
Publisher's Note:
Permission to reproduce this image
online was not granted by the
copyright holder. Readers are kindly
asked to refer to the printed version
of this chapter.
FIGURE 16.1 Four leading figures in the development of categorical data analysis.
RECENT ŽAND FUTURE?. DEVELOPMENTS
629
study investigated whether halothane was more likely than other anesthetics
to cause death due to liver damage. A presidential address by Mosteller
Ž1968. to the American Statistical Association described early uses of loglinear models for smoothing multidimensional discrete data sets. Fienberg and
his own students advanced this work further. A landmark book in 1975 by
him with Yvonne Bishop and Paul Holland, Discrete Multi®ariate Analysis, was
largely responsible for introducing loglinear models to the general statistical
community and remains an excellent reference.
Research at North Carolina by Gary Koch and several students and
co-workers has been highly influential in the biomedical sciences. Their
research developed WLS methods for categorical data models ŽSection 15.1..
The 1969 article by Koch with J. Grizzle and F. Starmer popularized this
approach. Koch and colleagues extended it in later articles to an impressive
variety of problems, including problems for which ML methods are awkward
to use, such as the analysis of repeated categorical measurement data ŽKoch
et al. 1977.. In 1966, Vasant Bhapkar showed that the WLS estimator is often
identical to Neyman’s minimum modified chi-squared estimator.
The early literature on loglinear models treated all classifications as
nominal. Haberman Ž1974b. and Simon Ž1974. showed how to exploit ordinality of classifications in loglinear models. This work was extended in several
articles by Leo Goodman Ž1979a, 1981a, b, 1983, 1985, 1986.. The extensions
included association models, which replace ordered scores in loglinear models by parameters ŽSection 9.5.. Goodman Ž1985, 1986, 1996. also discussed
related correlation models and provided a model-based perspective for the
closely related correspondence analysis methods.
Certain loglinear models with conditional independence structure provide
graphical models for contingency tables. These relate to the association
graphs used in Section 9.1. Darroch et al. Ž1980. was the genesis of much of
this work.
16.5 RECENT (AND FUTURE?) DEVELOPMENTS
The most active area of new research in CDA in the past decade has been
the modeling of clustered data, such as occur in longitudinal studies and
other forms of repeated measurement. A variety of ways now exist of
modeling while accounting for the correlation among responses in the same
cluster.
As discussed in Chapters 11 and 12, ML estimation is difficult for such
models. For complex forms of generalized linear mixed models, for instance,
it is a challenge to estimate well regression parameters and variance components. Integrating out the random effect to obtain the likelihood function
requires an approximation such as numerical integration. Not surprisingly,
various Monte Carlo approaches are applied increasingly here. A promising
630
HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS
approach is a Monte Carlo EM algorithm that uses a Monte Carlo approximation for the E step ŽBooth and Hobert 1999.. The Monte Carlo error can
be assessed at each iteration, and one can accurately reproduce the ML
estimates with sufficiently many iterations.
The modeling of clustered correlated data is likely to be an active area of
research in coming years. The class of generalized linear mixed models is
certain to see substantial work and further generalization. One extension
is generalized additi®e mixed models. Time-series models for categorical
responses have so far received relatively little attention. For all such models
with correlated responses, model diagnostics are of vital importance and
need development. For longitudinal data, missing data are a common problem. This area currently has much activity.
Another important recent advance is the development of efficient algorithms for exact small-sample methods. With such methods, one can guarantee that the size of a test is no greater than some prespecified level and that
the coverage probability for a confidence interval is at least the nominal level.
The ‘‘exactness’’ refers only to inference being based on probability distributions that do not depend on unknown parameters. There is no unique way to
do this, and certain methods can be highly conservative because of discreteness. Most literature deals with the conditional approach, which eliminates
nuisance parameters by conditioning on their sufficient statistics. Hence, the
basic idea builds on Fisher’s exact test. Conditional methods are versatile,
applying to exponential family linear models that use the canonical link
function, such as loglinear models for Poisson responses and logit models for
binomial responses. Many of the computational advances with the exact
conditional approach occurred in a series of articles by Cyrus Mehta, Nitin
Patel, and colleagues at Harvard Že.g., Mehta and Patel 1983., using the
network algorithm. See surveys by Agresti Ž1992., Mehta Ž1994., Mehta and
Patel Ž1995., and the StatXact and LogXact manuals ŽCytel Software, Cambridge, MA, founded by Mehta and Patel..
Although the development of ‘‘exact’’ methods has seen considerable
progress, certain analyses are still infeasible and likely to be so for some time
because of the exponential increase in computing time as the table size or
sample size increases. There are an ever-increasing variety of methods for
accurate approximation of exact methods. These include simple Monte Carlo
Že.g., Agresti et al. 1979., Monte Carlo with importance sampling Že.g., Booth
and Butler 1999; Mehta et al. 1988., Markov chain Monte Carlo ŽMCMC;
Forster et al. 1996., saddlepoint approximations ŽPierce and Peters 1992,
Strawderman and Wells 1998., and related work on an approximate conditioning approach ŽPierce and Peters 1999. in which discreteness is not so
problematic.
Finally, the development of Bayesian approaches to CDA is an increasingly active area. The multiplicity of parameters complicates Bayesian modeling. For early use of Bayesian estimation of probabilities, see Good Ž1965.
and Lindley Ž1964.. Good’s Ž1965. article apparently evolved from his work
RECENT ŽAND FUTURE?. DEVELOPMENTS
631
during World War II with Alan Turing at Bletchley Park, England, on
breaking Nazi codes. The development of the Bayesian approach for CDA is
discussed in Section 15.2.3.
Predicting the future is always dangerous. However, it is likely that much
future research will focus on computationally intensive methods such as
generalized linear mixed models. Another hot topic, largely outside the realm
of traditional modeling, is the development of algorithmic methods for huge
data sets with large numbers of variables. Such methods, often referred to as
data mining, deal with the handling of complex data structures, with a
premium on predictive power at the sacrifice of simplicity and interpretability
of structure. Important areas of application include genetics, such as the
analysis of discrete DNA sequences in the form of very high-dimensional
contingency tables, and business applications such as credit scoring and
tree-structured methods for predicting future behavior of customers.
Sources for the historical tour in this chapter include Stigler Ž1986.,
Studies in the History of Probability and Statistics, edited by E. S. Pearson and
M. G. Kendall ŽLondon: Griffin, 1970., and personal conversations over the
years with several statisticians, including Erling Andersen, R. L. Anderson,
Henri Caussinus, William Cochran, Sir David Cox, John Darroch, Leo
Goodman, Gary Koch, Frederick Mosteller, John Nelder, C. R. Rao, Stephen
Stigler, Geoffrey Watson, and Marvin Zelen. To readers who have made it
this far, I congratulate your perseverance! To develop a more complete
understanding of the historical development of CDA, you may want to study
the following chronological list of 25 sources. These convey a sense of how
methodology has evolved. Alternatively, look at some early books on this
topic, such as A. E. Maxwell’s Analysing Qualitati®e Data ŽNew York:
Methuen, 1961., R. L. Plackett’s The Analysis of Categorical Data ŽLondon:
Griffin, 1974., and the Bishop, Fienberg, and Holland Discrete Multi®ariate
Analysis ŽCambridge, MA: MIT Press 1975..
Pearson Ž1900.
Yule Ž1912.
Fisher Ž1922.
Bartlett Ž1935.
Berkson Ž1944.
Neyman Ž1949.
Cochran Ž1954.
Goodman and Kruskal Ž1954.
Roy and Mitra Ž1956.
Cox Ž1958a.
Mantel and Haenszel Ž1959.
Birch Ž1963.
Birch Ž1964b.
Caussinus Ž1966.
Goodman Ž1968.
Mosteller Ž1968.
Grizzle et al. Ž1969.
Goodman Ž1970.
Haberman Ž1974a.
Nelder and Wedderburn Ž1972.
McFadden Ž1974.
Goodman Ž1979a.
McCullagh Ž1980.
Liang and Zeger Ž1986.
Breslow and Clayton Ž1993.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
APPENDIX A
Using Computer Software to
Analyze Categorical Data
In this appendix we discuss statistical software for categorical data analysis,
with emphasis on SAS. We begin by mentioning major software that can
perform the analyses discussed in this book. Then we illustrate, by chapter,
SAS code for the analyses. Information about other packages Žsuch as S-Plus,
R, SPSS, and Stata., as well as updated information about SAS, is at the Web
site Ž www. stat.ufl.edur;aarcdarcda.html.. Section A.2 on SAS also lists
other software for analyses not currently available in SAS.
A.1 SOFTWARE FOR CATEGORICAL DATA ANALYSIS
A.1.1
SAS
SAS is general-purpose software for a wide variety of statistical analyses. The
main procedures ŽPROCs. for categorical data analyses are FREQ, GENMOD, LOGISTIC, NLMIXED, and CATMOD.
PROC FREQ computes measures of association and their estimated
standard errors. It also performs generalized CochranMantelHaenszel
tests of conditional independence, and exact tests of independence in I = J
tables.
PROC GENMOD fits generalized linear models. It fits cumulative link
models for ordinal responses. It can perform GEE analyses for marginal
models. One can form one’s own variance function and allow scale parameters, making it suitable for quasi-likelihood analyses.
PROC LOGISTIC gives ML fitting of binary response models, cumulative
link models for ordinal responses, and baseline-category logit models for
nominal responses. It incorporates model selection procedures, regression
diagnostic options, and exact conditional inference. PROC PROBIT also
conducts ML fitting of binary and cumulative link models as well as quantal
632
SOFTWARE FOR CATEGORICAL DATA ANALYSIS
633
response models that permit a strictly positive probability as the linear
predictor decreases to y⬁.
PROC CATMOD fits baseline-category logit models. It is also useful for
WLS fitting of a wide variety of models for categorical data.
PROC NLMIXED fits generalized linear mixed models ŽGLMMs.. It
approximates the likelihood using adaptive GaussHermite quadrature.
Other programs run on SAS that are not specifically supported by the SAS
Institute. For further details about SAS for categorical data analyses, see the
very helpful guide by Stokes et al. Ž2000.. Also useful are SAS publications on
logistic regression ŽAllison 1999. and graphics ŽFriendly 2000..
A.1.2
Other Software Packages
Most major statistical software has procedures for categorical data analyses.
For instance, see SPSS Ž SPSS Regression Models 10.0 by M. J. Norusis, SPSS
Inc., 1999., Stata Ž A Handbook of Statistical Analyses Using Stata, 2nd ed., by
S. Rabe-Hesketh and B. Everitt, CRC Press, Boca Raton, FL, 2000., S-Plus
Ž Modern Applied Statistics with S-Plus, 3rd ed., by W. N. Venables and B. D.
Ripley, Springer-Verlag, New York, 1999., and the related free package, R,
and GLIM ŽAitkin et al. 1989.. Most major software now follows the lead of
GLIM and includes a generalized linear models routine. Examples are
PROC GENMOD in SAS and the glm function in R and S-Plus.
For certain analyses, specialized software is better than the major packages. A good example is StatXact ŽCytel Software, Cambridge, Massachusetts .,
which provides exact analysis for categorical data methods and some nonparametric methods. Among its procedures are small-sample confidence
intervals for differences and ratios of proportions and for odds ratios, and
Fisher’s exact test and its generalizations for I = J tables. It can also conduct
exact tests of conditional independence and of equality of odds ratios in
2 = 2 = K tables, and exact confidence intervals for the common odds ratio
in several 2 = 2 tables. StatXact uses Monte Carlo methods to approximate
exact P-values and confidence intervals when a data set is too large for exact
inference to be computationally feasible. Its companion, LogXact, performs
exact conditional logistic regression.
Other examples of specialized software are SUDAAN for GEE-type
analyses that handle clustering in survey data ŽResearch Triangle Institute,
Research Triangle Park, North Carolina., Latent GOLD for latent class
modeling ŽStatistical Innovations, Belmont, Massachusetts ., MLn ŽInstitute
of Education, London. and HLM ŽScientific Software, Chicago. for multilevel models, and PASS for power analyses ŽNCSS Statistical Software,
Kaysville, Utah.. S-Plus and R functions are also available from individuals
or from published work for particular analyses. For instance, Statistical
Models in S by J. M. Chambers and T. J. Hastie ŽWadsworth, Belmont,
California, 1993, p. 227. showed the use of S-Plus in quasi-likelihood analyses
using the quasi and make.family functions.
634
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.1 SAS Code for Chi-Squared, Measures of Association,
and Residuals for Education–Religion Data in Table 3.2
data table;
input degree religion $ count @@;
datalines;
1 fund 178
1 mod 138
1 lib 108
2 fund 570
2 mod 648
2 lib 442
3 fund 138
3 mod 252
3 lib 252
;
proc freq order = data; weight count;
tables degree*religion/ chisq expected measures cmh1;
proc genmod order = data; class degree religion;
model count = degree religion / dist = poi link = log residuals;
A.2 EXAMPLES OF SAS CODE BY CHAPTER
The examples below show SAS code ŽVersion 8.1.. We focus on basic model
fitting rather than the great variety of options. The material is organized by
chapter of presentation. For convenience, data for examples are entered in
the form of the contingency table displayed in the text. In practice, one
would usually enter data at the subject level. These tables and the full data
sets are available at www. stat.ufl.edur;aarcdarcda.html.
Chapters 1–3: Introduction, Two-Way Contingency Tables
Table A.1 uses SAS to analyze Table 3.2. The @@ symbol indicates that
each line of data contains more than one observation. Input of a variable as
characters rather than numbers requires an accompanying $ label in the
INPUT statement. PROC FREQ forms the table with the TABLES statement, ordering row and column categories alphanumerically. To use instead
the order in which the categories appear in the data set Že.g., to treat the
variable properly in an ordinal analysis ., use the ORDER s DATA option in
the PROC statement. The WEIGHT statement is needed when one enters
the cell counts instead of subject-level data. PROC FREQ can conduct
chi-squared tests of independence ŽCHISQ option., show its estimated expected frequencies ŽEXPECTED., provide a wide assortment of measures of
association and their standard errors ŽMEASURES., and provide ordinal
statistic Ž3.15. with a ‘‘nonzero correlation’’ test ŽCMH1.. One can also
perform chi-squared tests using PROC GENMOD Žusing loglinear models
discussed in the Chapters 89 section of this appendix ., as shown. Its
RESIDUALS option provides cell residuals. The output labeled ‘‘StReschi’’
is the standardized Pearson residual Ž3.13..
Table A.2 analyzes Table 3.8. With PROC FREQ, for 2 = 2 tables the
MEASURES option in the TABLES statement provides confidence intervals
EXAMPLES OF SAS CODE BY CHAPTER
635
TABLE A.2 SAS Code for Fisher’s Exact Test and Confidence Intervals
for Odds Ratio for Tea-Tasting Data in Table 3.8
data fisher;
input poured guess count @@;
datalines;
1 1 3
1 2 1
2 1 1
2 2 3
;
proc freq;
weight count;
tables poured*guess / measures riskdiff;
exact fisher or / alpha = .05;
proc logistic descending; freq count;
model guess = poured / clodds = pl;
for the odds ratio Žlabeled ‘‘case-control’’ on output. and the relative risk,
and the RISKDIFF option provides intervals for the proportions and their
difference. For tables having small cell counts, the EXACT statement can
provide various exact analyses. These include Fisher’s exact test and its
generalization for I = J tables, treating variables as nominal, with keyword
FISHER. The OR keyword gives the odds ratio and its large-sample confidence interval Ž3.2. and the small-sample interval based on Ž3.20.. Other
EXACT statement keywords include binomial tests for 1 = 2 tables Žkeyword
BINOMIAL., exact trend tests for I = 2 tables ŽTREND., and exact chisquared tests ŽCHISQ. and exact correlation tests for I = J tables ŽMHCHI..
One can use Monte Carlo simulation Žoption MC. to estimate exact P-values
when the exact calculation is too time consuming. Table A.2 also uses PROC
LOGISTIC to get a profile-likelihood confidence interval for the
odds ratio ŽCLODDS s PL.. LOGISTIC uses FREQ to serve the same
purpose as PROC FREQ uses WEIGHT.
Other
StatXact provides small-sample confidence intervals for a binomial parameter, the difference of proportions, relative risk, and odds ratio. Blaker Ž2000.
gave S-Plus functions that provide his confidence interval for a binomial
parameter.
Chapter 4: Models for Binary Response Variables
PROC GENMOD fits GLMs. It specifies the response distribution in the
DIST option Ž‘‘poi’’ for Poisson, ‘‘bin’’ for binomial, ‘‘mult’’ for multinomial,
‘‘negbin’’ for negative binomial. and specifies the link in the LINK option.
Table A.3 illustrates for Table 4.2. For binomial models with grouped data,
the response in the model statements takes the form of the number of
‘‘successes’’ divided by the number of cases.
636
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.3 SAS Code for Binary GLMs for Snoring Data in Table 4.2
data glm;
input snoring disease total @@;
datalines;
0 24 1379
2 35 638
4 21 213
5 30 254
;
proc genmod; model disease / total = snoring / dist = bin link = identity;
proc genmod; model disease / total = snoring / dist = bin link = logit;
proc genmod; model disease / total = snoring / dist = bin link = probit;
TABLE A.4 SAS Code for Poisson and Negative Binomial GLMs for Horseshoe
Crab Data in Table 4.3
data crab;
input color spine width satell weight;
datalines;
3 3 28.3 8 3.05
4 3 22.5 0 1.55
⭈⭈⭈
3 2 24.5 0 2.00
;
proc genmod;
model satell = width / dist = poi link = log;
proc genmod;
model satell = width / dist = poi link = identity;
proc genmod;
model satell = width / dist = negbin link = identity;
Table A.4 uses GENMOD for count modeling of Table 4.3. Each observation refers to a single crab. Using width as the predictor, the first two models
use Poisson regression. The third model uses the identity link assuming a
negative binomial distribution.
Table A.5 uses GENMOD for the overdispersed data of Table 4.5.
A CLASS statement requests dummy variables for the groups. With no
intercept in the model Žoption NOINT. for the identity link, the estimated
parameters are the four group probabilities. The ESTIMATE statement provides an estimate, confidence interval, and test for a contrast of
model parameters, in this case the difference in probabilities for the first
and second groups. The second analysis uses the Pearson statistic to scale
standard errors to adjust for overdispersion. PROC LOGISTIC can also
provide overdispersion modeling of binary responses; see Table A.27 in the
Chapter 13 part of this appendix.
PROC GAM Žstarting in Version 8.2. fits generalized additive models.
EXAMPLES OF SAS CODE BY CHAPTER
637
TABLE A.5 SAS Code for Overdispersion Modeling of Teratology Data in Table 4.5
data moore;
input litter group n y @@;
datalines;
1 1 10 1
2 1 11 4
3 1 12 9
4 1 4 4
5 1 10 10
⭈⭈⭈
55 4 14 1
56 4 8 0
58 4 17 0
;
proc genmod; class group;
model y/n = group / dist = bin link = identity noint;
estimate ‘pi1- pi2 ’ group 1 -1 0 0;
proc genmod; class group;
model y/n = group / dist = bin link = identity noint scale = pearson;
Chapters 5 and 6: Logistic Regression
One can fit logistic regression models using either software for GLMs or
specialized software for logistic regression. PROC GENMOD uses NewtonRaphson, whereas PROC LOGISTIC uses Fisher scoring. Both yield ML
estimates, but SE values use observed information in GENMOD and expected information in LOGISTIC. These are the same for the logit link.
Table A.6 applies GENMOD and LOGISTIC to Table 5.2, when ‘‘ y’’ out
of ‘‘n’’ crabs had satellites at a given width level. In GENMOD, the LRCI
option provides profile likelihood confidence intervals. The ALPHA s option
can specify an error probability other than the default of 0.05. The TYPE3
option provides likelihood-ratio tests for each parameter. ŽIn the Chapter
8᎐9 section we discuss the second GENMOD analysis. .
TABLE A.6 SAS Code for Modeling Grouped Crab Data in Table 5.2
data crab;
input width y n satell; logcases = log(n);
datalines;
22.69 5 14 14
⭈⭈⭈
30.41 14 14 72
;
proc genmod;
model y/n = width / dist = bin link = logit 1rci alpha = .01 type3;
proc logistic;
model y/n = width / influence stb;
output out = predict p = pi ᎐ hat lower = LCL upper = UCL;
proc print data = predict;
proc genmod;
model satell = width / dist = poi link = log offset = logcases residuals;
638
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.7 SAS Code for Logit Modeling of AIDS Data in Table 5.5
data aids;
input race $ azt $ y n @@;
datalines;
White Yes 14 107
White No 32 113
Black Yes 11 63
Black No 12 55
;
proc genmod; class race azt;
model y/n = azt race / dist = bin type3 lrci residuals obstats;
proc logistic; class race azt / param = reference;
model y/n = azt race / aggregate scale = none clparm = both clodds = both;
output out = predict p = pi hat lower = lower upper = upper;
proc print data = predict;
proc logistic; class race azt (ref = first) / param = ref;
model y/n = azt / aggregate = (azt race) scale = none;
With PROC LOGISTIC, logistic regression is the default for binary data.
LOGISTIC has a built-in check of whether logistic regression ML estimates
exist. It can detect a complete separation of data points with 0 and 1
outcomes. LOGISTIC can also apply other links, such as the probit. Its
INFLUENCE option provides Pearson and deviance residuals and diagnostic
measures ŽPregibon 1981.. The STB option provides standardized estimates
by multiplying by s x j'3 r ŽSection 5.4.7 and Note 5.9.. Following the model
statement, Table A.6 requests predicted probabilities and lower and upper
95% confidence limits for the probabilities.
Table A.7 uses GENMOD and LOGISTIC to fit a logit model with
qualitative predictors to Table 5.5. In GENMOD, the OBSTATS option
provides various ‘‘observation statistics,’’ including predicted values and their
confidence limits. The RESIDUALS option requests residuals such as the
Pearson and standardized Pearson residuals Žlabeled ‘‘Reschi’’ and
‘‘StReschi’’.. A CLASS statement requests dummy variables for the factor. By
default, in GENMOD the parameter estimate for the last level of each factor
equals 0. In LOGISTIC, estimates sum to zero. That is, dummies take the
effect coding Ž1, y1. of 1 when in the category and y1 when not, for which
parameters sum to 0. In the CLASS statement in LOGISTIC, the option
PARAM s REF requests Ž1, 0. dummy variables with the last category as the
reference level. Also putting REF s FIRST next to a variable name requests
its first category as the reference level. The CLPARM s BOTH and
CLODDS s BOTH options provide Wald and profile likelihood confidence
intervals for parameters and odds ratio effects of explanatory variables. With
AGGREGATE SCALE s NONE in the model statement, LOGISTIC reports Pearson and deviance tests of fit; it forms groups by aggregating data
into the possible combinations of explanatory variable values, without
overdispersion adjustments. Adding variables in parentheses after AGGREGATE Žas in the second use of LOGISTIC in Table A.7. specifies the
predictors used for forming the table on which to test fit, even when some
predictors may have no effect in the model.
EXAMPLES OF SAS CODE BY CHAPTER
639
TABLE A.8 SAS Code for Logistic Regression Models with Horseshoe
Crab Data in Table 4.3
data crab;
input color spine width satell weight;
if satell>0 then y = 1; if satell = 0 then y = 0;
if color = 4 then light = 0; if color<4 then light = 1;
datalines;
2 3 28.3 8 3.05
⭈⭈⭈
2 2 24.5 0 2.00
;
proc genmod descending; class color;
model y = width color / dist = bin link = logit lrci type3 obstats;
contrast ’a- d’ color 1 0 0 -1;
proc genmod descending;
model y = width color / dist = bin link = logit;
proc genmod descending;
model y = width light / dist = bin link = logit;
proc genmod descending; class color spine;
model y = width weight color spine / dist = bin link = logit type3;
proc logistic descending; class color spine / param = ref;
model y = width weight color spine / selection = backward lackfit
outroc = classif1;
proc plot data = classif1; plot ᎐ sensit᎐ * ᎐ lmspec᎐ ;
Table A.8 shows logistic regression analyses for Table 4.3. The models
refer to a constructed binary variable Y that equals 1 when a horseshoe crab
has satellites and 0 otherwise. With binary data entry, GENMOD and
LOGISTIC order the levels alphanumerically, forming the logit with Ž1, 0.
responses as logw P Ž Y s 0.rP Ž Y s 1.x. Invoking the procedure with DESCENDING following the PROC name reverses the order. The first two
GENMOD statements use both color and width as predictors; color is
qualitative in the first model Žby the CLASS statement . and quantitative in
the second. A CONTRAST statement tests contrasts of parameters, such as
whether parameters for two levels of a factor are identical. The statement
shown contrasts the first and fourth color levels. The third GENMOD
statement uses a dummy variable for color, indicating whether a crab is light
or dark Žcolor s 4.. The fourth GENMOD statement fits the main effects
model using all the predictors from Table 4.3. LOGISTIC has options for
stepwise selection of variables, as the final model statement shows. The
LACKFIT option yields the Hosmer᎐Lemeshow statistic. Using the OUTROC option, LOGISTIC can output a data set for plotting a ROC curve.
Table A.9 analyzes Table 6.9. The CMH option in PROC FREQ specifies
the CMH statistic, the Mantel᎐Haenszel estimate of a common odds ratio
and its confidence interval, and the Breslow᎐Day statistic. FREQ uses the
640
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.9 SAS Code for CMH Analysis of Clinical Trial Data in Table 6.9
data crab;
input center $ treat response count @@ ;
datalines;
a 1 1 11
a 1 2 25
a 2 1 10
a 2 2 27
⭈⭈⭈
h 1 1 4
h 1 2 2
h 2 1 6
h 2 2 1
;
proc freq; weight count;
tables center*treat*response/ cmh chisq;
two rightmost variables in the TABLES statement as the rows and columns
for each partial table; the CHISQ option yields chi-square tests of independence for each partial table. For I = 2 tables the TREND keyword in the
TABLES statement provides the Cochran᎐Armitage trend test.
Exact conditional logistic regression is available in PROC LOGISTIC with
the EXACT statement. It provides ordinary and mid-P-values as well as
confidence limits for each model parameter and the corresponding odds ratio
with the ESTIMATE s BOTH option. One can also conduct the exact
conditional version of the Cochran᎐Armitage test using the TREND option
in the EXACT statement with PROC FREQ. Version 9 of SAS will include
asymptotic conditional logistic regression, using a STRATA statement to
indicate the stratification parameters to be conditioned out. One can also use
PROC PHREG to do this ŽStokes et al. 2000..
Models with probit and complementary log-log ŽCLOGLOG. links are
available with PROC GENMOD, PROC LOGISTIC, or PROC PROBIT.
O’Brien Ž1986. gave a SAS macro for computing powers using the noncentral
chi-squared distribution.
Other
LogXact provides exact conditional logistic regression and StatXact provides
exact inference about the odds ratio in 2 = 2 = K tables. PASS ŽNCSS
Statistical Software. provides power analyses.
Chapter 7: Multinomial Response Models
PROC LOGISTIC fits baseline-category logit models Žas of Version 8.2.
using the LINK s GLOGIT option. The final response category is the
default baseline for the logits. Exact inference is also available using the
conditional distribution to eliminate nuisance parameters. PROC CATMOD
also fits baseline-category logit models, as Table A.10 shows. CATMOD
codes estimates for a factor so that they sum to zero. The PRED s PROB
and PRED s FREQ options provide predicted probabilities and fitted values and their standard errors. The POPULATION statement provides the
EXAMPLES OF SAS CODE BY CHAPTER
641
TABLE A.10 SAS Code for Baseline-Category Logit Models with Alligator Data
in Table 7.1
data gator;
input lake gender size food count @@;
datalines;
1 1 1 1 7 1 1 1 2 1 1 1 1 3 0 1 1 1 4 0 1 1 1 5 5
⭈⭈⭈
4 2 2 1 8 4 2 2 2 1 4 2 2 3 0 4 2 2 4 0 4 2 2 5 1
;
proc logistic; freq count; class lake size / param = ref;
model food(ref = ’1 ’) = lake size / link = glogit
aggregate scale = none;
proc catmod; weight count;
population lake size gender;
model food = lake size / pred = freq pred = prob;
variables that define the predictor settings. For instance, with ‘‘gender’’ in
that statement, the model with lake and size effects is fitted to the full table
also classified by gender.
PROC GENMOD can fit the proportional odds version of cumulative
logit models using the DIST s MULTINOMIAL and LINK s CLOGIT
options. Table A.11 fits it to Table 7.5. When the number of response
categories exceeds 2, by default PROC LOGISTIC fits this model. It also
gives a score test of the proportional odds assumption of identical effect
parameters for each cutpoint. Both procedures use the ␣ j q  x form of the
model. Cox Ž1995. used PROC NLIN for the more general model Ž7.8. having
a scale parameter.
Both GENMOD and LOGISTIC can use other links in cumulative link
models. GENMOD uses LINK s CPROBIT for the cumulative probit model
and LINK s CCLL for the cumulative complementary log-log model. Table
A.11 uses LINK s PROBIT in LOGISTIC to fit a cumulative probit model.
TABLE A.11 SAS Code for Cumulative Logit and Probit Models with Mental
Impairment Data in Table 7.5
data impair;
input mental ses life;
datalines;
1 1 1
⭈⭈⭈
4 0 9
;
proc genmod ;
model mental = life ses / dist = multinomial link = clogit lrci type3;
proc logistic;
model mental = life ses / link = probit;
642
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.12 SAS Code for Adjacent-Categories Logit and Mean Response Models
and CMH Analysis of Job Satisfaction Data in Table 7.8
data jobsat;
input gender income satisf count @@;
count2 = count + .01;
datalines;
1 1 1 1 1 1 2 3 1 1 3 11 1 1 4 2
...
0 4 1 0 0 4 2 1 0 4 3 9 0 4 4 6
;
proc catmod order = data; * ML analysis of adj- cat logit (ACL) model;
weight count;
population gender income;
model satisf =
(1 0 0 3 3, 0 1 0 2 2, 0 0 1 1 1,
1 0 0 6 3, 0 1 0 4 2, 0 0 1 2 1,
1 0 0 9 3, 0 1 0 6 2, 0 0 1 3 1,
1 0 0 12 3, 0 1 0 8 2, 0 0 1 4 1,
1 0 0 3 0, 0 1 0 2 0, 0 0 1 1 0,
1 0 0 6 0, 0 1 0 4 0, 0 0 1 2 0,
1 0 0 9 0, 0 1 0 6 0, 0 0 1 3 0,
1 0 0 12 0, 0 1 0 8 0, 0 0 1 4 0)
/ml pred = freq;
proc catmod order = data; weight count2; * WLS analysis of ACL model;
response alogits; population gender income; direct gender income;
model satisf = response gender income;
proc catmod; weight count; * mean response model;
population gender income; response mean; direct gender income;
model satisf = gender income / covb;
proc freq; weight count;
tables gender*income*satisf/ cmh scores = table;
One can fit adjacent-categories logit models in CATMOD by fitting
equivalent baseline-category logit models. Table A.12 uses it for Table 7.8,
where each line of code in the model statement specifies the predictor values
Žfor the three intercepts, income, and gender. for the three logits. The
income and gender predictor values are multiplied by 3 for the first logit, 2
for the second, and 1 for the third, to make effects comparable in the two
models. PROC CATMOD has options ŽCLOGITS and ALOGITS. for fitting
cumulative logit and adjacent-categories logit models to ordinal responses;
however, those options provide weighted least squares ŽWLS. rather than
ML fits. A constant must be added to empty cells for WLS to run. CATMOD
treats zero counts as structural zeros, so they must be replaced by small
constants when they are actually sampling zeros. The DIRECT statements
identify predictors treated as quantitative. The second analysis in Table A.12
uses the ALOGITS option. CATMOD can also fit mean response models
using WLS, as the third analysis in Table A.12 shows.
With the CMH option, PROC FREQ provides the generalized CMH tests
of conditional independence. The statistic for the ‘‘general association’’
EXAMPLES OF SAS CODE BY CHAPTER
643
alternative treats X and Y as nominal wstatistic Ž7.20.x, the statistic for the
‘‘row mean scores differ’’ alternative treats X as nominal and Y as ordinal,
and the statistic for the ‘‘nonzero correlation’’ alternative treats X and Y as
ordinal wstatistic Ž7.21.x. Table A.12 analyzes Table 7.8, using scores Ž1, 2, 3, 4.
for each variable.
PROC MDC fits multinomial discrete choice models, with logit and probit
links. One can also use PROC PHREG, which is designed for the Cox
proportional hazards model for survival analysis, because the partial likelihood for that analysis has the same form as the likelihood for the multinomial model ŽAllison 1999, Chap. 7; Chen and Kuo 2001..
Other
LogXact provides exact conditional analyses for baseline-category logit models. Joseph Lang Ž jblang@stat.uiowa.edu. has an R function that can fit
mean response models by ML.
Chapters 8 and 9: Loglinear Models
Table A.13 uses GENMOD to fit model Ž AC, AM, CM . to Table 8.3. Table
A.14 uses GENMOD for table raking of Table 8.15. Table A.15 uses
GENMOD to fit the linear-by-linear association model Ž9.6. and the row
effects model Ž9.8. to Table 9.3 Žwith column scores 1, 2, 4, 5.. The defined
TABLE A.13 SAS Code for Fitting Loglinear Models to Drug Survey
Data in Table 8.3
data drugs;
input a c m count @@;
datalines;
1 1 1 911
1 1 2 538
1 2 1 44
1 2 2 456
2 1 1
3
2 1 2 43
2 2 1 2
2 2 2 279
;
proc genmod; class a c m;
model count = a c m a*m a*c c*m / dist = poi link = log lrci type3 obstats;
TABLE A.14 SAS Code for Raking Table 8.15
data rake;
input school atti count @@;
log c = log(count); pseudo = 100 / 3;
data lines;
1 1 209
1 2 101
1 3 237
⭈⭈⭈
;
proc genmod; class school atti;
model pseudo = school atti / dist = poi link = log offset = log ᎐ c obstats;
644
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.15 SAS Code for Fitting Association Models to GSS Data in Table 9.3
data sex;
input premar birth u v count @@; assoc = u*v ;
datalines;
1 1 1 1 38
1 2 1 2 60
1 3 1 4 68
1 4 1 5 81
⭈⭈⭈
;
proc genmod; class premar birth;
model count = premar birth assoc / dist = poi link = log;
proc genmod; class premar birth;
model count = premar birth premar*v / dist = poi link = log;
variable ‘‘assoc’’ represents the cross-product of row and column scores,
which has  parameter as coefficient in model Ž9.6.. Table A.6 uses
GENMOD to fit the Poisson regression model with log link for the grouped
data of Table 5.2. It models the total number of satellites at each width level
Žvariable ‘‘satell’’., using the log of the number of cases as offset.
Correspondence analysis is available with PROC CORRESP.
Other
Prof. Joseph Lang Ž jblang@stat.uiowa.edu. has R and S-Plus functions for
ML fitting of the generalized loglinear model Ž8.18.. Becker Ž1990. gave a
FORTRAN program that fits the RC Ž M . model.
Chapter 10: Models for Matched Pairs
Table A.16 analyzes Table 10.1. For square tables, the AGREE option in
PROC FREQ provides the McNemar chi-squared statistic for binary matched
pairs, the X 2 test of fit of the symmetry model Žalso called Bowker’s test .,
TABLE A.16 SAS Code for McNemar’s Test and Comparing Proportions
for Matched Samples in Table 10.1
data matched;
input first second count @@;
datalines;
1 1 794
1 2 150
2 1 86
2 2 570
;
proc freq; weight count;
tables first*second / agree; exact mcnem;
proc catmod; weight count;
response marginals;
model first*second = (1 0 ,
1 1 ;
EXAMPLES OF SAS CODE BY CHAPTER
645
TABLE A.17 SAS Code for Testing Marginal Homogeneity with Migration
Data in Table 10.6
data migrate;
input then $ now $ count m11 m12 m13 m21 m22 m23 m31 m32 m33 m44 m1 m2 m3;
datalines;
ne ne 11607 1 0 0 0 0 0 0 0 0 0 0 0 0
ne mw
100 0 1 0 0 0 0 0 0 0 0 0 0 0
ne
s
366 0 0 1 0 0 0 0 0 0 0 0 0 0
ne
w
124 -1 -1 -1 0 0 0 0 0 0 0 1 0 0
mw ne
87 0 0 0 1 0 0 0 0 0 0 0 0 0
mw mw 13677 0 0 0 0 1 0 0 0 0 0 0 0 0
mw
s
515 0 0 0 0 0 1 0 0 0 0 0 0 0
mw
w
302 0 0 0 -1 -1 -1 0 0 0 0 0 1 0
s ne
172 0 0 0 0 0 0 1 0 0 0 0 0 0
s mw
225 0 0 0 0 0 0 0 1 0 0 0 0 0
s
s 17819 0 0 0 0 0 0 0 0 1 0 0 0 0
s
w
270 0 0 0 0 0 0 -1 -1 -1 0 0 0 1
w ne
63 -1 0 0 -1 0 0 -1 0 0 0 1 0 0
w mw
176 0 -1 0 0 -1 0 0 -1 0 0 0 1 0
w
s
286 0 0 -1 0 0 -1 0 0 -1 0 0 0 1
w
w 10192 0 0 0 0 0 0 0 0 0 1 0 0 0
;
proc genmod;
model count = m11 m12 m13 m21 m22 m23 m31 m32 m33 m44 m1 m2 m3
/ dist = poi link = identity;
proc catmod; weight count; response marginals;
model then*now = response /freq;
repeated time 2;
and Cohen’s kappa and weighted kappa with SE values. The MCNEM
keyword in the EXACT statement provides a small-sample binomial version
of McNemar’s test. PROC CATMOD can provide the confidence interval for
the difference of proportions. The code forms a model for the marginal
proportions in the first row and the first column, specifying a model matrix in
the model statement that has an intercept parameter Žthe first column. that
applies to both proportions and a slope parameter that applies only to the
second; hence the second parameter is the difference between the second
and first marginal proportions.
PROC LOGISTIC can conduct conditional logistic regression.
Table A.17 shows ways of testing marginal homogeneity for Table 10.6.
The GENMOD code shows the Lipsitz et al. Ž1990. approach, expressing the
I 2 expected frequencies in terms of parameters for the Ž I y 1. 2 cells in the
first I y 1 rows and I y 1 columns, the cell in the last row and last column,
and I y 1 marginal totals Žwhich are the same for rows and columns.. Here,
m11 denotes expected frequency 11 , m1 denotes 1qs q1 , and so on. This
parameterization uses formulas such as 14 s 1qy 11 y 12 y 13 for
terms in the last column or last row. CATMOD provides the Bhapkar test
Ž10.16. of marginal homogeneity, as shown.
646
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.18 SAS Code Showing Square-Table Analysis of Table 10.5
data sex;
input premar extramar symm qi count @@;
unif = premar*extramar;
datalines;
1 1 1 1 144
1 2 2 5 2
1 3 3 5 0
1 4 4 5 0
2 1 2 5 33
2 2 5 2 4
2 3 6 5 2
2 4 7 5 0
3 1 3 5 84
3 2 6 5 14
3 3 8 3 6
3 4 9 5 1
4 1 4 5 126
4 2 7 5 29
4 3 9 5 25
4 4 10 4 5
;
proc genmod; class symm;
model count = symm / dist = poi link = log; * symmetry;
proc genmod; class extramar premar symm;
model count = symm extramar premar / dist = poi link = log; *QS;
proc genmod; class symm;
model count = symm extramar premar / dist = poi link = log; * ordinal QS;
proc genmod; class extramar premar qi;
model count = extramar premar qi / dist = poi link = log; * quasi indep;
proc genmod; class extramar premar;
model count = extramar premar unif / dist = poi link = log;
data sex2;
input score below above @@; trials = below + above;
datalines;
1 33 2
1 14 2
1 25 1
2 84 0
2 29 0
3 126 0
;
proc genmod data = sex2;
model above / trials = score / dist = bin link = logit noint;
proc genmod data = sex2;
model above / trials = /dist = bin link = logit noint;
proc genmod data = sex2;
model above / trials = /dist = bin link = logit;
Table A.18 shows various square-table analyses of Table 10.5. The ‘‘symm’’
factor indexes the pairs of cells that have the same association terms in the
symmetry and quasi-symmetry models. For instance, ‘‘symm’’ takes the same
value for cells Ž1, 2. and Ž2, 1.. Including this term as a factor in a model
invokes a parameter i j satisfying i j s ji . The first model fits this factor
alone, providing the symmetry model. The second model looks like the third
except that it identifies ‘‘premar’’ and ‘‘extramar’’ as class variables Žfor
quasi-symmetry., whereas the third model statement does not Žfor ordinal
quasi-symmetry.. The fourth model fits quasi-independence. The ‘‘qi’’ factor
invokes the i parameters. It takes a separate level for each cell on the main
diagonal and a common value for all other cells. The fifth model fits the
quasi-uniform association model Ž10.29..
The bottom of Table A.18 fits square-table models as logit models. The
pairs of cell counts Ž n i j , n ji ., labeled as ‘‘above’’ and ‘‘below’’ with reference
to the main diagonal, are six sets of binomial counts. The variable defined as
‘‘score’’ is the distance Ž u j y u i . s j y i. The first two cases are symmetry
EXAMPLES OF SAS CODE BY CHAPTER
647
TABLE A.19 SAS Code for Fitting Bradley–Terry Model to Table 10.10
data baseball;
input wins games milw detr toro newy bost clev balt;
datalines;
7 13 1 -1 0 0 0 0 0
⭈⭈⭈
6 13 0 0 0 0 0 1 -1
;
proc genmod;
model wins / games = milw detr toro newy bost clev balt /
dist = bin link = logit noint covb;
and ordinal quasi-symmetry. Neither model contains an intercept ŽNOINT.,
and the ordinal model uses ‘‘score’’ as the predictor. The third model allows
an intercept and is the conditional symmetry model Ž10.28..
Table A.19 uses GENMOD for logit fitting of the Bradley᎐Terry model to
Table 10.10 by forming an artificial explanatory variable for each team. For a
given observation, the variable for team i is 1 if it wins, y1 if it loses, and 0 if
it is not one of the teams for that match. Each observation lists the number
of wins Ž‘‘ wins’’. for the team with variate-level equal to 1 out of the number
of games Ž‘‘games’’. against the team with variate-level equal to y1. The
model has these artificial variates, one of which is redundant, as explanatory
variables with no intercept term. The COVB option provides the estimated
covariance matrix of parameter estimators.
Chapter 11: Analyzing Repeated Categorical Response Data
Table A.20 uses GENMOD for the likelihood-ratio test of marginal homogeneity for Table 11.1, where for instance m11p denotes 11q . The marginal
homogeneity model expresses the eight cell expected frequencies in terms of
TABLE A.20 SAS Code for Testing Marginal Homogeneity with Crossover
Study of Table 11.1
data crossover;
input a b c count m111 m11p m1p1 mp11 m1pp m222 @@;
datalines;
1 1 1 6
1 0 0 0 0 0
1 1 2 16 -1 1 0 0 0 0
1 2 1 2 -1 0 1 0 0 0
1 2 2
4 1 -1 -1 0 1 0
2 1 1 2 -1 0 0 1 0 0
2 1 2
4 1 -1 0 -1 1 0
2 2 1 6
1 0 -1 -1 1 0
2 2 2
6 0 0 0 0 0 1
;
proc genmod;
model count = m111 m11p m1p1 mp11 m1pp m222 / dist = poi link = identity;
proc catmod; weight count; response marginals;
model a*b*c = ᎐ response ᎐ /freq;
repeated drug 3;
648
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.21 SAS Code for Marginal Modeling of Depression Data in Table 11.2
data depress;
input case diagnose drug time outcome @@; * outcome = 1 is normal;
datalines;
1 0 0 0 1
1 0 0 1 1
1 0 0 2 1
⭈⭈⭈
340 1 1 0 0 340 1 1 1 0 340 1 1 2 0
;
proc genmod descending; class case;
model; outcome = diagnose drug time drug*time / dist = bin link = logit type3;
repeated subject = case / type = exch corrw;
proc n1mixed qpoints = 200;
parms alpha = -.03 beta1 = -1.3 beta2 = -.06 beta3 = .48 beta4 = 1.02 sigma = .066;
eta = alpha + beta1*diagnose + beta2*drug + beta3*time + beta4*drug*time + u;
p = exp(eta) / (1 + exp(eta));
model outcome ; binary(p);
random u ; normal(0, sigma*sigma)
subject = case;
TABLE A.22 SAS Code for GEE and Random Intercept Cumulative Logit Analysis
of Insomnia Data in Table 11.4
data francom;
input case treat time outcome @@;
datalines;
1 1 0 1
1 1 1 1
⭈⭈⭈
239 0 0 4 239 0 1 4
;
proc genmod; class case;
model outcome = treat time treat*time / dist = multinomial
link = clogit;
repeated subject = case / type = indep corrw;
proc n1mixed qpoints = 40;
bounds i2>0; bounds i3>0;
eta1 = i1 + treat*beta1 + time*beta2 + treat*time*beta3+ u;
eta2 = i1 + i2 + treat*beta1 + time*beta2 + treat*time*beta3+ u;
eta3 = i1 + i2 + i3 + treat*beta1 + time*beta2 + treat*time*beta3+ u;
p1 = exp(eta) / (1 + exp(eta1));
p2 = exp(eta2) / (1 + exp(eta2))- exp(eta1) / (1 + exp(eta1));
p3 = exp(eta3) / (1 + exp(eta3))- exp(eta2) / (1 + exp(eta2));
p4 = 1- exp(eta3) / (1 + exp(eta3));
11 = y1*log(p1) + y2*log(p2) + y3*log(p3) + y4*log(p4);
model y1 ; general(11);
estimate ’interc2 ’ i1 + i2; * this is alpha᎐ 2 in model, and
i1 is alpha᎐ 1;
estimate ’interc3 ’ i1 + i2 + i3; * this is alpha᎐ 3 in model;
random u ; normal(0, sigma*sigma) subject = case;
EXAMPLES OF SAS CODE BY CHAPTER
649
111 , 11q , 1q1 , q11 , 1qq, and 222 Žsince q1qs qq1 s 1qq .. Note,
for instance, that 112 s 11qy 111 and 122 s 111 q 1qqy 11qy 1q1 .
CATMOD provides the generalized Bhapkar test Ž11.5. of marginal homogeneity.
Table A.21 uses GENMOD to analyze Table 11.2 using GEE. Possible
working correlation structures are TYPE s EXCH for exchangeable, TYPE
s AR for autoregressive, TYPE s INDEP for independence, and TYPE s
UNSTR for unstructured. Output shows estimates and standard errors under
the naive working correlation and based on the sandwich matrix incorporating the empirical dependence. Alternatively, the working association structure in the binary case can use the log odds ratio Že.g., using LOGOR s
EXCH for exchangeability .. The type 3 option in GEE provides score tests
about effects. See Stokes et al. Ž2000, Sec. 15.11. for the use of GEE with
missing data.
Table A.22 uses GENMOD to implement GEE for a cumulative logit
model for Table 11.4. For multinomial responses, independence is currently
the only working correlation structure.
Other
Joseph Lang Ž jblang@stat.uiowa.edu. has R and S-Plus functions for ML
fitting of marginal models through the generalized loglinear model Ž11.8.,
using the constraint approach with Lagrange multipliers. The program
MAREG ŽKastner et al. 1997. provides GEE fitting and ML fitting of
marginal models with the Fitzmaurice and Laird Ž1993. approach, allowing
multicategory responses. See www. stat.uni-muenchen.der;andreasrmaregr
winmareg.html.
Chapter 12: Random Effects: Generalized Linear Mixed Models
PROC NLMIXED extends GLMs to GLMMs by including random effects.
Table A.23 analyzes the matched pairs model Ž12.3.. Table A.24 analyzes the
election data in Table 12.2.
TABLE A.23 SAS Code for Fitting Model (12.3) for Matched Pairs to Table 12.1
data matched;
input case occasion response count @@;
datalines;
2 0 1 794
1 1 1 794
2 0 1 150
2 1 0 150
3 0 0 86
3 1 1 86
4 0 0 570
4 1 0 570
;
proc n1mixed;
eta = alpha + beta*occasion + u; p = exp(eta) / (1 + exp(eta));
model response ; binary(p);
random u ; normal(0, sigma*sigma) subject = case;
replicate count;
650
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.24 SAS Code for GLMM Analysis of Election Data in Table 12.2
data vote;
input y n;
case = n ;
datalines;
1
5
16 32
⭈⭈⭈
1
4
;
proc n1mixed;
eta = alpha + u; p = exp(eta) / (1 + exp(eta));
model y ; binomial(n,p);
random u ; normal (0, sigma*sigma) subject = case;
predict p out = new;
proc print data = new;
TABLE A.25 SAS Code for GLMM Modeling of Opinions in Table 10.13
data new;
input sex poor single any count;
datalines;
1 1 1 1 342
⭈⭈⭈
2 0 0 0 457
;
data new; set new;
sex = sex- 1; case = ᎐ n ᎐ ;
q1 = 1; q2 = 0; resp = poor; output;
q1 = 0, q2 = 1; resp = single; output;
q1 = 0; q2 = 0; resp = any; output;
drop poor single any;
proc n1mixed qpoints = 50;
parms alpha = 0 beta1 = .8 beta2 = .3 gamma = 0 sigma = 8.6;
eta = alpha + beta1*q1 + beta2*q2 + gamma*sex + u;
p = exp(eta) / (1 + exp(eta));
model resp ; binary(p);
random u ; normal(0, sigma*sigma) subject = case;
replicate count;
EXAMPLES OF SAS CODE BY CHAPTER
651
TABLE A.26 SAS Code for GLMM for Leading Crowd Data in Table 12.8
data crowd;
input mem1 att1 mem2 att2 count;
datalines;
1 1 1 1 458
⭈⭈⭈
0 0 0 0 554
;
data new; set crowd;
case = ᎐ n ᎐ ;
x1m = 1; x1a = 0; x2m = 0; x2a = 0; var = 1; resp = mem1; output;
x1m = 0; x1a = 1; x2m = 0; x2a = 0; var = 0; resp = att1; output;
x1m = 0; x1a = 0; x2m = 1; x2a = 0; var = 1; resp = mem2; output;
x1m = 0; x1a = 0; x2m = 0; x2a = 1; var = 0; resp = att2; output;
drop mem1 att1 mem2 att2;
proc n1mixed data = new;
eta = beta1m*x1m + beta1a*x1a + beta2m*x2m + beta2a*x2a + um*var +
ua*(1- var);
p = exp(eta) / (1 + exp(eta));
model resp ; binary(p);
random um ua ; normal([0,0],[s1*s1, cov12, s2*s2]) subject = case;
replicate count;
estimate ’mem change’ beta2m- beta1m; estimate ’att change’
beta2a- beta1a;
Table A.25 fits model Ž12.10. to Table 10.13. This shows how to set initial
values and set the number of quadrature points for Gauss᎐Hermite quadrature Že.g., QPOINTS s.. One could let SAS fit without initial values but
then take that fit as initial values in further runs, increasing QPOINTS until
estimates and standard errors converge to the necessary precision.
Table A.21 uses NLMIXED for Table 11.2. Table A.22 uses NLMIXED
for ordinal modeling of Table 11.4, defining a general multinomial log
likelihood. Table A.26 shows a correlated bivariate random effect analysis of
Table 12.8. Agresti et al. Ž2000. showed NLMIXED examples for clustered
data, Agresti and Hartzel Ž2000. showed code for multicenter trials such as
Table 12.5, and Hartzel et al. Ž2001a. showed code for multicenter trials with
an ordinal response. The Web site for the journal Statistical Modelling shows
NLMIXED code for an adjacent-categories logit model and a nominal model
at the data archive for Hartzel et al. Ž2001b.. Chen and Kuo Ž2001. discussed
fitting multinomial logit models, including discrete-choice models, with random effects.
Other
MLn ŽInstitute of Education, London. and HLM ŽScientific Software,
Chicago. fit multilevel models. MIXOR is a FORTRAN program for ML
652
USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA
TABLE A.27 SAS Code for Overdispersion Analysis of Table 4.5
data moore;
input litter group n y @@;
z2 = 0; z3 = 0; z4 = 0;
if group = 2 then z2 = 1; if group = 3 then z3 = 1; if group = 4
then z4 = 1;
datalines;
1 1 10 1
2 1 11 4
3 1 12 9
4 1 4 4
⭈⭈⭈
55 4 14 1
56 4 8 0
57 4 6 0
58 4 17 0
;
proc logistic;
model y / n = z2 z3 z4 / scale = williams;
proc logistic;
model y / n = z2 z3 z4 / scale = pearson;
proc n1mixed qpoints = 200;
eta = alpha + beta2*z2 + beta3*z3 + beta4*z4 + u;
p = exp(eta) / (1 + exp(eta));
model y ; binomial(n,p);
random u ; normal(0, sigma*sigma) subject = litter;
TABLE A.28 SAS Code for Fitting Models to Murder Data in Table 13.6
data new;
input white black other response;
datalines;
1070 119 55
0
60
16
5
1
⭈⭈⭈
1
0
0
6
;
data new; set new; count = white; race = 0; output;
count = black; race = 1; output; drop white black other;
data new2; set new; do i = 1 to count; output; end; drop i;
proc genmod data = new2;
model response = race / dist = negbin link = log;
proc genmod data = new2;
model response = race / dist = poi link = log scale = pearson;
data new; set new; case = ᎐ n ᎐ ;
proc n1mixed data = new qpoints = 400;
parms alpha = -3.7 beta = 1.90 sigma = 1.6;
eta = alpha + beta*race + u; mu = exp(eta);
model response ; poisson(mu);
random u ; normal(0, sigma*sigma) subject = case;
replicate count;
EXAMPLES OF SAS CODE BY CHAPTER
653
fitting of binary and ordinal random effects models available from Don
Hedeker Ž www.uic.edur;hedekerrmix.html ..
Chapter 13: Other Mixture Models for Categorical Data
PROC LOGISTIC provides two overdispersion approaches for binary data.
The SCALE s WILLIAMS option uses variance function of the beta-binomial form Ž13.10., and SCALE s PEARSON uses the scaled binomial
variance Ž13.11.. Table A.27 illustrates for Table 4.5. That table also uses
NLMIXED for adding litter random intercepts.
For Table 13.6, Table A.28 uses GENMOD to fit a negative binomial
model and a quasi-likelihood model with scaled Poisson variance using the
Pearson statistic, and NLMIXED to fit a Poisson GLMM. PROC NLMIXED
can also fit negative binomial models.
Other
Latent GOLD Ždeveloped by J. Vermunt and J. Magidson for Statistical
Innovations, Belmont, Massachusetts . can fit a wide variety of mixture
models, including latent class models, nonparametric mixtures of logistic
regression, and some Rasch mixture models.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
APPENDIX B
Chi-Squared Distribution Values
Right-Tailed Probability
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
40
50
60
70
80
90
100
0.250
0.100
0.050
0.025
0.010
0.005
0.001
1.32
2.77
4.11
5.39
6.63
7.84
9.04
10.22
11.39
12.55
13.70
14.85
15.98
17.12
18.25
19.37
20.49
21.60
22.72
23.83
29.34
34.80
45.62
56.33
66.98
77.58
88.13
98.65
109.1
2.71
4.61
6.25
7.78
9.24
10.64
12.02
13.36
14.68
15.99
17.28
18.55
19.81
21.06
22.31
23.54
24.77
25.99
27.20
28.41
34.38
40.26
51.80
63.17
74.40
85.53
96.58
107.6
118.5
3.84
5.99
7.81
9.49
11.07
12.59
14.07
15.51
16.92
18.31
19.68
21.03
22.36
23.68
25.00
26.30
27.59
28.87
30.14
31.41
37.65
43.77
55.76
67.50
79.08
90.53
101.8
113.1
124.3
5.02
7.38
9.35
11.14
12.83
14.45
16.01
17.53
19.02
20.48
21.92
23.34
24.74
26.12
27.49
28.85
30.19
31.53
32.85
34.17
40.65
46.98
59.34
71.42
83.30
95.02
106.6
118.1
129.6
6.63
9.21
11.34
13.28
15.09
16.81
18.48
20.09
21.67
23.21
24.72
26.22
27.69
29.14
30.58
32.00
33.41
34.81
36.19
37.57
44.31
50.89
63.69
76.15
88.38
100.4
112.3
124.1
135.8
7.88
10.60
12.84
14.86
16.75
18.55
20.28
21.96
23.59
25.19
26.76
28.30
29.82
31.32
32.80
34.27
35.72
37.16
38.58
40.00
46.93
53.67
66.77
79.49
91.95
104.2
116.3
128.3
140.2
10.83
13.82
16.27
18.47
20.52
22.46
24.32
26.12
27.88
29.59
31.26
32.91
34.53
36.12
37.70
39.25
40.79
42.31
43.82
45.32
52.62
59.70
73.40
86.66
99.61
112.3
124.8
137.2
149.5
Source: Calculated using StaTable, Cytel Software, Cambridge, MA.
654
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
References
Adelbasit, K. M., and R. L. Plackett. 1983. Experimental design for binary data. J. Amer. Statist.
Assoc. 78: 9098.
Agresti, A. 1984. Analysis of Ordinal Categorical Data. New York: Wiley.
Agresti, A. 1992. A survey of exact inference for contingency tables. Statist. Sci. 7: 131153.
Agresti, A. 1993. Computing conditional maximum likelihood estimates for generalized Rasch
models using simple loglinear models with diagonal parameters. Scand. J. Statist. 20:
6371.
Agresti, A. 1997. A model for repeated measurements of a multivariate binary response.
J. Amer. Statist. Assoc. 92: 315321.
Agresti, A. 1999. On logit confidence intervals for the odds ratio with small samples. Biometrics
55: 597602.
Agresti, A. 2001. Exact inference for categorical data: Recent advances and continuing controversies. Statist. Medic. 20: 27092722.
Agresti, A., and B. Caffo. 2000. Simple and effective confidence intervals for proportions and
difference of proportions result from adding two successes and two failures. Amer. Statist.
54: 280288.
Agresti, A., and B. A. Coull. 1998. Approximate is better than exact for interval estimation of
binomial parameters. Amer. Statist. 52: 119126.
Agresti, A., and J. Hartzel. 2000. Strategies for comparing treatments on a binary response with
multi-centre data. Statist. Medic. 19Ž8.: 11151139.
Agresti, A., and J. Lang. 1993a. A proportional odds model with subject-specific effects for
repeated ordered categorical responses. Biometrika 80: 527534.
Agresti, A., and J. Lang. 1993b. Quasi-symmetric latent class models, with application to rater
agreement. Biometrics 49: 131139.
Agresti, A., and I. Liu. 1999. Modeling a categorical variable allowing arbitrarily many category
choices. Biometrics 55: 936943.
Agresti, A., and Y. Min. 2001. On small-sample confidence intervals for parameters in discrete
distributions. Biometrics 57: 963971.
Agresti, A., and R. Natarajan. 2001. Modeling clustered ordered categorical data: A survey.
Internal. Statist. Re®. 69: 345371.
Agresti, A., D. Wackerly, and J. Boyett. 1979. Exact conditional tests for cross-classifications:
Approximation of attained significance levels. Psychometrika 44: 7584.
Agresti, A., C. Chuang, and A. Kezouh. 1987. Order-restricted score parameters in association
models for contingency tables. J. Amer. Statist. Assoc. 82: 619623.
655
656
REFERENCES
Agresti, A., C. R. Mehta, and N. R. Patel. 1990. Exact inference for contingency tables with
ordered categories. J. Amer. Statist. Assoc. 85: 453458.
Agresti, A., J. Booth, J. Hobert, and B. Caffo. 2000. Random-effects modeling of categorical
response data. Sociol. Methodol. 30: 2781.
Aitchison, J., and C. G. G. Aitken. 1976. Multivariate binary discrimination by the kernel
method. Biometrika 63: 413420.
Aitchison, J., and C. H. Cho. 1989. The multivariate Poisson-log normal distribution. Biometrika
76: 643653.
Aitchison, J., and S. M. Shen. 1980. Logistic-normal distributions: Some properties and uses.
Biometrika 67: 261272.
Aitchison, J., and S. D. Silvey. 1957. The generalization of probit analysis to the case of multiple
responses. Biometrika 44: 131140.
Aitchison, J., and S. D. Silvey. 1958. Maximum likelihood estimation of parameters subject to
restraints. Ann. Math. Statist. 29: 813828.
Aitkin, M. 1979. A simultaneous test procedure for contingency table models. Appl. Statist. 28:
233242.
Aitkin, M. 1980. A note on the selection of log-linear models. Biometrics 36: 173178.
Aitkin, M. 1999. A general maximum likelihood analysis of variance components in generalized
linear models. Biometrics 55: 117128.
Aitkin, M., and D. Clayton. 1980. The fitting of exponential, Weibull, and extreme value
distributions to complex censored survival data using GLIM. Appl. Statist. 29: 156163.
Aitkin, M., and M. Stasinopoulos. 1989. Likelihood analysis of a binomial sample size problem.
Pp. 399411 in Contributions to Probability and Statistics: Essays in Honor of Ingram Olkin,
ed. L. J. Gleser, M. D. Perlman, S. J. Press, and A. R. Sampson. New York: SpringerVerlag.
Aitkin, M., D. Anderson, and J. Hinde. 1981. Statistical modelling of data on teaching styles.
J. Roy. Statist. Soc. Ser. A 144: 419461.
Aitkin, M., D. Anderson, B. Francis, and J. Hinde. 1989. Statistical Modeling in GLIM. Oxford:
Clarendon Press.
Albert, J. H. 1997. Bayesian testing and estimation of association in a two-way contingency table.
J. Amer. Statist. Assoc. 92: 685693.
Albert, A., and J. A. Anderson. 1984. On the existence of maximum likelihood estimates in
logistic models. Biometrika 71: 110.
Albert, J. H., and S. Chib. 1993. Bayesian analysis of binary and polychotomous response data.
J. Amer. Statist. Assoc. 88: 669679.
Albert, J. H., and A. K. Gupta. 1982. Mixtures of Dirichlet distributions and estimation in
contingency tables. Ann. Statist. 10: 12611268.
Allison, P. D. 1999. Logistic Regression Using the SAS System. Cary, NC: SAS Institute.
Altham, P. M. E. 1969. Exact Bayesian analysis of a 2 = 2 contingency table and Fisher’s ‘‘exact’’
significance test. J. Roy. Statist. Soc. Ser B 31: 261269.
Altham, P. M. E. 1970. The measurement of association of rows and columns for an r = s
contingency table. J. Roy. Statist. Soc. Ser B 32: 6373.
Altham, P. M. E. 1971. The analysis of matched proportions. Biometrika 58: 561576.
Altham, P. M. E. 1975. Quasi-independent triangular contingency tables. Biometrics 31: 233238.
Altham, P. M. E. 1978. Two generalizations of the binomial distribution. Appl. Statist. 27:
162167.
Altham, P. M. E. 1984. Improving the precision of estimation by fitting a model. J. Roy. Statist.
Soc. Ser B 46: 118119.
Amemiya, T. 1981. Qualitative response models: A survey. J. Econom. Literature 19: 14831536.
REFERENCES
657
Andersen, E. B. 1970. Asymptotic properties of conditional maximum-likelihood estimators.
J. Roy. Statist. Soc. Ser B 32: 283301.
Andersen, E. B. 1980. Discrete Statistical Models with Social Science Applications. Amsterdam:
North-Holland.
Andersen, E. B. 1995. Polytomous Rasch models and their estimation. Pp. 272291
in Rasch Models: Foundations, Recent De®elopments, and Applications, eds. G. Fischer and
I. Molenaar. New York: Springer-Verlag.
Anderson, J. A. 1972. Separate sample logistic discrimination. Biometrika 59: 1935.
Anderson, J. A. 1975. Quadratic logistic discrimination. Biometrika 62: 149154.
Anderson, J. A. 1984. Regression and ordered categorical variables. J. Roy. Statist. Soc. Ser B
46: 130.
Anderson, D. A., and M. Aitkin. 1985. Variance component models with binary response:
Interviewer variability. J. Roy. Statist. Soc. Ser B 47: 203210.
Anderson, C. J., and U. Bockenholt.
2000. Graphical regression models for polytomous vari¨
ables. Psychometrika 65: 497509.
Anderson, T. W., and L. A. Goodman. 1957. Statistical inference about Markov chains. Ann.
Math. Statist. 28: 89110.
Anderson, J. A., and P. R. Philips. 1981. Regression, discrimination, and measurement models
for ordered categorical variables. Appl. Statist. 30: 2231.
Anderson, C. J., and J. K. Vermunt. 2000. Log-multiplicative models as latent variable models
for nominal andror ordinal data. Sociol. Methodol. 30: 81121.
Aranda-Ordaz, F. J. 1981. On two families of transformations to additivity for binary response
data. Biometrics 68: 357363.
Aranda-Ordaz, F. J. 1983. An extension of the proportional hazards model for grouped data.
Biometrics 39: 109117.
Arminger, G., C. C. Clogg, and T. Cheng. 2000. Regression analysis of multivariate binary
response variables using Rasch-type models and finite mixture methods. Sociol. Methodol.
30: 126.
Armitage, P. 1955. Tests for linear trends in proportions and frequencies. Biometrics 11:
375386.
Ashford, J. R., and R. D. Sowden. 1970. Multivariate probit analysis. Biometrics 26: 535546.
Asmussen, S., and D. Edwards. 1983. Collapsibility and response variables in contingency tables.
Biometrika 70: 567578.
Azzalini, A. 1994. Logistic regression for autocorrelated data with application to repeated
measures. Biometrika 81: 767775.
Baglivo, J., D. Olivier, and M. Pagano. 1992. Methods for exact goodness-of-fit tests. J. Amer.
Statist. Assoc. 87: 464469.
Baker, S. G. 1992. A simple method for computing the observed information matrix when using
the EM algorithm with categorical data. J. Comput. Graph. Statist. 1: 6376.
Baker, S. G., and N. M. Laird. 1988. Regression analysis for categorical variables with outcome
subject to nonignorable nonresponse. J. Amer. Statist. Assoc. 83: 6269.
Baker, R. J., M. R. B. Clarke, and P. W. Lane. 1985. Zero entries in contingency tables. Comput.
Statist. Data Anal. 3: 3345.
Banerjee, C., M. Capozzoli, L. McSweeney, and D. Sinha. 1999. Beyond kappa: A review of
interrater agreement measures. Canad. J. Statist. 27: 323.
Baptista, J., and M. C. Pike. 1977. Algorithm AS115: Exact two-sided confidence limits for the
odds ratio in a 2 = 2 table. Appl. Statist. 26: 214220.
Barnard, G. A. 1945. A new test for 2 = 2 tables. Nature 156: 177.
Barnard, G. A. 1947. Significance tests for 2 = 2 tables. Biometrika 34: 123138.
658
REFERENCES
Barnard, G. A. 1949. Statistical inference. J. Roy. Statist. Soc. Ser B 11: 115139.
Barnard, G. A. 1979. In contradiction to J. Berkson’s dispraise: Conditional tests can be more
efficient. J. Statist. Plann. Inference 3: 181188.
Barndorff-Nielsen, O. E., and B. Jorgensen.
1991. Some parametric models on the simplex.
¨
J. Multi®ariate Anal. 39: 106116.
Bartholomew, D. J. 1980. Factor analysis for categorical data. J. Roy. Statist. Soc. Ser B 42:
293321.
Bartholomew, D. J., and M. Knott. 1999. Latent Variable Models and Factor Analysis, 2nd ed.
London: Edward Arnold.
Bartlett, M. S. 1935. Contingency table interactions. J. Roy. Statist. Soc. Suppl. 2: 248252.
Bartlett, M. S. 1937. Some examples of statistical methods of research in agriculture and applied
biology. J. Roy. Statist. Soc. Suppl. 4: 137183.
Becker, M. 1989a. Models for the analysis of association in multivariate contingency tables.
J. Amer. Statist. Assoc. 84: 10141019.
Becker, M. 1989b. On the bivariate normal distribution and association models for ordinal
categorical data. Statist. Probab. Lett. 8: 435440.
Becker, M. 1990. Maximum likelihood estimation of the RCŽM. association model. Appl. Statist.
39: 152167.
Becker, M., and A. Agresti. 1992. Log-linear modelling of pairwise interobserver agreement on a
categorical scale. Statist. Medic. 11: 101114.
Becker, M., and C. C. Clogg. 1989. Analysis of sets of two-way contingency tables using
association models. J. Amer. Statist. Assoc. 84: 142151.
Bedrick, E. J. 1983. Chi-squared tests for cross-classified tables of survey data. Biometrika 70:
591595.
Bedrick, E. J. 1987. A family of confidence intervals for the ratio of two binomial proportions.
Biometrics 43: 993998.
Begg, C. B., and R. Gray. 1984. Calculation of polytomous logistic regression parameters using
individualized regressions. Biometrika 71: 1118.
Beitler, P. J., and J. R. Landis. 1985. A mixed-effects model for categorical data. Biometrics 41:
9911000.
Benedetti, J. K., and M. B. Brown. 1978. Strategies for the selection of loglinear models.
Biometrics 34: 680686.
Benichou, J. 1998. Attributable risk. Pp. 216229 in Encyclopedia of Biostatistics. Chichester,
UK: Wiley.
Benzecri,
J.-P. 1973. L’Analyse des Donnees,
´
´ Vol. 1, La Taxonomie; Vol. 2, L’Analyse des
Correspondances. Paris: Dunod.
Berger, R., and D. D. Boos. 1994. p-Values maximized over a confidence set for the nuisance
parameter. J. Amer. Statist. Assoc. 89: 10121016.
Bergsma, W. P., and T. Rudas. 2002. Marginal models for categorical data. Ann. Statist. 30:
140159.
Berkson, J. 1938. Some difficulties of interpretation encountered in the application of the
chi-square test. J. Amer. Statist. Assoc. 33: 526536.
Berkson, J. 1944. Application of the logistic function to bio-assay. J. Amer. Statist. Assoc. 39:
357365.
Berkson, J. 1951. Why I prefer logits to probits. Biometrics 7: 327339.
Berkson, J. 1953. A statistically precise and relatively simple method of estimating the bioassay
with quantal response, based on the logistic function. J. Amer. Statist. Assoc. 48: 565599.
Berkson, J. 1955. Maximum likelihood and minimum logit 2 estimation of the logistic function.
J. Amer. Statist. Assoc. 50: 130᎐162.
REFERENCES
659
Berkson, J. 1978. In dispraise of the exact test. J. Statist. Plann. Inference 2: 2742.
Berkson, J. 1980. Minimum chi-square, not maximum likelihood! Ann. Statist. 8: 457487.
Berry, G., and P. Armitage. 1995. Mid-P confidence intervals: A brief review. The Statistician 44:
417423.
Bhapkar, V. P. 1966. A note on the equivalence of two test criteria for hypotheses in categorical
data. J. Amer. Statist. Assoc. 61: 228235.
Bhapkar, V. P. 1968. On the analysis of contingency tables with a quantitative response.
Biometrics 24: 329338.
Bhapkar, V. P. 1973. On the comparison of proportions in matched samples. Sankhya Ser A 35:
341356.
Bhapkar, V. P. 1989. Conditioning on ancillary statistics and loss of information in the presence
of nuisance parameters. J. Statist. Plann. Inference. 21: 139160.
Bhapkar, V. P., and G. G. Koch. 1968. On the hypothesis of ‘‘no interaction’’ in multidimensional contingency tables. Biometrics 24: 567594.
Bhapkar, V. P., and G. W. Somes. 1977. Distribution of Q when testing equality of matched
proportions. J. Amer. Statist. Assoc. 72: 658661.
Biggeri, A. 1998. Negative binomial distribution. Pp. 29622967 in Encyclopedia of Biostatistics.
Chichester, UK: Wiley.
Billingsley, P. 1961. Statistical methods in Markov chains. Ann. Math. Statist. 32: 1240.
Birch, M. W. 1963. Maximum likelihood in three-way contingency tables. J. Roy. Statist. Soc.
Ser. B 25: 220233.
Birch, M. W. 1964a. A new proof of the PearsonFisher theorem. Ann. Math. Statist. 35:
817824.
Birch, M. W. 1964b. The detection of partial association I: The 2 = 2 case. J. Roy. Statist. Soc.
Ser. B 26: 313324.
Birch, M. W. 1965. The detection of partial association II: The general case. J. Roy. Statist. Soc.
Ser B 27: 111124.
Bishop, Y. M. M. 1971. Effects of collapsing multidimensional contingency tables. Biometrics 27:
545562.
Bishop, Y. M. M., and F. Mosteller. 1969. Smoothed contingency table analysis. Chap. IV-3 in
The National Halothane Study. Washington, DC: U.S. Government Printing Office.
Bishop, Y. M. M., S. E. Fienberg, and P. W. Holland. 1975. Discrete Multi®ariate Analysis.
Cambridge, MA: MIT Press.
Blaker, H. 2000. Confidence curves and improved exact confidence intervals for discrete
distributions. Canad. J. Statist. 28: 783798.
Bliss, C. I. 1934. The method of probits. Science 79: 3839.
Bliss, C. I. 1935. The calculation of the dosagemortality curve. Ann. Appl. Biol. 22: 134167.
Blyth, C. R. 1972. On Simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc.
67: 364366.
Blyth, C. R., and H. A. Still. 1983 Binomial confidence intervals. J. Amer. Statist. Assoc. 78:
108116.
Bock, R. D. 1970. Estimating multinomial response relations. Pp. 453479 in Contributions to
Statistics and Probability, ed. R. C. Bose. Chapel Hill, NC: University of North Carolina
Press.
Bock, R. D., and M. Aitkin. 1981. Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika 46: 443459.
Bock, R. D., and L. V. Jones. 1968. The Measurement and Prediction of Judgement and Choice.
San Francisco: Holden-Day.
660
REFERENCES
Bockenholt,
U., and W. Dillon. 1997. Modelling within-subject dependencies in ordinal paired
¨
comparison data. Psychometrika 62: 411434.
Bonney, G. E. 1987. Logistic regression for dependent binary observations. Biometrics 43:
951973.
Boos, D. D. 1992. On generalized score tests. Amer. Statist. 46: 327333.
Booth, J., and R. Butler. 1999. An importance sampling algorithm for exact conditional tests in
log-linear models. Biometrika 86: 321332.
Booth, J. G., and J. P. Hobert. 1998. Standard errors of prediction in generalized linear mixed
models. J. Amer. Statist. Assoc. 93: 262272.
Booth, J. G., and J. P. Hobert. 1999. Maximizing generalized linear mixed model likelihoods with
an automated Monte Carlo EM algorithm. J. Roy. Statist. Soc. Ser. B 61: 265285.
Bowker, A. H. 1948. A test for symmetry in contingency tables. J. Amer. Statist. Assoc. 43:
572574.
Box, J. F. 1978. R. A. Fisher: The Life of a Scientist. New York: Wiley
Bradley, R. A. 1976. Science, statistics, and paired comparisons. Biometrics 32: 213240.
Bradley, R. A., and M. E. Terry. 1952. Rank analysis of incomplete block designs I. The method
of paired comparisons. Biometrika 39: 324345.
Breslow, N. 1976. Regression analysis of the log odds ratio: A method for retrospective studies.
Biometrics 32: 409416.
Breslow, N. 1981. Odds ratio estimators when the data are sparse. Biometrika 68: 7384.
Breslow, N. 1982. Covariance adjustment of relative-risk estimates in matched studies. Biometrics 38: 661672.
Breslow, N. 1984. Extra-Poisson variation in log-linear models. Appl. Statist. 33: 3844.
Breslow, N. 1996. Statistics in epidemiology: The casecontrol study. J. Amer. Statist. Assoc. 91:
1428.
Breslow, N., and D. G. Clayton. 1993. Approximate inference in generalized linear mixed
models. J. Amer. Statist. Assoc. 88: 925.
Breslow, N., and N. E. Day. 1980, 1987. Statistical Methods in Cancer Research, Vol. I, The
Analysis of CaseControl Studies; Vol. II. The Design and Analysis of Cohort Studies. Lyon:
IARC.
Breslow, N., and X. Lin. 1995. Bias correction in generalised linear mixed models with a single
component of dispersion. Biometrika 82: 8191.
Breslow, N., and W. Powers. 1978. Are there two logistic regressions for retrospective studies?
Biometrics 34: 100105.
Breslow, N., N. Day, K. Halvorsen, R. Prentice, and C. Sabai. 1978. Estimation of multiple
relative risk functions in matched casecontrol studies. Amer. J. Epidemiol. 108: 299307.
Brier, S. S. 1980. Analysis of contingency tables under cluster sampling. Biometrika 67: 591596.
Brooks, S. P., B. J. T. Morgan, M. S. Ridout, and S. E. Pack. 1997. Finite mixture models for
proportions. Biometrics 53: 10971115.
Bross, I. D. J. 1958. How to use ridit analysis. Biometrics 14: 1838.
Brown, M. B. 1976. Screening effects in multidimensional contingency tables. Appl. Statist. 25:
3746.
Brown, M. B., and J. K. Benedetti. 1977. Sampling behavior of tests for correlation in two-way
contingency tables. J. Amer. Statist. Assoc. 72: 309315.
Brown, P. J., and P. W. K. Rundell. 1985. Kernel estimates for categorical data. Technometrics
27: 293299.
Brown, L. D., T. T. Cai, and A. Das Gupta. 2001. Interval estimation for a binomial proportion.
Statist. Sci. 16: 101133.
REFERENCES
661
Brownstone, D., and K. F. Train. 1999. Forecasting new product penetration with flexible
substitution patterns. J. Econometrics 89: 109129.
Bull, S. B., and A. Donner. 1987. The efficiency of multinomial logistic regression compared with
multiple group discriminant analysis. J. Amer. Statist. Assoc. 82: 11181122.
Burnham, K. P., and D. R. Anderson. 1998. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag.
Burnham, K. P. and W. S. Overton. 1978. Estimation of the size of a closed population when
capture probabilities vary among animals. Biometrika 65: 625633.
Burridge, J. 1981. A note on maximum likelihood estimation for regression models using
grouped data. J. Roy. Statist. Soc. Ser. B 43: 4145.
Cameron, A. C., and P. K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge, U.K.:
Cambridge University Press.
Carey, V., S. L. Zeger, and P. Diggle. 1993. Modelling multivariate binary data with alternating
logistic regressions. Biometrika 80: 517526.
Carroll, R. J., S. Wang, and C. Y. Wang. 1995. Prospective analysis of logistic casecontrol pairs.
J. Amer. Statist. Assoc. 90: 157169.
Casella, G., and R. Berger. 2001. Statistical Inference, 2nd ed. Pacific Grove, CA: Wadsworth.
Catalano, P. J., and L. M. Ryan. 1992. Bivariate latent variable models for clustered discrete and
continuous outcomes. J. Amer. Statist. Assoc. 87: 651658.
Caussinus, H. 1966. Contribution `
a l’analyse statistique des tableaux de correlation.
Ann. Fac.
´
Sci. Uni®. Toulouse 29: 77182.
Chaloner, K., and K. Larntz. 1989. Optimal Bayesian design applied to logistic regression
experiments. J. Statist. Plann. Inference 21: 191208.
Chamberlain, G. 1980. Analysis of covariance with qualitative data. Re®. Econ. Stud. 47:
225238.
Chambers, E. A., and D. R. Cox. 1967. Discrimination between alternative binary response
models. Biometrika 54: 573578.
Chambers, R. L., and D. G. Steel. 2001. Simple methods for ecological inference in 2 = 2 tables.
J. Roy. Statist. Soc. Ser. A 164: 175192.
Chan, I. 1998. Exact tests of equivalence and efficacy with non-zero lower bound for comparative
studies. Statist. Medic. 17: 14031413.
Chan, J. S. K., and A. Y. C. Kuk. 1997. Maximum likelihood estimation for probit-linear mixed
models with correlated random effects. Biometrics 53: 8697.
Chao, A., P. K. Tsay, S.-H. Lin, W.-Y. Shau, and D.-Y. Chao. 2001. The applications of
capturerecapture models to epidemiological data. Statist. Medic. 20: 31233157.
Chapman, D. G., and R. C. Meng. 1966. The power of chi-square tests for contingency tables.
J. Amer. Statist. Assoc. 61: 965975.
Chen, Z. and L. Kuo. 2001. A note on the estimation of the multinomial logit model with
random effects. Amer. Statist. 55: 8995.
Christensen, R. 1997. LogLinear Models and Logistic Regression. New York: Springer-Verlag.
Chuang, C., D. Gheva, and C. Odoroff. 1985. Methods for diagnosing multiplicative-interaction
models for two-way contingency tables. Commun. Statist. Ser. A 14: 20572080.
Clogg, C. C. 1995. Latent class models. Pp. 311359 in Handbook of Statistical Modeling for the
Social and Beha®ioral Sciences, ed. G. Arminger and C. C. Clogg. New York: Plenum Press.
Clogg, C. C., and S. R. Eliason. 1987. Some common problems in log-linear analysis. Sociol.
Methods Res. 15: 444.
Clogg, C. C., and L. A. Goodman. 1984. Latent structure analysis of a set of multidimensional
contingency tables. J. Amer. Statist. Assoc. 79: 762771.
662
REFERENCES
Clogg, C. C., and E. S. Shihadeh. 1994. Statistical Models for Ordinal Variables. Thousand Oaks,
CA: Sage Publications.
Clopper, C. J., and E. S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the
case of the binomial. Biometrika 26: 404413.
Cochran, W. G. 1940. The analysis of variance when experimental errors follow the Poisson or
binomial laws. Ann. Math. Statist. 11: 335347.
Cochran, W. G. 1943. Analysis of variance for percentages based on unequal numbers. J. Amer.
Statist. Assoc. 38: 287301.
Cochran, W. G. 1950. The comparison of percentages in matched samples. Biometrika 37:
256266.
Cochran, W. G. 1952. The 2 test of goodness-of-fit. Ann. Math. Statist. 23: 315᎐345.
Cochran, W. G. 1954. Some methods of strengthening the common 2 tests. Biometrics 10:
417᎐451.
Cochran, W. G. 1955. A test of a linear function of the deviations between observed and
expected numbers. J. Amer. Statist. Assoc. 50: 377᎐397.
Coe, P. R., and A. C. Tamhane. 1993. Small sample confidence intervals for the difference, ratio
and odds ratio of two success probabilities. Commun. Statist. Ser. B 22: 925᎐938.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20: 37᎐46.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70: 213᎐220.
Cohen, A., and H. B. Sackrowitz. 1991. Tests for independence in contingency tables with
ordered alternatives. J. Multi®ariate Anal. 36: 56᎐67.
Cohen, A., and H. B. Sackrowitz. 1992. An evaluation of some tests of trend in contingency
tables. J. Amer. Statist. Assoc. 87: 470᎐475.
Collett, D. 1991. Modelling Binary Data. London: Chapman & Hall.
Conaway, M. R. 1989. Analysis of repeated categorical measurements with conditional likelihood
methods. J. Amer. Statist. Assoc. 84: 53᎐62.
Cook, R. D., and S. Weisberg. 1999. Applied Regression Including Computing and Graphics. New
York: Wiley.
Copas, J. B. 1973. Randomization models for the matched and unmatched 2 = 2 tables.
Biometrika 60: 467᎐476.
Copas, J. B. 1983. Plotting p against x. Appl. Statist. 32: 25᎐31.
Copas, J. B. 1988. Binary regression models for contaminated data. J. Roy. Statist. Soc. Ser B 50:
225᎐265.
Corcoran, C., L. Ryan, P. Senchaudhuri, C. Mehta, N. Patel, and G. Molenberghs. 2001.
An exact trend test for correlated binary data. Biometrics 57: 941᎐948.
Cormack, R. M. 1989. Log-linear models for capture᎐recapture. Biometrics 45: 395᎐413.
Cornfield, J. 1951. A method of estimating comparative rates from clinical data: Applications to
cancer of the lung, breast and cervix. J. Natl. Cancer Inst. 11: 1269᎐1275.
Cornfield, J. 1956. A statistical problem arising from retrospective studies. In Proc. 3rd Berkeley
Symposium on Mathematics, Statistics and Probability, ed. J. Neyman, 4: 135᎐148.
Cornfield, J. 1962. Joint dependence of risk of coronary heart disease on serum cholesterol and
systolic blood pressure: A discriminant function analysis. Fed. Proc. 21, Suppl. 11: 58᎐61.
Coull, B. A., and A. Agresti. 1999. The use of mixed logit models to reflect heterogeneity in
capture᎐recapture studies. Biometrics 55: 294᎐301.
Coull, B. A., and A. Agresti. 2000. Random effects modeling of multiple binomial responses
using the multivariate binomial logit-normal distribution. Biometrics 56: 73᎐80.
REFERENCES
663
Cox, C. 1984. An elementary introduction to maximum likelihood estimation for multinomial
models: Birch’s theorem and the delta method. Amer. Statist. 38: 283287.
Cox, C. 1995. Location-scale cumulative odds models for ordinal data: A generalized non-linear
model approach. Statist. Medic. 14: 11911203.
Cox, C. 1996. Nonlinear quasi-likelihood models: Applications to continuous proportions.
Comput. Statist. Data Anal. 21: 449461.
Cox, D. R. 1958a. The regression analysis of binary sequences. J. Roy. Statist. Soc. Ser. B 20:
215242.
Cox, D. R. 1958b. Two further applications of a model for binary regression. Biometrika 45:
562565.
Cox, D. R. 1970. The Analysis of Binary Data Ž2nd ed. 1989, by D. R. Cox and E. J. Snell..
London: Chapman & Hall.
Cox, D. R. 1972. The analysis of multivariate binary data. Appl. Statist. 21: 113120.
Cox, D. R. 1983. Some remarks on overdispersion. Biometrika 70: 269274.
Cox, D. R., and D. V. Hinkley. 1974. Theoretical Statistics. London: Chapman & Hall.
Cramer,
´ H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press.
Cressie, N., and T. R. C. Read. 1984. Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser.
B 46: 440464.
Cressie, N., and T. R. C. Read. 1989. Pearson X 2 and the loglikelihood ratio statistic G 2 :
A comparative review. Internat. Statist. Re®. 57: 1943.
Croon, M., W. Bergsma, and J. Hagenaars. 2000. Analyzing change in categorical variables by
generalized log-linear models. Sociol. Methods Res. 29: 195229.
Crouchley, R. 1995. A random-effects model for ordered categorical data. J. Amer. Statist.
Assoc. 90: 489498.
Crowder, M. J. 1978. Beta-binomial ANOVA for proportions. Appl. Statist. 27: 3437.
D’Agostino, R. B., Jr. 1998. Propensity score methods for bias reduction in the comparison of a
treatment to a non-randomized control group. Statist. Medic. 17: 22652281.
Daniels, M. J., and C. Gatsonis. 1999. Hierarchical generalized linear models in the analysis of
variations in health care utilization. J. Amer. Statist. Assoc. 94: 2942.
Dardanoni, V., and A. Forcina. 1998. A unified approach to likelihood inference on stochastic
orderings in a nonparametric context. J. Amer. Statist. Assoc. 93: 11121123.
Darroch, J. N. 1962. Interactions in multi-factor contingency tables. J. Roy. Statist. Soc. Ser. B
24: 251263.
Darroch, J. N. 1981. The MantelHaenszel test and tests of marginal symmetry; Fixed-effects
and mixed models for a categorical response. Internat. Statist. Re®. 49: 285307.
Darroch, J. N., and P. I. McCloud. 1986. Category distinguishability and observer agreement.
Austral. J. Statist. 28: 371388.
Darroch, J. N., and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann.
Math. Statist. 43: 14701480.
Darroch, J. N., S. L. Lauritzen, and T. P. Speed. 1980. Markov fields and log-linear interaction
models for contingency tables. Ann. Statist. 8: 522539.
Darroch, J. N., S. E. Fienberg, G. F. V. Glonek, and B. W. Junker. 1993. A three-sample
multiple-recapture approach to census population estimation with heterogeneous catchability. J. Amer. Statist. Assoc. 88: 11371148.
Das Gupta, S., and M. D. Perlman. 1974. Power of the noncentral F-test: Effect of additional
variates on Hotelling’s T 2-test. J. Amer. Statist. Assoc. 69: 174180.
David, H. A. 1988. The Method of Paired Comparisons, 2nd ed. Oxford: Oxford University Press.
Davis, L. J. 1986a. Exact tests for 2 by 2 contingency tables. Amer. Statist. 40: 139141.
664
REFERENCES
Davis, L. J. 1986b. Relationship between strictly collapsible and perfect tables. Statist. Probab.
Lett. 4: 119122.
Davis, L. J. 1989. Intersection union tests for strictly collapsibility in three-dimensional contingency tables. Ann. Statist. 17: 16931708.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge,
U.K. Cambridge University Press.
Dawson, R. B., Jr. 1954. A simplified expression for the variance of the 2-function on a
contingency table. Biometrika 41: 280.
Day, N. E., and D. P. Byar. 1979. Testing hypotheses in case᎐control studies: Equivalence of
Mantel᎐Haenszel statistics and logit score tests. Biometrics 35: 623᎐630.
de Falguerolles, A., S. Jmel, and J. Whittaker. 1995. Correspondence analysis and association
models constrained by a conditional independence graph. Psychometrika 60: 161᎐180.
Deming, W. E. 1964. Statistical Adjustment of Data Žreprint of 1943 Wiley text.. New York:
Dover.
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sampled frequency
table when the expected marginal totals are known. Ann. Math. Statist. 11: 427᎐444.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data
via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39: 1᎐38.
Dey, D. K., S. K. Ghosh, and B. K. Mallick Žeditors.. 2000. Generalized Linear Models:
A Bayesian Perspecti®e. New York: Marcel Dekker.
Diaconis, P., and B. Efron. 1985. Testing for independence in a two-way table: New interpretations of the chi-square statistic. Ann. Statist. 13: 845᎐874.
Diaconis, P., and B. Sturmfels. 1998. Algebraic algorithms for sampling from conditional
distributions. Ann. Statist. 26: 363᎐397.
Diggle, P. J., P. Heagerty, K.-Y. Liang, and S. L. Zeger. 2002. Analysis of Longitudinal Data,
2nd ed. Oxford: Clarendon Press.
Dittrich, R., R. Hatzinger, and W. Katzenbeisser. 1998. Modeling the effect of subject-specific
covariates in paired comparison studies with an application to university rankings. Appl.
Statist. 47: 511᎐525.
Dobson, A. J. 2001. An Introduction to Generalized Linear Models, 2 nd ed. London: Chapman &
Hall.
Dong, J. 1998. Simpson’s paradox. Pp. 4108᎐4110 in Encyclopedia of Biostatistics, Vol. 5.
Chichester, UK: Wiley.
Dong, J., and J. S. Simonoff. 1994. The construction and properties of boundary kernels for
smoothing sparse multinomials. J. Computat. Graph. Statist. 3: 57᎐66.
Dong, J., and J. S. Simonoff. 1995. A geometric combination estimator for d-dimensional ordinal
sparse contingency tables. Ann. Statist. 23: 1143᎐1159.
Donner, A., and W. W. Hauck. 1986. The large-sample efficiency of the Mantel᎐Haenszel
estimator in the fixed-strata case. Biometrics 42: 537᎐545.
Doolittle, M. H. 1888. Association ratios. Bull. Philos. Soc. Washington 10: 83᎐87, 94᎐96.
Drost, F. C., W. C. M. Kallenberg, D. S. Moore, and J. Oosterhoff. 1989. Power approximations
to multinomial tests of fit. J. Amer. Statist. Assoc. 84: 130᎐141.
Ducharme, G. R., and Y. Lepage. 1986. Testing collapsibility in contingency tables. J. Roy.
Statist. Soc. Ser B 48: 197᎐205.
Dupont, W. D. 1986. Sensitivity of Fisher’s exact test to minor perturbations in 2 = 2 contingency tables. Statist. Medic. 5: 629᎐635.
Dyke, G. V., and H. D. Patterson. 1952. Analysis of factorial arrangements when the data are
proportions. Biometrics 8: 1᎐12.
REFERENCES
665
Edwardes, M. D. deB. 1997. Univariate random cut-points theory for the analysis of ordered
categorical data. J. Amer. Statist. Assoc. 92: 11141123.
Edwards, A. W. F. 1963. The measure of association in a 2 = 2 table. J. Roy. Statist. Soc. Ser A
126: 109114.
Edwards, D. 2000. Introduction to Graphical Modelling, 2nd ed. New York: Springer-Verlag.
Edwards, D., and S. Kreiner. 1983. The analysis of contingency tables by graphical models.
Biometrika 70: 553565.
Efron, B. 1975. The efficiency of logistic regression compared to normal discriminant analysis.
J. Amer. Statist. Assoc. 70: 892898.
Efron, B. 1978. Regression and ANOVA with zeroone data: Measures of residual variation.
J. Amer. Statist. Soc. 73: 113121.
Efron, B., and D. V. Hinkley. 1978. Assessing the accuracy of the maximum likelihood estimator:
Observed versus expected Fisher information. Biometrika 65: 457482.
Efron, B., and C. Morris. 1975. Data analysis using Stein’s estimator and its generalizations.
J. Amer. Statist. Assoc. 70: 311319.
Ekholm, A., J. W. McDonald, and P. W. F. Smith. 2000. Association models for a multivariate
binary response. Biometrics 56: 712718.
Escoufier, Y. 1982. L’analyse des tableaux de contingence simples et multiples. In Proc.
International Meeting on the Analysis of Multidimensional Contingency Tables ŽRome, 1981.,
ed. R. Coppi. Metron 40: 5377.
Espeland, M. A., and S. L. Handelman. 1989. Using latent class models to characterize and
assess relative error in discrete measurements. Biometrics 45: 587599.
Fahrmeir, L., and G. Tutz. 2001. Multi®ariate Statistical Modelling based on Generalized Linear
Models, 2nd ed. New York: Springer-Verlag.
Farewell, V. T. 1979. Some results on the estimation of logistic models based on retrospective
data. Biometrika 66: 2732.
Farewell, V. T. 1982. A note on regression analysis of ordinal data with variability of classification. Biometrika 69: 533538.
Fay, R. 1985. A jackknifed chi-squared test for complex samples. J. Amer. Statist. Assoc. 80:
148157.
Fay, R. 1986. Causal models for patterns of nonresponse. J. Amer. Statist. Assoc. 81: 354365.
Ferguson, T. S. 1967. Mathematical Statistics: A Decision Theoretic Approach. New York:
Academic Press.
Fienberg, S. E. 1970a. An iterative procedure for estimation in contingency tables. Ann. Math.
Statist. 41: 907917.
Fienberg, S. E. 1970b. Quasi-independence and maximum likelihood estimation in incomplete
contingency tables. J. Amer. Statist. Soc. 65: 16101616.
Fienberg, S. E. 1972. The analysis of incomplete multi-way contingency tables. Biometrics 28:
177202.
Fienberg, S. E. 1980. Fisher’s contributions to the analysis of categorical data. Pp. 7584 in
R. A. Fisher: An Appreciation, ed. S. E. Fienberg and D. V. Hinkley. Berlin: SpringerVerlag.
Fienberg, S. E. 1984. The contributions of William Cochran to categorical data analysis.
Pp. 103118 in W. G. Cochran’s Impact on Statistics, ed. P. S. R. S. Rao and J. Sedransk.
New York: Wiley.
Fienberg, S. E., and P. W. Holland. 1973. Simultaneous estimation of multinomial cell probabilities. J. Amer. Statist. Assoc. 68: 683690.
Fienberg, S. E., and K. Larntz. 1976. Loglinear representation for paired and multiple comparison models. Biometrika 63: 245254.
666
REFERENCES
Fienberg, S. E., M. A. Johnson, and B. J. Junker. 1999. Classical multilevel and Bayesian
approaches to population size estimation using multiple lists. J. Roy. Statist. Soc. Ser. A
162: 383405.
Finney, D. J. 1947. The estimation from individual records of the relationship between dose and
quantal response. Biometrika 34: 320334.
Finney, D. J. 1971. Probit Analysis, 3rd ed. Cambridge: Cambridge University Press.
Firth, D. 1987. On the efficiency of quasi-likelihood estimation. Biometrika 74: 233245.
Firth, D. 1989. Marginal homogeneity and the superposition of Latin squares. Biometrika 76:
179182.
Firth, D. 1991. Generalized linear models. Pp. 5582 in Statistical Theory and Modelling. In
Honour of Sir Da®id Cox, FRS, D. V. Hinkley, N. Reid, and E. J. Snell, eds. London:
Chapman & Hall.
Firth, D. 1993a. Bias reduction of maximum likelihood estimates. Biometrika 80: 2738.
Firth, D. 1993b. Recent developments in quasi-likelihood methods. Proc. ISI 49th Session,
pp. 341358.
Firth, D., and J. Kuha. 2000. On the index of dissimilarity for lack of fit in log linear models.
Unpublished manuscript.
Fischer, G. H., and I. W. Molenaar. 1995. Rasch Models: Foundations, Recent De®elopments, and
Applications. New York: Springer-Verlag.
Fisher, R. A. 1922. On the interpretation of chi-square from contingency tables, and the
calculation of P. J. Roy. Statist. Soc. 85: 8794.
Fisher, R. A. 1924. The conditions under which chi-square measures the discrepancy between
observation and hypothesis. J. Roy. Statist. Soc. 87: 442450.
Fisher, R. A. 1926. Bayes’ theorem and the fourfold table. Eugenics Re®. 18: 3233.
Fisher, R. A. 1934, 1970. Statistical Methods for Research Workers Žoriginally published 1925,
14th ed., 1970.. Edinburgh: Oliver & Boyd.
Fisher, R. A. 1935a. The Design of Experiments Ž8th ed., 1966.. Edinburgh: Oliver & Boyd.
Fisher, R. A. 1935b. Appendix to article by C. Bliss. Ann. Appl. Biol. 22: 164165.
Fisher, R. A. 1935c. The logic of inductive inference. J. Roy. Statist. Soc. 98: 3982.
Fisher, R. A. 1945. A new test for 2 = 2 tables ŽLetter to the Editor.. Nature 156: 388.
Fisher, R. A. 1956. Statistical Methods for Scientific Inference. Edinburgh: Oliver & Boyd.
Fisher, R. A., and F. Yates. 1938. Statistical Tables. Edinburgh: Oliver and Boyd.
Fitzmaurice, G. M., and N. M. Laird. 1993. A likelihood-based method for analysing longitudinal
binary responses. Biometrika 80: 141151.
Fitzmaurice, G. M., N. M. Laird, and S. Lipsitz. 1994. Analysing incomplete longitudinal binary
responses: A likelihood-based approach. Biometrics 50: 601612.
Fitzmaurice, G. M., N. M. Laird, and A. G. Rotnitzky. 1993. Regression models for discrete
longitudinal responses. Statist. Sci. 8: 284299.
Fitzpatrick, S., and A. Scott. 1987. Quick simultaneous confidence intervals for multinomial
proportions. J. Amer. Statist. Assoc. 82: 875878.
Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions, 2nd ed. New York: Wiley.
Fleiss, J. L., and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass
correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33: 613619.
Fleiss, J. L., J. Cohen, and B. S. Everitt. 1969. Large-sample standard errors of kappa and
weighted kappa. Psychol. Bull. 72: 323327.
Follman, D. A., and D. Lambert. 1989. Generalizing logistic regression by nonparametric mixing.
J. Amer. Statist. Assoc. 84: 295300.
REFERENCES
667
Forster, J. J., and P. W. F. Smith. 1998. Model-based inference for categorical survey data
subject to non-ignorable non-response. J. Roy. Statist. Soc. Ser B 60: 5770.
Forster, J. J., J. W. McDonald, and P. W. F. Smith. 1996. Monte Carlo exact conditional tests for
log-linear and logistic models. J. Roy. Statist. Soc. Ser B 58: 445453.
Fowlkes, E. B. 1987. Some diagnostics for binary logistic regression via smoothing. Biometrika
74: 503515.
Fowlkes, E. B., A. E. Freeny, and J. Landwehr. 1988. Evaluating logistic models for large
contingency tables. J. Amer. Statist. Assoc. 83: 611622.
Freedman, D., R. Pisani, and R. Purves. 1978. Statistics. New York: W. W. Norton.
Freeman, G. H., and J. H. Halton. 1951. Note on an exact treatment of contingency, goodnessof-fit and other problems of significance. Biometrika 38: 141149.
Freeman, D. H., Jr. and T. R. Holford. 1980. Summary rates. Biometrics 36: 195205.
Freeman, M. F., and J. W. Tukey. 1950. Transformations related to the angular and the square
root. Ann. Math. Statist. 21: 607611.
Freidlin, B., and J. L. Gastwirth. 1999. Unconditional versions of several tests commonly used in
the analysis of contingency tables. Biometrics 55: 264267.
Friendly, M. 2000. Visualizing Categorical Data. Cary, NC: SAS Institute.
Frome, E. L. 1983. The analysis of rates using Poisson regression models. Biometrics 39:
665674.
Fuchs, C. 1982. Maximum likelihood estimation and model selection in contingency tables with
missing data. J. Amer. Statist. Assoc. 77: 270278.
Gabriel, K. R. 1966. Simultaneous test procedures for multiple comparisons on categorical data.
J. Amer. Statist. Assoc. 61: 10811096.
Gabriel, K. R. 1971. The biplot graphic display of matrices with applications to principal
component analysis. Biometrika 58: 453467.
Gail, M. H., and J. J. Gart. 1973. The determination of sample sizes for use with the exact
conditional test in 2 = 2 comparative trials. Biometrics 29: 441448.
Gail, M., and N. Mantel. 1977. Counting the number of r = c contingency tables with fixed
margins. J. Amer. Statist. Assoc. 72: 859862.
Gart, J. J. 1966. Alternative analyses of contingency tables. J. Roy. Statist. Soc. Ser B 28:
164179.
Gart, J. J. 1969. An exact test for comparing matched proportions in crossover designs.
Biometrika 56: 7580.
Gart, J. J. 1970. Point and interval estimation of the common odds ratio in the combination of
2 = 2 tables with fixed margins. Biometrika 57: 471475.
Gart, J. J. 1971. The comparison of proportions: A review of significance tests, confidence
intervals and adjustments for stratification. Re®. Internat. Statist. Re®. 39: 148169.
Gart, J. J., and J. Nam. 1988. Approximate interval estimation of the ratio of binomial
parameters: A review and corrections for skewness. Biometrics 44: 323338.
Gart, J. J., and J. R. Zweiful. 1967. On the bias of various estimators of the logit and its variance
with applications to quantal bioassay. Biometrika 54: 181187.
Gelfand, A. E., and A. F. Smith. 1990. Sampling-based approaches to calculating marginal
densities. J. Amer. Statist. Assoc. 85: 398409.
Genter, F. C., and V. T. Farewell. 1985. Goodness-of-link testing in ordinal regression models.
Canad. J. Statist. 13: 3744.
Ghosh, B. K. 1979. A comparison of some approximate confidence intervals for the binomial
parameter. J. Amer. Statist. Assoc. 74: 894900.
Ghosh, M., M. Chen, A. Ghosh, and A. Agresti. 2000. Hierarchical Bayesian analysis of binary
matched pairs data. Statist. Sin. 10: 647657.
668
REFERENCES
Gibbons, R. D., and D. Hedeker. 1997. Random-effects probit and logistic regression models for
three-level data. Biometrics 53: 15271537.
Gill, J. 2000. Generalized Linear Models: A Unified Approach. Thousand Oaks, CA: Sage
Publications.
Gilmour, A. R., R. D. Anderson, and A. L. Rae. 1985. The analysis of binomial data by a
generalized linear mixed model. Biometrika 72: 593599.
Gilula, Z., and S. Haberman. 1986. Canonical analysis of contingency tables by maximum
likelihood. J. Amer. Statist. Assoc. 81: 780788.
Gilula, Z., and S. Haberman. 1988. The analysis of multivariate contingency tables by restricted
canonical and restricted association models. J. Amer. Statist. Assoc. 83: 760771.
Gilula, Z., and S. Haberman. 1998. Chi-square, partition of. Pp. 622627 in Encyclopedia of
Biostatistics. Chichester, UK: Wiley.
Gleser, L. J., and D. S. Moore. 1985. The effect of positive dependence on chi-squared tests for
categorical data. J. Roy. Statist. Soc. Ser B 47: 459465.
Glonek, G. 1996. A class of regression models for multivariate categorical responses. Biometrika
83: 1528.
Glonek, G. F. V., and P. McCullagh. 1995. Multivariate logistic models. J. Roy. Statist. Soc. Ser.
B 57: 533546.
Glonek, G., J. N. Darroch, and T. P. Speed. 1988. On the existence of maximum likelihood
estimators for hierarchical loglinear models. Scand. J. Statist. 15: 187193.
Gokhale, D. V., and S. Kullback. 1978. The Information in Contingency Tables. New York: Marcel
Dekker.
Goldstein, H. 1995. Multile®el Statistical Models, 2nd ed. London: Edward Arnold.
Goldstein, H., and J. Rasbash. 1996. Improved approximations for multilevel models with binary
responses. J. Roy. Statist. Soc. Ser A 159: 505513.
Good, I. J. 1963. Maximum entropy for hypothesis formulation, especially for multi-dimensional
contingency tables. Ann. Math. Statist. 34: 911934.
Good, I. J. 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods.
Cambridge, MA: MIT Press.
Good, I. J. 1976. On the application of symmetric Dirichlet distributions and their mixtures to
contingency tables. Ann. Statist. 4: 11591189.
Good, I. J., and R. A. Gaskins. 1971. Nonparametric roughness penalties for probability
densities. Biometrika 58: 255277.
Good, I. J., and Y. Mittal. 1987. The amalgamation and geometry of two-by-two contingency
tables. Ann. Statist. 15: 694711.
Good, I. J., T. N. Gover, and G. J. Mitchell. 1970. Exact distributions for 2 and for the
likelihood-ratio statistic for the equiprobable multinomial distribution. J. Amer. Statist.
Assoc. 65: 267᎐283.
Goodman, L. A. 1964a. Simultaneous confidence intervals for cross-product ratios in contingency tables. J. Roy. Statist. Soc. Ser B 26: 86᎐102.
Goodman, L. A. 1964b. Interactions in multi-dimensional contingency tables. Ann. Math. Statist.
35: 632᎐646.
Goodman, L. A. 1965. On simultaneous confidence intervals for multinomial proportions.
Technometrics 7: 247᎐254.
Goodman, L. A. 1968. The analysis of cross-classified data: Independence, quasi-independence,
and interactions in contingency tables with or without missing entries. J. Amer. Statist.
Assoc. 63: 1091᎐1131.
Goodman, L. A. 1969a. On partitioning chi-square and detecting partial association in three-way
contingency tables. J. Roy. Statist. Soc. Ser B 31: 486᎐498.
REFERENCES
669
Goodman, L. A. 1969b. How to ransack social mobility tables and other kinds of cross-classification tables. Amer. J. Sociol. 75: 140.
Goodman, L. A. 1970. The multivariate analysis of qualitative data: Interaction among multiple
classifications. J. Amer. Statist. Assoc. 65: 226256.
Goodman, L. A. 1971a. The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications.
Technometrics 13: 3361.
Goodman, L. A. 1971b. The partitioning of chi-square, the analysis of marginal contingency
tables, and the estimation of expected frequencies in multidimensional contingency tables.
J. Amer. Statist. Assoc. 66: 339344.
Goodman, L. A. 1973. The analysis of multidimensional contingency tables with some variables
are posterior to others: A modified path analysis approach. Biometrika 60: 179192.
Goodman, L. A. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61: 215231.
Goodman, L. A. 1979a. Simple models for the analysis of association in cross-classifications
having ordered categories. J. Amer. Statist. Assoc. 74: 537552.
Goodman, L. A. 1979b. Multiplicative models for square contingency tables with ordered
categories. Biometrika 66: 413418.
Goodman, L. A. 1981a. Association models and canonical correlation in the analysis of
cross-classifications having ordered categories. J. Amer. Statist. Assoc. 76: 320334.
Goodman, L. A. 1981b. Association models and the bivariate normal for contingency tables with
ordered categories. Biometrika 68: 347355.
Goodman, L. A. 1983. The analysis of dependence in cross-classification having ordered
categories, using log-linear models for frequencies and log-linear models for odds. Biometrics 39: 149160.
Goodman, L. A. 1985. The analysis of cross-classified data having ordered andror unordered
categories: Association models, correlation models, and asymmetry models for contingency
tables with or without missing entries. Ann. Statist. 13: 1069.
Goodman, L. A. 1986. Some useful extensions of the usual correspondence analysis approach
and the usual log-linear models approach in the analysis of contingency tables. Internat.
Statist. Re®. 54: 243309.
Goodman, L. A. 1996. A single general method for the analysis of cross-classified data:
Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some
methods of correspondence analysis and association analysis. J. Amer. Statist. Assoc. 91:
408427.
Goodman, L. A. 2000. The analysis of cross-classified data: Notes on a century of progress
in contingency table analysis, and some comments on its prehistory and its future.
Pp. 189231 in Statistics for the 21 st Century, ed. C. R. Rao and G. J. Szekely.
New York:
´
Marcel Dekker.
Goodman, L. A., and W. H. Kruskal. 1979. Measures of Association for Cross Classifications. New
York: Springer-Verlag Žcontains articles appearing in J. Amer. Statist. Assoc. in 1954,
1959, 1963, 1972..
Gould, S. J. 1981. The Mismeasure of Man. New York: W. W. Norton.
Gourieroux, C., A. Monfort, and A. Trognon. 1984. Pseudo maximum likelihood methods:
Theory. Econometrica 52: 681700.
Graubard, B. I., and E. L. Korn. 1987. Choice of column scores for testing independence in
ordered 2 = K contingency tables. Biometrics 43: 471476.
Green, P. J. 1984. Iteratively weighted least squares for maximum likelihood estimation and
some robust and resistant alternatives. J. Roy. Statist. Soc. Ser B 46: 149192.
Greenacre, M. J. 1993. Correspondence Analysis in Practice. New York: Academic Press.
670
REFERENCES
Greenland, S. 1991. On the logical justification of conditional tests for two-by-two contingency
tables. Amer. Statist. 45: 248251.
Greenland, S., and J. M. Robins. 1985. Estimation of a common effect parameter from sparse
follow-up data. Biometrics 41: 5568.
Greenwood, M., and G. U. Yule. 1920. An inquiry into the nature of frequency distributions
representative of multiple happenings with particular reference to the occurrence of
multiple attacks of disease or of repeated accidents. J. Roy. Statist. Soc. Ser A 83: 255279.
Greenwood, P. E., and M. S. Nikulin. 1996. A Guide to Chi-Squared Testing. New York: Wiley.
Grizzle, J. E., C. F. Starmer, and G. G. Koch. 1969. Analysis of categorical data by linear
models. Biometrics 25: 489504.
Gross, S. T. 1981. On asymptotic power and efficiency of tests of independence in contingency
tables with ordered classifications. J. Amer. Statist. Assoc. 76: 935941.
Gueorguieva, R., and A. Agresti. 2001. A correlated probit model for joint modeling of clustered
binary and continuous responses. J. Amer. Statist. Assoc. 96: 11021112.
Haber, M. 1980. A comparison of some continuity corrections for the chi-squared test on 2 = 2
tables. J. Amer. Statist. Assoc. 75: 510515.
Haber, M. 1982. The continuity correction and statistical testing. Internat. Statist. Re®. 50:
135144.
Haber, M. 1985. Maximum likelihood methods for linear and log-linear models in categorical
data. Comput. Statist. Data Anal. 3: 110.
Haber, M. 1986. An exact unconditional test for the 2 = 2 comparative trial. Psychol. Bull. 99:
129132.
Haber, M. 1989. Do the marginal totals of a 2 = 2 contingency table contain information
regarding the table proportions? Commun. Statist. Ser A 18: 147156.
Haberman, S. J. 1973a. The analysis of residuals in cross-classification tables. Biometrics 29:
205220.
Haberman, S. J. 1973b. Log-linear models for frequency data: Sufficient statistics and likelihood
equations. Ann. Statist. 1: 617632.
Haberman, S. J. 1974a. The Analysis of Frequency Data. Chicago: University of Chicago Press.
Haberman, S. J. 1974b. Log-linear models for frequency tables with ordered classifications.
Biometrics 36: 589600.
Haberman, S. J. 1977a. Log-linear models and frequency tables with small expected cell counts.
Ann. Statist. 5: 11481169.
Haberman, S. J. 1977b. Maximum likelihood estimation in exponential response models. Ann.
Statist. 5: 815841.
Haberman, S. J. 1978, 1979. Analysis of Qualitati®e Data, Vols. 1 and 2. New York: Academic
Press.
Haberman, S. J. 1981. Tests for independence in two-way contingency tables based on canonical
correlation and on linear-by-linear interaction. Ann. Statist. 9: 11781186.
Haberman, S. J. 1982. The analysis of dispersion of multinomial responses. J. Amer. Statist.
Assoc. 77: 568580.
Haberman, S. J. 1988. A warning on the use of chi-squared statistics with frequency tables with
small expected cell counts. J. Amer. Statist. Assoc. 83: 555560.
Haberman, S. J. 1995. Computation of maximum likelihood estimates in association models.
J. Amer. Statist. Assoc. 90: 14381446.
Hagenaars, J. A. 1998. Categorical causal modeling: Latent class analysis and directed log-linear
models with latent variables. Sociol. Methods Res. 26: 436486.
Hald, A. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley.
REFERENCES
671
Haldane, J. B. S. 1940. The mean and variance of 2 , when used as a test of homogeneity, when
expectations are small. Biometrika 31: 346᎐355.
Haldane, J. B. S. 1956. The estimation and significance of the logarithm of a ratio of frequencies.
Ann. Human Genet. 20: 309᎐311.
Hall, P., and D. M. Titterington. 1987. On smoothing sparse multinomial data. Austral. J. Statist.
29: 19᎐37.
Hamada, M., and C. F. J. Wu. 1990. A critical look at accumulation analysis and related
methods. Technometrics 32: 119᎐130.
Hansen, L. P. 1982. Large sample properties of generalized-method of moments estimators.
Econometrica 50: 1029᎐1054.
Harkness, W. L., and L. Katz. 1964. Comparison of the power functions for the test of
independence in 2 = 2 contingency tables. Ann. Math. Statist. 35: 1115᎐1127.
Harrell F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. Evaluating the yield
of medical tests. J. Amer. Medic. Assoc. 247: 2543᎐2546.
Hartzel, J., I.-M. Liu, and A. Agresti. 2001a. Describing heterogeneous effects in stratified
ordinal contingency tables, with application to multi-center clinical trials. Computat. Statist.
Data Anal. 35: 429᎐449.
Hartzel, J., A. Agresti, and B. Caffo. 2001b. Multinomial logit random effects models. Statistical
Modelling 1: 81᎐102.
Haslett, S. 1990. Degrees of freedom and parameter estimability in hierarchical models for
sparse complete contingency tables. Computat. Statist. Data Anal. 9: 179᎐195.
Hastie, T., and R. Tibshirani. 1987. Non-parametric logistic and proportional odds regression.
Appl. Statist. 36: 260᎐276.
Hastie, T., and R. Tibshirani. 1990. Generalized Additi®e Models. London: Chapman & Hall.
Hatzinger, R. 1989. The Rasch model, some extensions and their relation to the class of
generalized linear models. Statistical Modelling: Lecture Notes in Statistics, Vol. 57. Berlin:
Springer-Verlag.
Hauck, W. W. 1979. The large sample variance of the Mantel᎐Haenszel estimator of a common
odds ratio. Biometrics 35: 817᎐819.
Hauck, W. W. 1983. A note on confidence bands for the logistic response curve. Amer. Statist.
37: 158᎐160.
Hauck, W. W., and A. Donner. 1977. Wald’s test as applied to hypotheses in logit analysis.
J. Amer. Statist. Assoc. 72: 851᎐853.
Heagerty, P. J. 1999. Marginally specified logistic-normal models for longitudinal binary data.
Biometrics 55: 688᎐698.
Heagerty, P. J., and S. L. Zeger. 1996. Marginal regression models for clustered ordinal
measurements. J. Amer. Statist. Assoc. 91: 1024᎐1036.
Heagerty, P. J., and S. L. Zeger. 2000. Marginalized multilevel models and likelihood inference.
Statist. Sci. 15: 1᎐19.
Hedeker, D., and R. D. Gibbons. 1994. A random-effects ordinal regression model for multilevel
analysis. Biometrics 50: 933᎐944.
Heinen, T. 1996. Latent Class and Discrete Latent Trait Models. Thousand Oaks, CA: Sage
Publications.
Heyde, C. C. 1997. Quasi-likelihood and Its Application. New York: Springer-Verlag.
Hinde, J. 1982. Compound Poisson regression models. Pp. 109᎐121 in GLIM 82: Proc. International Conference on Generalised Linear Models, ed. R. Gilchrist. New York: Springer-Verlag.
Hinde, J., and C. G. B. Demetrio.
1998. Overdispersion: Models and estimation. Comput. Statist.
´
Data Anal. 27: 151᎐170.
672
REFERENCES
Hirji, K. F. 1991. A comparison of exact, mid-P, and score tests for matched case-control studies.
Biometrics 47: 487496.
Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic
regression. J. Amer. Statist. Assoc. 82: 11101117.
Hirotsu, C. 1982. Use of cumulative efficient scores for testing ordered alternatives in discrete
models. Biometrika 69: 567577.
Hirschfeld, H. O. 1935. A connection between correlation and contingency. Cambridge Philos.
Soc. Proc. Ž Math. Proc.. 31: 520524.
Hodges, J. L., Jr. 1958. Fitting the logistic by maximum likelihood. Biometrics 14: 453461.
Hoem, J. M. 1987. Statistical analysis of a multiplicative model and its application to the
standardization of vital rates: A review. Internat. Statist. Re®. 5: 119152.
Holford, T. R. 1980. The analysis of rates and of survivorship using log-linear models. Biometrics
36: 299305.
Holt, D., A. J. Scott, and P. D. Ewings. 1980. Chi-squared tests with survey data. J. Roy. Statist.
Soc. Ser. A 143: 303320.
Hook, E. B., and R. R. Regal. 1995. Capturerecapture methods in epidemiology: Methods and
limitations. Epidemiol. Re®. 17: 243264.
Hosmer, D. W., and S. Lemeshow. 1980. A goodness-of-fit test for multiple logistic regression
model. Commun. Statist. Ser A 9: 10431069.
Hosmer, D. W., and S. Lemeshow. 2000. Applied Logistic Regression, 2nd ed. New York: Wiley.
Hosmer, D. W., T. Hosmer, S. le Cessie, and S. Lemeshow. 1997. A comparison of goodness-of-fit
tests for the logistic regression model. Statist. Medic. 16: 965980.
Hout, M., O. D. Duncan, and M. E. Sobel. 1987. Association and heterogeneity: Structural
models of similarities and differences. Sociol. Methodol. 17: 145184.
Howard, J. V. 1998. The 2 = 2 table: A discussion from a Bayesian viewpoint. Statist. Sci. 13:
351367.
Hsieh, F. Y. 1989. Sample size tables for logistic regression. Statist. Medic. 8: 795802.
Hsieh, F. Y., D. A. Bloch, and M. D. Larsen. 1998. A simple method of sample size calculation
for linear and logistic regression. Statist. Medic. 17: 16231634.
Hwang, J. T. G., and M. T. Wells. 2002. Optimality results for mid P-values. To appear.
Hwang, J. T. G., and M.-C. Yang. 2001. An optimality theory for mid P-values in 2 = 2
contingency tables. Statist. Sin. 11: 807826.
Imrey, P. B. 1998. BradleyTerry model. Pp. 437443 in Encyclopedia of Biostatistics. Chichester,
UK: Wiley.
Imrey, P. B., W. D. Johnson, and G. G. Koch. 1976. An incomplete contingency table approach
to paired-comparison experiments. J. Amer. Statist. Assoc. 71: 614623.
Imrey, P. B., G. G. Koch, and M. E. Stokes. 1981. Categorical data analysis: Some reflections on
the log linear model and logistic regression. I: Historical and methodological overview.
Internat. Statist. Re®. 49: 265283.
Ireland, C. T., and S. Kullback. 1968a. Minimum discrimination information estimation.
Biometrics 24: 707713.
Ireland, C. T., and S. Kullback. 1968b. Contingency tables with given marginals. Biometrika 55:
179188.
Ireland, C. T., H. H. Ku, and S. Kullback. 1969. Symmetry and marginal homogeneity of an r = r
contingency table. J. Amer. Statist. Assoc. 64: 13231341.
Irwin, J. O. 1935. Tests of significance for differences between percentages based on small
numbers. Metron 12: 8394.
Jennison, C., and B. W. Turnbull. 2000. Group Sequential Methods with Applications to Clinical
Trials. London: Chapman & Hall.
REFERENCES
673
Johnson, B. M. 1971. On the admissible estimators for certain fixed sample binomial problems.
Ann. Math. Statist. 42: 15791587.
Johnson, W. 1985. Influence measures for logistic regression: Another point of view. Biometrika
72: 5965.
Johnson, N. L., S. Kotz, and A. W. Kemp. 1992. Uni®ariate Discrete Distributions, 2nd ed. New
York: Wiley.
Jones, B., and M. G. Kenward. 1987. Modelling binary data from a three-period cross-over trial.
Statist. Medic. 6: 555564.
Jones, M. P., T. W. O’Gorman, J. H. Lemke, and R. F. Woolson. 1989. A Monte Carlo
investigation of homogeneity tests of the odds ratio under various sample size considerations. Biometrics 45: 171181.
Jorgensen,
B. 1983. Maximum likelihood estimation and large-sample inference for generalized
Ⲑ
linear and nonlinear regression models. Biometrika 70: 1928.
Jorgensen,
B. 1987. Exponential dispersion models. J. Roy. Statist. Soc. Ser. B 49: 127162.
Ⲑ
Kalbfleisch, J. D., and J. F. Lawless. 1985. The analysis of panel data under a Markov
assumption. J. Amer. Statist. Assoc. 80: 863871.
Kastner, C., A. Fieger, and C. Heumann. 1997. MAREG and WinMAREG: A tool for marginal
regression models. Comput. Statist. Data Anal. 24: 237241.
Kauermann, G., and R. J. Carroll, 2001. A note on the efficiency of sandwich covariance matrix
estimation. J. Amer. Statist. Assoc. 96: 13871397.
Kauermann, G., and G. Tutz. 2001. Testing generalized linear and semiparametric models
against smooth alternatives. J. Roy. Statist. Soc. Ser. B 63: 147166.
Kelderman, H. 1984. Loglinear Rasch model tests. Psychometrika 49: 223245.
Kempthorne, O. 1979. In dispraise of the exact test: Reactions. J. Statist. Plann. Inference 3:
199213.
Kendall, M. G. 1945. The treatment of ties in rank problems. Biometrika 33: 239251.
Kendall, M., and A. Stuart. 1979. The Ad®anced Theory of Statistics, Vol. 2; Inference and
Relationship, 4th ed. New York: Macmillan.
Kenward, M. G., and B. Jones. 1991. The analysis of categorical data from cross-over trials using
a latent variable model. Statist. Medic. 10: 16071619.
Kenward, M. G., and B. Jones. 1994. The analysis of binary and categorical data from crossover
trials. Statist. Methods Medic. Res. 3: 325344.
Kenward, M. G., E. Lesaffre, and G. Molenberghs. 1994. An application of maximum likelihood
and estimating equations to the analysis of ordinal data from a longitudinal study with
cases missing at random. Biometrics 50: 945953.
Khamis, H. J. 1983. Log-linear model analysis of the semi-symmetric intraclass contingency table.
Commun. Statist. Ser. A 12: 27232752.
Kim, D., and A. Agresti. 1995. Improved exact inference about conditional association in
three-way contingency tables. J. Amer. Statist. Assoc. 90: 632639.
Kim, D., and A. Agresti. 1997. Nearly exact tests of conditional independence and marginal
homogeneity for sparse contingency tables. Comput. Statist. Data Anal. 24: 89104.
King, G. 1997. A Solution to the Ecological Inference Problem. Princeton, NJ: Princeton University Press.
Knuiman, M. W., and T. P. Speed. 1988. Incorporating prior information into the analysis of
contingency tables. Biometrics 44: 10611071.
Koch, G. G., and V. P. Bhapkar. 1982. Chi-square tests. Pp. 442457 in Encyclopedia of
Statistical Sciences, Vol. 1. New York: Wiley.
674
REFERENCES
Koch, G. G., J. R. Landis, J. L. Freeman, D. H. Freeman, and R. G. Lehnen. 1977. A general
methodology for the analysis of experiments with repeated measurement of categorical
data. Biometrics 33: 133158.
Koch, G. G., I. A. Amara, G. W. Davis, and D. B. Gillings. 1982. A review of some statistical
methods for covariance analysis of categorical data. Biometrics 38: 563595.
Koch, G. G., P. B. Imrey, J. M. Singer, S. S. Atkinson, and M. E. Stokes. 1985. Lecture Notes for
Analysis of Categorical Data. Montreal: Les Presses de L’Universite
´ de Montreal.
´
Koehler, K. 1986. Goodness-of-fit tests for log-linear models in sparse contingency tables.
J. Amer. Statist. Assoc. 81: 483493.
Koehler, K. 1998. Chi-square tests. Pp. 608622 in Encyclopedia of Biostatistics. Chichester, UK:
Wiley.
Koehler, K., and K. Larntz.
sparse multinomials. J.
Koehler, K., and J. Wilson.
several cluster samples.
1980. An empirical investigation of goodness-of-fit statistics for
Amer. Statist. Assoc. 75: 336344.
1986. Chi-square tests for comparing vectors of proportions for
Commun. Statist. Ser. A 15: 29772990.
Koopman, P. A. R. 1984. Confidence limits for the ratio of two binomial proportions. Biometrics
40: 513517.
Kraemer, H. C. 1979. Ramifications of a population model for as a coefficient of reliability.
Psychometrika 44: 461᎐472.
Kreiner, S. 1987. Analysis of multidimensional contingency tables by exact conditional tests:
Techniques and strategies. Scand. J. Statist. 14: 97᎐112.
Kreiner, S. 1998. Interaction models. Pp. 2063᎐2068 in Encyclopedia of Biostatistics. Chichester,
UK: Wiley.
Kruskal, W. H. 1958. Ordinal measures of association. J. Amer. Statist. Assoc. 53: 814᎐861.
Ku, H. H., R. N. Varner, and S. Kullback. 1971. Analysis of multidimensional contingency tables.
J. Amer. Statist. Assoc. 66: 55᎐64.
Kuha, J., and C. Skinner. 1997. Categorical data analysis and misclassification. Pp. 633᎐670 in
Sur®ey Measurement and Process Quality, ed. L. Lyberg et al. New York: Wiley.
Kuha, J., C. Skinner, and J. Palmgren. 1998. Misclassification error. Pp. 2615᎐2621 in Encyclopedia of Biostatistics. Chichester, UK: Wiley.
Kullback, S. 1959. Information Theory and Statistics. New York: Wiley.
Kullback, S., M. Kupperman, and H. H. Ku. 1962. Tests for contingency tables and Markov
chains. Technometrics 4: 573᎐608.
Kupper, L. L., and J. K. Haseman. 1978. The use of a correlated binomial model for the analysis
of certain toxicological experiments. Biometrics 34: 69᎐76.
Kupper, L. L., C. Portier, M. D. Hogan, and E. Yamamoto. 1986. The impact of litter effects on
dose᎐response modeling in teratology. Biometrics 42: 85᎐98.
Laara,
¨¨ ¨ E., and J. N. S. Matthews. 1985. The equivalence of two models for ordinal data.
Biometrika 72: 206᎐207.
Lachin, J. M. 1977. Sample-size determinations for r = c comparative trials. Biometrics 33:
315᎐324.
Laird, N. M. 1978. Empirical Bayes methods for two-way contingency tables. Biometrika 65:
581᎐590.
Laird, N. M. 1998. EM algorithm. Pp. 1300᎐1313 in Encyclopedia of Biostatistics. Chichester, UK:
Wiley.
Laird, N. M., and D. Olivier. 1981. Covariance analysis of censored survival data using log-linear
analysis techniques. J. Amer. Statist. Assoc. 76: 231᎐240.
Lancaster, H. O. 1949. The derivation and partition of 2 in certain discrete distributions.
Biometrika 36: 117᎐129.
REFERENCES
675
Lancaster, H. O. 1951. Complex contingency tables treated by partition of 2 . J. Roy. Statist.
Soc. Ser. B 13: 242᎐249.
Lancaster, H. O. 1961. Significance tests in discrete distributions. J. Amer. Statist. Assoc. 56:
223᎐234.
Lancaster, H. O. 1969. The Chi-Squared Distribution. New York: Wiley.
Lancaster, H. O., and M. A. Hamdan. 1964. Estimation of the correlation coefficient in
contingency tables with possible nonmetrical characters. Psychometrika 29: 383᎐391.
Landis, J. R., and G. G. Koch. 1977. An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics 33: 363᎐374.
Landis, J. R., E. R. Heyman, and G. G. Koch. 1978. Average partial association in three-way
contingency tables: A review and discussion of alternative tests. Internat. Statist. Re®. 46:
237᎐254.
Landis, J. R., T. J. Sharp, S. J. Kuritz, and G. G. Koch. 1998. Mantel-Haenszel methods.
Pp. 2378᎐2691 in Encyclopedia of Biostatistics. Chichester, UK: Wiley.
Landwehr, J. M., D. Pregibon, and A. C. Shoemaker. 1984. Graphical methods for assessing
logistic regression models. J. Amer. Statist. Assoc. 79: 61᎐71.
Lang, J. B. 1992. Obtaining the observed information matrix for the Poisson log linear model
with incomplete data. Biometrika 79: 405᎐407.
Lang, J. B. 1996a. Maximum likelihood methods for a generalized class of log-linear models.
Ann. Statist. 24: 726᎐752.
Lang, J. B. 1996b. On the partitioning of goodness-of-fit statistics for multivariate categorical
response models. J. Amer. Statist. Assoc. 91: 1017᎐1023.
Lang, J. B. 1996c. On the comparison of multinomial and Poisson log-linear models. J. Roy.
Statist. Soc. Ser. B 58: 253᎐266.
Lang, J. B., and A. Agresti. 1994. Simultaneously modeling joint and marginal distributions of
multivariate categorical responses. J. Amer. Statist. Assoc. 89: 625᎐632.
Lang, J. B., J. W. McDonald, and P. W. F. Smith. 1999. Association-marginal modeling of
multivariate categorical responses: A maximum likelihood approach. J. Amer. Statist.
Assoc. 94: 1161᎐1171.
Laplace, P. S. 1812. Theorie
´ Analytique des Probabilites.
´ Paris: Courcier.
Larntz, K. 1978. Small-sample comparison of exact levels for chi-squared goodness-of-fit statistics. J. Amer. Statist. Assoc. 73: 253᎐263.
Larsen, K., J. H. Petersen, E. Budtz-JoⲐ rgensen, and L. Endahl. 2000. Interpreting parameters in
the logistic regression model with random effects. Biometrics 56: 909᎐914.
Larson, M. G. 1984. Covariate analysis of competing-risks data with log-linear models. Biometrics 40: 459᎐469.
Lauritzen, S. L. 1996. Graphical Models. New York: Oxford University Press.
Lauritzen, S. L., and N. Wermuth. 1989. Graphical models for associations between variables,
some of which are qualitative and some quantitative. Ann. Statist. 17: 31᎐57.
LaVange, L. M., G. G. Koch, and T. A. Schwartz. 2001. Applying sample survey methods to
clinical trials data. Statist. Medic. 20: 2609᎐2623.
Lawal, H. B. 1984. Comparisons of the X 2 , Y 2 , Freeman᎐Tukey and Williams improved G 2 test
statistics in small samples of one-way multinomials. Biometrika 71: 415᎐418.
Lawless, J. F. 1987. Negative binomial and mixed Poisson regression. Canad. J. Statist. 15:
209᎐225.
Lazarsfeld, P. F., and N. W. Henry. 1968. Latent Structure Analysis. Boston: Houghton Mifflin.
Lee, S. K. 1977. On the asymptotic variances of ˆ
u terms in loglinear models of multidimensional
contingency tables. J. Amer. Statist. Assoc. 72: 412᎐419.
676
REFERENCES
Lee, Y., and J. A. Nelder. 1996. Hierarchical generalized linear models. J. Roy. Statist. Soc. Ser
B 58: 619678.
Lefkopoulou, M., D. Moore, and L. Ryan. 1989. The analysis of multiple correlated binary
outcomes: Application to rodent teratology experiments. J. Amer. Statist. Assoc. 84:
810815.
Lehmann, E. L. 1966. Some concepts of dependence. Ann. Math. Statist. 37: 11371153.
Lehmann, E. L. 1986. Testing Statistical Hypotheses, 2nd ed. New York: Wiley.
Leonard, T. 1975. Bayesian estimation methods for two-way contingency tables. J. Roy. Statist.
Soc. Ser. B 37: 2337.
Leonard, T. and J. S. J. Hsu. 1994. The Bayesian analysis of categorical data: A selective review.
Pp. 283310 in Aspects of Uncertainty: A Tribute to D. V. Lindley. P. R. Freeman and
A. F. M. Smith, eds. New York: Wiley.
Lesaffre, E., and A. Albert. 1989. Multiple-group logistic regression diagnostics. Appl. Statist. 38:
425440.
Lesaffre, E., and G. Molenberghs. 1991. Multivariate probit analysis: A neglected procedure in
medical statistics. Statist. Medic. 10: 13911403.
Lesaffre, E., and B. Spiessens. 2001. On the effect of quadrature points in a logistic randomeffects model: An example. Appl. Statist. 50: 325335.
Lewis, T., I. W. Saunders, and M. Westcott. 1984. The moments of the Pearson chi-squared
statistic and the minimum expected value in two-way tables. Biometrika 71: 515522.
Liang, K. Y. 1984. The asymptotic efficiency of conditional likelihood methods. Biometrika 71:
305313.
Liang, K. Y., and J. Hanfelt. 1994. On the use of the quasi-likelihood method in teratological
experiments. Biometrics 50: 872880.
Liang, K. Y., and P. McCullagh. 1993. Case studies in binary dispersion. Biometrics 49: 623630.
Liang, K. Y., and S. G. Self. 1985. Tests for homogeneity of odds ratios when the data are
sparse. Biometrika 72: 353358.
Liang, K. Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear models.
Biometrika 73: 1322.
Liang, K. Y., and S. L. Zeger. 1988. On the use of concordant pairs in matched casecontrol
studies. Biometrics 44: 11451156.
Liang, K. Y, and S. L. Zeger. 1995. Inference based on estimating functions in the presence of
nuisance parameters. Statist. Sci. 10 158173.
Liang, K. Y., S. L. Zeger, and B. Qaqish. 1992. Multivariate regression analyses for categorical
data. J. Roy. Statist. Soc.Ser. B 54: 324.
Lin, X. 1997. Variance component testing in generalized linear models with random effects.
Biometrika 84: 309326.
Lindley, D. V. 1964. The Bayesian analysis of contingency tables. Ann. Math. Statist. 35:
16221643.
Lindsay, B., C. Clogg, and J. Grego. 1991. Semi-parametric estimation in the Rasch model and
related exponential response models, including a simple latent class model for item
analysis. J. Amer. Statist. Assoc. 86: 96107.
Lindsey, J. K. 1999. Models for Repeated Measurements, 2nd ed. Oxford: Oxford University Press.
Lindsey, J. K., and P. M. E. Altham. 1998. Analysis of the human sex ratio by using overdispersion models. Appl. Statist. 47: 149157.
Lindsey, J. K., and G. Mersch. 1992. Fitting and comparing probability distributions with log
linear models. Comput. Statist. Data Anal. 13: 373384.
Lipsitz, S. 1992. Methods for estimating the parameters of a linear model for ordered categorical
data. Biometrics 48: 271281.
REFERENCES
677
Lipsitz, S. R., and G. Fitzmaurice. 1996. The score test for independence in R = C contingency
tables with missing data. Biometrics 52: 751762.
Lipsitz, S., N. Laird, and D. Harrington. 1990. Finding the design matrix for the marginal
homogeneity model. Biometrika 77: 353358.
Lipsitz, S., N. Laird, and D. Harrington. 1991. Generalized estimating equations for correlated
binary data: Using the odds ratio as a measure of association. Biometrika 78: 153160.
Lipsitz, S. R., K. Kim, and L. Zhao. 1994. Analysis of repeated categorical data using generalized
estimating equations. Statist. Medic. 13: 11491163.
Little, R. J. 1989. Testing the equality of two independent binomial proportions. Amer. Statist.
43: 283288.
Little, R. J. 1998. Missing data. Pp. 26222635 in Encyclopedia of Biostatistics. Chichester, UK:
Wiley.
Little, R. J., and D. B. Rubin. 1987. Statistical Analysis with Missing Data. New York: Wiley.
Little, R. J. A., and M.-M. Wu. 1991. Models for contingency tables with known margins when
target and sampled populations differ. J. Amer. Statist. Assoc. 86: 8795.
Liu, Q., and D. A. Pierce. 1993. Heterogeneity in MantelHaenszel-type models. Biometrika 80:
543556.
Liu, Q., and D. A. Pierce. 1994. A note on GaussHermite quadrature. Biometrika 81: 624629.
Lloyd, C. J. 1988a. Some issues arising from the analysis of 2 = 2 contingency tables. Austral. J.
Statist. 30: 3546.
Lloyd, C. J. 1988b. Doubling the one-sided P-value in testing independence in 2 = 2 tables
against a two-sided alternative. Statist. Medic. 7: 12971306.
Lloyd, C. J. 1999. Statistical Analysis of Categorical Data. New York: Wiley.
Longford, N. T. 1993. Random Coefficient Models. New York: Oxford University Press.
Loughin, T. M., and P. N. Scherer. 1998. Testing for association in contingency tables with
multiple column responses. Biometrics 54: 630637.
Louis, T. A. 1982. Finding the observed information matrix when using the EM algorithm.
J. Roy. Statist. Soc. Ser. B 44: 226233.
Luce, R. D. 1959. Indi®idual Choice Beha®ior. New York: Wiley.
Madansky, A. 1963. Tests of homogeneity for correlated samples. J. Amer. Statist. Assoc. 58:
97119.
Maddala, G. S. 1983. Limited-Dependent and Qualitati®e Variables in Econometrics. Cambridge:
Cambridge University Press.
Magnus, J. R., and H. Neudecker. 1988. Matrix Differential Calculus with Applications in Statistics
and Econometrics. New York: Wiley.
Mantel, N. 1963. Chi-square tests with one degree of freedom: Extensions of the
MantelHaenszel procedure. J. Amer. Statist. Assoc. 58: 690700.
Mantel, N. 1966. Models for complex contingency tables and polychotomous dosage response
curves. Biometrics 22: 8395.
Mantel, N. 1973. Synthetic retrospective studies and related topics. Biometrics 29: 479486.
Mantel, N. 1985. Maximum likelihood vs. minimum chi-square. Biometrics 41: 777781.
Mantel, N. 1987a. Understanding Wald’s test for exponential families. Amer. Statist. 41:
147148.
Mantel, N. 1987b. Exact tests for 2 = 2 contingency tables ŽLetter.. Amer. Statist. 41: 159.
Mantel, N., and D. P. Byar. 1978. Marginal homogeneity, symmetry and independence.
Commun. Statist. Ser. A 7: 953976.
Mantel, N., and W. Haenszel. 1959. Statistical aspects of the analysis of data from retrospective
studies of disease. J. Natl. Cancer Inst. 22: 719748.
678
REFERENCES
´ Andres,
Martin
´ A., and Silva Mato, A. 1994. Choosing the optimal unconditional test for
comparing two independent proportions. Comput. Statist. Data Anal. 17: 555574.
Matthews, J. N. S., and K. P. Morris. 1995. An application of BradleyTerry-type models to the
measurement of pain. Appl. Statist. 44: 243255.
McCullagh, P. 1978. A class of parametric models for the analysis of square contingency tables
with ordered categories. Biometrika 65: 413418.
McCullagh, P. 1980. Regression models for ordinal data. J. Roy. Statist. Soc. Ser. B 42: 109142.
McCullagh, P. 1982. Some applications of quasisymmetry. Biometrika 69: 303308.
McCullagh, P. 1983. Quasi-likelihood functions. Ann. Statist. 11: 5967.
McCullagh, P. 1986. The conditional distribution of goodness-of-fit statistics for discrete data.
J. Amer. Statist. Assoc. 81: 104107.
McCullagh, P., and J. A. Nelder. 1983; 2nd ed., 1989. Generalized Linear Models. London:
Chapman & Hall.
McCulloch, C. E. 1994. Maximum likelihood variance components estimation for binary data.
J. Amer. Statist. Assoc. 89: 330335.
McCulloch, C. E. 1997. Maximum likelihood algorithms for generalized linear mixed models.
J. Amer. Statist. Assoc. 92: 162170.
McCulloch, C. E. 2000. Generalized linear models. J. Amer. Statist. Assoc. 95: 13201324.
McCulloch, C. E., and S. Searle. 2001. Generalized, Linear, and Mixed Models. New York: Wiley.
McFadden, D. 1974. Conditional logit analysis of qualitative choice behavior. Pp. 105142 in
Frontiers in Econometrics, ed. P. Zarembka. New York: Academic Press.
McFadden, D. 1982. Qualitative response models. Pp. 137 in Ad®ances in Econometrics, ed.
W. Hildebrand. Cambridge: Cambridge University Press.
McNemar, Q. 1947. Note on the sampling error of the difference between correlated proportions
or percentages. Psychometrika 12: 153157.
Mee, R. W. 1984. Confidence bounds for the difference between two probabilities Žletter..
Biometrics 40: 11751176.
Meeden, G., C. Geyer, J. Lang, and E. Funo. 1998. The admissibility of the maximum likelihood
estimator for decomposable log-linear interaction models for contingency tables. Commun.
Statist. Ser. A 27: 473493.
Mehta, C. R. 1994. The exact analysis of contingency tables in medical research. Statist. Methods
Medic. Res. 3: 135156.
Mehta, C. R., and N. R. Patel. 1983. A network algorithm for performing Fisher’s exact test in
r = c contingency tables. J. Amer. Statist. Assoc. 78: 427434.
Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statist.
Medic. 14: 21432160.
Mehta, C. R., and S. J. Walsh. 1992. Comparison of exact, mid-P, and MantelHaenszel
confidence intervals for the common odds ratio across several 2 = 2 contingency tables.
Amer. Statist. 46: 146150.
Mehta, C. R., N. R. Patel, and R. Gray. 1985. Computing an exact confidence interval for the
common odds ratio in several 2 by 2 contingency tables. J. Amer. Statist. Assoc. 80:
969973.
Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 1988. Importance sampling for estimating exact
probabilities in permutational inference. J. Amer. Statist. Assoc. 83: 9991005.
Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 2000. Efficient Monte Carlo methods for
conditional logistic regression. J. Amer. Statist. Assoc. 95: 99108.
Michailidis, G., and J. de Leeuw. 1998. The Gifi system of descriptive multivariate analysis.
Statist. Sci. 13: 307336.
REFERENCES
679
Miettinen, O. S. 1969. Individual matching with multiple controls in the case of all-or-none
responses. Biometrics 25: 339355.
Miettinen, O. S., and M. Nurminen. 1985. Comparative analysis of two rates. Statist. Medic. 4:
213226.
Miller, M. E., C. S. Davis, and J. R. Landis. 1993. The analysis of longitudinal polytomous data:
Generalized estimating equations and connections with weighted least squares. Biometrics
49: 10331044.
Minkin, S. 1987. On optimal design for binary data. J. Amer. Statist. Assoc. 82: 10981103.
Mirkin, B. 2001. Eleven ways to look at the chi-squared coefficient for contingency tables. Amer.
Statist. 55: 111120.
Mitra, S. K. 1958. On the limiting power function of the frequency chi-square test. Ann. Statist.
29: 12211233.
Molenberghs, G., and E. Goetghebeur. 1997. Simple fitting algorithms for incomplete categorical
data. J. Roy. Statist. Soc. Ser. B 59: 401414.
Molenberghs, G., and E. Lesaffre. 1994. Marginal modeling of correlated ordinal data using a
multivariate Plackett distribution. J. Amer. Statist. Assoc. 89: 633644.
Molenberghs, G., M. G. Kenward, and E. Lesaffre. 1997. The analysis of longitudinal ordinal
data with nonrandom drop-out. Biometrika 84: 3344.
Moore, D. F. 1986a. Asymptotic properties of moment estimates for overdispersed counts and
proportions. Biometrika 35: 583588.
Moore, D. S. 1986b. Tests of chi-squared type. Pp. 6395 in Goodness-of-Fit Techniques, ed.
R. D’Agostino and M. A. Stephens. New York: Marcel Dekker.
Moore, D. F., and A. Tsiatis. 1991. Robust estimation of the variance in moment methods for
extra-binomial and extra-Poisson variation. Biometrics 47: 383401.
Morgan, B. J. T. 1992. Analysis of Quantal Response Data. London: Chapman & Hall.
Morgan, W. M., and B. A. Blumenstein. 1991. Exact conditional tests for hierarchical models in
multidimensional contingency tables. Appl. Statist. 40: 435442.
Mosimann, J. E. 1962. On the compound multinomial distribution, the multivariate -distribution and correlations among proportions. Biometrika 49: 65᎐82.
Mosteller, F. 1951. Remarks on the method of paired comparisons I: The least-squares solution
assuming equal standard deviations and equal correlations. Psychometrika 16: 3᎐9.
Mosteller, F. 1952. Some statistical problems in measuring the subjective response to drugs.
Biometrics 8: 220᎐226.
Mosteller, F. 1968. Association and estimation in contingency tables. J. Amer. Statist. Assoc. 63:
1᎐28.
Nair, V. N. 1987. Chi-squared-type tests for ordered alternatives in contingency tables. J. Amer.
Statist. Assoc. 82: 283᎐291.
Natarajan, R., and C. McCulloch. 1995. A note on the existence of the posterior distribution for
a class of mixed models for binomial responses. Biometrika 82: 639᎐643.
Natarajan, R., and C. McCulloch. 1998. Gibbs sampling with diffuse proper priors: A valid
approach to data-driven inference? J. Comput. Graph. Statist. 7: 267᎐277.
Nelder, J., and D. Pregibon. 1987. An extended quasi-likelihood function. Biometrika 74:
221᎐232.
Nelder, J., and R. W. M. Wedderburn. 1972. Generalized linear models. J. Roy. Statist. Soc. Ser.
A 135: 370᎐384.
Nerlove, M., and S. J. Press. 1973. Univariate and multivariate log-linear and logistic models.
Technical Report R-1306-EDArNIH, Rand Corporation, Santa Monica, CA.
Neuhaus, J. M. 1992. Statistical methods for longitudinal and clustered designs with binary
responses. Statist. Methods Medic. Res. 1: 249᎐273.
680
REFERENCES
Neuhaus, J. M., and N. P. Jewell. 1990a. Some comments on Rosner’s multiple logistic model for
clustered data. Biometrics 46: 523534.
Neuhaus, J. M., and N. P. Jewell. 1990b. The effect of retrospective sampling on binary
regression models for clustered data. Biometrics 46: 977990.
Neuhaus, J. M., and M. L. Lesperance. 1996. Estimation efficiency in a binary mixed-effects
model setting. Biometrika 83: 441446.
Neuhaus, J. M., J. D. Kalbfleisch, and W. W. Hauck. 1991. A comparison of cluster-specific and
population-averaged approaches for analyzing correlated binary data. Internat. Statist. Re®.
59: 2535.
Neuhaus, J. M., W. W. Hauck, and J. D. Kalbfleisch. 1992. The effects of mixture distribution
misspecification when fitting mixed-effects logistic models. Biometrika 79: 755762.
Neuhaus, J. M., J. D. Kalbfleisch, and W. W. Hauck. 1994. Conditions for consistent estimation
in mixed-effects models for binary matched-pairs data. Canad. J. Statist. 22: 139148.
Newcombe, R. 1998a. Two-sided confidence intervals for the single proportion: Comparison of
seven methods. Statist. Medic. 17: 857872.
Newcombe, R. 1998b. Interval estimation for the difference between independent proportions:
Comparison of eleven methods. Statist. Medic. 17: 873890.
Newcombe, R. 2001. Logit confidence intervals and the inverse sinh transformation. Amer.
Statist. 55: 200202.
Neyman, J. 1935. On the problem of confidence limits. Ann. Math. Statist. 6: 111116.
Neyman, J. 1949. Contributions to the theory of the 2 test. Pp. 239᎐273 in Proc. First Berkeley
Symposium on Mathematical Statistics and Probability, ed. J. Neyman. Berkeley, CA:
University of California Press.
Nurminen, M. 1986. Confidence intervals for the ratio and difference of two binomial proportions. Biometrics 42: 675᎐676.
O’Brien, P. C. 1988. Comparing two samples: Extensions of the t, rank-sum, and log-rank tests.
J. Amer. Statist. Assoc. 83: 52᎐61.
O’Brien, R. G. 1986. Using the SAS system to perform power analyses for log-linear models.
Pp. 778᎐784 in Proc. 11th Annual SAS Users Group Conference. Cary, NC: SAS Institute.
Ochi, Y., and R. Prentice. 1984. Likelihood inference in a correlated probit regression model.
Biometrika 71: 531᎐543.
O’Gorman, T. W., and R. F. Woolson. 1988. Analysis of ordered categorical data using the SAS
system. Pp. 957᎐963 in Proc. 13th Annual SAS Users Group Conference. Cary, NC: SAS
Institute.
Paik, M. 1985. A graphic representation of a three-way contingency table: Simpson’s paradox
and correlation. Amer. Statist. 39: 53᎐54.
Palmgren, J. 1981. The Fisher information matrix for log-linear models arguing conditionally in
the observed explanatory variables. Biometrika 68: 563᎐566.
Palmgren, J., and A. Ekholm. 1987. Exponential family non-linear models for categorical data
with errors of observation. Appl. Stochastic Models Data Anal. 3: 111᎐124.
Park, T., and M. B. Brown. 1994. Models for categorical data with nonignorable nonresponse. J.
Amer. Statist. Assoc. 89: 44᎐52.
Parr, W. C., and H. D. Tolley. 1982. Jackknifing in categorical data analysis. Austral. J. Statist.
24: 67᎐79.
Parzen, E. 1997. Concrete statistics. Pp. 309᎐332 in Statistics of Quality. New York: Marcel
Dekker.
Patefield, W. M. 1982. Exact tests for trends in ordered contingency tables. Appl. Statist. Ser B
31: 32᎐43.
REFERENCES
681
Patnaik, P. B. 1949. The non-central 2 and F-distributions and their applications. Biometrika
36: 202᎐232.
Paul, S. R., K. Y. Liang, and S. G. Self. 1989. On testing departure from the binomial and
multinomial assumptions. Biometrics 45: 231᎐236.
Pearson, E. S. 1947. The choice of a statistical test illustrated on the interpretation of data
classified in 2 = 2 tables. Biometrika 34: 139᎐167.
Pearson, K. 1900. On a criterion that a given system of deviations from the probable in the case
of a correlated system of variables is such that it can be reasonably supposed to have arisen
from random sampling. Philos. Mag. Ser. 5 50: 157᎐175. ŽReprinted in Karl Pearson’s Early
Statistical Papers, ed. E. S. Pearson. Cambridge: Cambridge University Press, 1948..
Pearson, K. 1904. Mathematical contributions to the theory of evolution XIII: On the theory of
contingency and its relation to association and normal correlation. Draper’s Co. Research
Memoirs, Biometric Series, no. 1. ŽReprinted in Karl Pearson’s Early Papers, ed. E. S.
Pearson, Cambridge: Cambridge University Press, 1948..
Pearson, K. 1913. On the probable error of a correlation coefficient as found from a fourfold
table. Biometrika 9: 22᎐27.
Pearson, K. 1917. On the general theory of multiple contingency with special reference to partial
contingency. Biometrika 11: 145᎐158.
Pearson, K. 1922. On the 2 test of goodness of fit. Biometrika 14: 186᎐191.
Pearson, K., and D. Heron. 1913. On theories of association. Biometrika 9: 159᎐315.
Peduzzi, P., J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. 1996. A simulation study
of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49:
1373᎐1379.
Pendergast, J. F., S. J. Gange, M. A. Newton, M. J. Lindstrom, M. Palta, and M. R. Fisher. 1996.
A survey of methods for analyzing clustered binary response data. Internat. Statist. Re®. 64:
89᎐118.
Pepe, M. S. 2000. Receiver operating characteristic methodology. J. Amer. Statist. Assoc. 95:
308᎐311.
Peterson, B., and F. E. Harrell, Jr. 1990. Partial proportional odds models for ordinal response
variables. Appl. Statist. 39: 205᎐217.
Pierce, D. A., and D. Peters. 1992. Practical use of higher order asymptotics for multiparameter
exponential families. J. Roy. Statist. Soc. Ser. B 54: 701᎐725.
Pierce, D. A., and D. Peters. 1999. Improving on exact tests by approximate conditioning.
Biometrika 86: 265᎐277.
Pierce, D. A., and B. R. Sands. 1975. Extra-Bernoulli variation in regression of binary data.
Technical Report 46, Statistics Deptartment, Oregon State University, Cornwallis, OR.
Pierce, D. A., and D. W. Schafer. 1986. Residuals in generalized linear models. J. Amer. Statist.
Assoc. 81: 977᎐983.
Plackett, R. L. 1962. A note on interactions in contingency tables. J. Roy. Statist. Soc. Ser. B 24:
162᎐166.
Plackett, R. L. 1964. The continuity correction in 2 = 2 tables. Biometrika 51: 327᎐337.
Plackett, R. L. 1983. Karl Pearson and the chi-squared test. Internat. Statist. Re®. 51: 59᎐72.
Podgor, M. J., J. L. Gastwirth, and C. R. Mehta. 1996. Efficiency robust tests of independence in
contingency tables with ordered classifications. Statist. Medic. 15: 2095᎐2105.
Poisson, S.-D. 1837. Recherches sur la probabilite
´ des jugements en matieere
` criminelle et en matiere
`
ci®ile, precedees
du calcul des probabilites.
´ ´ ´ des regles
` generales
´´
´ Paris: Bachelier.
Pratt, J. W. 1981. Concavity of the log likelihood. J. Amer. Statist. Assoc. 76: 103᎐106.
Pregibon, D. 1980. Goodness of link tests for generalized linear models. Appl. Statist. 29: 15᎐24.
Pregibon, D. 1981. Logistic regression diagnostics. Ann. Statist. 9: 705᎐724.
682
REFERENCES
Pregibon, D. 1982. Score tests in GLIM with application. Pp. 8797 in Lecture Notes in Statistics,
14: GLIM 82, Proc. International Conference on Generalised Linear Models, ed. R. Gilchrist.
New York: Springer-Verlag.
Prentice, R. 1976a. Use of the logistic model in retrospective studies. Biometrics 32: 599606.
Prentice, R. 1976b. Generalization of the probit and logit methods for dose response curves.
Biometrics 32: 761768.
Prentice, R. 1986. Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J. Amer. Statist. Assoc. 81:
321327.
Prentice, R., and N. Breslow. 1978. Retrospective studies and failure time models. Biometrika
65: 153158.
Prentice, R., and L. A. Gloeckler. 1978. Regression analysis of grouped survival data with
application to breast cancer data. Biometrics 34: 5767.
Prentice, R., and R. Pyke. 1979. Logistic disease incidence models and case-control studies.
Biometrika 66: 403412.
Prentice, R., and L. P. Zhao. 1991. Estimating equations for parameters in means and
covariances of multivariate discrete and continuous responses. Biometrics 47: 825839.
Press, S. J., and S. Wilson. 1978. Choosing between logistic regression and discriminant analysis.
J. Amer. Statist. Assoc. 73: 699705.
Qu, A., B. G. Lindsay, and B. Li. 2000. Improving generalised estimating equations using
quadratic inference functions. Biometrika 87: 823836.
Quine, M. P., and E. Seneta. 1987. Bortkiewicz’s data and the law of small numbers. Internat.
Statist. Re®. 5: 173181.
Rabe-Hesketh, S., and A. Skrondal. 2001. Parameterisation of multivariate random effects
models for categorical data. Biometrics 57:.
Raftery, A. E. 1986. Choosing models for cross-classification. Amer. Sociol. Re®. 51: 145146.
Rao, C. R. 1957. Maximum likelihood estimation for the multinomial distribution. Sankhya 18:
139148.
Rao, C. R. 1963. Criteria of estimation in large samples. Sankhya 25: 189206.
Rao, C. R. 1973. Linear Statistical Inference and Its Applications, 2nd ed. New York: Wiley.
Rao, C. R. 1982. Diversity: Its measurement, decomposition, apportionment, and analysis.
Sankhya Ser. A 44: 122.
Rao, J. N. K., and A. J. Scott. 1987. On simple adjustments to chi-square tests with sample
survey data. Ann. Statist. 15: 385397.
Rao, J. N. K., and D. R. Thomas. 1988. The analysis of cross-classified categorical data from
complex sample surveys. Sociol. Methodol. 18: 213270.
Rasch, G. 1961. On general laws and the meaning of measurement in psychology. Pp. 321333 in
Proc. 4 th Berkeley Symposium on Mathematics, Statistics, and Probability, Vol. 4, ed.
J. Neyman. Berkeley, CA: University of California Press.
Rayner, J. C. W., and D. J. Best. 2001. A Contingency Table Approach to Nonparametric Testing.
London: Chapman & Hall.
Read, T. R. C., and N. A. C. Cressie. 1988. Goodness-of-Fit Statistics for Discrete Multi®ariate
Data. New York: Springer-Verlag.
Rice, W. R. 1988. A new probability model for determining exact P-values for 2 = 2 contingency
tables when comparing binomial proportions. Biometrics 44: 122.
Ritov, Y., and Z. Gilula. 1991. The order-restricted RC model for ordered contingency tables:
Estimation and testing for fit. Ann. Statist. 19: 20902101.
Robins, J., N. Breslow, and S. Greenland. 1986. Estimators of the MantelHaenszel variance
consistent in both sparse data and large-strata limiting models. Biometrics 42: 311323.
REFERENCES
683
Robins, J., A. Rotnitzky, and L. P. Zhao. 1995. Analysis of semiparametric regression models for
repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc. 90: 106121.
Rohmel,
J., and U. Mansmann. 1999. Unconditional non-asymptotic one-sided tests for indepen¨
dent binomial proportions when the interest lies in showing non-inferiority andror
superiority. Biometrical J. 41: 149170.
Rosenbaum, P. R., and D. R. Rubin. 1983. The central role of the propensity score in
observational studies for causal effects. Biometrika 70: 4155.
Rosner, B. 1984. Multivariate methods in ophthalmology with application to other paired-data
situations. Biometrics 40: 10251035.
Rosner, B. 1989. Multivariate methods for clustered binary data with more than one level of
nesting. J. Amer. Statist. Assoc. 84: 373380.
Rotnitzky, A., and N. P. Jewell. 1990. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika 77: 485497.
Routledge, R. D. 1992. Resolving the conflict over Fisher’s exact test. Canad. J. Statist. 20:
201209.
Routledge, R. D. 1994. Practicing safe statistics with the mid-P*. Canad. J. Statist. 22: 103110.
Roy, S. N., and M. A. Kastenbaum. 1956. On the hypothesis of no ‘‘interaction’’ in a multiway
contingency table. Ann. Math. Statist. 27: 749757.
Roy, S. N., and S. K. Mitra. 1956. An introduction to some nonparametric generalizations of
analysis of variance and multivariate analysis. Biometrika 43: 361376.
Rudas, T., C. C. Clogg, and B. G. Lindsay. 1994. A new index of fit based on mixture methods
for the analysis of contingency tables. J. Roy. Statist. Soc. 56: 623639.
Ryan, L. 1992. Quantitative risk assessment for developmental toxicity. Biometrics 48: 163174.
Ryan, L. 1995. Comment on article by Liang and Zeger. Statist. Sci. 10: 189193.
Samuels, M. L. 1993. Simpson’s paradox and related phenomena. J. Amer. Statist. Assoc. 88:
8188.
Santner, T. J., and M. K. Snell. 1980. Small-sample confidence intervals for p1 p 2 and p1 rp 2 in
2 = 2 contingency tables. J. Amer. Statist. Assoc. 75: 386394.
Santner, T. J., and S. Yamagami. 1993. Invariant small sample confidence intervals for the
difference of two success probabilities. Commun. Statist. Ser. B 22: 3359.
Schafer, J. L. 1997. Analysis of Incomplete Multi®ariate Data. London: Chapman & Hall.
Schluchter, M. D., and K. L. Jackson. 1989. Log-linear analysis of censored survival data with
partially observed covariates. J. Amer. Statist. Assoc. 84: 4252.
Scott, A., and C. Wild. 2001. Casecontrol studies with complex sampling. Appl. Statist. 50:
389401.
Seeber, G. 1998. Poisson regression. Pp. 34043412 in Encyclopedia of Biostatistics. Chichester,
UK: Wiley.
Sekar, C. C., and W. E. Deming. 1949. On a method of estimating birth and death rates and the
extent of registration. J. Amer. Statist. Assoc. 44: 101115.
Self, S. G., and K.-Y. Liang. 1987. Asymptotic properties of maximum likelihood estimators and
likelihood ratio tests under nonstandard conditions. J. Amer. Statist. Assoc. 82: 605610.
Sen, P. K., and J. M. Singer. 1993. Large Sample Methods in Statistics: An Introduction with
Applications. London: Chapman & Hall.
Shapiro, S. H. 1982. Collapsing contingency tables: A geometric approach. Amer. Statist. 36:
4346.
Shuster, J., and D. Downing. 1976. Two-way contingency tables for complex sampling schemes.
Biometrika 63: 271276.
Silvapulle, M. J. 1981. On the existence of maximum likelihood estimators for the binomial
response models. J. Roy. Statist. Soc. Ser. B 43: 310313.
684
REFERENCES
Simon, G. 1973. Additivity of information in exponential family probability laws. . J. Amer.
Statist. Assoc. 68: 478482.
Simon, G. 1974. Alternative analyses for the singly-ordered contingency table. J. Amer. Statist.
Assoc. 69: 971976.
Simon, G. 1978. Efficacies of measures of association for ordinal contingency tables. J. Amer.
Statist. Assoc. 73: 545551.
Simonoff, J. 1983. A penalty function approach to smoothing large sparse contingency tables.
Ann. Statist. 11: 208218.
Simonoff, J. 1986. Jackknifing and bootstrapping goodness-of-fit statistics in sparse multinomials.
J. Amer. Statist. Assoc. 81: 10051111.
Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer-Verlag.
Simonoff, J. S. 1998. Three sides of smoothing: Categorical data smoothing, nonparametric
regression, and density estimation. Internat. Statist. Re®. 66: 137156.
Simpson, E. H. 1949. The measurement of diversity. Nature 163: 699.
Simpson, E. H. 1951. The interpretation of interaction in contingency tables. J. Roy. Statist. Soc.
Ser. B 13: 238241.
Skellam, J. G. 1948. A probability distribution derived from the binomial distribution by
regarding the probability of success as variable between the sets of trials. J. Roy. Statist.
Soc. Ser. B 10: 257261.
Skene, A. M., and J. C. Wakefield. 1990. Hierarchical models for multicentre binary response
studies. Statist. Medic. 9: 919929.
Slaton, T. L., W. W. Piegorsch, and S. D. Durham. 2000. Estimation and testing with overdispersed proportions using the beta-logistic regression model of Heckman and Willis.
Biometrics 56: 125133.
Small, K. A. 1987. A discrete choice model for ordered alternatives. Econometrica 55: 409424.
Smith, K. W. 1976. Table standardization and table shrinking: Aids in the traditional analysis of
contingency tables. Social Forces 54: 669693.
Smith, P. W. F., J. J. Forster, and J. W. McDonald. 1996. Monte Carlo exact tests for square
contingency tables. J. Roy. Statist. Soc. Ser. A 159: 309321.
Snell, E. J. 1964. A scaling procedure for ordered categorical data. Biometrics 20: 592607.
Somers, R. H. 1962. A new asymmetric measure of association for ordinal variables. Amer.
Sociol. Re®. 27: 799811.
Speed, T. 1998. Iterative proportional fitting. Pp. 21162119 in Encyclopedia of Biostatistics.
Chichester, UK: Wiley.
Spiegelhalter, D. J., and A. F. M. Smith. 1982. Bayes factors for linear and log-linear models
with vague prior information. J. Roy. Statist. Soc. Ser. B 44: 377387.
Spitzer, R. L., J. Cohen, J. L. Fleiss, and J. Endicott. 1967. Quantification of agreement in
psychiatric diagnosis. Arch. Gen. Psychiatry 17: 8387.
Sprott, D. A. 2000. Statistical Inference in Science. New York: Springer-Verlag.
Stern, S. 1997. Simulation-based estimation. J. Econ. Literature 35: 20062039.
Sterne, T. E. 1954. Some remarks on confidence or fiducial limits. Biometrika 41: 275278.
Stevens, S. S. 1951. Mathematics, measurement, and psychophysics. Pp. 149 in Handbook of
Experimental Psychology, ed. S. S. Stevens. New York: Wiley.
Stevens, W. L. 1950. Fiducial limits of the parameter of a discontinuous distribution. Biometrika
37: 117129.
Stigler, S. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge,
MA: Harvard University Press.
REFERENCES
685
Stigler, S. 1994. Citation patterns in the journals of statistics and probability. Statist. Sci. 9:
94108.
Stigler, S. 1999. Statistics on the Table. Cambridge, MA: Harvard University Press.
Stiratelli, R., N. Laird, and J. H. Ware. 1984. Random-effects models for serial observations with
binary response. Biometrics 40: 10251035.
Stokes, M. E., C. S. Davis, and G. G. Koch. 2000. Categorical Data Analysis Using the SAS System,
2nd ed. Cary, NC: SAS Institute.
Strawderman, R. L., and M. T. Wells. 1998. Approximately exact inference for the common odds
ratio in several 2 = 2 tables. J. Amer. Statist. Assoc. 93: 12941307.
Stuart, A. 1955. A test for homogeneity of the marginal distributions in a two-way classification.
Biometrika 42: 412416.
Stukel, T. A. 1988. Generalized logistic models. J. Amer. Statist. Assoc. 83: 426431.
Suissa, S., and J. J. Shuster. 1984. Are uniformly most powerful unbiased tests really best? Amer.
Statist. 38: 204206.
Suissa, S., and J. J. Shuster. 1985. Exact unconditional samples sizes for the 2 by 2 binomial trial.
J. Roy. Statist. Soc. Ser. A 148: 317327.
Suissa, S., and J. J. Shuster. 1991. The 2 = 2 matched-pairs trial: Exact unconditional design and
analysis. Biometrics 47: 361372.
Sundberg, R. 1975. Some results about decomposable Žor Markov-type. models for multidimensional contingency tables: Distribution of marginals and partitioning of tests. Scand. J.
Statist. 2: 7179.
Tango, T. 1998. Equivalence test and confidence interval for the difference in proportions for
the paired-sample design. Statist. Medic. 17: 891908.
Tanner, M. A., and M. A. Young. 1985. Modelling agreement among raters. J. Amer. Statist.
Assoc. 80: 175180.
Tarone, R. E. 1985. On heterogeneity tests based on efficient scores. Biometrika 72: 9195.
Tarone, R. E., and J. J. Gart. 1980. On the robustness of combined tests for trends in
proportions. J. Amer. Statist. Assoc. 75: 110116.
Tarone, R. E., J. J. Gart, and W. W. Hauck. 1983. On the asymptotic relative efficiency of
certain noniterative estimators of a common relative risk or odds ratio. Biometrika 70:
519522.
Tavare,
´ S., and P. M. E. Altham. 1983. Serial dependence of observations leading to contingency
tables, and corrections to chi-squared statistics. Biometrika 70: 139144.
Ten Have, T. R. 1996. A mixed effects model for multivariate ordinal response data including
correlated discrete failure times with ordinal responses. Biometrics 52: 473491.
Ten Have, T. R., and A. R. Localio. 1999. Empirical Bayes estimation of random effects
parameters in mixed effects logistic regression models. Biometrics 55: 10221029.
Ten Have, T. R., and A. Morabia. 1999. Mixed effects models with bivariate and univariate
association parameters for longitudinal bivariate binary response data. Biometrics 55:
8593.
Ten Have, T. R., and D. H. Uttal. 1994. Subject-specific and population-averaged continuation
ratio logit models for multiple discrete time survival profiles. Appl. Statist. 43: 371384.
Theil, H. 1969. A multinomial extension of the linear logit model. Internat. Econ. Re®. 10:
251259.
Theil, H. 1970. On the estimation of relationships involving qualitative variables. Amer. J.
Sociol. 76: 103154.
Thompson, R., and R. J. Baker. 1981. Composite link functions in generalized linear models.
Appl. Statist. 30: 125131.
686
REFERENCES
Thompson, W. A. 1977. On the treatment of grouped observations in life studies. Biometrics 33:
463470.
Thurstone, L. L. 1927. The method of paired comparisons for social values. J. Abnormal Social
Psych. 21: 384400.
Tjur, T. 1982. A connection between Rasch’s item analysis model and a multiplicative Poisson
model. Scand. J. Statist. 9: 2330.
Tocher, K. D. 1950. Extension of the NeymanPearson theory of tests to discontinuous variates.
Biometrika 37: 130144.
Toledano, A., and C. Gatsonis. 1996. Ordinal regression methodology for ROC curves derived
from correlated data. Statist. Medic. 15: 18071826.
Train, K. 1986. Qualitati®e Choice Analysis: Theory, Econometrics, and an Application. Cambridge,
MA: MIT Press.
Tsiatis, A. A. 1980. A note on the goodness-of-fit test for the logistic regression model.
Biometrika 67: 250251.
Tutz, G. 1989. Compound regression models for ordered categorical data. Biometrical J. 31:
259272.
Tutz, G. 1991. Sequential models in categorical regression. Comput. Statist. Data Anal. 11:
275295.
Tutz, G., and W. Hennevogl. 1996. Random effects in ordinal regression models. Comput. Statist.
Data Anal. 22: 537557.
Uebersax, J. S. 1993. Statistical modeling of expert ratings on medical treatment appropriateness. J. Amer. Statist. Assoc. 88: 421427.
Uebersax, J. S., and W. M. Grove. 1990. Latent class analysis of diagnostic agreement. Statist.
Medic. 9: 559572.
Uebersax, J. S., and W. M. Grove. 1993. A latent trait finite mixture model for the analysis of
rating agreement. Biometrics 49: 823835.
van der Heijden, P. G. M., and J. de Leeuw. 1985. Correspondence analysis: A complement to
log-linear analysis. Psychometrika 50: 429447.
van der Heijden, P. G. M., A. de Falguerolles, and J. de Leeuw. 1989. A combined approach to
contingency table analysis using correspondence analysis and log-linear analysis. Appl.
Statist. 38: 249292.
Verbeke, G., and E. Lesaffre. 1996. A linear mixed-effects model with heterogeneity in the
random-effects population. J. Amer. Statist. Assoc. 91: 217221.
Verbeke, G., and G. Molenberghs. 2000. Linear Mixed Models for Longitudinal Data. New York:
Springer-Verlag.
Wald, A. 1943. Tests of statistical hypotheses concerning several parameters when the number of
observations is large. Trans. Amer. Math. Soc. 54: 426482.
Walker, S. H., and D. B. Duncan. 1967. Estimation of the probability of an event as a function of
several independent variables. Biometrika 54: 167179.
Walley, P. 1996. Inferences from multinomial data: Learning about a bag of marbles. J. Roy.
Statist. Soc. Ser. B 58: 334.
Wardrop, R. L. 1995. Simpson’s paradox and the hot hand in basketball. Amer. Statist. 49:
2428.
Ware, J. H., S. Lipsitz, and F. E. Speizer. 1988. Issues in the analysis of repeated categorical
outcomes. Statist. Medic. 7: 95107.
Watson, G. S. 1956. Missing and ‘‘mixed up’’ frequencies in contingency tables. Biometrics 12:
4750.
Watson, G. S. 1959. Some recent results in chi-square goodness-of-fit tests. Biometrics 15:
440468.
REFERENCES
687
Wedderburn, R. W. M. 1974. Quasi-likelihood functions, generalized linear models, and the
GaussNewton method. Biometrika 61: 439447.
Wedderburn, R. W. M. 1976. On the existence and uniqueness of the maximum likelihood
estimates for certain generalized linear models. Biometrika 63: 2732.
Wermuth, N. 1976. Model search among multiplicative models. Biometrics 32: 253263.
Wermuth, N. 1987. Parametric collapsibility and the lack of moderating effects in contingency
tables with a dichotomous response variable. J. Roy. Statist. Soc. Ser. B 49: 353364.
Westfall, P. H., and R. D. Wolfinger. 1997. Multiple tests with discrete distributions. Amer.
Statist. 51: 38.
Westfall, P. H., and S. S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods
for p-Value Adjustment. New York: Wiley.
White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 126.
White, A. A., J. R. Landis, and M. M. Cooper. 1982. A note on the equivalence of several
marginal homogeneity test criteria for categorical data. Internat. Statist. Re®. 50: 2734.
Whitehead, J. 1993. Sample size calculations for ordered categorical data. Statist. Medic. 12:
22572271.
Whittaker, J. 1990. Graphical Models in Applied Multi®ariate Statistics. New York: Wiley.
Whittaker, J., and M. Aitkin. 1978. A flexible strategy for fitting complex log-linear models.
Biometrics 34: 487495.
Whittemore, A. S. 1978. Collapsibility of multidimensional tables. J. Roy. Statist. Soc. Ser. B 40:
328340.
Whittemore, A. S. 1981. Sample size for logistic regression with small response probability.
J. Amer. Statist. Assoc. 76: 2732.
Wilks, S. S. 1935. The likelihood test of independence in contingency tables. Ann. Math. Statist.
6: 190196.
Wilks, S. S. 1938. The large-sample distribution of the likelihood ratio for testing composite
hypotheses. Ann. Math. Statist. 9: 6062.
Williams, D. A. 1975. The analysis of binary responses from toxicological experiments involving
reproduction and teratogenicity. Biometrics 31: 949952.
Williams, D. A. 1982. Extra-binomial variation in logistic linear models. Appl. Statist. 31:
144148.
Williams, D. A. 1987. Generalized linear model diagnostics using the deviance and single-case
deletions. Appl. Statist. 36: 181191.
Williams, D. A. 1988. Comments on ‘‘The impact of litter effects on doseresponse modeling in
teratology.’’ Biometrics 44: 305308.
Williams, E. J. 1952. Use of scores for the analysis of association in contingency tables.
Biometrika 39: 274289.
Williams, O. D., and J. E. Grizzle. 1972. Analysis for contingency tables having ordered response
categories. J. Amer. Statist. Assoc. 67: 5563.
Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. J. Amer.
Statist. Assoc. 22: 209212.
Wolfinger, R., and M. O’Connell. 1993. Generalized linear mixed models: A pseudo-likelihood
approach. J. Statist. Comput. Simul. 48: 233243.
Wong, G. Y., and W. M. Mason. 1985. The hierarchical logistic regression model for multilevel
analysis. J. Amer. Statist. Assoc. 80: 513524.
Woolf, B. 1955. On estimating the relation between blood group and disease. Ann. Human
Genet. Ž London. 19: 251253.
Woolson, R. F., and W. R. Clarke. 1984. Analysis of categorical incomplete longitudinal data.
J. Roy. Statist. Soc. Ser. A 147: 8799.
688
REFERENCES
Wu, C. F. J. 1985. Efficient sequential designs with binary data. J. Amer. Statist. Soc. 80:
974984.
Yang, I., and M. P. Becker. 1997. Latent variable modeling of diagnostic accuracy. Biometrics 53:
948958.
Yates, F. 1934. Contingency tables involving small numbers and the 2 test. J. Roy. Statist. Soc.
Suppl. 1: 217᎐235.
Yates, F. 1948. The analysis of contingency tables with grouping based on quantitative characters. Biometrika 35: 176᎐181.
Yates, F. 1984. Tests of significance for 2 = 2 contingency tables. J. Roy. Statist. Soc. Ser. A
147: 426᎐463.
Yee, T. W., and C. J. Wild. 1996. Vector generalized additive models. J. Roy. Statist. Soc. Ser. B
58: 481᎐493.
Yerushalmy, J. 1947. Statistical problems in assessing methods of medical diagnosis, with special
reference to x-ray techniques. Public Health Rep. 62: 1432᎐1449.
Yule, G. U. 1900. On the association of attributes in statistics. Philos. Trans. Roy. Soc. London
Ser. A 194: 257᎐319.
Yule, G. U. 1903. Notes on the theory of association of attributes in statistics. Biometrika 2:
121᎐134.
Yule, G. U. 1906. On a property which holds good for all groupings of a normal distribution of
frequency for two variables, with application to the study of contingency tables for the
inheritance of unmeasured qualities. Proc. Roy. Soc. Ser A 77: 324᎐336.
Yule, G. U. 1912. On the methods of measuring association between two attributes. J. Roy.
Statist. Soc. 75: 579᎐642.
Zeger, S. L., and M. R. Karim. 1991. Generalized linear models with random effects: A Gibbs
sampling approach. J. Amer. Statist. Assoc. 86: 79᎐86
Zeger, S. L., K.-Y. Liang, and P. S. Albert. 1988. Models for longitudinal data: A generalized
estimating equation approach. Biometrics 44: 1049᎐1060.
Zelen, M. 1971. The analysis of several 2 = 2 contingency tables. Biometrika 58: 129᎐137.
Zelen, M. 1991. Multinomial response models. Comput. Statist. Data Anal. 12: 249᎐254.
Zellner, A., and P. E. Rossi. 1984. Bayesian analysis of dichotomous quantal response models.
J. Economet. 25: 365᎐393.
Zelterman. D. 1987. Goodness-of-fit tests for large sparse multinomial distributions. J. Amer.
Statist. Soc. 82: 624᎐629.
Zermelo, E. 1929. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der
Wahrscheinlichkeitsrechnung. Math. Z. 29: 436᎐460.
Zhang, H., J. Crowley, H. Sox, and R. Olshen. 1998. Tree-structured statistical methods.
Pp. 4561᎐4573 in Encyclopedia of Biostatistics. Chichester, UK: Wiley.
Zheng, B., and A. Agresti. 2000. Summarizing the predictive power of a generalized linear
model. Statist. Medic. 19: 1771᎐1781.
Zhu, Y., and N. Reid. 1994. Information, ancillarity, and sufficiency in the presence of nuisance
parameters. Canad. J. Statist. 22: 111᎐123.
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
Examples Index
Abortion and education, 345
Abortion opinions, 29, 205206, 441, 486,
504506, 553
Admissions into Berkeley, 6263
Admissions into Florida, 223224, 529
Afterlife, belief in, 302303
AIDS and AZT use, 184187
AIDS, measures to deal with, 347
Air pollution and breathing, 377378
Alcohol, cigarettes, and marijuana use,
322326, 361363, 367, 482483, 528
Alcohol consumption and malformation,
8990, 158, 179180, 182
Alcohol and driving, 203
Alligator food choice, 268274, 304
Alzheimer’s disease and cognitive impairment,
310
Aspirations by income, 107,
Aspirin and heart attacks, 37, 46, 7172
Automobile collisions and seat belts, 4041,
61, 305306, 327329, 331, 349, 361
Baseball complete games, 157158
Baseball standings, 437438
Beetle mortality, 247250
Birth control, teenage, 352
Blood pressure and heart disease, 221223
Breast cancer, 38, 105, 107
Breathing test and smoking, 307, 377378
Breathlessness, wheeze, and age, 378
Buchanan vote in Palm Beach County,
156157
Busing and race, 348
Calves and pneumonia, 2526, 34
Cancer of larynx and radiation therapy, 107
Cancer remission, 197199, 261
Capturerecapture, hepatitis, 533
Capturerecapture of snowshoe hares,
511513, 544545, 551552
Carcinoma of uterine cervix, 431435, 532,
541544, 549551
Chlorophyll inheritance, 29
Cholesterol and cereal, 309
Claritin, 109
Clinical trials, 230236, 507510
Coffee drinking, 446
Cola drink taste test, 448
Condoms and adolescents, 202
Coronary deaths and smoking, 404
Credit card and income ŽItaly., 206
Crime and race, 63
Crossover drug trial, 457, 483484
Death penalty and race, 4852, 63, 65, 201
Depression, mental, 459461, 468469,
506507
Developmental toxicity study, 290291,
517521
Diabetes, case-control study, 418419
Diagnostic tests, 60, 66
Diarrhea, 255
Draft position in sports, 207
Dumping severity, 308309
Dysmenorrhea, 483484, 572
Esophageal cancer, 203
Fish egg hatching, 568569
Free throws, 105, 160161
Gambler’s ruin, 489490
Genetics, 165
Government spending, 349351, 449, 530531
Graduate admissions at Florida, 223224, 529
Graduate admissions at Berkeley, 6263
689
690
Graham Greene, 28
Gun-related deaths, 61
Heart attacks and aspirin use, 37, 46
Heart catheterization and race, 62
Heart disease and blood pressure, 221223
Heart disease and snoring, 121123
Heart valve replacement and survival, 385387
Hepatitis outbreaks, 533
Home team advantage in baseball, 437438
Homicide victims, number, 561563, 564565,
571
Horseshoe crab mating, 126131, 154155,
159, 168170, 173176, 188192, 212216,
570
Income by year, 308
Infant survival, gestation, smoking, and age,
400401
Insomnia, 462464, 469, 487, 514515, 531
Job satisfaction and income, 5759, 8788,
287288, 295, 297, 308
Job satisfaction and race, gender, age, and
location, 205
Journal citations, 448
Kyphosis and spinal surgery, 199200
Labelling index and remission, 197199, 261
Larry Bird free throws, 105
Leading crowd, 516517, 532
Leprosy, 239
Life table, 284
Lung cancer and chemotherapy, 306,
Lung cancer and smoking, 42, 61, 62, 64
Lung cancer survival, 390391
Malformation of infants, 8990, 158, 179180,
182
Mendel’s theories, 2223
Mental health, and parents SES, 381, 383384
Mental impairment, life events and SES,
279282
Migration, 423, 427428
Missing people in London, 202
Mixture for two protozoan genuses, 546
Motor vehicle accident rates, 403
Movie reviewers, 445446
Multicenter clinical trial, infection cream,
230235, 508510
Multicenter clinical trial, fungal infections,
394395, 530
EXAMPLES INDEX
Multiple sclerosis and neurologist ratings, 447
Murder rates in U.S., 62, 63
Myocardial infarction and aspirin, 37, 46,
7172
Myocardial infarction and diabetes, 418419
NCAA graduation rates, 202
Nervousness and Claritin, 109
Obesity, occasion and gender, 487
Occupational aspirations, 206
Occupational status, father and son, 447
Oral contraceptive use, 200
Osteosarcoma, 262263
Palm Beach County vote for Buchanan,
156157
Party identification by race and by gender,
105106, 303
Party identification and protestors, 307
Pathologists ratings of carcinoma, 431435,
532, 541544, 549551
Penicillin and rabbits, 259260
Pig farmer survey, 484485
Pneumonia infections, 2526, 34
Poison dose for protozoa, 546547
Political ideology and party affiliation, 305,
375377
Pregnancy rates, 567
Presidential approval rating, 409412
Presidential vote, by state, 503504, 534
Promotion discrimination, 254255
Prussian army and mule kicks, 30
Psychiatric patients and prescribed drugs,
106107
Religious fundamentalism and education, 80,
8182
Religious services, frequency of attendance,
352
Respiratory illness, age and maternal smoking,
480481
Respiratory illness in children, 478479
Satisfaction with housing, 310
Satisfaction with job, 205
Schizophrenia origin, 8384
Seat belts and injury, 4041, 61, 305306,
327329, 331, 349, 361
Sex, frequency of, 569570
Sex opinions, 65, 217219, 368, 371373, 421,
430, 431, 530
Sexual intercourse, gender and race, 201
691
EXAMPLES INDEX
Shopping choice, 300
Snowshoe hares, 511513, 544545, 551552
Soccer and arrests, 403
Sore throat in surgery, 204
Space shuttle, 199
Student survey Žalcohol, marijuana, cigarettes .,
322326, 361363, 367, 482483, 528
Teratology studies, 151153
Titanic, 61
Toxicity study, 517520
Train accidents, 403, 569
Tea drinker, 92, 100
Teenage birth control, 368, 371373
Tennis rankings, 449
Vegetarianism, 1617, 29
Veterinary information sources, 484485
Voting, proportion by state, 503504, 534
UFOs, 106
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
Author Index
Adelbasit, K. M., 196
Agresti, A., 27, 32, 33, 60, 100, 101, 102, 104,
111, 156, 227, 255, 258, 266, 298, 301, 379,
384, 397, 399, 422, 426, 435, 443, 445, 453,
465, 481, 485, 491, 502, 511, 513, 517, 518,
526, 527, 533, 536, 551, 552, 565, 567, 596,
630, 651
Aitchison, J., 265, 301, 465, 561, 613, 625
Aitken, C. G. G., 613
Aitkin M., 155, 388, 398, 399, 495, 520, 526,
545, 565, 633
Albert, A., 195, 197
Albert, J. H., 607, 609
Albert, P. S., 688
Allison, P. D., 633, 643
Altham, P. M. E., xv, 59, 103, 104, 240, 442,
443, 555, 566, 573, 587, 608
Amemiya, T., 227, 258, 300
Andersen, E. B., 255, 399, 450, 496, 526, 576,
631
Anderson, C. J., 398, 399
Anderson, D. A., 520, 526
Anderson, D. R., 216
Anderson, J. A., 171, 195, 196, 197, 207, 277
Anderson, R. L., 631
Anderson, T. W., 478, 482, 490
Aranda-Ordaz, F. J., 250, 399
Arminger, G., 549
Armitage, P., 104, 181
Ashford, J. R., 258, 379
Asmussen, S., 360
Azzalini, A., 480
Baglivo, J., 346
Baker, R. J., 283
Baker, S. G., 393, 482, 541
Banerjee, C., 443
Baptista, J., 100
Barnard, G. A., 95, 104, 114
Barndorff-Nielsen, O. E., 266
Bartholomew, D. J., 526, 565
Bartlett, M. S., 265, 623624, 631
Becker, M., 370, 399, 435, 443, 544, 644
Bedrick, E. J., 77, 103
Begg, C. B., 273
Beitler, P. J., 508
Benedetti, J. K., 102, 398
Benichou, J., 66
Benzecri,
´ J. P., 399, 624
Berger, R., 18, 33, 95, 594
Bergsma, W. P., 481
Berkson, J., 80, 104, 166, 197, 612, 624, 631
Berry, G., 104
Berry, S. M., 207
Best, D. J., 103
Bhapkar, V. P., 27, 103, 104, 291, 422, 453,
488, 612, 615, 616, 629
Bickel, P., 63
Biggeri, A., 566
Billingsley, P., 482
Birch, M. W., 255, 263, 295, 298, 336, 339, 340,
341, 346, 369, 392, 576, 585, 627, 631
Bishop, Y. M. M., 347, 360, 366, 452, 482, 526,
576, 582, 587, 591, 594, 627, 629, 631
Blaker, H., 20, 27, 93, 635
Bliss, C., 246, 247, 560, 623
Blyth, C. R., 20, 27, 32, 59
Bock, R. D., 300, 301, 495, 526, 624, 625
Bockenholt,
U., 398, 399, 443
¨
Bonney, G. E., 479, 480
Boos, D. D., 95, 467, 481
Booth, J., xv, 104, 223, 397, 443, 523, 525, 567,
630
Bowker, A. H., 424
Box, J. F., 23, 92, 623, 624
Bradley, R. A., 302, 436, 443
Breslow, N., 51, 59, 155, 156, 171, 234, 235,
255, 258, 399, 419, 493, 523, 524, 563, 625,
631
Brier, S. S., 515
Brooks, S. P., 566
Bross, I. D. J., 111
693
694
Brown, L. D., 15, 27, 33, 606
Brown, M. B., 102, 398
Brown, P. J., 613, 614
Brownstone, D., 302, 527
Bull, S. B., 196
Burnham, K. P., 216, 526, 552
Burridge, J., 283
Butler, R., 104, 397, 443, 630
Byar, D. P., 232, 295, 414, 481, 625
Caffo, B., xv, 102, 656
Cameron, A. C., 131, 155, 561, 566, 574
Carey, V., 474
Carroll, R. J., 171, 467
Casella, G., 18, 33, 594
Catalano, P. J., 527
Caussinus, H., 425, 427, 428, 443, 451, 627, 631
Chaloner, K., 196, 609
Chamberlain, G., 419, 420, 526
Chambers, E. A., 258
Chambers, J. M., 633
Chambers, R. L., 527
Chan, I., 104
Chan, J. S. K., 523
Chao, A., 513, 526, 533
Chapman, D. G., 258
Chen, Z., 527, 643, 651
Chib, S., 609
Christensen, R., 196
Chuang, C., 399, 462
Clayton D. G., 388, 399, 493, 523, 563, 625,
631
Clogg, C. C., 103, 391, 399, 565, 627
Clopper, C. J., 18
Cochran W. G., 27, 80, 88, 163, 181, 232, 239,
396, 459, 488, 596, 626, 627, 631
Coe, P. R., 101
Cohen, A., 103, 104
Cohen, J., 434, 435, 443
Coleman, J. S., 516, 532
Collett, D., 196, 204
Conaway, M. R., 426, 482, 526, 565
Cook, R. D., 225
Copas, J. B., 156, 257, 442, 616
Corcoran, C., 197, 573
Cormack, R. M., 511, 526, 551
Cornfield, J., 42, 47, 51, 71, 77, 99, 100, 171,
196, 208, 221, 624
Coull, B. A., xv, 27, 32, 33, 513, 518, 526, 533,
552, 567, 655, 662
Cox, C., 266, 282, 286, 576, 587, 641
Cox, D. R., 12, 104, 133, 138, 196, 197, 258,
415, 482, 493, 497, 624, 625, 631
Cramer,
´ H., 112, 576, 587, 625626
AUTHOR INDEX
Cressie, N., 27, 112, 258, 396, 612
Croon, M., 481
Crouchley, R., 527
Crowder, M. J., 555, 566
D’Agostino, R. B., Jr., 196
Dalal, S. R., 199
Daniels, M. J., 524, 609
Dardanoni, V., 301
Darroch, J. N., 347, 357, 398, 414, 426, 443,
459, 481, 513, 526, 551, 552, 565, 626, 629,
631
Das Gupta, S., 237
David, H. A., 443
Davis, L. J., 59, 93, 398
Davison, A. C., 156, 594
Dawson, R. B., Jr., 103
Dawson, R. J. M., 61
Day, N. E., 51, 171, 232, 235, 258, 399, 625
de Falguerolles, A., 399, 664, 686
de Leeuw, J., 399
Demetrio,
C. G. B., 156, 555, 566
´
Deming, W. E., 343, 347, 511
Dempster, A. P., 522
Dey, D. K., 609
Diaconis, P., 103, 104
Diggle, P., 471, 625
Dillon, W., 443
Dittrich, R., xv, 443
Dobson, A. J., 155
Doll, R., 42, 62, 64, 404
Dong, J., xv, 59, 614, 616
Donner, A., 172, 196, 258
Doolittle, M. H., 621
Downing, D., 103
Drost, F. C., 112, 258, 595
Ducharme, G. R., 398
Duncan, D. B., 195, 197, 277, 301, 624
Dupont, W. D., 93
Dyke, G. V., 624
Edwardes, M. D., 301
Edwards, A. W. F., 59
Edwards, D., 360, 398
Efron, B., 103, 146, 196, 227, 258, 526, 605,
610
Ekholm, A., 156, 481
Eliason, S. R., 103, 391
Escoufier, Y., 383, 399
Espeland, M. A., 544, 571
Everitt, B. S., 633
Fahrmeir, L., 155, 300, 615
Farewell, V. T., 171, 301, 527, 625
AUTHOR INDEX
Fay, R., 103, 482, 594
Fechner, G., 623
Ferguson, T. S., 605, 617
Fienberg, S. E., 344, 347, 392, 438, 443, 513,
526, 610, 615, 616, 626, 627, 629, 631
Finney, D., 151, 258, 556, 623
Firth, D., 155, 156, 196, 330, 467, 481, 482
Fischer, G. H., 526
Fisher, R. A., 12, 22, 23, 29, 51, 79, 91, 92, 95,
99, 104, 114, 146, 156, 162, 237, 247, 560,
576, 589, 622624, 625, 626, 628, 631
Fitzmaurice, G. M., 103, 466, 474, 481, 482,
649
Fitzpatrick, S., 35
Fleiss, J. L., 104, 110, 111, 242, 258, 347, 435,
436, 443
Follman, D. A., 546, 547, 548, 566
Forcina, A., 301
Forster, J. J., 104, 346, 397, 482, 616, 630
Forthofer, R. N., 378
Fowlkes, E. B., 199, 226, 257
Francom, S., 462
Freedman, D., 23, 63
Freeman, D. H. Jr., 399
Freeman, G. H., 97
Freeman, M. F., 112
Freidlin, B., 104
Friendly, M., 59, 399, 633
Frome, E. L., 155, 399
Fuchs, C., 482
Gabriel, K. R., 263, 399
Gaddum, J. H., 623
Gail, M. H., 104, 625
Gart, J. J., 70, 71, 77, 102, 104, 197, 255, 258,
397, 442
Gaskins, R. A., 614
Gastwirth, J., 104, 197
Gatsonis, C., xv, 230, 481, 524, 609
Gelfand, A. E., 609
Genter, F. C., 301
Geyer, C., 678
Ghosh, B. K., 27
Ghosh, M., 442, 609
Gibbons, R. D., 520
Gilbert, G. N., 217
Gill, J., 155
Gilmour, A. R., 526
Gilula, Z., 83, 382, 384
Gini, C., 329
Glass, P. V., 447
Gleser, L. J., 103
Glonek, G. F. V., 393, 466
Godambe, V. P., 104, 482
695
Goetghebeur, E., 482
Gokhale, D. V., 112, 612, 616
Goldstein, H., 520, 524
Good, I. J., 24, 60, 104, 605, 607, 608, 612, 614,
616, 626, 630
Goodman L. A., 35, 59, 68, 69, 83, 84, 102,
110, 213, 217, 228, 340, 346, 365, 366, 369,
370, 374, 379, 380, 381, 382, 383, 384, 397,
398, 399, 406, 407, 408, 425, 428, 431, 443,
478, 482, 490, 516, 527, 540, 565, 566, 572,
621, 622, 627, 628, 629, 631
Gould, S. J., 544
Gourieroux, C., 467, 482
Graubard, B. I., 89, 103
Gray, R., 273
Green, P. J., 156
Greenacre, M. J., 384, 399
Greene, G., 28
Greenland, S., 96, 234, 258
Greenwood, M., 566
Greenwood, P. E., 27
Grego, J., 686
Grizzle, J. E., 291, 301, 457, 601, 615, 624, 629,
631
Gross, S. T., 197
Grove, W. M., 544
Gueorguieva, R., xv, 527, 670
Gupta, A. K., 607
Haber, M., 95, 96, 103, 291, 465
Haberman, S. J., 69, 81, 83, 113, 195, 224, 258,
268, 300, 349, 346, 347, 364, 367, 369, 374,
380, 382, 392, 393, 396, 399, 408, 440, 526,
540, 565, 572, 576, 589, 591, 592, 595, 627,
629, 631
Hagenaars, J., 565
Hald, A., 623
Haldane, J. B. S., 70, 103, 196
Hall, P., 616
Halton, J. H., 97
Hamada, M., 301
Handelman, S. L., 544, 571
Hanfelt, J., 566, 571
Hansen, L. P., 467, 482
Harkness, W. L., 258
Harrell, F. E., 229, 282, 301
Hartzel, J., 511, 513, 514, 516, 534, 651
Haslett, S., 394
Hastie, T., 153, 199, 301, 633
Hatzinger, R., 565
Hauck, W. W., 172, 234, 258
Haynam, G. F., 243
Heagerty, P., 481, 527, 548
Hedeker, D., 520, 653
696
Heinen, T., 565
Hennevogl, W., 513
Henry, N. W., 565
Heyde, C. C., 156, 481
Hill, A. B., 42, 64, 111, 404
Hinde, J., 155, 156, 555, 563, 566
Hinkley, D., 12, 104, 133, 138, 146, 156, 594
Hirji, K. F., 104, 258, 625
Hirotsu, C., 406
Hirschfeld, H., 399
Hoadley, B., 199
Hobert, J., xv, 523, 525, 630
Hodges, J. L., 197
Hoem, J. M., 347, 399
Holford, T. R., 389, 390, 399
Holland, P. W., 610, 615, 629, 631
Hollander, M., 443
Holt, D., 103
Holtbrugge,
306
¨
Hook, E. B., 526
Hosmer, D. W., 177, 196, 197, 257, 258
Hout, M., 65, 428, 443
Howard, J. V., 104, 608
Hsieh, F. Y., 242, 243
Hsu, J. S. J., 608
Hwang, J. T. G., 104
Imrey, P. B., 346, 443, 615
Ireland, C. T., 616
Irwin, J., 91
Jennison, C., 103
Jewell, N. P., 467, 496, 566
Johnson, B. M., 605
Johnson, N. L., 566, 574
Johnson, W., 257
Jones, B., 442, 484, 536
Jones, M. P., 258
Jorgensen,
B., 136, 155, 156, 266, 470
Ⲑ
Kalbfleisch, J. D., 482
Karim, M. R., 524, 609
Kastenbaum, M. A., 627
Kastner, C., 466, 649
Katzenbeisser, W., xv, 672
Kauerman, G., 156, 467
Kelderman, H., 565
Kempthorne, O., 96, 104
Kendall M. G., 27, 56, 60, 68, 399, 631
Kenward, M. G., 442, 475, 484, 536
Khamis, H. J., xv, 332, 443
Kim, D., 104, 255, 298, 379, 397
King, G., 527
Knott, M., 565
AUTHOR INDEX
Knuiman, M. W., 616
Koch, G. G., 27, 302, 436, 447, 459, 460, 481,
532, 601, 615, 616, 629, 631, 670, 673, 674,
675
Koehler, K., 27, 103, 396, 397
Koopman, P. A. R., 77
Korn, E. L., 89, 103
Kraemer, H. C., 443
Kreiner, S., xv, 358, 398
Kruskal, W. H., 59, 60, 68, 69, 102, 110, 621,
631
Ku, H. H., 616
Kuha, J., 330, 347
Kuk, A. Y. C., 523
Kullback, S., 112, 399, 612, 616
Kuo, L., 527, 643, 651
Kupper, L. L., 566
Laara,
¨¨ ¨ F., 301, 313
Lachin, J., 258
Laird, N. M., 385, 386, 389, 466, 482, 522, 541,
609, 610, 649
Lambert, D., 546, 547, 548, 566
Lancaster, H., 20, 27, 83, 84, 113, 399, 626
Landis, J. R., 111, 295, 297, 301, 302, 436, 447,
462, 508, 532
Landrum, M., 609
Landwehr, J. M., 226, 257
Lang, J. B., xv, 301, 340, 399, 465, 481, 537,
541, 551, 643, 644, 649, 655, 675, 678
Laplace, P. S., 15
Larntz, K., 196, 396, 397, 438, 443, 609
Larsen, K., 498
Larson, M. G., 399
Lauritzen, S. L., 346, 398, 399
LaVange, L. M., 103, 197, 481
Lawal, H. B., 396
Lawless, J. F., 155, 482, 560, 561, 566
Lazarsfeld, P. F., 565
Lee, E., 198
Lee, S. K., 596
Lee, Y., 559, 566, 574
Lefkopoulou, M., 566
Lehmann, E., 67, 104, 263, 406
Lehnen, R. G., 378
Lemeshow, S., 177, 196, 197, 257, 258
Leonard, T., 608, 609
Lesaffre, E., 258, 300, 466, 522, 545
Lesperance, M. L., 526, 548
Liang, K.-Y., 104, 258, 442, 467, 469, 471, 473,
481, 482, 525, 556, 566, 571, 573, 625, 631
Lin, X., 524, 525
Lindley, D. V., 609, 630
Lindsay, B., 494, 545, 549
697
AUTHOR INDEX
Lindsey, J. K., 400, 467, 566, 573
Lipsitz, S., 103, 291, 422, 456, 469, 473, 474,
481, 645
Little, R. J., 114, 346, 347, 475, 476, 482
Liu, I., xv, 485, 655, 671
Liu, Q., 510, 522
Lloyd, C., 93, 104, 156, 615
Localio, A. R., 526
Longford, N. T., 520
Loughin, T., 484
Louis, T., 541
Luce, R., 299, 302, 443
Madansky, A., 422, 456
Maddala, G. S., 258, 264, 302
Magidson, J., 653
Magnus, J. R., 602
Mansmann, U., 104
Mantel, N., 87, 93, 104, 171, 197, 209, 230, 231,
232, 234, 238, 260, 295, 296, 297, 300, 379,
414, 481, 612, 618, 624, 625, 627, 631
Martin Andres, A., 104
Mason, W. M., 524, 609
Matthews, J. N. S., 301, 313, 443
Maxwell, A. E., 631
McArdle, J. J., 202
McCloud, P. I., 443
McCullagh, P., 132, 155, 156, 257, 276, 277,
283, 286, 290, 301, 308, 312, 340, 378, 397,
431, 443, 466, 471, 481, 556, 566, 625, 631
McCulloch, C. E., 522, 523, 524, 527, 548, 555,
623, 625
McDonald, J. W., 667, 684
McFadden, D., 228, 264, 299, 300, 302, 302,
624, 631
McNemar, Q., 411
Mee, R. W., 77
Meeden, G., 605
Mehta, C. R., 98, 104, 254, 255, 258, 298, 397,
625, 630
Mendel, G., 22
Mendenhall, W. M., 107
Mersch, G., 400
Michailidis, G., 399
Miettinen, O. S., 77, 442
Miller, M. E., 481, 604
Min, Y., 100, 101
Minkin, S., 196
Mirkin, B., 112
Mitra, S. K., 79, 258, 346, 591, 627, 631
Mittal, Y., 60
Molenaar, I. W., 526
Molenberghs, G., 258, 466, 482
Moore, D. F., 152, 556, 566
Moore, D. S., 27, 103
Morabia, A., 527
Morgan, B. J. T., 196, 207
Morgan, W. M., 346
Morris, C., 526, 605, 610
Mosimann, J. E., 566
Mosteller, F., 345, 412, 443, 627, 629, 631
Nair, V. N., 103, 301
Nam, J., 77
Natarajan, R., xv, 481, 502, 524
Nelder, J., 116, 132, 148, 149, 155, 156, 257,
290, 301, 312, 340, 378, 559, 566, 574, 625,
631
Nerlove, M., 300, 624
Neudecker, H., 602
Neuhaus, J. M., 417, 494, 496, 499, 502, 526,
547, 548, 566
Newcombe, R., 27, 109, 110
Neyman, J., 18, 112, 611, 612, 616, 626, 631
Nikulin, M. S., 27
Normand, S.-L., 609
Norusis, M. J., 633
Nurminen, M., 77
O’Brien, P. C., 207
O’Brien, R. G., 244, 258, 640
Ochi, Y., 258
Odoroff, C., 661
O’Gorman, T. W., 596
Olivier, D., 385, 386, 389
Overton, W. S., 526, 552
Pagano M., 61, 657
Paik, M., 59
Palmgren, J., 156, 340
Park, T., 482
Parr, W. C., 594
Parzen, E., 34
Patefield, W. M., 104
Patel, N. R., 98, 258, 625, 630
Patnaik, P. B., 258
Paul, S. R., 566
Pearson, E. S., 18, 104, 626, 631
Pearson, K., 22, 79, 112, 399, 576, 589, 620,
621, 622, 628, 631
Peduzzi, P., 212
Pendergast, J. F., xv, 502
Pepe, M. S., 258
Perlman, M., 237
Peters, D., 104, 630
Peterson, B., 282, 301
Peto, R., 62
Piccarreta, R., 206
698
Pierce, D. A., 104, 143, 156, 497, 502, 522, 526,
630
Pike, M. C., 100
Piegorsch, W. W., 684
Plackett, R. L., 103, 196, 399, 623, 627, 631
Podgor, M. J., 197
Poisson, S.-D., 7
Pratt, J. W., 283
Pregibon, D., 143, 156, 197, 225, 257, 258, 566,
638
Prentice, R. L., 171, 196, 258, 283, 399, 482,
555, 566, 625
Presnell, B., 156
Press, S. J., 196, 300, 624
Pyke, R., 171, 625
Qaqish, B., 676
Qu, A., 482
Quetelet, A., 68
Quine, M. P., 29
Rabe-Hesketh, S., 527, 633
Radelet, M., 48, 65
Raftery, A., 257
Rao, C. R., 10, 12, 576, 582, 585, 587, 589, 591,
596, 616, 626, 631
Rao, J. N. K., 103, 515
Rasbash, J., 520, 524
Rasch, G., 399, 415, 493, 495, 624
Rayner, J. C. W., 103
Read, T. R. C., 27, 112, 258, 396, 612
Regal, R. R., 526
Reid, N., 96
Rice, W. R., 104
Ripley, B., 633
Ritov, Y., 384
Robins, J., 234, 258, 475
Rohmel,
J., 104
¨
Rosenbaum, P. R., 196
Rosner, B., 566
Rossi, P. E., 609
Rotnitzky, A., 467
Routledge, R. D., 104, 607
Roy, S. N., 79, 346, 627, 631
Rubin, D., 196, 475, 482
Rudas, T., 481, 565
Rundell, P. W. K., 613, 614
Ryan, L., 290, 527, 566
Sackrowitz, H. B., 103, 104
Samuels, M. L., 60
Santner, T. J., 101
Schafer, D. W., 143, 156
Schafer, J. L., 103, 347, 482
AUTHOR INDEX
Schluchter, M. D., 399
Schumacher, M., 306
Scott, A. J., 35, 103, 197
Searle, S., 527, 555
Seeber, G., 155
Sekar, C. C., 511
Self, S. G., 258, 525
Sen, P. K., 594
Seneta E., 29
Silvey, S. D., 301, 465, 625
Singer, J. M., 594
Shapiro, S. H., 398
Shen, S. M., 265
Shihadeh, E. S., 399
Shuster, J. J., 95, 103, 104, 442
Silva Mato, A., 104
Silvapulle, M. J., 195
Silvey, S. D., 301, 465, 625
Simon, G., 197, 301, 374, 399, 612, 624, 629
Simonoff, J., 594, 614, 615, 616
Simpson, E. H., 51, 60, 398, 596, 621
Singer, J. M., 594
Skellam, J. G., 566
Skene, A. M., 502, 609
Skinner, C., 347
Skrondal, A., 527
Slaton, T. L., 566
Small, K. A., xv, 302
Smith A. F. M., 609, 616
Smith, K. W., 345
Smith, P. W. F., 443, 482, 616
Snell, E. J., 196, 301, 624
Snell, M. K., 101
Sobel, M. E., 672
Somers, R. H., 68
Somes, G. W., 488
Speed, T., 347, 616
Spiegelhalter, D., 616
Spitzer, R. L., 435
Sprott, D. A., 95, 114, 453
Starmer, C. F., 601, 629
Stasinopoulos, M., 526
Stern, S., 302
Sterne, T. E., 20
Stevens, S. S., 26
Stigler, S., 22, 443, 448, 623, 631
Still, H. A., 20, 27, 32
Stiratelli, R., 482, 526
Stokes M. E., xv, 282, 302, 399, 476, 482, 633,
640, 649
Strawderman, R. L., 104, 630
Stuart, A., 27, 56, 399, 422
Stukel, T. A., 196, 250
Sturmfels, B., 104
AUTHOR INDEX
Suissa, S., 95, 104, 442
Sundberg, R., 346, 366
Tamhane, A. C., 101
Tango, T., 411
Tanner, M. A., 443
Tarone, R. E., 197, 234, 258
Tavare,
´ S., 103
Ten Have, T. R., xv, 517, 526, 527, 527
Theil, H., 57, 228, 300, 624
Thomas, D. R., 103, 515
Thompson, R., 283
Thompson, W. A., 399
Thurstone, L. L., 443
Tibshirani, R., 153, 199, 301
Titterington, D. M., 616
Tjur, T., 426, 552, 553
Tocher, K. D., 94
Toledano, A., 230, 481
Tolley, H. D., 594
Train, K., 302, 527
Trivedi, P. K., 131, 155, 561, 566, 574
Tsiatis, A. A., 152, 197, 556, 566
Tukey, J., 112
Turing, A., 631
Turnbull, B. W., 103
Tutz, G., 155, 156, 289, 290, 300, 301, 513, 615
Uebersax, J. S., 544
Uttal, D. H., 517
van der Heijden, P. G., 399
Venables, W. N., 633
Verbeke, G., 482, 545
Vermunt, J. K., 399, 653
Wainer, H., 63
Wakefield, J. C., 502, 609
Wald, A., 11, 172
Walker, S. H., 195, 197, 277, 301, 624
Walley, P., 616
Walsh, S. J., 104
Wardrop, R. L., 105
Ware, J. H., 478, 480, 482
Watson, G. S., 79, 103, 576, 590, 627, 631
699
Wedderburn, R. W. M., 116, 148, 149, 150,
155, 156, 195, 258, 265, 266, 466, 470, 625,
631
Weisberg, S., 226
Wells, M. T., 104, 630
Wermuth, N., 398, 399, 401
Westfall, P. H., 214, 360
White, A. A., 481
White, H., 467, 471, 482
Whitehead, J., 301
Whittaker, J., 346, 358, 398, 399
Whittemore, A. S., 243, 398
Wild, C., 103, 197, 301
Wilks, S. S., 12
Williams, D. A., 156, 225, 397, 555, 566, 653
Williams, E. J., 103, 399
Williams, O. D., 291, 301, 624
Wilson, E. B., 16
Wilson, J., 103
Winner, L., 445
Wolfinger, R. D., 214, 360, 527
Wong, G. Y., 524, 609
Woolf, B., 71
Woolson, R. F., 487, 596
Wu, C. F. J., 196, 301
Wu, M., 346, 347
Yamagami, S., 101
Yang, I., 544
Yang, M., 104
Yates F., 91, 93, 96, 98, 103, 104, 114, 239, 624
Yee, T. W., 301
Yerushalmy, J., 38
Young, S. S., 214, 360
Yule, G. U., 44, 53, 59, 68, 110, 346, 406, 566,
620621, 628, 631
Zeger, S. L., 442, 467, 469, 471, 473, 481, 482,
499, 500, 524, 548, 609, 625, 631
Zelen, M., 255, 625, 631
Zellner, A., 609
Zelterman, D., 397
Zermelo, E., 443
Zhang, H., 257
Zhao, L., 482
Zheng, B., 227, 258, 266
Zhu, Y., 96
Zweiful, J. R., 70, 397
Categorical Data Analysis, Second Edition. Alan Agresti
Copyright ¶ 2002 John Wiley & Sons, Inc.
ISBN: 0-471-36093-7
Subject Index
Adjacent categories logit, 286288, 370371,
374376, 642
Adjusted residual, see Standardized Pearson
residual
Agreement, 431436, 443, 453454, 541544,
549551
AIC, 216217, 324
Alternating logistic regressions, 474
Ancillary statistic, 104
Arc sine transformation, 596
Armitage test, see CochranArmitage
trend test
Association, see Measures of association
Association graphs, 357360, 539
Association models, 373381, 399
Asymptotic covariance matrix, 137138,
577581, 594
Asymptotic normality, 7377, 577581
Attributable risk, 66, 110
Backward elimination, 214216
BAN, 611, 626
Baseline-category logits, 267274, 300,
310311, 426, 515, 640643
Bayesian inference, 604610, 616, 630631
binomial parameters, 605607, 617
generalized linear mixed models, 524, 609
kernel smoothing, connection, 614
multinomial proportions, 607610, 618
Bernoulli distribution, 117
Beta-binomial distribution, 30, 553559, 566,
572, 573, 653
Beta distribution, 554, 572, 605606
Bias, 70, 85, 196, 450, 496, 524, 548, 595, 615
BIC, 257
Binary data
correlated, 409420, 455482, 491527,
538559
generalized linear models, 120125, 137, 140
matched pairs, 409420
Binomial distribution, 56
admissible estimator, 605
confidence interval for proportion, 1517,
3233, 635
exact inference, 1820
exponential family, 117, 134
GLM likelihood equations, 137
likelihood function, 9
matched pairs, 409420
moment generating function, 31
overdispersion, 8, 30
tests for proportion, 1415
variance stabilizing, 596
Binomial models
deviance, 140
GLMs, 120125
likelihood equations, 137, 265
overdispersion, 151153, 291, 573, 653
Birch’s results, 336
Bootstrap, 75, 156, 525, 531, 594
BradleyTerry model, 436439, 443, 647
BreslowDay test, 258
Calibration, 207
Canonical correlation, 382, 399, 408, 624
Canonical link, 117, 148149, 193, 257, 472,
496
Capturerecapture, 511513, 526, 544545,
551552
CART, 257
Case-control study, 4243, 4647, 59, 233,
and logistic regression, 170171, 418420,
625
several controls per case, 233, 442
Categorical data analysis, 1688
Causal diagram, 217218
Censoring, 386, 400
Centering, 167, 175
Chi-squared distribution
df, 12, 79, 175, 589
701
702
Chi-squared distribution Ž Continued.
mgf, 35
moments, 27
noncentral, 237, 258, 408, 591592, 595, 597
reproductive property, 82
table of percentage points, 654
Chi-squared statistics
likelihood-ratio, see Likelihood-ratio
statistic
partitioning, see Partitioning
Pearson, see Pearson chi-squared statistic
Classification methods, 196, 257, 228230, 258
Clinical trials, 42, 230236, 507510
ClopperPearson confidence interval, 1820,
33, 606
Cluster sampling, 103, 481, 515
Clustered data, 455, 491527, 556558
Cochran, W. G., 626
CochranArmitage trend test, 181182, 197,
237, 253, 640
CochranMantelHaenszel test, 231234, 639
exact test, 254, 298
and marginal homogeneity, 413, 458459,
481
and McNemar test, 413414
matched pairs, 413
nominal and ordinal cases, 295298, 302,
379, 642643
score test for logit model, 232, 297298
Cochran’s Q, 459, 488
Collapsibility, 358360, 398
Complementary loglog model
binary response, 248250, 640
ordinal response, 283284, 301, 313, 527,
641
Computer software, see Software
Concentration coefficient, 69
Concordance index, 229
Concordant pair, 5759
Conditional distribution, 37, 48
Conditional independence, 52
I = J = K tables, 293298, 302, 318319,
325
logit models, 183184, 230234, 263,
293295, 359360
versus marginal independence, 53, 365366
power and sample size, 244245
small-sample test, 254, 298
Conditional inference, 91101, 250257,
416420, 495496, 630
Conditional logistic regression, 250258,
414420, 495496, 526, 625, 640, 645
Conditional logit, 299
Conditional ML, 100, 417, 494496, 526
SUBJECT INDEX
Conditional symmetry, 431, 452
Confidence intervals
likelihood-based, 13, 7778
tail method, 18, 99
Wald, 13
score, 1516, 77
Confounding, 4751, 230
Conjugate mixture model, 558559
Constraint equations, 612
Constraints, parameter, 178179, 317, 352353
Contingency coefficient, 112, 620
Contingency table, 36, 4754
Continuation-ratio logit, 289291, 301,
517520
Continuity correction, 27,
Continuous proportions, 265266, 624
Contrasts, 82, 317, 340, 344, 603, 636, 639
Correlation, 87, 226, 296, 634
Correlation models, 381384, 399, 408
Correspondence analysis, 382384, 399, 624,
644
Cramer’s
´ V 2 , 112
Credit scoring, 165, 263, 631
Cross-classification table, see Contingency
table
Crossover study, 444, 457, 483, 498, 501, 572
Cross-product ratio, 44
Cross validation, 266
Cumulant function, 155
Cumulative link models, 282286, 313
Cumulative logit models, 274282, 301, 624,
641
dispersion effects, 285286
marginal models, 420421, 462463, 469
proportional odds property, 275276, 282
random effects, 514515, 536
score test and ranks, 301
Cumulative odds ratio, 67, 276
Cumulative probit model, 278, 283, 301, 312,
624625, 641
Data mining, 219, 631
Decomposable model, 346, 360
Degrees of freedom, 12, 79, 175, 589, 622
Delta method, 7377, 577581, 594
Dependent proportions, 410412
Design, 196, 609
Design matrix, see Model matrix
Deviance, 118119, 139142
grouped vs. ungrouped binary data, 208
likelihood-ratio tests, 141142, 186187,
363365
residual, 142, 220, 638
R-squared measures, 228
SUBJECT INDEX
Diagnostics, 142143, 219230, 257258,
366367
Diagonals-parameter symmetry, 443
Difference of proportions, 43
collapsibility, 398
dependent, 410412, 645
homogeneity, 258
large-sample confidence interval, 72, 77, 102,
110, 410411
sample size determination, 240242, 258
small-sample confidence interval, 101
z test and Pearson statistic, 111
Directed alternatives, 8890, 236239, 373
Dirichlet distribution, 607, 610
Discordant pair, 5759
Discrete choice models, 298300, 302, 527,
624
Discreteness and conservatism, 1820, 9394,
257
Discriminant analysis, 196
Dispersion parameters, 131, 133, 285286, 560
Dissimilarity index, 329330
Diversity index, 596
Dummy variables, 178179
Ecological inference, 527
Effect modifier, 54
EM algorithm, 522523, 540541
Empirical Bayes, 526, 610
Empirical logit, 168
Empty cells, 392
Entropy, 57, 613
Estimated expected frequencies, 25, 78, 315
Estimating equations, 470, 481482
Exact confidence intervals, 1820, 99101, 255
Exact tests
binomial parameter, 18, 412
conditional independence, 254, 298
Fisher, 9197, 253
I = J tables, 9798, 104
logistic regression, 251257
matched pairs, 412
ordinal variables, 114
StatXact and LogXact, 633, 635, 640, 643
trend in proportions, 98
unconditional test, 9496, 104, 114
Expected frequencies, 22, 25,
Exponential dispersion family, 133, 310
Exponential distribution, 313, 388
Exponential family, 116, 133
Extreme-value distribution, 249250, 264
Fisher, R. A., 2223, 622624, 626, 628
df argument with Pearson, 622623
703
variance test, 163
Fisher scoring, 145149, 156, 247, 623, 625
Fisher’s exact test, 9197, 99, 253, 623
and Bayes approach, 608
conservativism, 9394
controversy, 9596, 104
software, 635
UMPU, 104
versus unconditional test, 9596, 104, 114
Fitted values, 121
asymptotic distribution, 194, 341, 585586,
593
FreemanTukey chi-squared, 112, 594
G 2 statistic, see Likelihood-ratio statistic
G 2 Ž M0 < M1 ., 187, 363
Gamma, 5859, 88, 110, 596597
Gamma distribution, 559560, 574
GaussHermite quadrature, 521522, 651
Generalized estimating equations ŽGEE.,
466475, 481482, 501, 557558, 649
Generalized additive models, 153155, 156,
301, 630, 636
Generalized linear mixed model ŽGLMM.,
417, 492
Bayesian approach, 524, 609
binary data, 492527
correlation nonnegative, 497, 564
count data, 563565
heterogeneity, interpretation, 497498
marginal effects, comparison, 498502, 535,
563564
marginal model, corresponding, 527,
563564, 574575
misspecification, 547548
model fitting, 520526, 527
multinomial data, 513516
software, 649653
Generalized linear model ŽGLM., 116119,
625
canonical link, 117, 148149, 193, 257, 472,
496
covariance matrix, 137138
exponential dispersion family, 133
inference using, 139143
likelihood equations, 135136, 148
model fitting, 143149
moments, 132134
multivariate, 274
variance function, 136
Generalized loglinear model, 332333, 464,
481, 602
Gini concentration index, 68
Goodman, L. A., 627629
704
SUBJECT INDEX
Goodman and Kruskal tau and lambda, 6869
Goodness-of-fit statistics
continuous explanatory variables, 176177,
197
deviance for GLMs, 118119, 139142
likelihood-ratio test, 141142, 186187,
363365
logistic regression, 174177, 186187, 208
loglinear models, 324
mixture summary, 565
Pearson chi-squared, 2226
uninformative for ungrouped data, 162
Graphical models, 357360, 398, 629
Grouped versus ungrouped data, 140141, 162,
174177, 208, 228
GSK method, 601
Gumbel distribution, 249
Influence diagnostics, 224226, 638
Information matrix, 9
GLM, 138, 145146
logistic regression, 193
loglinear model, 339
observed versus expected, 145146, 247
Interaction, 210
and odds ratios, 54
three-factor, 320
uniform, 407
Isotropy, 406
Item response models, 495
Iterative proportional fitting, 343345, 347
Iterative reweighted least squares, 147, 156,
195, 343
Hat matrix, 143, 225, 589
Hazard function, 301, 388, 399400
Heterogeneity, 130, 235236, 291, 377,
492493, 497, 499500, 507510, 538
Hierarchical models, 316, 520, 609
History, 619631
Homogeneity of odds ratios, 54, 183, 234236,
255, 258
Homogeneous association, 54, 320, 377, 407,
623
HosmerLemeshow statistic, 177, 639
Hypergeometric distribution, 91
and binomial, 113
moments, 103, 232
multiple hypergeometric, 97
noncentral, 99
Kappa, 434435, 443, 453, 645
Kendall’s tau and tau-b, 60, 68
Kernel smoothing, 613615, 616
Identity link, 117, 120, 124, 128, 385, 387, 562,
565
Incomplete table, 392
Independence
conditional, see Conditional independence
estimated expected frequencies, 78
exact test, see Fisher’s exact test
from irrelevant alternatives, 299, 302
joint, 318, 319
likelihood-ratio test, 79
loglinear model, 132, 314315, 336, 352
mutual, 318319, 353, 354
Pearson test, 7879
quasi, 426428, 432433, 443
residuals, 81, 111112
smoothing using, 8586
two-way table, 3839, 7879, 111
variance of proportion estimator, 113
Independent multinomial sampling, 40, 67,
339340
Joint independence, 318, 319
Lambda Žmeasure of association., 69
Laplace approximation, 523
Latent class models, 538545, 565, 571572,
653
Latent variable, 277278, 399
LD 50, 167
Leverage, 143, 589
Likelihood function, 9
generalized linear model, 133, 135
marginal likelihood, 521
Likelihood-ratio statistic, 1112, 24
asymptotic chi-squared distribution,
590591
and confidence intervals, 13, 16, 17, 78, 638
difference of deviances, 141142, 187,
363364
independence, 79
minimized by ML estimate, 590591
monotone property, 141
nested models, 363365
noncentrality, 243
nonnegative, 34, 141
partitioning, 8284, 363365, 399, 405
Pearson statistic, comparison, 24, 80, 364
as power divergence statistic, 112
sparse data, 80, 395397
Linear-by-linear association, 369373,
643644
and bivariate normal, 370, 399
and correlation model, 408
heterogeneous, 377
homogeneous, 377379, 407
score statistic, 406
SUBJECT INDEX
Linear logit model, 180182
directed inference, 236237
efficiency, 197
exact test, 253
likelihood equations, 209
and trend test, 197, 237239
Linear predictor, 116
Linear probability model, 120121, 291
and trend test, 181182
Link function, 116, 135
canonical, 117, 148149, 193, 257, 472, 496
cumulative, 282286, 301
goodness of link, 257258, 301
inverse cdf, 124125, 163, 282
Litter effects, 151153, 291, 556558, 566
Local odds ratio, 55, 312, 369370
asymptotic covariances, 597
conditional, 321322, 377
exponential family for multinomial,
310311
Logistic distribution, 125, 162, 197, 246
Logistic-normal distribution, 265
Logistic-normal model, 496513, 516527
Logistic regression, 121125, 165196
case-control studies, 170171, 418420, 625
categorical predictors, 177186
conditional, 250258, 414420, 495496,
526, 625, 640, 645
conditional independence, 183184, 231
covariance matrix, 193194
design, 196, 609
diagnostics, 219230, 257258
existence of ML estimates, 195196,
394395
fitting model, 192196
generalized linear model, 117, 121125
goodness-of-fit, 174177, 186187, 197
inference, 172177
interpretation, 166171, 191
likelihood equations, 192193
linear logit model, see Linear logit model
loglinear models, connection, 315, 330332,
367, 593594
marginal models, 414, 456476
matched pairs, 414420, 493496
model-building, 211225
multiple predictors, 182195
nonparametric mixture, 546547, 653
normal distribution connection, 171,
207208
and odds ratio, 124, 166
perfect discrimination, 195196
probability estimators, 166167, 191, 194
random effects, 496513, 516527
regressive logistic model, 479481
705
repeated binary response, 414420,
456476, 496513, 516527
repeated multinomial response, 461464,
469, 474475, 513516
residuals, 219223
sample size determination, 242243
sample size and number of predictors, 212
software, 637643, 645, 649651
Logit transform, 75, 117, 624
bias, 196
confidence interval, 109
in logistic regression, 123
standard error, 7475
Wald test of proportion, 208209
Loglinear models, 117118, 314347,
627629
covariance matrix, 138139, 338, 341, 593,
598
existence of estimates, 341, 392395
fitting, 342344
four dimensions, 326330, 355
generalized loglinear model, 332333, 464,
481
generalized linear model, 117118, 125132
goodness of fit, 337338
homogeneous association, 320, 377
independence, 232, 314315, 318319, 336,
352, 365366
likelihood equations, 334336
linear-by-linear association, 369373,
377379
logit models, connection, 315, 330332, 367,
593594
ordinal variables, 367377
parameter definition, 316317, 352353
Poisson-multinomial connection, 317318,
339340
probability estimates, 340341
rates, 385391
saturated, 316, 380
selection, 360366
software, 643644
square-tables, 424431
three-factor interaction, 320
Ž X, Y, Z . type symbols, 320321
Log link, 118, 124, 125, 132, 138, 140, 314, 560,
563
Log-log models, 248250, 283
Longitudinal studies, see Repeated response
Lowess, 154
MannWhitney statistic, 90, 301, 452453
Mantel, N., 625
MantelHaenszel estimator, 234235, 417,
639
706
MantelHaenszel test, see
CochranMantelHaenszel test
Mantel score test, 87, 88, 89, 379
Marginal distribution, 37. See also Marginal
models
Marginal likelihood, 521
Marginal homogeneity
binary matched pairs, 410413
and independence, 111
nominal tests, 422423, 457459
ordinal tests, 421, 452453, 458
multi-way table, 439442, 456459,
647649
Marginal models, 414, 420423, 439442,
456476
conditional models, comparison,
498502
GEE approach, 466475
ML fitting, 464466, 481
odds ratio, 451, 494
software, 644649
Marginal symmetry, 442
Marginal table, 48
same association as partial table, 358360,
398
Markov chains, 477481, 482, 489490
Matched pairs, 409454
CochranMantelHaenszel approach, 413
dependent proportions, 410412
logistic models, 414420, 493496, 516517
McNemar test, 411413, 424, 442, 644645
odds ratio estimates, 417, 451, 494
ordinal data, 420421, 429431, 439, 443,
452453, 462464, 536
random effects, 417418, 493494, 535
Maximum likelihood, 9
conditional, 100, 417, 494496, 526
inconsistent estimator, 450
iterative reweighted least squares, 147, 156,
195, 343
likelihood function, see Likelihood function
versus other methods, 468, 603605, 612
McNemar test, 411413, 424, 442, 644645
Mean response model, 291294
Measurement error, 347, 493
Measures of association, 4347, 5460, 6869,
620622
asymptotic normality, 110
comparing several values, 599
Mendel, 2223, 623
Mid-distribution function, 34
Mid-P-value, 20, 27, 33, 104
Midranks, 89, 90, 302
Minimum chi-squared, 112, 611612, 616, 618,
629
SUBJECT INDEX
Minimum discrimination information, 112,
612613, 616
Misclassification error, 347
Missing data, 103, 347, 463, 475476, 482
Mixture models, 538566. See also
Generalized linear mixed models
ML, see Maximum likelihood
Model-based inference
improved precision of estimation, 85, 112,
174, 239240, 264
model-based tests, 141142, 172, 363365,
396, 399
Model matrix, 135
Monotone trends, 88. See also Trend tests
Monte Carlo methods, 114, 522525, 609,
629630, 635
Multicollinearity, 212
Multilevel models, 520, 609, 651
Multinomial distribution, 67
binomial factorization, 289
exponential family, 310311
inference, 2126, 35
mean, correlation, covariance, 7, 31,
579580, 596
and Poisson, 89, 40
sampling models, 4041, 67
Multinomial logit models, 267291, 298300,
302, 624, 640643, 651653
Multinomial loglinear model, 317318,
339341
Multinomial response models, 267300,
640643
Mutual independence, 318319, 353, 354
National Halothane Study, 627, 629
Natural exponential family, 116, 133, 155
Natural parameter, 133
Negative binomial
distribution, 31, 161, 163, 560, 566, 574
regression model, 131, 560563, 565, 566,
653
Nested models
likelihood-ratio comparison, 141142, 187,
363364
simultaneous tests, 263
using X 2 , 364
NewtonRaphson, 143146, 163164
and Fisher scoring, 145, 247
IPF, comparison, 344345
logistic regression, 194195
loglinear models, 342345
Neyman, J., 626
Nominal variable, 23
baseline-category logit models, 267274,
300, 310311, 426, 515, 640643
SUBJECT INDEX
Nominal variable Ž Continued.
matched pairs, 422423
measures of association, 5557, 6869
square table models, 425433, 439442
Noncentral chi-squared distribution, 237,
258
asymptotic representation, 591592, 595
noncentrality parameter, 237, 243245,
408, 597
power and df, 237239
Nonparametric random effects, 545553,
565566, 653
Normal distribution
asymptotic normality, see Delta method
and chi-squared, 82
and logistic regression, 171, 207208
underlying categorical data, 112, 264, 370,
620
O, o rates of convergence, 577, 595
Observational study, 43
Odds, 44
Odds ratio, 44, 620
bias, 70, 595
case-control studies, 4647
conditional, 5154, 255, 321, 417, 451
conditional ML estimate, 255, 417
confidence interval, 71, 7778, 99102, 255,
256
cumulative, 67
exact inference, 99101, 253, 255
homogeneity, in 2 = 2 = K tables, 54, 183,
234236, 255
I = J tables, 5556, 581, 597
invariance properties, 4546, 59
local, see Local odds ratio
MantelHaenszel estimator, 234235
marginal, 451, 494
matched pairs, 415418, 451
logistic regression parameters, 124, 166, 171,
179, 183, 331, 415, 497500
loglinear model parameters, 315, 316, 321,
331, 369
ordinal variables, see Local odds ratio
relation to relative risk, 47, 124, 624
standard error, 71, 7577, 581, 597
Offset, 385
Ordinal variables, 23
cumulative link models, 282286
cumulative logit models, 274282, 301,
420421
efficiency, 197, 301
exact tests, 98, 253
improved power, 8890, 236239, 373
loglinear models, 367377, 399
707
marginal models, 420421, 429430,
440441, 462464
matched pairs, 420421, 429431, 439, 443,
452454, 462464
mean response model, 291294
measures of association, 5759, 67, 68
multinomial response models, 274295
ordinal quasi symmetry, 429430, 440441,
647
repeated response, 461464, 469, 474475,
514515, 517520
scores, choice of, 8890, 383384
testing independence, 8691, 373
Overdispersion, 493
binomial, 8, 30, 151153, 291, 555558, 573,
653
litter effects, 151153, 291, 556558, 566
Poisson, 78, 130131, 636
quasi-likelihood, 151153, 291, 555558, 653
Paired comparisons, see BradleyTerry model
Parallel odds models, 374375
Partial tables, 48
Partitioning
chi-squared statistic, 8284, 112113, 365,
399, 405
and combining rows, 112
I = J tables, 8283
nested models, 365
trend test, 181, 373
Pattern mixture model, 476
Pearson, Karl, 619623, 628
arguments with Fisher, Yule, 79, 619623
goodness of fit, 2224, 79
Pearson chi-squared statistic, 2226, 79,
111112
asymptotic chi-squared distribution,
589590
asymptotic conditional distribution, 103
continuity correction, 103
degrees of freedom, 25, 79, 622
and z for difference of proportions, 111
goodness of fit, 2226
independence, 7879, 111112, 622
and likelihood-ratio, comparison, 24, 80, 364
minimizing, 112, 611612, 616, 618, 629
moments, 103
multinomial parameters, 2226
nested models, 364
noncentral chi-squared distribution, see
Noncentral chi-squared distribution
score statistic, 24
sparse data, 80, 395397
with ungrouped data, 162
upper bound, 112
708
Pearson residual, 81, 142, 588589, 593
binomial GLM, 220, 555, 638
Poisson GLM, 142, 366, 588
Penalized likelihood, 614615
Penalized quasi likelihood ŽPQL., 523524
Perfect contingency tables, 398
Perfect discrimination, 195196
Phi-squared, 112
Poisson distribution, 7
comparing means, 31
exponential family, 117, 134
moments, 7, 31
and multinomial, 89, 40,
and negative binomial, 131, 559560, 566,
574
overdispersion, 78, 130131, 636
Poisson sampling, 39
variance test, 163
Poisson models
counts, 125132, 155, 563565
deviance, 140
loglinear model, 117118, 125132,
138139, 232, 314347
overdispersion, 130131, 150151, 636
random effects, 563565
rates, 385391, 399400
Polytomous logit models, 267291
Population-averaged effects, 414, 495, 499501
Positive likelihood-ratio dependence, 406
Power
calculating, 240245, 640
increased, for directed alternatives, 8890,
236239, 373
and noncentrality, 237239, 243245
and number of ordinal categories, 301
Power-divergence statistic, 112, 613
Prediction, 525526
Probit model, 124125, 246247, 258, 623, 640
discrete choice, 302
likelihood equations, 265
normal parameters, 163, 246, 264
ordinal data, 278, 283, 301, 312, 641
random effects, 535
threshold and utility motivations, 264
Profile likelihood confidence interval, 78, 512,
638
Propensity score, 196
Proportional hazards model, 283284, 301,
389, 643
Proportional odds, see Cumulative logit
models
Proportional reduction in variation, 5657,
6768
Proportions
admissible estimator, 605
SUBJECT INDEX
asymptotic distribution, 585588, 593
Bayesian inference, 605607
confidence interval, 1517, 3233, 635
dependent, 410412
difference, see Difference of proportions
ratio, see Relative risk
standard error, 11, 340341
P-value
mid-P-value, 20, 27, 33, 104
randomized, 27, 32
UMVU estimator, 162
Qualitative variable, 34
Quantitative variable, 34
Quasi-association, 431, 453454
Quasi-independence, 426428, 432433, 443
Quasi-likelihood
binary models, 151153, 291, 555558
count models, 150151
GLM, 149153, 156
multivariate ŽGEE., 466475, 481482, 625
overdispersion, 150153, 291, 555558
Quasi-symmetry, 425431, 433434, 451, 454,
646647
and BradleyTerry model, 438439
and marginal homogeneity, 428430
multiway tables, 440441
and Rasch model, 552553, 565
Raking a table, 345346, 347, 643
Random component of GLM, 116, 133
Random effects, 417, 492527
Random intercept, 493
Ranks, 89, 90, 298, 301, 302
Rasch mixture model, 548551, 653
Rasch model, 495496, 517, 526, 535, 565, 624
Rates, 385391, 399400
RC model, 379381, 399400
Regressive logistic model, 479481
Relative risk, 4344
asymptotic standard error, 73
collapsibility, 398
confidence interval, 73, 77
homogeneity, 258
in model, 124
and odds ratio, 47, 624
Repeated response, 409517. See also
Generalized linear mixed models;
Marginal models; Matched pairs
Residuals, 142143, 156
asymptotic distribution, 587589
binomial GLMs, 219223
deviance, see Deviance residual
Pearson, see Pearson residual
Poisson GLMs, 143, 366367
709
SUBJECT INDEX
standardized Pearson, see Standardized
Pearson residual
Retrospective study, 4243. See also Casecontrol study
logistic regression, 170171
odds ratio, 4647
Ridits, 111, 406
ROC curve, 228230, 258
Row and column effects model, see RC
model
Row effects model, 374376, 643644
R-squared type measure
logistic regression, 226228, 258
nominal association, 5657, 6768
Sample size determination, 240245
Sampling methods, 3943
Sampling zero, 392
Sandwich estimator, 471474
SAS, 632643
Saturated model, 119, 139, 382
logit models, 178,
loglinear models, 316, 380
Scaled deviance, 140
Scores
choice of, 8890, 383384
efficiency, 197, 301
in loglinear models, 369379, 407
in trend test, 8889, 181182, 406
Score statistic, 12, 2627
confidence intervals, 1516, 77
logistic regression, 232, 297298
Pearson statistic, 24
and standardized residuals, 156
trend test, 182
Selection model, 475476
Sensitivity, 38, 60, 228230
Simpson diversity index, 596
Simpson’s paradox, 51, 5960, 224, 354, 621
Small-area estimation, 502504
Small samples
adding constants to cells, 397398
alternative asymptotics, 233, 396397
exact inference, 1820, 91101, 104,
251257
existence of estimates, 195196, 341,
392395
model-based tests, 187, 251257
X 2 and G 2 , 24, 80, 364, 395397
zeros, 392398
Smoothing
Bayes, 606610
generalized additive model, 153155
improved estimation with model, 85, 112,
174, 239240, 264
kernel, 613615, 616
penalized likelihood, 614615
Software, 632653
SAS, 632643
StatXact and LogXact, 633, 635, 640, 643
Somers’ d, 68
Sparse data, 391398, 187, 250257, 591
asymptotics, 233, 396397
Spearman’s rho, 90
Specificity, 38, 60, 228230
Square tables, 409454
Standardized table, 345346
Standardized parameter estimate, 191192,
197
Standardized Pearson residual, 81, 143, 589
binomial GLMs, 220, 638
and Pearson statistic, 112
Poisson GLMs, 143, 367, 634
as score statistic, 156
StatXact, 633, 635, 640, 643
Stepwise model-building, 213216
Stochastic ordering, 33, 67, 301
Structural zero, 25, 392
Subject-specific effects, 414420, 491, 498500
Sufficient statistics, 148, 250257, 273, 334,
336
Suppressor variable, 67
Survival data, 385391
Symmetric association, 425
Symmetry, 424425, 644647
complete, 440
multiway, 439442
Systematic component of GLM, 116
Tetrachoric correlation, 620
Three-factor interaction, 320
Threshold model, 264, 277279
Tolerance distribution, 245246
Transformations, 595, 596
Transition probabilities, 477, 490
Transitional model, 464, 476481, 482
Tree-structured methods, 257, 631
Trend tests, 8690, 103, 296, 373, 379
CochranArmitage for proportions, 90,
181182, 237239
efficiency, 197, 301
exact, 253
software, 634, 635
Uncertainty coefficient, 57
Uniform association model, 312, 369370, 377
Uniform interaction model, 407
Uniqueness of ML estimate, 341
Utility, 264
710
Variance
asymptotic, see Delta method
components, 492, 525
in exponential family, 134
stabilizing, 596, 626
test for Poisson, 163
variance function, 136, 149150
Wald statistic, 11, 27
and power, 172, 208209
Wald confidence intervals, 13
adjusted intervals, 33, 102
Weight matrix, 138, 155, 164
Weighted kappa, 435, 443, 645
Weighted observation, 391
Weighted least squares, 481, 600604,
615, 629
SUBJECT INDEX
and minimum modified chi-squared, 611,
612
and ML estimation, 146148, 603604
Wilcoxon test, 90, 301
WLS, see Weighted least squares
X 2 statistic, see Pearson chi-squared statistic
X 2 Ž M0 < M1 ., 364
Yates continuity correction, 103
Yule, G. U., 620621, 628
Yule’s Q, 68, 110
Zero cell count
adding constants, 7071, 397398
effects on estimates, 7071, 78, 256
sampling, 392
structural, 25, 392