[go: up one dir, main page]

Academia.eduAcademia.edu

Categorical Data Analysis

Categorical Data Analysis Categorical Data Analysis Second Edition ALAN AGRESTI University of Florida Gainesville, Florida ⬁ This book is printed on acid-free paper. " Copyright 䊚 2002 John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, Ž978. 750-8400, fax Ž978. 750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, Ž212. 850-6011, fax Ž212. 850-6008, E-Mail: PERMREQ@WILEY.COM. For ordering and customer service, call 1-800-CALL-WILEY. Library of Congress Cataloging-in-Publication Data Is A©ailable ISBN 0-471-36093-7 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 To Jacki Contents Preface 1. Introduction: Distributions and Inference for Categorical Data xiii 1 1.1 Categorical Response Data, 1 1.2 Distributions for Categorical Data, 5 1.3 Statistical Inference for Categorical Data, 9 1.4 Statistical Inference for Binomial Parameters, 14 1.5 Statistical Inference for Multinomial Parameters, 21 Notes, 26 Problems, 28 2. Describing Contingency Tables 36 2.1 Probability Structure for Contingency Tables, 36 2.2 Comparing Two Proportions, 43 2.3 Partial Association in Stratified 2 = 2 Tables, 47 2.4 Extensions for I = J Tables, 54 Notes, 59 Problems, 60 3. Inference for Contingency Tables 3.1 3.2 3.3 3.4 3.5 70 Confidence Intervals for Association Parameters, 70 Testing Independence in Two-Way Contingency Tables, 78 Following-Up Chi-Squared Tests, 80 Two-Way Tables with Ordered Classifications, 86 Small-Sample Tests of Independence, 91 vii viii CONTENTS 3.6 3.7 Small-Sample Confidence Intervals for 2 = 2 Tables,* 98 Extensions for Multiway Tables and Nontabulated Responses, 101 Notes, 102 Problems, 104 4. Introduction to Generalized Linear Models 115 4.1 4.2 4.3 4.4 Generalized Linear Model, 116 Generalized Linear Models for Binary Data, 120 Generalized Linear Models for Counts, 125 Moments and Likelihood for Generalized Linear Models,* 132 4.5 Inference for Generalized Linear Models, 139 4.6 Fitting Generalized Linear Models, 143 4.7 Quasi-likelihood and Generalized Linear Models,* 149 4.8 Generalized Additive Models,* 153 Notes, 155 Problems, 156 5. Logistic Regression 165 5.1 Interpreting Parameters in Logistic Regression, 166 5.2 Inference for Logistic Regression, 172 5.3 Logit Models with Categorical Predictors, 177 5.4 Multiple Logistic Regression, 182 5.5 Fitting Logistic Regression Models, 192 Notes, 196 Problems, 197 6. Building and Applying Logistic Regression Models 6.1 6.2 6.3 6.4 6.5 6.6 Strategies in Model Selection, 211 Logistic Regression Diagnostics, 219 Inference About Conditional Associations in 2 = 2 = K Tables, 230 Using Models to Improve Inferential Power, 236 Sample Size and Power Considerations,* 240 Probit and Complementary Log-Log Models,* 245 *Sections marked with an asterisk are less important for an overview. 211 ix CONTENTS 6.7 Conditional Logistic Regression and Exact Distributions,* 250 Notes, 257 Problems, 259 7. Logit Models for Multinomial Responses 267 7.1 7.2 7.3 7.4 7.5 Nominal Responses: Baseline-Category Logit Models, 267 Ordinal Responses: Cumulative Logit Models, 274 Ordinal Responses: Cumulative Link Models, 282 Alternative Models for Ordinal Responses,* 286 Testing Conditional Independence in I = J = K Tables,* 293 7.6 Discrete-Choice Multinomial Logit Models,* 298 Notes, 302 Problems, 302 8. Loglinear Models for Contingency Tables 314 8.1 8.2 Loglinear Models for Two-Way Tables, 314 Loglinear Models for Independence and Interaction in Three-Way Tables, 318 8.3 Inference for Loglinear Models, 324 8.4 Loglinear Models for Higher Dimensions, 326 8.5 The Loglinear᎐Logit Model Connection, 330 8.6 Loglinear Model Fitting: Likelihood Equations and Asymptotic Distributions,* 333 8.7 Loglinear Model Fitting: Iterative Methods and their Application,* 342 Notes, 346 Problems, 347 9. Building and Extending Loglinearr r Logit Models 9.1 9.2 9.3 9.4 9.5 9.6 Association Graphs and Collapsibility, 357 Model Selection and Comparison, 360 Diagnostics for Checking Models, 366 Modeling Ordinal Associations, 367 Association Models,* 373 Association Models, Correlation Models, and Correspondence Analysis,* 379 357 x CONTENTS 9.7 9.8 Poisson Regression for Rates, 385 Empty Cells and Sparseness in Modeling Contingency Tables, 391 Notes, 398 Problems, 400 10. Models for Matched Pairs 409 10.1 Comparing Dependent Proportions, 410 10.2 Conditional Logistic Regression for Binary Matched Pairs, 414 10.3 Marginal Models for Square Contingency Tables, 420 10.4 Symmetry, Quasi-symmetry, and Quasiindependence, 423 10.5 Measuring Agreement Between Observers, 431 10.6 Bradley᎐Terry Model for Paired Preferences, 436 10.7 Marginal Models and Quasi-symmetry Models for Matched Sets,* 439 Notes, 442 Problems, 444 11. Analyzing Repeated Categorical Response Data 455 11.1 Comparing Marginal Distributions: Multiple Responses, 456 11.2 Marginal Modeling: Maximum Likelihood Approach, 459 11.3 Marginal Modeling: Generalized Estimating Equations Approach, 466 11.4 Quasi-likelihood and Its GEE Multivariate Extension: Details,* 470 11.5 Markov Chains: Transitional Modeling, 476 Notes, 481 Problems, 482 12. Random Effects: Generalized Linear Mixed Models for Categorical Responses 12.1 Random Effects Modeling of Clustered Categorical Data, 492 12.2 Binary Responses: Logistic-Normal Model, 496 12.3 Examples of Random Effects Models for Binary Data, 502 12.4 Random Effects Models for Multinomial Data, 513 491 CONTENTS xi 12.5 Multivariate Random Effects Models for Binary Data, 516 12.6 GLMM Fitting, Inference, and Prediction, 520 Notes, 526 Problems, 527 13. Other Mixture Models for Categorical Data* 538 13.1 Latent Class Models, 538 13.2 Nonparametric Random Effects Models, 545 13.3 Beta-Binomial Models, 553 13.4 Negative Binomial Regression, 559 13.5 Poisson Regression with Random Effects, 563 Notes, 565 Problems, 566 14. Asymptotic Theory for Parametric Models 576 14.1 Delta Method, 577 14.2 Asymptotic Distributions of Estimators of Model Parameters and Cell Probabilities, 582 14.3 Asymptotic Distributions of Residuals and Goodnessof-Fit Statistics, 587 14.4 Asymptotic Distributions for LogitrLoglinear Models, 592 Notes, 594 Problems, 595 15. Alternative Estimation Theory for Parametric Models 600 15.1 Weighted Least Squares for Categorical Data, 600 15.2 Bayesian Inference for Categorical Data, 604 15.3 Other Methods of Estimation, 611 Notes, 615 Problems, 616 16. Historical Tour of Categorical Data Analysis* 16.1 Pearson᎐Yule Association Controversy, 619 16.2 R. A. Fisher’s Contributions, 622 619 xii CONTENTS 16.3 Logistic Regression, 624 16.4 Multiway Contingency Tables and Loglinear Models, 625 16.5 Recent Žand Future? . Developments, 629 Appendix A. A.1 A.2 Using Computer Software to Analyze Categorical Data 632 Software for Categorical Data Analysis, 632 Examples of SAS Code by Chapter, 634 Appendix B. Chi-Squared Distribution Values 654 References 655 Examples Index 689 Author Index 693 Subject Index 701 Preface The explosion in the development of methods for analyzing categorical data that began in the 1960s has continued apace in recent years. This book provides an overview of these methods, as well as older, now standard, methods. It gives special emphasis to generalized linear modeling techniques, which extend linear model methods for continuous variables, and their extensions for multivariate responses. Today, because of this development and the ubiquity of categorical data in applications, most statistics and biostatistics departments offer courses on categorical data analysis. This book can be used as a text for such courses. The material in Chapters 17 forms the heart of most courses. Chapters 13 cover distributions for categorical responses and traditional methods for two-way contingency tables. Chapters 47 introduce logistic regression and related logit models for binary and multicategory response variables. Chapters 8 and 9 cover loglinear models for contingency tables. Over time, this model class seems to have lost importance, and this edition reduces somewhat its discussion of them and expands its focus on logistic regression. In the past decade, the major area of new research has been the development of methods for repeated measurement and other forms of clustered categorical data. Chapters 1013 present these methods, including marginal models and generalized linear mixed models with random effects. Chapters 14 and 15 present theoretical foundations as well as alternatives to the maximum likelihood paradigm that this text adopts. Chapter 16 is devoted to a historical overview of the development of the methods. It examines contributions of noted statisticians, such as Pearson and Fisher, whose pioneering effortsand sometimes vocal debatesbroke the ground for this evolution. Every chapter of the first edition has been extensively rewritten, and some substantial additions and changes have occurred. The major differences are: 䢇 䢇 A new Chapter 1 that introduces distributions and methods of inference for categorical data. A unified presentation of models as special cases of generalized linear models, starting in Chapter 4 and then throughout the text. xiii xiv 䢇 䢇 䢇 䢇 PREFACE Greater emphasis on logistic regression for binary response variables and extensions for multicategory responses, with Chapters 4᎐7 introducing models and Chapters 10᎐13 extending them for clustered data. Three new chapters on methods for clustered, correlated categorical data, increasingly important in applications. A new chapter on the historical development of the methods. More discussion of ‘‘exact’’ small-sample procedures and of conditional logistic regression. In this text, I interpret categorical data analysis to refer to methods for categorical response variables. For most methods, explanatory variables can be qualitative or quantitative, as in ordinary regression. Thus, the focus is intended to be more general than contingency table analysis, although for simplicity of data presentation, most examples use contingency tables. These examples are often simplistic, but should help readers focus on understanding the methods themselves and make it easier for them to replicate results with their favorite software. Special features of the text include: 䢇 䢇 䢇 䢇 More than 100 analyses of ‘‘real’’ data sets. More than 600 exercises at the end of the chapters, some directed towards theory and methods and some towards applications and data analysis. An appendix that shows, by chapter, the use of SAS for performing analyses presented in this book. Notes at the end of each chapter that provide references for recent research and many topics not covered in the text. Appendix A summarizes statistical software needed to use the methods described in this text. It shows how to use SAS for analyses included in the text and refers to a web site Žwww.stat.ufl.edur; aarcdarcda.html . that contains Ž1. information on the use of other software Žsuch as R, S-plus, SPSS, and Stata., Ž2. data sets for examples in the form of complete SAS programs for conducting the analyses, Ž3. short answers for many of the odd-numbered exercises, Ž4. corrections of errors in early printings of the book, and Ž5. extra exercises. I recommend that readers refer to this appendix or specialized manuals while reading the text, as an aid to implementing the methods. I intend this book to be accessible to the diverse mix of students who take graduate-level courses in categorical data analysis. But I have also written it with practicing statisticians and biostatisticians in mind. I hope it enables them to catch up with recent advances and learn about methods that sometimes receive inadequate attention in the traditional statistics curriculum. PREFACE xv The development of new methods has influenced ᎏand been influenced byᎏthe increasing availability of data sets with categorical responses in the social, behavioral, and biomedical sciences, as well as in public health, human genetics, ecology, education, marketing, and industrial quality control. And so, although this book is directed mainly to statisticians and biostatisticians, I also aim for it to be helpful to methodologists in these fields. Readers should possess a background that includes regression and analysis of variance models, as well as maximum likelihood methods of statistical theory. Those not having much theory background should be able to follow most methodological discussions. Sections and subsections marked with an asterisk are less important for an overview. Readers with mainly applied interests can skip most of Chapter 4 on the theory of generalized linear models and proceed to other chapters. However, the book has distinctly higher technical level and is more thorough and complete than my lower-level text, An Introduction to Categorical Data Analysis ŽWiley, 1996.. I thank those who commented on parts of the manuscript or provided help of some type. Special thanks to Bernhard Klingenberg, who read several chapters carefully and made many helpful suggestions, Yongyi Min, who constructed many of the figures and helped with some software, and Brian Caffo, who helped with some examples. Many thanks to Rosyln Stone and Brian Marx for each reviewing half the manuscript and Brian Caffo, I-Ming Liu, and Yongyi Min for giving insightful comments on several chapters. Thanks to Constantine Gatsonis and his students for using a draft in a course at Brown University and providing suggestions. Others who provided comments on chapters or help of some type include Patricia Altham, Wicher Bergsma, Jane Brockmann, Brent Coull, Al DeMaris, Regina Dittrich, Jianping Dong, Herwig Friedl, Ralitza Gueorguieva, James Hobert, Walter Katzenbeisser, Harry Khamis, Svend Kreiner, Joseph Lang, Jason Liao, Mojtaba Ganjali, Jane Pendergast, Michael Radelet, Kenneth Small, Maura Stokes, Tom Ten Have, and Rongling Wu. I thank my co-authors on various projects, especially Brent Coull, Joseph Lang, James Booth, James Hobert, Brian Caffo, and Ranjini Natarajan, for permission to use material from those articles. Thanks to the many who reviewed material or suggested examples for the first edition, mentioned in the Preface of that edition. Thanks also to Wiley Executive Editor Steve Quigley for his steadfast encouragement and facilitation of this project. Finally, thanks to my wife Jacki Levine for continuing support of all kinds, despite the many days this work has taken from our time together. ALAN AGRESTI Gaines®ille, Florida No®ember 2001 Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 1 Introduction: Distributions and Inference for Categorical Data From helping to assess the value of new medical treatments to evaluating the factors that affect our opinions and behaviors, analysts today are finding myriad uses for categorical data methods. In this book we introduce these methods and the theory behind them. Statistical methods for categorical responses were late in gaining the level of sophistication achieved early in the twentieth century by methods for continuous responses. Despite influential work around 1900 by the British statistician Karl Pearson, relatively little development of models for categorical responses occurred until the 1960s. In this book we describe the early fundamental work that still has importance today but place primary emphasis on more recent modeling approaches. Before outlining the topics covered, we describe the major types of categorical data. 1.1 CATEGORICAL RESPONSE DATA A categorical ®ariable has a measurement scale consisting of a set of categories. For instance, political philosophy is often measured as liberal, moderate, or conservative. Diagnoses regarding breast cancer based on a mammogram use the categories normal, benign, probably benign, suspicious, and malignant. The development of methods for categorical variables was stimulated by research studies in the social and biomedical sciences. Categorical scales are pervasive in the social sciences for measuring attitudes and opinions. Categorical scales in biomedical sciences measure outcomes such as whether a medical treatment is successful. Although categorical data are common in the social and biomedical sciences, they are by no means restricted to those areas. They frequently 1 2 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA occur in the behavioral sciences Že.g., type of mental illness, with the categories schizophrenia, depression, neurosis., epidemiology and public health Že.g., contraceptive method at last intercourse, with the categories none, condom, pill, IUD, other., genetics Žtype of allele inherited by an offspring., zoology Že.g., alligators’ primary food preference, with the categories fish, invertebrate, reptile ., education Že.g., student responses to an exam question, with the categories correct and incorrect., and marketing Že.g., consumer preference among leading brands of a product, with the categories brand A, brand B, and brand C.. They even occur in highly quantitative fields such as engineering sciences and industrial quality control. Examples are the classification of items according to whether they conform to certain standards, and subjective evaluation of some characteristic: how soft to the touch a certain fabric is, how good a particular food product tastes, or how easy to perform a worker finds a certain task to be. Categorical variables are of many types. In this section we provide ways of classifying them and other variables. 1.1.1 Response–Explanatory Variable Distinction Most statistical analyses distinguish between response Žor dependent. ®ariables and explanatory Žor independent. ®ariables. For instance, regression models describe how the mean of a response variable, such as the selling price of a house, changes according to the values of explanatory variables, such as square footage and location. In this book we focus on methods for categorical response variables. As in ordinary regression, explanatory variables can be of any type. 1.1.2 Nominal–Ordinal Scale Distinction Categorical variables have two primary types of scales. Variables having categories without a natural ordering are called nominal. Examples are religious affiliation Žwith the categories Catholic, Protestant, Jewish, Muslim, other., mode of transportation to work Žautomobile, bicycle, bus, subway, walk., favorite type of music Žclassical, country, folk, jazz, rock., and choice of residence Žapartment, condominium, house, other.. For nominal variables, the order of listing the categories is irrelevant. The statistical analysis does not depend on that ordering. Many categorical variables do have ordered categories. Such variables are called ordinal. Examples are size of automobile Žsubcompact, compact, midsize, large., social class Župper, middle, lower., political philosophy Žliberal, moderate, conservative., and patient condition Žgood, fair, serious, critical .. Ordinal variables have ordered categories, but distances between categories are unknown. Although a person categorized as moderate is more liberal than a person categorized as conservative, no numerical value describes how much more liberal that person is. Methods for ordinal variables utilize the category ordering. CATEGORICAL RESPONSE DATA 3 An inter®al ®ariable is one that does have numerical distances between any two values. For example, blood pressure level, functional life length of television set, length of prison term, and annual income are interval variables. ŽAn internal variable is sometimes called a ratio ®ariable if ratios of values are also valid.. The way that a variable is measured determines its classification. For example, ‘‘education’’ is only nominal when measured as public school or private school; it is ordinal when measured by highest degree attained, using the categories none, high school, bachelor’s, master’s, and doctorate; it is interval when measured by number of years of education, using the integers 0, 1, 2, . . . . A variable’s measurement scale determines which statistical methods are appropriate. In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels. For instance, statistical methods for nominal variables can be used with ordinal variables by ignoring the ordering of categories. Methods for ordinal variables cannot, however, be used with nominal variables, since their categories have no meaningful ordering. It is usually best to apply methods appropriate for the actual scale. Since this book deals with categorical responses, we discuss the analysis of nominal and ordinal variables. The methods also apply to interval variables having a small number of distinct values Že.g., number of times married. or for which the values are grouped into ordered categories Že.g., education measured as - 10 years, 1012 years, ) 12 years.. 1.1.3 Continuous–Discrete Variable Distinction Variables are classified as continuous or discrete, according to the number of values they can take. Actual measurement of all variables occurs in a discrete manner, due to precision limitations in measuring instruments. The continuousdiscrete classification, in practice, distinguishes between variables that take lots of values and variables that take few values. For instance, statisticians often treat discrete interval variables having a large number of values Žsuch as test scores. as continuous, using them in methods for continuous responses. This book deals with certain types of discretely measured responses: Ž1. nominal variables, Ž2. ordinal variables, Ž3. discrete interval variables having relatively few values, and Ž4. continuous variables grouped into a small number of categories. 1.1.4 Quantitative–Qualitative Variable Distinction Nominal variables are qualitati®edistinct categories differ in quality, not in quantity. Interval variables are quantitati®edistinct levels have differing amounts of the characteristic of interest. The position of ordinal variables in 4 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA the quantitativequalitative classification is fuzzy. Analysts often treat them as qualitative, using methods for nominal variables. But in many respects, ordinal variables more closely resemble interval variables than they resemble nominal variables. They possess important quantitative features: Each category has a greater or smaller magnitude of the characteristic than another category; and although not possible to measure, an underlying continuous variable is usually present. The political philosophy classification Žliberal, moderate, conservative. crudely measures an inherently continuous characteristic. Analysts often utilize the quantitative nature of ordinal variables by assigning numerical scores to categories or assuming an underlying continuous distribution. This requires good judgment and guidance from researchers who use the scale, but it provides benefits in the variety of methods available for data analysis. 1.1.5 Organization of This Book The models for categorical response variables discussed in this book resemble regression models for continuous response variables; however, they assume binomial, multinomial, or Poisson response distributions instead of normality. Two types of models receive special attention, logistic regression and loglinear models. Ordinary logistic regression models, also called logit models, apply with binary Ži.e., two-category. responses and assume a binomial distribution. Generalizations of logistic regression apply with multicategory responses and assume a multinomial distribution. Loglinear models apply with count data and assume a Poisson distribution. Certain equivalences exist between logistic regression and loglinear models. The book has four main units. In the first, Chapters 1 through 3, we summarize descriptive and inferential methods for univariate and bivariate categorical data. These chapters cover discrete distributions, methods of inference, and analyses for measures of association. They summarize the non-model-based methods developed prior to about 1960. In the second and primary unit, Chapters 4 through 9, we introduce models for categorical responses. In Chapter 4 we describe a class of generalized linear models having models of this text as special cases. We focus on models for binary and count response variables. Chapters 5 and 6 cover the most important model for binary responses, logistic regression. In Chapter 7 we present generalizations of that model for nominal and ordinal multicategory response variables. In Chapter 8 we introduce the modeling of multivariate categorical response data and show how to represent association and interaction patterns by loglinear models for counts in the table that cross-classifies those responses. In Chapter 9 we discuss model building with loglinear and related logistic models and present some related models. In the third unit, Chapters 10 through 13, we discuss models for handling repeated measurement and other forms of clustering. In Chapter 10 we DISTRIBUTIONS FOR CATEGORICAL DATA 5 present models for a categorical response with matched pairs; these apply, for instance, with a categorical response measured for the same subjects at two times. Chapter 11 covers models for more general types of repeated categorical data, such as longitudinal data from several times with explanatory variables. In Chapter 12 we present a broad class of models, generalized linear mixed models, that use random effects to account for dependence with such data. In Chapter 13 further extensions and applications of the models from Chapters 10 through 12 are described. The fourth and final unit is more theoretical. In Chapter 14 we develop asymptotic theory for categorical data models. This theory is the basis for large-sample behavior of model parameter estimators and goodness-of-fit statistics. Maximum likelihood estimation receives primary attention here and throughout the book, but Chapter 15 covers alternative methods of estimation, such as the Bayesian paradigm. Chapter 16 stands alone from the others, being a historical overview of the development of categorical data methods. Most categorical data methods require extensive computations, and statistical software is necessary for their effective use. In Appendix A we discuss software that can perform the analyses in this book and show the use of SAS for text examples. See the Web site www. stat.ufl.edur; aarcdarcda.html to download sample programs and data sets and find information about other software. Chapter 1 provides background material. In Section 1.2 we review the key distributions for categorical data: the binomial, multinomial, and Poisson. In Section 1.3 we review the primary mechanisms for statistical inference, using maximum likelihood. In Sections 1.4 and 1.5 we illustrate these by presenting significance tests and confidence intervals for binomial and multinomial parameters. 1.2 DISTRIBUTIONS FOR CATEGORICAL DATA Inferential data analyses require assumptions about the random mechanism that generated the data. For regression models with continuous responses, the normal distribution plays the central role. In this section we review the three key distributions for categorical responses: binomial, multinomial, and Poisson. 1.2.1 Binomial Distribution Many applications refer to a fixed number n of binary observations. Let y 1 , y 2 , . . . , yn denote responses for n independent and identical trials such that P Ž Yi s 1. s  and P Ž Yi s 0. s 1 y  . We use the generic labels ‘‘success’’ and ‘‘failure’’ for outcomes 1 and 0. Identical trials means that the probability of success  is the same for each trial. Independent trials means 6 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA that the  Yi 4 are independent random variables. These are often called Bernoulli trials. The total number of successes, Y s Ý nis1 Yi , has the binomial distribution with index n and parameter  , denoted by binŽ n, .. The probability mass function for the possible outcomes y for Y is pŽ y . s ž/ n nyy  y Ž1 y  . , y where the binomial coefficient s 1 =  q 0 = Ž1 y  . s  , Ž 1.1 . y s 0, 1, 2, . . . , n, ž / s n!rw y! Ž n y y .!x. Since EŽY . s EŽY . n y 2 i E Ž Yi . s  and i var Ž Yi . s  Ž 1 y  . . The binomial distribution for Y s Ý i Yi has mean and variance  s E Ž Y . s n and  2 s var Ž Y . s n Ž 1 y  . . ' The skewness is described by E Ž Y y  . 3r 3 s Ž1 y 2 .r n Ž 1 y  . . The distribution converges to normality as n increases, for fixed  . There is no guarantee that successive binary observations are independent or identical. Thus, occasionally, we will utilize other distributions. One such case is sampling binary outcomes without replacement from a finite population, such as observations on gender for 10 students sampled from a class of size 20. The hypergeometric distribution, studied in Section 3.5.1, is then relevant. In Section 1.2.4 we mention another case that violates these binomial assumptions. 1.2.2 Multinomial Distribution Some trials have more than two possible outcomes. Suppose that each of n independent, identical trials can have outcome in any of c categories. Let yi j s 1 if trial i has outcome in category j and yi j s 0 otherwise. Then yi s Ž yi1 , yi2 , . . . , yic . represents a multinomial trial, with Ý j yi j s 1; for instance, Ž0, 0, 1, 0. denotes outcome in category 3 of four possible categories. Note that yic is redundant, being linearly dependent on the others. Let n j s Ý i yi j denote the number of trials having outcome in category j. The counts Ž n1 , n 2 , . . . , n c . have the multinomial distribution. Let  j s P Ž Yi j s 1. denote the probability of outcome in category j for each trial. The multinomial probability mass function is p Ž n1 , n 2 , . . . , n cy1 . s ž n! n1 ! n 2 !  n c ! /  1n1 2n 2  cn c . Ž 1.2 . 7 DISTRIBUTIONS FOR CATEGORICAL DATA Since Ý j n j s n, this is Ž cy1.-dimensional, with n c s n y Ž n1 q  qn cy1 .. The binomial distribution is the special case with c s 2. For the multinomial distribution, E Ž n j . s n j , var Ž n j . s n j Ž 1 y  j . , cov Ž n j , n k . s yn j k . Ž 1.3 . We derive the covariance in Section 14.1.4. The marginal distribution of each n j is binomial. 1.2.3 Poisson Distribution Sometimes, count data do not result from a fixed number of trials. For instance, if y s number of deaths due to automobile accidents on motorways in Italy during this coming week, there is no fixed upper limit n for y Žas you are aware if you have driven in Italy.. Since y must be a nonnegative integer, its distribution should place its mass on that range. The simplest such distribution is the Poisson. Its probabilities depend on a single parameter, the mean . The Poisson probability mass function ŽPoisson 1837, p. 206. is pŽ y . s ey y y! , Ž 1.4 . y s 0, 1, 2, . . . . It satisfies E Ž Y . s varŽ Y . s . It is unimodal with mode equal to the integer part of . Its skewness is described by E Ž Y y  . 3r 3 s 1r  . The distribution approaches normality as  increases. The Poisson distribution is used for counts of events that occur randomly over time or space, when outcomes in disjoint periods or regions are independent. It also applies as an approximation for the binomial when n is large and  is small, with  s n . So if each of the 50 million people driving in Italy next week is an independent trial with probability 0.000002 of dying in a fatal accident that week, the number of deaths Y is a binŽ50000000, 0.000002. variate, or approximately Poisson with  s n s 50,000,000Ž0.000002. s 100. A key feature of the Poisson distribution is that its variance equals its mean. Sample counts vary more when their mean is higher. When the mean number of weekly fatal accidents equals 100, greater variability occurs in the weekly counts than when the mean equals 10. ' 1.2.4 Overdispersion In practice, count observations often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is called o®erdispersion. We assumed above that each person has the same probability of dying in a fatal accident in the next week. More realistically, these probabilities vary, INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 8 due to factors such as amount of time spent driving, whether the person wears a seat belt, and geographical location. Such variation causes fatality counts to display more variation than predicted by the Poisson model. Suppose that Y is a random variable with variance varŽ Y <  . for given  , but  itself varies because of unmeasured factors such as those just described. Let  s E Ž .. Then unconditionally, EŽ Y . s E EŽ Y < . , var Ž Y . s E var Ž Y <  . q var E Ž Y <  . . When Y is conditionally Poisson Žgiven  ., for instance, then E Ž Y . s E Ž . s  and var Ž Y . s E Ž . q var Ž . s  q var Ž . )  . Assuming a Poisson distribution for a count variable is often too simplistic, because of factors that cause overdispersion. The negati®e binomial is a related distribution for count data that permits the variance to exceed the mean. We introduce it in Section 4.3.4. Analyses assuming binomial Žor multinomial. distributions are also sometimes invalid because of overdispersion. This might happen because the true distribution is a mixture of different binomial distributions, with the parameter varying because of unmeasured variables. To illustrate, suppose that an experiment exposes pregnant mice to a toxin and then after a week observes the number of fetuses in each mouse’s litter that show signs of malformation. Let n i denote the number of fetuses in the litter for mouse i. The mice also vary according to other factors that may not be measured, such as their weight, overall health, and genetic makeup. Extra variation then occurs because of the variability from litter to litter in the probability  of malformation. The distribution of the number of fetuses per litter showing malformations might cluster near 0 and near n i , showing more dispersion than expected for binomial sampling with a single value of  . Overdispersion could also occur when  varies among fetuses in a litter according to some distribution ŽProblem 1.12.. In Chapters 4, 12, and 13 we introduce methods for data that are overdispersed relative to binomial and Poisson assumptions. 1.2.5 Connection between Poisson and Multinomial Distributions In Italy this next week, let y 1 s number of people who die in automobile accidents, y 2 s number who die in airplane accidents, and y 3 s number who die in railway accidents. A Poisson model for Ž Y1 , Y2 , Y3 . treats these as independent Poisson random variables, with parameters Ž 1 ,  2 ,  3 .. The joint probability mass function for  Yi 4 is the product of the three mass functions of form Ž1.4.. The total n s ÝYi also has a Poisson distribution, with parameter Ý  i . With Poisson sampling the total count n is random rather than fixed. If we assume a Poisson model but condition on n,  Yi 4 no longer have Poisson distributions, since each Yi cannot exceed n. Given n,  Yi 4 are also no longer independent, since the value of one affects the possible range for the others. 9 STATISTICAL INFERENCE FOR CATEGORICAL DATA For c independent Poisson variates, with E Ž Yi . s  i , let’s derive their conditional distribution given that ÝYi s n. The conditional probability of a set of counts  n i 4 satisfying this condition is P Ž Y1 s n1 , Y2 s n 2 , . . . , Yc s n c . s s Ý Yj s n P Ž Y1 s n1 , Y2 s n 2 , . . . , Yc s n c . P Ž ÝYj s n . Ł i exp Ž y i .  in irn i ! exp Ž yÝ  j .Ž Ý  j . rn! n s n! Ł i ni ! Ł  in , i Ž 1.5 . i where  i s  irŽÝ  j .4 . This is the multinomial Ž n,  i 4. distribution, characterized by the sample size n and the probabilities  i 4 . Many categorical data analyses assume a multinomial distribution. Such analyses usually have the same parameter estimates as those of analyses assuming a Poisson distribution, because of the similarity in the likelihood functions. 1.3 STATISTICAL INFERENCE FOR CATEGORICAL DATA The choice of distribution for the response variable is but one step of data analysis. In practice, that distribution has unknown parameter values. In this section we review methods of using sample data to make inferences about the parameters. Sections 1.4 and 1.5 cover binomial and multinomial parameters. 1.3.1 Likelihood Functions and Maximum Likelihood Estimation In this book we use maximum likelihood for parameter estimation. Under weak regularity conditions, such as the parameter space having fixed dimension with true value falling in its interior, maximum likelihood estimators have desirable properties: They have large-sample normal distributions; they are asymptotically consistent, converging to the parameter as n increases; and they are asymptotically efficient, producing large-sample standard errors no greater than those from other estimation methods. Given the data, for a chosen probability distribution the likelihood function is the probability of those data, treated as a function of the unknown parameter. The maximum likelihood ŽML. estimate is the parameter value that maximizes this function. This is the parameter value under which the data observed have the highest probability of occurrence. The parameter value that maximizes the likelihood function also maximizes the log of that function. It is simpler to maximize the log likelihood since it is a sum rather than a product of terms. INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 10 We denote a parameter for a generic problem by  and its ML estimate by ˆ. The likelihood function is l Ž  . and the log-likelihood function is LŽ  . s log w l Ž  .x. For many models, LŽ  . has concave shape and ˆ is the point at which the derivative equals 0. The ML estimate is then the solution of the likelihood equation, LŽ  .r  s 0. Often,  is multidimensional, ˆ is the solution of a set of likelihood equations. denoted by ␤, and ␤ ˆ . denote the Let SE denote the standard error of ˆ, and let cov Ž␤ ˆ asymptotic covariance matrix of ␤. Under regularity conditions ŽRao 1973, ˆ . is the inverse of the information matrix. The Ž j, k . element of p. 364., cov Ž␤ the information matrix is yE ž 2 LŽ ␤ . j k / Ž 1.6 . . The standard errors are the square roots of the diagonal elements for the inverse information matrix. The greater the curvature of the log likelihood, the smaller the standard errors. This is reasonable, since large curvature ˆ hence, implies that the log likelihood drops quickly as ␤ moves away from ␤; ˆ the data would have been much more likely to occur if ␤ took a value near ␤ ˆ rather than a value far from ␤. 1.3.2 Likelihood Function and ML Estimate for Binomial Parameter The part of a likelihood function involving the parameters is called the kernel. Since the maximization of the likelihood is with respect to the parameters, the rest is irrelevant. To illustrate, consider the binomial distribution Ž1.1.. The binomial coefficient ž / has no influence on where the maximum occurs with respect to  . n y Thus, we ignore it and treat the kernel as the likelihood function. The binomial log likelihood is then L Ž  . s log  y Ž 1 y  . nyy s ylog Ž  . q Ž n y y . log Ž 1 y  . . Ž 1.7 . Differentiating with respect to  yields L Ž  . r  s yr y Ž n y y . r Ž 1 y  . s Ž y y n . r Ž 1 y  . . Ž 1.8 . Equating this to 0 gives the likelihood equation, which has solution  ˆ s yrn, the sample proportion of successes for the n trials. Calculating 2 LŽ .r  2 , taking the expectation, and combining terms, we get yE 2 L Ž  . r  2 s E yr 2 q Ž n y y . r Ž 1 y  . 2 s nr  Ž 1 y  . . Ž 1.9 . 11 STATISTICAL INFERENCE FOR CATEGORICAL DATA Thus, the asymptotic variance of  ˆ is  Ž1 y  .rn. This is no surprise. Since E Ž Y . s n and var Ž Y . s n Ž1 y  ., the distribution of  ˆ s Yrn has mean and standard error EŽ ˆ. s, 1.3.3  Ž ˆ. s (  Ž1 y  . n . Wald–Likelihood Ratio–Score Test Triad Three standard ways exist to use the likelihood function to perform large-sample inference. We introduce these for a significance test of a null hypothesis H0 :  s  0 and then discuss their relation to interval estimation. They all exploit the large-sample normality of ML estimators. With nonnull standard error SE of ˆ, the test statistic z s Ž ˆ y  0 . rSE has an approximate standard normal distribution when  s  0 . One refers z to the standard normal table to obtain one- or two-sided P-values. Equivalently, for the two-sided alternative, z 2 has a chi-squared null distribution with 1 degree of freedom Ždf.; the P-value is then the right-tailed chi-squared probability above the observed value. This type of statistic, using the nonnull standard error, is called a Wald statistic ŽWald 1943.. The multivariate extension for the Wald test of H0 : ␤ s ␤ 0 has test statistic ˆ y ␤ 0 . cov Ž ␤ ˆ. W s Ž␤ X y1 Ž ␤ˆ y ␤ 0 . . ŽThe prime on a vector or matrix denotes the transpose. . The nonnull ˆ The covariance is based on the curvature Ž1.6. of the log likelihood at ␤. ˆ asymptotic multivariate normal distribution for ␤ implies an asymptotic ˆ ., which is the chi-squared distribution for W. The df equal the rank of cov Ž␤ number of nonredundant parameters in ␤. A second general-purpose method uses the likelihood function through the ratio of two maximizations: Ž1. the maximum over the possible parameter values under H0 , and Ž2. the maximum over the larger set of parameter values permitting H0 or an alternative Ha to be true. Let l 0 denote the maximized value of the likelihood function under H0 , and let l 1 denote the maximized value generally Ži.e., under H0 j Ha .. For instance, for parameter vector ␤ s Ž␤ 0 , ␤ 1 . and H0 : ␤ 0 s 0, l 1 is the likelihood function calculated at the ␤ value for which the data would have been most likely; l 0 is the likelihood function calculated at the ␤ 1 value for which the data would have been most likely, when ␤ 0 s 0. Then l 1 is always at least as large as l 0 , since l 0 results from maximizing over a restricted set of the parameter values. 12 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA The ratio s l 0rl 1 of the maximized likelihoods cannot exceed 1. Wilks Ž1935, 1938. showed that y2 log has a limiting null chi-squared distribution, as n ™ . The df equal the difference in the dimensions of the parameter spaces under H0 j Ha and under H0 . The likelihood-ratio test statistic equals y2 log s y2 log Ž l 0rl 1 . s y2 Ž L0 y L1 . , where L0 and L1 denote the maximized log-likelihood functions. The third method uses the score statistic, due to R. A. Fisher and C. R. Rao. The score test is based on the slope and expected curvature of the log-likelihood function LŽ  . at the null value  0 . It utilizes the size of the score function uŽ  . s LŽ  . r  , evaluated at  0 . The value uŽ  0 . tends to be larger in absolute value when ˆ is farther from  0 . Denote yEw 2 LŽ  .r  2 x Ži.e., the information. evaluated at  0 by Ž  0 .. The score statistic is the ratio of uŽ  0 . to its null SE, which is w Ž  0 .x1r2 . This has an approximate standard normal null distribution. The chi-squared form of the score statistic is uŽ 0 . Ž 0 . LŽ  . r 0 2 s yE 2 2 L Ž  . r  02 , where the partial derivative notation reflects derivatives with respect to  that are evaluated at  0 . In the multiparameter case, the score statistic is a quadratic form based on the vector of partial derivatives of the log likelihood with respect to ␤ and the inverse information matrix, both evaluated at the H0 estimates Ži.e., assuming that ␤ s ␤ 0 .. Figure 1.1 is a generic plot of a log-likelihood LŽ  . for the univariate case. It illustrates the three tests of H0 :  s 0. The Wald test uses the behavior of LŽ  . at the ML estimate ˆ, having chi-squared form Ž ˆrSE. 2 . The SE of ˆ depends on the curvature of LŽ  . at ˆ. The score test is based on the slope and curvature of LŽ  . at  s 0. The likelihood-ratio test combines information about LŽ  . at both ˆ and  0 s 0. It compares the log-likelihood values L1 at ˆ and L0 at  0 s 0 using the chi-squared statistic y2Ž L0 y L1 .. In Figure 1.1, this statistic is twice the vertical distance between values of LŽ  . at ˆ and at 0. In a sense, this statistic uses the most information of the three types of test statistic and is the most versatile. As n ™ , the Wald, likelihood-ratio, and score tests have certain asymptotic equivalences ŽCox and Hinkley 1974, Sec. 9.3.. For small to moderate sample sizes, the likelihood-ratio test is usually more reliable than the Wald test. STATISTICAL INFERENCE FOR CATEGORICAL DATA FIGURE 1.1 1.3.4 13 Log-likelihood function and information used in three tests of H0 :  s 0. Constructing Confidence Intervals In practice, it is more informative to construct confidence intervals for parameters than to test hypotheses about their values. For any of the three test methods, a confidence interval results from inverting the test. For instance, a 95% confidence interval for  is the set of  0 for which the test of H0 :  s  0 has a P-value exceeding 0.05. Let z a denote the z-score from the standard normal distribution having right-tailed probability a; this is the 100Ž1 y a. percentile of that distribution. Let df2 Ž a. denote the 100Ž1 y a. percentile of the chi-squared distribution with degrees of freedom df. 100Ž1 y  .% confidence intervals based on asymptotic normality use z r2 , for instance z 0.025 s 1.96 for 95% confidence. The Wald confidence interval is the set of  0 for which < ˆ y  0 < rSE - z r2 . This gives the interval ˆ " z r2 ŽSE.. The likelihood-ratio-based confidence interval is the set of  0 for which y2w LŽ  0 . y LŽ ˆ.x -  12 Ž.. wRecall that  12 Ž. s z2 r2 .x When ˆ has a normal distribution, the log-likelihood function has a parabolic shape Ži.e., a second-degree polynomial.. For small samples with categorical data, ˆ may be far from normality and the log-likelihood function can be far from a symmetric, parabolic-shaped curve. This can also happen with moderate to large samples when a model contains many parameters. In such cases, inference based on asymptotic normality of ˆ may have inadequate performance. A marked divergence in results of Wald and likelihoodratio inference indicates that the distribution of ˆ may not be close to normality. The example in Section 1.4.3 illustrates this with quite different confidence intervals for different methods. In many such cases, inference can INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 14 instead utilize an exact small-sample distribution or ‘‘higher-order’’ asymptotic methods that improve on simple normality Že.g., Pierce and Peters 1992.. The Wald confidence interval is most common in practice because it is simple to construct using ML estimates and standard errors reported by statistical software. The likelihood-ratio-based interval is becoming more widely available in software and is preferable for categorical data with small to moderate n. For the best known statistical model, regression for a normal response, the three types of inference necessarily provide identical results. 1.4 STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS In this section we illustrate inference methods for categorical data by presenting tests and confidence intervals for the binomial parameter  , based on y successes in n independent trials. In Section 1.3.2 we obtained the likelihood function and ML estimator  ˆ s yrn of  . 1.4.1 Tests about a Binomial Parameter Consider H0 :  s  0 . Since H0 has a single parameter, we use the normal rather than chi-squared forms of Wald and score test statistics. They permit tests against one-sided as well as two-sided alternatives. The Wald statistic is zW s  ˆ y 0 SE s  ˆ y 0 'ˆ Ž 1 y ˆ . rn Ž 1.10 . . Evaluating the binomial score Ž1.8. and information Ž1.9. at  0 yields uŽ  0 . s y 0 y nyy 1 y 0 , Ž 0 . s n  0Ž1 y  0 . . The normal form of the score statistic simplifies to zS s uŽ  0 . Ž 0 . 1r2 s y y n 0 'n 0 Ž1 y  0 . s '  ˆ y 0 0 Ž 1 y  0 . rn . Ž 1.11 . Whereas the Wald statistic z W uses the standard error evaluated at  ˆ , the score statistic z S uses it evaluated at  0 . The score statistic is preferable, as it uses the actual null SE rather than an estimate. Its null sampling distribution is closer to standard normal than that of the Wald statistic. The binomial log-likelihood function Ž1.7. equals L 0 s ylog 0 q Ž n y y . log Ž1 y  0 . under H0 and L1 s y log  ˆ q Ž n y y . logŽ1 y ˆ . more 15 STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS generally. The likelihood-ratio test statistic simplifies to ž y2 Ž L0 y L1 . s 2 y log  ˆ 0 q Ž n y y . log 1y ˆ 1 y 0 / . Expressed as ž y2 Ž L0 y L1 . s 2 y log y n 0 q Ž n y y . log nyy n y n 0 / , it compares observed success and failure counts to fitted Ži.e., null. counts by 2 Ý observed log observed fitted . Ž 1.12 . We’ll see that this formula also holds for tests about Poisson and multinomial parameters. Since no unknown parameters occur under H0 and one occurs under Ha , Ž1.12. has an asymptotic chi-squared distribution with df s 1. 1.4.2 Confidence Intervals for a Binomial Parameter A significance test merely indicates whether a particular  value Žsuch as  s 0.5. is plausible. We learn more by using a confidence interval to determine the range of plausible values. Inverting the Wald test statistic gives the interval of  0 values for which < z W < - z r2 , or  ˆ " z r2 (  ˆ Ž 1 y ˆ . n . Ž 1.13 . Historically, this was one of the first confidence intervals used for any parameter ŽLaplace 1812, p. 283.. Unfortunately, it performs poorly unless n is very large Že.g., Brown et al. 2001.. The actual coverage probability usually falls below the nominal confidence coefficient, much below when  is near 0 or 1. A simple adjustment that adds 12 z2 r2 observations of each type to the sample before using this formula performs much better ŽProblem 1.24.. The score confidence interval contains  0 values for which < z S < - z r2 . Its endpoints are the  0 solutions to the equations Ž ˆ y  0 . r' 0 Ž 1 y  0 . rn s "z r2 . INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 16 These are quadratic in  0 . First discussed by E. B. Wilson Ž1927., this interval is  ˆ ž n n q z2 r2 " z r2 / ž q ) 1 z2 r2 2 n q z2 r2 1 n q z2 r2 /  ˆ Ž 1 y ˆ . ž n n q z2 r2 / ž /ž /ž q 1 1 z2 r2 2 2 n q z2 r2 / . The midpoint  ˜ of the interval is a weighted average of ˆ and 12 , where the weight nrŽ n q z2 r2 . given  ˆ increases as n increases. Combining terms, this midpoint equals  ˜ s Ž y q z2 r2 r2.rŽ n q z2 r2 .. This is the sample proportion for an adjusted sample that adds z2 r2 observations, half of each type. The square of the coefficient of z r2 in this formula is a weighted average of the variance of a sample proportion when  s  ˆ and the variance of a sample proportion when  s 12 , using the adjusted sample size n q z2 r2 in place of n. This interval has much better performance than the Wald interval. The likelihood-ratio-based confidence interval is more complex computationally, but simple in principle. It is the set of  0 for which the likelihoodratio test has a P-value exceeding  . Equivalently, it is the set of  0 for which double the log likelihood drops by less than  12 Ž. from its value at the ML estimate  ˆ s yrn. 1.4.3 Proportion of Vegetarians Example To collect data in an introductory statistics course, recently I gave the students a questionnaire. One question asked each student whether he or she was a vegetarian. Of n s 25 students, y s 0 answered ‘‘ yes.’’ They were not a random sample of a particular population, but we use these data to illustrate 95% confidence intervals for a binomial parameter  . Since y s 0,  ˆ s 0r25 s 0. Using the Wald approach, the 95% confidence interval for  is ' 0 " 1.96 Ž 0.0 = 1.0 . r25 , or Ž 0, 0 . . When the observation falls at the boundary of the sample space, often Wald methods do not provide sensible answers. By contrast, the 95% score interval equals Ž0.0, 0.133.. This is a more believable inference. For H0 :  s 0.5, for instance, the score test statistic is z S s Ž0 y 0.5.r Ž 0.5 = 0.5 . r25 s y5.0, so 0.5 does not fall in the interval. By contrast, for H0 :  s 0.10, z S s Ž0 y 0.10.r Ž 0.10 = 0.90 . r25 s y1.67, so 0.10 falls in the interval. ' ' STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS 17 When y s 0 and n s 25, the kernel of the likelihood function is l Ž . s  0 Ž1 y  . 25 s Ž1 y  . 25 . The log likelihood Ž1.7. is LŽ . s 25 log Ž1 y  .. Note that LŽ ˆ . s LŽ0. s 0. The 95% likelihood-ratio confidence interval is the set of  0 for which the likelihood-ratio statistic y2 Ž L0 y L1 . s y2 L Ž  0 . y L Ž  ˆ. s y50 log Ž 1 y  0 . F  12 Ž 0.05 . s 3.84. The upper bound is 1 y expŽy3.84r50. s 0.074, and the confidence interval equals Ž0.0, 0.074.. wIn this book, we use the natural logarithm throughout, so its inverse is the exponential function expŽ x . s e x.x Figure 1.2 shows the likelihood and log-likelihood functions and the corresponding confidence region for  . The three large-sample methods yield quite different results. When  is near 0, the sampling distribution of  ˆ is highly skewed to the right for small n. It is worth considering alternative methods not requiring asymptotic approximations. FIGURE 1.2 Binomial likelihood and log likelihood when y s 0 in n s 25 trials, and confidence interval for  . INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 18 Exact Small-Sample Inference*1 1.4.4 With modern computational power, it is not necessary to rely on large-sample approximations for the distribution of statistics such as  ˆ . Tests and confidence intervals can use the binomial distribution directly rather than its normal approximation. Such inferences occur naturally for small samples, but apply for any n. We illustrate by testing H0 :  s 0.5 against Ha :  / 0.5 for the survey results on vegetarianism, y s 0 with n s 25. We noted that the score statistic equals z s y5.0. The exact P-value for this statistic, based on the null binŽ25, 0.5. distribution, is P Ž < z < G 5.0 . s P Ž Y s 0 or Y s 25 . s 0.5 25 q 0.5 25 s 0.00000006. 100Ž1 y  .% confidence intervals consist of all  0 for which P-values exceed  in exact binomial tests. The best known interval ŽClopper and Pearson 1934. uses the tail method for forming confidence intervals. It requires each one-sided P-value to exceed r2. The lower and upper endpoints are the solutions in  0 to the equations n Ý ksy ž/ n nyk  k Ž1 y  0 . s r2 and k 0 y Ý ks0 ž/ n nyk  k Ž1 y  0 . s r2, k 0 except that the lower bound is 0 when y s 0 and the upper bound is 1 when y s n. When y s 1, 2, . . . , n y 1, from connections between binomial sums and the incomplete beta function and related cumulative distribution functions Žcdf’s. of beta and F distributions, the confidence interval equals 1q nyyq1 yF2 y , 2Ž nyyq1. Ž 1 y r2 . y1 -- 1 q nyy Ž y q 1 . F2Ž yq1. , 2Ž nyy . Žr2. y1 , where Fa, b Ž c . denotes the 1 y c quantile from the F distribution with degrees of freedom a and b. When y s 0 with n s 25, the ClopperPearson 95% confidence interval for  is Ž0.0, 0.137.. In principle this approach seems ideal. However, there is a serious complication. Because of discreteness, the actual coverage probability for any  is at least as large as the nominal confidence level ŽCasella and Berger 2001, p. 434; Neyman 1935. and it can be much greater. Similarly, for a test of H0 :  s  0 at a fixed desired size  such as 0.05, it is not usually possible to achieve that size. There is a finite number of possible samples, and hence a finite number of possible P-values, of which 0.05 may not be one. In testing H0 with fixed  0 , one can pick a particular  that can occur as a P-value. 1 Sections marked with an asterisk are less important for an overview. STATISTICAL INFERENCE FOR BINOMIAL PARAMETERS 19 FIGURE 1.3 Plot of coverage probabilities for nominal 95% confidence intervals for binomial parameter  when n s 25. For interval estimation, however, this is not an option. This is because constructing the interval corresponds to inverting an entire range of  0 values in H0 :  s  0 , and each distinct  0 value can have its own set of possible P-values; that is, there is not a single null parameter value  0 as in one test. For any fixed parameter value, the actual coverage probability can be much larger than the nominal confidence level. When n s 25, Figure 1.3 plots the coverage probabilities as a function of  for the ClopperPearson method, the score method, and the Wald method. At a fixed  value with a given method, the coverage probability is the sum of the binomial probabilities of all those samples for which the resulting interval contains that  . There are 26 possible samples and 26 corresponding confidence intervals, so the coverage probability is a sum of somewhere between 0 and 26 binomial probabilities. As  moves from 0 to 1, this coverage probability jumps up or down whenever  moves into or out of one of these intervals. Figure 1.3 shows that coverage probabilities are too low for the Wald method, whereas the ClopperPearson method errs in the opposite direction. The score method behaves well, except for some  values close to 0 or 1. Its coverage probabilities tend to be near the nominal level, not being consistently conservative or liberal. This is a good method unless  is very close to 0 or 1 ŽProblem 1.23.. In discrete problems using small-sample distributions, shorter confidence intervals usually result from inverting a single two-sided test rather than two 20 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA one-sided tests. The interval is then the set of parameter values for which the P-value of a two-sided test exceeds  . For the binomial parameter, see Blaker Ž2000., Blyth and Still Ž1983., and Sterne Ž1954. for methods. For observed outcome yo , with Blaker’s approach the P-value is the minimum of the two one-tailed binomial probabilities P Ž Y G yo . and P Ž Y F yo . plus an attainable probability in the other tail that is as close as possible to, but not greater than, that one-tailed probability. The interval is computationally more complex, although available in software ŽBlaker gave S-Plus functions.. The result is still conservative, but less so than the ClopperPearson interval. For the vegetarianism example, the 95% confidence interval using the Blaker exact method is Ž0.0, 0.128. compared to the ClopperPearson interval of Ž0.0, 0.137.. 1.4.5 Inference Based on the Mid-P-Value* To adjust for discreteness in small-sample distributions, one can base inference on the mid-P-®alue ŽLancaster 1961.. For a test statistic T with observed value t o and one-sided Ha such that large T contradicts H0 , mid-P-value s 12 P Ž T s t o . q P Ž T ) t o . , with probabilities calculated from the null distribution. Thus, the mid-P-value is less than the ordinary P-value by half the probability of the observed result. Compared to the ordinary P-value, the mid-P-value behaves more like the P-value for a test statistic having a continuous distribution. The sum of its two one-sided P-values equals 1.0. Although discrete, under H0 its null distribution is more like the uniform distribution that occurs in the continuous case. For instance, it has a null expected value of 0.5, whereas this expected value exceeds 0.5 for the ordinary P-value for a discrete test statistic. Unlike an exact test with ordinary P-value, a test using the mid-P-value does not guarantee that the probability of type I error is no greater than a nominal value ŽProblem 1.19.. However, it usually performs well, typically being a bit conservative. It is less conservative than the ordinary exact test. Similarly, one can form less conservative confidence intervals by inverting tests using the exact distribution with the mid-P-value Že.g., the 95% confidence interval is the set of parameter values for which the mid-P-value exceeds 0.05.. For testing H0 :  s 0.5 against Ha :  / 0.5 in the example about the proportion of vegetarians, with y s 0 for n s 25, the result observed is the most extreme possible. Thus the mid-P-value is half the ordinary P-value, or 0.00000003. Using the ClopperPearson inversion of the exact binomial test but with the mid-P-value yields a 95% confidence interval of Ž0.000, 0.113. for  , compared to Ž0.000, 0.137. for the ordinary ClopperPearson interval. The mid-P-value seems a sensible compromise between having overly conservative inference and using irrelevant randomization to eliminate prob- 21 STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS lems from discreteness. We recommend it both for tests and confidence intervals with highly discrete distributions. 1.5 STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS We now present inference for multinomial parameters  j 4 . Of n observations, n j occur in category j, j s 1, . . . , c. 1.5.1 Estimation of Multinomial Parameters First, we obtain ML estimates of  j 4 . As a function of  j 4 , the multinomial probability mass function Ž1.2. is proportional to the kernel Ł  jn where j j Ý  j s 1. all  j G 0 and Ž 1.14 . j The ML estimates are the  j 4 that maximize Ž1.14.. The multinomial log-likelihood function is LŽ ␲ . s Ý n j log  j . j To eliminate redundancies, we treat L as a function of Ž 1 , . . . , cy1 ., since c s 1 y Ž 1 q  qcy1 .. Thus, cr  j s y1, j s 1, . . . , c y 1. Since log c j s 1 c c j sy 1 c , differentiating LŽ ␲ . with respect to  j gives the likelihood equation LŽ ␲ . j s nj j y nc c s0. The ML solution satisfies  ˆ jrˆc s n jrn c . Now  ˆc Ý ˆ j s 1 s j žÝ / nj j nc s  ˆc n nc , so  ˆc s n crn and then ˆ j s n jrn. From general results presented later in the book ŽSection 8.6., this solution does maximize the likelihood. Thus, the ML estimates of  j 4 are the sample proportions. 22 1.5.2 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA Pearson Statistic for Testing a Specified Multinomial In 1900 the eminent British statistician Karl Pearson introduced a hypothesis test that was one of the first inferential methods. It had a revolutionary impact on categorical data analysis, which had focused on describing associations. Pearson’s test evaluates whether multinomial parameters equal certain specified values. His original motivation in developing this test was to analyze whether possible outcomes on a particular Monte Carlo roulette wheel were equally likely ŽStigler 1986.. Consider H0 :  j s  j0 , j s 1, . . . , c, where Ý j j0 s 1. When H0 is true, the expected values of  n j 4 , called expected frequencies, are  j s n j0 , j s 1, . . . , c. Pearson proposed the test statistic 2 X s Ý j Ž nj y j . j 2 . Ž 1.15 . Greater differences  n j y  j 4 produce greater X 2 values, for fixed n. Let X o2 denote the observed value of X 2 . The P-value is the null value of P Ž X 2 G X o2 .. This equals the sum of the null multinomial probabilities of all count arrays Žhaving a sum of n. with X 2 G X o2 . For large samples, X 2 has approximately a chi-squared distribution with 2 2 G X o2 ., where cy1 df s c y 1. The P-value is approximated by P Ž cy1 denotes a chi-squared random variable with df s c y 1. Statistic Ž1.15. is called the Pearson chi-squared statistic. 1.5.3 Example: Testing Mendel’s Theories Among its many applications, Pearson’s test was used in genetics to test Mendel’s theories of natural inheritance. Mendel crossed pea plants of pure yellow strain with plants of pure green strain. He predicted that second-generation hybrid seeds would be 75% yellow and 25% green, yellow being the dominant strain. One experiment produced n s 8023 seeds, of which n1 s 6022 were yellow and n 2 s 2001 were green. The expected frequencies for H0 :  10 s 0.75,  20 s 0.25 are  1 s 8023Ž0.75. s 6017.25 and  2 s 2005.75. The Pearson statistic X 2 s 0.015 Ždf s 1. has a P-value of P s 0.90. This does not contradict Mendel’s hypothesis. Mendel performed several experiments of this type. In 1936, R. A. Fisher summarized Mendel’s results. He used the reproductive property of chisquared: If X 12 , . . . , X k2 are independent chi-squared statistics with degrees of freedom  1 , . . . ,  k , then Ý i X i2 has a chi-squared distribution with df s Ý i  i . Fisher obtained a summary chi-squared statistic equal to 42, with df s 84. A chi-squared distribution with df s 84 has mean 84 and standard deviation Ž2 = 84.1r2 s 13.0, and the right-tailed probability above 42 is P s 0.99996. In other words, the chi-squared statistic was so small that the fit seemed too good. STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS 23 Fisher commented: ‘‘The general level of agreement between Mendel’s expectations and his reported results shows that it is closer than would be expected in the best of several thousand repetitions . . . . I have no doubt that Mendel was deceived by a gardening assistant, who knew only too well what his principal expected from each trial made.’’ In a letter written at the time Žsee Box 1978, p. 297., he stated: ‘‘Now, when data have been faked, I know very well how generally people underestimate the frequency of wide chance deviations, so that the tendency is always to make them agree too well with expectations.’’ In summary, goodness-of-fit tests can reveal not only when a fit is inadequate, but also when it is better than random fluctuations would have us expect. wR. A. Fisher’s daughter, Joan Fisher Box Ž1978, pp. 295300., and Freedman et al. Ž1978, pp. 420428, 478. discussed Fisher’s analysis of Mendel’s data and the accompanying controversy. Despite possible difficulties with Mendel’s data, subsequent work led to general acceptance of his theories.x 1.5.4 Chi-Squared Theoretical Justification* We now outline why Pearson’s statistic has a limiting chi-squared distribution. For a multinomial sample Ž n1 , . . . , n c . of size n, the marginal distribution of n j is the binŽ n, j . distribution. For large n, by the normal approximation to the binomial, n j Žand  ˆ j s n jrn. have approximate normal distributions. More generally, by the central limit theorem, the sample proportions ␲ ˆ s Ž n1rn, . . . , n cy1 rn.X have an approximate multivariate normal distribution ŽSection 14.1.4.. Let ⌺ 0 denote the null covariance matrix of 'n ␲, ˆ and let ␲ 0 s Ž 10 , . . . , cy1,0 .X . Under H0 , since 'n Ž ␲ ˆ y ␲ 0 . converges to a N Ž0, ⌺ 0 . distribution, the quadratic form nŽ ␲ ˆ y ␲ 0 . ⌺y1 ˆ y ␲0 . 0 Ž␲ Ž 1.16 . X has distribution converging to chi-squared with df s c y 1. In Section 14.1.4 we show that the covariance matrix of 'n ␲ ˆ has elements jk s ½ y j  k if j / k  j Ž1 y  j . if j s k . The matrix ⌺y1 has Ž j, k .th element 1rc0 when j / k and Ž1r j0 q 1rc0 . 0 Ž when j s k. You can verify this by showing that ⌺ 0 ⌺y1 equals the identity 0 matrix.. With this substitution, direct calculation Žwith appropriate combining of terms. shows that Ž1.16. simplifies to X 2 . In Section 14.3 we provide a formal proof in a more general setting. This argument is similar to Pearson’s in 1900. R. A. Fisher Ž1922. gave a simpler justification, the gist of which follows: Suppose that Ž n1 , . . . , n c . are independent Poisson random variables with means Ž 1 , . . . ,  c .. For large 24 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA '  j 4 , the standardized values  z j s Ž n j y  j .r  j 4 have approximate standard normal distributions. Thus, Ý j z j2 s X 2 has an approximate chi-squared distribution with c degrees of freedom. Adding the single linear constraint Ý j Ž n j y  j . s 0, thus converting the Poisson distributions to a multinomial, we lose a degree of freedom. When c s 2, Pearson’s X 2 simplifies to the square of the normal score statistic Ž1.11.. For Mendel’s data,  ˆ 1 s 6022r8023,  10 s 0.75, n s 8023, and z S s 0.123, for which X 2 s Ž0.123. 2 s 0.015. In fact, for general c the Pearson test is the score test about multinomial parameters. 1.5.5 Likelihood-Ratio Chi-Squared An alternative test for multinomial parameters uses the likelihood-ratio test. The kernel of the multinomial likelihood is Ž1.14.. Under H0 the likelihood is maximized when  ˆ j s  j0 . In the general case, it is maximized when ˆ j s n jrn. The ratio of the likelihoods equals s Ł j Ž  j0 . nj Ł j Ž n jrn . nj . Thus, the likelihood-ratio statistic, denoted by G 2 , is G 2 s y2 log s 2 Ý n j log Ž n jrn j0 . . Ž 1.17 . This statistic, which has form Ž1.12., is called the likelihood-ratio chi-squared statistic. The larger the value of G 2 , the greater the evidence against H0 . In the general case, the parameter space consists of  j 4 subject to Ý j j s 1, so the dimensionality is c y 1. Under H0 , the  j 4 are specified completely, so the dimension is 0. The difference in these dimensions equals Ž c y 1.. For large n, G 2 has a chi-squared null distribution with df s c y 1. When H0 holds, the Pearson X 2 and the likelihood ratio G 2 both have asymptotic chi-squared distributions with df s c y 1. In fact, they are asymptotically equivalent in that case; specifically, X 2 y G 2 converges in probability to zero ŽSection 14.3.4.. When H0 is false, they tend to grow proportionally to n; they need not take similar values, however, even for very large n. For fixed c, as n increases the distribution of X 2 usually converges to chi-squared more quickly than that of G 2 . The chi-squared approximation is usually poor for G 2 when nrc - 5. When c is large, it can be decent for X 2 for nrc as small as 1 if the table does not contain both very small and moderately large expected frequencies. We provide further guidelines in Section 9.8.4. Alternatively, one can use the multinomial probabilities to generate exact distributions of these test statistics ŽGood et al. 1970.. 25 STATISTICAL INFERENCE FOR MULTINOMIAL PARAMETERS 1.5.6 Testing with Estimated Expected Frequencies Pearson’s X 2 Ž1.15. compares a sample distribution to a hypothetical one  j0 4 . In some applications,  j0 s  j0 Ž ␪ .4 are functions of a smaller set of unknown parameters ␪. ML estimates ˆ ␪ of ␪ determine ML estimates  j0 Žˆ ␪ .4 of  j0 4 and hence ML estimates   ˆ j s n j0 Žˆ␪ .4 of expected frequencies in X 2 . Replacing  j 4 by estimates   ˆ j 4 affects the distribution of X 2 . When dim Ž ␪ . s p, the true df s Ž c y 1. y p ŽSection 14.3.3.. Pearson failed to realize this ŽSection 16.2.. We now show a goodness-to-fit test with estimated expected frequencies. A sample of 156 dairy calves born in Okeechobee County, Florida, were classified according to whether they caught pneumonia within 60 days of birth. Calves that got a pneumonia infection were also classified according to whether they got a secondary infection within 2 weeks after the first infection cleared up. Table 1.1 shows the data. Calves that did not get a primary infection could not get a secondary infection, so no observations can fall in the category for ‘‘no’’ primary infection and ‘‘ yes’’ secondary infection. That combination is called a structural zero. A goal of this study was to test whether the probability of primary infection was the same as the conditional probability of secondary infection, given that the calf got the primary infection. In other words, if ab denotes the probability that a calf is classified in row a and column b of this table, the null hypothesis is H0 :  11 q  12 s  11 r Ž  11 q  12 . or  11 s Ž 11 q  12 . 2 . Let  s  11 q  12 denote the probability of primary infection. The null hypothesis states that the probabilities satisfy the structure that Table 1.2 shows; that is, probabilities in a trinomial for the categories Žyesyes, yesno, nono. for primarysecondary infection equal Ž 2 ,  Ž1 y  ., 1 y  .. Let n ab denote the number of observations in category Ž a, b .. The ML estimate of  is the value maximizing the kernel of the multinomial likelihood n 11 Ž 2 . Ž y  2 . TABLE 1.1 n 12 Ž1 y  . n 22 . Primary and Secondary Pneumonia Infections in Calves Secondary Infection a Primary Infection Yes No Yes No 30 Ž38.1. 0 Ž. 63 Ž39.0. 63 Ž78.9. Source: Data courtesy of Thang Tran and G. A. Donovan, College of Veterinary Medicine, University of Florida. a Values in parentheses are estimated expected frequencies. 26 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA TABLE 1.2 Probability Structure for Hypothesis Secondary Infection Primary Infection Yes 2 Yes No   No Total  Ž1 y  . 1y  1y The log likelihood is L Ž  . s n11 log  2 q n12 log Ž  y  2 . q n 22 log Ž 1 y  . . Differentiation with respect to  gives the likelihood equation 2 n11  q n12  y n12 1y y n 22 1y s 0. The solution is  ˆ s Ž 2 n11 q n12 . r Ž 2 n11 q 2 n12 q n 22 . . For Table 1.1,  ˆ s 0.494. Since n s 156, the estimated expected frequencies are  ˆ 11 s nˆ 2 s 38.1,  ˆ 12 s nŽˆ y ˆ 2 . s 39.0, and  ˆ 22 s nŽ1 y ˆ . s 78.9. Table 1.1 shows them. Pearson’s statistic is X 2 s 19.7. Since the c s 3 possible responses have p s 1 parameter Ž . determining the expected frequencies, df s Ž3 y 1. y 1 s 1. There is strong evidence against H0 Ž P s 0.00001.. Inspection of Table 1.1 reveals that many more calves got a primary infection but not a secondary infection than H0 predicts. The researchers concluded that the primary infection had an immunizing effect that reduced the likelihood of a secondary infection. NOTES Section 1.1: Categorical Response Data 1.1. Stevens Ž1951. defined Žnominal, ordinal, interval. scales of measurement. Other scales result from mixtures of these types. For instance, partially ordered scales occur when subjects respond to questions having categories ordered except for don’t know or undecided categories. Section 1.3: Statistical Inference for Categorical Data 1.2. The score method does not use ˆ. Thus, when  is a model parameter, compute the score statistic for testing H0 :  s  0 without fitting the advantageous when fitting several models in an exploratory analysis and computationally intensive. An advantage of the score and likelihood-ratio one can usually model. This is model fitting is methods is that 27 PROBLEMS they apply even when < ˆ < s . In that case, one cannot compute the Wald statistic. Another disadvantage of the Wald method is that its results depend on the parameterization; inference based on ˆ and its SE is not equivalent to inference based on a nonlinear function of it, such as log ˆ and its SE. Section 1.4: Statistical Inference for Binomial Parameters 1.3. Among others, Agresti and Coull Ž1998., Blyth and Still Ž1983., Brown et al. Ž2001., Ghosh Ž1979., and Newcombe Ž1998a. showed the superiority of the score interval to the Wald interval for  . Of the ‘‘exact’’ methods, Blaker’s Ž2000. has particularly good properties. It is contained in the ClopperPearson interval and has a nestedness property whereby an interval of higher nominal confidence level necessarily contains one of lower level. 1.4. Using continuity corrections with large-sample methods provides approximations to exact small-sample methods. Thus, they tend to behave conservatively. We do not present them, since if one prefers an exact method, with modern computational power it can be used directly rather than approximated. 1.5. In theory, one can eliminate problems with discreteness in tests by performing a supplementary randomization on the boundary of a critical region Žsee Problem 1.19.. In rejecting the null at the boundary with a certain probability, one can obtain a fixed overall type I error probability  even when it is not an achievable P-value. For such randomization, the one-sided P y value is randomized P-value s U = P Ž T s t o . q P Ž T ) t o . , where U denotes a uniform Ž0, 1. random variable ŽStevens 1950.. In practice, this is not used, as it is absurd to let this random number influence a decision. The mid P-value replaces the arbitrary uniform multiple U = P ŽT s t o . by its expected value. Section 1.5: Statistical Inference for Multinomial Parameters 1.6. The chi-squared distribution has mean df, variance 2 df, and skewness Ž8rdf.1r2 . It is approximately normal when df is large. Greenwood and Nikulin Ž1996., Kendall and Stuart Ž1979., and Lancaster Ž1969. presented other properties. Cochran Ž1952. presented a historical survey of chi-squared tests of fit. See also Cressie and Read Ž1989., Koch and Bhapkar Ž1982., Koehler Ž1998., and Moore Ž1986b.. PROBLEMS Applications 1.1 Identify each variable as nominal, ordinal, or interval. a. UK political party preference ŽLabour, Conservative, Social Democrat. b. Anxiety rating Žnone, mild, moderate, severe, very severe. c. Patient survival Žin number of months. d. Clinic location ŽLondon, Boston, Madison, Rochester, Montreal. 28 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA e. Response of tumor to chemotherapy Žcomplete elimination, partial reduction, stable, growth progression. f. Favorite beverage Žwater, juice, milk, soft drink, beer, wine. g. Appraisal of company’s inventory level Žtoo low, about right, too high. 1.2 Each of 100 multiple-choice questions on an exam has four possible answers, one of which is correct. For each question, a student guesses by selecting an answer randomly. a. Specify the distribution of the student X s number of correct answers. b. Find the mean and standard deviation of that distribution. Would it be surprising if the student made at least 50 correct responses? Why? c. Specify the distribution of Ž n1 , n 2 , n 3 , n 4 ., where n j is the number of times the student picked choice j. d. Find E Ž n j ., var Ž n j ., cov Ž n j , n k ., and corr Ž n j , n k .. 1.3 An experiment studies the number of insects that survive a certain dose of an insecticide, using several batches of insects of size n each. The insects are sensitive to factors that vary among batches during the experiment but were not measured, such as temperature level. Explain why the distribution of the number of insects per batch surviving the experiment might show overdispersion relative to a binŽ n,  . distribution. 1.4 In his autobiography A Sort of Life, British author Graham Greene described a period of severe mental depression during which he played Russian Roulette. This ‘‘game’’ consists of putting a bullet in one of the six chambers of a pistol, spinning the chambers to select one at random, and then firing the pistol once at one’s head. a. Greene played this game six times and was lucky that none of them resulted in a bullet firing. Find the probability of this outcome. b. Suppose that he had kept playing this game until the bullet fired. Let Y denote the number of the game on which it fires. Show the probability mass function for Y, and justify. 1.5 Consider the statement, ‘‘Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if she is married and does not want any more children.’’ For the 1996 General Social Survey, conducted by the National Opinion Research Center ŽNORC., 842 replied ‘‘ yes’’ and 982 replied ‘‘no.’’ Let  denote PROBLEMS 29 the population proportion who would reply ‘‘ yes.’’ Find the P-value for testing H0 :  s 0.5 using the score test, and construct a 95% confidence interval for  . Interpret the results. 1.6 Refer to the vegetarianism example in Section 1.4.3. For testing H0 :  s 0.5 against Ha :  / 0.5, show that: a. The likelihood-ratio statistic equals 2w25log Ž25r12.5.x s 34.7. b. The chi-squared form of the score statistic equals 25.0. c. The Wald z or chi-squared statistic is infinite. 1.7 In a crossover trial comparing a new drug to a standard,  denotes the probability that the new one is judged better. It is desired to estimate  and test H0 :  s 0.5 against Ha :  / 0.5. In 20 independent observations, the new drug is better each time. a. Find and sketch the likelihood function. Give the ML estimate of . b. Conduct a Wald test and construct a 95% Wald confidence interval for  . Are these sensible? c. Conduct a score test, reporting the P-value. Construct a 95% score confidence interval. Interpret. d. Conduct a likelihood-ratio test and construct a likelihood-based 95% confidence interval. Interpret. e. Construct an exact binomial test and 95% confidence interval. Interpret. f. Suppose that researchers wanted a sufficiently large sample to estimate the probability of preferring the new drug to within 0.05, with confidence 0.95. If the true probability is 0.90, about how large a sample is needed? 1.8 In an experiment on chlorophyll inheritance in maize, for 1103 seedlings of self-fertilized heterozygous green plants, 854 seedlings were green and 249 were yellow. Theory predicts the ratio of green to yellow is 3:1. Test the hypothesis that 3:1 is the true ratio. Report the P-value, and interpret. 1.9 Table 1.3 contains Ladislaus von Bortkiewicz’s data on deaths of soldiers in the Prussian army from kicks by army mules ŽFisher 1934; Quine and Seneta 1987.. The data refer to 10 army corps, each observed for 20 years. In 109 corps-years of exposure, there were no deaths, in 65 corps-years there was one death, and so on. Estimate the mean and test whether probabilities of occurrences in these five categories follow a Poisson distribution Žtruncated for 4 and above.. 30 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA TABLE 1.3 Data for Problem 1.9 Number of Deaths Number of Corps-Years 0 1 2 3 4 G5 109 65 22 3 1 0 1.10 A sample of 100 women suffer from dysmenorrhea. A new analgesic is claimed to provide greater relief than a standard one. After using each analgesic in a crossover experiment, 40 reported greater relief with the standard analgesic and 60 reported greater relief with the new one. Analyze these data. Theory and Methods 1.11 Why is it easier to get a precise estimate of the binomial parameter  when it is near 0 or 1 than when it is near 12 ? 1.12 Suppose that P Ž Yi s 1. s 1 y P Ž Yi s 0. s  , i s 1, . . . , n, where  Yi 4 are independent. Let Y s Ý i Yi . a. What are varŽ Y . and the distribution of Y ? b. When  Yi 4 instead have pairwise correlation  ) 0, show that var Ž Y . ) n Ž1 y  ., overdispersion relative to the binomial. wAltham Ž1978. discussed generalizations of the binomial that allow correlated trials. x c. Suppose that heterogeneity exists: P Ž Yi s 1 < . s  for all i, but  is a random variable with density function g Ž. on w0, 1x having mean  and positive variance. Show that varŽ Y . ) n  Ž1 y  .. ŽWhen  has a beta distribution, Y has the beta-binomial distribution of Section 13.3.. d. Suppose that P Ž Yi s 1 < i . s  i , i s 1, . . . , n, where  i 4 are independent from g Ž.. Explain why Y has a binŽ n,  . distribution unconditionally but not conditionally on  i 4 . Ž Hint: In each case, is Y a sum of independent, identical Bernoulli trials? . 1.13 For a sequence of independent Bernoulli trials, Y is the number of successes before the kth failure. Explain why its probability mass 31 PROBLEMS function is the negati®e binomial, pŽ y . s Ž y q k y 1. ! y k  Ž1 y  . , y! Ž k y 1 . ! y s 0, 1, 2, . . . . wFor it, E Ž Y . s krŽ1 y  . and var Ž Y . s krŽ1 y  . 2 , so var Ž Y . ) E Ž Y .; the Poisson is the limit as k ™ and  ™ 0 with k s  fixed.x 1.14 For the multinomial distribution, show that ' corr Ž n j , n k . s y j kr  j Ž 1 y  j .  k Ž 1 y  k . . Show that corrŽ n1 , n 2 . s y1 when c s 2. 1.15 Show that the moment generating function Žmgf. for the binomial distribution is mŽ t . s Ž1 y  q  e t . n, and use it to obtain the first two moments. Show that the mgf for the Poisson distribution is mŽ t . s expŽ wexpŽ t . y 1x4 , and use it to obtain the first two moments. 1.16 A likelihood-ratio statistic equals t o . At the ML estimates, show that the data are expŽ t or2. times more likely under Ha than under H0 . 1.17 Assume that y 1 , y 2 , . . . , yn are independent from a Poisson distribution. a. Obtain the likelihood function. Show that the ML estimator  ˆ s y. b. Construct a large-sample test statistic for H0 :  s  0 using (i) the Wald method, (ii) the score method, and (iii) the likelihood-ratio method. c. Construct a large-sample confidence interval for  using (i) the Wald method, (ii) the score method, and (iii) the likelihood-ratio method. 1.18 Inference for Poisson parameters can often be based on connections with binomial and multinomial distributions. Show how to test H0 :  1 s  2 for two populations based on independent Poisson counts Ž y 1 , y 2 ., using a corresponding test about a binomial parameter  . w Hint: Condition on n s y 1 q y 2 and identify  s  1rŽ 1 q  2 ..x How can one construct a confidence interval for  1r 2 based on one for  ? 1.19 A researcher routinely tests using a nominal P Žtype I error. s 0.05, rejecting H0 if the P-value F 0.05. An exact test using test statistic T 32 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA has null distribution P ŽT s 0. s 0.30, P ŽT s 1. s 0.62, and P ŽT s 2. s 0.08, where a higher T provides more evidence against the null. a. With the usual P-value, show that the actual P Žtype I error. s 0. b. With the mid-P-value, show that the actual P Žtype I error. s 0.08. c. Find P Žtype I error. in parts Ža. and Žb. when P ŽT s 0. s 0.30, P ŽT s 1. s 0.66, P ŽT s 2. s 0.04. Note that the test with midP-value can be conservative or liberal. The exact test with ordinary P-value cannot be liberal. d. In part Ža., a randomized-decision test generates a uniform random variable U from w0, 1x and rejects H0 when T s 2 and U F 58 . Show the actual P Žtype I error. s 0.05. Is this a sensible test? 1.20 For a binomial parameter  , show how the inversion process for constructing a confidence interval works with Ža. the Wald test, and Žb. the score test. 1.21 For a flip of a coin, let  denote the probability of a head. An experiment tests H0 :  s 0.5 against Ha :  / 0.5, using n s 5 independent flips. a. Show that the true null probability of rejecting H0 at the 0.05 significance level is 0.0 for the exact binomial test and 161 using the large-sample score test. b. Suppose that truly  s 0.5. Explain why the probability that the 95% ClopperPearson confidence interval contains  equals 1.0. Ž Hint: Is there any possible y for which both one-sided tests of H0 :  s 0.5 have P-value F 0.025?. 1.22 Consider the Wald confidence interval for a binomial parameter  . Since it is degenerate when  ˆ s 0 or 1, argue that for 0 -  - 1 the probability the interval covers  cannot exceed w1 y  n y Ž1 y  . n x; hence, the infimum of the coverage probability over 0 -  - 1 equals 0, regardless of n. 1.23 Consider the 95% binomial score confidence interval for  . When y s 1, show that the lower limit is approximately 0.18rn; in fact, 0 -  - 0.18rn then falls in an interval only when y s 0. Argue that for large n and  just barely below 0.18rn or just barely above 1 y 0.18rn, the actual coverage probability is about ey0.18 s 0.84. Hence, even as n ™ , this method is not guaranteed to have coverage probability G 0.95 ŽAgresti and Coull 1998; Blyth and Still 1983.. 1.24 From Section 1.4.2 the midpoint  ˜ of the score confidence interval for  is the sample proportion for an adjusted data set that adds z2 r2 r2 33 PROBLEMS observations of each type to the sample. This motivates an adjusted Wald interval,  ˜ " z r2 '˜ Ž 1 y ˜ . rn* , where n* s n q z2 r2 . Show that the variance  ˜ Ž1 y ˜ .rn* at the weighted average is at least as large as the weighted average of the variances that appears under the square root sign in the score interval Ž Hint: Use Jensen’s inequality.. Thus, this interval contains the score interval. wAgresti and Coull Ž1998. and Brown et al. Ž2001. showed that it performs much better than the Wald interval. It does not have the score interval’s disadvantage ŽProblem 1.23. of poor coverage near 0 and 1.x 1.25 A binomial sample of size n has y s 0 successes. a. Show that the confidence interval for  based on the likelihood function is w0.0, 1 y expŽyz2 r2 r2 n.x. For  s 0.05, use the expansion of an exponential function to show that this is approximately w0, 2rn x. b. For the score method, show that the confidence interval is w0, z2 r2 rŽ n q z2 r2 .x, or approximately w0, 4rŽ n q 4.x when  s 0.05. c. For the ClopperPearson approach, show that the upper bound is 1 y Žr2.1r n, or approximately ylogŽ0.025.rn s 3.69rn when  s 0.05. d. For the adaptation of the ClopperPearson approach using the mid-P-value, show that the upper bound is 1 y  1r n, or approximately ylog Ž0.05.rn s 3rn when  s 0.05. 1.26 For the geometric distribution pŽ y . s  y Ž1 y  ., y s 0, 1, 2, . . . , show that the tail method for constructing a confidence interval wi.e., equating P Ž Y G y . and P Ž Y F y . to r2x yields wŽr2.1r y , Ž1 y r2.1rŽ yq1. x. Show that all  between 0 and 1 y r2 ne®er fall above a confidence interval, and hence the actual coverage probability exceeds 1 y r2 over this region. 1.27 A statistic T has discrete distribution with cdf F Ž t .. Show that F ŽT . is stochastically larger than uniform over w0, 1x; that is, its cdf is everywhere no greater than that of the uniform ŽCasella and Berger 2001, pp. 77, 434.. Explain why an implication is that a P-value based on T has null distribution that is stochastically larger than uniform. 1.28 Suppose that P ŽT s t j . s  j , j s 1, . . . . Show that E Žmid-P-value. s 0.5. w Hint: Show that Ý j j Ž jr2 q  jq1 q  . s ŽÝ j j . 2r2.x 34 INTRODUCTION: DISTRIBUTIONS AND INFERENCE FOR CATEGORICAL DATA 1.29 For a statistic T with cdf F Ž t . and pŽ t . s P ŽT s t ., the mid-distribution function is Fmid Ž t . s F Ž t . y 0.5 pŽ t . ŽParzen 1997.. Given T s t o , show that the mid-P-value equals 1 y F Ž t o .. ŽIt also satisfies E w Fmid ŽT .x s 0.5 and varw Fmid ŽT .x s Ž1r12. 1 y E w p 2 ŽT .x4 .. 1.30 Genotypes AA, Aa, and aa occur with probabilities w  2 , 2  Ž1 y  ., Ž1 y  . 2 x. A multinomial sample of size n has frequencies Ž n1 , n 2 , n 3 . of these three genotypes. a. Form the log likelihood. Show that ˆs Ž2 n1 q n 2 .rŽ2 n1 q 2 n 2 q 2 n 3 .. b. Show that y 2 LŽ  .r  2 s wŽ2 n1 q n 2 .r 2 x q wŽ n 2 q 2 n 3 .r Ž1 y  . 2 x and that its expectation is 2 nr Ž1 y  .. Use this to obtain an asymptotic standard error of ˆ. c. Explain how to test whether the probabilities truly have this pattern. 1.31 Refer to Section 1.5.6. Using the likelihood function to obtain the information, find the approximate standard error of  ˆ. 1.32 Refer to Section 1.5.6. Let a denote the number of calves that got a primary, secondary, and tertiary infection, b the number that received a primary and secondary but not a tertiary infection, c the number that received a primary but not a secondary infection, and d the number that did not receive a primary infection. Let  be the probability of a primary infection. Consider the hypothesis that the probability of infection at time t, given infection at times 1, . . . , t y 1, is also  , for t s 2, 3. Show that  ˆ s Ž3a q 2 b q c .rŽ3a q 3b q 2 c q d .. 1.33 Refer to quadratic form Ž1.16.. is the inverse of a. Verify that the matrix quoted in the text for ⌺y1 0 ⌺0. b. Show that Ž1.16. simplifies to Pearson’s statistic Ž1.15.. c. For the z S statistic Ž1.11., show that z S2 s X 2 for c s 2. 1.34 For testing H0 :  j s  j0 , j s 1, . . . , c, using sample multinomial proportions  ˆ j 4, the likelihood-ratio statistic Ž1.17. is G 2 s y2 n Ý  ˆ j log Ž  j0rˆ j . . j Show that G 2 G 0, with equality if and only if  ˆ j s  j0 for all j. Ž Hint: Ž . Apply Jensen’s inequality to E y2 n log X , where X equals  j0r ˆj with probability  ˆ j .. 35 PROBLEMS 1.35 The chi-squared mgf with df s  is mŽ t . s Ž1 y 2 t .y r2 , for < t < - 12 . Use it to prove the reproductive property of the chi-squared distribution. 1.36 For the multinomial Ž n, j 4. distribution with c ) 2, confidence limits for  j are the solutions of Ž ˆ j y  j . 2 s Ž z r2 c .  j Ž 1 y  j . rn, 2 j s 1, . . . , c. a. Using the Bonferroni inequality, argue that these c intervals simultaneously contain all  j 4 Žfor large samples. with probability at least 1 y  . b. Show that the standard deviation of  ˆ j y ˆ k is w j q  k y Ž j y  k . 2 xrn. For large n, explain why the probability is at least 1 y  that the Wald confidence intervals Ž ˆ j y ˆ k . " z r2 a½  ˆ j q ˆ k y Ž ˆ j y ˆ k . rn 2 5 1r2 simultaneously contain the a s cŽ c y 1.r2 differences  j y  k 4 Žsee Fitzpatrick and Scott 1987; Goodman 1965.. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 2 Describing Contingency Tables In this chapter we introduce tables that display relationships between categorical variables. We also define parameters that summarize their association. Parameters in Section 2.2 are used to compare groups on the proportions of responses in the outcome categories. The odds ratio has special importance, appearing as a parameter in models discussed later. In Section 2.3 we extend the scope by controlling for a third variable. The association can change dramatically under a control. The chapter’s primary focus is binary variables, which have only two categories, but in Section 2.4 we present parameters for nominal and ordinal multicategory variables. First, in Section 2.1, we introduce basic terminology and notation. 2.1 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES The joint distribution between two categorical variables determines their relationship. This distribution also determines the marginal and conditional distributions. 2.1.1 Contingency Tables and Their Distributions Let X and Y denote two categorical response variables, X with I categories and Y with J categories. Classifications of subjects on both variables have IJ possible combinations. The responses Ž X, Y . of a subject chosen randomly from some population have a probability distribution. A rectangular table having I rows for categories of X and J columns for categories of Y displays this distribution. The cells of the table represent the IJ possible outcomes. When the cells contain frequency counts of outcomes for a sample, the table is called a contingency table, a term introduced by Karl Pearson Ž1904.. Another name is cross-classification table. A contingency table with I rows and J columns is called an I = J Žor I-by-J . table. 36 37 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES TABLE 2.1 Cross-Classification of Aspirin Use and Myocardial Infarction Myocardial Infarction Placebo Aspirin Fatal Attack Nonfatal Attack No Attack 18 5 171 99 10,845 10,933 Source: Preliminary report: Findings from the aspirin component of the ongoing Physicians’ Health Study. New Engl. J. Med. 318: 262264 Ž1988.. Table 2.1, a 2 = 3 contingency table, is from a report on the relationship between aspirin use and heart attacks by the Physicians’ Health Study Research Group at Harvard Medical School. The Physicians’ Health Study was a 5-year randomized study of whether regular aspirin intake reduces mortality from cardiovascular disease. Every other day, physicians participating in the study took either one aspirin tablet or a placebo. The study was blindthose in the study did not know whether they were taking aspirin or a placebo. Of the 11,034 physicians taking a placebo, 18 suffered fatal heart attacks over the course of the study, whereas of the 11,037 taking aspirin, 5 had fatal heart attacks. Let  i j denote the probability that Ž X, Y . occurs in the cell in row i and column j. The probability distribution  i j 4 is the joint distribution of X and Y. The marginal distributions are the row and column totals that result from summing the joint probabilities. We denote these by  iq 4 for the row variable and qj 4 for the column variable, where the subscript ‘‘q’’ denotes the sum over that index; that is,  iqs Ý i j j and qj s Ý i j. i These satisfy Ý i iqs Ý jqj s Ý i Ý j i j s 1.0. The marginal distributions provide single-variable information. In most contingency tables Žsuch as Table 2.1., one variable, say Y, is a response variable and the other Ž X . is an explanatory variable. When X is fixed rather than random, the notion of a joint distribution for X and Y is no longer meaningful. However, for a fixed category of X, Y has a probability distribution. It is germane to study how this distribution changes as the category of X changes. Given that a subject is classified in row i of X,  j < i denotes the probability of classification in column j of Y, j s 1, . . . , J. Note that Ý j j < i s 1. The probabilities  1 < i , . . . ,  J < i 4 form the conditional distribution of Y at category i of X. A principal aim of many studies is to compare conditional distributions of Y at various levels of explanatory variables. 38 DESCRIBING CONTINGENCY TABLES TABLE 2.2 Estimated Conditional Distributions for Breast Cancer Diagnoses Diagnosis of Test Breast Cancer Yes No Positive Negative Total 0.82 0.01 0.18 0.99 1.0 1.0 Source: Data from W. Lawrence et al., J. Natl. Cancer Inst. 90: 17921800 Ž1998.. 2.1.2 Sensitivity and Specificity The results in Table 2.2 are from a recent article about various methods of attempting to diagnose breast cancer. Based on a literature survey, the authors reported these results for the impact of using mammography together with clinical breast examination. Let X s true disease status Ži.e., whether a woman truly has breast cancer. and let Y s diagnosis Žpositive, negative., where a positive outcome predicts that a woman has breast cancer. The probabilities estimated in Table 2.2 are conditional probabilities of Y given X. With diagnostic tests for a disease, the two correct diagnoses are a positive test outcome when the subject has the disease and a negative test outcome when a subject does not have it. Given that the subject has the disease, the conditional probability that the diagnostic test is positive is called the sensiti®ity; given that the subject does not have the disease, the conditional probability that the test is negative is called the specificity ŽYerushalmy 1947.. Ideally, these are both high. For a 2 = 2 table with the format of Table 2.2, sensitivity is  1 <1 and specificity is  2 < 2 . In Table 2.2, the estimated sensitivity of combined mammography and clinical examination is 0.82. Of women with breast cancer, 82% are diagnosed correctly. The estimated specificity is 0.99. Of women not having breast cancer, 99% were diagnosed correctly. 2.1.3 Independence of Categorical Variables When both variables are response variables, descriptions of the association can use their joint distribution, the conditional distribution of Y given X, or the conditional distribution of X given Y. The conditional distribution of Y given X relates to the joint distribution by  j < i s  i jr iq for all i and j. Two categorical response variables are defined to be independent if all joint probabilities equal the product of their marginal probabilities,  i j s  iq qj for i s 1, . . . , I and j s 1, . . . , J. Ž 2.1 . 39 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES TABLE 2.3 Notation for Joint, Conditional, and Marginal Probabilities Column Row 1 2 Total 1 2 Total  11 Ž 1<1 .  21 Ž 1< 2 . q1  12 Ž 2 <1 .  22 Ž 2 < 2 . q2  1q Ž1.0.  2q Ž1.0. 1.0 When X and Y are independent,  j < i s  i jr iqs Ž  iq qj . r iqs qj for i s 1, . . . , I. Each conditional distribution of Y is identical to the marginal distribution of Y. Thus, two variables are independent when  j <1 s ⭈⭈⭈ s ␲ j < I , for j s 1, . . . , J 4 ; that is, the probability of any given column response is the same in each row. When Y is a response and X is an explanatory variable, this is a more natural way to define independence than Ž2.1.. Independence is then often referred to as homogeneity of the conditional distributions. Table 2.3 displays notation for joint, conditional, and marginal distributions for the 2 = 2 case. Sample distributions use similar notation, with p or ␲ ˆ in place of ␲ . For instance,  pi j 4 denotes the sample joint distribution. The cell frequencies are denoted  n i j 4 , and n s Ý i Ý j n i j is the total sample size. Thus, pi j s n i jrn. The sample proportion of times that subjects in row i made response j is pj < i s pi jrpiqs n i jrn iq , where n iqs npiqs Ý j n i j . 2.1.4 Poisson, Binomial, and Multinomial Sampling The probability distributions introduced in Section 1.2 extend to cell counts in contingency tables. For instance, a Poisson sampling model treats cell counts  Yi j 4 as independent Poisson random variables with parameters  ␮i j 4 . The joint probability mass function for potential outcomes  n i j 4 is then the product of the Poisson probabilities P Ž Yi j s n i j . for the IJ cells, or Ł Ł exp Ž y␮i j . ␮ inj rn i j ! . ij i j 40 DESCRIBING CONTINGENCY TABLES When the total sample size n is fixed but the row and column totals are not, a multinomial sampling model applies. The IJ cells are the possible outcomes. The probability mass function of the cell counts has the multinomial form n!r Ž n11 ! ⭈⭈⭈ n I J ! . Ł Ł ␲ inj i ij . j Often, observations on a response Y occur separately at each setting of an explanatory variable X. This case normally treats row totals as fixed, and for simplicity, we use the notation n i s n iq. Suppose that the n i observations on Y at setting i of X are independent, each with probability distribution ␲ 1 < i , . . . , ␲ J < i 4 . The counts  n i j , j s 1, . . . , J 4 satisfying Ý j n i j s n i then have the multinomial form ni ! Ł j ni j ! Ł ␲ jn< i . ij Ž 2.2 . j When samples at different settings of X are independent, the joint probability function for the entire data set is the product of the multinomial functions Ž2.2. from the various settings. This sampling scheme is independent multinomial sampling, also called product multinomial sampling. Independent multinomial sampling also results under the following conditions: Suppose that  n i j 4 result from either independent Poisson sampling with means  ␮i j 4 or multinomial sampling over the IJ cells with probabilities ␲ i j s ␮ i jrn4 . When X is an explanatory variable, it is sensible to perform statistical inference conditional on the totals  n i s Ý j n i j 4 even when their values are not fixed by the sampling design. Conditional on  n i 4 , the cell counts  n i j , j s 1, . . . , J 4 have the multinomial distribution Ž2.2. with response probabilities ␲ j < i s ␮ i jr␮ iq, j s 1, . . . , J 4 , and cell counts from different rows are independent. With this conditioning, we treat the row totals as fixed and analyze the data as if they formed separate independent samples. Sometimes both row and column margins are naturally fixed. The appropriate sampling distribution is then the hypergeometric. In Section 3.5.1 we discuss this case, which is less common. 2.1.5 Seat Belt Example Researchers in the Massachusetts Highway Department plan to study the relationship between seat-belt use Žyes, no. and outcome of an automobile crash Žfatality, nonfatality . for drivers involved in accidents on the Massachusetts Turnpike. They will summarize results in the format shown in Table 2.4. They plan to catalog all accidents on the turnpike for the next year, classifying each according to these variables. The total sample size is 41 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES TABLE 2.4 Seat-Belt Use and Results of Automobile Crashes Result of Crash Seat-Belt Use Fatality Nonfatality Yes No then a random variable. They might treat the numbers of observations at the four combinations of seat-belt use and outcome of crash as independent Poisson random variables with unknown means  ␮11 , ␮ 12 , ␮ 21 , ␮ 22 4 . Suppose, instead, that the researchers randomly sample 200 police records of crashes on the turnpike in the past year and classify each according to seat-belt use and outcome of crash. For this study, the total sample size n is fixed. They might then treat the four cell counts as a multinomial random variable with n s 200 trials and unknown joint probabilities ␲ 11 , ␲ 12 , ␲ 21 ,␲ 22 4 . Suppose, instead, that police records for accidents involving fatalities were filed separately from the others. The researchers might instead randomly sample 100 records of accidents with a fatality and randomly sample 100 records of accidents with no fatality. This approach fixes the column totals in Table 2.4 at 100. They might then regard each column of Table 2.4 as an independent binomial sample. Yet another approach, the traditional experimental design, takes 200 subjects and randomly assigns 100 of them to wear seat belts; the 200 then all are forced to have an accident. The recorded results would then be independent binomial samples in each row, with fixed row totals of 100 each. ŽObviously, traditional designs common in some experimental science may not be ethical for humans. This is especially true in medical studies. . 2.1.6 Types of Studies Table 2.5 comes from one of the first studies of the link between lung cancer and smoking, by Richard Doll and A. Bradford Hill. In 20 hospitals in London, England, patients admitted with lung cancer in the preceding year were queried about their smoking behavior. For each of the 709 patients admitted, researchers studied the smoking behavior of a noncancer patient at the same hospital of the same gender and within the same 5-year grouping on age. The 709 cases in the first column of Table 2.5 are those having lung cancer and the 709 controls in the second column are those not having it. A smoker was defined as a person who had smoked at least one cigarette a day for at least a year. Normally, whether lung cancer occurs is a response variable and smoking behavior is an explanatory variable. In this study, however, the marginal 42 DESCRIBING CONTINGENCY TABLES TABLE 2.5 Cross-Classification of Smoking by Lung Cancer Lung Cancer Smoker Yes No Total Cases Controls 688 21 650 59 709 709 Source: Based on data reported in Table IV, R. Doll and A. B. Hill, British Med. J., Sept. 30, 1950, pp. 739748. distribution of lung cancer is fixed by the sampling design, and the outcome measured is whether the subject ever was a smoker. The study, which uses a retrospecti®e design to ‘‘look into the past,’’ is called a casecontrol study. Such studies are common in health-related applications. Often, the two samples are matched, as in this study. Sometimes the samples of cases and controls are independent rather than matched. For instance, another early casecontrol study on lung cancer and smoking sampled subjects by sending letters to the estates of physicians who had died of some type of cancer in 1950 or 1951, and observations were cross-classified on type of cancer and the subject’s smoking behavior Žsee, e.g., Cornfield 1956.. One might want to compare smokers with nonsmokers in terms of the proportion who suffered lung cancer. These proportions refer to the conditional distribution of lung cancer, given smoking behavior. Instead, casecontrol studies provide proportions in the reverse direction, for the conditional distribution of smoking behavior, given lung cancer status. For those in Table 2.5 with lung cancer, the proportion who were smokers was 688r709 s 0.970, while it was 650r709 s 0.917 for the controls. When we know the proportion of the population having lung cancer, we can use Bayes’ theorem to compute sample conditional distributions in the direction of main interest ŽProblem 2.21.. Otherwise, using a retrospective sample, we cannot estimate the probability of lung cancer at each category of smoking behavior. For Table 2.5 we do not know the population prevalence of lung cancer, and the patients suffering it were probably sampled at a rate far in excess of their occurrence in the general population. By contrast, imagine a study that samples subjects from the population of teenagers and then 60 years later measures the rates of lung cancer for the smokers and nonsmokers. Such a sampling design is prospecti®e. There are two types of prospective studies. Clinical trials randomly allocate subjects to the groups who will be smokers and nonsmokers. In cohort studies, subjects make their own choice about whether to smoke, and the study observes in future time who develops lung cancer. Yet another approach, a cross-sectional design, samples subjects and classifies them simultaneously on both variables. COMPARING TWO PROPORTIONS 43 Prospecti®e studies usually condition on the totals  n i s Ý j n i j 4 for categories of X and regard each row of J counts as an independent multinomial sample on Y. Retrospecti®e studies usually treat the totals  nqj 4 for Y as fixed and regard each column of I counts as a multinomial sample on X. In cross-sectional studies, the total sample size is fixed but not the row or column totals, and the IJ cell counts are a multinomial sample. Casecontrol, cohort, and cross-sectional studies are called obser®ational studies. They simply observe who chooses each group and who has the outcome of interest. By contrast, a clinical trial is an experimental study, the investigator having the advantage of experimental control over which subjects receive each treatment. Such studies can use the power of randomization to make the groups balance roughly on other variables that may be associated with the response. Observational studies are common but have more potential for biases of various types. 2.2 COMPARING TWO PROPORTIONS Many studies are designed to compare groups on a binary response variable. Then Y has only two categories, such as Žsuccess, failure. for outcome of a medical treatment. With two groups, a 2 = 2 contingency table displays the results. The rows are the groups and the columns are the categories of Y. This section presents parameters for comparing the groups. 2.2.1 Difference of Proportions For subjects in row i,  1 < i is the probability that the response has outcome in category 1 Ž‘‘success’’.. With only two possible outcomes,  2 < i s 1 y  1 < i , and we use the simpler notation  i for  1 < i . The difference of proportions of successes,  1 y  2 , is a basic comparison of the two rows. Comparison on failures is equivalent to comparison on successes, since Ž1 y  1. y Ž1 y  2 . s  2 y  1. The difference of proportions falls between y1.0 and q1.0. It equals zero when the rows have identical conditional distributions. The response Y is statistically independent of the row classification when  1 y  2 s 0. When both variables are responses, conditional distributions apply in either direction. One can also compare the two columns, such as by the difference between the proportions in row 1. This usually is not equal to the difference  1 y  2 comparing the rows. 2.2.2 Relative Risk A value  1 y  2 of fixed size may have greater importance when both  i are close to 0 or 1 than when they are not. For a study comparing two 44 DESCRIBING CONTINGENCY TABLES treatments on the proportion of subjects who die, the difference between 0.010 and 0.001 may be more noteworthy than the difference between 0.410 and 0.401, even though both are 0.009. In such cases, the ratio of proportions is also informative. The relati®e risk is defined to be the ratio Ž 2.3 .  1r 2 . It can be any nonnegative real number. A relative risk of 1.0 corresponds to independence. For the proportions just given, the relative risks are 0.010r0.001 s 10.0 and 0.410r0.401 s 1.02. Comparing the rows on the second response category gives a different relative risk, Ž1 y  1 .rŽ1 y  2 .. 2.2.3 Odds Ratio For a probability  of success, the odds are defined to be ⍀ s ␲r Ž 1 y ␲ . . The odds are nonnegative, with ⍀ ) 1.0 when a success is more likely than a failure. When ␲ s 0.75, for instance, then ⍀ s 0.75r0.25 s 3.0; a success is three times as likely as a failure, and we expect about three successes for every one failure. When ⍀ s 13 , a failure is three times as likely as a success. Inversely, ␲ s ⍀r Ž ⍀ q 1 . . For instance, when ⍀ s 13 , then ␲ s 0.25. Refer again to a 2 = 2 table. Within row i, the odds of success instead of failure are ⍀ i s ␲ irŽ1 y ␲ i .. The ratio of the odds ⍀ 1 and ⍀ 2 in the two rows, ␪s ⍀1 ⍀2 s ␲ 1r Ž 1 y ␲ 1 . ␲ 2r Ž 1 y ␲ 2 . Ž 2.4 . is called the odds ratio. For joint distributions with cell probabilities ␲ i j 4 , the equivalent definition for the odds in row i is ⍀ i s ␲ i1r␲ i2 , i s 1, 2. Then the odds ratio is ␪s ␲ 11 r␲ 12 ␲ 21 r␲ 22 s ␲ 11␲ 22 ␲ 12 ␲ 21 . Ž 2.5 . An alternative name for ␪ is the cross-product ratio, since it equals the ratio of the products ␲ 11␲ 22 and ␲ 12 ␲ 21 of probabilities from diagonally opposite cells ŽYule 1900, 1912.. 45 COMPARING TWO PROPORTIONS 2.2.4 Properties of the Odds Ratio The odds ratio can equal any nonnegative number. The condition ⍀ 1 s ⍀ 2 and hence Žwhen all cell probabilities are positive. ␪ s 1 corresponds to independence of X and Y. When 1 - ␪ - ⬁, subjects in row 1 are more likely to have a success than are subjects in row 2; that is, ␲ 1 ) ␲ 2 . For instance, when ␪ s 4, the odds of success in row 1 are four times the odds in row 2. This does not mean that the probability ␲ 1 s 4␲ 2 ; that is the interpretation of a relati®e risk of 4.0. When 0 - ␪ - 1, ␲ 1 - ␲ 2 . When one cell has zero probability, ␪ equals 0 or ⬁. Values of ␪ farther from 1.0 in a given direction represent stronger association. Two values represent the same association, but in opposite directions, when one is the inverse of the other. For instance, when ␪ s 0.25, the odds of success in row 1 are 0.25 times the odds in row 2, or equivalently, the odds of success in row 2 are 1r0.25 s 4.0 times the odds in row 1. When the order of the rows is reversed or the order of the columns is reversed, the new value for ␪ is the inverse of the original value. For inference, we shall see it is convenient to use log ␪ . Independence corresponds to log ␪ s 0. The log odds ratio is symmetric about this value reversal of rows or of columns results in a change in its sign. Two values for log  that are the same except for sign, such as log 4 s 1.39 and log 0.25 s y1.39, represent the same strength of association. The odds ratio does not change value when the orientation of the table reverses so that the rows become the columns and the columns become the rows. This is clear from the symmetric form of Ž2.5.. It is unnecessary to identify one classification as the response variable in order to use  . In fact, although Ž2.4. defined it in terms of odds using  i s P Ž Y s 1 < X s i ., one could just as well define it using reverse conditional probabilities. With a joint distribution, conditional distributions exist in each direction, and s s  11 22  12  21 s P Ž Y s 1 < X s 1 . rP Ž Y s 2 < X s 1 . P Ž Y s 1 < X s 2 . rP Ž Y s 2 < X s 2 . P Ž X s 1 < Y s 1 . rP Ž X s 2 < Y s 1 . P Ž X s 1 < Y s 2 . rP Ž X s 2 < Y s 2 . . Ž 2.6 . In fact, the odds ratio is equally valid for prospective, retrospective, or cross-sectional sampling designs. The sample odds ratio estimates the same parameter in each case. For cell counts  n i j 4 , the sample odds ratio is ˆs n11 n 22 rn12 n 21 . This does not change when both cell counts within any row are multiplied by a nonzero constant or when both cell counts within any column are multiplied by a nonzero constant. An implication is that the sample odds ratio 46 DESCRIBING CONTINGENCY TABLES estimates the same characteristic Ž  . even when the sample is disproportionately large or small from marginal categories of a variable. For a retrospective study of the association between vaccination and catching a certain strain of flu, the sample odds ratio estimates the same characteristic with a random sample of Ž1. 100 people who got the flu and 100 people who did not, or Ž2. 40 people who got the flu and 160 people who did not. The sample versions of the difference of proportions and relative risk Ž2.3. are invariant to multiplication of counts within rows by a constant, but they change with multiplication within columns or with rowcolumn interchange. 2.2.5 Aspirin and Heart Attacks Revisited We illustrate the three association measures with Table 2.1 on aspirin use and heart attacks. The table differentiates between fatal and nonfatal heart attacks, but we combine these outcomes for now. Of the 11,034 physicians taking placebo, 189 suffered heart attacks, a proportion of 189r11,034 s 0.0171. Of the 11,037 taking aspirin, 104 had heart attacks, a proportion of 0.0094. The sample difference of proportions is 0.0171 y 0.0094 s 0.0077. The relative risk is 0.0171r0.0094 s 1.82. The proportion suffering heart attacks of those taking placebo was 1.82 times the proportion suffering heart attacks of those taking aspirin. The sample odds ratio is Ž189 = 10,933.r Ž10,845 = 104. s 1.83. The odds of heart attack for those taking placebo was 1.83 times the odds for those taking aspirin. 2.2.6 Case–Control Studies and the Odds Ratio With retrospective sampling designs, such as casecontrol studies, it is possible to estimate conditional probabilities of form P Ž X s i < Y s j .. It is usually not possible to estimate the probability P Ž Y s j < X s i . of an outcome of interest or the difference of proportions or relative risk for that outcome. It is possible to estimate the odds ratio, however, since by Ž2.6. it is determined by conditional probabilities in either direction. To illustrate, we revisit Table 2.5 on X s smoking behavior and Y s lung cancer. The data were two binomial samples on X at fixed levels of Y. Thus, we can estimate the probability a subject was a smoker, given the outcome on whether the subject had lung cancer; this was 688r709 for the cases and 650r709 for the controls. We cannot estimate the probability of lung cancer, given whether one smoked, which is more relevant. Thus, we cannot estimate differences or ratios of probabilities of lung cancer. The difference of proportions and relative risk are limited to comparisons of the probabilities of being a smoker. However, we can compute the odds ratio using the sample analog of Ž2.6., 688 = 59 Ž 688r709. r Ž 21r709. s s 3.0. 650 = 21 Ž 650r709. r Ž 59r709. 47 PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES Moreover, by Ž2.6., interpretations can use the direction of interest, even though the study was retrospective: The estimated odds of lung cancer for smokers were 3.0 times the estimated odds for nonsmokers. 2.2.7 Relationship between Odds Ratio and Relative Risk From definitions Ž2.3. and Ž2.4., odds ratio s relative risk ž 1 y 2 1 y 1 / . Their magnitudes are similar whenever the probability  i of the outcome of interest is close to zero for both groups. We saw this similarity in Section 2.2.5 for the aspirin study, where the heart attack proportion was less than 0.02 for each group. The relative risk was 1.82 and the odds ratio was 1.83. Because of this similarity, when each  i is small, the odds ratio provides a rough indication of the relative risk when it is not directly estimable, such as in casecontrol studies ŽCornfield 1951.. For instance, for Table 2.5, if the probability of lung cancer is small regardless of smoking behavior, 3.0 is also a rough estimate of the relative risk; that is, smokers had about 3.0 times the relative frequency of lung cancer as nonsmokers. 2.3 PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES An important part of most studies, especially observational studies, is the choice of control variables. In studying the effect of X on Y, one should control any covariate that can influence that relationship. This involves using some mechanism to hold the covariate constant. Otherwise, an observed effect of X on Y may actually reflect effects of that covariate on both X and Y. The relationship between X and Y then shows confounding. Experimental studies can remove effects of confounding covariates by randomly assigning subjects to different levels of X, but this is not possible with observational studies. Suppose that a study considers effects of passive smoking, the effects on a nonsmoker of living with a smoker. To analyze whether passive smoking is associated with lung cancer, a cross-sectional study might compare lung cancer rates between nonsmokers whose spouses smoke and nonsmokers whose spouses do not smoke. The study should attempt to control for age, socioeconomic status, or other factors that might relate both to spouse smoking and to developing lung cancer. Otherwise, results will have limited usefulness. Spouses of nonsmokers may tend to be younger than spouses of smokers, and younger people are less likely to have lung cancer. Then a lower proportion of lung cancer cases among spouses of nonsmokers may merely reflect their lower average age. 48 DESCRIBING CONTINGENCY TABLES In this section we discuss the analysis of the association between categorical variables X and Y while controlling for a possibly confounding variable Z. For simplicity, the examples refer to a single control variable. In later chapters we treat more general cases and discuss the use of models to perform statistical control. 2.3.1 Partial Tables We control for Z by studying the XY relationship at fixed levels of Z. Two-way cross-sectional slices of the three-way contingency table cross classify X and Y at separate categories of Z. These cross sections are called partial tables. They display the XY relationship while removing the effect of Z by holding its value constant. The two-way contingency table obtained by combining the partial tables is called the XY marginal table. Each cell count in the marginal table is a sum of counts from the same location in the partial tables. The marginal table, rather than controlling Z, ignores it. The marginal table contains no information about Z. It is simply a two-way table relating X and Y but may reflect the effects of Z on X and Y. The associations in partial tables are called conditional associations, because they refer to the effect of X on Y conditional on fixing Z at some level. Conditional associations in partial tables can be quite different from associations in marginal tables. In fact, it can be misleading to analyze only marginal tables of a multiway contingency table. The following example illustrates. 2.3.2 Death Penalty Example Table 2.6 is a 2 = 2 = 2 contingency tabletwo rows, two columns, and two layersfrom an article that studied effects of racial characteristics on whether persons convicted of homicide received the death penalty. The 674 subjects classified in Table 2.6 were the defendants in indictments involving cases TABLE 2.6 Victims’ Race White Black Total Death Penalty Verdict by Defendant’s Race and Victims’ Race Death Penalty Defendant’s Race Yes No Percent Yes White Black White Black 53 11 0 4 414 37 16 139 11.3 22.9 0.0 2.8 White Black 53 15 430 176 11.0 7.9 Source: M. L. Radelet and G. L. Pierce, Florida Law Re®. 43: 134 Ž1991.. Reprinted with permission from the Florida Law Re®iew. PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES FIGURE 2.1 49 Percent receiving death penalty. with multiple murders in Florida between 1976 and 1987. The variables in Table 2.6 are Y s death penalty verdict, having the categories Žyes, no., X s race of defendant, and Z s race of victims, each having the categories Žwhite, black.. We study the effect of defendant’s race on the death penalty verdict, treating victims’ race as a control variable. Table 2.6 has a 2 = 2 partial table relating defendant’s race and the death penalty verdict at each category of victims’ race. For each combination of defendant’s race and victims’ race, Table 2.6 lists and Figure 2.1 displays the percentage of defendants who received the death penalty. These describe the conditional associations. When the victims were white, the death penalty was imposed 22.9% y11.3% s 11.6% more often for black defendants than for white defendants. When the victims were black, the death penalty was imposed 2.8% more often for black defendants than for white defendants. Controlling for victims’ race by keeping it fixed, the death penalty was imposed more often on black defendants than on white defendants. The bottom portion of Table 2.6 displays the marginal table. It results from summing the cell counts in Table 2.6 over the two categories of victims’ race, thus combining the two partial tables Že.g., 11 q 4 s 15.. Overall, 11.0% of white defendants and 7.9% of black defendants received the death penalty. Ignoring victims’ race, the death penalty was imposed less often on black defendants than on white defendants. The association reverses direction compared to the partial tables. Why does the association change so much when we ignore versus control victims’ race? This relates to the nature of the association between victims’ race and each of the other variables. First, the association between victims’ 50 DESCRIBING CONTINGENCY TABLES FIGURE 2.2 Proportion receiving death penalty by defendant’s race, controlling and ignoring victims’ race. race and defendant’s race is extremely strong. The marginal table relating these variables has odds ratio Ž467 = 143.rŽ48 = 16. s 87.0. Second, Table 2.6 shows that, regardless of defendant’s race, the death penalty was much more likely when the victims were white than when the victims were black. So whites are tending to kill whites, and killing whites is more likely to result in the death penalty. This suggests that the marginal association should show a greater tendency than the conditional associations for white defendants to receive the death penalty. In fact, Table 2.6 has this pattern. Figure 2.2 illustrates why the marginal association differs so from the conditional associations. For each defendant’s race, the figure plots the proportion receiving the death penalty at each category of victims’ race. Each proportion is labeled by a letter symbol giving the category of victims’ race. Surrounding each observation is a circle having area proportional to the number of observations at that combination of defendant’s race and victims’ race. For instance, the W in the largest circle represents a proportion of 0.113 receiving the death penalty for cases with white defendants and white victims. That circle is largest because the number of cases at that combination Ž53 q 414 s 467. is largest. The next-largest circle relates to cases in which blacks kill blacks. We control for victims’ race by comparing circles having the same victims’ race letter at their centers. The line connecting the two W circles has a positive slope, as does the line connecting the two B circles. Controlling for victims’ race, this reflects the death penalty being more likely for black defendants than for white defendants. When we add results across victims’ PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES 51 race to get a summary result for the marginal effect of defendant’s race on the death penalty verdict, the larger circles, having the greater number of cases, have greater influence. Thus, the summary proportions for each defendant’s race, marked on the figure by periods, fall closer to the center of the larger circles than to the center of the smaller circles. A line connecting the summary marginal proportions has negative slope, indicating that overall the death penalty was more likely for white defendants than for black defendants. The result that a marginal association can have a different direction from each conditional association is called Simpson’s paradox ŽSimpson 1951, Yule 1903.. It applies to quantitative as well as categorical variables. Statisticians commonly use it to caution against imputing causal effects from an association of X with Y. For instance, when doctors started to observe strong odds ratios between smoking and lung cancer, statisticians such as R. A. Fisher warned that some variable Že.g., a genetic factor. could exist such that the association would disappear under the relevant control. However, other statisticians Žsuch as J. Cornfield. showed that with a very strong XY association, a very strong association must exist between the confounding variable Z and both X and Y in order for the effect to disappear or change under the control ŽBreslow and Day 1980, Sec. 3.4.. 2.3.3 Conditional and Marginal Odds Ratios Odds ratios can describe marginal and conditional associations. We illustrate for 2 = 2 = K tables, where K denotes the number of categories of a control variable, Z. Let  ␮i jk 4 denote cell expected frequencies for some sampling model, such as binomial, multinomial, or Poisson sampling. Within a fixed category k of Z, the odds ratio ␪X YŽk. s ␮ 11 k ␮ 22 k ␮ 12 k ␮ 21 k Ž 2.7 . describes conditional XY association in partial table k. The odds ratios for the K partial tables are called XY conditional odds ratios. These can be quite different from marginal odds ratios. The XY marginal table has expected frequencies  ␮i jqs Ý k ␮ i jk 4 . The XY marginal odds ratio is ␪X Y s ␮ 11q ␮ 22q ␮ 12q ␮ 21q . Sample values of ␪ X Y Ž k . and ␪ X Y use similar formulas with cell counts substituted for expected frequencies. We illustrate for the association between defendant’s race and the death penalty in Table 2.6. In the first partial 52 DESCRIBING CONTINGENCY TABLES table, victims’ race is white and ˆX Y Ž1. s 53 = 37 414 = 11 s 0.43. The sample odds for white defendants receiving the death penalty were 43% of the sample odds for black defendants. In the second partial table, victims’ race is black and the estimated odds ratio equals ˆX Y Ž2. s Ž0 = 139.Ž16 = 4. s 0.0, since the death penalty was never given to white defendants with black victims. Estimation of the marginal odds ratio uses the 2 = 2 marginal table within Table 2.6, collapsing over victims’ race, or Ž53 = 176.rŽ430 = 15. s 1.45. The sample odds of the death penalty were 45% higher for white defendants than for black defendants. Yet within each victims’ race category, those odds were smaller for white defendants. This reversal in the association after controlling for victims’ race illustrates Simpson’s paradox. 2.3.4 Marginal versus Conditional Independence More generally, X may have I categories and Y may have J categories. An I = J = K table describes the relationship between X and Y, controlling for Z. If X and Y are independent in partial table k, then X and Y are called conditionally independent at le®el k of Z. When Y is a response, this means that P Ž Y s j < X s i, Z s k. s P Ž Y s j < Z s k. , for all i , j. Ž 2.8 . More generally, X and Y are said to be conditionally independent gi®en Z when they are conditionally independent at every level of Z, that is, when Ž2.8. holds for all k. Then, given Z, Y does not depend on X. Suppose that a single multinomial applies to the entire three-way table, with joint probabilities  i jk s P Ž X s i, Y s j, Z s k .4 . Then  i jk s P Ž X s i , Z s k . P Ž Y s j < X s i , Z s k . , which under conditional independence of X and Y, given Z, equals s  iqk P Ž Y s j < Z s k . s  iqk P Ž Y s j, Z s k . rP Ž Z s k . . Thus, conditional independence is then equivalent to  i jk s  iqk qj krqqk for all i , j, and k. Ž 2.9 . 53 PARTIAL ASSOCIATION IN STRATIFIED 2 = 2 TABLES TABLE 2.7 Expected Frequencies Showing That Conditional Independence Does Not Imply Marginal Independence Response Clinic 1 2 Total Treatment Success Failure A B A B 18 12 2 8 12 8 8 32 A B 20 20 20 40 Conditional independence does not imply marginal independence ŽYule 1903.. For instance, summing Ž2.9. over k on both sides yields  i jqs Ý Ž  iqk qj krqqk . . k All three terms in the summation involve k, and this does not simplify to  i jqs  iqq qjq , marginal independence. For 2 = 2 = K tables, X and Y are conditionally independent when the odds ratio between X and Y equals 1 at each category of Z. The expected frequencies  ␮i jk 4 in Table 2.7 illustrate this relation for Y s response Žsuccess, failure., X s drug treatment ŽA, B., and Z s clinic Ž1, 2.. From Ž2.7., the conditional XY odds ratios are ␪ X Y Ž1. s 18 = 8 12 = 12 s 1.0, ␪ X Y Ž2. s 2 = 32 8=8 s 1.0. Given the clinic, response and treatment are conditionally independent. The marginal table combines the tables for the two clinics. Its odds ratio is ␪ X Y s Ž20 = 40.rŽ20 = 20. s 2.0, so the variables are not marginally independent. Ignoring the clinic, why are the odds of a success for treatment A twice those for treatment B? The conditional XZ and YZ odds ratios give a clue. The odds ratio between Z and either X or Y, at each fixed category of the other variable, equals 6.0. For instance, the XZ odds ratio at the first category of Y equals Ž18 = 8.rŽ12 = 2. s 6.0. The conditional odds Žgiven response. of receiving treatment A at clinic 1 are six times those at clinic 2, and the conditional odds Žgiven treatment . of success at clinic 1 are six times those at clinic 2. Clinic 1 tends to use treatment A more often, and clinic 1 also tends to have more successes. For instance, if patients at clinic 1 tended to be younger and in better health than those at clinic 2, perhaps they had a better success rate regardless of the treatment received. 54 DESCRIBING CONTINGENCY TABLES It is misleading to study only the marginal table, concluding that successes are more likely with treatment A. Subjects within a particular clinic are likely to be more homogeneous than the overall sample, and response is independent of treatment in each clinic. 2.3.5 Homogeneous Association A 2 = 2 = K table has homogeneous XY association when  X Y Ž1. s  X Y Ž2. s ⭈⭈⭈ s ␪ X Y Ž K . . Then the effect of X on Y is the same at each category of Z. Conditional independence of X and Y is the special case in which each ␪ X Y Ž k . s 1.0. Under homogeneous XY association, homogeneity also holds for the other associations. For instance, the conditional odds ratio between two categories of X and two categories of Z is identical at each category of Y. For the odds ratio, homogeneous association is a symmetric property. It applies to any pair of variables viewed across the categories of the third. When it occurs, there is said to be no interaction between two variables in their effects on the other variable. When interaction exists, the conditional odds ratio for any pair of variables changes across categories of the third. For X s smoking Žyes, no., Y s lung cancer Žyes, no., and Z s age Ž- 45, 4565, ) 65., suppose that  X Y Ž1. s 1.2,  X Y Ž2. s 3.9, and  X Y Ž3. s 8.8. Then smoking has a weak effect on lung cancer for young people, but the effect strengthens considerably with age. Age is called an effect modifier; the effect of smoking is modified depending on its value. For the death penalty data ŽTable 2.6., ˆX Y Ž1. s 0.43 and ˆX Y Ž2. s 0.0. The values are not close, but the second estimate is unstable because of the zero cell count. Adding 12 to each cell count, ˆX Y Ž2. s 0.94. Because ˆX Y Ž2. is unstable and because further variation occurs from sampling variability, these partial tables do not necessarily contradict homogeneous association in a population. In Section 6.3 we show how to analyze whether sample data are consistent with homogeneous association or conditional independence. 2.4 EXTENSIONS FOR I = J TABLES For 2 = 2 tables, a single number such as the odds ratio can summarize the association. For I = J tables, it is rarely possible to summarize association by a single number without some loss of information. However, a set of odds ratios or another summary index can describe certain features of the association. 55 EXTENSIONS FOR I = J TABLES 2.4.1 Odds Ratios in I = J Tables Odds ratios can use each of the ž/ ž / s IŽ I y 1.r2 pairs of rows in combinaI 2 tion with each of the 2J s J Ž J y 1.r2 pairs of columns. For rows a and b and columns c and d, the odds ratio Žac b d .rŽ b cad . uses four cells in a rectangular pattern. There are 2I 2J odds ratios of this type. This set of odds ratios contains much redundant information. Consider the subset of Ž I y 1.Ž J y 1. local odds ratios ž /ž / i j s  i j iq1 , jq1  i , jq1  iq1 , j , i s 1, . . . , I y 1, j s 1, . . . , J y 1. Ž 2.10 . Figure 2.3 shows that local odds ratios use cells in adjacent rows and adjacent columns. These Ž I y 1.Ž J y 1. odds ratios determine all odds ratios formed from pairs of rows and pairs of columns. To illustrate, in Table 2.1, the sample local odds ratio is 2.08 for the first two columns and 1.74 for the FIGURE 2.3 Odds ratios for I = J tables. 56 DESCRIBING CONTINGENCY TABLES second and third columns. In each case, the more serious outcome was more prevalent for the placebo group. The product of these two odds ratios is 3.63, which is the odds ratio for the first and third columns. Construction Ž2.10. for a minimal set of odds ratios is not unique. Another basic set is ␣i j s  i j I J ␲ I j␲ i J , i s 1, . . . , I y 1, j s 1, . . . , J y 1. Ž 2.11 . This uses the rectangular pattern of cells determined by the cell in row i and column j and the cell in the last row and last column. Figure 2.3 illustrates. Given the marginal distributions ␲ iq 4 and ␲qj 4 , when ␲ i j ) 04 , conversion of the probabilities into the set of odds ratios Ž2.10. or Ž2.11. does not discard information. The cell probabilities determine the odds ratios, and given the marginals, the odds ratios determine the cell probabilities. In this sense, Ž I y 1.Ž J y 1. parameters can describe any association in an I = J table. Independence is equivalent to all Ž I y 1.Ž J y 1. odds ratios equaling 1.0. For three-way I = J = K tables, sets of odds ratios in the partial tables describe the conditional association. Homogeneous XY association means that any conditional odds ratio formed using two categories of X and two categories of Y is the same at each category of Z. 2.4.2 Summary Measures of Association An alternative way to describe association uses a single summary index. We discuss this first for nominal variables and then ordinal variables. The most interpretable indices for nominal variables have the same structure as Rsquared for interval variables. It and the more general intraclass correlation coefficient and correlation ratio ŽKendall and Stuart 1979. describe the proportional reduction in variance from the marginal distribution of the response Y to the conditional distributions of Y given an explanatory variable X. Let V Ž Y . denote a measure of variation for the marginal distribution ␲qj 4 of Y, and let V Ž Y < i . denote this measure computed for the conditional distribution ␲ 1 < i , . . . , ␲ J < i 4 of Y at the ith setting of X. A proportional reduction in variation measure has the form VŽY . y E VŽY < X . VŽY . , Ž 2.12 . where E w V Ž Y < X .x is the expectation of the conditional variation taken with respect to the distribution of X. For the marginal distribution ␲ iq 4 of X, E w V Ž Y < X .x s Ý i␲ iq V Ž Y < i .. 57 EXTENSIONS FOR I = J TABLES For a nominal response, Theil Ž1970. proposed an index using the variation measure V Ž Y . s Ýqj logqj , called the entropy. For contingency tables, the proportional reduction in entropy equals Usy Ý i Ý j i j log Ž  i jr iq qj . Ý jqj log qj , Ž 2.13 . called the uncertainty coefficient. This measure is well defined when more than one qj ) 0. It takes value between 0 and 1: U s 0 is equivalent to independence of X and Y; U s 1 is equivalent to a lack of conditional variation, in the sense that for each i,  j < i s 1 for some j. Various measures of form Ž2.12. describe association in I = J tables Že.g., Problems 2.38 and 2.39.. A difficulty with them is developing intuition for how large a value constitutes a strong association. What does it mean, for instance, to say that there is a 30% reduction in entropy? Summary measures seem easier to interpret and more useful when both classifications are ordinal, as discussed next. 2.4.3 Ordinal Trends: Concordant and Discordant Pairs In Table 2.8 the variables are income and job satisfaction, measured for the black males in a national ŽU.S.. sample. Both classifications are ordinal, job satisfaction with the categories very dissatisfied ŽVD., little dissatisfied ŽLD., moderately satisfied ŽMS., and very satisfied ŽVS.. When X and Y are ordinal, a monotone trend association is common. As the level of X increases, responses on Y tend to increase toward higher levels, or responses on Y tend to decrease toward lower levels. For instance, perhaps job satisfaction tends to increase as income does. A single parameter can describe this trend. Measures analogous to the correlation describe the degree to which the relationship is monotone. Some measures are based on classifying each pair of subjects as concordant or discordant. A pair is concordant if the subject ranked higher on X also ranks higher on Y. The TABLE 2.8 Cross-Classification of Job Satisfaction by Income Job Satisfaction Income Ždollars. - 15,000 15,00025,000 25,00040,000 ) 40,000 Very Dissatisfied Little Dissatisfied Moderately Satisfied Very Satisfied 1 2 1 0 3 3 6 1 10 10 14 9 6 7 12 11 Source: 1996 General Social Survey, National Opinion Research Center. 58 DESCRIBING CONTINGENCY TABLES pair is discordant if the subject ranking higher on X ranks lower on Y. The pair is tied if the subjects have the same classification on X andror Y. We illustrate for Table 2.8. Consider a pair of subjects, one in the cell Ž- 15, VD. and the other in the cell Ž1525, LD.. This pair is concordant, since the second subject ranks higher than the first both on income and on job satisfaction. The subject in cell Ž- 15, VD. forms concordant pairs when matched with each of the three subjects classified Ž1525, LD., so these two cells provide 1 = 3 s 3 concordant pairs. The subject in the cell Ž- 15, VD. is also part of a concordant pair when matched with each of the other Ž10 q 7 q 6 q 14 q 12 q 1 q 9 q 11. subjects ranked higher on both variables. Similarly, the three subjects in the Ž- 15, LD. cell are part of concordant pairs when matched with the Ž10 q 7 q 14 q 12 q 9 q 11. subjects ranked higher on both variables. The total number of concordant pairs, denoted by C, equals C s 1 Ž 3 q 10 q 7 q 6 q 14 q 12 q 1 q 9 q 11 . q 3 Ž 10 q 7 q 14 q 12 q 9 q 11 . q 10 Ž 7 q 12 q 11 . q 2 Ž 6 q 14 q 12 q 1 q 9 q 11 . q 3 Ž 14 q 12 q 9 q 11 . q10 Ž 12 q 11 . q 1 Ž 1 q 9 q 11 . q 6 Ž 9 q 11 . q 14 Ž 11 . s 1331. The total number of discordant pairs of observations is D s 3 Ž 2 q 1 q 0 . q 10 Ž 2 q 3 q 1 q 6 q 0 q 1 . q ⭈⭈⭈ q12 Ž 0 q 1 q 9 . s 849. In this example, C ) D, suggesting a tendency for low income to occur with low job satisfaction and high income with high job satisfaction. Consider two independent observations from a joint probability distribution ␲ i j 4 . For that pair, the probabilities of concordance and discordance are ⌸c s 2 Ý i ž Ý ␲i j Ý Ý ␲ hk j h)i k)j / , ⌸d s 2 Ý i ž Ý ␲i j Ý Ý ␲ hk j h)i k-j / . Here i and j are fixed in the inner summations, and the factor of 2 occurs because the first observation could be in cell Ž i, j . and the second in cell Ž h, k ., or vice versa. Several association measures for ordinal variables utilize the difference ⌸ c y ⌸ d . 2.4.4 Ordinal Measure of Association: Gamma Given that a pair is untied on both variables, ⌸ crŽ ⌸ c q ⌸ d . is the probability of concordance and ⌸ drŽ ⌸ c q ⌸ d . is the probability of discordance. The 59 NOTES difference between these probabilities is ␥s ⌸c y ⌸d ⌸c q ⌸d , Ž 2.14 . called gamma ŽGoodman and Kruskal 1954.. The sample version is ␥ ˆ s ŽC y D .rŽ C q D .. Like the correlation, gamma treats the variables symmetricallyit is unnecessary to identify one classification as a response variable. Also like the correlation, gamma has range y1 F ␥ F 1. A reversal in the category orderings of one variable causes a change in the sign of ␥ . Whereas the absolute value of the correlation is 1 when the relationship between X and Y is perfectly linear, only monotonicity is required for < ␥ < s 1, with ␥ s 1 if ⌸ d s 0 and ␥ s y1 if ⌸ c s 0. Independence implies that ␥ s 0, but the converse is not true. For instance, a U-shaped joint distribution can have ⌸ c s ⌸ d and hence ␥ s 0. 2.4.5 Gamma for Job Satisfaction Example For Table 2.8, C s 1331 and D s 849. Hence, ␥ˆ s Ž 1331 y 849 . r Ž 1331 q 849 . s 0.221. Only a weak tendency exists for job satisfaction to increase as income increases. Of the untied pairs, the proportion of concordant pairs is 0.221 higher than the proportion of discordant pairs. NOTES Section 2.2: Comparing Two Proportions 2.1. Breslow Ž1996. presented an interesting overview of the development of methods for casecontrol studies. 2.2. For 2 = 2 tables, Edwards Ž1963. showed that functions of the odds ratio are the only statistics that are invariant both to rowcolumn interchange and to multiplication within rows or within columns by a constant. For I = J tables, Altham Ž1970. gave related results. Yule Ž1912, p. 587. had argued that multiplicative invariance is a desirable property for measures of association, especially when proportions sampled in various marginal categories are arbitrary. Goodman Ž2000. showed five ways of viewing association in a 2 = 2 table and proposed a general measure that includes all five. Section 2.3: Partial Association in Stratified 2 = 2 Tables 2.3. Paik Ž1985. proposed circle diagrams of type Figure 2.2 to summarize three-way tables. Friendly Ž2000. discussed graphical presentation of categorical data. For more on Simpson’s paradox and when it can happen, see Blyth Ž1972., Davis Ž1989., Dong Ž1998., 60 DESCRIBING CONTINGENCY TABLES Samuels Ž1993., and Simpson Ž1951.. Good and Mittal Ž1989. extended it to an amalgamation paradox, whereby a marginal measure is greater than the maximum or less than the minimum of the partial table measures. Section 2.4: Extensions for I = J Tables 2.4. For continuous variables, samples can be fully ranked Ži.e., no ties occur., so C q D s ž/ n 2 and ␥ ˆ s Ž C y D .r ž/ n . This is Kendall’s tau. Agresti Ž1984, Chaps. 9 and 10. 2 and Kruskal Ž1958. surveyed ordinal measures of association. These also apply when one variable is ordinal and the other is binary. When Y is ordinal and X is nominal with I ) 2, no measure presented in Section 2.4 is very helpful. Ordinal modeling approaches ŽSection 7.2. use a parameter for each category of X; comparing parameters compares the ordinal response for pairs of categories of X. PROBLEMS Applications 2.1 An article in the New York Times ŽFeb. 17, 1999. about the PSA blood test for detecting prostate cancer stated: ‘‘The test fails to detect prostate cancer in 1 in 4 men who have the disease Žfalse-negative results ., and as many as two-thirds of the men tested receive false-positive results.’’ Let C Ž C . denote the event of having Žnot having. prostate cancer, and let q Žy. denote a positive Žnegative. test result. Which is true: P Žy < C . s 14 or P Ž C < y. s 14 ? P Ž C < q . s 23 or P Žq < C . s 23 ? Determine the sensitivity and specificity. 2.2 A diagnostic test has sensitivity s specificity s 0.80. Find the odds ratio between true disease status and the diagnostic test result. 2.3 Table 2.9 is based on records of accidents in 1988 compiled by the Department of Highway Safety and Motor Vehicles in Florida. Identify the response variable, and find and interpret the difference of proportions, relative risk, and odds ratio. Why are the relative risk and odds ratio approximately equal? TABLE 2.9 Data for Problem 2.3 Injury Safety Equipment in Use Fatal Nonfatal None Seat belt 1601 510 162,527 412,368 Source: Florida Department of Highway Safety and Motor Vehicles. PROBLEMS 61 2.4 Consider the following two studies reported in the New York Times. a. A British study reported ŽDec. 3, 1998. that of smokers who get lung cancer, ‘‘ women were 1.7 times more vulnerable than men to get small-cell lung cancer.’’ Is 1.7 the odds ratio or the relative risk? b. A National Cancer Institute study about tamoxifen and breast cancer reported ŽApr. 7, 1998. that the women taking the drug were 45% less likely to experience invasive breast cancer then were women taking placebo. Find the relative risk for Ži. those taking the drug compared to those taking placebo, and Žii. those taking placebo compared to those taking the drug. 2.5 A study ŽE. G. Krug et al., Internat. J. Epidemiol., 27: 214221, 1998. reported that the number of gun-related deaths per 100,000 people in 1994 was 14.24 in the United States, 4.31 in Canada, 2.65 in Australia, 1.24 in Germany, and 0.41 in England and Wales. Use the relative risk to compare the United States with the other countries. Interpret. 2.6 A newspaper article preceding the 1994 World Cup semifinal match between Italy and Bulgaria stated that ‘‘Italy is favored 1011 to beat Bulgaria, which is rated at 103 to reach the final.’’ Suppose that this means that the odds that Italy wins are 11 10 and the odds that Bulgaria wins are 103 . Find the probability that each team wins, and comment. 2.7 In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals 0.001304 for current smokers and 0.000121 for nonsmokers ŽM. Pagano and K. Gauvreau, Principles of Biostatistics, Duxbury Press, Pacific Grove, CA. 1993, p. 134.. a. Find and interpret the difference of proportions and the relative risk. Which measure is more informative for these data? Why? b. Find and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values. 2.8 For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender Žfemale, male. and survival Žyes, no. was 11.4. ŽFor data, see R. J. M. Dawson, J. Statist. Ed. 3, 1995.. a. What is wrong with the interpretation, ‘‘The probability of survival for females was 11.4 times that for males’’? Give the correct interpretation. When would the quoted interpretation be approximately correct? b. The odds of survival for females equaled 2.9. For each gender, find the proportion who survived. 62 2.9 DESCRIBING CONTINGENCY TABLES In an article about crime in the United States, Newsweek ŽJan. 10, 1994. quoted FBI statistics for 1992 stating that of blacks slain, 94% were slain by blacks, and of whites slain, 83% were slain by whites. Let Y s race of victim and X s race of murderer. Which conditional distribution do these statistics refer to, Y < X, or X < Y ? What additional information would you need to estimate the probability that the victim was white given that a murderer was white? Find and interpret the odds ratio. 2.10 A research study estimated that under a certain condition, the probability that a subject would be referred for heart catheterization was 0.906 for whites and 0.847 for blacks. a. A press release about the study stated that the odds of referral for cardiac catheterization for blacks are 60% of the odds for whites. Explain how they obtained 60% Žmore accurately, 57%.. b. An Associated Press story later described the study and said ‘‘Doctors were only 60% as likely to order cardiac catheterization for blacks as for whites.’’ Explain what is wrong with this interpretation. Give the correct percentage for this interpretation. ŽIn stating results to the general public, it is better to use the relative risk than the odds ratio. It is simpler to understand and less likely to be misinterpreted. For details, see New Engl. J. Med. 341: 279283, 1999.. 2.11 A 20-year cohort study of British male physicians ŽR. Doll and R. Peto, British Med. J. 2: 15251536, 1976. noted that the proportion per year who died from lung cancer was 0.00140 for cigarette smokers and 0.00010 for nonsmokers. The proportion who died from coronary heart disease was 0.00669 for smokers and 0.00413 for nonsmokers. a. Describe the association of smoking with each of lung cancer and heart disease, using the difference of proportions, relative risk, and odds ratio. Interpret. b. Which response is more strongly related to cigarette smoking, in terms of the reduction in number of deaths that would occur with elimination of cigarettes? Explain. 2.12 Table 2.10 refers to applicants to graduate school at the University of California at Berkeley, for fall 1973. It presents admissions decisions by gender of applicant for the six largest graduate departments. Denote the three variables by A s whether admitted, G s gender, and D s department. Find the sample AG conditional odds ratios and the marginal odds ratio. Interpret, and explain why they give such different indications of the AG association. 63 PROBLEMS TABLE 2.10 Data for Problem 2.12 Whether Admitted Male Female Department Yes No Yes No A B C D E F Total 512 353 120 138 53 22 1198 313 207 205 279 138 351 1493 89 17 202 131 94 24 557 19 8 391 244 299 317 1278 Source: Data from Freedman et al. Ž1978, p.14.. See also P. Bickel et al., Science 187: 398403 Ž1975.. 2.13 State three ‘‘real-world’’ variables X, Y, and Z for which you expect a marginal association between X and Y but conditional independence controlling for Z. 2.14 Based on 1987 murder rates in the United States, an Associated Press story reported that the probability that a newborn child has of eventually being a murder victim is 0.0263 for nonwhite males, 0.0049 for white males, 0.0072 for nonwhite females, and 0.0023 for white females. a. Find the conditional odds ratios between race and whether a murder victim, given the gender. Interpret. Do these variables exhibit homogeneous association? b. Half the newborns are of each gender, for each race. Find the marginal odds ratio between race and whether a murder victim. 2.15 At each age level, the death rate is higher in South Carolina than in Maine, but overall, the death rate is higher in Maine. Explain how this could be possible. ŽFor data, see H. Wainer, Chance 12: 44, 1999.. 2.16 A study of the death penalty for cases in Kentucky between 1976 and 1991 ŽT. Keil and G. Vito, Amer. J. Criminal Justice 20: 1736, 1995. indicated that the defendant received the death penalty in 8% of the 391 cases in which a white killed a white, in 2% of the 108 cases in which a black killed a black, in 12% of the 57 cases in which a black killed a white, and in 0% of the 18 cases in which a white killed a black. Form the three-way contingency table, obtain the conditional odds ratios between the defendant’s race and the death penalty verdict, interpret those associations, study whether Simpson’s paradox occurs, 64 DESCRIBING CONTINGENCY TABLES and explain why the marginal association is so different from the conditional associations. 2.17 An estimated odds ratio for adult females between the presence of squamous cell carcinoma Žyes, no. and smoking behavior Žsmoker, nonsmoker. equals 11.7 when the smoker category has subjects whose smoking level s is 0 - s - 20 cigarettes per day; it is 26.1 for smokers with s G 20 cigarettes per day ŽR. C. Brownson et al., Epidemiology 3: 6164, 1992.. Show that the estimated odds ratio between carcinoma Žyes, no. and the smoking levels Ž s G 20, 0 - s - 20. equals 2.2. 2.18 Table 2.11 refers to a retrospective study of lung cancer and tobacco smoking among patients in several English hospitals. The table compares male lung cancer patients with control patients having other diseases, according to the average number of cigarettes smoked daily over a 10-year period preceding the onset of the disease. a. Find the sample odds of lung cancer at each smoking level and the five odds ratios that pair each level of smoking with no smoking. As smoking increases, is there a trend? Interpret. b. If the log odds of lung cancer is linearly related to smoking level, the log odds in row i satisfies log Žodds i . s ␣ q ␤ i. Show that this implies that the local odds ratios are identical. c. Using these data, can you estimate the probability of lung cancer at each level of smoking? Are the estimated odds ratios in part Ža. meaningful? Explain. d. Show that the disease groups are stochastically ordered with respect to their distributions on smoking of cigarettes Žsee Problem 2.34 and Section 7.3.4.. Interpret. TABLE 2.11 Data for Problem 2.18 Disease Group Daily Average Number of Cigarettes None -5 514 1524 2549 50 q Lung Cancer Patients Control Patients 7 55 489 475 293 38 61 129 570 431 154 12 Source: Reprinted with permission from R. Doll and A. B. Hill, British Med. J. 2: 12711286 Ž1952.. 65 PROBLEMS TABLE 2.12 Data for Problem 2.19 Wife’s Rating of Sexual Fun Husband’s Rating Never or occasionally Fairly often Very often Almost always Never or Occasionally Fairly Often Very Often Almost Always 7 2 1 2 7 8 5 8 2 3 4 9 3 7 9 14 Source: Reprinted with permission from Hout et al. Ž1987.. 2.19 Table 2.12 summarizes responses of 91 married couples in Arizona to a question about how often sex is fun. Find and interpret a measure of association between wife’s response and husband’s response. 2.20 Table 2.13 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson’s paradox occurs. TABLE 2.13 Data for Problem 2.20 Victim’s Race White Black Death Penalty Defendant’s Race Yes No White Black White Black 19 11 0 6 132 52 9 97 Source: Reprinted with permission from M. L. Radelet, Amer. Sociol. Re®. 46: 918927 Ž1981. Theory and Methods 2.21 For a diagnostic test of a certain disease,  1 denotes the probability that the diagnosis is positive given that a subject has the disease, and  2 denotes the probability that the diagnosis is positive given that a subject does not have it. Let ␳ denote the probability that a subject does have the disease. a. Given that the diagnosis is positive, show that the probability that a subject does have the disease is ␲ 1 ␳r ␲ 1 ␳ q ␲ 2 Ž 1 y ␳ . . 66 DESCRIBING CONTINGENCY TABLES b. Suppose that a diagnostic test for HIVq status has both sensitivity and specificity equal to 0.95, and ␳ s 0.005. Find the probability that a subject is truly HIVq , given that the diagnostic test is positive. To better understand this answer, find the joint probabilities relating diagnosis to actual disease status, and discuss their relative sizes. 2.22 Binomial parameters for two groups are graphed, with ␲ 1 on the horizontal axis and ␲ 2 on the vertical axis. Plot the locus of points for a 2 = 2 table having Ža. relative risk s 0.5, Žb. odds ratio s 0.5, and Žc. difference of proportions s y0.5. 2.23 Let D denote having a certain disease and E denote having exposure to a certain risk factor. The attributable risk ŽAR. is the proportion of disease cases attributable to that exposure Žsee Benichou 1998.. a. Let P Ž E . s 1 y P Ž E .. Explain why AR s P Ž D . y P Ž D < E . rP Ž D . . b. Show that AR relates to the relative risk RR by AR s P Ž E . Ž RR y 1 . r 1 q P Ž E . Ž RR y 1 . . 2.24 For a 2 = 2 table of counts  n i j 4 , show that the odds ratio is invariant to Ža. interchanging rows with columns, and Žb. multiplication of cell counts within rows or within columns by c / 0. Show that the difference of proportions and the relative risk do not have these properties. 2.25 For given ␲ 1 and ␲ 2 , show that the relative risk cannot be farther than the odds ratio from their independence value of 1.0. 2.26 Explain why for three events E1 , E2 , and E3 and their complements, it is possible that P Ž E1 < E2 . ) P Ž E1 < E2 . even if both P Ž E1 < E2 E3 . P Ž E1 < E2 E3 . and P Ž E1 < E2 E3 . - P Ž E1 < E2 E3 .. Ž Hint: Use Simpson’s paradox for a three-way table. . 2.27 Let ␲ i j < k s P Ž X s i, Y s j < Z s k .. Explain why XY conditional independence is ␲ i j < k s ␲ iq< k ␲qj < k for all i and j and k. 2.28 For a 2 = 2 = 2 table, show that homogeneous association is a symmetric property, by showing that equal XY conditional odds ratios is equivalent to equal YZ conditional odds ratios. 67 PROBLEMS 2.29 Smith and Jones are baseball players. Smith has a higher batting average than Jones in each of K years. Is is possible that for the combined data from the K years, Jones has the higher batting average? Explain, using an example to illustrate. 2.30 When X and Y are conditionally dependent at each level of Z yet marginally independent, Z is called a suppressor ®ariable. Specify joint probabilities for a 2 = 2 = 2 table to show that this can happen Ža. when there is homogeneous association, and Žb. when the association has opposite direction in the partial tables. ž/ž/ J 2.31 Show that the  ␣ i j 4 in Ž2.11. determine Ža. all 2I odds ratios 2 Ž .  formed from pairs of rows and pairs of columns, b all ␪ i j 4 in Ž2.10., and vice versa. 2.32 Refer to Problem 2.31. When all rows and columns have positive probability, show that independence is equivalent to all  ␣ i j s 14 . 2.33 For I = J contingency tables, explain why the variables are independent when the Ž I y 1.Ž J y 1. differences ␲ j < i y ␲ j < I s 0, i s 1, . . . , I y 1, j s 1, . . . , J y 1. 2.34 A 2 = J table has ordinal response. Let Fj < i s ␲ 1 < i q ⭈⭈⭈ q␲ j < i . When Fj < 2 F Fj <1 for j s 1, . . . , J, the conditional distribution in row 2 is stochastically higher than the one in row 1. Consider the cumulati®e odds ratios ␪j s Fj <1r Ž 1 y Fj <1 . Fj < 2r Ž 1 y Fj < 2 . , j s 1, . . . , J y 1. a. Show that log ␪ j G 0 for all j is equivalent to row 2 being stochastically higher than row 1. Explain why row 2 is then more likely than row 1 to have observations at the high end of the ordinal scale. b. If all local log odds ratios are nonnegative, log ␪ j G 0 for 1 F j F J y 1 ŽLehmann 1966.. Show by counterexample that the converse is not true. 2.35 Suppose that  Yi j 4 are independent Poisson variates with means  ␮i j 4 . Show that P Ž Yi j s n i j . for all i, j, conditional on  Yiqs n i 4 , satisfy independent multinomial sampling wi.e., the product of Ž2.2. for all i x within the rows. 68 DESCRIBING CONTINGENCY TABLES 2.36 For 2 = 2 tables, Yule Ž1900, 1912. introduced Qs  11 22 y  12  21  11 22 q  12  21 , which he labeled Q in honor of the Belgian statistician Quetelet. It is now called Yule’s Q. a. Show that for 2 = 2 tables, Goodman and Kruskal’s ␥ s Q. b. Show that Q falls between y1 and 1. c. State conditions under which Q s y1 or Q s 1. d. Show that Q relates to the odds ratio by Q s Ž ␪ y 1.rŽ ␪ q 1., a monotone transformation of ␪ from the w0, ⬁x scale onto the wy1,q 1x scale. 2.37 When X and Y are ordinal with counts  n i j 4 : ž/ a. Explain why the n2 pairs of observations partition into C q D q TX q T Y y TX Y , where TX s Ýn iq Ž n iqy 1.r2 pairs are tied on X, T Y pairs are tied on Y, and TX Y pairs are tied on X and Y. b. For each ordered pair of observations Ž X a , Ya . and Ž X b ,Yb ., let X ab s signŽ X a y X b . and Yab s signŽ Ya y Yb .. Show that the sample correlation for the nŽ n y 1. distinct Ž X ab , Yab . pairs is ½ž / ␶b s CyD n y TX 2 ž/ n y TY 2 5 1r2 . This ordinal measure, called Kendall’s tau-b ŽKendall 1945., is less sensitive than gamma to the choice of response categories. ž/ c. Let d s Ž C y D .r n2 y TX . Explain why d is the difference between the proportions of concordant and discordant pairs out of those pairs untied on X ŽSomers 1962.. ŽFor 2 = 2 tables, d equals the difference of proportions, and tau-b equals the correlation between X and Y.. 2.38 Goodman and Kruskal Ž1954. proposed an association measure Žtau. for nominal variables based on variation measure VŽY . s Ý ␲qj Ž 1 y ␲qj . s 1 y Ý ␲qj2 . a. Show V Ž Y . is the probability that two independent observations on Y fall in different categories Žcalled the Gini concentration index .. PROBLEMS 69 Show that V Ž Y . s 0 when qj s 1 for some j and V Ž Y . takes maximum value of Ž J y 1.rJ when qj s 1rJ for all j. b. For the proportional reduction in variation, show that E w V Ž Y < X .x s 1 y Ý i Ý j i2jr iq. wThe resulting measure Ž2.12. is called the concentration coefficient. Like U, ␶ s 0 is equivalent to independence. Haberman Ž1982. presented generalized concentration and uncertainty coefficients. x 2.39 The measure of association lambda for nominal variables ŽGoodman and Kruskal 1954. has V Ž Y . s 1 y max ␲qj 4 and V Ž Y < i . s 1 y max j ␲ j < i 4 . Interpret lambda as a proportional reduction in prediction error for predictions which select the response category that is most likely. Show that independence implies ␭ s 0 but that the converse is not true. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 3 Inference for Contingency Tables In this chapter we introduce inferential methods for contingency tables. Many of these methods also play a vital role in analyses of later chapters for which categorical data need not have contingency table form. The methods assume Poisson, multinomial, or independent binomial sampling. In Section 3.1 we present confidence intervals for measures of association for 2 = 2 tables such as the odds ratio. Section 3.2 covers chi-squared tests of the hypothesis of independence between two categorical variables. Like any significance test, these have limited usefulness. In Section 3.3 we show how to follow-up the test using residuals or the partitioning property of chi-squared to extract components that describe the evidence about the association. In Section 3.4 we present more powerful inference applicable with ordered categories. The methods of Sections 3.1 through 3.4 assume large samples. In Sections 3.5 and 3.6 we introduce small-sample methods. 3.1 CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS The accuracy of estimators of association parameters is characterized by standard errors of their sampling distributions. In this section we present large-sample standard errors and confidence intervals. 3.1.1 Interval Estimation of Odds Ratios The sample odds ratio ␪ˆs n11 n 22 rn12 n 21 for a 2 = 2 table equals 0 or ⬁ if any n i j s 0, and it is undefined if both entries in a row or column are zero. Since these outcomes have positive probabilities, the expected value and variance of ␪ˆ and log ␪ˆ do not exist. ŽIn fact, this is also true for ML estimators of model parameters presented in later chapters. . In terms of bias and mean-squared error, Gart and Zweiful Ž1967. and Haldane Ž1956. 70 71 CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS showed that the amended estimators ␪˜s Ž n11 q 0.5 . Ž n 22 q 0.5 . Ž n12 q 0.5 . Ž n 21 q 0.5 . and log ␪˜ behave well ŽProblem 14.4.. The estimators ␪ˆ and ␪˜ have the same asymptotic normal distribution around ␪ . Unless n is quite large, however, their distributions are highly skewed. When ␪ s 1, for instance, ␪ˆ cannot be much smaller than ␪ Žsince ␪ˆG 0., but it could be much larger with nonnegligible probability. The log transform, having an additive rather than multiplicative structure, converges more rapidly to normality. An estimated standard error for log ␪ˆ is ␴ˆ Ž log ␪ˆ. s ž 1 n11 q 1 n12 q 1 n 21 q 1 n 22 / 1r2 . Ž 3.1 . We derive this formula in Section 3.1.7. By the large-sample normality of log ␪ˆ, log ␪ˆ" z␣ r2 ␴ˆ Ž log ␪ˆ. Ž 3.2 . is a Wald confidence interval for log ␪ . Exponentiating Žtaking antilogs of. its endpoints provides a confidence interval for ␪ . Woolf Ž1955. proposed this interval. It works quite well, usually being a bit conservative Ži.e., actual coverage probability higher than the nominal level.. When ␪ˆs 0 or ⬁, Woolf’s interval does not exist. When ␪ˆs 0, one should take 0 as the lower limit and when ␪ˆs ⬁, one should take ⬁ as the upper limit. The other bound can use the Woolf formula following some adjustment, such as Gart’s Ž1966., which replaces  n i j 4 by  n i j q 0.54 in the estimator and standard error. A less ad hoc approach forms the interval by inverting score tests ŽCornfield 1956. or likelihood-ratio tests for ␪ , as we discuss in Section 3.1.8. 3.1.2 Aspirin and Myocardial Infarction Example We illustrate inference for the odds ratio with Table 3.1 based on a Swedish study of the association between aspirin use and myocardial infarction similar to that described in Section 2.2.5. The study randomly assigned 1360 patients who had already suffered a stroke to an aspirin treatment Žone low-dose tablet a day. or to a placebo treatment. Table 3.1 reports the number of deaths due to myocardial infarction during a follow-up period of about 3 years. The sample odds ratio ␪ˆs 1.56 is close to ␪˜s 1.55, since no cell count is especially small. The standard error Ž3.1. of log ␪ˆs 0.445 is ␴ˆ Žlog ␪ˆ. s 0.307. 72 INFERENCE FOR CONTINGENCY TABLES TABLE 3.1 Swedish Study on Aspirin Use and Myocardial Infarction Myocardial Infarction Placebo Aspirin Yes No Total 28 18 656 658 684 676 Source: Based on results described in Lancet 338: 1345᎐1349 Ž1991.. A 95% confidence interval for log ␪ in the population this sample represents is 0.445 " 1.96Ž0.307., or Žy0.157, 1.047.. The corresponding interval for ␪ is wexpŽy0.157., expŽ1.047.x, or Ž0.85, 2.85.. The estimate of the true odds ratio is rather imprecise. Since the confidence interval for ␪ contains 1.0, it is plausible that the true odds of death due to myocardial infarction are equal for aspirin and placebo. If there truly is a beneficial effect of aspirin but the odds ratio is not large, it may require a large sample size to show that benefit because of the relatively small number of myocardial infarction cases ŽProblem 3.21.. 3.1.3 Interval Estimation of Difference of Proportions The difference of proportions and the relative risk compare conditional distributions of a response variable for two groups. For these measures, we treat the samples as independent binomials. For group i, yi has a binomial distribution with sample size n i and a probability ␲ i of a ‘‘success’’ response. The sample proportion ␲ ˆ i s yirn i has expectation ␲ i and variance ␲ i Ž1 y ␲ i .rn i . Since ␲ ˆ 1 and ␲ˆ 2 are independent, their difference has E Ž␲ ˆ 1 y ␲ˆ 2 . s ␲ 1 y ␲ 2 and standard error ␴ Ž␲ ˆ 1 y ␲ˆ 2 . s ␲ 1Ž 1 y ␲ 1 . n1 q ␲ 2Ž1 y ␲ 2 . n2 1r2 . Ž 3.3 . The estimate ␴ˆ Ž␲ ˆ 1 y ␲ˆ 2 . uses formula Ž3.3. with ␲ i replaced by ␲ˆ i . Then Ž ␲ˆ 1 y ␲ˆ 2 . " z␣ r2 ␴ˆ Ž ␲ˆ 1 y ␲ˆ 2 . Ž 3.4 . is a Wald confidence interval for ␲ 1 y ␲ 2 . Like the Wald interval Ž1.13. for a single proportion, it usually has true coverage probability less than the nominal confidence coefficient, especially when ␲ 1 and ␲ 2 are near 0 or 1. More complex but better methods are cited in Section 3.1.8, Note 3.2, and Problem 3.23. 73 CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS 3.1.4 Interval Estimation of Relative Risk The sample relative risk is r s ␲ ˆ 1r␲ˆ 2 . Like the odds ratio, it converges to normality faster on the log scale. The asymptotic standard error of log r is ␴ Ž log r . s ž 1 y ␲1 ␲ 1 n1 q 1 y ␲2 ␲ 2 n2 / 1r2 . Ž 3.5 . The Wald interval exponentiates endpoints of log r " z␣ r2 ␴ˆ Žlog r .. It works well but can be somewhat conservative. We discuss an alternative method in Section 3.1.8. For Table 3.1, the sample proportion of myocardial infarction deaths was 0.0409 for subjects taking placebo and 0.0266 for subjects taking aspirin. The sample relative risk is 0.0409r0.0266 s 1.54. The 95% confidence interval for the log relative risk of log Ž1.54. " 1.96Ž0.297. translates to Ž0.86, 2.75. for the relative risk. We infer that the death rate for those taking placebo was between 0.86 and 2.75 times that for those taking aspirin. The Wald 95% confidence interval for ␲ 1 y ␲ 2 is 0.014 " 1.96Ž0.0098. or Žy0.005, 0.033.. According to either measure, substantial public health benefits could result from taking aspirin, but no effect or a slight negative effect are also plausible. Results for the larger study described in Section 2.2.5 do show a benefit. 3.1.5 Deriving Standard Errors with the Delta Method* A simple and useful method exists of deriving standard errors for large-sample inferences. Let Tn denote a statistic that is asymptotically normally distributed about a parameter ␪ , the subscript n expressing its dependence on sample size. Suppose that an estimator is a function g ŽTn . of Tn . Then, under mild conditions, g ŽTn . itself has a large-sample normal distribution. The standard error depends on how fast g Ž t . changes for t near ␪ . Specifically, for large n, suppose that Tn is normally distributed about ␪ with standard error ␴r'n . That is, as n ™ ⬁, the cdf of 'n ŽTn y ␪ . converges to the cdf of a normal random variable with mean 0 and variance ␴ 2 . This limiting behavior is an example of con®ergence in distribution, denoted by d 'n Ž Tn y ␪ . ™ N Ž 0, ␴ 2 . . Let g be a function that is at least twice differentiable at ␪ . Using the Taylor series expansion for g Ž t . in a neighborhood of t s ␪ , in Section 14.1.2 we show 'n g Ž Tn . y g Ž ␪ . f 'n Ž Tn y ␪ . g X Ž ␪ . 74 INFERENCE FOR CONTINGENCY TABLES FIGURE 3.1 Depiction of delta method. for large n, where g X Ž ␪ . s ⭸ gr⭸ t evaluated at t s ␪ . Recall if a variate Y ; N Ž0, ␴ 2 ., then cY ; N Ž0, c 2␴ 2 .. Thus, 'n g Ž Tn . y g Ž ␪ . ™ N Ž 0, g X Ž ␪ . d 2 ␴ 2.. Ž 3.6 . In other words, g ŽTn . is approximately normal around g Ž ␪ . with variance w g X Ž ␪ .x2␴ 2rn. Figure 3.1 portrays this result. Locally around ␪ , g Ž t . is approximately linear, with slope g X Ž ␪ .. Then g ŽTn . is approximately normal, since linear transformations of normal random variables are themselves normal. The dispersion of g ŽTn . values about g Ž ␪ . is about < g X Ž ␪ . < times the dispersion of Tn values about ␪ . If the slope of g at ␪ is 12 , then g maps a region of Tn values into a region of g ŽTn . values only about half as wide. Result Ž3.6. is called the delta method. Since g X Ž ␪ . and ␴ 2 s ␴ 2 Ž ␪ . usually depend on the unknown parameter ␪ , the asymptotic variance is unknown. Confidence intervals and tests substitute Tn for ␪ and use the result that 'n w g ŽTn . y g Ž ␪ .xr < g X ŽTn . < ␴ ŽTn . is asymptotically standard normal. For instance, g Ž Tn . " 1.96 g X Ž Tn . ␴ Ž Tn . r'n is a large-sample Wald 95% confidence interval for g Ž ␪ .. 3.1.6 Delta Method Applied to Sample Logit* We illustrate the delta method for a function of the ML estimator Tn s ␲ ˆs yrn of the binomial parameter ␲ , for y successes in n trials. Since E Ž Y . s n␲ and var Ž Y . s n␲ Ž1 y ␲ ., E Ž␲ ˆ . s ␲ and var Ž␲ˆ . s ␲ Ž1 y ␲ .rn. Also, ␲ˆ 75 CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS has a large-sample normal distribution by the central limit theorem. So do many functions of ␲ ˆ. The log odds function of ␲ ˆ, g Ž␲ ˆ . s log ␲ˆr Ž 1 y ␲ˆ . , is called the sample logit. Evaluated at ␲ , its derivative equals 1r␲ Ž1 y ␲ .. By the delta method, the asymptotic variance of the sample logit is ␲ Ž1 y ␲ .rn Žthe variance of ␲ ˆ . multiplied by the square of w1r␲ Ž1 y ␲ .x. That is 'n ž log ␲ ˆ 1y␲ ˆ y log ␲ 1y␲ / ž d ™ N 0, 1 ␲ Ž1 y ␲ . / . The asymptotic normality of ␲ ˆ propagates to asymptotic normality of log w␲ ˆrŽ1 y ␲ˆ .x. The asymptotic variance is the variance of the normal distribution that approximates the true distribution, for large n. It is not an approximation for the variance of the true distribution. For 0 - ␲ - 1, the asymptotic variance w n␲ Ž1 y ␲ .xy1 of the sample logit is finite. By contrast, the true variance does not exist: Since ␲ ˆ s 0 or 1 with positive probability, the logit can equal y⬁ or ⬁ with positive probability. The probability of an infinite logit converges to zero rapidly as n increases. For large n, the distribution of the sample logit looks essentially normal with mean log w␲rŽ1 y ␲ .x and standard deviation w n␲ Ž1 y ␲ .xy1r2 . Thus, for the logit, the asymptotic variance actually has greater use than the true variance. Incidentally, related to this, the bootstrap is not helpful for approximating standard errors for many discrete measures, because it mimics the true rather than the more relevant asymptotic standard error. 3.1.7 Delta Method for Log Odds Ratio* Standard errors for the log odds ratio and the log relative risk result from a multiparameter version of the delta method. Suppose that  n i , i s 1, . . . , c4 have a multinomial Ž n, ␲ i 4. distribution. The sample proportion ␲ ˆ i s n irn has mean and variance EŽ␲ ˆi . s ␲i and var Ž ␲ ˆ i . s ␲ i Ž 1 y ␲ i . rn. Ž 3.7 . In Section 14.1.4 we show that for i / j, ␲ ˆ i and ␲ˆ j have covariance cov Ž ␲ ˆ i , ␲ˆ j . s y␲ i␲ jrn. Ž 3.8 . The sample proportions Ž␲ ˆ 1 , ␲ˆ 2 , . . . , ␲ˆcy1 . have a large-sample multivariate normal distribution. For functions of them, the delta method implies the 76 INFERENCE FOR CONTINGENCY TABLES following result, proved in Section 14.1.4: Let g Ž ␲ . denote a differentiable function of ␲ i4, with sample value g Ž ␲ ˆ . for a multinomial sample. Let ␾i s ⭸ gŽ␲. ⭸␲ i Then as n ™ ⬁, the distribution of normal, where ␴2s , i s 1, . . . , c. 'n w g Ž ␲ˆ . y g Ž ␲ .xr␴ Ý ␲ i ␾i2 y Ž Ý ␲ i ␾i . 2 converges to standard Ž 3.9 . . The asymptotic variance depends on ␲ i 4 and the partial derivatives of the measure with respect to ␲ i 4 . In practice, replacing ␲ i 4 and  ␾ i 4 in Ž3.9. by their sample values yields an ML estimate ␴ˆ 2 of ␴ 2 . Then ␴ˆr'n is an estimated standard error for g Ž ␲ ˆ .. A large-sample Wald confidence interval for g Ž ␲ . is gŽ ␲ ˆ . " z␣ r2 ␴ˆr'n . With the substitution of ␴ˆ for ␴ in Ž3.9., the limiting distribution is still standard normal, but convergence is slower. The equivalence in the largesample distribution is justified as follows: The sample proportions converge in probability to ␲ i 4 , by the weak law of large numbers. Since ␴ˆ is a continuous function of the sample proportions, it converges in probability to ␴ , and ␴r␴ˆ converges in probability to 1. Now 'n gŽ ␲ ˆ . y gŽ ␲. ␴ˆ s 'n gŽ ␲ ˆ . y gŽ ␲. ␴ ␴ˆ ␴ . The first term on the right-hand side converges in distribution to standard normal, by Ž3.9., and the second term converges in probability to 1. Thus, their product also has a limiting standard normal distribution. We now apply the delta method to the log odds ratio, taking g Ž ␲ . s log ␪ s log ␲ 11 q log ␲ 22 y log ␲ 12 y log ␲ 21 . Since ␾ 11 s ⭸ Ž log ␪ . r⭸␲ 11 s 1r␲ 11 ␾ 12 s y1r␲ 12 , ␾ 21 s y1r␲ 21 , ␾ 22 s 1r␲ 22 , Ý i Ý j␲ i j ␾ i j s 0 and ␴ 2 s Ý i Ý j␲ i j ␾ i2j s Ý i Ý j Ž1r␲ i j .. The asymptotic standard error of log ␪ˆ for a multinomial sample  n i j 4 is ␴ Ž log ␪ˆ. s ␴r'n s ž Ý Ý 1rn␲ i j i j / Since n␲ ˆ i j s n i j , the estimated standard error is Ž3.1.. 1r2 . CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS 77 The delta method also applies directly with ␪ to obtain ␴ˆ Ž ␪ˆ. and a Wald confidence interval ␪ˆ" z␣ r2 ␴ˆ Ž ␪ˆ.. This is not recommended; ␪ˆ converges more slowly than log ␪ˆ to normality, this interval could contain negative values, and it does not give results equivalent to those obtained with the Wald interval using 1r␪ˆ and its standard error. 3.1.8 Score and Profile Likelihood Confidence Intervals* Standard errors obtained with the delta method appear in Wald confidence intervals. However, intervals based on inverting Wald tests sometimes work poorly for small to moderate n. Alternative intervals result from inverting likelihood-ratio or score tests. Although computationally more complex, these methods often perform better. We illustrate first with the score method for the difference of proportions. The score test ŽMee 1984; Miettinen and Nurminen 1985. of H0 : ␲ 1 y ␲ 2 s ⌬ has the test statistic zŽ ⌬. s '␲ˆ Ž ⌬ . 1 Ž ␲ˆ 1 y ␲ˆ 2 . y ⌬ 1y␲ ˆ 1 Ž ⌬ . rn1 q ␲ˆ 2 Ž ⌬ . 1 y ␲ˆ 2 Ž ⌬ . rn 2 where ␲ ˆ i Ž ⌬ . denotes the ML estimate of ␲ i subject to the constraint ␲ 1 y ␲ 2 s ⌬. That is, ␲ ˆ 1Ž ⌬ . and ␲ˆ 2 Ž ⌬ . are the values of ␲ 1 and ␲ 2 satisfying ␲ 1 y ␲ 2 s ⌬ that maximize the product of the two binomial probability mass functions. These values do not have closed-form expressions and are determined using numerical methods. The score confidence interval is the set of ⌬ such that < z Ž ⌬ . < - z␣ r2 . Computations for such intervals require iteration ŽNurminen 1986.. For the relative risk also, slightly better performance results with an interval using the score method ŽBedrick 1987; Gart and Nam 1988; Koopman 1984, Miettinen and Nurminen 1985; Nurminen 1986.. Cornfield Ž1956. and Miettinen and Nurminen Ž1985. showed the score interval for the odds ratio. We prefer not to use a continuity or finite-sampling correction with these intervals, as then performance is too conservative. The fact that the score intervals are computationally more complex than Wald intervals should not be an impediment to their use in this modern era of computing, as the principle behind them is simple. However, currently they are not available in standard software. For a confidence interval based on the likelihood-ratio test, we illustrate with the odds ratio. The multinomial likelihood for a 2 = 2 table is a function of ␲ 11 ,␲ 12 ,␲ 21 4 . Equivalently, it can be expressed in terms of  ␪ ,␲ 1q,␲q1 4 Žrecall Section 2.4.1.. Thus, in inverting a likelihood-ratio test of H0 : ␪ s ␪ 0 to check whether ␪ 0 belongs in the confidence interval, there are two nuisance parameters. Their null ML estimates ␲ ˆ 1q Ž ␪ 0 . and ␲ˆq1Ž ␪ 0 . that maximize the likelihood under the null vary as ␪ 0 does. 78 INFERENCE FOR CONTINGENCY TABLES The profile log-likelihood function is LŽ ␪ 0 , ␲ ˆ 1q Ž ␪ 0 ., ␲ˆq1Ž ␪ 0 .., viewed as a function of ␪ 0 . For each ␪ 0 this function gives the maximum of the ordinary log likelihood subject to the constraint ␪ s ␪ 0 . Evaluated at ␪ 0 s ␪ˆ, this is the maximized log likelihood LŽ ␪ˆ, ␲ ˆ 1q, ␲ˆq1 ., which occurs at the sample proportions ␲ ˆ 1qs n1qrn and ␲ˆq1 s nq1 rn. The profile likelihood confidence interval for ␪ is the set of ␪ 0 for which ž y2 L Ž ␪ 0 , ␲ ˆ 1qŽ ␪ 0 . , ␲ˆq1 Ž ␪ 0 . . y L ␪ˆ, ␲ˆ 1q , ␲ˆq1 / - ␹ 12 Ž␣ . . This contains all ␪ 0 not rejected in likelihood-ratio tests of nominal size ␣ . The profile likelihood approach is available with some software Že.g., for SAS, see Table A.2 in Appendix A.. A related approach, discussed in Section 6.7.1, uses a conditional likelihood function that eliminates the nuisance parameters by conditioning on their sufficient statistics. This is beneficial when there are many nuisance parameters. An advantage of score and likelihood-based intervals is that unlike the Wald, they are not adversely affected when the sample relative risk or odds ratio is 0 or ⬁. In this section we have discussed interval estimation. Significance tests normally refer to a null hypothesis value of 0.0 for the log odds ratio, log relative risk, and difference of proportions. These are special cases of independence applied to 2 = 2 tables. In the next section we present tests of independence for two-way contingency tables. 3.2 TESTING INDEPENDENCE IN TWO-WAY CONTINGENCY TABLES For multinomial sampling with probabilities ␲ i j 4 in an I = J contingency table, the null hypothesis of statistical independence is H0 : ␲ i j s ␲ iq ␲qj for all i and j. For independent multinomial samples in the I rows, independence corresponds to homogeneity of each outcome probability among the rows. Our discussion refers to a single multinomial sample, but the same tests apply with independent multinomial samples. 3.2.1 Pearson and Likelihood-Ratio Chi-Squared Tests In Section 1.5.2 we introduced the Pearson X 2 statistic Ž1.15. for tests about multinomial probabilities. A test of H0 : independence uses X 2 with n i j in place of n i and with ␮ i j s n␲ iq ␲qj in place of ␮ i . Here ␮ i j s E Ž n i j . under H0 . Usually, ␲ iq 4 and ␲qj 4 are unknown. Their ML estimates are the sample marginal proportions ␲ ˆ iqs n iqrn and ␲ˆqj s nqj rn, so estimated expected frequencies are  ␮ ˆ i j s n␲ˆ iq ␲ˆqj s n iq nqj rn4. Then X 2 equals 2 X s ÝÝ i j Ž n i j y ␮ˆ i j . ␮ ˆi j 2 . Ž 3.10 . TESTING INDEPENDENCE IN TWO-WAY CONTINGENCY TABLES 79 Pearson Ž1900, 1904, 1922. claimed that replacing  ␮i j 4 by estimates  ␮ ˆ i j4 would not affect the distribution of X 2 . Since the contingency table has IJ categories, he argued that X 2 is asymptotically chi-squared with df s IJ y 1. On the contrary, since  ␮ ˆ i j 4 require estimating ␲ iq 4 and ␲qj 4, by Section 1.5.6 df s Ž IJ y 1 . y Ž I y 1 . y Ž J y 1 . s Ž I y 1 . Ž J y 1 . . The dimensions of ␲ iq 4 and ␲qj 4 reflect the constraints Ý i␲ iqs Ý j␲qj s 1. R. A. Fisher Ž1922. corrected Pearson’s error Žsee Section 16.2.. His article introduced the notion of degrees of freedom. ŽPearson had dealt with an indexed family of chi-squared distributions but had not dealt explicitly with ‘‘degrees of freedom.’’. The score test produces the X 2 statistic. The likelihood-ratio test produces a different one. For multinomial sampling, the kernel of the likelihood is Ł Ł ␲ inj i j ij , Ý Ý ␲ i j s 1. where all ␲ i j G 0 and i j Under H0 : independence, ␲ ˆ i j s ␲ˆ iq ␲ˆqj s n iq nqj rn2 . In the general case, ␲ ˆ i j s n i jrn. The ratio of the likelihoods equals ⌳s Ł i Ł j Ž n iq nqj . n n Ł i Ł j n ni ji j ni j . The likelihood-ratio chi-squared statistic is y2 log ⌳. Denoted by G 2 , it equals G 2 s y2 log ⌳ s 2 Ý i Ý n i j log Ž n i jr␮ˆ i j . Ž 3.11 . j where  ␮ ˆ i j s n iq nqj rn4. The larger the values of G 2 and X 2 , the more evidence exists against independence. In the general case, the parameter space consists of ␲ i j 4 subject to the linear restriction Ý i Ý j␲ i j s 1, so the dimension is IJ y 1. Under H0 , ␲ i j 4 are determined by ␲ iq 4 and ␲qj 4 , so the dimension is Ž I y 1. q Ž J y 1.. The difference in these dimensions equals Ž I y 1.Ž J y 1.. For large samples, G 2 has a chi-squared null distribution with df s Ž I y 1.Ž J y 1.. So G 2 and X 2 have the same limiting null chi-squared distribution. In fact, they are then asymptotically equivalent; X 2 y G 2 converges in probability to zero ŽSection 14.3.4.. The limiting results for multinomial sampling also hold with other sampling schemes ŽRoy and Mitra 1956, Watson 1959.. These results apply as n grows, and hence  ␮i j s n␲ i j 4 grow, for a fixed number of cells. As they grow, the multinomial distribution for  n i j 4 is better 80 INFERENCE FOR CONTINGENCY TABLES approximated by a multivariate normal, and X 2 and G 2 have more nearly chi-squared distributions. The convergence to chi-squared is quicker for X 2 than G 2 . The approximation is usually poor for G 2 when nrIJ - 5. When I or J is large, it can be decent for X 2 when some expected frequencies are as small as 1 but most exceed 5. In Section 9.8.4 we provide further guidelines. Small-sample methods ŽSection 3.5. are available whenever it is doubtful whether n is sufficiently large. 3.2.2 Education and Religious Fundamentalism Example Table 3.2 cross-classifies the degree of fundamentalism of subjects’ religious beliefs by their highest degree of education. The table also contains the estimated expected frequencies for H0 : independence. For instance, ␮ ˆ 11 s n1q nq1 rn s Ž424 = 886.r2726 s 137.8. The chi-squared statistics are X 2 s 69.2 and G 2 s 69.8, with df s Ž3 y 1.Ž3 y 1. s 4. The P-values are - 0.0001. These statistics provide extremely strong evidence of an association. 3.3 FOLLOWING-UP CHI-SQUARED TESTS Like any significance test, chi-squared tests of independence have limited usefulness. A small P-value indicates strong evidence of association but provides little information about the nature or strength of the association. Statisticians have long warned about dangers of relying solely on results of chi-squared tests rather than studying the nature of the association Že.g., Berkson 1938; Cochran 1954.. In this section we discuss ways to follow up the tests to learn more about the association. TABLE 3.2 Education and Religious Beliefs Religious Beliefs Highest Degree Less than high school High school or junior college Bachelor or graduate Total Fundamentalist Moderate Liberal Total 178 Ž137.8.1 Ž4.5. 2 570 Ž539.5. Ž2.6. 138 Ž208.7. Žy6.8. 138 Ž161.5. Žy2.6. 648 Ž632.1. Ž1.3. 252 Ž244.5. Ž0.7. 108 Ž124.7. Žy1.9. 442 Ž488.4. Žy4.0. 252 Ž188.9. Ž6.3. 424 886 1038 802 1660 642 2726 Source: 1996 General Social Survey, National Opinion Research Center. 1 Estimated expected frequencies for testing independence; 2 standardized Pearson residuals. FOLLOWING-UP CHI-SQUARED TESTS 3.3.1 81 Pearson and Standardized Residuals A cell-by-cell comparison of observed and estimated expected frequencies helps show the nature of the dependence. Under H0 , larger differences Ž ni j y ␮ ˆ i j . tend to occur in cells with larger ␮ i j . For Poisson sampling, for instance, the standard deviation of n i j and hence Ž n i j y ␮ i j . is ␮ i j ; the standard deviation of Ž n i j y ␮ ˆ i j . is less than that of n i j y ␮ i j but is proportional to ␮ i j . Thus, this raw difference is insufficient. The Pearson residual, defined for a cell by ' ' ei j s ni j y ␮ ˆi j ␮ ˆ1r2 ij Ž 3.12 . , attempts to adjust for this. Pearson residuals relate to the Pearson statistic by Ý i Ý j e i2j s X 2 . Under H0 ,  e i j 4 are asymptotically normal with mean 0. However, in Section 14.3.2 we show that their asymptotic variances are less than 1.0, averaging wŽ I y 1.Ž J y 1.xrŽnumber of cells.. Comparing Pearson residuals to standard normal percentage points provides conservative indications of cells having lack of fit. A standardized Pearson residual that is asymptotically standard normal results from dividing it by its standard error ŽHaberman 1973a; see also Section 14.3.2.. For H0 : independence, this is ni j y ␮ ˆi j ␮ ˆ i j Ž 1 y piq . Ž 1 y pqj . 1r2 Ž 3.13 . . A standardized Pearson residual that exceeds about 2 or 3 in absolute value indicates lack of fit of H0 in that cell. Larger values are more relevant when df is larger and it becomes more likely that at least one is large simply by chance. 3.3.2 Education and Religious Fundamentalism Revisited Table 3.2 also shows standardized Pearson residuals for testing independence. For instance, n11 s 178 and ␮ ˆ 11 s 137.8. The relevant marginal proportions equal p1qs 424r2726 s 0.156 and pq1 s 886r2726 s 0.325. The standardized Pearson residual Ž3.13. for this cell equals Ž 178 y 137.8 . r Ž 137.8 . Ž 1 y 0.156 . Ž 1 y 0.325 . 1r2 s 4.5. This cell shows a much greater discrepancy between n11 and ␮ ˆ 11 than expected if the variables were truly independent. Table 3.2 shows large positive residuals for subjects with less than a high school education and fundamentalist views and for subjects with a bachelor’s 82 INFERENCE FOR CONTINGENCY TABLES or graduate degree and liberal views. This means that significantly more subjects were at these combinations than H0 : independence predicts. Similarly, there were fewer subjects with high levels of education and fundamentalist views and with low levels of education and liberal views than independence predicts. Odds ratios describe this trend. The 2 = 2 table constructed from the first and last rows and the first and last columns of Table 3.2 has a sample odds ratio of Ž178 = 252.rŽ108 = 138. s 3.0. For those with a bachelor’s or graduate degree, the estimated odds of selecting liberal instead of fundamentalist were 3.0 times the estimated odds for those with less than a high school education. 3.3.3 Partitioning Chi-Squared Let Z denote a standard normal random variable. Then Z 2 has a chi-squared distribution with df s 1. A chi-squared random variable with df s ␯ has representation Z12 q ⭈⭈⭈ qZ␯2 , where Z1 , . . . , Z␯ are independent standard normal variables. Thus, a chi-squared statistic having df s ␯ has partitionings into independent chi-squared componentsᎏfor example, into ␯ components each having df s 1. Conversely, if X 12 and X 22 are independent chi-squared random variables having degrees of freedom ␯ 1 and ␯ 2 , then X 2 s X 12 q X 22 has a chi-squared distribution with df s ␯ 1 q ␯ 2 . Another supplement to a chi-squared test partitions its test statistic so that the components represent certain aspects of the effects. A partitioning may show that an association reflects primarily differences between certain categories or groupings of categories. We begin with a partitioning for the test of independence in 2 = J tables. We partition G 2 , which has df s Ž J y 1., into J y 1 components. The jth component is G 2 for a 2 = 2 table where the first column combines columns 1 through j of the full table and the second column is column j q 1. That is, G 2 for testing independence in a 2 = J table equals a statistic that compares the first two columns, plus a statistic that combines the first two columns and compares them to the third column, and so on, up to a statistic that combines the first J y 1 columns and compares them to the last column. ŽIn Section 9.2.4 we justify this partitioning. . Each component statistic has df s 1. It might seem more natural to compute G 2 for the Ž J y 1. separate 2 = 2 tables that pair each column with a particular one, say the last. However, these component statistics are not independent and do not sum to G 2 for the full table. ŽThis is beyond our scope at this stage but relates to the contrasts of log probabilities that form the log odds ratios for the two tables not being orthogonal.. For an I = J table, independent chi-squared components result from comparing columns 1 and 2 and then combining them and comparing them to column 3, and so on. Each of the J y 1 statistics has df s I y 1. More refined partitions contain Ž I y 1.Ž J y 1. statistics, each having df s 1. One FOLLOWING-UP CHI-SQUARED TESTS 83 such partition ŽLancaster 1949. applies to the Ž I y 1.Ž J y 1. separate 2 = 2 tables Ý Ý n ab Ý na j a-i b-j a-i Ý ni b Ž 3.14 . ni j b-j for i s 2, . . . , I and j s 2, . . . , J. For others, see Gilula and Haberman Ž1998. and Goodman Ž1969a, 1971b.. 3.3.4 Origin of Schizophrenia Example Table 3.3 classifies a sample of psychiatrists by their school of psychiatric thought and by their opinion on the origin of schizophrenia. Here G 2 s 23.04 with df s 4. To understand this association better, we partition G 2 into four independent components. The partitioning Ž3.14. applies to the subtables shown in Table 3.4. The first subtable compares the eclectic and medical schools of psychiatric thought on whether the origin of schizophrenia is biogenic or environmental given that the classification was in one of these two categories. For this subtable, G 2 s 0.29, with df s 1. The second subtable compares these two schools on the proportion of times the origin was ascribed to be a combination, rather than biogenic or environmental. This subtable has G 2 s 1.36, TABLE 3.3 Most Influential School of Psychiatric Thought and Ascribed Origin of Schizophrenia Origin of Schizophrenia School of Psychiatric Thought Eclectic Medical Psychoanalytic Biogenic Environmental Combination 90 13 19 12 1 13 78 6 50 Source: Reprinted with permission, based on data from B. J. Gallagher III, B. J. Jones, and L. P. Barakat, J. Clin. Psychol. 43: 438᎐443 Ž1987.. TABLE 3.4 Subtables Used in Partitioning Chi-Squared for Table 3.3 a Bio q Env Com Bio Env Ecl 90 Med 13 a 12 1 Ecl Med 102 14 78 6 Bio q Env Com Bio Env Ecl q Med 103 Psy 19 13 13 Ecl q Med Psy 116 32 Bio, biogenic; Com, combination; Ecl, eclectic; Env, environmental; Psy, psychoanalytic 84 50 84 INFERENCE FOR CONTINGENCY TABLES with df s 1. The sum of these two components equals G 2 for testing independence with the first two rows of Table 3.3. There is little evidence of a difference between the eclectic and medical schools of thought on the ascribed origin of schizophrenia. Next we combine the eclectic and medical schools and compare them to the psychoanalytic school. The third subtable in Table 3.4 compares them for the Žbiogenic, environmental. classification, giving G 2 s 12.95 with df s 1. The fourth subtable compares them for the Žbiogenic or environmental, combination. split, giving G 2 s 8.43 with df s 1. The psychoanalytic school seems more likely than the other schools to ascribe the origins of schizophrenia as being a combination. Of those who chose either the biogenetic or environmental origin, members of the psychoanalytic school were somewhat more likely than the other schools to choose the environmental origin. The sum of these four G 2 components equals the value of 23.04 for testing independence in the full table. 3.3.5 Rules for Partitioning Goodman Ž1968, 1969a, 1971b. and Lancaster Ž1949, 1969. gave rules for determining independent components of chi-squared. For forming subtables, among the necessary conditions are the following: 1. The df for the subtables must sum to df for the full table. 2. Each cell count in the full table must be a cell count in one and only one subtable. 3. Each marginal total of the full table must be a marginal total for one and only one subtable. For a certain partitioning, when the subtable df values sum properly but the G 2 values do not, the components are not independent. For the G 2 statistic, exact partitionings occur. The Pearson X 2 need not equal the sum of the X 2 values for the subtables. It is valid to use the X 2 statistics for the separate subtables; they simply need not provide an exact algebraic partitioning of X 2 for the full table. When the null hypotheses all hold, X 2 does have an asymptotic equivalence with G 2 , however. In addition, when the table has small counts, in large-sample chi-squared tests it is safer to use X 2 to study the subtables. 3.3.6 Limitations of Chi-Squared Tests Chi-squared tests of independence merely indicate the degree of evidence of association. They are rarely adequate for answering all questions about a data set. Rather than relying solely on results of these tests, investigate the nature of the association: Study residuals, decompose chi-squared into components, and estimate parameters such as odds ratios that describe the strength of association. FOLLOWING-UP CHI-SQUARED TESTS 85 The chi-squared tests also have limitations in the types of data to which they apply. For instance, they require large samples. Also, the  ␮ ˆi j s n iq nqj rn4 used in X 2 and G 2 depend on the marginal totals but not on the order of listing the rows and columns. Thus, X 2 and G 2 do not change value with arbitrary reorderings of rows or of columns. This implies that they treat both classifications as nominal. When at least one variable is ordinal, test statistics that utilize the ordinality are usually more appropriate. We present such tests in Section 3.4. 3.3.7 Why Consider Independence? Any idealized structure such as independence is unlikely to hold in any given practical situation. With large samples such as in Table 3.2 it is not surprising to obtain a small P-value. Given this and the limitations just mentioned, why even bother to consider independence as a possible representation for a joint distribution? One reason refers to the benefits of model parsimony. If the independence model approximates the true probabilities well, then unless n is very large, the model-based estimates ␲ ˆ i j s n iq nqj rn2 4 of cell probabilities tend to be better than the sample proportions  pi j s n i jrn4 . The independence ML estimates smooth the sample counts, somewhat damping the random sampling fluctuations. The mean-squared error ŽMSE. formula MSE s variance q Ž bias . 2 explains why the independence estimators can have smaller MSE. Although they may be biased, they have smaller variance because they are based on estimating fewer parameters Ž␲ iq 4 and ␲qj 4 instead of ␲ i j 4.. Hence, MSE can be smaller unless n is so large that the bias term dominates the variance. We illustrate using Table 3.5, which has ␲ i j s ␲ iq ␲qj w1 q ␦ Ž i y 2.Ž j y 2.x for ␲ iqs ␲qj s 13 . Here y1 - ␦ - 1, with ␦ s 0 equivalent to independence. Independence approximates the relationship well when ␦ is close to zero. The total MSE values of the two estimators are MSE Ž  pi j 4 . s Ý Ý E Ž pi j y ␲ i j . 2 s Ý Ý var Ž pi j . i s Ý Ý ␲ i j Ž 1 y ␲ i j . rn s i MSE Ž  ␲ ˆi j4 . s TABLE 3.5 Ž1 q ␦ .r9 1r9 Ž1 y ␦ .r9 j j Ý Ý E Ž ␲ˆ i j y ␲ i j . i 2 i j ž 1y Ý Ý ␲ i2j i j / n . j Cell Probabilities for Comparison of Estimators 1r9 1r9 1r9 Ž1 y ␦ .r9 1r9 Ž1 q ␦ .r9 86 INFERENCE FOR CONTINGENCY TABLES TABLE 3.6 Comparison of Total MSE(=10,000)for Sample Proportion and Independence Estimators ␦s0 ␦ s 0.1 ␦ s 0.2 ␦ s 0.6 ␦ s 1.0 n p ␲ ˆ p ␲ ˆ p ␲ ˆ p ␲ ˆ p ␲ ˆ 10 50 100 500 ⬁ 889 178 89 18 0 489 91 45 9 0 888 178 89 18 0 493 95 50 14 5 887 177 89 18 0 505 110 65 28 20 871 174 87 17 0 634 261 220 186 178 840 168 84 17 0 893 565 529 500 494 For Table 3.5, MSE Ž  pi j 4 . s 1 n ½ 8 9 y 4␦ 2 81 5 and rather tedious calculations yield MSE Ž  ␲ ˆi j4 . s 1 n ½ 4 9 q 4 9n 5 q 4␦ 2 81 ½ 1y 2 n q 2 n 2 y 2 n3 5 . Table 3.6 lists the total MSE values for various ␦ and n. When ␦ s 0, MSEŽ pi j 4. s 8r9n, whereas MSEŽ␲ ˆ i j 4. f 4r9n for large n. The independence estimator is then much better than the sample proportions. When the table is close to independence Ž ␦ f 0. and n is not large, MSE is only about half as large for the independence estimator. When ␦ / 0, the inconsistency of ␲ ˆ i j 4 is reflected by MSEŽ␲ˆ i j 4. ™ 4␦ 2r81 wwhereas MSEŽ pi j 4. ™ 0x as n ™ ⬁. When the table is close to independence, however, the independence estimator has a lower total MSE even for moderately large n Že.g., for n s 500 when ␦ s 0.1.. 3.4 TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS The X 2 and G 2 chi-squared tests ignore some information when used to test independence between ordinal classifications. When rows andror columns are ordered, more powerful tests usually exist. 3.4.1 Linear Trend Alternative to Independence When the row variable X and the column variable Y are ordinal, a positive or negative trend in the association is common. One approach to inference, described later in this section, uses an ordinal measure of monotone trend. TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS 87 A more popular analysis assigns scores to categories and summarizes the linear trend. A test statistic that is sensitive to positive or negative linear trends utilizes correlation information. Let u1 F u 2 F ⭈⭈⭈ F u I denote scores for the rows, and let ®1 F ®2 F ⭈⭈⭈ F ®J denote column scores. The scores have the same ordering as the categories. They assign distances between categories and actually treat the measurement scale as interval, with greater distances between categories that are farther apart. The sum Ý i Ý j u i ®j n i j weights cross-products of scores by their frequency. It relates to the covariation of X and Y. For the scores chosen, the correlation r between X and Y equals the standardization of this sum to the y1 to q1 scale Žin fact, r equals this sum when both sets of scores are linearly transformed for the n subjects to have a mean of 0 and standard deviation of 1.. The larger the correlation is in absolute value, the farther the data fall from independence in this linear dimension. A statistic for testing independence against the two-sided alternative of nonzero true correlation is M 2 s Ž n y 1. r 2 . Ž 3.15 . This statistic increases as < r < or n do. For large samples, it is approximately chi-squared with df s 1 ŽMantel 1963.. Large values contradict independence, so as with X 2 and G 2 , the P-value is the right-tailed probability above the value observed. A small P-value does not imply that the association is linear, merely that searching for a linear component to the association helped to build power against H0 . The test treats the variables symmetrically. 3.4.2 Job Satisfaction Example Revisited Table 2.8 showed job satisfaction and income for 96 subjects. The ordinary chi-squared statistics for testing independence are X 2 s 6.0 and G 2 s 6.8 with df s 9 Ž P-values s 0.74 and 0.66.. These statistics show little evidence of association, but they ignore the ordering of rows and columns. With scores Ž1, 2, 3, 4. for job satisfaction and scores 7.5, 20, 32.5, 604 for income that approximate midpoints of categories in thousands of dollars, the correlation is r s 0.200. The linear trend test statistic M 2 s Ž96 y 1.Ž0.200. 2 s 3.81. This shows some evidence of association Ž P s 0.051.. The evidence is stronger for the one-sided Žpositive trend. alternative, using M s 'n y 1 r s 1.95 Ž P s 0.026.. The nontrivial evidence of positive association may be surprising, since X 2 and G 2 have such unimpressive values. When a positive or negative trend exists, analyses designed to detect that trend can provide much smaller P-values than analyses that ignore it. 88 3.4.3 INFERENCE FOR CONTINGENCY TABLES Monotone Trend Alternatives to Independence Ordinal variables do not have a specified metric. Detecting a linear trend alternative to independence requires assigning scores to X and Y, treating them as interval variables. Alternatively, a strict ordinal analysis with the weaker alternative of monotonicity uses an ordinal measure of association, such as gamma ŽSection 2.4.4.. For large random samples, sample gamma has approximately a normal sampling distribution. The standard error ŽSE. follows from the delta method ŽProblem 3.27.. Gamma is the basis of an ordinal test of independence using test statistic z s ␥ ˆrSE. A confidence interval describes the strength of positive or negative monotone association. For Table 2.8 on income and job satisfaction, in Section 2.4.5 we showed that ␥ ˆ s 0.221. The sample has a weak tendency for job satisfaction to be higher at higher income levels. Software Že.g., PROC FREQ in SAS. reports a standard error of 0.117 for gamma. There is some evidence that ␥ ) 0, since z s 0.221r0.117 s 1.89 Ž P s 0.03 for the one-sided alternative .. An approximate 95% confidence interval for ␥ is 0.221 " 1.96Ž0.117., or Žy0.01, 0.45.. The true association between income and job satisfaction is at best moderately positive. 3.4.4 Extra Power with Ordinal Tests For testing independence, X 2 and G 2 refer to the most general alternative, whereby cell probabilities exhibit any type of statistical dependence. Their df value of Ž I y 1.Ž J y 1. reflects an alternative hypothesis that has Ž I y 1.Ž J y 1. more parameters than the null hypothesisᎏthe nonredundant odds ratios that describe the association wsuch as Ž2.10.x. These statistics are designed to detect any pattern for these parameters. In achieving this generality, they sacrifice sensitivity for detecting particular patterns. By contrast, the analyses for ordinal row and column variables attempt to describe association using a single parameter. For instance, M 2 uses the correlation. When a chi-squared test statistic refers to a single parameter wsuch as M 2 or Ž␥ ˆrSE. 2 dox, it has df s 1. When the association truly has a positive or negative trend, an ordinal test has a power advantage over the tests using X 2 or G 2 . Since df equals the mean of the chi-squared distribution, a relatively large M 2 value with df s 1 falls farther out in its right-hand tail than a comparable value of X 2 or G 2 with df s Ž I y 1.Ž J y 1.; falling farther out in the tail produces a smaller P-value. The potential discrepancy in power increases as I and J increase. In Section 6.4 we present the theory behind such a power comparison. 3.4.5 Choice of Scores Often, it is unclear how to assign scores to statistics that require them, such as M 2 . Cochran Ž1954. noted that ‘‘any set of scores gives a ®alid test, TWO-WAY TABLES WITH ORDERED CLASSIFICATIONS 89 provided that they are constructed without consulting the results of the experiment. If the set of scores is poor, in that it badly distorts a numerical scale that really does underlie the ordered classification, the test will not be sensitive. The scores should therefore embody the best insight available about the way in which the classification was constructed and used.’’ Ideally, the scale is chosen by a consensus of experts, and subsequent interpretations use that same scale. How sensitive are analyses to the choice or scores? There is no simple answer, but different scoring systems can give quite different results Že.g., Graubard and Korn 1987.. For most data sets, different choices of monotone scores give similar results. Scores that are linear transforms of each other, such as Ž1, 2, 3, 4. and Ž0, 2, 4, 6., have the same absolute correlation and hence the same M 2 . Results may depend on the scores, however, when the data are highly unbalanced, with some categories having many more observations than others. Table 3.7 illustrates the potential dependence. It refers to a prospective study of maternal drinking and congenital malformations. After the first three months of pregnancy, the women in the sample completed a questionnaire about alcohol consumption. Following childbirth, observations were recorded on the presence or absence of congenital sex organ malformations. When a variable is nominal but has only two categories, statistics that treat it as ordinal are still valid. For instance, we can artificially regard malformation as ordinal, treating ‘‘present’’ as ‘‘high’’ and ‘‘absent’’ as ‘‘low.’’ With only two rows, any set of distinct row scores is a linear transformation of any other set and gives the same M 2 value. Alcohol consumption, measured as the average number of drinks per day, is an ordinal explanatory variable. This groups a naturally continuous variable, and we first use the scores  ®1 s 0, ®2 s 0.5, ®3 s 1.5, ®4 s 4.0, ®5 s 7.04 , the last score being somewhat arbitrary. For this choice, M 2 s 6.57, for which the P-value is 0.010. By contrast, for the equally spaced row scores Ž1, 2, 3, 4, 5., M 2 s 1.83, giving a much weaker conclusion Ž P s 0.18.. An alternative approach uses the data to form the scores automatically, by using ranks as the category scores. All subjects in a category receive the average of the ranks that would apply for a complete ranking of the sample from 1 to n. These are called midranks. The 17,114 subjects at level 0 for TABLE 3.7 Example for which Results Depend on Choice of Scores Alcohol Consumption Žaverage number of drinks per day. Malformation Absent Present 0 -1 1᎐2 3᎐5 G6 17,066 48 14,464 38 788 5 126 1 37 1 Source: Reprinted with permission from the Biometric Society ŽGraubard and Korn 1987.. 90 INFERENCE FOR CONTINGENCY TABLES alcohol consumption share ranks 1 through 17,114. Each receives the average of these ranks, which is the midrank Ž1 q 17,114.r2 s 8557.5. Similarly, the midranks for the last four categories are 24,365.5, 32,013, 32,473, and 32,555.5. These scores yield M 2 s 0.35 and a weaker conclusion yet Ž P s 0.55.. Why does this happen? Adjacent categories having relatively few observations necessarily have similar midranks. The midranks are similar for the final three categories, since those categories have few observations compared with the first two categories. This scoring scheme treats alcohol consumption level 1᎐2 drinks Žcategory 3. as much closer to consumption level G 6 drinks Žcategory 5. than to consumption level 0 drinks Žcategory 1.. This seems inappropriate. It is usually better to select scores that reflect distances between categories. When uncertain about this choice, a sensitivity analysis should be performed, selecting two or three sensible choices and checking whether results are similar. Equally spaced scores often provide a reasonable compromise when the category labels do not suggest obvious choices, such as the categories Žliberal, moderate, conservative. for political philosophy. When X and Y are both ordinal and M 2 uses midrank scores, the correlation on which M 2 is based is called Spearman’s rho. 3.4.6 Trend Tests for I = 2 and 2 = J Tables When I or J equal 2, the tests based on linear or monotonic trend simplify to well-established procedures. With binary X, 2 = J tables occur in comparisons of two groups, such as when the rows represent two treatments. Using scores  u1 s 0, u 2 s 14 for levels of X, the covariation measure Ý i Ý j u i ®j n i j in M 2 simplifies to Ý j ®j n 2 j . This term sums the scores on Y for all subjects in row 2. Divided by the number of subjects in row 2, it gives the mean score for that row. In fact, M 2 is then directed toward detecting differences between the two row means of the scores on Y. With midrank scores for Y, the test using M 2 for 2 = J tables is sensitive to differences in mean ranks for the two rows. This test is called the Wilcoxon or Mann᎐Whitney test. Most nonparametric statistics textbooks present this test for fully ranked response data, whereas the 2 = J table is an extended case in which sets of subjects in the same category of Y are tied and use midranks. The large-sample version of that nonparametric test uses a standard normal z statistic. The square of the statistic is equivalent to M 2 , using arbitrary row scores and midranks for the columns. It is also asymptotically equivalent to test statistics based on the numbers of concordant and discordant pairs, such as the one using gamma. When Y has two levels, the table has size I = 2. The linear trend statistic then refers to a linear trend in the probability of either response category, such as the probability of malformation as a function of alcohol consumption. The test in that case, often called the Cochran᎐Armitage trend test, is presented in Section 5.3.5. SMALL-SAMPLE TESTS OF INDEPENDENCE 3.4.7 91 Nominal–Ordinal Tables The tests using the correlation or gamma are appropriate when both classifications are ordinal. When one is nominal with more than two categories, other statistics are needed. One is based on summarizing the variation among means on the ordinal variable in the various categories of the nominal variable. We defer discussion of this case to Section 7.5.3, Note 3.6, and Problem 3.28. 3.5 SMALL-SAMPLE TESTS OF INDEPENDENCE The inferential methods of the preceding four sections are large-sample methods. When n is small, alternative methods use exact small-sample distributions rather than large-sample approximations. In this section we describe small-sample tests of independence, starting with one that R. A. Fisher proposed for 2 = 2 tables. 3.5.1 Fisher’s Exact Test for 2 = 2 Tables In Section 3.5.7 we show that a distribution not depending on unknown parameters results from conditioning on the marginal totals of the contingency table. These are usually not naturally fixed. For Poisson sampling nothing is fixed, for multinomial sampling only n is fixed, and for independent binomial sampling in the two rows only the row marginal totals are fixed. In any of these cases, under H0 : independence, conditioning on both sets of marginal totals yields the hypergeometric distribution p Ž t . s P Ž n11 s t . s ž /ž / ž / n1q t n 2q nq1 y t n nq1 . Ž 3.16 . This formula expresses the distribution of  n i j 4 in terms of only n11 . Given the marginal totals, n11 determines the other three cell counts. The range of possible values for n11 is myF n11 F mq, where mys maxŽ0, n1qq nq1 y n. and mqs minŽ n1q, nq1 .. For 2 = 2 tables, independence is equivalent to the odds ratio ␪ s 1. To test H0 : ␪ s 1, the P-value is the sum of certain hypergeometric probabilities. To illustrate, consider Ha : ␪ ) 1. For the given marginal totals, tables having larger n11 have larger sample odds ratios and hence stronger evidence in favor of Ha . Thus, the P-value equals P Ž n11 G t o ., where t o denotes the observed value of n11 . This test for 2 = 2 tables is called Fisher’s exact test ŽFisher 1934, 1935a,c; Irwin 1935; Yates 1934.. 92 3.5.2 INFERENCE FOR CONTINGENCY TABLES Fisher’s Tea Drinker R. A. Fisher Ž1935a. described the following experiment from his days at Rothamsted Experiment Station, an agriculture research lab north of London. Muriel Bristol, a colleague of Fisher’s, claimed that when drinking tea she could distinguish whether milk or tea was added to the cup first Žshe preferred milk first.. To test her claim, Fisher asked her to taste eight cups of tea, four of which had milk added first and four of which had tea added first. She knew there were four cups of each type and had to predict which four had the milk added first. The order of presenting the cups to her was randomized. Table 3.8 shows a possible result. Distinguishing the order of pouring better than with pure guessing corresponds to ␪ ) 1, reflecting a positive association between order of pouring and the prediction. We conduct Fisher’s exact test of H0 : ␪ s 1 against Ha : ␪ ) 1. The experimental design fixed both marginal distributions, since Dr. Bristol had to predict which four cups had milk added first. Thus, the hypergeometric applies naturally for the null distribution of n11 . The P-value for Fisher’s exact test is the null probability of Table 3.8 and of tables having even more evidence in favor of her claim. The observed table, t o s 3 correct choices of the cups having milk added first, has null probability ž /ž / ž/ 4 3 4 1 8 4 s 0.229. The only table that is more extreme in the direction of Ha has n11 s 4 correct. It has a probability of 0.014. The P-value is P Ž n11 G 3. s 0.243. This result does not establish an association between the actual order of pouring and her predictions. It is difficult to do so with such a small sample. According to Fisher’s daughter ŽBox 1978, p. 134., in reality Bristol did convince Fisher of her ability. TABLE 3.8 Fisher’s Tea Tasting Experiment Guess Poured First Poured First Milk Tea Total Milk Tea Total 3 1 1 3 4 4 4 4 Source: Based on experiment described by Fisher Ž1935a.. SMALL-SAMPLE TESTS OF INDEPENDENCE 3.5.3 93 Two-Sided P-Values for Fisher’s Exact Test For the one-sided alternative, the same P-value results using tables ordered according to larger n11 , larger odds ratio, or larger difference of proportions ŽDavis 1986a.. For the two-sided alternative, different criteria can have different P-values. For a two-sided P-value, a popular approach sums P Ž n11 s t . in Ž3.16. for counts t such that pŽ t . F pŽ t o .; that is, the P-value is P s P w pŽ n11 . F pŽ t 0 .x for the observed value t o . Another possibility sums pŽ t . for tables that are farther from H0 ; that is, P s P n11 y E Ž n11 . G t 0 y E Ž n11 . , where the hypergeometric E Ž n11 . s n1q nq1 rn. This is identical to P Ž X 2 G X o2 . for observed Pearson statistic X o2 . A third approach takes P s 2 minw P Ž n11 G t o ., P Ž n11 F t o .x, but this can exceed 1. A fourth approach takes P s minw P Ž n11 G t o ., P Ž n11 F t o .x plus an attainable probability in the other tail that is as close as possible to, but not greater than, that one-tailed probability. Each approach has advantages and disadvantages ŽBlaker 2000; Davis 1986a; Dupont 1986; Lloyd 1988b; Mantel 1987b; Yates and discussants 1984.. They can provide different results because of the discreteness and potential skewness. The approach of ordering tables by a distance measure from H0 , such as X 2 , extends naturally to I = J tables. In practice, two-sided tests are much more common than one-sided. Partly this is so that researchers can avoid charges of bias in giving evidence that supports their predicted direction for an effect. To conduct a test of size 0.05 when one truly believes that the effect has a particular direction, it is safest to conduct the one-sided test at the 0.025 level to guard against criticism. For instance, in the 1998 document Biostatistical Principles for Clinical Trials, the International Conference on Harmonization ŽICH E9. stated: ‘‘The approach of setting type I errors for one-sided tests at half the conventional type I error used in two-sided tests is preferable in regulatory settings. This promotes consistency with two-sided confidence intervals that are generally appropriate for estimating the possible size of the difference between two treatments.’’ 3.5.4 Discreteness and Conservatism Issues The hypergeometric distribution Ž3.16. is highly discrete for small samples, as n11 and hence the P-value can assume relatively few values. It is usually not possible to achieve a fixed significance level Žsize. such as 0.05. In the tea-tasting experiment, for instance, n11 can equal only 4, 3, 2, 1, 0. The one-sided P-values are restricted to 0.014, 0.243, 0.757, 0.986, and 1.0. If 94 INFERENCE FOR CONTINGENCY TABLES one rejects H0 when the P-value does not exceed 0.05, then 0.05 is not the probability of type I error. Only the P-value of 0.014 does not exceed 0.05; thus, when H0 is true, the probability of falsely rejecting it is 0.014, not 0.05. In this sense, the traditional approach to hypothesis testing is conservative: The true probability of type I error is less than the nominal level. It is possible to achieve any fixed significance level by data-unrelated randomization on the boundary of the critical region, in deciding whether to reject H0 . For the tea-tasting experiment, suppose that we reject H0 when n11 s 4, we reject H0 with probability 0.157 when n11 s 3, and we do not reject H0 otherwise; that is, when n11 s 3, we generate a uniform random variable U over w0, 1x and reject H0 if U - 0.157. For expectation taken with respect to the null hypergeometric distribution of n11 , the significance level equals P Ž reject H0 . s E P Ž reject H0 < n11 . s 1.0 Ž 0.014 . q 0.157 Ž 0.229 . q 0.0 = P Ž n11 F 2 . s 0.05. With the randomization extension, Tocher Ž1950. showed that Fisher’s test is uniformly most powerful unbiased ŽUMPU.. In practice, randomization having nothing to do with the data is unacceptable. We recommend simply reporting the P-value. To reduce conservativeness, report the mid-P-value ŽSection 1.4.5.. The test is no longer guaranteed to have true P Žtype I error. no greater than the nominal value, but in practice it is rarely much greater. For the one-sided test with the tea-tasting data, mid-P-value s Ž 1r2 . P Ž n11 s 3 . q P Ž n11 ) 3 . s 0.129. 3.5.5 Small-Sample Unconditional Test of Independence* A common sampling assumption for analyses comparing two groups on a binary response is that the rows are independent binomial samples. Then, only  n iq 4 are naturally fixed. For Poisson and multinomial sampling schemes, neither marginal distribution is fixed. For such cases it may seem artificial to condition on both sets of marginal counts. An alternative small-sample test, designed for independent binomial samples, conditions on only the row totals. Under binomial sampling with parameter ␲ i in row i, consider testing H0 : ␲ 1 s ␲ 2 using some test statistic T, such as the Pearson X 2 . For fixed  n iq 4 , T can take a discrete set of values, one of which is the observed value t o . Given ␲ 1 s ␲ 2 s ␲ , the P-value is P␲ ŽT G t o ., calculated using the product of the two binomial probability mass functions. This is the sum of the product binomial probabilities for those pairs of binomial samples that have T G t o . Since ␲ is unknown, the actual P-value is defined as P s sup P␲ Ž T G t o . . 0F␲F1 SMALL-SAMPLE TESTS OF INDEPENDENCE 95 This is an unconditional small-sample test of independence. Like Fisher’s exact test, the true size is no greater than the nominal value Že.g., if we reject when P F 0.05, the actual P Žtype I error. is no greater than 0.05.. We illustrate using test statistic X 2 for the 2 = 2 table having entries Ž3, 0r0, 3., by row, with fixed row totals Ž3, 3. as binomial sample sizes. The sample X 2 s 6.0. This X 2 value for the observed table and for table Ž0, 3r3, 0. is the maximum possible. For a given value ␲ for ␲ 1 s ␲ 2 , the probability of the first table is w␲ 3 Ž1 y ␲ . 0 xw␲ 0 Ž1 y ␲ . 3 x s ␲ 3 Ž1 y ␲ . 3 Ž3 successes and 0 failures in the first row and 0 successes and 3 failures in the second., the product of two binomial probabilities. Similarly, the probability of the second table is Ž1 y ␲ . 3␲ 3. Thus, the P-value is P␲ Ž X 2 G 6. s 2␲ 3 Ž1 y ␲ . 3, the sum of the product binomial probabilities for those two tables. The supremum of this over 0 F ␲ F 1 occurs at ␲ s 12 , giving overall P-value equal to 2Ž0.5. 3 Ž0.5. 3 s 0.031. By contrast, the two-sided Fisher’s exact test has P-value equal to 2 30 33 r 63 s 0.100. Barnard Ž1945, 1947. first proposed an unconditional test comparing binomial parameters, although he later Ž1949. refuted it in favor of Fisher’s exact test. Several authors have since proposed related tests Že.g., Haber 1986; Suissa and Shuster 1985.. ž /ž / ž / 3.5.6 Conditional versus Unconditional Tests* Since Barnard introduced the unconditional test, statisticians have debated the proper way to conduct small-sample analyses of 2 = 2 tables. Fisher criticized the unconditional approach, arguing that possible samples with quite different numbers of successes than observed were not relevant. In Fisher’s Ž1945. view, ‘‘ . . . the existence of these less informative possibilities should not affect our judgment of significance based on the series actually observed . . . . The fact that such an unhelpful outcome as these might occur . . . is surely no reason for enhancing our judgment of significance in cases where it has not occurred; . . . it is only the sampling distribution of samples of the same type that can supply a rational test of significance.’’ Sprott Ž2000, Sec. 6.4.4. recently provided a similar argument. An adaptation of the unconditional approach by Berger and Boos Ž1994. addresses this criticism somewhat. They took the supremum for the P-value over a confidence interval of values for the nuisance parameter ␲ rather than over all possible values. Their unconditional P-value is P s sup P␲ Ž T G t o . q ␥ , ␲gC ␥ where C␥ is a 100Ž1 y ␥ .% confidence interval for ␲ . Here, ␥ is taken to be very small Že.g., 0.001., and the test maintains the guaranteed upper bound on size. 96 INFERENCE FOR CONTINGENCY TABLES Other arguments in favor of conditioning on both sets of marginal totals are that the conditional approach provides a simple way to eliminate nuisance parameters in a variety of problems Že.g., generalizing to other contingency table problems., and the margins contain little information about the association ŽHaber 1989; Yates 1984.. Zhu and Reid Ž1994. noted that some information loss occurs in conditioning on the margins except when ␪ s 1. Arguments against conditioning partly concern the increased discreteness that occurs. The few possible values for n11 make it difficult to obtain a small P-value. In repeated use with a nominal significance level, the actual type I error probability may be much smaller than the nominal value and the power may suffer. Finally, for inference about nonnull values Že.g, confidence intervals., we will see that the conditional approach applies only with the odds ratio and not other measures. The conservatism problem is partly unavoidable. Statistics having discrete distributions are necessarily conservative in terms of achieving nominal significance levels. Because an unconditional test fixes only one margin, however, it has many more tables in the reference set for its sampling distribution. That distribution is less discrete, and a richer array of possible P-values occurs than with Fisher’s exact test. An unconditional test tends to be less conservative and more powerful than Fisher’s exact test. A disadvantage is that computations are very intensive for more complex problems, such as larger tables. If a table truly has two independent binomial samples, the unconditional approach seems sensible. See Kempthorne Ž1979. for a cogent argument. The conditional approach is useful for other cases. In a randomized clinical trial a convenience sample of n subjects is randomly allocated to two treatments. The samples are not binomials, as they are not random samples from two populations of interest. One could focus on the sample alone and consider the probability of a result at least as extreme as observed if there truly is no treatment effect. For instance, out of all possible ways of choosing n1q of the n subjects for treatment 1, for what proportion would n11 be at least as large as observed? Under the null hypothesis of no treatment effect, the same overall response distribution Ž nq1 , nq2 . of successes and failures occurs regardless of the allocation of subjects to treatments. Thus, the column margin is also naturally fixed. This argument leads to hypergeometric null probabilities and Fisher’s exact test ŽGreenland 1981.. This argument does not extend, however, to nonnull effect values and hence to confidence intervals. When both sets of marginal totals are naturally fixed, such as in Table 3.8, the high degree of discreteness is unavoidable and Fisher’s exact test is the best procedure. Regardless of which margins are naturally fixed, using the mid-P-value helps reduce conservative effects of discreteness. 3.5.7 Derivation of Exact Conditional Distribution* We now show how the conditional test for independence yields the hypergeometric distribution. We do this for I = J tables, since we next discuss SMALL-SAMPLE TESTS OF INDEPENDENCE 97 extensions of Fisher’s exact test for them. We assume independent multinomial sampling within rows, as often applies in comparing I treatment groups. Then row totals  n iq 4 are fixed, and we estimate the I conditional distributions ␲ j < i , j s 1, . . . , J 4 . Under H0 : independence, ␲ j <1 s ␲ j < 2 s ⭈⭈⭈ s ␲ j < I s ␲qj , for j s 1, . . . , J. The product of the I multinomial probability functions then simplifies to Ł i ž n iq! Ł j ni j Ł ␲ jn< i ! ij j / s nqj Ž Ł i n iq! . Ž Ł j␲qj . Ł i Ł j ni j ! . Ž 3.17 . This distribution for  n i j 4 depends on ␲qj 4 . These are nuisance parameters, since they do not describe the association. Fisher introduced the standard way of eliminating nuisance parameters, by conditioning on their sufficient statistics. From the definition of sufficiency, the resulting conditional distribution does not depend on those parameters. The contribution of ␲qj 4 to the product multinomial distribution Ž3.17. depends on the data only through  nqj 4 , which are their sufficient statistics. The  nqj 4 have the multinomial Ž n,␲qj 4. distribution, namely n! Ł j nqj ! Ł ␲qjn qj Ž 3.18 . . j The joint probability function of  n i j 4 and  nqj 4 is identical to the probability function of  n i j 4 , since  n i j 4 determines  nqj 4 . Thus, the probability function of  n i j 4 , conditional on  nqj 4 , equals the probability function Ž3.17. of  n i j 4 divided by the probability function Ž3.18. evaluated at  nqj 4 , or Ž Ł i n iq! . Ž Ł j nqj ! . n!Ł i Ł j n i j ! . Ž 3.19 . This is the multiple hypergeometric distribution. It applies to the set of  n i j 4 having the same  n iq 4 and  nqj 4 as the observed table. For 2 = 2 tables, it is the hypergeometric distribution Ž3.16.. When a table has a single multinomial sample, the unknown parameters are ␲ i j 4 . For testing independence Ž␲ i j s ␲ iq ␲qj all i and j ., distribution Ž3.19. results from conditioning on the row and column totals. These are sufficient statistics for ␲ iq 4 and ␲qj 4 , which determine the null distribution. For either sampling model, both sets of margins are fixed after the conditioning. The end result Ž3.19. does not depend on unknown parameters and thus permits exact probability calculations. 3.5.8 Exact Tests of Independence for I = J Tables* Exact tests for I = J tables utilize the multiple hypergeometric distribution. Freeman and Halton Ž1951. defined the P-value as the probability of the set 98 INFERENCE FOR CONTINGENCY TABLES TABLE 3.9 Example for Exact Conditional Test Smoking Level Žcigarettesrday . Control Myocardial infarction 0 1᎐24 ) 25 25 0 25 1 12 3 Source: Reprinted with permission, based on Table 5 in S. Shapiro et al., Lancet 743᎐746 Ž1979.. of tables with the given margins that are no more likely to occur than the table observed. Other exact tests order the tables using a statistic describing distance from H0 . Yates Ž1934. used X 2 . The P-value is then the null value of P Ž X 2 G X o2 . for observed value X o2 . When classifications have ordered categories, an ordinal statistic is more relevant. For the alternative hypothesis of a positive association, we could use P ŽT G t o ., where T is the correlation or gamma and where t o denotes its observed value. We illustrate an exact test for ordered categories with Table 3.9, which cross-classifies level of smoking and myocardial infarction for a sample of young women in a case᎐control study. The second row contains small counts, and large-sample tests may be inappropriate. Given the marginal counts, the only table having greater evidence of positive association between smoking and myocardial infarction has counts Ž25,26,11. for row 1 and Ž0,0,4. in row 2. Conditional on both sets of margins, the null probability of the observed table and this more extreme table wbased on formula Ž3.19.x equals 0.018. Although the sample contains only four myocardial infarction patients, evidence exists of a positive association. The evidence is stronger than using X 2 , which ignores the ordering of categories. The exact P Ž X 2 G X o2 . s P Ž X 2 G 6.96. s 0.052. Special algorithms and software for computing exact tests for I = J tables are widely available Že.g., Mehta and Patel 1983; see also Appendix A.. We recommend these tests when asymptotic approximations may be invalid. Computing time increases exponentially as n, I, or J increase. However, one can use Monte Carlo to sample randomly from the set of tables with the given margins. The estimated P-value is then the sample proportion of tables having test statistic value at least as large as the value observed. As I andror J increase, the number of possible values for any test statistic T tends to increase. Thus, the conservativeness issue for conditional tests becomes less problematic. 3.6 SMALL-SAMPLE CONFIDENCE INTERVALS FOR 2 = 2 TABLES* Small-sample methods also apply to estimation. Exact distributions depending only on the parameter of interest result from the same arguments. These SMALL-SAMPLE CONFIDENCE INTERVALS FOR 2 = 2 TABLES 99 distributions are the basis of confidence intervals for measures such as the odds ratio. 3.6.1 Small-Sample Inference for the Odds Ratio For multinomial sampling, the distribution of  n i j 4 depends on n and cell probabilities ␲ i j 4 . For 2 = 2 tables, the odds ratio is ␪s ␲ 11␲ 22 ␲ 12 ␲ 21 s ␲ 11 Ž 1 y ␲ 1qy ␲q1 q ␲ 11 . Ž ␲ 1qy ␲ 11 . Ž ␲q1 y ␲ 11 . . Hence, ␲ 11 is a function of ␪ and ␲ 1q,␲q1 4 . The same argument applies to any ␲ i j , so the multinomial distribution of  n i j 4 can use parameters  ␪ , ␲ 1q, ␲q1 4 . Conditional on  n1q, nq1 4 , the distribution of  n i j 4 depends only on ␪ . Since n11 determines all other cell counts, given the marginal totals, the conditional distribution of  n i j 4 is specified by some function P Ž n11 s t . s f Ž t; n1q, nq1 , n, ␪ .. This distribution ŽFisher 1935c. is the noncentral hypergeometric, f Ž t ; n1q , nq1 , n, ␪ . s ž /ž / Ý ž /ž / n1q t mq usmy n1q u n y n1q ␪t nq1 y t n y n1q ␪u nq1 y u Ž 3.20 . for myF t F mq. A confidence interval for ␪ results from inverting the test of H0 : ␪ s ␪ 0 , having observed n11 s t o . For Ha : ␪ ) ␪ 0 , the P-value is Ps Ý f Ž t ; n1q , nq1 , n, ␪ 0 . . tGt o For testing against H0 : ␪ - ␪ 0 , Ps Ý f Ž t ; n1q , nq1 , n, ␪ 0 . . tFt o When ␪ 0 s 1, these are one-sided Fisher’s exact tests. Cornfield Ž1956. constructed a confidence interval using the tail method. The lower endpoint is ␪ 0 for which P s ␣r2 in testing against Ha : ␪ ) ␪ 0 . The upper endpoint is ␪ 0 for which P s ␣r2 for Ha : ␪ - ␪ 0 . The interval is the set of ␪ 0 for which both one-sided P-values G ␣r2. As in Fisher’s exact test, the conditional approach to interval estimation is necessarily conservative because of discreteness. The actual confidence coefficient, defined as the infimum of the coverage probabilities for all possible ␪ , has the nominal confidence level as a lower bound. Less conservative 100 INFERENCE FOR CONTINGENCY TABLES behavior and shorter intervals result from inverting a single two-sided test rather than inverting two one-sided tests ŽAgresti and Min 2001; Baptista and Pike 1977.. An alternative approach with independent binomial samples inverts nonnull unconditional small-sample tests. Because of the reduced discreteness, such intervals are also usually shorter. The conditional ML estimate of ␪ is the value of ␪ that maximizes probability Ž3.20.. Differentiating the log likelihood with respect to ␪ shows that this estimate satisfies the equation n11 s E Ž n11 . in ␪ , where the expectation refers to distribution Ž3.20.. This equation has a unique solution ␪ˆ and is solved using iterative methods ŽCornfield 1956.. This estimator differs from the unconditional ML estimator ␪ˆs n11 n 22 rn12 n 21 , which uses the ML estimates of ␲ i j 4 for the multinomial distribution of  n i j 4 . Using statistical software, we can calculate conditional ML estimates and small-sample confidence intervals for odds ratios Že.g., for SAS, see Table A.2.. 3.6.2 Tea Tasting Example We illustrate with Table 3.8 from Fisher’s tea-tasting experiment. The conditional ML estimate of ␪ is 6.4. Software provides the Cornfield tail-method interval Ž0.2, 626.2. with confidence coefficient guaranteed G 0.95. Not surprisingly, it is very wide because of the small sample. Inverting a family of two-sided ‘‘exact’’ conditional score tests gives a more precise interval, Ž0.3, 306.2.. The unconditional approach is not appropriate here because of the sampling design. wIf the table were two binomial samples, that approach gives interval Ž0.4, 234.4. by inverting ‘‘exact’’ unconditional score tests. x 3.6.3 Impact of Discreteness on Exact Confidence Intervals Small-sample inference is ‘‘exact’’ in the sense that the conditional distribution is free of nuisance parameters. Confidence intervals and tests use exact probability calculations rather than approximate ones. However, their operating characteristics are conservative because of discreteness. Large-sample methods do not have the guarantee of bounds on error probabilities. They can be conservative or liberal, and thus their results can appear quite different from exact methods. For example, for the tea-tasting data ŽTable 3.8., the P-value for the Pearson chi-squared test equals 0.157, compared to 0.486 for the two-sided exact test. The 95% large-sample confidence interval Ž3.2. for the odds ratio is Ž0.4, 220.9., compared to Cornfield’s exact interval of Ž0.2, 626.2.. Normally, one would prefer an exact method over an approximate one. When the conditional distribution is highly discrete, however, the choice is not so obvious. Exact methods then can be quite conservative, especially with small samples. For highly discrete data, it seems sensible to use adjustments of exact methods based on the mid-P-value. Confidence intervals with the conditional approach then invert hypergeometric tests of ␪ s ␪ 0 using the mid-P-value. Although not guaranteed to have error probabilities no greater than the EXTENSIONS FOR MULTIWAY TABLES AND NONTABULATED RESPONSES 101 nominal level, this method usually comes closer than the exact method to the desired level. Compared to large-sample methods, it has the advantage of working well as the degree of discreteness diminishes, since it then is essentially the same as the corresponding exact method using an ordinary P-value. Inference based on the mid-P-value compromises between the conservativeness of exact methods and the uncertain adequacy of large-sample methods. For interval estimation of the odds ratio, this method tends to be a bit conservative, but for small samples can yield much shorter intervals than the Cornfield exact interval. For the tea-tasting data, for instance, the 95% confidence interval based on inverting two one-sided hypergeometric tests using the mid-P-value is Ž0.31, 309., compared to the Cornfield interval of Ž0.21, 626.. 3.6.4 Small-Sample Inference for Difference of Proportions The conditional approach to eliminating nuisance parameters works when those parameters have sufficient statistics. However, we’ll see ŽSection 6.7.9. that reduced sufficient statistics occur only for certain models. For binary data, such models must have odds ratios as parameters. For 2 = 2 tables, the conditional approach cannot yield confidence intervals for differences or ratios of proportions. The unconditional approach is more complex but does not require sufficient statistics. We used it in Section 3.5.5 for testing ␲ 1 y ␲ 2 s 0 with independent binomial samples. A small-sample confidence interval inverts the corresponding unconditional test of H0 : ␲ 1 y ␲ 2 s ␦ 0 , for any fixed y1 - ␦ 0 - 1. The probability function for the table is the product of bin Ž n1 , ␲ 1 . and bin Ž n 2 , ␲ 2 . mass functions. One can express this in terms of ␦ s ␲ 1 y ␲ 2 and a nuisance parameter ␭. For instance, if ␭ s ␲ 1 q ␲ 2 , one substitutes ␲ 1 s Ž ␭ q ␦ .r2 and ␲ 2 s Ž ␭ y ␦ .r2. For ␦ s ␦ 0 and a fixed value of ␭, one then uses this binomial product to calculate the probability that the test statistic is at least as large as observed. The P-value is the supremum of such probabilities calculated over all possible values for ␭. This provides a family of tests for the various values of ␦ 0 . The confidence interval for ␲ 1 y ␲ 2 is the set of ␦ 0 for which this P-value exceeds ␣ . This approach can be quite conservative. For details regarding various test statistics, see Agresti and Min Ž2001., Coe and Tamhane Ž1993., Santner and Snell Ž1980., and Santner and Yamagami Ž1993.. It is better to invert a single two-sided test, as in Coe and Tamhane Ž1993., than to invert two separate one-sided tests. 3.7 EXTENSIONS FOR MULTIWAY TABLES AND NONTABULATED RESPONSES The methods of this chapter extend to multiway contingency tables. For instance, tests of independence for two-way tables extend to tests of condi- 102 INFERENCE FOR CONTINGENCY TABLES tional independence in three-way tables. In future chapters we present such methods with models that provide a basis for defining relevant parameters and their statistical inferences. The methods then apply in a greater variety of situations, such as when some explanatory variables are continuous rather than categorical. 3.7.1 Categorical Data Need Not Be Contingency Tables Examples so far have presented categorical data in the format of contingency tables. However, this book has broader focus than contingency table analysis. Models for categorical response variables can have continuous as well as categorical explanatory variables. Even when all or most variables are categorical, source data files are not usually contingency tables but have the form of a line of data for each subject. The first three lines in a data file containing responses of a survey of subjects measuring gender, race, education Ž1 s less than high school, 2 s high school or some college, 3 s college graduate., and opinion about homosexuality Ž1 s tolerant, 2 s homophobic. might be: subject 1 2 3 gender f m m race w b w education 2 3 1 opinion 1 1 2 Software can read data files of this type and then conduct analyses that may involve forming contingency tables. In the next chapter we introduce the modeling framework used in the rest of the book. All the methods that we’ve studied in this chapter result from inferences for parameters in simple versions of these models. NOTES Section 3.1: Confidence Inter©als for Association Parameters 3.1. Adaptations of Woolf’s interval Ž3.2. for log ␪ to handle zero cell counts include Agresti Ž1999. and Gart Ž1966, 1971.. Goodman Ž1964a. presented simultaneous confidence intervals for all odds ratios in an I = J table. Brown and Benedetti Ž1977. and Goodman and Kruskal Ž1963, 1972. provided standard errors for many association measures. Goodman and Kruskal Ž1963, 1972. extended Ž3.9. for independent multinomial sampling. 3.2. Agresti and Caffo Ž2000. showed that as in the single-sample case ŽProblem 1.24., the Wald interval Ž3.4. for ␲ 1 y ␲ 2 behaves much better after adding two pseudo-observations of each type Žone of each type in each sample.. 103 NOTES Section 3.2: Testing Independence in Two-Way Contingency Tables 3.3. For hypergeometric sampling, ␮ ˆ i j 4 in tests of independence are exact Žrather than estimated. expected values. Specifically, E Ž n11 . s n1q nq1 n and var Ž n11 . s n1q nq1 n 2q nq2 n2 Ž n y 1 . . Haldane Ž1940. derived E Ž X 2 . s Ž I y 1.Ž J y 1. nrŽ n y 1. and a complex formula for var Ž X 2 .; Dawson Ž1954. provided a simplified expression. Lewis et al. Ž1984. derived the third central moment. Watson Ž1959. showed that the conditional distribution of X 2 also has the limiting chi-squared distribution. 3.4. Diaconis and Efron Ž1985. presented inference based on a uniform distribution over all possible tables of the same I, J, and n; their ®olume test considers the proportion of such tables having X 2 F X o2 . 3.5. Specialized methods are necessary for complex sampling designs. Sequential methods are useful in biomedical applications ŽJennison and Turnbull 2000, Chap. 12.. Social science applications often incorporate clustering andror stratification. LaVange et al. Ž2001. and Rao and Thomas Ž1988. surveyed analyses of categorical data for complex sampling methods. Gleser and Moore Ž1985. showed that positive dependence causes null distributions of Pearson statistics to stochastically increase. See also Bedrick Ž1983., Clogg and Eliason Ž1987., Fay Ž1985., Holt et al. Ž1980., Koehler and Wilson Ž1986., Rao and Scott Ž1987., Scott and Wild Ž2001., Shuster and Downing Ž1976., Tavare ´ and Altham Ž1983., and methods of Chapter 12. Other modifications are necessary when some data are missing. Watson Ž1956. was perhaps the first to study this. Lipsitz and Fitzmaurice Ž1996. derived score tests of independence and conditional independence for contingency tables, assuming ignorable nonresponse, and showed that the test statistics have the usual asymptotic chi-squared null distributions. See Schafer Ž1997, Chap. 7. for a survey of methods. Section 3.4: Two-Way Tables with Ordered Classifications 3.6. Bhapkar Ž1968. and Yates Ž1948. proposed statistics similar to M 2 and also proposed statistics for singly-ordered tables. Graubard and Korn Ž1987. listed 14 tests for 2 = J tables that utilize a correlation-type statistic. See also Nair Ž1987. and Williams Ž1952.. Cohen and Sackrowitz Ž1991, 1992. evaluated decision-theoretic aspects, such as admissibility, of tests based on gamma and local log odds ratios. Rayner and Best Ž2001. considered nonparametrics methods in a contingency table format. Section 3.5: Small-Sample Tests of Independence 3.7. Yates Ž1934. mentioned that Fisher suggested the hypergeometric to him for an exact test. He proposed a continuity-corrected version of X 2 , X c2 s ýý Ž < n i j y ␮ˆ i j < ␮ ˆi j y 0.5 . 2 , to approximate the exact test. Haber Ž1980, 1982., Plackett Ž1964., and Yates Ž1984. discussed its appropriateness. Since software now makes Fisher’s exact test feasible even with large samples, this correction is no longer needed. 104 INFERENCE FOR CONTINGENCY TABLES 3.8. The UMPU property of Fisher’s exact test follows from conditioning on a sufficient statistic that is complete and has distribution in the exponential family ŽLehmann 1986, Secs. 4.5᎐4.7.. Fleiss Ž1981., Gail and Gart Ž1973., and Suissa and Shuster Ž1985. studied sample size for obtaining fixed power in Fisher’s test. The controversy over conditioning includes Barnard Ž1945, 1947, 1949, 1979., Berkson Ž1978., Fisher Ž1956., Howard Ž1998., Kempthorne Ž1979., Lloyd Ž1988a., Pearson Ž1947., Rice Ž1988., Routledge Ž1992., Suissa and Shuster Ž1984, 1985., and Yates Ž1984.. Yates and discussants also addressed the choice of two-sided P-value. Discussion of unconditional methods includes Chan Ž1998., Martın and ´ Andres ´ and Silva Mato Ž1994., and Rohmel Ⲑ Mansmann Ž1999.. Altham Ž1969. and Howard Ž1998. discussed Bayesian analyses for 2 = 2 tables Žsee Section 15.2.3.. Agresti Ž1992, 2001. surveyed small-sample methods. 3.9. For discussion of inference using the mid-P-value, see Berry and Armitage Ž1995., Hirji Ž1991., Hwang and Wells Ž2002., Hwang and Yang Ž2001., Mehta and Walsh Ž1992., and Routledge Ž1994.. Similar benefits can accrue from alternative proposed P-values. One approach, useful when several tables have the same value for a test statistic, uses the table probability to create a more finely partitioned sample space; for tables having the observed test statistic value, only those contribute to the P-value that are no more likely than the observed table ŽCohen and Sackrowitz 1992; Kim and Agresti 1995.. This depends on more than the sufficient statistic, and in some cases a Rao᎐Blackwellized version is the mid-P-value ŽHwang and Wells 2002.. Ordinary P-values obtained with higher-order asymptotic methods without continuity corrections for discreteness yield performance similar to that of the mid-P-value ŽPierce and Peters 1999; Strawderman and Wells 1998.. 3.10. For exact treatment of I = J tables, see Mehta and Patel Ž1983.. For ordered categories, see also Agresti et al. Ž1990.. For Monte Carlo estimation of exact P-values, see Agresti et al. Ž1979., Booth and Butler Ž1999., Diaconis and Sturmfels Ž1998., Forster et al. Ž1996., Mehta et al. Ž1988., and Patefield Ž1982.. Gail and Mantel Ž1977. and Good Ž1976. gave approximate formulas for the number of tables having certain fixed margins. Freidlin and Gastwirth Ž1999. extended the unconditional approach to a test for trend in I = 2 tables and a test of conditional independence with several 2 = 2 tables. Section 3.6: Small-Sample Confidence Inter©als for 2 = 2 Tables 3.11. Suppose that Ž ␪ , ␭. has minimal sufficient statistic ŽT, U ., where ␭ is a nuisance parameter. Cox and Hinkley Ž1974, p. 35. defined U to be ancillary for ␪ if its distribution depends only on ␭, and the distribution of T given U depends only on ␪ . For 2 = 2 tables with odds ratio ␪ and ␭ s Ž␲ 1q ,␲q1 ., let T s n11 and U s Ž n1q , nq1 .. Then U is not ancillary, because its distribution depends on ␪ as well as ␭. Using a definition due to Godambe, Bhapkar Ž1989. referred to the marginals U as partial ancillary for ␪ . This means that the distribution of the data, given U, depends only on ␪ , and that for fixed ␪ , the family of distributions of U for various ␭ is complete. Liang Ž1984. gave an alternative definition referring to conditional and unconditional inference being equally efficient. PROBLEMS Applications 3.1 Refer to Table 2.9. Construct and interpret a 95% confidence interval for the population Ža. odds ratio, Žb. difference of proportions, and Žc. relative risk between seat-belt use and type of injury. 105 PROBLEMS 3.2 Refer to Table 2.5 on lung cancer and smoking. Construct a confidence interval for a relevant measure of association. Interpret. 3.3 In professional basketball games during 1980᎐1982, when Larry Bird of the Boston Celtics shot a pair of free throws, 5 times he missed both, 251 times he made both, 34 times he made only the first, and 48 times he made only the second ŽWardrop 1995.. Is it plausible that the successive free throws are independent? 3.4 Refer to Table 3.10. a. Using X 2 and G 2 , test the hypothesis of independence between party identification and race. Report the P-values and interpret. b. Use residuals to describe the evidence of association. c. Partition chi-squared into components regarding the choice between Democrat and Independent and between these two combined and Republican. Interpret. d. Summarize association by constructing a 95% confidence interval for the odds ratio between race and whether a Democrat or Republican. Interpret. TABLE 3.10 Data for Problem 3.4 Party Identification Race Democrat Independent Republican Black White 103 341 15 105 11 405 Source: 1991 General Social Survey, National Opinion Research Center. 3.5 Refer to Table 3.10. In the same survey, gender was cross-classified with party identification. Table 3.11 shows some results. Explain how to interpret all the results on this printout. 3.6 In a study of the relationship between stage of breast cancer at diagnosis Žlocal or advanced. and a woman’s living arrangement, of 144 women living alone, 41.0% had an advanced case; of 209 living with spouse, 52.2% were advanced; of 89 living with others, 59.6% were advanced. The authors reported the P-value for the relationship as 0.02 ŽD. J. Moritz and W. A. Satariano, J. Clin. Epidemiol. 46: 443᎐454, 1993.. Reconstruct the analysis performed to obtain this P-value. 106 INFERENCE FOR CONTINGENCY TABLES TABLE 3.11 Results for Problem 3.5 Frequency Expected dem indep repub female 279 261.42 73 70.653 225 244.93 male 165 182.58 47 49.347 191 171.07 Statistic DF Value Chi- Square Likelihood Ratio Chi- Square 2 2 7.0095 7.0026 Observ Resraw 1 17.584 2 2.347 3 y19.931 Reschi 1.088 0.279 y1.274 StReschi 2.293 0.465 y2.618 Observ 4 5 6 Resraw y17.584 y2.347 19.931 Prob 0.0301 0.0302 Reschi y1.301 y0.334 1.524 StReschi y2.293 y0.464 2.618 3.7 Refer to Table 2.1. Partition G 2 for testing whether the incidence of heart attacks is independent of aspirin intake into two components. Interpret. 3.8 Project Blue Book: Analysis of Reports of Unidentified Aerial Objects was published by the U.S. Air Force ŽAir Technical Intelligence Center at Wright-Patterson Air Force Base. in May 1955 to analyze reports of unidentified flying objects ŽUFOs.. In its Table II, the report classified 1765 sightings later regarded as known objects and 434 sightings later regarded as unknown, according to the object color Žnine categories.. The report states: ‘‘The chi-square test is applicable only to distributions which have the same number of elements,’’ so the investigators multiplied all counts in the known category by Ž434r1765., so each row has 434 observations, before computing X 2 . They reported X 2 s 26.15 with df s 8. Explain why this is incorrect. What should X 2 equal? Ž Hint: For their adjusted table, first show that the contribution to X 2 is the same for each cell in a column, and then show the effect on those contributions of multiplying each count in one row by a constant. . 3.9 Table 3.12 classifies a sample of psychiatric patients by their diagnosis and by whether their treatment prescribed drugs. a. Obtain standardized Pearson residuals for independence, and interpret. b. Partition chi-squared into three components to describe differences and similarities among the diagnoses, by comparing Ži. the first two rows, Žii. the third and fourth rows, and Žiii. the last row to the first and second rows combined and the third and fourth rows combined. 107 PROBLEMS TABLE 3.12 Data for Problem 3.9 Diagnosis Schizophrenia Affective disorder Neurosis Personality disorder Special symptoms Drugs No Drugs 105 12 18 47 0 8 2 19 52 13 Source: Reprinted with permission from E. Helmes and G. C. Fekken, J. Clin. Psychol. 42: 569᎐576 Ž1986.. 3.10 Refer to Table 7.8. For the combined data for the two genders, yielding a single 4 = 4 table, X 2 s 11.5 Ž P s 0.24., whereas using row scores Ž3, 10, 20, 35. and column scores Ž1, 3, 4, 5., M 2 s 7.04 Ž P s 0.008.. Explain why the results are so different. 3.11 A study on educational aspirations of high school students ŽS. Crysdale, Internat. J. Compar. Sociol. 16: 19᎐36, 1975. measured aspirations with the scale Žsome high school, high school graduate, some college, college graduate.. The student counts in these categories were Ž11, 52, 23, 22. when family income was low, Ž9, 44, 13, 10. when family income was middle, and Ž9, 41, 12, 27. when family income was high. a. Test independence of educational aspirations and family income using X 2 or G 2 . Explain the deficiency of this test for these data. b. Find the standardized Pearson residuals. Do they suggest any association pattern? c. Conduct an alternative test that may be more powerful. Interpret. 3.12 Refer to Table 8.15. Obtain a 95% confidence interval for gamma. Interpret the association between schooling and attitude toward abortion. 3.13 Table 3.13 shows the results of a retrospective study comparing radiation therapy with surgery in treating cancer of the larynx. The response TABLE 3.13 Data for Problem 3.13 Surgery Radiation therapy Cancer Controlled Cancer Not Controlled 21 15 2 3 Source: Reprinted with permission from W. M. Mendenhall, R. R. Million, D. E. Sharkey, and N. J. Cassisi, Internat. J. Radiat. Oncol. Biol. Phys. 10: 357᎐363 Ž1984., Pergamon Press plc. 108 INFERENCE FOR CONTINGENCY TABLES TABLE 3.14 SAS Output for Problem 3.13 Fisher’s Exact Test Cell (1,1) Frequency (F) Left- sided Pr <= F Right- sided Pr >= F Table Probability (P) Two- sided Pr<= P 21 0.8947 0.3808 0.2755 0.6384 Odds Ratio 2.1000 Asymptotic Conf Limits: Exact Conf Limits: 95% 95% 95% 95% Lower Upper Lower Upper Conf Conf Conf Conf Limit 0.3116 Limit 14.1523 Limit 0.2089 Limit 27.5522 indicates whether the cancer was controlled for at least two years following treatment. Table 3.14 shows SAS output. a. Report and interpret the P-value for Fisher’s exact test with Ži. Ha : ␪ ) 1, and Žii. Ha : ␪ / 1. Explain how the P-values are calculated. b. Interpret the confidence intervals for ␪ . Explain the difference between them and how they were calculated. c. Find and interpret the one-sided mid-P-value. Give advantages and disadvantages of this type of P-value. 3.14 A study considered the effect of prednisolone on severe hypercalcaemia in women with metastatic breast cancer ŽB. Kristensen et al., J. Intern. Med. 232: 237᎐245, 1992.. Of 30 patients, 15 were randomly selected to receive prednisolone. The other 15 formed a control group. Normalization in their level of serum-ionized calcium was achieved by 7 of the treated patients and none of the control group. Analyze whether results were significantly better for treatment than for control. Interpret. 3.15 For Problem 3.14, obtain a 95% confidence interval for the odds ratio using Ža. the Woolf Ži.e., Wald. interval, Žb. Cornfield’s ‘‘exact’’ approach, Žc. the profile likelihood. In each case, note the effect of the zero cell count. Summarize advantages and disadvantages of each approach. 3.16 Refer to the tea-tasting data ŽTable 3.8.. Construct the null distributions of the ordinary P-value and the mid-P-value for Fisher’s exact test with Ha : ␪ ) 1. Find and compare their expected values. 109 PROBLEMS 3.17 Consider a 3 = 3 table having entries, by row, of Ž4, 2, 0 r 2, 2, 2 r 0, 2, 4.. Conduct an exact test of independence, using X 2 . Assuming ordered rows and columns and using equally spaced scores, conduct an ordinal exact test. Explain why results differ so much. 3.18 An advertisement by Schering Corp. in 1999 for the allergy drug Claritin mentioned that in a pediatric randomized clinical trial, symptoms of nervousness were shown by 4 of 188 patients on loratadine ŽClaritin., 2 of 262 patients taking placebo, and 2 of 170 patients on choropheniramine. In each part below, explain which method you used, and why. a. Is there inferential evidence that nervousness depends on drug? b. For the Claritin and placebo groups, construct and interpret a 95% confidence interval for the Ži. odds ratio and Žii. difference of proportions suffering nervousness. 3.19 Refer to Problem 2.19 on sexual fun. Analyze these data. Present a short report summarizing results and interpretations. Theory and Methods 3.20 Is ␪ˆ the midpoint of large- and small-sample confidence intervals for ␪ ? Why or why not? 3.21 For comparing two binomial samples, show that the standard error Ž3.1. of a log odds ratio increases as the absolute difference of proportions of successes and failures for a given sample increases. 3.22 Using the delta method, show that the Wald confidence interval for the logit of a binomial parameter ␲ is ' log ␲ ˆr Ž 1 y ␲ˆ . " z␣ r2 r n␲ˆ Ž 1 y ␲ˆ . . Explain how to use this interval to obtain one for ␲ itself. wNewcombe Ž2001. noted that the sample logit is also the midpoint of the score interval for ␲ , on the logit scale. He showed that this logit interval contains the score interval. x 3.23 For two parameters, a confidence interval for ␪ 1 y ␪ 2 based on single-sample estimate ␪ˆi and interval Ž l i , u i . for ␪ i , i s 1, 2, is žˆ ␪ 1 y ␪ˆ2 y 'Ž ˆ ␪ 1 y l1 . 2 q Ž u 2 y ␪ˆ2 . , 2 ␪ˆ1 y ␪ˆ2 q 'Ž Ž u1 y ␪ˆ1 . q ␪ˆ2 y l 2 2 . 2 / . 110 INFERENCE FOR CONTINGENCY TABLES Newcombe Ž1998b. proposed an interval for ␲ 1 y ␲ 2 using the score interval Žl i , u i . for ␲ i that performs much better than the Wald interval Ž3.4.. It is Ž␲ ˆ 1 y ␲ˆ 2 y z␣ r2 sL , ␲ˆ 1 y ␲ˆ 2 q z␣ r2 sU ., with sL s ) l1 Ž1 y l1 . n1 q u2 Ž 1 y u2 . n2 , sU s ) u1 Ž 1 y u1 . n1 q l 2 Ž1 y l 2 . n2 . Show that it has the general form above of an interval for ␪ 1 y ␪ 2 . 3.24 For multinomial sampling, use the asymptotic variance of log ␪ˆ to show that for Yule’s Q ŽProblem 3.26. the asymptotic variance of 'n Ž Qˆ y Q. is ␴ 2 s ŽÝ i Ý j␲y1 .Ž1 y Q 2 . 2r4 ŽYule 1900, 1912.. ij 3.25 Refer to Problem 2.23. For multinomial sampling, show how to obtain a confidence interval for AR by first finding one for logŽ1 y AR . ŽFleiss 1981, p. 76.. 3.26 For multinomial probabilities ␲ s Ž␲ 1 , ␲ 2 , . . . . with a contingency table of arbitrary dimensions, suppose that a measure g Ž ␲ . s ␯r␦ . Show that the asymptotic variance of 'n w g Ž ␲ ˆ . y g Ž ␲ .x is ␴ 2 s wÝ i␲ i␩i2 y ŽÝ i␲ i␩i . 2 xr␦ 4 , where ␩i s ␦ Ž ⭸␯r⭸␲ i . y ␯ Ž ⭸␦r⭸␲ i . ŽGoodman and Kruskal, 1972.. 3.27 For ordinal variables, consider gamma Ž2.14.. Let ␲ iŽjc. s Ý Ý ␲ab q Ý Ý ␲ab , a-i b-j ␲ iŽjd. s a)i b)j Ý Ý ␲ab q Ý Ý ␲ab , a-i b)j a)i b-j where i and j are fixed in the summations. Show that ⌸ c s Ý i Ý j␲ i j␲ iŽjc. and ⌸ d s Ý i Ý j␲ i j␲ iŽjd.. Use the delta method to show that the largesample normality Ž3.9. applies for ␥ ˆ , with ŽGoodman and Kruskal 1963. ␾ i j s 4 ⌸ d␲ iŽjc. y ⌸ c␲ iŽjd. r Ž ⌸ c q ⌸ d . , 2 Ý Ý ␲ i j ␾i j s 0 , i ␴2s 16 Ž ⌸c q ⌸d. 4 Ý Ý ␲i j i j j ⌸ d ␲ iŽjc. y ⌸ c␲ iŽjd. 2 . 111 PROBLEMS 3.28 An I = J table has ordered columns and unordered rows. Ridits ŽBross 1958. are data-based column scores. The jth sample ridit is the average cumulative proportion within category j, jy1 ž / 1 p . ˆr j s Ý pqk q 2 qj ks1 The sample mean ridit in row i is Rˆi s Ý j ˆ r j pj < i . Show that Ý j pqj ˆ rj s 0.50 and Ý i piq Rˆi s 0.50. wFor ridit analyses, see Agresti Ž1984, Secs. 9.3 and 10.2., Bross Ž1958., Fleiss Ž1981, Sec. 9.4., and Landis et al. Ž1978..x 3.29 Show that X 2 s nÝÝŽ pi j y piq pqj . 2rpiq pqj . Thus, X 2 can be large when n is large, regardless of whether the association is practically important. Explain why this test, like other tests, simply indicates the degree of evidence against H0 and does not describe strength of association. Ž‘‘Like fire, the chi-square test is an excellent servant and a bad master,’’ Sir Austin Bradford Hill, Proc. Roy. Soc. Med. 58: 295᎐300, 1965.. 3.30 For testing H0 : ␲ 1 s ␲ 2 using independent binomial variates y 1 and y 2 with n1 and n 2 trials, the score statistic is zs ␲ ˆ 1 y ␲ˆ 2 '␲ˆ Ž 1 y ␲ˆ . Ž 1rn 1 q 1rn 2 . , where ␲ ˆ s Ž y 1 q y 2 .rŽ n1 q n 2 . is the pooled estimate of ␲ 1 s ␲ 2 under H0 . Show that z 2 s X 2 . 3.31 For a 2 = 2 table, consider H0 : ␲ 11 s ␪ 2 , ␲ 12 s ␲ 21 s ␪ Ž1 y ␪ ., ␲ 22 s Ž1 y ␪ . 2 . a. Show that the marginal distributions are identical and that independence holds. b. For a multinomial sample, under H0 show that ␪ˆs Ž p1qq pq1 .r2. c. Explain how to test H0 . Show that df s 2 for the test statistic. d. Refer to Problem 3.3. Are Larry Bird’s pairs of free throws plausibly independent and identically distributed? 3.32 For a 2 = 2 table, show that: a. The four Pearson residuals may take different values. 112 INFERENCE FOR CONTINGENCY TABLES b. All four standardized Pearson residuals have the same absolute value. ŽThis is sensible, since df s 1.. c. The square of each standardized Pearson residual equals X 2 . w Note: X 2 s nŽ n11 n 22 y n12 n 21 . 2rŽ n1q n 2q nq1 nq2 . for 2 = 2 tables. See Mirkin Ž2001. for alternative X 2 formulas for I = J tables. x 3.33 For testing independence, show that X 2 F n minŽ I y 1, J y 1.. Hence V 2 s X 2rw nminŽ I y 1, J y 1.x falls between 0 and 1 ŽCramer ´ 1946.. For 2 = 2 tables, X 2rn is often called phi-squared; it equals Goodman and Kruskal’s tau ŽProblem 2.38.. Other measures based on X 2 include the contingency coefficient w X 2rŽ X 2 q n.x1r2 ŽPearson 1904.. 3.34 For counts  n i 4 , the power di®ergence statistic for testing goodness of fit ŽCressie and Read 1984; Read and Cressie 1988. is 2 ␭Ž ␭ q 1. ýn i Ž n ir␮ˆ i .  y1 for y⬁ - ␭ - ⬁. a. For ␭ s 1, show that this equals X 2 . b. As ␭ ™ 0, show that it converges to G 2 . w Hint: log t s lim h ™ 0 Ž t h y 1.rh.x c. As ␭ ™ y1, show that it converges to 2Ý ␮ ˆ i logŽ ␮ ˆ irn i ., the miniŽ mum discrimination information statistic Gokhale and Kullback 1978.. d. For ␭ s y2, show that it equals ÝŽ n i y ␮ ˆ i . 2rn i , the Neyman modified chi-squared statistic ŽNeyman 1949.. e. For ␭ s y 12 , show that it equals 4ÝŽ n i y ␮ ˆ i . 2 , the Freeman᎐Tukey statistic ŽFreeman and Tukey 1950.. ' ' wUnder regularity conditions, their asymptotic distributions are identical Žsee Drost et al. 1989.. The chi-squared null approximation works best for ␭ near 23 .x 3.35 Use a partitioning argument to explain why G 2 for testing independence cannot increase after combining two rows Žor two columns. of a contingency table. Ž Hint: Argue that G 2 for full table s G 2 for collapsed table q G 2 for table of the two rows that are combined in the collapsed table. . 113 PROBLEMS 3.36 Motivate partitioning Ž3.14. by showing that the multiple hypergeometric distribution Ž3.19. for  n i j 4 factors as the product of hypergeometric distributions for the separate component tables ŽLancaster, 1949.. 3.37 Explain why  nqj 4 are sufficient for ␲qj 4 in Ž3.17.. 3.38 Assume independence, and let pi j s n i jrn and ␲ ˆ i j s piq pqj . a. Show that pi j and ␲ ˆ i j are unbiased for ␲ i j s ␲ iq ␲qj . b. Show that varŽ pi j . s ␲ iq ␲qj Ž1 y ␲ iq ␲qj .rn. 2 . Ž 2 . 2 . c. Using E Ž piq pqj . 2 s E Ž piq E pqj and E Ž piq s var Ž piq . q 2 w E Ž piq .x , show that var Ž ␲ ˆ i j . s  ␲ iq ␲qj ␲ iq Ž 1 y ␲qj . q ␲qj Ž 1 y ␲ iq . 4 n q ␲ iq Ž 1 y ␲ iq . ␲qj Ž 1 y ␲qj . rn2 . d. As n ™ ⬁, show that lim var Ž'n ␲ ˆ i j . F lim var Ž'n pi j ., with equality only if ␲ i j s 1 or 0. Hence, if the model holds or if it nearly holds, the model estimator is better than the sample proportion. 3.39 Show that the sample value of the uncertainty coefficient Ž2.13. satisfies Uˆ s yG 2r2 nŽÝ pqj log pqj .. wHaberman Ž1982. gave its standard error.x 3.40 When a test statistic has a continuous distribution, the P-value has a null uniform distribution, P Ž P-value F ␣ . s ␣ for 0 - ␣ - 1. For Fisher’s exact test, explain why under the null, P Ž P-value F ␣ . F ␣ for 0 - ␣ - 1. Ž Hint: P Ž P-value F ␣ . s E w P Ž P-value F ␣ < n1q, nq1 , n.x.. 3.41 Refer to Note 3.3 about moments of the hypergeometric distribution Ž3.16.. Letting ␳ s nq1 rn, show that n11 has the same mean as a binomial random variable for n1q trials with success probability ␳ , and that it has its variance multiplied by a finite population correction factor Ž n y n1q .rŽ n y 1.. ŽThe hypergeometric is similar to the binomial when n1q is small compared to n.. 3.42 A contingency table for two independent binomial variables has counts Ž3, 0 r 0, 3. by row. For H0 : ␲ 1 s ␲ 2 and Ha : ␲ 1 ) ␲ 2 , show that the P-value equals 641 for the exact unconditional test and 201 for Fisher’s 114 INFERENCE FOR CONTINGENCY TABLES exact test. wFor discussion of this example, see Little Ž1989., G. Barnard’s remarks at the end of Yates Ž1984., and Sprott Ž2000, Sec. 6.4.4..x 3.43 Refer to Problem 3.42 and exact tests using X 2 with Ha : ␲ 1 / ␲ 2 . Explain why the unconditional P-value, evaluated at ␲ s 0.5, is related to Fisher conditional P-values for various tables by P Ž X 2 G 6. s 6 Ý P Ž X 2 G 6 < nq1 s k . P Ž nq1 s k . . ks0 Thus, the unconditional P-value of 321 is a weighted average of the Fisher P-value for the observed column margins and P-values of 0 corresponding to the impossibility of getting results as extreme as 6 observed if other margins had occurred Ži.e., 321 s 0.10 6 Ž 1r2 . .. ž/ 3 The Fisher quote in Section 3.5.6 gave his view about this. 3.44 Consider exact tests of independence, given the marginals, for the I = I table having n ii s 1 for i s 1, . . . , I, and n i j s 0 otherwise. Show that Ža. tests that order tables by their probabilities, X 2 , or G 2 have P-value s 1.0, and Žb. the one-sided test that orders tables by an ordinal statistic such as r or C y D has P-value s Ž1rI!.. 3.45 A Monte Carlo scheme randomly samples M separate I = J tables having the observed margins to approximate Po s P Ž X 2 G X o2 . for an exact test. Let Pˆ be the sample proportion of the M tables with X 2 G X o2 . Show that P Ž< Pˆ y Po < F B . s 1 y ␣ requires that M f z␣2 r2 Po Ž1 y Po .rB 2 . 3.46 Show that the conditional ML estimate of ␪ satisfies n11 s E Ž n11 . for distribution Ž3.18.. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 4 Introduction to Generalized Linear Models In Chapters 2 and 3 we focused on methods for two-way contingency tables. Most studies, however, have several explanatory variables, and they may be continuous as well as categorical. The goal is usually to describe their effects on response variables. Modeling the effects helps us do this efficiently. A good-fitting model evaluates effects, includes relevant interactions, and provides smoothed estimates of response probabilities. The rest of the book focuses on model building for categorical response variables. In this chapter we introduce a family of generalized linear models that contains the most important models for categorical responses as well as standard models for continuous responses. Section 4.1 covers three components common to all generalized linear models. Section 4.2 illustrates with models for binary responses. The most important case is logistic regression, a linear model for the logit transformation of a binomial parameter. In Chapters 5 through 7 we study these models in detail. In Section 4.3 we present generalized linear models for counts. A Poisson regression model called a loglinear model is a linear model for the log of a Poisson mean. In Chapters 8 and 9 we study them for modeling counts in contingency tables. Sections 4.4 through 4.8 are more technical. Readers wanting mainly an overview of methods can skip them or read them lightly. For generalized linear models, Section 4.4 covers likelihood equations and the asymptotic covariance matrix of ML model parameter estimates, and Section 4.5 summarizes inferential methods. Methods of solving the likelihood equations are presented in Section 4.6. In the final two sections we introduce generalizations, quasi-likelihood and generalized additi®e models, that further extend the scope of models. 115 116 4.1 INTRODUCTION TO GENERALIZED LINEAR MODELS GENERALIZED LINEAR MODEL Generalized linear models ŽGLMs. extend ordinary regression models to encompass nonnormal response distributions and modeling functions of the mean. Three components specify a generalized linear model: A random component identifies the response variable Y and its probability distribution; a systematic component specifies explanatory variables used in a linear predictor function; and a link function specifies the function of E Ž Y . that the model equates to the systematic component. Nelder and Wedderburn Ž1972. introduced the class of GLMs, although many models in the class were well established by then. 4.1.1 Components of Generalized Linear Models The random component of a GLM consists of a response variable Y with independent observations Ž y 1 , . . . , yN . from a distribution in the natural exponential family. This family has probability density function or mass function of form f Ž yi ; ␪ i . s a Ž ␪ i . b Ž yi . exp yi Q Ž ␪ i . . Ž 4.1 . Several important distributions are special cases, including the Poisson and binomial. The value of the parameter ␪ i may vary for i s 1, . . . , N, depending on values of explanatory variables. The term QŽ ␪ . is called the natural parameter. In Section 4.4 we present a more general formula that also has a dispersion parameter, but Ž4.1. is sufficient for basic discrete data models. The systematic component of a GLM relates a vector Ž␩1 , . . . , ␩N . to the explanatory variables through a linear model. Let x i j denote the value of predictor j Ž j s 1, 2, . . . , p . for subject i. Then ␩i s Ý ␤j x i j , i s 1, . . . , N. j This linear combination of explanatory variables is called the linear predictor. Usually, one x i j s 1 for all i, for the coefficient of an intercept Žoften denoted by ␣ . in the model. The third component of a GLM is a link function that connects the random and systematic components. Let ␮ i s E Ž Yi ., i s 1, . . . , N. The model links ␮ i to ␩i by ␩i s g Ž ␮i ., where the link function g is a monotonic, differentiable function. Thus, g links E Ž Yi . to explanatory variables through the formula g Ž ␮i . s Ý ␤j x i j , j i s 1, . . . , N. Ž 4.2 . 117 GENERALIZED LINEAR MODEL The link function g Ž ␮. s ␮ , called the identity link, has ␩i s ␮ i . It specifies a linear model for the mean itself. This is the link function for ordinary regression with normally distributed Y. The link function that transforms the mean to the natural parameter is called the canonical link. For it, g Ž ␮i . s QŽ ␪ i ., and QŽ ␪ i . s Ý j ␤ j x i j . The following subsections show examples. In summary, a GLM is a linear model for a transformed mean of a response variable that has distribution in the natural exponential family. We now illustrate the three components by introducing the key GLMs for discrete response variables. 4.1.2 Binomial Logit Models for Binary Data Many response variables are binary. Represent the success and failure outcomes by 1 and 0. The Bernoulli distribution for this Bernoulli trial specifies probabilities P Ž Y s 1. s ␲ and P Ž Y s 0. s 1 y ␲ , for which E Ž Y . s ␲ . This is the special case of the binomial Ž1.1. with n s 1. The probability mass function is f Ž y; ␲ . s ␲ y Ž1 y ␲ . 1yy ž s Ž 1 y ␲ . ␲r Ž 1 y ␲ . s Ž 1 y ␲ . exp y log ␲ 1y␲ / y Ž 4.3 . for y s 0 and 1. This is in the natural exponential family Ž4.1., identifying ␪ with ␲ , aŽ␲ . s 1 y ␲ , bŽ y . s 1, and QŽ␲ . s logw␲rŽ1 y ␲ .x. The natural parameter logw␲rŽ1 y ␲ .x is the log odds of response 1, the logit of ␲ . This is the canonical link. GLMs using the logit link are often called logit models. 4.1.3 Poisson Loglinear Models for Count Data Some response variables have counts as their possible outcomes. For a sample of silicon wafers used in manufacturing computer chips, each observation might be the number of imperfections on a wafer. Counts also occur as entries in contingency tables. The simplest distribution for count data is the Poisson. Like counts, Poisson variates can take any nonnegative integer value. Let Y denote a count and let ␮ s E Ž Y .. The Poisson probability mass function Ž1.4. for Y is f Ž y; ␮. s ey␮␮ y y! s exp Žy␮ . ž / 1 y! exp Ž y log ␮ . , y s 0, 1, 2, . . . . This has natural exponential form Ž4.1. with ␪ s ␮ , aŽ ␮. s exp Žy␮ ., bŽ y . s 1ry!, and QŽ ␮. s log ␮. The natural parameter is log ␮ , so the canonical 118 INTRODUCTION TO GENERALIZED LINEAR MODELS TABLE 4.1 Types of Generalized Linear Models for Statistical Analysis Random Component Normal Normal Normal Binomial Poisson Multinomial Link Identity Identity Identity Logit Log Generalized logit Systematic Component Model Chapters Continuous Categorical Mixed Mixed Mixed Mixed Regression Analysis of variance Analysis of covariance Logistic regression Loglinear Multinomial response 5 and 6 8 and 9 7 link function is the log link, ␩ s log ␮. The model using this link is log ␮ i s Ý ␤j x i j , i s 1, . . . , N. Ž 4.4 . j This model is called a Poisson loglinear model. 4.1.4 Generalized Linear Models for Continuous Responses The class of GLMs also includes models for continuous responses. The normal distribution is in a natural exponential family that includes dispersion parameters. Its natural parameter is the mean. Therefore, an ordinary regression model for E Ž Y . is a GLM using the identity link. Table 4.1 lists this and other standard models for a normal random component. The table also lists GLMs for discrete responses that are presented in the next six chapters. A traditional way to analyze data transforms Y so that it has approximately a normal distribution with constant variance; then, ordinary leastsquares regression is applicable. With GLMs, by contrast, the choice of link function is separate from the choice of random component. If a link is useful in the sense that a linear model for the predictors is plausible for that link, it is not necessary that it also stabilizes variance or produces normality. This is because the fitting process maximizes the likelihood for the choice of distribution for Y, and that choice is not restricted to normality. 4.1.5 Deviance For a particular GLM for observations y s Ž y 1 , . . . , yN ., let LŽ␮; y. denote the log-likelihood function expressed in terms of the means ␮ s Ž ␮1 , . . . , ␮ N .. Let LŽ␮; ˆ y. denote the maximum of the log likelihood for the model. Considered for all possible models, the maximum achievable log likelihood is GENERALIZED LINEAR MODEL 119 LŽy; y.. This occurs for the most general model, having a separate parameter for each observation and the perfect fit ␮ ˆ s y. Such a model is called the saturated model. This model is not useful, since it does not provide data reduction. However, it serves as a baseline for comparison with other model fits. The de®iance of a Poisson or binomial GLM is defined to be y2 L Ž ␮; ˆ y. y L Ž y; y. . This is the likelihood-ratio statistic for testing the null hypothesis that the model holds against the general alternative Ži.e., the saturated model.. For some Poisson and binomial GLMs, the number of observations N stays fixed as the individual counts increase in size. Then the deviance has a chi-squared asymptotic null distribution. The df s N y p, where p is the number of model parameters; that is, df equals the difference between the numbers of parameters in the saturated and unsaturated models. The deviance then provides a test of model fit. An example is binomial counts at N fixed settings of predictors when the number of trials at each setting increases. Let Yi be bin Ž n i , ␲ i ., i s 1, . . . , N. Consider the simple model of homogeneity, ␲ i s ␣ all i. It has p s 1 parameter. The saturated model makes no assumption about ␲ i 4 , letting them be any N values between 0 and 1.0. It has N parameters. The deviance for the homogeneity model has df s N y 1. In fact, it equals the G 2 likelihood-ratio statistic Ž3.11. for testing independence in the N = 2 table that these samples form. Under independence, it has approximately a chisquared distribution as the  n i 4 increase, for fixed N. We use the deviance throughout the book for model checking and for inferential comparisons of models. Components of the deviance are residual measures of lack of fit. Methods for analyzing the deviance generalize analysis of variance methods for normal linear models. 4.1.6 Advantages of the GLM Formulation GLMs provide a unified theory of modeling that encompasses the most important models for continuous and discrete variables. Models studied in this text are GLMs with binomial or Poisson random component, or multivariate extensions of GLMs. The ML parameter estimates are computed with an algorithm, presented in Section 4.6, that iteratively uses a weighted version of least squares. The reason for restricting GLMs to the exponential family of distributions for Y is that the same algorithm applies to this entire family, for any choice of link function. Most statistical software has the facility to fit GLMs. Appendix A gives details. 120 4.2 INTRODUCTION TO GENERALIZED LINEAR MODELS GENERALIZED LINEAR MODELS FOR BINARY DATA Let Y denote a binary response variable. For instance, Y might indicate vote in a British election ŽLabour, Conservative., choice of automobile Ždomestic, import., or diagnosis of breast cancer Žpresent, absent.. Each observation has one of two outcomes, denoted by 0 and 1, binomial for a single trial. The mean E Ž Y . s P Ž Y s 1.. We denote P Ž Y s 1. by ␲ Žx., reflecting its dependence on values x s Ž x 1 , . . . , x p . of predictors. The variance of Y is var Ž Y . s ␲ Ž x . 1 y ␲ Ž x . , the binomial variance for one trial. In introducing GLMs for binary data, for simplicity we use a single explanatory variable. 4.2.1 Linear Probability Model For a binary response, the regression model ␲ Ž x. s ␣ q ␤ x Ž 4.5 . is called a linear probability model. With independent observations it is a GLM with binomial random component and identity link function. The linear probability model has a major structural defect. Probabilities fall between 0 and 1, but linear functions take values over the entire real line. Model Ž4.5. has ␲ Ž x . - 0 and ␲ Ž x . ) 1 for sufficiently large or small x values. For its extension with multiple predictors, difficulties often occur fitting this model because during the fitting process, ␲ ˆ Žx. falls outside the w0, 1x range for some subjects’ x values. The model can be valid over a restricted range of x values. When it is plausible, an advantage is its simple interpretation: ␤ is the change in ␲ Ž x . for a one-unit increase in x. We defer to Section 4.6 the technical details of fitting this and other GLMs. One should assume a binomial distribution for Y and use maximum likelihood ŽML. rather than ordinary least squares. Least squares is ML for a normal distribution with constant variance. For binary responses, the constant variance condition that makes least squares estimators optimal Ži.e., minimum variance in the class of linear unbiased estimators. is not satisfied. Since var Ž Y . s ␲ Ž x .w1 y ␲ Ž x .x, the variance depends on x through its influence on ␲ Ž x .. As ␲ Ž x . moves toward 0 or 1, the distribution of Y is more nearly concentrated at a single point, and the variance moves toward 0. Because of the nonconstant variance, the binomial ML estimator is more efficient than least squares. Also Y, being binary, is very far from normally distributed. Thus, the usual sampling distributions for the least squares estimators do not apply. The estimates and standard errors for ML and least squares are usually similar, however, when ␲ ˆ Ž x . for the sample x values falls in the range within which the variance is relatively stable Žabout 0.3 to 0.7.. 121 GENERALIZED LINEAR MODELS FOR BINARY DATA TABLE 4.2 Relationship between Snoring and Heart Disease Heart Disease Snoring Never Occasionally Nearly every night Every night Yes No Proportion Yes Linear Fit a Logit Fit a 24 35 21 30 1355 603 192 224 0.017 0.055 0.099 0.118 0.017 0.057 0.096 0.116 0.021 0.044 0.093 0.132 a Model fits refer to proportion of yes responses. Source: P. G. Norton and E. V. Dunn, British Med. J. 291: 630᎐632 Ž1985., BMJ Publishing Group. 4.2.2 Snoring and Heart Disease Example We illustrate the linear probability model with Table 4.2, from an epidemiological survey of 2484 subjects to investigate snoring as a risk factor for heart disease. Those surveyed were classified according to their spouses’ report of how much they snored. The model states that the probability of heart disease is linearly related to the level of snoring x. We treat the rows of the table as independent binomial samples. No obvious choice of scores exists for categories of x. We used Ž0, 2, 4, 5., treating the last two levels as closer than the other adjacent pairs ŽProblem 4.4 uses equally spaced scores.. ML estimates and standard errors are the same if we use a data file of 2484 binary observations or if we enter the four binomial totals of yes and no responses listed in Table 4.2. Software Žsee, e.g., Table A.3 for SAS. reports the ML fit, ␲ ˆ Ž x . s 0.0172 q 0.0198 x, with a standard error SE s 0.0028 for ␤ˆ s 0.0198. For nonsnorers Ž x s 0., the estimated proportion of subjects having heart disease is 0.0172. We refer to the estimated values of E Ž Y . for a GLM as fitted ®alues. Table 4.2 shows the sample proportions and the fitted values for this model. Figure 4.1 graphs the sample and fitted values. The table and graph suggest that the model fits well. ŽIn Section 5.2.3 we discuss formal goodness-of-fit analyses for binary-response GLMs.. The model interpretation is simple. The estimated probability of heart disease is about 0.02 for nonsnorers; it increases 2Ž0.0198. s 0.04 for occasional snorers, another 0.04 for those who snore nearly every night, and another 0.02 for those who always snore. 4.2.3 Logistic Regression Model Usually, binary data result from a nonlinear relationship between ␲ Ž x . and x. A fixed change in x often has less impact when ␲ Ž x . is near 0 or 1 than when ␲ Ž x . is near 0.5. In the purchase of an automobile, consider the choice between buying new or used. Let ␲ Ž x . denote the probability of selecting new when annual family income s x. An increase of $50,000 in annual 122 FIGURE 4.1 INTRODUCTION TO GENERALIZED LINEAR MODELS Predicted probabilities for linear probability and logistic regression models. income would have less effect when x s $1,000,000 wfor which ␲ Ž x . is near 1x than when x s $50,000. In practice, nonlinear relationships between ␲ Ž x . and x are often monotonic, with ␲ Ž x . increasing continuously or ␲ Ž x . decreasing continuously as x increases. The S-shaped curves in Figure 4.2 are typical. The most important curve with this shape has the model formula ␲ Ž x. s exp Ž ␣ q ␤ x . 1 q exp Ž ␣ q ␤ x . . Ž 4.6 . This is the logistic regression model. As x ™ ⬁, ␲ Ž x .x0 when ␤ - 0 and ␲ Ž x .≠1 when ␤ ) 0. Let’s find the link function for which logistic regression is a GLM. For Ž4.6. the odds are ␲ Ž x. 1 y ␲ Ž x. s exp Ž ␣ q ␤ x . . The log odds has the linear relationship log ␲ Ž x. 1 y ␲ Ž x. s ␣ q ␤ x. Ž 4.7 . GENERALIZED LINEAR MODELS FOR BINARY DATA FIGURE 4.2 123 Logistic regression functions. Thus, the appropriate link is the log odds transformation, the logit. Logistic regression models are GLMs with binomial random component and logit link function. Logistic regression models are also called logit models. The logit is the natural parameter of the binomial distribution, so the logit link is its canonical link. Whereas ␲ Ž x . must fall in the Ž0, 1. range, the logit can be any real number. The real numbers are also the range for linear predictors Žsuch as ␣ q ␤ x . that form the systematic component of a GLM. So this model does not have the structural problem that is true of the linear probability model. For the snoring data in Table 4.2, software reports the logistic regression ML fit logit ␲ ˆ Ž x . s y3.87 q 0.40 x. The positive ␤ˆ s 0.40 reflects the increased incidence of heart disease at higher snoring levels. In Chapters 5 and 6 we study logistic regression in detail and interpret such equations. Estimated probabilities result from substituting x values into the estimate of probability formula Ž4.6.. Table 4.2 also reports these fitted values. Figure 4.1 displays the fit. The fit is close to linear over this narrow range of estimated probabilities, and results are similar to those for the linear probability model. 124 4.2.4 INTRODUCTION TO GENERALIZED LINEAR MODELS Binomial GLM for 2 = 2 Contingency Tables Among the simplest GLMs for a binary response is the one having a single explanatory variable X that is also binary. Label its values by 0 and 1. For a given link function, the GLM link ␲ Ž x . s ␣ q ␤ x has the effect of X described by ␤ s link ␲ Ž 1 . y link ␲ Ž 0 . . For the identity link, ␤ s ␲ Ž1. y ␲ Ž0. is the difference between proportions. For the log link, ␤ s log w␲ Ž1.x y log w␲ Ž0.x s log w␲ Ž1.r␲ Ž0.x is the log relative risk. For the logit link, ␤ s logit ␲ Ž 1 . y logit ␲ Ž 0 . s log s log ␲ Ž 1. 1 y ␲ Ž 1. y log ␲ Ž 0. 1 y ␲ Ž 0. ␲ Ž 1. r Ž 1 y ␲ Ž 1. . ␲ Ž 0. r Ž 1 y ␲ Ž 0. . is the log odds ratio. Measures of association for 2 = 2 tables are effect parameters in GLMs for binary data. 4.2.5 Probit and Inverse CDF Link Functions* A monotone regression curve such as the first one in Figure 4.2 has the shape of a cumulative distribution function Žcdf. for a continuous random variable. This suggests a model for a binary response having form ␲ Ž x . s F Ž x . for some cdf F. Using an entire class of location-scale cdf ’s, such as normal cdf ’s with their variety of means and variances, permits the curve ␲ Ž x . s F Ž x . to have flexibility in the rate of increase and in the location where most of that increase occurs. Let ⌽ Ž⭈. denote the standard cdf of the class, such as the N Ž0, 1. cdf. Using ⌽ but writing the model as ␲ Ž x. s ⌽Ž ␣ q ␤ x. Ž 4.8 . provides the same flexibility. Shapes of different cdf ’s in the class occur as ␣ and ␤ vary. Replacing x by ␤ x permits the curve to increase at a different rate than the standard cdf Žor even to decrease if ␤ - 0.; varying ␣ moves the curve to the left or right. When ⌽ is strictly increasing over the entire real line, its inverse function ⌽y1 Ž⭈. exists and Ž4.8. is, equivalently, ⌽y1 ␲ Ž x . s ␣ q ␤ x . Ž 4.9 . 125 GENERALIZED LINEAR MODELS FOR COUNTS For this class of cdf shapes, the link function for the GLM is ⌽y1. The link function maps the Ž0, 1. range of probabilities onto Žy⬁, ⬁., the range of linear predictors. The curve has the shape of a normal cdf when ⌽ is the standard normal cdf. Model Ž4.9. is then called the probit model. This curve has similar appearance to the logistic regression curve. Probit models are discussed in Section 6.6. When ␤ ) 0, the logistic regression curve Ž4.6. is a cdf for the logistic distribution. When ␤ - 0, the curve for 1 y ␲ Ž x ., the probability Y s 0, has that appearance. The cdf of the logistic distribution with mean ␮ and dispersion parameter ␶ ) 0 is FŽ x. s exp Ž x y ␮ . r␶ 1 q exp Ž x y ␮ . r␶ , y⬁ - x - ⬁. The corresponding probability density function is symmetric and bell-shaped, with standard deviation ␶␲r'3 Žhere, ␲ is the mathematical constant 3.14 . . . .. It looks much like the normal density with the same mean and standard deviation but with slightly thicker tails. ŽIts kurtosis equals that of a t distribution with df s 9.. The standardized form of the logistic cdf has ␮ s 0 and ␶ s 1, so ⌽ Ž x . s e xrŽ1 q e x .. For that function, the logistic regression curve Ž4.6. has form ␲ Ž x . s ⌽ Ž ␣ q ␤ x .. By Ž4.9. the logit transformation is simply the inverse function for the standard logistic cdf; that is, when ⌽ Ž x . s ␲ Ž x . s e xrŽ1 q e x ., then x s ⌽y1 w␲ Ž x .x s log w␲ Ž x .rŽ1 y ␲ Ž x ..x. 4.3 GENERALIZED LINEAR MODELS FOR COUNTS The best known GLMs for count data assume a Poisson distribution for Y. We introduced this distribution in Section 1.2.3. In Chapters 8 and 9 we present Poisson GLMs for counts in contingency tables with categorical response variables. In this section we introduce Poisson GLMs using an alternative application: modeling count or rate data for a single discrete response variable. 4.3.1 Poisson Loglinear Models The Poisson distribution has a positive mean ␮. Although a GLM can model a positive mean using the identity link, it is more common to model the log of the mean. Like the linear predictor ␣ q ␤ x, the log mean can take any real value. The log mean is the natural parameter for the Poisson distribution, and the log link is the canonical link for a Poisson GLM. A Poisson loglinear GLM assumes a Poisson distribution for Y and uses the log link. The Poisson loglinear model with explanatory variable X is log ␮ s ␣ q ␤ x . Ž 4.10 . 126 INTRODUCTION TO GENERALIZED LINEAR MODELS For this model, the mean satisfies the exponential relationship x ␮ s exp Ž ␣ q ␤ x . s e ␣ Ž e ␤ . . Ž 4.11 . A 1-unit increase in x has a multiplicative impact of e ␤ on ␮ : The mean at x q 1 equals the mean at x multiplied by e ␤. 4.3.2 Horseshoe Crab Mating Example We illustrate Poisson GLMs for Table 4.3 from a study of nesting horseshoe crabs. Each female horseshoe crab had a male crab resident in her nest. The study investigated factors affecting whether the female crab had any other males, called satellites, residing nearby. Explanatory variables are the female crab’s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites. For now, we use width alone as a predictor. Table 4.3 lists width in centimeters. The sample mean width equals 26.3 and the standard deviation equals 2.1. Figure 4.3 plots the response counts of satellites against width, with numbered symbols indicating the number of observations at each point. The substantial variability makes it difficult to discern a clear trend. To get a clearer picture, we grouped the female crabs into width categories ŽF 23.25, 23.25᎐24.25, 24.25᎐25.25, 25.25᎐26.25, 26.25᎐27.25, 27.25᎐28.25, 28.25᎐29.25, ) 29.25. and calculated the sample mean number of satellites for female crabs in each category. Figure 4.4 plots these sample means against the sample mean width for crabs in each category. More sophisticated ways of portraying the trend smooth the data without grouping the width values or assuming a particular functional relationship. Figure 4.4 also shows a smoothed curve based on an extension of the GLM introduced in Section 4.8. The sample means and the smoothed curve both show a strong increasing trend. ŽThe means tend to fall above the curve, since the response counts in a category tend to be skewed to the right; the smoothed curve is less susceptible to outlying observations.. The trend seems approximately linear, and we discuss next models for the ungrouped data for which the mean or the log of the mean is linear in width. For a female crab, let ␮ be the expected number of satellites and x s width. From GLM software Že.g., for SAS, see Table A.4., the ML fit of the Poisson loglinear model Ž4.10. is log ␮ ˆ s ␣ˆ q ␤ˆ x s y3.305 q 0.164 x. The effect ␤ˆ s 0.164 of width is positive, with SE s 0.020. The model fitted value at any width level is an estimated mean number of satellites ␮ ˆ . For instance, the fitted value at the mean width of x s 26.3 is ␮ ˆ s exp Ž ␣ˆ q ␤ˆ x . s exp y3.305 q 0.164 Ž 26.3 . s 2.74. 127 GENERALIZED LINEAR MODELS FOR COUNTS TABLE 4.3 C S 2 3 3 4 2 1 4 2 2 2 1 3 2 2 3 3 2 2 2 2 4 4 2 2 3 2 3 2 2 3 1 2 2 4 3 2 4 2 2 2 2 2 3 1 3 3 3 2 3 2 3 3 1 3 1 3 1 3 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 1 2 1 3 3 3 3 3 1 3 3 3 3 1 W 28.3 26.0 25.6 21.0 29.0 25.0 26.2 24.9 25.7 27.5 26.1 28.9 30.3 22.9 26.2 24.5 30.0 26.2 25.4 25.4 27.5 27.0 24.0 28.7 26.5 24.5 27.3 26.5 25.0 22.0 30.2 25.4 24.9 25.8 27.2 30.5 25.0 30.0 22.9 23.9 26.0 25.8 29.0 26.5 Number of Crab Satellites by Female’s Characteristics a Wt Sa 3.05 8 2.60 4 2.15 0 1.85 0 3.00 1 2.30 3 1.30 0 2.10 0 2.00 8 3.15 6 2.80 5 2.80 4 3.60 3 1.60 4 2.30 3 2.05 5 3.05 8 2.40 3 2.25 6 2.25 4 2.90 0 2.25 3 1.70 0 3.20 0 1.97 1 1.60 1 2.90 1 2.30 4 2.10 2 1.40 0 3.28 2 2.30 0 2.30 6 2.25 10 2.40 5 3.32 3 2.10 8 3.00 9 1.60 0 1.85 2 2.28 3 2.20 0 3.28 4 2.35 0 C S 3 2 3 2 4 2 2 1 2 4 1 2 4 2 2 3 2 3 3 3 3 2 2 3 3 2 2 2 2 2 3 3 1 2 2 3 2 2 4 3 2 3 1 3 3 3 1 3 1 3 1 1 3 1 3 3 1 1 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 2 3 1 3 3 3 3 3 1 W Wt Sa C S 22.5 23.8 24.3 26.0 24.7 22.5 28.7 29.3 26.7 23.4 27.7 28.2 24.7 25.7 27.8 27.0 29.0 25.6 24.2 25.7 23.1 28.5 29.7 23.1 24.5 27.5 26.3 27.8 31.9 25.0 26.2 28.4 24.5 27.9 25.0 29.0 31.7 27.6 24.5 23.8 28.2 24.1 28.0 1.55 2.10 2.15 2.30 2.20 1.60 3.15 3.20 2.70 1.90 2.50 2.60 2.10 2.00 2.75 2.45 3.20 2.80 1.90 1.20 1.65 3.05 3.85 1.55 2.20 2.55 2.40 3.25 3.33 2.40 2.22 3.20 1.95 3.05 2.25 2.92 3.73 2.85 1.90 1.80 3.05 1.80 2.62 0 0 0 14 0 1 3 4 5 0 6 6 5 5 0 3 10 7 0 0 0 0 5 0 1 1 1 3 2 5 0 3 6 7 6 3 4 4 0 0 8 0 0 1 3 2 1 2 3 2 1 4 2 2 2 2 2 4 2 2 3 2 2 3 4 2 2 3 2 2 2 4 3 2 4 2 2 2 2 2 2 2 3 4 3 4 1 2 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 2 3 1 1 1 1 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 2 3 3 3 W 26.0 24.7 25.8 27.1 27.4 26.7 26.8 25.8 23.7 27.9 30.0 25.0 27.7 28.3 25.5 26.0 26.2 23.0 22.9 25.1 25.9 25.5 26.8 29.0 28.5 24.7 29.0 27.0 23.7 27.0 24.2 22.5 25.1 24.9 27.5 24.3 29.5 26.2 24.7 29.8 25.7 26.2 27.0 Wt Sa 2.30 9 1.90 0 2.65 0 2.95 8 2.70 5 2.60 2 2.70 5 2.60 0 1.85 0 2.80 6 3.30 5 2.10 4 2.90 5 3.00 15 2.25 0 2.15 5 2.40 0 1.65 1 1.60 0 2.10 5 2.55 4 2.75 0 2.55 0 2.80 1 3.00 1 2.55 4 3.10 1 2.50 6 1.80 0 2.50 6 1.65 2 1.47 4 1.80 0 2.20 0 2.63 6 2.00 0 3.02 4 2.30 0 1.95 4 3.50 4 2.15 0 2.17 2 2.63 0 C S 3 2 2 2 2 4 4 2 2 3 3 2 1 2 3 2 2 3 3 3 2 4 4 3 2 2 2 3 4 3 2 2 2 2 2 2 3 2 2 2 2 3 2 3 1 3 3 2 3 3 2 3 3 1 3 1 3 3 3 1 3 2 2 3 3 3 3 3 1 3 3 3 3 3 3 1 1 1 3 3 1 3 3 3 3 2 W Wt Sa 24.8 23.7 28.2 25.2 23.2 25.8 27.5 25.7 26.8 27.5 28.5 28.5 27.4 27.2 27.1 28.0 26.5 23.0 26.0 24.5 25.8 23.5 26.7 25.5 28.2 25.2 25.3 25.7 29.3 23.8 27.4 26.2 28.0 28.4 33.5 25.8 24.0 23.1 28.3 26.5 26.5 26.1 24.5 2.10 1.95 3.05 2.00 1.95 2.00 2.60 2.00 2.65 3.10 3.25 3.00 2.70 2.70 2.55 2.80 1.30 1.80 2.20 2.25 2.30 1.90 2.45 2.25 2.87 2.00 1.90 2.10 3.23 1.80 2.90 2.02 2.90 3.10 5.20 2.40 1.90 2.00 3.20 2.35 2.75 2.75 2.00 0 0 11 1 4 3 0 0 0 3 9 3 6 3 0 1 0 0 3 0 0 0 0 0 1 1 2 0 12 6 3 2 4 5 7 0 10 0 0 4 7 3 0 C, color Ž1, light medium; 2, medium; 3, dark medium; 4, dark.; S, spine condition Ž1, both good; 2, one worn or broken; 3, both worn or broken.; W, carapace width Žcm.; Wt, weight Žkg.; Sa, number of satellites. Source: Data courtesy of Jane Brockmann, Zoology Department, University of Florida; study described in Ethology 102:1᎐21 Ž1996.. a 128 INTRODUCTION TO GENERALIZED LINEAR MODELS FIGURE 4.3 Number of satellites by width of female crab. For this model, exp Ž ␤ˆ. s exp Ž0.164. s 1.18 is the multiplicative effect on ␮ ˆ for a 1-cm increase in x. For instance, the fitted value at x s 27.3 s 26.3 q 1 is exp wy3.305 q 0.164Ž27.3.x s 3.23, which equals 1.18 = 2.74. A 1-cm increase in width yields an 18% increase in the estimated mean. Figure 4.4 shows that E Ž Y . may grow approximately linearly with width. This suggests the Poisson GLM with identity link. It has ML fit ␮ ˆ s ␣ˆ q ␤ˆ x s y11.53 q 0.55 x . This model has an additive rather than a multiplicative effect of X on ␮. A 1-cm increase in x has an estimated increase of ␤ˆ s 0.55 in ␮ ˆ . The fitted values are positive at all sampled x, and the model describes simply the effect: On the average, about a 2-cm increase in width is associated with an extra satellite. Figure 4.5 plots ␮ ˆ against width for the models with log link and identity link. Although they diverge somewhat for relatively small and large widths, they provide similar predictions over the width range in which most observations occur. We now study whether either model fits adequately. 129 GENERALIZED LINEAR MODELS FOR COUNTS FIGURE 4.4 TABLE 4.4 Smoothings of horseshoe crab counts. Sample Mean and Variance of Number of Satellites Width Žcm. Number of Cases Number of Satellites Sample Mean Sample Variance - 23.25 23.25᎐24.25 24.25᎐25.25 25.25᎐26.25 26.25᎐27.25 27.25᎐28.25 28.25᎐29.25 ) 29.25 14 14 28 39 22 24 18 14 14 20 67 105 63 93 71 72 1.00 1.43 2.39 2.69 2.86 3.87 3.94 5.14 2.77 8.88 6.54 11.38 6.88 8.81 16.88 8.29 130 INTRODUCTION TO GENERALIZED LINEAR MODELS FIGURE 4.5 4.3.3 Estimated mean number of satellites for log and identity links. Overdispersion for Poisson GLMs In Section 1.2.4 we noted that count data often show greater variability than the Poisson allows. For the grouped horseshoe crab data, Table 4.4 shows the sample mean and variance for the counts of number of satellites for the female crabs in each width category. The variances are much larger than the means, whereas Poisson distributions have identical mean and variance. The greater variability than predicted by the GLM random component reflects o®erdispersion. A common cause of overdispersion is subject heterogeneity. For instance, suppose that width, weight, color, and spine condition are the four predictors that affect a female crab’s number of satellites. Suppose that Y has a Poisson distribution at each fixed combination of those predictors. Our model uses width alone as a predictor. Crabs having a certain width are then a mixture of crabs of various weights, colors, and spine conditions. Thus, the population of crabs having that width is a mixture of several Poisson populations, each having its own mean for the response. This heterogeneity results in an overall response distribution at that width having greater variation than the Poisson predicts. If the variance equals the mean when all relevant variables are controlled, it exceeds the mean when only one is controlled. Overdispersion is not an issue in ordinary regression with normally distributed Y, because that distribution has a separate parameter Žthe variance. 131 GENERALIZED LINEAR MODELS FOR COUNTS to describe variability. For binomial and Poisson distributions, however, the variance is a function of the mean. Overdispersion is common in the modeling of counts. When the model for the mean is correct but the true distribution is not Poisson, the ML estimates of model parameters are still consistent but standard errors are incorrect. We next introduce an extension of the Poisson GLM that has an extra parameter and accounts better for overdispersion. In Section 4.7 we present another approach for this, quasilikelihood inference. 4.3.4 Negative Binomial GLMs The negati®e binomial distribution has probability mass function f Ž y; k, ␮. s ⌫Ž y q k. ⌫ Ž k . ⌫ Ž y q 1. ž /ž k k ␮qk 1y k ␮qk / y , y s 0, 1, 2, . . . , Ž 4.12 . where k and ␮ are parameters. This distribution has EŽ Y . s ␮, var Ž Y . s ␮ q ␮ 2rk . The index ky1 is called a dispersion parameter. As ky1 ™ 0, varŽ Y . ™ ␮ and the negative binomial distribution converges to the Poisson ŽCameron and Trivedi 1998, p. 75.. Usually, ky1 is unknown. Estimating it helps summarize the extent of overdispersion. For k fixed, one can express Ž4.12. in natural exponential family form Ž4.1.. Then, a model with negative binomial random component is a GLM. For simplicity, such models let k be the same constant for all observations but treat it as unknown. As in GLMs for binary data, a variety of link functions are possible. Most common is the log link, as in Poisson loglinear models, but sometimes the identity link is adequate. In Section 13.4 we discuss negative binomial GLMs. We illustrate it here for the crab data analyzed above with Poisson GLMs. With the identity link and width as predictor, the Poisson GLM has ␮ ˆ s y11.53 q 0.55 x ŽSE s 0.06 for ␤ˆ.. For the negative binomial GLM, ␮ ˆ s y11.15 q 0.53 x ŽSE s 0.11.. Moreover, ˆ ky1 s 0.98, so at a predicted ␮ ˆ , the estimated variance is 2 roughly ␮ ˆq␮ ˆ , compared to ␮ ˆ for the Poisson GLM. Although fitted values are similar, the greater SE for ␤ˆ and the greater estimated variance in the negative binomial model reflect the overdispersion uncaptured with the Poisson GLM. 4.3.5 Poisson Regression for Rates When events of a certain type occur over time, space, or some other index of size, it is usually more relevant to model the rate at which they occur than the number of them. For instance, a study of homicides in a given year for a 132 INTRODUCTION TO GENERALIZED LINEAR MODELS sample of cities might model the homicide rate, defined for a city as its number of homicides that year divided by its population size. The model might describe how the rate depends on the city’s unemployment rate, its residents’ median income, and the percentage of residents having completed high school. In Section 9.7 we discuss Poisson regression for modeling rates. 4.3.6 Poisson GLM of Independence in I = J Contingency Tables One use of Poisson loglinear models is in modeling counts in contingency tables. We illustrate for two-way tables with independent counts  Yi j 4 having Poisson distributions with means  ␮i j 4 . Suppose that  ␮i j 4 satisfy ␮ i j s ␮␣ i ␤ j , where  ␣ i 4 and  ␤ j 4 are positive constants satisfying Ý i ␣ i s Ý j ␤ j s 1. This is a multiplicative model, but a linear predictor for a GLM results using the log link, log ␮ i j s ␭ q ␣ iU q ␤ jU , Ž 4.13 . where ␭ s log ␮ , ␣ iU s log ␣ i , ␤ jU s log ␤ j . This Poisson loglinear model has additive main effects of the two classifications but no interaction. Since the  Yi j 4 are independent, the total sample size Ý i Ý j Yi j has a Poisson distribution with mean Ý i Ý j ␮ i j s ␮. Conditional on Ý i Ý j Yi j s n, the cell counts have a multinomial distribution with probabilities ␲ i j s ␮ i jr␮ s ␣ i ␤ j 4 . Similarly, you can check that conditional on n, the row totals  Yiq 4 have a multinomial distribution with probabilities ␲ iqs ␣ i 4 and the column totals  Yqj 4 have a multinomial distribution with probabilities ␲qj s ␤ j 4 . Conditional on n, the model is a multinomial one that satisfies ␲ i j s ␣ i ␤ j s ␲ iq ␲qj . This is independence of the two classifications. In fact, in Poisson form independence is the loglinear model Ž4.13.. The inferences conducted in Chapter 3 about independence in two-way contingency tables relate to GLMs, either Poisson loglinear models or corresponding multinomial models that fix n or the row or column totals. In Chapters 8 and 9 we present more complex loglinear models for contingency tables. 4.4 MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS* Having introduced GLMs for binary and count data, we now turn our attention to details such as likelihood equations and methods for fitting them. The remainder of this chapter is somewhat technical, providing general results applying to most modeling methods presented in subsequent chapters. See McCullagh and Nelder Ž1989. for further details. 133 MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS It is helpful to extend the notation for a GLM so that it can handle many distributions that have a second parameter. The random component of the GLM specifies that the N observations Ž y 1 , . . . , yN . on Y are independent, with probability mass or density function for yi of form f Ž yi ; ␪ i , ␾ . s exp  yi ␪ i y b Ž ␪ i . raŽ ␾ . q c Ž yi , ␾ . 4 . Ž 4.14 . This is called the exponential dispersion family and ␾ is called the dispersion parameter ŽJorgensen 1987.. The parameter ␪ i is the natural parameter. Ⲑ When ␾ is known, Ž4.14. simplifies to the form Ž4.1. for the natural exponential family, which is f Ž yi ; ␪ i . s a Ž ␪ i . b Ž yi . exp yi Q Ž ␪ i . . We identify QŽ ␪ . here with ␪raŽ ␾ . in Ž4.14., aŽ ␪ . with expwybŽ ␪ .raŽ ␾ .x in Ž4.14., and bŽ y . with exp w cŽ y, ␾ .x in Ž4.14.. The more general formula Ž4.14. is not needed for one-parameter families such as the binomial and Poisson. Usually, aŽ ␾ . has form aŽ ␾ . s ␾r␻ i for a known weight ␻ i . For instance, when yi is a mean of n i independent readings, such as a sample proportion for n i Bernoulli trials, ␻ i s n i ŽSection 4.4.2.. 4.4.1 Mean and Variance Functions for the Random Component General expressions for E Ž Yi . and var Ž Yi . use terms in Ž4.14.. Let L i s log f Ž yi ; ␪ i , ␾ . denote the contribution of yi to the log likelihood; that is, the log-likelihood function is L s Ý i L i . Then, from Ž4.14., L i s yi ␪ i y b Ž ␪ i . ra Ž ␾ . q c Ž yi , ␾ . . Ž 4.15 . Therefore, ⭸ L ir⭸␪ i s yi y bX Ž ␪ i . raŽ ␾ . , ⭸ 2 L ir⭸␪ i2 s ybY Ž ␪ i . raŽ ␾ . , where bX Ž ␪ i . and bY Ž ␪ i . denote the first two derivatives of bŽ⭈. evaluated at ␪ i . We now apply the general likelihood results E ž / ⭸L ⭸␪ s 0 and yE ž / ž / ⭸ 2L ⭸␪ 2 sE ⭸L ⭸␪ 2 , which hold under regularity conditions satisfied by the exponential family ŽCox and Hinkley 1974, Sec. 4.8.. From the first formula applied with a single observation, E w Yi y bX Ž ␪ i .xraŽ ␾ . s 0, or ␮ i s E Ž Yi . s bX Ž ␪ i . . Ž 4.16 . 134 INTRODUCTION TO GENERALIZED LINEAR MODELS From the second formula, bY Ž ␪ i . raŽ ␾ . s E Ž Yi y bX Ž ␪ i . . raŽ ␾ . 2 s var Ž Yi . r a Ž ␾ . , 2 so that var Ž Yi . s bY Ž ␪ i . a Ž ␾ . . Ž 4.17 . In summary, the function bŽ⭈. in Ž4.14. determines moments of Yi . 4.4.2 Mean and Variance Functions for Poisson and Binomial We illustrate the mean and variance expressions for Poisson and binomial distributions. When Yi is Poisson, f Ž yi ; ␮ i . s ey␮ i␮ iy i yi ! s exp Ž yi log ␮ i y ␮ i y log yi ! . s exp yi ␪ i y exp Ž ␪ i . y log yi ! , where ␪ i s log ␮ i . This has exponential dispersion form Ž4.14. with bŽ ␪ i . s exp Ž ␪ i ., aŽ ␾ . s 1, and cŽ yi , ␾ . s ylog yi!. The natural parameter is ␪ i s log ␮ i . From Ž4.16. and Ž4.17., E Ž Yi . s bX Ž ␪ i . s exp Ž ␪ i . s ␮ i , var Ž Yi . s bY Ž ␪ i . s exp Ž ␪ i . s ␮ i . Next, suppose that n i Yi has a bin Ž n i , ␲ i . distribution; that is, here yi is the sample proportion Žrather than number . of successes, so E Ž Yi . is independent of n i . Let ␪ i s log w␲ irŽ1 y ␲ i .x. Then, ␲ i s exp Ž ␪ i .rw1 q exp Ž ␪ i .x and log Ž1 y ␲ i . s ylog w1 q exp Ž ␪ i .x. Extending Ž4.3., one can show that f Ž yi ; ␲ i , n i . s ž / ni n yn y ␲ ni yi Ž 1 y ␲ i . i i i n i yi i s exp yi ␪ i y log 1 q exp Ž ␪ i . q log 1rn i ž / ni n i yi . Ž 4.18 . This has exponential dispersion form Ž4.14. with bŽ ␪ i . s log w1 q exp Ž ␪ i .x, . The natural parameter is the logit, ž / ␪ s log w␲ rŽ1 y ␲ .x. From Ž4.16. and Ž4.17., aŽ ␾ . s 1rn i , and cŽ yi , ␾ . s log i i ni n i yi i E Ž Yi . s bX Ž ␪ i . s exp Ž ␪ i . r 1 q exp Ž ␪ i . s ␲ i , ½ var Ž Yi . s bY Ž ␪ i . a Ž ␾ . s exp Ž ␪ i . r 1 q exp Ž ␪ i . 2 5 n i s ␲ i Ž 1 y ␲ i . rn i . 135 MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS 4.4.3 Systematic Component and Link Function Let Ž x i1 , . . . , x i p . denote values of explanatory variables for observation i. The systematic component of a GLM relates parameters ␩i 4 to these variables using a linear predictor ␩i s Ý ␤j x i j , i s 1, . . . , N. j In matrix form, ␩ s X␤ , where ␩ s Ž␩1 , . . . , ␩N .X , ␤ s Ž ␤ 1 , . . . , ␤ p .X are column vectors of model parameters, and X is the N = p matrix of values of the explanatory variables for the N subjects. In ordinary linear models, X is called the design matrix. It need not refer to an experimental design, however, and the GLM literature calls it the model matrix. The GLM links ␩i to ␮ i s E Ž Yi . by a link function g Ž⭈.. Thus, ␮ i relates to the explanatory variables by ␩i s g Ž ␮i . s Ý ␤j x i j , i s 1, . . . , N. j The link function g for which g Ž ␮i . s ␪ i in Ž4.14. is the canonical link. For it, the direct relationship ␪i s Ý ␤j x i j j occurs between the natural parameter and the linear predictor. Since ␮ i s bX Ž ␪ i ., the natural parameter is the function of the mean, ␪ i s Ž bX .y1 Ž ␮i ., where Ž bX .y1 Ž⭈. denotes the inverse function to bX . Thus, the canonical link is the inverse of bX . In the Poisson case, for instance, bŽ ␪ i . s exp Ž ␪ i ., so bX Ž ␪ i . s exp Ž ␪ i . s ␮ i . Thus, Ž bX .y1 Ž⭈. is the inverse of the exponential function, which is the log function Ži.e., ␪ i s log ␮ i .. The canonical link is the log link. 4.4.4 Likelihood Equations for a GLM For N independent observations, from Ž4.15. the log likelihood is LŽ ␤ . s Ý Li s Ý log f Ž yi ; ␪ i , ␾ . s Ý i i i yi ␪ i y b Ž ␪ i . aŽ ␾ . q Ý c Ž yi , ␾ . . i Ž 4.19 . The notation LŽ␤ . reflects the dependence of ␪ on the model parameters ␤. 136 INTRODUCTION TO GENERALIZED LINEAR MODELS The likelihood equations are ⭸ L Ž ␤ . r⭸␤ j s Ý ⭸ Lir⭸␤ j s 0 i for all j. To differentiate the log likelihood Ž4.19., we use the chain rule, ⭸ Li ⭸␤ j s ⭸ L i ⭸␪ i ⭸␮ i ⭸␩i ⭸␪ i ⭸␮ i ⭸␩i ⭸␤ j . Ž 4.20 . Since ⭸ L ir⭸␪ i s w yi y bX Ž ␪ i .xraŽ ␾ ., and since ␮ i s bX Ž ␪ i . and var Ž Yi . s bY Ž ␪ i . aŽ ␾ . from Ž4.16. and Ž4.17., ⭸ L ir⭸␪ i s Ž yi y ␮ i . raŽ ␾ . , ⭸␮ ir⭸␪ i s bY Ž ␪ i . s var Ž Yi . raŽ ␾ . . Also, since ␩i s Ý j ␤ j x i j , ⭸␩ir⭸␤ j s x i j . Finally, since ␩i s g Ž ␮i ., ⭸␮ ir⭸␩i depends on the link function for the model. In summary, substituting into Ž4.20. gives us ⭸ Li ⭸␤ j s yi y ␮ i a Ž ␾ . ⭸␮ i a Ž ␾ . var Ž Yi . ⭸␩i xi j s Ž yi y ␮ i . x i j ⭸␮ i . var Ž Yi . ⭸␩i Ž 4.21 . The likelihood equations are N Ý is1 Ž yi y ␮ i . x i j ⭸␮ i s 0, ⭸␩i var Ž Yi . j s 1, . . . , p. Ž 4.22 . Although ␤ does not appear in these equations, it is there implicitly through ␮ i , since ␮ i s gy1 ŽÝ j ␤ j x i j .. Different link functions yield different sets of equations. Interestingly, the likelihood equations Ž4.22. depend on the distribution of Yi only through ␮ i and varŽ Yi .. The variance itself depends on the mean through a particular functional form var Ž Yi . s ® Ž␮i . for some function ®, such as ®Ž ␮i . s ␮ i for the Poisson, ®Ž ␮i . s ␮ i Ž1 y ␮ i . for the Bernoulli, and ®Ž ␮i . s ␴ 2 Ži.e., constant. for the normal. When Yi has distribution in the natural exponential family, the relationship between the mean and the variance characterizes the distribution ŽJorgensen 1987.. Ⲑ For instance, if Yi has distribution in the natural exponential family and if ®Ž ␮i . s ␮ i , then necessarily Yi has the Poisson distribution. MOMENTS AND LIKELIHOOD FOR GENERALIZED LINEAR MODELS 4.4.5 137 Likelihood Equations for Binomial GLMs Using notation from Section 4.4.2, suppose that n i Yi has a bin Ž n i , ␲ i . distribution. Then yi is a sample proportion of successes for n i trials. The binomial GLM Ž4.8. for a single predictor extends with several predictors to ␲i s ⌽ ž Ý␤ / j Ž 4.23 . xi j , j where ⌽ is the standard cdf of some class of continuous distributions. Since ␲ i s ␮ i s ⌽ Ž␩i . with ␩i s Ý j ␤ j x i j , ⭸␮ ir⭸␩i s ␾ Ž ␩i . s ␾ ž Ý␤ / j xi j , j where ␾ Ž u. s ⭸ ⌽ Ž u.r⭸ u Ži.e., the probability density function corresponding to the cdf ⌽ .. Since var Ž Yi . s ␲ i Ž1 y ␲ i .rn i , the likelihood equations Ž4.22. simplify to Ý i n i Ž yi y ␲ i . x i j ␲i Ž1 y ␲i . ␾ ž Ý␤ / j x i j s 0, Ž 4.24 . j where ␲ i s ⌽ ŽÝ j ␤ j x i j .. These depend on the link function ⌽y1 through the derivative of its inverse. For the logit link, ␩i s log w␲ irŽ1 y ␲ i .x, so ⭸␩ir⭸␲ i s 1rw␲ i Ž1 y ␲ i .x and ⭸␮ ir⭸␩i s ⭸␲ ir⭸␩i s ␲ i Ž1 y ␲ i .. Then the likelihood equations Ž4.22. and Ž4.24. simplify to Ý n i Ž yi y ␲ i . x i j s 0, Ž 4.25 . i where ␲ i satisfies Ž4.23. with ⌽ the standard logistic cdf. 4.4.6 Asymptotic Covariance Matrix of Model Parameter Estimators The likelihood function for the GLM also determines the asymptotic covariˆ This matrix is the inverse of the ance matrix of the ML estimator ␤. information matrix I , which has elements E wy⭸ 2 LŽ␤ .r⭸␤ h ⭸␤ j x. To find this, for the contribution L i to the log likelihood we use the helpful result E ž ⭸ 2 Li ⭸␤ h ⭸␤ j / ž /ž / s yE ⭸ Li ⭸ Li ⭸␤ h ⭸␤ j , 138 INTRODUCTION TO GENERALIZED LINEAR MODELS which holds for exponential families ŽCox and Hinkley 1974, Sec. 4.8.. Thus, E ž ⭸ 2 Li ⭸␤ h ⭸␤ j / s yE s Ž Yi y ␮ i . x i h ⭸␮ i Ž Yi y ␮ i . x i j ⭸␮ i var Ž Yi . ⭸␩i var Ž Yi . ⭸␩i yx i h x i j var Ž Yi . ž / from Ž 4.21 . 2 ⭸␮ i . ⭸␩i Since LŽ␤ . s Ý i L i , ž E y ⭸ 2 LŽ ␤ . ⭸␤ h ⭸␤ j / N s xih xi j Ý is1 var Ž Yi . ž / ⭸␮ i ⭸␩i 2 . Generalizing from this typical element to the entire matrix, the information matrix has the form Ž 4.26 . I s XX WX, where W is the diagonal matrix with main-diagonal elements wi s Ž ⭸␮ ir⭸␩i . rvar Ž Yi . . Ž 4.27 . 2 ˆ is estimated by The asymptotic covariance matrix of ␤ $ ˆ . s Iˆ s Ž XX WX ˆ . cov Ž ␤ y1 y1 , Ž 4.28 . ˆ is W evaluated at ␤. ˆ From Ž4.27., the form of W also depends on where W the link function. We’ll see an example for Poisson GLMs next and for binomial GLMs in Section 5.5. 4.4.7 Likelihood Equations and Covariance Matrix for Poisson Loglinear Model The general Poisson loglinear model Ž4.4. has the matrix form log ␮ s X␤. For the log link, ␩i s log ␮ i , so ␮ i s exp Ž␩i . and ⭸␮ ir⭸␩i s exp Ž␩i . s ␮ i . Since var Ž Yi . s ␮ i , the likelihood equations Ž4.22. simplify to Ý Ž yi y ␮i . x i j s 0. Ž 4.29 . i These equate the sufficient statistics Ý i yi x i j for ␤ to their expected values. 139 INFERENCE FOR GENERALIZED LINEAR MODELS Also, since wi s Ž ⭸␮ ir⭸␩i . rvar Ž Yi . s ␮ i 2 ˆ is ŽXX WX ˆ .y1 , where W ˆ is the the estimated covariance matrix Ž4.28. of ␤ diagonal matrix with elements of ␮ ˆ on the main diagonal. 4.5 INFERENCE FOR GENERALIZED LINEAR MODELS For most GLMs the likelihood equations Ž4.22. are nonlinear functions of ␤. ˆ and For now, we put off details about solving them for the ML estimator ␤ focus instead on using the fit for statistical inference. The Wald, score, and likelihood-ratio methods introduced in Section 1.3.3 for significance testing and interval estimation apply to any GLM. In this section we concentrate on likelihood-ratio inference, through the de®iance of the GLM. 4.5.1 Deviance and Goodness of Fit From Section 4.1.5, the saturated GLM has a separate parameter for each observation. It gives a perfect fit. This sounds good, but it is not a helpful model. It does not smooth the data or have the advantages that a simpler model has, such as parsimony. Nonetheless, it serves as a baseline for other models, such as for checking model fit. A saturated model explains all variation by the systematic component of the model. Let ␪˜ denote the estimate of ␪ for the saturated model, corresponding to estimated means ␮ ˜ i s yi for all i. For a particular unsaturated model, denote the corresponding ML estimates by ␪ˆ and ␮ ˆ i . For maximized log likelihood LŽ␮; ˆ y. for that model and maximized log likelihood LŽy; y. in the saturated case, y2 log maximum likelihood for model maximum likelihood for saturated model s y2 L Ž ␮; ˆ y. y L Ž y; y. describes lack of fit. It is the likelihood-ratio statistic for testing the null hypothesis that the model holds against the alternative that a more general model holds. From Ž4.19., y2 L Ž ␮; ˆ y. y L Ž y; y. s 2 Ý yi ␪˜i y b Ž ␪˜i . raŽ ␾ . y 2 Ý yi ␪ˆi y b Ž ␪ˆi . raŽ ␾ . . i i 140 INTRODUCTION TO GENERALIZED LINEAR MODELS Usually, aŽ ␾ . in Ž4.14. has the form aŽ ␾ . s ␾r␻ i , and this statistic equals ž / 2 Ý ␻ i yi ␪˜i y ␪ˆi y b Ž ␪˜i . q b Ž ␪ˆi . r␾ s D Ž y; ␮ ˆ . r␾ . i Ž 4.30 . This is called the scaled de®iance and DŽy; ␮ ˆ . is called the de®iance. The greater the scaled deviance, the poorer the fit. For some GLMs the scaled deviance has an approximate chi-squared distribution. 4.5.2 Deviance for Poisson Models For Poisson GLMs, by Section 4.4.2, ␪ˆi s log ␮ ˆ i and bŽ ␪ˆi . s exp Ž ␪ˆi . s ␮ ˆ i. Similarly, ␪˜i s log yi and bŽ ␪˜i . s yi for the saturated model. Also aŽ ␾ . s 1, so the deviance and scaled deviance Ž4.30. equal D Ž y; ␮ ˆ . s 2 Ý yi log Ž yir␮ ˆ i . y yi q ␮ ˆi . Ž 4.31 . i When a model with log link contains an intercept term, the likelihood equation Ž4.29. implied by that parameter is Ý yi s Ý ␮ ˆ i . Then the deviance simplifies to D Ž y; ␮ ˆ . s 2 Ý yi log Ž yir␮ ˆi . . Ž 4.32 . i For two-way contingency tables, this reduces to the G 2 statistic Ž3.11. in Section 3.2.1, substituting cell count n i j for yi and the independence fitted value ␮ ˆ i j for ␮ ˆ i . For a Poisson or multinomial model applied to a contingency table with a fixed number of cells N, we will see in Section 14.3 that the deviance has an approximate chi-squared distribution for large  ␮i 4 . 4.5.3 Deviance for Binomial Models: Grouped and Ungrouped Data Now consider binomial GLMs with sample proportions  yi 4 based on  n i 4 trials. By Section 4.4.2, ␪ˆi s log w␲ ˆ irŽ1 y ␲ˆ i .x and bŽ ␪ˆi . s log w1 q exp Ž ␪ˆi .x s ˜ Ž . ylog 1 y ␲ ˆ i . Similarly, ␪ i s log w yirŽ1 y yi .x and bŽ ␪˜i . s ylog Ž1 y yi . for the saturated model. Also, aŽ ␾ . s 1rn i , so ␾ s 1 and ␻ i s n i . The deviance Ž4.30. equals ½ž 2 Ý n i yi log i yi 1 y yi s 2 Ý n i yi log i s 2 Ý n i yi log i y log 1y␲ ˆi n i yi n i y n i yi n i yi n i␲ ˆi / ␲ ˆi q log Ž 1 y yi . y log Ž 1 y ␲ ˆi . y 2 Ý n i yi log i n i␲ ˆi n i y n i␲ ˆi q 2 Ý Ž n i y n i yi . log i 5 q 2 Ý n i log n i y n i yi n i y n i␲ ˆi i . 1 y yi 1y␲ ˆi 141 INFERENCE FOR GENERALIZED LINEAR MODELS At setting i, n i yi is the number of successes and Ž n i y n i yi . is the number of failures, i s 1, . . . , N. Thus, the deviance is a sum over the 2 N cells of successes and failures and has the same form, D Ž y; ␮ ˆ . s 2 Ý observed = log Ž observedrfitted . , Ž 4.33 . as the deviance Ž4.32. for Poisson loglinear models with intercept term. With binomial responses, it is possible to construct the data file as expressed here with the counts of successes and failures at each setting for the predictors, or with the individual Bernoulli 0᎐1 observations at the subject level. The deviance differs in the two cases. In the first case the saturated model has a parameter at each setting for the predictors, whereas in the second case it has a parameter for each subject. We refer to these as grouped data and ungrouped data cases. The approximate chi-squared distribution for the deviance occurs for grouped data but not for ungrouped data Žsee Problems 4.22 and 5.37.. With grouped data, the sample size increases for a fixed number of settings of the predictors and hence a fixed number of parameters for the saturated model. 4.5.4 Likelihood-Ratio Model Comparison Using the Deviance For a Poisson or binomial model M, ␾ s 1, so the deviance Ž4.30. equals D Ž y; ␮ ˆ . s y2 L Ž ␮; ˆ y. y L Ž y; y. . Ž 4.34 . Consider two models, M0 with fitted values ␮ ˆ 0 and M1 with fitted values ␮ ˆ 1, with M0 a special case of M1. Model M0 is said to be nested within M1. Since M0 is simpler than M1 , a smaller set of parameter values satisfies M0 than satisfies M1. Maximizing the log likelihood over a smaller space cannot yield a larger maximum. Thus, LŽ␮ ˆ 0 ; y. F LŽ␮ ˆ 1; y., and it follows from Ž4.34. with the same LŽy; y. for each model that D Ž y; ␮ ˆ 1 . F D Ž y; ␮ ˆ 0. . Simpler models have larger deviances. Assuming that model M1 holds, the likelihood-ratio test of the hypothesis that M0 holds uses the test statistic y2 L Ž ␮ ˆ 0 ; y. y L Ž ␮ ˆ 1 ; y. s y2 L Ž ␮ ˆ 0 ; y . y L Ž y; y. y  y2 L Ž ␮ ˆ 1 ; y . y L Ž y; y. 4 s D Ž y; ␮ ˆ 0 . y D Ž y; ␮ ˆ1. . The likelihood-ratio statistic comparing the two models is simply the difference between the deviances. This statistic is large when M0 fits poorly compared to M1. 142 INTRODUCTION TO GENERALIZED LINEAR MODELS In fact, since the part in Ž4.30. involving the saturated model cancels, the difference between deviances, ž / D Ž y; ␮ ˆ 0 . y D Ž y; ␮ ˆ 1 . s 2 Ý ␻ i yi ␪ˆ1 i y ␪ˆ0 i y b Ž ␪ˆ1 i . q b Ž ␪ˆ0 i . , also has the form of the deviance. Under regularity conditions, this difference has approximately a chi-squared null distribution with df equal to the difference between the numbers of parameters in the two models. For binomial GLMs and Poisson loglinear GLMs with intercept, from expression Ž4.33. for the deviance, the difference in deviances uses the observed counts and the two sets of fitted values in the form D Ž y; ␮ ˆ 0 . y D Ž y; ␮ ˆ 1 . s 2 Ý observed = log Ž fitted 1rfitted 0 . . With binomial responses, the test comparing models does not depend on whether the data file has grouped or ungrouped form. The saturated model differs in the two cases, but its log likelihood cancels when one forms the difference between the deviances. 4.5.5 Residuals for GLMs When a GLM fits poorly according to an overall goodness-of-fit test, examination of residuals highlights where the fit is poor. One type of residual uses components of the deviance. In Ž4.30. let DŽy; ␮ ˆ . s Ýd i , where ž / d i s 2 ␻ i yi ␪˜i y ␪ˆi y b Ž ␪˜i . q b Ž ␪ˆi . . The de®iance residual for observation i is 'd i = sign Ž yi y ␮ ˆi . , Ž 4.35 . An alternative is the Pearson residual, yi y ␮ ˆi ei s $ . 1r2 var Ž Yi . Ž 4.36 . For instance, for a Poisson GLM, var Ž Yi . s ␮ i and the Pearson residual is ' e i s Ž yi y ␮ ˆi . r ␮ ˆi . For two-way contingency tables identifying yi with cell count n i j and ␮ ˆ i with the independence fitted value ␮ ˆ i j , this has the form Ž3.12.; then Ýe i2j s X 2 , the Pearson X 2 statistic. Similarly, the sum of squared deviance residuals Ýd i j s G 2 , the likelihood-ratio statistic for testing independence. 143 FITTING GENERALIZED LINEAR MODELS When the model holds, Pearson and deviance residuals are less variable than standard normal because they compare yi to the fitted means rather than the true mean Že.g., the denominator of Ž4.36. estimates wvar Ž Yi .x1r2 s wvar Ž Yi y ␮ i .x1r2 rather than wvar Ž Yi y ␮ ˆ i .x1r2 .. Standardized residuals divide the ordinary residuals by their asymptotic standard errors. For GLMs the asymptotic covariance matrix of the vector of the raw residuals  yi y ␮ ˆ i 4 is cov Ž Y y ␮ ˆ . s cov Ž Y. w I y Hatx . Here, I is the identity matrix and Hat is the hat matrix, Hat s W 1r2 X Ž XX WX . Ž 4.37 . where W is the diagonal matrix with elements Ž4.27. ŽPregibon 1981.. Let ˆ hi y1 XX W 1r2 , denote the estimated diagonal element of Hat for observation i, called its le®erage. Then, standardizing by dividing yi y ␮ ˆ i by its estimated SE yields the standardized Pearson residual ri s ½ yi y ␮ ˆi var Ž Yi . Ž 1 y ˆh i . 5 1r2 s ei '1 y ˆh . Ž 4.38 . i ' For Poisson GLMs, for instance, ri s Ž yi y ␮ ˆ i .r ␮ ˆ i Ž 1 y ˆh i . . Pierce and Schafer Ž1986. presented standardized deviance residuals. In linear models the hat matrix is so-named because Hat = y projects the data to the fitted values, ␮ ˆ s‘‘mu-hat.’’ For GLMs, applying the estimated hat matrix to a linearized approximation for g Žy. yields ␩ ˆ s g Ž␮ ˆ ., the model’s estimated linear predictor values. The greater an observation’s leverage, the greater its potential influence on the fit. As in ordinary regression, the leverages fall between 0 and 1 and sum to the number of model parameters. Unlike ordinary regression, the hat values depend on the fit as well as the model matrix, and points that have extreme predictor values need not have high leverage. 4.6 FITTING GENERALIZED LINEAR MODELS ˆ of GLM parameters. The Finally, we study how to find the ML estimators ␤ ˆ We describe a Ž . likelihood equations 4.22 are usually nonlinear in ␤. general-purpose iterative method for solving nonlinear equations and apply it two ways to determine the maximum of a likelihood function. 4.6.1 Newton–Raphson Method The Newton᎐Raphson method is an iterative method for solving nonlinear equations, such as equations whose solution determines the point at which a function takes its maximum. It begins with an initial guess for the solution. It 144 INTRODUCTION TO GENERALIZED LINEAR MODELS obtains a second guess by approximating the function to be maximized in a neighborhood of the initial guess by a second-degree polynomial and then finding the location of that polynomial’s maximum value. It then approximates the function in a neighborhood of the second guess by another second-degree polynomial, and the third guess is the location of its maximum. In this manner, the method generates a sequence of guesses. These converge to the location of the maximum when the function is suitable andror the initial guess is good. ˆ at In more detail, here’s how Newton᎐Raphson determines the value ␤ X which a function LŽ␤ . is maximized. Let u s Ž ⭸ LŽ␤ .r⭸␤ 1 , ⭸ LŽ␤ .r⭸␤ 2 , . . . .. Let H denote the matrix having entries h ab s ⭸ 2 LŽ␤ .r⭸␤ a ⭸␤ b , called the Hessian matrix. Let uŽ t . and H Ž t . be u and H evaluated at ␤Ž t ., the guess t for ˆ Step t in the iterative process Ž t s 0, 1, 2, . . . . approximates LŽ␤ . near ␤. ␤Ž t . by the terms up to second order in its Taylor series expansion, L Ž ␤ . f L Ž ␤Ž t . . q uŽ t . Ž ␤ y ␤Ž t . . q Ž 12 . Ž ␤ y ␤Ž t . . H Ž t . Ž ␤ y ␤Ž t . . . X X Solving ⭸ LŽ␤ .r⭸ ␤ f uŽ t . q H Ž t . Ž␤ y ␤Ž t . . s 0 for ␤ yields the next guess. That guess can be expressed as ␤Ž tq1. s ␤Ž t . y Ž H Ž t . . y1 Ž 4.39 . uŽ t . , assuming that H Ž t . is nonsingular. ŽHowever, computing routines use standard methods for solving the linear equations rather than explicitly calculating the inverse.. Iterations proceed until changes in LŽ␤Ž t . . between successive cycles are sufficiently small. The ML estimator is the limit of ␤Ž t . as t ™ ⬁; however, this need not happen if LŽ␤ . has other local maxima at which the derivative of LŽ␤ . equals 0. In that case, a good initial estimate is crucial. To help understand the Newton᎐Raphson process, work through these steps when ␤ has a single element ŽProblem 4.34.. Then, Figure 4.6 illustrates a cycle of the method, showing the parabolic Žsecond-order. approximation at a given step. In the next chapter we use Newton᎐Raphson for logistic regression models. For now, we illustrate it with a simpler problem for which we know the answer, maximizing the log likelihood based on an observation y from a bin Ž n, ␲ . distribution. From Section 1.3.2, the first two derivatives of LŽ␲ . s y log ␲ q Ž n y y .logŽ1 y ␲ . are u s Ž y y n␲ . r␲ Ž 1 y ␲ . , H s y yr␲ 2 q Ž n y y . r Ž 1 y ␲ . . 2 Each Newton᎐Raphson step has the form ␲ Ž tq1. s ␲ Ž t . q y Ž ␲ Žt. . 2 q y1 nyy Ž 1 y ␲ Žt. . 2 y y n␲ Ž t . ␲ Žt. Ž 1 y ␲ Žt. . . 145 FITTING GENERALIZED LINEAR MODELS FIGURE 4.6 Cycle of Newton᎐Raphson method. This adjusts ␲ Ž t . up if yrn ) ␲ Ž t . and down if yrn - ␲ Ž t .. For instance, with ␲ Ž0. s 12 , you can check that ␲ Ž1. s yrn. When ␲ Ž t . s yrn, no adjustment occurs and ␲ Ž tq1. s yrn, which is the correct answer for ␲ ˆ . For starting values other than 12 , adequate convergence usually takes four or five iterations. ˆ for the Newton᎐Raphson method is usually The convergence of ␤Ž t . to ␤ fast. For large t, the convergence satisfies, for each j, ␤ jŽ tq1. y ␤ˆj F c ␤ jŽ t . y ␤ˆj 2 for some c ) 0 and is referred to as second-order. This implies that the number of correct decimals in the approximation roughly doubles after sufficiently many iterations. In practice, it often takes relatively few iterations for satisfactory convergence. 4.6.2 Fisher Scoring Method Fisher scoring is an alternative iterative method for solving likelihood equations. It resembles the Newton᎐Raphson method, the distinction being with the Hessian matrix. Fisher scoring uses the expected ®alue of this matrix, called the expected information, whereas Newton᎐Raphson uses the matrix itself, called the obser®ed information. Let I Ž t . denote the approximation t for the ML estimate of the expected information matrix; that is, I Ž t . has elements yE Ž ⭸ 2 LŽ␤ .r⭸␤ a ⭸␤ b ., evalu- 146 INTRODUCTION TO GENERALIZED LINEAR MODELS ated at ␤Ž t .. The formula for Fisher scoring is ␤Ž tq1. s ␤Ž t . q Ž I Žt. . y1 uŽ t . or Ž 4.40 . I Ž t . ␤Ž tq1. s I Ž t . ␤Ž t . q uŽ t . . For estimating a binomial parameter, from Section 1.3.2 the information is nrw␲ Ž1 y ␲ .x. A step of Fisher scoring gives s␲ Žt. q ␲ Žt. Ž 1 y ␲ Žt. . y y n␲ Ž t . n y y n␲ Ž t . y1 n ␲ Ž tq1. s ␲ Ž t . q s y n ␲ Žt. Ž 1 y ␲ Žt. . . This gives the answer for ␲ ˆ after a single iteration and stays at that value for successive iterations. Formula Ž4.26. showed that I s XX WX. Similarly, I Ž t . s XX W Ž t . X, where W Ž t . is W wsee Ž4.27.x evaluated at ␤Ž t .. The estimated asymptotic covariance ˆ wsee Ž4.28.x occurs as a by-product of this algorithm as matrix Iŷ1 of ␤ Ž I Ž t . .y1 for t at which convergence is adequate. From Ž4.22., for both Fisher scoring and Newton᎐Raphson, u has elements uj s ⭸ LŽ ␤ . ⭸␤ j N s Ý is1 Ž yi y ␮ i . x i j ⭸␮ i . var Ž Yi . ⭸␩i Ž 4.41 . For GLMs with a canonical link, we’ll see ŽSection 4.6.4. that the observed and expected information are the same. For noncanonical link models, Fisher scoring has the advantages that it produces the asymptotic covariance matrix as a by-product, the expected information is necessarily nonnegative definite, and as seen next, it is closely related to weighted least squares methods for ordinary linear models. However, it need not have second-order convergence, and for complex models the observed information is often easier to calculate. Efron and Hinkley Ž1978., developing arguments of R. A. Fisher, gave reasons for preferring observed information. They argued that its variance estimates better approximate a relevant conditional variance Žconditional on statistics not relevant to the parameter being estimated ., it is ‘‘closer to the data,’’ and it tends to agree more closely with Bayesian analyses. 4.6.3 ML as Iterative Reweighted Least Squares* A relation exists between weighted least squares estimation and using Fisher scoring to find ML estimates. We refer here to the general linear model of 147 FITTING GENERALIZED LINEAR MODELS form z s X␤ q ⑀ . When the covariance matrix of ⑀ is V, the weighted least squares ŽWLS. estimator of ␤ is Ž XX Vy1 X . XX Vy1 z. y1 From I s XX WX, expression Ž4.41. for elements of u, and since diagonal elements of W are wi s Ž ⭸␮ ir⭸␩i . 2rvar Ž Yi ., it follows that in Ž4.40., I Ž t . ␤Ž t . q uŽ t . s XX W Ž t . z Ž t . , where z Ž t . has elements z iŽ t . s Ý x i j ␤ jŽ t . q Ž yi y ␮Ži t . . j s ␩iŽ t . q Ž yi y ␮Ži t . . ⭸␩iŽ t . ⭸␮Ži t . ⭸␩iŽ t . ⭸␮Ži t . . Equations Ž4.40. for Fisher scoring then have the form Ž XX W Ž t . X . ␤Ž tq1. s XX W Ž t . z Ž t . . These are the normal equations for using weighted least squares to fit a linear model for a response variable z Ž t ., when the model matrix is X and the inverse of the covariance matrix is W Ž t .. The equations have solution ␤Ž tq1. s Ž XX W Ž t . X . y1 XX W Ž t . z Ž t . . The vector z in this formulation is a linearized form of the link function g, evaluated at y, g Ž yi . f g Ž ␮i . q Ž yi y ␮ i . g X Ž ␮i . s ␩i q Ž yi y ␮ i . Ž ⭸␩ir⭸␮ i . s z i . Ž 4.42 . This adjusted Žor ‘‘ working’’. response ®ariable z has element i approximated by z iŽ t . for cycle t of the iterative scheme. That cycle regresses z Ž t . on X with weight Ži.e., inverse covariance. W Ž t . to obtain a new estimate ␤Ž tq1. . This estimate yields a new linear predictor value ␩ Ž tq1. s X␤Ž tq1. and a new adjusted response value z Ž tq1. for the next cycle. The ML estimator results from iterative use of weighted least squares, in which the weight matrix changes at each cycle. The process is called iterati®e reweighted least squares. A simple way to begin the iterative process uses the data y as the initial estimate of ␮. This determines the first estimate of the weight matrix W and 148 INTRODUCTION TO GENERALIZED LINEAR MODELS hence the initial estimate of ␤. It may be necessary to alter some observations slightly for this first cycle only so that g Žy., the initial value of z, is finite. For instance, when g is the log link applied to counts, a count of yi s 0 is problematic, so one could set yi s 12 . This is not a problem with the model itself, since the log applies to the mean, and fitted means are usually strictly positive in successive iterations. 4.6.4 Simplifications for Canonical Links* Certain simplifications result with GLMs using the canonical link. For that link, ␩i s ␪ i s Ý ␤j x i j . j Often, aŽ ␾ . in the density or mass function Ž4.14. is identical for all observations, such as for Poisson GLMs w aŽ ␾ . s 1x and binomial GLMs with each n i s 1 wfor which aŽ ␾ . s 1rn i s 1x. Then the part of the log likelihood Ž4.19. involving both parameters and data is Ý yi ␪ i , which simplifies to ž Ý yi Ý ␤ j x i j i j / s ž Ý ␤ j Ý yi x i j j i /. Sufficient statistics for estimating ␤ in the GLM are then Ý yi x i j , j s 1, . . . , p. i For the canonical link, ⭸␮ ir⭸␩i s ⭸␮ ir⭸␪ i s ⭸ bX Ž ␪ i . r⭸␪ i s bY Ž ␪ i . . Thus, the contribution Ž4.21. to the likelihood equation for ␤ j simplifies to ⭸ Li ⭸␤ j s yi y ␮ i var Ž Yi . bY Ž ␪ i . x i j s Ž yi y ␮ i . x i j . aŽ ␾ . Ž 4.43 . When aŽ ␾ . is identical for all observations, the likelihood equations are Ý x i j yi s Ý x i j ␮ i , i j s 1, . . . , p. Ž 4.44 . i These equations equate the sufficient statistics for the model parameters to their expected values ŽNelder and Wedderburn 1972.. For a normal distribution with identity link, these are the normal equations. We obtained these for Poisson loglinear models in Ž4.29. and for binomial logistic regression models Žwhen each n i s 1. in Ž4.25.. QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS 149 From expression Ž4.43. for ⭸ L ir⭸␤ j , with the canonical link the second derivatives of the log likelihood have components ⭸ 2 Li ⭸␤ j ⭸␤ h sy xi j aŽ ␾ . ž / ⭸␮ i ⭸␤ h . This does not depend on the observation yi , so ⭸ 2 L Ž ␤ . r⭸␤ h ⭸␤ j s E ⭸ 2 L Ž ␤ . r⭸␤ h ⭸␤ j . That is, H s yI , and the Newton᎐Raphson and Fisher scoring algorithms are identical for canonical link models ŽNelder and Wedderburn 1972.. 4.7 QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS* A GLM g Ž ␮i . s Ý j ␤ j x i j specifies ␮ i using a link function g and linear ˆ are the solutions of predictor. From Ž4.22. and Ž4.41., the ML estimates ␤ the likelihood equations uj Ž ␤. s N Ý is1 ž / Ž yi y ␮ i . x i j ⭸␮ i s 0, ® Ž ␮i . ⭸␩i j s 1, . . . , p, Ž 4.45 . where ␮ i s gy1 ŽÝ j ␤ j x i j . and ®Ž ␮i . s varŽ Yi .. These equations set the score functions  u j Ž␤ .4 , which are derivatives of the log likelihood with respect to  ␤ j 4 , equal to 0. As we noted in Section 4.4.4, the likelihood equations depend on the assumed distribution for Yi only through ␮ i and ®Ž ␮i .. The choice of distribution determines the mean᎐variance relationship ®Ž ␮i .. 4.7.1 Mean–Variance Relationship Determines Quasi-likelihood Estimates Wedderburn Ž1974. proposed an alternative approach, quasi-likelihood estimation, which assumes only a mean᎐variance relationship rather than a specific distribution for Yi . It has a link function and linear predictor of the usual GLM form, but instead of assuming a distributional type for Yi it assumes only var Ž Yi . s ® Ž ␮i . for some chosen variance function ®. The equations that determine quasilikelihood estimates are the same as the likelihood equations Ž4.45. for GLMs. They are not likelihood equations, however, without the additional assumption that  Yi 4 has distribution in the natural exponential family. To illustrate, suppose we assume that the  Yi 4 are independent with ® Ž ␮i . s ␮ i . 150 INTRODUCTION TO GENERALIZED LINEAR MODELS The quasi-likelihood ŽQL. estimates are the solution of Ž4.45. with ®Ž ␮i . replaced by ␮ i . Under the additional assumption that  Yi 4 have distribution in the exponential dispersion family Ž4.14., these estimates are also ML estimates. That case is simply the Poisson distribution. Thus, for ®Ž ␮. s ␮ , quasi-likelihood estimates are also ML estimates when the random component has a Poisson distribution. Wedderburn suggested using the estimating equations Ž4.45. for any variance function, even if it does not occur for a member of the natural exponential family. In fact, the purpose of the quasi-likelihood method was to encompass a greater variety of cases, such as discussed in Section 4.7.2. The QL estimates have asymptotic covariance matrix of the same form Ž4.28. as in ˆ .y1 with wi s Ž ⭸␮ ir⭸␩i . 2rvar Ž Yi .. GLMs, namely ŽXX WX 4.7.2 Overdispersion for Poisson GLMs and Quasi-likelihood For count data, we’ve seen ŽSection 4.3.3. that the Poisson assumption is often unrealistic because of overdispersionᎏthe variance exceeds the mean. One cause for this is heterogeneity among subjects. This suggests an alternative to a Poisson GLM in which the mean᎐variance relationship has the form ® Ž ␮i . s ␾␮ i for some constant ␾ . The case ␾ ) 1 represents overdispersion for the Poisson model. In the estimating equations Ž4.45. with ®Ž ␮i . s ␾␮ i , ␾ drops out. Thus, the equations are identical to likelihood equations for Poisson models, and model parameter estimates are also identical. Also, wi s Ž ⭸␮ ir⭸␩i . var Ž Yi . s Ž ⭸␮ ir⭸␩i . r␾␮ i , 2 2 ˆ . s ŽXX WX ˆ .y1 is ␾ times that for the Poisson model. so the estimated cov Ž␤ When a variance function has the form ®Ž ␮i . s ␾ ®*Ž ␮i ., usually ␾ is also unknown. However, ␾ is not in the estimating equations. Let X 2 s ÝŽ yi y ␮ ˆ i . 2r®*Ž ␮ ˆ i ., a Pearson-type statistic for the simpler model with ␾ s 1. Then X 2r␾ is a sum of squares of N standardized terms. When X 2r␾ is approximately chi-squared or when ␮ i is approximately linear in ␤ with ®*Ž ␮ ˆ i . close to ®*Ž ␮i ., then EŽ X 2r␾ . f N y p, the number of observations minus the number of model parameters p. Hence, E w X 2rŽ N y p .x f ␾ . Using the motivation of moment estimation, Wedderburn Ž1974. suggested taking ␾ˆ s X 2rŽ N y p . as the estimated multiple of the covariance matrix. In summary, this quasi-likelihood approach for count data is simple: Fit the ordinary Poisson model and use its p parameter estimates. Multiply the ordinary standard error estimates by X 2r Ž N y p . . We illustrate for the horseshoe crab data analyzed with Poisson GLMs in Section 4.3.2. With the log link, the fit using width to predict number of ' QUASI-LIKELIHOOD AND GENERALIZED LINEAR MODELS 151 satellites was log ␮ ˆ s y3.305 q 0.164 x, with SE s 0.020 for ␤ˆ s 0.164. To improve the adequacy of using a chi-squared statistic to summarize fit, we use the satellite totals and fit for all female crabs at a given width, to increase the counts and fitted values relative to those for individual female crabs. The N s 66 distinct width levels each have a total count yi for the number of satellites and a fitted total ␮ ˆ i . The Pearson statistic comparing these is X 2 s 174.3. The quasi-likelihood adjustment for standard errors equals 174.3r Ž 66 y 2 . s 1.65. Thus, SE s 1.65Ž0.020. s 0.033 is a more plausible standard error for ␤ˆ s 0.164 in this prediction equation. Alternative ways of handling overdispersion include mixture models that allow heterogeneity in the mean at fixed settings of predictors. For count data these include Poisson GLMs having random effects ŽSection 13.5. and negative binomial GLMs that result when a Poisson parameter itself has a gamma distribution ŽSection 4.3.4 and 13.4.. ' 4.7.3 Overdispersion for Binomial GLMs and Quasi-likelihood The quasi-likelihood approach can also handle overdispersion for counts based on binary data. When yi is the sample mean of n i independent binary observations with parameter ␲ i , i s 1, . . . , N, then binomial sampling has E Ž Yi . s ␲ i and var Ž Yi . s ␲ i Ž1 y ␲ i .rn i . A simple quasi-likelihood approach uses the alternative variance function ® Ž ␲ i . s ␾␲ i Ž 1 y ␲ i . rn i . Ž 4.46 . Overdispersion occurs when ␾ ) 1. The quasi-likelihood estimates are the same as ML estimates for the binomial model, since ␾ drops out of the estimating equations Ž4.45.. As in the overdispersed Poisson case, ␾ enters the denominator of wi . Thus, the asymptotic covariance matrix multiplies by ␾ , and standard errors multiply by ␾ . An estimate of ␾ using the X 2 fit statistic for the ordinary binomial model is X 2rŽ N y p . ŽFinney 1947.. Methods like these that use estimates from ordinary models but inflate their standard errors are appropriate only if the model chosen describes well the structural relationship between the mean of Y and the predictors. If a large goodness-of-fit statistic is due to some other type of lack of fit, such as failing to include a relevant interaction term, making an adjustment for overdispersion will not address the inadequacy. For counts with binary data, alternative mechanisms for handling overdispersion include mixture models such as binomial GLMs with random effects ŽSection 12.3. and models for which a binomial parameter itself has a beta distribution ŽSection 13.3.. ' 4.7.4 Teratology Overdispersion Example Table 4.5 shows results of a teratology experiment in which female rats on iron-deficient diets were assigned to four groups. Rats in group 1 were given placebo injections, and rats in other groups were given injections of an iron 152 INTRODUCTION TO GENERALIZED LINEAR MODELS TABLE 4.5 Response Counts of (Litter Size, Number Dead) for 58 Litters of Rats in Low-Iron Teratology Study Group 1: Untreated Žlow iron. Ž10, 1. Ž11, 4. Ž12, 9. Ž4, 4. Ž10, 10. Ž11, 9. Ž9, 9. Ž11, 11. Ž10, 10. Ž10, 7. Ž12, 12. Ž10, 9. Ž8, 8. Ž11, 9. Ž6, 4. Ž9, 7. Ž14, 14. Ž12, 7. Ž11, 9. Ž13, 8. Ž14, 5. Ž10, 10. Ž12, 10. Ž13, 8. Ž10, 10. Ž14, 3. Ž13, 13. Ž4, 3. Ž8, 8. Ž13, 5. Ž12, 12. Group 2: Injections days 7 and 10 Ž10, 1. Ž3, 1. Ž13, 1. Ž12, 0. Ž14, 4. Ž9, 2. Ž13, 2. Ž16, 1. Ž11, 0. Ž4, 0. Ž1, 0.Ž12, 0. Group 3: Injections days 0 and 7 Ž8, 0. Ž11, 1. Ž14, 0. Ž14, 1. Ž11, 0. Group 4: Injections weekly Ž3, 0. Ž13, 0. Ž9, 2. Ž17, 2. Ž15, 0. Ž2, 0. Ž14, 1. Ž8, 0. Ž6, 0. Ž17, 0. Source: Moore and Tsiatis Ž1991.. supplement; this was done weekly in group 4, only on days 7 and 10 in group 2, and only on days 0 and 7 in group 3. The 58 rats were made pregnant, sacrificed after three weeks, and then the total number of dead fetuses was counted in each litter. In teratology experiments, due to unmeasured covariates and genetic variability the probability of death may vary from litter to litter within a particular treatment group. Let yiŽ g . denote the proportion of dead fetuses out of the n iŽ g . in litter i in treatment group g. Let ␲ iŽ g . denote the probability of death for a fetus in that litter. Consider the model with n iŽ g . yiŽ g . a bin Ž n iŽ g . ,␲ iŽ g . . variate, where ␲ iŽ g . s ␲g , g s 1, 2, 3, 4. That is, the model treats all litters in a particular group g as having the same probability of death ␲g . The ML fit has estimate ␲ ˆg equal to the sample proportion of deaths for all fetuses from litters in that group. These equal ␲ ˆ 1 s 0.758 ŽSE s 0.024., ␲ˆ 2 s 0.102 ŽSE s 0.028., ␲ˆ 3 s 0.034 ŽSE s 0.024., and ␲ ˆ4 s 0.048 ŽSE s 0.021., where for group g, SE s ␲ ˆg Ž 1 y ␲ˆg . r Ž Ý i n iŽ g . . . The estimated probability of death is considerably higher for the placebo group. For litter i in group g, n iŽ g .␲ ˆg is a fitted number of deaths and n iŽ g .Ž1 y ␲ ˆg . is a fitted number of nondeaths. Comparing these fitted values to the observed counts of deaths and nondeaths in the N s 58 litters using the Pearson statistic gives X 2 s 154.7 with df s 58 y 4 s 54. There is considerable evidence of overdispersion. With the quasi-likelihood approach, ␲ ˆg 4 are the same as the binomial ML estimates; however, ␾ˆ s X 2rŽ N y p . s 154.7rŽ58 y 4. s 2.86, so standard errors multiply by ␾ˆ1r2 s 1.69. Even with this adjustment for overdispersion, strong evidence remains that the probability of death is substantially higher for the placebo group. For ' 153 GENERALIZED ADDITIVE MODELS instance, a 95% confidence interval for ␲ 1 y ␲ 2 is 2 2 Ž 0.758 y 0.102 . " 1.96 Ž 1.69 = 0.024 . q Ž 1.69 = 0.028 . or 1r2 Ž 0.54, 0.78 . . This is wider, however, than the Wald interval of Ž0.59, 0.73. for comparing independent proportions, which ignores the overdispersion. 4.8 GENERALIZED ADDITIVE MODELS* The GLM generalizes the ordinary linear model to permit nonnormal distributions and modeling functions of the mean. Quasi-likelihood provides a further generalization, specifying how the variance depends on the mean without assuming a given distribution. Another generalization replaces the linear predictor by smooth functions of the predictors. 4.8.1 Smoothing Data The GLM structure g Ž ␮i . s Ý j ␤ j x i j generalizes to g Ž ␮i . s Ý sj Ž x i j . , j where s j Ž⭈. is an unspecified smooth function of predictor j. A useful smooth function is the cubic spline. It has separate cubic polynomials over sets of disjoint intervals, joined together smoothly at boundaries of those intervals. Like GLMs, this model specifies a distribution for the random component and a link function g. The resulting model is called a generalized additi®e model, symbolized by GAM ŽHastie and Tibshirani 1990.. The GLM is the special case in which each s j is a linear function. Also possible is taking some s j as smooth functions and others as linear functions or as dummy variables for qualitative predictors. The details for fitting GAMs are beyond our scope. The fitting algorithm employs a generalization of the Newton᎐Raphson method that utilizes local smoothing. This corresponds to subtracting from the log-likelihood function a penalty function that increases as the smooth function gets more wiggly. The model fit assigns a deviance and an approximate df value to each s j in the additive predictor, enabling inference about those terms. For instance, a smooth function having df s 5 is similar in overall complexity to a fourthdegree polynomial, which has five parameters. One’s choice of a df value Žor smoothing parameter. determines how smooth the resulting GAM fit looks. It is usually worth trying a variety of degrees of smoothing to find one that smooths the data sufficiently so that the trend is not too irregular but does 154 INTRODUCTION TO GENERALIZED LINEAR MODELS not smooth so much that it suppresses interesting patterns. This approach may suggest that a linear model is adequate with a particular link or suggest ways to improve on linearity. Some software packages that do not have GAMs can smooth the data by employing a type of regression that gives greater weight to nearby observations in predicting the value at a given point; such locally weighted least squares regression is often referred to as lowess. We prefer GAMs because they recognize explicitly the form of the response. For instance, with a binary response, lowess can give predicted values below 0 or above 1, which cannot happen with a GAM. Even when one plans to use GLMs, a GAM can be helpful for exploratory analysis. For instance, for continuous X with continuous responses, scatter diagrams provide visual information about the dependence of Y on X. For binary responses, the following example shows that such diagrams are not very informative. Plotting the fitted smooth function for a predictor may reveal a general trend without assuming a particular functional relationship. FIGURE 4.7 Whether satellites are present Ž1, yes; 0, no., by width of female crab, with smoothing fit of generalized additive model. 155 NOTES 4.8.2 GAMs for Horseshoe Crab Example In Section 4.3.2, Figure 4.4 showed the trend relating number of satellites for horseshoe crabs to their width. This smooth curve is the fit of a generalized additive model, assuming a Poisson distribution and using the log link. In the next chapter we’ll use logistic regression to model the probability that a crab has at least one satellite. For crab i, let yi s 1 if she has at least one satellite and yi s 0 otherwise. Figure 4.7 plots these data against x s crab width. It consists of a set of points with yi s 1 and a second set of points with yi s 0. The numbered symbols indicate the number of observations at each point. It appears that yi s 1 tends to occur relatively more often at higher x values. Figure 4.7 also shows a curve based on smoothing the data using a GAM, assuming a binomial response and logit link. This curve shows a roughly increasing trend and is more informative than viewing the binary data alone. It suggests that an S-shaped regression function may describe this relationship relatively well. NOTES Section 4.1: Generalized Linear Model 4.2. Distribution Ž4.1. is called a natural Žor linear . exponential family to distinguish it from a more general exponential family that replaces y by r Ž y . in the exponential term. For Ž1987.. Books on GLMs and related models, in other generalizations, see Jorgensen Ⲑ approximate order of technical level from highest to lowest, are McCullagh and Nelder Ž1989., Fahrmeir and Tutz Ž2001., Aitkin et al. Ž1989., Dobson Ž2002., and Gill Ž2000.. See also Firth Ž1991.. Section 4.3: Generalized Linear Models for Counts 4.2. For further discussion of Poisson regression and related models for count data, see Breslow Ž1984., Cameron and Trivedi Ž1998., Frome Ž1983., Hinde Ž1982., Lawless Ž1987., and Seeber Ž1998. and references therein. Section 4.4: Moments and Likelihood for Generalized Linear Models 4.3. The function bŽ⭈. in Ž4.14. is called the cumulant function, since when aŽ ␾ . s 1 its derivatives yield the cumulants of the distribution ŽJorgensen 1987.. Ⲑ For many GLMs, including Poisson models with log link and binary models with logit link, with full-rank model matrix the Hessian is negative definite and the log likelihood is a strictly concave function. Then ML estimates of model parameters exist and are unique under quite general conditions ŽWedderburn 1976.. Section 4.5: Inference for Generalized Linear Models ˆ . wsee Ž4.28.x, in the hat matrix for standardized Pearson 4.4. The matrix W used in cov Ž␤ residuals wsee Ž4.38.x, and in Fisher scoring wsee Ž4.40.x is the inverse of the covariance matrix of the linearized form of g Žy. Žsee Section 4.6.3.. 156 INTRODUCTION TO GENERALIZED LINEAR MODELS McCullagh and Nelder Ž1989, Chap. 12. discussed model checking for GLMs. For discussions about residuals, see also Green Ž1984., Pierce and Schafer Ž1986., Pregibon Ž1980, 1981., and Williams Ž1987.. Pregibon Ž1982. showed that the squared standardized Pearson residual is the score statistic for testing whether the observation is an outlier. Davison and Hinkley Ž1997, Sec. 7.2. discussed bootstrapping in GLMs. Section 4.6: Fitting Generalized Linear Models 4.5. Fisher Ž1935b. introduced the Fisher scoring method to calculate ML estimates for probit models. For further discussion of GLM model fitting and the relationship between iterative reweighted least squares and ML estimation, see Green Ž1984., Ž1983., McCullagh and Nelder Ž1989., and Nelder and Wedderburn Ž1972.. Jorgensen Ⲑ Ž1983., and Palmgren and Ekholm Ž1987. also discussed this Green Ž1984., Jorgensen Ⲑ relation for exponential family nonlinear models. Section 4.7: Quasi-likelihood and Generalized Linear Models 4.6. For more on quasi-likelihood, see Sections 11.4, 12.6.4, and 13.3, Breslow Ž1984., Cox Ž1983., Firth Ž1987., Hinde and Demetrio Ž1998., McCullagh Ž1983., McCullagh and ´ Nelder Ž1989., Nelder and Pregibon Ž1987., and Wedderburn Ž1974, 1976.. See Heyde Ž1997. for a theoretical perspective. Section 4.8: Generalized Additi©e Models 4.7. Besides GAMs, other nonparametric smoothing methods can describe the dependence of a binary response on a predictor. For instance, see Copas Ž1983., Lloyd Ž1999, Chap. 5., and Section 15.3.3 for kernel smoothing and Kauermann and Tutz Ž2001. for models with random effects. PROBLEMS Applications 4.1 In the 2000 U.S. presidential election, Palm Beach County in Florida was the focus of unusual voting patterns Žincluding a large number of illegal double votes. apparently caused by a confusing ‘‘butterfly ballot.’’ Many voters claimed that they voted mistakenly for the Reform Party candidate, Pat Buchanan, when they intended to vote for Al Gore. Figure 4.8 shows the total number of votes for Buchanan plotted against the number of votes for the Reform Party candidate in 1996 ŽRoss Perot., by county in Florida. ŽFor details, see A. Agresti and B. Presnell, J. Law Public Policy, Volume 13, Fall 2001, 117᎐134.. a. In county i, let ␲ i denote the proportion of the vote for Buchanan and let x i denote the proportion of the vote for Perot in 1996. For the linear probability model fitted to all counties except Palm Beach County, ␲ ˆ i s y0.0003 q 0.0304 x i . Give the value of P in the 157 PROBLEMS FIGURE 4.8 Total vote, by county in Florida, for Reform Party candidates Buchanan in 2000 and Perot in 1996. interpretation: The estimated proportion vote for Buchanan in 2000 was roughly P% of that for Perot in 1996. b. For Palm Beach County, ␲ i s 0.0079 and x i s 0.0774. Does this result appear to be an outlier? Explain. c. For logistic regression, log w␲ ˆ irŽ1 y ␲ˆ i .x s y7.164 q 12.219 x i . Find ␲ ˆ i in Palm Beach County. Is that county an outlier for this model? 4.2 For games in baseball’s National League during nine decades, Table 4.6 shows the percentage of times that the starting pitcher pitched a complete game. TABLE 4.6 Decade 1900᎐1909 1910᎐1919 1920᎐1929 Data for Problem 4.2 Percent Complete Decade Percent Complete Decade Percent Complete 72.7 63.4 50.0 1930᎐1939 1940᎐1949 1950᎐1959 44.3 41.6 32.8 1960᎐1969 1970᎐1979 1980᎐1989 27.2 22.5 13.3 Source: Data from George Will, Newsweek, Apr. 10, 1989. 158 INTRODUCTION TO GENERALIZED LINEAR MODELS a. Treating the number of games as the same in each decade, the ML fit of the linear probability model is ␲ ˆ s 0.7578 y 0.0694 x, where x s decade Ž x s 1, 2, . . . , 9.. Interpret 0.7578 and y0.0694. b. Substituting x s 10, 11, 12, predict the percentages of complete games for the next three decades. Are these predictions plausible? Why? c. The ML fit with logistic regression is ␲ ˆ s expŽ1.148 y 0.315 x .rw1 q exp Ž1.148 y 0.315 x .x. Obtain ␲ ˆ i for x s 10, 11, 12. Are these more plausible? 4.3 For Table 3.7 with scores Ž0, 0.5, 1.5, 4.0, 7.0. for alcohol consumption, ML fitting of the linear probability model for malformation has output. Parameter Intercept Alcohol Estimate 0.0025 0.0011 Std Error 0.0003 0.0007 Wald 95% Conf Limits 0.0019 0.0032 y0.0003 0.0025 Interpret the model fit. Use it to estimate the relative risk of malformation for alcohol consumption levels 0 and 7.0. 4.4 For Table 4.2, refit the linear probability model or the logistic regression model using the scores Ža. Ž0, 2, 4, 6., Žb. Ž0, 1, 2, 3., and Žc. Ž1, 2, 3, 4.. Compare ␤ˆ for the three choices. Compare fitted values. Summarize the effect of linear transformations of scores, which preserve relative sizes of spacings between scores. 4.5 For Table 4.3, let Y s 1 if a crab has at least one satellite, and Y s 0 otherwise. Using x s weight, fit the linear probability model. a. Use ordinary least squares. Interpret the parameter estimates. Find the estimated probability at the highest observed weight Ž5.20 kg.. Comment. b. Try to fit the model using ML, treating Y as binomial. wThe failure is due to a fitted probability falling outside the Ž0, 1. range. The fit in part Ža. is ML for a normal random component, for which fitted values outside this range are permissible. x c. Fit the logistic regression model. Show that the fitted probability at a weight of 5.20 kg equals 0.9968. d. Fit the probit model. Find the fitted probability at 5.20 kg. 4.6 An experiment analyzes imperfection rates for two processes used to fabricate silicon wafers for computer chips. For treatment A applied to 10 wafers, the numbers of imperfections are 8, 7, 6, 6, 3, 4, 7, 2, 3, 4. Treatment B applied to 10 other wafers has 9, 9, 8, 14, 8, 13, 11, 5, 7, 6 159 PROBLEMS imperfections. Treat the counts as independent Poisson variates having means ␮A and ␮ B . a. Fit the model log ␮ s ␣ q ␤ x, where x s 1 for treatment B and x s 0 for treatment A. Show that exp Ž ␤ . s ␮ B r␮A , and interpret its estimate. b. Test H0 : ␮A s ␮ B with the Wald or likelihood ratio test of H0 : ␤ s 0. Interpret. c. Construct a 95% confidence interval for ␮ B r␮A . Ž Hint: First construct one for ␤ .. d. Test H0 : ␮A s ␮ B based on this result: If Y1 and Y2 are independent Poisson with means ␮ 1 and ␮ 2 , then Ž Y1 < Y1 q Y2 . is binomial with n s Y1 q Y2 and ␲ s ␮ 1rŽ ␮1 q ␮ 2 .. 4.7 For Table 4.3, Table 4.7 shows SAS output for a Poisson loglinear model fit using X s weight and Y s number of satellites. a. Estimate E Ž Y . for female crabs of average weight, 2.44 kg. b. Use ␤ˆ to describe the weight effect. Show how to construct the reported confidence interval. c. Construct a Wald test that Y is independent of X. Interpret. d. Can you conduct a likelihood-ratio test of this hypothesis? If not, what else do you need? e. Is there evidence of overdispersion? If necessary, adjust standard errors and interpret. TABLE 4.7 SAS Output for Problem 4.7 Criterion Deviance Pearson Chi- Square Log Likelihood DF 171 171 Value 560.8664 535.8957 71.9524 Parameter Estimate Std Error Wald 95% Conf Limits Chi- Sq Pr > ChiSq Intercept y0.4284 0.1789 y0.7791 y0.0777 5.73 0.0167 weight 0.5893 0.0650 0.4619 0.7167 82.15 <.0001 4.8 Refer to Problem 4.7. Using the identity link with x s weight, ␮ ˆs y2.60 q 2.264 x, where ␤ˆ s 2.264 has SE s 0.228. Repeat parts Ža. through Žc.. 4.9 Refer to Table 4.3. a. Fit a Poisson loglinear model using both W s weight and C s color to predict Y s number of satellites. Assigning dummy variables, treat C as a nominal factor. Interpret parameter estimates. 160 INTRODUCTION TO GENERALIZED LINEAR MODELS b. Estimate E Ž Y . for female crabs of average weight Ž2.44 kg. that are Ži. medium light, and Žii. dark. c. Test whether color is needed in the model. Ž Hint: From Section 4.5.4, the likelihood-ratio statistic comparing models is the difference in deviances. . d. The estimated color effects are monotone across the four categories. Fit a simpler model that treats C as quantitative and assumes a linear effect. Interpret its color effect and repeat the analyses of parts Žb. and Žc.. Compare the fit to the model in part Ža.. Interpret. e. Add width to the model. What effect does the strong positive correlation between width and weight have? Are both needed in the model? 4.10 In Section 4.3.2, refer to the Poisson model with identity link. The fit using least squares is ␮ ˆ s y10.42 q 0.51 x ŽSE s 0.11.. Explain why the parameter estimates differ and why the SE values are so different. 4.11 For the negative binomial model fitted to the crab satellite counts with log link and width predictor, ␣ ˆ s y4.05, ␤ˆ s 0.192 ŽSE s 0.048., ˆky1 s 1.106 ŽSE s 0.197.. Interpret. Why is SE for ␤ˆ so different from SE s 0.020 for the corresponding Poisson GLM in Sec 4.3.2? Which is more appropriate? Why? 4.12 Refer to Problem 4.6. The sample mean and variance are 5.0 and 4.2 for treatment A and 9.0 and 8.4 for treatment B. a. Is there evidence of overdispersion for the Poisson model having a dummy variable for treatment? Explain. b. Fit the negative binomial loglinear model. Note that the estimated dispersion parameter is 0 and that estimates of treatment means and standard errors are the same as with the Poisson loglinear GLM. c. For the overall sample of 20 observations, the sample mean and variance are 7.0 and 10.2. Fit the loglinear model having only an intercept term under Poisson and negative binomial assumptions. Compare results, and compare confidence intervals for the overall mean response. Why do they differ? Ž Note: This shows how the Poisson model can deteriorate when an important covariate is unmeasured. . 4.13 Table 4.8 shows the free-throw shooting, by game, of Shaq O’Neal of the Los Angeles Lakers during the 2000 NBA Žbasketball . playoffs. Commentators remarked that his shooting varied dramatically from game to game. In game i, suppose that Yi s number of free throws 161 PROBLEMS TABLE 4.8 Data for Problem 4.13 Number Number of Number Number of Number Number of Game Made Attempts Game Made Attempts Game Made Attempts 1 2 3 4 5 6 7 8 4 5 5 5 2 7 6 9 5 11 14 12 7 10 14 15 9 10 11 12 13 14 15 16 4 1 13 5 6 9 7 3 12 4 27 17 12 9 12 10 17 18 19 20 21 22 23 8 1 18 3 10 1 3 12 6 39 13 17 6 12 Source: www.nba.com. made out of n i attempts is a bin Ž n i , ␲ i . variate and the  Yi 4 are independent. a. Fit the model, ␲ i s ␣ , and find and interpret ␣ ˆ and its standard error. Does the model appear to fit adequately? Ž Note: You could check this with a small-sample test of independence of the 23 = 2 table of game and the binary outcome.. b. Adjust the standard error for overdispersion. Using the original SE and its correction, find and compare 95% confidence intervals for ␣ . Interpret. 4.14 Refer to Table 13.6. Fit a loglinear model with a dummy variable for race, Ža. assuming a Poisson distribution, and Žb. allowing overdispersion with a quasi-likelihood approach. Compare results. 4.15 Refer to Problem 4.6. The wafers are also classified by thickness of silicon coating Ž z s 0, low; z s 1, high.. The first five imperfection counts reported for each treatment refer to z s 0 and the last five refer to z s 1. Analyze these data. 14.6 Refer to Table 13.9 on frequency of sexual intercourse. Analyze these data. Theory and Methods 4.17 Describe the purpose of the link function of a GLM. What is the identity link? Explain why it is not often used with binomial or Poisson responses. 4.18 For known k, show that the negative binomial distribution Ž4.12. has exponential family form Ž4.1. with natural parameter log w ␮rŽ ␮ q k .x. 162 INTRODUCTION TO GENERALIZED LINEAR MODELS 4.19 For binary data, define a GLM using the log link. Show that effects refer to the relative risk. Why do you think this link is not often used? Ž Hint: What happens if the linear predictor takes a positive value?. 4.20 For the logistic regression model Ž4.6. with ␤ ) 0, show that Ža. as x ™ ⬁, ␲ Ž x . is monotone increasing, and Žb. the curve for ␲ Ž x . is the cdf of a logistic distribution having mean y␣r␤ and standard deviation ␲rŽ ␤ '3 .. 4.21 Show representation Ž4.18. for the binomial distribution. 4.22 Let Yi be a bin Ž n i , ␲ i . variate for group i, i s 1, . . . , N, with  Yi 4 independent. Consider the model that ␲ 1 s ⭈⭈⭈ s ␲ N . Denote that common value by ␲ . For observations  yi 4 , show that ␲ ˆ s ŽÝ yi .rŽÝn i .. When all n i s 1, for testing this model’s fit in the N = 2 table, show that X 2 s n. Thus, goodness-of-fit statistics can be completely uninformative for ungrouped data. ŽSee also Problem 5.37.. 4.23 Suppose that Yi is Poisson with g Ž ␮i . s ␣ q ␤ x i , where x i s 1 for i s 1, . . . , nA from group A and x i s 0 for i s nA q 1, . . . , nA q n B from group B. Show that for any link function g, the likelihood equations Ž4.22. imply that fitted means ␮ ˆA and ␮ ˆ B equal the sample means. 4.24 For binary data with sample proportion yi based on n i trials, we use quasi-likelihood to fit a model using variance function Ž4.46.. Show that parameter estimates are the same as for the binomial GLM but that the covariance matrix multiplies by ␾ . 4.25 A binomial GLM ␲ i s ⌽ ŽÝ j ␤ j x i j . with arbitrary inverse link function ⌽ assumes$ that n i Yi has a bin Ž n i , ␲ i . distribution. Find wi in Ž4.27. ˆ .. For logistic regression, show that wi s n i␲ i Ž1 y ␲ i .. and hence cov Ž␤ 4.26 A GLM has parameter ␤ with sufficient statistic S. A goodness-of-fit test statistic T has observed value t o . If ␤ were known, a P-value is P s P ŽT G t o ; ␤ .. Explain why P ŽT G t o < S . is the uniform minimum variance unbiased estimator of P. 4.27 Let yi j be observation j of a count variable for group i, i s 1, . . . , I, j s 1, . . . , n i . Suppose that  Yi j 4 are independent Poisson with E Ž Yi j . s ␮i. a. Show that the ML estimate of ␮ i is ␮ ˆ i s yi s Ý j yi jrn i . b. Simplify the expression for the deviance for this model. wFor testing this model, it follows from Fisher Ž1970, p. 58, originally published PROBLEMS 163 in 1925. that the deviance and the Pearson statistic Ý i Ý j Ž yi j y yi . 2ryi have approximate chi-squared distributions with df s Ý i Ž n i y 1.. For a single group, Cochran Ž1954. referred to Ý j Ž y 1 j y y 1 . 2ry 1 as the ®ariance test for the fit of a Poisson distribution, since it compares the sample variance to the estimated Poisson variance y 1.x 4.28 Conditional on ␭, Y has a Poisson distribution with mean ␭. Values of ␭ vary according to gamma density Ž13.12., which has E Ž ␭. s ␮ , var Ž ␭. s ␮ 2rk. Show that marginally Y has the negative binomial distribution Ž4.12.. Explain why the negative binomial model is a way to handle overdispersion for the Poisson. 4.29 Consider the class of binary models Ž4.8. and Ž4.9.. Suppose that the standard cdf ⌽ corresponds to a probability density function ␾ that is symmetric around 0. a. Show that x at which ␲ Ž x . s 0.5 is x s y␣r␤ . b. Show that the rate of change in ␲ Ž x . when ␲ Ž x . s 0.5 is ␤␾ Ž0.. Show this is 0.25 ␤ for the logit link and ␤r'2␲ Žwhere ␲ s 3.14 . . . . for the probit link. c. Show that the probit regression curve has the shape of a normal cdf with mean y␣r␤ and standard deviation 1r < ␤ < . 4.30 Show the normal distribution N Ž ␮, ␴ 2 . with fixed ␴ satisfies family Ž4.1., and identify the components. Formulate the ordinary regression model as a GLM. 4.31 In Problem 4.30, when ␴ is also a parameter, show that it satisfies the exponential dispersion family Ž4.14.. 4.32 For binary observations, consider the model ␲ Ž x . s 12 q Ž1r␲ .tany1 Ž␣ q ␤ x .. Which distribution has cdf of this form? Explain when a GLM using this curve might be more appropriate than logistic regression. 4.33 Find the form of the deviance residual Ž4.35. for an observation in a Ža. binomial GLM, and Žb. Poisson GLM. Illustrate part Žb. for a cell count in a two-way contingency table for the model of independence. 4.34 Consider the value ␤ˆ that maximizes a function LŽ ␤ .. Let ␤ Ž0. denote an initial guess. a. Using LX Ž ␤ˆ. s LX Ž ␤ Ž0. . q Ž ␤ˆ y ␤ Ž0. . LY Ž ␤ Ž0. . q ⭈⭈⭈ , argue that for ␤ Ž0. close to ␤ˆ, approximately 0 s LX Ž ␤ Ž0. . q Ž ␤ˆ y ␤ Ž0. . LY Ž ␤ Ž0. .. Solve this equation to obtain an approximation ␤ Ž1. for ␤ˆ. 164 INTRODUCTION TO GENERALIZED LINEAR MODELS b. Let ␤ Ž t . denote approximation t for ␤ˆ, t s 0, 1, 2, . . . . Justify that the next approximation is ␤ Ž tq1. s ␤ Ž t . y LX Ž ␤ Ž t . . rLY Ž ␤ Ž t . . . 4.35 For n independent observations from a Poisson distribution, show that Fisher scoring gives ␮Ž tq1. s y for all t ) 0. By contrast, what happens with Newton᎐Raphson? 4.36 Write a computer program using the Newton᎐Raphson algorithm to maximize the likelihood for a binomial sample. For ␲ ˆ s 0.3 based on n s 10, print out results of the first six iterations when the starting value ␲ Ž0. is Ža. 0.1, Žb. 0.2, . . . , Ži. 0.9. Summarize the effects of the starting value on speed of convergence. What happens if it is 0 or 1? 4.37 In a GLM, suppose that var Ž Y . s ®Ž ␮. for ␮ s E Ž Y .. Show that the link g satisfying g X Ž ␮. s w ®Ž ␮.xy1r2 has the same weight matrix W Ž t . at each cycle. Show this link for a Poisson random component is g Ž ␮. s 2 ␮. ' 4.38 For noncanonical links in a GLM, show that the observed information matrix may depend on the data and hence differs from the expected information. Illustrate using the probit model. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 5 Logistic Regression In introducing generalized linear models for binary data in Chapter 4 we highlighted logistic regression. This is the most important model for categorical response data. It is used increasingly in a wide variety of applications. Early uses were in biomedical studies but the past 20 years have also seen much use in social science research and marketing. Recently, logistic regression has become a popular tool in business applications. Some credit-scoring applications use logistic regression to model the probability that a subject is credit worthy. For instance, the probability that a subject pays a bill on time may use predictors such as the size of the bill, annual income, occupation, mortgage and debt obligations, percentage of bills paid on time in the past, and other aspects of an applicant’s credit history. A company that relies on catalog sales may determine whether to send a catalog to a potential customer by modeling the probability of a sale as a function of indices of past buying behavior. Another area of increasing application is genetics. For instance, one recent article ŽJ. M. Henshall and M. E. Goddard, Genetics 151:885894, 1999. used logistic regression to estimate quantitative trait loci effects, modeling the probability that an offspring inherits an allele of one type instead of another type as a function of phenotypic values on various traits for that offspring. Another recent article ŽD. F. Levinson et al., Amer. J. Hum. Genet., 67:652663, 2000. used logistic regression for analysis of the genotype data of affected sibling pairs ŽASPs. and their parents from several research centers. The model studied the probability that ASPs have identityby-descent allele sharing and tested its heterogeneity among the centers. In this chapter we study logistic regression more closely. Section 5.1 covers parameter interpretation. In Section 5.2 we present inferential methods for those parameters. Sections 5.3 and 5.4 generalize to multiple predictors, some of which may be qualitative. Finally, in Section 5.5 we apply GLM model-fitting methods to determine and solve likelihood equations for logistic regression. 165 166 5.1 LOGISTIC REGRESSION INTERPRETING PARAMETERS IN LOGISTIC REGRESSION For a binary response variable Y and an explanatory variable X, let ␲ Ž x . s P Ž Y s 1 < X s x . s 1 y P Ž Y s 0 < X s x .. The logistic regression model is ␲ Ž x. s exp Ž␣ q ␤ x . 1 q exp Ž␣ q ␤ x . . Ž 5.1 . Equivalently, the log odds, called the logit, has the linear relationship logit ␲ Ž x . s log ␲ Ž x. 1 y ␲ Ž x. s ␣ q ␤ x. Ž 5.2 . This equates the logit link function to the linear predictor. 5.1.1 Interpreting ␤: Odds, Probabilities, and Linear Approximations How can we interpret ␤ in Ž5.2.? Its sign determines whether ␲ Ž x . is increasing or decreasing as x increases. The rate of climb or descent increases as < ␤ < increases; as ␤ ™ 0 the curve flattens to a horizontal straight line. When ␤ s 0, Y is independent of X. For quantitative x with ␤ ) 0, the curve for ␲ Ž x . has the shape of the cdf of the logistic distribution Žrecall Section 4.2.5.. Since the logistic density is symmetric, ␲ Ž x . approaches 1 at the same rate that it approaches 0. Exponentiating both sides of Ž5.2. shows that the odds are an exponential function of x. This provides a basic interpretation for the magnitude of ␤ : The odds increase multiplicatively by e ␤ for every 1-unit increase in x. In other words, e ␤ is an odds ratio, the odds at X s x q 1 divided by the odds at X s x. Most scientists are not familiar with odds or logits, so the interpretation of a multiplicative effect of e ␤ on the odds scale or an additive effect of ␤ on the logit scale is not helpful to them. A simpler, although approximate slope interpretation uses a linearization argument ŽBerkson 1951.. Since it has a curved rather than a linear appearance, the logistic regression function Ž5.1. implies that the rate of change in ␲ Ž x . per unit change in x varies. A straight line drawn tangent to the curve at a particular x value, shown in Figure 5.1, describes the rate of change at that point. Calculating ⭸␲ Ž x .r⭸ x using Ž5.1. yields a fairly complex function of the parameters and x, but it simplifies to the form ␤␲ Ž x .w1 y ␲ Ž x .x. For instance, the line tangent to the curve at x for which ␲ Ž x . s 12 has slope ␤ Ž 21 .Ž 21 . s ␤r4; when ␲ Ž x . s 0.9 or 0.1, it has slope 0.09␤ . The slope approaches 0 as ␲ Ž x . approaches 1.0 or 0. The steepest slope occurs at x for which ␲ Ž x . s 12 ; that x value is x s y␣r␤ . wTo check that ␲ Ž x . s 12 at this INTERPRETING PARAMETERS IN LOGISTIC REGRESSION FIGURE 5.1 167 Linear approximation to logistic regression curve. point, substitute y␣r␤ for x in Ž5.1., or substitute ␲ Ž x . s 12 in Ž5.2. and solve for x.x This x value is sometimes called the median effecti®e le®el and denoted EL 50 . In toxicology studies it is called LD50 ŽLD s lethal dose., the dose with a 50% chance of a lethal result. From this linear approximation, near x where ␲ Ž x . s 12 , a change in x of 1r␤ corresponds to a change in ␲ Ž x . of roughly Ž1r␤ .Ž ␤r4. s 14 ; that is, 1r␤ approximates the distance between x values where ␲ Ž x . s 0.25 or 0.75 Žin reality, 0.27 and 0.73. and where ␲ Ž x . s 0.50. The linear approximation works better for smaller changes in x, however. An alternative way to interpret the effect reports the values of ␲ Ž x . at certain x values, such as their quartiles. This entails substituting those quartiles for x into formula Ž5.1. for ␲ Ž x .. The change in ␲ Ž x . over the middle half of x values, from the lower quartile to the upper quartile of x, then describes the effect. It can be compared to the corresponding change over the middle half of values of other predictors. The intercept parameter ␣ is not usually of particular interest. However, by centering the predictor about 0 wi.e., replacing x by Ž x y x .x, ␣ becomes the logit at that mean, and thus e ␣rŽ1 q e ␣ . s ␲ Ž x .. ŽAs in ordinary regression, centering is also helpful in complex models containing quadratic or interaction terms to reduce correlations among model parameter estimates.. 168 5.1.2 LOGISTIC REGRESSION Looking at the Data In practice, these interpretations use formula Ž5.1. with ML estimates substituted for parameters. Before fitting the model and making such interpretations, look at the data to check that the logistic regression model is appropriate. Since Y takes only values 0 and 1, it is difficult to check this by plotting Y against x. It can be helpful to plot sample proportions or logits against x. Let n i denote the number of observations at setting i of x. Of them, let yi denote the number of ‘‘1’’ outcomes, with pi s yirn i . Sample logit i is logw pirŽ1 y pi .x s log w yirŽ n i y yi .x. This is not finite when yi s 0 or n i . An ad hoc adjustment adds a positive constant to the number of outcomes of the two types. The adjustment log yi q 1 2 n i y yi q 1 2 is the least-biased estimator of this form of the true logit ŽNote 5.2.. The plot of sample logits should be roughly linear. When X is continuous and all n i s 1, or when it is essentially continuous and all n i are small, this is unsatisfactory. One could group the data with nearby x values into categories before calculating sample proportions and sample logits. A better approach that does not require choosing arbitrary categories uses a smoothing mechanism to reveal trends. One such smoothing approach fits a generalized additive model ŽSection 4.8., which replaces the linear predictor of a GLM by a smooth function. Inspect a plot of the fit to see if severe discrepancies occur from the S-shaped trend predicted by logistic regression. 5.1.3 Horseshoe Crabs Revisited To illustrate logistic regression, we reanalyze the horseshoe crab data introduced in Section 4.3.2. The binary response is whether a female crab has any male crabs residing nearby Žsatellites .: Y s 1 if she has at least one satellite, and Y s 0 if she has none. We first use as a predictor the female crab’s width. Figure 4.7 plotted the data and showed the smoothed prediction of the mean provided by a generalized additive model ŽGAM., assuming a binomial response and logit link. The logistic regression model appears to be adequate. This is also suggested by the grouping of the data used to investigate the adequacy of Poisson regression models in Section 4.3.2 ŽTable 4.4.. In each of the eight width categories, we computed the sample proportion of crabs having satellites and the mean width for the crabs in that category. Figure 5.2 shows eight dots representing the sample proportions of female crabs having satellites plotted against the mean widths for the eight cate- 169 INTERPRETING PARAMETERS IN LOGISTIC REGRESSION FIGURE 5.2 Observed and fitted proportions of satellites by width of female crab. gories. The eight plotted sample proportions and the GAM smoothing curve both show a roughly increasing trend, so we proceed with fitting the logistic regression model with linear width predictor. We defer to Section 5.5 details about ML fitting. Software Že.g., for SAS see Table A.8. reports output such as Table 5.1 exhibits. For the ungrouped data from Table 4.3, let ␲ Ž x . denote the probability that a female horseshoe crab of width x has a satellite. The ML fit is ␲ ˆ Ž x. s TABLE 5.1 Crab Data exp Ž y12.351 q 0.497x . 1 q exp Ž y12.351 q 0.497x . . Computer Output for Logistic Regression Model with Horseshoe Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 171 194.4527 Pearson Chi- Square 171 165.1434 Log Likelihood y97.2263 Parameter Intercept width Estimate y12.3508 0.4972 Std Error 2.6287 0.1017 Likelihood- Ratio 95% Conf Limits y17.8097 y7.4573 0.3084 0.7090 Wald Chi- Sq 22.07 23.89 P>ChiSq <.0001 <.0001 170 LOGISTIC REGRESSION Substituting x s 26.3 cm, the mean width level in this sample, ␲ ˆ Ž x . s 0.674. The estimated probability equals 12 when x s y␣ ˆr␤ˆ s 12.351r0.497 s 24.8. Figure 5.2 plots ␲ ˆ Ž x . against width. The estimated odds of a satellite multiply by expŽ ␤ˆ. s expŽ0.497. s 1.64 for each 1-cm increase in width; that is, there is a 64% increase. To convey the effect less technically, we could report the incremental rate of change in the probability of a satellite. At the mean width, ␲ ˆ Ž x . s 0.674, and ␲ˆ Ž x . increases by about ␤ˆw␲ ˆ Ž x .Ž1 y ␲ˆ Ž x ..x s 0.497Ž0.674.Ž0.326. s 0.11 for a 1-cm increase in width. Or, we could report ␲ ˆ Ž x . at the quartiles of x. The lower quartile, median, and upper quartile for width are 24.9, 26.1, and 27.7; ␲ ˆ Ž x . at those values equals 0.51, 0.65, and 0.81, increasing by 0.30 over the x values for the middle half of the sample. The latter summary is useful for comparing the effects of predictors having different units. For instance, with crab weight as the predictor, logitw␲ ˆ Ž x .x s y3.695 q 1.815 x. A 1-kg increase in weight is not comparable to a 1-cm increase in width, so ␤ˆ s 0.497 for x s width is not comparable to ␤ˆ s 1.815 for x s weight. The quartiles for weight are 2.00, 2.35, and 2.85; ␲ ˆ Ž x . at those values are 0.48, 0.64, and 0.81, increasing by 0.33 over the middle half of the sampled weights. The effect is similar to that of width. 5.1.4 Logistic Regression with Retrospective Studies Another property of logistic regression relates to situations in which the explanatory variable X rather than the response variable Y is random. This occurs with retrospective sampling designs, such as case᎐control biomedical studies ŽSection 2.1.6.. For samples of subjects having Y s 1 Žcases. and having Y s 0 Žcontrols., the value of X is observed. Evidence exists of an association if the distribution of X values differs between cases and controls. In retrospective studies, one can estimate odds ratios ŽSection 2.2.4.. Effects in the logistic regression model refer to odds ratios. Thus, one can fit such models and estimate effects in case᎐control studies. Here is a justification for this. Let Z indicate whether a subject is sampled Ž1 s yes, 0 s no.. Let ␳ 1 s P Ž Z s 1 < y s 1. denote the probability of sampling a case, and let ␳ 0 s P Ž Z s 1 < y s 0. denote the probability of sampling a control. Even though the conditional distribution of Y given X s x is not sampled, we need a model for P Ž Y s 1 < z s 1, x ., assuming that P Ž Y s 1 < x . follows the logistic model. By Bayes’ theorem, P Ž Y s 1 < z s 1, x . s P Ž Z s 1 < y s 1, x . P Ž Y s 1 < x . Ý1js0 P Ž Z s 1 < y s j, x . P Ž Y s j < x . . Ž 5.3 . Now, suppose that P Ž Z s 1 < y, x . s P Ž Z s 1 < y . for y s 0 and 1; that is, for each y, the sampling probabilities do not depend on x. For instance, often x 171 INTERPRETING PARAMETERS IN LOGISTIC REGRESSION refers to exposure of some type, such as whether someone has been a smoker. Then, for cases and for controls, the probability of being sampled is the same for smokers and nonsmokers. Under this assumption, substituting ␳ 1 and ␳ 0 in Ž5.3. and dividing numerator and denominator by P Ž Y s 0 < x ., Ž5.3. simplifies to P Ž Y s 1 < z s 1, x . s ␳ 1 exp Ž␣ q ␤ x . ␳ 0 q ␳ 1 exp Ž␣ q ␤ x . . Then, dividing numerator and denominator by ␳ 0 and using ␳ 1r␳ 0 s expwlogŽ ␳ 1r␳ 0 .x yields logit P Ž Y s 1 < z s 1, x . s ␣ * q ␤ x with ␣ * s ␣ q log Ž ␳ 1r␳ 0 .. Thus, the logistic regression model holds with the same effect parameter ␤ as in the model for P Ž Y s 1 < x .. If the sampling rate for cases is 10 times that for controls, the intercept estimated is logŽ10. s 2.3 larger than the one estimated with a prospective study. For related comments, see Anderson Ž1972., Breslow and Day Ž1980, p. 203., Breslow and Powers Ž1978., Carroll et al. Ž1995., Farewell Ž1979., Mantel Ž1973., Prentice Ž1976a., and Prentice and Pyke Ž1979.. With case᎐control studies, one cannot estimate ␤ in other binaryresponse models. Unlike the odds ratio, the effect for the conditional distribution of X given Y does not then equal that for Y given X. This is an important advantage of the logit link and is a major reason why logit models have surpassed other models in popularity in biomedical studies. Many case᎐control studies employ matching. Each case is matched with one or more control subjects. The controls are like the case on key characteristics such as age. The model and subsequent analysis should take the matching into account. In Section 10.2.5 we discuss logistic regression for matched case᎐control studies. Regardless of the sampling mechanism, logistic regression may or may not describe a relationship well. In one special case, it necessarily holds. Given that Y s i, suppose that X has N Ž ␮i , ␴ 2 . distribution, i s 0, 1. Then, by Bayes’ theorem, P Ž Y s 1 < X s x . equals Ž5.1. with ␤ s Ž ␮1 y ␮ 0 .r␴ 2 ŽCornfield 1962.. When a population is a mixture of two types of subjects, one type with Y s 1 that is approximately normally distributed on X and the other type with Y s 0 that is approximately normal on X with similar variance, the logistic regression function Ž5.1. approximates well the curve for ␲ Ž x .. If the distributions are normal but with different variances, the model applies also having a quadratic term ŽAnderson 1975.. In that case, the relationship is nonmonotone, with ␲ Ž x . increasing and then decreasing, or the reverse ŽProblem 5.33.. 172 5.2 LOGISTIC REGRESSION INFERENCE FOR LOGISTIC REGRESSION By Wald’s Ž1943. asymptotic results for ML estimators, parameter estimators in logistic regression models have large-sample normal distributions. Thus, inference can use the ŽWald, likelihood-ratio, score. triad of methods ŽSection 1.3.3.. 5.2.1 Types of Inference For the model with a single predictor, logit ␲ Ž x . s ␣ q ␤ x, significance tests focus on H0 : ␤ s 0, the hypothesis of independence. The Wald test uses the log likelihood at ␤ˆ, with test statistic z s ␤ˆrSE or its square; under H0 , z 2 is asymptotically ␹ 12 . The likelihood-ratio test uses twice the difference between the maximized log likelihood at ␤ˆ and at ␤ s 0 and also has an asymptotic ␹ 12 null distribution. The score test uses the log likelihood at ␤ s 0 through the derivative of the log likelihood Ži.e., the score function. at that point. The test statistic compares the sufficient statistic for ␤ to its null expected value, suitably standardized w N Ž0, 1. or ␹ 12 x. In Section 5.3.5 present this test of H0 : ␤ s 0. For large samples, the three tests usually give similar results. The likelihood-ratio test is preferred over the Wald. It uses more information, since it incorporates the log likelihood at H0 as well as at ␤ˆ. When < ␤ < is relatively large, the Wald test is not as powerful as the likelihood-ratio test and can even show aberrant behavior wsee Hauck and Donner Ž1977. and Problem 5.38x. Confidence intervals are more informative than tests. An interval for ␤ results from inverting a test of H0 : ␤ s ␤ 0 . The interval is the set of ␤ 0 for which the chi-squared test statistic is no greater than ␹ 12 Ž␣. s z␣2 r2 . For the Wald approach, this means wŽ ␤ˆ y ␤ 0 .rSEx 2 F z␣2 r2 ; the interval is ␤ˆ " z␣ r2 ŽSE.. For summarizing the relationship, other characteristics may have greater importance than ␤ , such as ␲ Ž x . at various x values. For fixed x s x 0 , logit w␲ ˆ Ž x 0 .x s ␣ˆ q ␤ˆ x 0 has a large-sample SE given by the estimated square root of var Ž␣ ˆ q ␤ˆ x 0 . s var Ž␣ˆ. q x 02 var Ž ␤ˆ . q 2 x 0 cov Ž␣ˆ, ␤ˆ . . A 95% confidence interval for logitw␲ Ž x 0 .x is Ž␣ ˆ q ␤ˆ x 0 . " 1.96 SE. Substituting each endpoint into the inverse transformation ␲ Ž x 0 . s exp Žlogit.r w1 q exp Žlogit.x gives a corresponding interval for ␲ Ž x 0 .. Each method of inference can also produce small-sample confidence intervals and tests. We defer discussion of this until Section 6.7. 173 INFERENCE FOR LOGISTIC REGRESSION 5.2.2 Inference for Horseshoe Crab Data We illustrate logistic regression inferences with the model for the probability a horseshoe crab has a satellite, with width as the predictor. Table 5.1 showed the fit and standard errors. The statistic z s ␤ˆrSE s 0.497r0.102 s 4.9 provides strong evidence of a positive width effect Ž P - 0.0001.. The equivalent Wald chi-squared statistic, z 2 s 23.9, has df s 1. The maximized log likelihoods equal y112.88 under H0 : ␤ s 0 and y97.23 for the full model. The likelihood-ratio statistic equals y2Žy112.88 y 97.23. s 31.3, with df s 1. This provides even stronger evidence than the Wald test. The Wald 95% confidence interval for ␤ is 0.497 " 1.96Ž0.102., or Ž0.298, 0.697.. Table 5.1 reports a likelihood-ratio confidence interval of Ž0.308, 0.709., based on the profile likelihood function. The confidence interval for the effect on the odds per 1-cm increase in width equals Ž e 0.308 , e 0.709 . s Ž1.36, 2.03.. We infer that a 1-cm increase in width has at least a 36% increase and at most a doubling in the odds of a satellite. Most software for logistic regression also reports estimates and confidence intervals for ␲ Ž x . Že.g., PROC GENMOD in SAS with the OBSTATS option.. Consider this for crabs of width x s 26.5, near the mean width. The estimated logit is y12.351 q 0.497Ž26.5. s 0.825, and ␲ ˆ Ž x . s 0.695. Software reports $ var Ž␣ ˆ. s 6.910, $ var Ž ␤ˆ . s 0.01035, $ cov Ž␣ ˆ, ␤ˆ . s y0.2668, from which $ var  logit ␲ ˆ Ž x. 4 s 6.910 q x 2 Ž 0.01035. q 2 x Ž y0.2668 . . At x s 26.5 this is 0.038, so the 95% confidence interval for logitw␲ Ž26.5.x equals 0.825 " Ž1.96.'0.038 , or Ž0.44, 1.21.. This translates to the interval Ž0.61, 0.77. for the probability of satellites Že.g., exp Ž0.44.rw1 q exp Ž0.44.x s 0.61.. ŽAlternatively, for the model fit using predictor x* s x y 26.5, ␣ ˆ and its SE are the estimated logit and its SE.. Figure 5.3 plots the confidence bands around the prediction equation for ␲ Ž x . as a function of x. Hauck Ž1983. gave alternative bands for which the confidence coefficient applies simultaneously to all possible predictor values. One could ignore the model fit and simply use sample proportions Ži.e., the saturated model. to estimate such probabilities. Six female crabs in the sample had x s 26.5, and four of them had satellites. The sample proportion estimate at x s 26.5 is ␲ ˆ s 4r6 s 0.67, similar to the model-based estimate. The 95% score confidence interval ŽSection 1.4.2. based on these six observations alone equals Ž0.30, 0.90.. When the logistic regression model truly holds, the model-based estimator of a probability is considerably better than the sample proportion. The model has only two parameters to estimate, whereas the saturated model has a 174 LOGISTIC REGRESSION FIGURE 5.3 Prediction equation and 95% confidence bands for probability of satellite as a function of width. separate parameter for every distinct value of x. For instance, at x s 26.5, software reports SE s 0.04 for the model-based estimate 0.695, whereas the SE is ␲ ˆ Ž 1 y ␲ˆ . rn s Ž 0.67 . Ž 0.33 . r6 s 0.19 for the sample proportion of 0.67 with only 6 observations. The 95% confidence intervals are Ž0.61, 0.77. using the model versus Ž0.30, 0.90. using the sample proportion. Instead of using only 6 observations, the model uses the information that all 173 observations provide in estimating the two model parameters. The result is a much more precise estimate. Reality is a bit more complicated. In practice, the model is not exactly the true relationship between ␲ Ž x . and x. However, if it approximates the true probabilities decently, its estimator still tends to be closer than the sample proportion to the true value. The model smooths the sample data, somewhat dampening the observed variability. The resulting estimators tend to be better unless each sample proportion is based on an extremely large sample. Section 6.4.5 discusses this advantage of using models. ' 5.2.3 ' Checking Goodness of Fit: Ungrouped and Grouped Data In practice, there is no guarantee that a certain logistic regression model fits the data well. For any type of binary data, one way to detect lack of fit uses a likelihood-ratio test to compare the model to more complex ones. A more complex model might contain a nonlinear effect, such as a quadratic term. Models with multiple predictors would consider interaction. If more complex models do not fit better, this provides some assurance that the model chosen is reasonable. INFERENCE FOR LOGISTIC REGRESSION 175 Other approaches to detecting lack of fit search for any way that the model fails. This is simplest when the explanatory variables are solely categorical, as we’ll illustrate in Section 5.4.3. At each setting of x, one can multiply the estimated probabilities of the two outcomes by the number of subjects at that setting to obtain estimated expected frequencies for y s 0 and y s 1. These are fitted ®alues. The test of the model compares the observed counts and fitted values using a Pearson X 2 or likelihood-ratio G 2 statistic. For a fixed number of settings, as the fitted counts increase, X 2 and G 2 have limiting chi-squared null distributions. The degrees of freedom, called the residual df for the model, subtract the number of parameters in the model from the number of parameters in the saturated model Ži.e., the number of settings of x .. The reason for the restriction to categorical predictors for a global test of fit relates to the distinction in Section 4.5.3 that we mentioned between grouped and ungrouped data for binomial models. The saturated model differs in the two cases. An asymptotic chi-squared distribution for the deviance results as n ™ ⬁ with a fixed number of parameters in that model and hence a fixed number of settings of predictor values. 5.2.4 Goodness of Fit of Model for Horseshoe Crabs We illustrate with a goodness-of-fit analysis for the model using x s width to predict the probability that a female crab has a satellite. One way to check it compares it to a more complex model, such as the model containing a quadratic term. With width centered at 0 by subtracting its mean of 26.3, that model has fit logit ␲ ˆ Ž x . s 0.618 q 0.533 x q 0.040 x 2 . The quadratic estimate has SE s 0.046. There is not much evidence to support adding that term. The likelihood-ratio statistic for testing that the true coefficient of x 2 is 0 equals 0.83 Ždf s 1.. We next consider overall goodness of fit. Width takes 66 distinct values for the 173 crabs, with few observations at most widths. One can view the data as a 66 = 2 contingency table. The two cells in each row count the number of crabs with satellites and the number of crabs without satellites, at that width. The chi-squared theory for X 2 and G 2 applies when the number of levels of x is fixed, and the number of observations at each level grows. Although we grouped the data using the distinct width values rather than using 173 separate binary responses, this theory is violated here in two ways. First, most fitted counts are very small. Second, when more data are collected, additional width values would occur, so the contingency table would contain more cells rather than a fixed number. Because of this, X 2 and G 2 for logistic regression models with continuous or nearly continuous predictors do not have approximate chi-squared distributions. ŽNormal approximations can be 176 LOGISTIC REGRESSION TABLE 5.2 Grouping of Observed and Fitted Values for Fit of Logistic Regression Model to Horseshoe Crab Data Width Žcm. Number Yes Number No Fitted Yes Fitted No - 23.25 23.2524.25 24.2525.25 25.2526.25 26.2527.25 27.2528.25 28.2529.25 ) 29.25 5 4 17 21 15 20 15 14 9 10 11 18 7 4 3 0 3.64 5.31 13.78 24.23 15.94 19.38 15.65 13.08 10.36 8.69 14.22 14.77 6.06 4.62 2.35 0.92 more appropriate, but no single method has received much attention; see Section 9.8.6 for references. . One could use X 2 and G 2 to compare the observed and fitted values in grouped form. Table 5.2 uses the groupings of Table 4.4, giving an 8 = 2 table. In each width category, the fitted value for a yes response is the sum of the estimated probabilities ␲ ˆ Ž x . for all crabs having width in that category; the fitted value for a no response is the sum of 1 y ␲ ˆ Ž x . for those crabs. The fitted values are then much larger. Then, X 2 and G 2 have better validity, although the chi-squared theory still is not perfect since ␲ Ž x . is not constant in each category. Their values are X 2 s 5.3 and G 2 s 6.2. Table 5.2 has eight binomial samples, one for each width setting; the model has two parameters, so df s 8 y 2 s 6. Neither X 2 nor G 2 shows evidence of lack of fit Ž P ) 0.4.. Thus, we can feel more comfortable about using the model for the original ungrouped data. 5.2.5 Checking Goodness of Fit with Ungrouped Data by Grouping As just noted, with ungrouped data or with continuous or nearly continuous predictors, X 2 and G 2 do not have limiting chi-squared distributions. They are still useful for comparing models, as done above for checking a quadratic term and as we will discuss in Sections 5.4.3 and 9.8.5. Also, as just noted, one can apply them in an approximate manner to grouped observed and fitted values for a partition of the space of x values. As the number of explanatory variables increases, however, simultaneous grouping of values for each variable can produce a contingency table with a large number of cells, most of which have small counts. Regardless of the number of predictors, one can partition observed and fitted values according to the estimated probabilities of success using the original ungrouped data. One common approach forms the groups in the partition so they have approximately equal size. With 10 groups, the first pair 177 LOGIT MODELS WITH CATEGORICAL PREDICTORS of observed counts and corresponding fitted counts refers to the nr10 observations having the highest estimated probabilities, the next pair refers to the nr10 observations having the second decile of estimated probabilities, and so on. Each group has an observed count of subjects with each outcome and a fitted value for each outcome. The fitted value for an outcome is the sum of the estimated probabilities for that outcome for all observations in that group. This construction is the basis of a test due to Hosmer and Lemeshow Ž1980.. They proposed a Pearson statistic comparing the observed and fitted counts for this partition. Let yi j denote the binary outcome for observation j in group i of the partition, i s 1, . . . , g, j s 1, . . . , n i . Let ␲ ˆ i j denote the corresponding fitted probability for the model fitted to the ungrouped data. Their statistic equals Ž Ý j yi j y Ý j␲ˆ i j . Ý Ý ␲ 1 y Ý ␲ rn Ž j ˆi j . i is1 Ž j ˆ i j . g 2 . When many observations have the same estimated probability, there is some arbitrariness in forming the groups, and different software may report somewhat different values. This statistic does not have a limiting chi-squared distribution, because the observations in a group are not identical trials, since they do not share a common success probability. However, Hosmer and Lemeshow noted that when the number of distinct patterns of covariate values equals the sample size, the null distribution is approximated by chi-squared with df s g y 2. For the logistic regression fit to the horseshoe crab data with continuous width predictor, the Hosmer᎐Lemeshow statistic with g s 10 groups equals 3.5, with df s 8. It also indicates a decent fit. Unfortunately, like other proposed global fit statistics, the Hosmer᎐ Lemeshow statistic does not have good power for detecting particular types of lack of fit ŽHosmer et al. 1997.. In any case, a large value of a global fit statistic merely indicates some lack of fit but provides no insight about its nature. The approach of comparing the working model to a more complex one is more useful from a scientific perspective, since it searches for lack of fit of a particular type. For either approach, when the fit is poor, diagnostic measures describe the influence of individual observations on the model fit and highlight reasons for the inadequacy. We discuss these in Section 6.2.1. 5.3 LOGIT MODELS WITH CATEGORICAL PREDICTORS Like ordinary regression, logistic regression extends to include qualitative explanatory variables, often called factors. In this section we use dummy variables to do this. 178 5.3.1 LOGISTIC REGRESSION ANOVA-Type Representation of Factors For simplicity, we first consider a single factor X, with I categories. In row i of the I = 2 table, yi is the number of outcomes in the first column Žsuccesses . out of n i trials. We treat yi as binomial with parameter ␲ i . The logit model with a factor is log ␲i 1 y ␲i s ␣ q ␤i . Ž 5.4 . The higher ␤i is, the higher the value of ␲ i . The right-hand side of Ž5.4. resembles the model formula for cell means in one-way ANOVA. As in ANOVA, the factor has as many parameters  ␤i 4 as categories, but one is redundant. With I categories, X has I y 1 nonredundant parameters. One parameter can be set to 0, say ␤I s 0. If the values do not satisfy this, we can recode so that it is true. For instance, set ␤˜i s ␤i y ␤I and ␣ ˜ s ␣ q ␤I , which satisfy ␤˜I s 0. Then logit Ž ␲ i . s ␣ q ␤i s Ž␣ ˜ y ␤I . q Ž ␤˜i q ␤I . s ␣˜ q ␤˜i , where the newly defined parameters satisfy the constraint. When ␤I s 0, ␣ equals the logit in row I, and ␤i is the difference between the logits in rows i and I. Thus, ␤i equals the log odds ratio for that pair of rows. For any ␲ i ) 04 ,  ␤i 4 exist such that model Ž5.4. holds. The model has as many parameters Ž I . as binomial observations and is saturated. When a factor has no effect, ␤ 1 s ␤ 2 s ⭈⭈⭈ s ␤I . Since this is equivalent to ␲ 1 s ⭈⭈⭈ s ␲ I , this model with only an intercept term specifies statistical independence of X and Y. 5.3.2 Dummy Variables in Logit Models An equivalent expression of model Ž5.4. uses dummy ®ariables. Let x i s 1 for observations in row i and x i s 0 otherwise, i s 1, . . . , I y 1. The model is logit Ž ␲ i . s ␣ q ␤ 1 x 1 q ␤ 2 x 2 q ⭈⭈⭈ q␤Iy1 x Iy1 . This accounts for parameter redundancy by not forming a dummy variable for category I. The constraint ␤I s 0 in Ž5.4. corresponds to this form of dummy variable. The choice of category to exclude for the dummy variable is arbitrary. Some software sets ␤ 1 s 0; this corresponds to a model with dummy variables for categories 2 through I, but not category 1. Another way to impose constraints sets Ý i ␤i s 0. Suppose that X has I s 2 categories, so ␤ 1 s y␤ 2 . This results from effect coding for a dummy variable, x s 1 in category 1 and x s y1 in category 2. 179 LOGIT MODELS WITH CATEGORICAL PREDICTORS The same substantive results occur for any coding scheme. For model Ž5.4., regardless of the constraint for  ␤i 4 ,  ␣ ˆ q ␤ˆi 4 and hence ␲ˆ i 4 are the same. The differences ␤ˆa y ␤ˆb for pairs Ž a, b . of categories of X are identical and represent estimated log odds ratios. Thus, expŽ ␤ˆa y ␤ˆb . is the estimated odds of success in category a of X divided by the estimated odds of success in category b of X. Reparameterizing a model may change parameter estimates but does not change the model fit or the effects of interest. The value ␤i or ␤ˆi for a single category is irrelevant. Different constraint systems result in different values. For a binary predictor, for instance, using dummy variables with reference value ␤ 2 s 0, the log odds ratio equals ␤ 1 y ␤ 2 s ␤ 1; by contrast, for effect coding with "1 dummy variable and hence ␤ 1 q ␤ 2 s 0, the log odds ratio equals ␤ 1 y ␤ 2 s ␤ 1 y Žy␤ 1 . s 2 ␤ 1. A parameter or its estimate makes sense only by comparison with one for another category. 5.3.3 Alcohol and Infant Malformation Example Revisited We return now to Table 3.7 from the study of maternal alcohol consumption and child’s congenital malformations, shown again in Table 5.3. For model Ž5.4., we treat malformations as the response and alcohol consumption as an explanatory factor. Regardless of the constraint for  ␤i 4 ,  ␣ ˆ q ␤ˆi 4 are the sample logits, reported in Table 5.3. For instance, logit Ž ␲ ˆ 1 . s ␣ˆ q ␤ˆ1 s log Ž 48r17,066 . s y5.87. For the coding that constrains ␤5 s 0, ␣ ˆ s y3.61 and ␤ˆ1 s y2.26. For the coding ␤ 1 s 0, ␣ ˆ s y5.87. Table 5.3 shows that except for the slight reversal between the first and second categories of alcohol consumption, the logits and hence the sample proportions of malformation cases increase as alcohol consumption increases. The simpler model with all ␤i s 0 specifies independence. For it, ␣ ˆ equals the logit for the overall sample proportion of malformations, or logŽ93r32481. s y5.86. To test H0 : independence Ždf s 4., the Pearson TABLE 5.3 Logits and Proportion of Malformation for Table 3.7 Alcohol Consumption 0 -1 1᎐2 3᎐5 G6 Proportion Malformed Present Absent Logit Observed Fitted 48 38 5 1 1 17,066 14,464 788 126 37 y5.87 y5.94 y5.06 y4.84 y3.61 0.0028 0.0026 0.0063 0.0079 0.0263 0.0026 0.0030 0.0041 0.0091 0.0231 180 LOGISTIC REGRESSION statistic Ž3.10. is X 2 s 12.1 Ž P s 0.02., and the likelihood-ratio statistic Ž3.11. is G 2 s 6.2 Ž P s 0.19.. These provide mixed signals. Table 5.3 has a mixture of very small, moderate, and extremely large counts. Even though n s 32,574, the null sampling distributions of X 2 or G 2 may not be close to chi-squared. The P-values using the exact conditional distributions of X 2 and G 2 are 0.03 and 0.13. These are closer, but still give differing evidence. In any case, these statistics ignore the ordinality of alcohol consumption. The sample suggests that malformations may tend to be more likely with higher alcohol consumption. The first two percentages are similar and the next two are also similar, however, and any of the last three percentages changes substantially with the addition or deletion of one malformation case. 5.3.4 Linear Logit Model for I = 2 Tables Model Ž5.4. treats the explanatory factor as nominal, since it is invariant to the ordering of categories. For ordered factor categories, other models are more parsimonious than this, yet more complex than the independence model. For instance, let scores  x 1 , x 2 , . . . , x I 4 describe distances between categories of X. When one expects a monotone effect of X on Y, it is natural to fit the linear logit model logit Ž ␲ i . s ␣ q ␤ x i . Ž 5.5 . The independence model is the special case ␤ s 0. The near-monotone increase in sample logits in Table 5.3 indicates that the linear logit model Ž5.5. may fit better than the independence model. As measured, alcohol consumption groups a naturally continuous variable. With scores  x 1 s 0, x 2 s 0.5, x 3 s 1.5, x 4 s 4.0, x 5 s 7.04 , the last score being somewhat arbitrary, Table 5.4 shows results. The estimated multiplicative TABLE 5.4 Computer Output for Logistic Regression Model with Infant Malformation Data Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 3 1.9487 Pearson Chi- Square 3 2.0523 Log Likelihood y635.5968 Parameter Intercept alcohol Estimate y5.9605 0.3166 Std Error 0.1154 0.1254 Likelihood- Ratio 95% Conf Limits y6.1930 y5.7397 0.0187 0.5236 Wald Chi- Sq 2666.41 6.37 Pr>ChiSq <.0001 0.0116 181 LOGIT MODELS WITH CATEGORICAL PREDICTORS effect of a unit increase in daily alcohol consumption on the odds of malformation is expŽ0.317. s 1.37. Table 5.3 shows the observed and fitted proportions of malformation. The model seems to fit well, as statistics comparing observed and fitted counts are G 2 s 1.95 and X 2 s 2.05, with df s 3. 5.3.5 Cochran–Armitage Trend Test Armitage Ž1955. and Cochran Ž1954. were among the first to emphasize the importance of utilizing ordered categories in a contingency table. For I = 2 tables with ordered rows and I independent bin Ž n i , ␲ i . variates  yi 4 , they proposed a trend statistic for testing independence by partitioning the Pearson statistic for that hypothesis. They used a linear probability model, Ž 5.6 . ␲i s ␣ q ␤ xi , fitted by ordinary least squares. For this model, the null hypothesis of independence is H0 : ␤ s 0. Let x s Ý i n i x irn. Let pi s yirn i , and let p s ŽÝ i yi .rn denote the overall proportion of successes. The prediction equation is ␲ ˆi s p q bŽ xi y x . , where bs Ý i n i Ž pi y p . Ž x i y x . Ýi n i Ž x i y x . 2 . Denote the Pearson statistic for testing independence by X 2 Ž I .. For I = 2 tables with ordered rows, it satisfies 1 X 2Ž I. s pŽ1 y p. Ý n i Ž pi y p . 2 s z 2 q X 2 Ž L . , i where X 2 Ž L. s 2 z s b2 pŽ1 y p. 1 pŽ1 y p. Ý ni Ž x i y x . i 2 s Ý n i Ž pi y ␲ˆ i . 2 i 2 Ý i Ž x i y x . yi 'p Ž 1 y p . Ý n Ž x y x . i i 2 . Ž 5.7 . i When the linear probability model holds, X 2 Ž L. is asymptotically chi-squared with df s I y 2. It tests the fit of the model. The statistic z 2 , with df s 1, 182 LOGISTIC REGRESSION tests H0 : ␤ s 0 for the linear trend in the proportions Ž5.6.. The test of independence using this statistic is called the Cochran᎐Armitage trend test. This analysis seems unrelated to the linear logit model. However, the Cochran᎐Armitage statistic is equivalent to the score statistic for testing H0 : ␤ s 0 in that model. Moreover, this statistic relates to the statistic M 2 in Ž3.15. used to test for a linear trend in an I = J table; namely, it equals M 2 applied when J s 2, except with Ž n y 1. replaced by n. When I s 2, X 2 Ž L. s 0 and z 2 s X 2 Ž I .. For Table 5.3 on alcohol consumption and malformation, X 2 Ž I . s 12.1. Using the same scores as in the linear logit model, the Cochran᎐Armitage trend test has z 2 s 6.6 Ž P-value s 0.010.. The test suggests strong evidence of a positive slope. In addition, X 2 Ž I . s 12.1 s 6.6 q 5.5, where X 2 Ž L. s 5.5 Ždf s 3. shows only slight evidence of departure of the proportions from linearity. The trend test agrees with M 2 for the sample correlation of r s 0.014 for n s 32,573 ŽSection 3.4.5.. For the chosen scores, the correlation seems weak. However, r has limited use as a descriptive measure for tables that are highly discrete and unbalanced. The Cochran᎐Armitage trend test Ži.e., the score test. usually gives results similar to the Wald or likelihood-ratio test of H0 : ␤ s 0 in the linear logit model. The asymptotics work well even for quite small n when  n i 4 are equal and  x i 4 are equally spaced. With Table 5.3, the Wald statistic equals Ž ␤ˆrSE. 2 s Ž0.317r0.125. 2 s 6.4 Ž P s 0.012. and the likelihood-ratio statistic equals 4.25 Ž P s 0.039.. The highly unbalanced counts suggest that it is safest to use the likelihood function through the likelihood-ratio approach. This is also true for estimation. The profile likelihood 95% confidence interval of Ž0.02, 0.52. for ␤ reported in Table 5.4 is preferable to the Wald interval of 0.317 " 1.96Ž0.125. s Ž0.07, 0.56.. Even though n is very large, exact inference based on small-sample methods presented in Section 6.7.4 is relevant here. 5.4 MULTIPLE LOGISTIC REGRESSION Like ordinary regression, logistic regression extends to models with multiple explanatory variables. For instance, the model for ␲ Žx. s P Ž Y s 1. at values x s Ž x 1 , . . . , x p . of p predictors is logit ␲ Ž x . s ␣ q ␤ 1 x 1 q ␤ 2 x 2 q ⭈⭈⭈ q␤ p x p . Ž 5.8 . 183 MULTIPLE LOGISTIC REGRESSION The alternative formula, directly specifying ␲ Žx., is ␲ Ž x. s exp Ž␣ q ␤ 1 x 1 q ␤ 2 x 2 q ⭈⭈⭈ q␤ p x p . 1 q exp Ž␣ q ␤ 1 x 1 q ␤ 2 x 2 q ⭈⭈⭈ q␤ p x p . . Ž 5.9 . The parameter ␤i refers to the effect of x i on the log odds that Y s 1, controlling the other x j . For instance, expŽ ␤i . is the multiplicative effect on the odds of a 1-unit increase in x i , at fixed levels of other x j . An explanatory variable can be qualitative, using dummy variables for categories. 5.4.1 Logit Models for Multiway Contingency Tables When all variables are categorical, a multiway contingency table displays the data. We illustrate ideas with binary predictors X and Z. We treat the sample size at given combinations Ž i, k . of X and Z as fixed and regard the two counts on Y at each setting as binomial, with different binomials treated as independent. Denote the two categories for each variable by Ž0, 1., and let dummy variables for X and Z have x 1 s z1 s 1 and x 2 s z 2 s 0. The model logit P Ž Y s 1 . s ␣ q ␤ 1 x i q ␤ 2 z k Ž 5.10 . has main effects for X and Z but assumes an absence of interaction. The effect of one factor is the same at each level of the other. At a fixed level z k of Z, the effect on the logit of changing categories of X is ␣ q ␤1Ž 1. q ␤ 2 z k y ␣ q ␤1Ž 0. q ␤ 2 z k s ␤1 . Ž 5.11 . This logit difference equals the difference of log odds, which is the log odds ratio between X and Y, fixing Z. Thus, expŽ ␤ 1 . is the conditional odds ratio between X and Y. Controlling for Z, the odds of success when X s 1 equal expŽ ␤ 1 . times the odds when X s 0. This conditional odds ratio is the same at each level of Z; that is, there is homogeneous XY association ŽSection 2.3.5.. The lack of an interaction term in Ž5.10. implies a common odds ratio for the partial tables. When ␤ 1 s 0, that common odds ratio equals 1. Then X and Y are independent in each partial table, or conditionally independent, gi®en Z ŽSection 2.3.4.. Additivity on the logit scale is the generally accepted definition of no interaction for categorical variables. However, one could, instead, define it as additivity on some other scale, such as with probit or identity link. Significant interaction can occur on one scale when there is none on another scale. In some applications, a particular definition may be natural. For instance, theory might assume an underlying normal distribution and predict that the probit is an additive function of predictor effects. 184 LOGISTIC REGRESSION A factor with I categories needs I y 1 dummy variables, as we showed in Section 5.3.2. An alternative representation of such factors resembles the way that ANOVA models often express them. The model formula logit P Ž Y s 1 . s ␣ q ␤i X q ␤ kZ Ž 5.11 . represents effects of X with parameters  ␤i X 4 and effects of Z with parameters  ␤ kZ 4 . ŽThe X and Z superscripts are merely labels and do not represent powers.. Model form Ž5.11. applies for any number of categories for X and Z. The parameter ␤i X denotes the effect on the logit of classification in category i of X. Conditional independence between X and Y, given Z, corresponds to ␤ 1X s ␤ 2X s ⭈⭈⭈ s ␤IX , whereby P Ž Y s 1. does not change as i changes. For each factor, one parameter in Ž5.11. is redundant. Fixing one at 0, such as ␤IX s ␤KZ s 0, represents the category not having its own dummy variable. When X and Z have two categories, the parameterization in model Ž5.11. then corresponds to that in model Ž5.10. with ␤ 1X s ␤ 1 and ␤ 2X s 0, and with ␤ 1Z s ␤ 2 and ␤ 2Z s 0. 5.4.2 AIDS and AZT Example Table 5.5 is from a study on the effects of AZT in slowing the development of AIDS symptoms. In the study, 338 veterans whose immune systems were beginning to falter after infection with the AIDS virus were randomly assigned either to receive AZT immediately or to wait until their T cells showed severe immune weakness. Table 5.5 cross-classifies the veterans’ race, whether they received AZT immediately, and whether they developed AIDS symptoms during the 3-year study. In model Ž5.10., we identify X with AZT treatment Ž x 1 s 1 for immediate AZT use, x 2 s 0 otherwise. and Z with race Ž z1 s 1 for whites, z 2 s 0 for blacks., for predicting the probability that AIDS symptoms developed. Thus, ␣ is the log odds of developing AIDS symptoms for black subjects without immediate AZT use, ␤ 1 is the increment to the log odds for those with immediate AZT use, and ␤ 2 is the increment to the log odds for white TABLE 5.5 Development of AIDS Symptoms by AZT Use and Race Symptoms Race AZT Use Yes No White Yes No Yes No 14 32 11 12 93 81 52 43 Black Source: New York Times, Feb. 15, 1991. 185 MULTIPLE LOGISTIC REGRESSION TABLE 5.6 Computer Output for Logit Model with AIDS Symptoms Data Goodness- of- Fit Statistics Criterion DF Value Pr ) ChiSq Deviance 1 1.3835 0.2395 Pearson 1 1.3910 0.2382 Parameter Intercept azt race Analysis of Maximum Likelihood Estimates Estimate Std Error Wald Chi- Square y1.0736 0.2629 16.6705 y0.7195 0.2790 6.6507 0.0555 0.2886 0.0370 Effect azt race Pr > ChiSq - .0001 0.0099 0.8476 Odds Ratio Estimates Estimate 95% Wald Confidence Limits 0.487 0.282 0.841 1.057 0.600 1.861 Profile Likelihood Confidence Interval for Odds Ratios Effect Estimate 95% Confidence Limits azt 0.487 0.279 0.835 race 1.057 0.605 1.884 Obs 1 2 3 4 race 1 1 0 0 azt 1 0 1 0 y 14 32 11 12 n 107 113 63 55 pi  hat 0.14962 0.26540 0.14270 0.25472 lower 0.09897 0.19668 0.08704 0.16953 upper 0.21987 0.34774 0.22519 0.36396 subjects. Table 5.6 shows output. The estimated odds ratio between immediate AZT use and development of AIDS symptoms equals expŽy0.7195. s 0.487. For each race, the estimated odds of symptoms are half as high for those who took AZT immediately. The Wald confidence interval for this effect is expwy0.720 " 1.96Ž0.279.x s Ž0.28, 0.84.. Similar results occur for the likelihood-based interval. The hypothesis of conditional independence of AZT treatment and development of AIDS symptoms, controlling for race, is H0 : ␤ 1 s 0 in Ž5.10.. The likelihood-ratio statistic comparing model Ž5.10. with the simpler model having ␤ 1 s 0 equals 6.9 Ždf s 1., showing evidence of association Ž P s 0.01.. The Wald statistic Ž ␤ˆ1rSE. 2 s Žy0.720r0.279. 2 s 6.65 provides similar results. Table 5.7 shows parameter estimates for three ways of defining factor parameters in Ž5.11.: Ž1. setting the last parameter equal to 0, Ž2. setting the first parameter equal to 0, and Ž3. having parameters sum to zero. For each coding scheme, at a given combination of AZT use and race, the estimated probability of developing AIDS symptoms is the same. For instance, the intercept estimate plus the estimate for immediate AZT use plus the estimate for being white is y1.738 for each scheme, so the estimated probability 186 LOGISTIC REGRESSION TABLE 5.7 Parameter Estimates for Logit Model Fitted to Table 5.5 Definition of Parameters Parameter Last s Zero First s Zero Sum s Zero Intercept y1.074 y1.738 y1.406 y0.720 0.000 0.000 0.720 y0.360 0.360 0.055 0.000 0.000 y0.055 0.028 y0.028 AZT Yes No Race White Black FIGURE 5.4 Estimated effects of AZT use and race on probability of developing AIDS symptoms Ždots are sample proportions.. that white veterans with immediate AZT use develop AIDS symptoms equals expŽy1.738.rw1 q expŽy1.738.x s 0.15. The bottom of Table 5.6 shows point and interval estimates of the probabilities. Figure 5.4 shows a graphical representation of the sample proportions Žthe four dots. and the point estimates enclosed in 95% confidence intervals. Similarly, for each coding scheme, ␤ 1X y ␤ 2X is identical and represents the conditional log odds ratio of X with the response, given Z. Here, exp Ž ␤ˆ1X y ␤ˆ2X . s exp Žy0.720. s 0.49 estimates the common odds ratio between immediate AZT use and AIDS symptoms, for each race. 5.4.3 Goodness of Fit as a Likelihood-Ratio Test The likelihood-ratio statistic y2Ž L0 y L1 . tests whether certain model parameters are zero by comparing the log likelihood L1 for the fitted model M1 with L0 for a simpler model M0 . Denote this statistic for testing M0 , given MULTIPLE LOGISTIC REGRESSION 187 that M1 holds, by G 2 Ž M0 < M1 .. The goodness-of-fit statistic G 2 Ž M . is a special case in which M0 s M and M1 is the saturated model. In testing whether M fits, we test whether all parameters in the saturated model but not in M equal zero. The asymptotic df is the difference in the number of parameters in the two models, which is the number of binomials modeled minus the number of parameters in M. We illustrate by checking the fit of model Ž5.10. for the AIDS data. For its fit, white veterans with immediate AZT use had estimated probability 0.150 of developing AIDS symptoms during the study. Since 107 white veterans took AZT, the fitted value is 107Ž0.150. s 16.0 for developing symptoms and 107Ž0.850. s 91.0 for not developing them. Similarly, one can obtain fitted values for all eight cells in Table 5.5. The goodness-of-fit statistics comparing these with the cell counts are G 2 s 1.38 and X 2 s 1.39. The model has four binomials, one at each combination of AZT use and race. Since it has three parameters, residual df s 4 y 3 s 1. The small G 2 and X 2 values suggest that the model fits decently Ž P ) 0.2.. For model Ž5.10., the odds ratio between X and Y is the same at each level of Z. The goodness-of-fit test checks this structure. That is, the test also provides a test of homogeneous odds ratios. For Table 5.5, homogeneity is plausible. Since residual df s 1, the more complex model that adds an interaction term and permits the two odds ratios to differ is saturated. Let LS denote the maximized log likelihood for the saturated model. As discussed in Section 4.5.4, the likelihood-ratio statistic for comparing models M1 and M0 is G 2 Ž M0 < M1 . s y2 Ž L0 y L1 . s y2 Ž L0 y LS . y y2 Ž L1 y LS . s G 2 Ž M 0 . y G 2 Ž M1 . . The test statistic comparing two models is identical to the difference in G 2 goodness-of-fit statistics Ždeviances. for the two models. To illustrate, consider H0 : ␤ 2 s 0 for the race effect with the AIDS data. The likelihood-ratio statistic equals 0.04, suggesting that the simpler model is adequate. But this equals G 2 Ž M0 . y G 2 Ž M1 . s 1.42 y 1.38, where M0 is the simpler model with ␤ 2 s 0. The model comparison statistic often has an approximate chi-squared null distribution even when separate G 2 Ž Mi . do not. For instance, when a predictor is continuous or a contingency table has very small fitted values, the sampling distribution of G 2 Ž Mi . may be far from chi-squared. Nonetheless, if df for the comparison statistic is modest Žas in comparing two models that differ by a few parameters ., the null distribution of G 2 Ž M0 < M1 . is approximately chi-squared. 188 5.4.4 LOGISTIC REGRESSION Horseshoe Crab Example Revisited Like ordinary regression, logistic regression can have a mixture of quantitative and qualitative predictors. We illustrate with the horseshoe crab data ŽSection 5.1.3., using the female crab’s width and color as predictors. Color has five categories: light, medium light, medium, medium dark, dark. It is a surrogate for age, older crabs tending to be darker. The sample contained no light crabs, so our models use only the other four categories. We first treat color as qualitative. The four categories use three dummy variables. The model is logit Ž ␲ . s ␣ q ␤ 1 c1 q ␤ 2 c 2 q ␤ 3 c 3 q ␤4 x, Ž 5.12 . where ␲ s P Ž Y s 1., x s width in centimeters, and c1 s 1 for medium-light color, and 0 otherwise, c2 s 1 for medium color, and 0 otherwise, c3 s 1 for medium-dark color, and 0 otherwise. The crab color is dark Žcategory 4. when c1 s c 2 s c 3 s 0. Table 5.8 shows the ML parameter estimates. For instance, for dark crabs, logitŽ␲ ˆ. s y12.715 q 0.468 x; by contrast, for medium-light crabs, c1 s 1, and logitŽ␲ ˆ. s Žy12.715 q 1.330. q 0.468 x s y11.385 q 0.468 x. At the average width of 26.3 cm, ␲ ˆ s 0.399 for dark crabs and 0.715 for medium-light crabs. The model assumes a lack of interaction between color and width in their effects. Width has the same coefficient Ž0.468. for all colors, so the shapes of the curves relating width to ␲ are identical. For each color, a 1-cm increase in width has a multiplicative effect of expŽ0.468. s 1.60 on the odds that Y s 1. Figure 5.5 displays the fitted model. Any one curve equals any other TABLE 5.8 Computer Output for Model with Width and Color Predictors Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 168 187.4570 Pearson Chi- Square 168 168.6590 Log Likelihood y93.7285 Parameter Estimate intercept y12.7151 c1 1.3299 c2 1.4023 c3 1.1061 width 0.4680 Standard Likelihood- Ratio 95% Error Confidence Limits 2.7618 y18.4564 y7.5788 0.8525 y0.2738 3.1354 0.5484 0.3527 2.5260 0.5921 y0.0279 2.3138 0.1055 0.2713 0.6870 ChiSquare 21.20 2.43 6.54 3.49 19.66 Pr>ChiSq <.0001 0.1188 0.0106 0.0617 <.0001 MULTIPLE LOGISTIC REGRESSION 189 FIGURE 5.5 Logistic regression model using width and color predictors of satellite presence for horseshoe crabs. curve shifted to the right or left. The parallelism of curves in the horizontal dimension implies that any two curves never cross. At all width values, color 4 Ždark. has a lower estimated probability of a satellite than the other colors. There is a noticeable positive effect of width. The exponentiated difference between two color parameter estimates is an odds ratio comparing those colors. For instance, the difference for mediumlight crabs and dark crabs equals 1.330. At any given width, the estimated odds that a medium-light crab has a satellite are exp Ž1.330. s 3.8 times the estimated odds for a dark crab. At width x s 26.3, the odds equal 0.715r0.285 s 2.51 for a medium-light crab and 0.399r0.601 s 0.66 for a dark crab, for which 2.51r0.66 s 3.8. 5.4.5 Model Comparison To test whether color contributes significantly to model Ž5.12., we test H0 : ␤ 1 s ␤ 2 s ␤ 3 s 0. This states that controlling for width, the probability of a satellite is independent of color. We compare the maximized log-likelihood L1 for the full model Ž5.12. to L0 for the simpler model. The test statistic y2Ž L0 y L1 . s 7.0 has df s 3, the difference between the numbers of parameters in the two models. The chi-squared P-value of 0.07 provides slight evidence of a color effect. The more complex model allowing color = width interaction has three additional terms, the cross-products of width with the color dummy variables. 190 LOGISTIC REGRESSION Fitting this model is equivalent to fitting logistic regression with width predictor separately for crabs of each color. Each color then has a differentshaped curve relating width to P Ž Y s 1., so a comparison of two colors varies according to the width value. The likelihood-ratio statistic comparing the models with and without the interaction terms equals 4.4, with df s 3. The evidence of interaction is weak Ž P s 0.22.. 5.4.6 Quantitative Treatment of Ordinal Predictor Color has ordered categories, from lightest to darkest. A simpler model yet treats this predictor as quantitative. Color may have a linear effect, for a set of monotone scores. To illustrate, for scores c s  1, 2, 3, 44 for the color categories, the model logit Ž ␲ . s ␣ q ␤ 1 c q ␤ 2 x Ž 5.13 . has ␤ˆ1 s y0.509 ŽSE s 0.224. and ␤ˆ2 s 0.458 ŽSE s 0.104.. This shows strong evidence of an effect for each. At a given width, for every one-category increase in color darkness, the estimated odds of a satellite multiply by exp Žy0.509. s 0.60. The likelihood-ratio statistic comparing this fit to the more complex model Ž5.12. having a separate parameter for each color equals 1.7 Ždf s 2.. This statistic tests that the simpler model Ž5.13. is adequate, given that model Ž5.12. holds. It tests that when plotted against the color scores, the color parameters in Ž5.12. follow a linear trend. The simplification seems permissible Ž P s 0.44.. The color parameter estimates in the qualitative-color model Ž5.12. are Ž1.33, 1.40, 1.11, 0., the 0 value for the dark category reflecting its lack of a dummy variable. Although these values do not depart significantly from a linear trend, the first three are quite similar compared to the last one. Thus, another potential color scoring for model Ž5.13. is  1, 1, 1, 04 ; that is, score s 0 for dark-colored crabs, and score s 1 otherwise. The likelihood-ratio statistic comparing model Ž5.13. with these binary scores to model Ž5.12. equals 0.5 Ždf s 2., showing that this simpler model is also adequate. Its fit is logit Ž ␲ ˆ . s y12.980 q 1.300 c q 0.478 x, Ž 5.14 . with standard errors 0.526 and 0.104. At a given width, the estimated odds that a lighter-colored crab has a satellite are exp Ž1.300. s 3.7 times the estimated odds for a dark crab. In summary, the qualitative-color model, the quantitative-color model with scores  1, 2, 3, 44 , and the model with binary color scores  1, 1, 1, 04 all suggest that dark crabs are least likely to have satellites. A much larger sample is MULTIPLE LOGISTIC REGRESSION 191 needed to determine which color scoring is most appropriate. It is advantageous to treat ordinal predictors in a quantitative manner when such models fit well. The model is simpler and easier to interpret, and tests of the predictor effect are more powerful when it has a single parameter rather than several parameters. In Section 6.4 we discuss this issue further. 5.4.7 Standardized and Probability-Based Interpretations To compare effects of quantitative predictors having different units, it can be helpful to report standardized coefficients. One approach fits the model to standardized predictors, replacing each x j by Ž x j y x j .rs x j . Then, each regression coefficient represents the effect of a standard deviation change in a predictor, controlling for the other variables. Equivalently, for each j one can multiply unstandardized estimate ␤ˆj by s x j Žsee also Note 5.9.. Regardless of the units, many find it difficult to understand odds or odds ratio effects. The simpler interpretation of the approximate change in the probability based on a linearization of the model ŽSection 5.1.1. applies also to multiple predictors. Consider a setting of predictors at which PˆŽ Y s 1. s ␲ ˆ . Then, controlling for the other predictors, a 1-unit increase in x j corresponds approximately to a ␤ˆj␲ ˆ Ž1 y ␲ˆ . change in ␲ˆ . For instance, at predictor settings at which ␲ ˆ s 0.5 for fit Ž5.14., the approximate effect of a 1-cm increase in width is Ž0.478.Ž0.5.Ž0.5. s 0.12. This is considerable, since a 1-cm change in width is less than half a standard deviation. This linear approximation deteriorates as the change in the predictor increases. More precise interpretations use the probability formula directly. To describe the effect of x j , one could set the other predictors at their sample means and compute the estimated probabilities at the smallest and largest x j values. These are sensitive to outliers, however. It is often more sensible to use the quartiles. For fit Ž5.14., the sample means are 26.3 for x and 0.873 for c. The lower and upper quartiles of x are 24.9 and 27.7. At x s 24.9 and c s c, ␲ ˆ s 0.51. At x s 27.7 and c s c, ␲ ˆ s 0.80. The change in ␲ˆ from 0.51 to 0.80 over the middle 50% of the range of width values reflects a strong width effect. Since c takes only values 0 and 1, one could instead report this effect separately for each. Also, when an explanatory variable is a dummy, it makes sense to report the estimated probabilities at its two values rather than at quartiles, which could be identical. At x s 26.3, ␲ ˆ s 0.40 when c s 0 and ␲ˆ s 0.71 when c s 1. This color effect, differentiating dark crabs from others, is also substantial. Table 5.9 shows a way to present effects that can be understandable to those not familiar with odds ratios. It also shows results of the extension of model Ž5.14., permitting interaction. The estimated width effect is then greater for the lighter-colored crabs. However, the interaction is not significant. 192 LOGISTIC REGRESSION TABLE 5.9 Summary of Effects in Model (5.14) with Crab Width and Color as Predictors of Presence of Satellites Variable No interaction model Intercept Color Ž0 s dark, 1 s other. Width, x Žcm. Interaction model Intercept Color Ž0 s dark, 1 s other. Width, x Žcm. Width = color 5.5 Estimate SE Comparison y12.980 2.727 Change in Probability Ž1, 0. at x 0.526 0.104 ŽUQ, LQ. at c 1.300 0.478 0.31 s 0.71 y 0.40 0.29 s 0.80 y 0.51 y5.854 6.694 y6.958 0.200 0.322 7.318 0.262 ŽUQ, LQ. at c s 0 0.286 ŽUQ, LQ. at c s 1 0.13 s 0.43 y 0.30 0.29 s 0.84 y 0.55 FITTING LOGISTIC REGRESSION MODELS The mechanics of ML estimation and model fitting for logistic regression are special cases of the GLM fitting results of Section 4.6. With n subjects, we treat the n binary responses as independent. Let x i s Ž x i1 , . . . , x i p . denote setting i of values of p explanatory variables, i s 1, . . . , N. When explanatory variables are continuous, a different setting may occur for each subject, in which case N s n. The logistic regression model Ž5.8., regarding ␣ as a regression parameter with unit coefficient, is ␲ Žx i . s 5.5.1 p exp Ž Ý js1 ␤j x i j . p 1 q exp Ž Ý js1 ␤j x i j . Ž 5.15 . . Likelihood Equations When more than one observation occurs at a fixed x i value, it is sufficient to record the number of observations n i and the number of successes. We then let yi refer to this success count rather than to an individual binary response. Then  Y1 , . . . , YN 4 are independent binomials with E Ž Yi . s n i␲ Žx i ., where n1 q ⭈⭈⭈ qn N s n. Their joint probability mass function is proportional to the product of N binomial functions, N Ł ␲ Žx i . is1 s ½ ½ yi 1 y ␲ Žx i . N Ł exp is1 s exp log ž n iyy i ␲ Žx i . 1 y ␲ Žx i . ␲ Žx i . Ý yi log 1 y ␲ Ž x . i i / 5½ 5½Ł yi N Ł is1 N is1 ni 1 y ␲ Žx i . 1 y ␲ Žx i . ni 5 . 5 193 FITTING LOGISTIC REGRESSION MODELS For model Ž5.15., the ith logit is Ý j ␤ j x i j , so the exponential term in the last expression equals exp wÝ i yi ŽÝ j ␤ j x i j .x s exp wÝ j ŽÝ i yi x i j . ␤ j x. Also, since w1 y ␲ Žx i .x s w1 q exp ŽÝ j ␤ j x i j .xy1 , the log likelihood equals ž Ý Ý yi x i j LŽ ␤ . s j i / ␤ y Ý n log 1 q exp ž Ý ␤ x / . j i j i ij Ž 5.16 . j This depends on the binomial counts only through the sufficient statistics  Ý i yi x i j , j s 1, . . . , p4 . The likelihood equations result from setting ⭸ LŽ␤ .r⭸ ␤ s 0. Since ⭸ LŽ ␤ . ⭸␤ j s exp Ž Ý k ␤ k x i k . Ý yi x i j y Ý n i x i j 1 q exp Ž Ý i i k ␤k x i k . , the likelihood equations are Ý yi x i j y Ý n i␲ˆ i x i j s 0, i j s 1, . . . , p, Ž 5.17 . i where ␲ ˆ i s exp ŽÝ k ␤ˆk x i k .rw1 q exp ŽÝ k ␤ˆk x i k .x is the ML estimate of ␲ Žx i .. We observed these equations as a special case of those for binomial GLMs in Ž4.25. Žbut there yi is the proportion of successes .. The equations are nonlinear and require iterative solution. Let X denote the N = p matrix of values of  x i j 4 . The likelihood equations Ž5.17. have form Ž 5.18 . XX y s XX ␮ ˆ, where ␮ ˆ i s n i␲ˆ i . This equation illustrates a fundamental result: For GLMs with canonical link, the likelihood equations equate the sufficient statistics to the estimates of their expected values. Equation Ž4.44. showed this result in the GLM context, and Ž5.18. are the normal equations in ordinary regression. 5.5.2 Asymptotic Covariance Matrix of Parameter Estimators ˆ have a large-sample normal distribution with covariThe ML estimators ␤ ance matrix equal to the inverse of the information matrix. The observed information matrix has elements y ⭸ 2 LŽ ␤ . ⭸␤ a ⭸␤ b s Ý i x i a x i b n i exp Ž Ý j ␤ j x i j . 1 q exp Ž Ý j ␤ j x i j . 2 s Ý x i a x i b n i␲ i Ž 1 y ␲ i . . Ž 5.19 . i This is not a function of  yi 4 , so the observed and expected information are identical. This happens for all GLMs that use canonical links ŽSection 4.6.4.. 194 LOGISTIC REGRESSION The estimated covariance matrix is the inverse of the matrix having ˆ This has form elements Ž5.19., substituting ␤. $ ˆ . s  XX diag n i␲ˆ i Ž 1 y ␲ˆ i . X 4 cov Ž ␤ y1 , Ž 5.20 . where diag w n i␲ ˆ i Ž1 y ␲ˆ i .x denotes the N = N diagonal matrix having  n i␲ Ž .4 1 y ␲ on the main diagonal. This is the special case of the GLM ˆi ˆi ˆ having covariance matrix Ž4.28. with estimated diagonal weight matrix W Ž . elements w s n ␲ 1 y ␲ . The square roots of the main diagonal elements ˆi ˆi i ˆi ˆ of Ž5.20. are estimated standard errors of ␤. 5.5.3 Distribution of Probability Estimators $ ˆ ., one can conduct inference about ␤ and related effects such as Using cov Ž␤ odds ratios. One can also construct confidence intervals for response probabilities ␲ Žx. at particular settings x. $ ˆ is x cov Ž␤ ˆ .xX . For large samThe estimated variance of logitw␲ ˆ Žx.x s x␤ $ X ˆ . x is a confidence interval for the true logit. ples, logitw␲ ˆ Žx.x " z␣ r2 x cov Ž ␤ The endpoints invert to a corresponding interval for ␲ Žx. using the transform ␲ s exp Žlogit.rw1 q expŽlogit.x. ' 5.5.4 Newton–Raphson Method Applied to Logistic Regression We refer back to Section 4.6.1 for the Newton᎐Raphson iterative method. Let uŽj t . s t. hŽab s ⭸ LŽ ␤ . s ⭸␤ j ␤ Žt. i ⭸ 2 LŽ ␤ . ⭸␤ a ⭸␤ b ␤ Ý Ž yi y n i␲ iŽ t . . x i j Žt. s y Ý x i a x i b n i␲ iŽ t . Ž 1 y ␲ iŽ t . . . i Here, ␲ Ž t ., approximation t for ␲, ˆ is obtained from ␤Ž t . through ␲ iŽ t . s p exp Ž Ý js1 ␤ jŽ t . x i j . p 1 q exp Ž Ý js1 ␤ jŽ t . x i j . Ž 5.21 . . We use uŽ t . and H Ž t . with formula Ž4.39. to obtain the next value ␤Ž tq1. , which in this context is ␤Ž tq1. s ␤Ž t . q  XX diag n i␲ iŽ t . Ž 1 y ␲ iŽ t . . X 4 y1 XX Ž y y ␮Ž t . . , where ␮Ži t . s n i␲ iŽ t .. This is used to obtain ␲ Ž tq1. , and so forth. Ž 5.22 . 195 FITTING LOGISTIC REGRESSION MODELS With an initial guess ␤Ž0. , Ž5.21. yields ␲ Ž0. , and for t ) 0 the iterations proceed as just described using Ž5.22. and Ž5.21.. In the limit, ␲ Ž t . and ␤Ž t . ˆ ŽWalker and Duncan 1967.. The H Ž t . converge to the ML estimates ␲ ˆ and ␤ X ˆ s yX diag w n i␲ˆ i Ž1 y ␲ˆ i .xX. By Ž5.20. the estimated matrices converge to H ˆ is a by-product of the Newton᎐Raphson asymptotic covariance matrix of ␤ ˆ y1 . method, namely yH From the argument in Section 4.6.3, ␤Ž tq1. has the iterative reweighted least squares form ŽXX Vty1 X.y1 XX Vty1 z Ž t ., where z Ž t . has elements z iŽ t . s log ␲ iŽ t . 1 y ␲ iŽ t . q yi y n i␲ iŽ t . n i␲ iŽ t . Ž 1 y ␲ iŽ t . . , Ž 5.23 . and where Vt is a diagonal matrix with elements  1rn i␲ iŽ t . Ž1 y ␲ iŽ t . .4 . In this expression, z Ž t . is the linearized form of the logit link function for the sample data, evaluated at ␲ Ž t . wsee Ž4.42.x. From Section 3.1.6 the elements of Vt are estimated asymptotic variances of the sample logits. The ML estimate is the limit of a sequence of weighted least squares estimates, where the weight matrix changes at each cycle. 5.5.5 Convergence and Existence of Finite Estimates The log-likelihood function for logistic regression models is strictly concave. ML estimates exist and are unique except in certain boundary cases ŽHaberman 1974a; Wedderburn 1976; Albert and Anderson 1984.. Estimates do not exist or may be infinite when there is no overlap in the sets of explanatory variable values having y s 0 and having y s 1; that is, when a hyperplane can pass through the space of predictor values such that on one side of that hyperplane y s 0 for all observations, whereas on the other side, y s 1 always. There is then perfect discrimination, as one can predict the sample outcomes perfectly by knowing the predictor values Žexcept possibly at a boundary point.. When there is overlap, ML estimates exist and are unique. Similar results occur for the probit and some other links ŽSilvapulle 1981.. Figure 5.6 illustrates for a single explanatory variable. Here, y s 0 at x s 10, 20, 30, 40, and y s 1 at x s 60, 70, 80, 90. An ideal fit has ␲ ˆ s 0 for x F 40 and ␲ ˆ s 1 for x G 60. By letting ␤ˆ ™ ⬁ and, for fixed ␤ˆ, letting ␣ ˆ s y␤ˆŽ50. so that ␲ˆ s 0.5 at x s 50, one generates a sequence with ever-increasing value of the likelihood that comes successively closer to a perfect fit. In practice, most software fails to recognize that ␤ˆ s ⬁. After a few cycles of iterative fitting, the log likelihood looks flat at the working estimate, and convergence criteria are satisfied. Because the log likelihood is so flat and because variances come from the inverse of the matrix of negative second derivatives, software typically reports huge standard errors. For these data, for instance, PROC GENMOD in SAS reports logitŽ␲ ˆ . s y192.2 q 3.8 x 8 7 with standard errors of 8.0 = 10 and 1.5 = 10 . 196 FIGURE 5.6 LOGISTIC REGRESSION Perfect discrimination resulting in infinite logistic regression parameter estimate. NOTES Section 5.1: Interpreting Parameters in Logistic Regression 5.1. Books focusing on applied logistic regression include Collett Ž1991. and Hosmer and Lemeshow Ž2000.. Books having major components on logistic regression include Christensen Ž1997., Cox and Snell Ž1989., and Morgan Ž1992.. Prentice Ž1976b. and Stukel Ž1988. extended the scope by introducing shape parameters that modify the behavior of the curve in extreme probability regions and allow for asymmetric treatment of the two tails. 5.2. Haldane Ž1956. recommended adding 21 to the numerator and denominator of the sample logit. With this modification, the bias is on the order of only 1rn2i , for large n i Žsee Firth 1993a and Problem 14.4.. 5.3. The Cornfield Ž1962. result about normal distributions for Ž X < Y s i . implying the logistic curve for P Ž Y s 1 < x . suggests that logistic regression is useful in discrimination and classification problems. These use a subject’s x value to predict to which of two populations they belong. Anderson Ž1975., Bull and Donner Ž1987., Efron Ž1975., and Press and Wilson Ž1978. compared logistic regression favorably to discriminant analysis, which assumes that explanatory variables have a normal distribution at each level of Y. 5.4. Rosenbaum and Rubin Ž1983. used logistic regression to adjust for bias in comparing two groups in observational studies. They defined the propensity as the probability of being in one group, for a given setting of the explanatory variables x, and they used logistic regression to estimate how propensity depends on x. In comparing the groups on the response variable, they showed that one can control for differing distributions of the groups on x by adjusting for the estimated propensity. This is done by using the propensity to match samples from the groups or to subclassify subjects into several strata consisting of intervals of propensity scores or to adjust directly by entering the propensity in the model. See D’Agostino Ž1998. for a tutorial. 5.5. Adelbasit and Plackett Ž1983., Chaloner and Larntz Ž1988., Minkin Ž1987., and Wu Ž1985. discussed design problems for binary response experiments, such as choosing settings for a predictor to optimize a criterion for estimating parameter values or estimating the setting at which the response probability equals some fixed value. The nonconstant variance makes this challenging. PROBLEMS 197 Section 5.2: Inference for Logistic Regression 5.6. Albert and Anderson Ž1984., Berkson Ž1951, 1953, 1955., Cox Ž1958a., Hodges Ž1958., and Walker and Duncan Ž1967. discussed ML estimation for logistic regression. For adjustments with complex sample surveys, see Hosmer and Lemeshow Ž2000, Sec. 6.4. and LaVange et al. Ž2001.. Scott and Wild Ž2001. discussed the analyses of case control studies with complex sampling designs. 5.7. Tsiatis Ž1980. suggested an alternative goodness-of-fit test that partitions values for the explanatory variables into a set of regions and adds a dummy variable to the model for each region. The test statistic compares the fit of this model to the simpler one, testing that the extra parameters are not needed. The idea of grouping values to check model fit by comparing observed and fitted counts extends to any GLM ŽPregibon 1982.. Hosmer et al. Ž1997. compared various ways of doing this. Section 5.3: Logit Models with Categorical Predictors 5.8. The CochranArmitage trend test is locally asymptotically efficient for both linear and logistic alternatives for P Ž Y s 1.. Its efficiency against linear alternatives follows from the approximate normality of the sample proportions, with constant Bernoulli variance when ␤ s 0. For the linear logit model Ž5.5., its efficiency follows from its equivalence with the score test. See Problem 9.35 and Cox Ž1958a. for related remarks. Tarone and Gart Ž1980. showed that the score test for a binary linear trend model does not depend on the link function. Gross Ž1981. noted that for the linear logit model, the local asymptotic relative efficiency for testing independence using the statistic with an incorrect set of scores equals the square of the Pearson correlation between the true and incorrect scores. Simon Ž1978. gave related asymptotic results. Corcoran et al. Ž2001., Mantel Ž1963., and Podgor et al. Ž1996. extended the trend test. Section 5.4: Multiple Logistic Regression 5.9. Since the standardized logistic cdf has standard deviation ␲r'3 , some software Že.g., PROC LOGISTIC in SAS. defines a standardized estimate by multiplying the unstandardized estimate by s x j'3r␲ . PROBLEMS Applications 5.1 For a study using logistic regression to determine characteristics associated with remission in cancer patients, Table 5.10 shows the most important explanatory variable, a labeling index ŽLI.. This index measures proliferative activity of cells after a patient receives an injection of tritiated thymidine, representing the percentage of cells that are ‘‘labeled.’’ The response Y measured whether the patient achieved remission Ž1 s yes.. Software reports Table 5.11 for a logistic regression model using LI to predict the probability of remission. 198 LOGISTIC REGRESSION TABLE 5.10 Data for Problem 5.1 LI Number of Cases 8 10 12 14 16 2 2 3 3 3 Number of Remissions LI 0 0 0 0 0 Number of Cases 18 20 22 24 26 Number of Remissions LI 1 3 2 1 1 1 2 1 0 1 Number of Cases Number of Remissions 1 1 1 3 1 0 1 2 28 32 34 38 Source: Data reprinted with permission from E. T. Lee, Comput. Prog. Biomed. 4: 8092 Ž1974.. TABLE 5.11 Computer Output for Problem 5.1 Criterion y2 Log L Intercept Only 34.372 Intercept and Covariates 26.073 Testing Global Null Hypothesis: BETA = 0 Test Chi- Square DF Pr > ChiSq Likelihood Ratio 8.2988 1 0.0040 Score 7.9311 1 0.0049 Wald 5.9594 1 0.0146 Parameter Intercept li Effect li Estimate y3.7771 0.1449 Standard Error 1.3786 0.0593 Chi- Square 7.5064 5.9594 Pr > ChiSq 0.0061 0.0146 Odds Ratio Estimates Point Estimate 95% Wald Confidence Limits 1.156 1.029 1.298 Estimated Covariance Matrix Variable Intercept li Intercept 1.900616 y0.07653 li y0.07653 0.003521 Obs 1 2 li 8 10 remiss 0 0 n 2 2 pi  hat 0.06797 0.08879 lower 0.01121 0.01809 upper 0.31925 0.34010 a. Show how software obtained ␲ ˆ s 0.068 when LI s 8. b. Show that ␲ ˆ s 0.5 when LI s 26.0. c. Show that the rate of change in ␲ ˆ is 0.009 when LI s 8 and 0.036 when LI s 26. d. The lower quartile and upper quartile for LI are 14 and 28. Show that ␲ ˆ increases by 0.42, from 0.15 to 0.57, between those values. e. For a unit change in LI, show that the estimated odds of remission multiply by 1.16. 199 PROBLEMS f. Explain how to obtain the confidence interval reported for the odds ratio. Interpret. g. Construct a Wald test for the effect. Interpret. h. Conduct a likelihood-ratio test for the effect, showing how to construct the test statistic using the y2 log L values reported. i. Show how software obtained the confidence interval for ␲ reported at LI s 8. Ž Hint: Use the reported covariance matrix.. TABLE 5.12 Data for Problem 5.2 a Ft Temp TD Ft Temp TD Ft Temp TD Ft Temp TD Ft Temp TD 1 6 11 16 21 66 72 70 75 75 0 0 1 0 1 2 7 12 17 22 70 73 78 70 76 1 0 0 0 0 3 8 13 18 23 69 70 67 81 58 0 0 0 0 1 4 9 14 19 68 57 53 76 0 1 1 0 5 10 15 20 67 63 67 79 0 1 0 0 Ft, flight number; Temp, temperature Ž⬚F.; TD, thermal distress Ž1, yes; 0, no.. Source: Data based on Table 1 in J. Amer. Statist. Assoc., 84: 945᎐957, Ž1989., by S. R. Dalal, E. B. Fowlkes, and B. Hoadley. Reprinted with permission from the Journal of the American Statistical Association. a 5.2 For the 23 space shuttle flights before the Challenger mission disaster in 1986, Table 5.12 shows the temperature at the time of the flight and whether at least one primary O-ring suffered thermal distress. a. Use logistic regression to model the effect of temperature on the probability of thermal distress. Plot a figure of the fitted model, and interpret. b. Estimate the probability of thermal distress at 31⬚F, the temperature at the place and time of the Challenger flight. c. Construct a confidence interval for the effect of temperature on the odds of thermal distress, and test the statistical significance of the effect. d. Check the model fit by comparing it to a more complex model. 5.3 Refer to Table 4.2. Using scores  0, 2, 4, 54 for snoring, fit the logistic regression model. Interpret using fitted probabilities, linear approximations, and effects on the odds. Analyze the goodness of fit. 5.4 Hastie and Tibshirani Ž1990, p. 282. described a study to determine risk factors for kyphosis, severe forward flexion of the spine following corrective spinal surgery. The age in months at the time of the operation for the 18 subjects for whom kyphosis was present were 12, 15, 42, 52, 59, 73, 82, 91, 96, 105, 114, 120, 121, 128, 130, 139, 139, 157 200 LOGISTIC REGRESSION and for 22 of the subjects for whom kyphosis was absent were 1, 1, 2, 8, 11, 18, 22, 31, 37, 61, 72, 81, 97, 112, 118, 127, 131, 140, 151, 159, 177, 206. a. Fit a logistic regression model using age as a predictor of whether kyphosis is present. Test whether age has a significant effect. b. Plot the data. Note the difference in dispersion on age at the two levels of kyphosis. Fit the model logit w␲ Ž x .x s ␣ q ␤ 1 x q ␤ 2 x 2 . Test the significance of the squared age term, plot the fit, and interpret. ŽNote also Problem 5.33.. 5.5 Refer to Table 6.11. The Pearson test of independence has X 2 Ž I . s 6.88 Ž P s 0.14.. For equally spaced scores, the Cochran᎐Armitage trend test has z 2 s 6.67 Ž P s 0.01.. Interpret, and explain why results differ so. Analyze the data using a linear logit model. Test independence using the Wald and likelihood-ratio tests, and compare results to the Cochran᎐Armitage test. Check the fit of the model, and interpret. 5.6 For Table 5.3, conduct the trend test using alcohol consumption scores Ž1, 2, 3, 4, 5. instead of Ž0.0, 0.5, 1.5, 4.0, 7.0.. Compare results, noting the sensitivity to the choice of scores for highly unbalanced data. 5.7 Refer to Table 2.11. Using scores Ž0, 3, 9.5, 19.5, 37, 55. for cigarette smoking, analyze these data using a logit model. Is the intercept estimate meaningful? Explain. 5.8 A study used the 1998 Behavioral Risk Factors Social Survey to consider factors associated with women’s use of oral contraceptives in the United States. Table 5.13 summarizes effects for a logistic regression model for the probability of using oral contraceptives. Each predictor uses a dummy variable, and the table lists the category having dummy outcome 1. Interpret effects. Construct and interpret a confidence interval for the conditional odds ratio between contraceptive use and education. TABLE 5.13 Data for Problem 5.8 Variable Coding s 1 if: Estimate SE Age Race Education Marital status 35 or younger White G 1 year college Married y1.320 0.622 0.501 y0.460 0.087 0.098 0.077 0.073 Source: Data courtesy of Debbie Wilson, College of Pharmacy, University of Florida. 201 PROBLEMS TABLE 5.14 Computer Output for Problem 5.9 Parameter Intercept def vic Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 1 0.3798 Pearson Chi- Square 1 0.1978 Log Likelihood y209.4783 Standard Likelihood Ratio Estimate Error 95% Conf Limits y3.5961 0.5069 y4.7754 y2.7349 y0.8678 0.3671 y1.5633 y0.1140 2.4044 0.6006 1.3068 3.7175 Source def vic 5.9 DF 1 1 LR Statistics Chi- Square 5.01 20.35 ChiSquare 50.33 5.59 16.03 Pr > ChiSq 0.0251 <.0001 Refer to Table 2.6. Table 5.14 shows the results of fitting a logit model, treating death penalty as the response Ž1 s yes. and defendant’s race Ž1 s white. and victims’ race Ž1 s white. as dummy predictors. a. Interpret parameter estimates. Which group is most likely to have the yes response? Find the estimated probability in that case. b. Interpret 95% confidence intervals for conditional odds ratios. c. Test the effect of defendant’s race, controlling for victims’ race, using a Ži. Wald test, and Žii. likelihood-ratio test. Interpret. d. Test the goodness of fit. Interpret. 5.10 Model the effects of victim’s race and defendant’s race for Table 2.13. Interpret. 5.11 Table 5.15 appeared in a national study of 15- and 16-year-old adolescent. The event of interest is ever having sexual intercourse. Analyze, TABLE 5.15 Data for Problem 5.11 Intercourse Race Gender Yes No White Male Female 43 26 134 149 Black Male Female 29 22 23 36 Source: S. P. Morgan and J. D. Teachman, J. Marriage Fam. 50: 929936 Ž1988.. Reprinted with permission from the National Council on Family Relations. 202 LOGISTIC REGRESSION including description and inference about the effects of gender and race, goodness of fit, and summary interpretations. 5.12 According to the Independent newspaper ŽLondon, Mar. 8, 1994., the Metropolitan Police in London reported 30,475 people as missing in the year ending March 1993. For those of age 13 or less, 33 of 3271 missing males and 38 of 2486 missing females were still missing a year later. For ages 14 to 18, the values were 63 of 7256 males and 108 of 8877 females; for ages 19 and above, the values were 157 of 5065 males and 159 of 3520 females. Analyze and interpret. ŽThanks to Pat Altham for showing me these data. . 5.13 The National Collegiate Athletic Association studied graduation rates for freshman student athletes during the 19841985 academic year. The Žsample size, number graduated . totals were Ž796, 498. for white females, Ž1625, 878. for white males, Ž143, 54. for black females, and Ž60, 197. for black males ŽJ. J. McArdle and F. Hamagami, J. Amer. Statist. Assoc. 89: 11071123, 1994.. Analyze and interpret. 5.14 In a study designed to evaluate whether an educational program makes sexually active adolescents more likely to obtain condoms, adolescents were randomly assigned to two experimental groups. The educational program, involving a lecture and videotape about transmission of the HIV virus, was provided to one group but not the other. Table 5.16 summarizes results of a logistic regression model for factors observed to influence teenagers to obtain condoms. a. Find the parameter estimates for the fitted model, using Ž1, 0. dummy variables for the first three predictors. Based on the corresponding confidence interval for the log odds ratio, determine the standard error for the group effect. b. Explain why either the estimate of 1.38 for the odds ratio for gender or the corresponding confidence interval is incorrect. Show that if the reported interval is correct, 1.38 is actually the log odds ratio, and the estimated odds ratio equals 3.98. TABLE 5.16 Data for Problem 5.14 Variable Group Žeducation vs. none. Gender Žmales vs. females. SES Žhigh vs. low. Lifetime number of partners Odds Ratio 4.04 1.38 5.82 3.22 95% Confidence Interval Ž1.17, 13.9. Ž1.23. 12.88. Ž1.87, 18.28. Ž1.08, 11.31. Source: V. I. Rickert et al., Clin. Pediatr. 31: 205210 Ž1992.. 203 PROBLEMS TABLE 5.17 Data for Problem 5.15 Variable Effect P-value Intercept Alcohol use Smoking Race Race = smoking y7.00 0.10 1.20 0.30 0.20 - 0.01 0.03 - 0.01 0.02 0.04 5.15 Table 5.17 shows estimated effects for a logistic regression model with squamous cell esophageal cancer Ž Y s 1, yes; Y s 0, no. as the response. Smoking status Ž S . equals 1 for at least one pack per day and 0 otherwise, alcohol consumption Ž A. equals the average number of alcoholic drinks consumed per day, and race Ž R . equals 1 for blacks and 0 for whites. To describe the race = smoking interaction, construct the prediction equation when R s 1 and again when R s 0. Find the fitted YS conditional odds ratio for each case. Similarly, construct the prediction equation when S s 1 and again when S s 0. Find the fitted YR conditional odds ratios. Note that for each association, the coefficient of the cross-product term is the difference between the log odds ratios at the two fixed levels for the other variable. Explain why the coefficient of S represents the log odds ratio between Y and S for whites. To what hypotheses do the P-values for R and S refer? 5.16 A survey of high school students on Y s whether the subject has driven a motor vehicle after consuming a substantial amount of alcohol Ž1 s yes., s s gender Ž1 s female., r s race Ž1 s black; 0 s white., and g s grade Ž g 1 s 1, grade 9; g 2 s 1, grade 10; g 3 s 1, grade 11; g 1 s g 2 s g 3 s 0, grade 12. has prediction equation logit PˆŽ Y s 1 . s y0.88 y 0.40 s y 0.72 r y 2.22 g 1 y 1.43 g 2 y 0.58 g 3 q 0.74 rg 1 q 0.38 rg 2 q 0.01rg 3 . a. Carefully interpret effects. Explain the interaction by describing the race effect at each grade and the grade effect for each race. b. Replace r above by r 1 Ž1 s black, 0 s other.. The study also measured r 2 Ž1 s Hispanic, 0 s other., with r 1 s r 2 s 0 for white. Suppose that the prediction equation is as above but with additional terms y0.29 r 2 q 0.53 r 2 g 1 q 0.25 r 2 g 2 y 0.06 r 2 g 3 . Interpret the effects. 204 LOGISTIC REGRESSION TABLE 5.18 Data for Problem 5.17 Patient D T Y Patient D T Y Patient D T Y 1 2 3 4 5 6 7 8 9 10 11 12 45 15 40 83 90 25 35 65 95 35 75 45 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 13 14 15 16 17 18 19 20 21 22 23 24 50 75 30 25 20 60 70 30 60 61 65 15 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 25 26 27 28 29 30 31 32 33 34 35 20 45 15 25 15 30 40 15 135 20 40 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 Source: Data from D. Collett, in Encyclopedia of Biostatistics ŽNew York: Wiley: 1998., pp. 350358. 5.17 Table 5.18 shows the results of a study about Y s whether a patient having surgery with general anesthesia experienced a sore throat on waking Ž0 s no, 1 s yes. as a function of the D s duration of the surgery Žin minutes. and the T s type of device used to secure the airway Ž0 s laryngeal mask airway, 1 s tracheal tube.. Fit a logit model using these predictors, interpret parameter estimates, and conduct inference about the effects. 5.18 Refer to model Ž5.2. for the horseshoe crabs using x s width. a. Show that Ži. at the mean width Ž26.3., the estimated odds of a satellite equal 2.07; Žii. at x s 27.3, the estimated odds equal 3.40; and Žiii. since expŽ ␤ˆ. s 1.64, 3.40 s Ž1.64.2.07, and the odds increase by 64%. b. Based on the 95% confidence interval for ␤ , show that for x near where ␲ s 0.5, the rate of increase in the probability of a satellite per 1-cm increase in x falls between about 0.07 and 0.17. 5.19 For Table 4.3, fit a logistic regression model for the probability of a satellite, using color alone as the predictor. a. Treat color as nominal. Explain why this model is saturated. Express its parameter estimates in terms of the sample logits for each color. b. Conduct a likelihood-ratio test that color has no effect. c. Fit a model that treats color as quantitative. Interpret the fit, and test that color has no effect. d. Test the goodness of fit of the model in part Žc.. Interpret. 205 PROBLEMS 5.20 Refer to model Ž5.14.. Describe the effect of width by finding the estimated probabilities of a satellite at its lower and upper quartiles, separately for c s 1 and c s 0. 5.21 Refer to the prediction equation logitŽ␲ ˆ . s y10.071 y 0.509c q 0.458 x for model Ž5.13.. The means and standard deviations are c s 2.44 and s s 0.80 for color, and x s 26.30 and s s 2.11 for width. For standardized predictors we.g., x s Žwidth y 26.3.r2.11x, explain why the estimated coefficients of c and x equal y0.41 and 0.97. Interpret these by comparing the partial effects of a 1 standard deviation increase in each predictor on the odds. Describe the color effect by estimating the change in ␲ ˆ between the first and last color categories at the mean score for width. 5.22 Refer to model Ž5.12.. a. Fit the model using x s weight. Interpret effects of weight and color. b. Does the model permitting interaction provide an improved fit? Interpret. c. For part Žb., construct a confidence interval for a difference between the slope parameters for medium-light and dark crabs. Interpret. d. Using models that treat color as quantitative, repeat the analyses in parts Ža. to Žc.. 5.23 Fowlkes et al. Ž1988. reported results of a survey of employees of a large national corporation to determine how satisfaction depends on race, gender, age, and regional location. The data are at the book’s Web site Ž www. stat.ufl.edur; aarcdarcda.html .. Fit a logit model to these data and carefully interpret the parameter estimates. Fowlkes et al. Ž1988. reported ‘‘The least-satisfied employees are less than 35 years of age, female, other Žrace., and work in the Northeast; . . . . The most satisfied group is greater than 44 years of age, male, other, and working in the Pacific or Mid-Atlantic regions; the odds of such employees being satisfied are about 3.5 to 1.’’ Show how these interpretations result from the fit of this model. 5.24 Let Y denote a subject’s opinion about current laws legalizing abortion Ž1 s support., for gender h Ž h s 1, female; h s 2, male., religious affiliation i Ž i s 1, Protestant; i s 2, Catholic; i s 3, Jewish., and political party affiliation j Ž j s 1, Democrat; j s 2, Republican; j s 3, Independent .. For survey data, software for fitting the model logit P Ž Y s 1 . s ␣ q ␤ hG q ␤iR q ␤ jP 206 LOGISTIC REGRESSION reports ␣ ˆ s 0.62, ␤ˆ1G s 0.08, ␤ˆ2G s y0.08, ␤ˆ1R s y0.16, ␤ˆ2R s y0.25, ␤ˆ3R s 0.41, ␤ˆ1P s 0.87, ␤ˆ2P s y1.27, ␤ˆ3P s 0.40. a. Interpret how the odds of support depends on religion. b. Estimate the probability of support for the group most Žleast. likely to support current laws. c. If, instead, parameters used constraints ␤ 1G s ␤ 1R s ␤ 1P s 0, report the estimates. 5.25 Table 5.19 refers to a sample of subjects randomly selected for an Italian study on the relation between income and whether one possesses a travel credit card. At each level of annual income in millions of lira, the table indicates the number of subjects sampled and the number possessing at least one travel credit card. Analyze these data. TABLE 5.19 Data for Problem 5.25 Income Number Income Number Income Number Žmillions of Credit Žmillions of Credit Žmillions of Credit of lira. Cases Cards of lira. Cases Cards of lira. Cases Cards 24 27 28 29 30 31 32 33 34 35 38 1 1 5 3 9 5 8 1 7 1 3 0 0 2 0 1 1 0 0 1 1 1 39 40 41 42 45 48 49 50 52 59 60 2 5 2 2 1 1 1 10 1 1 5 0 0 0 0 1 0 0 2 0 0 2 65 68 70 79 80 84 94 120 130 6 3 5 1 1 1 1 6 1 6 3 3 0 0 0 0 6 1 Source: Categorical Data Analysis, Quaderni del Corso Estivo di Statistica e Calcolo delle Probabilita, ` n. 4., Istituto di Metodi Quantitativi, Universita` Luigi Bocconi, by R. Piccarreta. 5.26 Refer to Table 9.1, treating marijuana use as the response variable. Analyze these data. 5.27 The book’s Web site Ž www. stat.ufl.edur; aarcdarcda.html . contains a five-way table relating occupational aspirations Žhigh, low. to gender, residence, IQ, and socioeconomic status. Analyze these data. Theory and Methods 5.28 For model Ž5.1., show that ⭸␲ Ž x .r⭸ x s ␤␲ Ž x .w1 y ␲ Ž x .x. 207 PROBLEMS 5.29 For model Ž5.1., when ␲ Ž x . is small, explain why you can interpret expŽ ␤ . approximately as ␲ Ž x q 1.r␲ Ž x .. 5.30 Prove that the logistic regression curve Ž5.1. has the steepest slope where ␲ Ž x . s 12 . Generalize to model Ž5.8.. 5.31 The calibration problem is that of estimating x at which ␲ Ž x . s ␲ 0 . For the linear logit model, argue that a confidence interval is the set of x values for which ␣ ˆ q ␤ˆ x y logit Ž ␲ 0 . r var Ž␣ˆ. q x 2 var Ž ␤ˆ . q 2 x cov Ž␣ˆ, ␤ˆ . 1r2 - z␣ r2 . wMorgan Ž1992, Sec. 2.7. surveyed other approaches. x 5.32 A study for several professional sports of the effect of a player’s draft position d Ž d s 1, 2, 3, . . . . of selection from the pool of potential players in a given year on the probability ␲ of eventually being named an all star used the model logitŽ␲ . s ␣ q ␤ log d ŽS. M. Berry, Chance, 14:53᎐57, 2001.. a. Show that ␲rŽ1 y ␲ . s e ␣ d ␤ . Show that e ␣ s odds for the first draft pick. b. In the United States, Berry reported ␣ ˆ s 2.3 and ␤ˆ s y1.1 for pro basketball and ␣ ˆ s 0.7 and ␤ˆ s y0.6 for pro baseball. This suggests that in basketball a first draft pick is more crucial and picks with high d are relatively less likely to be all-stars. Explain why. 5.33 For the population of subjects having Y s j, X has a N Ž ␮j , ␴ 2 . distribution, j s 0,1. a. Using Bayes theorem, show that P Ž Y s 1 < x . satisfies the logistic regression model with ␤ s Ž ␮1 y ␮ 0 .r␴ 2 . b. Suppose that Ž X < Y s j . is N Ž ␮j , ␴j 2 . with ␴ 0 / ␴ 1. Show that the logistic model holds with a quadratic term ŽAnderson 1975.. wProblem 5.4 showed that a quadratic term is helpful when x values have quite different dispersion at y s 0 and y s 1. This result also suggests that to test equality of means of normal distributions when the variances differ, one can fit a quadratic logistic regression with the two groups as the response and test the quadratic term; see O’Brien Ž1988..x c. Suppose that Ž X < Y s j . has exponential dispersion family density f Ž x; ␪ j . s expw x ␪ j y bŽ ␪ j .xraŽ ␾ . q cŽ x, ␾ .4 . Find the relevant logistic model. 208 LOGISTIC REGRESSION d. For multiple predictors, suppose that ŽX < Y s j . has a multivariate N Ž␮ j , ⌺ . distribution, j s 0, 1. Show that P Ž Y s 1 < x. satisfies logistic regression with effect parameters ⌺y1 Ž␮ 1 y ␮ 0 . ŽCornfield 1962.. 5.34 Suppose that ␲ Ž x . s F Ž x . for some strictly increasing cdf F. Explain why a monotone transformation of x exists such that the logistic regression model holds. Generalize to alternative link functions. 5.35 For an I = 2 contingency table, consider logit model Ž5.4.. a. Given ␲ i ) 04 , show how to find  ␤i 4 satisfying ␤I s 0. b. Prove that ␤ 1 s ␤ 2 s ⭈⭈⭈ s ␤I is the independence model. Find its likelihood equation, and show that ␣ ˆ s logitwŽÝ i yi .rŽÝ i n i .x. 5.36 Construct the log-likelihood function for the model logitw␲ Ž x .x s ␣ q ␤ x with independent binomial outcomes of y 0 successes in n 0 trials at x s 0 and y 1 successes in n1 trials at x s 1. Derive the likelihood equations, and show that ␤ˆ is the sample log odds ratio. 5.37 A study has n i independent binary observations  yi1 , . . . , yi n i 4 when X s x i , i s 1, . . . , N, with n s Ý i n i . Consider the model logit Ž␲ i . s ␣ q ␤ x i , where ␲ i s P Ž Yi j s 1.. a. Show that the kernel of the likelihood function is the same treating the data as n Bernoulli observations or N binomial observations. b. .For the saturated model, explain why the likelihood function is different for these two data forms. Ž Hint: The number of parameters differs. . Hence, the deviance reported by software depends on the form of data entry. c. Explain why the difference between deviances for two unsaturated models does not depend on the form of data entry. d. Suppose that each n i s 1. Show that the deviance depends on ␲ ˆi but not yi . Hence, it is not useful for checking model fit Žsee also Problem 4.22.. 5.38 Suppose that Y has a bin Ž n, ␲ . distribution. For the model, logitŽ␲ . s ␣ , consider testing H0 : ␣ s 0 Ži.e., ␲ s 0.5.. Let ␲ ˆ s yrn. a. From Section 3.1.6, the asymptotic variance of ␣ ˆ s logit Ž␲ˆ . is w n␲ Ž1 y ␲ .xy1 . Compare the estimated SE for the Wald test and the SE using the null value of ␲ , using test statistic wlogitŽ␲ ˆ .rSEx2 . Show that the ratio of the Wald statistic to the statistic with null SE equals 4␲ ˆ Ž1 y ␲ˆ .. What is the implication about performance of the Wald test if < ␣ < is large and ␲ ˆ tends to be near 0 or 1? 209 PROBLEMS b. Wald inference depends on the parameterization. How does the comparison of tests change with the scale wŽ␲ ˆ y 0.5.rSEx2 , where SE is now the estimated or null SE of ␲ ˆ? c. Suppose that y s 0 or y s n. Show that the Wald test in part Ža. cannot reject H0 : ␲ s ␲ 0 for any 0 - ␲ 0 - 1, whereas the Wald test in part Žb. rejects every such ␲ 0 . w Note: Analogous results apply for inference about the Poisson mean versus the log mean; see Mantel Ž1987a..x 5.39 Find the likelihood equations for model Ž5.10.. Show that they imply the fitted values and that the sample values are identical in the marginal two-way tables. 5.40 Consider the linear logit model Ž5.5. for an I = 2 table, with yi a bin Ž n i , ␲ i . variate. a. Show that the log likelihood is LŽ ␤ . s I Ý yi Ž␣ q ␤ x i . y is1 I Ý n i log 1 q exp Ž␣ q ␤ x i . . is1 b. Show that the sufficient statistic for ␤ is Ý i yi x i , and explain why this is essentially the variable utilized in the Cochran᎐Armitage test. ŽHence that test is a score test of H0 : ␤ s 0.. c. Letting S s Ý i yi , show that the likelihood equations are Ss exp Ž␣ q ␤ x i . Ý n i 1 q exp Ž␣ q ␤ x . i i exp Ž␣ q ␤ x i . Ý yi x i s Ý n i x i 1 q exp Ž␣ q ␤ x . . i i i d. Let  ␮ ˆ i s n i␲ˆ i 4. Explain why Ý i ␮ ˆ i s Ý i yi and yi Ý xi S i s ␮ ˆi Ý x i Ý ␮ˆ . a a i Explain why this implies that the mean score on x across the rows in the first column is the same for the model fit as for the observed data. They are also identical for the second column. 210 LOGISTIC REGRESSION 5.41 Let Yi be bin Ž n i , ␲ i . at x i , and let pi s yirn i . For binomial GLMs with logit link: a. For pi near ␲ i , show that log pi 1 y pi f log ␲i 1 y ␲i q pi y ␲ i ␲i Ž1 y ␲i . . b. Show that z iŽ t . in Ž5.23. is a linearized version of the ith sample logit, evaluated at approximation ␲ iŽ t . for ␲ ˆ i. $ ˆ .. c. Verify the formula Ž5.20. for cov Ž␤ 5.42 Using graphs or tables, explain what is meant by no interaction in modeling response Y and explanatory X and Z when: a. All variables are continuous Žmultiple regression.. b. Y and X are continuous, Z is categorical Žanalysis of covariance.. c. Y is continuous, X and Z are categorical Žtwo-way ANOVA.. d. Y is binary, X and Z are categorical Žlogit model.. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 6 Building and Applying Logistic Regression Models Having studied the basics of fitting and interpreting logistic regression models, we now turn our attention to building and applying them. With several explanatory variables, there are many potential models. In Section 6.1 we discuss strategies for model selection. After choosing a preliminary model, model checking addresses whether systematic lack of fit exists. Section 6.2 covers diagnostics, such as residuals, for model checking. In practice, a common application compares two groups on a binary response, with data stratified by control variables. In Section 6.3 we present logit-related analyses of such data. In Section 6.4 we show the advantages of a well-chosen model in enhancing inferential power for detecting and estimating associations. Section 6.5 covers power and sample size determination for logistic regression. Although the logit is the most popular link function for probabilities, other links are sometimes more appropriate. In Section 6.6 we present models using the probit link and links making a double log transform. For small samples or models with many parameters, ordinary large-sample ML inference may perform poorly. In Section 6.7 we discuss conditional logistic regression. Like small-sample methods for 2 = 2 tables, this uses conditioning arguments to eliminate nuisance parameters. 6.1 STRATEGIES IN MODEL SELECTION Model selection for logistic regression faces the same issues as for ordinary regression. The selection process becomes harder as the number of explanatory variables increases, because of the rapid increase in possible effects and interactions. There are two competing goals: The model should be complex enough to fit the data well. On the other hand, it should be simple to interpret, smoothing rather than overfitting the data. 211 212 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS Most studies are designed to answer certain questions. Those questions guide the choice of model terms. Confirmatory analyses then use a restricted set of models. For instance, a study hypothesis about an effect may be tested by comparing models with and without that effect. For studies that are exploratory rather than confirmatory, a search among possible models may provide clues about the dependence structure and raise questions for future research. In either case, it is helpful first to study the effect on Y of each predictor by itself using graphics Žincorporating smoothing. for a continuous predictor or a contingency table for a discrete predictor. This gives a ‘‘feel’’ for the marginal effects. Unbalanced data, with relatively few responses of one type, limit the number of predictors for the model. One guideline suggests at least 10 outcomes of each type should occur for every predictor ŽPeduzzi et al. 1996.. If y s 1 only 30 times out of n s 1000, for instance, the model should contain no more than about three x terms. Such guidelines are approximate, and this does not mean that if you have 500 outcomes of each type you are well served by a model with 50 predictors. Many model selection procedures exist, no one of which is always best. Cautions that apply to ordinary regression hold for any generalized linear model. For instance, a model with several predictors may suffer from multicollinearityᎏcorrelations among predictors making it seem that no one variable is important when all the others are in the model. A variable may seem to have little effect because it overlaps considerably with other predictors in the model, itself being predicted well by the other predictors. Deleting such a redundant predictor can be helpful, for instance to reduce standard errors of other estimated effects. 6.1.1 Horseshoe Crab Example Revisited The horseshoe crab data set in Table 4.3 has four predictors: color Žfour categories., spine condition Žthree categories., weight, and width of the carapace shell. We now fit a logistic regression model using all these to predict whether the female crab has satellites Ž y s 1.. We start by fitting a model containing main effects, logit P Ž Y s 1 . s ␣ q ␤ 1weight q ␤ 2 width q ␤ 3 c1 q ␤4 c 2 q ␤5 c 3 q ␤6 s1 q ␤ 7 s2 , treating color Ž c i . and spine condition Ž s j . as qualitative Žfactors., with dummy variables for the first three colors and the first two spine conditions. Table 6.1 shows results. A likelihood-ratio test that Y is jointly independent of these predictors simultaneously tests H0 : ␤ 1 s ⭈⭈⭈ s ␤ 7 s 0. The test statistic equals 40.6 with df s 7 Ž P - 0.0001.. This shows extremely strong evidence that at least one predictor has an effect. 213 STRATEGIES IN MODEL SELECTION TABLE 6.1 Computer Output from Fitting Model with All Main Effects to Horseshoe Crab Data Testing Global Null Hypothesis: BETA = 0 Test Chi- Square DF Pr > ChiSq Likelihood Ratio 40.5565 7 <.0001 Parameter Intercept weight width color 1 color 2 color 3 spine 1 spine 2 Analysis of Maximum Likelihood Estimates Estimate Std Error Chi- Square y9.2734 3.8378 5.8386 0.8258 0.7038 1.3765 0.2631 0.1953 1.8152 1.6087 0.9355 2.9567 1.5058 0.5667 7.0607 1.1198 0.5933 3.5624 y0.4003 0.5027 0.6340 y0.4963 0.6292 0.6222 Pr > ChiSq 0.0157 0.2407 0.1779 0.0855 0.0079 0.0591 0.4259 0.4302 Although the overall test is highly significant, the Table 6.1 results are discouraging. The estimates for weight and width are only slightly larger than their SE values. The estimates for the factors compare each category to the final one as a baseline. For color, the largest difference is less than two standard errors; for spine condition, the largest difference is less than a standard error. The small P-value for the overall test, yet the lack of significance for individual effects, is a warning sign of multicollinearity. In Section 5.2.2 we showed strong evidence of a width effect. Controlling for weight, color, and spine condition, little evidence remains of a partial width effect. However, weight and width have a strong correlation Ž0.887.. For practical purposes they are equally good predictors, but it is nearly redundant to use them both. Our further analysis uses width ŽW . with color Ž C . and spine condition Ž S . as predictors. For simplicity, we symbolize models by their highest-order terms, regarding C and S as factors. For instance, Ž C q S q W . denotes a model with main effects, whereas Ž C q S*W . denotes a model that has those main effects plus an S = W interaction. It is not usually sensible to consider a model with interaction but not the main effects that make up that interaction. 6.1.2 Stepwise Procedures In exploratory studies, an algorithmic method for searching among models can be informative if we use results cautiously. Goodman Ž1971a. proposed methods analogous to forward selection and backward elimination in ordinary regression. Forward selection adds terms sequentially until further additions do not improve the fit. At each stage it selects the term giving the greatest improve- 214 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS ment in fit. The minimum P-value for testing the term in the model is a sensible criterion, since reductions in deviance for different terms may have different df values. A stepwise variation of this procedure retests, at each stage, terms added at previous stages to see if they are still significant. Backward elimination begins with a complex model and sequentially removes terms. At each stage, it selects the term for which its removal has the least damaging effect on the model Že.g., largest P-value.. The process stops when any further deletion leads to a significantly poorer fit. With either approach, for qualitative predictors with more than two categories, the process should consider the entire variable at any stage rather than just individual dummy variables. Add or drop the entire variable rather than just one of its dummies. Otherwise, the result depends on the coding. The same remark applies to interactions containing that variable. Many statisticians prefer backward elimination over forward selection, feeling it safer to delete terms from an overly complex model than to add terms to an overly simple one. Forward selection can stop prematurely because a particular test in the sequence has low power. Neither strategy necessarily yields a meaningful model. Use variable selection procedures with caution! When you evaluate many terms, one or two that are not important may look impressive simply due to chance. For instance, when all the true effects are weak, the largest sample effect may substantially overestimate its true effect. See Westfall and Wolfinger Ž1997. and Westfall and Young Ž1993. for ways to adjust P-values to take multiple tests into account. Some software has additional options for selecting a model. One approach attempts to determine the best model with some fixed number of terms, according to some criterion. If such a method and backward and forward selection procedures yield quite different models, this is an indication that such results are of dubious use. Another such indication would be when a quite different model results from applying a given procedure to a bootstrap sample of the same size from the sample distribution. Finally, statistical significance should not be the sole criterion for inclusion of a term in a model. It is sensible to include a variable that is central to the purposes of the study and report its estimated effect even if it is not statistically significant. Keeping it in the model may help reduce bias in estimated effects of other predictors and may make it possible to compare results with other studies where the effect is significant Žperhaps because of a larger sample size.. Algorithmic selection procedures are no substitute for careful thought in guiding the formulation of models. 6.1.3 Backward Elimination for Horseshoe Crab Example Table 6.2 summarizes results of fitting and comparing several logit models to the horseshoe crab data with predictors width, color, and spine condition. The deviance Ž G 2 . test of fit compares the model to the saturated model. As noted in Sections 5.2.4 and 5.2.5, this is not approximately chi-squared when a predictor is continuous, as width is. However, the difference of deviances 215 STRATEGIES IN MODEL SELECTION TABLE 6.2 Results of Fitting Several Logistic Regression Models to Horseshoe Crab Data Model Predictors a 1 2 3a 3b 3c 4a 4b 5 6a 6b 6c 7a 7b 8 9 a Ž C*S*W . Ž C*S q C*W q S*W . Ž C*S q S*W . Ž C*W q S*W . Ž C*S q C*W . Ž S q C*W . ŽW q C*S . ŽC q S q W . ŽC q S . ŽS q W . ŽC q W . ŽC . ŽW . Ž C s dark q W . None Deviance Models Deviance Corr. G2 df AIC Compared Difference r Ž y, ␮ ˆ. 170.44 173.68 177.34 181.56 173.69 181.64 177.61 186.61 208.83 194.42 187.46 212.06 194.45 187.96 225.76 152 155 158 161 157 163 160 166 167 169 168 169 171 170 172 212.4 209.7 207.3 205.6 205.7 201.6 203.6 200.6 220.8 202.4 197.5 220.1 198.5 194.0 227.8 ᎏ ᎏ Ž2. ᎐ Ž1. 3.2 Ždf s 3. Ž3a. ᎐ Ž2. 3.7 Ždf s 3. Ž3b. ᎐ Ž2. 7.9 Ždf s 6. Ž3c. ᎐ Ž2. 0.0 Ždf s 2. Ž4a. ᎐ Ž3c. 8.0 Ždf s 6. Ž4b. ᎐ Ž3c. 3.9 Ždf s 3. Ž5. ᎐ Ž4b. 9.0 Ždf s 6. Ž6a. ᎐ Ž5. 22.2 Ždf s 1. Ž6b. ᎐ Ž5. 7.8 Ždf s 3. Ž6c. ᎐ Ž5. 0.8 Ždf s 2. Ž7a. ᎐ Ž6c. 24.5 Ždf s 1. Ž7b. ᎐ Ž6c. 7.0 Ždf s 3. Ž8. ᎐ Ž6c. 0.5 Ždf s 2. Ž9. ᎐ Ž8. 37.8 Ždf s 2. 0.452 0.285 0.402 0.447 0.000 C, color; S, spine condition; W, width. between two models that differ by a modest number of parameters is relevant. That difference is the likelihood-ratio statistic y2Ž L0 y L1 . comparing the models, and it has an approximate null chi-squared distribution.. To select a model, we use backward elimination. We test only the highest-order terms for each variable. It is inappropriate, for instance, to remove a main effect term if the model has interactions involving that term. We begin with the most complex model, symbolized by Ž C*S*W ., model 1 in Table 6.2. This model uses main effects for each term as well as the three two-factor interactions and the three-factor interaction. It allows a separate width effect at each CS combination. ŽIn fact, at some of those combinations y outcomes of only one type occur, so effects are not estimable. . The likelihood-ratio statistic comparing this model to the simpler model Ž C*S q C*W q S*W . removing the three-factor interaction term equals 3.2 Ždf s 3.. This suggests that the three-factor term is not needed Ž P s 0.36., thank goodness, so we continue the simplification process. In the next stage we consider the three models that remove a two-factor interaction. Of these, Ž C*S q C*W . gives essentially the same fit as the more complex model, so we drop the S = W interaction. Next, we consider dropping one of the other two-factor interactions. The model Ž S q C*W ., dropping the C = S interaction, has an increased deviance of 8.0 on df s 6 Ž P s 0.24.; the model ŽW q C*S ., dropping the C = W interaction, has an increased deviance of 3.9 on df s 3 Ž P s 0.27.. Neither increase is important, suggesting that we can drop either and proceed. In either case, dropping next the remaining interaction also seems permissible. For instance, 216 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS dropping the C = S interaction from model ŽW q C*S ., leaving model Ž C q S q W ., increases the deviance by 9.0 on df s 6 Ž P s 0.17.. The working model now has the main effects alone. In the next stage we consider dropping one of them. Table 6.2 shows little consequence of removing S. Both remaining variables Ž C and W . then have nonnegligible effects. For instance, removing C increases the deviance Žcomparing models 7b and 6c. by 7.0 on df s 3 Ž P s 0.07.. The analysis in Section 5.4.6 revealed a noticeable difference between dark crabs Žcategory 4. and the others. The simpler model that has a single dummy variable for color, equaling 0 for dark crabs and 1 otherwise, fits essentially as well. ŽThe deviance difference between models 8 and 6c equals 0.5, with df s 2.. Further simplification results in large increases in deviance and is unjustified. 6.1.4 AIC, Model Selection, and the Correct Model In selecting a model, we are mistaken if we think that we have found the true one. Any model is a simplification of reality. For instance, width does not exactly have a linear effect on the probability of satellites, whether we use the logit link or the identity link. What is the logic of testing the fit of a model when we know that it does not truly hold? A simple model that fits adequately has the advantages of model parsimony. If a model has relatively little bias, describing reality well, it tends to provide more accurate estimates of the quantities of interest. This was discussed in Sections 3.3.7 and 5.2.2 and is examined further in Section 6.4.5. Other criteria besides significance tests can help select a good model in terms of estimating quantities of interest. The best known is the Akaike information criterion ŽAIC.. It judges a model by how close its fitted values tend to be to the true values, in terms of a certain expected value. Even though a simple model is farther from the true model than is a more complex model, it may be preferred because it tends to provide better estimates of certain characteristics of the true model, such as cell probabilities. Thus, the optimal model is the one that tends to have fit closest to reality. Given a sample, Akaike showed that this criterion selects the model that minimizes AIC s y2 Ž maximized log likelihoodᎏnumber of parameters in model . . This penalizes a model for having many parameters. With models for categorical Y, this ordering is equivalent to one based on an adjustment of the deviance, w G 2 y 2Ždf.x, by twice its residual df. For cogent arguments supporting this criterion, see Burnham and Anderson Ž1998.. We illustrate AIC for model selection using the models Table 6.2 lists. That table also shows the AIC values. Of models using the three basic variables, AIC is smallest ŽAIC s 197.5. for C q W, having main effects of color and width. The simpler model having a dummy variable for whether a crab is dark fares better yet ŽAIC s 194.0.. Either model seems reasonable. 217 STRATEGIES IN MODEL SELECTION We should balance the lower AIC for the simpler model against its having been suggested by the fit of C q W. 6.1.5 Using Causal Hypotheses to Guide Model Building Although selection procedures are helpful exploratory tools, the model-building process should utilize theory and common sense. Often, a time ordering among the variables suggests possible causal relationships. Analyzing a certain sequence of models helps to investigate those relationships ŽGoodman 1973.. We illustrate with Table 6.3, from a British study. A sample of men and women who had petitioned for divorce and a similar number of married people were asked: Ža. ‘‘Before you married your Žformer. husbandrwife, had you ever made love with anyone else?’’; Žb. ‘‘During your Žformer. marriage, Ždid you have. have you had any affairs or brief sexual encounters with another manrwoman?’’ The 2 = 2 = 2 = 2 table has variables G s gender, E s whether reported extramarital sex, P s whether reported premarital sex, and M s marital status. The time points at which responses on the four variables occur suggests the following ordering of the variables: E extramarital sex 6 P premarital sex 6 6 G gender M marital status Any of these is an explanatory variable when a variable listed to its right is the response. Figure 6.1 shows one possible causal structure. In this figure, a variable at the tip of an arrow is a response for a model at some stage. The explanatory variables have arrows pointing to the response, directly or indirectly. We first treat P as a response. Figure 6.1 predicts that G has a direct effect on P, so the model of independence of these variables is inadequate. TABLE 6.3 Marital Status by Report of Pre- and Extramarital Sex (PMS and EMS) Gender Women PMS: Marital Status Divorced Still married EMS: Yes Men No Yes No Yes No Yes No Yes No Yes No 17 4 54 25 36 4 214 322 28 11 60 42 17 4 68 130 Source: G. N. Gilbert, Modelling Society ŽLondon: George Allen & Unwin, 1981.. Reprinted with permission from Unwin Hyman Ltd. 218 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS FIGURE 6.1 Causal diagram for Table 6.3. At the second stage, E is the response. Figure 6.1 predicts that P and G have direct effects on E. It also suggests that G has an indirect effect on E, through its effect on P. These effects on E can be analyzed using the logit model for E with additive G and P effects. If G has only an indirect effect on E, the model with P alone as a predictor is adequate; that is, controlling for P, E and G are conditionally independent. At the third stage, M is the response. Figure 6.1 predicts that E has a direct effect on M, P has direct effects and indirect effects through its effects on E, and G has indirect effects through its effects on P and E. This suggests the logit model for M having additive E and P effects. For this model, G and M are independent, given P and E. Table 6.4 shows results. The first stage, having P as the response, shows strong evidence of a GP association. The sample odds ratio for their marginal table is 0.27; the estimated odds of premarital sex for females are 0.27 times that for males. The second stage has E as the response. Only weak evidence occurs that G had a direct as well as an indirect effect on E, as G 2 drops by 2.9 Ždf s 1. after adding G to a model already containing P as a predictor. For this model, the estimated EP conditional odds ratio is 4.0. The third stage has M as the response. Figure 6.1 specifies the logit model with main effects of E and P, but it fits poorly. The model that allows an TABLE 6.4 Goodness of Fit of Various Models for Table 6.3 a Stage Response Variable 1 P G 2 E G, P 3 M G, P, E a Potential Explanatory Actual Explanatory G2 df None ŽG. None Ž P. ŽG q P . Ž E q P. Ž E*P . Ž E*P q G . 75.3 0.0 48.9 2.9 0.0 18.2 5.2 0.7 1 0 3 2 1 5 4 3 P, premarital sex; E, extramarital sex; M, marital status; G, gender. LOGISTIC REGRESSION DIAGNOSTICS 219 E = P interaction in their effects on M but assumes conditional independence of G and M fits much better Ž G 2 decrease of 13.0, df s 1.. The model that also has a main effect for G fits slightly better yet. Either model is more complicated than Figure 6.1 predicted, since the effects of E on M vary according to the level of P. However, some preliminary thought about causal relationships suggested a model similar to one giving a good fit. We leave it to the reader to estimate and interpret effects for the third stage. 6.1.6 New Model-Building Strategies for Data Mining As computing power continues to explode, enormous data sets are more common. A financial institution that markets credit cards may have observations for millions of subjects to whom they sent advertising, on whether they applied for a card. For their customers, they have monthly data on whether they paid their bill on time plus information on many variables measured on the credit card application. The analysis of huge data sets is called data mining. Model building for huge data sets is challenging. There is currently considerable study of alternatives to traditional statistical methods, including automated algorithms that ignore concepts such as sampling error or modeling. Significance tests are usually irrelevant, as nearly any variable has a significant effect if n is sufficiently large. Model-building strategies view some models as useful for prediction even if they have complex structure. Nonetheless, a point of diminishing returns still occurs in adding predictors to models. After a point, new predictors tend to be so correlated with a linear combination of ones already in the model that they do not improve predictive power. For large n, inference is less relevant than summary measures of predictive power. This is a topic of the next section. 6.2 LOGISTIC REGRESSION DIAGNOSTICS In Section 5.2.3 we introduced statistics for checking model fit in a global sense. After selecting a preliminary model, we obtain further insight by switching to a microscopic mode of analysis. In contingency tables, for instance, the pattern of lack of fit revealed in cell-by-cell comparisons of observed and fitted counts may suggest a better model. For continuous predictors, graphical displays are also helpful. Such diagnostic analyses may suggest a reason for the lack of fit, such as nonlinearity in the effect of an explanatory variable. 6.2.1 Pearson, Deviance, and Standardized Residuals With categorical predictors, it is useful to form residuals to compare observed and fitted counts. Let yi denote the binomial variate for n i trials at 220 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS setting i of the explanatory variables, i s 1, . . . , N. Let ␲ ˆ i denote the model estimate of P Ž Y s 1.. Then n i␲ ˆ i is the fitted number of successes. For a GLM with binomial random component, the Pearson residual Ž4.36. for this fit is yi y n i ␲ ˆi ei s $ s 1r2 var Ž Yi . yi y n i ␲ ˆi ' n ␲ˆ Ž1 y ␲ˆ . i i Ž 6.1 . . i This divides the raw residual Ž yi y ␮ ˆ i . by the estimated binomial standard deviation of yi . The Pearson statistic for testing the model fit satisfies N X2s Ý ei2 . is1 Each squared Pearson residual is a component of X 2 . With ␲ ˆ i replaced by ␲ i in the numerator of Ž6.1., e i is the difference between a binomial random variable and its expectation, divided by its estimated standard deviation. For large n i , e i then has an approximate N Ž0, 1. distribution, when the model holds. Since ␲ i is estimated by ␲ ˆ i and the ␲ ˆ i 4 depend on  yi 4, however,  yi y n i␲ˆ i 4 tend to be smaller than  yi y n i␲ i 4 and the  e i 4 are less variable than N Ž0, 1.. If X 2 has df s ␯ , X 2 s Ý i e i2 is asymptotically comparable to the sum of squares of ␯ Žrather than N . independent standard normal random variables. Thus, when the model holds, E ŽÝ i e i2 .rN f ␯rN - 1. The standardized Pearson residual is slightly larger in absolute value and is approximately N Ž0, 1. when the model holds. In Section 4.5.5 we showed the adjustment uses the leverage from an estimated hat matrix. For observation i with leverage ˆ h i , the standardized residual is ri s ei '1 y ˆh s i yi y n i ␲ ˆi ' n ␲ˆ Ž1 y ␲ˆ . Ž1 y ˆh . i i i . i Absolute values larger than roughly 2 or 3 provide evidence of lack of fit. An alternative residual uses components of the G 2 fit statistic. These are the de®iance residuals, introduced for GLMs in Ž4.35.. The deviance residual for observation i is 'd i = sign Ž yi y n i␲ ˆi . , Ž 6.2 . where ž d i s 2 yi log yi n i␲ ˆi q Ž n i y yi . log n i y yi n i y n i␲ ˆi / . This also tends to be less variable then N Ž0, 1. and can be standardized. 221 LOGISTIC REGRESSION DIAGNOSTICS Plots of residuals against explanatory variables or linear predictor values may detect a type of lack of fit. When fitted values are very small, however, just as X 2 and G 2 lose relevance, so do residuals. When explanatory variables are continuous, often n i s 1 at each setting. Then yi can equal only 0 or 1, and e i can assume only two values. One must then be cautious about regarding either outcome as extreme, and a single residual is usually uninformative. Plots of residuals also then have limited use, consisting simply of two parallel lines of dots. The deviance itself is then completely uninformative ŽProblem 5.37.. When data can be grouped into sets of observations having common predictor values, it is better to compute residuals for the grouped data than for individual subjects. 6.2.2 Heart Disease Example A sample of male residents of Framingham, Massachusetts, aged 40 through 59, were classified on several factors, including blood pressure ŽTable 6.5.. The response variable is whether they developed coronary heart disease during a six-year follow-up period. Let ␲ i be the probability of heart disease for blood pressure category i. The table shows the fit and the standardized Pearson residuals for two logistic regression models. The first model, logit Ž ␲ i . s ␣ , treats the response as independent of blood pressure. Some residuals for that model are large. This is not surprising, since the model fits poorly Ž G 2 s 30.0, X 2 s 33.4, df s 7.. TABLE 6.5 Standardized Pearson Residuals for Logit Models Fitted to Data on Blood Pressure and Heart Disease Fitted Residual Blood Pressure Sample Size Observed Heart Disease Indep. Model Linear Logit Indep. Model Linear Logit - 117 117᎐126 127᎐136 137᎐146 147᎐156 157᎐166 167᎐186 ) 186 156 252 284 271 139 85 99 43 3 17 12 16 12 8 16 8 10.8 17.4 19.7 18.8 9.6 5.9 6.9 3.0 5.2 10.6 15.1 18.1 11.6 8.9 14.2 8.4 y2.62 y0.12 y2.02 y0.74 0.84 0.93 3.76 3.07 y1.11 2.37 y0.95 y0.57 0.13 y0.33 0.65 y0.18 Source: Data from Cornfield Ž1962.. 222 TABLE 6.6 Observ 1 2 3 4 5 6 7 8 a BUILDING AND APPLYING LOGISTIC REGRESSION MODELS Residuals Reported in SAS for Heart Disease Data of Table 6.5 a disease 3 17 12 16 12 8 16 8 n 156 252 284 271 139 85 99 43 Observation blood 111.5 121.5 131.5 141.5 151.5 161.5 176.5 191.5 Statistics Reschi y0.9794 2.0057 y0.8133 y0.5067 0.1176 y0.3042 0.5135 y0.1395 Resdev y1.0617 1.8501 y0.8420 y0.5162 0.1170 y0.3088 0.5050 y0.1402 StReschi y1.1058 2.3746 y0.9453 y0.5727 0.1261 y0.3261 0.6520 y0.1773 Reschi, Pearson residual; StReschi, adjusted residual. A plot of the residuals show an increasing trend. This suggests the linear logit model, logit Ž ␲ i . s ␣ q ␤ x i , with scores  x i 4 for blood pressure level. We used scores Ž111.5, 121.5, 131.5, 141.5, 151.5, 161.5, 176.5, 191.5.. The nonextreme scores are midpoints for the intervals of blood pressure. The trend in residuals disappears for this model, and only the second category shows some evidence of lack of fit. Table 6.6 reports residuals for the linear logit model, as reported by SAS. The Pearson residuals ŽReschi., deviance residuals ŽResdev., and standardized Pearson residuals ŽStReschi. show similar results. Each is somewhat large in the second category. One relatively large residual is not surprising, however. With many residuals, some may be large purely by chance. Here the FIGURE 6.2 Observed and predicted proportions of heart disease for linear logit model. 223 LOGISTIC REGRESSION DIAGNOSTICS overall fit statistics Ž G 2 s 5.9, X 2 s 6.3 with df s 6. do not indicate problems. In analyzing residual patterns, we should be cautious about attributing patterns to what might be chance variation from a model. Another useful graphical display for showing lack of fit compares observed and fitted proportions by plotting them against each other or by plotting both of them against explanatory variables. For the linear logit model, Figure 6.2 plots both the observed proportions and the estimated probabilities of heart disease against blood pressure. The fit seems decent. Studying residuals helps us understand either why a model fits poorly or where there is lack of fit in a generally good-fitting model. The next example illustrates the second case. 6.2.3 Graduate Admissions Example Table 6.7 refers to graduate school applications to the 23 departments in the College of Liberal Arts and Sciences at the University of Florida during the 1997᎐1998 academic year. It cross-classifies applicant’s gender Ž G ., whether admitted Ž A., and department Ž D . to which the prospective students applied. We consider logit models with A as the response variable. Let yi k denote the number admitted and let ␲ i k denote the probability of admission for gender i in department k. We treat  Yi k 4 as independent bin Ž n i k , ␲ i k .. Other things being equal, one would hope the admissions decision is independent of gender. However, the model with no gender effect, given the department, logit Ž ␲ i k . s ␣ q ␤ kD , fits rather poorly Ž G 2 s 44.7, X 2 s 40.9, df s 23.. TABLE 6.7 Data Relating Admission to Gender and Department for Model with No Gender Effect Dept anth astr chem clas comm comp engl geog geol germ hist lati Females Yes No 32 81 6 0 12 43 3 1 52 149 8 7 35 100 9 1 6 3 17 0 9 9 26 7 Males Std. Res Yes No (Fem,Yes) Dept 21 41 y0.76 ling 3 8 2.87 math 34 110 y0.27 phil 4 0 y1.07 phys 5 10 y0.63 poli 6 12 1.16 psyc 30 112 0.94 reli 11 11 2.17 roma 15 6 y0.26 soci 4 1 1.89 stat 21 19 y0.18 zool 25 16 1.65 Source: Data courtesy of James Booth. Females Yes No 21 10 25 18 3 0 10 11 25 34 2 123 3 3 29 13 16 33 23 9 4 62 Males Std. Res Yes No (Fem,Yes) 7 8 1.37 31 37 1.29 9 6 1.34 25 53 1.32 39 49 y0.23 4 41 y2.27 0 2 1.26 6 3 0.14 7 17 0.30 36 14 y0.01 10 54 y1.76 224 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS Table 6.7 also reports standardized Pearson residuals for the number of females who were admitted for this model. For instance, the astronomy department admitted 6 females, which was 2.87 standard deviations higher than the model predicted. Each department has only a single nonredundant standardized residual, because of marginal constraints for the model. The model has fit ␲ ˆ i k s Ž y 1 k q y 2 k .rnqk , corresponding to an independence fit Ž␲ ˆ 1 k s ␲ˆ 2 k . in each partial table. Now, y 1 k y n1 k ␲ˆ 1 k s y 1 k y n1 k Ž y 1 k q y 2 k .rnqk s Ž n 2 krnqk . y 1 k y Ž n1 krnqk . y 2 k s yŽ y 2 k y n 2 k ␲ ˆ 2 k .. Thus, standard errors of Ž y 1 k y n1 k ␲ ˆ 1 k . and Ž y 2 k y n 2 k ␲ˆ 2 k . are identical. The standardized residuals are identical in absolute value for males and females but of different sign. Astronomy admitted 3 males, and their standardized residual was y2.87; the number admitted was 2.87 standard deviations fewer than predicted. This is another advantage of standardized over ordinary Pearson residuals. The model of independence in a partial table has df s 1. Only one bit of information exists about how the data depart from independence, yet the ordinary Pearson residual for males need not equal the ordinary Pearson residual for females. Departments with large standardized Pearson residuals reveal the reason for the lack of fit. Significantly more females were admitted than the model predicts in the astronomy and geography departments, and fewer in the psychology department. Without these three departments, the model fits reasonably well Ž G 2 s 24.4, X 2 s 22.8, df s 20.. For the complete data, adding a gender effect to the model does not provide an improved fit Ž G 2 s 42.4, X 2 s 39.0, df s 22., because the departments just described have associations in different directions and of greater magnitude than other departments. This model has an ML estimate of 1.19 for the GA conditional odds ratio, the odds of admission being 19% higher for females than males, given department. By contrast, the marginal table collapsed over department has a GA sample odds ratio of 0.94, the overall odds of admission being 6% lower for females. This illustrates Simpson’s paradox ŽSection 2.3.2., the conditional association having different direction than the marginal association. 6.2.4 Influence Diagnostics for Logistic Regression Other regression diagnostic tools are also helpful in assessing fit. These include plots of ordered residuals against normal percentiles ŽHaberman 1973a. and analyses that describe an observation’s influence on parameter estimates and fit statistics. Whenever a residual indicates that a model fits an observation poorly, it can be informative to delete the observation and refit the model to remaining ones. This is equivalent to adding a parameter to the model for that observation, forcing a perfect fit for it. As in ordinary regression, an observation may be relatively influential in determining parameter estimates. The greater an observation’s leverage, the greater its potential influence. The fit could be quite different if an 225 LOGISTIC REGRESSION DIAGNOSTICS observation that appears to be an outlier on y and has large leverage is deleted. However, a single observation can have a more exorbitant influence in ordinary regression than a single binary observation in logistic regression, since there is no bound on the distance of yi from its expected value. Also, in Section 4.5.5 we observed that the GLM estimated hat matrix $ ˆ 1r2 X Ž XX WX ˆ . Hat s W y1 ˆ 1r2 XX W depends on the fit as well as the model matrix X. For logistic regression, in ˆ is diagonal with element Section 5.5.2 we showed that the weight matrix W w ˆi s n i␲ˆ i Ž1 y ␲ˆ i . for the n i observations at setting i of predictors. Points that have extreme predictor values need not have high leverage. In fact, the leverage can be small if ␲ ˆ i is close to 0 or 1. Several measures that describe the effect on parameter estimates and fit statistics of removing an observation from the data set are related algebraically to the observation’s leverage ŽPregibon 1981; Williams 1987.. In logistic regression, the observation could be a single binary response or a binomial response for a set of subjects all having the same predictor values. Influence measures for each observation include: 1. For each model parameter, the change in the parameter estimate when the observation is deleted. This change, divided by its standard error, is called Dfbeta. 2. A measure of the change in a joint confidence interval for the parameters produced by deleting the observation. This confidence interval displacement diagnostic is denoted by c. 3. The change in X 2 or G 2 goodness-of-fit statistics when the observation is deleted. For each measure, the larger the value, the greater the influence. We illustrate them using the linear logit model with blood pressure as a predictor for heart disease in Table 6.5. Table 6.8 contains simple approximations Ždue to Pregibon 1981. for the Dfbeta measure for the coefficient of blood pressure, the confidence interval diagnostic c, the change in G 2 , and the change in X 2 . ŽThis is the square of the standardized Pearson residual, ri2 .. All their values show that deleting the second observation has the greatest effect. This is not surprising, as that observation has the only relatively large residual. By contrast, Table 6.8 also contains the changes in X 2 and G 2 for deleting observations in fitting the independence model. At the low and high ends of the blood pressure values, several changes are very large. However, these all relate to removing an entire binomial sample at a blood pressure level instead of removing a single subject’s binary observation. Such subjectlevel deletions have little effect even for this model. With continuous or multiple predictors, it can be informative to plot these diagnostics, for instance against the estimated probabilities. See Cook and 226 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS TABLE 6.8 Diagnostic Measures for Logistic Regression Models Fitted to Heart Disease Data Blood Pressure Dfbeta 111.5 121.5 131.5 141.5 151.5 161.5 176.5 191.5 0.49 y1.14 0.33 0.08 0.01 y0.07 0.40 y0.12 c Pearson X 2 Diff. Likelihood-Ratio G 2 Diff. Pearson X 2 Diff. a Likelihood-Ratio G 2 Diff. a 0.34 2.26 0.31 0.09 0.00 0.02 0.26 0.02 1.22 5.64 0.89 0.33 0.02 0.11 0.42 0.03 1.39 5.04 0.94 0.34 0.02 0.11 0.42 0.03 6.86 0.02 4.08 0.55 0.70 0.87 14.17 9.41 9.13 0.02 4.56 0.57 0.66 0.80 10.83 6.73 a Independence model; other values refer to model with blood pressure predictor. Source: Data from Cornfield Ž1962.. Weisberg Ž1999, Chap. 22., Fowlkes Ž1987., and Landwehr et al. Ž1984. for examples of useful diagnostic plots. 6.2.5 Summarizing Predictive Power: R and R-Squared Measures In ordinary regression, R 2 describes the proportional reduction in variation in comparing the conditional variation of the response to the marginal variation. It and the multiple correlation R describe the power of the explanatory variables to predict the response, with R s 1 for perfect prediction. Despite various attempts to define analogs for categorical response models, no proposed measure is as widely useful as R and R 2 . We present a few proposed measures in this section. For any GLM, the correlation r Ž y, ␮ ˆ . between the observed responses  yi 4  4 and the model’s fitted values ␮ ˆ i measures predictive power. For least squares regression, this is the multiple correlation between Y and the predictors. An advantage of the correlation relative to its square is the appeal of working on the original scale and its approximate proportionality to effect size: For a small effect with a single predictor, doubling the slope corresponds roughly to doubling the correlation. This measure can be useful for comparing fits of different models to the same data set. In logistic regression, ␮ ˆ i for a particular model is the estimated probability ␲ ˆ i for binary observation i. Table 6.2 shows r Ž y, ␮ ˆ . for a few models fitted to the horseshoe crab data. Width alone has r s 0.402, and adding color to the model increases r to 0.452. The simpler model that uses color merely to indicate whether a crab is dark does essentially as well, with r s 0.447. The complex model containing color, spine condition, width, and all their twoand three-way interactions has r s 0.526. This seems considerably higher, but with multiple predictors the r estimates become more highly biased in estimating the true correlation. It can be misleading to compare r values for models with greatly different df values. After a jackknife adjustment designed 227 LOGISTIC REGRESSION DIAGNOSTICS to reduce bias, there is little difference between r for this overly complex model and the simpler model ŽZheng and Agresti 2000.. Little is lost and much is gained by using the simpler model. Another way to measure the association between the binary responses  yi 4 and their fitted values ␲ ˆ i 4 uses the proportional reduction in squared error 1y Ý i Ž yi y ␲ ˆi . Ý i Ž yi y y . 2 2 , obtained by using ␲ ˆ i instead of y s Ý yirn as a predictor of yi ŽEfron 1978.. Ž . Amemiya 1981 suggested a related measure that weights squared deviations by inverse predicted variances. For logistic regression, unlike normal GLMs, these and r Ž y, ␮ ˆ . need not be nondecreasing as the model gets more complex. Like any correlation-type measure, they can depend strongly on the range of observed values of explanatory variables. Other measures directly use the likelihood function. Denote the maximized log likelihood by L M for a given model, LS for the saturated model, and L0 for the null model containing only an intercept term. Probabilities are no greater than 1.0, so log likelihoods are nonpositive. As the model complexity increases, the parameter space expands, so the maximized log likelihood increases. Thus, L0 F L M F LS F 0. The measure L M y L0 LS y L0 Ž 6.3 . falls between 0 and 1. It equals 0 when the model provides no improvement in fit over the null model, and it equals 1 when the model fits as well as the saturated model. A weakness is the log likelihood is not an easily interpretable scale. Interpreting the numerical value is difficult, other than in a comparative sense for different models. For n independent Bernoulli observations, the maximized log likelihood is n log Ł ␲ ˆ iy i Ž 1 y ␲ˆ i . is1 1yy i n s Ý yi log ␲ ˆ i q Ž 1 y yi . log Ž 1 y ␲ˆ i . . is1 The null model gives ␲ ˆ i s ŽÝ yi .rn s y, so that L0 s n y Ž log y . q Ž 1 y y . log Ž 1 y y . . The saturated model has a parameter for each subject and implies that 228 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS ␲ ˆ i s yi for all i. Thus, LS s 0 and Ž6.3. simplifies to Ds L0 y L M L0 . McFadden Ž1974. proposed this measure. With multiple observations at each setting of explanatory variables, the data file can take the grouped-data form of N binomial counts rather than n Bernoulli indicators. The saturated model then has a parameter for each count. It gives N fitted proportions equal to the N sample proportions of success. Then LS is nonzero and Ž6.3. takes a different value than when calculated using individual subjects. For N binomial counts, the maximized likelihoods are related to the G 2 goodness-of-fit statistic by G 2 Ž M . s y2Ž L M y LS ., so Ž6.3. becomes D* s G 2 Ž 0. y G 2 Ž M . G 2 Ž 0. . Goodman Ž1971a. and Theil Ž1970. discussed this and related partial association measures. With grouped data D* can be large even when predictive power is weak at the subject level. For instance, a model can fit much better than the null model even though fitted probabilities are close to 0.5 for the entire sample. In particular, D* s 1 when it fits perfectly, regardless of how well one can predict individual subject’s responses on Y with that model. Also, suppose that the population satisfies the given model, but not the null model. As the sample size n increases with number of settings N fixed, G 2 Ž M . behaves like a chi-squared random variable but G 2 Ž0. grows unboundedly. Thus, D* ™ 1 as n ™ ⬁, and its magnitude tends to depend on n. This measure confounds model goodness of fit with predictive power. Similar behavior occurs for R 2 in regression analyses when calculated using means of Y values Žrather than individual subjects. at N different x settings. It is more sensible to use D for binary, ungrouped data. 6.2.6 Summarizing Predictive Power: Classification Tables and ROC Curves A classification table cross-classifies the binary response with a prediction of whether y s 0 or 1. The prediction is ˆ y s 1 when ␲ ˆ i ) ␲ 0 and ˆy s 0 when ␲ ˆ i F ␲ 0 , for some cutoff ␲ 0 . Most classification tables use ␲ 0 s 0.5 and summarize predictive power by sensitivity s P Ž ˆ y s 1 < y s 1. and specificity s P Ž ˆ y s 0 < y s 0. LOGISTIC REGRESSION DIAGNOSTICS FIGURE 6.3 229 ROC curve for logistic regression model with horseshoe crab data. ŽRecall Sections 2.1.2.. Limitations of this table are that it collapses continuous predictive values ␲ ˆ into binary ones, the choice of ␲ 0 is arbitrary, and it is highly sensitive to the relative numbers of times y s 1 and y s 0. A recei®er operating characteristic ŽROC. curve is a plot of sensitivity as a function of Ž1 y specificity . for the possible cutoffs ␲ 0 . This curve usually has a concave shape connecting the points Ž0, 0. and Ž1, 1.. The higher the area under the curve, the better the predictions. The ROC curve is more informative than the classification table, since it summarizes predictive power for all possible ␲ 0 . Figure 6.3 shows how PROC LOGISTIC in SAS reports the ROC curve for the model for the horseshoe crabs using width and color as predictors. The area under a ROC curve is identical to the value of another measure of predictive power, the concordance index. Consider all pairs of observations Ž i, j . such that yi s 1 and y j s 0. The concordance index c estimates the probability that the predictions and the outcomes are concordant, the observation with the larger y also having the larger ␲ ˆ ŽHarrell et al. 1982.. A value c s 0.5 means predictions were no better than random guessing. This corresponds to a model having only an intercept term and an ROC curve that is a straight line connecting points Ž0, 0. and Ž1, 1.. For the horseshoe crab data, c s 0.639 with color alone as a predictor, 0.742 with width alone, 0.771 with 230 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS width and color, and 0.772 with width and a dummy for whether a crab has dark color. ROC curves are a popular way of evaluating diagnostic tests. Sometimes such tests have J ) 2 ordered response categories rather than Žpositive, negative.. The ROC curve then refers to the various possible cutoffs for defining a result to be positive. It plots sensitivity against 1 y specificity for the possible collapsings of the J categories to a Žpositive, negative. scale wsee Toledano and Gatsonis Ž1996.x. 6.3 INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES The analysis of the graduate admissions data in Sections 6.2.3 used the model of conditional independence. This model is an important one in biomedical studies that investigate whether an association exists between a treatment variable and a disease outcome after controlling for a possibly confounding variable that might influence that association. In this section we review the test of conditional independence as a logit model analysis for a 2 = 2 = K contingency table. We also present a test ŽMantel and Haenszel 1959. that seems non-model-based but relates to the logit model. We illustrate using Table 6.9, showing results of a clinical trial with eight centers. The study compared two cream preparations, an active drug and a TABLE 6.9 Clinical Trial Relating Treatment to Response for Eight Centers Response Center Treatment Success Failure Odds Ratio ␮ 11k varŽ n11 k . 1 Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control 11 10 16 22 14 7 2 1 6 0 1 0 1 1 4 6 25 27 4 10 5 12 14 16 11 12 10 10 4 8 2 1 1.19 10.36 3.79 1.82 14.62 2.47 4.80 10.50 2.41 2.29 1.45 0.70 ⬁ 3.52 1.20 ⬁ 0.52 0.25 2.0 0.71 0.42 0.33 4.62 0.62 2 3 4 5 6 7 8 Source: Beitler and Landis Ž1985.. INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES 231 control, on their success in curing an infection. This table illustrates a common pharmaceutical application, comparing two treatments on a binary response with observations from several strata. The strata are often medical centers or clinics; or they may be levels of age or severity of the condition being treated or combinations of levels of several control variables; or they may be different studies of the same sort evaluated in a meta analysis. 6.3.1 Using Logit Models to Test Conditional Independence For a binary response Y, we study the effect of a binary predictor X, controlling for a qualitative covariate Z. Let ␲ i k s P Ž Y s 1 < X s i, Z s k .. Consider the model logit Ž ␲ i k . s ␣ q ␤ x i q ␤ kZ , i s 1, 2, k s 1, . . . , K , Ž 6.4 . where x 1 s 1 and x 2 s 0. This model assumes that the XY conditional odds ratio is the same at each category of Z, namely expŽ ␤ .. The null hypothesis of XY conditional independence is H0 : ␤ s 0. The Wald statistic is Ž ␤ˆrSE. 2 . The likelihood-ratio statistic is the difference between G 2 statistics for the reduced model logit Ž ␲ i k . s ␣ q ␤ kZ Ž 6.5 . and the full model. These tests are sensible when X has a similar effect at each category of Z. They have df s 1. Alternatively, since the reduced model Ž6.5. is equivalent to conditional independence of X and Y, one could test conditional independence using a goodness-of-fit test of that model. That test has df s K when X is binary. This corresponds to comparing model Ž6.5. and the saturated model, which permits ␤ / 0 and contains XZ interaction parameters. When no interaction exists or when interaction exists but it has minor substantive importance, it follows from results to be presented in Section 6.4.2 that this approach is less powerful, especially when K is large. However, when the direction of the XY association varies among categories of Z, it can be more powerful. 6.3.2 Cochran–Mantel–Haenszel Test of Conditional Independence Mantel and Haenszel Ž1959. proposed a non-model-based test of H0 : conditional independence in 2 = 2 = K tables. Focusing on retrospective studies of disease, they treated response Žcolumn. marginal totals as fixed. Thus, in each partial table k of cell counts  n i jk 4 , their analysis conditions on both the predictor totals Ž n1qk , n 2qk 4 and the response outcome totals Ž nq1 k , nq2 k .. The usual sampling schemes then yield a hypergeometric distribution Ž3.16. for the first cell count n11 k in each partial table. That count determines  n12 k ,n 21 k , n 22 k 4 , given the marginal totals. 232 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS Under H0 , the hypergeometric mean and variance of n11 k are ␮ 11 k s E Ž n11 k . s n1qk nq1 krnqqk 2 var Ž n11 k . s n1qk n 2qk nq1 k nq2 krnqqk Ž nqqk y 1 . . Cell counts from different partial tables are independent. The test statistic combines information from the K tables by comparing Ý k n11 k to its null expected value. It equals CMH s Ý k Ž n11 k y ␮ 11 k . Ý k var Ž n11 k . 2 . Ž 6.6 . This statistic has a large-sample chi-squared null distribution with df s 1. When the odds ratio ␪ X Y Ž k . ) 1 in partial table k, we expect that Ž n11 k y ␮ 11 k . ) 0. When ␪ X Y Ž k . ) 1 in every partial table or ␪ X Y Ž k . - 1 in each table, Ý k Ž n11 k y ␮ 11 k . tends to be relatively large in absolute value. This test works best when the XY association is similar in each partial table. In this sense it is similar to the tests of H0 : ␤ s 0 in logit model Ž6.4.. When the sample sizes in the strata are moderately large, this test usually gives similar results. In fact, it is a score test ŽSection 1.3.3. of H0 : ␤ s 0 in that model ŽDay and Byar 1979.. Cochran Ž1954. proposed a similar statistic. He treated the rows in each 2 = 2 table as two independent binomials rather than a hypergeometric. Cochran’s statistic is Ž6.6. with var Ž n11 k . replaced by 3 var Ž n11 k . s n1qk n 2qk nq1 k nq2 krnqqk . Because of the similarity in their approaches, we call Ž6.6. the Cochran᎐Mantel᎐Haenszel ŽCMH. statistic. The Mantel and Haenszel approach using the hypergeometric is more general in that it also applies to some cases in which the rows are not independent binomial samples from two populations. Examples are retrospective studies and randomized clinical trials with the available subjects randomly allocated to two treatments. In the first case the column totals are naturally fixed. In the second, under the null hypothesis the column margins are the same regardless of how subjects were assigned to treatments, and randomization arguments lead to the hypergeometric in each 2 = 2 table. Mantel and Haenszel Ž1959. proposed Ž6.6. with a continuity correction. The P-value from the test then better approximates an exact conditional test ŽSection 6.7.5. but it tends to be conservative. The CMH statistic generalizes for I = J = K tables ŽSection 7.5.3.. 6.3.3 Multicenter Clinical Trial Example For the multicenter clinical trial, Table 6.9 reports the sample odds ratio for each table and the expected value and variance of the number of successes 233 INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES for the drug treatment Ž n11 k . under H0 : conditional independence. In each table except the last, the sample odds ratio shows a positive association. Thus, it makes sense to combine results with CMH s 6.38, with df s 1. There is considerable evidence against H0 Ž P s 0.012.. Similar results occur in testing H0 : ␤ s 0 in logit model Ž6.4.. The model fit has ␤ˆ s 0.777 with SE s 0.307. The Wald statistic is Ž0.777r0.307. 2 s 6.42 Ž P s 0.011.. The likelihood-ratio statistic equals 6.67 Ž P s 0.010.. 6.3.4 CMH Test and Sparse Data* In summary, for logit model Ž6.4., CMH is the score statistic alternative to the likelihood-ratio or Wald test of H0 : ␤ s 0. As n ™ ⬁ with fixed K, the tests have the same asymptotic chi-squared behavior under H0 . An advantage of CMH is that its chi-squared limit also applies with an alternative asymptotic scheme in which K ™ ⬁ as n ™ ⬁. The asymptotic theory for likelihood-ratio and Wald tests requires the number of parameters Žand hence K . to be fixed, so it does not apply to this scheme. An application of this type is when each stratum has a single matched pair of subjects, one in each group. With strata of matched pairs, n1qk s n 2qk s 1 for each k. Then n s 2 K, so K ™ ⬁ as n ™ ⬁. Table 6.10 shows the data layout for this situation. When both subjects in stratum k make the same response Žas in the first case in Table 6.10., nq1 k s 0 or nq2 k s 0. Given the marginal counts, the internal counts are then completely determined, and ␮ 11 k s n11 k and var Ž n11 k . s 0. When the subjects make differing responses Žas in the second case., nq1 k s nq2 k s 1, so that ␮ 11 k s 0.5 and var Ž n11 k . s 0.25. Thus, a matched pair contributes to the CMH statistic only when the two subjects’ responses differ. Let K * denote the number of the K tables that satisfy this. Although each n11 k can take only two values, the central limit theorem implies that Ý k n11 k is approximately normal for large K *. Thus, the distribution of CMH is approximately chi-squared. Usually, when K grows with n, each stratum has few observations. There may be more than two observations, such as case᎐control studies that match several controls with each case. Contingency tables with relatively few observations are referred to as sparse. The nonstandard setting in which K ™ ⬁ as n ™ ⬁ is called sparse-data asymptotics. Ordinary ML estimation then breaks down because the number of parameters is not fixed, instead having the same order as the sample size. In particular, an approximate chi-squared distribution holds for the likelihood-ratio and Wald statistics for testing conditional TABLE 6.10 Stratum Containing a Matched Pair Response Response Element of Pair Success Failure Success Failure First Second 1 1 0 0 1 0 0 1 234 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS independence only when the strata marginal totals generally exceed about 5 to 10 and K is fixed and small relative to n. 6.3.5 Estimation of Common Odds Ratio It is more informative to estimate the strength of association than to test hypotheses about it. When the association seems stable among partial tables, it is helpful to combine the K sample odds ratios into a summary measure of conditional association. The logit model Ž6.4. implies homogeneous association, ␪ X Y Ž1. s ⭈⭈⭈ s ␪ X Y Ž K . s exp Ž ␤ .. The ML estimate of the common odds ratio is exp Ž ␤ˆ.. Other estimators of a common odds ratio are not model-based. Woolf Ž1955. proposed an exponentiated weighted average of the K sample log odds ratios. Mantel and Haenszel Ž1959. proposed that ␪ˆMH s Ý k Ž n11 k n 22 krnqqk . Ý k Ž n12 k n 21 krnqqk . s Ý k p11 < k p 22 < k nqqk Ý k p12 < k p 21 < k nqqk , Ž 6.7 . where pi j < k s n i jkrnqqk . This gives more weight to strata with larger sample sizes. It is preferred over the ML estimator when K is large and the data are sparse. The ML estimator ␤ˆ of the log odds ratio then tends to be too large in absolute value. For sparse-data asymptotics with only a single matched p pair in each stratum, for instance, ␤ˆ™ 2 ␤ . wThis con®ergence in probability means that for any ⑀ ) 0, P Ž < ␤ˆ y 2 ␤ < - ⑀ . ™ 1 as n ™ ⬁; see Problem 10.24.x Hauck Ž1979. gave an asymptotic variance for log Ž ␪ˆMH . that applies for a fixed number of strata. In that case log Ž ␪ˆMH . is slightly less efficient than the ML estimator ␤ˆ unless ␤ s 0 ŽTarone et al. 1983.. Robins et al. Ž1986. derived an estimated variance that applies both for these standard asymptotics with large n and fixed K and for sparse asymptotics in which K is also large. Expressing ␪ˆMH s RrS s ŽÝ k R k .rŽÝ k Sk . with R k s n11 k n 22 krnqqk , their derivation showed that Žlog ␪ˆMH y log ␪ . is approximately proportional to Ž R y ␪ S .. They also showed that E Ž R y ␪ S . s 0 and derived the variance of Ž R y ␪ S .. Their result is ␴ˆ 2 log ␪ˆMH s 1 Ý ny1 qqk Ž n11 k q n 22 k . R k 2 R2 q q k 1 2S2 1 2 RS Ý ny1 qqk Ž n12 k q n 21 k . S k k Ý ny1 qqk Ž n11 k q n 22 k . S k q Ž n12 k q n 21 k . R k k . INFERENCE ABOUT CONDITIONAL ASSOCIATIONS IN 2 = 2 = K TABLES 235 For the eight-center clinical trial summarized by Table 6.9, ␪ˆMH s Ž 11 = 27 . r73 q ⭈⭈⭈ q Ž 4 = 1 . r13 s 2.13. Ž 25 = 10 . r73 q ⭈⭈⭈ q Ž 2 = 6 . r13 For log ␪ˆMH s 0.758, ␴ˆ wlog ␪ˆMH x s 0.303. A 95% confidence interval for the common odds ratio is expŽ0.758 " 1.96 = 0.303. or Ž1.18, 3.87.. Similar results occur using model Ž6.4.. The 95% confidence interval for expŽ ␤ . is exp Ž0.777 " 1.96 = 0.307., or Ž1.19, 3.97., using the Wald interval, and Ž1.20, 4.02. using the likelihood-ratio interval. Although the evidence of an effect is considerable, inference about its size is rather imprecise. The odds of success may be as little as 20% higher with the drug, or they may be as much as four times as high. If the true odds ratios are not identical but do not vary drastically, ␪ˆM H still is a useful summary of the conditional associations. Similarly, the CMH test is a powerful summary of evidence against H0 : conditional independence, as long as the sample associations fall primarily in a single direction. It is not necessary to assume equality of odds ratios to use the CMH test. 6.3.6 Testing Homogeneity of Odds Ratios The homogeneous association condition ␪ X Y Ž1. s ⭈⭈⭈ s ␪ X Y Ž K . for 2 = 2 = K tables is equivalent to logit model Ž6.4.. A test of homogeneous association is implicitly a goodness-of-fit test of this model. The usual G 2 and X 2 test statistics provide this, with df s K y 1. They test that the K y 1 parameters in the saturated model that are the coefficients of interaction terms wcross products of the dummy variable for x with Ž K y 1. dummy variables for categories of Z x all equal 0. Breslow and Day Ž1980, p. 142. proposed an alternative large-sample test ŽNote 6.5.. For the eight-center clinical trial data in Table 6.9, G 2 s 9.7 and X 2 s 8.0 Ždf s 7. do not contradict the hypothesis of equal odds ratios. It is reasonable to summarize the conditional association by a single odds ratio Že.g., ␪ˆMH s 2.1. for all eight partial tables. In fact, even with a small P-value in a test of homogeneous association, if the variability in the sample odds ratios is not substantial, a summary measure such as ␪ˆMH is useful. A test of homogeneity is not a prerequisite for this measure or for testing conditional independence. 6.3.7 Summarizing Heterogeneity in Odds Ratios In practice, a predictor effect is often similar from stratum to stratum. In multicenter clinical trials comparing a new drug to a standard, for example, if the new drug is truly more beneficial, the true effect is usually positive in each stratum. 236 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS In strict terms, however, a model with homogeneous effects is unrealistic. First, we rarely expect the true odds ratio to be exactly the same in each stratum, because of unmeasured covariates that affect it. Breslow Ž1976. discussed modeling of the log odds ratio using a set of explanatory variables. Second, the model regards the strata effects  ␤ kZ 4 as fixed effects, treating them as the only strata of interest. Often the strata are merely a sampling of the possible ones. Multicenter clinical trials have data for certain centers but many other centers could have been used. Scientists would like their conclusions to apply to all such centers, not only those in the study. A somewhat different logit model treats the true log odds ratios in partial tables as a random sample from a N Ž ␮, ␴ 2 . distribution. Fitting the model yields an estimated mean log odds ratio and an estimated variability about that mean. The inference applies to the population of strata rather than only those sampled. This type of model uses random effects in the linear predictor to induce this extra type of variability. In Chapter 12 we discuss GLMs with random effects, and in Section 12.3.4 we fit such a model to Table 6.9. 6.4 USING MODELS TO IMPROVE INFERENTIAL POWER When contingency tables have ordered categories, in Section 3.4 we showed that tests that utilize the ordering can have improved power. Testing independence against a linear trend alternative in a linear logit model ŽSections 5.3.4, and 5.4.6 . is a way to do this. In this section we present the reason for these power improvements. 6.4.1 Directed Alternatives Consider an I = 2 contingency table for I binomial variates with parameters ␲ i 4 . H0 : independence states logit Ž ␲ i . s ␣ . The ordinary X 2 and G 2 statistics of Section 3.2.1 refer to the general alternative, logit Ž ␲ i . s ␣ q ␤i , which is saturated. They test H0 : ␤ 1 s ␤ 2 s ⭈⭈⭈ s ␤I s 0 in that model, with df s Ž I y 1.. Their general alternative treats both classifications as nominal. Denote these test statistics as G 2 Ž I . and X 2 Ž I .. Recall that G 2 Ž I . is the likelihood-ratio statistic G 2 Ž M0 < M1 . s y2Ž L0 y L1 . for comparing the saturated model M1 with the independence Ž I . model M0 . Ordinal test statistics refer to narrower, usually more relevant, alternatives. With ordered rows, an example is a test of H0 : ␤ s 0 in the linear logit USING MODELS TO IMPROVE INFERENTIAL POWER 237 model, logitŽ␲ i . s ␣ q ␤ x i . The likelihood-ratio statistic G 2 Ž I < L. s G 2 Ž I . y G 2 Ž L. compares the linear logit model and the independence model. When a test statistic focuses on a single parameter, such as ␤ in that model, it has df s 1. Now, df equals the mean of the chi-squared distribution. A large test statistic with df s 1 falls farther out in its right-hand tail than a comparable value of X 2 Ž I . or G 2 Ž I . with df s Ž I y 1.. Thus, it has a smaller P-value. 6.4.2 Noncentral Chi-Squared Distribution To compare power of G 2 Ž I < L. and G 2 Ž I ., it is necessary to compare their nonnull sampling distributions. When H0 is false, their distributions are approximately noncentral chi-squared. This distribution, introduced by R. A. Fisher in 1928, arises from the following construction: If Zi ; N Ž ␮i , 1., i s 1, . . . , ␯ , and if Z1 , . . . , Z␯ are independent, ÝZi2 has the noncentral chisquared distribution with df s ␯ and noncentrality parameter ␭ s Ý ␮2i . Its mean is ␯ q ␭ and its variance is 2Ž ␯ q 2 ␭.. The ordinary Žcentral . chisquared distribution, which occurs when H0 is true, has ␭ s 0. Let X␯2, ␭ denote a noncentral chi-squared random variable with df s ␯ and noncentrality ␭. A fundamental result for chi-squared analyses is that, for fixed ␭, P X␯2, ␭ ) ␹␯2 Ž␣ . increases as ␯ decreases . That is, the power for rejecting H0 at a fixed ␣-level increases as the df of the test decreases Že.g., Das Gupta and Perlman 1974.. For fixed ␯ , the power equals ␣ when ␭ s 0, and it increases as ␭ increases. The inverse relation between power and df suggests that focusing the noncentrality on a statistic having a small df value can improve power. 6.4.3 Increased Power for Narrower Alternatives Suppose that X has, at least approximately, a linear effect on logit w P Ž Y s 1.x. To test independence, it is then sensible to use a statistic having strong power for that effect. This is the purpose of the tests based on the linear logit model, using the likelihood-ratio statistic G 2 Ž I < L., the Wald statistic z s ␤ˆrSE, and the Cochran᎐Armitage Žscore. statistic. When is G 2 Ž I < L. more powerful than G 2 Ž I .? The statistics satisfy G 2 Ž I . s G 2 Ž I < L. q G2 Ž L. , where G 2 Ž L. tests goodness of fit of the linear logit model. When the linear logit model holds, G 2 Ž L. has an asymptotic chi-squared distribution with 238 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS df s I y 2; then if ␤ / 0, G 2 Ž I . and G 2 Ž I < L. both have approximate noncentral chi-squared distributions with the same noncentrality. Whereas df s I y 1 for G 2 Ž I ., df s 1 for G 2 Ž I < L.. Thus, G 2 Ž I < L. is more powerful, since it uses fewer degrees of freedom. When the linear logit model does not hold, G 2 Ž I . has greater noncentrality than G 2 Ž I < L., the discrepancy increasing as the model fits more poorly. However, when the model approximates reality fairly well, usually G 2 Ž I < L. is still more powerful. That test’s df value of 1 more than compensates for its loss in noncentrality. The closer the true relationship is to the linear logit, the more nearly G 2 Ž I < L. captures the same noncentrality as G 2 Ž I ., and the more powerful it is compared to G 2 Ž I .. To illustrate, Figure 6.4 plots power as a function of noncentrality when df s 1 and 7. When the noncentrality of a test having df s 1 is at least about half that of a test having df s 7, the test with df s 1 is more powerful. The linear logit model then helps detect a key component of an association. As Mantel Ž1963. argued in a similar context, ‘‘that a linear regression is being tested does not mean that an assumption of linearity is being made. Rather it is that test of a linear component of regression provides power for detecting any progressive association which may exist.’’ The improved power results from sacrificing power in other cases. The G 2 Ž I . test can have greater power than G 2 Ž I < L. when the linear logit model describes reality very poorly. The remark about the desirability of focusing noncentrality holds for nominal variables also. For instance, consider testing conditional independence in 2 = 2 = K tables. One approach tests ␤ s 0 in model Ž6.4., using df s 1. Another approach tests goodness of fit of model Ž6.5., using df s K FIGURE 6.4 Power and noncentrality, for df s 1 and df s 7, when ␣ s 0.05. 239 USING MODELS TO IMPROVE INFERENTIAL POWER TABLE 6.11 Change in Clinical Condition by Degree of Infiltration Degree of Infiltration Clinical Change Worse Stationary Slight improvement Moderate improvement Marked improvement High Low Proportion High 1 13 16 15 7 11 53 42 27 11 0.08 0.20 0.28 0.36 0.39 Source: Reprinted with permission from the Biometric Society ŽCochran 1954.. ŽSection 6.3.1.. When model Ž6.4. holds, both tests have the same noncentrality. Thus, the test of ␤ s 0 is more powerful, since is has fewer degrees of freedom. 6.4.4 Treatment of Leprosy Example Table 6.11 refers to an experiment on the use of sulfones and streptomycin drugs in the treatment of leprosy. The degree of infiltration at the start of the experiment measures a type of skin damage. The response is the change in the overall clinical condition of the patient after 48 weeks of treatment. We use response scores  y1, 0, 1, 2, 34 . The question of interest is whether subjects with high infiltration changed differently from those with low infiltration. Here, the clinical change response variable is ordinal. It seems natural to compare the mean change for the two infiltration levels. Cochran Ž1954. and Yates Ž1948. noted that this analysis is identical to a trend test treating the binary variable as the response. That test is sensitive to linearity between clinical change and the proportion of cases with high infiltration. The test G 2 Ž I . s 7.28 Ždf s 4. does not show much evidence of association Ž P s 0.12., but it ignores the row ordering. The sample proportion of high infiltration increases monotonically as the clinical change improves. The test of H0 : ␤ s 0 in the linear logit model has G 2 Ž I < L. s 6.65, with df s 1 Ž P s 0.01.. It gives strong evidence of more positive clinical change at the higher level of infiltration. Using the ordering by decreasing df from 4 to 1 pays a strong dividend. In addition, G 2 Ž L. s 0.63 with df s 3 suggests that the linear trend model fits well. 6.4.5 Model Smoothing Improves Precision of Estimation Using directed alternatives can improve not only test power, but also estimation of cell probabilities and summary measures. In generic form, let ␲ be true cell probabilities in a contingency table, let p denote sample proportions, and let ␲ ˆ denote model-based ML estimates of ␲. 240 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS When ␲ satisfy a certain model, both ␲ ˆ for that model and p are consistent estimators of ␲. The model-based estimator ␲ ˆ is better, as its true asymptotic standard error cannot exceed that of p. This happens because of model parsimony: The unsaturated model, on which ␲ ˆ is based, has fewer parameters than the saturated model, on which p is based. In fact, modelbased estimators are also more efficient in estimating functions g Ž ␲ . of cell probabilities. For any differentiable function g, asymp. var 'n g Ž ␲ˆ . F asymp. var 'n g Ž p. . In Section 14.2.2 we prove this result. It holds more generally than for categorical data models ŽAltham 1984.. This is one reason that statisticians prefer parsimonious models. In reality, of course, a chosen model is unlikely to hold exactly. However, when the model approximates ␲ well, unless n is extremely large, ␲ ˆ is still better than p. Although ␲ ˆ i is biased, it has smaller variance than pi , and MSEŽ␲ ˆ i . - MSEŽ pi . when its variance plus squared bias is smaller than varŽ pi .. In Section 3.3.7 we showed that in two-way tables, independencemodel estimates of cell probabilities can be better than sample proportions even when that model does not hold. 6.5 SAMPLE SIZE AND POWER CONSIDERATIONS* In any statistical procedure, the sample size n influences the results. Strong effects are likely to be detected even when n is small. By contrast, detection of weak effects requires large n. A study design should reflect the sample size needed to provide good power for detecting the effect. 6.5.1 Sample Size and Power for Comparing Two Proportions For test statistics having large-sample normal distributions, power calculations can use ordinary methods. To illustrate, consider a test comparing binomial parameters ␲ 1 and ␲ 2 for two medical treatments. An experiment plans independent samples of size n i s nr2 receiving each treatment. The researchers expect ␲ i f 0.6 for each, and a difference of at least 0.10 is important. In testing H0 : ␲ 1 s ␲ 2 , the variance of the difference ␲ ˆ 1 y ␲ˆ 2 in sample proportions is ␲ 1Ž1 y ␲ 1 .rŽ nr2. q ␲ 2 Ž1 y ␲ 2 .rŽ nr2. f 0.6 = 0.4 = Ž4rn. s 0.96rn. In particular, zs Ž ␲ˆ 1 y ␲ˆ 2 . y Ž ␲ 1 y ␲ 2 . Ž 0.96rn . 1r2 has approximately a standard normal distribution for ␲ 1 and ␲ 2 near 0.6. 241 SAMPLE SIZE AND POWER CONSIDERATIONS The power of an ␣-level test of H0 is approximately P <␲ ˆ 1 y ␲ˆ 2 < Ž 0.96rn . 1r2 G z␣ r2 . When ␲ 1 y ␲ 2 s 0.10, for ␣ s 0.05, this equals P Ž ␲ˆ 1 y ␲ˆ 2 . y 0.10 Ž 0.96rn . qP 1r2 ) 1.96 y 0.10 Ž nr0.96 . Ž ␲ˆ 1 y ␲ˆ 2 . y 0.10 Ž 0.96rn . 1r2 1r2 - y1.96 y 0.10 Ž nr0.96 . 1r2 s P z ) 1.96 y 0.10 Ž nr0.96 . 1r2 q P z - y1.96 y 0.10 Ž nr0.96 . 1r2 s 1 y ⌽ 1.96 y 0.10 Ž nr0.96 . 1r2 q ⌽ y1.96 y 0.10 Ž nr0.96 . , 1r2 where ⌽ is the standard normal cdf. The power is approximately 0.11 when n s 50 and 0.30 when n s 200. It is not easy to attain significance when effects are small and the sample is not very large. Figure 6.5 shows how the power increases in n when ␲ 1 y ␲ 2 s 0.1. By contrast, it shows how the power improves when ␲ 1 y ␲ 2 s 0.2. For a given P Žtype I error. s ␣ and P Žtype II error. s ␤ Žand hence power s 1 y ␤ ., one can determine the sample size needed to attain those FIGURE 6.5 Approximate power for testing equality of proportions, with true values near middle of range and ␣ s 0.05. 242 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS values. A study using n1 s n 2 requires approximately n1 s n 2 s Ž z␣ r2 q z␤ . ␲ 1 Ž 1 y ␲ 1 . q ␲ 2 Ž 1 y ␲ 2 . r Ž ␲ 1 y ␲ 2 . . 2 2 For a test with ␣ s 0.05 and ␤ s 0.10 when ␲ 1 and ␲ 2 are truly about 0.60 and 0.70, n1 s n 2 s 473. This formula also provides the sample sizes needed for a comparable confidence interval for ␲ 1 y ␲ 2 . With about 473 subjects in each group, a 95% confidence interval has only a 0.10 chance of containing 0 when actually, ␲ 1 s 0.60 and ␲ 2 s 0.70. This sample-size formula is approximate and may underestimate slightly the actual values required. It is adequate for most practical work, though, in which only rough conjectures are available for ␲ 1 and ␲ 2 . Fleiss Ž1981. showed more precise formulas. 6.5.2 Sample Size Determination in Logistic Regression Consider now the model logitw␲ Ž x i .x s ␣ q ␥ x i , i s 1, . . . , n, in which x is quantitative. wWe use ␥ so as not to confuse with ␤ s P Žtype II error..x The sample size needed to achieve a certain power for testing H0 : ␥ s 0 depends on the variance of ␥ ˆ. This depends on ␲ Ž x i .4, and formulas for n use a guess for ␲ ˆ s ␲ Ž x . and the distribution of X. The effect size is the log odds ratio ␶ comparing ␲ Ž x . to ␲ Ž x q s x ., the probability for a standard deviation above the mean of x. For a one-sided test when X is approximately normal, Hsieh Ž1989. derived n s z␣ q z␤ exp Ž y␶ 2r4 . Ž 1 q 2␲␦ ˆ . r Ž ␲␶ ˆ 2., 2 where ␦ s 1 q Ž 1 q ␶ 2 . exp Ž 5␶ 2r4 . r 1 q exp Ž y␶ 2r4 . . The value n decreases as ␲ ˆ ™ 0.5 and as < ␶ < increases. We illustrate for modeling the effect of x s cholesterol level on the probability of severe heart disease for a population for which that probability at an average level of cholesterol is about 0.08. Researchers want the test to be sensitive to a 50% increase in this probability, for a standard deviation increase in cholesterol. The odds of severe heart disease at the mean cholesterol level equal 0.08r0.92 s 0.087, and the odds one standard deviation above the mean equal 0.12r0.88 s 0.136. The odds ratio equals 0.136r0.087 s 1.57, and ␶ s log Ž1.57. s 0.450. For ␣ s 0.05 and ␤ s 0.10, ␦ s 1.306 and n s 612. 6.5.3 Sample Size in Multiple Logistic Regression A multiple logistic regression model requires larger n to detect effects. Let R denote the multiple correlation between the predictor X of interest and the 243 SAMPLE SIZE AND POWER CONSIDERATIONS others in the model. The formula for n above divides by Ž1 y R 2 .. In that formula, ␲ ˆ is evaluated at the mean of all the explanatory variables, and the odds ratio refers to the effect of X at the mean level of the other predictors. Consider the example in Section 6.5.2 when blood pressure is also a predictor. If the correlation between cholesterol and blood pressure is 0.40, we need n f 612rw1 y Ž0.40. 2 x s 729. These formulas provide, at best, rough indications of sample size. Most applications have only a crude guess for ␲ ˆ and R, and X may be far from normally distributed. For other work on this problem, see Hsieh et al. Ž1998. and Whittemore Ž1981.. 6.5.4 Power for Chi-Squared Tests in Contingency Tables When hypotheses are false, squared normal and X 2 and G 2 statistics have large-sample noncentral chi-squared distributions ŽSection 6.4.2.. Suppose that H0 is equivalent to model M for a contingency table. Let ␲ i denote the true probability in cell i, and let ␲ i Ž M . denote the value to which the ML estimate ␲ ˆ i for model M converges, where Ý␲ i s Ý␲ i Ž M . s 1. For a multinomial sample of size n, the noncentrality parameter for X 2 equals ␲i y ␲iŽ M . ␭ s nÝ 2 ␲iŽ M . i Ž 6.8 . . This has the same form as X 2 , with ␲ i in place of the sample proportion pi and ␲ i Ž M . in place of ␲ ˆ i . The noncentrality parameter for G 2 equals ␭ s 2 n Ý ␲ i log i ␲i ␲iŽ M . Ž 6.9 . . TABLE 6.12 Power of Chi-Squared Test for  s 0.05 Noncentrality df 0.0 0.2 0.4 0.6 0.8 1.0 2.0 3.0 4.0 5.0 7.0 10.0 15.0 25.0 1 2 3 4 6 8 10 20 50 .050 .050 .050 .050 .050 .050 .050 .050 .050 .073 .065 .062 .060 .058 .057 .056 .053 .052 .097 .081 .075 .071 .066 .064 .062 .056 .054 .121 .098 .088 .082 .075 .071 .068 .060 .056 .146 .115 .102 .093 .084 .079 .075 .063 .059 .170 .133 .116 .106 .094 .087 .082 .066 .061 .293 .226 .192 .172 .146 .131 .121 .096 .076 .410 .322 .275 .244 .206 .182 .166 .125 .092 .516 .415 .358 .320 .270 .238 .215 .158 .110 .609 .504 .440 .396 .336 .296 .268 .193 .129 .754 .655 .590 .540 .468 .417 .379 .273 .173 .885 .815 .761 .716 .644 .588 .542 .402 .250 .972 .944 .917 .891 .843 .799 .760 .611 .398 .998 .996 .993 .989 .980 .968 .956 .883 .687 Source: Reprinted with permission from G. E. Haynam, Z. Govindarajulu, and F. C. Leone, in Selected Tables in Mathematical Statistics, eds. H. L. Harter and D. B. Owen ŽChicago: Markham, 1970.. 244 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS When H0 is true, all ␲ i s ␲ i Ž M .. Then, for either statistic, ␭ s 0 and the central chi-squared distribution applies. To determine the approximate power for a chi-squared test with df s ␯ , Ž1. choose a hypothetical set of true values ␲ i 4 , Ž2. calculate ␲ i Ž M .4 by fitting to ␲ i 4 the model M for H0 , Ž3. calculate the noncentrality parameter ␭, and Ž4. calculate P w X␯2, ␭ ) ␹␯2 Ž␣.x. Table 6.12 shows an excerpt from a table of noncentral chi-squared probabilities for step 4 with ␣ s 0.05. 6.5.5 Power for Testing Conditional Independence We use an example based on one in O’Brien Ž1986.. A standard fetal heart rate monitoring test predicts whether a fetus will require nonroutine care following delivery. The standard test has categories Žworrisome, reassuring .. The response Y is whether the newborn required some nonroutine medical care during the first week after birth Ž1 s yes, 0 s no.. A new fetal heart rate monitoring test is developed, having categories Žvery worrisome, somewhat worrisome, reassuring .. A physician plans to study whether this new test can help make predictions about the outcome; that is, given the result of the standard test, is there an association between the response and the result of the new test? A relevant statistic tests the effect of the new monitoring test in the logit model having the new test Ž N . and standard test Ž S . as qualitative predictors. To help select n, a statistician asks the physician to conjecture about the joint distribution of the explanatory variables, with questions such as ‘‘What proportion of the cases do you think will be scored ‘reassuring’ by both tests?’’ For each NS combination, the physician also guessed P Ž Y s 1.. Table 6.13 shows one scenario for marginal and conditional probabilities. These yield a joint distribution ␲ i jk 4 from their product, such as 0.04 = 0.40 s 0.016 for the proportion of cases judged worrisome by the standard test and very worrisome by the new test and requiring nonroutine medical care. These joint probabilities yield fitted probabilities  Ž M0 . and  Ž M1 . for the null and alternative logit models. ŽOne can get these by entering ␲ i jk 4 in TABLE 6.13 Scenario for Power Computation Standard Worrisome Reassuring New Joint Probability P Žnonroutine care. Very worrisome Somewhat worrisome Reassuring Very worrisome Somewhat worrisome Reassuring 0.04 0.08 0.04 0.02 0.18 0.64 0.40 0.32 0.27 0.30 0.22 0.15 Source: Reprinted with permission from O’Brien Ž1986.. PROBIT AND COMPLEMENTARY LOG-LOG MODELS 245 percentage form as counts in software for logistic regression, fit the relevant model, and divide the fitted counts by 100 to get the fitted joint probabilities. . The likelihood-ratio test comparing these models has noncentrality Ž6.9. with  Ž M1 . playing the role of  and  Ž M0 . playing the role of  Ž M .. For the scenario in Table 6.13, the noncentrality equals 0.00816 n, with df s 2. For n s 400, 600, and 1000, the approximate powers when ␣ s 0.05 are 0.35, 0.49, and 0.73. This scenario predicts 64% of the observations to occur at only one combination of the factors. The lack of dispersion for the factors weakens the power. 6.5.6 Effects of Sample Size on Model Selection and Inference The effects of sample size suggest some cautions for model selection. For small n, the most parsimonious model accepted in a goodness-of-fit test may be quite simple. By contrast, larger samples usually require more complex models to pass goodness-of-fit tests. Then, some effects that are statistically significant may be weak and substantively unimportant. With large n it may be adequate to use a model that is simpler than models that pass goodnessof-fit tests. An analysis that focuses solely on goodness-of-fit tests is incomplete. It is also necessary to estimate model parameters and describe strengths of effects. These remarks merely reflect limitations of significance testing. Null hypotheses are rarely true. With large enough n, they will be rejected. A more relevant concern is whether the difference between true parameter values and null hypothesis values is sufficient to be important. Many methodologists overemphasize testing and underutilize estimation methods such as confidence intervals. When the P-value is small, a confidence interval specifies the extent to which H0 may be false, thus helping us determine whether rejecting it has practical importance. When the P-value is not small, the confidence interval indicates whether some plausible parameter values are far from H0 . A wide confidence interval containing the H0 value indicates that the test had weak power at important alternatives. 6.6 PROBIT AND COMPLEMENTARY LOG-LOG MODELS* For binary responses, in this section we discuss two alternatives to logit models. Like the logit model, these models have form Ž4.8., ␲ Ž x . s ⌽ Ž␣ q ␤ x . Ž 6.10 . for a continuous cdf ⌽. The following argument motivates this class. 6.6.1 Tolerance Motivation for Binary Response Models In toxicology, binary response models describe the effect of dosage of a toxin on whether a subject dies. The tolerance distribution provides justification for 246 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS model Ž6.10.. Let x denote the dosage level. For a randomly selected subject, let Y s 1 if the subject dies. Suppose that the subject has tolerance T for the dosage, with Ž Y s 1. equivalent to ŽT F x .. For instance, an insect survives if the dosage x is less than T and dies if the dosage is at least T. Tolerances vary among subjects, and let F Ž t . s P ŽT F t .. For fixed dosage x, the probability a randomly selected subject dies is ␲ Ž x. s PŽ Y s 1 < X s x. s PŽT F x. s FŽ x. . That is, the appropriate binary model is the one having the shape of the cdf F of the tolerance distribution. Let ⌽ denote the standard cdf for the family to which F belongs. A common standardization uses the mean and standard deviation of T, so that ␲ Ž x . s F Ž x . s ⌽ Ž x y ␮ . r␴ . Then, the model has form ␲ Ž x . s ⌽ Ž␣ q ␤ x .. 6.6.2 Probit Models Toxicological experiments often measure dosage as the log concentration ŽBliss 1935.. Often, the tolerance distribution for the dosage is approximately N Ž ␮, ␴ 2 . for unknown ␮ and ␴ . If F is the N Ž ␮, ␴ 2 . cdf, then ␲ Ž x . has the form ␲ Ž x . s ⌽ Ž␣ q ␤ x ., where ⌽ is the standard normal cdf, ␣ s y␮r␴ and ␤ s 1r␴ . In GLM form, ⌽y1 ␲ Ž x . s ␣ q ␤ x Ž 6.11 . is the probit model. The probit link function is ⌽y1 Ž⭈.. Whereas the cdf maps the real line onto the Ž0, 1. probability scale, the inverse cdf maps the Ž0, 1. scale for ␲ Ž x . onto the real line values for linear predictors in binary response models. The response curve for ␲ Ž x . wor for 1 y ␲ Ž x ., when ␤ - 0x has the appearance of the normal cdf with mean ␮ s y␣r␤ and standard deviation ␴ s 1r < ␤ < . Since 68% of the normal density falls within a standard deviation of the mean, 1r< ␤ < is the distance between x values where ␲ Ž x . s 0.16 or 0.84 and where ␲ Ž x . s 0.50. The rate of change in ␲ Ž x . is ⭸␲ Ž x .r⭸ x s ␤␾ Ž␣ q ␤ x ., where ␾ Ž⭈. is the standard normal density function. The rate is highest when ␣ q ␤ x s 0 Ži.e., at x s y␣r␤ ., where it equals ␤rŽ2␲ .1r2 s 0.40 ␤ Žfor ␲ s 3.14 . . . .. At that point, ␲ Ž x . s 12 . By comparison, in logistic regression with parameter ␤ , the curve for ␲ Ž x . is a logistic cdf with standard deviation ␲r< ␤ < '3 . Its rate of change in ␲ Ž x . at x s y␣r␤ is 0.25 ␤ . The rates of change where ␲ Ž x . s 12 are the same for the cdf ’s corresponding to the probit and logistic curves when the logistic ␤ is 0.40r0.25 s 1.6 times the probit ␤ . The standard deviations are the same when the logistic ␤ is ␲r'3 s 1.8 times the probit ␤ . When both PROBIT AND COMPLEMENTARY LOG-LOG MODELS 247 models fit well, parameter estimates in logistic regression are about 1.6 to 1.8 times those in probit models. The likelihood equations that Ž4.24. showed for binomial regression models apply to probit models Žsee also Problem 6.32.. One can solve them using the Fisher scoring algorithm for GLMs ŽBliss 1935, Fisher 1935b.. Newton᎐Raphson yields the same ML estimates but slightly different standard errors. For the information matrix inverted to obtain the asymptotic covariance matrix, Newon᎐Raphson uses observed information, whereas Fisher scoring uses expected information. These differ for binary links other than the logit. 6.6.3 Beetle Mortality Example Table 6.14 reports the number of beetles killed after 5 hours of exposure to gaseous carbon disulfide at various concentrations. Figure 6.6 plots Žas dots. the proportion killed against the log concentration. The proportion jumps up at about x s 1.8, and it is close to 1 above there. The ML fit of the probit model is ⌽y1 ␲ ˆ Ž x . s y34.96 q 19.74 x. For this fit, ␲ ˆ Ž x . s 0.5 at x s 34.96r19.74 s 1.77. The fit corresponds to a normal tolerance distribution with ␮ s 1.77 and ␴ s 1r19.74 s 0.05. The curve for ␲ ˆ Ž x . is that of a N Ž1.77, 0.05 2 . cdf. At dosage x i with n i beetles, n i␲ ˆ Ž x i . is the fitted count for death, i s 1, . . . , 8. Table 6.14 reports the fitted values and Figure 6.6 shows the fit. The table also shows fitted values for the linear logit model. These models fit similarly and rather poorly. The G 2 goodness-of-fit statistic equals 11.1 for the logit model and 10.0 for the probit model, with df s 6. TABLE 6.14 Beetles Killed after Exposure to Carbon Disulfide Log Dose Number of Beetles Number Killed 1.691 1.724 1.755 1.784 1.811 1.837 1.861 1.884 59 60 62 56 63 59 62 60 6 13 18 28 52 53 61 60 Fitted Values Comp. Log-Log 5.7 11.3 20.9 30.3 47.7 54.2 61.1 59.9 Source: Data reprinted with permission from Bliss Ž1935.. Probit Logit 3.4 10.7 23.4 33.8 49.6 53.4 59.7 59.2 3.5 9.8 22.4 33.9 50.0 53.3 59.2 58.8 248 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS FIGURE 6.6 Proportion of beetles killed versus log dosage, with fits of probit and complementary log-log models. 6.6.4 Complementary Log-Log Link Models The logit and probit links are symmetric about 0.5, in the sense that link ␲ Ž x . s ylink 1 y ␲ Ž x . . To illustrate, logit ␲ Ž x . s log ␲ Ž x . r Ž 1 y ␲ Ž x . . s ylog Ž 1 y ␲ Ž x . . r␲ Ž x . s ylogit 1 y ␲ Ž x . . This means that the response curve for ␲ Ž x . has a symmetric appearance about the point where ␲ Ž x . s 0.5, so ␲ Ž x . approaches 0 at the same rate it approaches 1. Logit and probit models are inappropriate when this is badly violated. The response curve ␲ Ž x . s 1 y exp yexp Ž␣ q ␤ x . Ž 6.12 . has the shape shown in Figure 6.7. It is asymmetric, ␲ Ž x . approaching 0 fairly slowly but approaching 1 quite sharply. For this model, log ylog Ž 1 y ␲ Ž x . . s ␣ q ␤ x. The link for this GLM is called the complementary log-log link, since the log-log link applies to the complement of ␲ Ž x .. PROBIT AND COMPLEMENTARY LOG-LOG MODELS 249 FIGURE 6.7 Model with complementary log᎐log link. To interpret model Ž6.12., we note that at x 1 and x 2 , log ylog Ž 1 y ␲ Ž x 2 . . y log ylog Ž 1 y ␲ Ž x 1 . . s ␤ Ž x 2 y x 1 . , so that log 1 y ␲ Ž x 2 . log 1 y ␲ Ž x 1 . s exp ␤ Ž x 2 y x 1 . and 1 y ␲ Ž x 2 . s 1 y ␲ Ž x1 . exp w ␤ Ž x 2 yx 1 .x . For x 2 y x 1 s 1, the complement probability at x 2 equals the complement probability at x 1 raised to the power expŽ ␤ .. A related model to Ž6.12. is ␲ Ž x . s exp yexp Ž␣ q ␤ x . . Ž 6.13 . For it, ␲ Ž x . approaches 0 sharply but approaches 1 slowly. As x increases, the curve is monotone decreasing when ␤ ) 0, and monotone increasing when ␤ - 0. In GLM form it uses the log-log link log ylog Ž ␲ Ž x . . s ␣ q ␤ x. When the complementary log-log model holds for the probability of a success, the log-log model holds for the probability of a failure. Model Ž6.13. with log-log link is the special case of Ž6.10. with cdf of the extreme ®alue Žor Gumbel . distribution. The cdf equals F Ž x . s exp  yexp y Ž x y a . rb 4 250 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS for parameters b ) 0 and y⬁ - a - ⬁. It has mean a q 0.577b and standard deviation ␲ br'6 . Models with log-log links can be fitted using the Fisher scoring algorithm for GLMs. 6.6.5 Beetle Mortality Example Revisited For the beetle mortality data ŽTable 6.14., the complementary log-log model has ML estimates ␣ ˆ s y39.52 and ␤ˆ s 22.01. At dosage x s 1.7, the fitted probability of survival is 1 y ␲ ˆ Ž x . s expyexpwy39.52 q 22.01Ž1.7.x4 s 0.885, whereas at x s 1.8 it is 0.332 and at x s 1.9 it is 5 = 10y5. The probability of survival at dosage x q 0.1 equals the probability at dosage x raised to the power exp Ž22.01 = 0.1. s 9.03. For instance, 0.332 s Ž0.885. 9.03 . Table 6.14 shows the fitted values and Figure 6.6 shows the fit. They are close to the observed death counts Ž G 2 s 3.5, df s 6.. The fit seems adequate. Aranda-Ordaz Ž1981. and Stukel Ž1988. discussed these data further. 6.7 CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS* ML estimators of logistic model parameters work best when the sample size n is large compared to the number of parameters. When n is small or when the number of parameters grows as n does, improved inference results using conditional maximum likelihood. In this section we present this approach and in Section 10.2 apply it with matched case᎐control studies. 6.7.1 Conditional Likelihood This conditional likelihood approach eliminates nuisance parameters by conditioning on their sufficient statistics. This generalizes Fisher’s method for 2 = 2 tables ŽSection 3.5.. The conditional likelihood refers to a conditional distribution defined for potential samples that provide the same information about the nuisance parameters that occurs in the observed sample. We begin with a general exposition and then discuss special cases. Let yi denote the binary response for subject i, i s 1, . . . , N. ŽFor now, each yi refers to a single trial, so n i s 1.. Let x i j be the value of predictor j for that subject, j s 1, . . . , p. The model is P Ž Yi s yi . s p exp yi Ž␣ q Ý js1 ␤j x i j . p 1 q exp Ž␣ q Ý js1 ␤j x i j . , Ž 6.14 . where substituting yi s 1 gives the usual expression, such as Ž5.15.. Here, we explicitly separate the intercept from the coefficients of the p predictors. For N independent observations, P Ž Y1 s y 1 , . . . , YN s yN . s p exp Ž Ý i yi . ␣ q Ý js1 Ž Ý i yi x i j . ␤ j p ␤j x i j . Ł i 1 q exp Ž␣ q Ý js1 . Ž 6.15 . CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS 251 From this likelihood function, the sufficient statistic for ␤ j is Ý i yi x i j , j s 1, . . . , p. The sufficient statistic for ␣ is Ý i yi , the total number of successes. Usually, some parameters refer to effects of primary interest. Others may be there to adjust for relevant effects, but their values are not of special interest. We can eliminate the latter parameters from the likelihood by conditioning on their sufficient statistics. We illustrate by eliminating ␣ . ŽIn Section 10.2.5 we show that for models for matched case᎐control studies, intercept terms cause difficulties with inference about the primary parameters, so it can be helpful to eliminate them.. Since the sufficient statistic for ␣ is Ý i yi , we condition on Ý i yi . Suppose that Ý i yi s t. Denote the conditional reference set of samples having the same value of Ý i yi as observed by ½ S Ž t . s Ž y*, 1 . . . , y* N. : Ý yi* s t i 5. With  yi 4 such that Ý i yi s t, the conditional likelihood function equals ž P Y1 s y 1 , . . . , YN s yN Ý yi s t i s s / s P Ž Y1 s y 1 , . . . , YN s yN . ÝSŽ t . P Ž Y1 s y 1*, . . . , YN s yN * . p p ␤j x i j . exp t ␣ q Ý js1 Ž Ý i yi x i j . ␤ j Ł i 1 q exp Ž␣ q Ý js1 p p ÝSŽ t . exp t ␣ q Ý js1 ␤j x i j . Ž Ý i yi*x i j . ␤ j rŁ i 1 q exp Ž␣ q Ý js1 p exp Ý js1 Ž Ý i yi x i j . ␤ j p ÝSŽ t . exp Ý js1 Ž Ý i yi*x i j . ␤ j . This does not depend on ␣ . A conditional likelihood is used just like an ordinary likelihood. For the parameters in it, their conditional ML estimates are the values maximizing it. Calculated using iterative methods, the estimators are asymptotically normal with covariance matrix equal to the negative inverse of the matrix of second partial derivatives of the conditional log likelihood. 6.7.2 Small-Sample Conditional Inference for Logistic Regression For small samples, inference for a parameter uses the conditional distribution after eliminating all other parameters. With it, one can calculate probabilities such as P-values exactly rather than with crude approximations ŽCox 1970.. For instance, suppose that inference focuses on ␤ p in model Ž6.14.. To eliminate other parameters, we condition on their sufficient statistics Tj s Ý i yi x i j , j s 0, . . . , p y 1 Žwhere x i0 s 1.. With an argument like that 252 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS just shown, one obtains the conditional distribution P Ž Y1 s y 1 , . . . , YN s yN < Tj s t j , j s 0, . . . , p y 1 . s exp Ž Ý i yi x i p . ␤ p ÝSŽ t 0 , . . . , t py1 . exp Ž Ý i yi*x i p . ␤ p exp Ž t p ␤ p . s ÝSŽ t 0 , . . . , t py1 . exp Ž t p*␤ p . , where ½ 5 S Ž t 0 , . . . , t py1 . s Ž y*, 1 . . . , y* N . : Ý yi*x i j s t j , j s 0, . . . , p y 1 . i This depends only on ␤ p . Inference for ␤ p uses the conditional distribution of its sufficient statistic, Tp s Ý i yi x i p , given the others. Let cŽ t 0 , . . . , t py1 , t . denote the number of data vectors in SŽ t 0 , . . . , t py1 . for which Tp s t. The conditional distribution of Tp is P Ž Tp s t < Tj s t j , j s 0, . . . , p y 1 . s c Ž t 0 , . . . , t py1 , t . exp Ž t ␤ p . , Ý u c Ž t 0 , . . . , t py1 , u . exp Ž u ␤ p . Ž 6.16 . where the denominator summation refers to the possible values u of Tp . For testing H0 : ␤ p s 0, the conditional distribution simplifies. For Ha : ␤ p ) 0 and observed Tp s t obs , the exact conditional P-value is Ý P Ž Tp s t < Tj s t j , j s 0, . . . , p y 1 . s tGt obs Ý t G t obs c Ž t 0 , . . . , t py1 ,t . Ý u c Ž t 0 , . . . , t py1 , u . , the proportion of data configurations in the conditional set that have the sufficient statistic for ␤ p at least as large as observed. Implementing this inference requires calculating  cŽ t 0 , . . . , t py1 , u.4 . For all but the simplest problems, computations are intensive and require specialized software Že.g., LogXact of Cytel Software or PROC LOGISTIC in SAS.. In the remainder of this section we consider special cases for small-sample inference. 6.7.3 Small-Sample Conditional Inference for 2 = 2 Contingency Tables First, consider logistic regression with a single predictor x, logit P Ž Yi s 1 . s ␣ q ␤ x i , i s 1, . . . , N, Ž 6.17 . when x i takes only two values. The model applies to 2 = 2 tables, where x i s 1 denotes row 1 and x i s 0 denotes row 2. The sufficient statistic for ␣ CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS 253 is Ý i yi , which is the first column total. The sufficient statistic for ␤ is T s Ý i yi x i , which simplifies to the number of successes in the first row. Equivalently, the sufficient statistics for the model are the numbers of successes in the two rows. Let s1 and s2 denote these binomial variates. The row totals n1 and n 2 are their indices. To eliminate ␣ , we condition on s s s1 q s2 , the first column total. Since N s n1 q n 2 is fixed, so then is the other column marginal total. Fixing both sets of marginal totals yields hypergeometric probabilities for s1 that depend only on ␤ wsee Ž3.20., identifying ␪ s expŽ ␤ .x. In that case the conditional distribution satisfies Ž6.16. with cŽ t 0 , t . s ž /ž n1 N y n1 t t0 y t / and with t 0 s s and t s s1. The resulting exact conditional test that ␤ s 0 is Fisher’s exact test for 2 = 2 tables ŽSection 3.5.1.. 6.7.4 Small-Sample Conditional Inference for Linear Logit Model The linear logit model, logitŽ␲ i . s ␣ q ␤ x i , applies to I = 2 tables with ordered rows. We discussed this model in Section 5.3.4. For it, the data  yi 4 are I independent  bin Ž n i , ␲ i .4 counts, with fixed row totals  n i 4 . Conditioning on s s Ý yi and hence the column totals yields a conditional likelihood free of ␣ . Exact inference about ␤ uses its sufficient statistic, T s Ý x i yi . From Ž6.16. its distribution has the form ž P Tst Ý yi s s; ␤ i / s c Ž s, t . e ␤ t Ý u c Ž s, u . e ␤ u ž / . Ž 6.18 . ni Here, cŽ s, u. equals the sum of Ł i for all tables with the given yi marginal totals that have T s u. When ␤ s 0, the cell counts have the multiple hypergeometric distribution Ž3.19.. To test this, ordering the tables with the given margins by T is equivalent to ordering them by the Cochran᎐Armitage statistic ŽSection 5.3.5.. Thus, this test for the linear logit model is an exact trend test. In Section 5.3.5 we applied the Cochran᎐Armitage test to Table 5.3 on maternal alcohol consumption and infant malformation. Even though n s 32,573, the table is highly unbalanced, with both very small and very large counts. It is safer to use small-sample methods. For the exact conditional trend test with the same scores, the one-sided P-value for Ha : ␤ ) 0 is 0.0168. The two-sided P-value is 0.0172, reflecting asymmetry of the conditional distribution, given the marginal counts. This is not much different from the two-sided P-value of 0.010 obtained with the large-sample Cochran᎐ Armitage test. 254 6.7.5 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS Small-Sample Tests of Conditional Independence in 2 = 2 = K Tables For 2 = 2 = K tables  n i jk 4 , the Cochran᎐Mantel᎐Haenszel test uses Ý k n11 k . For logit model Ž6.4., this is the sufficient statistic for ␤ , the effect of X. To conduct a small-sample test of ␤ s 0, one needs to eliminate the other model parameters. Constructing the likelihood reveals that the sufficient statistics for  ␤ kZ 4 are the column marginal totals  nqj k 4 in each partial table. When X and Z are predictors, it is natural to treat the numbers of trials  n iqk 4 at each combination of XZ values as fixed. Thus, exact inference about ␤ conditions on the row and column totals in each stratum. Conditional on the strata margins, an exact test uses Ý k n11 k . Hypergeometric probabilities occur in each partial table for the independent null distributions of  n11 k , k s 1, . . . , K 4 . The product of the K mass functions gives the null joint distribution of  n11 k , k s 1, . . . , K 4 . wThis is Ž6.19. below, setting ␪ s 1.x This determines the null distribution of Ý k n11 k . For Ha : ␤ ) 0, the P-value is the null probability that Ý k n11 k is at least as large as observed, for the fixed strata marginal totals. Mehta et al. Ž1985. presented a fast algorithm. The test simplifies to Fisher’s exact test when K s 1. 6.7.6 Promotion Discrimination Example Table 6.15 refers to U.S. government computer specialists of similar seniority considered for promotion. The table cross-classifies promotion decision by employee’s race, considered for three separate months. We test conditional independence of promotion decision and race, or H0 : ␤ s 0, in model Ž6.4.. The table contains several small counts. The overall sample size is not small Ž n s 74., but one marginal count Žcollapsing over month of decision. equals zero, so we might be wary of using the CMH test. For Ha : ␤ - 0 Ži.e., odds ratio - 1., the probability of promotion was lower for black employees than for white employees. For the margins of the partial tables in Table 6.15, n111 can range between 0 and 4, n112 can range between 0 and 4, and n113 can range between 0 and 2. The total Ý k n11 k can range between 0 and 10. The sample data are the most extreme possible TABLE 6.15 Promotion Decisions by Race and by Month July Promotions August Promotions September Promotions Race Yes No Yes No Yes No Black White 0 4 7 16 0 4 7 13 0 2 8 13 Source: J. Gastwirth, Statistical Reasoning in Law and Public Policy ŽSan Diego, CA: Academic Press, 1988., p. 266. 255 CONDITIONAL LOGISTIC REGRESSION AND EXACT DISTRIBUTIONS result in each case. The observed Ý k n11 k s 0, and the P-value is the null probability of this outcome. Software provides P s 0.026. A two-sided Pvalue, based on summing the probabilities of all tables no more likely than the observed table, equals 0.056. 6.7.7 Exact Conditional Estimation and Comparison of Odds Ratios For model Ž6.4. of homogeneous association in 2 = 2 = K tables, the ordinary ML estimator of the odds ratio ␪ s expŽ ␤ . behaves poorly for sparse-data asymptotics. The conditional ML estimator maximizes the conditional likelihood function after reducing the parameter space by conditioning on sufficient statistics for the other parameters ŽAndersen 1970; Birch 1964b.. For cell counts  n i jk 4 , given  n iqk , nqj k 4 for all k, the conditional probability mass function that Ž n111 s t 1 , . . . , n11 K s t K . is the product of the functions Ž3.20. from the separate strata, or Ł P Ž n11 k s t k < n1qk , nq1 k , nqqk ;␪ . s Ł k k ž /ž n1qk tk Ýu ž /ž n1qk u / nqqk y n1qk ␪ tk nq1 k y t k / nqqk y n1qk ␪u nq1 k y u . Ž 6.19 . The conditional ML estimator ␪ˆ maximizes Ž6.19.. Like the Mantel᎐Haenszel estimator ␪ˆMH , it has good properties for both standard and sparse-data asymptotic cases ŽAndersen 1970; Breslow 1981., since the number of parameters does not change as K does. It can be slightly more efficient than ␪ˆMH , except when ␪ s 1.0, where they are equally efficient, or for matched pairs, where they are identical ŽBreslow 1981.. The conditional distribution Ž6.19. propagates one for Ý k n11 k , which is used to test H0 : ␪ s ␪ 0 for an arbitrary value. Then, a 95% confidence interval for ␪ consists of all ␪ 0 for which the P-value exceeds 0.05. Such an interval is guaranteed to have at least the nominal coverage probability ŽGart 1970; Kim and Agresti 1995; Mehta et al. 1985.. This extends the interval for a single 2 = 2 table ŽSection 3.6.1.. For the promotion discrimination case ŽTable 6.15., Ý k n11 k s 0, so the lower bound of any confidence interval for ␪ should be 0. For the generalization to several strata of Cornfield’s tail-method interval, StatXact reports a 95% confidence interval of Ž0, 1.01.. Zelen Ž1971. presented a small-sample test of homogeneity of the odds ratios. See Agresti Ž1992. for discussion of this and other small-sample methods for contingency tables. 256 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS TABLE 6.16 Example for Exact Conditional Logistic Regression Cephalexin a Age a Length of Stay a Cases of Diarrhea Sample Size 0 0 0 0 1 0 0 1 1 1 0 1 0 1 1 0 5 3 47 5 385 233 789 1081 5 a See the text for an explanation of 0 and 1. Source: Based on study by E. Jaffe and V. Chang, Cornell Medical Center, reported in the Manual for LogXact ŽCambridge, MA: CYTEL Software, 1999., p. 259. 6.7.8 Diarrhea Example The final example deals with a larger number of variables. Table 6.16 refers to 2493 patients having stays in a hospital. The response is whether they suffered an acute form of diarrhea during their stay. The three predictors are age Ž1 for over 50 years old, 0 for under 50., length of stay in hospital Ž1 for more than 1 week, 0 for less than 1 week., and exposure to an antibiotic called Cephalexin Ž1 for yes, 0 for no.. We discuss estimation of the effect of Cephalexin, controlling for age and length of stay, using a model containing only main-effect terms. The sample size is large, yet relatively few cases of acute diarrhea occurred. Moreover, all subjects having exposure to Cephalexin were also diarrhea cases. Such boundary outcomes in which none or all responses fall in one category cause infinite ML estimates of some model parameters. An ML estimate of ⬁ for the Cephalexin effect means that the likelihood function increases continually as the parameter estimate for Cephalexin increases indefinitely. To study the Cephalexin effect, we use an exact distribution, conditioning on sufficient statistics for the other predictors. Although the estimate of the log-odds-ratio parameter for the effect of Cephalexin is infinite, it is possible to construct a confidence interval by inverting the family of tests for the parameter, using the conditional distribution. Doing this, a 95% confidence interval is Ž19, ⬁. for the odds ratio. Assuming that the main-effects model is valid, Cephalexin appears to have a strong effect. Similarly, P - 0.0001 for testing that the log odds ratio equals zero. Results must be qualified somewhat because no Cephalexin cases occurred at the first three combinations of levels of age and length of stay. In fact, the first three rows of Table 6.16 make no contribution to the analysis ŽProblem 6.18.. The data actually provide evidence about the effect of Cephalexin only for older subjects having a long stay. 257 NOTES 6.7.9 Complications from Discreteness Like Fisher’s exact test, exact conditional inference for contingency tables is conservative because of discreteness. This is especially true when n is small or the data are unbalanced, with most observations falling in a single column or row. Using mid-P-values or P-values based on a finer partitioning of the sample space ŽNote 3.9. in tests and related confidence intervals reduces conservativeness. For the promotion discrimination data ŽTable 6.15., we reported a 95% confidence interval for the common odds ratio of Ž0, 1.01.. Inverting exact tests of H0 : ␪ s ␪ 0 with the mid-P-value yields the interval Ž0, 0.78.. However, this approach cannot guarantee that the actual coverage probability is bounded below by 0.95. 4 A particular problem occurs when no other set of  y* i values has the same value of a given sufficient statistic Ý i yi x i j as the observed data. In that case the conditional distribution of the sufficient statistic for the parameter of interest is degenerate. The P-value for the exact test then equals 1.0. This commonly happens when at least one explanatory variable x j whose effect is conditioned out for the inference is continuous, with unequally spaced observed values. Finally, a limitation of the conditional approach is requiring sufficient statistics for the nuisance parameters. This happens only with GLMs that use the canonical link. Thus, for instance, the conditional approach works for logit models but not probit models. NOTES Section 6.1: Strategies in Model Selection 6.1. A Bayesian argument motivates the Bayesian information criterion BIC s w G 2 y Žlog n.Ždf.x, an alternative to AIC. It takes sample size into account. Compared to AIC, BIC gravitates less quickly toward more complex models as n increases. For details and critiques, see Raftery Ž1986. and the February 1999 issue of Sociological Methods and Research. 6.2. Tree-structured methods such as CART are alternatives to logistic regression that formalize a decision process using a sequential set of questions that branch in different directions depending on a subject’s responses. An example is deciding whether a subject with chest pains may be suffering a heart attack. Zhang et al. Ž1998. surveyed such methods. Section 6.2: Logistic Regression Diagnostics 6.3. For logistic regression diagnostics, see Copas Ž1988., Fowlkes Ž1987., Hosmer and Lemeshow Ž2000, Chap. 5., Johnson Ž1985., Landwehr et al. Ž1984., and Pregibon Ž1981.. Separate diagnostics are useful for checking the adequacy of each component of a GLM ŽMcCullagh and Nelder 1989, Chap. 12.. For a family g Ž␮; ␥ . of link functions indexed 258 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS by parameter ␥ , Pregibon Ž1980. showed how to estimate ␥ giving the link with best fit and how to check the adequacy of a given link g Ž␮; ␥ 0 .. 6.4. Amemiya Ž1981., Efron Ž1978., Maddala Ž1983., and Zheng and Agresti Ž2000. and references therein reviewed R 2 measures for binary regression. Hosmer and Lemeshow Ž2000, Sec. 5.2.3. discussed classification tables and their limitations. Pepe Ž2000. and references therein surveyed ROC methodology. Section 6.3: Inference about Conditional Associations in 2 = 2 = K Tables 6.5. Analogs of ␪ˆMH summarize differences of proportions or relative risks from several strata ŽGreenland and Robins 1985.. Breslow and Day Ž1980, p. 142. proposed an alternative large-sample test of homogeneity of odds ratios. In each partial table let ␮ ˆ i jk 4 have the same marginals as the data observed, yet have odds ratio equal to ␪ˆMH . Their test statistic has the Pearson form comparing  n i jk 4 to  ␮ ˆ i jk 4. Tarone Ž1985. showed that because of the inefficiency of ␪ˆMH one must adjust the Breslow᎐Day statistic for it to have a limiting chi-squared null distribution with df s K y 1. This adjustment is usually minor. Jones et al. Ž1989. reviewed and compared several tests of homogeneity in sparse and nonsparse settings. Other work on comparing odds ratios and estimating a common value include Breslow and Day Ž1980, Sec. 4.4., Donner and Hauck Ž1986., Gart Ž1970., and Liang and Self Ž1985.. For modeling the odds ratio, see Breslow Ž1976., Breslow and Day Ž1980, Sec. 7.5., and Prentice Ž1976a.. Breslow emphasized retrospective studies, in which the conditional approach is natural since the outcome totals are fixed. Section 6.5: Sample Size and Power Considerations 6.6. For sample-size determination for comparing proportions, Fleiss Ž1981, Sec. 3.2. provided tables. See Lachin Ž1977. for the I = J case. Chapman and Meng Ž1966., Drost et al. Ž1989., Haberman Ž1974a, pp. 109᎐112., Harkness and Katz Ž1964., Mitra Ž1958., and Patnaik Ž1949. derived theory for asymptotic nonnull behavior of chi-squared statistics; see also Section 14.3.5. O‘Brien’s Ž1986. simulation results suggested that the noncentral chi-squared approximation for G 2 holds well for a wide range of powers. Read and Cressie Ž1988, pp. 147᎐148. listed other articles that studied the nonnull behavior of X 2 and G 2 . Section 6.6: Probit and Complementary Log-Log Models 6.7. Finney Ž1971. is the standard reference on probit modeling. Chambers and Cox Ž1967. showed that it is difficult to distinguish between probit and logit models unless n is extremely large. Ashford and Sowden Ž1970. generalized the probit model for multivariate binary responses; see also Lesaffre and Molenberghs Ž1991. and Ochi and Prentice Ž1984.. Wedderburn Ž1976. showed that the log likelihood is concave for probit and complementary log-log links. Section 6.7: Conditional Logistic Regression 6.8. For details about conditional logistic regression, see Section 10.2, Breslow and Day Ž1980, Chap. 7., Cox Ž1970., and Hosmer and Lemeshow Ž2000, Chap. 5.. Liang Ž1984. showed that conditional ML estimators and conditional score tests are asymptotically equivalent to their unconditional counterparts under sampling from exponential families. For exact inference using the conditional likelihood, see Hirji et al. Ž1987., Mehta and Patel Ž1995., and the LogXact manual ŽCytel Software.. Mehta et al. Ž2000. discussed Monte Carlo approximations. PROBLEMS 259 PROBLEMS Applications 6.1 For the horseshoe crab data, fit a model using weight and width as predictors. Conduct Ža. a likelihood-ratio test of H0 : ␤ 1 s ␤ 2 s 0, and Žb. separate tests for the partial effects. Why does neither test in part Žb. show evidence of an effect when the test in part Ža. shows strong evidence? 6.2 Refer to the data for Problem 8.13. Treating opinion about premarital sex as the response variable, use backward elimination to select a model. Interpret. 6.3 Refer to Table 6.4. Fit the stage 3 model denoted there by Ž E*P q G .. Use parameter estimates to interpret the G effect and the dependence of the E effect on P. 6.4 Discern the reasons that Simpson’s paradox occurs for Table 6.7. 6.5 Refer to Problem 2.12. a. Fit the model with G and D main effects. Using it, estimate the AG conditional odds ratio. Compare to the marginal odds ratio, and explain why they are so different. Test its goodness of fit. b. Fit the model of no G effect, given the department. Use X 2 to test fit. Obtain residuals, and interpret the lack of fit. ŽEach department has a single nonredundant standardized Pearson residual. They satisfy Ý6is1 ri2 s X 2 , their squares giving six df s 1 components.. c. Fit the two models excluding department A. Again consider lack of fit, and interpret. 6.6 Conduct a residual analysis for the independence model with Table 6.11. What type of lack of fit is indicated? 6.7 Table 6.17, refers to the effectiveness of immediately injected or 1 12 -hour-delayed penicillin in protecting rabbits against lethal injection with ␤-hemolytic streptococci. a. Let X s delay, Y s whether cured, and Z s penicillin level. Fit the logit model Ž6.4.. Argue that the pattern of 0 cell counts suggests that Žwith no intercept . ␤ˆ1Z s y⬁ and ␤ˆ5Z s ⬁. What does your software report? b. Using the logit model, conduct the likelihood-ratio test of XY conditional independence. Interpret. 260 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS TABLE 6.17 Data for Problem 6.7 Penicillin Level Response Delay Cured Died 1 8 None 1 12 h 0 0 6 5 1 4 None 1 12 h 3 0 3 6 1 2 None 1 12 h 6 2 0 4 1 None 1 12 h 5 6 1 0 4 None 1 12 h 2 5 0 0 Source: Reprinted with permission from Mantel Ž1963.. c. Test XY conditional independence using the Cochran᎐Mantel᎐ Haenszel test. Interpret. d. Estimate the XY conditional odds ratio using Ži. ML with the logit model, and Žii. the Mantel᎐Haenszel estimate. Interpret. e. The small cell counts make large-sample analyses questionnable. Conduct small-sample inference, and interpret. 6.8 Refer to Table 2.6. Use the CMH statistic to test independence of death penalty verdict and victim’s race, controlling for defendant’s race. Show another test of this hypothesis, and compare results. 6.9 Treatments A and B were compared on a binary response for 40 pairs of subjects matched on relevant covariates. For each pair, treatments were assigned to the subjects randomly. Twenty pairs of subjects made the same response for each treatment. Six pairs had a success for the subject receiving A and a failure for the subject receiving B, whereas the other 14 pairs had a success for B and a failure for A. Use the Cochran᎐Mantel᎐Haenszel procedure to test independence of response and treatment. ŽIn Section 10.1 we present an equivalent test, McNemar’s test.. 6.10 Refer to Section 6.5.1. Suppose that ␲ 1 s 0.7 and ␲ 2 s 0.6. What sample size is needed for the test to have approximate power 0.80, when ␣ s 0.05, for Ža. Ha : ␲ 1 / ␲ 2 , and Žb. Ha : ␲ 1 ) ␲ 2? 261 PROBLEMS 6.11 Refer to Section 6.5.1. Suppose that ␲ 1 s 0.63 and ␲ 2 s 0.57. When treatment sample sizes are equal, explain why the joint probabilities in the 2 = 2 table are 0.315 and 0.185 in the row for treatment A and 0.285 and 0.215 in the row for treatment B. For the model of independence, explain why the fitted joint probabilities are 0.30 for success and 0.20 for failure, in each row. Show that X 2 has noncentrality parameter 0.00375n and df s 1. For n s 200 and ␣ s 0.05, find the power. 6.12 In an experiment designed to compare two treatments on a three-category response, a researcher expects the conditional distributions to be approximately Ž0.2, 0.2, 0.6. and Ž0.3, 0.3, 0.4.. a. With ␣ s 0.05, find the approximate power using Ži. X 2 , and Žii. G 2 to compare the distributions with 100 observations for each treatment. Compare results. b. What sample size is needed for each treatment for the tests in part Ža. to have approximate power 0.90? 6.13 The horseshoe crab width values in Table 4.3 have x s 26.3 and s x s 2.1. If the true relationship were similar to the fitted equation in Section 5.1.3, about how large a sample yields P Žtype II error. s 0.10, with ␣ s 0.05, for testing H0 : ␤ s 0 against Ha : ␤ ) 0? 6.14 Refer to Problem 5.1. Table 6.18 shows output for fitting a probit model. Interpret the parameter estimates Ža. using characteristics of the normal cdf response curve, Žb. finding the estimated rate of change in the probability of remission where it equals 0.5, and Žc. finding the difference between the estimated probabilities of remission at the upper and lower quartiles of the labeling index, 14 and 28. TABLE 6.18 Data for Problem 6.14 Parameter Intercept LI Estimate y2.3178 0.0878 Standard Error 0.7795 0.0328 Likelihood Ratio 95% Confidence Limits y4.0114 y0.9084 0.0275 0.1575 ChiSquare 8.84 7.19 Pr ) ChiSq 0.0029 0.0073 6.15 Use probit models to describe the effects of width and color on the probability of a satellite for Table 4.3. Interpret. 6.16 Refer to Table 6.14. Fit the model having log-log link rather than complementary log-log. Test the fit. Why does it fit so poorly? 262 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS 6.17 For the linear logit model with Table 3.9 and scores Ž0, 15, 30., conduct the exact test of H0 : ␤ s 0 and find a point and interval estimate of ␤ using the conditional likelihood. Interpret. 6.18 Refer to Table 6.16. Apply conditional logistic regression to the model discussed in Section 6.7.8. a. Obtain an exact P-value for testing no C effect against the alternative of a positive effect. Construct a 95% confidence interval for the conditional CD odds ratio. b. Construct the partial tables relating C to D for the combinations of levels of Ž A, L.. Note that three tables have no data when C s 1. For the sole partial table having data at both C levels, find a 95% exact confidence interval for the odds ratio and find an exact one-sided P-value. Compare to results using the entire data set. Comment about the contribution to inference of tables having only a single positive row total or a single positive column total. c. Obtain the ordinary ML fit of the logistic regression model. To investigate the sensitivity of the estimated C effect, find the change in the estimate and SE after adding one observation to the data set, a case with no diarrhea when Ž C, A, L. s Ž1, 1, 1.. 6.19 Consider Table 6.19, from a study of nonmetastatic osteosarcoma ŽA. M. Goorin, J. Clin Oncol. 5: 1178᎐1184, 1987, and the manual for LogXact .. The response is whether the subject achieved a three-year disease-free interval. a. Show that each predictor has a significant effect when used individually without the others. b. Try to fit a main-effects logistic regression model containing all three predictors. Explain why the ML estimate for the effect of lymphocytic infiltration is infinite. TABLE 6.19 Data for Problem 6.19 Lymphocytic Infiltration Gender High Female Male Low Female Male Disease-Free Osteoblastic Pathology Yes No No Yes No Yes No Yes No Yes 3 2 4 1 5 3 5 6 0 0 0 0 0 2 4 11 Source: LogXact 4 for Windows ŽCambridge, MA: CYTEL Software, 1999.. PROBLEMS 263 c. Using conditional logistic regression, Ži. conduct an exact test for the effect of lymphocytic infiltration, controlling for the other variables; and Žii. find a 95% confidence interval for the effect. Interpret results. 6.20 Use the methods discussed in this chapter to select a model for Table 5.5. 6.21 Logistic regression is applied increasingly to large financial databases, such as for credit scoring to model the influence of predictors on whether a consumer is creditworthy. The data archive found under the index at www. stat.uni-muenchen.de contains such a data set that includes 20 covariates for 1000 observations. Build a model for creditworthiness using the predictors running account, duration of credit, payment of previous credits, intended use, gender, and marital status. Theory and Methods 6.22 For a sequence of s nested models M1 , . . . , Ms , model Ms is the most complex. Let ␯ denote the difference in residual df between M1 and Ms . a. Explain why for j - k, G 2 Ž M j < Mk . F G 2 Ž M j < Ms .. b. Assume model M j , so that Mk also holds when k ) j. For all k ) j, as n ™ ⬁, P w G 2 Ž M j < Mk . ) ␹␯2 Ž␣.x F ␣ . Explain why. c. Gabriel Ž1966. suggested a simultaneous testing procedure in which, for each pair of models, the critical value for differences between G 2 values is ␹␯2 Ž␣.. The final model accepted must be more complex than any model rejected in a pairwise comparison. Since part Žb. is true for all j - k, argue that Gabriel’s procedure has type I error probability no greater than ␣ . 6.23 Prove that the Pearson residuals for the linear logit model applied to a I I = 2 contingency table satisfy X 2 s Ý is1 e i2 . Note that this holds for a binomial GLM with any link. 6.24 Refer to logit model Ž6.4. for a 2 = 2 = K contingency table  n i jk 4 . a. Using dummy variables, write the log-likelihood function. Identify the sufficient statistics for the various parameters. Explain how to conduct exact conditional inference about the effect of X, controlling for Z. b. Using a basic result for testing in exponential families, explain why uniformly most powerful unbiased tests of conditional XY independence are based on Ý k n11 k ŽBirch 1964b; Lehmann 1986, Sec. 4.8.. 264 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS 6.25 Suppose that ␲ i jk 4 in a 2 = 2 = 2 table are, by row, Ž0.15, 0.10 r 0.10, 0.15. when Z s 1 and Ž0.10, 0.15 r 0.15, 0.10. when Z s 2. For testing conditional XY independence with logit models having Y as a response, explain why the likelihood-ratio test comparing models X q Z and Z is not consistent but the likelihood-ratio test of fit of the XY conditional independence model is. 6.26 Refer to Section 6.4.1. When Y is N Ž ␮i , ␴ 2 ., consider the comparison of Ž ␮1 , . . . , ␮ I . based on independent samples at the I categories of X. When approximately ␮ i s ␣ q ␤ x i , explain why the t or F test of H0 : ␤ s 0 is more powerful than the one-way ANOVA F test. Describe a pattern for  ␮i 4 for which the ANOVA test would be more powerful. 6.27 For a multinomial distribution, let ␥ s Ý i bi␲ i , and suppose that ␲ i s f i Ž ␪ . ) 0, i s 1, . . . , I. For sample proportions  pi 4 , let S s Ý i bi pi . Let T s Ý i bi␲ ˆ i , where ␲ˆ i s f i Ž ␪ˆ., for the ML estimator ␪ˆ of ␪ . a. Show that var Ž S . s wÝ i bi2␲ i y ŽÝ i bi␲ i . 2 xrn. b. Using the delta method, show var ŽT . f wvar Ž ␪ˆ.xwÝ i bi f iX Ž ␪ .x 2 . c. By computing the information for LŽ ␪ . s Ý i n i log w f i Ž ␪ .x, show that var Ž ␪ˆ. is approximately w nÝ i Ž f iX Ž ␪ .. 2rf i Ž ␪ .xy1 . d. Asymptotically, show that var w'n ŽT y ␥ .x F var w'n Ž S y ␥ .x. w Hint: Show that var ŽT .rvar Ž S . is a squared correlation between two random variables, where with probability ␲ i the first equals bi and the second equals f iX Ž ␪ .rf i Ž ␪ ..x 6.28 A threshold model can also motivate the probit model. For it, there is an unobserved continuous response Y * such that the observed yi s 0 if yi* F ␶ and yi s 1 if yi* ) ␶ . Suppose that yi* s ␮ i q ⑀ i , where ␮ i s ␣ q ␤ x i and where  ⑀ i 4 are independent from a N Ž0, ␴ 2 . distribution. For identifiability one can set ␴ s 1 and the threshold ␶ s 0. Show that the probit model holds and explain why ␤ represents the expected number of standard deviation change in Y * for a 1-unit increase in x. 6.29 Consider the choice between two options, such as two product brands. Let U0 denote the utility of outcome y s 0 and U1 the utility of y s 1. For y s 0 and 1, suppose that Uy s ␣ y q ␤ y x q ⑀ y , using a scale such that ⑀ y has some standardized distribution. A subject selects y s 1 if U1 ) U0 for that subject. a. If ⑀ 0 and ⑀ 1 are independent N Ž0, 1. random variables, show that P Ž Y s 1. satisfies the probit model. b. If ⑀ y are independent extreme-value random variables, with cdf F Ž ⑀ . s exp wyexp Žy⑀ .x, show that P Ž Y s 1. satisfies the logistic regression model ŽMaddala 1983, p. 60; McFadden 1974.. 265 PROBLEMS 6.30 Consider model Ž6.12. with complementary log-log link. a. Find x at which ␲ Ž x . s 12 . b. Show the greatest rate of change of ␲ Ž x . occurs at x s y␣r␤ . What does ␲ Ž x . equal at that point? Give the corresponding result for the model with log-log link, and compare to the logit and probit models. 6.31 Suppose that log-log model Ž6.13. holds. Explain how to interpret ␤ . 6.32 Let yi , i s 1, . . . , n, denote n independent binary random variables. a. Derive the log likelihood for the probit model ⌽y1 w␲ Žx i .x s Ý j ␤ j x i j . b. Show that the likelihood equations for the logistic and probit regression models are Ý Ž yi y ␲ˆ i . z i x i j s 0, j s 0, . . . , p, i where z i s 1 for the logistic case and z i s ␾ ŽÝ j ␤ˆj x i j .r␲ ˆ i Ž1 y ␲ˆ i . for the probit case. ŽWhen the link is not canonical, there is no reduction of the data in sufficient statistics. . 6.33 Sometimes, sample proportions are continuous rather than of the binomial form Žnumber of successes .rŽnumber of trials.. Each observation is any real number between 0 and 1, such as the proportion of a tooth surface that is covered with plaque. For independent responses  yi 4 , Aitchison and Shen Ž1980. and Bartlett Ž1937. modeled logitŽ Yi . ; N Ž ␤i , ␴ 2 .. Then Yi itself is said to have a logistic-normal distribution. a. Expressing a N Ž ␤ , ␴ 2 . variate as ␤ q ␴ Z, where Z is standard normal, show that Yi s expŽ ␤i q ␴ Z .rw1 q exp Ž ␤i q ␴ Z .x. b. Show that for small ␴ , Yi s e ␤i 1 q e ␤i q e ␤i 1 1 q e ␤i 1 q e ␴Zq ␤i e ␤i Ž 1 y e ␤i . ␤i 3 2Ž 1 q e . ␴ 2 Z 2 q ⭈⭈⭈ . c. Letting ␮ i s e ␤ irŽ1 q e ␤ i ., when ␴ is close to 0 show that E Ž Yi . f ␮ i , var Ž Yi . f ␮ i Ž 1 y ␮ i . 2 ␴ 2. d. For independent continuous proportions  yi 4 , let ␮ i s E Ž Yi .. For a GLM, it is sensible to use an inverse cdf link for ␮ i , but it is unclear how to choose a distribution for Yi . The approximate moments for the logistic-normal motivate a quasi-likelihood approach ŽWedderburn 1974. with variance function ®Ž ␮i . s ␾ w ␮ i Ž1 y ␮ i .x 2 for un- 266 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS known  . Explain why this provides similar results as fitting a normal regression model to the sample logits assuming constant variance. ŽThe QL approach has the advantage of not requiring adjustment of 0 or 1 observations, for which sample logits don’t exist.. e. Wedderburn Ž1974. gave an example with response the proportion of a leaf showing a type of blotch. Envision an approximation of binomial form based on cutting each leaf into a large number of small regions of the same size and observing for each region whether it is mostly covered with blotch. Explain why this suggests that ®Ž ␮i . s ␾␮ i Ž1 y ␮ i .. What violation of the binomial assumptions might make this questionnable? wThe parametric family of beta distributions has variance function of this form Žsee Section Ž1991. proposed a distri13.3.1.. Barndorff-Nielsen and Jorgensen  bution having ®Ž ␮i . s ␾ w ␮ i Ž1 y ␮ i .x 3 ; see also Cox Ž1996..x 6.34 For independent binomial sampling, construct the log likelihood and identify the sufficient statistics to be conditioned out to perform exact inference about ␤ in model Ž6.4.. 6.35 Let  ˆ Žy. s Ž␲ˆ Žy1. , . . . , ␲ˆ Žyn . ., where ␲ˆ Žyi . denotes the estimate of E Ž Yi . for binary observation i after fitting the model without that observation. Cross-validation declares a model to have good predictive power if corr Ž  ˆ Žy. , y. is high. Consider the model logit Ž␲ i . s ␣ for all i. Show that ␲ ˆ i s y and hence ␲ˆ Žyi . s w nrŽ n y 1.xw y y Ž1rn. yi x, and hence corr Ž  ˆ Žy. , y. s y1 regardless of how well the model fits. Thus, cross-validation can be misleading with binary data ŽZheng and Agresti 2000.. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 7 Logit Models for Multinomial Responses In Chapters 5 and 6 we discussed modeling binary response variables with binomial GLMs. Multicategory responses use multinomial GLMs. In this chapter we generalize logistic regression for multinomial Žnominal and ordinal. response variables. In Section 7.1 we present a model for nominal responses that uses a separate binary logit model for each pair of response categories. In Section 7.2 we present a model for ordinal responses that uses logits of cumulative response probabilities. In Section 7.3 we use other link functions for those cumulative probabilities. Section 7.4 covers alternative ordinal-response models. In Section 7.5 we discuss tests of conditional independence with multinomial responses using models and using generalizations of the Cochran MantelHaenszel statistic. In the final section we introduce a multinomial logit model for discrete-choice modeling of a subject’s choice from one of several options when values of predictors may depend on the option. 7.1 NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS Let Y be a categorical response with J categories. Multicategory Žalso called polytomous. logit models for nominal response variables simultaneously describe log odds for all 2J pairs of categories. Given a certain choice of J y 1 of these, the rest are redundant. ž/ 7.1.1 Baseline-Category Logits Let ␲ j Žx. s P Ž Y s j < x. at a fixed setting x for explanatory variables, with Ý j␲ j Žx. s 1. For observations at that setting, we treat the counts at the J categories of Y as multinomial with probabilities ␲ 1Žx., . . . , ␲ J Žx.4 . 267 268 LOGIT MODELS FOR MULTINOMIAL RESPONSES Logit models pair each response category with a baseline category, often the last one or the most common one. The model log ␲ j Ž x. ␲ J Ž x. s ␣ j q ␤Xj x, j s 1, . . . , J y 1, Ž 7.1 . simultaneously describes the effects of x on these J y 1 logits. The effects vary according to the response paired with the baseline. These J y 1 equations determine parameters for logits with other pairs of response categories, since log ␲a Ž x . ␲ b Ž x. s log ␲a Ž x . ␲ J Ž x. y log ␲ b Ž x. ␲ J Ž x. . With categorical predictors, X 2 and G 2 goodness-of-fit statistics provide a model check when data are not sparse. When an explanatory variable is continuous or the data are sparse, such statistics are still valid for comparing nested models differing by relatively few terms ŽHaberman 1974a, pp. 372᎐373; 1977a.. 7.1.2 Alligator Food Choice Example Table 7.1 is from a study of factors influencing the primary food choice of alligators. It used 219 alligators captured in four Florida lakes. The nominal response variable is the primary food type, in volume, found in an alligator’s stomach. This had five categories: fish, invertebrate, reptile, bird, other. The invertebrates included apple snails, aquatic insects, and crayfish. The reptiles were primarily turtles, although one stomach contained the tags of 23 baby alligators released in the lake the previous year! The ‘‘other’’ category consisted of amphibian, mammal, plant material, stones or other debris, or no food or dominant type. Table 7.1 also classifies the alligators according to L s lake of capture ŽHancock, Oklawaha, Trafford, George., G s gender Žmale, female., and S s size ŽF 2.3 meters long, ) 2.3 meters long.. Baseline-category logit models can investigate the effects of L, G, and S on primary food type. Table 7.2 contains fit statistics for several models. We denote a model by its predictors: for instance, Ž L q S . having additive lake and size effects and Ž . having no predictors. The data are sparse, 219 observations scattered among 80 cells. Thus, G 2 is more reliable for comparing models than for testing fit. The statistics G 2 wŽ . < Ž G .x s 2.1 and G 2 s wŽ L q S . < Ž G q L q S .x s 2.2, each based on df s 4, suggest simplifying by collapsing the table over gender. ŽOther analyses, not presented here, show that adding interaction terms including G do not improve the fit significantly. . The G 2 and X 2 values for the collapsed table indicate that both L and S have effects. Table 7.3 exhibits fitted values for model Ž L q S . for the NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS TABLE 7.1 Primary Food Choice of Alligators Lake Gender Hancock Male Female Oklawaha Male Female Trafford Male Female George Male Female 269 Primary Food Choice Size Žm. Fish Invertebrate Reptile Bird F 2.3 ) 2.3 F 2.3 ) 2.3 7 4 16 3 1 0 3 0 0 0 2 1 0 1 2 2 5 2 3 3 F 2.3 ) 2.3 F 2.3 ) 2.3 2 13 3 0 2 7 9 1 0 6 1 0 0 0 0 1 1 0 2 0 F 2.3 ) 2.3 F 2.3 ) 2.3 3 8 2 0 7 6 4 1 1 6 1 0 0 3 1 0 1 5 4 0 F 2.3 ) 2.3 F 2.3 ) 2.3 13 9 3 8 10 0 9 1 0 0 1 0 2 1 0 0 2 2 1 1 Other Source: Data courtesy of Clint Moore, from an unpublished manuscript by M. F. Delaney and C. T. Moore. TABLE 7.2 Goodness of Fit of Baseline-Category Logit Models for Table 7.1 Model a Ž . ŽG. ŽS. Ž L. Ž L q S. ŽG q L q S . Collapsed over G Ž . ŽS. Ž L. Ž L q S. a G2 X2 df 116.8 114.7 101.6 73.6 52.5 50.3 106.5 101.2 86.9 79.6 58.0 52.6 60 56 56 48 44 40 81.4 66.2 38.2 17.1 73.1 54.3 32.7 15.0 28 24 16 12 G, gender; S, size; L, lake of capture. See the text for details. 270 LOGIT MODELS FOR MULTINOMIAL RESPONSES TABLE 7.3 Observed and Fitted Values for Study of Alligator’s Primary Food Choice Size of alligator Žmeters. Lake Hancock F 2.3 ) 2.3 Oklawaha F 2.3 ) 2.3 Trafford F 2.3 ) 2.3 George F 2.3 ) 2.3 Primary Food Choice Fish Invertebrate Reptile Bird Other 23 Ž20.9. 7 Ž9.1. 4 Ž3.6. 0 Ž0.4. 2 Ž1.9. 1 Ž1.1. 2 Ž2.7. 3 Ž2.3. 8 Ž9.9. 5 Ž3.1. 5 Ž5.2. 13 Ž12.8. 11 Ž12.0. 8 Ž7.0. 1 Ž1.5. 6 Ž5.5. 0 Ž0.2. 1 Ž0.8. 3 Ž1.1. 0 Ž1.9. 5 Ž4.4. 89 Ž8.6. 11 Ž12.4. 7 Ž5.6. 2 Ž2.1. 6 Ž5.9. 1 Ž0.9. 3 Ž3.1. 5 Ž4.2. 5 Ž5.8. 16 Ž18.5. 17 Ž14.5. 19 Ž16.9. 1 Ž3.1. 1 Ž0.5. 0 Ž0.5. 2 Ž1.2. 1 Ž1.8. 3 Ž3.8. 3 Ž2.2. collapsed table. Absolute values of standardized Pearson residuals comparing observed and fitted values exceed 2 in only two of the 40 cells and exceed 3 in none of the cells. The fit seems adequate. Fish was the most common food choice. We now estimate the effects of lake and size on the odds that alligators select other primary food types instead of fish. With fish as the baseline category, Table 7.4 contains ML estimates of effect parameters. These result from models using dummy variables for the first three lakes and for size. The table uses letter subscripts to denote the food choice categories. For example, the prediction equation for the log odds of selecting invertebrates instead of fish is log Ž ␲ ˆ Ir␲ˆ F . s y1.55 q 1.46 s y 1.66 z H q 0.94 z O q 1.12 zT , TABLE 7.4 Estimated Parameters in Logit Model for Alligator Food Choice, Based on Dummy Variable for First Size Category and Each Lake Except Lake George a Lake Logit b logŽ␲ Ir␲ F . logŽ␲ Rr␲ F . logŽ␲ Br␲ F . logŽ␲ Or␲ F . a Intercept Size F 2.3 Hancock Oklawaha Trafford y1.55 y3.31 y2.09 y1.90 1.46 Ž0.40. y0.35 Ž0.58. y0.63 Ž0.64. 0.33 Ž0.45. y1.66 Ž0.61. 1.24 Ž1.19. 0.70 Ž0.78. 0.83 Ž0.56. 0.94 Ž0.47. 2.46 Ž1.12. y0.65 Ž1.20. 0.01 Ž0.78. 1.12 Ž0.49. 2.94 Ž1.12. 1.09 Ž0.84. 1.52 Ž0.62. SE values in parentheses. I, invertebrate; R, reptile; B, bird; O, other; F, fish. NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS 271 where s s 1 for size F 2.3 meters and 0 otherwise, z H is a dummy variable for Lake Hancock Ž z H s 1 for alligators in that lake and 0 otherwise., and z O and zT are dummy variables for lakes Oklawaha and Trafford. Size of alligator has a noticeable effect. For a given lake, for small alligators the estimated odds that primary food choice was invertebrates instead of fish are expŽ1.46. s 4.3 times the estimated odds for large alligators; the Wald 95% confidence interval is expw1.46 " 1.96Ž0.396.x s Ž2.0, 9.3.. The lake effects indicate that the estimated odds that the primary food choice was invertebrates instead of fish are relatively higher at Lakes Trafford and Oklawaha and relatively lower at Lake Hancock than they are at Lake George. The equations in Table 7.4 determine those for other food-choice pairs. For instance, for Žinvertebrate, other., log Ž ␲ ˆ Ir␲ˆO . s log Ž ␲ˆ Ir␲ˆ F . y log Ž ␲ˆOr␲ˆ F . s Ž y1.55 q 1.46 s y 1.66 z H q 0.94 z O q 1.12 zT . y Ž y1.90 q 0.33s q 0.83 z H q 0.01 z O q 1.52 zT . s 0.35 q 1.13s y 2.48 z H q 0.93 z O y 0.39 zT . 7.1.3 Estimating Response Probabilities The equation that expresses multinomial logit models directly in terms of response probabilities ␲ j Žx.4 is ␲ j Ž x. s exp Ž␣j q ␤Xj x . Jy1 1 q Ý hs1 exp Ž␣h q ␤Xh x . Ž 7.2 . with ␣ J s 0 and ␤ J s 0. This follows from Ž7.1., using the fact that Ž7.1. also holds with j s J by setting ␣ J s 0 and ␤ J s 0. ŽAlso, the parameters equal zero for a baseline category for identifiability reasons; see Problem 7.26.. The denominator of Ž7.2. is the same for each j. The numerators for various j sum to the denominator, so Ý j␲ j Žx. s 1. For J s 2, Ž7.2. simplifies to the formula of type Ž5.1. used for binary logistic regression. From Table 7.4 the estimated probability that a large alligator in Lake Hancock has invertebrates as the primary food choice is ␲ ˆI s ey1 .55y1 .66 1 q ey1 .55y1 .66 q ey3 .31q1 .24 q ey2 .09q0 .70 q ey1 .90q0 .83 s 0.023. The estimated probabilities for reptile, bird, other, and fish are 0.072, 0.141, 0.194, and 0.570. This example used qualitative predictors. Multinomial logit models can also contain quantitative predictors. In this study, the biologists used the size dummy variable to distinguish between adult and subadult alligators. However, the alligators’ actual length was measured and is quantitative. With quantitative predictors, it is informative to plot the estimated probabilities. 272 LOGIT MODELS FOR MULTINOMIAL RESPONSES FIGURE 7.1 Estimated probabilities for primary food choice. To illustrate, for alligators at one lake, Figure 7.1 plots the estimated probabilities that primary food choice is fish, invertebrate, or other Žwhich combines the other, bird, and reptile categories. as a function of length. With more than two response categories, the probability for a given category need not continuously increase or decrease ŽProblem 7.27.. 7.1.4 Fitting of Baseline-Category Logit Models* ML fitting of multinomial logit models maximizes the likelihood subject to ␲ j Žx.4 simultaneously satisfying the J y 1 equations that specify the model. For i s 1, . . . , n, let yi s Ž yi1 , . . . , yi J . represent the multinomial trial for subject i, where yi j s 1 when the response is in category j and yi j s 0 otherwise. Thus, Ý j yi j s 1. Let x i s Ž x i1 , . . . , x i p .X denote explanatory variable values for subject i. Let ␤ j s Ž ␤j1 , . . . , ␤ j p .X denote parameters for the jth logit. Since ␲ J s 1 y Ž␲ 1 q ⭈⭈⭈ q␲ Jy1 . and yi J s 1 y Ž yi1 q ⭈⭈⭈ qyi, Jy1 ., the contribution to the log likelihood by subject i is J log Ł ␲j Žx i . yi j Jy1 s js1 Ý ž yi j log␲ j Ž x i . q 1 y js1 Jy1 s Ý js1 yi j log ␲j Žx i . 1 y Ý Jy1 js1 ␲ j Ž x i . Jy1 Ý js1 / Jy1 yi j log 1 y Ý ␲j Žx i . js1 Jy1 q log 1 y Ý ␲j Žx i . js1 . NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS 273 Thus, the baseline-category logits are the natural parameters for the multinomial distribution. Now assume n independent observations. In the last expression above, substituting ␣ j q ␤Xj x i for the logit in the first term and ␲ J Žx i . s X Ž .x in the second term, the log likelihood is 1rw1 q Ý Jy1 js1 exp ␣j q ␤ j x i n log Ł is1 J Ł ␲j Žx i . js1 n s ½ Jy1 Ý Ý is1 yi j Ž␣j q ␤Xj x i . y log 1 q js1 Jy1 s yi j Ý js1 Ý exp Ž␣j q ␤Xj x i . js1 žÝ / n ␣j Jy1 yi j q is1 n Jy1 is1 js1 y Ý log 1 q p Ý ks1 žÝ / 5 n ␤ jk x i k yi j is1 Ý exp Ž␣j q ␤Xj x i . . The sufficient statistic for ␤ jk is Ý i x i k yi j , j s 1, . . . , J y 1, k s 1, . . . , p. The sufficient statistic for ␣ j is Ý i yi j s Ý i x i0 yi j for x i0 s 1; this is the total number of outcomes in category j. The likelihood equations equate the sufficient statistics to their expected values. The log likelihood is concave, and the Newton᎐Raphson method yields the ML parameter estimates. The estimators have large-sample normal distributions. Their asymptotic standard errors are square roots of diagonal elements of the inverse information matrix. Most statistical software can fit multinomial logit models, but some can fit only binary logistic regression models. An alternative fitting approach fits binary logit models separately for the J y 1 pairings of responses: model Ž7.1. for j s 1 alone, using only observations in category 1 or J of the response variable to obtain estimates of ␣ 1 and ␤ 1; model Ž7.1. using only categories 2 and J to obtain estimates of ␣ 2 and ␤ 2 ; in this manner, obtaining J y 1 separate fits of logit models. A logit model fitted using data from only two response categories is the same as a regular logit model fitted conditional on classification into one of those categories. For instance, the jth baseline-category logit is a logit of conditional probabilities log ␲ j Ž x. r Ž␲ j Ž x. q ␲ J Ž x. . ␲ J Ž x. r Ž␲ j Ž x. q ␲ J Ž x. . s log ␲ j Ž x. ␲ J Ž x. . The separate-fitting estimates differ from the ML estimates for simultaneous fitting of the J y 1 logits. They are less efficient, tending to have larger standard errors. However, Begg and Gray Ž1984. showed that the efficiency loss is minor when the response category having highest prevalence is the 274 LOGIT MODELS FOR MULTINOMIAL RESPONSES baseline. To illustrate this approach, we used the data for the categories invertebrate and fish alone. The fit is logŽ␲ ˆ Ir␲ˆ F . s y1.69 q 1.66 s y 1.78 z H q 1.05 z O q 1.22 zT , with standard errors Ž0.43, 0.62, 0.49, 0.52. for the effects. The effects are similar to those from simultaneous fitting with all five response categoriesᎏsee the first row of Table 7.4. The estimated standard errors are only slightly larger, since 155 of the 219 observations were in the fish or invertebrate categories of food type. 7.1.5 Multicategory Logit Model as Multivariate GLM* For a univariate response variable in the natural exponential family, a GLM has form g Ž ␮i . s x Xi ␤ for a link function g, expected response ␮ i s E Ž Yi ., vector of values x i of p explanatory variables for observation i, and parameter vector ␤ s Ž ␤1 , . . . , ␤ p .X . This extends to a multivariate GLM for distributions in the multivariate exponential family ŽProblem 7.24., such as the multinomial. Let yi s Ž yi1 , yi2 , . . . .X be a vector response for subject i, with ␮ i s E ŽYi .. Let g be a vector of link functions. The multivariate GLM has the form gŽ ␮ i . s X i␤ , Ž 7.3 . where row h of the model matrix X i for observation i contains values of explanatory variables for yi h . For details, see Fahrmeir and Tutz Ž2001, Chap. 3.. The baseline-category logit model is a multivariate GLM. Here yi s Ž yi1 , . . . , yi, Jy1 .X , since yi J is redundant. Then, ␮ i s Ž␲ 1Žx i ., . . . , ␲ Jy1 Žx i ..X and g j Ž ␮ i . s log  ␮i jr 1 y Ž ␮i1 q ⭈⭈⭈ q␮ i , Jy1 . The model matrix for observation i is Xi s  1 x Xi 1 x Xi . ⭈⭈⭈ 1 x Xi 4. 0 with 0 entries in other locations, and ␤X s Ž␣1 , ␤X1 , . . . , ␣ Jy1 , ␤XJy1 .. One can also formulate it for grouped data using sample proportions in the categories. 7.2 ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS In Section 6.4.1 we showed the benefits of utilizing the ordinality of a variable by focusing inferences on a single parameter. These benefits extend to models for ordinal responses. Models with terms that reflect ordinal ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS 275 characteristics such as monotone trend have improved model parsimony and power. In this section we introduce the most popular logit model for ordinal responses. 7.2.1 Cumulative Logits One way to use category ordering forms logits of cumulative probabilities, P Ž Y F j < x . s ␲ 1 Ž x . q ⭈⭈⭈ q␲ j Ž x . , j s 1, . . . , J. The cumulati®e logits are defined as logit P Ž Y F j < x . s log s log P Ž Y F j < x. 1 y P Ž Y F j < x. ␲ 1 Ž x . q ⭈⭈⭈ q␲ j Ž x . ␲ jq1 Ž x . q ⭈⭈⭈ q␲ J Ž x . , j s 1, . . . , J y 1. Ž 7.4 . Each cumulative logit uses all J response categories. A model for logitw P Ž Y F j .x alone is an ordinary logit model for a binary response in which categories 1 to j form one outcome and categories j q 1 to J form the second. Better, models can use all J y 1 cumulative logits in a single parsimonious model. 7.2.2 Proportional Odds Model A model that simultaneously uses all cumulative logits is logit P Ž Y F j < x . s ␣ j q ␤X x, j s 1, . . . , J y 1. Ž 7.5 . Each cumulative logit has its own intercept. The  ␣ j 4 are increasing in j, since P Ž Y F j < x. increases in j for fixed x, and the logit is an increasing function of this probability. This model has the same effects ␤ for each logit. For a continuous predictor x, Figure 7.2 depicts the model when J s 4. For fixed j, the FIGURE 7.2 Cumulative logit model with effect independent of cutpoint. 276 LOGIT MODELS FOR MULTINOMIAL RESPONSES FIGURE 7.3 Category probabilities in cumulative logit model. response curve is a logistic regression curve for a binary response with outcomes Y F j and Y ) j. The response curves for j s 1, 2, and 3 have the same shape. They share exactly the same rate of increase or decrease but are horizontally displaced from each other. For j - k, the curve for P Ž Y F k . is the curve for P Ž Y F j . translated by Ž␣k y ␣ j .r␤ units in the x direction; that is, P Ž Y F k < X s x . s P Ž Y F j < X s x q Ž␣k y ␣ j . r␤ . . Figure 7.3 portrays the curves for the category probabilities. The cumulative logit model Ž7.5. satisfies logit P Ž Y F j < x 1 . y logit P Ž Y F j < x 2 . s log P Ž Y F j < x 1 . rP Ž Y ) j < x 1 . P Ž Y F j < x 2 . rP Ž Y ) j < x 2 . s ␤X Ž x 1 y x 2 . . An odds ratio of cumulative probabilities is called a cumulati®e odds ratio. The odds of making response F j at x s x 1 are expw ␤X Žx 1 y x 2 .x times the odds at x s x 2 . The log cumulative odds ratio is proportional to the distance between x 1 and x 2 . The same proportionality constant applies to each logit. Because of this property, McCullagh Ž1980. called Ž7.5. a proportional odds model. ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS 277 FIGURE 7.4 Uniform odds ratios ADrBC whenever x 1 y x 2 s 1, for all response cutpoints with proportional odds model. With a single predictor, the cumulative odds ratio equals e ␤ whenever x 1 y x 2 s 1. Figure 7.4 illustrates the constant cumulative odds ratio this model then implies for all j. It shows the J-category response collapsed into the binary outcome ŽF j, ) j . and shows the sets of cells that determine the cumulative odds ratio ADrBC that takes the same value e ␤ for each such collapsing. Model Ž7.5. constrains the J y 1 response curves to have the same shape. Thus, its fit is not the same as fitting separate logit models for each j. Again let Ž yi1 , . . . , yi J . be binary indicators of the response for subject i. The likelihood function is n J Ł Ł ␲ j Žx i . y is1 ij s js1 s n J is1 js1 n J is1 js1 Ł Ł Ž P ŽY F j < x i. y P ŽY F j y 1 < x i.. Ł Ł ž exp Ž␣j q ␤X x i . 1 q exp Ž␣j q ␤X x i . y yi j exp Ž␣jy1 q ␤X x i . 1 q exp Ž␣jy1 q ␤X x i . / yi j , Ž 7.6 . viewed as a function of Ž ␣ j 4 , ␤ .. McCullagh Ž1980. and Walker and Duncan Ž1967. used Fisher scoring algorithms to obtain ML estimates. 7.2.3 Latent Variable Motivation* A regression model for a continuous variable assumed to underlie Y motivates the common effect ␤ for different j in the proportional odds model ŽAnderson and Philips 1981.. Let Y * denote this underlying variable. In statistics, such an unobserved variable is called a latent ®ariable. Suppose that it has cdf G Ž y* y ␩ ., where values of y* vary around a location parameter ␩ Žsuch as a mean. that depends on x through ␩ Žx. s ␤X x. Suppose that y⬁ s ␣ 0 - ␣ 1 - ⭈⭈⭈ - ␣ J s ⬁ are cutpoints of the continuous scale such 278 LOGIT MODELS FOR MULTINOMIAL RESPONSES FIGURE 7.5 Ordinal measurement and underlying regression model for a latent variable. that the observed response Y satisfies Y s j if ␣ jy1 - Y * F ␣ j . That is, Y falls in category j when the latent variable falls in the jth interval of values ŽFigure 7.5.. Then P Ž Y F j < x . s P Ž Y * F ␣ j < x . s G Ž␣j y ␤X x . . The appropriate model for Y implies that the link Gy1 , the inverse of the cdf for Y *, applies to P Ž Y F j < x.. If Y * s ␤X x q ⑀ , where the cdf G of ⑀ is the logistic ŽSection 4.2.5., then Gy1 is the logit link and a proportional odds model results. Normality for ⑀ implies a probit link for cumulative probabilities ŽSection 7.3.1.. In this derivation, the same parameters ␤ occur for the effects on Y regardless of how the cutpoints  ␣ j 4 chop up the scale for the latent variable. The effect parameters are invariant to the choice of categories for Y. If a continuous variable measuring political philosophy has a linear regression with some predictor variables, then the same effect parameters apply to a discrete version of political philosophy with the categories Žliberal, moderate, conservative. or Žvery liberal, slightly liberal, moderate, slightly conservative, very conservative.. This feature makes it possible to compare estimates from studies using different response scales. ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS 279 Note that the use of a cdf of form G Ž y* y ␩ . for the latent variable results in linear predictor ␣ j y ␤X x rather than ␣ j q ␤X x. When ␤ ) 0, as x increases each cumulative logit then decreases, so each cumulative probability decreases and relatively less probability mass falls at the low end of the Y scale. Thus, Y tends to be larger at higher values of x. With this parameterization the sign of ␤ has the usual meaning. However, most software Že.g., SAS. uses form Ž7.5.. 7.2.4 Mental Impairment Example Table 7.5 comes from a study of mental health for a random sample of adult residents of Alachua County, Florida. It relates mental impairment to two explanatory variables. Mental impairment is an ordinal response, with categories Žwell, mild symptom formation, moderate symptom formation, impaired.. The life events index x 1 is a composite measure of the number and severity of important life events such as birth of child, new job, divorce, or death in family that occurred to the subject within the past 3 years. Socioeconomic status Ž x 2 s SES. is measured here as binary Ž1 s high, 0 s low.. TABLE 7.5 Mental Impairment by SES and Life Events Subject Mental Impairment SES a x2 Life Events x1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Well Well Well Well Well Well Well Well Well Well Well Well Mild Mild Mild Mild Mild Mild Mild Mild 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 9 4 3 2 0 1 3 3 7 1 2 5 6 3 1 8 2 5 5 a 0, low; 1, high. Subject Mental Impairment SES a x2 Life Events x1 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Mild Mild Mild Mild Moderate Moderate Moderate Moderate Moderate Moderate Moderate Impaired Impaired Impaired Impaired Impaired Impaired Impaired Impaired Impaired 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 0 9 3 3 1 0 4 3 9 6 4 3 8 2 7 5 4 4 8 8 9 280 LOGIT MODELS FOR MULTINOMIAL RESPONSES TABLE 7.6 Output for Fitting Cumulative Logit Model to Table 7.5 Score Test for the Proportional Odds Assumption Chi- Square DF Pr ) ChiSq 2.3255 4 0.6761 Parameter Intercept1 Intercept2 Intercept3 life ses Estimate y0.2819 1.2128 2.2094 y0.3189 1.1112 Std Error 0.6423 0.6607 0.7210 0.1210 0.6109 Like. Ratio 95% Conf Limits y1.5615 0.9839 y0.0507 2.5656 0.8590 3.7123 y0.5718 y0.0920 y0.0641 2.3471 ChiSquare 0.19 3.37 9.39 6.95 3.31 Pr > Chi Sq 0.6607 0.0664 0.0022 0.0084 0.0689 The main-effects model of form Ž7.5. is logit P Ž Y F j < x . s ␣ j q ␤ 1 x 1 q ␤ 2 x 2 . Table 7.6 shows output. With J s 4 response categories, the model has three  ␣ j 4 intercepts. Usually, these are not of interest except for computing response probabilities. The parameter estimates yield estimated logits and hence estimates of P Ž Y F j ., P Ž Y ) j ., or P Ž Y s j .. We illustrate for subjects at the mean life events score of x 1 s 4.275 with low SES Ž x 2 s 0.. Since ␣ ˆ1 s y0.282, the estimated probability of response well is PˆŽ Y s 1 . s PˆŽ Y F 1 . s exp y0.282 y 0.319 Ž 4.275 . 1 q exp y0.282 y 0.319 Ž 4.275 . s 0.16. Figure 7.6 plots PˆŽ Y ) 2. as a function of the life events index, at the two levels of SES. FIGURE 7.6 Estimated values of P Ž Y ) 2. for Table 7.5. ORDINAL RESPONSES: CUMULATIVE LOGIT MODELS 281 The effect estimates ␤ˆ1 s y0.319 and ␤ˆ2 s 1.111 suggest that the cumulative probability starting at the well end of the scale decreases as the life events score increases and increases at the higher level of SES. Given the life events score, at the high SES level the estimated odds of mental impairment below any fixed level are e 1.111 s 3.0 times the estimated odds at the low SES level. Descriptions of effects can compare cumulative probabilities rather than use odds ratios. These can be easier to understand. We describe effects of quantitative variables by comparing probabilities at their quartiles. We describe effects of qualitative variables by comparing probabilities for different categories. We control for quantitative variables by setting them at their mean. We control for qualitative variables by fixing the category, unless there are several, in which case we can set each at their dummy means. We illustrate again with P Ž Y s 1., the well outcome. First, we describe the SES effect. At the mean life events of 4.275, PˆŽ Y s 1. s 0.37 at high SES Ži.e., x 2 s 1. and 0.16 at low SES Ž x 2 s 0.. Next, we describe the life events effect. The lower and upper quartiles of the life events score are 2.0 and 6.5. For high SES, PˆŽ Y s 1. changes from 0.55 to 0.22 between these quartiles; for low SES, it changes from 0.28 to 0.09. ŽNote that comparing 0.55 to 0.28 at the lower quartile and 0.22 to 0.09 at the upper quartile provides further information about the SES effect. . The sample effect is substantial for both predictors. The output in Table 7.6, taken from SAS, also presents a score test of the proportional odds property. This tests whether the effects are the same for each cumulative logit against the alternative of separate effects. It compares the model with one parameter for x 1 and one for x 2 to a more complex model with three parameters for each, allowing different effects for logit w P Ž Y F 1., logit w P Ž Y F 2.x, and logit w P Ž Y F 3.x. Here, the score statistic equals 2.33. It has df s 4, since the more complex model has four additional parameters. The more complex model does not fit significantly better Ž P s 0.68.. 7.2.5 More Complex Models More complex cumulative logit models are formulated as in ordinary logistic regression. They simply require a set of intercept parameters rather than a single one. In the previous example, for instance, permitting interaction yields a model with ML fit logit PˆŽ Y F j < x . s ␣ ˆj y 0.420 x 1 q 0.371 x 2 q 0.181 x 1 x 2 , where the coefficient of x 1 x 2 has SE s 0.238. The estimated effect of life events on the cumulative logit is y0.420 for the low SES group and Žy0.420 q 0.181. s y0.239 for the high SES group. The impact of life 282 LOGIT MODELS FOR MULTINOMIAL RESPONSES events seems more severe for the low SES group, but the difference in effects is not significant. Models in this section used the proportional odds assumption of the same effects for different cumulative logits. An advantage is that effects are simple to summarize and interpret, requiring only a single parameter for each predictor. The models generalize to include separate effects, replacing ␤ in Ž7.5. by ␤ j . This implies nonparallelism of curves for different logits. However, curves for different cumulative probabilities then cross for some x values. Such models violate the proper order among the cumulative probabilities. Even if such a model fits better over the observed range of x, for reasons of parsimony the simple model might be preferable. One case is when effects ˆ j 4 with different logits are not substantially different in practical terms. ␤ Then the significance in a test of proportional odds may reflect primarily a large value of n. Even with smaller n, although effect estimators using the simple model are biased, they may have smaller MSE than estimators from a more complex model having many more parameters. So even if a test of proportional odds has a small P-value, don’t discard this model automatically. If a proportional odds model fits poorly in terms of practical as well as statistical significance, alternative strategies exist. These include Ž1. trying a link function for which the response curve is nonsymmetric Že.g., complementary log-log.; Ž2. adding additional terms, such as interactions, to the linear predictor; Ž3. adding dispersion parameters; Ž4. permitting separate effects for each logit for some but not all predictors Ži.e., partial proportional odds; and Ž5. fitting baseline-category logit models and using the ordinality in an informal way in interpreting the associations. For approach Ž4., see Peterson and Harrell Ž1990., Stokes et al. Ž2000, Sec. 15.13., and criticism by Cox Ž1995.. In the next section we generalize the cumulative logit model to permit extensions Ž1. and Ž3.. 7.3 ORDINAL RESPONSES: CUMULATIVE LINK MODELS Cumulative logit models use the logit link. As in univariate GLMs, other link functions are possible. Let Gy1 denote a link function that is the inverse of the continuous cdf G Žrecall Section 4.2.5.. The cumulati®e link model Gy1 P Ž Y F j < x . s ␣ j q ␤X x Ž 7.7 . links the cumulative probabilities to the linear predictor. The logit link function Gy1 Ž u. s logw urŽ1 y u.x is the inverse of the standard logistic cdf. As in the proportional odds model Ž7.5., effects of x in Ž7.7. are assumed the same for each cutpoint, j s 1, . . . , J y 1. In Section 7.2.3 we showed that this assumption holds when a linear regression for a latent variable Y * has ORDINAL RESPONSES: CUMULATIVE LINK MODELS 283 standardized cdf G. Model Ž7.7. results from discrete measurement of Y * from a location-parameter family having cdf GŽ y* y ␤X x.. The parameters  ␣ j 4 are category cutpoints on a standardized version of the latent scale. In this sense, cumulative link models are regression models, using a linear predictor ␤X x to describe effects of explanatory variables on crude ordinal measurement of Y *. Using y␤ rather than q␤ in the linear predictor ˆ Most software Že.g., GENMOD and merely results in change of sign of ␤. LOGISTIC in SAS. fits it in q␤ form. 7.3.1 Types of Cumulative Links Use of the standard normal cdf ⌽ for G gives the cumulati®e probit model. This generalizes the binary probit model ŽSection 6.6. to ordinal responses. It is appropriate when the distribution for Y * is normal. Parameters in probit models can be interpreted in terms of the latent variable Y *. For instance, consider the model ⌽y1 w P Ž Y F j .x s ␣ j y ␤ x. From Section 7.2.3, since Y * s ␤ x q ⑀ where ⑀ ; N Ž0, 1. has cdf ⌽, ␤ has the interpretation that a 1-unit increase in x corresponds to a ␤ increase in E Ž Y *.. When ⑀ need not be in standardized form with ␴ s 1, a 1-unit increase in x corresponds to a ␤ standard deviation increase in E Ž Y *.. Cumulative logit models provide fits similar to those for cumulative probit models, and their parameter interpretation is simpler. An underlying extreme value distribution for Y * implies a model of the form log  ylog 1 y P Ž Y F j < x . 4 s ␣ j q ␤X x . In section 6.6 we introduced this complementary log-log link for binary data. The ordinal model using this link is sometimes called a proportional hazards model since it results from a generalization of the proportional hazards model for survival data to handle grouped survival times ŽPrentice and Gloeckler 1978.. It has the property P Ž Y ) j < x1 . s P Ž Y ) j < x 2 . X exp w ␤ Žx 1 yx 2 .x . With this link, P Ž Y F j . approaches 1.0 at a faster rate than it approaches 0.0. The related log-log link log ylogwPŽY F j.x4 is appropriate when the complementary log-log link holds for the categories listed in reverse order. 7.3.2 Estimation for Cumulative Link Models McCullagh Ž1980. and Thompson and Baker Ž1981. treated cumulative link models as multivariate GLMs. McCullagh presented a Fisher scoring algorithm for ML estimation, expressing the likelihood in the form Ž7.6. using cumulative probabilities. McCullagh showed that sufficiently large n guarantees a unique maximum of the likelihood. Burridge Ž1981. and Pratt Ž1981. 284 LOGIT MODELS FOR MULTINOMIAL RESPONSES showed that the log likelihood is concave for many cumulative link models, including the logit, probit, and complementary log-log. Iterative algorithms usually converge rapidly to the ML estimates. 7.3.3 Life Table Example Table 7.7 shows the life-length distribution for U.S. residents in 1981, by race and gender. Life length uses five ordered categories. The underlying continuous cdf of life length increases slowly at small to moderate ages but increases sharply at older ages. This suggests the complementary log-log link. This link also results from assuming that the hazard rate increases exponentially with age, which happens for an extreme value distribution Žthe Gompertz.. For gender G Ž1 s female; 0 s male., race R Ž1 s black; 0 s white., and life length Y, Table 7.7 contains fitted distributions for the model log  ylog 1 y P Ž Y F j < G s g , R s r . 4 s ␣ j q ␤1 g q ␤ 2 r . Goodness-of-fit statistics are irrelevant, since the table contains population distributions. The model describes well the four distributions. Its parameter values are ␤ 1 s y0.658 and ␤ 2 s 0.626. The fitted cdf’s satisfy P Ž Y ) j < G s 0, R s r . s P Ž Y ) j < G s 1, R s r . exp Ž0.658 . . Given race, the proportion of men living longer than a fixed time equaled the proportion for women raised to the exp Ž0.658. s 1.93 power. Given gender, the proportion of blacks living longer than a fixed time equaled the proportion for whites to the expŽ0.626. s 1.87 power. The ␤ 1 and ␤ 2 values indicate that white men and black women had similar distributions, that white women tended to have longest lives and black men tended to have shortest lives. If the probability of living longer than some fixed time equaled ␲ for white women, that probability was about ␲ 2 for white men and black women and ␲ 4 for black men. TABLE 7.7 Life-Length Distribution of U.S. Residents (Percent), a 1981 Males Life Length 0᎐20 20᎐40 40᎐50 50᎐60 Over 65 White 2.4 3.4 3.8 17.5 72.9 Ž2.4. Ž3.5. Ž4.4. Ž16.7. Ž73.0. Females Black 3.6 7.5 8.3 25.0 55.6 Ž4.4. Ž6.4. Ž7.7. Ž26.1. Ž55.4. White 1.6 1.4 2.2 9.9 84.9 Ž1.2. Ž1.9. Ž2.4. Ž9.6. Ž84.9. Black 2.7 2.9 4.4 16.3 73.7 Ž2.3. Ž3.4. Ž4.3. Ž16.3. Ž73.7. Values in parentheses are fit of proportional hazards Ži.e., complementary log-log link. model. Source: Data from Statistical Abstract of the United States ŽWashington, DC: U.S. Bureau of the Census, 1984., p. 69. a ORDINAL RESPONSES: CUMULATIVE LINK MODELS 7.3.4 285 Incorporating Dispersion Effects* For cumulative link models, settings of the explanatory variables are stochastically ordered on the response: For any pair x 1 and x 2 , either P Ž Y F j < x 1 . F P Ž Y F j < x 2 . for all j or P Ž Y F j < x 1 . G P Ž Y F j < x 2 . for all j. Figure 7.7a illustrates for underlying continuous density functions and cdf’s at two settings of x. When this is violated and such models fit poorly, often it is because the dispersion also varies with x. For instance, perhaps responses tend to concentrate around the same location but more dispersion occurs at x 1 than at x 2 . Then perhaps P Ž Y F j < x 1 . ) P Ž Y F j < x 2 . for small j but P Ž Y F j < x 1 . - P Ž Y F j < x 2 . for large j. In other words, at x 1 the responses concentrate more at the extreme categories than at x 2 . Figure 7.7b illustrates for underlying continuous distributions. A cumulative link model that incorporates dispersion effects is Gy1 P Ž Y F j < x . s ␣ j q ␤X x exp Ž ␥ X x . . Ž 7.8 . ŽAgain, one can replace q by y to more closely mimic a location᎐scale family for an underlying continuous variable.. The denominator contains FIGURE 7.7 Ž a. Distribution 1 stochastically higher than distribution 2; Ž b . distributions not stochastically ordered. 286 LOGIT MODELS FOR MULTINOMIAL RESPONSES scale parameters ␥ that describe the dispersion’s dependence on x. The ordinary model Ž7.7. is the special case ␥ s 0. Otherwise, the cumulative probabilities tend to shrink toward each other when ␥ X x ) 0. This creates higher probabilities in the end categories and overall greater dispersion. The cumulative probabilities tend to move apart Žcreating less dispersion . when ␥ X x - 0. To illustrate, we use this model to compare two groups on an ordinal scale. Suppose that x is a dummy variable with x s 1 for the first group. With cumulative logits, model Ž7.8. is logit P Ž Y F j . s ␣ j , x s 0, logit P Ž Y F j . s Ž␣j q ␤ . rexp Ž ␥ . , x s 1. The case ␥ s 0 is the usual model, in which ␤ is a location shift that determines a common cumulative log odds ratio for all 2 = 2 collapsings of the 2 = J table. When ␥ / 0 the difference between the logits for the two groups, and hence the cumulative odds ratio, varies as j does. When ␥ ) 0, responses at x s 1 tend to be more disperse than at x s 0. See Cox Ž1995. and McCullagh Ž1980. for model fitting and examples. 7.4 ALTERNATIVE MODELS FOR ORDINAL RESPONSES* Models for ordinal responses need not use cumulative probabilities. In this section we discuss alternative logit models and a simpler model that resembles ordinary regression. 7.4.1 Adjacent-Categories Logits The adjacent-categories logits are logit P Ž Y s j < Y s j or j q 1 . s log ␲j ␲ jq1 , j s 1, . . . , J y 1. Ž 7.9 . These logits are a basic set equivalent to the baseline-category logits. The connections are log ␲j ␲J s log ␲j ␲ jq1 q log ␲ jq1 ␲ jq2 q ⭈⭈⭈ qlog ␲ Jy1 ␲J , and log ␲j ␲ jq1 s log ␲j ␲J y log Either set determines logits for all ␲ jq1 ž/ J 2 ␲J , j s 1, . . . , J y 1. pairs of response categories. Ž 7.10 . 287 ALTERNATIVE MODELS FOR ORDINAL RESPONSES Models using adjacent-categories logits can be expressed as baseline-category logit models. For instance, consider the adjacent-categories logit model log ␲ j Ž x. ␲ jq1 Ž x . s ␣ j q ␤X x, j s 1, . . . , J y 1, Ž 7.11 . with common effect ␤. From adding Ž J y j . terms as in Ž7.10., the equivalent baseline-category logit model is log ␲ j Ž x. ␲ J Ž x. Jy1 s Ý ␣ k q ␤X Ž J y j . x, j s 1, . . . , J y 1 ksj s ␣ j* q ␤X u j , j s 1, . . . , J y 1 with u j s Ž J y j .x. The adjacent-categories logit model corresponds to a baseline-category logit model with adjusted model matrix but also a single parameter for each predictor. With some software one can fit model Ž7.11. by fitting the equivalent baseline-category logit model. The construction of the adjacent-categories logits recognizes the ordering of Y categories. To benefit from this in model parsimony requires appropriate specification of the linear predictor. For instance, if an explanatory variable has similar effect for each logit, advantages accrue from having a single parameter instead of Ž J y 1. parameters describing that effect. When used with this proportional odds form, model Ž7.11. with adjacent-categories logits fit well in similar situations as model Ž7.5. with cumulative logits. They both imply stochastically ordered distributions for Y at different predictor values. The choice of model should depend less on goodness of fit than on whether one prefers effects to refer to individual response categories, as the adjacent-categories logits provide, or instead to groupings of categories using the entire scale or an underlying latent variable, which cumulative logits provide. Since effects in cumulative logit models refer to the entire scale, they are usually larger. The ratio of estimate to standard error, however, is usually similar for the two model types. An advantage of the cumulative logit model is the approximate invariance of effect estimates to the choice and number of response categories. This does not happen with the adjacent-categories logits. 7.4.2 Job Satisfaction Example Table 7.8 refers to the relationship between job satisfaction Ž Y . and income, stratified by gender, for black Americans. For simplicity, we use income scores Ž1, 2, 3, 4.. For income x and gender g Ž1 s females, 0 s males., consider the model log Ž ␲ jr␲ jq1 . s ␣ j q ␤ 1 x q ␤ 2 g , j s 1, 2, 3. 288 LOGIT MODELS FOR MULTINOMIAL RESPONSES TABLE 7.8 Job Satisfaction and Income, Controlling for Gender Job Satisfaction Gender Income Ždollars. Very Dissatisfied A Little Satisfied Moderately Satisfied Very Satisfied Female - 5000 500015,000 15,00025,000 ) 25,000 1 2 0 0 3 3 1 2 11 17 8 4 2 3 5 2 Male - 5000 500015,000 15,00025,000 ) 25,000 1 0 0 0 1 3 0 1 2 5 7 9 1 1 3 6 Source:1991, General Social Survey, National Opinion Research Center. It describes the odds of being very dissatisfied instead of a little satisfied, a little instead of moderately satisfied, and moderately instead of very satisfied. This model is equivalent to the baseline-category logit model log Ž ␲ jr␲4 . s ␣ Uj q ␤ 1 Ž 4 y j . x q ␤ 2 Ž 4 y j . g , j s 1, 2, 3. The value of the first predictor in this model is set equal to 3 x in the equation for logŽ␲ 1r␲4 ., 2 x in the equation for logŽ␲ 2r␲4 ., and x in the equation for logŽ␲ 3r␲4 .. Some software Že.g., PROC CATMOD in SAS; see Table A.12. allows one to enter a row of a model matrix for each baselinecategory logit at a given setting of predictors. Then, after fitting the baseline-category logit model that constrains the effects to be the same for each logit, the estimated regression parameters are the ML estimates of parameters for the adjacent-categories logit model. The ML fit gives ␤ˆ1 s y0.389 ŽSE s 0.155. and ␤ˆ2 s 0.045 ŽSE s 0.314.. For this parameterization, ␤ˆ1 - 0 means the odds of lower job satisfaction decrease as income increases. Given gender, the estimated odds of response in the lower of two adjacent categories multiplies by expŽy0.389. s 0.68 for each category increase in income. The model describes 24 logits Žthree for each income = gender combination. with five parameters. Its deviance G 2 s 12.6 with df s 19. This model with a linear trend for the income effect and a lack of interaction between income and gender seems adequate. Similar substantive results occur with a cumulative logit model. Its deviance G 2 s 13.3 with df s 19. The income effect is larger Ž ␤ˆ1 s y0.51, SE s 0.20., since it refers to the entire response scale rather than adjacent categories. However, significance is similar, with ␤ˆ1rSE f y2.5 for each model. 289 ALTERNATIVE MODELS FOR ORDINAL RESPONSES 7.4.3 Continuation-Ratio Logits Continuation-ratio logits are defined as log ␲j ␲ jq1 q ⭈⭈⭈ q␲ J , j s 1, . . . , J y 1 Ž 7.12 . or as log ␲ jq1 ␲ 1 q ⭈⭈⭈ q␲ j , j s 1, . . . , J y 1. Ž 7.13 . The continuation-ratio logit model form is useful when a sequential mechanism, such as survival through various age periods, determines the response outcome Že.g., Tutz 1991.. Let ␻ j s P Ž Y s j < Y G j .. With explanatory variables, ␻ j Ž x. s ␲ j Ž x. ␲ j Ž x . q ⭈⭈⭈ q␲ J Ž x . , j s 1, . . . , J y 1. Ž 7.14 . The continuation-ratio logits Ž7.12. are ordinary logits of these conditional probabilities: namely, logw ␻ j Žx.rŽ1 y ␻ j Žx..x. At the ith setting x i of x, let  yi j , j s 1, . . . , J 4 denote the response counts, with n i s Ý j yi j . When n i s 1, yi j indicates whether the response is in category j, as in Section 7.1.4. Let bŽ n, y; ␻ . denote the binomial probability of y successes in n trials with parameter ␻ for each trial. By expressing the multinomial probability of Ž yi1 , . . . , yi J . in the form pŽ yi1 . pŽ yi2 < yi1 . ⭈⭈⭈ pŽ yi J < yi1 , . . . , yi, Jy1 ., one can show that the multinomial mass function has factorization b n i , yi1 ; ␻ 1 Ž x i . b n i y yi1 , yi2 ; ␻ 2 Ž x i . ⭈⭈⭈ b n i y yi1 y ⭈⭈⭈ yyi , Jy2 , yi , Jy1 ; ␻ Jy1 Ž x i . . Ž 7.15 . The full likelihood is the product of multinomial mass functions from the different x i values. Thus, the log likelihood is a sum of terms such that different ␻ j enter into different terms. When parameters in the model specification for logitŽ ␻ j . are distinct from those for logitŽ ␻ k . whenever j / k, maximizing each term separately maximizes the full log likelihood. Thus, separate fitting of models for different continuation-ratio logits gives the same results as simultaneous fitting. The sum of the J y 1 separate G 2 statistics provides an overall goodness-of-fit statistic pertaining to the simultaneous fitting of J y 1 models. Because these logits refer to a binary response in which one category combines levels of the original scale, separate fitting can use methods for binary logit models. Similar remarks apply to continuation-ratio logits Ž7.13., 290 LOGIT MODELS FOR MULTINOMIAL RESPONSES although those logits and the subsequent analysis do not give equivalent results. Sometimes, simpler models with the same effects for each logit are plausible ŽMcCullagh and Nelder 1989, p. 164; Tutz 1991.. 7.4.4 Developmental Toxicity Study with Pregnant Mice We illustrate continuation-ratio logits using Table 7.9 from a developmental toxicity study. Such experiments with rodents test substances posing potential danger to developing fetuses. Diethylene glycol dimethyl ether ŽdiEGdiME., one such substance, is an industrial solvent used in the manufacture of protective coatings such as lacquer and metal coatings. This study administered diEGdiME in distilled water to pregnant mice. Each mouse was exposed to one of five concentration levels for 10 days early in the pregnancy. The mice exposed to level 0 formed a control group. Two days later, the uterine contents of the pregnant mice were examined for defects. Each fetus has three possible outcomes Žnonlive, malformation, normal.. The outcomes are ordered, with nonlive the least desirable result. We use continuation-ratio logits to model Ž1. the probability ␲ 1 of a nonlive fetus, and Ž2. the conditional probability ␲ 2rŽ␲ 2 q ␲ 3 . of a malformed fetus, given that the fetus was live. We fitted the continuation-ratio logit models log ␲ 1Ž x i . ␲ 2 Ž xi . q ␲ 3Ž xi . s ␣ 1 q ␤1 x i , log ␲ 2 Ž xi . ␲ 3Ž xi . s ␣ 2 q ␤2 x i , using x i scores  0, 62.5, 125, 250, 5004 for concentration level. The ML estimates are ␤ˆ1 s 0.0064 ŽSE s 0.0004. and ␤ˆ2 s 0.0174 ŽSE s 0.0012.. In each case, the less desirable outcome is more likely as the concentration increases. For instance, given that a fetus was live, the estimated odds that it was malformed rather than normal multiplies by expŽ1.74. s 5.7 for every 100-unit increase in the concentration of diEGdiME. The likelihood-ratio fit TABLE 7.9 Outcomes for Pregnant Mice in Developmental Toxicity Study Concentration Žmgrkg per day. 0 Žcontrols. 62.5 125 250 500 Response Nonlive Malformation Normal 15 17 22 38 144 1 0 7 59 132 281 225 283 202 9 Based on results in C. J. Price et al., Fund. Appl. Toxicol. 8:115᎐126 Ž1987.. I thank Louise Ryan for showing me these data. a ALTERNATIVE MODELS FOR ORDINAL RESPONSES 291 statistics are G 2 s 5.78 for j s 1 and G 2 s 6.06 for j s 2, each based on df s 3. Their sum, G 2 s 11.84 Žor similarly X 2 s 9.76., with df s 6, summarizes the fit. This analysis treats pregnancy outcomes for different fetuses as independent, identical observations. In fact, each pregnant mouse had a litter of fetuses, and statistical dependence may exist among different fetuses in the same litter. Different litters at a given concentration level may also have different response probabilities. Heterogeneity of various sorts among the litters Že.g., due to varying physical characteristics among different pregnant mice. would cause these probabilities to vary somewhat. Either statistical dependence or heterogeneous probabilities violates the binomial assumption and causes overdispersion. At a fixed concentration level, the number of fetuses in a litter that die may vary among pregnant mice more than if the counts were independent and identical binomial variates. The total G 2 shows some evidence of lack of fit Ž P s 0.07. but may reflect overdispersion caused by these factors rather than an inappropriate choice of response curve. To account for overdispersion, we could adjust standard errors using the quasi-likelihood approach ŽSection 4.7.. This multiplies standard errors by X 2rdf s 9.76r6 s 1.28. For each logit, strong evidence remains that ␤ j ) 0. In Chapters 12 and 13 we present other methods that account for the clustering of fetuses in litters. ' 7.4.5 ' Mean Response Models for Ordered Response We now present a model that resembles ordinary regression for a continuous response variable. For scores ®1 F ®2 F ⭈⭈⭈ F ®J , let M Ž x. s Ý ®j␲ j Ž x . j denote the mean response. The model M Ž x . s ␣ q ␤X x Ž 7.16 . assumes a linear relationship between the mean and the explanatory variables. With J s 2, it is the linear probability model ŽSection 4.2.1.. With J ) 2, it does not structurally specify the response probabilities but merely describes the dependence of the mean on x. Assuming independent multinomial sampling at different x i , Bhapkar Ž1968., Grizzle et al. Ž1969., and Williams and Grizzle Ž1972. presented weighted least squares ŽWLS. fits for mean response models. The WLS approach, described in Section 15.1, applies when all explanatory variables are categorical. The ML approach for maximizing the product multinomial likelihood applies for categorical or continuous explanatory variables. Haber Ž1985. and Lipsitz Ž1992. presented algorithms for ML fitting of a family, 292 LOGIT MODELS FOR MULTINOMIAL RESPONSES including mean response models. This is somewhat complex, since the probabilities in the multinomial likelihood are not direct functions of the parameters in Ž7.16.. Specialized software is available Žsee Appendix A.. 7.4.6 Job Satisfaction Example Revisited We illustrate for Table 7.8, modeling the mean of Y s job satisfaction using income x and gender g Ž1 s females, 0 s males.. For simplicity, we use job satisfaction scores and income scores Ž1, 2, 3, 4.. The model has ML fit, M̂ s 2.59 q 0.181 x y 0.030 g , with SE s 0.069 for income and 0.145 for gender. Given gender, the estimated increase in mean job satisfaction is about 0.2 response category for each category increase of income. Although the evidence is strong of a positive effect we.g., Wald statistic Ž0.181r0.069. 2 s 6.8, df s 1, P s 0.009x, the strength of the effect is weak. Job satisfaction at the highest income level is estimated to average about half a category higher than at the lowest income level, since 3Ž0.181. s 0.54. Similar results occur with the WLS solution, for which the estimated income effect of 0.182 has SE s 0.068 ŽTable A.12 shows the use of CATMOD in SAS.. The deviance for testing the model fit equals 5.1. Since means occur at eight income = gender settings and the model has three parameters, residual df s 5. The fit seems adequate. 7.4.7 Advantages and Disadvantages of Mean Response Models Treating ordinal variables in a quantitative manner is sensible if their categorical nature reflects crude measurement of an inherently continuous variable. Mean response models have the advantage of closely resembling ordinary regression. With J s 2, in Section 4.2.1 we noted that linear probability models have a structural difficulty because of the restriction of probabilities to Ž0, 1.. A similar difficulty occurs here, since a linear model can have predicted means outside the range of assigned scores. This happens less frequently when J is large and reasonable dispersion of responses occurs throughout the domain of interest for the explanatory variables. The notion of an underlying latent variable makes more sense for an ordinal variable than for a strictly binary response, so this difficulty has less relevance here. Unlike logit models, mean response models do not uniquely determine cell probabilities. Thus, mean response models do not specify structural aspects such as stochastic orderings. These models do not represent the categorical response structure as fully as do models for probabilities, and conditions such as independence do not occur as special cases. However, they provide simpler descriptions than odds ratios or summaries from cumulative link TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES 293 models. As J increases, they also interface with ordinary regression models. For large J, they are a simple mechanism for approximating results for a regression model we would use if we could measure Y continuously. 7.5 TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES* In Section 6.3.2 we introduced the CochranMantelHaenszel ŽCMH. test of conditional independence for 2 = 2 = K tables. This section presents related tests with multicategory responses for I = J = K tables. Likelihood-ratio tests compare the fit of a model specifying XY conditional independence with a model having dependence. Alternatively, generalizations of the CMH statistic are score statistics for certain models. 7.5.1 Using Multinomial Models to Test Conditional Independence Treating Z as a nominal control factor, we discuss four cases with Ž Y, X . as Žordinal, ordinal., Žordinal, nominal., Žnominal, ordinal., Žnominal, nominal.. For ordinal Y we use cumulative logit models, but other ordinal links yield analogous tests. As we noted in Section 6.3.2 when the XY association is similar in the partial tables, the power benefits from basing a test statistic on a model of homogeneous association. 1. Y ordinal, X ordinal. Let  x i 4 be ordered scores. The model logit P Ž Y F j < X s i , Z s k . s ␣ j q ␤ x i q ␤ kZ Ž 7.17 . has the same linear trend for the X effect in each partial table. For it, XY conditional independence is H0 : ␤ s 0. Likelihood-ratio, score, or Wald statistics for H0 provide large-sample chi-squared tests with df s 1 that are sensitive to the trend alternative. 2. Y ordinal, X nominal. An alternative to conditional independence that treats X as a factor is logit P Ž Y F j < X s i , Z s k . s ␣ j q ␤i q ␤ kZ , with constraint such as ␤I s 0. For this model, XY conditional independence is H0 : ␤ 1 s ⭈⭈⭈ s ␤I . Large-sample chi-squared tests have df s I y 1. 3. Y nominal, X ordinal. When Y is nominal, analogous tests use baseline-category logit models. The model of XY conditional independence is log P Ž Y s j < X s i, Z s k. P Ž Y s J < X s i, Z s k. s ␣ jk . Ž 7.18 . 294 LOGIT MODELS FOR MULTINOMIAL RESPONSES For ordered scores  x i 4 , a test that is sensitive to the same linear trend alternatives in each partial table compares this model to log P Ž Y s j < X s i, Z s k. P Ž Y s J < X s i, Z s k. s ␣ jk q ␤ j x i . Conditional independence is H0 : ␤ 1 s ⭈⭈⭈ s ␤ Jy1 s 0. Large-sample chi-squared tests have df s J y 1. 4. Y nominal, X nominal. An alternative to XY conditional independence that treats X as a factor is log P Ž Y s j < X s i, Z s k. P Ž Y s J < X s i, Z s k. Ž 7.19 . s ␣ jk q ␤i j with constraint such as ␤I j s 0 for each j. For each j, X and Z have additive effects of form ␣ k q ␤i . Conditional independence is H0 : ␤ 1 j s ⭈⭈⭈ s ␤I j for j s 1, . . . , J y 1. Large-sample chi-squared tests have df s Ž I y 1.Ž J y 1.. Table 7.10 summarizes the four tests. They work well when the model describes at least a major component of the departure from conditional independence. This does not mean that one must test the fit of the model to use the test Žsee the remarks at the end of Section 6.3.2.. Occasionally, the association may change dramatically across the K partial tables. When Z is ordinal, an alternative by which a log odds ratio changes linearly across levels of Z is sometimes of use. For instance, when Z s age of subject, the association between a risk factor X Že.g., level of smoking. and a response Y Že.g., severity of heart disease . may tend to increase with Z. When Z is nominal, one can test the conditional independence models TABLE 7.10 Summary of Models for Testing Conditional Independence Conditional Independence df ␤s0 1 ␤ 1 s ⭈⭈⭈ s ␤I Iy1 s ␣ jk q ␤ j x i ␤ 1 s ⭈⭈⭈ s ␤ Jy1 s 0 Jy1 s ␣ jk q ␤i j all ␤i j s 0 Ž I y 1.Ž J y 1. Model Y-X Ord-Ord -Nom Nom-Ord -Nom logitw P Ž Y F j .x s ␣ j q ␤ x i q ␤ kZ logitw P Ž Y F j .x s ␣ j q ␤i q log log PŽY s j. PŽY s J . PŽY s j. PŽY s J . ␤ kZ 295 TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES against a more general alternative with separate effect parameters at each level of Z. Allowing effects to vary across levels of Z, however, results in the test df being multiplied by K, which handicaps power. 7.5.2 Job Satisfaction Example Revisited We now revisit the job satisfaction data ŽTable 7.8.. Table 7.11 summarizes the fit of several models. The model treating income as an ordinal predictor uses scores  3, 10, 20, 354 , approximate midpoints of categories in thousands of dollars. Each likelihood-ratio test compares a given model to the model deleting the income effect, controlling for gender. Testing conditional independence with the cumulative logit model Ž7.17. yields likelihood-ratio statistic 19.62 y 13.95 s 5.7 with df s 20 y 19 s 1, strong evidence of an effect. Models that treat either or both variables as nominal do not provide such strong evidence. Focusing the test on a linear trend alternative yields a smaller P-value. However, we learn more from estimating parameters than from significance tests, as in Sections 7.4.2 and 7.4.6. 7.5.3 Generalized Cochran–MantelHaenszel Tests for I = J = K Tables Birch Ž1965., Landis et al. Ž1978., and Mantel and Byar Ž1978. generalized the CMH statistic ŽSection 6.3.2.. The tests treat X and Y symmetrically, so the three cases correspond to treating both as nominal, both as ordinal, or one of each. Conditional on row and column totals, each stratum has Ž I y 1.Ž J y 1. nonredundant cell counts. Let n k s Ž n11 k , n12 k , . . . , n1, Jy1 , k , . . . , n Iy1 , Jy1 , k . . X TABLE 7.11 Summary of Model-Based Likelihood-Ratio Tests of Conditional Independence for Table 7.8 Income G 2 Fit df Test Statistic df P-value Ordinal Ordinal Nominal Not in model 13.95 10.51 19.62 19 17 20 5.7 9.1 ᎏ 1 3 ᎏ 0.017 0.028 ᎏ Nominal Ordinal Nominal Not in model 11.74 7.09 19.37 15 9 18 7.6 12.3 ᎏ 3 9 ᎏ 0.054 0.198 ᎏ Satisfaction 296 LOGIT MODELS FOR MULTINOMIAL RESPONSES Let ␮ k s E Žn k . under H0 : conditional independence, namely ␮ k s Ž n1qk nq1 k , n1qk nq2 k , . . . , n Iy1 ,q, k nq, Jy1 , k . rnqqk . X Let Vk denote the null covariance matrix of n k , where n iqk Ž ␦ iiX nqqk y n iXqk . nqj k Ž ␦ j jX nqqk y nqj X k . cov Ž n i jk , n iX jX k . s 2 nqqk Ž nqqk y 1 . with ␦ ab s 1 when a s b and ␦ ab s 0 otherwise. The most general statistic treats rows and columns as unordered. Summing over the K strata, let ns Ý nk , ␮s Ý ␮k , Ý Vk . Vs The generalized CMH statistic for nominal X and Y is CMH s Ž n y ␮ . Vy1 Ž n y ␮ . . Ž 7.20 . X Its large-sample chi-squared distribution has df s Ž I y 1.Ž J y 1.. The df value equals that for the statistics comparing logit models Ž7.18. and Ž7.19.. Both statistics are sensitive to detecting a conditional association that is similar in each stratum. For K s 1 stratum with n observations, CMH s wŽ n y 1.rn x X 2 , where X 2 is the Pearson statistic Ž3.10.. Mantel Ž1963. introduced a generalized statistic for ordinal X and Y. Using ordered scores  u i 4 and  ®j 4 , it is sensitive to a correlation of common sign in each stratum. Evidence of a positive trend occurs if in each stratum Tk s Ý i Ý j u i ®j n i jk exceeds its null expectation. Given the marginal totals in each stratum, under conditional independence E Ž Tk . s Ý u i n iqk Ý ®j nqj k i var Ž Tk . s 1 nqqk y 1 = nqqk , j Ý Ý ®j2 nqj k j u 2i n iqk y Ž Ý i u i n iqk . nqqk i y Ž Ý j ®j nqj k . nqqk 2 2 . The statistic w Tk y E ŽTk .xrwvarŽTk .x1r2 equals the correlation between X and Y in stratum k multiplied by nqqk y 1 . To summarize across the K strata, ' 297 TESTING CONDITIONAL INDEPENDENCE IN I = J = K TABLES Mantel Ž1963. proposed 2 M s Ýk Ý i Ý j u i ®j n i jk y E Ž Ý i Ý j u i ®j n i jk . Ý kvar Ž Ý i Ý j u i ®j n i jk . 4 2 . Ž 7.21 . This has an approximate ␹ 12 null distribution, the same as for testing H0 : ␤ s 0 in ordinal model Ž7.17.. For K s 1, this is the M 2 statistic Ž3.15.. Landis et al. Ž1978. presented a statistic that has Ž7.20. and Ž7.21. as special cases. His statistic also can treat X as nominal and Y as ordinal, summarizing information about how I row means compare to their null expected values, with df s I y 1 Žsee Note 7.7.. 7.5.4 Job Satisfaction Example Revisited Table 7.12 shows output from conducting generalized CMH tests for Table 7.8. Statistics treating a variable as ordinal used scores  3, 10, 20, 354 for income and scores  1, 3, 4, 54 for job satisfaction. ŽTable A.12 shows the use of PROC FREQ in SAS, but with different scores. The general association alternative treats X and Y as nominal and uses Ž7.20.. It is sensitive to any association that is similar in each level of Z. The row mean scores differ alternative treats rows as nominal and columns as ordinal. It is sensitive to variation among the I row mean scores on Y, when that variation is similar in each level of Z. Finally, the nonzero correlation alternative treats X and Y as ordinal and uses Ž7.21.. It is sensitive to a similar linear trend in each level of Z. As in the model-based analyses that Table 7.11 summarized, the evidence is stronger using the df s 1 ordinal test. 7.5.5 Related Score Tests for Multinomial Logit Models The generalized CMH tests seem to be non-model-based alternatives to those of Section 7.5.1 using multinomial logit models. However, a close connection exists between them. For various multinomial logit models, the generalized CMH tests are score tests. TABLE 7.12 Output for Generalized Cochran–Mantel–Haenszel Tests with Job Satisfaction and Income Data Summary Statistics for income by satisf Controlling for gender Cochran- Mantel- Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob 1 2 3 Nonzero Correlation Row Mean Scores Differ General Association 1 3 9 6.1563 9.0342 10.2001 0.0131 0.0288 0.3345 298 LOGIT MODELS FOR MULTINOMIAL RESPONSES The generalized CMH test Ž7.20. that treats X and Y as nominal is the score test that the Ž I y 1.Ž J y 1. ␤i j 4 parameters in logit model Ž7.19. equal 0. The generalized CMH test using M 2 that treats X and Y as ordinal is the score test of ␤ s 0 in model Ž7.17.. For the cumulative logit model, the equivalence has the same  x i 4 scores in the model as in M 2 , and the  ®j 4 scores in M 2 are average rank scores. For the adjacent-categories logit model analog of Ž7.17., the  ®j 4 scores in M 2 are any equally spaced scores. With large samples in each stratum, the generalized CMH tests give similar results as likelihood-ratio tests comparing the relevant models. An advantage of the model-based approach is providing estimates of effects. An advantage of the generalized CMH tests is maintaining good performance under sparse asymptotics whereby K grows as n does. Remarks in Section 6.3.4 apply here also. 7.5.6 Exact Tests of Conditional Independence In principle, exact tests of conditional independence can use the generalized CMH statistics, generalizing Section 6.7.5 for 2 = 2 = K tables. To eliminate nuisance parameters, one conditions on row and column totals in each stratum. The distribution of counts in each stratum is the multiple hypergeometric ŽSection 3.5.7., and this propagates an exact conditional distribution for the statistic of interest. The P-value is the probability of those tables having the same strata margins as observed but test statistic at least as large as observed Žsee Birch 1965; Kim and Agresti 1997; Mehta et al. 1988.. 7.6 DISCRETE-CHOICE MULTINOMIAL LOGIT MODELS* An important application of multinomial logit models is determining effects of explanatory variables on a subject’s choice from a discrete set of optionsᎏfor instance, the choice of transportation system to take to work Ždrive, bus, subway, walk, bicycle., housing Žbuy house, buy condominium, rent., primary shopping location Ždowntown, mall, catalogs, Internet ., or product brand. Models for response variables consisting of a discrete set of choices are called discrete-choice models. 7.6.1 Discrete-Choice Modeling In many discrete-choice applications, an explanatory variable takes different values for different response choices. As predictors of choice of transportation system, cost and time to reach destination take different values for each option. As a predictor of choice of product brand, price varies according to the option. Explanatory variables of this type are characteristics of the choices. They differ from the usual ones, for which values remain constant across the choice set. Such variables, characteristics of the chooser, include income, education, and other demographic characteristics. DISCRETE-CHOICE MULTINOMIAL LOGIT MODELS 299 McFadden Ž1974. proposed a discrete-choice model for explanatory variables that are characteristics of the choices. His model also permits the choice set to vary among subjects. For instance, some subjects may not have the subway as an option for travel to work. For subject i and response choice j, let x i j s Ž x i j1 , . . . , x i j p .X denote the values of the p explanatory variables, and let x i s Žx i1 , . . . , x i p .. Conditional on the choice set Ci for subject i, the model for the probability of selecting option j is ␲j Žx i . s exp Ž ␤X x i j . Ý h g C i exp Ž ␤X x i h . Ž 7.22 . . For each pair of choices a and b, this model has the logit form log ␲a Ž x i . r␲ b Ž x i . s ␤X Ž x i a y x i b . . Ž 7.23 . Conditional on the choice being a or b, a variable’s influence depends on the distance between the subject’s values of that variable for those choices. If the values are the same, the model asserts that the variable has no influence on the choice between a and b. Reflecting this property, McFadden originally referred to model Ž7.22. as a conditional logit model. From Ž7.23., the odds of choosing a over b do not depend on the other alternatives in the choice set or on their values of the explanatory variables. Luce Ž1959. called this property independence from irrele®ant alternati®es. It is unrealistic in some applications. For instance, for travel options auto and red bus, suppose that 80% choose auto, an odds of 4.0. Now suppose that the options are auto, red bus, and blue bus. According to Ž7.23., the odds are still 4.0 of choosing auto instead of red bus, but intuitively, we expect them to be about 8.0 Ž10% choosing each bus option., McFadden Ž1974. stated: ‘‘Application of the model should be limited to situations where the alternatives can plausibly be assumed to be distinct and weighed independently in the eyes of each decision-maker.’’ 7.6.2 Discrete-Choice and Multinomial Logit Models Model Ž7.22. can also incorporate explanatory variables that are characteristics of the chooser. This may seem surprising, since Ž7.22. has a single parameter for each explanatory variable; that is, the parameter vector is the same for each pair of choices. However, multinomial logit model Ž7.2. has discrete-choice form Ž7.22. after replacing such an explanatory variable by J artificial variables; the jth is the product of the explanatory variable with a dummy variable that equals 1 when the response choice is j. For instance, for a single explanatory variable, let x i denote its value for subject i. For j s 1, . . . , J, let ␦ jk equal 1 when k s j and 0 otherwise, and let z i j s Ž ␦ j1 , . . . , ␦ j J , ␦ j1 x i , . . . , ␦ j J x i . . X 300 LOGIT MODELS FOR MULTINOMIAL RESPONSES Let ␤ s Ž␣1 , . . . , ␣ J , ␤ 1 , . . . , ␤ J .X . Then ␤X z i j s ␣ j q ␤ j x i , and Ž7.2. is Žwith ␣ J s ␤ J s 0 for identifiability . ␲j Ž xi . s s exp Ž␣j q ␤ j x i . exp Ž␣1 q ␤ 1 x i . q ⭈⭈⭈ qexp Ž␣J q ␤ J x i . exp Ž ␤X z i j . exp Ž ␤X z i1 . q ⭈⭈⭈ qexp Ž ␤X z i J . . This has form Ž7.22.. With this approach, discrete-choice models can contain characteristics of the chooser and the choices. Thus, model Ž7.22. is very general. The ordinary multinomial logit model Ž7.2. using baseline-category logits is a special case. 7.6.3 Shopping Choice Example McFadden Ž1974. used multinomial logit models to describe how residents of Pittsburgh, Pennsylvania chose a shopping destination. The five possible destinations were different city zones. One explanatory variable measured shopping opportunities, defined to be the retail employment in the zone as a percentage of total retail employment in the region. The other explanatory variable was price of the trip, defined from a separate analysis using auto in-vehicle time and auto operating cost. The ML estimates of model parameters were y1.06 ŽSE s 0.28. for price of trip and 0.84 ŽSE s 0.23. for shopping opportunity. From Ž7.23., log Ž ␲ ˆar␲ˆ b . s y1.06 Ž Pa y Pb . q 0.84 Ž S a y Sb . , where P s price and S s shopping opportunity. Not surprisingly, a destination is relatively more attractive as the trip price decreases and as the shopping opportunity increases. Given values of P and S for each destination, the sample analog of Ž7.22. provides estimated probabilities of choosing each destination. NOTES Section 7.1: Nominal Responses: Baseline-Category Logit Models 7.1. Multicategory models derive from latent variable constructions that generalize those for binary responses. One approach uses the principle of selecting the category having maximum utility ŽProblem 6.29.. Fahrmeir and Tutz Ž2001, Chap. 3. gave discussion and references. Baseline-category logit models were developed in Bock Ž1970., Haberman Ž1974a, pp. 352᎐373., Mantel Ž1966., Nerlove and Press Ž1973., and Theil Ž1969, 1970.. Lesaffre and Albert Ž1989. presented regression diagnostics. Amemiya Ž1981., Haberman Ž1982., and Theil Ž1970. presented R-squared measures. NOTES 301 Section 7.2: Ordinal Responses: Cumulati©e Logit Models 7.2. Early uses of cumulative logit models include Bock and Jones Ž1968., Simon Ž1974., Snell Ž1964., Walker and Duncan Ž1967., and Williams and Grizzle Ž1972.. McCullagh Ž1980. popularized the proportional odds case. Later articles include Agresti and Lang Ž1993a., Hastie and Tibshirani Ž1987., Peterson and Harrell Ž1990., and Tutz Ž1989.. See also Section 11.3.3, Note 11.3, and Section 12.4.1. McCullagh and Nelder Ž1989, Sec. 5.6. suggested using cumulative totals in forming residuals. 7.3. McCullagh Ž1980. noted that score tests for model Ž7.5. are equivalent to nonparametric tests using average ranks. For instance, for 2 = J tables assume that logit w P Ž Y F j .x s ␣ j q ␤ x, with x an indicator. The score test of H0 : ␤ s 0 is equivalent to a discrete version of the Wilcoxon᎐Mann᎐Whitney test. Whitehead Ž1993. gave sample size formulas for this case. The sample size n J needed for a certain power decreases as J increases: When response categories have equal probabilities, n J f 0.75n 2 rŽ1 y 1rJ 2 .. Thus, for large J, n J f 0.75n 2 , and 1 y 1rJ 2 is a type of efficiency measure of using J categories instead of a continuous response. The efficiency loss is minor with J f 5, but major in collapsing to J s 2. Edwardes Ž1997. innovatively adapted the test by treating the cutpoints as random. This relates to random effects models of Section 12.4.1. Section 7.3: Ordinal Responses: Cumulati©e Link Models 7.4. Aitchison and Silvey Ž1957. and Bock and Jones Ž1968, Chap. 8. studied cumulative probit models. Farewell Ž1982. generalized the complementary log-log model to allow variation among the sample in the category boundaries for the underlying scale; this relates to random effects models ŽSection 12.4.. Genter and Farewell Ž1985. introduced a generalized link function that permits comparison of fits provided by probit, complementary log-log, and other links. Yee and Wild Ž1996. defined generalized additive models for nominal and ordinal responses. Hamada and Wu Ž1990. and Nair Ž1987. presented alternatives to model Ž7.8. for detecting dispersion effects. 7.5. Some authors have considered inference relating generally to stochastic ordering; see, for instance, Dardanoni and Forcina Ž1998. and survey articles in a 2002 issue of J. Statist. Plann. Inference ŽVol. 107, Nos. 1᎐2.. Section 7.4: Alternati©e Models for Ordinal Responses 7.6. The ratio of a pdf to the complement of the cdf is the hazard function ŽSection 9.7.3.. For discrete variables, this is the ratio found in continuation-ratio logits. Hence, continuation-ratio logits are sometimes interpreted as log hazards. Thompson Ž1977. used them in modeling discrete survival-time data. When lengths of time intervals approach 0, his model converges to the Cox proportional hazards model. Other applications of continuation-ratio logits include Laara ¨¨ ¨ and Matthews Ž1985. and Tutz Ž1991.. Section 7.5: Testing Conditional Independence in I = J = K Tables 7.7. Let B k s u k m vk denote a matrix of constants based on row scores u k and column scores vk for stratum k, where m denotes the Kronecker product. The Landis et al. 302 LOGIT MODELS FOR MULTINOMIAL RESPONSES Ž1978. generalized statistic is X L2 s Ý B k Žn k y ␮ k . Ý B kVk BXk k y1 k Ý B k Žn k y ␮ k . . k When u k s Ž u1 , . . . , u I . and vk s Ž ®1 , . . . , ®J . for all strata, L2 s M 2 . When u k is an Ž I y 1. = I matrix ŽI, y1., where I is an identity matrix of size Ž I y 1. and 1 denotes a column vector of I y 1 ones, and vk is the analogous matrix of size Ž J y 1. = J, L2 simplifies to Ž7.20. with df s Ž I y 1.Ž J y 1.. With this u k and vk s Ž ®1 , . . . , ®J ., L2 sums over the strata information about how I row means compare to their null expected values, and it has df s I y 1. Rank score versions are analogs for ordered categorical responses of strata-adjusted Spearman correlation and KruskalWallis tests. Landis et al. Ž1998. and Stokes et al. Ž2000. reviewed CMH methods. Koch et al. Ž1982. reviewed related methods. Section 7.6: Discrete-Choice Multinomial Logit Models 7.8. McFadden’s model relates to models proposed by Bradley and Terry Ž1952. Žsee Section 10.6. and Luce Ž1959.. See Train Ž1986. for a text treatment. McFadden Ž1982. discussed hierarchical models having a nesting of choices in a tree-like structure. For other discussion, see Maddala Ž1983. and Small Ž1987.. Models that do not assume independence from irrelevant alternatives result with probit link ŽAmemiya 1981. or with the logit link but including random effects ŽBrownstone and Train 1999.. Methods in Section 12.6 for random effects models are useful for fitting such models. These include Monte Carlo methods for approximating integrals that determine the likelihood function. See Stern Ž1997. for a review. PROBLEMS Applications 7.1 For Table 7.13, let Y s belief in life after death, x 1 s gender Ž1 s females, 0 s males., and x 2 s race Ž1 s whites, 0 s blacks.. Table 7.14 shows the fit of the model log Ž ␲ jr␲ 3 . s ␣ j q ␤ jG x 1 q ␤ jR x 2 , j s 1, 2, with SE values in parentheses. TABLE 7.13 Data for Problem 7.1 Belief in Afterlife Race Gender Yes Undecided No White Female Male 371 250 49 45 74 71 Black Female Male 64 25 9 5 15 13 Source: 1991 General Social Survey, National Opinion Research Center. 303 PROBLEMS TABLE 7.14 Fit of Model for Problem 7.1 Belief Categories for Logit Parameter YesrNo UndecidedrNo Intercept Gender Race 0.883 Ž0.243. 0.419 Ž0.171. 0.342 Ž0.237. y0.758 Ž0.361. 0.105 Ž0.246. 0.271 Ž0.354. a. Find the prediction equation for logŽ␲ 1r␲ 2 .. b. Using the yes and no response categories, interpret the conditional gender effect using a 95% confidence interval for an odds ratio. c. Show that for white females, ␲ ˆ 1 s PˆŽ Y s yes. s 0.76. d. Without calculating estimated probabilities, explain why the intercept estimates indicate that for black males ␲ ˆ 1 ) ␲ˆ 3 ) ␲ˆ 2 . Use the intercept and gender estimates to show that the same ordering applies for black females. e. Without calculating estimated probabilities, explain why the estimates in the gender and race rows indicate that ␲ ˆ 3 is highest for black males. f. For this fit, G 2 s 0.9. Explain why residual df s 2. Deleting the gender effect, G 2 s 8.0. Test whether opinion is independent of gender, given race. Interpret. 7.2 A model fit predicting preference for U.S. President ŽDemocrat, Republican, Independent . using x s annual income Žin $10,000. is logŽ␲ ˆ D r␲ˆ I . s 3.3 y 0.2 x and logŽ␲ˆ Rr␲ˆ I . s 1.0 q 0.3 x. a. Find the prediction equation for logŽ␲ ˆ Rr␲ˆ D . and interpret the slope. For what range of x is ␲ ˆ R ) ␲ˆ D? b. Find the prediction equation for ␲ ˆI. c. Plot ␲ ˆ D , ␲ˆ I , and ␲ˆ R for x between 0 and 10, and interpret. 7.3 Table 7.15 refers to the effect on political party identification of gender and race. Find a baseline-category logit model that fits well. TABLE 7.15 Data for Problem 7.3 Party Identification Gender Race Democrat Republican Independent Male White Black 132 42 176 6 127 12 Female White Black 172 56 129 4 130 15 304 LOGIT MODELS FOR MULTINOMIAL RESPONSES Interpret estimated effects on the odds that party identification is Democrat instead of Republican. TABLE 7.16 Data for Problem 7.4 a Males Females Length Žm. Choice Length Žm. Choice Length Žm. Choice Length Žm. Choice 1.30 1.32 1.32 1.40 1.42 1.42 1.47 1.47 1.50 1.52 1.63 1.65 1.65 1.65 1.65 1.68 1.70 1.73 1.78 1.78 I F F F I F I F I I I O O I F F I O F O 1.80 1.85 1.93 1.93 1.98 2.03 2.03 2.31 2.36 2.46 3.25 3.28 3.33 3.56 3.58 3.66 3.68 3.71 3.89 F F I F I F F F F F O O F F F F O F F 1.24 1.30 1.45 1.45 1.55 1.60 1.60 1.65 1.78 1.78 1.80 1.88 2.16 2.26 2.31 2.36 2.39 2.41 2.44 I I I O I I I F I O I I F F F F F F F 2.56 2.67 2.72 2.79 2.84 O F I F F a I, invertebrates; F, fish; O, other. 7.4 For 63 alligators caught in Lake George, Florida, Table 7.16 classifies primary food choice as Žfish, invertebrate, other. and shows length in meters. Alligators are called subadults if length - 1.83 meters Ž6 feet. and adults if length ) 1.83 meters. a. Measuring length as Žadult, subadult ., find a model that adequately describes effects of gender and length on food choice. Interpret the effects. For adult females, find the estimated probabilities of the food-choice categories. b. Using only observations for which primary food choice was fish or invertebrate, find a model that adequately describes effects of gender and binary length. Compare parameter estimates and standard errors for this separate-fitting approach to those obtained with simultaneous fitting, including the other category. c. Treating length as binary loses information. Adapt the model in part Ža. to use the continuous measurements. Interpret, explaining how the estimated outcome probabilities vary with length. Find the 305 PROBLEMS estimated length at which the invertebrate and other categories are equally likely. 7.5 For recent data from a General Social Survey, the cumulative logit model Ž7.5. with Y s political ideology Žvery liberal, slightly liberal, moderate, slightly conservative, very conservative. and x s 1 for the 428 Democrats and x s 0 for the 407 Republicans has ␤ˆ s 0.975 ŽSE s 0.129. and ␣ ˆ1 s y2.469. Interpret ␤ˆ. Find the estimated probability of a very liberal response for each group. 7.6 Refer to Problem 7.5. With adjacent-categories logits, ␤ˆ s 0.435. Interpret using odds ratios for adjacent categories and for the Žvery liberal, very conservative. pair of categories. 7.7 Table 7.17 is an expanded version of a data set analyzed in Section 8.4.2. The response categories are Ž1. not injured, Ž2. injured but not transported by emergency medical services, Ž3. injured and transported by emergency medical services but not hospitalized, Ž4. injured and hospitalized but did not die, and Ž5. injured and died. Table 7.18 shows output for a model of form Ž7.5., using dummy variables for predictors. a. Why are there four intercepts? Explain how they determine the estimated response distribution for males in urban areas wearing seat belts. b. Construct a confidence interval for the effect of gender, given seat-belt use and location. Interpret. c. Find the estimated cumulative odds ratio between the response and seat-belt use for those in rural locations and for those in urban locations, given gender. Based on this, explain how the effect of seat-belt use varies by region, and explain how to interpret the interaction estimate, y0.1244. TABLE 7.17 Data for Problem 7.7 Response Gender Location Female Urban Rural Male Urban Rural Seat Belt 1 2 3 4 5 No Yes No Yes 7,287 11,587 3,246 6,134 175 126 73 94 720 577 710 564 91 48 159 82 10 8 31 17 No Yes No Yes 10,381 10,969 6,123 6,693 136 83 141 74 566 259 710 353 96 37 188 74 14 1 45 12 Source: Data courtesy of Cristanna Cook, Medical Care Development, Augusta, Maine. 306 LOGIT MODELS FOR MULTINOMIAL RESPONSES TABLE 7.18 Output for Problem 7.7 Parameter Intercept1 Intercept2 Intercept3 Intercept4 gender gender location location seatbelt seatbelt location*seatbelt location*seatbelt location*seatbelt location*seatbelt female male rural urban no yes rural rural urban urban no yes no yes DF 1 1 1 1 1 0 1 0 1 0 1 0 0 0 Estimate 3.3074 3.4818 5.3494 7.2563 y0.5463 0.0000 y0.6988 0.0000 y0.7602 0.0000 y0.1244 0.0000 0.0000 0.0000 Std Error 0.0351 0.0355 0.0470 0.0914 0.0272 0.0000 0.0424 0.0000 0.0393 0.0000 0.0548 0.0000 0.0000 0.0000 7.8 Refer to the cumulative logit model for Table 7.8. a. Compare the estimated income effect ␤ˆ1 s y0.510 to the estimate after collapsing the response to three categories by combining categories Ži. very satisfied and moderately satisfied, and Žii. very dissatisfied and a little satisfied. What property of the model does this reflect? b. Consider ␤ˆ1rSE using the full scale to ␤ˆ1rSE for the collapsing in part ŽaŽi... Usually, a disadvantage of collapsing multinomial responses is that the significance of effects diminishes. c. Check whether an improved model results from permitting interaction between income and gender. Interpret. 7.9 Table 7.19 refers to a clinical trial for the treatment of small-cell lung cancer. Patients were randomly assigned to two treatment groups. The sequential therapy administered the same combination of chemotherapeutic agents in each treatment cycle; the alternating therapy had three different combinations, alternating from cycle to cycle. TABLE 7.19 Data for Problem 7.9 Response to Chemotherapy Therapy Gender Progressive Disease No Change Partial Remission Complete Remission Sequential Male Female 28 4 45 12 29 5 26 2 Alternating Male Female 41 12 44 7 20 3 20 1 Source: W. Holtbrugge and M. Schumacher, Appl. Statist. 40: 249᎐259 Ž1991.. 307 PROBLEMS a. Fit a cumulative logit model with main effects for treatment and gender. Interpret. b. Fit the model that also contains an interaction term. Interpret. Does it fit better? Explain why it is equivalent to using the four gendertreatment combinations as levels of a single factor. 7.10 Refer to Table 7.13. Treating belief in an afterlife as ordinal, fit and interpret an ordinal model. 7.11 Table 9.7 displays associations among smoking status Ž S ., breathing test results Ž B ., and age Ž A. for workers in certain industrial plants. Treat B as a response. a. Specify a baseline-category logit model with additive factor effects of S and A. This model has deviance G 2 s 25.9. Show that df s 4, and explain why this model treats all variables as nominal. b. Treat B as ordinal and S as ordinal in terms of how recently one was a smoker, with scores  si 4 . Consider the model log P Ž B s k q 1 < S s i, A s j. P Ž B s k < S s i, A s j. s ␣ k q ␤ 1 si q ␤ 2 a j q ␤ 3 si a j with a1 s 0 and a2 s 1. Show that this assumes a linear effect of S with slope ␤ 1 for age - 40 and ␤ 1 q ␤ 3 for age 40᎐59. Using  si s i4 , ␤ˆ1 s 0.115, ␤ˆ2 s 0.311, and ␤ˆ3 s 0.663 ŽSE s 0.164.. Interpret the interaction. c. From part Žb., for age 40᎐59 show that the estimated odds of abnormal rather than borderline breathing for current smokers are 2.18 times those for former smokers and expŽ2 = 0.778. s 4.74 times those for never smokers. Explain why the squares of these values are estimated odds of abnormal rather than normal breathing. 7.12 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 7 = 2 table that refers to subjects who graduated from high school in 1965. They were classified as protestors if they took part in at least one demonstration, protest march, or sit-in, and classified according to their party identification in 1982. Analyze the data, using response Ža. party identification, Žb. whether a protestor. Compare interpretations. 7.13 For Table 7.5, the cumulative probit model has fit ⌽y1 w PˆŽ Y F j .x s ␣ ˆj y 0.195 x 1 q 0.683 x 2 , with ␣ ˆ1 s y0.161, ␣ˆ2 s 0.746, and ␣ˆ3 s 1.339. Find the means and standard deviation for the two normal cdf ’s that provide the curves for PˆŽ Y ) 2. as a function of x 1 s life events index, at the two levels of x 2 s SES. Interpret effects. 308 LOGIT MODELS FOR MULTINOMIAL RESPONSES 7.14 Analyze Table 7.8 with a cumulative probit model. Compare interpretations to those in the text with other ordinal models. 7.15 Fit a model with complementary log-log link to Table 7.20, which shows family income distributions by percent for families in the northeast U.S. Interpret the difference between the income distributions. TABLE 7.20 Data for Problem 7.15 Income Ž$1000. Year 03 35 57 710 1012 1215 15 q 1960 1970 6.5 4.3 8.2 6.0 11.3 7.7 23.5 13.2 15.6 10.5 12.7 16.3 22.2 42.1 Source: Reproduced with permission from the Royal Statistical Society, London ŽMcCullagh 1980.. 7.16 Table 7.21 shows results of fitting the mean response model to Table 7.8 using scores  3, 10, 20, 354 for income and  1, 3, 4, 54 for job satisfaction. Interpret the income effect, provide a confidence interval for the difference in mean satisfaction at income levels 35 and 3, controlling for gender, and check the model fit. TABLE 7.21 Results for Problem 7.16 Effect Intercept gender income Source DF Chi- Square Pr > ChiSq Residual 5 6.99 0.2211 Analysis of Weighted Least Squares Estimates Parameter Estimate Std Error Chi- Square 1 2 3 3.8076 y0.0687 0.0160 0.1796 0.1419 0.0066 449.47 0.23 5.97 Pr > ChiSq <.0001 0.6283 0.0146 7.17 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 3 = 4 = 4 table that cross-classifies dumping severity Ž Y . and operation Ž X . for four hospitals Ž H .. The four operations refer to treatments for duodenal ulcer patients and have a natural ordering. Dumping severity describes a possible undesirable side effect of the operation. Its three categories are also ordered. a. Table 7.22 shows results of generalized CMH tests. Interpret, explaining how one test can be much more significant than the others. 309 PROBLEMS TABLE 7.22 Results for Problem 7.17 Summary Statistics for dumping by operate Controlling for hospital Statistic Alternative Hypothesis DF Value Prob 1 2 3 Nonzero Correlation Row Mean Scores Differ General Association 1 3 6 6.3404 6.5901 10.5983 0.0118 0.0862 0.1016 b. Let  x i s i4 . Fit the model logit P Ž Y F j < H s h, X s i . s ␣ j q ␮ h q ␤ x i . Test conditional independence of X and Y using it, and interpret ␤ˆ. Which generalized CMH test has the same spirit as this? c. Does an improved fit result from allowing the operation effect to vary by hospital? Interpret. d. Find a mean response model that fits well. Interpret. 7.18 Table 7.23 refers to a study that randomly assigned subjects to a control or treatment group. Daily during the study, treatment subjects ate cereal containing psyllium. The study analyzed the effect on LDL cholesterol. a. Model the ending cholesterol level as a function of treatment, using the beginning level as a covariate. Interpret the treatment effect. b. Repeat part Ža., now treating the beginning level as qualitative. Compare results. c. An alternative to part Žb. uses a generalized CMH test relating treatment to the ending response for partial tables defined by beginning cholesterol level. Apply such a test, taking into account the response ordering, to compare treatments. Interpret, and compare to part Žb.. TABLE 7.23 Data for Problem 7.18 Ending LDL Cholesterol Level Control Beginning F 3.4 F 3.4 3.4᎐4.1 4.1᎐4.9 ) 4.9 18 16 0 0 Treatment 3.4᎐4.1 4.1᎐4.9 ) 4.9 3.4 3.4᎐4.1 4.1᎐4.9 ) 4.9 8 30 14 2 0 13 28 15 0 2 7 22 21 17 11 1 4 25 35 5 2 6 36 14 0 0 6 12 Source: Data courtesy of Sallee Anderson, Kellogg Co. 310 LOGIT MODELS FOR MULTINOMIAL RESPONSES 7.19 Analyze Table 7.5 with each type of model studied in this chapter. Write a report summarizing results and advantages and disadvantages of each modeling strategy. 7.20 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 4 = 4 = 5 table that cross-classifies assessment of cognitive impairment, Alzheimer’s disease, and age. Analyze these data, treating Ža. Alzheimer’s disease, and Žb. cognitive impairment, as the response variable. 7.21 Analyze Table 9.5 using logit models that treat Ža. party affiliation, and Žb. ideology, as the response variable. 7.22 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 4 = 2 = 3 = 3 table that refers to a sample of residents of Copenhagen. The variables are type of housing Ž H ., degree of contact with other residents Ž C ., feeling of influence on apartment management Ž I ., and satisfaction with housing conditions Ž S .. Treating S as the response variable, analyze these data. 7.23 Refer to Table 7.17. Analyze these data. Theory and Methods 7.24 A multivariate generalization of the exponential dispersion family Ž4.14. is f Ž yi ; ␪ i , ␾ . s exp  yiX ␪ i y b Ž ␪ i . raŽ ␾ . q c Ž yi , ␾ . 4 , where ␪ i is the natural parameter. Show that the multinomial variate yi defined in Section 7.1.5 for a single trial with parameters ␲ j , j s 1, . . . , J y 14 is in the Ž J y 1.-parameter exponential family, with baseline-category logits as natural parameters. 7.25 Cell counts  yi j 4 in an I = J contingency table have a multinomial Ž n; ␲ i j 4. distribution. Show that  P Ž Yi j s n i j ., i s 1, . . . , I, j s 1, . . . , J 4 can be expressed as d n n! Ł i Ł Ž ni j !. j y1 Iy1 Jy1 exp Ý Ý n i j log Ž␣i j . is1 js1 Iy1 q Ý is1 n iq log Ž ␲ i Jr␲ I J . q Jy1 Ý nqj log Ž ␲ I jr␲ I J . js1 311 PROBLEMS where ␣ i j s ␲ i j␲ I Jr␲ i J ␲ I j and d is a constant independent of the data. Find an alternative expression using local odds ratios  ␪ i j 4 , by showing that Ý Ý n i j log ␣ i j s Ý Ý si j log ␪ i j , i j i where si j s j Ý Ý n ab . aFi bFj 7.26 Suppose that we express Ž7.2. as ␲ j Ž x. s exp Ž␣j q ␤Xj x . J Ý hs1 exp Ž␣h q ␤Xh x . . Show that dividing numerator and denominator by expŽ␣J q ␤XJ x. yields new parameters ␣ j* s ␣ j y ␣ J and ␤ j* s ␤ j y ␤ J that satisfy ␣ J s 0 and ␤ J s 0. Thus, without loss of generality, ␣ J s 0 and ␤ J s 0. 7.27 When J s 3, suppose that ␲ j Ž x . s exp Ž␣j q ␤ j x . r 1 q exp Ž␣1 q ␤ 1 x . q exp Ž␣2 q ␤ 2 x . , j s 1, 2. Show that ␲ 3 Ž x . is Ža. decreasing in x if ␤ 1 ) 0 and ␤ 2 ) 0, Žb. increasing in x if ␤ 1 - 0 and ␤ 2 - 0, and Žc. nonmonotone when ␤ 1 and ␤ 2 have different signs. 7.28 Refer to the log-likelihood function for the baseline-category logit model ŽSection 7.1.4.. Denote the sufficient statistics by npj s Ý i yi j and S jk s Ý i x i k yi j , j s 1, . . . J y 1, k s 1, . . . , p. Let S s Ž S11 , . . . , S1 t , . . . S J 1 , . . . , S J t .X . Condition on Ý i yi j , j s 1, . . . , J. Under the null hypothesis that explanatory variables have no effect, show that E Ž S. s n Ž p m m . , var Ž S. s n Ž V m ⌺ . , where p s Ž p1 , . . . , p J .X ; m s Ž x 1 , . . . , x t .X , where x k s ŽÝ i x i k .rn; ⌺ has elements Ž sk2 ® ., where sk2 ® s wÝ i Ž x i k y x k .Ž x i ® y x ® .xrŽ n y 1.; V has elements ®ii s pi Ž1 y pi . and ®i j s y pi pj , and m denotes the Kronecker product ŽZelen 1991.. 7.29 Is the proportional odds model a special case of a baseline-category logit model? Explain why or why not. 7.30 Prove factorization Ž7.15. for the multinomial distribution. 312 LOGIT MODELS FOR MULTINOMIAL RESPONSES 7.31 Show that for the model, logitw P Ž Y F j .x s ␣ j q ␤ j x, cumulative probabilities may be misordered for some x values. 7.32 For an I = J contingency table with ordinal Y and scores  x i s i4 for x, consider the model logit P Ž Y F j < X s x i . s ␣ j q ␤ x i . Ž 7.24 . a. Show that logitw P Ž Y F j < X s x iq1 .x y logitw P Ž Y F j < X s x i .x s ␤ . Show that this difference in logits is a log cumulative odds ratio for the 2 = 2 table consisting of rows i and i q 1 and the binary response having cutpoint following category j. Thus, Ž7.24. is a uniform association model in cumulative odds ratios. b. Show that residual df s IJ y I y J. c. Show that independence of X and Y is the special case ␤ s 0. d. Using the same linear predictor but with adjacent-categories logits, show that uniform association applies to the local odds ratios Ž2.10.. e. A generalization of Ž7.24. replaces  ␤ x i 4 by unordered parameters  ␮ i 4 , hence treating X as nominal. For rows a and b, show that the log cumulative odds ratio equals ␮ a y ␮ b for all J y 1 cutpoints. 7.33 Suppose that model Ž7.24. holds for a 2 = J table with J ) 2, and let x 2 y x 1 s 1. Explain why local log odds ratios are typically smaller in absolute value than the cumulative log odds ratio ␤ . wIn fact, on p. 122 of their first edition, McCullagh and Nelder Ž1989. noted that local odds ratios  ␪ 1 j 4 relate to ␤ by log ␪ 1 j s ␤ P Ž Y F j q 1 . y P Ž Y F j y 1 . q o Ž ␤ . , j s 1, . . . , J y 1, where oŽ ␤.r␤ ™ 0 as ␤ ™ 0.x 7.34 A response scale has the categories Žstrongly agree, mildly agree, mildly disagree, strongly disagree, don’t know.. One way to model such a scale uses a logit model for the probability of a don’t know response and uses a separate ordinal model for the ordered categories conditional on response in one of those categories. Explain how to construct a likelihood to do this simultaneously. 7.35 For the cumulative probit model ⌽y1 w P Ž Y F j .x s ␣ j y ␤X x, explain why a 1-unit increase in x i corresponds to a ␤i standard deviation increase in the expected underlying latent response, controlling for other predictors. 313 PROBLEMS 7.36 For cumulative link model Ž7.7., show that for 1 F j - k F J y 1, P Ž Y F k < x. s P Ž Y F j < x*., where x* is obtained by increasing the ith component of x by Ž␣k y ␣ j .r␤i . Interpret. 7.37 A cumulative link model for an I = J contingency table with a qualitative predictor is Gy1 P Ž Y F j . s ␣ j q ␮ i , i s 1, . . . , I, j s 1, . . . , J y 1 . a. Show that the residual df s Ž I y 1.Ž J y 2.. b. When this model holds, show that independence corresponds to ␮ 1 s ⭈⭈⭈ s ␮ I and the test of independence has df s I y 1. c. When this model holds, show that the rows are stochastically ordered on Y. 7.38 F1Ž y . s 1 y expŽy␭ y . for y ) 0 is a negative exponential cdf with parameter ␭, and F2 Ž y . s 1 y expŽy␮ y . for y ) 0. Show that the difference between the cdf ’s on a complementary log-log scale is identical for all y. Give implications for categorical data analysis. 7.39 Consider the model Linkw ␻ j Žx.x s ␣ j q ␤Xj x, where ␻ j Žx. is Ž7.14.. a. Explain why this model can be fitted separately for j s 1, . . . , J y 1. b. For the complementary log-log link, show that this model is equivalent to one using the same link for cumulative probabilities ŽLaara ¨¨ ¨ and Matthews 1985.. 7.40 Why is it not optimal to fit mean response models for ordinal responses using ordinary least squares as is done for normal regression? 7.41 When X and Y are ordinal, explain how to test conditional independence by allowing a different trend in each partial table. w Hint: Generalize model Ž7.17. by replacing ␤ by ␤ k .x 7.42 A cafe has four entrees: ´ chicken, beef, fish, vegetarian. Specify a model of form Ž7.22. for the selection of an entree ´ using x s gender Ž1 s female, 0 s male. and u s cost of entree, ´ which is a characteristic of the choices. Interpret the model parameters. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 8 Loglinear Models for Contingency Tables In Section 4.3 we introduced loglinear models as generalized linear models ŽGLMs. using the log link function with a Poisson response. A common use is modeling cell counts in contingency tables. The models specify how the expected count depends on levels of the categorical variables for that cell as well as associations and interactions among those variables. The purpose of loglinear modeling is the analysis of association and interaction patterns. In Section 8.1 we introduce loglinear models for two-way contingency tables. In Sections 8.2 and 8.3 we extend them to three-way tables, and in Section 8.4 discuss models for multiway tables. Loglinear models are of use primarily when at least two variables are response variables. With a single categorical response, it is simpler and more natural to use logit models. When one variable is treated as a response and the others as explanatory variables, logit models for that response variable are equivalent to certain loglinear models. Section 8.5 covers this connection. In Sections 8.6 and 8.7 we discuss ML loglinear model fitting. 8.1 LOGLINEAR MODELS FOR TWO-WAY TABLES Consider an I = J contingency table that cross-classifies a multinomial sample of n subjects on two categorical responses. The cell probabilities are ␲ i j 4 and the expected frequencies are  ␮i j s n␲ i j 4 . Loglinear model formulas use  ␮i j 4 rather than ␲ i j 4 , so they also apply with Poisson sampling for N s IJ independent cell counts  Yi j 4 having  ␮i j s E Ž Yi j .4 . In either case we denote the observed cell counts by  n i j 4 . 8.1.1 Independence Model Under statistical independence, in Section 4.3.6 we noted that the  ␮i j 4 have the structure ␮ i j s ␮␣ i ␤ j . 314 LOGLINEAR MODELS FOR TWO-WAY TABLES 315 For multinomial sampling, for instance, ␮ i j s n␲ iq ␲qj . Denote the row variable by X and the column variable by Y. The formula expressing independence is multiplicative. Thus, log ␮ i j has additive form log ␮ i j s ␭ q ␭ iX q ␭Yj Ž 8.1 . for a row effect ␭ iX and a column effect ␭Yj . This is the loglinear model of independence. As usual, identifiability requires constraints such as ␭ IX s ␭YJ s 0. The ML fitted values are  ␮ ˆ i j s n iq nqj rn4, the estimated expected frequencies for chi-squared tests of independence. The tests using X 2 and G 2 ŽSection 3.2.1. are also goodness-of-fit tests of this loglinear model. 8.1.2 Interpretation of Parameters Loglinear models for contingency tables are GLMs that treat the N cell counts as independent observations of a Poisson random component. Loglinear GLMs identify the data as the N cell counts rather than the individual classifications of the n subjects. The expected cell counts link to the explanatory terms using the log link. As Ž8.1. illustrates, of the cross-classified variables, the model does not distinguish between response and explanatory variables. It treats both jointly as responses, modeling  ␮i j 4 for combinations of their levels. To interpret parameters, however, it is helpful to treat the variables asymmetrically. We illustrate with the independence model for I = 2 tables. In row i, the logit equals logit P Ž Y s 1 < X s i . s log s log P Ž Y s 1 < X s i. P Ž Y s 2 < X s i. ␮ i1 ␮ i2 s log ␮ i1 y log ␮ i2 s Ž ␭ q ␭ iX q ␭1Y . y Ž ␭ q ␭ iX q ␭Y2 . s ␭1Y y ␭Y2 . The final term does not depend on i; that is, logitw P Ž Y s 1 < X s i .x is identical at each level of X. Thus, independence implies a model of form, logitw P Ž Y s 1 < X s i .x s ␣ . In each row, the odds of response in column 1 equal expŽ␣. s expŽ ␭1Y y ␭Y2 .. An analogous property holds when J ) 2. Differences between two parameters for a given variable relate to the log odds of making one response, relative to the other, on that variable. Of course, with a single response variable, logit models apply directly and loglinear models are unneeded. 316 8.1.3 LOGLINEAR MODELS FOR CONTINGENCY TABLES Saturated Model Statistically dependent variables satisfy a more complex loglinear model, log ␮ i j s ␭ q ␭ iX q ␭Yj q ␭ iXj Y . Ž 8.2 . The  ␭ iXj Y 4 are association terms that reflect deviations from independence. The right-hand side of Ž8.2. resembles the formula for cell means in two-way ANOVA, allowing interaction. The  ␭ iXj Y 4 represent interactions between X and Y, whereby the effect of one variable on ␮ i j depends on the level of the other. The independence model Ž8.1. results when all ␭ iXj Y s 0. With constraints ␭ IX s ␭YJ s 0 in Ž8.1. and Ž8.2.,  ␭ iX 4 and  ␭Yj 4 are, equivalently, coefficients of dummy variables for the first Ž I y 1. categories of X and the first Ž J y 1. categories of Y. Thus, ␭ iXj Y is the coefficient of the product of dummy variables for ␭ iX and ␭Yj . Since there are Ž I y 1.Ž J y 1. such cross products, ␭ IXj Y s ␭ iXJ Y s 0, and only Ž I y 1.Ž J y 1. of these parameters are nonredundant. Tests of independence analyze whether these Ž I y 1.Ž J y 1. parameters equal zero, so they have residual df s Ž I y 1.Ž J y 1.. The number of parameters in model Ž8.2. equals 1 q Ž I y 1. q Ž J y 1. q Ž I y 1.Ž J y 1. s IJ, the number of cells. Hence, this model describes perfectly any  ␮i j ) 04 Žsee Problem 8.16.. It is the most general model for two-way contingency tables, the saturated model. For it, direct relationships exist between log odds ratios and  ␭ iXj Y 4 . For instance, for 2 = 2 tables, log ␪ s log ␮ 11 ␮ 22 ␮ 12 ␮ 21 s log ␮ 11 q log ␮ 22 y log ␮ 12 y log ␮ 21 XY s Ž ␭ q ␭1X q ␭1Y q ␭11 . q Ž ␭ q ␭2X q ␭Y2 q ␭22X Y . XY y Ž ␭ q ␭1X q ␭Y2 q ␭12 . y Ž ␭ q ␭2X q ␭1Y q ␭21X Y . XY XY XY XY s ␭11 q ␭ 22 y ␭12 y ␭ 21 . Ž 8.3 . Thus,  ␭ iXj Y 4 determine the association. In practice, unsaturated models are preferable, since their fit smooths the sample data and has simpler interpretations. For tables with at least three variables, unsaturated models can include association terms. Then, loglinear models are more commonly used to describe associations Žthrough two-factor terms. than to describe odds Žthrough single-factor terms.. Like others in this book, model Ž8.2. is hierarchical. This means that the model includes all lower-order terms composed from variables contained in a higher-order model term. When the model contains ␭ iXj Y, it also contains ␭ iX and ␭Yj . A reason for including lower-order terms is that, otherwise, the statistical significance and the interpretation of a higher-order term depends on how variables are coded. This is undesirable, and with hierarchical models the same results occur no matter how variables are coded. LOGLINEAR MODELS FOR TWO-WAY TABLES 317 An example of a nonhierarchical model is log ␮ i j s ␭ q ␭ iX q ␭ iXj Y . This model permits association but forces unnatural behavior of expected frequencies, with the pattern depending on constraints used for parameters. For instance, with constraints whereby parameters are zero at the last level, log ␮ I j s ␭ in every column. Nonhierarchical models are rarely sensible in practice. Using them is analogous to using ANOVA or regression models with interaction terms but without the corresponding main effects. When a model has two-factor terms, interpretations focus on them rather than on the single-factor terms. By analogy with two-way ANOVA with two-factor interaction, it can be misleading to report main effects. The estimates of the main-effect terms depend on the coding scheme used for the higher-order effects, and the interpretation also depends on that scheme Žsee Problem 8.16.. Normally, we restrict our attention to the highest-order terms for a variable, as we illustrate in Section 8.2. 8.1.4 Alternative Parameter Constraints As with the independence model, the parameter constraints for the saturated model are arbitrary. Instead of setting all ␭ IXj Y s ␭ iXJ Y s 0, one could set Ý i ␭ iXj Y s Ý j ␭ iXj Y s 0 for all i and j. Different software uses different conXY XY XY XY straints. What is unique are contrasts such as ␭11 q ␭22 y ␭12 y ␭21 in Ž8.3. that determine odds ratios. For instance, suppose that a log odds ratio equals 2.0 in a 2 = 2 table. With the first set of constraints, 2.0 is the coefficient of a product of a dummy variable indicating the first category of X and a dummy variable XY XY XY XY s 2.0 and ␭12 s ␭21 s ␭22 indicating the first category of Y. With it, ␭11 XY XY XY XY s 0. For sum-to-zero constraints, ␭11 s ␭22 s 0.5, ␭12 s ␭21 s y0.5. For either set, the log odds ratio Ž8.3. equals 2.0. For a set of parameters, an advantage of setting a baseline parameter equal to 0 instead of the sum equal to 0 is that some parameters in a set can have infinite estimates. 8.1.5 Multinomial Models for Cell Probabilities Conditional on the sum n of the cell counts, Poisson loglinear models for  ␮i j 4 become multinomial models for cell probabilities ␲ i j s ␮ i jrŽÝÝ ␮ ab .4 . To illustrate, for the saturated model, ␲i j s exp Ž ␭ q ␭ iX q ␭Yj q ␭ iXj Y . Ý Ý exp Ž ␭ q ␭ aX q ␭Yb q ␭ abX Y . a b . Ž 8.4 . 318 LOGLINEAR MODELS FOR CONTINGENCY TABLES This representation implies the usual constraints for probabilities, ␲ i j G 04 and Ý i Ý j␲ i j s 1. The ␭ intercept parameter cancels in the multinomial model Ž8.4.. This parameter relates purely to the total sample size, which is random in the Poisson model but not in the multinomial model. 8.2 LOGLINEAR MODELS FOR INDEPENDENCE AND INTERACTION IN THREE-WAY TABLES In Section 2.3 we introduced three-way contingency tables and related structure such as conditional independence and homogeneous association. Loglinear models for three-way tables describe their independence and association patterns. 8.2.1 Types of Independence A three-way I = J = K cross-classification of response variables X, Y, and Z has several potential types of independence. We assume a multinomial distribution with cell probabilities ␲ i jk 4 , and Ý i Ý j Ý k ␲ i jk s 1.0. The models also apply to Poisson sampling with means  ␮i jk 4 . The three variables are mutually independent when ␲ i jk s ␲ iqq ␲qjq ␲qqk for all i , j, and k. Ž 8.5 . For expected frequencies  ␮i jk 4 , mutual independence has loglinear form log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk . Ž 8.6 . Variable Y is jointly independent of X and Z when ␲ i jk s ␲ iqk ␲qjq for all i , j, and k. Ž 8.7 . This is ordinary two-way independence between Y and a variable composed of the IK combinations of levels of X and Z. The loglinear model is log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␭ iXkZ . Ž 8.8 . Similarly, X could be jointly independent of Y and Z, or Z could be jointly independent of X and Y. Mutual independence Ž8.5. implies joint independence of any one variable from the others. From Section 2.3, X and Y are conditionally independent, gi®en Z when independence holds for each partial table within which Z is fixed. That is, if ␲ i j < k s P Ž X s i, Y s j < Z s k ., then ␲ i j < k s ␲ iq< k ␲qj < k for all i , j, and k. LOGLINEAR MODELS FOR THREE-WAY TABLES 319 For joint probabilities over the entire table, equivalently ␲ i jk s ␲ iqk ␲qj kr␲qqk Ž 8.9 . for all i , j, and k. Conditional independence of X and Y, given Z, is the loglinear model log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␭ iXkZ q ␭YjkZ . Ž 8.10 . This is a weaker condition than mutual or joint independence. Mutual independence implies that Y is jointly independent of X and Z, which itself implies that X and Y are conditionally independent. Table 8.1 summarizes these three types of independence. In Section 2.3.2 we showed that partial associations can be quite different from marginal associations. For instance, conditional independence does not imply marginal independence. Conditional independence and marginal independence both hold when one of the stronger types of independence studied above applies. Figure 8.1 summarizes relationships among the four types of independence. 8.2.2 Homogeneous Association and Three-Factor Interaction Loglinear models Ž8.6., Ž8.8., and Ž8.10. have three, two, and one pair of conditionally independent variables, respectively. In the latter two models, TABLE 8.1 Summary of Loglinear Independence Models Model Probabilistic Form for ␲ i jk Association Terms in Loglinear Model Interpretation Ž8.6. Ž8.8. Ž8.10. ␲ iqq ␲qjq ␲qqk ␲ iqk ␲qjq ␲ iqk ␲qj kr␲qqk None ␭ iXkZ ␭ iXkZ q ␭YjkZ Variables mutually independent Y independent of X and Z X and Y independent, given Z FIGURE 8.1 Relationships among types of XY independence. 320 LOGLINEAR MODELS FOR CONTINGENCY TABLES the doubly subscripted terms Žsuch as ␭ iXj Y . pertain to conditionally dependent variables. A model that permits all three pairs to be conditionally dependent is log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␭ iXj Y q ␭ iXkZ q ␭YjkZ . Ž 8.11 . From exponentiating both sides, the cell probabilities have form ␲ i jk s ␺ i j ␾ jk ␻ i k . No closed-form expression exists for the three components in terms of margins of ␲ i jk 4 except in certain special cases Žsee Note 9.2.. For this model, in the next section we show that conditional odds ratios between any two variables are identical at each category of the third variable. That is, each pair has homogeneous association ŽSection 2.3.5.. Model Ž8.11. is called the loglinear model of homogeneous association or of no three-factor interaction. The general loglinear model for a three-way table is log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q␭ iXj Y q ␭ iXkZ q ␭YjkZ q ␭ iXjkY Z . Ž 8.12 . With dummy variables, ␭ iXjkY Z is the coefficient of the product of the ith dummy variable for X, jth dummy variable for Y, and kth dummy variable for Z. The total number of nonredundant parameters is 1 q Ž I y 1. q Ž J y 1. q Ž K y 1. q Ž I y 1. Ž J y 1. q Ž I y 1. Ž K y 1. q Ž J y 1 . Ž K y 1 . q Ž I y 1 . Ž J y 1 . Ž K y 1 . s IJK , the total number of cell counts. This model has as many parameters as observations and is saturated. It describes all possible positive  ␮i jk 4 . Each pair of variables may be conditionally dependent, and an odds ratio for any pair may vary across categories of the third variable. Setting certain parameters equal to zero in Ž8.12. yields the models introduced previously. Table 8.2 lists some of these models. To ease referring to models, Table 8.2 assigns to each model a symbol that lists the highest-order TABLE 8.2 Loglinear Models for Three-Dimensional Tables Loglinear Model log ␮ i jk s ␭ q log ␮ i jk s ␭ q log ␮ i jk s ␭ q log ␮ i jk s ␭ q log ␮ i jk s ␭ q ␭ iX ␭ iX ␭ iX ␭ iX ␭ iX Symbol q q q q q ␭Yj ␭Yj ␭Yj ␭Yj ␭Yj q q q q q ␭ Zk ␭ Zk ␭ Zk ␭ Zk ␭ Zk q ␭ iXj Y q ␭ iXj Y q ␭YjkZ XZ q ␭ iXj Y q ␭YjkZ q ␭ ik XY YZ XZ q ␭ i j q ␭ jk q ␭ ik q ␭ iXjkY Z Ž X, Y, Z . Ž XY, Z . Ž XY, YZ . Ž XY, YZ, XZ . Ž XYZ . LOGLINEAR MODELS FOR THREE-WAY TABLES 321 termŽs. for each variable. For instance, the model Ž8.10. of conditional independence between X and Y has symbol Ž XZ, YZ ., since its highest-order terms are ␭ iXkZ and ␭YjkZ. In the notation we used for logit models in Sections 6.1 and 7.1.2 this stands for Ž X *Z q Y *Z ., which is itself shorthand for notation Ž X q Y q Z q X = Z q Y = Z . that has the main effects as well as interactions. 8.2.3 Interpreting Model Parameters Interpretations of loglinear model parameters use their highest-order terms. For instance, interpretations for model Ž8.11. use the two-factor terms to describe conditional odds ratios. At a fixed level k of Z, the conditional association between X and Y uses Ž I y 1.Ž J y 1. odds ratios, such as the local odds ratios ␪ i jŽ k . s ␲ i jk ␲ iq1 , jq1 , k ␲ i , jq1 , k ␲ iq1 , j, k , 1 F i F I y 1, 1 F j F J y 1. Ž 8.13 . Similarly, Ž I y 1.Ž K y 1. odds ratios  ␪ iŽ j. k 4 describe XZ conditional association, and Ž J y 1.Ž K y 1. odds ratios  ␪Ž i. jk 4 describe YZ conditional association. Loglinear models have characterizations using constraints on conditional odds ratios. For instance, conditional independence of X and Y is equivalent to  ␪ i jŽ k . s 1, i s 1, . . . , I y 1, j s 1, . . . , J y 1, k s 1, . . . , K 4 . The two-factor parameters relate directly to the conditional odds ratios. To illustrate, substituting Ž8.11. for model Ž XY, XZ, YZ . into log ␪ i jŽ k . yields log ␪ i jŽ k . s log ␮ i jk ␮ iq1 , jq1 , k ␮ iq1 , jk ␮ 1, jq1 , k XY XY XY s ␭ iXj Y q ␭ iq1 , jq1 y ␭ i , jq1 y ␭ iq1 , j . Ž 8.14 . Since the right-hand side is the same for all k, an absence of three-factor interaction is equivalent to ␪ i jŽ1. s ␪ i jŽ2. s ⭈⭈⭈ s ␪ i jŽ K . for all i and j. The same argument for the other conditional odds ratios shows that model Ž XY, XZ, YZ . is also equivalent to ␪ iŽ1. k s ␪ iŽ2. k s ⭈⭈⭈ s ␪ iŽ J . k for all i and k, ␪Ž1. jk s ␪Ž2. jk s ⭈⭈⭈ s ␪Ž I . jk for all j and k. and to Any model not having the three-factor interaction term has a homogeneous association for each pair of variables. 322 LOGLINEAR MODELS FOR CONTINGENCY TABLES When X and Y have two categories, only one nonredundant ␭ iXj Y parameter occurs. Thus, expression Ž8.14. is simplified depending on the constraints. By the same argument as in Section 8.1.3 for 2 = 2 tables, the conditional log XY with dummy-variable constraints setting parameodds ratio simplifies to ␭11 ters at the second level of X or Y equal to 0. The ␭ iXjkY Z term in the general model Ž8.12. refers to three-factor interaction. It describes how the odds ratio between two variables changes across categories of the third. We illustrate for 2 = 2 = 2 tables. By direct substitution of the general model formula, log ␪ 11Ž1. ␪ 11Ž2. s log Ž ␮111 ␮ 221 . r Ž ␮121 ␮ 211 . Ž ␮112 ␮ 222 . r Ž ␮122 ␮ 212 . X YZ X YZ X YZ X YZ q ␭221 y ␭121 y ␭211 s Ž ␭111 . X YZ X YZ X YZ X YZ y Ž ␭112 q ␭222 y ␭122 y ␭212 .. Only one parameter is nonredundant. For constraints setting the second-catX YZ egory parameters equal to 0, this log ratio of odds ratios equals ␭111 . When X YZ ␭111 s 0, ␪ 11Ž1. s ␪ 11Ž2. , giving homogeneous XY association. 8.2.4 Alcohol, Cigarette, and Marijuana Use Example Table 8.3 refers to a 1992 survey by the Wright State University School of Medicine and the United Health Services in Dayton, Ohio. The survey asked 2276 students in their final year of high school in a nonurban area near Dayton, Ohio whether they had ever used alcohol, cigarettes, or marijuana. Denote the variables in this 2 = 2 = 2 table by A for alcohol use, C for cigarette use, and M for marijuana use. Section 8.7 covers the fitting of loglinear models. For now, we emphasize interpretation. Table 8.4 shows fitted values for several loglinear models. The TABLE 8.3 Alcohol, Cigarette, and Marijuana Use for High School Seniors Alcohol Use Marijuana Use Cigarette Use Yes No Yes Yes No 911 44 538 456 No Yes No 3 2 43 279 Source: Data courtesy of Harry Khamis, Wright State University. LOGLINEAR MODELS FOR THREE-WAY TABLES TABLE 8.4 323 Fitted Values for Loglinear Models Applied to Table 8.3 a Loglinear Model Alcohol Cigarette Marijuana Ž A, C, M . Ž AC, M . Ž AM, CM . Ž AC, AM, CM . Ž ACM . Use Use Use Yes No Yes Yes No 540.0 740.2 611.2 837.8 909.24 438.84 910.4 538.6 911 538 No Yes No 282.1 386.7 210.9 289.1 45.76 555.16 44.6 455.4 44 456 Yes Yes No 90.6 124.2 19.4 26.6 4.76 142.16 3.6 42.4 3 43 No Yes No 47.3 64.9 118.5 162.5 0.24 179.84 1.4 279.6 2 279 a A, alcohol use; C, cigarette use; M, marijuana use. fit for model Ž AC, AM, CM . is close to the observed data, which are the fitted values for the saturated model Ž ACM .. The other models fit poorly. Table 8.5 illustrates model association patterns by presenting estimated conditional and marginal odds ratios. For example, the entry 1.0 for the AC conditional association for the model Ž AM, CM . of AC conditional independence is the common value of the AC fitted odds ratios at the two levels of M, 1.0 s 909.24 = 0.24 45.76 = 4.76 s 438.84 = 179.84 555.16 = 142.16 . The entry 2.7 for the AC marginal association for this model is the odds ratio for the marginal AC fitted table. The odds ratios for the observed data are those reported for the saturated model Ž ACM .. Table 8.5 shows that estimated conditional odds ratios equal 1.0 for each pairwise term not appearing in a model, such as the AC association in model Ž AM, CM .. For that model, the estimated marginal AC odds ratio differs from 1.0, since conditional independence does not imply marginal independence. Some models have conditional associations that are necessarily the TABLE 8.5 Estimated Odds Ratios for Loglinear Models in Table 8.5 Conditional Association Marginal Association Model AC AM CM AC AM CM Ž A, C, M . Ž AC, M . Ž AM, CM . Ž AC, AM, CM . Ž ACM . level 1 Ž ACM . level 2 1.0 17.7 1.0 7.8 13.8 7.7 1.0 1.0 61.9 19.8 24.3 13.5 1.0 1.0 25.1 17.3 17.5 9.7 1.0 17.7 2.7 17.7 17.7 1.0 1.0 61.9 61.9 61.9 1.0 1.0 25.1 25.1 25.1 324 LOGLINEAR MODELS FOR CONTINGENCY TABLES same as the corresponding marginal associations. In Section 9.1.2 we present a condition guaranteeing this. Model Ž AC, AM, CM . permits all pairwise associations but maintains homogeneous odds ratios between two variables at each level of the third. The AC fitted conditional odds ratios for this model equal 7.8. One can calculate this odds ratio using the model’s fitted values at either level of M, ˆ11AC q ␭ˆ22AC y ␭ˆ12AC y ␭ˆ21AC .. or wfrom Ž8.14.x using exp Ž ␭ Table 8.5 shows that estimated odds ratios are very dependent on the model. This highlights the importance of good model selection. An estimate from this table is informative only to the extent that its model fits well. In the next section we discuss goodness of fit. 8.3 INFERENCE FOR LOGLINEAR MODELS A good-fitting loglinear model provides a basis for describing and making inferences about associations among categorical responses. Standard methods apply for checking fit and making inference about model parameters. 8.3.1 Chi-Squared Goodness-of-Fit Tests As usual, X 2 and G 2 test whether a model holds by comparing cell fitted values to observed counts. Here df equals the number of cell counts minus the number of model parameters. For the student survey ŽTable 8.3., Table 8.6 shows results of testing fit for several loglinear models. Models that lack any association term fit poorly. The model Ž AC, AM, CM . that has all pairwise associations fits well Ž P s 0.54.. It is suggested by other criteria also, such as minimizing AIC s y2 Ž maximized log likelihoodᎏnumber of parameters in model . or equivalently, minimizing w G 2 y 2Ždf.x. TABLE 8.6 Goodness-of-Fit Tests for Loglinear Models in Table 8.4 Model Ž A, C, M . Ž A, CM . Ž C, AM . Ž M, AC . Ž AC, AM . Ž AC, CM . Ž AM, CM . Ž AC, AM, CM . Ž ACM . a P-value for G 2 statistic. G2 X2 df P-value a 1286.0 534.2 939.6 843.8 497.4 92.0 187.8 0.4 0.0 1411.4 505.6 824.2 704.9 443.8 80.8 177.6 0.4 0.0 4 3 3 3 2 2 2 1 0 - 0.001 - 0.001 - 0.001 - 0.001 - 0.001 - 0.001 - 0.001 0.54 ᎏ 325 INFERENCE FOR LOGLINEAR MODELS 8.3.2 Inference about Conditional Associations Tests about conditional associations compare loglinear models. The likelihood-ratio statistic y2Ž L0 y L1 . is identical to the difference G 2 Ž M0 < M1 . s G 2 Ž M0 . y G 2 Ž M1 . between deviances for models without that term and with it. For model Ž XY, XZ, YZ ., consider the hypothesis of XY conditional independence. This is H0 : ␭ iXj Y s 0 for the Ž I y 1.Ž J y 1. XY association parameters. The test statistic is G 2 Ž XZ, YZ . y G 2 Ž XY, XZ, YZ ., with df s Ž I y 1.Ž J y 1.. This has the same purpose as the generalized CMH and model-based tests for nominal variables presented in Section 7.5. For instance, the test of conditional independence between alcohol use and cigarette smoking compares model Ž AM, CM . with the alternative Ž AC, AM, CM .. The test statistic is G 2 Ž AM, CM . < Ž AC, AM, CM . s 187.8 y 0.4 s 187.4, with df s 2 y 1 s 1 Ž P - 0.001.. The statistics comparing Ž AC, CM . and Ž AC, AM . with Ž AC, AM, CM . also provide strong evidence of AM and CM conditional associations. Further analyses of Table 8.3 use model Ž AC, AM, CM .. With large sample sizes, statistically significant effects can be weak and unimportant. A more relevant concern is whether the associations are strong enough to be important. Confidence intervals are more useful than tests for assessing this. Table 8.7 shows output from fitting model Ž AC, AM, CM . with TABLE 8.7 Output for Fitting Loglinear Model to Table 8.3 Criteria For Assessing Goodness Of Fit Criterion DF Value Value / DF Deviance 1 0.3740 0.3740 Pearson Chi- Square 1 0.4011 0.4011 Parameter Intercept a c m a*m a*c c*m 1 1 1 1 1 1 1 1 1 Estimate 5.6334 0.4877 y1.8867 y5.3090 2.9860 2.0545 2.8479 LR Statistics Source a*m a*c c*m DF 1 1 1 Standard Error 0.0597 0.0758 0.1627 0.4752 0.4647 0.1741 0.1638 Chi- Square 91.64 187.38 497.00 Wald Chi- Square 8903.96 41.44 134.47 124.82 41.29 139.32 302.14 Pr>ChiSq <.0001 <.0001 <.0001 Pr>ChiSq <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 326 LOGLINEAR MODELS FOR CONTINGENCY TABLES parameters in the last row and in the last column equal to zero, such as by using Ž1, 0. dummy variables for each classification. Consider the conditional ˆ11AC s AC odds ratio, assuming model Ž AC, AM, CM .. Table 8.7 reports ␭ 2.054, with SE s 0.174. For these constraints, this is the estimated conditional log odds ratio. A 95% Wald confidence interval for the true conditional AC odds ratio is expw2.054 " 1.96Ž0.174.x, or Ž5.5, 11.0.. Strong positive association exists between cigarette use and alcohol use, both for users and nonusers of marijuana. For model Ž AC, AM, CM ., the 95% Wald confidence intervals are Ž8.0, 49.2. for the AM conditional odds ratio and Ž12.5, 23.8. for the CM conditional odds ratio. The intervals are wide, but these associations also are strong. Table 8.5 shows that estimated marginal associations are even stronger. Controlling for outcome on one response moderates the association somewhat between the other two. The analyses in this section pertain to associations. A different analysis pertains to comparing single-variable marginal distributions, for instance to determine if students used cigarettes more than alcohol or marijuana. That type of analysis is presented in Section 10.1. 8.4 LOGLINEAR MODELS FOR HIGHER DIMENSIONS Loglinear models for three-way tables are more complex than for two-way tables, because of the variety of potential association terms. Loglinear models for three-way tables extend readily, however, to multiway tables. As the number of dimensions increases, some complications arise. One is the increase in the number of possible association and interaction terms, making model selection more difficult. Another is the increase in number of cells. In Section 9.8 we show that this can cause difficulties with existence of estimates and appropriateness of asymptotic theory. 8.4.1 Four-Way Contingency Tables We illustrate models for higher dimensions using a four-way table with variables W, X, Y, and Z. Interpretations are simplest when the model has no three-factor interaction terms. Such models are special cases of X Y Z log ␮ h i jk s ␭ q ␭W h q ␭i q ␭ j q ␭k X WY WZ XY XZ YZ q ␭W h i q ␭ h j q ␭ h k q ␭ i j q ␭ i k q ␭ jk , denoted by ŽWX, WY, WZ, XY, XZ, YZ .. Each pair of variables is conditionally dependent, with the same odds ratios at each combination of categories of the other two variables. An absence of a two-factor term implies conditional independence, given the other two variables. 327 LOGLINEAR MODELS FOR HIGHER DIMENSIONS A variety of models exhibit three-factor interaction. A model could contain any of WXY, WXZ, WYZ, or XYZ terms. For model ŽWXY, WZ, XZ, YZ ., each pair of variables is conditionally dependent, but at each level of Z the WX association, the WY association, and the XY association may vary across categories of the remaining variable. The conditional association between Z and another variable is homogeneous. The saturated model contains all the three-factor terms plus a four-factor interaction term. 8.4.2 Automobile Accident Example Table 8.8 summarizes observations of 68,694 passengers in autos and light trucks involved in accidents in the state of Maine in 1991. The table classifies passengers by gender Ž G ., location of accident Ž L., seat-belt use Ž S ., and injury Ž I .. Table 8.8 reports the sample proportion of passengers who were injured. For each GL combination, the proportion of injuries was about halved for passengers wearing seat belts. Table 8.9 displays tests of fit for several loglinear models. To investigate the complexity of model needed, we consider models Ž G, I, L, S ., TABLE 8.8 Loglinear Models for Injury, Seat-Belt Use, Gender, and Location a Gender Location Female Urban Rural Male Urban Rural Ž GI, GL, GS, IL, IS, LS . Injury Seat Belt No Yes No Yes No Yes No Yes No Yes No Yes 7,287 11,587 3,246 6,134 10,381 10,969 6,123 6,693 996 759 973 757 812 380 1,084 513 7,166.4 11,748.3 3,353.8 5,985.5 10,471.5 10,837.8 6,045.3 6,811.4 993.0 721.3 988.8 781.9 845.1 387.6 1,038.1 518.2 Sample Ž GLS, GI, IL, IS . Proportion No Yes Yes 7,273.2 11,632.6 3,254.7 6,093.5 10,358.9 10,959.2 6,150.2 6,697.6 1,009.8 713.4 964.3 797.5 834.1 389.8 1,056.8 508.4 a G, gender; I, injury; L, location; S, seat-belt use. Source:Data courtesy of Cristanna Cook, Medical Care Development, Augusta, Maine. TABLE 8.9 Goodness-of-Fit Tests for Loglinear Models in Table 8.8 Model Ž G, I, L, S . Ž GI, GL, GS, IL, IS, LS . Ž GIL, GIS, GLS, ILS . Ž GIL, GS, IS, LS . Ž GIS, GL, IL, LS . Ž GLS, GI, IL, IS . Ž ILS, GI, GL, GS . G2 df P-Value 2792.8 23.4 1.3 18.6 22.8 7.5 20.6 11 5 1 4 4 4 4 - 0.0001 - 0.001 0.25 0.001 - 0.001 0.11 - 0.001 0.12 0.06 0.23 0.11 0.07 0.03 0.15 0.07 328 LOGLINEAR MODELS FOR CONTINGENCY TABLES TABLE 8.10 Estimated Conditional Odds Ratios for Models of Table 8.8 Loglinear Model Odds Ratio GI IL IS GL S s no S s yes GS L s urban L s rural LS Gs female Gs male Ž GI, GL, GS, IL, IS, LS . Ž GLS, GI, IL, IS . 0.58 2.13 0.44 1.23 1.23 0.63 0.63 1.09 1.09 0.58 2.13 0.44 1.33 1.17 0.66 0.58 1.17 1.03 Ž GI, GL, GS, IL, IS, LS ., and Ž GIL, GIS, GLS, ILS . having all terms of varying complexity. Model Ž G, I, L, S . of mutual independence fits very poorly. Model Ž GI, GL, GS, IL, IS, LS . fits much better but still has a lack of fit Ž P - 0.001.. Model Ž GIL, GIS, GLS, ILS . fits well Ž G 2 s 1.3, df s 1. but is complex and difficult to interpret. This suggests studying models more complex than Ž GI, GL, GS, IL, IS, LS . but simpler than Ž GIL, GIS, GLS, ILS .. First, however, we analyze model Ž GI, GL, GS, IL, IS, LS ., which focuses on pairwise associations. Table 8.8 displays its fitted values. Table 8.10 reports the model-based estimated conditional odds ratios. One can obtain them directly using the fitted values for partial tables relating two variables at any combination of levels of the other two. They also follow directly from IS IS . ˆ11IS q ␭ˆ22 ˆ12IS y ␭ˆ21 y ␭ . parameter estimates; for instance, 0.44 s expŽ ␭ Since the sample size is large, the estimates of odds ratios are quite precise. For instance, the standard error of the estimated IS conditional log odds ratio of y0.814 is 0.028. A 95% Wald confidence interval for the true odds ratio is expwy0.814 " 1.96Ž0.028.x or Ž0.42, 0.47.. This model estimates that the odds of injury for passengers wearing seat belts were less than half the odds for passengers not wearing them, at each gender᎐location combination. The fitted odds ratios in Table 8.10 also suggest that other factors being fixed, injury was more likely in rural than urban accidents and more likely for females than for males. The estimated odds that males used seat belts were only 0.63 times the estimated odds for females. Interpretations are more complex for models containing three-factor interaction terms. Table 8.9 shows results of adding a single three-factor term to model Ž GI, GL, GS, IL, IS, LS .. Of the four possible models, Ž GLS, GI, IL, IS . appears to fit best. Table 8.8 also displays its fit. Given the large sample size, its G 2 value suggests that it fits quite well. For model Ž GLS, GI, IL, IS ., each pair of variables is conditionally dependent, and at each category of I the association between any two of the others 329 LOGLINEAR MODELS FOR HIGHER DIMENSIONS varies across categories of the remaining variable. For this model, it is inappropriate to interpret the GL, GS, and LS two-factor terms on their own. Since I does not occur in a three-factor interaction, the conditional odds ratio between I and each variable Žsee the top portion of Table 8.10. is the same at each combination of categories of the other two variables. When a model has a three-factor interaction term but no term of higher order than that, one can study the interaction by calculating fitted odds ratios between two variables at each level of the third. One can do this at any levels of remaining variables not involved in the interaction. The bottom portion of Table 8.10 illustrates this for model Ž GLS, GI, IL, IS .. For instance, the fitted GS odds ratio of 0.66 for Ž L s urban. refers to four fitted values for urban accidents, both the four with Žinjury s no. and the four with Žinjury s yes.; for example, 0.66 s Ž7273.2 = 10,959.2.rŽ11,632.6 = 10,358.9.. 8.4.3 Large Samples and Statistical versus Practical Significance Model Ž GLS, GI, IL, IS . seems to fit much better than Ž GI, GL, GS, IL, IS, LS .. The difference in G 2 values of 23.4 y 7.5 s 15.9 has df s 5 y 4 s 1 Ž P s 0.0001.. Table 8.10 indicates, however, that the degree of threefactor interaction is weak. The fitted odds ratio between any two of G, L, and S is similar at both levels of the third variable. The significantly better fit of model Ž GLS, GI, IL, IS . reflects mainly the enormous sample size. As in any test, a statistically significant effect need not be practically important. With huge samples, it is crucial to focus on estimation rather than hypothesis testing. For instance, a comparison of fitted odds ratios for the two models in Table 8.10 suggests that the simpler model Ž GI, GL, GS, IL, IS, LS . is adequate for most purposes. 8.4.4 Dissimilarity Index For a table of arbitrary dimension with cell counts  n i s npi 4 and fitted values  ␮ ˆ i s n␲ˆ i 4, one can summarize the closeness of a model fit to the data by the dissimilarity index ŽGini 1914., ˆs ⌬ Ý i ni y ␮ ˆ i r2 n s Ý pi y ␲ ˆ i r2 . i This index falls between 0 and 1, with smaller values representing a better fit. It represents the proportion of sample cases that must move to different cells for the model to fit perfectly. ˆ estimates a corresponding population index ⌬ The dissimilarity index ⌬ describing model lack of fit. The value ⌬ s 0 occurs when the model holds perfectly. In practice, this is unrealistic for unsaturated models, and ⌬ ) 0. ˆ helps study whether the lack of fit is important in a practical The estimator ⌬ ˆ sense. When ⌬ - 0.02 or 0.03, the sample data follow the model pattern 330 LOGLINEAR MODELS FOR CONTINGENCY TABLES ˆ quite closely, even though the model is not perfect. When ⌬ is near 0, ⌬ tends to overestimate ⌬, substantially so for small n. Firth and Kuha Ž2000. ˆ and studied ways to reduce its provided an approximate variance for ⌬ estimation bias. ˆ s 0.008, and model For Table 8.8, model Ž GI, GL, GS, IL, IS, LS . has ⌬ ˆ s 0.003. For either model, moving less than 1% of Ž GLS, GI, IL, IS . has ⌬ the data yields a perfect fit. The relatively large G 2 value for Ž GI, GL, GS, IL, IS, LS . indicated that it does not truly hold. Nevertheless, ˆ value suggests that, in practical terms, it fits decently. the small ⌬ 8.5 LOGLINEAR᎐LOGIT MODEL CONNECTION Loglinear models treat categorical response variables symmetrically, focusing on associations and interactions in their joint distribution. Logit models, by contrast, describe how a single categorical response depends on explanatory variables. The model types seem distinct, but connections exist between them. For a loglinear model, forming logits on one response helps to interpret the model. Moreover, logit models with categorical explanatory variables have equivalent loglinear models. 8.5.1 Using Logit Models to Interpret Loglinear Models To understand implications of a loglinear model formula, it can help to form a logit on one variable. We illustrate with the loglinear model Ž XY, XZ, YZ .. When Y is binary, its logit is log P Ž Y s 1 < X s i, Z s k. P Ž Y s 2 < X s i, Z s k. s log ␮ i1 k ␮ i2 k s log ␮ i1 k y log ␮ i2 k s Ž ␭ q ␭ iX q ␭1Y q ␭ Zk q ␭ i1X Y q ␭ iXkZ q ␭1YkZ . XY q ␭ iXkZ q ␭Y2 kZ . y Ž ␭ q ␭ iX q ␭Y2 q ␭ Zk q ␭ i2 XY s Ž ␭1Y y ␭Y2 . q Ž ␭ i1X Y y ␭ i2 . q Ž ␭1YkZ y ␭Y2 kZ . . The first parenthetical term is a constant, not depending on i or k. The second parenthetical term depends on the category i of X. The third parenthetical term depends on the category k of Z. This logit has the additive form logit P Ž Y s 1 < X s i , Z s k . s ␣ q ␤i X q ␤ kZ . Ž 8.15 . Using the notation summarizing logit models by their predictors, we denote it by Ž X q Z .. LOGLINEAR᎐ LOGIT MODEL CONNECTION 331 In Section 5.4.1 we discussed this logit model. When Y is binary, the loglinear model Ž XY, XZ, YZ . is equivalent to it. The ␭ iXkZ terms for association among explanatory variables cancel in the difference in logarithms the logit defines. The logit model does not study this association. 8.5.2 Auto Accident Example Revisited For the Maine auto accidents ŽTable 8.8., in Section 8.4.2 we showed that the loglinear model Ž GLS, GI, LI, IS ., log ␮ g i l s s ␭ q ␭Gg q ␭ iI q ␭ Ll q ␭Ss q ␭Gg iI q ␭Gg lL q ␭Gg sS G LS q ␭ iILl q ␭ iISs q ␭ LS l s q ␭g l s , fits well. It is natural to treat injury Ž I . as a response variable and gender Ž G ., location Ž L., and seat-belt use Ž S . as explanatory variables, or perhaps S as a response with G and L as explanatory. One can show that this loglinear model is equivalent to logit model Ž G q L q S ., logit P Ž I s 1 < G s g , L s l , S s s . s ␣ q ␤ gG q ␤ lL q ␤ sS . Ž 8.16 . For instance, the seat-belt effects in the two models satisfy ␤ sS s ␭1ISs y ␭2ISs . In the logit calculation, all terms in the loglinear model not having the injury index i cancel. Fitted values, goodness-of-fit statistics, residual df, and standardized Pearson residuals for the logit model are identical to those for the loglinear model. Odds ratios describing effects on I relate to two-factor loglinear parameters and main-effect logit parameters. In the logit model, the log odds ratio IS IS IS IS q ␭22 y ␭12 y ␭21 for the effect of S on I equals ␤ 1S y ␤ 2S. This equals ␭11 in the loglinear model. Their estimates are the same no matter how software sets up constraints. For Table 8.8, ␤ˆ1S y ␤ˆ2S s y0.817 for the logit model, IS IS ˆ11IS q ␭ˆ22 ˆ12IS y ␭ˆ21 and ␭ y␭ s y0.817 for the loglinear model. Loglinear models are GLMs that treat the 16 cell counts in Table 8.8 as 16 independent Poisson variates. Logit models are GLMs that treat the table as binomial counts. Logit models with I as the response treat the marginal GLS table  n gql s 4 as fixed and regard  n g1 l s 4 as eight independent binomial variates on that response. Although the sampling models differ, the results from fits of corresponding models are identical. 8.5.3 Correspondence between Loglinear and Logit Models In the derivation of the logit model Ž X q Z . wsee Ž8.15.x from loglinear model Ž XY, XZ, YZ ., the ␭ iXkZ term cancels. It might seem as if the model Ž XY, YZ . omitting this term is also equivalent to that logit model. Indeed, forming the logit on Y for Ž XY, YZ . results in the same logit formula. The loglinear 332 LOGLINEAR MODELS FOR CONTINGENCY TABLES TABLE 8.11 Equivalent Loglinear and Logit Models for a Three-Way Table with Binary Response Variable Y Loglinear Symbol Ž Y, XZ . Ž XY, XZ . Ž YZ, XZ . Ž XY, YZ, XZ . Ž XYZ . Logit Model Logit Symbol ␣ ␣ q ␤ iX ␣ q ␤ kZ ␣ q ␤iX q ␤ kZ ␣ q ␤iX q ␤ kZ q␤ikXZ Žᎏ. ŽX. ŽZ. Ž X q Z. Ž X *Z . model that has the same fit as the logit model, however, contains a general interaction term for relationships among the explanatory variables. The logit model does not assume anything about relationships among explanatory variables, so it allows an arbitrary interaction pattern for them. Table 8.11 summarizes equivalent logit and loglinear models for three-way tables when Y is a binary response. Each loglinear model contains the XZ association term relating the explanatory variables in the logit models. The simple loglinear model Ž Y, XZ . states that Y is jointly independent of both X and Z, and is equivalent to the logit model having only an intercept. The saturated loglinear model Ž XYZ . contains the three-factor interaction term. When Y is a binary response, this model is equivalent to a logit model with an interaction between the predictors X and Z. For instance, the effect of X on Y depends on Z, meaning that the XY odds ratio varies across its categories. That logit model is also saturated. Analogous correspondences hold when Y has several categories, using baseline-category logit models. An advantage of the loglinear approach is its generality. It applies when more than one response variable exists. The alcohol᎐cigarette᎐marijuana example in Section 8.2.4, for instance, used loglinear models to study association patterns among three response variables. Loglinear models are most natural when at least two variables are response variables. When only one is a response, it is more sensible to use logit models directly. 8.5.4 Generalized Loglinear Model* Let n s Ž n1 , . . . , n N .X and ␮ s Ž ␮1 , . . . , ␮ N .X denote column vectors of observed and expected counts for the N cells of a contingency table, with n s Ý i n i . For simplicity we use a single index, but the table may be multidimensional. Loglinear models for positive Poisson means have the form log ␮ s X␤ for model matrix X and column vector ␤ of model parameters. Ž 8.17 . MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS 333 We illustrate with the independence model, log ␮ i j s ␭ q ␭ iX q ␭Yj , for a 2 = 2 table. With constraints ␭ 2X s ␭Y2 s 0, it is log log log log ␮ 11 1 ␮ 12 1 s ␮ 21 1 1 ␮ 22 1 1 0 0 1 0 1 0 ␭ ␭1X . ␭1Y A generalization of Ž8.17. allows many additional models. This generalized loglinear model is C log Ž A␮ . s X␤ Ž 8.18 . for matrices C and A. The ordinary loglinear model Ž8.17. results when C and A are identity matrices. Other special cases include logit models for binary or multicategory responses. For instance, the loglinear model of independence for a 2 = 2 table is equivalent to a model by which the logit for Y is the same in each row of X Žsee Section 8.1.2.. That logit model has form Ž8.18.: A is a 4 = 4 identity matrix, so A␮ is the 4 = 1 vector ␮ s Ž ␮11 , ␮ 12 , ␮ 21 , ␮ 22 .X ; the product C logŽA␮ . forms the logit in row 1 and the logit in row 2 using Cs 1 0 y1 0 0 1 0 ; y1 then X s Ž1, 1.X is a 2 = 1 matrix, and ␤ is a single constant ␣ , so X␤ forms a common value for those two logits. In Chapters 10 and 11 we use the generalized loglinear model for models outside the classes of GLMs studied thus far. An example is modeling marginal distributions of multivariate responses. 8.6 LOGLINEAR MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTIC DISTRIBUTIONS* In discussing the fitting of loglinear models, we first derive sufficient statistics and likelihood equations. We then present large-sample normal distributions for ML estimators of model parameters and cell probabilities. We illustrate results with models for three-way tables. For simplicity, derivations use the Poisson sampling model, which does not require a constraint on parameters such as the multinomial does. 334 8.6.1 LOGLINEAR MODELS FOR CONTINGENCY TABLES Minimal Sufficient Statistics For three-way tables, the joint Poisson probability that cell counts  Yi jk s n i jk 4 is ŁŁŁ i j ey␮ i jk␮ injki jk n i jk ! k , where the product refers to all cells of the table. The kernel of the log likelihood is LŽ ␮ . s Ý Ý Ý n i jk log ␮ i jk y Ý Ý Ý ␮ i jk . i j k i j Ž 8.19 . k For the general loglinear model Ž8.12., this simplifies to LŽ ␮ . s n ␭ q Ý n iqq ␭iX q Ý nqjq ␭Yj q Ý nqqk ␭Zk i qÝ i qÝ i Ý j n i jq ␭ iXj Y j q k ÝÝ i n iqk ␭ iXkZ k q Ý Ý nqj k ␭YjkZ j k Ý Ý n i jk ␭ iXjkY Z y Ý Ý Ý exp Ž ␭ q ⭈⭈⭈ q␭iXjkY Z . . Ž 8.20 . j k i j k Since the Poisson distribution is in the exponential family, coefficients of the parameters are sufficient statistics. For this saturated model,  n i jk 4 are coefficients of  ␭ iXjkY Z 4 , so there is no reduction of the data. For simpler models, certain parameters are zero and Ž8.20. simplifies. For instance, for the model Ž X, Y, Z . of mutual independence, sufficient statistics are the coefficients in Ž8.20. of  ␭ iX 4 ,  ␭Yj 4 , and  ␭ Zk 4 . These are  n iqq 4 ,  nqjq 4 , and  nqqk 4 . Table 8.12 lists minimal sufficient statistics for several loglinear models. Each one is the coefficient of the highest-order termŽs. in which a variable appears. In fact, they are the marginal distributions for terms in the model symbol. Simpler models use more condensed sample information. For instance, whereas Ž X, Y, Z . uses only the single-factor marginal distributions, Ž XY, XZ, YZ . uses the two-way marginal tables. TABLE 8.12 Minimal Sufficient Statistics for Fitting Loglinear Models Model Ž X, Y, Z . Ž XY, Z . Ž XY, YZ . Ž XY, XZ, YZ . Minimal Sufficient Statistics  n iqq 4,  nqjq 4,  nqqk 4  n i jq 4,  nqqk 4  n i jq 4,  nqj k 4  n i jq 4,  n iqk 4,  nqj k 4 MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS 8.6.2 335 Likelihood Equations for Loglinear Models The fitted values for a model are solutions to the likelihood equations. We derive likelihood equations using general representation Ž8.17. for a loglinear model. For a vector of counts n with ␮ s E Žn., the model is log ␮ s X␤, for which logŽ ␮i . s Ý j x i j ␤ j for all i. Extending Ž8.19., for Poisson sampling the log likelihood is LŽ ␮ . s Ý n i log ␮i y Ý ␮ i i s i ž Ý ni Ý x i j ␤j i j / y ž Ý exp Ý x i j ␤ j i j / . Ž 8.21 . The sufficient statistic for ␤ j is its coefficient, Ý i n i x i j . Since ⭸ ⭸␤ j exp žÝ ␤ / xi j j s x i j exp j ⭸ LŽ ␮ . ⭸␤ j žÝ ␤ / xi j j s x i j ␮i , j s Ý n i x i j y Ý ␮i x i j , i j s 1,2, . . . , p. i The likelihood equations equate these derivatives to zero. They have the form Ž 8.22 . XX n s XX ␮. ˆ These equations equate the sufficient statistics to their expected values, a result obtained with GLM theory in Ž4.29.. For models considered so far, these sufficient statistics are the marginal tables in the model symbol. To illustrate, consider model Ž XZ, YZ .. Its log likelihood is Ž8.20. with ␭ X Y s ␭ X Y Z s 0. The log-likelihood derivatives ⭸L ⭸␭ iXkZ s n iqk y ␮ iqk and ⭸L ⭸␭YjkZ s nqj k y ␮qj k yield the likelihood equations for all i and k, Ž 8.23 . ␮ ˆqj k s nqj k for all j and k. Ž 8.24 . ␮ ˆ iqk s n iqk Derivatives with respect to lower-order terms yield equations implied by these ŽProblem 8.30.. For model Ž XZ, YZ ., the fitted values have the same XZ and YZ marginal totals as the observed data. 336 8.6.3 LOGLINEAR MODELS FOR CONTINGENCY TABLES Birch’s Results for Loglinear Models For model Ž XZ, YZ ., from Ž8.23., Ž8.24., and Table 8.12, the minimal sufficient statistics are the ML estimates of the corresponding marginal distributions of expected frequencies. Equation Ž8.22. gives the corresponding result for any loglinear model. Birch Ž1963. showed that likelihood equations for loglinear models match minimal sufficient statistics to their expected values. Poisson GLM theory implied this result in Ž4.29. and Ž4.44.. Thus, fitted values for loglinear models are smoothed versions of the cell counts that match them in certain marginal distributions but have associations and interactions satisfying the model-implied patterns. Birch showed that a unique set of fitted values both satisfy the model and match the data in the minimal sufficient statistics. Hence, if we find such a solution, it must be the ML solution. To illustrate, the independence model for a two-way table log ␮ i j s ␭ q ␭ iX q ␭Yj has minimal sufficient statistics  n iq 4 and  nqj 4 . The likelihood equations are ␮ ˆ iqs n iq , ␮ ˆqj s nqj , for all i and j. The fitted values  ␮ ˆ i j s n iq nqj rn4 satisfy these equations and also satisfy the model. Birch’s result implies that they are the ML estimates. 8.6.4 Direct versus Iterative Calculation of Fitted Values To illustrate how to solve likelihood equations, we continue the analysis of model Ž XZ, YZ .. From Ž8.9., the model satisfies ␲ i jk s ␲ iqk ␲qj k ␲qqk for all i , j, and k. For Poisson sampling, the related formula uses expected frequencies. Setting ␲ i jk s ␮ i jkrn, this is  ␮i jk s ␮ iqk ␮qj kr␮qqk 4 . The likelihood equations Ž8.23. and Ž8.24. specify that ML estimates satisfy ␮ ˆ iqk s n iqk and ␮ ˆqj k s nqj k and thus also ␮ ˆqqk s nqqk . Since ML estimates of functions of parameters are the same functions of the ML estimates of those parameters, ␮ ˆ i jk s ␮ ˆ iqk ␮ ˆqj k ␮ ˆqqk s n iqk nqj k nqqk . This solution satisfies the model and matches the data in the sufficient statistics. Thus, it is the unique ML solution. MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS 337 TABLE 8.13 Fitted Values for Loglinear Models in Three-Way Tables Model a Probabilistic Form Fitted Value Ž X, Y, Z . ␲ i jk s ␲ iqq ␲qjq ␲qqk ␮ ˆ i jk s Ž XY, Z . ␲ i jk s ␲ i jq ␲qqk ␮ ˆ i jk s ␲ i jq ␲ iqk Ž XY, XZ . ␲ i jk s Ž XY, XZ, YZ . Ž XYZ . ␲ i jk s ␺ i j ␾ jk ␻ ik No restriction ␲ iqq ␮ ˆ i jk s n iqq nqjq nqqk n2 n i jq nqqk n n i jq n iqk n iqq Iterative methods ŽSection 8.7. ␮ ˆ i jk s n i jk Formulas for models not listed are obtained by symmetry; for example, for Ž XZ, Y ., ␮ ˆ i jk s n iqk nqjqrn. a Similar reasoning produces  ␮ ˆ i jk 4 for all except one model in Table 8.12. Table 8.13 shows formulas. That table also expresses ␲ i jk 4 in terms of marginal probabilities. These expressions and the likelihood equations determine the ML formulas, using the approach just described. For models having explicit formulas for ␮ ˆ i jk , the estimates are said to be direct. Many loglinear models do not have direct estimates. ML estimation then requires iterative methods. Of models in Tables 8.12 and 8.13, the only one not having direct estimates is Ž XY, XZ, YZ .. Although the two-way marginal tables are its minimal sufficient statistics, it is not possible to express ␲ i jk 4 directly in terms of ␲ i jq 4 , ␲ iqk 4 , and ␲qj k 4 . Direct estimates do not exist for unsaturated models containing all two-factor associations. In practice, it is not essential to know which models have direct estimates. Iterative methods for models not having direct estimates also apply with models that have direct estimates. Statistical software for loglinear models uses such iterative methods for all cases. 8.6.5 Chi-Squared Goodness-of-Fit Tests Model goodness-of-fit statistics compare fitted cell counts to sample counts. For Poisson GLMs, in Section 4.5.2 we showed that for models with an intercept term, the deviance equals the G 2 statistic. With a fixed number of cells, G 2 and X 2 have approximate chi-squared null distributions when expected frequencies are large. The df equal the difference in dimension between the alternative and null hypotheses. This equals the difference between the number of parameters in the general case and when the model holds. We illustrate with model Ž X, Y, Z ., for multinomial sampling with probabilities ␲ i jk 4 . In the general case, the only constraint is Ý i Ý j Ý k ␲ i jk s 1, so there are IJK y 1 parameters. For model Ž X, Y, Z ., ␲ i jk s ␲ iqq ␲qjq ␲qqk 4 are determined by I y 1 of ␲ iqq 4 Žsince Ý i␲ iqqs 1., J y 1 of ␲qjq 4 , and K y 1 of ␲qqk 4 . Thus, df s Ž IJK y 1 . y Ž I y 1 . q Ž J y 1 . q Ž K y 1 . s IJK y I y J y K q 2. 338 LOGLINEAR MODELS FOR CONTINGENCY TABLES TABLE 8.14 Residual Degrees of Freedom for Loglinear Models for Three-Way Tables Model Degrees of Freedom Ž X, Y, Z . Ž XY, Z . Ž XZ, Y . Ž YZ, X . Ž XY, YZ . Ž XZ, YZ . Ž XY, XZ . Ž XY, XZ, YZ . Ž XYZ . IJK y I y J y K q 2 Ž K y 1.Ž IJ y 1. Ž J y 1.Ž IK y 1. Ž I y 1.Ž JK y 1. J Ž I y 1.Ž K y 1. K Ž I y 1.Ž J y 1. I Ž J y 1.Ž K y 1. Ž I y 1.Ž J y 1.Ž K y 1. 0 The same df formula applies for Poisson sampling. Then, the general case has IJK  ␮i jk 4 parameters. The residual df equal the number of cells in the table minus the number of parameters in the Poisson loglinear model for  ␮i jk 4 . For instance, model Ž X, Y, Z . has residual df s IJK y w1 q Ž I y 1. q Ž J y 1. q Ž K y 1.x, reflecting the single intercept parameter ␭ and constraints such as ␭ IX s ␭YJ s ␭ ZK s 0. This equals the number of linearly independent parameters equated to zero in the saturated model to obtain the given model. Table 8.14 shows df formulas for testing three-way loglinear models. 8.6.6 Covariance Matrix of ML Parameter Estimators To present large-sample distributions of ML parameter estimators, we return to general expression logŽ ␮i . s Ý j x i j ␤ j , from which we obtained the log-likelihood derivatives ⭸ LŽ ␮ . ⭸␤ j s Ý n i x i j y Ý ␮i x i j , i j s 1, 2, . . . , p. i The Hessian matrix of second partial derivatives has elements ⭸ 2 LŽ ␮ . ⭸␤ j ⭸␤ k s y Ý xi j i s y Ý xi j i ⭸␮ i ⭸␤ k ½ ⭸ ⭸␤ k exp ž Ý x ␤ / 5 s yÝ x ih h h ij x i k ␮i . i Like logistic regression models, loglinear models are GLMs using the canonical link; thus this matrix does not depend on the observed data. The MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS 339 information matrix, the negative of this matrix, is I s XX diag Ž ␮ . X, where diagŽ␮ . has the elements of ␮ on the main diagonal. ˆ is asymptotiFor a fixed number of cells, as n ™ ⬁, the ML estimator ␤ y1 cally normal with mean ␤ and covariance matrix I . Thus, for Poisson sampling, the asymptotic covariance matrix ˆ . s XX diag Ž ␮ . X cov Ž ␤ y1 Ž 8.25 . . Substituting ML fitted values and then taking square roots of diagonal ˆ This also follows from the general elements yields standard errors for ␤. expression Ž4.28. for GLMs, as noted in Section 4.4.7. 8.6.7 Connection between Multinomial and Poisson Loglinear Models Similar asymptotic results hold with multinomial sampling. When  Yi , i s 1, . . . , N 4 are independent Poisson random variables, the conditional distribution of  Yi 4 given n s Ý i Yi is multinomial with parameters ␲ i s ␮ ir ŽÝ a ␮ a .4 . Birch Ž1963. showed that ML estimates of loglinear model parameters are the same for multinomial sampling as for independent Poisson sampling. He showed that estimates are also the same for independent multinomial sampling, as long as the model contains a term for the marginal distribution fixed by the sampling design. To illustrate, suppose that at each combination of categories of X and Z, an independent multinomial sample occurs on Y. Then,  n iqk 4 are fixed. The model must contain ␭ iXkZ , so the fitted values satisfy  ␮ ˆ iqk s n iqk 4. That separate inferential theory is unnecessary for multinomial loglinear models follows from the following argument. Express the Poisson loglinear model for  ␮i 4 as log ␮ i s ␭ q x i ␤ , where Ž1, x i . is row i of the model matrix X and Ž ␭, ␤X .X is the model parameter vector. The Poisson log likelihood is L s LŽ ␭, ␤ . s Ý n i log ␮i y Ý ␮i i s i Ý n i Ž ␭ q x i ␤ . y Ý exp Ž ␭ q x i ␤ . s n ␭ q Ý n i x i ␤ y ␶ , i i i where ␶ s Ý i ␮ i s Ý i expŽ ␭ q x i ␤ .. Since log ␶ s ␭ q logwÝ i expŽx i ␤ .x, this log likelihood has the form L s LŽ ␶ , ␤ . s ½Ý i n i x i ␤ y nlog Ý exp Ž x i ␤ . i 5 q Ž nlog ␶ y ␶ . . Ž 8.26 . 340 LOGLINEAR MODELS FOR CONTINGENCY TABLES Now ␲ i s ␮ irŽÝ a ␮ a . s expŽ ␭ q x i ␤ .rwÝ a expŽ ␭ q x a ␤ .x, and expŽ ␭. cancels in the numerator and denominator. Thus, the first term Žin braces. on the right-hand side in Ž8.26. is Ýn i log ␲ i , which is the multinomial log likelihood, conditional on the total cell count n. Unconditionally, n s Ý i n i has a Poisson distribution with expectation Ý i ␮ i s ␶ , so the second term in Ž8.26. is the Poisson log likelihood for n. Since ␤ enters only in the first term, ˆ and its covariance matrix for the Poisson log likelihood the ML estimator ␤ LŽ ␭, ␤ . are identical to those for the multinomial log likelihood. The Poisson loglinear model has one more parameter Ži.e., ␭. than the multinomial loglinear model because of the random sample size. See Birch Ž1963., Lang Ž1996c., McCullagh and Nelder Ž1989, p. 211., and Palmgren Ž1981. for details. For a multinomial sample, we show in Section 14.4.1 that the estimated covariance matrix of loglinear parameter estimators is $ ˆ . s  XX diag Ž ␮ cov Ž ␤ ˆ . y ␮␮rn ˆ ˆX X 4 y1 . Ž 8.27 . The intercept ␭ from the Poisson model is not relevant, and X for the multinomial model deletes the column of X pertaining to it in the Poisson model. A similar argument applies with several independent multinomial samples. Each log-likelihood term is a sum of components from different samples, but the Poisson log likelihood again decomposes into two parts. One part is a Poisson log likelihood for the independent sample sizes, and the other part is the sum of the independent multinomial log likelihoods. Palmgren Ž1981. showed that conditional on observed marginal totals for explanatory variables, the asymptotic covariances for estimators of parameters involving the response are the same as for Poisson sampling. For a single multinomial sample, Palmgren’s result implies that Ž8.27. is identical to Ž8.25. with the row and column referring to ␭ deleted. Birch Ž1963. and Goodman Ž1970. gave related results. Lang Ž1996c. gave an elegant discussion of connections between multinomial and Poisson models. His results imply that the asymptotic variance of any linear contrast of estimated log means within a covariate level is identical for the two models. 8.6.8 Distribution of Probability Estimators For multinomial sampling, the ML estimates of cell probabilities are ␲ ˆs Ž . Ž . ␮rn. We next give the asymptotic cov ␲ . Lang 1996c showed the asympˆ ˆ totic covariance matrix for ␮ ˆ for Poisson sampling and its connection with covŽ ␲ ˆ .. The saturated model has ␲ ˆ s p, the sample proportions. Under multinomial sampling, from Ž3.7. and Ž3.8., their covariance matrix is cov Ž p . s diag Ž ␲ . y ␲ ␲X rn. Ž 8.28 . MODEL FITTING: LIKELIHOOD EQUATIONS AND ASYMPTOTICS 341 With I independent multinomial samples on a response variable with J categories, ␲ and p consist of I sets of proportions, each having J y 1 nonredundant elements. Then, covŽp. is a block diagonal matrix. Each of the independent samples has a Ž J y 1. = Ž J y 1. block of form Ž8.28., and the matrix contains zeros off the main diagonal of blocks. Now assume an unsaturated model. Using the delta method we show in Sections 14.2.2 and 14.4.1 that ␲ ˆ has an asymptotic normal distribution about ␲. The estimated covariance matrix equals $ ½ $ $ cov Ž ␲ ˆ . s cov Ž p . X XXcov Ž p . X y1 $ 5 XXcov Ž p . rn. For a single multinomial sample, this expression equals $ cov Ž ␲ ˆ. s ½ diag Ž ␲ˆ . y ␲ˆ ␲ˆ X X XX Ž diag Ž ␲ ˆ. y␲ ˆ␲ ˆ X .X y1 XX diag Ž ␲ ˆ. y␲ ˆ␲ ˆX 5 rn. For tables with many cells, it is not unusual to have a sample proportion of 0 in a cell. In this case the ordinary standard error is 0, which is unappealing. An advantage of fitting a model is that it typically has a positive fitted probability and standard error. 8.6.9 Uniqueness of ML Estimates When all  n i ) 04 , the ML estimates exist and are unique. To show this, for simplicity we use Poisson sampling. Suppose that the model is parameterized so that X has full rank. Birch Ž1963. showed that the likelihood equations are soluble, by noting that the kernel of the Poisson log likelihood LŽ ␮ . s Ý Ž n i log ␮i y ␮i . i has individual terms converging to y⬁ as logŽ ␮i . ™ "⬁; thus, the log likelihood is bounded above and attains its maximum at finite values of the model parameters. It is stationary at this maximum, since it has continuous first partial derivatives. Birch showed that the likelihood equations have a unique solution, and the likelihood is maximized at that point. He proved this by showing that the matrix of values  y⭸ 2 Lr⭸␤ h ⭸␤ j 4 wi.e., the information matrix XX diag( ␮ )Xx is nonsingular and nonnegative definite, and hence positive definite. Nonsingularity follows from X having full rank and the diagonal matrix having positive elements  ␮i 4 . Any quadratic form cX XX diag( ␮ )Xc equals Ý i w ␮ i ŽÝ j x i j c j .x 2 G 0, so the matrix is also nonnegative definite. ' 342 LOGLINEAR MODELS FOR CONTINGENCY TABLES 8.7 LOGLINEAR MODEL FITTING: ITERATIVE METHODS AND THEIR APPLICATION* When a loglinear model does not have direct estimates, iterative algorithms such as Newton᎐Raphson can solve the likelihood equations. In this section we also present a simpler but more limited method, iterati®e proportional fitting. 8.7.1 Newton᎐Raphson Method In Section 4.6.1 we introduced the Newton᎐Raphson method. Referring to notation there, we identify LŽ␤ . as the log likelihood for Poisson loglinear models. From Ž8.21., let LŽ ␤ . s ž Ý n i Ý x i h ␤h i h / y Ý exp ž Ý x ␤ / . ih i h h Then uj s h jk s ⭸ LŽ ␤ . ⭸␤ j ⭸ 2 LŽ ␤ . ⭸␤ j ⭸␤ k s Ý n i x i j y Ý ␮i x i j , i i s y Ý ␮i x i j x i k , i so that uŽj t . s Ý Ž n i y ␮Ži t . . x i j and hŽjkt . s y Ý ␮Ži t . x i j x i k . i i The t th approximation ␮Žt. for ␮ ˆ derives from ␤Žt. through ␮Ž t . s Žt.. Ž exp X␤ . It generates the next value ␤Ž tq1. using Ž4.39., which in this context is ␤Ž tq1. s ␤Ž t . q XX diag Ž ␮Ž t . . X y1 XX Ž n y ␮Ž t . . . This in turn produces ␮Ž tq1. , and so on. Alternatively, ␤Ž tq1. can be expressed as ␤Ž tq1. s y Ž H Ž t . . y1 Ž t . r , Ž 8.29 . where r jŽ t . s Ý ␮Ži t . x i j log ␮Ži t . q Ž n i y ␮Ži t . . r␮Ži t . . The expression in brackets is the first term in the Taylor series expansion of log n i at log ␮Ži t .. LOGLNEAR MODEL FITTING: ITERATIVE METHODS 343 The iterative process begins with all ␮Ž0. i s n i , or with an adjustment such 1 Ž . produces ␤Ž1. , and for t ) 0 the as ␮Ž0. s n q if any n s 0. Then 8.29 i i i 2 iterations proceed as just described with  n i 4 . For loglinear models LŽ␤ . is concave, and ␮Ž t . and ␤Ž t . usually converge rapidly to the ML estimates ␮ ˆ ˆ as t increases. The H Ž t . matrix converges to H ˆ s yXX diagŽ␮ and ␤ ˆ .X. By ˆ is yH ˆ y1 , a Ž8.25., the estimated large-sample covariance matrix of ␤ by-product of the method. As we discussed in Section 4.6.3 for GLMs, Ž8.29. has the iterative reweighted least squares form ˆty1 X . ␤Ž tq1. s Ž XX V y1 ˆty1 z Ž t . . XX V ˆt s Here, z Ž t . has elements n i s log ␮Ži t . q Ž n i y ␮Ži t . .r␮Ži t . and V wdiagŽ␮Ž t . .xy1 . Thus, ␤Ž tq1. is the weighted least squares solution for a model z Ž t . s X␤ q ⑀ , 4 Ž1. is where  ⑀ i 4 are uncorrelated with variances  1r␮Ži t .4 . With  ␮Ž0. i s ni , ␤ the weighted least squares estimate for model log Žn. s X␤ q ⑀. 8.7.2 Iterative Proportional Fitting The iterati®e proportional fitting ŽIPF. algorithm is a simple method for calculating  ␮ ˆ i 4 for loglinear models. Introduced by Deming and Stephan Ž1940., it has the following steps: 4 satisfying a model no more complex than the one being 1. Start with  ␮Ž0. i 4 fitted. For instance,  ␮Ž0. i ' 1.0 are trivially adequate. 4 successively to match 2. By multiplying by appropriate factors, adjust  ␮Ž0. i each marginal table in the set of minimal sufficient statistics. 3. Continue until the maximum difference between the sufficient statistics and their fitted values is sufficiently close to zero. We illustrate using model Ž XY, XZ, YZ .. Its minimal sufficient statistics are  n i jq 4 ,  n iqk 4 , and  nqj k 4 . Initial estimates must satisfy the model. The first cycle of the IPF algorithm has three steps: Ž0. ␮Ž1. i jk s ␮ i jk n i jq ␮Ž0. i jq , Ž1. ␮Ž2. i jk s ␮ i jk n iqk ␮Ž1. iqk , Ž2. ␮Ž3. i jk s ␮ i jk nqj k Ž2. ␮qj k . Summing both sides of the first expression over k shows that ␮Ž1. i jq s n i jq for all i and j. After step 1, observed and fitted values match in the XY marginal table. After step 2, all ␮Ž2. iqk s n iqk , but the XY marginal tables no longer Ž3. match. After step 3, all ␮qj k s nqj k , but the XY and XZ marginal tables no 344 LOGLINEAR MODELS FOR CONTINGENCY TABLES longer match. A new cycle begins by again matching the XY marginal tables, Ž3. Ž Ž3. . using ␮Ž4. i jk s ␮ i jk n i jqr␮ i jq , and so on. At each step, the updated estimates continue to satisfy the model. For . instance, step 1 uses the same adjustment factor Ž n i jqr␮Ž0. i jq at different levels k of Z. Thus, XY odds ratios from different levels of Z have ratio equal to 1, and the homogeneous association pattern continues at each step. As the cycles progress, the G 2 statistic comparing cell counts to the updated fit is monotone decreasing, and the process must converge ŽFienberg 1970a; Haberman 1974a.. The IPF algorithm produces ML estimates because it generates a sequence of fitted values converging to a solution that both satisfies the model and matches the sufficient statistics. By Birch’s results ŽSection 8.6.3., only one such solution exists, and it is ML. The IPF method works even for models having direct estimates. Then, IPF normally yields ML estimates within one cycle ŽHaberman 1974a, p. 197.. We illustrate with the model of independence. The minimal sufficient statistics 4 are  n iq 4 and  nqj 4 . With  ␮Ž0. i j ' 1.0 , the first cycle gives Ž0. ␮Ž1. i j s ␮i j Ž1. ␮Ž2. i j s ␮i j n iq ␮Ž0. iq nqj Ž1. ␮qj s s n iq J , n iq nqj n . The IPF algorithm then gives ␮ ˆ Ži tj. s n iq nqj rn for all t ) 2. 8.7.3 Comparison of Iterative Methods The IPF algorithm is simple and easy to implement. It converges to the ML fit even when the likelihood is poorly behaved, for instance with zero fitted counts and estimates on the boundary of the parameter space. The Newton᎐Raphson method is more complex, requiring solving a system of equations at each step. Newton᎐Raphson is sometimes not feasible when the model is of high dimensionalityᎏfor instance, when the contingency table and parameter vector are huge. However, IPF has disadvantages. It is applicable primarily to models for which likelihood equations equate observed and fitted counts in marginal tables. By contrast, Newton᎐Raphson is a general-purpose method that can solve more complex likelihood equations. IPF sometimes converges slowly compared to Newton᎐Raphson. Unlike Newton᎐Raphson, IPF does not produce the model parameter estimates and their estimated covariance matrix as a by-product. Fitted values that IPF produces can generate this information. Model parameter estimates are contrasts of  log ␮ ˆ i 4 Žsee Probˆ .. . Ž . lems 8.16 and 8.17 , and substituting fitted values into 8.25 yields cov Ž␤ Because Newton᎐Raphson applies to a wide variety of models and also yields standard errors, it is the fitting routine used by most software for LOGLNEAR MODEL FITTING: ITERATIVE METHODS 345 loglinear models. IPF is increasingly viewed as primarily of historical interest. However, for some applications the analysis is more transparent using IPF, as the next example illustrates. 8.7.4 Contingency Table Standardization Table 8.15 relates education and attitudes toward legalized abortion using a General Social Survey, conducted by the National Opinion Research Center. To make patterns of association clearer, Smith Ž1976. standardized the table so that all row and column marginal totals equal 100 while maintaining the sample odds ratio structure. The IPF routine to standardize with margins of 100 is ␮Ž0. i j s ni j and then for t s 1, 3, 5, . . . , ␮Ži tj. s ␮Ži ty1. j 100 ty1. ␮Žiq , ␮Ži tq1. s ␮Ži tj. j 100 Žt. ␮qj . At the end of each odd-numbered step, all row totals equal 100. At the end of each even-numbered step, all column totals equal 100. Odds ratios do not change at each odd Ževen. step, since all counts in a given row Žcolumn. multiply by the same constant. The IPF algorithm converges to the entries in parentheses in Table 8.15. The association is clearer in this standardized table. A ridge appears down the main diagonal, with higher levels of education having more favorable attitudes about abortion. The other counts fall away smoothly on both sides. Table standardization is useful for comparing tables having different marginal structures. Mosteller Ž1968. compared intergenerational occupa- TABLE 8.15 Marginal Standardization of Attitudes toward Abortion by Years of Schooling Attitude toward Legalized Abortion Schooling Less than high school High school More than high school Total Source: Smith Ž1976.. Generally Disapprove Middle Position Generally Approve 209 Ž49.4. 151 Ž32.8. 101 Ž32.0. 126 Ž36.6. 237 Ž18.6. 426 Ž30.6. 16 Ž17.8. Ž100. 21 Ž31.3. Ž100. 138 Ž50.9. Ž100. Total Ž100. Ž100. Ž100. 346 LOGLINEAR MODELS FOR CONTINGENCY TABLES tional mobility tables from Britain and Denmark. Yule Ž1912. compared three hospitals on vaccination and recovery for smallpox patients. A modern application is adjusting sample data to match marginal distributions specified by census results. The process of table standardization is called raking the table. Imrey et al. Ž1981. and Little and Wu Ž1991. derived the asymptotic covariance matrix for raked sample proportions. For sample counts  n i j 4 with  ␮i j s E Ž n i j .4 , let  Ei j 4 denote expected frequencies for the standardized table and  Eˆi j 4 fitted values in the standardized table. The standardization process corresponds to fitting the model log Ž Ei jr␮ i j . s ␭ q ␭ iE q ␭ jA . That is, maintaining the odds ratios means that the two-way tables of  Ei jr␮ i j 4 and of  Eˆi jrn i j 4 satisfy independence. The fitted values  Eˆi j 4 in the standardized table satisfy ˆ q ␭ˆiE q ␭ˆjA . log Eˆi j y log n i j s ␭ The adjustment term, ylog n i j , to the log link of the fit is called an offset. The fit corresponds to using log n i j as a predictor on the right-hand side and forcing its coefficient to equal 1.0. Standard GLM software can fit models having offsets. To rake a table, one enters as sample data pseudo-values that satisfy independence and have the desired margins, taking log n i j as an offset. ŽFor SAS, see Table A.14.. In Section 9.7.1 we discuss further the use of model offsets. NOTES Section 8.2: Loglinear Models for Independence and Interaction in Three-Way Tables 8.1. Roy and Mitra Ž1956. discussed types of independence for three-way tables and their large-sample tests. Birch’s Ž1963. article on ML estimation for loglinear models was part of substantial research on loglinear models in the 1960s, much due to L. A. Goodman Žsee Section 16.4.. Haberman Ž1974a. presented an influential theoretical study of loglinear models. Section 8.3: Inference for Loglinear Models 8.2. Goodman Ž1970, 1971b., Haberman Ž1974a, Chap. 5., Lauritzen Ž1996., Sundberg Ž1975., and Whittaker Ž1990, Sec. 12.4. discussed families of loglinear models that have direct ML estimates and interpretations in terms of independence, conditional independence, or equiprobability. Such models are called decomposable, since expected frequencies decompose into products and ratios of expected marginal sufficient statistics. Haberman proved conditions under which loglinear models have direct estimates. Baglivo et al. Ž1992., Forster et al. Ž1996., and Morgan and Blumenstein Ž1991. discussed exact inference. 347 PROBLEMS 8.3. For methods that allow for misclassification error, see Kuha and Skinner Ž1997. and Kuha et al. Ž1998. and references therein. For treatment of missing data, see Little Ž1998., Schafer Ž1997, Chap. 8., and their references. Section 8.7: Loglinear Model Fitting: Iterati©e Methods and Their Application 8.4. Deming Ž1964, Chap. VII. described early work on IPF by Deming and Stephan. Darroch Ž1962. used IPF to obtain ML estimates in contingency tables. Bishop et al. Ž1975., Fienberg Ž1970a., and Speed Ž1998. presented other applications of IPF. Darroch and Ratcliff Ž1972. generalized IPF for models in which sufficient statistics are more complex than marginal tables. 8.5. For further discussion of table raking, see Bishop et al. Ž1975, pp. 76᎐102., Fleiss Ž1981, Chap. 14., Haberman Ž1979, Chap. 9., Hoem Ž1987., and Little and Wu Ž1991.. PROBLEMS Applications 8.1 The 1988 General Social Survey compiled by the National Opinion Research Center asked: ‘‘Do you support or oppose the following measures to deal with AIDS? Ž1. Have the government pay all of the health care costs of AIDS patients; Ž2. Develop a government information program to promote safe sex practices, such as the use of condoms.’’ Table 8.16 summarizes opinions about health care costs Ž H . and the information program Ž I ., classified also by the respondent’s gender Ž G .. a. Fit loglinear models Ž GH, GI ., Ž GH, HI ., Ž GI, HI ., and Ž GH, GI, HI .. Show that models that lack the HI term fit poorly. b. For model Ž GH, GI, HI ., show that 95% Wald confidence intervals equal Ž0.55, 1.10. for the GH conditional odds ratio and Ž0.99, 2.55. for the GI conditional odds ratio. Interpret. Is it plausible that gender has no effect on opinion for these issues? TABLE 8.16 Data for Problem 8.1 Gender Male Female Health Opinion Information Opinion Support Oppose Support Oppose Support Oppose 76 6 114 11 160 25 181 48 Source: 1988 General Social Survey, National Opinion Research Center. 348 LOGLINEAR MODELS FOR CONTINGENCY TABLES TABLE 8.17 Data for Problem 8.2 a Home President Busing 1 2 3 1 1 2 3 1 2 3 1 2 3 41 71 1 2 3 1 0 0 0 65 157 17 5 44 0 3 10 0 0 1 0 0 0 0 1 0 1 2 3 a 1, Yes; 2, no; 3, don’t know. Source: 1991 General Social Survey, National Opinion Research Center. 8.2 Refer to Table 8.17 from the 1991 General Social Survey. White subjects were asked: Ž B . ‘‘Do you favor busing of ŽNegrorBlack. and white school children from one school district to another?’’, Ž P . ‘‘If your party nominated a ŽNegrorBlack. for President, would you vote for him if he were qualified for the job?’’, Ž D . ‘‘During the last few years, has anyone in your family brought a friend who was a ŽNegrorBlack. home for dinner?’’ The response scale for each item was Žyes, no, don’t know.. Fit model Ž BD, BP, DP .. a. Using the yes and no categories, estimate the conditional odds ratio for each pair of variables. Interpret. b. Analyze the model’s goodness of fit. Interpret. c. Conduct inference for the BP conditional association using a Wald or likelihood-ratio confidence interval and test. Interpret. 8.3 Refer to Section 8.3.2. Explain why software for which parameters sum ˆ11AC s ␭ˆ22AC s 0.514 and to zero across levels of each index reports ␭ AC AC ˆ ˆ ␭12 s ␭21 s y0.514, with SE s 0.044 for each term. 8.4 Refer to Table 2.6. Let D s defendant’s race, V s victims’ race, and P s death penalty verdict. Fit the loglinear model Ž DV, DP, PV .. a. Using the fitted values, estimate and interpret the odds ratio between D and P at each level of V. Note the common odds ratio property. b. Calculate the marginal odds ratio between D and P, Ži. using the fitted values, and Žii. using the sample data. Why are they equal? Contrast the odds ratio with part Ža.. Explain why Simpson’s paradox occurs. 349 PROBLEMS TABLE 8.18 Data for Problem 8.5 Safety Equipment in Use Injury Whether Ejected Nonfatal Fatal Yes No Yes No 1,105 411,111 4,624 157,342 14 483 497 1,008 Seat belt None Source: Florida Department of Highway Safety and Motor Vehicles. c. Fit the corresponding logit model, treating P as the response. Show the correspondence between parameter estimates and fit statistics. d. Is there a simpler model that fits well? Interpret, and show the logit᎐loglinear connection. 8.5 Table 8.18 refers to automobile accident records in Florida in 1988. a. Find a loglinear model that describes the data well. Interpret associations. b. Treating whether killed as the response, fit an equivalent logit model. Interpret the effects. c. Since n is large, goodness-of-fit statistics are large unless the model fits very well. Calculate the dissimilarity index for the model in part Ža., and interpret. 8.6 Refer to Table 8.19. Subjects were asked their opinions about government spending on the environment Ž E ., health Ž H ., assistance to big cities Ž C ., and law enforcement Ž L.. TABLE 8.19 Data for Problem 8.6 a Cities Environment Health 1 2 3 a 1 2 3 1 2 3 1 2 3 Law Enforcement: 1 2 3 1 2 3 1 2 3 1 2 3 62 11 2 11 1 1 3 1 1 17 7 3 3 4 0 0 0 0 5 0 1 0 0 1 0 0 0 90 22 2 21 6 2 2 2 0 42 18 0 13 9 1 1 1 0 3 1 1 2 0 1 0 0 0 74 19 1 20 6 4 9 4 1 31 14 3 8 5 3 2 2 2 11 3 1 3 2 1 1 0 3 1, Too little; 2, about right; 3, too much. Source: 1989 General Social Survey, National Opinion Research Center. 350 LOGLINEAR MODELS FOR CONTINGENCY TABLES TABLE 8.20 Output for Fitting Model to Table 8.19 Criteria For Assessing Goodness Of Fit Criterion DF Value Value / DF Deviance 48 31.6695 0.6598 Pearson Chi- Square 48 26.5224 0.5526 Log Likelihood 1284.9404 Parameter e*h e*h e*h e*h e*l e*l e*l e*l e*c e*c e*c e*c h*c h*c h*c h*c h*l h*l h*l h*l c*l c*l c*l c*l 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Estimate 2.1425 1.4221 0.7294 0.3183 y0.1328 0.3739 y0.2630 0.4250 1.2000 1.3896 0.6917 1.3767 y0.1865 0.7464 y0.4675 0.7293 1.8741 1.0366 1.9371 1.8230 0.8735 0.5707 1.0793 1.2058 Standard Error 0.5566 0.6034 0.5667 0.6211 0.6378 0.6975 0.6796 0.7361 0.5177 0.4774 0.5605 0.5024 0.4547 0.4808 0.4978 0.5023 0.5079 0.5262 0.6226 0.6355 0.4604 0.4863 0.4326 0.4462 Wald 95% Confidence Limits 1.0515 3.2335 0.2394 2.6049 y0.3813 1.8402 y0.8991 1.5356 y1.3829 1.1172 y0.9931 1.7410 y1.5949 1.0689 y1.0178 1.8678 0.1854 2.2147 0.4540 2.3253 y0.4068 1.7902 0.3921 2.3614 y1.0777 0.7048 y0.1959 1.6886 y1.4431 0.5081 y0.2553 1.7138 0.8786 2.8696 0.0052 2.0680 0.7168 3.1574 0.5775 3.0686 y0.0289 1.7760 y0.3824 1.5239 0.2314 1.9271 0.3312 2.0804 ChiSquare 14.81 5.55 1.66 0.26 0.04 0.29 0.15 0.33 5.37 8.47 1.52 7.51 0.17 2.41 0.88 2.11 13.61 3.88 9.68 8.23 3.60 1.38 6.23 7.30 a. Table 8.20 shows some results, including the two-factor estimates, for the homogeneous association model. Check the fit, and interpret. b. All estimates at category 3 of each variable equal 0. Report the estimated conditional odds ratios using the too much and too little categories for each pair of variables. Summarize the associations. Based on these results, which termŽs. might you consider dropping from the model? Why? EH 4 ˆeh c. Table 8.21 reports  ␭ when parameters sum to zero within rows and within columns, and when parameters are zero in the first row and first column. Show how these yield the estimated EH conditional odds ratio for the too much and too little categories. Compare to part Žb.. Construct a confidence interval for that odds ratio. Interpret. 351 PROBLEMS TABLE 8.21 Parameter Estimates for Problem 8.6 Sum to Zero Constraints H Zero for First Level H E 1 2 3 1 2 3 1 2 3 0.509 y0.065 y0.445 0.166 y0.099 y0.068 y0.676 0.163 0.513 0 0 0 0 0.309 0.720 0 1.413 2.142 8.7 Refer to the loglinear models for Table 8.8. a. Explain why the fitted odds ratios in Table 8.10 for model Ž GI, GL, GS, IL, IS, LS . suggest that the most likely accident case for injury is females not wearing seat belts in rural locations. b. Fit model Ž GLS, GI, IL, IS .. Using model parameter estimates, show that the fitted IS conditional odds ratio equals 0.44. Show that for each injury level, the estimated conditional LS odds ratio is 1.17 for Ž G s female. and 1.03 for Ž G s male.. How can you get these using the model parameter estimates? 8.8 Consider the following two-stage model for Table 8.8. The first stage is a logit model with S as the response for the three-way GLS table. The second stage is a logit model with these three variables as predictors for I in the four-way table. Explain why this composite model is sensible, fit the models, and interpret results. 8.9 Refer to the logit model in Problem 5.24. Let A s opinion on abortion. a. Give the symbol for the loglinear model that is equivalent to this logit model. b. Which logit model corresponds to loglinear model Ž AR, AP, GRP .? c. State the equivalent loglinear and logit models for which Ži. A is jointly independent of G, R, and P; Žii. there are main effects of R on A, but A is conditionally independent of G and P, given R; Žiii. there is interaction between P and R in their effects on A, and G has main effects. 8.10 For a multiway contingency table, when is a logit model more appropriate than a loglinear model? When is a loglinear model more appropriate? 8.11 Using software, conduct the analyses described in this chapter for the student survey data ŽTable 8.3.. 352 LOGLINEAR MODELS FOR CONTINGENCY TABLES 8.12 Standardize Table 10.6. Describe the migration patterns. 8.13 The book’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . has a 2 = 3 = 2 = 2 table relating responses on frequency of attending religious services, political views, opinion on making birth control available to teenagers, and opinion about a man and woman having sexual relations before marriage. Analyze these data using loglinear models. Theory and Methods 8.14 Suppose that  ␮i j s n␲ i j 4 satisfy the independence model Ž8.1.. a. Show that ␭Ya y ␭Yb s log Ž␲qa r␲qb .. b. Show that  all ␭Yj s 04 is equivalent to ␲qj s 1rJ for all j. 8.15 Refer to the independence model, ␮ i j s ␮␣ i ␤ j . For the corresponding loglinear model Ž8.1.: a. Show that one can constrain Ý ␭ iX s Ý ␭Yj s 0 by setting ␭ iX s log ␣ i y ž Ý log ␣ / I, ␭Yj s log ␤ j y ž Ý log ␣ / Iq h h ␭ s log ␮ q h h ž Ý log ␤ / h J, h ž Ý log ␤ / h J. h b. Show that one can constrain ␭1X s ␭1Y s 0 by defining ␭ iX s log ␣ i y log ␣ 1 and ␭Yj s log ␤ j y log ␤ 1. Then, what does ␭ equal? 8.16 For an I = J table, let ␩i j s log ␮ i j , and let a dot subscript denote the mean for that index Že.g., ␩i.s Ý j␩i jrJ .. Then, let ␭ s ␩. . , ␭ iX s ␩i.y ␩. . , ␭Yj s ␩. j y ␩. . , and ␭ iXj Y s ␩i j y ␩i.y ␩. j q ␩. . . a. Show that log ␮ i j s ␭ q ␭ iX q ␭Yj q ␭ iXj Y. Hence, any set of positive  ␮i j 4 satisfies the saturated model. b. Show that Ý i ␭ iX s Ý j ␭Yj s Ý i ␭ iXj Y s Ý j ␭ iXj Y s 0. XY c. For 2 = 2 tables, show that log ␪ s 4␭11 . XY d. For 2 = J tables, show that ␭11 s ŽÝ j log ␣ j .r2 J, where ␣ j s ␮ 11 ␮ 2 jr␮ 21 ␮ 1 j , j s 2, . . . , J. e. Alternative constraints have other odds ratio formulas. Let ␭ s ␩11 , ␭ iX s ␩i1 y ␩11 , ␭Yj s ␩1 j y ␩11 , and ␭ iXj Y s ␩i j y ␩i1 y ␩1 j q ␩11 . Then, show that the saturated model holds with ␭1X s ␭1Y s ␭1Xj Y s ␭ i1X Y s 0 for all i and j, and ␭ iXj Y s logŽ ␮11 ␮ i jr␮ 1 j ␮ i1 .. 353 PROBLEMS 8.17 Suppose that all ␮ i jk ) 0. Let ␩i jk s log ␮ i jk , and consider model parameters with zero-sum constraints. a. For the general loglinear model Ž8.12., define parameters in the fashion of Problem 8.16 Že.g., ␭ iXj Y s ␩i j.y ␩i. .y ␩. j.q ␩. . . .. XY b. For model Ž XY, XZ, YZ . with a 2 = 2 = 2 table, show that ␭11 1 s 4 log ␪ 11Ž k . . c. For Ž XYZ . with a 2 = 2 = 2 table, show that X YZ ␭111 s 18 log ␪ 11Ž1. r␪ 11Ž2. . Thus, ␭ iXjkY Z s 0 is equivalent to ␪ 11Ž1. s ␪ 11Ž2. . 8.18 Two balanced coins are flipped, independently. Let X s whether the first flip resulted in a head Žyes, no., Y s whether the second flip resulted in a head, and Z s whether both flips had the same result. Using this example, show that marginal independence for each pair of three variables does not imply that the variables are mutually independent. 8.19 For three categorical variables X, Y, and Z: a. When Y is jointly independent of X and Z, show that X and Y are conditionally independent, given Z. b. Prove that mutual independence of X, Y, and Z implies that X and Y are both marginally and conditionally independent. c. When X is independent of Y and Y is independent of Z, does it follow that X is independent of Z? Explain. d. When any pair of variables is conditionally independent, explain why there is no three-factor interaction. 8.20 Suppose that X and Y are conditionally independent, given Z, and X and Z are marginally independent. a. Show that X is jointly independent of Y and Z. b. Show X and Y are marginally independent. c. Show that if X and Z are conditionally Žrather than marginally. independent, then X and Y are still marginally independent. 8.21 A 2 = 2 = 2 table satisfies ␲ iqqs ␲qjqs ␲qqk s 12 , all i, j, k. Give an example of ␲ i jk 4 that satisfies model Ža. Ž X, Y, Z ., Žb. Ž XY, Z ., Žc. Ž XY, YZ ., Žd. Ž XY, XZ, YZ ., and Že. Ž XYZ ., but in each case not a simpler model. 8.22 Suppose that model Ž XY, XZ, YZ . holds in a 2 = 2 = 2 table, and the common XY conditional log odds ratio at the two levels of Z is 354 LOGLINEAR MODELS FOR CONTINGENCY TABLES positive. If the XZ and YZ conditional log odds ratios are both positive or both negative, show that the XY marginal odds ratio is larger than the XY conditional odds ratio. Hence, Simpson’s paradox cannot occur for the XY association. 8.23 Show that the general loglinear model in T dimensions has 2 T terms. w Hint: It has an intercept, T single-factor terms, T two-factor 1 2 terms, . . . .x ž / ž / 8.24 Each of T responses is binary. For dummy variables  z1 , . . . , zT 4 , the loglinear model of mutual independence has the form log ␮ z 1 , . . . , z T s ␭1 z1 q ⭈⭈⭈ q␭T zT . Show how to express the general loglinear model ŽCox 1972.. 8.25 Consider a cross-classification of W, X, Y, Z. a. Explain why ŽWXZ, WYZ . is the most general loglinear model for which X and Y are conditionally independent. b. State the model symbol for which X and Y are conditionally independent and there is no three-factor interaction. 8.26 For a four-way table with binary response Y, give the equivalent loglinear and logit models that have: a. Main effects of A, B, and C on Y. b. Interaction between A and B in their effects on Y, and C has main effects. c. Repeat part Ža. for a nominal response Y with a baseline-category logit model. 8.27 For a 3 = 3 table with ordered rows having scores  x i 4 , identify all terms in the generalized loglinear model Ž8.18. for models Ža. logit w P Ž Y F j .x s ␣ j q ␤ x i , and Žb. log w P Ž Y s j .rP Ž Y s 3.x s ␣ j q ␤j x i. 8.28 For the independence model for a two-way table, derive minimal sufficient statistics, likelihood equations, fitted values, and residual df. 8.29 For the loglinear model for an I = J table, log ␮ i j s ␭ q ␭ iX , show that ␮ ˆ i j s n iqrJ and residual df s I Ž J y 1.. 8.30 Write the log likelihood L for model Ž XZ, YZ .. Calculate ⭸ Lr⭸␭ and show that it implies ␮ ˆqqqs n. Show that ⭸ Lr⭸␭ iX s n iqqy ␮ iqq. 355 PROBLEMS Similarly, differentiate with respect to each parameter to obtain likelihood equations. Show Ž8.23. and Ž8.24. imply the other equations, so those equations determine the ML estimates. 8.31 For model Ž XY, Z ., derive Ža. minimal sufficient statistics, Žb. likelihood equations, Žc. fitted values, and Žd. residual df for tests of fit. 8.32 Consider the loglinear model with symbol Ž XZ, YZ .. a. For fixed k, show that  ␮ ˆ i jk 4 equal the fitted values for testing independence between X and Y within level k of Z. b. Show that the Pearson and likelihood-ratio statistics for testing this model’s fit have form X 2 s Ý X k2 , where X k2 tests independence between X and Y at level k of Z. 8.33 Verify the df values shown in Table 8.14 for models Ž XY, Z ., Ž XY, YZ ., and Ž XY, XZ, YZ .. 8.34 Verify that loglinear model Ž GLS, GI, LI, IS . implies logit model Ž8.16.. Show that the conditional log odds ratio for the effect of S on I equals IS IS IS IS q ␭22 y ␭12 y␭21 in the loglinear ␤ 1S y ␤ 2S in the logit model and ␭11 model. 8.35 Table 8.22 shows fitted values for models for four-way tables that have direct estimates. a. Use Birch’s results to verify that the entry is correct for ŽW, X, Y, Z .. Verify its residual df. b. Motivate the estimate and df formulas for ŽWX, YZ ., ŽWXY, Z ., ŽWXY, WZ ., and ŽWXY, WXZ . using composite variables and the corresponding results for two-way tables we.g., for ŽWXY, WZ ., given W, Z is independent of the composite XY variablex. TABLE 8.22 Data for Problem 8.35 a Model ŽW, X, Y, Z . ŽWX, Y, Z . ŽWX, WY, Z . ŽWX, YZ . ŽWX, WY, XZ . ŽWX, WY, WZ . ŽWXY, Z . ŽWXY, WZ . ŽWXY, WXZ . a Expected Frequency Estimate 3 n hqqq nqiqq nqqjq nqqqk rn n hiqq nqqjq nqqqk rn2 n hiqq n hqjq nqqqk rn hqqq n n hiqq nqqj krn n hiqq n hqjq nqiqk rn hqqq nqiqq n hiqq n hqjq n hqqk rŽ n hqqq . 2 n hi jq nqqqk rn n hi jq n hqqk rn hqqq n hi jq n h iqk rn h iqq Residual DF HIJK y H y I y J y K q 3 HIJK y HI y J y K q 2 HIJK y HI y HJ y K q H q 1 Ž HI y 1.Ž JK y 1. HIJK y HI y HJ y IK q H q I HIJK y HI y HJ y HK q 2 H Ž HIJ y 1.Ž K y 1. H Ž IJ y 1.Ž K y 1. HI Ž J y 1.Ž K y 1. Number of levels of W, X, Y, Z, denoted by H, I, J, K. Estimates for other models of each type are obtained by symmetry. 356 LOGLINEAR MODELS FOR CONTINGENCY TABLES 8.36 A T-dimensional table  n ab . . . t 4 has Ii categories in dimension i. a. Find minimal sufficient statistics, ML estimates of cell probabilities, and residual df for the mutual independence model. b. Find the minimal sufficient statistics and residual df for the hierarchical model having all two-factor associations but no three-factor interactions. 8.37 Consider loglinear model Ž X, Y, Z . for a 2 = 2 = 2 table. a. Express the model in the form log ␮ s X␤. b. Show that the likelihood equations XX n s XX ␮ ˆ equate  n i jk 4 and ␮ ˆ i jk 4 in the one-dimensional margins. 8.38 Apply IPF to model Ža. Ž X, YZ ., and Žb. Ž XZ, YZ .. Show that the ML estimates result within one cycle. 8.39 Given target row totals  ri ) 04 and column totals  c j ) 04 : a. Explain how to use IPF to adjust sample proportions  pi j 4 to have these totals but maintain the sample odds ratios. b. Show how to find cell proportions that have these totals and for which all local odds ratios equal ␪ ) 0. Ž Hint: Take initial values of 1.0 in all cells in the first row and in the first column. This determines all other initial cell entries such that all local odds ratios equal ␪ .. c. Explain how cell proportions are determined by the marginal proportions and the local odds ratios. 8.40 Refer to Birch’s results in Section 8.6.3. Show that L has individual terms converging to y⬁ as log ␮ i ™ "⬁. Explain why positive definiteness of the information matrix implies that the solution of the likelihood equations is unique, with likelihood maximized at that point. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 9 Building and Extending LoglinearrLogit Models In Chapters 5 through 7 we presented logistic regression models, which use the logit link for binomial or multinomial responses. In Chapter 8 we presented loglinear models for contingency tables, which use the log link for Poisson cell counts. Equivalences between them were discussed in Section 8.5.3. In this chapter we discuss building and extending these models with contingency tables. In Section 9.1 we present graphs that show a model’s association and conditional independence patterns. In Section 9.2 we discuss selection and comparison of loglinear models. Diagnostics for checking models, such as residuals, are presented in Section 9.3. The loglinear models of Chapter 8 treat all variables as nominal. In Section 9.4 we present loglinear models of association between ordinal variables. In Sections 9.5 and 9.6 we present generalizations that replace fixed scores by parameters. In the final section we discuss complications that occur with sparse contingency tables. 9.1 ASSOCIATION GRAPHS AND COLLAPSIBILITY A graphical representation for associations in loglinear models indicates the pairs of conditionally independent variables. This representation helps reveal implications of models. Our presentation derives partly from Darroch et al. Ž1980., who used mathematical graph theory to represent certain loglinear models Žcalled graphical models. having a conditional independence structure. 9.1.1 Association Graphs An association graph has a set of vertices, each vertex representing a variable. An edge connecting two variables represents a conditional association be357 358 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS FIGURE 9.1 Association graph for model ŽWX, WY, WZ, YZ .. tween them. For instance, loglinear model ŽWX, WY, WZ, YZ . lacks XY and XZ terms. It assumes independence between X and Y and between X and Z, conditional on the remaining two variables. Figure 9.1 portrays this model’s association graph. The four variables form the vertices. The four edges represent pairwise conditional associations. Edges do not connect X and Y or X and Z, the conditionally independent pairs. Two loglinear models with the same pairwise associations have the same association graph. For instance, this association graph is also the one for model ŽWX, WYZ ., which adds a three-factor WYZ interaction. A path in an association graph is a sequence of edges leading from one variable to another. Two variables X and Y are said to be separated by a subset of variables if all paths connecting X and Y intersect that subset. For instance, in Figure 9.1, W separates X and Y, since any path connecting X and Y goes through W. The subset W, Z 4 also separates X and Y. A fundamental result states that two variables are conditionally independent given any subset of variables that separates them ŽKreiner 1987; Whittaker 1990, p. 67.. Thus, not only are X and Y conditionally independent given W and Z, but also given W alone. Similarly, X and Z are conditionally independent given W alone. 9.1.2 Collapsibility in Three-Way Contingency Tables In Section 2.3.3 we showed that conditional associations in partial tables usually differ from marginal associations. Under certain collapsibility conditions, however, they are the same. For three-way tables, XY marginal and conditional odds ratios are identical if either Z and X are conditionally independent or if Z and Y are conditionally independent. The conditions state that the variable treated as the control Ž Z . is conditionally independent of X or Y, or both. These conditions occur for loglinear models Ž XY, YZ . and Ž XY, XZ .. Thus, the fitted XY odds ratio is identical in the partial tables and the marginal table for models with association graphs X Y Z and Y X Z 359 ASSOCIATION GRAPHS AND COLLAPSIBILITY or even simpler models, but not for the model with graph X Z Y in which an edge connects Z to both X and Y. The proof follows directly from the formulas for models Ž XY, YZ . and Ž XY, XZ . ŽProblem 9.26.. We illustrate for the student survey ŽTable 8.3. from Section 8.2.4, with A s alcohol use, C s cigarette use, and M s marijuana use. Model Ž AM, CM . specifies AC conditional independence, given M. It has association graph A M C. Consider the AM association. Since C is conditionally independent of A, the AM fitted conditional odds ratios are the same as the AM fitted marginal odds ratio collapsed over C. From Table 8.5, both equal 61.9. Similarly, the CM association is collapsible. The AC association is not, because M is conditionally dependent with both A and C in model Ž AM, CM .. Thus, A and C may be marginally dependent, even though they are conditionally independent. In fact, from Table 8.5, the fitted AC marginal odds ratio for this model is 2.7. For model Ž AC, AM, CM ., no pair is conditionally independent. No collapsibility conditions are fulfilled. Table 8.5 showed that each pair has quite different fitted marginal and conditional associations for this model. When a model contains all two-factor effects, effects may change after collapsing over any variable. 9.1.3 Collapsibility and Logit Models The collapsibility conditions apply also to logit models. For instance, suppose that a clinical trial studies the association between a binary treatment variable X Ž x 1 s 1, x 2 s 0. and a binary response Y, using data from K centers Ž Z .. The logit model logit P Ž Y s 1 < X s i , Z s k . s ␣ q ␤ x i q ␤ kZ has the same treatment effect ␤ for each center. Since this model corresponds to loglinear model Ž XY, XZ, YZ ., this effect may differ after collapsing the 2 = 2 = K table over centers. The estimated XY conditional odds ratio, expŽ ␤ˆ., typically differs from the sample odds ratio in the marginal 2 = 2 table. Next, consider the simpler model that lacks center effects, logit P Ž Y s 1 < X s i , Z s k . s ␣ q ␤ x i . For a given treatment, the success probability is identical for each center. The model satisfies a collapsibility condition, because it states that Z is 360 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS conditionally independent of Y, given X. This logit model is equivalent to loglinear model Ž XY, XZ ., for which the XY association is collapsible. So, when center effects are negligible and the simpler model fits nearly as well, the estimated treatment effect is approximately the marginal XY odds ratio. 9.1.4 Collapsibility and Association Graphs for Multiway Tables Bishop et al. Ž1975, p. 47. provided a parametric collapsibility condition with multiway tables: Suppose that a model for a multiway table partitions variables into three mutually exclusive subsets, A, B, C, such that B separates A and C. After collapsing the table over the variables in C, parameters relating variables in A and parameters relating variables in A to variables in B are unchanged. We illustrate using model ŽWX, WY, WZ, YZ . ŽFigure 9.1.. Let A s  X 4 , B s W 4 , and C s  Y, Z 4 . Since the XY and XZ terms do not appear, all parameters linking set A with set C equal zero, and B separates A and C. If we collapse over Y and Z, the WX association is unchanged. Next, identify A s  Y, Z 4 , B s W 4 , C s  X 4 . Then, conditional associations among W, Y, and Z remain the same after collapsing over X. This result also implies that when any variable is independent of all other variables, collapsing over it does not affect any other model terms. For instance, associations among W, X, and Y in model ŽWX, WY, XY, Z . are the same as in ŽWX, WY, XY .. When set B contains more than one variable, although parameter values are unchanged in collapsing over set C, the ML estimates of those parameters may differ slightly. A stronger collapsibility definition also requires that the estimates be identical. This condition of commutativity of fitting and collapsing holds if the model contains the highest-order term relating variables in B to each other. Asmussen and Edwards Ž1983. discussed this property, which relates to decomposability of tables ŽNote 8.2.. 9.2 MODEL SELECTION AND COMPARISON Strategies for selecting and comparing loglinear models are similar to those for logistic regression discussed in Section 6.1. A model should be complex enough to fit well but also relatively simple to interpret, smoothing rather than overfitting the data. 9.2.1 Considerations in Model Selection The potentially useful models are usually a small subset of the possible models. A study designed to answer certain questions through confirmatory analyses may plan to compare models that differ only by the inclusion of certain terms. Also, models should recognize distinctions between response MODEL SELECTION AND COMPARISON 361 and explanatory variables. The modeling process should concentrate on terms linking responses and terms linking explanatory variables to responses. The model should contain the most general interaction term relating the explanatory variables. From the likelihood equations, this has the effect of equating the fitted totals to the sample totals at combinations of their levels. This is natural, since one normally treats such totals as fixed. Related to this, certain marginal totals are often fixed by the sampling design. Any potential model should include those totals as sufficient statistics, so likelihood equations equate them to the fitted totals. Consider Table 8.8 with I s automobile injury and S s seat-belt use as responses and G s gender and L s location as explanatory variables. Then we treat  n gqlq 4 as fixed at each combination for G and L. For example, 20,629 women had accidents in urban locations, so the fitted counts should have 20,629 women in urban locations. To ensure this, a loglinear model should contain the GL term, which implies from its likelihood equations that ␮ ˆ gqlqs n gqlq 4. Thus, the model should be at least as complex as Ž GL, S, I . and focus on the effects of G and L on S and I as well as the SI association. If S is also explanatory and only I is a response,  n gql s 4 should be fixed. With a single categorical response, relevant loglinear models correspond to logit models for that response. One should then use logit rather than loglinear models, when the main focus is describing effects on that response. For exploratory studies, a search among potential models may provide clues about associations and interactions. One approach first fits the model having single-factor terms, then the model having two-factor and single-factor terms, then the model having three-factor and lower terms, and so on. Fitting such models often reveals a restricted range of good-fitting models. In Section 8.4.2 we used this strategy with the automobile injury data set. Automatic search mechanisms among possible models, such as backward elimination, may also be useful but should be used with care and skepticism. Such a strategy need not yield a meaningful model. 9.2.2 Model Building for the Dayton Student Survey In Sections 8.2.4 and 8.3.2 we analyzed the use of alcohol Ž A., cigarettes Ž C ., and marijuana Ž M . by a sample of high school seniors. The study also classified students by gender Ž G . and race Ž R .. Table 9.1 shows the five-dimensional contingency table. In selecting a model, we treat A, C, and M as responses and G and R as explanatory. Thus, a model should contain the GR term, which forces the GR fitted marginal totals to equal the sample marginal totals Table 9.2 displays goodness-of-fit tests for several models. Because many cell counts are small, the chi-squared approximation for G 2 may be poor, but this index is useful for comparing models. The first model listed contains only the GR association and assumes conditional independence for the other nine pairs of associations. It fits horribly, which is no surprise. Model 2, with all two-factor terms, on the other hand, seems to fit well. Model 3, containing all 362 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS TABLE 9.1 Alcohol, Cigarette, and Marijuana Use for High School Seniors Marijuana Use Race s White Female Race s Other Male Female Male Alcohol Use Cigarette Use Yes No Yes No Yes No Yes No Yes Yes No Yes No 405 13 1 1 268 218 17 117 453 28 1 1 228 201 17 133 23 2 0 0 23 19 1 12 30 1 1 0 19 18 8 17 No Source: Harry Khamis, Wright State University. TABLE 9.2 Model 1. 2. 3. 4a. 4b. 4c. 4d. 4e. 4f. 4g. 4h. 4i. 5. 6. 7. a Goodness-of-Fit Tests for Loglinear Models for Table 9.1 a Mutual independence q GR Homogeneous association All three-factor terms Ž2. ᎐AC Ž2. ᎐AM Ž2. ᎐CM Ž2. ᎐AG Ž2. ᎐AR Ž2. ᎐CG Ž2. ᎐CR Ž2. ᎐GM Ž2. ᎐MR Ž AC, AM, CM, AG, AR, GM, GR, MR. Ž AC, AM, CM, AG, AR, GM, GR . Ž AC, AM, CM, AG, AR, GR . G2 df 1325.1 15.3 5.3 201.2 107.0 513.5 18.7 20.3 16.3 15.8 25.2 18.9 16.7 19.9 28.8 25 16 6 17 17 17 17 17 17 17 17 17 18 19 20 G, gender; R, race; A, alcohol use; C, cigarette use; M, marijuana use. the three-factor interaction terms, also fits well, but the improvement in fit is not great Ždifference in G 2 of 15.3 y 5.3 s 10.0 based on df s 16 y 6 s 10.. Thus, we consider models without three-factor terms. Beginning with model 2, we eliminate two-factor terms. We use backward elimination, sequentially taking out terms for which the resulting increase in G 2 is smallest, when refitting the model. Table 9.2 shows the start of this process. Nine pairwise associations are candidates for removal from model 2 Žall except GR ., shown in models 4a through 4i. The smallest increase in G 2 , compared to model 2, occurs in removing the CR term Ži.e., model 4g.. The increase is 15.8 y 15.3 s 0.5, with df s 17 y 16 s 1, so this elimination seems sensible. After removing it, 363 MODEL SELECTION AND COMPARISON the smallest additional increase results from removing the CG term Žmodel 5., resulting in G 2 s 16.7 with df s 18, and a change in G 2 of 0.9 based on df s 1. Removing next the MR term Žmodel 6. yields G 2 s 19.9 with df s 19, a change in G 2 of 3.2 based on df s 1. Further removals have a more severe effect. For instance, removing the AG term increases G 2 by 5.3, with df s 1, for a P-value of 0.02. One cannot take such P-values literally, since the data suggested these tests, but it seems safest not to drop additional terms. wSee Westfall and Wolfinger Ž1997. and Westfall and Young Ž1993. for methods of adjusting P-values to account for multiple testsx. Model 6, denoted by Ž AC, AM, CM, AG, AR, GM, GR ., has association graph M C G A R Every path between C and  G, R4 involves a variable in  A, M 4 . Given the outcome on alcohol use and marijuana use, the model states that cigarette use is independent of both gender and race. Collapsing over the explanatory variables race and gender, the conditional associations between C and A and between C and M are the same as with the model Ž AC, AM, CM . fitted in Section 8.2.4. Removing the GM term from this model yields model 7 in Table 9.2. Its association graph reveals that A separates  G, R4 from  C, M 4 . Thus, all pairwise conditional associations among A, C, and M in model 7 are identical to those in model Ž AC, AM, CM ., collapsing over G and R. In fact, model 7 does not fit poorly Ž G 2 s 28.8 with df s 20. considering the large ˆ s 0.036.. Hence, one might sample size. ŽIts sample dissimilarity index is ⌬ collapse over gender and race in studying associations among the primary variables. An advantage of the full five-variable model is that it estimates effects of gender and race on these responses, in particular the effects of race and gender on alcohol use and the effect of gender on marijuana use. 9.2.3 Loglinear Model Comparison Statistics Consider two loglinear models, M1 and M0 , with M0 a special case of M1. By Sections 4.5.4 and 5.4.3, the likelihood-ratio statistic for testing M0 against M1 is G 2 Ž M0 < M1 . s G 2 Ž M0 . y G 2 Ž M1 .. We used this statistic above in comparing pairs of models. Let n denote a column vector of the observed cell counts  n i 4 . Let ␮ ˆ 0 and ␮ ˆ 1 denote vectors of the fitted values  ␮ ˆ 0 i 4 and  ␮ ˆ 1 i 4 for M0 and M1. The deviance G 2 Ž M0 . for the simpler model partitions into G 2 Ž M 0 . s G 2 Ž M 1 . q G 2 Ž M 0 < M1 . . Ž 9.1 . Just as G 2 Ž M . measures the distance of fitted values for M from n, G 2 Ž M0 < M1 . measures the distance of fit ␮ ˆ 0 from fit ␮ ˆ 1. In this sense, 364 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS decomposition Ž9.1. expresses a certain orthogonality: The distance of n from ␮ ˆ 0 equals the distance of n from ␮ ˆ 1 plus the distance of ␮ ˆ 1 from ␮ ˆ 0. The model comparison statistic equals G 2 Ž M0 < M1 . s 2 Ý n i log Ž n ir␮ ˆ 0 i . y 2 Ý n i log Ž n ir␮ ˆ1i . i i s 2 Ý n i log Ž ␮ ˆ 1 i r␮ ˆ0i . . Ž 9.2 . i The two loglinear models have the matrix form Ž8.17., or log ␮ 0 s X 0 ␤ 0 and log ␮ 1 s X 1 ␤ 1 . Since M0 is simpler than M1 , one can express log ␮ 0 s X 0 ␤ 0 s X 1 ␤U1 , where ␤U1 equals ␤ 0 with 0 elements appended corresponding to the extra parameters in ␤ 1 but not in ␤ 0 . Then, from Ž9.2., ˆ 1 y X 1␤ ˆU1 G 2 Ž M0 < M1 . s 2nX Ž log ␮ ˆ 1 y log ␮ ˆ 0 . s 2nX X 1 ␤ ˆ 1 y X 1␤ ˆU1 s 2␮ s 2␮ ˆX1 X 1 ␤ ˆX1 Ž log ␮ ˆ 1 y log ␮ ˆ0. s 2Ý␮ ˆ 1 i log Ž ␮ ˆ 1 ir␮ ˆ0i . , Ž 9.3 . where the replacement of n by ␮ ˆ 1 follows from the likelihood equations X w Ž .x. Statistic Ž9.3. has the same form as nX X 1 s ␮ X for M Recall 8.22 ˆ1 1 1 4 G 2 Ž M0 ., but with  ␮ playing the role of the observed data. Note that ˆ1i G 2 Ž M0 . is the special case of G 2 Ž M0 < M1 . with M1 saturated. The Pearson difference X 2 Ž M0 . y X 2 Ž M1 . does not have Pearson form. It is not even necessarily nonnegative. A more appropriate Pearson statistic for comparing models is X 2 Ž M 0 < M1 . s Ý Ž ␮ˆ 1 i y ␮ˆ 0 i . r␮ˆ 0 i . 2 Ž 9.4 . This has the usual form with  ␮ ˆ 1 i 4 in place of  n i 4. Statistics Ž9.3. and Ž9.4. depend on the data only through the fitted values and thus only through sufficient statistics for M1. When M0 holds, G 2 Ž M0 . and G 2 Ž M1 . have asymptotic chi-squared distributions, and G 2 Ž M0 < M1 . is asymptotically chi-squared with df equal to the difference between df for M0 and M1. Haberman Ž1977a. showed that G 2 Ž M0 < M1 . and X 2 Ž M0 < M1 . have the same null large-sample behavior, even for fairly sparse tables. ŽUnder certain conditions, their difference converges in probability to 0 as n increases. . When M1 holds but M0 does not, G 2 Ž M1 . still has its asymptotic chi-squared distribution, but the other two statistics tend to grow unboundedly as n increases. 365 MODEL SELECTION AND COMPARISON 9.2.4 Partitioning Chi-Squared with Model Comparisons Equation Ž9.1. utilizes the property by which a chi-squared statistic with df ) 1 partitions into components. We used such partitionings in tests for trend with ordinal predictors in linear logit or linear probability models ŽSection 5.3.5. and with ordinal responses in cumulative logit models ŽSection 7.2.. More generally, this property applies with a set of nested models to test a sequence of hypotheses. The separate tests for comparing pairs of models are asymptotically independent. For example, a chi-squared decomposition with J y 1 models justifies the partitioning of G 2 stated in Section 3.3.3 for 2 = J tables. For j s 2, . . . , J, let M j denote the model that satisfies ␪ i s Ž ␮1 i ␮ 2, iq1 . r Ž ␮1, iq1 ␮ 2 i . s 1, i s 1, . . . , j y 1. For M j , the 2 = j table consisting of columns 1 through j satisfies independence. Model M J is independence in the complete 2 = J table. Model Mh is a special case of M j whenever h ) j. By Ž9.2., G 2 Ž M J . s G 2 Ž M J < M Jy1 . q G 2 Ž M Jy1 . s G 2 Ž M J < M Jy1 . q G 2 Ž M Jy1 < M Jy2 . q G 2 Ž M Jy2 . s ⭈⭈⭈ s G 2 Ž M J < M Jy1 . q ⭈⭈⭈ qG 2 Ž M3 < M2 . q G 2 Ž M2 . . From Ž9.3., G 2 Ž M j < M jy1 . has the G 2 form with the fitted values for model M jy1 playing the role of the observed data. Substitution of fitted values for the two models into Ž9.3. shows that G 2 Ž M j < M jy1 . is identical to G 2 for testing independence in a 2 = 2 table; the first column combines column 1 through j y 1 of the original table, and the second column is column j of the original table. With several preplanned comparisons, simultaneous test procedures lessen the probability of attributing importance to sample effects that simply reflect chance variation. These procedures use adjusted significance levels. For a set of s tests for nested models, when each test has level 1 y Ž1 y ␣ .1r s, the overall asymptotic P Žtype I error. F ␣ ŽGoodman 1969a.. For instance, suppose that we test the fit of ŽWXZ, WY, XY, ZY ., compare that model to ŽWX, WZ, XZ, WY, XY, ZY ., and compare that model to ŽWX, WZ, XZ, WY, ZY .. To ensure overall ␣ s 0.05 for the s s 3 tests, use level 1 y Ž0.95.1r3 s 0.017 for each. 9.2.5 Identical Marginal and Conditional Tests of Independence A test using G 2 Ž M0 < M1 . simplifies dramatically when both models have direct estimates. In that case, the models have independence linkages neces- 366 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS sary to ensure collapsibility. A test of conditional independence has the same result as the test of independence applied to the marginal table. Sundberg Ž1975. proved the following: When two direct models M0 and M1 are identical except for a pairwise association term, G 2 Ž M0 < M1 . is identical to G 2 for testing independence in the marginal table for that pair of variables. Bishop Ž1971. and Goodman Ž1970, 1971b. have related discussion. For instance, G 2 wŽ X, Y, Z . < Ž XY, Z .x tests ␭ X Y s 0 in model Ž XY, Z .. Thus, it tests XY conditional independence under the assumption that X and Y are jointly independent of Z. Using the two sets of fitted values, from Ž9.3., it equals 2Ý i n i jq nqqk ÝÝ j n k s 2Ý i log n i jq nqqk rn n iqq nqjq nqqk rn2 Ý n i jq log n j n i jq iqq nqjqrn , which equals G 2 wŽ X, Y .x for testing independence in the marginal XY table. This is not surprising. The collapsibility conditions imply that for model Ž XY, Z ., the marginal XY association is the same as the conditional XY association. 9.3 DIAGNOSTICS FOR CHECKING MODELS The model comparison test using G 2 Ž M0 < M1 . is useful for detecting whether an extra term improves a model fit. Cell residuals provide a cell-specific indication of model lack of fit. 9.3.1 Residuals for Loglinear Models In Section 4.5.5 we noted that residuals for the independence model ŽSection 3.3.1. extend to any Poisson GLM. For cell i in a contingency table with observed count n i and fitted value ␮ ˆ i , the Pearson residual is ei s These relate Like the variances of ni y ␮ ˆi '␮ˆ . Ž 9.5 . i to the Pearson statistic by Ýe i2 s X 2 . Pearson residual Ž6.1. for binomial models, the asymptotic  e i 4 are less than 1.0. They average Žresidual df.rŽnumber of 367 MODELING ORDINAL ASSOCIATIONS cells.. Haberman Ž1973a. defined the standardized Pearson residual, ' r i s e ir 1 y ˆ hi , where the leverage ˆ h i is a diagonal element of the estimated hat matrix ŽSection 4.5.5.. This has an asymptotic standard normal distribution and is preferable to the Pearson residual. A closed-form expression applies for loglinear models having direct estimates ŽHaberman 1978, p. 275.. Alternative residuals use components of the deviance ŽSection 4.5.5.. 9.3.2 Student Survey Example Revisited For Table 9.1 cross-classifying alcohol, cigarette, and marijuana use by gender and race, we suggested in Section 9.2.2 that the model with all two-factor associations is plausible. For it, the only large standardized Pearson residual equals 3.2, resulting from a fitted value of 3.1 in the cell having a count of 8. Further comparisons suggested that the simpler model Ž AC, AM, CM, AG, AR, GM, GR . is adequate. Its only large standardized residual equals 3.3, referring to a fitted value of 2.9 in that cell. The number of nonwhite males who did not use alcohol or marijuana but who smoked cigarettes is somewhat greater than either model predicts. The standardized Pearson residuals do not suggest problems with either model, considering the large sample size and many cells studied. 9.3.3 Correspondence between Loglinear and Logit Residuals In Section 8.5 we showed that logit models in contingency tables are equivalent to certain loglinear models. However, a Pearson residual for a logit model differs from a Pearson residual for a loglinear model. The numerators comparing the ith observed and fitted binomial or Poisson count are the same, since the model fitted values are the same. However, the logit model uses a fitted binomial standard deviation in the denominator wsee Ž6.1.x, whereas the loglinear model uses a fitted Poisson standard deviation wsee Ž9.5.x. Thus, the logit Pearson residual exceeds the loglinear Pearson residual Ž9.5.. Once standardized by dividing by estimated standard errors, the standardized Pearson residuals are identical for the two models. This is another reason for preferring standardized residuals over ordinary Pearson residuals. 9.4 MODELING ORDINAL ASSOCIATIONS The loglinear models presented so far have a serious limitationᎏthey treat all classifications as nominal. If the order of a variable’s categories changes in 368 TABLE 9.3 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Opinions about Premarital Sex and Availability of Teenage Birth Control Teenage Birth Control a Strongly Disagree Disagree Agree Strongly Agree 81 Ž42.4.1 7.6 2 Ž80.9. 3 24 Ž16.0. 2.3 Ž20.8. 68 Ž51.2. 3.1 Ž67.6. 26 Ž19.3. 1.8 Ž23.1. 60 Ž86.4. y4.1 Ž69.4. 29 Ž32.5. y0.8 Ž31.5. 38 Ž67.0. y4.8 Ž29.1. 14 Ž25.2. y2.8 Ž17.6. Wrong only sometimes 18 Ž30.1. y2.7 Ž24.4. 41 Ž36.3. 1.0 Ž36.1. 74 Ž61.2. 2.2 Ž65.7. 42 Ž47.4. y1.0 Ž48.8. Not wrong at all 36 Ž70.6. y6.1 Ž33.0. 57 Ž85.2. y4.6 Ž65.1. 161 Ž143.8. 2.4 Ž157.4. 157 Ž111.4. 6.8 Ž155.5. Premarital Sex Always wrong Almost always wrong Independence model fit; 2 standardized Pearson residuals for the independence model fit; linear-by-linear association model fit. Source: 1991 General Social Survey, National Opinion Research Center. a1 3 any way, the fit is the same. For ordinal classifications, these models ignore important information. Refer to Table 9.3. Subjects were asked their opinion about a man and woman having sexual relations before marriage Žalways wrong, almost always wrong, wrong only sometimes, not wrong at all.. They were also asked whether methods of birth control should be available to teenagers between the ages of 14 and 16 Žstrongly disagree, disagree, agree, strongly agree.. For the loglinear model of independence, denoted by I, G 2 Ž I . s 127.6 with df s 9. The model fits poorly. Yet, adding the ordinary association term makes it saturated and unhelpful. Table 9.3 also contains fitted values and standardized residuals for independence. The residuals in the corners stand out. Sample counts are much larger than independence predicts where both responses are the most negative possible or the most positive possible. By contrast, the counts are much smaller than fitted values where one response is the most positive and the other is the most negative. Cross-classifications of ordinal variables often exhibit their greatest deviations from independence in the corner cells. This pattern for Table 9.3 indicates lack of fit in the form of a positive trend. MODELING ORDINAL ASSOCIATIONS 369 Subjects who are more willing to make birth control available to teenagers also tend to feel more tolerant about premarital sex. Models for ordinal variables use association terms that permit trends. The models are more complex than the independence model, yet unsaturated. Models with association and interaction terms exist in situations in which nominal models are saturated. Tests with ordinal models have improved power for detecting trends. 9.4.1 Linear-by-Linear Association in Two-Way Tables For two-way tables, a simple model for two ordinal variables assigns ordered row scores u1 F u 2 F ⭈⭈⭈ F u I and column scores ®1 F ®2 F ⭈⭈⭈ F ®J . The model is log ␮ i j s ␭ q ␭ iX q ␭Yj q ␤ u i ®j , Ž 9.6 . with constraints such as ␭ IX s ␭YJ s 0. This is the special case of the saturated model Ž8.2. in which ␭ iXj Y s ␤ u i ®j . It requires only one parameter to describe association, whereas the saturated model requires Ž I y 1.Ž J y 1.. Independence occurs when ␤ s 0. The term ␤ u i ®j represents the deviation of log ␮ i j from independence. The deviation is linear in the Y scores at a fixed level of X and linear in the X scores at a fixed level of Y. In column j, for instance, the deviation is a linear function of X, having form Žslope. = Žscore for X ., with slope ␤ ®j . Because of this property, Ž9.6. is called the linear-by-linear association model Žabbreviated, L = L.. The model has its greatest departures from independence in the corners of the table. Birch Ž1965., Goodman Ž1979a., and Haberman Ž1974b. introduced special cases. The direction and strength of the association depend on ␤ . When ␤ ) 0, Y tends to increase as X increases. Expected frequencies are larger than expected Žunder independence . in cells where X and Y are both high or both low. When ␤ - 0, Y tends to decrease as X increases. When the data display a positive or negative trend, the L = L model usually fits much better than the independence model. For the 2 = 2 table using the cells intersecting rows a and c with columns b and d, direct substitution shows that the model has log ␮ ab ␮ c d ␮ ad ␮ cb s ␤ Ž u c y u a . Ž ®d y ®b . . Ž 9.7 . This log odds ratio is stronger as < ␤ < increases and for pairs of categories that are farther apart. Simple interpretations result when u 2 y u1 s ⭈⭈⭈ s u I y u Iy1 and ®2 y ®1 s ⭈⭈⭈ s ®J y ®Jy1 . When  u i s i4 and  ®j s j4 , for instance, the local odds ratios Ž2.10. for adjacent rows and adjacent columns have common value e ␤. Goodman Ž1979a. called this case uniform association. Figure 9.2 portrays local odds ratios having uniform value. 370 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS FIGURE 9.2 Constant odds ratio implied by uniform association model. Ž Note: ␤ s the constant log odds ratio for adjacent rows and adjacent columns.. The choice of scores affects the interpretation of ␤ . Often, the response scale discretizes an inherently continuous scale. It is sensible to choose scores that approximate distances between midpoints of categories for the underlying scale, such as we did in measuring alcohol consumption for a linear logit model in Section 3.4.5. It is sometimes useful to standardize the scores, subtracting the mean and dividing by the standard deviation, so Ý u i␲ iqs Ý ®j␲qj s 0 Ý u 2i ␲ iqs Ý ®j2␲qj s 1. Then, ␤ represents the log odds ratios for standard deviation distances in the X and Y directions. The L = L model tends to fit well when an underlying continuous distribution is approximately bivariate normal. For standardized scores, ␤ is then comparable to ␳rŽ1 y ␳ 2 ., where ␳ is the underlying correlation. For weak associations, ␤ f ␳ Žsee Becker 1989b; Goodman 1981a, b, 1985.. 9.4.2 Corresponding Logit Model for Adjacent Responses A logit formulation of the L = L model treats Y as a response and X as explanatory. Let ␲ j < i s P Ž Y s j < X s i .. Using logits for adjacent response categories ŽSection 7.4.1., ␲ jq1 < i ␮ i , jq1 log s log s Ž ␭Yjq1 y ␭Yj . q ␤ Ž ®jq1 y ®j . u i . ␲j<i ␮i j For unit-spaced  ®j 4 , this simplifies to ␲ jq1 < i log s ␣ j q ␤ ui ␲j<i 371 MODELING ORDINAL ASSOCIATIONS where ␣ j s ␭Yjq1 y ␭Yj . The same linear logit effect ␤ applies simultaneously for all Ž J y 1. pairs of adjacent response categories: The odds Y s j q 1 instead of Y s j multiply by e ␤ for each unit change in X. In using equal-interval response scores, we implicitly assume that the effect of X is the same on each of the J y 1 adjacent-categories logits for Y. 9.4.3 Likelihood Equations and Model Fitting The Poisson log-likelihood LŽ␮ . s Ý i Ý j n i j log ␮ i j y Ý i Ý j ␮ i j simplifies for the L = L model Ž9.6. to LŽ ␮ . s n ␭ q Ý n iq ␭ iX q Ý nqj ␭Yj q ␤ Ý Ý u i ®j n i j i yÝ i j i j Ý exp Ž ␭ q ␭iX q ␭Yj q ␤ u i ®j . . j Differentiating LŽ␮ . with respect to Ž ␭ iX , ␭Yj , ␤ . and setting the three partial derivatives equal to zero yields likelihood equations ␮ ˆ iqs n iq , i s 1, . . . , I, ␮ ˆqj s nqj , j s 1, . . . , J , Ý Ý u i ®j ␮ˆ i j s Ý Ý u i ®j n i j . i j i j Iterative methods such as Newton᎐Raphson yield the ML fit. Let pi j s n i jrn and ␲ ˆi j s ␮ ˆ i jrn. The third likelihood equation implies that Ý Ý u i ®j␲ˆ i j s Ý Ý u i ®j pi j . i j i j Since marginal distributions and hence marginal means and variances are identical for fitted and observed distributions, the third equation implies the correlation between the scores for X and Y is the same for both distributions. The fitted counts display the same positive or negative trend as the data. Since  u i 4 and  ®j 4 are fixed, the L = L model Ž9.6. has only one more parameter Ž ␤. than the independence model. Its residual df s IJ y 1 q Ž I y 1 . q Ž J y 1 . q 1 s IJ y I y J , unsaturated for all but 2 = 2 tables. 9.4.4 Sex Opinions Example Table 9.3 also reports fitted values for the linear-by-linear association model applied to Table 9.3, using scores  1, 2, 3, 44 for rows and columns. Table 9.4 372 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS TABLE 9.4 Output for Fitting Linear-by-Linear Association Model to Table 9.3 Criteria For Assessing Criterion Deviance Pearson Chi- Square Parameter Intercept premar premar premar premar birth birth birth birth linlin 1 2 3 4 1 2 3 4 Estimate 0.4735 1.7537 0.1077 y0.0163 0.0000 1.8797 1.4156 1.1551 0.0000 0.2858 Source linlin Standard Error 0.4339 0.2343 0.1988 0.1264 0.0000 0.2491 0.1996 0.1291 0.0000 0.0282 Goodness Of Fit DF Value 8 11.5337 8 11.5085 Wald 95% Conf. Limits y0.3769 1.3239 1.2944 2.2129 y0.2820 0.4974 y0.2641 0.2314 0.0000 0.0000 1.3914 2.3679 1.0243 1.8068 0.9021 1.4082 0.0000 0.0000 0.2305 0.3412 LR Statistics DF Chi- Square 1 116.12 ChiSquare 1.19 56.01 0.29 0.02 . 56.94 50.29 80.07 . 102.46 Pr ) ChiSq 0.2751 -.0001 0.5880 0.8972 . -.0001 -.0001 -.0001 . -.0001 Pr ) ChiSq ).0001 shows software output. To get this, we added a variable Ždenoted ‘‘linlin’’. to the independence model having values equal to the product of row and column number. Compared to the independence model, for which G 2 Ž I . s 127.6 with df s 9, the L = L model fits dramatically better w G 2 Ž L = L. s 11.5, df s 8x. This is especially noticeable in the corners, where it predicts the greatest departures from independence. The ML estimate ␤ˆ s 0.286 ŽSE s 0.028. indicates that subjects having more favorable attitudes about teen birth control also tend to have more tolerant attitudes about premarital sex. The estimated local odds ratio is expŽ ␤ˆ. s expŽ0.286. s 1.33. A 95% Wald confidence interval is expŽ0.286 " 1.96 = 0.028., or Ž1.26, 1.41.. The strength of association seems weak. From Ž9.7., however, nonlocal odds ratios are stronger. The estimated odds ratio for the four corner cells equals exp ␤ˆŽ u 4 y u1 . Ž ®4 y ®1 . s exp 0.286 Ž 4 y 1 . Ž 4 y 1 . s 13.1. This also results from the corner fitted values, Ž80.9 = 155.5.rŽ29.1 = 33.0. s 13.1. Two sets of scores having the same spacings yield the same ␤ˆ and the same fit. Any other sets of equally spaced scores yield the same fit but an appropriately rescaled ␤ˆ. For instance, using row scores  2, 4, 6, 84 with  ®j s j4 also yields G 2 s 11.5, but ␤ˆ s 0.143 with SE s 0.014 Žboth half as ASSOCIATION MODELS 373 large.. For Table 9.3, one might regard categories 2 and 3 as farther apart than categories 1 and 2, or categories 3 and 4. Scores such as  1, 2, 4, 54 for rows and columns recognize this. The L = L model then has G 2 s 8.8 Ždf s 8. and ␤ˆ s 0.146 ŽSE s 0.014.. One need not regard the scores as approximations for distances between categories or as reasonable scalings of ordinal variables in order for the models to be valid. They simply imply a certain pattern for the odds ratios. If the L = L model fits well with equally spaced row and column scores, the uniform local odds ratio describes the association regardless of whether the scores are sensible indexes of true distances between categories. For scores  u i s i4 with Table 9.3, the marginal mean and standard deviation for premarital sex are 2.81 and 1.26. The standardized scores are Ž i y 2.81.r1.264 , or Žy1.44, y0.65, 0.15, 0.95.. The standardized equal-interval scores for birth control are Žy1.65, y0.69, 0.27, 1.23.. For these scores, ␤ˆ s 0.374. By solving ␤ˆ s ␳ˆrŽ1 y ␳ˆ2 . for ␳ˆ, ␳ˆ s 0.333. If there is an underlying bivariate normal distribution, we estimate the correlation to be 0.333. 9.4.5 Directed Ordinal Test of Independence For the linear-by-linear association model, H0 : independence is H0 : ␤ s 0. The likelihood-ratio test statistic equals G2 Ž I < L = L. s G 2 Ž I . y G 2 Ž L = L. . Designed to detect positive or negative trends, it has df s 1. For Table 9.3, G 2 Ž I < L = L. s 127.6 y 11.5 s 116.1. This has P - 0.0001, extremely strong evidence of an association. The Wald statistic z 2 s Ž ␤ˆrSE . 2 s Ž0.286r0.0282. 2 s 102.5 Ždf s 1. also shows strong evidence. The correlation statistic Ž3.15. presented in Section 3.4.1 for testing independence is the score statistic for H0 : ␤ s 0 in this model. It equals 112.6 Ždf s 1.. When the L = L model holds, the ordinal test using G 2 Ž I < L = L. is asymptotically more powerful than the test using G 2 Ž I .. This is true for the same reason given in Section 6.4.2 for the linear logit model. The power of a chi-squared test increases when df decrease, for fixed noncentrality. When the L = L model holds, the noncentrality is the same for G 2 Ž I < L = L. and G 2 Ž I .; thus G 2 Ž I < L = L. is more powerful, since its df s 1 compared to Ž I y 1.Ž J y 1. for G 2 Ž I .. The power advantage increases as I and J increase, since the noncentrality remains focused on df s 1 for G 2 Ž I < L = L. but df also increases for G 2 Ž I .. 9.5 ASSOCIATION MODELS* Generalizations of the linear-by-linear association model apply to multiway tables or treat scores as parameters rather than fixed. The models are called association models, because they focus on the association structure. 374 9.5.1 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Row and Column Effects Models We first present a model that treats X as nominal and Y as ordinal. It is appropriate for two-way tables with ordered columns, using scores ®1 F ®2 F ⭈⭈⭈ F ®J . Since the rows are unordered, they do not have scores. Replacing the ordered values  ␤ u i 4 in the linear-by-linear term ␤ u i ®j in model Ž9.6. by unordered parameters  ␮i 4 gives log ␮ i j s ␭ q ␭ iX q ␭Yj q ␮ i ®j . Ž 9.8 . Constraints are needed such as ␭ IX s ␭YJ s ␮ I s 0. The  ␮i 4 are called row effects. The model is called the row effects model. Model Ž9.8. has I y 1 more parameters Žthe  ␮i 4. than the independence model. Independence is the special case ␮ 1 s ⭈⭈⭈ s ␮ I . A corresponding column effects model has association term u i ␯ j . It treats X as ordinal with scores  u i 4 and Y as nominal with parameters  ␯ j 4 . The row effects and column effects models were developed by Goodman Ž1979a., Haberman Ž1974b., and Simon Ž1974.. 9.5.2 Logit Model for Adjacent Responses With  ®jq1 y ®j s 14 , the row effects model has adjacent-categories logit form log P Ž Y s j q 1 < X s i. P Ž Y s j < X s i. s ␣ j q ␮i . Ž 9.9 . The effect in row i is identical for each pair of adjacent responses. Plots of these logits against i Ž i s 1, . . . , I . for different j are parallel. Goodman Ž1983. referred to model Ž9.9. as the parallel odds model. Differences among  ␮i 4 compare rows with respect to their conditional distributions on Y. When ␮ i s ␮ h , rows h and i have identical conditional distributions. If ␮ i ) ␮ h , Y is stochastically higher in row i than row h. The likelihood equations for the row effects model Ž9.8. are  ␮ ˆ iqs n iq 4, ␮ ˆqj s nqj 4, and Ý ®j ␮ˆ i j s Ý ®j n i j , i s 1, . . . , I. j Let ␲ ˆj<i s ␮ ˆ i j r␮ ˆ iq and pj < i s n i jrn iq. Since ␮ ˆ iqs n iq, the third likelihood equation is Ý j ®j␲ ˆ j < i s Ý j ®j pj < i . For the conditional distribution within each row, the mean column score is the same for the fitted and sample distributions. The likelihood equations are solved using iterative methods. 375 ASSOCIATION MODELS TABLE 9.5 Observed Frequencies and Fitted Values for Political Ideology Data Political Ideology a Party Affiliation Liberal Moderate Conservative Total Democrat 143 Ž102.0.1 Ž136.6. 2 119 Ž120.2. Ž123.8. 156 Ž161.4. Ž168.7. 210 Ž190.1. Ž200.4. 100 Ž135.6. Ž93.6. 399 470 15 Ž54.7. Ž16.6. 72 Ž86.6. Ž68.9. 141 Ž159.7. Ž145.8. 127 Ž72.7. Ž128.6. Independent Republican 214 Independence model; 2 row effects model. Source: Based on data in R. D. Hedlund, Public Opinion Quart. 41: 498᎐514 Ž1978.. a1 9.5.3 Political Ideology Example Table 9.5 displays the relationship between political ideology and political party affiliation for a sample of voters in a presidential primary in Wisconsin. The table shows fitted values for the independence Ž I . model and the row effects Ž R . model with  ®j s j4 . Table 9.6 shows output. Goodness-of-fit tests show that independence is inadequate. Adding the row effects parameters much improves the fit Ž G 2 Ž I . s 105.7, df s 4; G 2 Ž R . s 2.8, df s 2.. Also, testing H0 : ␮ 1 s ␮ 2 s ␮ 3 using G 2 Ž I < R . s 102.9 Ždf s 2. shows very strong evidence of an association. In Table 9.5, the improved fit is especially noticeable at the ends of the ordinal scale, where the model has greatest deviation from independence. The output uses dummy variables for the first two categories of each classification. The interaction term equals the product of the score for ideology and a parameter for party. Thus, the row effect estimates satisfy ␮ ˆ 3 s 0, and the other two estimates contrast the first two parties with Republicans. The estimates are ␮ ˆ 1 s y1.213 and ␮ ˆ 2 s y0.943. The further ␮ ˆ i falls in the negative direction, the greater the tendency for the party i to locate at the liberal end of the ideology scale, relative to Republicans. In this sample the Republicans are much more conservative than the other two groups, and the Democrats Žrow 1. are the most liberal. From Ž9.9. the model predicts constant odds ratios for adjacent columns of political ideology. For instance, since ␮ ˆ3 y ␮ ˆ 1 s 1.213, the estimated odds that Republicans were conservative instead of moderate, or moderate instead of liberal, were expŽ1.213. s 3.36 times the corresponding estimated odds for Democrats. Figure 9.3 shows the parallelism of the estimated logits for the row effects model. The loglinear model does not distinguish between response and explanatory variables. Instead, one could use a cumulative logit model to describe 376 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS TABLE 9.6 Output for Fitting Row Effects Model to Table 9.5 Criteria For Assessing Goodness Of Fit Criterion DF Value Deviance 2 2.8149 Pearson Chi- Square 2 2.8039 Parameter Intercept party party party ideology ideology ideology score*party score*party score*party Democ Indep Repub 1 2 3 Democ Indep Repub LR Statistics Source score*party FIGURE 9.3 Estimate 4.8565 3.3230 2.9536 0.0000 y2.0488 y0.6244 0.0000 y1.2134 y0.9426 0.0000 DF 2 Std Error 0.0858 0.3188 0.3149 0.0000 0.2216 0.1139 0.0000 0.1304 0.1260 0.0000 Wald 95% Conf. ChiPr ) Limits Square ChiSq 4.6883 5.0246 3204.02 -.0001 2.6981 3.9479 108.63 -.0001 2.3364 3.5707 87.98 -.0001 0.0000 0.0000 . . y2.4831 y1.6145 85.50 -.0001 y0.8476 y0.4013 30.08 -.0001 0.0000 0.0000 . . y1.4690 y0.9577 86.56 -.0001 y1.1896 y0.6956 55.95 -.0001 0.0000 0.0000 . . Chi- Square 102.85 Pr ) ChiSq -.0001 Observed and predicted logits for adjacent response categories. 377 ASSOCIATION MODELS the effects of party affiliation on ideology, or a baseline-category logit model to describe linear effects of ideology on party affiliation. 9.5.4 Ordinal Variables in Models for Multiway Tables Multidimensional tables with ordinal responses can use generalizations of association models. In three dimensions, the rich collection of models includes Ž1. association models that are more parsimonious than the nominal model Ž XY, XZ, YZ ., and Ž2. models permitting heterogeneous association that, unlike model Ž XYZ ., are unsaturated. Models for association that are special cases of Ž XY, XZ, YZ . replace ␭ association terms by structured terms that account for ordinality. For instance, when both X and Y are ordinal, alternatives to ␭ iXj Y are a linear-bylinear term ␤ u i ®j , a row effects term ␮ i ®j , or a column effects term u i ␯ j ; these provide a stochastic ordering of conditional distributions within rows and within columns, or just within rows, or just within columns. With a linear-by-linear term, the model is log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␤ u i ®j q ␭ iXkZ q ␭YjkZ . Ž 9.10 . The conditional local odds ratios Ž8.13. then satisfy log ␪ i jŽ k . s ␤ Ž u iq1 y u i . Ž ®jq1 y ®j . for all k. The association is the same in different partial tables, with homogeneous linear-by-linear XY association. When the association is heterogeneous, structured terms for ordinal variables make effects simpler to interpret than in the saturated model. For instance, the heterogeneous linear-by-linear XY association model log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␤ k u i ®j q ␭ iXkZ q ␭YjkZ Ž 9.11 . allows the XY association to change across levels of Z. With unit-spaced scores, log ␪ i jŽ k . s ␤ k for all i and j. It has uniform association within each level of Z, but heterogeneity among levels of Z in the strength of association. Fitting it corresponds to fitting the L = L model Ž9.6. separately at each level of Z. 9.5.5 Air Pollution and Breathing Examples Table 9.7 displays associations among smoking status Ž S ., breathing test results Ž B ., and age Ž A. for workers in certain industrial plants in Houston, 378 TABLE 9.7 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Cross-Classification of Industrial Workers by Breathing Test Results Breathing Test Results Age Smoking Status Normal Borderline Abnormal - 40 Never smoked Former smoker Current smoker Never smoked Former smoker Current smoker 577 192 682 164 145 245 27 20 46 4 15 47 7 3 11 0 7 27 40᎐59 Source: From p. 21 of Public Program Analysis by R. N. Forthofer and R. G. Lehnen. Copyright 䊚 1981 by Lifetime Learning Publications, Belmont, CA 94002, a division of Wadsworth, Inc. Reprinted by permission of Van Nostrand Reinhold. All rights reserved. Texas. The loglinear model Ž SA, SB, BA. fits poorly Ž G 2 s 25.9, df s 4.. Thus, simpler models such as homogeneous linear-by-linear SB association are not plausible Ž G 2 s 29.1, df s 7, using equally spaced scores.. The heterogeneous linear-by-linear SB association model fits much better with only one additional parameter Ž G 2 s 10.8, df s 6.. With integer scores for S and B, ␤ˆ1 s 0.115 for the younger group and ␤ˆ2 s 0.781 for the older group, with SE s 0.167 for the difference. The effect of smoking seems much stronger for the older group, with estimated local odds ratio of expŽ0.781. s 2.18 compared to expŽ0.115. s 1.12 for the younger group. Here, it may be more natural to use logit models with B as the response variable ŽProblem 7.11.. When strata are ordered, roughly a linear trend may exist across strata in certain log odds ratios as Table 9.8 illustrates. The data refer to a sample of coal miners, measured on B s breathlessness, W s wheeze, and A s age, where B and W are response variables. One could use a separate logit model to describe effects of age on each response. To study whether the BW association varies by age, we fit model Ž BW, AB, AW .. It has residual G 2 s 26.7, with df s 8. Table 9.8 reports the standardized Pearson residuals. They show a decreasing tendency as age increases. This suggests the model log ␮ i jk s Ž BW , AB, AW . q kI Ž i s j s 1 . ␦ , Ž 9.12 . where I is the indicator function. It amends the homogeneous association model by adding ␦ in the cell for ␮ 111 , . . . , 9␦ in the cell for ␮ 119 . Then, the BW log odds ratio changes linearly in the age category. The model fit has ␦ˆ s y0.131 ŽSE s 0.029.. The estimated BW log odds ratio at level k of age is 3.676 y 0.131k, decreasing from 3.55 to 2.50. The model has residual G 2 s 6.80 Ždf s 7.. McCullagh and Nelder Ž1989, Sec. 6.6. showed other analyses. ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS TABLE 9.8 379 Coal Miners Classified by Breathlessness, Wheeze, and Age Breathlessness Yes Age 20᎐24 25᎐29 30᎐34 35᎐39 40᎐44 45᎐49 50᎐54 55᎐59 60᎐64 No Wheeze Yes Wheeze No Wheeze Yes Wheeze No 9 23 54 121 169 269 404 406 372 7 9 19 48 54 88 117 152 106 95 105 177 257 273 324 245 225 132 1841 1654 1863 2357 1778 1712 1324 967 526 Std. Pearson Residual a 0.75 2.20 2.10 1.77 1.13 y0.42 0.81 y3.65 y1.44 a Residual refers to yes᎐yes and no᎐no cells; reverse sign for yes᎐no and no᎐yes cells. Source: Reprinted with permission from Ashford and Sowden Ž1970.. 9.5.6 Other Ordinal Tests of Conditional Independence Tests of conditional independence of ordinal classifications can generalize G 2 Ž I < L = L.. For instance, one can compare the XY conditional independence model Ž XZ, YZ . to the homogeneous linear-by-linear XY association model Ž9.10.. It tests ␤ s 0 in that model, with df s 1. This is an alternative to the ordinal test of conditional independence in Section 7.5.3. Like Mantel’s score statistic Ž7.21., this statistic uses correlation information, since Ý k ŽÝ i Ý j u i ®j n i jk . is the sufficient statistic for ␤ in model Ž9.10.. In fact, the Mantel statistic provides the score test of H0 : ␤ s 0 in that model. Exact, small-sample tests can use likelihood-ratio, score, or Wald statistics for such models. Computations require special algorithms ŽAgresti et al. 1990; Kim and Agresti 1997.. 9.6 ASSOCIATION MODELS, CORRELATION MODELS, AND CORRESPONDENCE ANALYSIS* The linear-by-linear association Ž L = L. model is a special case of the row effects Ž R . model, which has parameter row scores, and the column effects Ž C . model, which has parameter column scores. These models are special cases of a more general model with row and column parameter scores. 9.6.1 Multiplicative Row and Column Effects Model Replacing  u i 4 and  ®j 4 in the L = L model Ž9.6. by parameters yields the row and column effects Ž RC . model ŽGoodman 1979a. log ␮ i j s ␭ q ␭ iX q ␭Yj q ␤␮ i ␯ j . Ž 9.13 . 380 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Identifiability requires location and scale constraints on  ␮i 4 and  ␯ j 4 . The residual df s Ž I y 2.Ž J y 2.. This model is not loglinear, because the predictor is a multiplicative Žrather than linear. function of parameters ␮ i and ␯ j . It treats classifications as nominal; the same fit results from a permutation of rows or columns. Parameter interpretation is simplest when at least one variable is ordinal, through the local log odds ratios log ␪ i j s ␤ Ž ␮iq1 y ␮ i . Ž ␯ jq1 y ␯ j . . Although it may seem appealing to use parameters instead of arbitrary scores, the RC model presents complications that do not occur with loglinear models. The likelihood may not be concave and may have local maxima. Independence is a special case, but it is awkward to test independence using the RC model. Haberman Ž1981. showed that the null distribution of G 2 Ž I . y G 2 Ž RC . is not chi-squared but rather that of the maximum eigenvalue from a Wishart matrix. When one set of parameter scores is fixed, the RC model simplifies to the R or C model. Goodman Ž1979a. suggested an iterative model-fitting algorithm that exploits this. A cycle of the algorithm has two steps. First, for some initial guess of  ␯ j 4 , it estimates the row scores as in the R model. Then, treating the estimated row scores from the first step as fixed, it estimates the column scores as in the C model. Those estimates serve as fixed column scores in the first step of the next cycle, for reestimating the row scores in the R model. There is no guarantee of convergence to ML estimates, but this seems to happen when the model fits well. Haberman Ž1995. provided more sophisticated fitting methods for association models. Goodman Ž1985. expressed the association term in the saturated model in a form that generalizes the ␤␮ i ␯ j term in the RC model, namely, M ␭ iXj Y s Ý ␤k ␮ i k ␯ jk Ž 9.14 . ks1 where M s minŽ I y 1, J y 1.. The parameters satisfy constraints such as Ý ␮i k ␲ iq s Ý ␯ jk ␲qj s 0 i Ý ␮2i k ␲ iq s Ý ␯ jk2 ␲qj s 1 i for all k, Ž 9.15 . j Ý ␮i k ␮ i h␲ iq s Ý ␯ jk ␯ jh␲qj s 0 i for all k, j for all k / h. j When ␤ k s 0 for k ) M*, model Ž9.14. is called the RC Ž M*. model. See Becker Ž1990. for ML model fitting. The RC model Ž9.13. is the case M* s 1. ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS TABLE 9.9 381 Cross-Classification of Mental Health Status and Socioeconomic Status Mental Health Status Parents’ Socioeconomic Status A Žhigh. B C D E F Žlow. Well Mild Symptom Formation Moderate Symptom Formation Impaired 64 57 57 72 36 21 94 94 105 141 97 71 58 54 65 77 54 54 46 40 60 94 78 71 Source: Reprinted with permission from L. Srole et al. Mental Health in the Metropolis: The Midtown Manhattan Study, ŽNew York: NYU Press, 1978., p. 289. 9.6.2 Mental Health Status Example Table 9.9 describes the relationship between child’s mental impairment and parents’ socioeconomic status for a sample of residents of Manhattan ŽGoodman 1979a.. The RC model fits well Ž G 2 s 3.6, df s 8.. For scaling Ž9.15., the ML estimates are Žy1.11, y1.12, y0.37, 0.03, 1.01, 1.82. for the row scores, Žy1.68, y0.14, 0.14, 1.41. for the column scores, and ␤ˆ s 0.17. Nearly all estimated local log odds ratios are positive, indicating a tendency for mental health to be better at higher levels of parents’ SES. Ordinal loglinear models also fit well. For equal-interval scores, G 2 Ž L = L. s 9.9 Ždf s 14.. The statistic G 2 Ž L = L < RC . s 6.3 Ždf s 6. tests that row and column scores in the RC model are equal-interval. The parameter scores do not provide a significantly better fit. It is sufficient to use a uniform local odds ratio to describe the table. For unit-spaced scores, ␤ˆ s 0.091 ŽSE s 0.015., so the fitted local odds ratio is expŽ0.091. s 1.09. There is strong evidence of positive association, but the degree of association is rather weak, at least locally. 9.6.3 Correlation Models A correlation model for two-way tables has many features in common with the RC model ŽGoodman 1985.. In its simplest form, it is ␲ i j s ␲ iq ␲qj Ž 1 q ␭␮ i ␯ j . , where  ␮i 4 and  ␯ i 4 are score parameters satisfying Ý ␮ i␲ iqs Ý ␯ j␲qj s 0 and Ý ␮2i ␲ iqs Ý ␯ j2␲qj s 1. Ž 9.16 . 382 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS The parameter ␭ is the correlation between the scores for joint distribution Ž9.16.. The correlation model is also called the canonical correlation model, because ML estimates of the scores maximize the correlation for Ž9.16.. The general canonical correlation model is ž ␲ i j s ␲ iq ␲qj 1 q M Ý ␭ k ␮ i k ␯ jk ks1 / where 0 F ␭ M F ⭈⭈⭈ F ␭1 F 1 and with constraints such as in Ž9.15.. The parameter ␭ k is the correlation between  ␮i k , i s 1, . . . , I 4 and  ␯ jk , j s 1, . . . , J 4 . The  ␮i1 4 and  ␯ j1 4 are standardized scores that maximize the correlation ␭1 for the joint distribution;  ␮i2 4 and  ␯ j2 4 are standardized scores that maximize the correlation ␭ 2 , subject to  ␮i14 and  ␮i2 4 being uncorrelated and  ␯ j1 4 and  ␯ j2 4 being uncorrelated, and so on. Unsaturated models result from replacing M by M* - minŽ I y 1, J y 1.. Gilula and Haberman Ž1986. and Goodman Ž1985. discussed ML fitting. When ␭ is close to zero in Ž9.16., Goodman Ž1981a, 1985, 1986. noted that ML estimates of ␭ and the score parameters are similar to those of ␤ and the score parameters in the RC model. Correlation models can also use fixed scores instead of parameter scores. Goodman discussed advantages of association models over correlation models. The correlation model is not defined for all possible combinations of score values because of the constraint 0 F ␲ i j F 1, ML fitted values do not have the same marginal totals as the observed data, and the model is not simply generalizable to multiway tables. Gilula and Haberman Ž1988. analyzed multiway tables with correlation models by treating explanatory variables as a single variable and response variables as a second variable. 9.6.4 Correspondence Analysis Correspondence analysis is a graphical way to represent associations in two-way contingency tables. The rows and columns are represented by points on a graph, the positions of which indicate associations. Goodman Ž1985, 1986. noted that coordinates of the points are reparameterizations of  ␮i k 4 and  ␯ jk 4 in the general canonical correlation model. Correspondence analysis uses adjusted scores x i k s ␭k ␮i k , y jk s ␭ k ␯ jk . These are close to zero for dimensions k in which the correlation ␭ k is close to zero. A correspondence analysis graph uses the first two dimensions, plotting Ž x i1 , x i2 . for each row and Ž y j1 , y j2 . for each column. ASSOCIATION, CORRELATION, AND CORRESPONDENCE MODELS 383 TABLE 9.10 Scores from Correspondence Analysis Applied to Table 9.9 Dimension Column Score 1 2 3 4 1 2 Dimension 3 0.260 0.012 0.023 0.030 0.024 y0.019 y0.013 y0.069 y0.002 y0.236 0.019 0.016 Row Score 1 1 2 3 4 5 6 0.181 0.185 0.059 y0.008 y0.164 y0.287 2 3 y0.018 0.028 y0.011 y0.026 y0.021 y0.010 0.042 0.011 0.044 y0.009 y0.061 0.005 Source: Reprinted with permission from the Institute of Mathematical Statistics, based on Goodman Ž1985.. Goodman Ž1985, 1986. used Table 9.9 to illustrate the similarities of correspondence analysis to analyses using correlation models and association models. For the general canonical correlation model, M s minŽ I y 1, J y 1. s 3. Its estimated squared correlations are Ž0.0260, 0.0014, and 0.0003.. The association is rather weak. Table 9.10 contains estimated row and column scores for the correspondence analysis of these three dimensions. Both sets of scores in the first dimension fall in a monotone increasing pattern, except for a slight discrepancy between the first two row scores. This indicates an overall positive association. The scores for the second and third dimension ˆ2 and ␭ˆ3 . are close to zero, reflecting the relatively small ␭ Figure 9.4 exhibits the results of the correspondence analysis. The horizontal axis has estimates for the first dimension, and the vertical axis has estimates for the second dimension. Six points Žcircles. represent the six rows, with point i giving Ž ˆ x i1 , ˆ x i2 .. Similarly, four points Žsquares. display the estimates Ž ˆ y j1 , ˆ y j2 .. Both sets of points lie close to the horizontal axis, since the first dimension is more important than the second. FIGURE 9.4 Graphical display of scores from first two dimensions of correspondence analysis. wBased on Escoufier Ž1982.; reprinted with permission.x 384 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Row points that are close together represent rows with similar conditional distributions across the columns. Close column points represent columns with similar conditional distributions across rows. Row points close to column points represent combinations that are more likely than expected under independence. Figure 9.4 shows a tendency for subjects at the high end of one scale to be at the high end of the other and for subjects at the low end of one to be at the low end of the other. Correspondence analysis is used mainly as a descriptive tool. Goodman Ž1986. developed inferential methods for it. For Table 9.9, inferential analysis reveals that the first dimension, accounting for 94% of the total squared correlation, is adequate for describing the association. Goodman argued for choosing the unsaturated model employing only one dimension and having graphics display fitted scores for that dimension alone. Then, correspondence analysis is equivalent to a ML analysis using correlation model Ž9.16.. The estimated scores for that model are Žy1.09, y1.17, y0.37, 0.05, 1.01, 1.80. for the rows and Žy1.60, y0.19, 0.09, 1.48. for the columns. The model fits well Ž G 2 s 2.75, df s 8.. The quality of fit and the estimated scores are similar to those we saw in Section 9.6.2 for the RC model. More parsimonious correlation models also fit these data well, such as ones using equally spaced scores. All analyses of Table 9.9 have yielded similar conclusions about the association. They all neglect, however, that mental health is a natural response variable. It may make more sense to use an ordinal logit model. Like correlation models, a severe limitation of correspondence analysis is nontrivial generalization to multiway tables. Greenacre Ž1993. showed displays of several pairwise associations in a single plot. 9.6.5 Model Selection and Score Choice for Ordinal Variables The past three sections showed several ways to use category orderings in model building. With allowance for ordinal effects, the variety of potential models is much greater than standard loglinear models. To choose among models, one approach uses the standard models for guidance. If a standard model fits well, simplify by replacing some parameters with structured terms for ordinal classifications. Association, correlation, and correspondence analysis models have scores for categories of ordinal variables. Parameter interpretations are simplest for equally spaced scores. With parameter scores, the resulting ML estimates of scores need not be monotone. Constrained versions of the models force monotonicity by maximizing the likelihood subject to order restrictions Že.g., Agresti et al. 1987; Ritov and Gilula 1991.. Disadvantages exist, however, of treating scores as parameters. The model becomes less parsimonious, and tests of effects may be less powerful because of a greater df value Žrecall Section 6.4.3.. When one variable alone is a response, cumulative link models POISSON REGRESSION FOR RATES 385 ŽSections 7.2 and 7.3. for that response do not require preassigned or parameter scores. 9.7 POISSON REGRESSION FOR RATES Loglinear models need not refer to contingency tables. In Section 4.3 we introduced Poisson regression for modeling counts. When outcomes occur over time, space, or some other index of size, it is more relevant to model their rate of occurrence than their raw number. 9.7.1 Analyzing Rates Using Loglinear Models with Offsets When a response count n i has index equal to t i , the sample rate is n irt i . Its expected value is ␮ irt i . With an explanatory variable x, a loglinear model for the expected rate has form log Ž ␮irt i . s ␣ q ␤ x i . Ž 9.17 . This model has equivalent representation log ␮ i y log t i s ␣ q ␤ x i . As noted in Section 8.7.4, the adjustment term, ylog t i , to the log link of the mean is called an offset. The fit correspond to using log t i as a predictor on the right-hand side and forcing its coefficient to equal 1.0. For model Ž9.17., the expected response count satisfies ␮ i s t i exp Ž␣ q ␤ x i . . The mean is proportional to the index, with proportionality constant depending on the value of x. The identity link is also sometimes useful. The model is then ␮ irt i s ␣ q ␤ x i , or ␮ i s ␣ t i q ␤ x i t i . This does not require an offset. It corresponds to an ordinary Poisson GLM using identity link with t i and x i t i as explanatory variables and no intercept. It provides additive, rather than multiplicative, predictor effects. It is less useful with many predictors, as the fitting process may fail because of negative fitted counts at some iteration. 9.7.2 Modeling Death Rates for Heart Valve Operations Laird and Olivier Ž1981. analyzed patient survival after heart valve replacement operations. A sample of 109 patients were classified by type of heart 386 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS TABLE 9.11 Data on Heart Valve Replacement Operations Type of Heart Valve Age - 55 55 q Deaths Time at risk Death rate Deaths Time at risk Death rate Aortic Mitral 4 1259 0.0032 7 1417 0.0049 1 2082 0.0005 9 1647 0.0055 Source: Reprinted with permission, based on data in Laird and Olivier Ž1981.. valve Žaortic, mitral. and by age Ž- 55, G 55.. Follow-up observations occurred until the patient died or the study ended. Operations occurred throughout the study period, and follow-up observations covered lengths of time varying from 3 to 97 months. The response was whether the subject died and the follow-up time. For subjects who died, this is the time after the operation until death; for the others, it is the time until the study ended or the subject withdrew from it. Table 9.11 lists the numbers of deaths during the follow-up period, by valve type and age. These counts are the first layer of a three-way contingency table that classifies valve type, age, and whether died Žyes, no.. The subjects not tabulated in Table 9.11 were not observed to die. They are censored, since we know only a lower bound for how long they lived after the operation. It is inappropriate to analyze that 2 = 2 = 2 table using binary GLMs for the probability of death, since subjects had differing times at risk; it is not sensible to treat a subject who could be observed for 3 months and a subject who could be observed for 97 months as identical trials with the same probability. To use age and valve type as predictors in a model for frequency of death, the proper baseline is not the number of subjects but rather the total time that subjects were at risk. Thus, we model the rate of death. The time at risk for a subject is their follow-up time of observation. For a given age and valve type, the total time at risk is the sum of the times at risk for all subjects in that cell Žthose who died and those censored.. Table 9.11 lists those total times in months. The sample rate, also shown in that table, divides the number of deaths by total time at risk. For instance, 4 deaths in 1259 months of observation occurred for younger subjects with aortic valve replacement, so their sample rate is 4r1259 s 0.0032. We now model effects of age and valve type on the rate. Let a be a dummy variable for age, with a1 s 0 for the younger age group and a2 s 1 for the older group. Let ® be a dummy variable for valve type, with ®1 s 0 for aortic and ®2 s 1 for mitral. Let n i j denote the number of deaths for age a i and valve type ®j , with expected value ␮ i j for total time at risk t i j . Given t i j , 387 POISSON REGRESSION FOR RATES TABLE 9.12 Fit to Table 9.11 for Poisson Regression Models Log Link Age - 55 55 q Number of deaths Death rate Number of deaths Death rate Identity Link Aortic Mitral Aortic Mitral 2.28 0.0018 8.72 0.0062 2.72 0.0013 7.28 0.0044 3.16 0.0025 9.17 0.0065 1.19 0.0006 7.48 0.0046 the expected rate is ␮ i jrt i j . The model log Ž ␮i jrt i j . s ␣ q ␤ 1 a i q ␤ 2 ®j Ž 9.18 . assumes a lack of interaction in the effects. Model fitting uses standard iterative methods, treating  n i j 4 as independent Poisson variates with means  ␮i j 4 . This is done conditional on  t i j 4 . Table 9.12 presents the fitted death counts and estimated rates. The estimated effects are ␤ˆ1 s 1.221 Ž SE s 0.514 . , ␤ˆ2 s y0.330 Ž SE s 0.438 . . There is evidence of an age effect. Given valve type, the estimated rate for the older age group is expŽ1.221. s 3.4 times that for the younger age group. The 95% Wald confidence interval for ␤ 1 of 1.221 " 1.96Ž0.514. translates to Ž1.2, 9.3. for the true multiplicative effect expŽ ␤1 .. ŽThe likelihood-ratio confidence interval is Ž1.3, 10.4... The study contains much censored data. Of the 109 patients, only 21 died during the study period. Both effect estimates are imprecise. Note, though, that the analysis uses all 109 patients through their contributions to the times at risk. Goodness-of-fit statistics comparing  n i j 4 to fitted values  ␮ ˆ i j 4 are G 2 s 3.2 2 and X s 3.1. The residual df s 1, since the four response counts have three parameters. The mild evidence of lack of fit corresponds to evidence of interaction between valve type and age. However, the model without valvetype effects wi.e., ␤ 2 s 0 in Ž9.18.x fits nearly as well, with G 2 s 3.8 and X 2 s 3.8 Ždf s 2.. Models omitting age effects fit poorly. The corresponding model with identity link ␮ i j s ␣ t i j q ␤ 1 a i t i j q ␤ 2 ®j t i j shows a good fit, with G 2 s 1.1 and X 2 s 1.1 Ždf s 1.. Table 9.12 shows the fit. Substantive conclusions are similar. The estimate ␤ˆ1 s 0.0040 ŽSE s 0.0014. then represents an estimated difference in death rates between the older and younger age groups for each valve type. 388 9.7.3 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS Modeling Survival Times* A method for modeling survival times relates to the Poisson loglinear model for rates. This method focuses on times until death rather than on numbers of deaths. Let T denote the time to some event, such as death or such as product failure in a reliability study. Let f Ž t . denote the probability density function Žpdf. and F Ž t . the cdf of T. A connection exists between ML estimation using a Poisson likelihood for numbers of events and a negative exponential likelihood for T ŽAitkin and Clayton 1980.. A subject having T s t contributes f Ž t . to the likelihood. For a subject whose censoring time equals t, we know only that T ) t. Thus, this subject contributes P ŽT ) t . s 1 y F Ž t .. Using the indicator wi s 1 for death and 0 for censoring for subject i, the survival-time likelihood for n independent observations is n Ł f Ž ti . 1 y F Ž ti . wi is1 1yw i . The log likelihood equals Ý wi log f Ž t i . i q Ý Ž 1 y wi . log 1 y F Ž ti . . Ž 9.19 . i Further analysis requires a parametric form for f and a model for the dependence of its parameters on explanatory variables. Most survival models focus on the rate at which death occurs rather than on E ŽT .. The hazard function hŽ t . s f Ž t. 1 y FŽ t. s lim Pwt - T - t q ⑀ < T ) tx ⑀ ⑀ x0 represents the instantaneous rate of death for subjects who have survived to time t. A simple density for survival modeling is the negative exponential. The pdf is f Ž t . s ␭ ey␭ t , t ) 0. The cdf is F Ž t . s 1 y ey␭ t for t ) 0, and E ŽT . s ␭y1 . The hazard function is hŽ t . s ␭, t ) 0, constant for all t. Now we include explanatory variables x. Suppose that the hazard function for a negative exponential survival distribution is h Ž t ; x . s ␭exp Ž ␤X x . . Ž 9.20 . 389 POISSON REGRESSION FOR RATES That is, the distribution for T has parameter depending on x through Ž9.20.. The choice of functional form Ž9.20. for explanatory variable effects ensures the hazard is nonnegative at all x. For instance, loglinear model Ž9.18. corresponds to a multiplicative model of type Ž9.20. for the rate itself. Now, consider the log likelihood Ž9.19. with f Ž t . equal to the negative exponential density with parameter ␭ expŽ␤X x.. For subject i, let ␮ i s t i ␭ exp Ž ␤X x i . . With this substitution, the log likelihood simplifies to Ý wi log ␮i y Ý ␮i y Ý wi log t i . i i i The first two terms involve ␤. This part is identical to the log likelihood for independent Poisson variates  wi 4 with expected values  ␮i 4 . In this application  wi 4 are binary rather than Poisson, but that is irrelevant to the process of maximizing with respect to ␤. This process is equivalent to maximizing the likelihood for the Poisson loglinear model log ␮ i y log t i s log ␭ q ␤X x i with offset logŽ t i ., using observations  wi 4 . When we sum terms in the log likelihood for subjects having a common value of x, the observed data are the numbers of deaths ŽÝwi . at each setting of x, and the offset is the log of ŽÝt i . at each setting. The assumption of constant hazard over time is often not sensible. As products wear out, their failure rate increases. A generalization divides the time scale into disjoint time intervals and assumes constant hazard in each, namely, h Ž t ; x . s ␭ k exp Ž ␤X x . for t in interval k, k s 1, . . . . A separate hazard rate applies to each piece of the time scale. Consider the contingency table for numbers of deaths, in which one dimension is a discrete time scale and other dimensions represent categorical explanatory variables. Holford Ž1980. and Laird and Olivier Ž1981. showed that Poisson loglinear models and likelihoods for this table are equivalent to loglinear hazard models and likelihoods that assume piecewise exponential hazards for the survival times. For short time intervals, the piecewise exponential approach is essentially nonparametric, making no assumption about the dependence of the hazard on time. This suggests the generalization of model Ž9.20. that replaces ␭ by an unspecified function ␭Ž t ., so that h Ž t ; x . s ␭ Ž t . exp Ž ␤X x . . This is the Cox proportional hazards model. Its ratio of hazards h Ž t ; x 1 . rh Ž t ; x 2 . s exp ␤X Ž x 1 y x 2 . is the same for all t. 390 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS TABLE 9.13 Number of Deaths from Lung Cancer Follow-up Time Interval Žmonths. 0᎐2 2᎐4 4᎐6 6᎐8 8᎐10 10᎐12 12 q Histology a Disease Stage: I II III 1 2 3 1 2 3 1 2 3 9 Ž157 2 Ž139 9 Ž126 10 Ž102 1 Ž88 3 Ž82 1 Ž76 12 134 7 110 5 96 10 86 4 66 3 59 4 51 42 212 26 136 12 90 10 64 5 47 4 39 1 29 5 77 2 68 3 63 2 55 2 50 2 45 2 42 4 71 3 63 5 58 4 42 2 35 1 32 4 28 28 130 19 72 10 42 5 21 0 14 3 13 2 7 1 21 1 17 1 14 1 12 0 10 1 8 0 6 1 22 1 18 3 14 1 10 0 8 0 8 2 6 19 101. 11 63. 7 43. 6 32. 3 21. 3 14. 3 10. a Values in parentheses represent total follow-up. Source: Reprinted with permission from the Biometric Society, based on Holford Ž1980.. 9.7.4 Lung Cancer Survival Example* Table 9.13 describes survival for 539 males diagnosed with lung cancer. The prognostic factors are histology Ž H . and stage Ž S . of disease. For a piecewise exponential hazard approach, the time scale for follow-up ŽT . was divided into two-month intervals. Let ␮ i jk denote the expected number of deaths and t i jk the total time at risk for histology i and state of disease j, in follow-up time interval k. The model log Ž ␮i jkrt i jk . s ␭ q ␭ iH q ␭Sj q ␭Tk Ž 9.21 . has residual G 2 s 43.9 Ždf s 52.. All models assuming no interaction between follow-up time interval and either prognostic factor are proportional hazards models, since they have the same effects of histology and stage of disease for each time interval. Table 9.14 summarizes results of fitting several such models. Although stage of disease is an important prognostic factor, histology did not contribute significant additional information. For model Ž9.21., the effects of stage of disease satisfy ˆS2 y ␭ˆ1S s 0.470 ␭ Ž SE s 0.174 . , ˆS3 y ␭ˆ1S s 1.324 ␭ Ž SE s 0.152 . . EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES 391 TABLE 9.14 Results for Poisson Regression Models of Proportional Hazards Form with Table 9.13 Effects a T TqH TqS TqSqH TqSqHqS=H a G2 df 170.7 143.1 45.8 43.9 41.5 56 54 54 52 48 T, time scale for follow-up; H, histology; S, disease stage. For instance, at a fixed follow-up time for a given histology, the estimated death rate at the third stage of disease is expŽ1.324. s 3.8 times that at the first stage. Adding interaction terms between stage and time does not significantly improve the fit Žchange in G 2 s 14.9, change in df s 12.. The ˆSj 4 are very similar for the simpler model without the histology effects. ␭ 9.7.5 Analyzing Weighted Data* The process of fitting a loglinear model with an offset is also useful in other applications. For expected frequencies  ␮i 4 and fixed constants  t i 4 , consider a model log Ž ␮irt i . s ␣ q ␤ 1 x i1 q ␤ 2 x i2 q ⭈⭈⭈ . Standard loglinear models have  t i s 14 . The general form is useful for the analysis of categorical data with sampling designs more complex than simple random sampling. Many surveys have sampling designs employing stratification andror clustering. Case weights inflate or deflate the influence of each observation according to features of that design. Adding the case weights for subjects in a particular cell i provides a total weighted frequency for that cell. The average cell weight z i is defined to be the total weighted frequency divided by the cell count. Conditional on  z i 4 , loglinear models for the weighted expected frequencies  z i ␮ i s ␮ irt i 4 with t i s zy1 express the model as a standard i loglinear model for  log ␮ i 4 , with offset  log t i s ylog z i 4 . Fitting this model provides appropriate parameter estimates and standard errors ŽClogg and Eliason 1987.. 9.8 EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES Contingency tables having small cell counts are said to be sparse. We end this chapter by discussing effects of sparse tables on model fitting. Sparse 392 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS tables occur when the sample size n is small. They also occur when n is large but so is the number of cells. Sparseness is common in tables with many variables. The following discussion refers to a generic contingency table and model, with cell counts  n i 4 and expected frequencies  ␮i 4 for n observations in N cells. 9.8.1 Empty Cells: Sampling versus Structural Zeros Sparse tables usually contain cells with n i s 0. These empty cells are of two types: sampling zeros and structural zeros. In most cases, even though n i s 0, ␮ i ) 0. It is possible to have observations in the cell, and n i ) 0 with sufficiently large n. This empty cell is called a sampling zero. The empty cells in Table 9.1 for the student survey are sampling zeros. An empty cell in which observations are impossible is called a structural zero. For such cells ␮ i s 0 and necessarily ␮ ˆ i s 0 and n i s 0 regardless of n. For a table that cross classifies cancer patients on their gender, race, and type of cancer, some cancers Že.g., prostate cancer, ovarian cancer. are gender specific. Thus, certain cells have structural zeros. Contingency tables with structural zeros are called incomplete tables. Sampling zeros are part of the data set. A count of 0 is a permissible outcome for a Poisson or multinomial variate. It contributes to the likelihood function and model fitting. A structural zero, on the other hand, is not an observation and is not part of the data. Sampling zeros are much more common than structural zeros, and the remaining discussion refers to them. 9.8.2 Existence of Estimates in Loglinear r Logit Models Sampling zeros can affect the existence of finite ML estimates of loglinear and logit model parameters. Haberman Ž1973b, 1974a., generalizing work by Birch Ž1963. and Fienberg Ž1970b., studied this. Let n denote the vector of cell counts and ␮ their expected values. Haberman showed results 1 through 5 for Poisson sampling, but by result 6 they apply also to multinomial sampling. 1. The log-likelihood function is a strictly concave function of log ␮. 2. If a ML estimate of ␮ exists, it is unique and satisfies the likelihood equations XX n s XX ␮. ˆ Conversely, if ␮ ˆ satisfies the model and also the likelihood equations, it is the ML estimate of ␮. 3. If all n i ) 0, ML estimates of loglinear model parameters exist. 4. Suppose that ML parameter estimates exist for a loglinear model that equates observed and fitted counts in certain marginal tables. Then those marginal tables have uniformly positive counts. 5. If ML estimates exist for a model M, they also exist for any special case of M. EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES 393 6. For any loglinear model, the ML estimates ␮ ˆ are identical for multinomial and independent Poisson sampling, and those estimates exist in the same situations. To illustrate, consider the saturated model. By results 2 and 3, when all n i ) 0, the ML estimate of ␮ is n. By result 4, parameter estimates do not exist when any n i s 0. Model parameter estimates are contrasts of  log ␮ ˆ i4, and since ␮ ˆ s n for the saturated model, the estimates are finite only when all n i ) 0. For unsaturated models, by results 3 and 4 ML estimates exist when all n i ) 0 and do not exist when any count is zero in the set of sufficient marginal tables. Suppose that at least one n i s 0 but the sufficient marginal counts are all positive. For hierarchical loglinear models, Glonek et al. Ž1988. showed that the positivity of the sufficient counts implies the existence of ML estimates if and only if the model is decomposable ŽNote 8.2., which includes the conditional independence models. Models having all pairs of variables associated, however, are more complex. For model Ž XY, XZ, YZ ., for instance, ML estimates exist when only one n i s 0 but may not exist when at least two cells are empty. For instance, ML estimates do not exist for Table 9.15, even though all sufficient statistics Žthe two-way marginal totals. are positive ŽProblem 9.47.. Haberman showed that the supremum of the likelihood function is finite. This motivated him to define extended ML estimators of ␮. These always exist but may equal 0 and, falling on the boundary, need not have the same properties as regular ML estimators wsee also Baker et al. Ž1985.x. A sequence of estimates satisfying the model that converges to the extended estimate has log likelihood approaching its supremum. In this extended sense, ␮ ˆ i s 0 is the ML estimate of ␮ i for the saturated model when n i s 0, and one can have infinite loglinear parameter estimates. When a sufficient marginal count for a factor equals zero, infinite estimates occur for that term. For instance, when a XY marginal total equals ˆiXj Y 4 for loglinear models such as zero, infinite estimates occur among  ␭ Ž XY, XZ, YZ ., and infinite estimates occur among  ␤ˆi X 4 for the effect of X on Y in logit models. Sometimes, however, not even infinite estimates exist. An example is estimating the log odds ratio when both entries in a row or column of a 2 = 2 table equal 0. TABLE 9.15 Data for Which ML Estimates Do Not Exist for Model ( XY, XZ, YZ ) a 1 Z: X 1 2 a Y: 2 1 2 1 2 0 ) ) ) ) ) ) 0 Cells containing * may contain any positive numbers. 394 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS A value of ⬁ Žor y⬁. for a ML parameter estimate implies that ML fitted values equal 0 in some cells, and some odds ratio estimates equal ⬁ or 0. One potential indicator is when the iterative fitting process does not converge, typically because an estimate keeps increasing from cycle to cycle. Most software, however, is fooled after a certain point in the iterative process by the nearly flat likelihood. It reports convergence, but because of the very slight curvature of the log likelihood, the estimated standard errors Žbased on inverting the information matrix of second partial derivatives. are extremely large and numerically unstable. Slight changes in the data then often cause dramatic changes in the estimates and their standard errors. A danger with sparse data is that one might not realize that a true estimated effect is infinite and, as a consequence, report estimated effects and results of statistical inferences that are invalid and highly unstable. Many ML analyses are unharmed by empty cells. Even when a parameter estimate is infinite, this is not fatal to data analysis. The likelihood-ratio confidence interval for the true log odds ratio has one endpoint that is finite. For instance, when n11 s 0 but other n i j ) 0 in a 2 = 2 table, log ␪ˆs y⬁ and a confidence interval has form Žy⬁, U . for some finite upper bound U. When the pattern of empty cells forces certain fitted values for a model to equal 0, this affects the df for testing model fit ŽHaslett 1990.. 9.8.3 Clinical Trials Example Table 9.16 shows results of a clinical trial conducted at five centers. The purpose was to compare an active drug to placebo for treating fungal infections, with a binary Žsuccess, failure. response. For these data, let Y s response, X s treatment Ž x 1 s 1 for active drug and x 2 s 0 for placebo., and Z s center. Centers 1 and 3 had no successes. Thus, the 5 = 2 marginal table relating response to center, collapsed over treatment, contains zero counts. The last two columns of Table 9.16 show this marginal table. Infinite ML estimates occur for terms in loglinear or logit models containing the YZ association. An example is the logit model logit P Ž Y s 1 < X s i , Z s k . s ␤ x i q ␤ kZ . ŽWe omit the intercept, so the  ␤ kZ 4 need no constraint; then, these refer to center effects rather than contrasts between centers and a baseline center. . The likelihood function increases continually as ␤ 1Z and ␤ 3Z decrease toward y⬁; that is, as the logit decreases toward y⬁, so the fitted probability of success decreases toward the ML estimate of 0 for those centers. The counts in the 2 = 2 marginal table relating response to treatment, shown in the bottom panel of Table 9.16, are all positive. The empty cells in Table 9.16 affect the center estimates, but not the treatment estimate, for this logit model. In the limit as the log likelihood increases, the fitted values have a log odds ratio ␤ˆ s 1.55 ŽSE s 0.70.. Most software reports this, but 395 EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES TABLE 9.16 Clinical Trial Relating Treatment to Response with XY and YZ Marginal Tables a Response YZ Marginal Center Treatment Success Failure 1 Active drug Placebo Active drug Placebo Active drug Placebo Active drug Placebo Active drug Placebo 0 0 1 0 0 0 6 2 5 2 5 9 12 10 7 5 3 6 9 12 Active drug Placebo 12 4 36 42 2 3 4 5 XY marginal Success Failure 0 14 1 22 0 12 8 9 7 21 a X, Treatment; Y, response; Z, center. Source: Data courtesy of Diane Connell, Sandoz Pharmaceuticals Corporation. instead of ␤ˆ1Z s ␤ˆ3Z s y⬁ reports large numbers with extremely large standard errors. For instance, PROC GENMOD in SAS reports values of about y26 for ␤ˆ1Z and ␤ˆ3Z , with standard errors of about 200,000. The treatment estimate ␤ˆ s 1.55 also results from deleting centers 1 and 3 from the analysis. When a center contains responses of only one type, it provides no information about this odds ratio. ŽIt does provide information about the size of some other measures, such as the difference of proportions.. In fact, such tables also make no contribution to standard tests of conditional independence, such as the Cochran᎐Mantel᎐Haenszel test ŽSection 6.3.2. and exact test ŽSection 6.7.5.. An alternative strategy in multicenter analyses combines centers of a similar type. Then, if each resulting partial table has responses with both outcomes, the inferences use all data. For Table 9.16, perhaps centers 1 and 3 are similar to center 2, since the success rate is very low for that center. Combining these three centers and refitting the model to this table and the tables for the other two centers yields ␤ˆ s 1.56 ŽSE s 0.70.. Usually, this strategy produces results similar to deleting the table with no outcomes of a particular type. 9.8.4 Effect of Small Samples on X 2 and G 2 Although empty cells and sparse tables need not affect parameter estimates of interest, they can cause sampling distributions of goodness-of-fit statistics to be far from chi-squared. The true sampling distributions converge to 396 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS chi-squared as n ™ ⬁, for a fixed number of cells N. The adequacy of the chi-squared approximation depends both on n and N. Cochran studied the chi-squared approximation for X 2 in several articles. In 1954, he suggested that to test independence with df ) 1, a minimum expected value ␮ i f 1 is permissible as long as no more than about 20% of ␮ i - 5. Koehler Ž1986., Koehler and Larntz Ž1980., and Larntz Ž1978. showed that X 2 applies with smaller n and more sparse tables than G 2 . The distribution of G 2 is usually poorly approximated by chi-squared when nrN is less than 5. Depending on the sparseness, P-values based on referring G 2 to a chi-squared distribution can be too large or too small. When most ␮ i are smaller than 0.5, treating G 2 as chi-squared gives a highly conservative test; when H0 is true, reported P-values tend to be much larger than true ones. When most ␮ i are between 0.5 and 4, G 2 tends to be too liberal; the reported P-value tends to be too small. The size of nrN that produces adequate approximations for X 2 tends to decrease as N increases ŽKoehler and Larntz 1980.. However, the approximation tends to be poor for sparse tables containing both small and moderately large ␮ i ŽHaberman 1988.. It is difficult to give a guideline that covers all cases. For other discussion, see Cressie and Read Ž1989. and Lawal Ž1984.. For fixed n and N, the chi-squared approximation is better for tests with smaller df. For instance, in testing conditional independence in I = J = K tables, G 2 wŽ XZ, YZ . < Ž XY, XZ, YZ .x Žwith df s Ž I y 1.Ž J y 1.. is closer to chi-squared than G 2 Ž XZ, YZ . wwith df s K Ž I y 1.Ž J y 1.x. The ordinal test of H0 : ␤ s 0 with the homogeneous linear-by-linear XY association model Ž9.10. has df s 1, and behaves even better. 9.8.5 Model-Based Tests and Sparseness From Ž9.3. and Ž9.4., the model-based statistics G 2 Ž M0 < M1 . and X 2 Ž M0 < M1 . depend on the data only through the fitted values, and hence only through minimal sufficient statistics for the more complex model. These statistics have null distributions converging to chi-squared as the expected values of the minimal sufficient statistics grow. For most loglinear models, these sufficient statistics refer to marginal tables. Marginal totals are more nearly normally distributed than are single cell counts. Thus, G 2 Ž M0 < M1 . and X 2 Ž M0 < M1 . converge to their limiting chi-squared distribution more quickly than does G 2 Ž M0 . and X 2 Ž M0 ., which depend also on individual cell counts. When  ␮ ˆ i 4 are small but the sufficient marginal totals for M1 are mostly in at least the range 5 to 10, the chi-squared approximation is usually adequate for model comparison statistics. Haberman Ž1977a. provided theoretical justification. 9.8.6 Alternative Asymptotics and Alternative Statistics When large-sample approximations are inadequate, exact small-sample methods are an alternative. When they are infeasible, it is often possible to EMPTY CELLS AND SPARSENESS IN MODELING CONTINGENCY TABLES 397 approximate exact distributions precisely using Monte Carlo methods Že.g., Booth and Butler 1999; Forster et al. 1996; Kim and Agresti 1997; Mehta et al. 1988.. An alternative approach uses sparse asymptotic approximations that apply when the number of cells N increases as n increases. For this approach,  ␮i 4 need not increase, as they must do in the usual Žfixed N, n ™ ⬁. large-sample theory. For goodness-of-fit testing of a specified multinomial, Koehler and Larntz Ž1980. showed that a standardized version of G 2 has an approximate normal distribution for very sparse tables. Koehler Ž1986. presented limiting normal distributions for G 2 for use in testing models having direct ML estimates. McCullagh Ž1986. reviewed ways of handling sparse tables and presented an alternative approximation for G 2 . Zelterman Ž1987. gave normal approximations for X 2 and proposed an alternative statistic. 9.8.7 Adding Constants to Cells of a Contingency Table Empty cells and sparse tables can cause problems with existence of estimates for loglinear model parameters, estimation of odds ratios, performance of computational algorithms, and asymptotic approximations of chi-squared statistics. However, they need not be problematic. The likelihood can still be maximized, a point estimate of ⬁ for an effect still usually has a finite lower bound for a likelihood-based confidence interval, and one can use small-sample inferential methods rather than asymptotic ones. One way to obtain finite estimates of all effects and ensure convergence of fitting algorithms is to add a small constant to cell counts. Some algorithms add 12 to each cell, as Goodman Ž1964b, 1970, 1971a. recommended for saturated models. An example of the beneficial effect of this for a saturated model is bias reduction for estimating an odds ratio in a 2 = 2 table ŽGart 1966; Gart and Zweiful 1967.. Adding 12 to each cell before fitting an unsaturated model smooths the data too much, however, causing havoc with sampling distributions. This operation has too conservative an influence on estimated effects and test statistics. The effect is very severe with a large number of cells. Even for a saturated model, adding 12 to each cell is not a panacea for all purposes. When the ordinary ML estimate of an odds ratio is infinite, the estimate after adding 12 to each cell is finite, as are the endpoints of any confidence interval. However, it is more sensible to use an upper bound of ⬁ for the odds ratio, since no sample evidence suggests that the odds ratio falls below any given value. When in doubt about the effect of sparse data, one should perform a sensitivity analysis. For example, for each possibly influential observation, delete it or move it to another cell to see how results vary with small perturbations to the data. Influence diagnostics for GLMs ŽWilliams 1987. are also useful for this purpose. Often, some associations are not affected by empty cells and give stable results for the various analyses, whereas others 398 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS that are affected are highly unstable. Use caution in making conclusions about an association if small changes in the data are influential. Later chapters show ways to smooth data in a less ad hoc manner than adding arbitrary constants to cells. These include random effects models ŽSection 12.3. and Bayesian methods ŽSection 15.2.. NOTES Section 9.1: Association Graphs and Collapsibility 9.1. Darroch et al. Ž1980. defined a class of graphical models that contains the family of decomposable models Žsee Note 8.2.. For expositions on graphical models and their relevant independence graphs, which show the conditional independence structure, see Ž2000., Edwards Ž2000., Edwards and Kreiner Ž1983., also Anderson and Bockenholt ¨ Kreiner Ž1998., Lauritzen Ž1996., and Whittaker Ž1990.. Whittaker Ž1990, Sec. 12.5. summarized connections with various definitions of collapsibility. 9.2 For I = J = 2 tables, the collapsibility conditions ŽSection 9.1.2. are necessary as well as sufficient ŽSimpson 1951; Whittemore 1978.. For I = J = K tables, Ducharme and Lepage Ž1986. showed the conditions are necessary and sufficient for the odds ratios to remain the same no matter how the levels of Z are pooled Ži.e., no matter how Z is partially collapsed .. Darroch Ž1962. defined a perfect table as one for which for all i, j, k, Ý ␲ i jq ␲ iqk Ý ␲ iqk ␲qj k i k ␲ iqq ␲qqk s ␲qjq ␲qqk , Ý j ␲qj k ␲ i jq ␲qjq s ␲ iqq ␲qqk , s ␲ iqq ␲qjq . For perfect tables, homogeneous association implies that  ␲ i jk s ␲ i jq ␲ iqk ␲qj kr␲ iqq ␲qjq ␲qqk 4 and conditional odds ratios are identical to marginal odds ratios. Whittemore Ž1978. used perfect tables to illustrate that for I = J = K tables with K ) 2, conditional and marginal odds ratios can be identical even when no pair of variables is conditionally independent. See also Davis Ž1986b.. Suppose that the difference of proportions or relative risk, computed for a binary response Y and predictor X, is the same at every level of Z. If Z is independent of X in the marginal XZ table or if Z is conditionally independent of Y given X, the measure has the same value in the marginal XY table ŽShapiro 1982.. Thus, for factorial designs with the same number of observations at each combination of levels, the difference of proportions and relative risk are collapsible. See also Wermuth Ž1987.. Section 9.2: Model Selection and Comparison 9.3. Articles on loglinear model selection include Aitkin Ž1979, 1980., Benedetti and Brown Ž1978., Brown Ž1976., Goodman Ž1970, 1971a., Wermuth Ž1976., and Whittaker and Aitkin Ž1978.. When a certain model holds, G 2 rdf has an asymptotic mean of 1. Goodman Ž1971a. recommended this index for comparing fits. Smaller values represent better fits. NOTES 399 9.4. Kullback et al. Ž1962. and Lancaster Ž1951. were among the first to partition chi-squared statistics in multiway tables. Goodman Ž1970. and Plackett Ž1962. noted difficulties with their approaches. When observations have distribution in the natural exponential family, Simon Ž1973. showed G 2 Ž M0 < M1 . s 2Ý i ␮ ˆ 1 i logŽ␮ ˆ 1 ir␮ ˆ 0 i . whenever models are linear in the natural parameters. See Lang Ž1996b. for partitionings for more complex models. Section 9.4: Modeling Ordinal Associations 9.5. Goodman Ž1979a. stimulated research on loglinear models for ordinal data. His work extended Haberman Ž1974b., who expressed the ␭ X Y association term with an expansion in orthogonal polynomials. For more general ordinal models for multiway tables, see Agresti Ž1984., Becker Ž1989a., Becker and Clogg Ž1989., and Goodman Ž1986.. Section 9.6: Association Models, Correlation Models, and Correspondence Analysis 9.6. Early articles on the RC model include Goodman Ž1979a, 1981a, b. and Andersen Ž1980, pp. 210᎐216., apparently partly motivated by earlier work of G. Rasch Žsee Ž2000., Becker Ž1989a, b, 1990., Becker and Andersen 1995.. Anderson and Bockenholt ¨ Clogg Ž1989., Chuang et al. Ž1985., and Goodman Ž1985, 1986, 1996. discussed generalizations for multiway tables. Anderson Ž1984. discussed a related model. Anderson and Vermunt Ž2000. showed that RC and related association models arise when observed variables are conditionally independent given a latent variable that is conditionally normal, given the observed variables. Their work generalizes results in Lauritzen and Wermuth Ž1989. and discussion by Whittaker of van der Heijden et al. Ž1989.. See also de Falguerolles et al. Ž1995.. Clogg and Shihadeh Ž1994. surveyed association models and related correlation models. 9.7. Kendall and Stuart Ž1979, Chap. 33. surveyed basic canonical correlation methods for contingency tables. See also Williams Ž1952., who discussed earlier work by R. A. Fisher and others. Karl Pearson often analyzed tables by assuming an underlying bivariate normal distribution ŽSection 16.1.. For estimating that distribution’s correlation, see Becker Ž1989b., Goodman Ž1981b., Kendall and Stuart Ž1979, Chaps. 26 and 33., Lancaster Ž1969, Chap. X., the Pearson Ž1904. tetrachoric correlation for 2 = 2 tables, and the Lancaster and Hamdan Ž1964. polychoric correlation for I = J tables. 9.8. Correspondence analysis gained popularity in France under the influence of Benzecri ´ Žsee, e.g., 1973.. Goodman Ž1996. attributed its origins to H. O. Hartley, publishing under his original German name ŽHirschfeld, 1935.. Greenacre Ž1993. related it to the singular value decomposition of a matrix. For other discussion, see Escoufier Ž1982., Friendly Ž2000, Chap. 5., Goodman Ž1986, 1996, 2000., Michailidis and de Leeuw Ž1998., van der Heijden and de Leeuw Ž1985., and van der Heijden et al. Ž1989.. Gabriel Ž1971. discussed related work on biplots. Section 9.7: Poisson Regression for Rates 9.9. Another application using offsets is table standardization ŽSection 8.7.4.. For analyses of rate data, see Breslow and Day Ž1987, Sec. 4.5., Freeman and Holford Ž1980., Frome Ž1983., and Hoem Ž1987.. Articles dealing with grouped survival data, particularly loglinear and logit models for survival probabilities, include Aranda-Ordaz Ž1983., Larson Ž1984., Prentice and Gloeckler Ž1978., Schluchter and Jackson Ž1989., Stokes et al. Ž2000, Chap. 17., and Thompson Ž1977.. Aitkin and Clayton Ž1980. discussed exponential survival models and also presented similar models having hazard functions 400 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS for Weibull or extreme-value survival distributions. Log likelihood Ž9.19. actually applies only for noninformati®e censoring mechanisms. It does not make sense if subjects tend to withdraw from the study because of factors related to it, perhaps because of health effects related to one of the treatments. 9.10. Lindsey and Mersch Ž1992. showed a clever way to use loglinear models to fit exponential family distributions f Ž y;␪ . of form Ž4.14. with ␾ known. One breaks the response scale into intervals Ž y k y ⌬ kr2, y k q ⌬ kr2.4. Counts in those intervals follow a multinomial with probabilities approximated by  f Ž y k , ␪ . ⌬ k 4. The log expected count approximations are linear in ␪ with an offset. PROBLEMS Applications 9.1 Use odds ratios in Table 8.3 to illustrate the collapsibility conditions. a. For Ž A, C, M ., all conditional odds ratios equal 1.0. Explain why all reported marginal odds ratios equal 1.0. b. For Ž AC, M ., explain why Ži. all conditional odds ratios are the same as the marginal odds ratios, and Žii. all ␮ ˆ acqs n acq. Ž . Ž . c. For AM, CM , explain why i the AC conditional odds ratios of 1.0 need not be the same as the AC marginal odds ratio, Žii. the AM and CM conditional odds ratios are the same as the marginal odds ratios, and Žiii. all ␮ ˆ aqm s n aqm and ␮ ˆqc m s nqc m . d. For Ž AC, AM, CM ., explain why Ži. no conditional odds ratios need be the same as the related marginal odds ratios, and Žii. the fitted marginal odds ratios must equal the sample marginal odds ratios. 9.2 Table 9.17 summarizes a study with variables age of mother Ž A., length of gestation Ž G . in days, infant survival Ž I ., and number of cigarettes smoked per day during the prenatal period Ž S .. Treat G and I as response variables and A and S as explanatory. a. Explain why a loglinear model should include the ␭ A S term. b. Fit the models Ž AGIS ., Ž AGI, AIS, AGS, GIS ., Ž AG, AI, AS, GI, GS, IS ., and Ž AS, G, I .. Identify a subset of models nested between two of these that may fit well. Select one such model. c. Use Ži. forward selection, and Žii. backward elimination to build a model. Compare the results of the strategies, and interpret the models chosen. 9.3 Refer to Table 2.13. Consider the nested set Ž DVP ., Ž DP, VP, DV ., Ž VP, DV ., Ž P, DV ., Ž D, V, P .4 . Partition chi-squared to compare the four pairs, ensuring that the overall type I error probability for the four comparisons does not exceed ␣ s 0.10. Which model would you select, using a backward comparison starting with Ž DVP .? Show that the final 401 PROBLEMS TABLE 9.17 Data for Problem 9.2 Infant Survival Age Smoking Gestation No Yes - 30 -5 F 260 ) 260 F 260 ) 260 F 260 ) 260 F 260 ) 260 50 24 9 6 41 14 4 1 315 4012 40 459 147 1594 11 124 5q 30 q -5 5q Source: N. Wermuth, pp. 279᎐295 in Proc. 9th International Biometrics Conference, Vol. 1 Ž1976.. Reprinted with permission from the Biometric Society. model selected depends on the choice of nested set, by repeating the analysis with Ž DP, VP, DV ., Ž DP, DV ., Ž P, DV ., Ž D, V, P .. 9.4 Consider the loglinear model selection for Table 6.3. a. Why is it not sensible to consider models omitting the ␭G M term? b. Using forward selection starting with Ž GM, E, P ., show that model Ž GM, GP, EG, EMP . seems reasonable. c. Using backward elimination, show that Ž GM, GP, EMP . or Ž GM, GP, EG, EMP . seems reasonable. d. The EMP interaction seems vital. To describe it, show that the effect of extramarital sex on divorce is greater for subjects who had no premarital sex. e. Use residuals to describe the lack of fit of model Ž GM, EMP .. 9.5 For model Ž AC, AM, CM . with Table 8.3, the standardized Pearson residual in each cell equals "0.63. Interpret, and explain why each one has the same absolute value. By contrast, model Ž AM, CM . has standardized Pearson residual "3.70 in each cell where M s yes Že.g., q3.70 when A s C s yes. and "12.80 in each cell where M s no Že.g., q12.80 when A s C s yes.. Interpret. 9.6 Refer to Table 8.8. Conduct a residual analysis with the model of no three-factor interaction to describe the nature of the interaction. 9.7 Perform a residual analysis for the independence model with Table 3.2. Explain why it suggests that the linear-by-linear association model may fit better. Fit it, compare to the independence model, and interpret. 402 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS 9.8 Refer to Problem 9.7. a. Using standardized scores, find ␤ˆ. Comment on the strength of association. b. Fit a model in which job satisfaction scores are parameters. Interpret the estimated scores, and compare the fit to the L = L model. 9.9 Refer to Table 9.3. a. For the linear-by-linear association model, construct a 95% confidence interval for the odds ratio using the four corner cells. Interpret. b. Fit the column effects model. Compare estimated column scores to the equal-interval scores in part Ža.. Test that the true column scores are equal-interval, given that the model holds. Interpret. Construct a 95% confidence interval for the odds ratio using the four corner cells. Compare to part Ža.. 9.10 A weak local association may be substantively important for nonlocal categories. Illustrate with the L = L model for Table 9.9, showing how the estimated odd ratio for the four corner cells compares to the estimated local odds ratio. 9.11 Refer to Table 7.8. Fit the homogeneous linear-by-linear association model, and interpret. Test conditional independence between income Ž I . and job satisfaction Ž S ., controlling for gender Ž G ., using Ža. that model, and Žb. model Ž IS, IG, SG .. Explain why the results are so different. 9.12 Fit the RC model to Table 9.3. Interpret the estimated scores. Does it fit better than the uniform association model? 9.13 Replicate the results in Section 9.6 for the correlation and correspondence models with Table 9.9. 9.14 One hundred leukemia patients were randomly assigned to two treatments. During the study, 10 subjects on treatment A died and 18 subjects on treatment B died. The total time at risk was 170.4 years for treatment A and 147.3 years for treatment B. Test whether the two treatments have the same death rates. Compare the rates with a confidence interval. 9.15 For Table 9.11, fit a model in which death rate depends only on age. Interpret the age effect. 9.16 Consider model Ž9.18.. What is the effect on the model parameter estimates, their standard errors, and the goodness-of-fit statistics when Ža. the times at risk are doubled, but the numbers of deaths stay the 403 PROBLEMS same; Žb. the times at risk stay the same, but the numbers of deaths double; and Žc. the times at risk and the numbers of deaths both double. 9.17 Consider Table 9.13. Explain how one could analyze whether the hazard depends on time. 9.18 An article by W. A. Ray et al. Ž Amer. J. Epidemiol. 132: 873᎐884, 1992. dealt with motor vehicle accident rates for 16,262 subjects aged 65᎐84 years, with data on each for up to 4 years. In 17.3 thousand years of observation, the women had 175 accidents in which an injury occurred. In 21.4 thousand years, men had 320 injurious accidents. a. Find a 95% confidence interval for the true overall rate of injurious accidents. b. Using a model, compare the rates for men and women. 9.19 A table at the text’s Web site Ž www. stat.ufl.edur;aarcdarcda.html . shows the number of train miles Žin millions. and the number of collisions involving British Rail passenger trains between 1970 and 1984. A Poisson model assuming a constant log rate ␣ over the 14-year period has ␣ ˆ s y4.177 ŽSE s 0.1325. and X 2 s 14.8 Ždf s 13.. Interpret. 9.20 Table 9.18 lists total attendance Žin thousands . and the total number of arrests in the 1987᎐1988 season for soccer teams in the Second Division of the British football league. Let Y s number of arrests for a team, and let t s total attendance. Explain why the model E Ž Y . s ␮ t TABLE 9.18 Data for Problem 9.20 Team Aston Villa Bradford City Leeds United Bournemouth West Brom Hudderfield Middlesbro Birmingham Ipswich Town Leicester City Blackburn Crystal Palace Attendance Žthousands . Arrests 404 286 443 169 222 150 321 189 258 223 211 215 308 197 184 149 132 126 110 101 99 81 79 78 Team Shrewsbury Swindon Town Sheffield Utd. Stoke City Barnsley Millwall Hull City Manchester City Plymouth Reading Oldham Attendance Žthousands . Arrests 108 210 224 211 168 185 158 429 226 150 148 68 67 60 57 55 44 38 35 29 20 19 Source: The Independent ŽLondon., Dec. 21, 1988. Thanks to P. M. E. Altham for showing me these data. 404 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS might be plausible. Assuming Poisson sampling, fit it and interpret. Plot arrests against attendance, and overlay the prediction equation. Use residuals to identify teams that had arrest counts much different than expected. TABLE 9.19 Data for Problem 9.21 Person-Years Age 35᎐44 45᎐54 55᎐64 65᎐74 75᎐84 Coronary Deaths Nonsmokers Smokers Nonsmokers Smokers 18,793 10,673 5710 2585 1462 52,407 43,248 28,612 12,663 5317 2 12 28 28 31 32 104 206 186 102 Source: R. Doll and A. B. Hill, Natl. Cancer Inst. Monogr. 19: 205᎐268 Ž1966.. See also N. R. Breslow in A Celebration of Statistics, ed. A. C. Atkinson and S. E. Fienberg, ŽNew York: Springer-Verlag, 1985.. 9.21 Table 9.19 is based on a study with British doctors. a. For each age, find the sample coronary death rates per 1000 person-years for nonsmokers and smokers. To compare them, take their ratio and describe its dependence on age. b. Fit a main-effects model for the log rates having four parameters for age and one for smoking. In discussing lack of fit, show that this model assumes a constant ratio of nonsmokers’ to smokers’ coronary death rates over age. c. From part Ža., explain why it is sensible to add a quantitative interaction of age and smoking. For this model, show that the log ratio of coronary death rates changes linearly with age. Assign scores to age, fit the model, and interpret. 9.22 Analyze Table 9.9 using ordinal logit models. Interpret, and discuss advantagesrdisadvantages compared to loglinear analyses. 9.23 Refer to Problem 8.6. Analyze these data, using methods of this chapter. Theory and Methods 9.24 In a 2 = 2 = K table, the true XY conditional odds ratios are identical, but different from the XY marginal odds ratio. Is there three-factor interaction? Is Z conditionally independent of X or Y ? Explain. 405 PROBLEMS 9.25 Consider loglinear model ŽWX, XY, YZ .. Explain why W and Z are independent given X alone or given Y alone or given both X and Y. When are W and Y conditionally independent? When are X and Z conditionally independent? 9.26 Suppose that loglinear model Ž XY, XZ . holds. a. Find ␮ i jq and log ␮ i jq. Show the loglinear model for the XY marginal table has the same association parameters as  ␭ iXj Y 4 in Ž XY, XZ .. Deduce that odds ratios are the same in the XY marginal table as in the partial tables. Using an analogous result for model Ž XY, YZ ., deduce the collapsibility conditions in Section 9.1.2. b. Calculate log ␮ i jq for model Ž XY, XZ, YZ ., and explain why marginal associations need not equal conditional associations. 9.27 For a four-way table, is the WX conditional association the same as the WX marginal association for the loglinear model Ža. ŽWX, XYZ .? and Žb. ŽWX, WZ, XY, YZ .? Why? 9.28 Loglinear model M0 is a special case of loglinear model M1. a. Explain why the fitted values for the two models are identical in the sufficient marginal distributions for M0 . b. Haberman Ž1974a. showed that when  ␮ ˆ i 4 satisfy any model that is a special case of M0 , Ý i ␮ ˆ 1 i log ␮ ˆ i s Ýi ␮ ˆ 0 i log ␮ ˆ i . Thus, ␮ ˆ 0 is the orthogonal projection of ␮ ˆ 1 onto the linear manifold of log ␮4 satisfying M0 . Using this, show that G 2 Ž M0 . y G 2 Ž M1 . s 2Ý i ␮ ˆ 1 i logŽ ␮ ˆ 1 ir␮ ˆ 0 i .. 9.29 Refer to Section 9.2.4. Show that G 2 Ž M j < M jy1 . equals G 2 for independence in the 2 = 2 table comparing columns 1 through j y 1 with column j. 9.30 For T categorical variables X 1 , . . . , X T , explain why: a. G 2 Ž X 1 , X 2 , . . . , X T . s G 2 Ž X 1 , X 2 . q G 2 Ž X 1 X 2 , X 3 . q ⭈⭈⭈ qG 2 Ž X 1 X 2 ⭈⭈⭈ X Ty1 , X T .. b. G 2 Ž X 1 ⭈⭈⭈ X Ty1 , X T . s G 2 Ž X 1 , X T . q G 2 Ž X 1 X T , X 1 X 2 . q ⭈⭈⭈ qG 2 Ž X 1 X 2 ⭈⭈⭈ X Ty1 , X 1 X 2 ⭈⭈⭈ X Ty2 X T .. 9.31 For I = 2 contingency tables, explain why the linear-by-linear association model is equivalent to the linear logit model Ž5.5.. 9.32 Consider the L = L model Explain why ␤ˆ is halved but Ž9.6. with  ®j s j4 replaced by  ®j s 2 j4 . ␮ ˆ i j 4,  ␪ˆi j 4, and G 2 are unchanged. 406 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS 9.33 Lehmann Ž1966. defined Ž X, Y . to be positi®ely likelihood-ratio dependent if their joint density satisfies f Ž x 1 , y 1 . f Ž x 2 , y 2 . G f Ž x 1 , y 2 . f Ž x 2 , y 1 . whenever x 1 - x 2 and y 1 - y 2 . Then, the conditional distribution of Y Ž X . stochastically increases as X Ž Y . increases ŽGoodman 1981a.. a. For the L = L model, show that the conditional distributions of Y and of X are stochastically ordered. What is its nature if ␤ ) 0? b. In row effects model Ž9.8., if ␮ i ) ␮ h , show that the conditional distribution of Y is stochastically higher in row i than in row h. Explain why ␮ 1 s ⭈⭈⭈ s ␮ I is equivalent to the equality of the I conditional distributions within rows. 9.34 Yule Ž1906. defined a table to be isotropic if an ordering of rows and of columns exists such that the local log odds ratios are all nonnegative wsee also Goodman Ž1981a.x. a. Show that a table is isotropic if it satisfies Ži. the linear-by-linear association model, Žii. the row effects model, and Žiii. the RC model. b. Explain why a table that is isotropic for a certain ordering is still isotropic when adjacent rows or columns are combined. 9.35 Consider the log likelihood for the linear-by-linear association model. a. Differentiating with respect to ␤ and evaluating at ␤ s 0 and null estimates of parameters, show that the score function is proportional to Ý Ý u i ®j Ž pi j y piq pqj . . i j b. Use the delta method to show that its null SE is ½ Ý u 2i piqy Ž Ý u i piq . 2 Ý ®j2 pqj y Ž Ý ®j pqj . 2 n 5 1r2 . c. Construct a score statistic for testing independence. Show that it is essentially the correlation test Ž3.15.. wHirotsu Ž1982. discussed a family of score tests for ordered alternatives. x 9.36 Given the parenthetical result in Problem 7.33, show that if cumulative logit model Ž7.24. holds and < ␤ < is small, the linear-by-linear association model should fit well with row scores  x i 4 and ‘‘ridit’’ column scores  ®j s w P Ž Y F j y 1. q P Ž Y F j .xr24 , with its ␤ parameter about twice ␤ for model Ž7.24.. 407 PROBLEMS 9.37 Consider the row effects model Ž9.8.. a. Show that no loss of generality occurs in letting ␭ IX s ␭YJ s ␮ I s 0. b. Show that minimal sufficient statistics are  n iq 4 ,  nqj 4 , and  Ý j ®j n i j , i s 1, . . . , I 4 , and derive the likelihood equations. 9.38 Show that the column effects model corresponds to a baseline-category logit model for Y that is linear in scores for X, with slope depending on the paired response categories. 9.39 Refer to the homogeneous linear-by-linear association model Ž9.10.. a. Show that the likelihood equations are, for all i, j, and k, ␮ ˆ iqk s n iqk , ␮ ˆqj k s nqj k , Ý Ý u i ®j ␮ˆ i jqs Ý Ý u i ®j n i jq . i j i j b. Show that residual df s K Ž I y 1.Ž J y 1. y 1. c. When I s J s 2, explain why it is equivalent to Ž XY, XZ, YZ .. d. Show how the last likelihood equation above changes for heterogeneous linear-by-linear XY association Ž9.11.. Explain why, in each stratum, the fitted XY correlation equals the sample correlation. 9.40 When model Ž XY, XZ, YZ . is inadequate and variables are ordinal, useful models are nested between it and Ž XYZ .. For ordered scores  u i 4 ,  ®j 4 , and  w k 4 , consider log ␮ i jk s ␭ q ␭ iX q ␭Yj q ␭ Zk q ␭ iXj Y q ␭ iXkZ q ␭YjkZ q ␤ u i ®j w k . Ž 9.22 . a. Define ␪ i jk s ␪ i jŽ kq1. r␪ i jŽ k . s ␪ iŽ jq1. kr␪ iŽ j. k s ␪Ž iq1. jkr␪Ž i. jk . For unit-spaced scores, show that log ␪ i jk s ␤ . Goodman Ž1979a. called this the uniform interaction model. b. Show that log odds ratios for any two variables change linearly across levels of the third variable. c. Show that the likelihood equations are those for model Ž XY, XZ, YZ . plus Ý Ý Ý u i ®j wk ␮ˆ i jk s Ý Ý Ý u i ®j wk i j k i j n i jk . k d. Explain why model Ž9.12. is a special case of model Ž9.22.. 9.41 Construct a model having general XZ and YZ associations, but row effects for the XY association that are Ža. homogeneous, and Žb. heterogeneous across levels of Z. Interpret. 408 BUILDING AND EXTENDING LOGLINEAR r LOGIT MODELS 9.42 Explain why the RC model requires scale constraints for the scores. Show the residual df s Ž I y 2.Ž J y 2.. Find and interpret the likelihood equations. Explain why the fit is invariant to category orderings. 9.43 Refer to correlation model Ž9.16. ŽGoodman 1985, 1986.. a. Show that ␭ is the correlation between the scores. b. If this model holds, show that Ý i ␮ i Ž␲ i jr␲qj . s ␭␯ j and Ý j ␯ j Ž␲ i jr␲ iq . s ␭␮ i . Interpret. c. With ␭ close to zero, show that logŽ␲ i j . has form ␥ i q ␦ i q ␭␮ i ␯ j q oŽ ␭., where oŽ ␭.r␭ ™ 0 as ␭ ™ 0. Thus, when the association is weak, the correlation model is similar to the linear-by-linear association model with ␤ s ␭ and scores  u i s ␮ i 4 and  ®j s ␯ j 4 . 9.44 For the general canonical correlation model, show that Ý ␭2k s Ý i Ý j Ž␲ i j y ␲ iq ␲qj . 2r␲ iq ␲qj . Thus, the squared correlations partition a dependence measure that is the noncentrality Ž6.8. of X 2 for the independence model with n s 1. wGoodman Ž1986. stated other partitionings.x 9.45 Refer to model Ž9.18.. Given the times at risk  t i j 4 , show that sufficient statistics are  n iq 4 and  nqj 4 . 9.46 Refer to Section 9.7.3. Let T s Ýt i and W s Ýwi . Suppose that survival times have a negative exponential distribution with parameter ␭. ˆ s WrT. a. Using log likelihood Ž9.19., show that ␭ b. Conditional on T, show that W has a Poisson distribution with ˆ s WrT. mean T␭. Using the Poisson likelihood, show that ␭ 9.47 Show that ML estimates do not exist for Table 9.15. w Hint: Haberman Ž1973b, 1974a, p. 398.: If ␮ ˆ 111 s c ) 0, then marginal constraints the model satisfy imply that ␮ ˆ 222 s yc.x 9.48 For a loglinear model, explain heuristically why the ML estimate of a parameter is infinite when its sufficient statistic takes its maximum or minimum possible value, for given values of other sufficient statistics. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 10 Models for Matched Pairs We next introduce methods for comparing categorical responses for two samples when each observation in one sample pairs with an observation in the other. Such matched-pairs data commonly occur in studies with repeated measurement of subjects, such as longitudinal studies that observe subjects over time. Because of the matching, the responses in the two samples are statistically dependent. This is the first of four chapters on special methods for handling such dependence. Table 10.1 illustrates matched-pairs data. For a poll of a random sample of 1600 voting-age British citizens, 944 indicated approval of the Prime Minister’s performance in office. Six months later, of these same 1600 people, 880 indicated approval. The two cells with identical row and column response form the main diagonal of the table. These subjects had the same opinion at both surveys. They compose most of the sample, since relatively few people changed opinion. A strong association exists between opinions six months apart, the sample odds ratio being Ž794 = 570.rŽ150 = 86. s 35.1. For matched pairs with a categorical response, a two-way contingency table with the same row and column categories summarizes the data. The table is square. In this chapter we present analyses of square tables. In Section 10.1 we describe methods for comparing proportions with a binary response. In Section 10.2 we discuss logistic regression analyses of such data. For multicategory responses, Section 10.3 covers nominal and ordinal logit TABLE 10.1 Rating of Performance of Prime Minister First Survey Approve Disapprove Total Second Survey Approve Disapprove Total 794 86 150 570 944 656 880 720 1600 409 410 MODELS FOR MATCHED PAIRS models for comparing the response distributions. In Section 10.4 we introduce loglinear models for square tables. In Sections 10.5 and 10.6 we discuss two matched-pairs applications for which models for square tables are useful: analyzing agreement between two observers who rate a common set of subjects, and evaluating preferences of treatments based on their pairwise evaluation. Section 10.7 extends the models of Sections 10.2 through 10.4 to multiway tables that result from matched sets of observations. In Chapter 11 we extend them further to incorporate explanatory variables. 10.1 COMPARING DEPENDENT PROPORTIONS For each of n matched pairs, let ␲ab denote the probability of outcome a for the first observation and outcome b for the second. Let n ab count the number of such pairs, with pab s n abrn the sample proportion. We treat  n ab 4 as a sample from a multinomial Ž n; ␲ab 4. distribution. Then paq is the proportion in category a for observation 1, and pqa is the corresponding proportion for observation 2. We compare samples by comparing marginal proportions  paq 4 with  pqa 4 . With matched samples, these proportions are correlated, and methods for independent samples are inappropriate. In this section we consider binary outcomes. When ␲ 1qs ␲q1 , then ␲ 2qs ␲q2 also, and there is marginal homogeneity. Since ␲ 1qy ␲q1 s Ž ␲ 11 q ␲ 12 . y Ž ␲ 11 q ␲ 21 . s ␲ 12 y ␲ 21 , marginal homogeneity in 2 = 2 tables is equivalent to ␲ 12 s ␲ 21 . The table then shows symmetry across the main diagonal. 10.1.1 Inference for Dependent Proportions One comparison of the marginal distributions uses ␦ s ␲q1 y ␲ 1q. Let d s pq1 y p1qs p 2qy pq2 . From formula Ž1.3. for multinomial covariances, covŽ pq1 , p1q . s covŽ p11 q p 21 , p11 q p12 . simplifies to Ž␲ 11␲ 22 y ␲ 12 ␲ 21 .rn. Thus, var Ž 'n d . s ␲ 1q Ž 1 y ␲ 1q . q ␲q1 Ž 1 y ␲q1 . y 2 Ž ␲ 11␲ 22 y ␲ 12 ␲ 21 . . Ž 10.1 . For large samples, d has approximately a normal sampling distribution. A confidence interval for ␦ s ␲q1 y ␲ 1q is then d " z␣ r2 ␴ˆ Ž d . , 411 COMPARING DEPENDENT PROPORTIONS where ␴ˆ 2 Ž d . s p1q Ž 1 y p1q . q pq1 Ž 1 y pq1 . y 2 Ž p11 p 22 y p12 p 21 . rn s Ž p12 q p 21 . y Ž p12 y p 21 . rn, Ž 10.2 . 2 with the second formula following after substitution and some algebra. Inverting the score test of H0 : ␦ s ␦ 0 is more complex but provides coverage probabilities closer to the nominal values ŽTango 1998., as does adding 1 to each cell before computing d and ␴ˆ Ž d .. The hypothesis of marginal homogeneity is H0 : ␲ 1qs ␲q1 Ži.e., ␦ s 0.. The ratio z s dr␴ ˆ Ž d . or its square is a Wald test statistic. Under H0 , an alternative estimated variance is ␴ˆ02 Ž d . s p12 q p 21 n s n12 q n 21 n2 . Ž 10.3 . The score test statistic z 0 s dr␴ ˆ0 Ž d . simplifies to z0 s n 21 y n12 Ž n 21 q n12 . 1r2 . Ž 10.4 . The square of z 0 is a chi-squared statistic with df s 1. The test using it is called McNemar’s test ŽMcNemar 1947.. The McNemar statistic depends only on cases classified in different categories for the two observations. The n11 q n 22 on the main diagonal are irrelevant to inference about whether ␲ 1q and ␲q1 differ. This may seem surprising, but all cases contribute to inference about how much ␲ 1q and ␲q1 differ: for instance, to estimating ␦ and the standard error. 10.1.2 Prime Minister Approval Rating Example For Table 10.1, the sample proportions of approval of the prime minister’s performance are p1qs 944r1600 s 0.59 for the first survey and pq1 s 880r1600 s 0.55 for the second. Using Ž10.2., a 95% confidence interval for ␲q1 y ␲ 1q is Ž0.55 y 0.59. " 1.96Ž0.0095., or Žy0.06, y0.02.. The approval rating appears to have dropped between 2 and 6%. For testing marginal homogeneity, the test statistic Ž10.4. using the null variance is z0 s 86 y 150 Ž 86 q 150. 1r2 s y4.17. It shows strong evidence of a drop in the approval rating. 412 MODELS FOR MATCHED PAIRS 10.1.3 Increased Precision with Dependent Samples The final term of formula Ž10.1., based on covŽ pq1 , p1q ., reflects the dependence between the marginal proportions. By contrast, for independent samples of size n each to estimate binomial probabilities ␲ 1 and ␲ 2 , the covariance for the sample proportions is zero, and var 'n Ž difference of sample proportions . s ␲ 1Ž 1 y ␲ 1 . q ␲ 2 Ž 1 y ␲ 2 . . Dependent samples usually exhibit a positive dependence, with log ␪ s logw␲ 11␲ 22 r␲ 12 ␲ 21 x ) 0; that is, ␲ 11␲ 22 ) ␲ 12 ␲ 21 . From Ž10.1., positive dependence implies that varŽ d . is smaller than when the samples are independent. A study design using dependent samples can help improve the precision of statistical inferences for within-subject effects. ŽBy contrast, standard errors tend to be larger, per given number of observations, for between-subject group comparisons.. The improvement is substantial when samples are highly correlated. To illustrate, Table 10.1 with dependent samples of size 1600 each has a standard error of 0.0095 for d s 0.55 y 0.59. The two observations have strong association, the sample odds ratio being 35.1. Independent samples of size 1600 each with ␲ ˆ 1 y ␲ˆ 2 s 0.55 y 0.59 have a standard error of 0.0175 for the difference, nearly twice as large. 10.1.4 Small-Sample Test Comparing Matched Proportions The null hypothesis of marginal homogeneity for binary matched pairs is, equivalently, H0 : ␲ 12 s ␲ 21 or ␲ 21 rŽ␲ 21 q ␲ 12 . s 0.5. For small samples, an exact test conditions on n* s n 21 q n12 ŽMosteller 1952.. Under H0 , n 21 has a binomial Ž n*, 12 . distribution, for which E Ž n 21 . s 12 n*. The P-value for the test is a binomial tail probability. For instance, for Table 10.1, consider Ha : ␲q1 - ␲ 1q, or equivalently, Ha : ␲ 21 - ␲ 12 . Since n* s 86 q 150 s 236, the reference distribution is binŽ236, 12 .. The P-value is the probability of at least 150 successes out of 236 trials, which equals 0.00002. The P-value for Ha : ␲q1 / ␲ 1q doubles this. When n* ) 10, the reference binomial distribution is approximately normal with mean 12 n* and variance n*Ž 12 .Ž 12 .. The standardized normal test statistic equals zs n 21 y 12 n* n* Ž 12 .Ž 12 . 1r2 s n 21 y n12 Ž n 21 q n12 . This is identical to the McNemar statistic Ž10.4.. 1r2 . 413 COMPARING DEPENDENT PROPORTIONS 10.1.5 Connection between McNemar and Cochran–Mantel–Haenszel Tests An alternative representation of binary responses for n matched pairs presents the data in n partial tables, one 2 = 2 table for each pair. It has columns that are the two possible outcomes for each measurement. Row 1 shows the outcome of the first observation, and row 2 shows the outcome of the second. Table 10.2 shows the four possible partial tables in this representation. For Table 10.1, the full three-way table has 1600 partial tables; 794 look like the one for subject 1 Ži.e., ‘‘approve’’ at both surveys., 570 who disapproved at each survey have tables like the one for subject 2, 86 have tables like the one for subject 3, and 150 have tables like the one for subject 4. The 1600 subjects from Table 10.1 provide 3200 observations in a 2 = 2 = 1600 contingency table. Collapsing this table over the 1600 partial tables yields a 2 = 2 table with first row equal to Ž944, 656. and second row equal to Ž880, 720.. These are the total number of Žapprove, disapprove. responses for the two surveys. They form the marginal counts in Table 10.1. For each subject, suppose that the probability of approval is identical in each survey. Then, conditional independence exists between the opinion outcome and the survey time, controlling for subject. The probability of approval is then also the same for each survey in the marginal table collapsed over the subjects. But this implies that the true probabilities for Table 10.1 satisfy marginal homogeneity. Thus, a test of conditional independence in the 2 = 2 = 1600 table provides a test of marginal homogeneity for Table 10.1. To test conditional independence in this three-way table, one can use the Cochran᎐Mantel᎐Haenszel ŽCMH. statistic Ž6.6.. The result of that chisquared statistic is algebraically identical to the squared McNemar’s statistic, namely Ž n 21 y n12 . 2rŽ n12 q n 21 . for tables of form Ž10.1.. McNemar’s test is a special case of the CMH test applied to the binary responses of n matched pairs displayed in n partial tables. This connection is not helpful for computational purposes, since the McNemar statistic is simple. But it does suggest TABLE 10.2 Representation of Four Types of Matched Pairs Contributing to Counts in Table 10.1 Response Subject Survey Approve Disapprove 1 First Second 1 1 0 0 2 First Second 0 0 1 1 3 First Second 0 1 1 0 4 First Second 1 0 0 1 414 MODELS FOR MATCHED PAIRS ways of handling more complex matched data. With several outcome categories or several observations, one can test marginal homogeneity by applying the generalized CMH tests ŽSection 7.5. using a single stratum for each subject, with each row representing a particular observation ŽDarroch 1981; Mantel and Byar 1978.. Coming sections refer to the 2 = 2 = n table representation of matchedpairs data as the subject-specific table. They refer to the 2 = 2 table of form of Table 10.1 as the population-a®eraged table, since its margins provide direct estimates of population marginal proportions. 10.2 CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS In Section 6.7 we introduced conditional logistic regression for eliminating nuisance parameters from an analysis. We now study this for binary matched-pairs data. The models refer to subject-specific tables. 10.2.1 Marginal versus Conditional Models for Matched Pairs The analyses of Section 10.1 occur in the context of models. Let Ž Y1 , Y2 . denote the pair of observations for a randomly selected subject, where a ‘‘1’’ outcome denotes category 1 Žsuccess . and ‘‘0’’ denotes category 2. The difference ␦ s P Ž Y2 s 1. y P Ž Y1 s 1. between marginal probabilities occurs as a parameter in P Ž Yt s 1 . s ␣ q ␦ x t , Ž 10.5 . where x 1 s 0 and x 2 s 1; then, P Ž Y1 s 1. s ␣ and P Ž Y2 s 1. s ␣ q ␦ . Alternatively, the logit link yields logit P Ž Yt s 1 . s ␣ q ␤ x t . Ž 10.6 . The parameter ␤ is a log odds ratio with the marginal distributions. Models Ž10.5. and Ž10.6. are marginal models: They focus on the marginal distributions of responses for the two observations. For instance, in terms of the population-averaged table, the ML estimate of ␤ in Ž10.6. is the log odds ratio of marginal proportions, ␤ˆ s logw pq1 p 2qrpq2 p1q x. See Problem 10.26 for its asymptotic variance. By contrast, the subject-specific table having strata like Table 10.2 implicitly allows probabilities to vary by subject. Let Ž Yi1 , Yi2 . denote the ith pair of observations, i s 1, . . . , n. A model then has the form link P Ž Yi t s 1 . s ␣ i q ␤ x t . Ž 10.7 . This is called a conditional model, since the effect ␤ is defined conditional on the subject. Its estimate describes conditional association for the three-way table stratified by subject. The effect is subject-specific, since it is defined at 415 CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS the subject level. By contrast, the effects in marginal models Ž10.5. and Ž10.6. are population-a®eraged, since they refer to averaging over the entire population rather than to individual subjects. For the identity link, subject-specific and population-averaged effects are identical. For instance, for the conditional model Ž10.7. with identity link, ␤ s P Ž Yi2 s 1. y P Ž Yi1 s 1. for all i, and averaging this over subjects in the population equates ␤ to the ␦ parameter in model Ž10.5.. For nonlinear links, however, the effects differ. For model Ž10.7. with the logit link, for instance, P Ž Yit s 1 . s exp Ž␣i q ␤ x t . r 1 q exp Ž␣i q ␤ x t . . The average of this for the population does not have the form expŽ␣ q ␤ x t .rw1 q expŽ␣ q ␤ x t .x corresponding to the marginal logit model Ž10.6.. We now take a closer look at the conditional model with logit link. 10.2.2 A Logit Model with Subject-Specific Probabilities Model Ž10.7. differs from models in earlier chapters by permitting subjects to have their own probability distributions. Cox Ž1958b, 1970. and Rasch Ž1961. presented this model with logit link. This model for Yi t , observation t for subject i, is logit P Ž Yit s 1 . s ␣ i q ␤ x t , Ž 10.8 . where x 1 s 0 and x 2 s 1. Although permitting subject-specific distributions, it assumes a common effect ␤ . For subject i, P Ž Yi1 s 1 . s exp Ž␣i . 1 q exp Ž␣i . , P Ž Yi2 s 1 . s exp Ž␣i q ␤ . 1 q exp Ž␣i q ␤ . . The parameter ␤ compares the response distributions. For each subject, the odds of success for observation 2 are expŽ ␤. times the odds for observation 1. Given the parameters, with model Ž10.8. one normally assumes independence of responses for different subjects and for the two observations on the same subject. However, averaged over all subjects, the responses are nonnegatively associated. Suppose that < ␤ < is small compared to < ␣ i <. A subject with a large positive ␣ i has high P Ž Yit s 1. for each t and is likely to have a success each time; a subject with a large negative ␣ i has low P Ž Yit s 1. for each t and is likely to have a failure each time. The greater the variability in  ␣ i 4 , the greater the overall positive association between responses, successes Žfailures . for observation 1 tending to occur with successes Žfailures . for observation 2. This is true for any ␤ . The positive association reflects the shared value of ␣ i for each observation in a pair. No association occurs only when  ␣ i 4 are identical. Thus, the model does account for the dependence in matched pairs. Fitting it takes into account nonnegative association through the structure of the model. 416 MODELS FOR MATCHED PAIRS For this model, the large number of  ␣ i 4 causes difficulties with the fitting process and with the properties of ordinary ML estimators ŽProblem 10.24.. The remedy of conditional ML treats them as nuisance parameters and maximizes the likelihood function for a conditional distribution that eliminates them. A note on terminology: We’ve referred to model Ž10.8. as a conditional model, meaning that its effect ␤ is subject-specific, conditional on the subject. The analyses described below for such models are examples of conditional logistic regression; but here the term conditional refers to the ML analysis that is performed conditional on sufficient statistics for nuisance parameters, to eliminate those parameters from the likelihood. 10.2.3 Conditional ML Inference for Binary Matched Pairs For model Ž10.8., assuming independence of responses for different subjects and for the two observations on the same subject, the joint mass function for Ž y 11 , y 12 ., . . . , Ž yn1 , yn2 .4 is n Ł is1 ž /ž y i1 exp Ž␣i . 1 q exp Ž␣i . = ž 1 1 q exp Ž␣i . /ž / 1yy i1 y i2 exp Ž␣i q ␤ . 1 q exp Ž␣i q ␤ . 1 1 q exp Ž␣i q ␤ . / 1yy i2 . In terms of the data, this is proportional to exp ž Ý ␣ i Ž yi1 q yi2 . q ␤ Ý yi2 i i /. To eliminate  ␣ i 4 , we condition on their sufficient statistics, the pairwise success totals  Si s yi1 q yi2 4 . Given Si s 0, P Ž Yi1 s Yi2 s 0. s 1, and given Si s 2, P Ž Yi1 s Yi2 s 1. s 1. The distribution of Ž Yi1 , Yi2 . depends on ␤ only when Si s 1; that is, only when outcomes differ for the two responses. Given yi1 q yi2 s 1, the conditional distribution is P Ž Yi1 s yi1 , Yi2 s yi2 < Si s 1 . s P Ž Yi1 s yi1 , Yi2 s yi2 . s ž exp Ž␣i . / ž y i1 1 P Ž Yi1 s 1, Yi2 s 0 . q P Ž Yi1 s 0, Yi2 s 1 . / ž 1yy i1 exp Ž␣i q ␤ . / ž y i2 1 1 q exp Ž␣i q ␤ . 1 q exp Ž␣i q ␤ . 1 exp Ž␣i q ␤ . q 1 q exp Ž␣i . 1 q exp Ž␣i q ␤ . 1 q exp Ž␣i . 1 q exp Ž␣i q ␤ . 1 q exp Ž␣i . 1 q exp Ž␣i . 1 exp Ž␣i . s exp Ž ␤ . r 1 q exp Ž ␤ . , s 1r 1 q exp Ž ␤ . , yi1 s 0, yi1 s 1, yi2 s 1 yi2 s 0. / 1yy i2 417 CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS Again, let  n ab 4 denote the counts for the four possible sequences. For subjects having Si s 1, Ý i yi1 s n12 , the number of subjects having success for observation 1 and failure for observation 2. Similarily, for those subjects, Ý i yi2 s n 21 and Ý i Si s n* s n12 q n 21 . Since n 21 is the sum of n* independent, identical Bernoulli variates, its conditional distribution is binomial with parameter expŽ ␤.rw1 q expŽ ␤.x. For testing marginal homogeneity Ž ␤ s 0., the parameter equals 12 . In summary, the conditional analysis for the logit model implies that pairs in which yi1 s yi2 are irrelevant to inference about ␤ . When this model is realistic, it provides justification for comparing marginal distributions using only the n12 q n 21 pairings having outcomes in different categories at the two observations. Conditional on Si s 1, the joint distribution of the matched pairs is Ł Sis1 ž 1 1 q exp Ž ␤ . /ž exp Ž ␤ . y i1 1 q exp Ž ␤ . / y i2 s exp Ž ␤ . n 21 1 q exp Ž ␤ . n* Ž 10.9 . where the product refers to all pairs having Si s 1. Differentiating the log of this conditional likelihood and equating to 0 and solving yields the conditional ML estimator of ␤ in model Ž10.8.. You can check that it and its standard error are ␤ˆ s log Ž n 21 rn12 . , 10.2.4 ' SE s 1rn 21 q 1rn12 . Ž 10.10 . Random Effects in Binary Matched-Pairs Model An alternative remedy to handling the huge number of nuisance parameters in logit model Ž10.8. treats  ␣ i 4 as random effects. This regards  ␣ i 4 as an unobserved random sample from a probability distribution, usually assumed to be N Ž ␮, ␴ 2 . with unknown ␮ and ␴ . It eliminates  ␣ i 4 by averaging with respect to their distribution, yielding a marginal distribution. The likelihood function then depends on ␤ as well as the N Ž ␮, ␴ 2 . parameters. It has only three parameters and is more manageable. For matched pairs with nonnegative sample log odds ratio, this approach also yields ␤ˆ s logŽ n 21 rn12 . ŽNeuhaus et al. 1994.. This model is an example of a generalized linear mixed model, containing both random effects and the fixed effect ␤ . Its analysis is presented in Chapter 12. Model Ž10.8. implies that the true odds ratio for each of the n subjectspecific partial tables equals expŽ ␤.. In Section 6.3.5 we presented the Mantel᎐Haenszel estimate of a common odds ratio for several 2 = 2 tables. In fact, that estimator applied to subject-specific tables of the form shown in Table 10.2 is algebraically identical to n 21 rn12 for tables of the form shown in Table 10.1. ŽRecall that partial tables with responses in only one column do not contribute to the CMH test or Mantel᎐Haenszel estimate. . In summary, the Mantel᎐Haenszel estimate, the conditional ML estimate, and 418 MODELS FOR MATCHED PAIRS Žwith nonnegative log odds ratio. the ML estimate for the random effects version of logit model Ž10.8. yield expŽ ␤ˆ. s n 21 rn12 . 10.2.5 Logistic Regression for Matched Case–Control Studies The two observations Ž yi1 , yi2 . in a matched pair need not refer to the same subject. For instance, case᎐control studies that match a single control with each case yield matched-pairs data. For a binary response Y, each case Ž Y s 1. is matched with a control Ž Y s 0. according to criteria that could affect the response. Subjects in the matched pairs are measured on the predictor variableŽs. of interest, X, and the XY association is analyzed. Table 10.3 illustrates. A case᎐control study of acute myocardial infarction ŽMI. among Navajo Indians matched 144 victims of MI according to age and gender with 144 people free of heart disease. Subjects were asked whether they had ever been diagnosed as having diabetes Ž x s 0, no; x s 1, yes.. Table 10.3 has the same form as Table 10.1 except that the levels of X rather than the levels of Y form the rows and the columns. One can display the data for each matched case᎐control pair using a partial table of the form shown in Table 10.2, but reversing the roles of X and Y. The X values have four possible patterns, shown in Table 10.4. There are 37 partial tables of type a, since for 37 pairs the case had diabetes and the control did not, 16 partial tables of type b, 9 of type c, and 82 of type d. Now, for subject t in matched pair i, consider the model logit P Ž Yit s 1 . s ␣ i q ␤ x i t . Ž 10.11 . TABLE 10.3 Previous Diagnoses of Diabetes for Myocardial Infarction (MI) Case–Control Pairs MI Cases MI Controls Diabetes No Diabetes Total Diabetes No diabetes 9 37 16 82 25 119 46 98 144 Total Source: J. L. Coulehan et al., Amer. J. Public Health 76: 412᎐414 Ž1986., reprinted with permission from the American Public Health Association. TABLE 10.4 Possible Case–Control Pairs for Table 10.3 a Diabetes Yes No b c d Case Control Case Control Case Control Case Control 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 CONDITIONAL LOGISTIC REGRESSION FOR BINARY MATCHED PAIRS 419 The probabilities modeled refer to the distribution of Y given X, but the retrospective study provides information only about the distribution of X given Y. One can estimate the odds ratio expŽ ␤., however, since it refers to the XY odds ratio, which relates to both conditional distributions ŽSections 2.2.4, 5.1.4.. Even though this study reverses the roles of X and Y in terms of which is fixed and which is random, the conditional ML estimate of expŽ ␤. is simply n 21 rn12 s 37r16 s 2.3. 10.2.6 Conditional ML for Matched Pairs with Multiple Predictors When the binary response has p predictors for case᎐control or subjectspecific matched pairs, the model generalizes to logit P Ž Yit s 1 . s ␣ i q ␤ 1 x 1 it q ␤ 2 x 2 it q ⭈⭈⭈ q␤ p x p it , Ž 10.12 . where x h it denotes the value of predictor h for observation t in pair i, t s 1, 2. Typically, one predictor is an explanatory variable of interest, such as diabetes status. The others are covariates being controlled, in addition to those already controlled by virtue of using them to form the matched pairs. The conditional ML approach to estimating  ␤ j 4 conditions on sufficient statistics for ␣ i to eliminate them from the likelihood. Let x it s Ž x 1 i t , . . . , x p it .X and ␤ s Ž ␤1 , . . . , ␤ p .X . A generalization of the derivation in Section 10.2.3 shows that P Ž Yi1 s 0, Yi2 s 1 < Si s 1 . s exp Ž x Xi2 ␤ . exp Ž x Xi1 ␤ . q exp Ž x Xi2 ␤ . , P Ž Yi1 s 1, Yi2 s 0 < Si s 1 . s exp Ž x Xi1 ␤ . exp Ž x Xi1 ␤ . q exp Ž x Xi2 ␤ . . Ž 10.13 . Dividing numerator and denominator by expŽx Xi1 ␤ . shows that the first equation has the form of logistic regression with no intercept and with predictor values x*i s x i2 y x i1. In fact, one can obtain conditional ML estimates for model Ž10.12. by fitting a logistic regression model to those pairs alone, using artificial response y* s 1 when Ž yi1 s 0, yi2 s 1., y* s 0 when Ž yi1 s 1, yi2 s 0., no intercept, and predictor values x*. i This addresses the same likelihood as the conditional likelihood ŽBreslow et al. 1978; Chamberlain 1980.. To illustrate, for model Ž10.11. with Table 10.3, let y* i s yi2 y yi1 and x*i s x i2 y x i1. If t s 1 refers to the control and t s 2 to the case, then yi* s 1 always. Since x it s 1 represents ‘‘ yes’’ for diabetes and x i t s 0 . Ž i s 1, x*i s 0. for represents ‘‘no,’’ Ž y* i s 1, x* i s y1 for 16 observations, y* . 9 q 82 s 91 observations, and y* i s 1, x* i s q1 for 37 observations. The logit model that forces ␣ ˆ s 0 has ␤ˆ s 0.84. With a single binary predictor, the estimate is identical to logŽ n 21 rn12 .. 420 10.2.7 MODELS FOR MATCHED PAIRS Marginal Models and Conditional Models: Extensions For binary matched-pairs data, Section 10.1 presented analyses for a marginal Ži.e., population-averaged . model, and this section presented analyses for a conditional Ži.e., subject-specific . model. These models generalize to multinomial responses and to matched sets. For instance, Chamberlain Ž1980. discussed conditional ML for matched pairs on a multinomial response. For binary responses, model Ž10.12. applies when ␣ i refers to a set of repeated measurements on subject i. Or, it could refer to a matched set that is a cluster of subjects, such as children from family i or fetuses from litter i. With extensions of the conditional model to matched-set clusters, the conditional ML approach is restricted to estimating ␤ j that are withincluster effects, such as occur in case᎐control and crossover studies. For these, the explanatory variable varies in t for each i. Conditional ML cannot estimate a between-cluster effect. Statistics providing information about such an effect use subject totals at different levels of the relevant explanatory variable; however, those totals sum the sufficient statistics for  ␣ i 4 , so they are themselves fixed and have degenerate distributions after conditioning on the sufficient statistics. An explanatory variable that is constant in t for each i cancels out of the conditional likelihood. wYou can observe this for matched pairs with Ž10.13. for any j for which x ji1 s x ji2 all i.x For it, at best one can stratify by its levels and fit a model estimating within-cluster effects separately at each level. An advantage of using the random effects approach instead of conditional ML with the conditional model is that it is not restricted to estimating within-cluster effects. In the remainder of this chapter we emphasize marginal models for matched pairs with multinomial responses. In the following chapter we deal with marginal model extensions allowing matched sets and explanatory variables. Conditional models using a random effects approach have extra computational complexities. We mention briefly some multinomial conditional models in this chapter, but we defer most discussion to Chapter 12. 10.3 MARGINAL MODELS FOR SQUARE CONTINGENCY TABLES Matched pairs analyses generalize from binary to I ) 2 outcome categories. A square I = I table  n ab 4 shows counts of possible sequences Ž a, b . of outcomes for Ž Y1 , Y2 .. Let ␲ab s P Ž Y1 s a, Y2 s b .. Marginal homogeneity is P Ž Y1 s a. s P Ž Y2 s a. for a s 1, . . . , I. Marginal models compare  P Ž Y1 s a.4 and  P Ž Y2 s a.4 . 10.3.1 Marginal Models for Ordinal Classifications For ordered categories, marginal model Ž10.6. for binary matched pairs extends using ordinal logits. With cumulative logits, logit P Ž Yt F j . s ␣ j q ␤ x t , t s 1, 2, j s 1, . . . , I y 1, Ž 10.14 . 421 MARGINAL MODELS FOR SQUARE CONTINGENCY TABLES where x 1 s 0 and x 2 s 1. This model has proportional odds structure ŽSection 7.2.2.. The odds of outcome Y2 F j equal expŽ ␤. times the odds of outcome Y1 F j. The model implies stochastically ordered marginal distributions, with ␤ ) 0 meaning that Y1 tends to be higher than Y2 . Marginal homogeneity corresponds to ␤ s 0. Model fitting treats Ž Y1 , Y2 . as dependent. The ML approach maximizes the multinomial likelihood for ␲ab 4 . This is not simple. Since the model refers to marginal probabilities  P Ž Y1 s a. s ␲aq 4 and  P Ž Y2 s b . s ␲qb 4 , one cannot substitute the model formula in the kernel Ý a Ý b n ab log ␲ab of the log likelihood, which has joint probabilities. We defer discussion of ML model fitting of marginal models to Section 11.2.5. Model Ž10.14. describes the 2Ž I y 1. marginal probabilities by I parameters, so df s I y 2 for testing fit. Alternatively, one can compare margins using summaries such as a difference in means for chosen category scores ŽProblem 10.38.. 10.3.2 Premarital and Extramarital Sex Example Refer to Table 10.5. For a General Social Survey, subjects gave their opinion about premarital sex Ža couple having sex before marriage. and extramarital sex Ža married person having sex with someone other than the marriage partner .. The response categories are 1 s always wrong, 2 s almost always wrong, 3 s wrong only sometimes, 4 s not wrong at all. The sample cumulative marginal proportions are Ž0.307, 0.389, 0.611. for premarital sex and Ž0.815, 0.918, 0.987. for extramarital sex. This suggests that responses on premarital sex tended to be higher on the ordinal scale than those on extramarital sex. With scores Ž1, 2, 3, 4., the mean for premarital sex is 2.69, closest to the ‘‘ wrong only sometimes’’ score, and the mean response for extramarital sex is 1.28, closest to the ‘‘always wrong’’ score. The cumulative logit model Ž10.14. has ␤ˆ s 2.51 ŽSE s 0.13.. There is strong evidence that population responses are more positive on premarital than on extramarital sex. The fit of the marginal homogeneity model has G 2 s 348.1 Ždf s 3., and the fit of model Ž10.14. has G 2 s 35.1 Ždf s 2.. The ordinal model does not fit well, but it fits much better than the marginal homogeneity model. Models to be considered in Section 10.4.7 fit better yet. TABLE 10.5 Premarital Sex Opinions on Premarital Sex and Extramarital Sex Extramarital Sex 1 2 3 4 Total 1 2 3 4 144 33 84 126 2 4 14 29 0 2 6 25 0 0 1 5 146 39 105 185 Total 387 49 33 6 475 Source: 1989 General Social Survey, National Opinion Research Center. 422 MODELS FOR MATCHED PAIRS 10.3.3 Marginal Models for Nominal Classifications With nominal responses, it is not sensible to assume the same effect for each logit. A baseline-category logit model has form log P Ž Yt s j . rP Ž Yt s I . s ␣ j q ␤ j x t , t s 1, 2, j s 1, . . . , I y 1, Ž 10.15 . where x 1 s 0 and x 2 s 1. This model has 2Ž I y 1. parameters for the 2Ž I y 1. marginal probabilities. It is saturated. Marginal homogeneity is the special case ␤ 1 s ⭈⭈⭈ s ␤Iy1 s 0. To fit it, Lipsitz et al. Ž1990. and Madanksy Ž1963. maximized the multinomial likelihood for  n ab 4 subject to these constraints. Iterative methods produce fitted values  ␮ ˆ ab 4. Comparing these to  n ab 4 using G 2 or X 2 tests marginal homogeneity, with df s I y 1. Bhapkar Ž1966. tested marginal homogeneity by exploiting the asymptotic normality of marginal proportions. Let d a s pqa y paq, and let dX s Ž d1 , . . . , d Iy1 .. It is redundant to include d I , since Ýd a s 0. The sample ˆ of 'n d has elements covariance matrix V ˆ®ab s y Ž pab q p b a . y Ž pqa y paq . Ž pqb y p bq . for a / b, 2 ˆ®aa s pqa q paqy 2 paa y Ž pqa y paq . . Now 'n wd y E Žd.x has an asymptotic multivariate normal distribution with ˆ Under marginal homogeneity, EŽd. s 0, estimated covariance matrix V. and Ž 10.16 . ˆy1 d W s ndX V is asymptotically chi-squared with df s I y 1. This is a Wald test for parameters in the analog of model Ž10.15. using the identity link. Stuart Ž1955. ˆ0y1 d, which uses the sample null covariance matrix V ˆ0 proposed W0 s ndX V and is the score test. This has ˆ®ab0 s y Ž pab q p b a . for a / b, ˆ®aa0 s pqa q paqy 2 paa . Ireland et al. Ž1969. noted that W s W0rŽ1 y W0rn.. For I s 2, W0 is McNemar’s statistic, the square of Ž10.4.. These tests use all I y 1 degrees of freedom available for comparisons of I pairs of marginal proportions. With ordered categories, when I is large and the dependence between classifications is strong, ordinal tests Žwith df s 1. can be much more powerful ŽAgresti 1984, p. 209.. SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE 423 TABLE 10.6 Migration from 1980 to 1985, with Fit of Marginal Homogeneity Model Residence in 1980 Northeast Midwest South West Total Residence in 1985 Northeast Midwest South 11,607 Ž11,607. 87 Ž88.7. 100 Ž98.1. 13,677 Ž13,677. 366 Ž265.7. 515 Ž379.1. 172 Ž276.5. 63 Ž92.5. 255 Ž350.8. 176 Ž251.3. 17,819 Ž17,819. 286 Ž269.8. 11,929 Ž12,064.7. 14,178 Ž14,377.1. 18,986 Ž18,733.5. West 124 Ž94.0. 302 Ž323.3. 270 Ž287.3. 10,192 Ž10,192. 10,888 Ž10,805.6. Total 12,197 Ž12,064.7. 14,581 Ž14,377.1. 18,486 Ž18,733.5. 10,717 Ž10,805.6. 55,981 Source: Data based on Table 12 of U.S. Bureau of the Census, Current Population Reports, Series P-20, No. 420, Geographical Mobility: 1985 ŽWashington, DC: U.S. Government Printing Office., 1987. 10.3.4 Migration Example For a sample of U.S. residents, Table 10.6 compares region of residence in 1985 with 1980. Relatively few people changed region, 95% of the observations falling on the main diagonal. The ML fit of marginal homogeneity, shown in Table 10.6, gives G 2 s 240.8 Ždf s 3.. Statistics using differences in sample marginal proportions give similar results. For instance, Bhapkar’s statistic Ž10.16. is W s 236.5 Ždf s 3.. The sample marginal proportions for the four regions were Ž0.218, 0.260, 0.330, 0.191. in 1980 and Ž0.213, 0.253, 0.339, 0.194. in 1985. Little change occurred over such a short time period. The large test statistics reflect the huge sample size. To estimate the change for a given region, we apply Ž10.2. to the collapsed 2 = 2 table that combines the other regions. A 95% confidence interval for ␲q1 y ␲ 1q is Ž0.2131 y 0.2179. " 1.96Ž0.00054., or y0.005 " 0.001. Similarly, a 95% confidence interval for ␲q2 y ␲ 2q is y0.007 " 0.001, for ␲q3 y ␲ 3q is 0.009 " 0.001, and for ␲q4 y ␲4q is 0.003 " 0.001. Although strong evidence of change occurs for all four regions, the changes were small. 10.4 SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE An alternative analysis of square contingency tables directly models the joint distribution using logit or loglinear models. Some models have marginal homogeneity as a special case. 424 MODELS FOR MATCHED PAIRS An I = I joint distribution ␲ab 4 satisfies symmetry if Ž 10.17 . whenever a / b. ␲ab s ␲ b a Under symmetry, ␲aqs Ý b␲ab s Ý b␲ b a s ␲qa for all a, so marginal homogeneity occurs. For I s 2, symmetry is equivalent to marginal homogeneity, but for I ) 2, marginal homogeneity can occur without symmetry. 10.4.1 Symmetry as Logit and Loglinear Models When all ␲ab ) 0, symmetry is a logit and a loglinear model. In logit form, it is trivially log Ž ␲abr␲ b a . s 0 for all a - b. For expected frequencies  ␮ab s n␲ab 4 , it has the loglinear form Ž 10.18 . log ␮ ab s ␭ q ␭ a q ␭ b q ␭ ab where all ␭ ab s ␭ b a . Both classifications have the same single-factor parameters  ␭ a4 , so log ␮ ab s log ␮ b a . Identifiability requires constraints. A simpler expression is log ␮ ab s ␭ ab , with all ␭ ab s ␭ b a . For Poisson or multinomial cell counts  n ab 4 , the likelihood equations are ␮ ˆ ab q ␮ ˆ b a s n ab q n b a for all a - b and ␮ ˆ aa s n aa for all a. The main diagonal has perfect fit. The solution that satisfies symmetry is ␮ ˆ ab s n ab q n b a 2 for all a, b. ž/ The logit symmetry model has no parameters for the 2I binomial pairs Ž n ab , n b a .4 with a - b, so its residual df s I Ž I y 1.r2. Equivalently, the loglinear symmetry model log ␮ ab s ␭ ab Ž ␭ ab s ␭ b a . for I 2 Poisson counts  n ab 4 has I  ␭ ab 4 with a - b and I  ␭ aa4 , so df s I 2 y w I q I Ž I y 1.r2x s ž/ 2 I Ž I y 1.r2. For testing symmetry, Bowker Ž1948. showed that X 2 simplifies to X2s ÝÝ a-b Ž n ab y n b a . n ab q n b a 2 . For I s 2 this is McNemar’s statistic, the square of Ž10.4.. The standardized Pearson residuals equal r ab s Ž n ab y n b a . Ž n ab q n b a . 1r2 . SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE 425 Only one residual for each pair of categories is nonredundant, since r ab s 2 yr b a . They satisfy Ý a- b r ab s X 2. The symmetry model is very simple. Except for a few specialized applications, such as describing intraobserver agreement for pairs of measurements by an observer, it rarely fits well. When the marginal distributions differ substantially, it fits poorly. 10.4.2 Quasi-symmetry One can accommodate marginal heterogeneity by permitting the main-effect terms in the symmetry model Ž10.18. to differ. The resulting loglinear model, called quasi-symmetry, is log ␮ ab s ␭ q ␭ aX q ␭Yb q ␭ ab , Ž 10.19 . where ␭ ab s ␭ b a for all a - b ŽCaussinus 1966.. Symmetry is the special case ␭ aX s ␭Ya for a s 1, . . . , I, and independence is the special case in which all ␭ ab s 0. The likelihood equations for quasi-symmetry are ␮ ˆ aqs n aq , a s 1, . . . , I ␮ ˆqb s nqb , b s 1, . . . , I ␮ ˆ ab q ␮ ˆ b a s n ab q n b a Ž 10.20 . for a F b. Only one of the first two sets of equations is needed. The other is redundant, given the other two. The residual df s Ž I y 1.Ž I y 2.r2. From Ž10.20., ␮ ˆ aa s n aa for a s 1, . . . , I. Otherwise, the likelihood equations do not have a direct solution. They are solved using iterative methods such as Newton᎐Raphson and IPF ŽCaussinus 1966.. The quasi-symmetry model has multiplicative form ␲ab s ␣ a ␤ b␥ab , where ␥ab s ␥ b a all a - b Ž 10.21 . and all parameters are positive. The symmetry model is Ž10.21. with ␣ a s ␤ a for all a. This equation indicates that a table satisfying quasi-symmetry is the cellwise product of a table satisfying independence with one satisfying symmetry. The association symmetry implies that odds ratios on one side of the main diagonal are identical to corresponding odds ratios on the other side. In fact, the model can be defined by properties such as ␮ ab ␮ II ␮a I ␮I b s ␮ b a ␮ II ␮ b I ␮ Ia for all a - b Ž 10.22 . or ␪ab s ␪ b a for local odds ratios. Goodman Ž1979a. referred to it as the symmetric association model. 426 MODELS FOR MATCHED PAIRS The meaning of quasi-symmetry is less obvious than symmetry. However, it usually fits much better and has greater scope. One way to interpret its parameters relates to subject-specific logit models. For such models having additivity of subject terms and occasion terms, of which model Ž10.8. is the simplest case, the joint distribution in the corresponding population-averaged table necessarily satisfies quasi-symmetry Žsee Darroch 1981; Section 13.2.7 shows this.. Consider the generalization of baseline-category logit model Ž10.15. to a subject-specific model log P Ž Yit s j . rP Ž Yit s I . s ␣ i j q ␤ j x t , t s 1, 2, j s 1, . . . , I y 1. This has the additive form of Ž10.8. for each j. The model implies, averaging over subjects, that the quasi-symmetry model Ž10.19. holds for the I = I population-averaged table with  ␤ j s ␭Yj y ␭ jX 4 , when one constrains ␭ IX s ␭YI s 0. In fact, for the conditional ML analysis that conditions out  ␣ i j 4 , the conditional ML estimates of  ␤ˆj 4 relate to the ordinary ML fit of quasi-symˆYj y ␭ˆjX 4 ŽConaway 1989.. This provides an interpretation for metry by  ␤ˆj s ␭ the main-effect terms in quasi-symmetry. Related results hold for multiple occasions using a multivariate form Ž10.33. of quasi-symmetry Že.g., Agresti 1997; Conaway 1989; Darroch 1981; Tjur 1982; see also Section 13.2.7.. In addition, quasi-symmetry contains as a special case other useful models. These include the ones in Sections 10.4.3 and 10.6.3. 10.4.3 Quasi-independence Square tables usually exhibit positive dependence, manifested by larger counts on the main diagonal than the independence model predicts. Conditional on the event that a matched pair falls off the main diagonal, though, the relationship may have a simple structure. A square contingency table satisfies quasi-independence when the variables are independent, given that the row and column outcomes differ. This has the loglinear form log ␮ ab s ␭ q ␭ aX q ␭Yb q ␦ a I Ž a s b . , Ž 10.23 . where I Ž⭈. is the indicator function, I Ž a s b. s ½ 1, 0, asb a / b. This adds a parameter to the independence model for each cell on the main diagonal. The first three terms in Ž10.23. specify independence, and  ␦ a4 permit  ␮aa4 to depart from this pattern and have arbitrary positive values. When ␦ a ) 0, ␮ aa is larger than under independence. The likelihood equations for quasi-independence are ␮ ˆ aqs n aq , ␮ ˆqa s nqa , ␮ ˆ aa s n aa , a s 1, . . . , I. SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE 427 A perfect fit occurs on the main diagonal, but independence holds for the remaining cells. The model implies that odds ratios equal 1.0 for all rectangularly formed 2 = 2 tables in which all cells fall off the main diagonal. One can fit the model using Newton᎐Raphson or IPF. The model has I more parameters than the independence model, so its residual df s Ž I y 1. 2 y I. It applies to tables with I G 3. Quasi-independence is the special case of quasi-symmetry Ž10.21. in which ␥ab for a / b4 are identical. Caussinus Ž1966, p. 146. showed that they are equivalent when I s 3. 10.4.4 Migration Revisited We now return to Table 10.6 on migration patterns. Not surprisingly, the independence model fits terribly, with G 2 s 125,923 and X 2 s 146,929. ŽThe maximum possible value of X 2 is 3n s 167,943; see Problem 3.33.. The symmetry model is also unpromising. For instance, 124 people moved from the northeast to the west, but only 63 people made the reverse move. The deviance for testing symmetry is G 2 s 243.6 Ždf s 6.. Quasi-independence states that for people who moved, residence in 1985 is independent of region in 1980. Table 10.7 contains its fitted values, for which G 2 s 69.5 Ždf s 5.. This model fits much better than the independence model, primarily because it forces a perfect fit on the main diagonal, where most observations occur. However, lack of fit is apparent off that diagonal. Many more people moved from the northeast to the south and many fewer moved from the west to the south than quasi-independence predicts. TABLE 10.7 Fit of Models to Table 10.6 Residence in 1985 a Residence in 1980 Northeast Midwest South West Total Northeast 11,607 100 Ž126.6.1 Ž95.8. 2 124 Ž150.5. Ž123.8. 12,197 87 Ž117.4. Ž91.2. 172 Ž133.2. Ž167.6. 13,677 366 Ž312.9. Ž370.4. 515 Ž531.1. Ž501.7. 17,189 302 Ž255.5. Ž311.1. 270 Ž290.0. Ž261.1. 14,581 Midwest South West Total a1 63 Ž71.4. Ž63.2. 11,929 2 255 Ž243.8. Ž238.3. 176 Ž130.6. Ž166.9. 14,178 286 Ž323.0. Ž294.9. 18,986 18,486 10,192 10,717 10,888 55,981 Quasi-independence fit; quasi-symmetry fit; both models giving perfect fit on main diagonal. 428 MODELS FOR MATCHED PAIRS The quasi-symmetry model has G 2 s 3.0, with df s 3. Table 10.7 displays its fit, which is much better than with quasi-independence. The lack of symmetry in cell probabilities reflects slight marginal heterogeneity. The subject-specific effects can be described using the model’s parameter estiˆ1Y y ␭ˆ1X s y0.672, ␭ˆY2 y ␭ˆ2X s y0.623, ␭ˆY3 y ␭ˆ3X s 0.1224 . For inmates,  ␭ stance, for a given subject the estimated odds of living in the south instead of the west in 1985 were expŽ0.122. s 1.13 times the odds in 1980. We’ll see in Chapter 12 that such subject-specific effects tend to be stronger than those in corresponding marginal models, especially in tables like this with strong association. A related application with matched samples is the study of occupational mobility. Each observation pairs parent’s occupation with child’s occupation ŽGoodman 1979b; Hout et al. 1987.. 10.4.5 Marginal Homogeneity and Quasi-symmetry Marginal homogeneity is not equivalent to a loglinear model. However, quasi-symmetry is a useful model for studying marginal homogeneity. Caussinus Ž1966. showed that symmetry is equivalent to quasi-symmetry and marginal homogeneity holding simultaneously. We have seen that symmetry implies both quasi-symmetry and marginal homogeneity. Now we give Caussinus’s argument for the converse, that the joint occurrence of quasisymmetry and marginal homogeneity implies symmetry. From Ž10.21., if quasi-symmetry holds, ␲ab s ␣ a ␤ b␥ab , where ␥ab s ␥ b a ) 0 for all a - b. Equivalently, ␲ab s ␳ a ␦ ab , where ␳ a s ␣ ar␤ a and ␦ ab s ␤ a ␤ b␥ab also satisfies ␦ ab s ␦ b a ) 0 for all a - b. If there is also marginal homogeneity, then ␲ jqs ␳ j Ý ␦ jb s Ý ␳a ␦a j s ␲qj , a b or ␳j s ž Ý␳ ␦ / ž Ý␦ / s ž Ý␳ ␦ / ž Ý␦ /, a aj a jb b a aj a bj j s 1, . . . , I. b Thus, each ␳ j is a weighted average of  ␳ a4 , with weights  ␦ a j rÝ b ␦ b j ) 0, a s 1, . . . , I 4 . Any set  ␳ a4 satisfying this must be identical. Otherwise, there would be a ␳ j that is no greater than any ␳ a but smaller than at least one, and hence it could not be a positive weighted average of all of them. But since  ␳ a4 are identical, ␲ab s ␳ a ␦ ab s ␳ b ␦ ab s ␳ b ␦ b a s ␲ b a , so symmetry holds. Thus, a table that satisfies both quasi-symmetry and marginal homo- SYMMETRY, QUASI-SYMMETRY, AND QUASI-INDEPENDENCE 429 geneity also satisfies symmetry. Since the converse holds, quasi-symmetry q marginal homogeneity s symmetry. Ž 10.24 . It follows that when quasi-symmetry ŽQS. holds, marginal homogeneity ŽMH. is equivalent to symmetry ŽS., which is  ␭ aX s ␭Ya , a s 1, . . . , I 4 in the QS model. Thus, conditional on quasi-symmetry, testing marginal homogeneity is equivalent to testing symmetry. A test of marginal homogeneity compares fit statistics for the symmetry and quasi-symmetry models, G 2 Ž S < QS . s G 2 Ž S . y G 2 Ž QS . , Ž 10.25 . with df s I y 1. This is an alternative to approaches using marginal models discussed in Section 10.3.3. Table 10.6 on migration from 1980 to 1985 has G 2 ŽS. s 243.6 and G 2 ŽQS. s 3.0. The difference G 2 ŽS < QS. s 240.6 Ždf s 3. shows extremely strong evidence of marginal heterogeneity. Results are similar to those quoted in Section 10.3.4 for the likelihood-ratio test based on model Ž10.15., for which G 2 s 240.8, or the Wald test, for which W s 236.5 Žboth with df s 3.. 10.4.6 Ordinal Quasi-symmetry Model The loglinear models presented so far for square tables treat classifications as nominal. With ordered categories, more parsimonious models are useful. Let u1 F ⭈⭈⭈ F u I denote ordered scores for both the row and columns. An ordinal quasi-symmetry model is log ␮ ab s ␭ q ␭ a q ␭ b q ␤ u b q ␭ ab , Ž 10.26 . where ␭ ab s ␭ b a for all a - b. It is the special case of the quasi-symmetry model Ž10.19. in which ␭Yb y ␭ bX s ␤ u b has a linear trend. Symmetry is the special case ␤ s 0. This model has logit representation, log Ž ␲abr␲ b a . s ␤ Ž u b y u a . for a F b. Ž 10.27 . This is the special case of the linear logit model, logitŽ␲ . s ␣ q ␤ x, with ␣ s 0, x s u b y u a and ␲ equal to the conditional probability of cell Ž a,b ., given response sequence Ž a, b . or Ž b, a.. The greater the value of < ␤ < , the greater the difference between ␲ab and ␲ b a and hence between the marginal distributions. 430 MODELS FOR MATCHED PAIRS The likelihood equations for ordinal quasi-symmetry are Ý u a ␮ˆ aqs Ý u a n aq , a a Ý u b ␮ˆqb s Ý u b nqb , b ␮ ˆ ab q ␮ ˆ b a s n ab q n b a b for a - b. The fitted marginal counts need not equal the observed marginal counts. However, dividing the first two equations by n shows that they have the same means. When ␤ / 0, this model implies stochastically ordered margins. When ␤ ) 0 Ž ␤ - 0., responses have a higher mean in the column Žrow. distribution. Like the ordinal marginal models ŽSection 10.3.1., this model concentrates the marginal effect on df s 1. A test of marginal homogeneity Ž H0 : ␤ s 0. uses ordinal quasi-symmetry q marginal homogeneity s symmetry. The likelihood-ratio test statistic compares the deviance for symmetry and ordinal quasi-symmetry. One can fit this model by fitting Ž10.27. with logit model software: Identify Ž n ab , n b a . as binomial with n ab q n b a trials, and fit a logit model with no intercept and predictor x s u b y u a . One can also fit Ž10.26. using iterative methods for loglinear models. 10.4.7 Premarital and Extramarital Sex Revisited For Table 10.5 on attitudes toward premarital and extramarital sex, a cursory glance at the data reveals that the symmetry model is inadequate Ž G 2 s 402.2, df s 6.. By comparison, quasi-symmetry fits well Ž G 2 s 1.4, df s 3.. The simpler model of ordinal quasi-symmetry also fits well: With scores  1, 2, 3, 44 , G 2 s 2.1 Ždf s 5.. The ML estimate ␤ˆ s y2.86. From Ž10.27., the estimated probability that outcome on premarital sex is x categories more positive than the outcome on extramarital sex equals expŽ2.86 x . times the reverse probability. For instance, the estimated probability that premarital sex is judged almost always wrong and extramarital sex is always wrong equals expŽ2.86. s 17.4 times the estimated probability that premarital sex is always wrong and extramarital sex is almost always wrong. 10.4.8 Other Ordinal Models for Square Tables For ordered classifications, when symmetry does not hold, often either ␲ab ) ␲ b a for all a - b, or ␲ab - ␲ b a for all a - b. A generalization of 431 MEASURING AGREEMENT BETWEEN OBSERVERS symmetry with this property is the logit model log Ž ␲abr␲ b a . s ␶ for a - b. Ž 10.28 . It implies that for all a - b, P Ž Yi1 s a, Yi2 s b < Yi1 - Yi2 . s P Ž Yi1 s b, Yi2 s a < Yi1 ) Yi2 . . The pattern of probabilities for cells above the main diagonal is a mirror image of the pattern for cells below it. This property is called conditional symmetry ŽMcCullagh 1978.. Problem 10.35 shows the corresponding loglinear model and its fit. Symmetry is the special case ␶ s 0. Another model generalizes quasi-independence. Let  u a4 be ordered scores. The model log ␮ ab s ␭ q ␭ aX q ␭Yb q ␤ u a u b q ␦ a I Ž a s b . Ž 10.29 . permits linear-by-linear association wsee Ž9.6.x off the main diagonal. It is a special case of quasi-symmetry, and quasi-independence is the special case ␤ s 0. For equal-interval scores, it implies uniform local association, given that responses differ. Goodman Ž1979a. called it quasi-uniform association. For Table 10.5 on opinions about premarital and extramarital sex, the conditional symmetry model has ˆ ␶ s y4.130 ŽSE s 0.451.. The estimated probability that extramarital sex is considered more wrong are expŽ4.13. s 62.2 times the estimated probability that premarital sex is considered more wrong. The quasi-uniform association model has ␤ˆ s 0.632 ŽSE s 0.106.. Off the main diagonal, the estimated local odds ratio equals expŽ0.632. s 1.88. 10.5 MEASURING AGREEMENT BETWEEN OBSERVERS We now discuss an application, analyzing agreement between two observers, that uses matched-pairs models. We illustrate with Table 10.8. This shows ratings by two pathologists, labeled A and B, who separately classified 118 slides regarding the presence and extent of carcinoma of the uterine cervix. The rating scale has the ordered categories Ž1. negative, Ž2. atypical squamous hyperplasia, Ž3. carcinoma in situ, Ž4. squamous or invasive carcinoma. 10.5.1 Agreement: Departures from Independence Let ␲ab denote the probability that observer A classifies a slide in category a and observer B classifies it in category b. Then ␲aa is the probability that they both choose category a, and Ý a␲aa is the total probability of agreement. Perfect agreement occurs when Ý a␲aa s 1. With subjective scales, agreement is less than perfect. Analyses focus on describing strength of agreement and detecting patterns of disagreement. 432 MODELS FOR MATCHED PAIRS TABLE 10.8 Diagnoses of Carcinoma Pathologist B a Pathologist A 1 2 3 4 Total 1 22 Ž8.5. 5 Žy0.5. 2 Žy0.5. 7 Ž3.2. 2 Žy5.9. 14 Žy0.5. 0 Žy1.8. 0 Žy1.8. 26 0 Žy4.1. 0 Žy3.3. 2 Žy1.2. 1 Žy1.3. 36 Ž5.5. 17 Ž0.3. 0 Žy2.3. 10 Ž5.9. 27 12 69 10 2 3 4 Total 26 38 28 118 a Values in parentheses are standardized Pearson residuals for the independence model. Source: N. S. Holmquist, C. A. McMahon, and O. D. Williams, Arch. Pathol. 84: 334᎐345 Ž1967.; reprinted with permission from the American Medical Association. See also Landis and Koch Ž1977.. Agreement and association are distinct facets of the joint distribution. Strong agreement requires strong association, but strong association can exist without strong agreement. If observer A consistently rates subjects one category higher than observer B, strength of agreement is poor even though the association is strong. Evaluations of agreement compare  n ab 4 to the values  n aq nqb rn4 predicted under independence. That model is a baseline, showing the agreement expected if no association existed between ratings. Normally, it fits poorly if even mild agreement exists, but its cell standardized residuals ŽSection 3.3.1. show patterns of agreement and disagreement. Ideally, standardized residuals are large positive on the main diagonal and large negative off that diagonal. The sizes are influenced by sample size n, however, larger values tending to occur as n increases. The independence model fits Table 10.8 poorly Ž G 2 s 118.0, df s 9.. That table reports the standardized Pearson residuals in parentheses. The large positive residuals on the main diagonal indicate that agreement for each category is greater than expected by chance, especially for the first category. Off the main diagonal they are primarily negative. Disagreements occurred less than expected under independence, although the evidence of this is weaker for categories closer together. The most common disagreements were observer B choosing category 3 and observer A instead choosing category 2 or 4. 10.5.2 Using Quasi-independence to Analyze Agreement More complex models add components that relate to agreement beyond that expected under independence. A useful generalization is quasi-independence 433 MEASURING AGREEMENT BETWEEN OBSERVERS TABLE 10.9 Fitted Values for Carcinoma Diagnoses of Table 10.8 Pathologist B a Pathologist A 1 2 3 4 1 22 Ž22.1 Ž22. 2 2 5 Ž2.4. Ž4.6. 0 Ž0.8. Ž0.4. 0 Ž1.9. Ž0.0. 2 Ž0.7. Ž2.4. 7 Ž7. Ž7. 2 Ž3.3. Ž1.6. 14 Ž16.6. Ž14.4. 0 Ž0.0. Ž0.0. 0 Ž0.0. Ž0.0. Ž1.2. Ž1.6. 1 Ž3.0. Ž1.0. 36 Ž36. Ž36. 17 Ž13.1. Ž17.0. 0 Ž0.0. Ž0.0. 10 Ž10. Ž10. 3 4 a1 Quasi-independence model; 2 quasi-symmetry model. Ž10.23., which adds main-diagonal parameters  ␦ a4 . For Table 10.8, this model has G 2 s 13.2 Ždf s 5.. It fits much better than independence, but some lack of fit remains. Table 10.9 shows the fit. For two subjects, suppose that each observer classifies one in category a and one in category b. The odds that the observers agree rather than disagree on which is in category a and which is in category b equal ␶ab s ␲aa␲ b b ␲ab␲ b a s ␮ aa ␮ b b ␮ ab ␮ b a . Ž 10.30 . As ␶ab increases, the observers are more likely to agree for that pair of categories. Under quasi-independence, ␶ab s exp Ž ␦ a q ␦ b . . Larger  ␦ a4 represent stronger agreement. For instance, for Table 10.8, ␦ˆ2 s 0.6 and ␦ˆ3 s 1.9, and ˆ ␶ 23 s 12.3. The degree of agreement also seems fairly strong for other pairs of categories. 10.5.3 Quasi-symmetry and Agreement Modeling For Table 10.8, the quasi-independence model shows some lack of fit. Given that the pathologists disagree, some association remains between ratings. For observer agreement tables, this is common. Quasi-symmetry Ž10.19. often fits much better, because it permits association. For Table 10.8, it has G 2 s 1.0 Ždf s 2.. Table 10.9 displays the fit. It is not unusual for tables to have many 434 MODELS FOR MATCHED PAIRS empty cells. When n ab q n b a s 0 for any pair Žsuch as categories 1 and 4 in Table 10.8., the ML fitted values for quasi-symmetry in those cells must also be zero since one of its likelihood equations is ␮ ˆ ab q ␮ ˆ b a s n ab q n b a. One should eliminate those cells from the fitting process to get the proper residual df value. ˆaa q ␭ˆb b y ␭ˆab y ␭ˆb a ., where ␭ˆab s Under quasi-symmetry, ˆ ␶ab s expŽ ␭ ˆ ␭ b a . For categories 2 and 3 of Table 10.8, for instance, ˆ ␶ 23 s 10.7. Loglinear models directly address the association component of agreement. The quasi-symmetry model also yields information about similarity of marginal distributions. The simpler symmetry model that forces the margins to be identical fits Table 10.8 poorly Ž G 2 s 39.2, df s 5.. The statistic G 2 Ž S < QS . s 39.2 y 1.0 s 38.2 Ždf s 3. provides strong evidence of marginal heterogeneity. In Table 10.8, differences in marginal proportions are substantial in each category but the first. The marginal heterogeneity is one reason that the agreement is not stronger. Models for agreement can take ordering of categories into account. Conditional on observer disagreement, a tendency usually remains for high Žlow. ratings by one observer to occur with relatively high Žlow. ratings by the other observer Žsee Problem 10.41.. 10.5.4 Kappa Measure of Agreement An alternative approach summarizes agreement with a single index. For nominal scales, the most popular measure is Cohen’s kappa ŽCohen 1960.. It compares the probability of agreement Ý a␲aa to that expected if the ratings were independent, Ý a␲aq ␲qa , by ␬s Ý a␲aa y Ý a␲aq ␲qa 1 y Ý a␲aq ␲qa . The denominator equals the numerator with Ý a␲aa replaced by its maximum possible value of 1, corresponding to perfect agreement. Kappa equals 0 when the agreement merely equals that expected under independence. It equals 1.0 when perfect agreement occurs. The stronger the agreement, the higher is ␬ , for given marginal distributions. Negative values occur when agreement is weaker than expected by chance, but this rarely happens. For multinomial sampling, the sample value ␬ ˆ has a large-sample normal distribution. Its estimated asymptotic variance ŽFleiss et al. 1969. is ␴ˆ 2 Ž ␬ˆ . s 1 n ½ Po Ž 1 y Po . Ž 1 y Pe . 2 q q 2 Ž 1 y Po . 2 Po Pe y Ý a paa Ž paqq pqa . Ž 1 y Pe . 3 2 2 Ž 1 y Po . Ý a Ý b pab Ž p bqq pqa . y 4 Pe2 Ž 1 y Pe . 4 5 , 435 MEASURING AGREEMENT BETWEEN OBSERVERS where Po s Ý a paa and Pe s Ý a paq pqa . It is rarely plausible that agreement is no better than expected by chance. Thus, rather than testing H0 : ␬ s 0, it is more relevant to estimate strength of agreement by interval estimation of ␬ . For Table 10.8, Po s 0.636 and Pe s 0.281. Sample kappa equals Ž0.636 y 0.281.rŽ1 y 0.281. s 0.493. The difference between observed agreement and that expected under independence is about 50% of the maximum possible difference. The estimated standard error is 0.057, so ␬ apparently falls roughly between 0.4 and 0.6, moderately strong agreement. 10.5.5 Weighted Kappa: Quantifying Disagreement Kappa treats classifications as nominal. When categories are ordered, the seriousness of a disagreement depends on the difference between the ratings. For nominal classifications also, some disagreements may be considered more severe than others. The measure weighted kappa ŽSpitzer et al. 1967. uses weights  wab 4 satisfying 0 F wab F 1, with all waa s 1 and all wab s w b a to describe closeness of agreement. One possibility is  wab s 1 y < a y b < rŽ I y 1.4 , for which agreement is greater for cells nearer the main diagonal. Fleiss and Cohen Ž1973. suggested  wab s 1 y Ž a y b . 2rŽ I y 1. 2 4 . The weighted agreement is Ý a Ý b wab␲ab and weighted kappa is ␬w s Ý a Ý b wab␲ab y Ý a Ý b wab␲aq ␲qb 1 y Ý a Ý b wab␲aq ␲qb . Controversy surrounds the utility of kappa and weighted kappa, partly because their values depend strongly on the marginal distributions. The same diagnostic rating process can yield quite different values, depending on the proportions of cases of the various types ŽProblem 10.40.. In summarizing a contingency table by a single number, the reduction in information can be severe. It is helpful to construct models providing more detailed investigation of the agreement and disagreement structure rather than to depend solely on a summary index. 10.5.6 Extensions to Multiple Observers With several observers, ordinary loglinear models are not usually relevant. Their description of agreement and association between two observers is conditional on ratings by the others. It is more relevant to study this marginally, without conditioning on the other ratings. Hence, for R observers, modelling simultaneously the pairwise agreement and association structure requires studying the ŽBecker and Agresti 1992.. ž / pairs of two-way marginal distributions R 2 436 MODELS FOR MATCHED PAIRS Other approaches have also been used. For instance, generalizations of kappa summarize pairwise agreements or multiple agreements ŽFleiss 1981, Sec. 13.2; Landis and Koch 1977.. Or, it may make sense to use a mixture model that assumes latent classes of subjects for whom the observers agree and subjects for whom they disagree. Such an analysis is shown in Section 13.1.2. 10.6 BRADLEY–TERRY MODEL FOR PAIRED PREFERENCES Sometimes, categorical outcomes result from pairwise evaluations. A common example is athletic competitions, when the outcome for a team or player consists of categories Žwin, lose.. Another example is pairwise comparison of product brands, such as two brands of wine of some type. When a wine critic rates I brands of sauvignon blanc, it might be difficult to establish an outright ranking, especially if I is large. However, for any given pair, the critic could probably state a preference after tasting them at the same occasion. An overall ranking of the wines could then be based on the pairwise preferences. We present a model for this in this section. 10.6.1 Bradley–Terry Model Bradley and Terry Ž1952. proposed a logit model for paired evaluations. Let ⌸ ab denote the probability that a is preferred to b. Suppose that ⌸ ab q ⌸ b a s 1 for all pairs; that is, a tie cannot occur. The Bradley᎐Terry model is log ⌸ ab ⌸ba s ␤a y ␤ b . Ž 10.31 . Alternatively, ⌸ ab s exp Ž ␤ a . r exp Ž ␤ a . q exp Ž ␤ b . . Thus, ⌸ ab s 12 when ␤ a s ␤ b and ⌸ ab ) 12 when ␤ a ) ␤ b . Identifiability requires a constraint such as ␤I s 0 or Ý a expŽ ␤ˆa . s 1. Since the model describes I probabilities Ž ⌸ ab 4 for a - b . by Ž I y 1. ž/ parameters, residual df s ž / y ŽI y 1.. 2 I 2 For a - b, let Nab denote the sample number of evaluations, with a preferred n ab times and b preferred n b a s Nab y n ab times. A square contingency table with empty cells on the main diagonal summarizes results. When the Nab comparisons are independent with probability ⌸ ab for each, n ab has a binŽ Nab , ⌸ ab . distribution. If evaluations for different pairs are also independent, ordinary methods for logit models apply for fitting the model. BRADLEY᎐ TERRY MODEL FOR PAIRED PREFERENCES 437 TABLE 10.10 Results of 1987 Season for American League Baseball Teams Losing Team a Winning Team Milwaukee Detroit Toronto New York Boston Cleveland Baltimore Milwaukee Detroit Toronto New York Boston Cleveland Baltimore ᎏ 6 Ž6.0. 4 Ž5.6. 6 Ž5.4. 6 Ž5.0. 4 Ž3.8. 2 Ž2.2. 7 Ž7.0. ᎏ 6 Ž6.0. 8 Ž5.9. 2 Ž5.4. 4 Ž4.2. 4 Ž2.5. 9 Ž7.4. 7 Ž7.0. ᎏ 6 Ž6.3. 6 Ž5.9. 5 Ž4.6. 1 Ž2.8. 7 Ž7.6. 5 Ž7.1. 7 Ž6.7. ᎏ 7 Ž6.0. 6 Ž4.7. 3 Ž2.9. 7 Ž8.0. 11 Ž7.6. 7 Ž7.1. 6 Ž7.0. ᎏ 6 Ž5.1. 1 Ž3.2. 9 Ž9.2. 9 Ž8.8. 8 Ž8.4. 7 Ž8.3. 7 Ž7.9. ᎏ 7 Ž4.4. 11 Ž10.8. 9 Ž10.5. 12 Ž10.2. 10 Ž10.1. 12 Ž9.8. 6 Ž8.6. ᎏ a Values in parentheses represent the fit of the Bradley᎐Terry model. Source: American League Red Book, 1988 ŽSt. Louis, MO: Sporting News Publishing Co.. 10.6.2 Home Team Advantage in Baseball Table 10.10 shows results of the 1987 season for the seven baseball teams in the Eastern Division of the American League. For instance, of games between Boston and New York, Boston won 7 and New York won 6. Table 10.10 shows the population of regular-season games. We regard this as a sample estimate of a conceptual distribution representing the long-run performance of teams as constituted in 1987. We fitted the Bradley᎐Terry model as a logit model for 72 s 21 indepen- ž/ dent binomial samples, using an appropriate model matrix and no intercept Že.g., for SAS, see Table A.19.. The model fits adequately Ž G 2 s 15.7, df s 15.. Table 10.10 contains the fitted values  ␮ ˆ ab 4. Table 10.11 displays the sample proportion of games each team won and the model estimates of  ␤ˆa4 Žsetting ␤ˆ7 s 0. and  expŽ ␤ˆa .4 wsetting Ý a expŽ ␤ˆa . s 1x. When Boston played New York, the estimated probability that Boston won is ˆ 54 s exp Ž ␤ˆ5 . ⌸ exp Ž ␤ˆ5 . q exp Ž ␤ˆ4 . s 0.46. The standard error of each ␤ˆa and of each ␤ˆa y ␤ˆb is about 0.3, so not much evidence exists of a difference among the top five teams. TABLE 10.11 Results of Fitting Bradley–Terry Models to Baseball Data Team Milwaukee Detroit Toronto New York Boston Cleveland Baltimore Winning Percentage ␤ˆi Ž10.31. 64.1 60.2 56.4 55.1 51.3 39.7 23.1 1.58 1.44 1.29 1.25 1.11 0.68 0.00 expŽ ␤ˆi . Ž10.31. expŽ ␤ˆi . Ž10.32. 0.218 0.189 0.164 0.158 0.136 0.089 0.045 0.220 0.190 0.164 0.157 0.137 0.088 0.044 438 MODELS FOR MATCHED PAIRS TABLE 10.12 Wins r Losses by Home and Away Team, 1987 Away Team Home Team Milwaukee Detroit Toronto New York Boston Cleveland Baltimore Milwaukee Detroit Toronto New York Boston Cleveland Baltimore ᎏ 3-3 2-5 3-3 5-1 2-5 2-5 4-3 ᎏ 4-3 5-1 2-5 3-3 1-5 4-2 4-2 ᎏ 2-5 3-3 3-4 1-6 4-3 4-3 2-4 ᎏ 4-2 4-3 2-4 6-1 6-0 4-3 4-3 ᎏ 4-2 1-6 4-2 6-1 4-2 4-2 5-2 ᎏ 3-4 6-0 4-3 6-0 6-1 6-0 2-4 ᎏ Source: American League Red Book, 1988 ŽSt. Louis, MO: Sporting News Publishing Co... This model does not recognize which team is the home team. Most sports have a home field advantage: A team is more likely to win when it plays at its home city. Table 10.12 contains results for the 1987 season according to the Žhome team, away team. classification. For instance, when Boston was the home team, it beat New York 4 times and lost 2 times; when New York was the home team, it beat Boston 4 times and lost 3 times. Now for all a / b, let ⌸*ab denote the probability that team a beats team b, when a is the home team. Consider logit model log ⌸*ab 1 y ⌸*ab s ␣ q Ž ␤a y ␤ b . . Ž 10.32 . When ␣ ) 0, a home field advantage exists. The home team of two evenly matched teams has probability expŽ␣.rw1 q expŽ␣.x of winning. For Table 10.12, model Ž10.32. describes 42 binomial distributions with 7 parameters. It has G 2 s 38.6 Ždf s 35.. Table 10.11 displays  expŽ ␤ˆa .4 , which are similar to those obtained previously. The estimate of the home-field parameter is ␣ ˆ s 0.302. For two evenly matched teams, the home team had estimated probability 0.575 of winning. When Boston played New York, the estimated probability of a Boston win was 0.54 at Boston and 0.39 at New York. Model Ž10.32. is a useful generalization of the Bradley᎐Terry model whenever an order effect exists. For instance, in pairwise taste evaluations, the product tasted first may have a slight advantage. 10.6.3 Bradley–Terry Model and Quasi-symmetry Fienberg and Larntz Ž1976. showed that the Bradley᎐Terry model is a logit formulation of the quasi-symmetry model Ž10.19.. For quasi-symmetry, given that an observation is in cell Ž a, b . or Ž b, a., the logit of the conditional MARGINAL AND QUASI-SYMMETRY MODELS FOR MATCHED SETS 439 probability of cell Ž a, b . equals log ␮ ab ␮b a XY s Ž ␭ q ␭ aX q ␭Yb q ␭ ab . y Ž ␭ q ␭ bX q ␭Ya q ␭ bXaY . s Ž ␭ aX y ␭Ya . y Ž ␭ bX y ␭Yb . s ␤ a y ␤ b , ˆaX 4 and  ␭ˆYa 4 for quasi-symmetry yield  ␤ˆa4 where ␤ a s ␭ aX y ␭Ya . Estimates  ␭ for the Bradley᎐Terry model. 10.6.4 Extensions to Ties and Ordinal Evaluations The Bradley᎐Terry model extends to ordinal comparisons, such as the evaluation scale Žmuch better, slightly better, the same, slightly worse, much worse. in comparing two products. With cumulative logits and an I-category evaluation scale, let Yab denote the response for a comparison of a with b. The model is logit P Ž Yab F j . s ␣ j q Ž ␤ a y ␤ b . . Since P Ž Yab F j . s P Ž Yb a ) I y j . s 1 y P Ž Yb a F I y j ., it follows that logit w P Ž Yab F j x s y logit w P Ž Yb a F I y j .x. Thus, necessarily, ␣ j s y␣ Iyj . The most common ordered preference scale is Žwin, tie, lose.. Then, ␣ 1 s y␣ 2 . 10.7 MARGINAL AND QUASI-SYMMETRY MODELS FOR MATCHED SETS* Methods for matched pairs extend to matched sets. Here we present mainly the loglinear modeling approach; in Chapters 11 and 12 we present extensions of the marginal and conditional logit modeling approaches. 10.7.1 Marginal Homogeneity, Complete Symmetry, and Quasi-symmetry Let Ž Y1 , Y2 , . . . , YT . denote the T responses in each matched set. With I response categories, a contingency table with I T cells summarizes the possible outcomes. Let i s Ž i1 , . . . , i T . denote the cell having Yt s i t , t s 1, . . . , T. Let ␲ i s P Ž Yt s i t , t s 1, . . . , T ., and let ␮ i s n␲ i . Then P Ž Yt s j . s ␲q⭈ ⭈ ⭈qjq⭈ ⭈ ⭈q , where the j subscript is in position t, and  P Ž Yt s j ., j s 1, . . . , I 4 is the marginal distribution for Yt . 440 MODELS FOR MATCHED PAIRS This T-way table satisfies marginal homogeneity if P Ž Y1 s j . s P Ž Y2 s j . s ⭈⭈⭈ s P Ž YT s j . for j s 1, . . . , I. It satisfies complete symmetry if ␲i s ␲j for any permutation j s Ž j1 , . . . , jT . of i s Ž i1 , . . . , i T .. Complete symmetry implies marginal homogeneity, but the converse does not hold except when T s I s 2. Complete symmetry is a loglinear model. One representation is log ␮ i s ␭ ab . . . m , where a is the minimum of Ž i1 , . . . , i T ., b is the next smallest, . . . , and m is the maximum. In a three-way table, for instance, log ␮ 122 s log ␮ 212 s log ␮ 221 s ␭122 . The number of  ␭ ab . . . m 4 parameters is the number of ways of selecting T out of I items with replacement, which is T residual df s I y ž IqTy1 T / ŽHaberman 1978, p. 518.. ž IqTy1 T / . Thus, An I T table satisfies quasi-symmetry if log ␮ i s ␭1 i1 q ␭2 i 2 q ⭈⭈⭈ q␭T i T q ␭ ab . . . m Ž 10.33 . where ␭ ab . . . m is defined as in the complete symmetry model. It has symmetric association and higher-order interaction terms, but permits each singlefactor marginal distribution to have its own parameters. Identifiability requires constraints such as ␭ t I s 0 for each t. One set of main-effect terms is redundant ŽProblem 10.31.. This model has Ž I y 1.ŽT y 1. more parameters than complete symmetry. It is fitted using iterative methods. For ordinal responses, a simpler model with quantitative main effects uses ordered scores  u a4 . The ordinal quasi-symmetry model is log ␮ i s ␤ 1 u i1 q ␤ 2 u i 2 q ⭈⭈⭈ q␤ T u i T q ␭ ab . . . m where one can set ␤ T s 0. Complete symmetry is the special case ␤ 1 s ⭈⭈⭈ s ␤ T . When quasi-symmetry Ž10.33. or ordinal quasi-symmetry holds, marginal homogeneity is equivalent to complete symmetry. Marginal heterogeneity occurs if quasi-symmetry ŽQS. holds but complete symmetry ŽS. does not. The statistic G 2 Ž S < QS . s G 2 Ž S . y G 2 Ž QS . MARGINAL AND QUASI-SYMMETRY MODELS FOR MATCHED SETS 441 tests marginal homogeneity. Under complete symmetry, it is asymptotically chi-squared with df s Ž I y 1.ŽT y 1.. The corresponding test for the ordinal quasi-symmetry model has df s ŽT y 1.. 10.7.2 Attitudes toward Legalized Abortion Example Refer to Table 10.13. Subjects indicated whether they support legalized abortion in three situations: Ž1. if the family has a very low income and cannot afford any more children, Ž2. when the woman is not married and does not want to marry the man, and Ž3. when the woman wants it for any reason. The table also classifies subjects by gender, resulting in a 2 4 table. Let ␮ g h i j denote the expected frequency for gender g Ž1 s female; 0 s male. with response sequence Ž h, i, j . for the three questions. Consider the model log ␮ g h i j s ␤ g q ␭ ab c , where the interaction term is ␭111 when Ž h, i, j . s Ž1, 1, 1., ␭112 when Ž h, i, j . s Ž1, 1, 2. or Ž1, 2, 1. or Ž2, 1, 1., ␭122 when Ž h, i, j . s Ž1, 2, 2. or Ž2, 1, 2. or Ž2, 2, 1., and ␭ 222 when Ž h, i, j . s Ž2, 2, 2.. This model implies the same complete symmetry pattern of probabilities for each gender. Its fit has G 2 s 39.2 with df s 11. Adding main-effect terms for the three issues implies the same quasi-symmetric pattern for each gender. It fits much better, having G 2 s 10.2 with df s 9. Thus, it seems plausible to assume a symmetric association structure. In fact, the loglinear model with only two-factor association terms has fitted log odds ratios of 3.2 for items 1 and 2, 2.6 for items 1 and 3, and 3.3 for items 2 and 3. One can test marginal homogeneity, given gender, by the likelihood-ratio statistic 39.2 y 10.2 s 29.0, with df s 2. An analysis of the main-effect terms in the quasi-symmetry model shows greater support for legalized abortion when the family has a low income and cannot afford any more children than in the other two instances. TABLE 10.13 Support for Legalizing Abortion in Three Situations, by Gender Sequence of Responses on the Three Items a Gender Ž1, 1, 1. Ž1, 1, 2. Ž2, 1, 1. Ž2, 1, 2. Ž1, 2, 1. Ž1, 2, 2. Ž2, 2, 1. Ž2, 2, 2. Male Female 342 440 26 25 6 14 21 18 11 14 32 47 19 22 356 457 Items are Ž1. if the family has a very low income and cannot afford anymore children, Ž2. when the woman is not married and does not want to marry the man, and Ž3. when the woman wants it for any reason. 1, yes; 2, no. Source: Data from 1994 General Social Survey, National Opinion Research Center. a 442 MODELS FOR MATCHED PAIRS 10.7.3 Types of Marginal Symmetry A general type of symmetry for I T tables has marginal homogeneity and complete symmetry as special cases. For an I T table, P Ž Yt 1 s j1 , . . . , Yt h s jh ., where h is between 1 and T, is a h-dimensional marginal probability, h s 1 giving single-variable marginal probabilities. There is hth-order marginal symmetry if for all h-tuples j s Ž j1 , . . . , jh ., this probability is the same for each permutation of j and for all combinations t s Ž t 1 , . . . , t h . of h of the T responses. For h s 1, first-order marginal symmetry is marginal homogeneity. Second-order marginal symmetry occurs if for all t and u, P Ž Yt s a, Yu s b . is the same and the equality holds for all pairs of outcomes Ž a, b .. In other words, the two-way marginal tables exhibit symmetry, and they are identical. T th-order marginal symmetry in an I T table is complete symmetry. When hth-order symmetry holds, ith-order marginal symmetry holds for any i - h. For instance, complete symmetry implies second-order marginal symmetry, which itself implies marginal homogeneity. Although this hierarchy is mathematically attractive, the higher-order symmetries are usually too restrictive to fit well in practice. 10.7.4 Marginal Models: Multiway Tables In practice, usually the form of the joint distribution is of secondary interest. Research questions pertain instead to the marginal distributions. The marginal models of Section 10.3 for matched pairs extend to matched sets. For instance, with ordinal classifications, a cumulative logit model is logit P Ž Yt F j . s ␣ j q ␤t , j s 1, . . . , I y 1, t s 1, . . . , T . Ž 10.34 . In the next chapter we study marginal models in more general contexts, extending the analyses of this chapter to incorporate matched sets and explanatory variables. NOTES Section 10.1: Comparing Dependent Proportions 10.1. Miettinen Ž1969. generalized the McNemar test to case᎐control sets having several controls per case. The Table 10.2 representation is then useful. Each of n matched sets forms a stratum of a 2 = 2 = n table with one observation in column 1 Žthe case. and several observations in column 2 Žthe controls.. Altham Ž1971. and Ghosh et al. Ž2000. presented Bayesian analyses for binary matched pairs. Copas Ž1973., Gart Ž1969., Kenward and Jones Ž1994., and Miettinen Ž1969. studied generalizations of matched-pairs designs. With some approaches ŽGhosh et al. 2000; Liang and Zeger 1988; Suissa and Shuster 1991., inferences about marginal homogeneity also use the main-diagonal observations. 443 NOTES Section 10.4: Symmetry, Quasi-symmetry, and Quasi-independence 10.2. For other discussion of quasi-symmetry, see Darroch Ž1981. and McCullagh Ž1982.. The term quasi-independence originated in Goodman Ž1968.. A more general definition of it is ␲ab s ␣ a ␤ b for some fixed set of cells. See Caussinus Ž1966., Fienberg Ž1970b, 1972., and Goodman Ž1968.. Caussinus used the concept to analyze tables that deleted a certain set of cells from consideration, and Goodman used it in earlier analyses of social mobility. Altham Ž1975. used it with triangular tables, for which observations occur only above or only below the main diagonal. Stigler Ž1999, Chap. 19. summarized early uses, including Karl Pearson’s handling in 1913 of a triangular array. Booth and Butler Ž1999. and Smith et al. Ž1996. discussed exact tests for square-table models. 10.3. The effect ␤ in ordinal quasi-symmetry relates to the occasion effect in a subjectspecific adjacent-categories-logit model ŽAgresti 1993.. Conditional symmetry is a special case of diagonals-parameter symmetry, log Ž ␲abr␲ b a . s ␶ bya , a - b. See Goodman Ž1979b, 1985. and Hout et al. Ž1987.. 10.4. In some applications a table is a priori symmetric or independent, but one can observe only the pair Ž i, j . rather than their order, thus leading to an upper-triangular table. See Khamis Ž1983. for examples and ML fitting of models for such three-way tables that are symmetric within layers. Section 10.5: Measuring Agreement between Obser©ers 10.5. Kappa and weighted kappa relate to the intraclass correlation, a measure of interrater reliability for interval scales ŽFleiss 1981; Fleiss and Cohen 1973; Kraemer 1979.. Banerjee et al. Ž1999. and Fleiss Ž1981, Chap. 13. reviewed kappa and its generalizations. See Becker and Agresti Ž1992., Goodman Ž1979b., Tanner and Young Ž1985., and Problem 10.41 for examples of modeling agreement with loglinear models. Darroch and McCloud Ž1986. showed that quasi-symmetry has an important role in agreement modeling. Section 10.6: Bradley–Terry Model for Paired Preferences 10.6. Zermelo Ž1929. proposed a model that is equivalent to the Bradley᎐Terry model. Luce Ž1959. provided an axiomatic basis for it. Mosteller Ž1951. and Thurstone Ž1927. proposed an analogous model with probit link. An interesting interview of Ralph Bradley by M. Hollander Ž Stat. Sci. 16: 75᎐100, 2001. discussed food-tasting applications that motivated its development. For extensions, see Bradley Ž1976.. Fienberg and Larntz Ž1976. and Imrey et al. Ž1976. related it to quasi-independence. Dittrich et al. Ž1998. allowed covariates. Matthews and Morris Ž1995. gave an application with a factorial design, ties, and allowance for dependence among judgments. Bockenholt ¨ and Dillon Ž1997. modeled dependence with ordinal preferences. David Ž1988. and Imrey Ž1998. surveyed paired preference methods. 444 MODELS FOR MATCHED PAIRS TABLE 10.14 Data for Problem 10.1 Let Patient Die Suicide Yes No Yes No 1097 203 90 435 Source: 1994 General Social Survey, National Opinion Research Center. PROBLEMS Applications 10.1 Table 10.14 shows results when subjects were asked ‘‘Do you think a person has the right to end his or her own life if this person has an incurable disease?’’ and ‘‘When a person has a disease that cannot be cured, do you think doctors should be allowed to end the patient’s life by some painless means if the patient and his family request it?’’ The table refers to these variables as ‘‘suicide’’ and ‘‘let patient die.’’ a. Compare the marginal proportions using a confidence interval. b. Perform McNemar’s test, and interpret. c. Find the conditional ML estimate of ␤ for model Ž10.8.. Interpret. 10.2 Refer to Table 8.16 and Problem 8.1. Treat the data as matched pairs on opinion, stratified by gender. Testing independence for the 2 = 2 table using entries Ž6, 160. in row 1 and Ž11, 181. in row 2 tests equality of ␤ for logit model Ž10.8. for each gender. Explain why. 10.3 A crossover experiment with 100 subjects compares two drugs for treating migraine headaches. The response scale is success Ž1. or failure Ž0.. Half the study subjects, randomly selected, used drug A the first time they had a headache and drug B the next time. For them, 6 had outcomes Ž1, 1. for Ž A, B ., 25 had outcomes Ž1, 0., 10 had outcomes Ž0, 1., and 9 had outcomes Ž0, 0.. For the 50 subjects who took the drugs in the reverse order, 10 were Ž1, 1. for Ž A, B ., 20 were Ž1, 0., 12 were Ž0, 1., and 8 were Ž0, 0.. a. Ignoring treatment order, compare the success probabilities for the two drugs. Interpret. b. McNemar’s test uses only the pairs of outcomes that differ. For this study, Table 10.15 shows such data from both treatment orders. Testing independence for this table tests whether success rates are identical for the treatments ŽGart 1969.. Explain why. Analyze these data, and interpret. 445 PROBLEMS TABLE 10.15 Data for Problem 10.3 Treatment That Is Better Treatment Order A, then B B, then A First Second 25 12 10 20 10.4 A case᎐control study has 8 pairs of subjects. The cases have colon cancer, and the controls are matched with the cases on gender and age. A possible explanatory variable is the extent of red meat in a subject’s diet, measured as ‘‘1 s high’’ or ‘‘0 s low.’’ The Žcase, control. observations on this were Ž1, 1. for 3 pairs, Ž0, 0. for 1 pair, Ž1, 0. for 3 pairs, and Ž0, 1. for 1 pair. a. Cross-classify the 8 pairs in terms of diet Ž1 or 0. for the case against diet Ž1 or 0. for the control. Call this Table A. Display the 2 = 2 = 8 table with eight partial tables relating diet Ž1 or 0. to response Žcase or control. for the 8 pairs. Call this Table B. b. Calculate the McNemar z 2 for Table A and the CMH statistic for Table B. Compare. c. Show that the Mantel᎐Haenszel estimate of a common odds ratio for Table B is identical to n12 rn 21 for Table A. d. For Table B with pairs deleted in which the case and the control had the same diet, show that the CMH statistic and the Mantel᎐Haenszel odds ratio estimate do not change. e. This sample size is small for large-sample tests. Use the binomial distribution with Table A to find the exact P-value for testing marginal homogeneity against the alternative hypothesis of a higher incidence of colon cancer for the high-red-meat diet. 10.5 Each week Variety magazine summarizes reviews of new movies by critics in several cities. Each review is categorized as pro, con, or mixed, according to whether the overall evaluation is positive, negative, or a mixture of the two. Table 10.16 summarizes the ratings from TABLE 10.16 Data for Problem 10.5 Ebert Siskel Con Mixed Pro Con Mixed Pro 24 8 10 8 13 9 13 11 64 Source: A. Agresti and L. Winner, CHANCE 10: 10᎐14 Ž1997., reprinted with permission, copyright 1997 by the American Statistical Association. 446 MODELS FOR MATCHED PAIRS April 1995 through September 1996 for Chicago film critics Gene Siskel and Roger Ebert. a. Fit the symmetry model, quasi-independence model, and quasisymmetry model. Interpret. b. Test marginal homogeneity using models, and interpret. c. Analyze these data using agreement models andror measures of agreement. 10.6 Refer to Table 10.5. Fit the ordinal quasi-symmetry model using u1 s 1 and u 4 s 4 and picking u 2 and u 3 that are unequally spaced but represent sensible choices. Compare results and interpretations to those in Sections 10.3.2 and 10.4.7. 10.7 Refer to all four items in Table 8.19. a. Fit the complete symmetry and quasi-symmetry models. Test marginal homogeneity. Interpret. b. Fit the ordinal quasi-symmetry model. Test marginal homogeneity. Interpret the effects. 10.8 Table 10.17 shows subjects’ purchase choice of instant decaffeinated coffee at two times. a. Fit the symmetry model and use residuals to analyze changes. b. Test marginal homogeneity. Show that the small P-value reflects a decrease in the proportion choosing High Point and an increase in the proportion choosing Sanka, with no evidence of change for the other coffees. c. Show that quasi-independence has G 2 s 13.8 Ždf s 11.. Interpret, and suggest other analyses that might be useful. TABLE 10.17 Data for Problem 10.8 Second Purchase First Purchase High Point Taster’s Choice Sanka Nescafe Brim High Point Taster’s Choice Sanka Nescafe Brim 93 9 17 6 10 17 46 11 4 4 44 11 155 9 12 7 0 9 15 2 10 9 12 2 27 Source: Based on data from R. Grover and V. Srinivasan, J. Market. Res. 24: 139᎐153 Ž1987.. Reprinted with permission from the American Marketing Association. 447 PROBLEMS TABLE 10.18 Data for Problem 10.9 Father’s Status Son’s Status 1 2 3 4 5 Total 1 2 3 4 5 50 28 11 14 3 45 174 78 150 42 8 84 110 185 72 18 154 223 714 320 8 55 96 447 411 129 495 518 1510 848 Total 106 489 459 1429 1017 3500 Source: Reprinted with permission from D. V. Glass Žed., Social Mobility in Britain, Glencoe, IL: Free Press Ž1954.. 10.9 Table 10.18 relates father’s and son’s occupational status for a British sample. Analyze these data, using models of Ža. symmetry, Žb. quasisymmetry, Žc. ordinal quasi-symmetry, Žd. conditional symmetry, Že. marginal homogeneity, Žf. quasi-independence, and Žg. quasi-uniform association. Interpret using their fit and lack of fit. 10.10 For Table 10.18, use kappa to describe agreement. Interpret. 10.11 Table 10.19 displays multiple sclerosis diagnoses for two neurologists who classified patients in two sites, Winnipeg and New Orleans. The diagnostic classes are Ž1. certain; Ž2. probable; Ž3. possible; and Ž4. doubtful, unlikely, or definitely not. For the New Orleans patients, study the agreement using Ža. the independence model and residuals, Žb. more complex models, and Žc. kappa. Interpret each. TABLE 10.19 Data for Problem 10.11 Winnipeg Neurologist New Orleans Neurologist 1 2 3 4 Winnipeg Patients New Orleans Patients 1 2 3 4 1 2 3 4 38 33 10 3 5 11 14 7 0 3 5 3 1 0 6 10 5 3 2 1 3 11 13 2 0 4 3 4 0 0 4 14 Source: J. R. Landis and G. G. Koch, Biometrics 33: 159᎐174 Ž1977.. Reprinted with permission from the Biometric Society. 448 MODELS FOR MATCHED PAIRS 10.12 For Problem 10.11, construct a model that describes agreement between neurologists for the two sites simultaneously. 10.13 Calculate kappa for a 4 = 4 table having n ii s 5 all i, n i, iq1 s 15, i s 1, 2, 3, n 41 s 15, and n i j s 0 otherwise. Explain why strong association does not imply strong agreement. 10.14 Refer to Table 10.8. Based on the reported standardized residuals, explain why the linear-by-linear association model Ž9.6. might fit well. Fit it and describe the association. 10.15 In 1990, a sample of psychology graduate students at the University of Florida made blind, pairwise preference tests of three cola drinks. For 49 comparisons of Coke and Pepsi, Coke was preferred 29 times. For 47 comparisons of Classic Coke and Pepsi, Classic Coke was preferred 19 times. For 50 comparisons of Coke and Classic Coke, Coke was preferred 31 times. Comparisons resulting in ties are not reported. a. Fit the Bradley᎐Terry model, analyze the quality of fit, and rank the drinks. Is there sufficient evidence to conclude a preference for one drink? b. Estimate the probability that Coke is preferred to Pepsi, using the model, and compare to the sample proportion. 10.16 Table 10.20 refers to journal citations among four statistics journals during 1987᎐1989. The more often articles in a particular journal are cited, the more prestige that journal accrues. For citations involving pair A and B, view it as a victory for A if it is cited by B and a defeat for A if it cites B. Fit the Bradley᎐Terry model. Interpret the fit, and give a prestige ranking of the journals. For citations involving Commun. Stat. and JRSS-B, estimate the probability that the Commun. Stat. article cites the JRSS-B article. TABLE 10.20 Data for Problem 10.16 Cited Journal Citing Journal Biometrika Commun. Stat. JASA JRSS-B Biometrika Commun. Stat. JASA JRSS-B 714 730 498 221 33 425 68 17 320 813 1072 142 284 276 325 188 Source: Stigler Ž1994.. Reprinted with permission from the Institute of Mathematical Statistics. 449 PROBLEMS TABLE 10.21 Data for Problem 10.17 Loser Winner Seles Graf Sabatini Navratilova Sanchez Seles Graf Sabatini Navratilova Sanchez ᎏ 3 0 3 0 2 ᎏ 3 0 1 1 6 ᎏ 2 2 3 3 1 ᎏ 1 2 7 3 3 ᎏ 10.17 Table 10.21 refers to matches for several women tennis players during 1989 and 1990. a. Fit the Bradley᎐Terry model. Interpret, and rank the players. b. Estimate the probability of Seles beating Graf. Compare the model estimate to the sample proportion. Construct a 90% confidence interval for the probability. c. Which pairs of players are significantly different according to a 80% simultaneous Bonferroni comparison? 10.18 Refer to Problem 3.3 on basketball free-throw shooting. Analyze these data. 10.19 Refer to Table 2.12 and Problem 2.19. Using models, describe the relationship between husband’s and wife’s sexual fun. 10.20 Refer to Table 8.19. The two-way table relating responses for the environment Žas rows. and cities Žas columns. has cell counts, by row, Ž108, 179, 157 r 21, 55, 52 r 5, 6, 24.. Analyze these data. Theory and Methods 10.21 Explain the following analogy: McNemar’s test is to binary data as the paired difference t test is to normally distributed data. 10.22 For a 2 = 2 table, derive covŽ pq1 , p1q ., and show that varw'n Ž pq1 y p1q .x equals Ž10.1.. 10.23 Refer to the subject-specific model Ž10.8. for binary matched pairs. a. Show that expŽ ␤. is a conditional odds ratio between observation and outcome. Explain the distinction between it and the odds ratio expŽ ␤. for model Ž10.6.. 450 MODELS FOR MATCHED PAIRS b. Using the conditional distribution Ž10.9., show that logŽ n 21 rn12 .. c. For a random sample of n pairs, explain why E Ž n 21 rn . s 1 n n Ý is1 1 exp Ž␣i q ␤ . 1 q exp Ž␣i . 1 q exp Ž␣i q ␤ . ␤ˆ s . 6 6 Similarily, state E Ž n12 rn.. Using their ratio for fixed n and as p n ⬁, explain why n 21 rn12 expŽ ␤.. Ž Hint: Apply the law of large numbers due to A. A. Markov for independent but not identically distributed random variables, or use Chebyshev’s inequality.. d. Show that the Mantel᎐Haenszel estimator Ž6.7. of a common odds ratio in the 2 = 2 = n form of the data simplifies to expŽ ␤ˆ. s n 21 rn12 . e. Use the delta method to show Ž10.10. for the SE of ␤ˆ. f. For a table of the form shown in Table 10.2, show that the CMH statistic Ž6.6. is algebraically identical to the McNemar statistic Ž n 21 y n12 . 2rŽ n 21 q n12 . for tables of Table 10.1 type. 6 10.24 Refer to Problem 10.23. Unlike the conditional ML estimator of ␤ , the unconditional ML estimator is inconsistent ŽAndersen 1980, pp. 244᎐245; first shown by him in 1973.. Show this as follows: a. Assuming independence of responses for different subjects and different observations by the same subject, find the log likelihood. Show that the likelihood equations are yqt s Ý i P Ž Yit s 1. and yiqs Ý t P Ž Yit s 1.. b. Substituting expŽ␣i .rw1 q expŽ␣i .x q expŽ␣i q ␤ .rw1 q expŽ␣i q ␤ .x in the second likelihood equation, show that ␣ ˆi s y⬁ for the n 22 subjects with yiqs 0, ␣ ˆi s ⬁ for the n11 subjects with yiqs 2, and ␣ ˆi s y␤ˆr2 for the n 21 q n12 subjects with yiqs 1. c. By breaking Ý i P Ž Yi t s 1. into components for the sets of subjects having yiqs 0, yiqs 2, and yiqs 1, show that the first likelihood equation is, for t s 1, yq1 s n 22 Ž0. q n11Ž1. q Ž n 21 q n12 .expŽy␤ˆr2.rw1 q expŽy␤ˆr2.x. Explain why yq1 s n11 q n12 , and solve the first likelihood equation to show pthat ␤ˆ s 2 ␤. 2 logŽ n 21 rn12 .. Hence, as a result of Problem 10.23, ␤ˆ 10.25 Consider marginal model Ž10.6. when Y1 and Y2 are independent and conditional model Ž10.8. when  ␣ i 4 are identical. Explain why they are equivalent. 451 PROBLEMS 10.26 Let ␤ˆM s logŽ pq1 p 2qrpq2 p1q . refer to marginal model Ž10.6. and ␤ˆC s logŽ n 21 rn12 . to conditional model Ž10.8.. Using the delta method, show that the asymptotic variance of 'n Ž ␤ˆM y ␤M . is Ž ␲ 1q ␲ 2q . y1 q Ž ␲q1 ␲q2 . y1 y 2 Ž ␲ 11␲ 22 y ␲ 12 ␲ 21 . r Ž ␲ 1q ␲ 2q ␲q1 ␲q2 . . Under the independence condition of the previous problem, ␤M s ␤C . In that case, show that the asymptotic variances satisfy var 'n Ž ␤ˆM . s Ž ␲ 1q ␲ 2q . y1 q Ž ␲q1 ␲q2 . y1 F Ž ␲ 1q ␲q2 . y1 q Ž ␲q1 ␲ 2q . y1 y1 s ␲y1 12 q ␲ 21 s var 'n Ž ␤ˆC . 10.27 Refer to model Ž10.12. for a matched-pairs study. For the conditional ML approach, show that the conditional distribution satisfies Ž10.13. and does not depend on ␤ when Si s 0 or 2. Show what happens to ␤ j in the conditional distribution for a predictor for which x ji1 s x ji2 all i. 10.28 Consider model Ž10.12. for a study with matched sets of T observations rather than matched pairs. Explain how Ž10.13. generalizes and construct the form of the conditional likelihood. 10.29 Give an example illustrating that when I ) 2, marginal homogeneity does not imply symmetry. 10.30 Derive the likelihood equations and residual df for Ža. symmetry, Žb. quasi-symmetry, Žc. quasi-independence, and Žd. ordinal quasisymmetry. 10.31 For the quasi-symmetry model Ž10.19., let ␭ a s ␭ aX y ␭Ya . Show that one can express it equivalently as log ␮ ab s ␭ q ␭ a q ␭*ab , with ␭*ab s ␭*b a . Hence, one needs only one set of main-effect parameters. 10.32 Show that quasi-symmetry is equivalent ŽCaussinus 1966. to Ž ␲ab␲ b c␲c a . r Ž ␲ b a␲cb␲ac . s 1 all a, b, and c. 10.33 Derive the covariance matrix Ž10.16. for the difference vector d. 452 MODELS FOR MATCHED PAIRS 10.34 Construct the loglinear model satisfying both marginal homogeneity and statistical independence. Show that ␲ ˆab s Ž pqa q paq .Ž pqb q p bq .r4 and residual df s I Ž I y 1.. 10.35 Consider the conditional symmetry ŽCS. model Ž10.28.. a. Show that it has the loglinear representation log ␮ ab s ␭min Ž a, b., max Ž a, b. q ␶ I Ž a - b . , where I Ž⭈. is an indicator Žsee also Bishop et al. 1975, pp. 285᎐286.. b. Show that the likelihood equations are ␮ ˆ ab q ␮ ˆ b a s n ab q n b a for all a F b, ÝÝ ␮ˆ ab s ÝÝ n ab . a-b a-b c. Show that ␶ˆ s logwŽÝÝ a- b n ab .rŽÝÝ a) b n ab .x, ␮ ˆ aa s n aa , a s 1, . . . , I, ␮ ˆ ab s exp wˆ␶ I Ž a - b .xŽ n ab q n b a .rwexp Ž␶ˆ . q 1x for a / b. ␶ is d. Show that the estimated asymptotic variance of ˆ ž ÝÝ n / ab a-b y1 q ž ÝÝ n / ab y1 . a)b e. Show that residual df s Ž I q 1.Ž I y 2.r2. f. Show that conditional symmetry q marginal homogeneity s symmetry. Explain why G 2 ŽS < CS. tests marginal homogeneity Ždf s 1.. When the model holds G 2 ŽS < CS. is more powerful asymptotically than G 2 ŽS < QS.. Why? 10.36 Identify loglinear models that correspond to the logit models, for a - b, logŽ␲abr␲ b a . s Ža. 0, Žb. ␶ , Žc. ␣ a y ␣ b , and Žd. ␤ Ž b y a.. 10.37 A nonmodel-based ordinal measure of marginal heterogeneity is ˆs ⌬ ÝÝ paq pqb y ÝÝ paq pqb . a-b a)b ˆ estimates ⌬ s P Ž Y1 ) Y2 . y P Ž Y2 ) Y1 ., where Y1 has Show that ⌬ distribution ␲aq 4 and Y2 is independent from ␲qb 4 . Show that marginal homogeneity implies that ⌬ s 0. Show that the estimated 453 PROBLEMS ˆ is asymptotic variance of ⌬ ž Ý Ý ␾ˆab2 pab y Ý Ý ␾ˆab pab a b a b / 2 n, where ␾ˆab s Fˆb1 q Fˆby1,1 y Fˆa2 y Fˆay1,2 with Fˆa1 s Ž p1qq ⭈⭈⭈ qpaq . and Fˆa2 s Ž pq1 q ⭈⭈⭈ qpqa . ŽAgresti 1984, pp. 208᎐209.. 10.38 For ordered scores  u a4 , let y 1 s Ý a u a paq and y 2 s Ý a u a pqa . Show that marginal homogeneity implies that E Ž Y1 . s E Ž Y2 . and Ý Ý Ž u a y u b . 2 pab y Ž y1 y y 2 . a 2 n. b estimates varŽ Y1 y Y2 .. Construct a test of marginal homogeneity ŽBhapkar 1968.. 10.39 Consider the multiplicative model for a square table, ␲ab s ½ ␣a ␣b Ž1 y ␤ . , a/b ␣ a2 q ␤␣ a Ž 1 y ␣ a . , a s b. a. Show that the model satisfies Ži. symmetry, Žii. marginal homogeneity, Žiii. quasi-symmetry, Živ. quasi-independence. b. Show that ␣ a s ␲aqs ␲qa , a s 1, . . . , I. c. Show that ␤ s Cohen’s kappa, and interpret ␬ s 0 and ␬ s 1 for this model. 10.40 A 2 = 2 table has a true odds ratio of 10. Find the cell probabilities for which Ža. ␲ 1qs ␲q1 s 0.5, Žb. ␲ 1qs ␲q1 s 0.3, and Žc. ␲ 1qs ␲q1 s 0.1. Find the value of kappa for each. ŽThis shows that for a given association, kappa depends strongly on the marginal probabilities; see also Sprott 2000, p. 59.. 10.41 A model for agreement on an ordinal response partitions beyondchance agreement into that due to a baseline association and a main-diagonal increment ŽA. Agresti, Biometrics 44: 539᎐548, 1988.. For ordered scores  u a4 , the model is log ␮ ab s ␭ q ␭ aA q ␭ bB q ␤ u a u b q ␦ I Ž a s b . . Ž 10.35 . a. Show that this is a special case of quasi-symmetry and of quasiassociation Ž10.29.. 454 MODELS FOR MATCHED PAIRS b. For agreement odds Ž10.30., show that log ␶ab s Ž u b y u a . 2␤ q 2 ␦ . For unit-spaced scores, show the local odds ratios have log ␪ab s ␤ when none of the four cells falls on the main diagonal. c. Find the likelihood equations and show that  ␮ ˆ ab 4 and  n ab 4 share the same marginal distributions, correlation, and prevalence of exact agreement. d. For Table 10.8 using  u a s a4 , show that Ž10.35. has G 2 s 4.8 Ždf s 7., with ␦ˆ s 0.842 ŽSE s 0.427. and ␤ˆ s 1.316 ŽSE s 0.420.. Interpret using ˆ ␶a, aq1 and ␪ˆab for a y b ) 1. 10.42 Refer to the Bradley᎐Terry model. a. Show that logŽ ⌸ acr⌸ c a . s logŽ ⌸ abr⌸ b a . q logŽ ⌸ b cr⌸ cb .. b. With this model, is it possible that a could be preferred to b Ži.e., ⌸ ab ) ⌸ b a . and b could be preferred to c, yet c could be preferred to a? Explain. c. Explain why  ␤ a4 are not identifiable without a constraint such as ␤I s 0. Ž Hint: Show the model holds when  ␤ a* s ␤ a y c4 for any c.. 10.43 Refer to model Ž10.32.. a. Construct a more general model having home-team parameters  ␤H i 4 and away-team parameters  ␤A i 4 , such that the probability team i beats team j when i is the home team is exp Ž ␤H i .rwexpŽ ␤H i . q exp Ž ␤A j .x, where ␤A I s 0 but ␤H i is unrestricted. b. Interpret the case  ␤H i s ␤A i q c4 , when Ži. c s 0, and Žii. c ) 0. c. Fit the model to Table 10.12. Compare the fit to model Ž10.32.. Compare  ␤ˆH i 4 and  ␤ˆA i 4 to describe how teams play at home and away. 10.44 Find the log likelihood for the Bradley᎐Terry model. From the kernel, show that Žgiven  Nab 4. the minimal sufficient statistics are  n aq 4 . Thus, explain how ‘‘ victory totals’’ determine the estimated ranking. 10.45 Explain how to fit the complete symmetry model in T dimensions. 10.46 Prove that if kth-order marginal symmetry holds, jth-order marginal symmetry holds for any j - k. 10.47 Suppose that quasi-symmetry holds for an I T table. When the table is collapsed over a variable, show that the model holds for the I Ty1 table with the same main effects. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 11 Analyzing Repeated Categorical Response Data Many studies observe the response variable for each subject repeatedly, at several times or under various conditions. Repeated categorical response data occur commonly in health-related applications, especially in longitudinal studies. For example, a physician might evaluate patients at weekly intervals regarding whether a new drug treatment is successful. In some cases explanatory variables may also vary over time. But the repeated responses need not refer to different times. A dental study might measure whether there is decay for each tooth in a subject’s mouth. Often, the responses refer to matched sets, or clusters, of subjects. An example is a Žsurvival, nonsurvival. response for each fetus in a litter, for a sample of pregnant mice exposed to various dosages of a toxin. A multistage sample to study factors affecting obesity in children may regard children from the same family as a cluster. Observations within a cluster tend to be more alike than observations from different clusters. Ordinary analyses that ignore this may be badly inappropriate. In this chapter we generalize methods of Chapter 10, which referred to matched pairs. In Section 11.1 we compare marginal distributions in T-way tables. The remaining sections extend models to include explanatory variables. For instance, many studies compare the repeated measurements for different groups or treatments. In Section 11.2 we use ML methods for fitting marginal models. In Section 11.3 we use generalized estimating equations ŽGEE., a multivariate version of quasi-likelihood that is computationally simpler than ML. Section 11.4 covers technical details about the GEE approach. In the final section we introduce a transitional approach that models observations in terms of previous outcomes. 455 456 ANALYZING REPEATED CATEGORICAL RESPONSE DATA 11.1 COMPARING MARGINAL DISTRIBUTIONS: MULTIPLE RESPONSES Usually, the multivariate dependence among repeated responses is of less interest than their marginal distributions. For instance, in treating a chronic condition Žsuch as a phobia. with some treatment, the primary goal might be to study whether the probability of success increases over the T weeks of a treatment period. The T success probabilities refer to the T first-order marginal distributions. In Sections 10.2.1 and 10.3 we compared marginal distributions for matched pairs ŽT s 2. using models that apply directly to the marginal distributions. In this section we extend this approach to T ) 2. 11.1.1 Binary Marginal Models and Marginal Homogeneity Denote T binary responses by Ž Y1 , Y2 , . . . , YT .. The marginal logit model Ž10.6. for matched pairs extends to logit P Ž Yt s 1 . s ␣ q ␤t , t s 1, . . . , T , Ž 11.1 . with a constraint such as ␤ T s 0 or ␣ s 0. For a possible sequence of outcomes i s Ž i1 , i 2 , . . . , i T . where each i t s 0 or 1, let  i s P Ž Y1 s i1 , Y2 s i 2 , . . . , YT s i T . . Let  denote the vector of these probabilities for the possible i. They refer to a 2 T table that cross-classifies the T responses and describes the joint distribution of Ž Y1 , . . . , YT .. The sample cell proportions are the ML estimates of , and the sample proportion with yt s 1 is the ML estimate of P Ž Yt s 1.. Model Ž11.1. is saturated, describing T marginal probabilities by T parameters. Marginal homogeneity, for which P Ž Y1 s 1. s ⭈⭈⭈ s P Ž YT s 1., is the special case ␤ 1 s ⭈⭈⭈ s ␤ T . Even though this case has only one parameter, ML fitting is not simple. The multinomial likelihood refers to the 2 T joint cell probabilities  rather than the T marginal probabilities  P Ž Yt s 1.4 . Fitting methods are described in Section 11.2.5. Let n i denote the sample cell count in cell i. The kernel of the log likelihood LŽ  . is Ýi n i log  i . Let LŽp. denote the log likelihood evaluated at the sample proportions  p i s n irn4 , the ML fit of model Ž11.1.. Let LŽ  ˆ M H . denote the maximized log likelihood assuming marginal homogeneity. The likelihood-ratio test of marginal homogeneity ŽLipsitz et al. 1990; Madansky 1963. uses y2 L Ž  ˆ M H . y L Ž p . s 2 Ý n i log Ž p irˆ iM H . . i Ž 11.2 . COMPARING MARGINAL DISTRIBUTIONS: MULTIPLE RESPONSES 457 TABLE 11.1 Responses to Three Drugs in a Crossover Study Drug A Favorable C Favorable C Unfavorable Drug A Unfavorable B Favorable B Unfavorable B Favorable B Unfavorable 6 16 2 4 2 4 6 6 Source: Reprinted with permission from the Biometric Society ŽGrizzle et al. 1969.. The asymptotic null chi-squared distribution has df s T y 1, since the general model Ž11.1. has T y 1 more parameters than marginal homogeneity. 11.1.2 Crossover Drug Comparison Example Table 11.1 comes from a crossover study in which each subject used each of three drugs for treatment of a chronic condition at three times. The response measured the reaction as favorable or unfavorable. The 2 3 table gives the Žfavorable, unfavorable. classification for reaction to drug A in the first dimension, drug B in the second, and drug C in the third. We assume that the drugs have no carryover effects and that the severity of the condition remained stable for each subject throughout the experiment. These assumptions are reasonable for many chronic conditions, such as migraine headache. The sample proportion favorable was Ž0.61, 0.61, 0.35. for drugs ŽA, B, C.. The likelihood-ratio statistic for testing marginal homogeneity is 5.95 Ždf s 2., for a P-value of 0.05. For simultaneous confidence intervals comparing pairs of treatments with overall error probability no greater than 0.05, the Bonferroni method uses confidence coefficient Ž1 y 0.05r3. s 0.9833 for each. For instance, from formula Ž10.1., the estimate 0.261 s 0.609 y 0.348 of the difference between drugs A and C has an estimated standard error of 0.108. The confidence interval for the true difference is 0.261 " 2.39Ž0.108., or Ž0.002, 0.520.. The same interval holds for comparison of drugs B and C. There is some evidence that the proportion of favorable responses is lower for drug C. The sample size is not large, however, so we view these results with caution. For each pair of drugs, a 2 = 2 table relates the two responses. An exact binomial test ŽSection 10.4.1. uses its off-diagonal counts. These yield P-values of 1.0 for comparing drugs A and B and 0.036 for comparing A with C and for comparing B with C. 11.1.3 Modeling Margins of a Multicategory Response The binary marginal model Ž11.1. extends to multinomial responses. With baseline-category logits for I outcome categories, the saturated model is log P Ž Yt s j . rP Ž Yt s I . s ␤t j , t s 1, . . . , T , j s 1, . . . , I y 1. Ž 11.3 . 458 ANALYZING REPEATED CATEGORICAL RESPONSE DATA Marginal homogeneity, whereby P Ž Y1 s j . s ⭈⭈⭈ s P Ž YT s j . for j s 1, . . . , I y 1, is the special case in which ␤ 1 j s ␤ 2 j s ⭈⭈⭈ s ␤ T j , j s 1, . . . , I y 1. The likelihood-ratio test of marginal homogeneity comparing the two models has form Ž11.2. and df s ŽT y 1.Ž I y 1.. For an ordinal response, an unsaturated model that is more complex than marginal homogeneity focuses on shifts up and down in the T margins. One such model is logit P Ž Yt F j . s ␣ j q ␤t , t s 1, . . . , T , j s 1, . . . , I y 1, Ž 11.4 . with constraint such as ␤ T s 0. Marginal homogeneity is the special case ␤ 1 s ⭈⭈⭈ s ␤ T . Its test has df s T y 1. The  ␣ j 4 satisfy ␣ 1 - . . . - ␣ Iy1 because of the ordering of the cumulative probabilities. These models can be fitted using ML methodology presented in Section 11.2.5. 11.1.4 Wald and Generalized CMH Score Tests of Marginal Homogeneity In this chapter we focus on modeling the marginal distributions rather than merely testing marginal homogeneity. However, a variety of tests are available besides the likelihood ratio, so we briefly summarize a couple of them. Let pj Ž t . denote the sample proportion in category j for response Yt , let pj s Ý pj Ž t . rT , d j Ž t . s pj Ž t . y pj , t and let d denote the vector of  d j Ž t ., t s 1, . . . , T y 1, j s 1, . . . , I y 14 . Let ˆ denote the estimated covariance matrix of 'n d. Bhapkar Ž1973. proposed V the Wald statistic ˆy1 d. W s ndX V Ž 11.5 . for the general alternative. This generalizes Ž10.16. and has a large-sample chi-squared distribution with df s Ž I y 1.ŽT y 1.. Other statistics are special cases of the generalized Cochran᎐Mantel᎐ Haenszel ŽCMH. statistic ŽSection 7.5.3.. Recall that for the binary case Ž I s 2. with matched pairs ŽT s 2., the CMH statistic applies to a three-way table Žsee, e.g., Table 10.2. in which each stratum shows the two outcomes for a given subject. A generalization of Table 10.2 provides n strata of T = I tables. The kth stratum gives the T outcomes for subject k. Row t in a stratum has a 1 in the column that is the outcome for observation t, and 0 in all other columns Žor 0 in every column if that observation is missing.. Probability distributions for the subject-stratified setup naturally relate to MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH 459 subject-specific models such as logit model Ž10.8., rather than to marginal models. However, conditional independence in this three-way table Žgiven subject. corresponds to an exchangeability among variables in the I T table that implies marginal homogeneity. A generalized CMH test of conditional independence in the T = I = n table also tests marginal homogeneity using a sampling distribution generated under the stronger exchangeability condition ŽDarroch 1981.. For an ordinal response with fixed scores, the generalized CMH statistic for detecting variability among T means is appropriate. When I s 2 and T s 2, this CMH approach is equivalent to McNemar’s statistic. When I s 2 but T ) 2, the generalized CMH statistic treating the T responses as unordered is identical to a statistic Cochran Ž1950. proposed. His statistic, called Cochran’s Q, has df s T y 1 ŽProblem 11.22.. 11.2 MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH Analyses above compared marginal distributions, but without accounting for explanatory variables. We now include such predictors. In this section we use ML, but we defer model fitting details to the end of the section. 11.2.1 Longitudinal Mental Depression Example We use Table 11.2 to illustrate a variety of analyses in this and the next chapter. It refers to a longitudinal study comparing a new drug with a standard drug for treatment of subjects suffering mental depression ŽKoch et al. 1977.. Subjects were classified into two initial diagnosis groups according to whether severity of depression was mild or severe. In each group, subjects were randomly assigned to one of the two drugs. Following 1 week, 2 weeks, and 4 weeks of treatment, each subject’s suffering from mental depression was classified as normal or abnormal. TABLE 11.2 Cross-Classification of Responses on Depression at Three Times by Diagnosis and Treatment Response at Three Times a Diagnosis Treatment Mild Severe a Standard New drug Standard New drug NNN NNA NAN NAA ANN ANA AAN AAA 16 31 2 7 13 0 2 2 9 6 8 5 3 0 9 2 14 22 9 31 4 2 15 5 15 9 27 32 6 0 28 6 N, normal; A, abnormal. Source: Reprinted with permission from the Biometric Society ŽKoch et al. 1977.. 460 ANALYZING REPEATED CATEGORICAL RESPONSE DATA Table 11.2 shows four groups, the combinations of categories of the two explanatory variables: treatment type and severity of initial diagnosis. Since the study observed the binary response Ždepression assessment . at T s 3 occasions, Table 11.2 shows a 2 3 table for each group. The three depression assessments form a multivariate response variable with three components, with Yt s 1 for normal and 0 for abnormal. The 12 marginal distributions result from three repeated observations for each of the four groups. Let s denote the severity of the initial diagnosis, with s s 1 for severe and s s 0 for mild. Let d denote the drug, with d s 1 for new and d s 0 for standard. Let t denote the time of measurement. Koch et al. Ž1977. noted that if the time metric reflects cumulative drug dosage, a logit scale often has a linear effect for the logarithm of time. They used scores Ž0, 1, 2., the logs to base 2 of the week numbers Ž1, 2, and 4., for time. Table 11.3 shows sample proportions of normal responses Ži.e., yt s 1. for the 12 marginal distributions. For instance, from Table 11.2, the sample proportion of normal responses after week 1 for subjects with mild initial diagnosis using the standard drug was Ž16 q 13 q 9 q 3.rŽ16 q 13 q 9 q 3 q 14 q 4 q 15 q 6. s 0.51. The sample proportion of normal responses Ž1. increased over time for each group; Ž2. increased at a faster rate for the new drug than the standard, for each fixed initial diagnosis; and Ž3. was higher for the mild than the severe initial diagnosis, for each treatment at each occasion. In such a study the company that developed the new drug would hope to show that patients have a significantly higher rate of improvement with it. The marginal logit model logit P Ž Yt s 1 . s ␣ q ␤ 1 s q ␤ 2 d q ␤ 3 t has the main effects of the explanatory variables Žseverity of initial diagnosis and drug. and of the variable Žtime. that specifies the different components of the multivariate response. Its linear time effect ␤ 3 is the same for each group. The natural sampling assumption is multinomial for the eight cells in the 2 3 cross-classification of the three responses, independently for the four TABLE 11.3 Sample Marginal Proportions of Normal Response for Depression Data of Table 11.2 Sample Proportion Diagnosis Treatment Week 1 Week 2 Week 4 Mild Standard New drug Standard New drug 0.51 0.53 0.21 0.18 0.59 0.79 0.28 0.50 0.68 0.97 0.46 0.83 Severe MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH 461 groups. However, the model refers to 12 marginal probabilities Žfor 2 drug treatments = 2 initial severity diagnoses = 3 time points. rather than the 4 = 2 3 s 32 cell probabilities in the product multinomial likelihood function. The three marginal binomial variates for each group are dependent. ML estimation requires an iterative routine for maximizing the product multinomial likelihood, subject to the constraint that the marginal probabilities satisfy the model. An algorithm for this is given in Section 11.2.5. A check of model fit compares the 32 cell counts in Table 11.2 to their ML fitted values. Since the model describes 12 marginal logits using four parameters, residual df s 8. The deviance G 2 s 34.6. The poor fit is not surprising. The model assumes a common rate of improvement ␤ 3 , but the sample shows a higher rate for the new drug. A more realistic model permits the time effect to differ by drug, logit P Ž Yt s 1 . s ␣ q ␤ 1 s q ␤ 2 d q ␤ 3 t q ␤4 dt. Its time effect estimate is ␤ˆ3 s 0.48 ŽSE s 0.12. for the standard drug Ž d s 0. and ␤ˆ3 q ␤ˆ4 s 1.49 ŽSE s 0.14. for the new one Ž d s 1.. For the new drug, the slope is ␤ˆ4 s 1.01 ŽSE s 0.18. higher than for the standard, giving strong evidence of faster improvement. This model fits much better, with G 2 s 4.2 Ždf s 7.. The G 2 decrease of 34.6 y 4.2 s 30.4 compared to the simpler model is the likelihood-ratio test of H0 : ␤4 s 0, a common time effect for each drug. The severity of initial diagnosis estimate is ␤ˆ1 s y1.29 ŽSE s 0.14.; for each drug᎐time combination, the estimated odds of a normal response when the initial diagnosis was severe equal expŽy1.29. s 0.27 times the estimated odds when the initial diagnosis was mild. The estimate ␤ˆ2 s y0.06 ŽSE s 0.22. indicates an insignificant difference between the drugs after 1 week Žfor which t s 0.. At time t, the estimated odds of normal response with the new drug are expŽy0.06 q 1.01 t . times the estimated odds for the standard drug, for each initial diagnosis level. In summary, severity of initial diagnosis, drug treatment, and time all have substantial effects on the probability of a normal response. 11.2.2 Modeling a Repeated Multinomial Response Models for marginal distributions of a repeated binary response generalize to multicategory responses. At observation t, the marginal response distribution has I y 1 logits. With nominal responses, baseline-category logit models describe the odds of each outcome relative to a baseline. For ordinal responses, one might use cumulative logit models. For a particular marginal logit, a model has the form logit j Ž t . s ␣ j q ␤Xj x t , j s 1, . . . , I y 1, t s 1, . . . . 462 ANALYZING REPEATED CATEGORICAL RESPONSE DATA For an ordinal response, perhaps logit j Ž t . s logitw P Ž Yt F j .x. Then, ␤ j may simplify to ␤, in which case the model takes the proportional odds form with the same effects for each logit. Some parameters in ␤ may refer to the variable subscripted by t Že.g., time. that indexes the repeated measurements. One can then compare marginal distributions at particular settings of x or evaluate effects of x on the response. In either case, checking for interaction is crucial. For instance, are the effects of x the same at each t? 11.2.3 Insomnia Example Table 11.4 shows results of a randomized, double-blind clinical trial comparing an active hypnotic drug with a placebo in patients who have insomnia problems. The response is the patient’s reported time Žin minutes. to fall asleep after going to bed. Patients responded before and following a two-week treatment period. The two treatments, active and placebo, form a binary explanatory variable. The subjects receiving the two treatments were independent samples. Table 11.5 displays sample marginal distributions for the four treatment᎐occasion combinations. From the initial to follow-up occasion, time to falling asleep seems to shift downward for both treatments. The degree of shift seems greater for the active treatment, indicating possible interaction. The response variable is a discrete version of a continuous variable, so by the derivation in Section 7.2.3 a cumulative link model is natural. The proportional odds model logit P Ž Yt F j . s ␣ j q ␤ 1 t q ␤ 2 x q ␤ 3 tx Ž 11.6 . permits interaction between t s occasion Ž0 s initial, 1 s follow-up. and TABLE 11.4 Time to Falling Asleep, by Treatment and Occasion Time to Falling Asleep Follow-up Treatment Initial - 20 20᎐30 30᎐60 ) 60 Active - 20 20᎐30 30᎐60 ) 60 - 20 20᎐30 30᎐60 ) 60 7 11 13 9 7 14 6 4 4 5 23 17 4 5 9 11 1 2 3 13 2 1 18 14 0 2 1 8 1 0 2 22 Placebo Source: From S. F. Francom, C.Chuang-Stein, and J. R. Landis, Statist. Med. 8: 571᎐582 Ž1989.. Reprinted with permission from John Wiley & Sons Ltd. MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH 463 TABLE 11.5 Sample Marginal Distributions of Table 11.4 Response Treatment Occasion - 20 20᎐30 30᎐60 ) 60 Active Initial Follow-up Initial Follow-up 0.101 0.336 0.117 0.258 0.168 0.412 0.167 0.242 0.336 0.160 0.292 0.292 0.395 0.092 0.425 0.208 Placebo x s treatment Ž0 s placebo, 1 s active., but assumes the same effects for each response cutpoint. For ML model fitting, G 2 s 8.0 Ždf s 6. for comparing observed to fitted cell counts in modeling the 12 marginal logits using these six parameters. The ML estimates are ␤ˆ1 s 1.074 ŽSE s 0.162., ␤ˆ2 s 0.046 ŽSE s 0.236., and ␤ˆ3 s 0.662 ŽSE s 0.244.. This shows evidence of interaction. At the initial observation, the estimated odds that time to falling asleep for the active treatment is below any fixed level equal expŽ0.046. s 1.04 times the estimated odds for the placebo treatment; at the follow-up observation, the effect is expŽ0.046 q 0.662. s 2.03. In other words, initially the two groups had similar distributions, but at the follow-up those with the active treatment tended to fall asleep more quickly. For simpler interpretation, it can be helpful to report sample marginal means and their differences. With response scores  10, 25, 45, 754 for time to fall asleep, the initial means were 50.0 for the active group and 50.3 for the placebo. The difference in means between the initial and follow-up responses was 22.2 for the active group and 13.0 for the placebo. The difference between these differences of means equals 9.2, with SE s 3.0, indicating that the change was significantly greater for the active group. 11.2.4 Comparisons That Control for Initial Response For data such as Table 11.4, suppose that the marginal distributions for initial response are identical for the treatment groups. This is true, apart from sampling error, with random assignment of subjects to the groups. Suppose also that conditional on the initial response, the follow-up response distribution is identical for the treatment groups. Then, the follow-up marginal distributions are also identical. If the initial marginal distributions are not identical, however, the difference between follow-up and initial marginal distributions may differ between treatment groups, even though their conditional distributions for follow-up response are identical. In such cases, although marginal models can be useful, they may not tell the entire story. It may be more informative to construct models that compare the follow-up responses while controlling for the initial response. 464 ANALYZING REPEATED CATEGORICAL RESPONSE DATA Let Y2 denote the follow-up response, for treatment x with initial response y 1. In the model logit P Ž Y2 F j . s ␣ j q ␤ 1 x q ␤ 2 y 1 , Ž 11.7 . ␤ 1 compares the follow-up distributions for the treatments, controlling for initial observation. This is an analog of an analysis-of-covariance model, with ordinal rather than continuous response. This cumulative logit model refers to a univariate response Ž Y2 . rather than marginal distributions of a multivariate response Ž Y1 , Y2 .. It is an example of a transitional model, discussed in the final section of this chapter. 11.2.5 ML Fitting of Marginal Logit Models* ML fitting of marginal logit models is awkward. For T observations on an I-category response, at each setting of predictors the likelihood refers to I T multinomial joint probabilities, but the model applies to T sets of marginal multinomial parameters  P Ž Yt s k ., k s 1, . . . , I 4 . The marginal multinomial variates are not independent. Let  denote the complete set of multinomial joint probabilities for all settings of predictors. Marginal logit models have the generalized loglinear model form C log Ž A . s X␤ Ž 11.8 . introduced in Section 8.5.4. In the binary case, the matrix A applied to  forms the T marginal probabilities  P Ž Yt s 1.4 and their complements at each setting of predictors. The matrix C applied to the log marginal probabilities forms the T marginal logits for each setting; each row of C has 1 in the position multiplied by the log numerator probability for a given marginal logit, y1 in the position multiplied by the log denominator probability, and 0 elsewhere. For instance, for the model of marginal homogeneity in a 2 T table with no covariates, ␤ is a single parameter, denoted by ␣ in Ž11.1.. For T s 2,  has four elements, and this model is 1 0 y1 0 0 1 1 0 0 log y1 1 0 1 0 0 1 0 1 1 0 0 1 0 1  11  12  21  22 s 1 ␣, 1 which sets both logit Ž 11 q  12 . s logit w P Ž Y1 s 1.x and logit Ž 11 q  21 . s logit w P Ž Y2 s 1.x equal to ␣ . The likelihood function l Ž  . for a marginal logit model is the product of the multinomial mass functions from the various predictor settings. One MARGINAL MODELING: MAXIMUM LIKELIHOOD APPROACH 465 approach for ML fitting views the model as a set of constraints and uses methods for maximizing a function subject to constraints. In model Ž11.8., let U denote a full column rank matrix such that the space spanned by the columns of U is the orthogonal complement of the space spanned by the columns of X. Then, UX X s 0, and the model has the equivalent constraint form UX C log Ž A . s 0. For instance, for marginal homogeneity in a 2 = 2 table with Ž11.8. as expressed above, UX s Ž1, y1.. Then UX applied to C logŽA . sets the difference between the row and column marginal logits equal to 0. This method of maximizing the likelihood incorporates these model constraints as well as identifiability constraints, which constrain the response probabilities at each predictor setting to sum to 1. We express this collection of model constraints UX C logŽA . s 0 and identifiability constraints as fŽ  . s 0. The method introduces Lagrange multipliers corresponding to these constraints and solves the Lagrangian likelihood equations using a Newton᎐ Raphson algorithm ŽAitchison and Silvey 1958; Haber 1985.. Let ␪ be a vector having elements ␲ and the Lagrange multipliers ␭ . The Lagrangian likelihood equations have form hŽ ␪ . s 0, where h Ž ␪ . s h Ž ␲ , ␭ . s Ž f Ž ␲ . , ⭸ log l Ž ␲ . r⭸ ␲ q ⭸ f Ž ␲ . r⭸ ␲ ␭ . X X is a vector with terms involving the contrasts in marginal logits that the model specifies as constraints as well as log-likelihood derivatives. The Newton᎐Raphson method then is ␪ Ž tq1. s␪ Žt. y ⭸ h Ž ␪Ž t . . ⭸␪ y1 h Ž ␪Ž t . . , t s 1, . . . . This can be computationally intensive because the derivative matrix inverted has dimensions larger than the number of elements in ␲. A refinement ŽLang 1996a; Lang and Agresti 1994. uses an asymptotic approximation to a reparameterized derivative matrix that has a much simpler form, requiring inverting only a diagonal matrix and a symmetric positive definite matrix. This ML marginal fitting method is available in specialized software ŽAppendix A mentions an S-Plus function.. It makes no assumption about the model that describes the joint distribution ␲. Thus, when the marginal model holds, the ML estimate of ␤ in Ž11.8. is consistent regardless of the dependence structure for that distribution. Several alternative fitting approaches have been considered. Lang and Agresti Ž1994. simultaneously fitted a marginal model and an unsaturated loglinear model for ␲. The complete model can be specified as a special case of Ž11.8. and fitted using the constraint approach with Lagrange multipliers just described. In standard cases, the marginal and joint model parameters are orthogonal. If the 466 ANALYZING REPEATED CATEGORICAL RESPONSE DATA marginal model holds, the ML estimator of the marginal model parameters is consistent even if the model for the joint distribution is incorrect. Fitzmaurice and Laird Ž1993. gave a related ML approach. A one-to-one correspondence holds between ␲ and parameters of the saturated loglinear model. They used a further one-to-one correspondence between the main effect and the higher-order parameters of that loglinear model with the marginal probabilities and those same higher-order loglinear parameters. Models were then specified separately for the marginal probabilities and the higher-order Žconditional. loglinear parameters. The likelihood is then maximized in terms of the two sets of model parameters. Again, the two sets of parameters are orthogonal, so the ML estimator of marginal model parameters is consistent when the marginal model holds. This mixed parameter approach is also available in specialized software ŽKastner et al. 1997; see also Appendix A.. Yet another ML approach uses a one-to-one correspondence between ␲ and parameters that describe the marginal distributions, the bivariate distributions, the trivariate distributions, and so on Že.g., Glonek and McCullagh 1995; Molenberghs and Lesaffre 1994.. Multivariate logistic models then apply to the component distributions, although some higher-order effects may be assumed to vanish, for simplicity. Glonek Ž1996. proposed a hybrid of this and the Fitzmaurice and Laird Ž1993. approach. 11.3 MARGINAL MODELING: GENERALIZED ESTIMATING EQUATIONS (GEE) APPROACH At each combination of predictor values, ML fitting assumes a multinomial distribution for the I T cell probabilities for the T observations on an I-category response. As the number of predictors increases, the number of multinomial probabilities increases dramatically. Currently, all the ML approaches described above are not practical when T is large or there are many predictors, especially when some are continuous. Compared to the continuous-response case using the multivariate normal, marginal modeling of multivariate categorical responses is also hindered by the lack of a simple multivariate distribution for describing correlations among the T responses. For instance, with T means and a common variance and correlation, the multivariate normal has only T q 2 parameters, compared to the I T y 1 parameters for the multinomial. An alternative to ML fitting uses a multivariate generalization of quasilikelihood ŽSection 4.7.. Rather than assuming a particular distribution for Y, the quasi-likelihood method specifies only the first two moments; it links the mean to a linear predictor and also specifies how the variance depends on the mean. The estimates are solutions of estimating equations that are likelihood equations under the further assumption of a distribution in the exponential family with that mean and variance ŽWedderburn 1974.. MARGINAL MODELING: GENERALIZED ESTIMATING EQUATIONS APPROACH 11.3.1 467 Generalized Estimating Equation Methodology: Basic Ideas Repeated measurement provides a multivariate response Ž Y1 , Y2 , . . . , YT ., where T sometimes varies by subject. As in the univariate case, the quasilikelihood method specifies a model for ␮ s E Ž Y . and specifies a variance function ®Ž ␮. describing how varŽ Y . depends on ␮. Now, though, that model applies to the marginal distribution for each Yt . The method also requires a working guess for the correlation structure among  Yt 4 . The estimates are solutions of quasi-likelihood equations called generalized estimating equations. The method is often referred to as the GEE method. Liang and Zeger Ž1986. proposed it for marginal modeling with GLMs. Their work built on related material in the econometrics literature Že.g., Gourieroux et al. 1984; Hansen 1982; White 1982.. We outline concepts here and give more details in Section 11.4. The GEE approach utilizes an assumed covariance structure for Ž Y1 , Y2 , . . . , YT ., specifying a variance function and a pairwise correlation pattern, without assuming a particular multivariate distribution. The GEE estimates of model parameters are valid even if one misspecifies the covariance structure. Consistency Ži.e., estimates converging in probability to the true parameters . depends on the first moment but not the second. Specifically, suppose that the model is correct in the sense that the chosen link function and linear predictor truly describe how E Ž Yt . depend on the predictors, t s 1, . . . , T. Then the GEE model parameter estimators are consistent. In practice, a chosen model is never exactly correct. This result is useful, however, for suggesting that the correlation structure need not adversely affect the quality of estimates for whatever model one uses. Often, no a priori information is available about this structure, and the correlation is regarded as a nuisance. A simple implementation of the GEE method naively treats  Yt 4 as pairwise independent. Although parameter estimates are usually fine under this naive assumption, standard errors are not. More appropriate standard errors result from an adjustment the GEE method makes using the empirical dependence the data exhibit. The naive standard errors based on the independence assumption are updated using the information the data provide about the actual dependence structure to yield more appropriate Ž robust . standard errors. As an alternative to estimates that treat  Yt 4 as pairwise independent, the GEE method can use a working guess about the correlation structure but again empirically adjust the standard error. The exchangeable working correlation structure treats corrŽ Yt , Ys . as identical for all s and t. This is more flexible and realistic than the naive independence assumption. Even more realistic is an unstructured working correlation that permits a separate correlation for each pair. When T is large, however, this approach suffers some efficiency loss because of the many additional parameters. In theory, choosing the working correlation wisely can pay benefits of improved efficiency of estimation. However, Liang and Zeger Ž1986. noted 468 ANALYZING REPEATED CATEGORICAL RESPONSE DATA that estimators based on independence working correlation can have surprisingly good efficiency when the actual correlation is weak to moderate. One can check the sensitivity to the selection by comparing results for different working correlation assumptions. In our experience, when the correlations are modest, all working correlation structures yield similar GEE estimates and standard errors, as the empirical dependence has a large impact on adjusting the naive standard errors. ŽIf they differed substantially, a more careful study of the correlation structure would be necessary. . Unless one expects dramatic differences among the correlations, we recommend the exchangeable working correlation structure. This recognizes the dependence at the cost of only one extra parameter. The GEE approach is appealing for categorical data because of its computational simplicity compared to ML. Advantages include not requiring a multivariate distribution and the consistency of estimation even with misspecified correlation structure. However, it has limitations. Since the GEE approach does not completely specify the joint distribution, it does not have a likelihood function. Likelihood-based methods are not available for testing fit, comparing models, and conducting inference about parameters. Instead, inference uses Wald statistics constructed with the asymptotic normality of the estimators together with their estimated covariance matrix. However, unless the sample size is quite large, the empirically based standard errors tend to underestimate the true ones Že.g., Firth 1993b.. As estimators, those standard errors can also show more variability than parametric estimators ŽKauermann and Carroll 2001.. Boos Ž1992. and Rotnitzky and Jewell Ž1990. proposed analogs of score tests for effects of predictors, using quasilog-likelihood, that may be more trustworthy than Wald tests. Some statisticians Že.g., Lindsey 1999. are critical of the GEE approach because of the lack of likelihood. Others do not find this problematic, as they regard GEE as an estimation method rather than a model. 11.3.2 Longitudinal Mental Depression Example For Table 11.2 comparing two treatments for mental depression, ML fitting of a logit model with drug = time interaction was used in Section 11.2.1. The GEE analysis provides similar results, regardless of the choice of working correlation structure. With the exchangeable structure, the GEE estimated slope Žon the logit scale. for the standard drug is ␤ˆ3 s 0.48 ŽSE s 0.12.. For the new drug the slope increases by ␤ˆ4 s 1.02 ŽSE s 0.19.. Table 11.6 shows results using the independence working correlations. Estimates are the same to two decimal places. The initial estimates and standard errors there are those that apply if the repeated responses are truly independent. They equal those obtained by using ordinary logistic regression with 3 = 340 s 1020 independent observations rather than treating the data as three dependent observations for each of 340 subjects. The empirical standard errors incorporate the sample dependence to adjust the independence-based standard errors. MARGINAL MODELING: GENERALIZED ESTIMATING EQUATIONS APPROACH 469 TABLE 11.6 Output from Using GEE to Fit Logit Model to Table 11.2 Initial Parameter Estimates Parameter Intercept diagnose drug time drug)time Estimate y0.0280 y1.3139 y0.0596 0.4824 1.0174 Row1 Row2 Row3 Std Error 0.1639 0.1464 0.2222 0.1148 0.1888 GEE Parameter Estimates Empirical Std Error Estimates Parameter Estimate Std Error Intercept y0.0280 0.1742 diagnose y1.3139 0.1460 drug y0.0596 0.2285 time 0.4824 0.1199 drug)time 1.0174 0.1877 Working Correlation Matrix Col1 Col2 Col3 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 With exchangeable correlation structure, the estimated common correlation between pairs of the three responses is y0.003. The successive observations apparently have pairwise appearance like independent observations. This is quite unusual for repeated measurement data. For this reason, similar results occur from fitting the model assuming the three observations for a subject actually come from three separate subjects Ži.e., assuming 1020 independent observations.. 11.3.3 GEE Approach for Multinomial Responses: Insomnia Example Liang and Zeger Ž1986. originally specified the GEE methodology for modeling univariate marginal distributions, such as the binomial and Poisson. It extends to marginal modeling of multinomial responses. Lipsitz et al. Ž1994. outlined a GEE approach for cumulative logit models with repeated ordinal responses. With this approach, for each pair of outcome categories one selects a working correlation matrix for the pairs of repeated observations. Each multinomial response at a fixed observation uses the Ž I y 1. = Ž I y 1. multinomial covariance matrix. Section 11.4.4 has details. We illustrate for the insomnia data of Table 11.4. In Section 11.2.3 we used ML to fit the marginal model logit P Ž Yt F j . s ␣ j q ␤ 1 t q ␤ 2 x q ␤ 3 tx for Yt s time to fall asleep with treatment x at occasion t. With independence working correlation structure, the GEE estimates are ␤ˆ1 s 1.038 ŽSE s 0.168., ␤ˆ2 s 0.034 ŽSE s 0.238., and ␤ˆ3 s 0.708 ŽSE s 0.244.. The estimates are similar to the ML estimates, and the substantive conclusions are the same. Considerable evidence exists that the distribution of time to fall asleep decreased more for the treatment group than for the placebo group. 470 ANALYZING REPEATED CATEGORICAL RESPONSE DATA 11.4 QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSION: DETAILS* A GLM assumes a certain distribution for the response variable. Sometimes it is unclear how to select it. However, often there is a plausible relationship between the mean and variance, such as ®Ž ␮i . s ␾␮ i for count data. Then, an alternative to ML estimation is quasi-likelihood estimation ŽSection 4.7.. We next present some details about this method and its GEE extension for marginal modeling of multivariate responses. We begin with models for a single response and later discuss marginal models for a multivariate response. For subject i, i s 1, . . . , n, let yi be the outcome on Y with ␮ i s E Ž Yi . and variance function ®Ž ␮i ., and let x i j be the value of explanatory variable j. For link function g, the linear predictor is ␩i s g Ž ␮i . s Ý j ␤ j x i j s x Xi ␤. The quasi-likelihood ŽQL. parameter estimates ˆ are the solutions of quasi-score equations ␤ uŽ ␤ . s Ý i ž / ⭸␮ i ⭸␤ X ® Ž ␮i . y1 Ž yi y ␮ i . s 0, Ž 11.9 . where ␮ i s gy1 Žx Xi ␤ .. These estimating equations are the same as the likelihood equations Ž4.22. for GLMs when we substitute ⭸␮ i ⭸␤ j s ⭸␮ i ⭸␩i ⭸␩i ⭸␤ j s ⭸␮ i ⭸␩i xi j . They are not likelihood equations, however, without the extra assumption that  yi 4 has distribution in the natural exponential family. Under that assumption, ®Ž ␮i . characterizes the distribution within the natural exponential family ŽJorgensen 1987.. Another motivation for equations Ž11.9. is that Ⲑ Ž . with ® ␮i replaced by known variance ®i , they result from the weighted least squares problem of minimizing Ý i Ž yi y ␮ i . 2 ®y1 i . The likelihood equations Ž4.22. for a GLM depend only on the mean and variance of  yi 4 and the link function g, which determines ⭸␮ ir⭸␩i . Thus, Wedderburn Ž1974. suggested using them as estimating equations for any link and variance function, even if they do not correspond to a particular member of the natural exponential family. 11.4.1 Properties of Quasi-likelihood Estimators In the quasi-likelihood ŽQL. method, the quasi-score function u j Ž␤ . in Ž11.9. is called an unbiased estimating function; this term refers to any function hŽy; ␤ . of y and ␤ such that E w hŽY; ␤ .x s 0 for all ␤. The equations Ž11.9. ˆ are called estimating equations. that determine ␤ The quasi-likelihood method treats the quasi-score function as the derivative of a function called the quasi-log likelihood. This function may not be a QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS 471 proper log likelihood function. Nonetheless, McCullagh Ž1983. showed that QL estimators have properties similar to those of ML estimators. For ˆ are asymptotically normal with covariance instance, the QL estimators ␤ matrix approximated by Ý Vs i ž / ⭸␮ i X ⭸␤ ® Ž ␮i . y1 ž / ⭸␮ i ⭸␤ y1 Ž 11.10 . . 6 This is equivalent to the formula for the large-sample covariance matrix of the ML estimator in a GLM wwhich is estimated by Ž4.28.x. ˆ is consistent for ␤ Ži.e., ␤ ˆ p ␤. A key result is that the QL estimator ␤ even if the variance function is misspecified, as long as the specification is correct for the link function and linear predictor. That is, assuming that the ˆ holds even if the model form g Ž ␮i . s Ý j ␤ j x i j is correct, the consistency of ␤ Ž . true variance function is not ® ␮i . We now give a heuristic explanation for this. When truly ␮ i s gy1 ŽÝ j ␤ j x i j ., then from Ž11.9., E w u j Ž␤ .x s 0 for all j. From Ž11.9., uŽ␤ .rn is a vector of sample means. By a law of large numbers, ˆ of the it converges in probability to its expected value of 0. The solution ␤ quasi-score equations is a continuous function of these sample means, so it ˆ is the value of ␤ for which the sum is exactly equal converges to ␤, since ␤ to 0. The consistency also follows from general results for unbiased estimating functions ŽLiang and Zeger 1995.. 11.4.2 Sandwich Covariance Adjustment for Variance Misspecification If one assumes that varŽ Yi . s ®Ž ␮i . but the true varŽ Yi . / ®Ž ␮i ., then the ˆ is not V as given actual asymptotic covariance matrix of the QL estimator ␤ in Ž11.10.. Instead, it is ŽDiggle et al. 2001; White 1982. V Ý i ž / ⭸␮ i ⭸␤ X ® Ž ␮i . y1 var Ž Yi . ® Ž ␮i . y1 ž / ⭸␮ i ⭸␤ V. Ž 11.11 . Even though the variances are scalar, we express the matrices in this form to motivate the GEE multivariate extension discussed below. Matrix Ž11.11. simplifies to V if varŽ Yi . s ®Ž ␮i .. In practice, the true variance function is unknown. A consistent estimator of Ž11.11. is a sample analog, replacing ␮ i by ␮ ˆ i and varŽ Yi . by Ž yi y ␮ ˆ i . 2 ŽLiang and Zeger 1986.. The estimated covariance matrix is valid regardless of whether the variance specification ®Ž ␮i . is correct. This estimated covariance matrix is called a sandwich estimator, because the empirical evidence is sandwiched between the modeldriven covariance matrices. In summary, even with incorrect specification of the variance function, one can still consistently estimate ␤ and one can estimate the asymptotic variance 472 ANALYZING REPEATED CATEGORICAL RESPONSE DATA ˆ by estimating the sandwich adjustment Ž11.11.. However, some effiof ␤ ciency loss occurs when the variance chosen, ®Ž ␮i ., is wildly inaccurate. Also, the number of clusters n may need to be large for the sample version of Ž11.11. to work well; otherwise, it can be biased downward. Of course, a modeling process never gets anything exactly correct. Just as the variance function chosen only approximates the true one Žhopefully, closely., so is the specification for the mean only approximate. 11.4.3 GEE Methodology: Technical Details Now we consider the generalized estimating equations ŽGEE. multivariate generalization of QL. For subject i, let yi s Ž yi1 , . . . , yiT i .X and ␮ i s Ž ␮i1 , . . . , ␮ iT .X , where ␮ it s E Ž Yi t .. The number Ti of responses may vary by i cluster. Let x it denote a p = 1 vector of explanatory variable values for yit . The notation allows for cases where explanatory variables also vary for the repeated measurements. The linear predictor of the model is ␩it s g Ž ␮it . s x Xit ␤ for link function g. The model refers to the marginal distribution at each t rather than the joint distribution. Let X i be the Ti = p matrix of predictor values for cluster Žor subject. i, for which row t is x Xit . We assume that yit has probability mass function of form f Ž yit ; ␪ it , ␾ . s exp ½ yit ␪ it y b Ž ␪ it . 5 ␾ q c Ž yit , ␾ . . When ␾ is known, this is the natural exponential family with natural parameter ␪ it . From Section 4.4.1, ␮ it s E Ž Yit . s bX Ž ␪ it . , ® Ž ␮it . s var Ž Yit . s bY Ž ␪ it . ␾ . The GEE method also assumes a working correlation matrix RŽ ␣ . for Yi , depending on parameters ␣ . The exchangeable working correlation has corrŽ Yit , Yi s . s ␣ for each pair in Yi . Let b i Ž ␪ . s Ž bŽ ␪ i1 ., . . . , bŽ ␪ iT i .., and let B i denote a diagonal matrix with main diagonal elements bYi Ž ␪ .. Then the working covariance matrix for Yi is Vi s B1r2 R Ž ␣ . B1r2 i i ␾. Ž 11.12 . Note that Vi s covŽYi . if R is the true correlation matrix for Yi . Now let ⌬ i be the diagonal matrix with elements ⭸␪ itr⭸␩it on the main diagonal for t s 1, . . . , Ti . ŽFor the canonical link, this is the identity matrix.. Let Di s ⭸ ␮ ir⭸ ␤ s B i ⌬ i X i be a Ti = p matrix with typical element expressing ⭸␮ itr⭸␤ j in the form Ž ⭸␮ itr⭸␪ it .Ž ⭸␪ itr⭸␩i t .Ž ⭸␩itr⭸␤ j .. From Ž11.9., for univariate GLMs the quasi-likelihood estimating equations have the form Ý Ž ⭸␮ ir⭸ ␤ . ® Ž ␮i . y1 X i yi y ␮ i Ž ␤ . s 0, QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS 473 where ␮ i s ␮ i Ž␤ . s gy1 Žx Xi ␤ .. The analog of this in the multivariate case is the set of generalized estimating equations n Ý DXi Viy1 yi y ␮ i Ž ␤ . s 0. is1 ˆ is the solution of these equations. The GEE estimator ␤ The naive approach, which sets RŽ ␣ . s I, treats pairs of responses as independent. In that case, Ž11.12. simplifies to Vi s B i ␾ , and the generalized estimating equations simplify to Ý DXi Viy1 yi y ␮ i Ž ␤ . s i Ý XXi ⌬ i B iViy1 yi y ␮ i Ž ␤ . i s Ž 1r␾ . Ý XXi ⌬ i yi y ␮ i Ž ␤ . s 0, i ˆ is then the same as the ordinary or Ý i XXi ⌬ i wyi y ␮ i Ž␤ .x s 0. The solution ␤ estimator for a GLM with the chosen link function and variance function, treating Ž yi1 , . . . , yiT i . as independent observations. Normally, one selects a working correlation matrix permitting dependence, such as the exchangeable structure. For time-series data, also popular is the autoregressive structure, corrŽ Yit , Yi s . s ␣ < tys < , which treats observations farther apart in time as more weakly correlated. Liang and Zeger Ž1986. suggested computing the GEE estimates by iterating between a modified Fisher scoring algorithm for solving the generalized estimating equations for ␤ Žgiven current estimates of ␣ and ␾ . and using residuals for moment estimation of ␣ and ␾ Žbased on the current estimates of ␤ .. They suggested estimates of RŽ ␣ . for a variety of correlation structures. Alternative algorithms simultaneously solve estimating equations for ␤ and for association parameters Že.g., Liang et al. 1992; see also Note 11.8.. GEE algorithms need not converge, but often one iteration gives adequate results ŽLipsitz et al. 1991.. Liang and Zeger Ž1986. showed asymptotic normality and consistency as the number of clusters n increases. Under certain regularity conditions, d N Ž 0, VG . . 6 'n Ž ␤ˆ y ␤ . Here, generalizing Ž11.11., VG s lim n™⬁ VG, n with VG , n s n Ý DXi Viy1 i y1 Di Ý DXi Viy1 covŽ Yi . Viy1 Di Ý DXi Viy1 Di i y1 . i ˆG, nrn of ␤ ˆ replaces ␤ with ␤, ˆ ␾ with ␾ˆ, The estimated covariance matrix V X ˆ ˆ Ž . w Ž .xw Ž .x ␣ with ␣ ˆ , and cov Yi by yi y ␮ i ␤ yi y ␮ i ␤ . The purpose of the 474 ANALYZING REPEATED CATEGORICAL RESPONSE DATA sandwich estimator is to use the data’s empirical evidence about covariation to adjust the standard errors in case the true covariance differs substantially from the working guess. When the working correlation structure is the true one and covŽYi . s Vi , the asymptotic covariance matrix VG, nrn simplifies to ŽÝ i DXi Viy1 Di .y1 . This is the relevant covariance if we put complete faith in our guess about the correlation structure. With binary data, the correlation may not be the best way to express the within-cluster association. The marginal probabilities constrain the possible correlation values, since the range of possible values for E Ž Yi t Yi s . s P Ž Yi t s 1, Yi s s 1. depends on P Ž Yit s 1. and P Ž Yi s s 1.. An alternative approach uses the odds ratio, for instance by modeling the log odds ratios for pairs in a cluster as exchangeable. This has the advantage that the association parameters are distinct from the means. See Fitzmaurice et al. Ž1993. and Lipsitz et al. Ž1991.. Carey et al. Ž1993. suggested an iterative alternating logistic regressions algorithm. It alternates between a GEE step for the regression parameters in the model for the mean and a step for an association model for the log odds ratio. This is useful when the structure of the association is itself a major focus rather than a nuisance. 11.4.4 GEE Approach: Multinomial Responses We now briefly describe the Lipsitz et al. Ž1994. GEE approach for marginal modeling with a multinomial response. This is appropriate, for instance, with cumulative logit models. Let yit Ž j . s 1 if observation t in cluster i has outcome j Ž j s 1, . . . , I y 1.. Let yi be the Ti Ž I y 1. binary indicators for cluster i. Then, one selects a w Ti Ž I y 1.x = w Ti Ž I y 1.x working covariance matrix Vi for yi , specifying a pattern for corrŽ Yit Ž j ., Yi s Ž k .. for each pair of outcome categories Ž j, k . and each pair Ž t, s .. The Ž I y 1. = Ž I y 1. block of Vit for Ž yit Ž1., . . . , yit Ž I y 1.. is a multinomial covariance matrix with ®it Ž j . s P Ž Yit Ž j . s 1.w1 y P Ž Yit Ž j . s 1.x on the main diagonal and yP Ž Yit Ž j . s 1. P Ž Yit Ž k . s 1. off it. The remaining elements of Vi contain elements covŽ Yit Ž j ., Yi s Ž k ... For instance, one possibility is the exchangeable structure, corrŽ Yit Ž j ., Yi s Ž k .. s ␳ jk for all t and s. In this approach the generalized estimating equations for ␤ again have the form uŽ ␤ . s n Ý DXi Viy1 Ž yi y ␮ i . s 0, is1 where ␮ i is the vector of probabilities associated with yi , DXi s ⭸ ␮Xir⭸ ␤, and the parameters are evaluated at their current estimates. Lipsitz et al. suggested a Fisher scoring algorithm for solving these equations and a method of moments update for estimating  ␳ jk 4 at each step of the iteration. An QUASI-LIKELIHOOD AND ITS GEE MULTIVARIATE EXTENSIONS: DETAILS 475 ˆ is again empirically adjusted sandwich covariance matrix of ␤ n Ý DXi Viy1 Di is1 y1 n n is1 is1 Ý DXi Viy1 covŽ Yi . Viy1 Di Ý DXi Viy1 Di y1 . This is estimated by substituting ␮ ˆ i from the model fit and replacing covŽYi . by the empirical covariance matrix of yi . 11.4.5 Dealing with Missing Data Unfortunately, studies with repeated measurement often have cases for which at least one response in a cluster is missing. In a longitudinal study, for instance, some subjects may drop out before its conclusion. When data are missing, analyzing the observed data alone as if no data are missing can result in biased estimates. An advantage of the GEE method is that different clusters can have different numbers of observations. The data input file has a separate line for each observation, and for longitudinal studies, computations use those times for which a subject has an observation. However, bias can arise in GEE estimates unless one can make certain assumptions about why the data are missing. Let Y Ž o. denote the observed responses, Y Ž m. the missing responses, and Y their union. Let M denote a missing data indicator that equals 1 when an observation is missing and 0 otherwise. Little and Rubin Ž1987. called the data missing completely at random if M is statistically independent of Y; that is, the probability that an observation is missing is independent of that observation’s value, although it may depend on the explanatory variables. Less restrictively, they called the data missing at random if the distribution of Ž M < Y. equals that of Ž M < Y Ž o. .; that is, missingness depends only on Y Ž o. and not on the missing values. When either of these is plausible, with a likelihood-based analysis it is not necessary to model the missingness mechanism. An analysis using only Y Ž o. is not systematically biased. The same is true with GEE methods when estimating equations can be weighted by response probabilities ŽRobins et al. 1995.. Otherwise, however, with non-likelihood-based methods such as GEE, the missingness process can be ignored only when data are missing completely at random. Kenward et al. Ž1994. illustrated the breakdown in GEE estimates when the data are not missing completely at random. Often, missingness depends on the missing values. For instance, in a longitudinal study measuring pain, perhaps a subject dropped out when the pain got above some threshhold. Then, more complex analyses are needed that model the joint distribution of Y and M ŽLittle 1998.. Let f Ž⭈. denote a generic probability mass function, which also depends on explanatory variables x and parameters. Selection models factor the joint distribution of Y 476 ANALYZING REPEATED CATEGORICAL RESPONSE DATA and M as f Ž y, M ; x, ␤ , ␺ . s f Ž y; x, ␤ . f Ž M < y; x, ␺ . , where f Žy; x, ␤ . is the model in the absence of missing values and f Ž M < y; x, ␺ . is the model for the missing-data mechanism. Pattern mixture models use the alternative factorization, f Ž y, M ; x, ␪, ␾ . s f Ž y < M, x, ␾ . f Ž M ; x, ␪ . , which conditions the distribution of Y on the missing data pattern. The two specifications are equivalent when M is independent of Y, with ␤ s ␾ and ␺ s ␪. For discussion of advantages of each modeling approach and details on ways of modeling missingness, see Little Ž1998. and references in Note 11.9. See Stokes et al. Ž2000, p. 524. for an example of building the missingness pattern into a model to check whether it is associated with the response or interacts with effects of explanatory variables. Analyses in the presence of much missingness should be made with caution. Typically, little is known about the missing data mechanism, and assumptions about it cannot be checked. Since inferences may not be robust, a sensitivity study is necessary to check how results depend on specification of that mechanism. In the absence of a model for the missingness, one should at least compare results of the analysis using all available cases for all clusters to the analysis using only clusters having no missing observations. If results differ substantially, conclusions should be very tentative until the reasons for missingness can be studied. 11.5 MARKOV CHAINS: TRANSITIONAL MODELING When Yt denotes the response at time t, t s 0, 1, 2, . . . , the indexed family of random variables Ž Y0 , Y1 , Y2 , . . . . is a stochastic process. The state space of the process is the set of possible values for Yt . The value Y0 is the initial state. When the state space is categorical and observations occur at a discrete set of times,  Yt 4 has discrete state space and discrete time. 11.5.1 Transitional Models The main focus is usually on the dependence of Yt on the responses  y 0 , y 1 , . . . , yty1 4 observed previously as well as any explanatory variables. Models of this type are called transitional models. Let f Ž y 0 , . . . , y T . denote the joint probability mass function of Ž Y0 , . . . , YT . Žignoring, for now, ex- MARKOV CHAINS: TRANSITIONAL MODELING 477 planatory variables.. Transitional models use the factorization f Ž y 0 , . . . , y T . s f Ž y 0 . f Ž y 1 < y 0 . f Ž y 2 < y 0 , y 1 . ⭈⭈⭈ f Ž y T < y 0 , y 1 , . . . , y Ty1 . . Unlike the marginal models in the other sections of this chapter, this modeling is conditional on previous responses. In this section we introduce discrete-time Marko® chains, a simple stochastic process having discrete state space. Many transitional models have Markov chain structure for at least part of the model. 11.5.2 First-Order Markov Chains A Marko® chain is a stochastic process for which, for all t, the conditional distribution of Ytq1 , given Y0 , . . . , Yt , is identical to the conditional distribution of Ytq1 given Yt alone. That is, given Yt , Ytq1 is conditionally independent of Y0 , . . . , Yty1 . Knowing the present state of a Markov chain, information about past states does not help us predict the future. For Markov chains, f Ž y 0 , . . . , y T . s f Ž y 0 . f Ž y 1 < y 0 . f Ž y 2 < y 1 . . . . f Ž y T < y Ty1 . . Ž 11.13 . A stochastic process is a kth-order Marko® chain if, for all t, the conditional distribution of Ytq1 , given Y0 , . . . , Yt , is identical to the conditional distribution of Ytq1 , given Ž Yt , . . . , Ytykq1 .. Given the states at the previous k times, the future behavior of the chain is independent of past behavior before those k times. Our discussion here focuses mainly on ordinary Markov chains as in Ž11.13., which are first order Ž k s 1.. Denote the conditional probability P Ž Yt s j < Yty1 s i . by  j < i Ž t .. The  j < i Ž t .4 , which satisfy Ý j j < i Ž t . s 1, are called transition probabilities. The I = I matrix  j < i Ž t ., i s 1, . . . , I, j s 1, . . . , I 4 is a transition probability matrix. It is called one-step, to distinguish it from the matrix of probabilities for k-step transitions from time t y k to time t. From Ž11.13., the joint distribution for a Markov chain depends only on one-step transition probabilities and the marginal distribution for the initial state. It also follows that the joint distribution satisfies loglinear model Ž Y0 Y1 , Y1Y2 , . . . , YTy1 YT . . For a sample of realizations of a stochastic process, a contingency table displays counts of the possible sequences. A test of fit of this loglinear model checks whether the process plausibly satisfies the Markov property. Statistical inference for Markov chains uses standard methods of categorical data analysis. For example, consider ML estimation of transition probabilities. Let n i j Ž t . denote the number of transitions from state i at time t y 1 to state j at time t. For fixed t,  n i j Ž t .4 form the two-way marginal table for dimensions t y 1 and t of an I Tq1 contingency table. For the n iq Ž t . subjects 478 ANALYZING REPEATED CATEGORICAL RESPONSE DATA in category i at time t y 1, suppose that  n i j Ž t ., j s 1, . . . , I 4 have a multinomial distribution with parameters  j < i Ž t .4 . Let  n i0 4 denote the initial counts. Suppose that they also have a multinomial distribution, with parameters  i0 4 . If subjects behave independently, from Ž11.13. the likelihood function is proportional to ž I Ł  i0n is1 i0 /½ T I I Ł Ł Ł j<i Ž t . n ts1 is1 js1 i jŽ t . 5 Ž 11.14 . . The transition probabilities are parameters of IT independent multinomial distributions. From Anderson and Goodman Ž1957., the ML estimates are  ˆ j < i Ž t . s n i j Ž t . rn iq Ž t . . 11.5.3 Respiratory Illness Example Table 11.7 refers to a longitudinal study at Harvard of effects of air pollution on respiratory illness in children. The children were examined annually at ages 9 through 12 and classified according to the presence or absence of wheeze. Denote the binary response Žwheeze, no wheeze . by Yt at age t, t s 9, 10, 11, 12. The loglinear model Ž Y9 Y10 , Y10 Y11 , Y11 Y12 . represents a firstorder Markov chain. It fits poorly, with G 2 s 122.9 Ždf s 8.. Given the state at time t, classification at time t q 1 depends on states at times previous to time t. The model Ž Y9 Y10 Y11 , Y10 Y11 Y12 . represents a second-order Markov chain, satisfying conditional independence at ages 9 and 12, given states at ages 10 and 11. This model also fits poorly, with G 2 s 23.9 Ždf s 4.. The poor fits may partly reflect subject heterogeneity, since these analyses ignore possibly relevant covariates such as parental smoking behavior. The loglinear model Ž Y9 Y10 , Y9 Y11 , Y9 Y12 , Y10 Y11 , Y10 Y12 , Y11 Y12 . that permits association at each pair of ages fits well, with G 2 s 1.5 Ždf s 5.. Table TABLE 11.7 Results of Breath Test at Four Ages a a Y9 Y10 Y11 Y12 Count Y9 Y10 Y11 Y12 Count 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 2 2 1 2 1 2 1 2 1 2 94 30 15 28 14 9 12 63 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2 1 1 2 2 1 2 1 2 1 2 1 2 19 15 10 44 17 42 35 572 1, wheeze; 2, no wheeze. Source: Ware et al. Ž1988.. MARKOV CHAINS: TRANSITIONAL MODELING 479 TABLE 11.8 Estimated Conditional Log Odds Ratios for Table 11.7 Association Estimate Simpler Structure 1.81 1.65 1.85 0.95 1.05 1.07 1.75 1.75 1.75 1.04 1.04 1.04 Y9 Y10 Y10 Y11 Y11Y12 Y9 Y11 Y9 Y12 Y10 Y12 11.8 shows its ML estimates of pairwise conditional log odds ratios. The association seems similar for pairs of ages 1 year apart, and somewhat weaker for pairs of ages more than 1 year apart. The simpler model in which ␭Yi j9 Y10 s ␭Yi j10 Y11 s ␭Yi j11 Y12 and ␭Yi j9 Y11 s ␭Yi j9 Y12 s ␭Yi j10 Y12 fits well, with G 2 s 2.3 Ždf s 9.. The estimated log odds ratios are 1.75 in the first case, and 1.04 in the second. 11.5.4 Transitional Models with Explanatory Variables Transitional models usually also include explanatory variables x. The joint mass function of T sequential responses is then f Ž y1 , . . . , yT ; x. s f Ž y 1 ; x . f Ž y 2 < y 1 ; x . f Ž y 3 < y 1 , y 2 ; x . ⭈⭈⭈ f Ž y T < y 1 , y 2 , . . . , y Ty1 ; x . . With binary y, for instance, one might specify a logistic regression model for each term in this factorization, f Ž yt < y 1 , . . . , yty1 ; x t . s exp yt Ž␣ q ␤ 1 y 1 q ⭈⭈⭈ q␤ty1 yty1 q ␤X x t . 1 q exp Ž␣ q ␤ 1 y 1 q ⭈⭈⭈ q␤ty1 yty1 q ␤X x t . , yt s 0,1. Here, the predictor x may take different value for each component. The model treats previous responses as explanatory variables. It is called a regressi®e logistic model ŽBonney 1987.. ˆ depends on how many previous The interpretation and magnitude of ␤ observations are in the model. Within-cluster effects may diminish markedly 480 ANALYZING REPEATED CATEGORICAL RESPONSE DATA by conditioning on previous responses. This is an important difference from marginal models, for which the interpretation does not depend on the specification of the dependence structure. In the special case of first-order Markov structure, the coefficients of  y 1 , . . . , yty2 4 equal 0 in the model for yt Že.g., Azzalini 1994; Bonney 1987.. It may help to allow interaction between x t and yty1 in their effects on yt . For a given subject, the product of the conditional mass functions determines that subject’s contribution to the likelihood function. ŽOne usually ignores the contribution of the marginal distribution for the first term.. That is, given the predictor, the model treats repeated transitions by a subject as independent. Thus, one can fit the model with ordinary GLM software, treating each transition as a separate observation ŽBonney 1986.. 11.5.5 Child’s Respiratory Illness and Maternal Smoking Table 11.9 is also from the Harvard study of air pollution and health. At ages 7 through 10, children were evaluated annually on the presence of respiratory illness. A predictor is maternal smoking at the start of the study, where s s 1 for smoking regularly and s s 0 otherwise. Let yt denote the response at age t Ž t s 7, 8, 9, 10.. We consider the regressive logistic model logit P Ž Yt s 1 . s ␣ q ␤ 1 s q ␤ 2 t q ␤ 3 yty1 , t s 8, 9, 10. Each subject contributes three observations to the model fitting. The data set consists of 12 binomials, for the 2 = 3 = 2 combinations of Ž s, t, yty1 .. For instance, for the combination Ž0, 8, 0., y 8 s 0 for 237 q 10 q 15 q 4 s TABLE 11.9 Child’s Respiratory Illness by Age and Maternal Smoking No Maternal Smoking Maternal Smoking Age 10 Age 10 Child’s Respiratory Illness Age 7 No Age 8 Age 9 No Yes No Yes No No Yes No Yes No Yes No Yes 237 15 16 7 24 3 6 5 10 4 2 3 3 2 2 11 118 8 11 6 7 3 4 4 6 2 1 4 3 1 2 7 Yes Yes No Yes Source: Data courtesy of James Ware. 481 NOTES 266 subjects and y 8 s 1 for 16 q 2 q 7 q 3 s 28 subjects. The ML fit is logit PˆŽ Yt s 1 . s y0.293 q 0.296 s y 0.243t q 2.211 yty1 , with SE values Ž0.846, 0.156, 0.095, 0.158.. Not surprisingly, the previous observation has a strong effect. Given that and the child’s age, there is slight evidence of a positive effect of maternal smoking: The likelihood-ratio statistic for H0 : ␤ 1 s 0 is 3.55 Ždf s 1, P s 0.06.. The model itself does not show any evidence of lack of fit Ž G 2 s 3.1, df s 8.. NOTES Section 11.1: Comparing Marginal Distributions: Multiple Responses 11.1. Darroch Ž1981. surveyed thoroughly the relationships among statistics for testing marginal homogeneity and their connections with generalized CMH analyses. See also Mantel and Byar Ž1978. and White et al. Ž1982.. Croon et al. Ž2000. studied a variety of hypotheses for longitudinal data in the context of the generalized loglinear model. Section 11.2: Marginal Modeling: Maximum Likelihood Approach 11.2. For other work on ML fitting of marginal models, see Bergsma and Rudas Ž2002., Ekholm et al. Ž2000., Fitzmaurice et al. Ž1993., and Lang et al. Ž1999.. Section 11.3: Marginal Modeling: Generalized Estimating Equations Approach 11.3. Liang et al. Ž1992. discussed GEE methods for categorical Žprimarily binary. responses. For multinomial responses, see Heagerty and Zeger Ž1996., Lipsitz et al. Ž1994., Miller et al. Ž1993., and references in Agresti and Natarajan Ž2001.. More general models with ordinal responses allow for dispersion parameters that also depend on covariates ŽToledano and Gatsonis 1996.. 11.4. LaVange et al. Ž2001. used GEE methods to adjust for clustered sampling in surveys and clinical trials. Boos Ž1992. discussed generalized score tests that incorporate empirical variance estimates, illustrating with tests for trend and lack of fit in binary regression. 11.5. Koch et al. Ž1977. used weighted least squares ŽWLS. to fit marginal models to Table 11.2. WLS for categorical modeling is described in Section 15.1. It has severe limitations Že.g., covariates must be categorical and marginal tables cannot be sparse . but led naturally to the GEE approach. Section 11.4: Quasi-likelihood and Its GEE Multi©ariate Extension: Details 11.6. Firth Ž1993b. provided a useful overview of quasi-likelihood methods. McCullagh Ž1983. showed that under correct specification of the mean and the variance function, quasi-likelihood estimators are asymptotically efficient among estimators that are locally linear in  yi 4. His result generalizes the Gauss᎐Markov theorem, although in an asymptotic rather than exact manner. See also Heyde Ž1997. and Liang and Zeger Ž1995. for discussions of unbiased estimating functions and their connections with 482 ANALYZING REPEATED CATEGORICAL RESPONSE DATA asymptotic consistency and efficiency. Godambe showed in 1960 that ML estimators are optimal solutions with an unbiased estimating function. When quasi-likelihood estimators are not ML, Cox Ž1983. and Firth Ž1987. suggested that they still retain good efficiency when the departure from the natural exponential family is at most moderate, such as modest overdispersion relative to such a family. 11.7. The generalized estimating equations are likelihood equations, and hence the GEE estimates are also ML, in certain cases. Examples are multivariate normal data or binary data when the working covariance is correct ŽFitzmaurice et al. 1993.. Results about effects of model misspecification arise in a variety of model-building contexts. For general theory, see Gourieroux et al. Ž1984., Hansen Ž1982., Liang and Zeger Ž1995., and White Ž1982.. 11.8. A GEE2 analysis adds estimating equations for the correlation structure ŽPrentice and Zhao 1991.. This has the potential to increase efficiency. A disadvantage is that, ˆ is no longer consistent if this part of the model is unlike with ordinary GEE, ␤ misspecified. Qu et al. Ž2000. showed how to increase efficiency by representing the working correlation matrix by a linear combination of basis matrices. 11.9. For surveys of ways to handle missing data, see Little Ž1998., Little and Rubin Ž1987, Chap. 9., Schafer Ž1997., and Verbeke and Molenberghs Ž2000.. See also Baker and Laird Ž1988., Fay Ž1986., Fitzmaurice et al. Ž1994., Forster and Smith Ž1998., Fuchs Ž1982., Molenberghs and Goetghebeur Ž1997., Molenberghs et al. Ž1997., Park and Brown Ž1994., and Stokes et al. Ž2000.. Section 11.5: Marko© Chains: Transitional Modeling 11.10. For statistical inference with Markov chains, see Andersen Ž1980, Sec. 7.7., Anderson and Goodman Ž1957., Billingsley Ž1961., Bishop et al. Ž1975, Chap. 7., and Kalbfleisch and Lawless Ž1985.. See Conaway Ž1989., Stiratelli et al. Ž1984., and Ware et al. Ž1988. for other analyses focusing on the conditional dependence structure. PROBLEMS Applications 11.1 Refer to Table 8.3. Viewing the table as matched triplets, construct the marginal distribution for each substance. Find the sample proportions of students who used marijuana, alcohol, and cigarettes. Test the hypothesis of marginal homogeneity. Interpret results. 11.2 Refer to Table 9.1. Fit a marginal model to describe main effects of race, gender, and substance type Žmarijuana, alcohol, cigarettes . on whether a subject had used that substance. Summarize effects. 11.3 Refer to Problem 11.2. Further study shows evidence of an interaction between gender and substance type. Using GEE with exchangeable working correlation, the model fit for the probability  of using 483 PROBLEMS a particular substance is logit Ž  ˆ . s y0.57 q 1.93S1 q 0.86S2 q 0.38 R y 0.20G q 0.37G = S1 q 0.22G = S2 , where R, G, S1 , S2 are dummy variables for race Ž1 s white., gender Ž1 s female., and substance type Ž S1 s 1, S2 s 0 for alcohol; S1 s 0, S2 s 1 for cigarettes; S1 s S2 s 0 for marijuana.. Show that: a. The estimated odds a nonwhite male has used marijuana are expŽy0.57. s 0.57. b. Given gender, the estimated odds a white subject used a given substance are 1.46 times the estimated odds for a black subject. c. Given race, the estimated odds a female has used alcohol are 1.19 times the estimated odds for males; for cigarettes and for marijuana, the estimated odds ratios are 1.02 and 0.82. d. Given race, the estimated odds a female has used alcohol Žcigarettes . are 9.97 Ž2.94. times the estimated odds she has used marijuana. e. Given race, the estimated odds a male has used alcohol Žcigarettes . are 6.89 Ž2.36. times the estimated odds he has used marijuana. Interpret the interaction. 11.4 Refer to Table 11.2. Analyze the data using the scores Ž1, 2, 4. for the week number, using ML or GEE. Interpret estimates and compare substantive results to those in the text with scores Ž0, 1, 2.. 11.5 Analyze Table 11.9 using a marginal logit model with age and maternal smoking as predictors. Compare interpretations to the Markov model of Section 11.5.5. 11.6 Table 11.10 refers to a three-period crossover trial to compare placebo Žtreatment A. with a low-dose analgesic Žtreatment B . and high-dose analgesic Žtreatment C . for relief of primary dysmenorrhea. Subjects in the study were divided randomly into six groups, the possible sequences for administering the treatments. At the end of each period, each subject rated the treatment as giving no relief Ž0. or some relief Ž1.. Let yiŽ k .t s 1 denote relief for subject i using treatment t Ž t s A, B, C ., where subject i is nested in treatment sequence k Ž k s 1, . . . , 6.. Assuming common treatment effects for each sequence, and setting ␤A s 0, obtain and interpret  ␤ˆt 4 Žusing ML or GEE. for the model logit P Ž YiŽ k .t s 1 . s ␣ k q ␤t . How would you order the drugs, taking significance into account? 484 ANALYZING REPEATED CATEGORICAL RESPONSE DATA TABLE 11.10 Data for Problem 11.6 Treatment Sequence A A B B C C B C A C A B C B C A B A Response Pattern for Treatments ŽA, B, C. 000 001 010 011 100 101 110 111 0 2 0 0 3 1 2 0 1 1 0 5 2 0 1 1 0 0 9 9 8 8 7 4 0 1 1 1 0 0 0 0 3 0 1 3 1 0 0 0 2 1 1 4 1 1 1 0 Source: Jones and Kenward Ž1987.. 11.7 Table 11.11 is from a Kansas State University survey of 262 pig farmers. For the question ‘‘What are your primary sources of veterinary information?,’’ the categories were ŽA. professional consultant, ŽB. veterinarian, ŽC. state or local extension service, ŽD. magazines, and ŽE. feed companies and reps. Farmers sampled were asked to select all relevant categories. The 2 5 = 2 = 4 table shows the Žyes, no. counts for each of these five sources cross-classified with the farmers’ education Žwhether they had at least some college education. and size of farm Žnumber of pigs marketed annually, in thousands .. TABLE 11.11 Data for Problem 11.7 Response on D A s yes B s yes A s no B s no B s yes B s no C s yes C s no C s yes C s no C s yes C s no C s yes C s no Educ Pigs E Y N Y N Y N Y N Y N Y N Y N Y N No -1 Y N Y N Y N Y N Y N Y N Y N Y N 1 0 2 0 3 1 2 1 3 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 6 0 0 0 1 0 0 0 10 2 1 4 0 3 0 1 0 4 4 2 2 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 2 5 4 5 1 2 0 1 1 2 1 4 0 5 1 4 1 4 1 0 2 0 0 0 0 4 0 2 0 0 0 1 1 5 7 7 0 0 3 4 0 1 1 4 1 0 0 6 0 2 6 14 0 1 7 14 1 1 4 4 0 0 2 4 3 0 4 0 1 0 2 0 11 0 6 0 3 0 2 0 1᎐2 2᎐5 )5 Some -1 1᎐2 2᎐5 )5 Source: Data courtesy of Tom Loughin, Kansas State University. 485 PROBLEMS a. Explain why it is not proper to analyze the data by fitting a multinomial model to the counts in the 2 = 4 = 5 contingency table cross-classifying education by size of farm by the source of veterinary information, treating source as the response variable. ŽThis table contains 453 positive responses of sources from the 262 farmers.. b. For a farmer with education i and size of farm s, let  j Ž is . denote the probability of responding ‘‘ yes’’ on the jth source. Table 11.12 shows output for using GEE with exchangeable working correlation to estimate parameters in the model lacking an education effect, logit  j Ž is . s ␣ j q ␤ j s, s s 1, 2, 3, 4. Explain how to interpret the working correlation matrix. Explain why the results suggest a strong positive size of farm effect for source A and perhaps a weak negative size effect of similar magnitude for C, D, and E. c. Constraining ␤ 3 s ␤4 s ␤5 , the ML estimate of the common slope is y0.184 ŽSE s 0.063.. Explain why it is advantageous to fit the marginal model simultaneously for all sources rather than separately to each. wAgresti and Liu Ž1999. and Loughin and Scherer Ž1998. discussed analyses for data of this form.x TABLE 11.12 Output for Problem 11.7 Row1 Row2 Row3 Row4 Row5 Col1 1.0000 0.0997 0.0997 0.0997 0.0997 Parameter source source source source source size*source size*source size*source size*source size*source Working Correlation Matrix Col2 Col3 Col4 0.0997 0.0997 0.0997 1.0000 0.0997 0.0997 0.0997 1.0000 0.0997 0.0997 0.0997 1.0000 0.0997 0.0997 0.0997 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Estimate Std Error Z 1 y4.4994 0.6457 y6.97 2 y0.8279 0.2809 y2.95 3 y0.1526 0.2744 y0.56 4 0.4875 0.2698 1.81 5 y0.0808 0.2738 y0.30 1 1.0812 0.1979 5.46 2 0.0792 0.1105 0.72 3 y0.1894 0.1121 y1.69 4 y0.2206 0.1081 y2.04 5 y0.2387 0.1126 y2.12 Col5 0.0997 0.0997 0.0997 0.0997 1.0000 Pr> <Z < <.0001 0.0032 0.5780 0.0708 0.7680 <.0001 0.4738 0.0912 0.0412 0.0341 486 ANALYZING REPEATED CATEGORICAL RESPONSE DATA TABLE 11.13 Output for Problem 11.8 Row1 Row2 Row3 Parameter Intercept question 1 question 2 question 3 female Working Correlation Matrix Col1 Col2 Col3 1.0000 0.8173 0.8173 0.8173 1.0000 0.8173 0.8173 0.8173 1.0000 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Estimate Std Error Z y0.1253 0.0676 y1.85 0.1493 0.0297 5.02 0.0520 0.0270 1.92 0.0000 0.0000 . 0.0034 0.0878 0.04 Pr> <Z < 0.0637 <.0001 0.0544 . 0.9688 11.8 Refer to Table 11.13 on attitudes toward legalized abortion. For the response Yt Ž1 s support legalization, 0 s oppose. for question t Ž t s 1, 2, 3. and for gender g Ž1 s female, 0 s male., consider the model logitw P Ž Yt s 1.x s ␣ q ␥ g q ␤t with ␤ 3 s 0. a. A GEE analysis using unstructured working correlation gives correlation estimates 0.826 for questions 1 and 2, 0.797 for 1 and 3, and 0.832 for 2 and 3. What does this suggest about a reasonable working correlation structure? b. Table 11.13 shows a GEE analysis with exchangeable working correlation. Interpret effects. c. Treating the three responses for each subject as independent observations and performing ordinary logistic regression, ␤ˆ1 s 0.149 ŽSE s 0.066., ␤ˆ2 s 0.052 ŽSE s 0.066., and ␥ ˆ s 0.004 ŽSE s 0.054.. Give a heuristic explanation of why within-subject standard errors are much larger than with GEE, yet the between-subject standard error is smaller. 11.9 Refer to the air pollution data in Table 11.7. Using ML or GEE, fit marginal logit models that assume Ža. marginal homogeneity, Žb. a linear effect of time, and Žc. no pattern. Interpret and compare. 11.10 Refer to the clinical trials data in Table 12.5, analyzed with random effects models in Section 12.3.4. Use GEE methods to analyze them, treating each center as a correlated cluster. 11.11 Refer to Table 10.5. Using GEE methods with cumulative logits, compare the two marginal distributions. Compare results to those using ML in Section 10.3.2. 11.12 Refer to the 3 4 table on government spending in Table 8.19. Analyze these data with a marginal cumulative logit model. Interpret effects. 487 PROBLEMS 11.13 Refer to Table 11.4. a. To compare effects while controlling for initial response, fit model Ž11.7., using scores  10, 25, 45, 754 for time to falling asleep. Also fit the interaction model, and describe the lack of fit. ŽNote that for the first two baseline levels, the active and placebo treatments have similar sample response distributions at the follow-up; at higher baseline levels, the active treatment seems more successful. . b. Fit the interaction model logit P Ž Y2 F j . s ␣ j q ␤ 1 x q ␤ 2 y 1 q ␤ 3 xy1 that constrains effects  ␤ 1 x q ␤ 2 y 1 q ␤ 3 xy 1 4 to follow the pattern Ž␶ , ␶ , ␭ q ␴ , ␭. for the active group and Ž␶ , ␶ , ␴ , 0. for the placebo ˆ. group. Interpret ␭ 11.14 Find a marginal model with another type of logit that fits the insomnia data of Table 11.4 well. Interpret parameter estimates, and compare conclusions to those using cumulative logits. 11.15 Refer to Table 11.9. Combine the data for the two levels of maternal smoking. Does a first-order Markov chain model these data adequately? Find a loglinear model that does fit adequately. 11.16 Analyze Table 11.9 using a transitional model with two previous responses. Does it fit better than the first-order model of Section 11.5.5? Interpret. 11.17 Analyze Table 11.2 using a first-order transitional model. Compare interpretations to those in this chapter using marginal models. 11.18 Table 11.14 is from a longitudinal study of coronary risk factors in schoolchildren ŽWoolson and Clarke 1984.. A sample of children aged 11᎐13 in 1977 were classified by gender and by relative weight Žobese, not obese. in 1977, 1979, and 1981. Analyze these data. TABLE 11.14 Data for Problem 11.18 Responses a Gender NNN NNO NON NOO ONN ONO OON OOO Male Female 119 129 7 8 8 7 3 9 13 6 4 2 11 7 16 14 a NNN indicates not obese in 1977, 1979, and 1981; NNO indicates not obese in 1977 and 1979 but obese in 1981; and so on. Source: Reproduced with permission from the Royal Statistical Society, London ŽWoolson and Clarke 1984.. 488 ANALYZING REPEATED CATEGORICAL RESPONSE DATA 11.19 Refer to the pig farmer survey of Problem 11.7 ŽTable 11.11.. Analyze these data using marginal models with all the variables. 11.20 Refer to the cereal diet and cholesterol study of Problem 7.18 ŽTable 7.23.. Analyze these data with marginal models. Theory and Methods 11.21 Refer to Problem 11.1. Suppose that we expressed the data with a 3 = 2 partial table of drug-by-response for each subject, to use a generalized CMH procedure to test marginal homogeneity. Explain why the 911 q 279 subjects who make the same response for every drug have no effect on the test. 11.22 Let yit s 1 or 0 for observation t on subject i, i s 1, . . . , n, t s 1, . . . , T. Let y.t s Ý i yitrn, yi.s Ý t yi trT, and y. .s Ý i Ý t yitrnT. a. Regard  yiq 4 as fixed. Suppose that each way to allocate the yiq ‘‘successes’’ to yiq of the observations is equally likely. Show that E Ž Yi t . s yi. , varŽ Yit . s yi.Ž1 y yi. ., and covŽ Yi t , Yi k . s yyi.Ž1 y yi. .rŽT y 1. for t / k. w Hint: The covariance is the same for any pair of cells in the same row, and varŽÝ t Yit . s 0 since yiq is fixed.x b. Refer to part Ža.. For large n with independent subjects, explain why Ž Y.1 , . . . , Y.T . is approximately multivariate normal with pairwise correlation ␳ s y1rŽT y 1.. Conclude that Cochran’s Q statistic ŽCochran 1950. Qs n2 Ž T y 1 . ÝTts1 Ž y .t y y . . . 2 T Ý nis1 yi . Ž 1 y yi . . is approximately chi-squared with df s ŽT y 1.. wOne way notes that if Ž X 1 , . . . , X T . is multivariate normal with common mean and common variance ␴ 2 and common correlation ␳ for pairs Ž X t , X k ., then ÝŽ X t y X . 2r␴ 2 Ž1 y ␳ . is chi-squared with df s ŽT y 1.. See Bhapkar and Somes Ž1977. for slightly weaker conditions for a chi-squared limiting distribution for Q than those in part Ža..x c. Show that Q is unaffected by deleting cases in which yi1 s ⭈⭈⭈ s yiT . 11.23 Consider the model ␮ i s ␤ , i s 1, . . . , n, assuming that ®Ž ␮i . s ␮ i . Suppose that actually varŽ Yi . s ␮2i . Using the univariate version of GEE described in Section 11.4, show that uŽ ␤. s Ý i Ž yi y ␤ .r␤ and ␤ˆ s y. Show that V in Ž11.10. equals ␤rn, the actual asymptotic variance Ž11.11. simplifies to ␤ 2rn, and its consistent estimate is Ý i Ž yi y y . 2rn2 . PROBLEMS 489 11.24 Repeat Problem 11.23 assuming that ®Ž ␮i . s ␴ 2 when actually varŽ Yi . s ␮ i . 11.25 Consider the model ␮ i s ␤ , i s 1, . . . , n, for independent Poisson observations. For ␤ˆ s y, show that the model-based asymptotic variance estimate is yrn, whereas the robust estimate of the asymptotic variance is Ý i Ž yi y y . 2rn2 . Which would you expect to be better Ža. if the Poisson model holds, and Žb. if there is severe overdispersion? 11.26 Show that Ž11.10. is equivalent to the formula for the large-sample covariance of the ML estimator in a GLM, estimated by Ž4.28.. 11.27 a. For a univariate response, how is quasi-likelihood ŽQL. inference different from ML inference? When are they equivalent? b. Explain the sense in which GEE methodology is a multivariate version of QL. c. Summarize the advantages and disadvantages of the QL approach. d. Describe conditions under which GEE parameter estimators are consistent and conditions under which they are not. For conditions in which they are consistent, explain why. 11.28 Formulate a model using adjacent-categories logits or continuationratio logits that is analogous to Ž11.4.. Interpret parameters. 11.29 Refer to the analysis of mean time to falling asleep at the end of Section 11.2.3. Explain how to calculate SE for the difference between the difference of means reported there. ŽNote that one difference uses paired samples and the other uses independent samples.. 11.30 What is wrong with this statement?: ‘‘For a first-order Markov chain, Yt is independent of Yty2 .’’ 11.31 Suppose that loglinear model Ž Y0 , Y1 , . . . , YT . holds. Is this a Markov chain? 11.32 Gamblers A and B have a total of I dollars. They play games of pool repeatedly. Each game they each bet $1, and the winner takes the other’s dollar. The outcomes of the games are statistically independent, and A has probability  and B has probability 1 y  of winning any game. Play stops when one player has all the money. Let Yt denote A’s monetary total after t games. a. Show that  Yt 4 is a first-order Markov chain. b. State the transition probability matrix. ŽFor this gambler’s ruin problem, 0 and I are absorbing states. Eventually, the chain enters one of these and stays. The other states are transient.. 490 ANALYZING REPEATED CATEGORICAL RESPONSE DATA 11.33 A first-order Markov chain has stationary Žor time-homogeneous. transition probabilities if the one-step transition probability matrices are identical, that is, if for all i and j,  j < i Ž 1 . s  j < i Ž 2 . s ⭈⭈⭈ s  j < i Ž T . s  j < i . Let X, Y, and Z denote the classifications for the I = I = T table consisting of  n i j Ž t ., i s 1, . . . , I, j s 1, . . . , I, t s 1, . . . , T 4 . a. Explain why all transition probabilities are stationary if expected frequencies for this table satisfy loglinear model Ž XY, XZ .. wThus, the likelihood-ratio statistic for testing stationary transition probabilities equals G 2 for testing fit of model Ž XY, XZ ..x b. Let n i j s Ý t n i j Ž t .. Under the assumption of stationary transition probabilities, show how the likelihood in Ž11.14. simplifies, and show that the ML estimators are  ˆ j < i s n i jrn iq . c. For a Markov chain with stationary transition probabilities, let yi jk denote the number of transitions from i to j to k over two successive steps. For  yi jk 4 , argue that the goodness of fit of loglinear model Ž Y1Y2 , Y2 Y3 . tests that the chain is first order against the alternative that it is second order ŽAnderson and Goodman 1957.. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 12 Random Effects: Generalized Linear Mixed Models for Categorical Responses In Chapter 11 we noted that observations often occur in clusters. For instance, cluster i might consist of repeated measurements on subject i or observations for all subjects in family i. Observations within a cluster tend to be more alike than observations from different clusters. Thus, they are usually positively correlated. Ordinary analyses that ignore the correlation and treat within-cluster observations the same as between-cluster observations produce invalid standard errors. In Chapter 11 we focused on modeling the marginal distributions of clustered responses, treating the joint dependence structure as a nuisance. In this chapter we present an alternative approach using cluster-level terms in the model. These terms take the same value for each observation in a cluster but different values for different clusters. They are unobserved and, when treated as varying randomly among clusters, are called random effects. In Section 10.2.4 we introduced this approach in a model for matched pairs. The models have conditional interpretations, referred to as subjectspecific when each cluster is a subject. This contrasts with marginal models, which have population-a®eraged interpretations. Random effects models for normal responses are well established. By contrast, only recently have random effects been used much in models for categorical data. In this chapter we extend generalized linear models to include random effects. In Section 12.1 we introduce this extension, the generalized linear mixed model. In Section 12.2 we discuss an important special case for binary data, the logistic-normal model. Several examples are shown in Section 12.3. Section 12.4 covers extensions for multinomial responses, and Section 12.5 covers models with multivariate random effects. In Section 12.6 we discuss model fitting, assuming normality for the random effects. Parts of this chapter are from Agresti et al. Ž2000.. 491 492 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 12.1 RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA Parameters that describe a factor’s effects in ordinary linear models are called fixed effects. They apply to all categories of interest, such as genders, age groupings, or treatments. By contrast, random effects usually apply to a sample. For a study using a sample of clinics, for example, the model treats observations from a given clinic as a cluster, and it has a random effect for each clinic. GLMs extend ordinary regression by allowing nonnormal responses and a link function of the mean. The generalized linear mixed model ŽGLMM. is a further extension that permits random effects as well as fixed effects in the linear predictor. 12.1.1 Generalized Linear Mixed Model Let yit denote observation t in cluster i, t s 1, . . . , Ti . As in the GEE analyses in Chapter 11, the number of observations may vary by cluster. In a longitudinal study, even if clusters have equal size, many of them may have missing observations. Let x it denote a column vector of values of explanatory variables, for fixed effect model parameters ␤. Let u i denote the vector of random effect values for cluster i. This is common to all observations in the cluster. Let z it denote a column vector of their explanatory variables. Often, the random effect is univariate. Conditional on u i , a GLMM resembles an ordinary GLM. Let  it s E Ž Yit < u i .. The linear predictor for a GLMM has the form g Ž it . s x Xit ␤ q zXit u i Ž 12.1 . for link function g Ž.. The random effect vector u i is assumed to have a multivariate normal distribution N Ž0,  .. The covariance matrix  depends on unknown ®ariance components and possibly also correlation parameters. Denote varŽ Yit < u i . s  it ®Ž i t . , where the variance function ®Ž. describes how the Žconditional. variance depends on the mean. As in Section 4.4, often  i t s 1 or  i t s r i t , where  i t is a known weight Že.g., number of trials for a binomial count. and  is an unknown dispersion parameter. Conditional on u i , the model treats  yit 4 as independent over i and t. As discussed in Section 10.2.2, the variability among u i induces a nonnegative association among the responses, for the marginal distribution averaged over the subjects. This is caused by the shared random effect u i for each observation in a cluster. In Ž12.1., the random effect enters the model on the same scale as the predictor terms. This is convenient but also natural for many applications. For instance, random effects sometimes represent heterogeneity caused by RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA 493 omitting certain explanatory variables. Consider the special case with univariate random effect and z it s 1. With u i replaced by u i* where  u i*4 are N Ž0, 1., the GLMM has the form g Ž it . s x Xit ␤ q u i* . This has the form of an ordinary GLM with unobserved values  u i*4 of a particular covariate. Thus, random effects models relate to methods of dealing with unmeasured predictors and other forms of missing data. The random effects part of the linear predictor reflects terms that would be in the fixed effects part if those explanatory variables had been included. Random effects also sometimes represent random measurement error in the explanatory variables. If we replace a particular predictor x it by x it* q  i , with x it* the true value and  i the measurement error, then  i times the regression parameter can be absorbed in the random effects term. Related to these motivations, random effects also provide a mechanism for explaining overdispersion in basic models not having those effects ŽBreslow and Clayton 1993.. 12.1.2 Logit GLMM for Binary Matched Pairs We illustrate the GLMM expression Ž12.1. using a simple case, that of binary matched pairs. The data form two dependent binomial samples ŽSection 10.1.. Cluster i consists of the responses Ž yi1 , yi2 . for matched pair i. Observation t in cluster i has yit s 1 Ža success . or 0 Ža failure., t s 1, 2. In Section 10.2.2 we introduced the model ŽCox 1958b, Rasch 1961. logit P Ž Yi t s 1 . s  i q  x t Ž 12.2 . where x 1 s 0 and x 2 s 1. For it,  is a cluster-specific log odds ratio. That section treated  i as a fixed effect and eliminated it using conditional ML. An equivalent representation of Ž12.2. is logit P Ž Yi1 s 1 < u i . s  q u i , logit P Ž Yi2 s 1 < u i . s  q  q u i , Ž 12.3 . where u i s  i y  for some constant  . Now, we treat u i as a random effect for cluster i, with  u i 4 independent from a N Ž0,  2 . distribution with  unknown. Conditionally on u i , we assume that yi1 and yi2 are independent. Model Ž12.3. is the special case of Ž12.1. in which  i t s P Ž Yi t s 1 < u i ., g Ž. is the logit link, ␤X s Ž  ,  ., x Xi1 s Ž1, 0. and x Xi2 s Ž1, 1. for all i, and z it s 1 for all i and t. The univariate random effect adjusts the intercept but does not modify the fixed effect. A GLMM with random effect of this form is called a random intercept model. Instead of the usual fixed intercept  , it has a random intercept  q u i . RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 494 Let Y1 s Ý i yi1 and Y2 s Ý i yi2 . Marginally, Y1 is binomial with n trials and parameter E expŽ  q U .rw1 q expŽ  q U .x4 , and Y2 is binomial with parameter E expŽ  q  q U .rw1 q expŽ  q  q U .x4 . The expectations refer to U, a N Ž0,  2 . random variable. The model implies a nonnegative correlation between Y1 and Y2 , with greater association resulting from greater heterogeneity Ži.e., larger  .. Clusters with a large positive u i have a relatively large P Ž Yit s 1 < u i . for each t, whereas clusters with a large negative u i have a relatively small P Ž Yit s 1 < u i . for each. For this model, Y1 and Y2 are independent only if  s 0. A 2 = 2 population-averaged table with Žsuccess, failure. for both the row and column categories summarizes the number of observations for which Ž yi1 , yi2 . s Ž1, 1., Ž1, 0., Ž0, 1., or Ž0, 0.. Let  n ab 4 denote these counts. Table 12.1, analyzed first in Section 10.1, is an example. Let   ˆ ab 4 denote marginal fitted values for model Ž12.3.. We defer discussion of model fitting until Section 12.6. However, model Ž12.3. is a rare instance in which the fixed effect in a random effects model has a closed-form ML estimate, ˆ s log Ž  ˆ 21 r ˆ 12 . . When the sample log odds ratio logŽ n11 n 22 rn12 n 21 . G 0, then   ˆ ab s n ab 4 ˆ Ž . and  s log n 21 rn12 . This is the same as the conditional ML estimate ŽSection 10.2.3.. Neuhaus et al. Ž1994. showed that this is true for any parametric choice of random effects distribution for which the model Ž12.3. can generate  n ab 4 as fitted values. Lindsay et al. Ž1991. showed that this estimate also results with a nonparametric approach discussed in Section 13.2.4. The model implies that the true log odds ratio for this 2 = 2 table is at least 0. When logŽ n11 n 22 rn12 n 21 . - 0, however, then ˆ s 0 and the fitted values   ˆ ab s n aq nqb rn4 satisfy independence. Then, ˆ is identical to the estimate for the marginal model Ž10.6. by which  is the difference between logits for the two marginal distributions, namely ˆ s log wŽ n 2q nq1 .rŽ n1q nq2 .x. 12.1.3 Ratings of Prime Minister Revisited For Table 12.1, the ML fit of model Ž12.3., treating  u i 4 as normal, yields ˆ s logŽ86r150. s y0.556 ŽSE s 0.135., with ˆ s 5.16. This is identical to the conditional ML estimate Ž10.10., with standard error wŽ1r86. q Ž1r150.x1r2 . For a given subject, the estimated odds of approval at the second TABLE 12.1 Rating of Performance of Prime Minister First Survey Approve Disapprove Total Second Survey Approve Disapprove Total 794 86 880 150 570 720 944 656 1600 RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA 495 survey equal expŽy0.556. s 0.57 times those at the first survey. The large ˆ reflects the very strong association between the two responses, with sample odds ratio 35.1. 12.1.4 Extension: Rasch Model and Item Response Models An extension of the logit matched-pairs model Ž12.3. allows T ) 2 observations in each cluster. The random intercept model then has form logit P Ž Yit s 1 < u i . s u i q t , Ž 12.4 . where  u i 4 are independent N Ž0,  2 .. Equivalently, the model can add an intercept  or let E Ž u i . s  , but then identifiability requires a constraint such as  T s 0. Early applications of this GLMM were in psychometrics. The model describes responses to a battery of T questions on an exam. The probability P Ž Yit s 1 < u i . that subject i makes the correct response on question t depends on the overall ability of subject i, characterized by u i , and the easiness of question t, characterized by t . Such models are called item-response models. The logit form Ž12.4. is called the Rasch model ŽRasch 1961.. In estimating  t 4 , Rasch treated  u i 4 as fixed effects and used conditional ML, as outlined in Section 10.2.3 for matched pairs. Later authors used the normal random effects approach for this model and the model with probit link Že.g., Bock and Aitkin 1981.. The t 4 in the Rasch model differ from parameters in corresponding marginal models such as Ž11.1., since the effects are subject specific. The Rasch model refers to a T = 2 = n table of observation by outcome by subject, whereas the marginal model refers to the T = 2 observation-byoutcome table of the T marginal distributions, collapsed over subjects. For observations s and t for a given subject i with model Ž12.4.,  s y t s logit P Ž Yi s s 1 < u i . y logit P Ž Yit s 1 < u i . , which is a log odds ratio conditional on the subject. By contrast, the corresponding population-averaged effect in marginal model Ž11.1. is  s y t s logit P Ž Yh s s 1 . y logit P Ž Yit s 1 . , with subject h randomly selected for observation s and subject i randomly selected for observation t Ži.e., h and i are independent observations.. 12.1.5 Random Effects versus Conditional ML Approaches Suppose that one treated  u i 4 in model Ž12.4. as fixed effects instead of random effects. Then, consider ordinary ML estimation of  t 4 and  u i 4 . As n increases, so does the number of parameters, since each subject has a u i . 496 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS Even though the number of  t 4 does not increase as n does, the ordinary ML estimators  ˆt 4 are not consistent. This happens in many models when the number of parameters has an order similar to that of the number of subjects. Asymptotic optimality properties of ML estimators, such as consistency, require the number of parameters to be fixed as n increases. For model Ž12.4., ML estimators of  t 4 have bias of order TrŽT y 1. ŽAndersen 1980, pp. 244 245.. For the matched-pairs model Ž12.2., for instance, ˆ ™ 2  in probability ŽProblem 10.24.. For this reason, the preferable approach for the fixed effects model is conditional ML. One eliminates  u i 4 by conditioning on their sufficient statistics  Si s Ý t yit , i s 1, . . . , n4 . In the item response context, these are the numbers of correct responses for each subject. Conditional on  Si 4 , the distribution of  yit 4 is independent of  u i 4 . Maximizing the resulting likelihood then yields consistent estimators of  t 4 . The analysis generalizes the one in Section 10.2.3 for the subject-specific logistic model Ž10.8. for matched pairs. See Andersen Ž1980. for details. Compared with the random effects approach, the conditional ML approach has certain advantages. One does not need to assume a parametric distribution for  u i 4 . It is difficult to check this assumption in the random effects approach. Conditional ML is also appropriate with retrospective sampling. In that case, bias can occur with a random effects approach because the clusters are not randomly sampled ŽNeuhaus and Jewell 1990b.. However, the conditional ML approach has severe disadvantages. It is restricted to the canonical link Žthe logit., for which reduced sufficient statistics exist for  u i 4 . More important, as discussed in Section 10.2.7, it is restricted to inference about within-cluster fixed effects. The conditioning removes the source of variability needed for estimating between-cluster effects in models with explanatory variables such as those considered next. Also, this approach does not provide information about  u i 4 , such as predictions of their values and estimates of their variability or of the probabilities they determine. Finally, in more general models with covariates, conditional ML can be less efficient than the random effects approach for estimating the fixed effects Žsee Note 12.2.. 12.2 BINARY RESPONSES: LOGISTIC-NORMAL MODEL The item response model Ž12.4. with random intercept is a special case of an important class of random effects models for binary data called logisticnormal models. With univariate random effect, the model form is logit P Ž Yi t s 1 < u i . s x Xit ␤ q u i Ž 12.5 . where  u i 4 are independent N Ž0,  2 . variates. This is the special case of the GLMM Ž12.1. in which g Ž. is the logit link and the random effects structure BINARY RESPONSES: LOGISTIC-NORMAL MODEL 497 simplifies to a random intercept. The logistic-normal model has a long history, dating at least to Cox Ž1970, Prob. 20 in that text. for the matchedpairs model Ž12.3. and Pierce and Sands Ž1975.. More generally, the link function in model Ž12.5. can be an arbitrary inverse cdf. For such models, Yi s and Yi t are treated conditionally Žgiven u i . as independent but are marginally nonnegatively correlated. Let denote the cdf that is the inverse link function. Then, for s / t, cov Ž Yi s , Yit . s E cov Ž Yi s , Yit < u i . q cov E Ž Yi s < u i . , E Ž Yit < u i . s 0 q cov Ž xXi s ␤ q u i . , Ž xXit ␤ q u i . . Ž 12.6 . The functions in the last covariance term are both monotone increasing in u i , and hence are nonnegatively correlated. For common predictor value x at each t, the joint distribution for the model is exchangeable. This is often plausible for clustered data. In longitudinal studies, however, observations closer together in time may tend to be more highly correlated. Usually, the main focus in using a GLMM is inference about the fixed effects. The random effects part of the model is a mechanism for representing how the positive correlation occurs between observations within a cluster. Parameters pertaining to the random effects may themselves be of interest, however. For instance, the estimate ˆ of the standard deviation of a random intercept may be a useful summary of the degree of heterogeneity of a population. 12.2.1 Interpreting Heterogeneity in Logistic-Normal Models When  s 0, the logistic-normal model Ž12.5. simplifies to the ordinary logistic regression model treating all observations as independent. When  ) 0, how can we interpret the variability in effects this model implies? Consider observation yit at setting x it of predictors and observation y h s at setting x h s . Their log odds ratio is logit P Ž Yit s 1 < u i . y logit P Ž Yh s s 1 < u h . s Ž x it y x h s . ␤ q Ž u i y u h . . X We cannot observe Ž u i y u h ., which has a N Ž0, 2  2 .distribution. However, 100Ž1 y  .% of those log odds ratios fall within Ž x it y x h s . ␤ " z r2 '2  . X Ž 12.7 . When  s 0, Žx it y x h s . ␤ is the usual form of log odds ratio for a model without random effects. When  ) 0, Žx it y x h s .X ␤ is the log odds ratio for two observations in the same cluster Ž h s i . or with the same random effect value. Suppose that x it s x h s for observations from different clusters. Then, using z 0.25 s 0.674, the middle 50% of the log odds ratios fall within 498 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS "0.674'2  s "0.95 . Hence, the median odds ratio between the observation with higher random effect and the observation with lower random effect equals expŽ0.95 .. With a single predictor and x it y x h s s 1, the median such odds ratio equals expŽ  q 0.95 .. Larsen et al. Ž2000. presented related interpretations. 12.2.2 Connections between Conditional Models and Marginal Models The fixed effects parameters ␤ in GLMMs have conditional intepretations, given the random effect. Those fixed effects are of two types. First, consider an explanatory variable that varies in value among observations in a cluster. For instance, in a crossover study comparing T drugs, for each subject the drug taken varies from observation to observation in that subject’s cluster of T observations. For such an explanatory variable, its coefficient in the model refers to the effect on the response of a within-cluster Že.g., subject-specific . 1-unit increase of that predictor. The random effect as well as other explanatory variables in the model are constant while that predictor increases by 1. The effect of that explanatory variable is a ‘‘ within-cluster’’ or ‘‘ within-subject’’ one. Second, consider an explanatory variable with constant value among observations in a cluster. An example is gender when each subject forms a cluster. For such an explanatory variable, its coefficient refers to the effect on the response of a ‘‘between-cluster’’ 1-unit increase of that predictor. An example is a comparison of females and males using a dummy variable and its coefficient. However, this fixed effect in the GLMM applies only when the random effect Žas well as other explanatory variables in the model. takes the same value in both groups: for instance, a male and a female with the same value for their random effects. It is in this sense that random effects models are conditional models, as both within- and between-cluster effects apply conditional on the random effect value. By contrast, effects in marginal models are averaged over all clusters Ži.e., population averaged., so those effects do not refer to a comparison at a fixed value of a random effect. In fact, a fundamental difference between the two model types is that when the link function is nonlinear, such as the logit, the population-averaged effects of marginal models often are smaller than the cluster-specific effects of GLMMs. Specifically, the GLMM Ž12.1. refers to the conditional mean,  it s Ž E Yit < u i .. By inverting the link function, E Ž Yi t < u i . s gy1 Ž x Xi t ␤ q zXit u i . . Marginally, averaging over the random effects, the mean is E Ž Yit . s E E Ž Yit < u i . s Hg y1 Ž xXi t ␤ q zXit u i . f Ž u i ;  . du i , BINARY RESPONSES: LOGISTIC-NORMAL MODEL 499 where f Žu;  . is the N Ž0,  . density function for the random effects. For the identity link, E Ž Yit . s HŽ x X it  q zXit u i . f Ž u i ;  . du i s x Xi t  . The marginal model has the same model form and effects . This is not true for other links. For instance, for the logistic-normal model Ž12.5., E Ž Yi t . s E exp Ž x Xit  q u i . 1 q exp Ž x Xit  q u i . . This expectation does not have form exp Žx Xit  .rw1 q exp Žx Xit  .x except when u i has a degenerate distribution Ž  s 0.. Approximate relationships exist between estimates from the two model types. In the logistic-normal case with effect  and small  , Zeger et al. Ž1988. showed that E Ž Yit . f exp Ž cx Xit  . r 1 q exp Ž cx Xit  . , Ž 12.8 . where c s w1 q 0.6  2 xy1r2 . Since the effect in the marginal model multiplies that of the conditional model by about c, it is typically smaller in absolute value. The discrepancy increases as  increases. For  near 0, Neuhaus et al. Ž1991. showed that the marginal model effect is approximately  Ž1 y ␳ ., where ␳ s corrŽ Yit , Yi s . at  s 0. Again, the discrepancy increases as  increases, since ␳ increases with  . For Table 12.1 on ratings of the prime minister, the ML estimate for model Ž12.3. is ˆ s y0.556, with ˆ s 5.16 for variability of  u i 4 . Approximation Ž12.8. suggests that ˆ s y0.556 with ˆ s 5.16 corresponds to a marginal estimate of about w1 q 0.6Ž5.16. 2 xy1r2 Žy0.556. s y0.135. The actual marginal estimate is the log odds ratio for the sample marginal distributions, equaling log Ž 880r720. r Ž 944r656. s y0.163. In fact, the marginal effect is much smaller than the conditional effect, but this approximation connecting the two estimates works better for smaller ˆ . At  s 0, the fit of the model is that of the symmetry model, for which  ˆ 12 s  ˆ 21 s Ž n12 q n 21 .r2. The correlation for that 2 = 2 table equals 0.699, from which the conditional estimate of y0.556 suggests a marginal estimate of y0.556Ž1 y 0.699. s y0.167, very close to the actual value of y0.163. Figure 12.1 illustrates why the marginal effect is smaller than the conditional effect. For a single explanatory variable x, the figure shows subjectspecific curves for P Ž Yit s 1 < u i . for several subjects when considerable heterogeneity exists. This corresponds to a relatively large  for random effects. 500 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS FIGURE 12.1 Logistic random-intercept model, showing the conditional Žsubject-specific . curves and the marginal Žpopulation-averaged . curve averaging over these. At any fixed value of x, variability occurs in the conditional means, E Ž Yit < u i . s P Ž Yit s 1 < u i .. The average of these is the marginal mean, E Ž Yit .. These averages for various x values yield the superimposed curve. It has a shallower slope. In fact, it does not exactly follow the logistic formula. Similar remarks apply to other GLMMs. For the probit link with binary data, however, the conditional probit model with normal random effect does imply a marginal model of probit form ŽProblem 12.29.. With univariate random intercept, the marginal effect equals the conditional effect multiplied by w1 q  2 xy1r2 ŽZeger et al. 1988.. In Section 13.5.1 we explore the conditional marginal connection for loglinear GLMMs. 12.2.3 Comments about Conditional versus Marginal Models Random effects models describe conditional Žsubject-specific . effects, whereas marginal models describe population-averaged effects. Some statisticians prefer one of these types, but most feel that both are useful, depending on the application. The conditional modeling approach is preferable if one wants to specify a mechanism that could generate positive association among clustered observations, estimate cluster-specific effects, estimate their variability, or model the joint distribution. Latent variable constructions used to motivate model forms Že.g., the tolerance motivation for binary models of Section 6.6.1 and the related threshold motivation in Problem 6.28 and utility motivation in Problem 6.29. usually apply more naturally at the cluster level than at the marginal level. Given a conditional model, one can recover information about marginal distributions. That is, a conditional model implies a marginal model, BINARY RESPONSES: LOGISTIC-NORMAL MODEL 501 but a marginal model does not itself imply a conditional model Žalthough see Note 12.10 for an implicit connection.. In many surveys or epidemiological studies, a goal is to compare the relative frequency of occurrence of some outcome for different groups in a population. Then, quantities of primary interest include between-group odds ratios among marginal probabilities for the different groups. That is, effects of interest are between-cluster rather than within-cluster. When marginal effects are the main focus, it is usually simpler and may be preferable to model the margins directly. One can then parameterize the model so that regression parameters have a direct marginal interpretation. Developing a more detailed model of the joint distribution that generates those margins, as a random effects model does, provides greater opportunity for misspecification. For instance, with longitudinal data the assumption that observations are independent, given the random effect, need not be realistic. With the marginal model approach, we showed in Chapter 11 that ML is sometimes possible but that the GEE approach is computationally simpler and more versatile. A drawback of the GEE approach is that it does not explicitly model random effects and therefore does not allow these effects to be estimated. In addition, likelihood-based inferences are not possible because the joint distribution of the responses is not specified. In Section 12.2.2 it was noted that conditional effects are usually larger than marginal effects, and increase as variance components increase. Usually, though, the significance of an effect Že.g., as measured by the ratio of estimate to standard error. is similar in the two model types. If one effect seems more important than another in a conditional model, the same is usually true with a marginal model. So the choice of the model is usually not crucial to inferential conclusions. This statement requires a caveat, however, since sizes of effects in marginal models depend on the degree of heterogeneity in conditional models. In comparing effects for two groups or two variables that have quite different variance components, relative sizes of effects will differ for marginal and conditional models. From Ž12.8., with binary data the attenuation from the conditional to the marginal effect will tend to be greater for the group having the larger variance component. For instance, suppose that two groups, one young in age and the other elderly, both show the same conditional effect in a crossover study comparing two drugs. If the elderly group has more heterogeneity on the response, their marginal effect may be smaller than that for the younger group. The marginal effects differ even though the conditional effects are the same, because of the greater variance component for the elderly. In such cases, the conditional effect Žappropriately modeled. may have more relevance. Finally, with either marginal or conditional models, missing data are a common problem with multivariate responses. Unless data are missing at random, potential bias occurs in ML inference. GEE methods usually require 502 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS the stronger condition that data are missing completely at random ŽSection 11.4.5.. Thus, modeling missingness or conducting a sensitivity study to discern its potential effects can be an important component of an analysis. Regardless of the choice of paradigm, it is a challenge for statisticians even to explain to practitioners why marginal and conditional effects differ with a nonlinear link function. Graphics such as Figure 12.1 can help. Neuhaus Ž1992. and Pendergast et al. Ž1996. surveyed ways of analyzing clustered binary data, including conditional and marginal models. Agresti and Natarajan Ž2001. surveyed conditional and marginal modeling of clustered ordinal data. 12.3 EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA In the next three sections we present a variety of examples of random effects models. In this section we consider binary responses. 12.3.1 Small-Area Estimation of Binomial Proportions Small-area estimation refers to estimation of parameters for a large number of geographical areas when each has relatively few observations. For instance, one might want county-specific estimates of characteristics such as the unemployment rate or the proportion of families having health insurance coverage. With a national or statewide survey, some counties may have few observations. Then, sample proportions in the counties may poorly estimate the true countywide proportions. Random effects models that treat each county as a cluster can provide improved estimates. In assuming that the true proportions vary according to some distribution, the fitting process ‘‘borrows from the whole’’ᎏit uses data from all the counties to estimate the proportion in any given one. Let  i denote the true proportion in area i, i s 1, . . . , n. These areas may be all the ones of interest, or only a sample. Let  yi 4 denote independent i binŽTi ,  i . variates; that is, yi s ÝTts1 yit , where  yit , t s 1, . . . , Ti 4 are independent with P Ž Yit s 1. s  i and P Ž Yit s 0. s 1 y  i . The sample proportions  pi s yirTi 4 are ML estimates of  i 4 for the fixed-effects model logit Ž  i . s  q i , i s 1, . . . , n. This model is saturated, having n nonredundant parameters Žwith a constraint such as Ý i i s 0. for the n binomial observations. For small Ti 4 ,  pi 4 have large standard errors. Thus,  pi 4 may display much more variability than  i 4 , especially when  i 4 are similar. Then, it is helpful EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA 503 to shrink  pi 4 toward their overall mean. One can accomplish this with the random effects model logit P Ž Yit s 1 < u i . s  q u i , Ž 12.9 . where  u i 4 are independent N Ž0,  2 . variates. This model is a logit analog of one-way random effects ANOVA. When  s 0, all  i are identical. For this model,  ˆ i s exp Žˆ q uˆi . r 1 q exp Žˆ q uˆi . . This estimate differs from the sample proportion pi . If ˆ s 0, then all i yit .rŽÝ i Ti ., u ˆi s 0. Then, the random effects estimate of each  i is ŽÝ nis1 ÝTts1 the overall sample proportion after pooling all n samples. When truly all  i are equal, this is a much better estimator of that common value than the sample proportion from a single sample. Generally, the random effects model estimators shrink the separate sample proportions toward the overall sample proportion. The amount of shrinkage decreases as ˆ increases. The shrinkage also decreases as the Ti 4 grow; as each sample has more data, we put more trust in the separate sample proportions. The predicted random effect u ˆi is the estimated mean of the distribution of u i , given the data Žsee Section 12.6.7.. This prediction depends on all the data, not just data from area i. A benefit is potential reduction in the mean-squared error of the estimates around the true values. We illustrate model Ž12.9. with a simulated sample of size 2000 to mimic a poll taken before the 1996 U.S. presidential election. For Ti observations in state i Ž i s 1, . . . , 51, where i s 51 is DC s District of Columbia., yi is binŽTi ,  i ., where  i is the actual proportion of votes in state i for Bill Clinton in the 1996 election, conditional on voting for Clinton or the Republican candidate, Bob Dole. Here, Ti is proportional to the state’s population size, subject to Ý i Ti s 2000. Table 12.2 shows Ti 4 ,  i 4 , and  pi s yirTi 4 . For the ML fit of model Ž12.9.,  ˆ s 0.163 and ˆ s 0.29. The predicted random effect values Žobtained using PROC NLMIXED in SAS. yield the proportion estimates  ˆ i 4, also shown in Table 12.2. Since Ti 4 are mostly small and since ˆ is relatively small, considerable shrinkage of these estimates occurs from the sample proportions toward the overall proportion supporting Clinton, which was 0.548. The  ˆ i 4 vary only between 0.468 Žfor . Ž TX s Texas and 0.696 for NY s New York., whereas the sample proportions vary between 0.111 Žfor Idaho. and 1.0 Žfor DC.. Sample proportions based on fewer observations, such as DC, tended to shrink more. Although the estimates incorporating random effects are relatively homogeneous, they tend to be closer than the sample proportions to the true values. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 504 TABLE 12.2 Estimates of Proportion of Vote for Clinton, Conditional on Voting for Clinton or Dole in 1996 U.S. Presidential Election a State Ti i pi  ˆi State Ti i pi  ˆi AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS 5 32 19 34 240 29 25 4 5 108 56 9 22 9 89 44 19 29 33 46 38 9 73 35 41 21 0.394 0.463 0.594 0.512 0.572 0.492 0.604 0.903 0.586 0.532 0.494 0.643 0.557 0.391 0.596 0.468 0.400 0.506 0.566 0.686 0.586 0.627 0.573 0.594 0.535 0.472 0.200 0.500 0.526 0.618 0.538 0.586 0.720 1.000 0.400 0.602 0.554 0.556 0.500 0.111 0.539 0.432 0.316 0.448 0.667 0.739 0.474 0.778 0.589 0.571 0.561 0.333 0.508 0.524 0.537 0.573 0.538 0.558 0.602 0.576 0.527 0.583 0.548 0.543 0.528 0.472 0.540 0.488 0.477 0.506 0.592 0.637 0.511 0.578 0.570 0.554 0.550 0.477 MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY 7 55 5 13 9 60 13 12 137 84 23 24 90 7 28 6 40 144 15 51 4 42 39 14 4 0.483 0.475 0.461 0.395 0.567 0.600 0.540 0.506 0.660 0.536 0.456 0.547 0.552 0.689 0.469 0.479 0.513 0.473 0.380 0.489 0.633 0.572 0.559 0.584 0.426 0.429 0.455 0.600 0.462 0.556 0.667 0.462 0.500 0.752 0.488 0.478 0.625 0.567 0.571 0.571 0.667 0.500 0.444 0.333 0.412 0.500 0.619 0.487 0.571 0.250 0.526 0.494 0.546 0.524 0.543 0.611 0.524 0.533 0.696 0.507 0.520 0.569 0.558 0.545 0.552 0.555 0.522 0.468 0.490 0.473 0.538 0.578 0.517 0.548 0.518 a  i , True; pi , sample;  ˆ i , estimate using random effects model. 12.3.2 Modeling Repeated Binary Responses In Section 12.1.4 we introduced a random effects version of the Rasch model for repeated binary measurement. This model extends to incorporate covariates. We illustrate using Table 10.13, first analyzed in Section 10.7.2. The subjects indicated whether they supported legalizing abortion in each of three situations. Table 10.13 also classified the subjects by gender. Let yit denote the response for subject i on item t, with yit s 1 representing support. Consider the model logit P Ž Yi t s 1 < u i . s u i q t q  x i , Ž 12.10 . where x i s 1 for females and 0 for males, and where  u i 4 are independent N Ž0,  2 .. ŽEquivalently, one could place a constraint on  t 4 and allow an 505 EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA intercept  .. Here, the gender effect  is assumed the same for each item, and the  t 4 refer to the items. Since model Ž12.10. implies nonnegative association among responses on the items, one should use items and scales for which this should occur. For opinions about legalized abortion with scale Žyes, no., it would not be appropriate for one question to ask ‘‘Do you agree that abortion should be legal when a woman is not married?’’ and another to ask ‘‘Do you agree that abortion should be illegal during the last three months of pregnancy?’’ Table 12.3 summarizes ML fitting results. The contrasts of  ˆt 4 indicate greater support for legalized abortion with item 1 Žwhen the family has a low income and cannot afford any more children . than with the other two. There is slight evidence of greater support with item 2 Žwhen the woman is not married and does not want to marry the man. than with item 3 Žwhen the woman wants the abortion for any reason.. The fixed effects estimates have log odds ratio interpretations. For a given subject of either gender, for instance, the estimated odds of supporting legalized abortion for item 1 equal expŽ0.83. s 2.3 times the estimated odds for item 3. Since  ˆ s 0.01, for each item the estimated probability of supporting legalized abortion is similar for females and males with similar random effect values. For these data, subjects are highly heterogeneous Ž ˆ s 8.6.. Thus, strong associations exist among responses on the three items. This is reflected by 1595 of the 1850 subjects making the same response on all three items: that is, response patterns Ž0, 0, 0. and Ž1, 1, 1.. It implies tremendous variability in between-subject odds ratios. From Ž12.7., for different subjects of a given gender, the middle 50% of odds ratios comparing items 1 and 3 are estimated to vary between about expŽ0.83 y 0.95 = 8.6. and expŽ0.83 q 0.95 = 8.6.. For contingency tables, one can obtain cell fitted values. To do this, one must integrate over the estimated random effects distribution to obtain estimated marginal probabilities of any particular sequence of responses. For the ML parameter estimates, the probability of a particular sequence of responses Ž yi1 , . . . , yiT . for a given u i is the appropriate product of conditional probabilities, Ł t P Ž Yit s yit < u i ., since the responses are independent given u i . Integrating this product probability with respect to u i for the TABLE 12.3 Summary of ML Estimates for Random Effects Model (12.10) and ML and GEE Estimates for Corresponding Marginal Model GLMM ML Effect Abortion Gender 'varŽ u . i Marginal Model ML Marginal Model GEE Parameter Estimate SE Estimate SE Estimate SE 1 y  3 1 y  2 2 y 3  0.83 0.54 0.29 0.01 0.16 0.16 0.16 0.48 0.148 0.098 0.049 0.005 0.030 0.027 0.027 0.088 0.149 0.097 0.052 0.003 0.030 0.028 0.027 0.088  8.6 0.54 506 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS N Ž0, ˆ 2 . distribution estimates the marginal probability for a given cell Žaveraged over subjects.. This requires numerical integration methods described in Section 12.6. Multiplying this marginal probability of a given sequence by the sample size for that multinomial gives a fitted value. Not surprisingly, for these data, the response patterns Ž0, 0, 0. and Ž1, 1, 1. also have the largest fitted values for the multinomial for each gender. For instance, for females 440 indicated support under all three circumstances Ž457 under none of the three., and the fitted value was 436.5 Ž459.3.. Overall chi-squared statistics comparing the 16 observed and fitted counts are G 2 s 23.2 and X 2 s 27.8 Ždf s 9.. These are not that large considering the very large sample size and the few parameters Ž  1 ,  2 ,  3 , ␥ ,  . used to describe the 14 multinomial cell probabilities Ž8 y 1 s 7 for each gender. in Table 10.13. Here, df s 9 since we are modeling 14 multinomial parameters using five GLMM parameters. An extended model allows interaction between gender and item. It has different  t 4 for men and women. However, it does not fit better. The likelihood-ratio statistic s 1.0 Ždf s 2. for testing that the extra parameters equal 0. An alternative analysis of these data focuses on the marginal distributions, treating the dependence as a nuisance. A marginal model analog of Ž12.10. is logit P Ž Yt s 1 . s t q ␥ x. For it, Table 12.3 also shows GEE estimates for the exchangeable working correlation structure and ML estimates. The marginal model fits well, with G 2 s 1.1; here, df s 2 since the model describes six marginal probabilities Žthree for each gender. using four parameters. These population-averaged  ˆt 4 are much smaller than the subject-specific  ˆt 4 from the GLMM. This reflects the very large GLMM heterogeneity Ž ˆ s 8.6. and the corresponding strong correlations among the three responses. For instance, the GEE analysis estimates a common correlation of 0.82 between pairs of responses. Although the GLMM  ˆt 4 are about five to six times the marginal model  ˆt 4 , so are the standard errors. The two approaches provide similar substantive interpretations and conclusions. 12.3.3 Longitudinal Mental Depression Study Revisited We now revisit Table 11.2 from a longitudinal study to compare a new drug with a standard for treating subjects suffering mental depression. In Section 11.2.1 we analyzed the data using marginal models. The response yt for measurement t on mental depression equals 1 for normal and 0 for abnormal. For severity of initial diagnosis s Ž1 s severe, 0 s mild., drug treatment d Ž1 s new, 0 s standard ., and time of measurement t, we used the model logit P Ž Yt s 1 . s  q  1 s q  2 d q  3 t q 4 dt to evaluate the marginal distributions. 507 EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA TABLE 12.4 Model Parameter Estimates for Marginal and Conditional Logit Models Fitted to Table 11.2 ML Marginal Std. GEE Marginal Std. Random Effects Std. Estimate Error Estimate Error ML Estimate Error Parameter Diagnosis Drug Time Drug = Time y1.29 y0.06 0.48 1.01 0.14 0.22 0.12 0.18 y1.31 y0.06 0.48 1.02 0.15 0.23 0.12 0.19 y1.32 y0.06 0.48 1.02 0.15 0.22 0.12 0.19 Now let yit denote observation t for subject i. The model logit P Ž Yit s 1 < u i . s  q  1 s q  2 d q  3 t q 4 dt q u i has subject-specific rather than population-averaged effects. Table 12.4 shows the ML estimates. The time trend estimates are ˆ3 s 0.48 for the standard drug and ˆ3 q ˆ4 s 1.50 for the new one. These are nearly identical to the ML and GEE estimates for the corresponding marginal model, also shown in the table Žthese are discussed in Sections 11.2.1 and 11.3.2.. The reason is that the repeated observations do not exhibit much correlation, as the GEE analysis observed. Here, this is reflected by ˆ s 0.07, showing little heterogeneity among subjects. Based on the model fit, integrating over the N Ž0, 0.07 2 . random effects distribution yields marginal fitted values of the possible response sequences. Comparing these to the sample counts in Table 11.2 indicates a relatively good fit. The model describes the 28 multinomial cell probabilities Žseven for the trivariate response at each of the four severity drug combinations. using six parameters. The usual fit statistics comparing the observed cell counts to their fitted values are G 2 s 22.0 and X 2 s 20.8 Ždf s 28 y 6 s 22.. The deviance increases by only 0.001 when one assumes that  s 0. From results to be discussed in Section 12.6.6, the P-value for comparing models is half what one gets by treating the deviance as chi-squared with df s 1, or P s 0.49. This simpler model, which gives nearly identical effect estimates and SE values, is adequate. This is also suggested by AIC values Že.g., PROC NLMIXED in SAS reports 1173.9 for the random effects model and 1171.9 for the simpler model with  s 0.. 12.3.4 Modeling Heterogeneity among Multicenter Clinical Trials Many applications compare two groups on a response for data stratified on a third variable. With binary outcomes, the data form several 2 = 2 contingency tables. The main focus relates to studying the association in the 2 = 2 tables and whether and how it varies among the strata. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 508 The strata are sometimes themselves a sample, such as schools or medical clinics. A random effects approach is then natural. With a random sampling of strata, it enables inferences to extend to the population of strata. The fit of the random effects model provides a simple summary such as an estimated mean and standard deviation of log odds ratios for the population of strata. In each stratum it also provides a predicted log odds ratio that shrinks the sample value toward the mean. This is especially useful when the sample size in a stratum is small and the ordinary sample odds ratio has large standard error. Even when the strata are not a random sample or not even a sample and a random effects approach is not as natural, the model is beneficial for these purposes. We illustrate using Table 12.5, previously analyzed in Section 6.3, showing the results of a clinical trial at eight centers. The purpose was to compare an active drug and a control, for curing an infection. For a subject in center i using treatment t Ž1 s active drug; 2 s control., let yit s 1 denote success. One possible model is the logistic-normal, logit P Ž Yi1 s 1 < u i . s  q r2 q u i logit P Ž Yi2 s 1 < u i . s  y r2 q u i , Ž 12.11 . TABLE 12.5 Clinical Trial Relating Treatment to Response for Eight Centers Response Center Treatment Success Failure 1 Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control Drug Control 11 10 16 22 14 7 2 1 6 0 1 0 1 1 4 6 25 27 4 10 5 12 14 16 11 12 10 10 4 8 2 1 2 3 4 5 6 7 8 Source: Beitler and Landis Ž1985.. Sample Odds Ratio Fitted Odds Ratio 1.19 2.02 1.82 2.09 4.80 2.19 2.29 2.11  2.18  2.12 2.0 2.11 0.33 2.06 EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA 509 where  u i 4 are independent N Ž0,  2 . variates. This model assumes that the log odds ratio  between treatment and response is constant over centers. The parameter  summarizes center heterogeneity in the success probabilities. A logistic-normal model permitting treatment-by-center interaction is logit P Ž Yi1 s 1 < u i , bi . s  q Ž  q bi . r2 q u i , logit P Ž Yi2 s 1 < u i , bi . s  y Ž  q bi . r2 q u i , Ž 12.12 . where  u i 4 are independent N Ž0, a2 .,  bi 4 are independent N Ž0,  b2 ., and  u i 4 are independent of  bi 4 . The log odds ratio equals  q bi in center i. These vary among centers according to a N Ž  ,  b2 . distribution. That is,  is the expected center-specific log odds ratio between treatment and response, and  b describes variability in those log odds ratios. The model parameters are Ž  ,  , a ,  b .. In Table 12.5 the sample success rates vary markedly among centers both for the control and drug treatments, but in all except the last center that rate is higher for the drug treatment. In using models with random center and possibly random treatment effects, it is preferable to have more than eight centers. It is difficult to get reliable variance component estimates with so few centers. Keeping this in mind, we use these data to illustrate the models. With a large number of centers it would also be sensible to allow correlation between bi and u i , but we shall not attempt that here. The treatment estimates are ˆ s 0.739 ŽSE s 0.300. for the model Ž12.11. of no interaction and ˆ s 0.746 ŽSE s 0.325. for the model Ž12.12. permitting interaction. Considerable evidence of a drug effect occurs. With such a small sample, however, it is unclear whether that effect is weak or moderate. The evidence about association is weaker for the model permitting interaction. The Wald statistics are Ž0.739r0.300. 2 s 6.0 for the no-interaction model and Ž0.746r0.325. 2 s 5.3 for the interaction model. The corresponding likelihood-ratio statistics are 6.3 and 4.6 Ždf s 1.. The extra variance component in the interaction model pertains to variability in the log odds ratios. As its estimate ˆb increases, so does the standard error of the estimated treatment effect ˆ tend to increase. In this example, ˆb s 0.15 is relatively small and the standard errors of ˆ are not very different in the two models. When ˆb s 0, the standard errors and the model fits are the same. To show the effect of larger ˆb on the standard error of the mean treatment effect estimate ˆ, we alter Table 12.5 slightly. We change three failures to successes for drug in center 3 and three successes to failures for drug in center 8. With these changes, the estimated variability of the treatment effects increases from ˆb s 0.15 to ˆb s 1.4. The ML estimates of the mean treatment effects are then ˆ s 0.722 ŽSE s 0.299. for the no interaction model Ž12.11. and ˆ s 0.767 ŽSE s 0.623. for the interaction model. The Wald statistics are 5.8 and 1.5. The evidence of a treatment 510 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS effect is then dramatically weaker for the interaction model Ž12.12.. Not surprisingly, when the treatment effect varies substantially among centers, it is more difficult to estimate the mean of that effect. For the actual data in Table 12.5, because ˆb s 0.15 for model Ž12.12. is relatively small, the model shrinks the sample odds ratios considerably. Table 12.5 shows the sample values and the model predicted values. These are based on predicting the random effects Žto be explained in Section 12.6., and substituting them and the ML estimates of fixed effects into the model formula to estimate the two response probabilities for each treatment in each center. The sample odds ratios vary from 0.33 to ⬁; their random effects model counterparts Žcomputed with PROC NLMIXED in SAS. vary only between 2.0 and 2.2. The smoothed estimates are much less variable and do not have the same ordering as the sample values. For instance, the smoothed estimate of 2.2 for center 3 is greater than the estimate of 2.1 for center 6, even though the sample value is infinite for the latter. This partly reflects the greater shrinkage that occurs when sample sizes are smaller. When ˆb s 0, model Ž12.12. provides the same fit as model Ž12.11., and estimated odds ratios are identical in each center. For related analyses permitting heterogeneity in odds ratios with several 2 = 2 tables, see Liu and Pierce Ž1993. and Skene and Wakefield Ž1990.. 12.3.5 Alternative Formulations of Random Effects Models There are other ways to express the models. For instance, an equivalent expression for interaction model Ž12.12. is logit P Ž Yit s 1 < u i , bit . s  q  x t q bit q u i , where x t is a treatment dummy variable Ž x 1 s 1, x 2 s 0.,  u i 4 are independent N Ž0, a2 ., and  bi1 4 and  bi2 4 are independent N Ž0,  2 .. Here, bi1 y bi2 corresponds to bi in parameterization Ž12.12., and 2  2 corresponds to  b2 . Formulating a random effects model requires care about implications of the model expression and the random effects correlation structure. Suppose that one expressed the interaction model Ž12.12. as logit P Ž Yi t s 1 < u i , bi . s  q Ž  q bi . x t q u i , Ž 12.13 . with  bi 4 from N Ž0,  b2 .. This is inappropriate, since the model then imposes greater variability for the logit with the first treatment than the second, since x 2 s 0 and  u i 4 and  bi 4 are uncorrelated. Also, the model should not depend on the definition of the dummy variable x t . Note, however, that if z t s x t q c for some constant c, then model Ž12.13. is equivalently logit P Ž Yi t s 1 < u i , bi . s  q Ž  q bi . Ž z t y c . q u i s  X q Ž  q bi . z t q ®i , 511 EXAMPLES OF RANDOM EFFECTS MODELS FOR BINARY DATA where  s  y c  and ®i s u i y cbi . Thus, Ž ®i , bi . are correlated even if Ž u i , bi . are not. In fact, expression Ž12.13. is sensible only with correlated random effects. It is then equivalent to Ž12.12. with correlated random effects. See Agresti and Hartzel Ž2000. for further discussion. 12.3.6 Capture–Recapture Modeling to Predict Population Size Capture recapture experiments are a method of using a series of samples to estimate the size of a population. Such methods have traditionally been used to estimate animal abundance in some habitat. At each sampling occasion, animals are captured and marked in some manner. The animals captured for any given sample are freed and all animals are candidates for recapture in a later sample. With T sampling occasions, a 2 T contingency table displays the data, with scale Žcaptured, not captured . at each occasion. The count n 22  2 is missing for the cell corresponding to noncapture at each occasion. If we knew this cell count, adding it to the others would yield the population size. Models specified for this 2 T table use the 2 T y 1 observed counts to fit the model. The fit refers to those 2 T y 1 cells, but extrapolating it yields an estimated count in the unobserved cell. Adding that to the total of the 2 T y 1 observed counts yields an estimate of population size. To illustrate, suppose that T s 2. We observe n11 animals at both occasions, n12 at the first but not the second occasion, and n 21 at the second but not the first. We do not know the number n 22 not captured either time. If we assumed independence in the 2 = 2 table, the prediction ˆ n 22 would be the value giving an odds ratio of 1.0; but Ž n11 ˆ n 22 .rŽ n12 n 21 . s 1 implies that ˆn 22 s n12 n 21 rn11 . This yields a population size prediction ŽSekar and Deming 1949. of N̂ s n11 q n12 q n 21 q n12 n 21 rn11 s n1q nq1 rn11 with $ var Ž Nˆ . s n1q nq1 n12 n 21 3 n11 . The assumption of independence is usually unrealistic, however. With additional sampling occasions, one can try more complex models. Table 12.6, analyzed by Cormack Ž1989. and others, refers to a study having T s 6 consecutive trapping days for a population of snowshoe hares. The study observed 68 hares. For instance, Table 12.6 indicates that 3 hares were observed on the first day but on none of the other days. For simplicity, models for studies over a brief time period assume that no deaths, births, or immigration into the population occurred during the study period. This is called a closed population. Most methods for capture recapture treat the probability of capture at a given occasion as identical for each subject Že.g., animal.. This is usually RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 512 TABLE 12.6 Results of Capture–Recapture of Snowshoe Hares Capture Capture 6 5 Capture 4 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Capture 3, Capture 2, Capture 1a 000 001 010 011 100 101 110 111 ᎏ Ž24.0. 3 Ž4.8. 4 Ž3.9. 1 Ž1.3. 4 Ž6.8. 4 Ž2.3. 2 Ž1.9. 1 Ž1.0. 3 Ž2.3. 2 Ž0.8. 2 Ž0.6. 0 Ž0.3. 1 Ž1.1. 0 Ž0.6. 0 Ž0.5. 1 Ž0.4. 6 Ž5.4. 3 Ž1.8. 3 Ž1.5. 0 Ž0.8. 1 Ž2.6. 3 Ž1.3. 1 Ž1.1. 1 Ž0.9. 0 Ž0.9. 0 Ž0.5. 1 Ž0.4. 0 Ž0.3. 1 Ž0.6. 0 Ž0.5. 0 Ž0.4. 0 Ž0.5. 5 Ž3.2. 0 Ž1.1. 0 Ž0.9. 0 Ž0.5. 2 Ž1.5. 1 Ž0.8. 1 Ž0.7. 0 Ž0.5. 1 Ž0.5. 1 Ž0.3. 1 Ž0.2. 0 Ž0.2. 0 Ž0.4. 0 Ž0.3. 0 Ž0.3. 0 Ž0.3. 0 Ž1.2. 0 Ž0.6. 0 Ž0.5. 0 Ž0.4. 2 Ž0.9. 2 Ž0.7. 1 Ž0.6. 1 Ž0.7. 0 Ž0.3. 0 Ž0.3. 0 Ž0.2. 0 Ž0.3. 0 Ž0.4. 0 Ž0.4. 0 Ž0.4. 2 Ž0.7. a Values in parentheses represent the fit of the logistic-normal model. Source: A. Agresti, Biometrics 50: 494 500 Ž1994.. unrealistic. One way to allow heterogeneous capture probabilities uses a logit model having subject random effects. For subject i, i s 1, . . . , N with N unknown, let yiX s Ž yi1 , . . . , yiT ., where yit s 1 denotes capture in sample t and yit s 0 denotes noncapture. Lacking explanatory variables, one might use the Rasch-type model logit P Ž Yit s 1 < u i . s u i q t , where  u i 4 are independent N Ž0,  2 .. The larger the value of t , the greater the capture probability at occasion t. The larger is  , the more heterogeneous are the capture probabilities. When  s 0 this logistic-normal model simplifies to mutual independence wi.e., loglinear model Ž8.6.x for the 2 T table. As with other random effects models, integrating the random effect from the probability mass function of Žyi < u i . yields the likelihood function Žas discussed in Section 12.6.. One can consider this likelihood function and the resulting ML estimates of  t 4 and  for all possible counts in the unobserved cell. A profile likelihood function views the maximized likelihood as a function of the unobserved cell count. The ML prediction for that unobserved cell count is the value that maximizes this profile likelihood. Lacking specialized software, one can fit the random effects model repeatedly with various counts in the unobserved cell to determine by trial and error the count that maximizes the likelihood function. RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA 513 ML fitting of this model to Table 12.6 yields a prediction of 24 for the unobserved cell count. Since the study observed 68 hares, the population size estimate is Nˆ s 92. For this fit, ˆ s 1.0. Methods for obtaining a confidence interval for N include using the profile likelihood function or a nonparametric bootstrap method. With the profile likelihood approach, the interval for the missing cell count consists of the possible counts for that cell such that the G 2 fit statistic increases by less than  12 Ž. from its value at the ML estimate. Adding the number of subjects observed in the samples to the endpoints of this interval gives the corresponding interval for N. For the snowshoe hares, a 95% profile-likelihood confidence interval for N is Ž75, 154.. It is common for Nˆ to be nearer the low end of the interval. See Coull and Agresti Ž1999. for details. The greater the heterogeneity, as reflected by larger ˆ , Nˆ tends to be larger and the confidence interval tends to be wider. Large ˆ causes difficulties in estimation, since it results in a relatively flat likelihood surface. This implies imprecise estimates of N. In particular, the upper limit of the profile-likelihood confidence interval for N is essentially infinite when the likelihood function gets sufficiently flat. Also, the ML estimator is then often ˆ unstable, with small changes in the data yielding large changes in N. Difficulties can also arise when probabilities of capture are small. Evidence of this occurs when most subjects captured appear in only one sample. When this happens or when ˆ is large, it is unrealistic to expect narrow confidence intervals for N. Alternative models are discussed in Section 13.1.3. Models that ignore likely heterogeneity can give unrealistically narrow confidence intervals for N. Although traditionally used for animal populations, capture recapture applications also include estimating population size for human populations, such as estimating population prevalence of injecting drug use and HIV infection. Darroch et al. Ž1993. considered census population estimation, and Chao et al. Ž2001. estimated the number of people infected during a hepatitis outbreak ŽProblem 12.21.. An interesting application is estimating the number of files on the World Wide Web relating to some subject by taking samples using several search engines ŽFienberg et al. 1999.. 12.4 RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA Random effects models for binary responses extend to multicategory responses. For the multicategory models of Chapter 7, a multinomial observation with I categories is a vector of I y 1 indicators, the jth of which is 1 when the observation falls in category j and 0 otherwise. In Section 7.1.5 we defined a multivariate GLM by applying a vector of link functions to this multivariate response. Adding random effects extends this multivariate GLM and the GLMM Ž12.1. to a multivariate GLMM ŽHartzel et al. 2001b; Tutz and Hennevogl 1996.. This class includes models for nominal and ordinal responses. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 514 12.4.1 Cumulative Logit Model with Random Intercept Modeling is simpler with ordinal than nominal responses, since often the same random effect and the same fixed effect can apply to each logit. With cumulative logits, this is the proportional odds structure ŽSection 7.2.2.. Denote the possible outcomes for yit , observation t in cluster i, by 1, 2, . . . , I. A GLMM for the cumulative logits has the form logit P Ž Yit F j < u i . s  j q x Xit ␤ q zXit u i , j s 1, . . . , I y 1. Ž 12.14 . Hedeker and Gibbons Ž1994. discussed model fitting, primarily with u i as multivariate normal. For cumulative logit and probit random intercept models, the same relationship exists between their effects and those in marginal models as presented in Section 12.2.2 for binary-response models. Marginal effects tend to be smaller, increasingly so as  increases. Also, the same predictor structure as in Ž12.14. holds with other links for which a common effect for each logit is plausible. For instance, Hartzel et al. Ž2001a, b. used it with adjacent-categories logits. 12.4.2 Insomnia Study Revisited Table 11.4 showed results of a clinical trial at two occasions comparing a drug with placebo in treating insomnia patients. In Sections 11.2.3 and 11.3.3 the data were analyzed with marginal models. For yt s time to fall asleep at occasion t, the marginal model logit P Ž Yt F j . s  j q  1 t q  2 x q  3 tx permitted interaction between t s occasion Ž0 s initial, 1 s follow-up. and x s treatment Ž1 s active, 0 s placebo.. Table 12.7 shows the ML and GEE estimates. Now, let yit denote the response for subject i at occasion t. Table 12.7 also shows results of fitting the random-intercept model logit P Ž Yi t F j < u i . s u i q  j q  1 t q  2 x q  3 tx. TABLE 12.7 Fits of Cumulative Logit Models to Table 11.4 a Effect Treatment Occasion Treatment = occasion a Marginal ML Marginal GEE Random Effects ŽGLMM. ML 0.046 Ž0.236. 1.074 Ž0.162. 0.662 Ž0.244. 0.034 Ž0.238. 1.038 Ž0.168. 0.708 Ž0.244. 0.058 Ž0.366. 1.602 Ž0.283. 1.081 Ž0.380. Values in parentheses represent standard errors. 515 RANDOM EFFECTS MODELS FOR MULTINOMIAL DATA Results are substantively similar to the marginal model, but estimates and standard errors are about 50% larger. This reflects the relatively large heterogeneity Ž ˆ s 1.90. and the resultant strong association between the responses at the two occasions. 12.4.3 Cluster Sampling With surveys that use cluster sampling, standard methods based on simple random sampling Že.g., for a single multinomial sample. require adjustment. Ordinary standard errors are too small. The usual chi-squared test statistics no longer have chi-squared null distributions, but rather, weighted sums of chi-squared. Rao and Thomas Ž1988. surveyed ways of adjusting standard inferences to take into account complex sampling methods in the analysis and modeling of categorical data. When the sampling scheme randomly samples clusters, one can account for the clustering using cluster random effects. We illustrate using data from Brier Ž1980., who reported 96 observations taken from 20 neighborhoods Žthe clusters . on Y s satisfaction with home and X s satisfaction with neighborhood as a whole. Each variable was measured with the ordinal scale Žunsatisfied, satisfied, very satisfied .. Brier’s analysis adjusted for clustering by reducing the Pearson statistic for testing independence in the 3 = 3 contingency table relating X and Y from 17.9 to 15.7 Ždf s 4.. Consider the model for yit , observation t in cluster i, logit P Ž Yit F j < u i . s u i q  j q x it  , Ž 12.15 . with scores Ž1, 2, 3. for the satisfaction levels of x it . With a N Ž0,  2 . distribution assumed for u i , the ML effect estimate is ˆ s y1.201 ŽSE s 0.407., with ˆ s 0.92. By contrast, treating the 96 observations as a random sample corresponds to fitting this model with  s 0. It has ˆ s y1.226 ŽSE s 0.370.. A slight reduction in significance results from adjusting for clustering. 12.4.4 Baseline-Category Logit Models with Random Effects For nominal response variables, one can formulate a binary model that pairs each category with a baseline and fit these models simultaneously while allowing separate effects. This requires using a vector of cluster-specific random effects u i j , one for each logit. The general form of the baseline-category logit model with random effects is log P Ž Yit s j . P Ž Yit s I . s  j q x Xi t ␤ j q zXit u i j , j s 1, . . . , I y 1. The fixed effects ␤ j and the random effects u i j depend on j, since the baseline category is arbitrary. With nominal responses there is no reason to expect effects to be similar for different j. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 516 Cluster i has a vector uXi s ŽuXi1 , . . . , uXi, Iy1 . of random effects. The usual approach treats  u i 4 as independent multivariate normal variates. We recommend an unspecified covariance matrix  for u i . For instance, it is sensible to allow different variances for random effects that apply to different logits. With a common variance, that variance would not be the same as that for the implied random effect for a logit for an arbitrary pair of categories, logw P Ž Yit s j .rP Ž Yit s k .x. With unspecified covariance the model is structurally the same regardless of the choice of baseline category. See Hartzel et al. Ž2001b. for an example. 12.5 MULTIVARIATE RANDOM EFFECTS MODELS FOR BINARY DATA In practice, random effects are often univariate, taking the form of random intercepts. However, we’ve seen that nominal responses require multivariate random effects and that bivariate random effects are helpful for describing heterogeneity in multicenter clinical trials. In this section we present other examples in which multivariate random effects are natural. 12.5.1 Matched Pairs with a Bivariate Binary Response Leo Goodman analyzed Table 12.8 in several articles Že.g., Goodman 1974.. A sample of schoolboys were interviewed twice, several months apart, and asked about their self-perceived membership in the ‘‘leading crowd’’ and about whether they sometimes needed to go against their principles to belong to that group. Thus, there are two binary response variables, which we refer to as membership and attitude, measured at two interview times for each subject. Table 12.8 labels the categories for attitude as Žpositive, negative., where ‘‘positive’’ refers to disagreeing with the statement that one must go against his principles. TABLE 12.8 Membership and Attitude Toward the ‘‘Leading Crowd’’ Ž M, A. for First Interview Yes, positive Yes, negative No, positive No, negative a Ž M, A. for Second Interview a ŽYes, Positive. ŽYes, Negative. ŽNo, Positive. ŽNo, Negative. 458 171 184 85 140 182 75 97 110 56 531 338 49 87 281 554 M, membership; A, attitude. Source: J. S. Coleman, Introduction to Mathematical Sociology ŽLondon: Free Press of Glencoe, 1964., p. 170. MULTIVARIATE RANDOM EFFECTS MODELS FOR BINARY DATA 517 For subject i, let yit ® be the response at interview time t on variable ®, where ® s M for membership and ® s A for attitude. The logit model logit P Ž Yit ® s 1 < u i ® . s t ® q u i ® Ž 12.16 . is a multivariate form of the Rasch-type model Ž12.4.. It has additive item and subject effects for each variable ®. Here, Ž u i M , u i A . is a bivariate random effect that describes subject heterogeneity for Žmembership, attitude .. We assume that the Ž u i M , u i A .4 are independent from a bivariate normal distribution, N Ž 0,  ., with possibly different variances and nonzero correlation. The ML fit yields ˆ2 M y ˆ1 M s 0.379 ŽSE s 0.075. and ˆ2 A y ˆ1 A s 0.176 ŽSE s 0.058.. For both variables, the probability of the first outcome category is higher at the second interview. For instance, for a given subject the odds of self-perceived membership in the leading crowd at interview 2 are estimated to be expŽ0.379. s 1.46 times the odds at interview 1. The estimated correlation between the random effects is 0.30. Their estimated standard deviations are ˆ1 s 3.1 for  u i M 4 and ˆ2 s 1.5 for  u i A 4 . Since these are quite different, the relative sizes of membership and attitude effects differ for marginal and conditional models Žrecall the caveat in Section 12.2.3.. The marginal effect is attenuated more for membership. For this conditional model, the ratio of estimated odds ratios is expŽ0.379.rexpŽ0.176. s 1.46r1.19 s 1.22. For the marginal model, the estimated odds ratios use the marginal distributions of each variable at each time we.g., this is Ž1392r2006.rŽ1253r2145. s 1.188 for membershipx, and the ratio of estimated odds ratios is 1.188r1.133 s 1.05. Integrating over the estimated random effects distribution yields fitted values for the 16 possible sequences of responses in Table 12.8. The deviance of G 2 s 5.5 Ždf s 8. compares the 16 observed counts to their fitted values. The model, which describes 15 multinomial probabilities with seven parameters, fits well. The model constraining the random effects to be uncorrelated fits poorly Ž G 2 s 97.5, df s 9.. The model constraining the random effects to be perfectly correlated is equivalent to having a single random effect u i for each subject. The model is then a Rasch-type model with four items that are the combinations of interviews and variables. That model fits very poorly Ž G 2 s 655.5, df s 10.. Agresti et al. Ž2000. gave further details. 12.5.2 Continuation-Ratio Logits for Clustered Ordinal Outcomes: Toxicity Study For continuation-ratio logit models with ordinal responses, the logits refer to independent binomial variates ŽSection 7.4.3.. Thus, binary logit random effects models apply to clustered ordinal responses using continuation-ratio logits ŽTen Have and Uttal 1994.. For observation t in cluster i, let ␻ i j s P Ž Yit s j < Yit G j, u i j .. ŽMore generally, this probability could also depend on t, but this generality is not needed for the example below.. The continuation-ratio logits are logitŽ␻i j ., j s 1, . . . , I y 14 . 518 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS Let n i j be the number of subjects in cluster i making response j. Let n i s Ý Ijs1 n i j . For a given cluster in a continuation-ratio logit model, treating Ž n i1 , . . . , n i, Iy1 . as multinomial is equivalent to treating them as a sequential set of independent binomial variates, where n i j is bin Ž n i y Ý h - j n i h , ␻ i j ., j s 1, . . . , I y 1. We illustrate with a developmental toxicity study conducted under the U.S. National Toxicology Program. This study examined the developmental effects of ethylene glycol ŽEG. by administering one of four dosages Ž0, 0.75, 1.50, 3.00 grkg. to pregnant rodents. The four dose groups had Ž25, 24, 22, 23. pregnant rodents. The clusters are litters of mice. The three possible outcomes Ždeadrresorption, malformation, normal. for each fetus are ordered, normal being the most desirable result. Table 12.9 shows the data. The continuation-ratio logit is natural here since categories are hierarchically related; an animal must survive before a malformation can take place. The following analyses are from Coull and Agresti Ž2000.. For litter i in dose group d, let logitŽ␻iŽ d.1 . be the continuation-ratio logit for the probability of death and logitŽ␻iŽ d.2 . the continuation-ratio logit for the conditional probability of malformation, given survival. wThe notation iŽ d . represents litter i nested within dose d.x Let x d be the dosage for group d. We account for the litter effect using litter-specific random effects u iŽ d. s Ž u iŽ d.1 , u iŽ d.2 . sampled from N Ž0,  d .. This bivariate random effect allows for differing amounts of overdispersion for the probability of death and for the probability of malformation, given survival. A model also permitting different fixed effects for each is logit Ž␻iŽ d. j . s u iŽ d. j q  j q  j x d . Ž 12.17 . TABLE 12.9 Response Counts for 94 Litters of Mice on (Number Dead, Number Malformed, Number Normal) Dose s 0.00 grkg Dose s 0.75 grkg Dose s 1.50 grkg Dose s 3.00 grkg Ž1, 0, 7., Ž0, 0, 14. Ž0, 0, 13., Ž0, 0, 10. Ž0, 1, 15., Ž1, 0, 14. Ž1, 0, 10., Ž0, 0, 12. Ž0, 0, 11., Ž0, 0, 8. Ž1, 0, 6., Ž0, 0, 15. Ž0, 0, 12., Ž0, 0, 12. Ž0, 0, 13., Ž0, 0, 10. Ž0, 0, 10., Ž1, 0, 11. Ž0, 0, 12., Ž0, 0, 13. Ž1, 0, 14., Ž0, 0, 13. Ž0, 0, 13., Ž1, 0, 14. Ž0, 0, 14. Ž0, 3, 7., Ž1, 3, 11. Ž0, 2, 9., Ž0, 0, 12. Ž0, 1, 11., Ž0, 3, 10. Ž0, 0, 15., Ž0, 0, 11. Ž2, 0, 8., Ž0, 1, 10. Ž0, 0, 10., Ž0, 1, 13. Ž0, 1, 9., Ž0, 0, 14. Ž1, 1, 11., Ž0, 1, 9. Ž0, 1, 10., Ž0, 0, 15. Ž0, 0, 15., Ž0, 3, 10. Ž0, 2, 5., Ž0, 1, 11. Ž0, 1, 6., Ž1, 1, 8. Ž0, 8, 2., Ž0, 6, 5. Ž0, 5, 7., Ž0, 11, 2. Ž1, 6, 3., Ž0, 7, 6. Ž0, 0, 1., Ž0, 3, 8. Ž0, 8, 3., Ž0, 2, 12. Ž0, 1, 12., Ž0, 10, 5. Ž0, 5, 6., Ž0, 1, 11. Ž0, 3, 10., Ž0, 0, 13. Ž0, 6, 1., Ž0, 2, 6. Ž0, 1, 2., Ž0, 0, 7. Ž0, 4, 6., Ž0, 0, 12. Ž0, 4, 3., Ž1, 9, 1. Ž0, 4, 8., Ž1, 11, 0. Ž0, 7, 3., Ž0, 9, 1. Ž0, 3, 1., Ž0, 7, 0. Ž0, 1, 3., Ž0, 12, 0. Ž2, 12, 0., Ž0, 11, 3. Ž0, 5, 6., Ž0, 4, 8. Ž0, 5, 7., Ž2, 3, 9. Ž0, 9, 1., Ž0, 0, 9. Ž0, 5, 4., Ž0, 2, 5. Ž1, 3, 9., Ž0, 2, 5. Ž0, 1, 11. Source: Study described by C. J. Price, C. A. Kimmel, R. W. Tyl, and M. C. Marr, Toxicol. Appl. Pharmacol. 81: 113 127 Ž1985.. 519 MULTIVARIATE RANDOM EFFECTS MODELS FOR BINARY DATA TABLE 12.10 Comparisons of Log Likelihoods for Multivariate Random Effects Models for Developmental Toxicity Study Model Dose-specific  i  i , Common  ,  Common  Common , s 0 Univariate  2 Number of Parameters Change in Parameters Change in Log Likelihood 16 14 7 6 5 ᎏ 2 9 10 11 ᎏ 28.4 7.4 7.4 16.7 Table 12.10 reports the change in the maximized log likelihood from fitting four special cases of this model: 1. 2. 3. 4. Common intercept and slope for the two logits:  1 s  2 and  1 s  2 Common covariance matrix for the four doses:  1 s  2 s  3 s  4 Common covariance matrix and uncorrelated random effects Univariate common variance component across dose: u iŽ d.1 s u iŽ d.2 and d s  Tests of the first three special cases against the general model Ž12.17. can use ordinary likelihood-ratio tests. Little seems to be lost by using the simpler model having uncorrelated random effects with homogeneous covariance structure Ži.e., the fourth model listed in Table 12.10., as the likelihood-ratio statistic comparing this to model Ž12.17. equals 2Ž7.4. s 14.8 Ždf s 10.. The model provides a separate univariate logistic-normal model for each conditional binomial outcome, specifying that the proportion of dead pups and the proportion of malformed pups Žgiven survival. are independent, both within litter and marginally. The univariate model in Table 12.10 is the special case of the third model listed in which the variances are common for the two logits and the random effects are perfectly correlated. Hence, it reduces to a univariate random effects model. Comparing the univariate model to a multivariate counterpart involves testing that correlation parameters fall on the boundary. Ordinary chi-squared asymptotic theory for likelihood-ratio tests applies only when the parameter falls in the interior of the parameter space. Tests when a null model has a correlation of 1 or a variance component of 0 are complex and beyond our scope here Žsee Section 12.6.6.. However, an informal analysis of change in log likelihoods suggests that the univariate model is inadequate. The ML estimated effects for the separate univariate logistic-normal model for each conditional binomial outcome are ˆ1 s 0.08 ŽSE s 0.21., ˆ2 s 1.79 ŽSE s 0.22.. For a given cluster, there is no evidence of a dose effect on the death rate, but the estimated odds of malformation, given survival, multiply by expŽ1.79. s 6.0 for every additional grkg of ethylene 520 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS glycol. The variance component estimates suggest a stronger litter effect for the malformation outcome given survival Ž ˆ2 s 1.6. than for death Ž ˆ1 s 0.5.. 12.5.3 Hierarchical (Multilevel) Modeling Hierarchical data structures, with units grouped at different levels, are common in education. A statewide study of factors that affect student performance might measure students’ scores on a battery of exams but use a model that takes into account the student, the school or school district, and the county. Just as two observations on the same student might tend to be more alike than observations on different students, so might two students in the same school tend to be more alike than students from different schools. Student, school, and county terms might be treated as random effects, with different ones referring to different le®els of the model. For instance, a model might have students at level 1, schools at level 2, and counties at level 3. GLMMs for data having a hierarchical grouping of this sort are called multile®el models. Random effects enter the model at each level of the hierarchy. We illustrate with a two-level model. Let  iŽ j.t denote the probability that student i in school j passes test t in a battery of tests. A multilevel model with random effects for student and school and fixed effects for explanatory variables has the form logit w  iŽ j.t x s x XiŽ j.t ␤ q u j q ®iŽ j. . Here, the explanatory variables x might include one that identifies the test in the battery. The random effects u j for schools and ®iŽ j. for students within schools are independent with different variance components. The level 1 random effects  ®iŽ j.4 account for variability among students in ability or parents’ socioeconomic status or other characteristics not measured by x. When they have a relatively large variance component, there is a strong correlation among the test results for students. The level 2 random effects  u j 4 account for variability among schools due to possibly unmeasured factors such as per-capita expenditure in the school’s budget. For examples of the use of multivariate random effects in multilevel modeling, see Aitkin et al. Ž1981., Anderson and Aitkin Ž1985., Gibbons and Hedeker Ž1997., Goldstein Ž1995., Goldstein and Rasbash Ž1996., and Longford Ž1993.. 12.6 GLMM FITTING, INFERENCE, AND PREDICTION Model fitting is rather complex for GLMMs. The main difficulty is that the likelihood function does not have a closed form. Numerical methods for approximating it can be computationally intensive for models with multivari- GLMM FITTING, INFERENCE, AND PREDICTION 521 ate random effects. In this section we outline the basic ideas of ML fitting of GLMMs. Some ML methods are available in software Že.g., PROC NLMIXED in SAS.. 12.6.1 Marginal Likelihood and Maximum Likelihood Fitting The GLMM is a two-stage model. At the first stage, conditional on the random effects, observations are assumed to follow a GLM. That is, observation yit in cluster i has distribution in the exponential family with expected value  it linked to a linear predictor, g Ž it . s x Xit ␤ q zXit u i . Then, zXit u i is a known offset and observations in a cluster are independent. At the second stage, the random effects  u i 4 are assumed independent from a N Ž0,  . distribution. For a discrete variable, denote the vector of all the observations by y and the vector of all the random effects by u. Let f Žy < u;  . denote the conditional mass function of y, given u. Let f Žu;  . denote the normal density function for u. The likelihood function l Ž, ; y. for a GLMM is the probability mass function f Žy; ,  . of y, viewed as a function of  and . This mass function refers to the marginal distribution of y after integrating out the random effects, l Ž  ,  ; y. s f Ž y;  ,  . s H f Ž y < u;  . f Ž u;  . du. Ž 12.18 . It is often called a marginal likelihood. For example, the likelihood function l Ž,  2 ; y. for the logistic-normal model Ž12.5. Žabsorbing  into  . is Ł i žH  y Ł t exp Ž x Xit  q u i . 1 q exp Ž x Xit  q u i . yit 1 1 q exp Ž x Xit  q u i . 1yy i t / f Ž u i ;  2 . du i . The likelihood function is evaluated numerically and maximized as a function of  and . Many methods have been developed to do this. We next discuss a few of the most popular. 12.6.2 Gauss–Hermite Quadrature Methods The integral determining the likelihood function has dimension that depends on the random effects structure. When the dimension is small, as in the one-dimensional integral above for the logistic-normal model Ž12.5., standard numerical integration methods can approximate the likelihood function. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 522 Gauss Hermite quadrature is a method for approximating the integral of a function f Ž. multiplied by another function having the shape of a normal density. The approximation is a finite weighted sum that evaluates the function at certain points. In the univariate normal random effects case, the approximation has the form Hy  f Ž u . exp Ž yu 2 . du f q Ý c k f Ž sk . , ks1 with weights  c k 4 and quadrature points  sk 4 that are tabulated. The approximation improves as q, the number of quadrature points, increases. The approximated likelihood can be maximized with standard algorithms ˆ and . ˆ Inverting an such as Newton Raphson, yielding ML estimates ␤ approximation for the observed information matrix provides standard errors for the ML estimates. For complex models, second partial derivatives for the Hessian may be computed numerically rather than analytically. Adequate ˆ We approximation usually requires larger q for standard errors than for . recommend sequentially increasing q until the changes are negligible in both the estimates and standard errors. An adaptive version of Gauss Hermite quadrature Že.g., Liu and Pierce 1994. centers the quadrature points with respect to the mode of the function being integrated and scales them according to the estimated curvature at the mode. This improves efficiency, dramatically reducing the number of quadrature points needed to approximate the integrals effectively. Lesaffre and Spiessens Ž2001. showed comparisons and warned against using too few points. 12.6.3 Monte Carlo Methods Multivariate forms of Gauss Hermite quadrature handle multivariate, correlated random effects. Adequate approximation becomes more difficult, however, when the dimension of the integral exceeds roughly 5. Then, Monte Carlo methods are more feasible computationally than numerical integration. Various Monte Carlo approaches have been studied Že.g., McCulloch 1997., including Monte Carlo in combination with Newton Raphson, Monte Carlo in combination with the EM algorithm, and simulating the likelihood directly. Here, we briefly describe a Monte Carlo EM ŽMCEM. algorithm. The EM algorithm is a popular iterative method of finding ML estimates when data are missing or when filling in some ‘‘missing’’ data simplifies a likelihood ŽDempster et al. 1977. wsee Laird Ž1998. for a useful reviewx. In each cycle an E-step takes an expectation over the missing data to approximate the likelihood function and an M-step maximizes the likelihood given the working values of the parameter estimates. In GLMMs, one regards the random effects u as missing data. Then, hŽy, u; ,  . s f Žy < u;  . f Žu;  . specifies the joint distribution of the complete data. The E-step in iteration r GLMM FITTING, INFERENCE, AND PREDICTION 523 of the EM algorithm calculates E  log h Ž y, u; ␤ ,  . < y; Ž r . ,  Ž r . 4 . The expectation is with respect to the distribution of Žu < y. with parameter values equal to Ž r . and  Ž r ., the working estimates for iteration r. The distribution of Žu < y. follows from those of Žy < u. and u in the GLMM via Bayes’ theorem. The M-step then maximizes the result with respect to  and  to obtain Ž rq1. and  Ž rq1. . The MCEM algorithm approximates the expectation in the E-step using Monte Carlo methods. Possible ways of doing this include using independent simulations from the distribution of Žu < y., at the current estimate of parameters, or using Markov chain Monte Carlo ŽMCMC.. For details, including the issue of choosing an appropriate Monte Carlo sample size, see Booth and Hobert Ž1999., Chan and Kuk Ž1997., and McCulloch Ž1994, 1997.. 12.6.4 Penalized Quasi-likelihood Approximation The Gauss Hermite and Monte Carlo integration methods provide likelihood approximations such that resulting parameter estimates converge to the ML estimates as they are applied more finely Ži.e., as the number of quadrature points increases for numerical integration and as the Monte Carlo sample size increases in the MCEM method.. This contrasts with other approximate methods that are simpler but need not yield estimates near the ML estimates. These methods maximize an analytical approximation of the likelihood function. Recall that the likelihood function Ž12.18. results from integrating out the random effects u from the joint distribution of y and u. Using the exponential family representation of each component of that joint distribution, the integrand of Ž12.18. is an exponential function of u. One approach approximates that function using a second-order Taylor series expansion of its exponent around a point ˜ u at which the first-order term equals 0. wThat point Ž < . x ˜u f E u y . The approximating function for the integrand is then exponential with quadratic exponent in Žu y ˜ u. and has the form of a constant multiple of a multivariate normal density. Thus, its integral has closed form. This type of integral approximation is called a Laplace approximation. The approximation for integral Ž12.18. is then treated as a likelihood and maximized with respect to  and . For one such method ŽBreslow and Clayton 1993., the integral approximation yields a function approximating the log likelihood that has the form q Ž  , y . y Ž 1r2 . ˜ uX y1 ˜ u, 524 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS where q Ž␤, y. resembles a quasi-log-likelihood function for the GLM conditional on u s ˜ u. Thus, the approximation results in a penalty for the quasilog likelihood, with the penalty increasing as elements of ˜ u increase in absolute value. This approach is called penalized quasi-likelihood ŽPQL.. The calculations for maximizing the penalized quasi-likelihood use methods for linear mixed models with a normal response. This treats a linearization of the logit as a working response and entails iterative solution of sets of likelihood-like equations in ␤ and u. PQL methods do not require numerical or Monte Carlo integration and so are simpler than ML methods. They are computationally feasible for large data sets and models with complex random effects structure. Unfortunately, PQL methods can perform poorly relative to ML ŽMcCulloch 1997.. For instance, for the abortion example in Section 12.3.2, the PQL approximations to the ML estimates Žobtained using the GLIMMIX macro in SAS. are decent for  t 4 , but the standard errors and the estimate of  are only about half what they should be Že.g., PQL gives ˆ s 4.3, compared to the ML estimate of 8.6.. When true variance components are large, ordinarily PQL tends to produce variance component estimates with substantial negative bias ŽBreslow and Lin 1995.. The PQL estimators also behave poorly when the response distribution is far from normal Že.g., binary.. Adjustments have been developed for some cases to lessen the bias Že.g., Goldstein and Rasbash 1996., but where possible we recommend using ML rather than PQL. 12.6.5 Bayesian Approaches Another approach to fitting of GLMMs is Bayesian. With it, the distinction between fixed and random effects no longer occurs, as every effect has a probability distribution. Use of a flat prior distribution yields a posterior that is a constant multiple of the likelihood function. Then, Markov chain Monte Carlo ŽMCMC. methods for approximating intractable posterior distributions can approximate the likelihood function ŽZeger and Karim 1991.. For instance, an approximation for the mode of the posterior distribution approximates the ML estimate. A danger is that improper prior distributions have improper posteriors for many models for categorical data ŽNatarajan and McCulloch 1995.. In using MCMC, one may fail to realize that the posterior is improper. It is safer to use a proper but relatively diffuse prior. However, the posterior mode need not be close to the ML estimate, and Markov chains may converge slowly ŽNatarajan and McCulloch 1998.. This is currently an active area of research, not just as a way of approximating ML results but also as an approach preferred over ML by those who adopt the Bayesian paradigm. See, for instance, Daniels and Gatsonis Ž1999. for multilevel modeling of geographic and temporal trends with clustered longitudinal binary data, which built on earlier hierarchical modeling by Wong and Mason Ž1985.. GLMM FITTING, INFERENCE, AND PREDICTION 12.6.6 525 Inference for Model Parameters After fitting the model, inference about fixed effects proceeds in the usual way. For instance, likelihood-ratio tests can compare nested models. Asymptotics for GLMMs apply as the number of clusters increases, rather than as the numbers of observations within the clusters increase. Similarly, resampling methods such as the bootstrap using a large number of clusters should sample clusters rather than individual observations within clusters, to preserve the within-cluster dependence. Inference about random effects Že.g., their variance components. is more complex. For instance, sometimes one model is a special case of another in which a variance component equals 0. The simpler model then falls on the boundary of the parameter space relative to the more complex model, so ordinary likelihood-based inference does not apply. The asymptotic distribution of the likelihood-ratio statistic is known for the most common situation, testing H0 :  2 s 0 against Ha :  2 ) 0 for a model containing a single variance component. The null distribution is an equal mixture of ␹ 02 Ži.e., degenerate at 0. and ␹ 12 random variables ŽSelf and Liang 1987.. The value of 0 occurs when ˆ s 0, in which case the maximized likelihoods are identical under H0 and Ha . When ˆ ) 0 and the observed test statistic equals t, the P-value for this large-sample test is 12 P Ž ␹ 12 ) t ., half the P-value that applies for ␹ 12 asymptotic tests. For testing more than one variance component, the mixture distribution becomes more complex, and it is simpler to use a score test ŽLin 1997.. 12.6.7 Prediction Using Random Effects The use of random effects in a model implies heterogeneity of certain effects of interest, such as odds ratios. Estimated effects of interest are often then linear combinations of fixed and random effects. For example, in the clinical trial comparing two treatments with random effects for centers ŽSection 12.3.4., one can predict the probability of success for each treatment in each center and odds ratios in those centers. Given the data, the conditional distribution of Žu < y. contains the information about the random effects u. A prediction for u is E Žu < y., its posterior mean given the data. Calculation of E Žu < y. itself requires numerical integration or Monte Carlo approximation. The expectation depends on ␤ and , ˆ and  ˆ in the approximation. The standard so in practice one substitutes  error of the predictor of the random effect u i is the standard deviation of the ˆ and  ˆ in EŽu < y., however, distribution of Ž u i < y.. When one substitutes  the standard error does not account for the sampling variability in those estimates. Hence, the true standard error tends to be underestimated ŽBooth and Hobert 1998.. This approach to prediction using posterior means of random effects provides effect estimates that exhibit shrinkage relative to estimates using RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 526 only data in the specific cluster. In this sense the results are similar to those using an empirical Bayes approach ŽTen Have and Localio 1999.. This adapts an ordinary Bayesian analysis by using the sample data to estimate parameters of the prior distribution. For a vector of mean parameters, this approach yields an estimate of a particular mean that is a weighted average of the sample mean and the overall mean of the sample means. Thus, it shrinks the sample mean toward the overall mean. Shrinkage estimators can be far superior to sample values when the sample size for estimating each parameter is small, when there are many parameters to estimate, or when the true parameter values are roughly equal. The empirical Bayes paradigm has been in use for some time: for instance, for estimating a vector of means or binomial proportions ŽEfron and Morris 1975.. Although random effects models are natural in many applications, further work is needed. Work continues on the development of methodology for model-fitting and inference with complex GLMMs. In addition, research is needed on model checking and diagnostics. Nonetheless, we believe that GLMMs provide a very useful extension of ordinary GLMs. NOTES Section 12.1: Random Effects Modeling of Clustered Categorical Data 12.1. For further discussion of the Rasch model and ways of estimating its parameters, see Andersen Ž1980, Sec. 6.4. and Fischer and Molenaar Ž1995.. Haberman Ž1977b. showed ML estimators can achieve consistency when both n and T grow at suitable rates. For multinomial Rasch extensions, see Andersen Ž1980, pp. 272 284; 1995. and Conaway Ž1989.. Early work on random effects models for a categorical response includes Anderson and Aitkin Ž1985., Bartholomew Ž1980., Bock and Aitkin Ž1981., Chamberlain Ž1980., Gilmour et al. Ž1985., Pierce and Sands Ž1975., and Stiratelli et al. Ž1984.. 12.2. In models with covariates, Neuhaus and Lesperance Ž1996. noted that conditional ML may lose considerable efficiency compared to the random effects approach when cluster sizes are small and covariates have strong positive within-cluster correlation. As that correlation approaches q1, the covariate effect resembles a between-cluster one, which the conditional ML approach cannot estimate. The matched-pairs case referred to in Section 12.1.2 in which the conditional ML estimate equals the random effects estimate has within-cluster covariate correlation s y1, as depending on the order of viewing the observations, x t changes from 0 to 1 or from 1 to 0; then, no efficiency loss occurs. Section 12.3: Examples of Random Effects Models for Binary Data 12.3. For further discussion of modeling capture recapture data, see Bishop et al. Ž1975, Chap. 6., Chao et al. Ž2001., Cormack Ž1989., Coull and Agresti Ž1999., Darroch et al. Ž1993., Fienberg et al. Ž1999., and Hook and Regal Ž1995.. Similarities exist between this problem and the related problem of estimating the binomial index n when observing independent bin Ž n,  . counts with unknown n and  ; see Aitkin and Stasinopoulos Ž1989. and references therein. Relatively flat log likelihoods also occur with other models that permit capture heterogeneity ŽBurnham and Overton 1978., such as a beta-binomial model. PROBLEMS 527 12.4. King Ž1997. used random effects models as part of a solution for analyzing aggregated categorical data, the problem of ecological inference. Chambers and Steel Ž2001. discussed early work by Leo Goodman on this problem and proposed a simpler semiparametric approach. Section 12.4: Random Effects Models for Multinomial Data 12.5. With the complementary log-log link, the likelihood function has closed form with a log gamma random effects distribution ŽCrouchley 1995, Farewell 1982, Ten Have 1996.. 12.6. Chen and Kuo Ž2001. discussed nominal responses, including discrete choice models ŽSec. 7.6. with random effects. See also Brownstone and Train Ž1999. for discrete choice GLMMs. Section 12.5: Multi©ariate Random Effects Models for Binary Data 12.7. Rabe-Hesketh and Skrondal Ž2001. showed that careful attention must be paid to parameter identification in models with multivariate random effects. Their factor model contains many multivariate random effects models as special cases. 12.8. For longitudinal bivariate binary responses, Ten Have and Morabia Ž1999. simultaneously modeled bivariate log odds ratios and univariate logits. Multivariate responses sometimes have both continuous and categorical components. For random effects modeling of such data, see Catalano and Ryan Ž1992. and Gueorguieva and Agresti Ž2001.. Section 12.6: GLMM Fitting, Inference, and Prediction 12.9. See Fahrmeir and Tutz Ž2001, Chap. 7. and McCulloch and Searle Ž2001. for more details on the fitting of GLMMs. Just as the likelihood function for a GLMM is an integral, so do likelihood equations have the form of integral equations ŽMcCulloch and Searle 2001, p. 227.. Wolfinger and O’Connell Ž1993. described a fitting method related to PQL, also motivated by a Laplace approximation. 12.10. A GLMM determines the marginal relationship Žaveraged over random effects . between the mean response and explanatory variables. Conversely, Heagerty Ž1999. noted that a marginal model for the mean implicitly determines the form of the fixed portion of the linear predictor in a conditional model. The conditional GLMM Ž12.1. X X X has linear predictor, x i t ␤ q z i t u i . A more general form ⌬ i t q z i t u i implies a particular marginal model. Here, ⌬ i t is a function of the marginal linear predictor and the random effects distribution. It is implicitly defined by the integral equation that links the marginal and conditional means. PROBLEMS Applications 12.1 Refer to the matched-pairs data of Table 10.14 and Problem 10.1. a. Fit model Ž12.3.. Interpret ˆ. If your software uses numerical integration, report ˆ, ˆ , and their standard errors for 5, 10, 25, 100, and 200 quadrature points, and comment on convergence. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 528 b. Compare ˆ and its SE for this approach to the conditional ML approach. 12.2 Refer to Table 4.8 on the free-throw shooting of Shaq OX Neal. In game i, suppose that yi s number made out of n i attempts is a bin Ž n i , i . variate and  yi 4 are independent. a. Fit the model, logitŽ i . s  . Find and interpret  ˆ i . Does the model appear to fit adequately? b. Fit the model, logitŽ i . s  q u i , where  u i 4 are independent N Ž0,  2 .. Use  ˆ and ˆ to summarize OX Neal’s free-throw shooting. c. Explain how the model in part Ža. is a special case of that in part Žb.. Is there evidence that the one in part Žb. fits better? 12.3 For Table 8.3, let yit s 1 when subject i used substance t. Table 12.11 shows output for the logistic-normal model logit P Ž Yit s 1 < u i . s u i q t . Interpret. Illustrate by comparing use of cigarettes and marijuana. TABLE 12.11 Output for Problem 12.3 Description Subjects Max Obs Per Subject Parameters Quadrature Points Log Likelihood Value 2276 3 4 200 y3311 Parameter beta1 beta2 beta3 sigma Estimate 4.2227 1.6209 y0.7751 3.5496 Std Error 0.1824 0.1207 0.1061 0.1627 t Value 23.15 13.43 y7.31 21.82 12.4 How is the focus different for the model in Problem 12.3 than for the loglinear model Ž AC, AM, CM . used in Section 8.2.4? If ˆ s 0, which loglinear model has the same fit as the GLMM? 12.5 For the student survey in Table 9.1, Ža. analyze using GLMMs, and Žb. compare results and interpretations to those with marginal models in Problem 11.2. 12.6 Fit model Ž12.10. to the responses on abortion. If your software uses Gauss Hermite quadrature, report the approximate number of quadrature points needed for parameter estimates to converge and the number needed for standard error estimates to converge. ŽThis example has large ˆ and requires many points.. 529 PROBLEMS 12.7 For the crossover study in Table 11.10 ŽProblem 11.6., fit the model logit P Ž YiŽ k .t s 1 < u iŽ k . . s  k q t q u iŽ k . , Ž 12.19 . where  u iŽ k .4 are independent N Ž0,  2 .. Interpret  ˆt 4 and ˆ . 12.8 For Problem 12.7, compare estimates of B y A and C y A and SE values to those using Ža. a marginal model ŽProblem 11.6., and Žb. conditional logistic regression ŽSection 10.2., treating subject terms in model Ž12.19. as fixed effects. 12.9 For Problem 12.7, fit the more general GLMM having treatment effects  t k 4 that vary by sequence. Test whether the fit is better. One could also consider period or carryover effects. Add two period effects to model Ž12.19. Že.g., the first-period-effect parameter adds to the model when t s A and k s 1, 2, t s B and k s 3, 4, and t s C and k s 5, 6.. Check whether the fit improves. Interpret. 12.10 Consider the logistic-normal model Ž12.10. for the abortion opinion data, under the constraint  s 0. a. Explain why the fit is the same as an ordinary logit model treating the three responses for each subject as if they were independent responses for three separate subjects. b. Explain why the model fit is the same as an ordinary loglinear model Ž GI1 , GI2 , GI3 . of mutual independence of responses on the three items Ž I1 , I2 , I3 ., given G s gender. c. Fit the model. Interpret, and explain why  ˆt y ˆu4 are quite different from those in Section 12.3.2 allowing  ) 0. 12.11 For Table 6.7 on admissions decisions for graduate school applicants, let yi g s 1 denote a subject in department i of gender g Ž1 s females, 0 s males. being admitted. a. For the fixed effects model, logitw P Ž Yi g s 1.x s  q  g q iD , ˆ s 0.173 ŽSE s 0.112.. Interpret. b. The corresponding model Ž12.12. in which departments are a normal random effect has ˆ s 0.163 ŽSE s 0.111.. Interpret. c. The model of form Ž12.12. allowing the gender effect to vary by department has ˆ s 0.176 ŽSE s 0.132., with ˆb s 0.20. Interpret. Explain why the standard error of ˆ is slightly larger than with the other analyses. d. The marginal sample log odds ratio between gender and whether admitted equals y0.07. How could this take different sign from ˆ in these models? RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 530 e. The sample conditional odds ratios between gender and whether admitted vary between 0 and ⬁. By contrast, predicted odds ratios for the interaction random effects model do not vary much. Explain why results can be so different. 12.12 For the clinical trial in Table 9.16, let  it s P Ž Yit s 1 < u i . denote the probability of success for treatment t in center i. a. The random intercept model Ž12.11. has ˆ s 1.52 ŽSE s 0.70. and ˆ s 1.9. Interpret. b. From Section 9.8.3, the fixed effects analog of this model Žreplacing  q u i by  i . has  ˆ1 s ˆ3 s y, corresponding to ˆ 1 t s ˆ 3 t s 0 for each treatment. By contrast, the random effects model has  ˆ q uˆ1 s y3.78 Žusing NLMIXED in SAS. and ˆ 11 s 0.047 and  ˆ 12 s 0.011 in center 1. Explain how this model can have ˆ it ) 0 in centers having no successes. 12.13 Refer to the subject-specific model in Section 12.3.3. Verify that the estimated difference in time effect slopes between the new and standard drugs for treating depression are Ža. 1.018 ŽSE s 0.192. with the GLMM approach, and Žb. 1.156 ŽSE s 0.222. with conditional ML. 12.14 For marginal model Ž10.14. for Table 10.5 on premarital and extramarital sex, Table 12.12 shows results of fitting a corresponding random intercept model. Interpret ˆ. Compare estimates of and inferences about  to those in Section 10.3.2 for the marginal model. TABLE 12.12 Output for Problem 12.14 Subjects Max Obs Per Subject Parameters Quadrature Points Log Likelihood 475 2 5 100 y890.1 Parameter inter1 inter2 inter3 beta sigma Estimate y1.5422 y0.6682 0.9273 4.1342 2.0757 Std Error 0.1826 0.1578 0.1673 0.3296 0.2487 t Value y8.45 y4.24 5.54 12.54 8.35 12.15 A data set from the 1994 General Social Survey on subjects’ opinions on four items Žthe environment, health, law enforcement, education. related to whether they believed government spending on each item should increase, stay the same, or decrease. Subjects were also classified by their gender and race. For subject i, let Gi s 1 for females and 0 for males, let R1 i s 1 for whites and 0 otherwise, 531 PROBLEMS R 2 i s 1 for blacks and 0 otherwise, and R1 i s R 2 i s 0 for the other category of race. Let yit denote the response for subject i on spending item t, where outcomes Ž1, 2, 3. represent Žincrease, stay the same, decrease .. a. With constraint 4 s 0, the random-intercept model logit P Ž Yit F j < u i . s  j q t q  g Gi q r1 R i1 q r 2 R 2 i q u i , j s 1, 2, has ˆ1 s y0.55, ˆ2 s y0.60, ˆ3 s y0.49, with ˆ s 1.03. These estimates are greater than five standard errors in absolute value. Interpret. b. Table 12.13 shows results with a race-by-item interaction. Interpret. TABLE 12.13 Results for Problem 12.15 a Variable Estimate SE Intercept-1 Intercept-2 Gender Race1-w Race2-b Item1-envir Item2-health Item3-crime Race1)Item1 Race1)Item2 Race1)Item3 Race2)Item1 Race2)Item2 Race2)Item3 1.065 1.919 0.409 y0.055 0.434 y0.357 y0.319 y0.585 y0.170 y0.387 0.197 y0.452 0.454 y0.518 0.391 0.051 0.088 0.397 0.452 0.539 0.493 0.480 0.549 0.503 0.491 0.606 0.598 0.560 a Coding 0 for item 4 Žeducation. and race 3 Žother.. 12.16 Refer to Problem 11.12 for Table 8.19 on government spending. Analyze these data using a cumulative logit model with random effects. Interpret. Compare results to those with a marginal model ŽProblem 11.12.. 12.17 For the insomnia example in Section 12.4.2, according to SAS the maximized log likelihood equals y593.0, compared to y621.0 for the simpler model forcing  s 0. Compare models, using either a likelihood-ratio test or AIC. What do you conclude? RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 532 TABLE 12.14 Results for Problem 12.18 Observer Effect GEE Random Effects A B C D E F y0.451 Ž0.108. y0.391 Ž0.093. 0.319 Ž0.118. 0.632 Ž0.105. y0.491 Ž0.098. 1.252 Ž0.161. y1.201 Ž0.300. y0.919 Ž0.299. 0.558 Ž0.301. 1.545 Ž0.313. y1.379 Ž0.300. 2.907 Ž0.344. 12.18 Landis and Koch Ž1977. showed ratings by seven pathologists who separately classified 118 slides regarding the presence and extent of carcinoma of the uterine cervix, using a five-point ordinal scale. ŽTable 13.1 is a collapsing of their table that combines the first two categories and the last three categories.. For slide i with rater t, Table 12.14 shows results of fitting model logit P Ž Yit F j < u i . s u i q  j q t to the ordinal table Žwith ˆG s 0., assuming that the  u i 4 are independent N Ž0,  2 .. It also shows GEE estimates, using independence working equations, for the corresponding marginal model. Interpret ˆF for each model. Explain why estimates using the random effects model, for which ˆ s 3.8, tend to be much larger in absolute value. Discuss the differences in assumptions and interpretations for the two models. 12.19 Refer to Section 12.5.1 on boys’ attitudes toward the leading crowd. Table 12.15 shows results for a sample of schoolgirls. Fit model Ž12.16. and interpret. Summarize the estimated variability and correlation of random effects. TABLE 12.15 Data for Problem 12.19 Ž M, A. for First Interview Yes, positive Yes, negative No, positive No, negative a Ž M, A. for Second Interview a ŽYes, Positive. ŽYes, Negative. ŽNo, Positive. ŽNo, Negative. 484 112 129 74 93 110 40 75 107 30 768 303 32 46 321 536 M, membership; A, attitude. Source: J. S. Coleman, Introduction to Mathematical Sociology ŽLondon: Free Press of Glencoe, 1964., p. 168. 533 PROBLEMS 12.20 Generalize model Ž12.16. to apply simultaneously to Tables 12.8 and 12.15, using a gender main effect but the same membership effect and the same attitude effect for each gender. Fit the model. Use the maximized log likelihood to compare with a more general model having different membership effects and different attitude effects for each gender. Interpret. 12.21 Table 12.16 reports results from a study to estimate the number N of people infected during a 1995 hepatitis A outbreak in Taiwan. The 271 observed cases were reported from records based on a serum test taken by the Institute of Preventive Medicine of Taiwan ŽP., records reported by the National Quarantine Service ŽQ., and records based on questionnaires administered by epidemiologists ŽE.. Estimating N is difficult, because many subjects had only one capture. a. Find Nˆ if you observed only Ži. P and Q, Žii. P and E, Žiii. Q and E. ˆ using the model of mutual independence with P, Q, and E. b. Find N c. Find a 95% profile likelihood interval for N using the model in part Žb.. d. The random effects model of Section 12.3.6 has fit shown in Table 12.16, for which ˆ s 2.9. The log-likelihood is relatively flat, and ˆ s 4551 with a 95% profile likelihood interval of Ž758, ⬁. ŽCoull N and Agresti 1999.. Explain why this model may provide imprecise estimates of N. Since the interval in part Žc. is much narrower, is it necessarily more reliable? TABLE 12.16 Data for Problem 12.21 PQE Observed Count Logistic-Normal ML Fit 000 001 010 011 100 101 110 111 ᎏ 63 55 18 69 17 21 28 Ž487, ⬁. 61.0 58.0 17.0 68.0 20.0 19.0 28.0 Source: Data from Chao et al. Ž2001.. 12.22 Analyze the crossover data of Table 11.1 using a random effects approach. Interpret, and compare results to those in Section 11.1.2. RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 534 12.23 The analyses in Section 12.3.2 comparing opinions on some topic extend to ordinal responses. Using an ordinal random effects model, analyze the 4 3 table in Agresti Ž1993., found also at the book’s Web site, www. stat.ufl.edur; aarcdarcda.html. 12.24 The analyses in Section 12.3.4 describing heterogeneity in multicenter clinical trials extend to ordinal responses. Using random effects models, analyze the 2 = 3 = 8 table in Hartzel et al. Ž2001a.. 12.25 You are a statistical consultant asked to analyze Table 4 in B. Efron, Statistical Science 13: 95 122 Ž1998., which shows 2 = 2 tables from a clinical trial in 41 cities. Analyze, and write a report summarizing your analysis. 12.26 Analyze Table 11.9 with age and maternal smoking as predictors using a Ža. logistic-normal model, Žb. marginal model, and Žc. transitional model. Explain how the interpretation of the maternal smoking effect differs for the three approaches. Theory and Methods 12.27 Refer to Section 12.3.1. Using supplementary information improves predictions. Let qi denote the true proportion of votes for Clinton in state i in the 1992 election, conditional on voting for him or Bush. Consider the model logit P Ž Yit s 1 < u i . s logit Ž qi . q  q u i , where  qi 4 are known and  u i 4 are independent N Ž0,  2 .. When ˆ s 0, show  ˆ i s qi exp Žˆ.rw1 y qi q qi exp Žˆ.x. Compared to  qi 4, explain how  ˆ i then shifts up or down depending on how the overall Democratic vote compares in the current poll to the previous election Ži.e., depending on  ˆ .. When also ˆ s 0, show ˆ i s qi . 12.28 For a binary response, consider the random effects model logit P Ž Yit s 1 < u i . s  q t q u i , t s 1, . . . , T , where  u i 4 are independent N Ž0,  2 ., and the marginal model logit P Ž Yt s 1 . s  q t* , t s 1, . . . , T . For identifiability,  T s  T* s 0. Explain why all t s 0 implies that all t* s 0. Is the converse true? 535 PROBLEMS 12.29 The GLMM for binary data using probit link function is y1 P Ž Yit s 1 < u i . s x Xit ␤ q zXit u i , where is the N Ž0, 1. cdf and u i has N Ž0,  . pdf, f Žu i ;  .. a. Show that the marginal mean is P Ž Yt s 1 . s HP Ž Z y z X it u i F x Xit  . f Ž u i ;  . du i , where Z is a standard normal variate that is independent of u i . b. Since Z y zXi t u i has a N Ž0, 1 q zXit  z it . distribution, deduce that y1 P Ž Yt s 1 . s x Xit  w 1 q zXit  z it x y1r2 . Hence, the marginal model is a probit model with attenuated effect. In the univariate random intercept case, show the marginal effect equals that from the GLMM divided by '1 q  2 . 12.30 In the Rasch model, logitw P Ž Yit s 1.x s  i q t ,  i is a fixed effect. a. Assuming independence of responses for different subjects and for different observations on the same subject, show that the log likelihood is Ý Ý  i yit q Ý Ý t yit y Ý Ý log i t i t i 1 q exp Ži q t . . t b. Show that the likelihood equations are yqt s Ý i P Ž Yit s 1. and yiqs Ý t P Ž Yit s 1. for all i and t. Explain why conditioning on  yiq 4 yields a distribution that does not depend on   i 4 . c. Discuss advantages and disadvantages of, instead, treating  i as random. 12.31 Consider the matched-pairs random effects model Ž12.3.. For given  0 , let  0 be such that  ˆ 12 s n12 q  0 and  ˆ 21 s n 21 y  0 satisfies log Ž  ˆ 21 r ˆ 12 . s  0 . Suppose   ˆ i j 4 has nonnegative log odds ratio. Explain why: a. This is the fit of the model assuming  s  0 . b. The likelihood-ratio statistic for testing H0 :  s  0 in this model equals ž 2 n12 log n12 n12 q  0 q n 21 log n 21 n 21 y  0 / . c. The likelihood-ratio test of H0 :  s 0 is the test of symmetry. 536 RANDOM EFFECTS: GENERALIZED LINEAR MIXED MODELS 12.32 Explain why the logistic-normal model is not helpful for capture recapture experiments with only two captures. 12.33 Refer to the crossover study in Problem 12.7. Kenward and Jones Ž1991. reported results using the ordinal response scale Žnone, moderate, complete. for relief. Explain how to formulate an ordinal logit random effects model for these data analogous to model Ž12.19.. 12.34 Formulate a model using adjacent-categories logits that is analogous to model Ž12.14. for cumulative logits. Interpret parameters. 12.35 For ordinal square I = I tables of counts  n ab 4 , model Ž12.3. for binary matched-pairs responses Ž Yi1 , Yi2 . for subject i extends to logit P Ž Yit F j < u i . s  j q  x t q u i with  u i 4 independent N Ž0,  2 . variates and x 1 s 0 and x 2 s 1. a. Explain how to interpret  , and compare to the interpretation of  in the corresponding marginal model Ž10.14.. b. This model implies model Ž12.3. for each 2 = 2 collapsing that combines categories 1 through j for one outcome and categories j q 1 through I for the other. Use the form of the conditional ML Žor random effects ML. estimator for binary matched pairs to explain why log žÝ Ý / žÝ Ý / n ab a)j b-j n ab a-j b)j is a consistent estimator of  . c. Treat these Ž I y 1. collapsed 2 = 2 tables naively as if they are independent samples. Show that adding the numerators and adding the denominators of the separate estimates of e  motivates the summary estimator of  , ˜ s log ½ Ý Ž a y b . n ab a)b Ý Ž b y a. n ab b)a 5 . Explain why ˜ is consistent for  even recognizing the actual dependence. d. A standard error for ˜ that treats the collapsed tables in part Žc. as independent is inappropriate. Treating  n ab 4 as a multinomial sample, show that an estimated asymptotic variance of ˜ is ŽAgresti 537 PROBLEMS and Lang 1993a. ½ Ý Ž b y a. 2 n ab b)a q ½ Ý Ž b y a. n ab b)a Ý Ž a y b . 2 n ab a)b 2 5 Ý Ž a y b . n ab a)b 2 5 . 12.36 Summarize advantages and disadvantages of using a GLMM approach compared to a marginal model approach. Describe conditions under which parameter estimators are consistent for Ža. marginal models using GEE, Žb. marginal models using ML, Žc. GLMM using PQL, and Žd. GLMM using ML. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 13 Other Mixture Models for Categorical Data* In Chapters 10 through 12 we introduced ways of handling correlated observations due to repeated measurement and other forms of clustering. The generalized linear mixed models ŽGLMMs. of Chapter 12 assume normal random effects. They describe heterogeneity by replacing the linear predictor by a normally distributed mixture of linear predictors. In this chapter we present additional models having connections with GLMMs. Except for one case, these models use nonnormal mixture distributions. In Section 13.1 we present latent class models. These treat a contingency table as a mixture of unobserved tables at categories of a qualitative latent Žunobserved. variable. In Section 13.2 we discuss a related nonparametric approach to fitting GLMMs that uses an unspecified discrete quantitative distribution for the random effects distribution. In Section 13.3 we model clustered binomial responses using the beta distribution to describe heterogeneity of binomial parameters. The resulting beta-binomial distribution has variance function for which quasi-likelihood methods are also available. In Section 13.4 we model count responses using the gamma distribution to describe heterogeneity of Poisson parameters. The resulting negative binomial regression model corresponds to a Poisson GLMM having a log-gamma distributed random effect. It is an alternative to the GLMM for Poisson responses with normal random effects, a model discussed in Section 13.5. 13.1 LATENT CLASS MODELS GLMMs create a mixture of linear predictor values using a latent variable, the unobserved random effect vector, having a normal distribution. By contrast, latent class models use a mixture distribution that is qualitative rather than quantitative. The basic model assumes existence of a latent 538 539 LATENT CLASS MODELS FIGURE 13.1 Association graph for latent class model. categorical variable such that the observed response variables are conditionally independent, given that variable. For categorical response variables Ž Y1 , Y2 , . . . , YT ., the latent class model assumes a latent categorical variable Z such that for each possible sequence of response outcomes Ž y 1 , . . . , y T . and each category z of Z, P Ž Y1 s y 1 , . . . , YT s y T < Z s z . s P Ž Y1 s y 1 < Z s z . ⭈⭈⭈ P Ž YT s y T < Z s z . . Figure 13.1 shows the association graph for the model. A latent class model summarizes probabilities of classification P Ž Z s z . in the latent classes as well as conditional probabilities P Ž Yt s yt < Z s z . of outcomes for each Yt within each latent class. These are the model parameters. More generally, the latent variable Z can be multivariate. The model is an analog for categorical responses and latent variables of the factor analysis model for multivariate normal responses. The latent class model is sometimes plausible when the observed variables are several indicators of some concept, such as prejudice, religiosity, or opinion about an issue. An example is Table 10.13, in which subjects gave their opinions about whether abortion should be legal in various situations. Perhaps an underlying latent variable describes one’s basic attitude toward legalized abortion, such that given the value of that latent variable, responses on the observed variables are conditionally independent. For instance, the latent variable may be a qualitative variable with three categories: One class for those who always oppose legalized abortion regardless of the situation, one for those who always favor it, and one for those whose response depends on the situation. The T-dimensional contingency table cross classifying Ž Y1 , . . . , YT . is observed. The ŽT q 1.-dimensional table that cross-classifies it with the latent variable is an unobserved table. Denote the number of categories of each Yt by I and the number of latent classes of Z by q. For the observed table, let ␲ y 1 , . . . , y T s P Ž Y1 s y 1 , . . . , YT s y T .. The model assumes a multinomial distribution over its I T cells. For a given cell, q ␲y1 , . . . , y T s Ý zs1 P Ž Y1 s y 1 , . . . , YT s y T < Z s z . P Ž Z s z . . 540 OTHER MIXTURE MODELS FOR CATEGORICAL DATA The conditional independence factorization for the latent class model states that ␲y1 , . . . , y T s q T zs1 ts1 Ý Ł P Ž Yt s yt < Z s z . P Ž Z s z . . Ž 13.1 . This is a nonlinear model for the I T multinomial probabilities. 13.1.1 Fitting Latent Class Models Denote the counts in the observed table by  n y 1 , . . . , y T 4 . Summing over the I T cells in that table, the kernel of the multinomial log likelihood is Ý ny , . . . , y 1 T Ž 13.2 . log ␲ y 1 , . . . , y T . Substituting parameters from Ž13.1., one can maximize Ž13.2. with respect to those parameters using Newton᎐Raphson ŽHaberman 1979, Chap. 10. or the EM algorithm ŽGoodman 1974.. It is helpful to note that the latent class model states that the loglinear model symbolized by Ž Y1 Z, Y2 Z, . . . , YT Z . holds for the unobserved table. The model makes no assumption about the  Yt Z 4 associations but assumes that the  Yt 4 are mutually independent within each category of Z. The EM algorithm has two steps in each iteration. The E Žexpectation . step in iteration s calculates pseudo-counts  nŽys.1 , . . . , y T , z 4 for the unobserved table using  n y 1 , . . . , y T 4 and a working conditional distribution for Ž Z < Y1 , . . . , YT . described shortly. The M Žmaximization. step treats  nŽys., . . . , y , z 4 as data and applies an algorithm such as iterative reweighted 1 T least squares or IPF for fitting the model Ži.e., the loglinear model Ž Y1 Z, Y2 Z, . . . , YT Z ... The fit  ␮Žys., . . . , . . . , y , z 4 of that model in the unob1 T served table then determines the new working conditional distribution of Ž Z < Y1 , . . . , YT . to apply to  n y , . . . , y 4 for the E-step of the next iteration. 1 T This allocates the observed data to pseudo-counts in the unobserved cells in proportion to this fit, using nŽysq1. s n y1 , . . . , yT 1 , . . . , yT , z ␮Žys.1 , . . . , y T , z q Ý . ␮Žys.1 , . . . , y T , k ks1 These are entries in the unobserved table for iteration Ž s q 1.. They are used as pseudo-data for the M-step of iteration Ž s q 1.. Eventually, the algorithm converges to fitted values for the unobserved table that provide fitted probabilities that satisfy mutual independence within each latent class, and such that the corresponding fitted probabilities in the observed table Ži.e., added over the latent categories. maximize the likelihood Ž13.2.. The fitted probabilities in the unobserved table are an estimated joint LATENT CLASS MODELS 541 distribution for Ž Y1 , . . . , YT , Z .. One can use them to calculate the ML estimates of the latent class model parameters  P Ž Yt s yt < Z s z .4 and  P Ž Z s z .4 . The EM algorithm is computationally simple and relatively stable. Each iteration increases the likelihood. However, its convergence can be slow. See Laird Ž1998. for a review. The log likelihood for a latent class model may have local maxima. Thus, with either the Newton᎐Raphson or EM algorithm, it is advisable to perform the fitting process a few times with different starting guesses for the parameter values. The EM algorithm tends to be less sensitive to the choice of starting values. Thus, some software begins with the EM algorithm and then switches to the Newton᎐Raphson algorithm as it approaches the ML estimates to speed the process. As q increases, multiple local maxima are more likely and the danger increases of a lack of identifiability. Standard errors for model parameter estimates result from inverting the model’s estimated information matrix. This is a by-product of the Newton᎐Raphson algorithm but not the EM algorithm. One way to obtain standard errors with it applies a useful formula of Louis Ž1982. for the observed information when using the EM algorithm. It equals the expected value of the observed information for the loglinear model for the unobserved table minus the expected value of the information for the conditional distribution of Z given the observed data. Baker Ž1992. and Lang Ž1992. gave related results. Chi-squared statistics comparing observed cell counts to fitted values test the model fit. The residual df s I T y qT Ž I y 1. y q. This follows since multinomial model Ž13.1. describes I T y 1 multinomial probabilities using Ž I y 1. parameters  P Ž Yt s yt < Z s z ., yt s 1, . . . , I y 14 at each of qT combinations of z and t, and q y 1 parameters  P Ž Z s z .4 . Often, the nature of the variables suggests a value for q, usually quite small Ž2 to 4.. Otherwise, the usual procedure starts with q s 2; if the fit is inadequate, it increases by steps of 1 as long as the fit shows substantive improvement. Specialized software exists for such models ŽAppendix A.. 13.1.2 Latent Class Model for Rater Agreement Table 13.1 is an expanded data set of the example in Section 10.5. Seven pathologists classified each of 118 slides on the presence or absence of carcinoma in the uterine cervix. For modeling interobserver agreement, the conditional independence assumption of the latent class model is often plausible. With a blind rating scheme, ratings of a given subject or unit by different pathologists are independent. If subjects having true rating in a given category are relatively homogeneous, then ratings by different pathologists may be nearly independent within a given true rating class. Thus, one might posit a latent class model with q s 2 classes, one for subjects whose true rating is positive and one for subjects whose true rating is negative. This 542 OTHER MIXTURE MODELS FOR CATEGORICAL DATA TABLE 13.1 Diagnoses of Carcinoma and Fits of Latent Class Models a Pathologist Fit A B C D E F G Count qs1 qs2 qs3 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 34 2 6 1 4 5 2 1 2 1 2 7 1 1 2 3 13 5 10 16 1.1 1.6 2.2 2.8 3.3 4.2 1.4 1.6 2.8 3.5 4.2 5.3 1.4 1.3 2.0 0.5 3.3 0.9 1.2 0.3 23.0 6.6 12.7 1.7 3.6 0.5 3.0 0.2 1.7 0.3 0.5 3.7 2.6 0.1 4.3 3.1 11.5 8.4 13.5 9.9 33.8 2.0 6.3 1.5 3.0 4.7 2.1 0.2 1.3 1.6 2.9 6.5 1.4 0.1 2.6 2.0 9.6 8.7 13.6 12.3 Fits obtained with Latent Gold ŽStatistical Innovations, Belmont MA.. 1, yes; 0, no. Source: Based on data in Landis and Koch Ž1977., not showing empty cells. a model expresses the 2 7 joint distribution of the seven ratings as a mixture of two 2 7 distributions, one for each true rating class. Table 13.2 shows results of fitting some latent class models Žincluding a mixture model studied in Section 13.2.4.. Because the observed table is sparse, the deviance is mainly useful for comparing models. This is an informal comparison, though, since the chi-squared distribution does not apply for comparing deviances of models with different numbers of latent classes. A model with q classes is a special case of a model with q* ) q classes in which P Ž Z s z . s 0 for z ) q and hence falls on the boundary of the parameter space. Ordinary chi-squared likelihood-ratio tests require parameters to fall in the interior of the parameter space Ži.e., 0 - P Ž Z s z . - 1 for z s 1, . . . , q*.. Table 13.1 also shows the fitted values for latent class models with q s 1, 2, 3, for the cells having positive counts. ŽEach empty cell also has a fitted value, not shown here.. The model with q s 1 latent class is the model of mutual independence of the seven ratings. Equivalently, it is the loglinear model Ž Y1 , Y2 , . . . , Y7 .. It fits poorly, as one would expect. With q s 2, considerable evidence remains of lack of fit. For instance, the fitted count for 543 LATENT CLASS MODELS TABLE 13.2 Likelihood-Ratio Statistics for Latent Class Models Fitted to Table 13.1 a Model Deviance Ž G 2 . Statistic df Mutual independence Latent class Rasch mixture Latent class Rasch mixture Latent class Rasch mixture Žquasi-symmetry. 476.8 62.4 67.6 15.3 27.5 6.4 23.7 120 112 118 104 116 96 114 Number of Latent Classes 1 2 3 4 a Models fitted with Latent Gold ŽStatistical Innovations, Belmont, MA.. a negative rating by each pathologist is 23.0, compared to an observed count of 34. ŽThe small G 2 that Table 13.2 reports for this model does not imply a good fit; in Section 9.8.4 we noted that G 2 tends to be highly conservative when most fitted values are very close to 0.. The model with q s 3 seems to fit adequately. Studying the estimated probability P Ž Yt s 1 < Z s z . of a carcinoma diagnosis for each pathologist, conditional on a given latent class z, helps illuminate the nature of these classes. Table 13.3 reports these for the three-class model. They suggest that Ž1. the first latent class refers to cases that all pathologists Žexcept occasionally B. agree show no carcinoma; Ž2. the third latent class refers to cases in which A, B, E, and G agree show carcinoma and C and D usually agree; and Ž3. the second latent class refers to cases of strong disagreement, whereby C, D, and F rarely diagnose carcinoma but B, E, and G usually do. The estimated proportions in the three latent classes are PˆŽ Z s 1. s 0.37, PˆŽ Z s 2. s 0.18, and PˆŽ Z s 3. s 0.45. The model estimates that 18% of the cases fall in the problematic class. TABLE 13.3 Estimated Probabilities of Diagnosing Carcinoma, for Latent Class Model and Rasch Mixture Model with Three Classes a Pathologist Latent Class A B C D E F G Latent Class 1 2 3 0.057 0.513 1.000 0.138 1.00 0.981 0.000 0.000 0.858 0.000 0.058 0.586 0.055 0.751 1.000 0.000 0.000 0.476 0.000 0.631 1.000 Rasch Mixture 1 2 3 0.022 0.611 0.994 0.150 0.923 0.999 0.001 0.052 0.853 0.000 0.015 0.617 0.047 0.774 0.997 0.000 0.009 0.483 0.022 0.611 0.994 Model a Results obtained with Latent Gold ŽStatistical Innovations, Belmont, MA.. 544 OTHER MIXTURE MODELS FOR CATEGORICAL DATA A danger with latent variable models, shared by factor analysis for continuous responses, is the temptation to interpret latent variables too literally. In this example it is tempting to treat latent class 1 Žlatent class 3. as cases truly without carcinoma Žwith carcinoma.. Thus, it is tempting to treat a rating of no carcinoma Ža rating of carcinoma. given that the subject falls in latent class 1 Žlatent level 3. as necessarily being a correct judgment. One should realize the tentative nature of the latent variable. Be careful not to make the error of reificationᎏtreating an abstract construction as if it has actual existence ŽGould 1981.. Using model parameter estimates and Bayes’ theorem, one can also estimate P Ž Z s z < Yt s yt . and P Ž Z s z < Y1 s y 1 , . . . , YT s y T .. If a pathologist makes a ‘‘ yes’’ rating, for instance, what is the estimated probability that the subject is in the latent class for which agreement on a positive rating usually occurs? We perform further analysis in Section 13.2.5 after studying a simpler model. Espeland and Handelman Ž1989., Uebersax Ž1993., Uebersax and Grove Ž1990, 1993., and Yang and Becker Ž1997. presented various latent variable models for rater agreement and diagnostic accuracy. One could also use methods of Chapters 11 and 12, such as a model with a continuous rather than qualitative latent variable. A logistic-normal random intercept model, for instance, yields subject-specific comparisons of P Ž Yt s 1. for various t. 13.1.3 Latent Class Models for Capture–Recapture We next apply latent class models to capture᎐recapture modeling for estimating population size. In Section 12.3.6 a logistic-normal GLMM was used for this. With T sampling occasions, a 2 T contingency table displays the data, with scale Žcaptured, not captured . at each occasion. A prediction of the population size equals the prediction for the missing cell count, representing subjects not captured at every occasion, added to the counts in other cells. With two classes, the latent class model treats the population as a mixture of two types, perhaps determined by genetic or environmental factors. Homogeneity of capture probabilities occurs for subjects within each type, but the type of any given subject is unknown. This model represents a compromise between the mutual independence model, which assumes a single latent class and complete homogeneity, and the logistic-normal GLMM, which assumes a continuous mixture of capture probabilities rather than two classes. We illustrate with the T s 6-capture data set on snowshoe hares in Table 12.6. The model of mutual independence predicts that Nˆ s 75. Its 95% profile-likelihood confidence interval for N is Ž70, 83.. The latent class model with two classes has Nˆ s 85 and a profile-likelihood confidence interval of Ž74, 106.. The latent class model with three classes gives similar results. Since the logistic-normal GLMM in Section 12.3.6 gave the interval Ž75, 154., these seem too short to be trusted. This simple latent class model may not capture NONPARAMETRIC RANDOM EFFECTS MODELS 545 all the existing heterogeneity. It is more plausible to assume a continuous latent variable than a discrete one with a couple of classes. We’ll analyze these data further with related models in the next section. 13.2 NONPARAMETRIC RANDOM EFFECTS MODELS In spite of its popularity and attractive features, the normality assumption for random effects in ordinary GLMMs can rarely be closely checked. For instance, in studying normal GLMMs, Verbeke and Lesaffre Ž1996. noted that under a normality assumption for random effects, their predicted values often appear normally distributed even when the true values are generated from a highly nonnormal distribution. An obvious concern of this or any parametric assumption for the random effects is possibly harmful effects of misspecification. To check sensitivity to this assumption, one can fit GLMMs using alternative or more general random effects assumptions. 13.2.1 Logit Models with Unspecified Random Effects Distribution A nonparametric approach Že.g., Aitkin 1999. guards against possibly harmful misspecification effects. This uses an unspecified random effects distribution on a finite set of mass points. The location of the mass points and their probabilities are parameters. The number of mass points can be fixed. When this number is itself unknown, one treats it as fixed in the estimation process but increases it sequentially until the likelihood is maximized. The maximization usually requires relatively few mass points. Even allowing a continuous mixture distribution, the nonparametric estimate of that distribution takes a finite number of points Že.g., Lindsay et al. 1991.. In fact, fitting a model having only two mass points often results in fixed effects estimates quite similar to those with the full maximization. This approach is useful primarily when the random effects distribution is not itself of direct interest, since the nonparametric estimate of that distribution tends to be poor even for very large samples. Model fitting is actually simpler than for models with normal random effects, since the integral that determines the likelihood function simplifies to a finite sum. In Section 13.2.4 we discuss this point with a Rasch-type model. Specialized software can fit nonparametric mixture models ŽAppendix A.. However, this approach also has disadvantages. For instance, with multivariate random effects it cannot provide simple correlation structure as the normal can. Standard inference does not apply for comparing models with different numbers of mass points, since one model is on the boundary of the parameter space compared to the other. Also, the ML estimate of the random effects distribution often places some weight at "⬁. Although this can be useful with binary data for identifying a subsample for which the estimated response probability equals 1 or equals 0 for all observations in a 546 OTHER MIXTURE MODELS FOR CATEGORICAL DATA cluster, it is not then possible to describe heterogeneity with an estimated variance component. To illustrate this approach, we reanalyze Table 10.13 on attitudes about legalized abortion. In Section 12.3.2 we fitted the logistic-normal model Ž12.10., logit P Ž Yit s 1 < u i . s u i q ␤t q ␥ x, Ž 13.3 . with x s gender and parameters  ␤t 4 representing three conditions under which abortion might be legal. Treating u i instead nonparametrically, the likelihood maximizes with a two-point mixture distribution. Estimated abortion item effects are ␤ˆ1 y ␤ˆ3 s 0.83 ŽSE s 0.16., ␤ˆ2 y ␤ˆ3 s 0.30 ŽSE s 0.16., and ␤ˆ1 y ␤ˆ2 s 0.52 ŽSE s 0.16.. Results are similar to those that Table 12.3 shows for the normal random effects approach ŽSection 12.3.2.. 13.2.2 Nonparametric Mixing of Logistic Regression Follman and Lambert Ž1989. presented an example with a prespecified number of mass points. They analyzed the effect of the dosage of a poison on the probability of death of a protozoan of a particular genus. Table 13.4 shows the data. They assumed two unobserved types of that genus. Let ␲ i Ž x . denote the probability of death at log dose level x for genus type i, i s 1, 2. Let ␳ denote the probability a protozoan belongs to genus type 1. Their model specifies ␲ Ž x . s ␳␲ 1 Ž x . q Ž 1 y ␳ . ␲ 2 Ž x . , where logit ␲ i Ž x . s ␣ i q ␤ x , with unknown ␳ . The curve for ␲ Ž x . is a weighted average of two curves having the same shapes but different intercepts. The ordinary logistic regression model is the special case ␳ s 1. Its fit, logit w␲ ˆ Ž x .x s y68.4 q 42.1 x Žwith SE s 3.8 for ␤ˆ s 42.1., is poor, with deviance G 2 s 24.7 Ždf s 6.. The fit of the mixture model is ␲ ˆ Ž x . s 0.34␲ˆ 1 Ž x . q 0.66␲ˆ 2 Ž x . , with logit ␲ ˆ 1Ž x . s y196.2 q 124.8 x, logit ␲ ˆ 2 Ž x . s y205.7 q 124.8 x, TABLE 13.4 Number of Protozoa Exposed to Poison Dose and Number That Died Poison Dose Exposed Dead Poison Dose Exposed Dead 4.7 4.8 4.9 5.0 55 49 60 55 0 8 18 18 5.1 5.2 5.3 5.4 53 53 51 50 22 37 47 50 Source: Follman and Lambert Ž1989.. Reprinted with permission from the Journal of the American Statistical Association. NONPARAMETRIC RANDOM EFFECTS MODELS 547 FIGURE 13.2 Fit of binary mixture of logistic regressions to Table 13.4 wmodel fitted using Latent Gold ŽStatistical Innovations, Belmont, MA.x. and SE s 25.2 for ␤ˆ s 124.8. Figure 13.2 shows the fit. This is much better, with G 2 s 3.4 Ždf s 4.; that is, double the maximized log-likelihood increases by 24.7 y 3.4 s 21.3 by adding two parameters: an additional intercept and the probability for the mixture. Follman and Lambert noted that with eight dose levels, at most two mixture points are identifiable for this model. The ordinary GLMM assumes a normal mixture of logistic curves. It gives a deviance reduction of only 1.7 compared to the ordinary logistic model with ␳ s 1. 13.2.3 Is Misspecification a Serious Problem? Is it worth the trouble to consider alternatives to the normality assumption for random effects in GLMMs, whether they be parametric or nonparametric? Not much work exists on investigating misspecification effects. For logistic random intercept models, different assumptions for the random effects distribution often provide similar results for estimating the regression effects. Choosing an incorrect random effects distribution does not tend to bias estimators of those effects. The true distribution for the random effects being skewed can result in some bias for the normal intercept estimator ŽNeuhaus et al. 1992.. The choice of random effects distribution also usually has little impact on efficiency of estimation. When the true random effects distribution is dramatically far from normal, there can be some efficiency loss for the logistic-normal estimator. This can 548 OTHER MIXTURE MODELS FOR CATEGORICAL DATA happen when the true distribution is a two-point mixture with large variance component. B. Caffo and I studied this with various models, such as a simple one-way random effects model. In cluster i, let yit be a Bernoulli variate satisfying logit P Ž Yit s 1 < u i . s ␣ q u i , i s 1, . . . , n, t s 1, . . . , T , Ž 13.4 . where varŽ u i . s ␴ 2 . Simulated samples from this model used various n, T, ␣ , and ␴ , and various true distributions for u i including normal, uniform, exponential, and binary. Usually, assuming normality does not hurt when the true distribution is nonnormal. Also, using a nonparametric approach when the true distribution is normal does not result in much efficiency loss wNeuhaus and Lesperance Ž1996. noted this for a related model.x However, when the true distribution is a two-point mixture, the normal approach loses efficiency in estimating  ␮i s P Ž Yit s 1 < u i .4 as ␴ and T increase. For example, when n s T s 30, ␣ s 0, and the mixture has probability 0.5 at each point, the expected value of ␮ ˆ i y ␮ i is Ž0.06, 0.05. for the Žnormal, nonparametric . approach when ␴ s 0.5, Ž0.06, 0.02. when ␴ s 1.0, and Ž0.04, 0.01. when ␴ s 2.0. Differences for estimating ␣ are less dramatic. The example from Follman and Lambert Ž1989. discussed in Section 13.2.2, which has a covariate but T s 1, illustrates the potential efficiency loss with the logistic-normal GLMM. The two-point mixture model has ␤ˆ s 124.8 with SE s 25.2, for which ␤ˆrSE s 4.9. The normal mixture model has ␤ˆ s 65.5 with SE s 19.5, for which ␤ˆrSE s 3.4. Our study suggested that the random effects distribution has to be rather extremely nonnormal for the normal GLMM to suffer in bias or efficiency. However, Heagerty and Zeger Ž2000. Žsee also McCulloch 1997. noted that other types of misspecification can be more crucial. Regarding bias, they argued that sensitivity to the random effects assumption is greater for estimating regression parameters in random effects models than estimating their counterparts in corresponding marginal models. They illustrated this with a model violation by which the variance of the random effects depends on values of covariates. They concluded that between-cluster effects may be more sensitive to correct specification of the random effects distribution than within-cluster effects. This is an advantage of using marginal models for between-cluster effects. 13.2.4 Rasch Mixture Model From Section 12.1.4, for subject i with item t the Rasch model for a binary response is logit P Ž Yit s 1 < u i . s u i q ␤t , t s 1, . . . , T . Ž 13.5 . 549 NONPARAMETRIC RANDOM EFFECTS MODELS The GLMM treats  u i 4 as normal random effects. Lindsay et al. Ž1991. studied this model when u i instead can assume only a finite number q of values. Denote the distribution of the latent variable u i , which is the same for all i, by P Ž U s ak . s ␳ k , k s 1, . . . , q, for unknown  a k 4 and  ␳ k 4 . For identifiability one can either place a constraint on this distribution, such as Ý k ␳ k a k s 0, or on  ␤t 4 . This model is called a Rasch mixture model. Like other random effects models, the Rasch mixture model is a latent variable model. The random effect u i is unobserved, and the T responses are assumed conditionally independent at each fixed u i value. It differs from the ordinary latent class model for binary responses having q latent classes ŽSection 13.1., since it assumes structure Ž13.5. for P Ž Yit s 1 < u i . whereas latent class model Ž13.1. assumes no structure for P Ž Yt s yt < Z s z .. This model is simpler to fit than GLMMs with normal random effects because the GLMM’s intractable integral that determines the likelihood function is replaced by a finite sum. The marginal probability of a sequence of responses Ž y 1 , . . . , y T . is ␲y1 , . . . , y T s q T ks1 ts1 exp yt Ž a k q ␤t . Ý ␳ k Ł 1 q exp Ž a k q ␤t . . Substituting this in the multinomial log likelihood Ž13.2., ML estimation of  a k , ␳ k 4 and  ␤t 4 can proceed using Newton᎐Raphson or EM algorithms. As q increases, the maximized likelihood increases and the fit improves. However, Lindsay et al. Ž1991. showed that with T items, the likelihood no longer changes once q s ŽT q 1.r2. Then, the model gives the same fit to the 2 T observed table as the quasi-symmetry model Ž10.33.. Thus, this simpler latent class model has a symmetric conditional association structure among the observed variables. Arminger et al. Ž2000. extended the Rasch mixture model to incorporate covariates. 13.2.5 Modeling Rater Agreement For the ratings of carcinoma by seven pathologists ŽTable 13.1., Table 13.2 also summarizes the fit of Rasch mixture models. Here, P Ž Yit s 1 < u i . in Ž13.5. denotes the probability of a carcinoma diagnosis for pathologist t evaluating slide i. With q s 3 Ži.e., u i can take 3 values., it does not fit significantly more poorly than the latent class model. With T s 7 raters, the discrete mixture can take at most ŽT q 1.r2 s 4 points. The model with q s 4 is equivalently the quasi-symmetry model. It does not seem to fit better than with q s 3. 550 OTHER MIXTURE MODELS FOR CATEGORICAL DATA FIGURE 13.3 Pathologist estimates for Rasch mixture model and results of 90% Bonferroni simultaneous comparison. Figure 13.3 shows  ␤ˆt 4 for the Rasch mixture model with q s 3, setting Ý t ␤ˆt s 0. These describe variation among the pathologists’ response distributions at each latent level. For a given latent class, for instance, the estimated odds of a carcinoma diagnosis for pathologist B are expŽ3.52 y 1.48. s 7.7 times the estimated odds for pathologist A. Pathologist B tends to make a carcinoma diagnosis most often, and D and F the least. The figure also shows results of a 90% Bonferroni comparison of the 21 pairs of pathologists, based on standard errors of pairwise differences ␤ˆt y ␤ˆs . For pathologist t, conditional on latent level k for a slide, exp Ž a ˆk q ␤ˆt . 1 q exp Ž a ˆk q ␤ˆt . estimates the probability of a carcinoma diagnosis. Table 13.3 reports these, which use a ˆ1 s y5.25, aˆ2 s y1.02, and aˆ3 s 3.63. They are similar to the estimates for the ordinary latent class model but a bit smoother, with fewer estimates at the boundary. Again, at latent level 1 pathologists tend not to diagnose carcinoma, at level 2 many disagreements occur, and at level 3 pathologists tend to diagnose carcinoma. The estimated latent class proportions are ␳ˆ1 s 0.37, ␳ˆ2 s 0.19, and ␳ˆ3 s 0.43, with 19% of cases falling in the problematic class. Model Ž13.5. implies that the association between each Yt and U has log odds ratio Ž a k y a l . for levels k and l of U. For instance, in the third latent class the estimated odds that a pathologist diagnoses carcinoma are exp w3.63 y Žy5.25.x ) 7000 times those in the first latent class. In terms of the estimated probabilities in Table 13.3, using pathologist A this is exp wŽ0.994r0.006.rŽ0.022r0.978.x. The large  a ˆk y aˆ l 4 suggest strong association between each pathologist’s rating and the latent variable. This induces strong association between pairs of pathologist ratings. ŽThe model-fitted odds ratios between pairs of raters vary between about 7 and 400.. However, the quite varied  ␤ˆt 4 suggest that substantial marginal heterogeneity exists among the seven ratings. This causes heterogeneity in pairwise levels of agreement. The mutual independence model is the special case of the Rasch mixture model with q s 1; that is, ␳ 1 s 1. For Table 13.1 the Rasch mixture model with q s 3 has only four more parameters than the mutual independence NONPARAMETRIC RANDOM EFFECTS MODELS 551 model Ži.e., ␳ k and a k , k s 1,2.. Yet it fits well and has simple interpretations. See Agresti and Lang Ž1993b. for further details and a simpler model that sets a1 y a2 s a2 y a3 . 13.2.6 Other Models for Capture–Recapture In Section 13.1.3 latent-class models were used for capture᎐recapture experiments. Alternatively, one could use the Rasch mixture model. Model Ž13.5. with two classes gives Nˆ s 77 and a 95% profile-likelihood confidence interval of Ž71, 87.. This seems too short to trust. It is more realistic to allow a continuous distribution for capture probabilities. Model Ž13.5. treating u i as normal rather than binary does this, and in Section 12.3.6 we used it for these data. So, which models might be used other than a parametric random effects model? One possibility is a loglinear model ŽCormack 1989.. This is a marginal model, applying to probabilities averaged over subjects. Let Yt denote the binary capture variable for a randomly selected subject at occasion t, with categories Žcaptured, not captured .. The simplest model, denoted by Ž Y1 , Y2 , . . . , YT ., assumes that capture events are mutually independent. This is equivalent to the logistic-normal model Ž13.5. with ␴ s 0 and latent class model Ž13.1. with q s 1. A more plausible model allows an association between pairs of capture variables. This is equivalently the loglinear model denoted Ž Y1Y2 , Y1Y3 , . . . , YTy1 YT .. Alternatively, a model with Markov structure such as Ž Y1Y2 , Y2 Y3 , . . . , YTy1 YT . may be useful. Usually, insufficient data exists to warrant using very complex loglinear models. For any such model, its fit for the 2 T y 1 observed cells projects to the remaining cell to predict the number unobserved at every occasion. A connection exists between nonparametric random effects and loglinear approaches. In Section 13.2.7 we show that assuming model Ž13.5. but using a nonparametric treatment of u i implies a loglinear model of quasi-symmetric form for the marginal model. The quasi-symmetry model Ž10.33. itself is not useful for this problem, because any count in the missing cell is consistent with it. The model has an interaction parameter pertaining to that cell alone, which results in a likelihood equation equating that cell count to its fitted value. So, information in other cells does not help in the estimation of the expected frequency in that cell. However, special cases of quasi-symmetry are useful ŽDarroch et al. 1993.. An example is the loglinear model with the same association for each pair of occasions. Like the logistic-normal model, this model of exchangeable association has only one more parameter than the mutual independence model. For the snowshoe hare data of Table 12.6, the model with exchangeable two-factor association has Nˆ s 90.5 and a confidence interval of Ž75, 125.. This interval and the one of Ž71, 87. for the Rasch mixture model with q s 2 are substantially narrower than the interval Ž75, 154. for the logistic-normal model ŽSection 12.3.6.. In capture᎐recapture experiments, Nˆ and the confi- 552 OTHER MIXTURE MODELS FOR CATEGORICAL DATA dence interval for N depend strongly on the choice of model. The problem is inherently one of prediction. Estimating N requires extrapolating from the observed numbers of subjects having 1, 2, . . . , T captures to the number of subjects with 0 captures. Standard goodness-of-fit criteria are of limited help. Two models can fit the data well, yet yield quite different estimates for the unobserved count. For instance, for the snowshoe hare data, the loglinear models of mutual independence and of two-factor association both fit the observed cells relatively well Ž G 2 s 58.3, df s 56 for mutual independence and G 2 s 32.4, df s 41 for the two-factor model.; however, their Nˆ values are 75 and 105. Simpler models usually give narrower confidence intervals for N, through the usual benefits of model parsimony. This is not necessarily good. A narrow confidence interval for N is desirable, but not at the expense of severe sacrifice in the actual confidence level. Intervals based on a possibly unrealistic assumption of subject homogeneity may be overly optimistic. Simulations suggest that actual coverage probabilities are often well below nominal levels when even slight model misspecification occurs. Allowance for heterogeneity among subjects results in wider intervals. Severe population heterogeneity makes reaching useful conclusions difficult, as intervals can be very wide ŽBurnham and Overton 1978, Coull and Agresti 1999.. 13.2.7 Nonparametric Mixtures and Quasi-symmetry A distribution-free approach for u i with the Rasch form of model Ž13.5. implies the quasi-symmetry loglinear model marginally ŽDarroch 1981; Tjur 1982.. We now show this result, to which we alluded in Section 10.4.2. Let Yi denote the sequence of T responses for subject i. For possible outcomes y s Ž y 1 , . . . , y T ., where each yt s 1 or 0, P Ž Yi s y < u i . s s Ł t exp Ž u i q ␤t . yt 1 1 q exp Ž u i q ␤t . exp u i Ž Ý t yt . q Ý t yt ␤t Ł t 1 q exp Ž u i q ␤t . 1yy t 1 q exp Ž u i q ␤t . . Let F denote the cdf of u i . The marginal probability of sequence y for a randomly selected subject is Žsuppressing the subject label. ␲ y 1 , . . . , y T s EU P Ž Y s y < U . s exp ž Ý y ␤ /H Ł t t exp u Ž Ý t yt . t t 1 q exp Ž u q ␤t . dF Ž u . . This probability contributes to the log likelihood, which is Ž13.2. for a multinomial distribution over the 2 T cells for possible y. Regardless of the choice for F, the integral is complex. However, it depends on the data only BETA-BINOMIAL MODELS 553 through Ý t yt . A more general model replaces this integral by a separate parameter for each value of Ý t yt . This model has form log␲ y 1 , . . . , y T s Ý yt ␤t q ␭ y q ⭈⭈⭈ qy . 1 t Ž 13.6 . t The final term represents a separate parameter at each value of Ý t yt . The implied marginal model Ž13.6. has interaction term that is invariant to a permutation of the response outcomes y, since each such permutation yields the same sum, Ý t yt . Thus, it is the loglinear model of quasi-symmetry Ž10.33.. No matter what form F takes, the marginal model has the same main effect structure, and it has an interaction term that is a special case of the one in Ž13.6.. Thus, one can consistently estimate  ␤t 4 using the ordinary ML estimates for the loglinear model. In fact, Tjur Ž1982. showed that these estimates are also the conditional ML estimates, treating  u i 4 as fixed effects and conditioning on their sufficient statistics. The interaction parameters in model Ž13.6. result from the dependence in responses among variables, due to heterogeneity in  u i 4 . We illustrate for the opinions about legalized abortion analyzed in Sections 10.7.2 and 12.3.2 and with a nonparametric random effects approach in Section 13.2.1. For model Ž13.3., estimated within-subject comparisons ␤t y ␤ s of items result from fitting a quasi-symmetric loglinear model. Let ␮ g Ž y 1 , y 2 , y 3 . denote the expected frequency for gender g making response yt to item t, t s 1, 2, 3, where for item t, yt s 1 for approval of legalized abortion and 0 for disapproval. The loglinear model is log ␮ g Ž y 1 , y 2 , y 3 . s ␤ 1 y 1 q ␤ 2 y 2 q ␤ 3 y 3 q ␥ g q ␭ y 1qy 2qy 3 . Ž 13.7 . For y 1 q y 2 q y 3 s k, ␭ k refers to all cells in which subjects voiced approval for k of the three items, k s 0, 1, 2, 3. The ML fit, which has G 2 s 10.2 with df s 9, yields ␤ˆ1 y ␤ˆ2 s 0.521 ŽSE s 0.154., ␤ˆ1 y ␤ˆ3 s 0.828 ŽSE s 0.160., and ␤ˆ2 y ␤ˆ3 s 0.307 ŽSE s 0.161.. These are similar to the normal random effects estimates ŽTable 12.3. and nonparametric random effects estimates in Section 13.2.1. They also are the conditional ML estimates for model Ž13.3., treating  u i 4 as fixed. With this approach or conditional ML, however, one cannot estimate between-groups effects, such as the gender effect in model Ž13.7.. wThe ␥ parameter in model Ž13.7. refers to relative sample sizes of males and females and is not the same as the gender effect in Ž13.3..x 13.3 BETA-BINOMIAL MODELS The beta-binomial model is a parametric mixture model that is another alternative to binary GLMMs with normal random effects. As with other 554 OTHER MIXTURE MODELS FOR CATEGORICAL DATA mixture models that assume a binomial distribution at a fixed parameter value, the marginal distribution permits more variation than the binomial. Thus, a model using the beta-binomial is a way to handle overdispersion occurring with ordinary binomial models. 13.3.1 Beta-Binomial Distribution The beta-binomial distribution results from a beta distribution mixture of binomials. Suppose that Ža. given ␲ , Y has a binomial distribution, binŽ n, ␲ ., and Žb. ␲ has a beta distribution. The beta probability density function is f Ž␲ ; ␣ , ␤ . s ⌫ Ž␣ q ␤ . ⌫ Ž␣ . ⌫ Ž ␤ . ␲ ␣y1 Ž 1 y ␲ . ␤y1 0 F ␲ F 1, Ž 13.8 . , with parameters ␣ ) 0 and ␤ ) 0, for the gamma function ⌫ Ž⭈.. Let ␮s ␣ ␣q␤ , ␪ s 1r Ž␣ q ␤ . . The beta distribution for ␲ has mean and variance EŽ␲ . s ␮, var Ž ␲ . s ␮ Ž 1 y ␮ . ␪r Ž 1 q ␪ . . When ␣ and ␤ exceed 1.0, the distribution is unimodal, with skew to the right when ␣ - ␤ , skew to the left with ␣ ) ␤ , and symmetry when ␣ s ␤ . It simplifies to the uniform distribution when ␣ s ␤ s 1. Marginally, averaging with respect to the beta distribution for ␲ , Y has the beta-binomial distribution. Its mass function is pŽ y ; ␣ , ␤ . s ž/ n B Ž␣ q y, n q ␤ y y . , y B Ž␣ , ␤ . y s 0, 1, . . . , n. In terms of ␮ and ␪ , the beta-binomial mass function is pŽ y ; ␮, ␪ . s ž/ n y yy1 Ł ks0 Ž ␮ q k␪ . nyyy1 Ł ks0 Ž 1 y ␮ q k␪ . ny1 Ł ks0 Ž 1 q k␪ . . Ž 13.9 . It is easier to understand the nature of this distribution from its moments than from its mass function. The first two moments are E Ž Y . s n␮ , var Ž Y . s n ␮ Ž 1 y ␮ . 1 q Ž n y 1 . ␪r Ž 1 q ␪ . . BETA-BINOMIAL MODELS 555 As ␪ ™ 0 in the beta distribution, varŽ␲ . ™ 0 and that distribution converges to a degenerate distribution at ␮. Then varŽ Y . ™ n ␮ Ž1 y ␮ . and the betabinomial distribution converges to the binŽ n, ␮ .. 13.3.2 Models Using the Beta-Binomial Distribution Models using the beta-binomial distribution permit ␮ wand hence E Ž Y .x to depend on explanatory variables. The simplest models let ␪ be the same unknown constant for all observations. wPrentice Ž1986. considered extensions where it could also depend on covariates.x Like GLMs, models can use various link functions, but the logit is most common. For observation i with n i trials, assuming that yi has a beta-binomial distribution with index n i and parameters Ž ␮i , ␪ ., the model links ␮ i to predictors by logit Ž ␮i . s ␣ q ␤X x i . The beta-binomial is not in the natural exponential family, even for known ␪ . Articles using beta-binomial models have employed a variety of fitting methods ŽNote 13.4.. Crowder Ž1978. discussed the likelihood behavior for an Ž1998. obtained the ML fit by ANOVA-type model. Hinde and Demetrio ´ iterating between solving the likelihood equations for the regression parameters ␤, for fixed ␪ , and solving the likelihood equation for ␪ for fixed ␤. Each part can use Newton᎐Raphson. McCulloch and Searle Ž2001, p. 61. showed the asymptotic covariance matrix of Ž ␮ ˆ , ␪ˆ. and of Ž␣ˆ, ␤ˆ. for independent observations from a single beta-binomial distribution. A related but simpler approach for overdispersed binary counts uses quasi-likelihood with similar variance function as the beta-binomial. The quasi-likelihood variance function is ® Ž ␮i . s n i ␮ i Ž 1 y ␮ i . 1 q Ž n i y 1 . ␳ Ž 13.10 . with < ␳ < F 1. Although motivated by the beta-binomial model, this variance function results merely from assuming that ␲ i has a distribution with varŽ␲ i . s ␳␮ i Ž1 y ␮ i .. It also results from assuming a common correlation ␳ between each pair of the n i individual binary random variables that sum to yi ŽAltham 1978.. The ordinary binomial variance results when ␳ s 0. Overdispersion occurs when ␳ ) 0. For this quasi-likelihood approach, Williams Ž1982. gave an iterative routine for estimating ␤ and the overdispersion parameter ␳ . He let ␳ˆ be such that the resulting Pearson X 2 that sums the squared Pearson residuals for this variance function equals the residual df for the model. This requires an iterative two-step process of Ž1. solving the quasi-likelihood equations for ˆ solving for ␳ˆ in the ␤ for a given ␳ˆ, and then Ž2. using the updated ␤, ˆ and ␳ˆ. to its df. equation that equates X 2 Žwhich depends on ␤ 556 OTHER MIXTURE MODELS FOR CATEGORICAL DATA An alternative quasi-likelihood approach uses the simpler variance function ® Ž i . s ␾ n i ␮i Ž 1 y ␮ i . Ž 13.11 . introduced in Section 4.7.3. The ordinary binomial variance has ␾ s 1.0 and ˆ is the same as its ML overdispersion has ␾ ) 1. With this approach, ␤ estimate for the ordinary binomial model. Commonly, ␾ˆ s X 2rdf, where X 2 is the Pearson fit statistic for the binomial model ŽFinney 1947.. The standard errors for the overdispersion approach multiply those for the binomial model by ␾ˆ1r2 . Liang and McCullagh Ž1993. showed several examples using these two variance functions. A plot of the standardized residuals for the ordinary binomial model against the indices  n i 4 can provide insight about which is more appropriate. When the residuals show an increasing trend in their spread as n i increases, the beta-binomial-type variance function may be more appropriate. This is because when the beta-binomial variance holds, the residuals from an ordinary binomial model have denominator that is progressively too small as n i increases. The two quasi-likelihood approaches are equivalent when  n i 4 are identical. Only when the indices vary considerably might results differ much. Because the variance function ®Ž ␮i . s ␾ n i ␮i Ž1 y ␮ i . has a structural problem when n i s 1 ŽProblem 13.33. and has less direct motivation, we prefer quasi-likelihood with the beta-binomial variance function. 13.3.3 Teratology Overdispersion Example Revisited Refer back to Table 4.5 on results of a teratology experiment analyzed by Liang and McCullagh Ž1993. and Moore and Tsiatis Ž1991.. Female rats on iron-deficient diets were assigned to four groups. Group 1 was given only placebo injections. The other groups were given injections of an iron supplement according to various schedules. The rats were made pregnant and then sacrificed after 3 weeks. For each fetus in each rat’s litter, the response was whether the fetus was dead. Because of unmeasured covariates, it is natural to permit the probability of death to vary from litter to litter within a particular treatment group. Let yi denote the number dead out of the n i fetuses in litter i. Let ␲ it denote the probability of death for fetus t in litter i. First, suppose that yi is a bin Ž n i ,␲ it . variate, with logit Ž ␲ it . s ␣ q ␤ 2 z 2 i q ␤ 3 z 3 i q ␤4 z 4 i , where z g i s 1 if litter i is in group g and 0 otherwise. This model treats all litters in a group g as having the same probability of death, expŽ␣ q ␤ g .r w1 q expŽ␣ q ␤ g .x, where ␤ 1 s 0. However, it has evidence of overdispersion, BETA-BINOMIAL MODELS 557 TABLE 13.5 Estimates for Several Logit Models Fitted to Table 4.5 Type of Logit Model a Parameter Intercept Group 2 Group 3 Group 4 Overdispersion Binomial ML QLŽ1. QLŽ2. GEE GLMM 1.144 Ž0.129. 1.212 Ž0.223. 1.144 Ž0.219. 1.144 Ž0.276. 1.802 Ž0.362. y3.322 Ž0.331. y3.370 Ž0.563. y3.322 Ž0.560. y3.322 Ž0.440. y4.515 Ž0.736. y4.476 Ž0.731. y4.585 Ž1.303. y4.476 Ž1.238. y4.476 Ž0.610. y5.855 Ž1.190. y4.130 Ž0.476. y4.250 Ž0.848. y4.130 Ž0.806. y4.130 Ž0.576. y5.594 Ž0.919. None ␳ˆ s 0.192 ␾ˆ s 2.86 ␳ˆ s 0.185 ␴ˆ s 1.53 Binomial ML assumes no overdispersion, QLŽ1. is quasi-likelihood with beta-binomial-type variance, QLŽ2. is quasi-likelihood with inflated binomial variance; QLŽ2. and GEE Žindependence working equations. estimates are the same as binomial ML estimates. Values in parentheses are standard errors. a with X 2 s 154.7 and G 2 s 173.5 Ždf s 54.. Table 13.5 shows ML estimates and standard errors. Table 13.5 also shows results for the two quasi-likelihood approaches. Estimates and standard errors are qualitatively similar for each. For variance function ®Ž ␮i . s ␾ n i ␮ i Ž1 y ␮ i ., the estimates equal the binomial ML estimates but standard errors are multiplied by ␾ˆ1r2 s Ž X 2rdf.1r2 s 154.7r54 s 1.69. For the beta-binomial-type variance function, ␳ˆ s 0.192. This fit treats the variance of Yi as ' n i ␮i Ž 1 y ␮ i . 1 q 0.192 Ž n i y 1 . . This corresponds roughly to a doubling of the variance relative to the binomial with a litter size of 6 and a tripling with n i s 11. Even with these adjustments for overdispersion, Table 13.5 shows that strong evidence remains that the probability of death is substantially lower for each treatment group than the placebo group. Figure 13.4 plots the standardized Pearson residuals against litter size for the binomial logit model. The apparent increase in their variability as litter size increases suggests that the beta-binomial variance function is plausible. The term ␳ in that variance function corresponds to ␪rŽ1 q ␪ . in the variance of the beta-binomial distribution. For that distribution or more generally, ␳ˆ s 0.192 means that the probabilities of death for litters of a particular group have estimated standard deviation 0.192 ␮ i Ž 1 y ␮ i . . This equals 0.22 when the mean is 0.5 and 0.13 when the mean is 0.1 or 0.9, which is considerable heterogeneity. More generally, a model could let ␳ vary by treatment group or be different for the placebo group than the others. We leave this to the reader. For comparison, Table 13.5 also shows results with the GEE approach to fitting the logit model, assuming an independence working correlation structure for observations within a litter. The estimates are the same as the ML ' 558 OTHER MIXTURE MODELS FOR CATEGORICAL DATA FIGURE 13.4 Standardized Pearson residuals for binomial logit model fitted to Table 4.5. estimates for the binomial logit model, but the empirical adjustment increases the standard errors. Similar results occur with an exchangeable working correlation structure. For it, the estimated within-litter correlation between the binary responses is 0.185. This is comparable to the value of 0.192 that yields the quasi-likelihood results with beta-binomial variance function. The GEE standard errors are somewhat different from those with the quasi-likelihood approach. It may be that the sample size is insufficient for the GEE sandwich adjustment, which tends to underestimate standard errors unless the number of clusters is quite large. Or, this may simply reflect the different variance function for the GEE approach. Finally, Table 13.5 also shows results for the GLMM that adds a normal random intercept u i for litter i to the binomial logit model. Results are also similar in terms of significance of the treatment groups relative to placebo. Estimated effects are larger for this logistic-normal model, since they are subject-specific Ži.e., litter-specific . rather than population-averaged. 13.3.4 Conjugate Mixture Models The beta-binomial model is an example of a conjugate mixture model. These are models for which the marginal distribution has closed form. The data have a particular distribution, conditional on a parameter, and then the parameter has its own distribution such that the marginal distribution has closed form. 559 NEGATIVE BINOMIAL REFRESSION Similarly, in Bayesian methods the conjugate prior distribution is a distribution that when combined with the likelihood, gives a closed form for the posterior distribution. For instance, for observations from a binomial distribution with beta prior distribution for the binomial parameter, the posterior distribution of that parameter is also beta. Conjugate models were the primary method of conducting Bayesian analysis before the development of computationally intensive methods, such as Markov chain Monte Carlo, for evaluating the integral that determines the posterior distribution. The beta-binomial conjugate mixture model applies with totals from binary trials. In the next section we study a conjugate mixture model for count data. It uses a gamma distribution to mix the Poisson parameter. A disadvantage of the conjugate mixture approach is the lack of generality and flexibility, requiring a different mixture distribution for each type of problem. In addition, the extra variability need not enter on the same scale as the ordinary predictors, and it can be difficult to have multivariate random effects structure. Lee and Nelder Ž1996. discussed this approach and considered a variety of hierarchical models of GLMM form in which the random effect need not be normal. 13.4 NEGATIVE BINOMIAL REGRESSION The negative binomial is a conjugate mixture distribution for count data. It is useful when overdispersion occurs with Poisson GLMs. 13.4.1 Negative Binomial as Gamma Mixture of Poisson Distributions In Section 4.3.3 we noted that a severe limitation of Poisson models is that the variance of Y must equal the mean. Hence, at a fixed mean the variance cannot decrease as additional predictors enter the model. Count data often show overdispersion, with the variance exceeding the mean. This might happen, for instance, because some relevant explanatory variables are not in the model. A mixture model is a flexible way to account for overdispersion. At a fixed setting of the predictors used, given the mean the distribution of Y is Poisson, but the mean itself varies according to some distribution. Suppose that Ž1. given ␭, Y has a Poisson distribution with mean ␭, and Ž2. ␭ has a gamma distribution, G Ž k, ␮ .. The gamma probability density function for ␭ is Ž kr␮ . exp Ž yk ␭r␮ . ␭ ky1 , ␭ G 0. ⌫Ž k. k f Ž ␭; k , ␮ . s This gamma distribution has E Ž ␭. s ␮ , var Ž ␭ . s ␮2rk . Ž 13.12 . 560 OTHER MIXTURE MODELS FOR CATEGORICAL DATA The parameter k ) 0 describes the shape. The density is skewed to the right, but the degree of skewness decreases as k increases. Marginally, the gamma mixture of the Poisson distributions yields the negative binomial distribution for Y. Its probability mass function is p Ž y ; k, ␮ . s ⌫Ž y q k. ⌫ Ž k . ⌫ Ž y q 1. ž /ž k k ␮qk 1y k ␮qk / y , y s 0, 1, 2, . . . . Ž 13.13 . This negative binomial distribution has EŽ Y . s ␮, var Ž Y . s ␮ q ␮ 2rk . The index ky1 is called the dispersion parameter. As ky1 ™ 0, the gamma distribution has varŽ ␭. ™ 0 and it converges to a degenerate distribution at ␮ ; similarly, the negative binomial distribution then has varŽ Y . ™ ␮ and it converges to the Poisson distribution with mean ␮. For given ky1 , the negative binomial is in the natural exponential family. The natural parameter is logw ␮rŽ ␮ q k .x. Usually, though, the dispersion parameter ky1 is itself unknown. Estimating it helps to summarize the extent of overdispersion. The greater ky1 , the greater the overdispersion compared to the ordinary Poisson GLM. For independent observations, the ML estimate of ␮ is the sample mean, but ML estimation for ky1 requires iterative methods ŽR. A. Fisher showed this in an appendix of a 1953 Biometrics article by C. Bliss.. Problem 13.40 shows an alternative gamma parameterization that implies a linear rather than quadratic variance function for the negative binomial. 13.4.2 Negative Binomial Regression Modeling Negative binomial models for counts permit ␮ to depend on explanatory variables ŽLawless 1987.. Such models normally take ky1 to be the same for all observations. This corresponds to a constant coefficient of variation in the gamma mixing distribution, var Ž ␭ . rEŽ ␭. s 1r'k . with the standard deviation increasing as the mean does. Most common is the log link, as in Poisson loglinear models. Sometimes the identity link is adequate. One such case is with a single predictor that is a factor. For k fixed, a negative binomial model is a GLM. Thus, the likelihood equations for the regression parameters ␤ are special cases of those wsee Ž4.22.x for an ordinary GLM with variance function ®Ž ␮. s ␮ q ␮ 2rk. The usual iterative reweighted least squares algorithm applies for ML model fitting. When k is unknown, ML fitting can use a Newton᎐Raphson routine on all the parameters simultaneously. Or, one can evaluate the profile likelihood for various fixed k ŽLawless 1987.. Another approach alternates ' 561 NEGATIVE BINOMIAL REFRESSION between Ž1. using iterative reweighted least squares to solve the equations for ˆ using Newton᎐Raphson to estimate k, ␤, for fixed k, and Ž2. for fixed ␤, iterating between them until convergence. The full log likelihood LŽ␤, k; y. for a negative binomial model satisfies ⭸ 2L ⭸␤ j ⭸ k s Ý i yi y ␮ i 2 Ž k q ␮ i . g X Ž ␮i . xit . Thus, E Ž ⭸ 2 Lr⭸␤ j ⭸ k . s 0 for each j. Similarly, the inverse of the expected information matrix has 0 elements connecting k with each ␤ j . Since this is ˆ and ˆk are asymptotically independent. the asymptotic covariance matrix, ␤ ˆ obtained from part Ž1. of the iterative It follows that standard errors for ␤ scheme above are correct. Cameron and Trivedi Ž1998, p. 72. showed the asymptotic covariance matrix. They wand Lawless Ž1987.x considered a moment estimator for ky1 and studied robustness properties of estimators. ˆ from this model is consistent if the model for the mean is They noted that ␤ correctly specified, even if the true distribution is not negative binomial. 13.4.3 Frequency of Knowing Homicide Victims Example Table 13.6 summarizes responses of 1308 subjects to the question: Within the past 12 months, how many people have you known personally that were victims of homicide? The table shows responses by race, for those who identified their race as white or as black. The sample mean for the 159 blacks was 0.522, with a variance of 1.150. The sample mean for the 1149 whites was 0.092, with a variance of 0.155. A natural first choice for modeling count data is a Poisson GLM, such as a loglinear model with a dummy predictor for race. Let yit denote the response for subject t of race i. For ␮ it s E Ž Yit ., this model is log ␮ it s ␣ q ␤ x it , TABLE 13.6 Number of Victims of Murder Known in Past Year, by Race, with Fit of Poisson and Negative Binomial Models Data Poisson GLM Neg. Bin. GLM Poisson GLMM Response Black White Black White Black White Black White 0 1 2 3 4 5 6 119 16 12 7 3 2 0 1070 60 14 4 0 0 1 94.3 49.2 12.9 2.2 0.3 0.0 0.0 1047.7 96.7 4.5 0.1 0.0 0.0 0.0 122.8 17.9 7.8 4.1 2.4 1.4 0.9 1064.9 67.5 12.7 2.9 0.7 0.2 0.1 116.7 24.5 8.1 3.6 1.9 1.1 0.7 1068.3 65.3 10.1 2.8 1.1 0.5 0.3 Source: 1990 General Social Survey, National Opinion Research Center. 562 OTHER MIXTURE MODELS FOR CATEGORICAL DATA with x 1 t s 1 Žblacks. and x 2 t s 0 Žwhites .. This model has fit log  ˆ it s y2.38 q 1.733 x it . The estimated expected responses are expŽy2.38 q 1.733. s 0.522 for blacks and expŽy2.38. s 0.092 for whites, the sample means. For any link function for this model, the likelihood equations imply that the fitted means equal the sample means. Since ␤ˆ s 1.733 ŽSE s 0.147. is the difference between the log means for blacks and whites, the ratio of sample means is expŽ1.733. s 5.7 s 0.522r0.092. However, for each race the sample variance is roughly double the mean. Table 13.6 also shows the fit of this model. The evidence of overdispersion is reflected by the higher observed counts at y s 0 and at large y values than the Poisson GLM predicts. An alternative is the same model form but assuming a negative binomial response. A mixture model does seem plausible. Due to various demographic factors, heterogeneity probably occurs among subjects of a given race in the distribution of Y. For ML fitting, the deviance decreases by 122.2 compared to the ordinary Poisson GLM that is the special case with ky1 s 0. Table 13.6 also shows this model fit. It is dramatically better at y s 0 and 1. Table 13.7 shows parameter estimates for the negative binomial and Poisson GLMs. For both, ␤ˆ s 1.733 since both models provide fitted means equal to the sample means. However the estimated standard error of ␤ˆ increases from 0.147 for the Poisson GLM to 0.238 for the negative binomial model. The Wald 95% confidence interval for the ratio of means for blacks and whites goes from expw1.733 " 1.96Ž0.147.x s Ž4.2, 7.5. for the Poisson GLM to expw1.733 " 1.96Ž0.238.x s Ž3.5, 9.0. for the negative binomial. In accounting for the overdispersion, we obtain results that are not as precise as the more naive model suggests. The negative binomial model has ˆ ky1 s 4.94 ŽSE s 1.00.. This shows y1 strong evidence that k ) 0, indicating that the negative binomial model is more appropriate than the Poisson GLM. The estimated variance of Y is  ˆq ˆ 2rkˆ s  ˆ q 4.94 ˆ 2 , which is 0.13 for whites and 1.87 for blacks, much closer to the sample values than the Poisson model provides. Table 13.7 also shows results for negative binomial and Poisson models using the identity link. The fits  ˆ it s 0.092 q 0.430 x it reproduce the sample means. Now ␤ˆ refers to the difference in means rather than their log ratio. The estimated difference ␤ˆ s 0.430 has SE s 0.058 for the Poisson model and SE s 0.109 for the negative binomial. Results are more imprecise but TABLE 13.7 Parameter Estimates for Models Fitted to Homicide Data Models with Log Link Models with Identity Link Term Neg. Binom. GLM Poisson GLM Poisson GLMM ␣ ␤ SEŽ␤ˆ. y2.38 1.733 0.238 y2.38 1.733 0.147 y3.69 1.897 0.246 Neg. Binom. GLM 0.092 0.430 0.109 Poisson GLM 0.092 0.430 0.058 563 POISSON REGRESSION WITH RANDOM EFFECTS more realistic with the negative binomial model. For this link also the estimated dispersion parameter is ˆ ky1 s 4.94. 13.5 POISSON REGRESSION WITH RANDOM EFFECTS The GLMMs introduced in Chapter 12 referred to categorical responses. GLMMs are also useful for other types of discrete responses, such as counts. This section illustrates with Poisson regression modeling of count data. We’ve seen that a flexible way to account for overdispersion is with a mixture model. In Section 13.4 we mixed the Poisson using the gamma distribution, yielding the negative binomial marginally. Breslow Ž1984. and Hinde Ž1982. suggested the GLMM structure Ž12.1. with the log link and normal random intercept. The model for the mean for observation t in cluster i is log E Ž Yit < u i . s x Xit ␤ q u i , Ž 13.14 . where  u i 4 are independent N Ž0, ␴ 2 .. Conditional on u i , yi t has a Poisson distribution. Marginally, the distribution has variance greater than the mean whenever ␴ ) 0. Applications of Poisson GLMMs include the analysis of maps of cancer rates in epidemiology ŽBreslow and Clayton 1993. and modeling variability in bacteria counts ŽAitchison and Ho 1989.. Although links other than the log are possible, the identity link Žand any other link having range only the positive real line. has a structural problem. With a normal random effect with ␴ ) 0, a positive probability exists that the linear predictor is negative, but the Poisson mean must be nonnegative. The negative binomial model Žfor fixed k . is a GLMM with nonnormal random effect. With the log link, it results from a loglinear model of form Ž13.14. with random intercept, where expŽ u i . has a gamma distribution with mean 1 and variance ky1. With identity link, negative binomial models usually work better than Poisson GLMMs. Regardless of the gamma mixture distribution, the resulting marginal mean is nonnegative for the negative binomial. 13.5.1 Marginal Model Implied by Poisson GLMM The Poisson GLMM Ž13.14. implies a relatively simple marginal model, averaging out the random effect. The mean of the marginal distribution is E Ž Yit . s E E Ž Yit < u i . s E w e x i t ␤qu i x s e x i t ␤q␴ X X 2 r2 . Here E wexpŽ u i .x s expŽ ␴ 2r2. because a N Ž0, ␴ 2 . variate u i has moment generating function E wexpŽ tu i .x s expŽ t 2␴ 2r2.. So, for the Poisson GLMM 564 OTHER MIXTURE MODELS FOR CATEGORICAL DATA the log of the mean conditionally equals x Xit ␤ q u i and marginally equals x Xit ␤ q ␴ 2r2. A loglinear model still applies. The marginal effects of the explanatory variables are the same as the cluster-specific effects. Thus, the ratio of means at two different settings of x Xit is the same conditionally and marginally. However, marginally the intercept is offset. ŽNote that Jensen’s inequality applies, since the link is not linear. . The variance of the marginal distribution is var Ž Yit . s E var Ž Yit < u i . q var E Ž Yi t < u i . s E w e x i t ␤qu i x q e 2x i t ␤ var Ž e u i . X s e x i t ␤q␴ r2 q e 2x i t ␤ Ž e 2 ␴ y e ␴ X 2 X 2 2 X 2 . s E Ž Yit . q E Ž Yit . Ž e ␴ y 1 . . 2 Here, varŽ e u i . s E Ž e 2 u i . y w E Ž e u i .x 2 s e 2 ␴ y e ␴ by evaluating the moment generating function at t s 2 and t s 1. As in the negative binomial model, the marginal variance is a quadratic function of the marginal mean. It exceeds the marginal mean when ␴ ) 0. The ordinary Poisson model results when ␴ s 0. When ␴ ) 0 the marginal distribution is not Poisson, and the extent to which the variance exceeds the mean increases as ␴ increases. As in binary GLMMs, Yit and Yi s are independent given u i but are marginally nonnegatively correlated. For t / s, 2 2 cov Ž Yit , Yi s . s E cov Ž Yit , Yi s < u i . q cov E Ž Yit < u i . , E Ž Yi s < u i . s 0 q cov exp Ž x Xit ␤ q u i . , exp Ž x Xi s ␤ q u i . . Ž 13.15 . The functions in the last covariance term are both monotone increasing functions of u i , and hence are nonnegatively correlated ŽProblem 13.44.. 13.5.2 Frequency of Knowing Homicide Victims Example We now return to Table 13.6 on responses, classified by race, of the number of victims of homicide within the past 12 months that subjects knew personally. Models permitting subject heterogeneity are sensible. For the response yit for subject t of race i, the Poisson GLMM is log E Ž Yit < u it . s ␣ q ␤ x it q u it , where  u it 4 are independent N Ž0, ␴ 2 .. The log means vary according to a N Ž␣, ␴ 2 . distribution for whites and a N Ž␣ q ␤ , ␴ 2 . distribution for blacks. Given u it , yit has a Poisson distribution. Table 13.6 also shows this model fit, and Table 13.7 shows estimates. The random effects have ␴ˆ s 1.63 ŽSE s 0.15.. The deviance decreases by 116.6 compared to the Poisson GLM, indicating a better fit by allowing heterogeneity. For subjects at the means of the random effects distributions Ž u it s 0. the estimated expected responses are expŽy3.69 q 1.90. s 0.167 for blacks and 565 NOTES expŽy3.69. s 0.025 for whites. The fitted marginal mean is expŽ␣ ˆ q ␤ˆ x it q ␴ˆ 2r2., or 0.63 for blacks and 0.09 for whites. The fitted marginal variances are 0.21 for blacks and 5.78 for whites. These are somewhat larger than the sample means and variances, perhaps because the fitted distribution has nonnegligible mass above the largest observed response of 6. 13.5.3 Negative Binomial Models versus Poisson GLMMs The Poisson GLMM with normal random effects has the advantage, relative to the negative binomial GLM, of easily permitting multivariate random effects and multilevel models. However, the negative binomial has properties that can make interpretation simpler. We’ve seen that the identity link is valid for it, which is useful for simple examples such as the preceding one with a factor predictor. With any link and a factor predictor, its ML fitted means equal the sample means. This is not the case for the Poisson GLMM. Besides the Poisson GLMM and the negative binomial model, an alternative way of accounting for overdispersion with count data is quasi-likelihood with variance function ® Ž ␮i . s ␾␮ i , for some constant ␾ . This is often adequate for exploratory analyses. NOTES Section 13.1: Latent Class Models 13.1. Aitkin et al. Ž1981., Bartholomew and Knott Ž1999., Clogg Ž1995., Clogg and Goodman Ž1984., Goodman Ž1974., Haberman Ž1979, Chap. 10., Hagenaars Ž1998., Heinen Ž1996., and Lazarsfeld and Henry Ž1968. discussed fitting and intrepretation of latent class and related latent variable models. 13.2. Rudas et al. Ž1994. proposed a clever mixture method for summarizing goodness of fit. For a model M for a contingency table with true probabilities ␲, they used the mixture ␲ s Ž1 y ␳ . ␲ 1 q ␳ ␲ 2 , with ␲ 1 the model-based probabilities and ␲ 2 unconstrained. Their index of lack of fit is the smallest such ␳ possible for which this holds. It is the fraction of the population that cannot be described by the model. This recognizes that any given model does not truly hold but is useful if ␳ is close to 0. The mixture contrasts with the latent class model in which both ␲ 1 and ␲ 2 correspond to independence. Section 13.2: Nonparametric Random Effects Models 13.3. For connections between Rasch-type models and quasi-symmetry models, see Agresti Ž1993., Conaway Ž1989., Darroch Ž1981., Darroch et al. Ž1993., Hatzinger Ž1989., and Kelderman Ž1984.. For the matched-pairs random effects model Ž12.16., a nonparametric or conditional ML treatment of Ž u i1 , u i2 . implies a multivariate quasi-symmetry model ŽAgresti 1997.. Model Ž12.16. with correlated normal random effects is a 566 OTHER MIXTURE MODELS FOR CATEGORICAL DATA continuous analog to discrete latent class models that Goodman Ž1974. proposed, based on two associated binary latent variables. Section 13.3: Beta-Binomial Models 13.4. Skellam Ž1948. introduced the beta-binomial distribution and discussed parameter estimation. For modeling using this distribution or related quasi-likelihood approaches, see Brooks et al. Ž1997., Crowder Ž1978., Hinde Ž1996., Lee and Nelder Ž1996., Liang and Hanfelt Ž1994., Liang and McCullagh Ž1993., Lindsey and Altham Ž1998., Moore Ž1986a., Moore and Tsiatis Ž1991., Nelder and Pregibon Ž1987., Prentice Ž1986., Rosner Ž1984, 1989. wwith critique by Neuhaus and Jewell Ž1990a.x, Slaton et al. Ž2000., and Williams Ž1975, 1982.. For beta-binomial type variance, Ryan Ž1995. and Williams Ž1988. showed advantages of the quasi-likelihood approach over ML. Often, it helps to permit the quasi-likelihood scale parameter ␳ Žor the related parameter ␪ in the beta-binomial. to vary among groups. The beta-binomial generalizes to a Dirichlet-multinomial. Conditional on the probabilities, the distribution is multinomial. The probabilities themselves have a Dirichlet distribution, which is a generalization of the beta defined on vectors of probabilities that sum to 1. See Mosimann Ž1962. and Paul et al. Ž1989.. 13.5. Kupper et al. Ž1986. and Ryan Ž1992. discussed modeling overdispersion caused by litter effects in developmental toxicity studies. See Follman and Lambert Ž1989., Kupper and Haseman Ž1978., and Lefkopoulou et al. Ž1989. for related material. Section 13.4: Negati©e Binomial Regression 13.6. Greenwood and Yule Ž1920. derived the negative binomial as a gamma mixture of Poissons. Johnson et al. Ž1992. summarized its properties. Biggeri Ž1998., Cameron and Ž1998., and Lawless Ž1987. discussed modeling Trivedi Ž1998., Hinde and Demetrio ´ using it. PROBLEMS Applications 13.1 For the 2 3 table of opinions about legalized abortion ŽTable 10.13. collapsed over gender, fit a latent class model with two classes. Show that it is saturated. For each latent class, report the estimated probability of supporting legalized abortion in each of the three situations. Give a tentative interpretation for the classes. 13.2 Analyze Table 8.3 using a latent class model with q s 2. a. For a subject in the first latent class, estimate the probability of having used Ži. marijuana, Žii. alcohol, Žiii. cigarettes, Živ. all three, and Žv. none of them. b. Estimate the probability a subject is in the first latent class, given they have used Ži. marijuana, Žii. alcohol, Žiii. cigarettes, Živ. all three, and Žv. none of them. 567 PROBLEMS 13.3 Analyze Table 8.19 on government spending using latent class models. 13.4 For capture᎐recapture experiments, Coull and Agresti Ž1999. used a loglinear model with exchangeable association and no higher-order terms. Explain why the model expected frequencies satisfy log ␮ Ž y 1 , . . . , y T . s ␭ q ␤ 1 y 1 q ⭈⭈⭈ q␤ T y T q ␤ Ž y 1 y 2 q y 1 y 3 q ⭈⭈⭈ qy Ty1 y T . . Show that the fit of this model to Table 12.6 yields Nˆ s 90.5 and a 95% profile-likelihood confidence interval for N of Ž75, 125.. 13.5 Use or write software to replicate the analyses of the opinions about abortion data in Section 13.2 using Ža. nonparametric random effects fitting of logit model Ž13.3., and Žb. the quasi-symmetry model. 13.6 A data set on pregnancy rates among girls under 18 years of age in 13 north central Florida counties has information on a 3-year total for each county i on n i s number of births and yi s number of those for which mother had age under 18 Žsee J. Booth, in Statistical Modelling: Lecture Notes in Statistics, 104, Springer, 43᎐52, 1995.. a. A beta-binomial model states that given ␲ i 4 ,  Yi 4 are independent  binŽ n i , ␲ i .4 variates, and ␲ i 4 are independent from a betaŽ␣, ␤ . distribution. The ML estimated parameters are ␣ ˆ s 9.9 and ␤ˆ s 240.8 Žthanks to J. Booth for this analysis .. Use the mean and variance to describe the estimated beta distribution and the estimated marginal distribution of Yi Žas a function of n i .. b. Quasi-likelihood using variance function Ž13.10. for the model logitŽ ␮i . s ␣ has ␣ ˆ s y3.18 and ␳ˆ s 0.005. Describe the estimated mean and variance of Yi . c. Quasi-likelihood using variance Ž13.11. for the model logitŽ ␮i . s ␣ has ␣ ˆ s y3.35 and ␾ˆ s 8.3. Describe the estimated mean and variance of Yi . d. The logistic-normal GLMM, logitŽ␲ i . s ␣ q u i , yields ␣ ˆ s y3.24 and ␴ˆ s 0.33. Describe the estimated mean of Yi wRecall Ž12.8.x. 13.7 In Problem 12.2 about Shaq OX Neal’s free-throw shooting, the simple binomial model, ␲ i s ␣ , has lack of fit. Fit the beta-binomial model, or use the quasi-likelihood approach with that variance structure. Use the fit to summarize his free-throw shooting, by giving an estimated mean and standard deviation for ␲ i . 568 OTHER MIXTURE MODELS FOR CATEGORICAL DATA 13.8 For the toxicity study of Table 12.9, collapsing to a binary response, consider linear logit models for the probability a fetus is normal. a. Does the ordinary binomial model show evidence of overdispersion? b. Fit the linear logit model using the quasi-likelihood approach with inflated binomial variance. How do the standard errors change? c. Fit the linear logit model using quasi-likelihood with beta-binomial variance. Interpret and compare with previous results. d. Fit the linear logit model using a GEE approach with exchangeable working correlation among fetuses in the same litter. Interpret and compare with previous results, including comparing the estimated GEE correlation with the estimate ␳ˆ from part Žc.. e. Fit the linear logit GLMM after adding a litter-specific normal random effect. Interpret and compare with previous results. 13.9 Extend the various analyses of the teratology data ŽTable 4.5. in Section 13.3.3 as follows: a. Include a predictor for litter size Žas well as group.. Interpret, and compare results to those without this predictor. b. Fit a model with beta-binomial variance Ž13.10. in which ␳ varies by treatment group. Use results to motivate a model that allows overdispersion only in the placebo group. Interpret and compare results to those with common ␳ for each group. 13.10 Table 13.8 reports the results of a study of fish hatching under three environments. Eggs from seven clutches were randomly assigned to three treatments, and the response was whether an egg hatched by day 10. The three treatments were Ž1. carbon dioxide and oxygen removed, Ž2. carbon dioxide only removed, and Ž3. neither removed. TABLE 13.8 Data for Problem 13.10 Treatment 1 Treatment 2 Treatment 3 Clutch Number Hatched Total Number Hatched Total Number Hatched Total 1 2 3 4 5 6 7 0 0 0 0 0 0 0 6 13 10 16 32 7 21 3 0 8 10 25 7 10 6 13 10 16 28 7 20 0 0 6 9 23 5 4 6 13 9 16 30 7 20 Source: Data courtesy of Becca Hale, Zoology Department, University of Florida. 569 PROBLEMS a. Let ␲ i t denote the probability of hatching for an egg from clutch i in treatment t. Assuming independent binomial observations, fit the model logit Ž ␲ i t . s ␤ 1 z1 q ␤ 2 z 2 q ␤ 3 z 3 , where z t s 1 for treatment t and 0 otherwise. What does your software report for ␤ˆ1 , and what should it be? Ž Hint: Note that treatment 1 has no successes. . b. Analyze these data using an approach that allows overdispersion. Interpret. Indicate whether evidence of overdispersion occurs for treatments 2 and 3. 13.11 For the train accidents in Problem 9.19, a negative binomial model assuming constant log rate over the 14-year period has estimate y4.177 ŽSE s 0.153. and estimated dispersion parameter 0.012. Interpret. 13.12 One question in the 1990 General Social Survey asked subjects how many times they had sexual intercourse in the preceding month. Table 13.9 shows responses, classified by gender. a. The sample means were 5.9 for males and 4.3 for females; the sample variances were 54.8 and 34.4. The mode for each gender was 0. Does an ordinary Poisson GLM seem appropriate? Explain. b. The Poisson GLM with log link and a dummy variable for gender Ž1 s males, 0 s females. has gender estimate 0.308 ŽSE s 0.038.. Explain why this implies a ratio of 1.36 for the fitted means. ŽThis is also the ratio of sample means, since this model has fitted means equal to sample means.. Show that the Wald 95% confidence interval for the ratio of means for males and females is Ž1.26, 1.47.. TABLE 13.9 Data for Problem 13.12 Response Male 0 1 2 3 4 5 6 7 8 65 11 13 14 26 13 15 7 21 Female Response Male Female Response Male Female 128 17 23 16 19 17 17 3 15 9 10 12 13 14 15 16 17 18 2 24 6 3 0 3 3 0 0 2 13 10 3 1 10 1 1 1 20 22 23 24 25 27 30 50 60 7 0 0 1 1 0 3 1 1 6 1 1 0 3 1 1 0 0 Source:1990 General Social Survey, National Opinion Research Center. 570 OTHER MIXTURE MODELS FOR CATEGORICAL DATA c. For the negative binomial model, the log likelihood increases by 248.7 Ždeviance decreases by 497.3.. The estimated difference between the log means is also 0.308, but now SE s 0.127. Show that the 95% confidence interval for the ratio of means is Ž1.06, 1.75.. Compare to the Poisson GLM, and interpret. d. The mode for the Poisson distribution is the integer part of the mean, rather than 0. Argue that a possibly more realistic mixture model assumes for gender i a proportion ␳ i that has a Poisson distribution with mean 0 and a proportion 1 y ␳ i that has distribution that is a gamma mixture of Poissons. Explain why the corresponding marginal distribution for each gender is a mixture of a degenerate distribution at 0 and a negative binomial distribution. 13.13 Refer to Problem 13.12. Fit the Poisson and negative binomial GLMs using identity link. Show that the estimated differences in means between males and females are identical for the two GLMs but the SE values are very different. Explain why. Use the more appropriate one to form a confidence interval for the true difference in means. 13.14 For the counts of horseshoe-crab satellites in Table 4.3, Table 13.10 shows the results of ML fitting of the negative binomial model using width as the predictor, with the identity link. a. State and interpret the prediction equation. b. Show that at a predicted ␮ ˆ , the estimated variance is roughly ␮ ˆq␮ ˆ 2. c. The corresponding Poisson GLM has fit ␮ ˆ s y11.53 q 0.55 x ŽSE s 0.06.. Compare 95% confidence intervals for the slopes for the two models. Interpret, and indicate whether overdispersion seems to exist relative to the Poisson GLM. TABLE 13.10 Results for Problem 13.14 Parameter Intercept width Dispersion Estimate y11.1471 0.5308 0.9843 Standard Error 2.8275 0.1132 0.1822 Wald 95% Confidence Limits y16.6890 y5.6052 0.3089 0.7528 0.6847 1.4149 ChiSquare 15.54 21.97 13.15 Refer to Problem 13.14. a. Fit a negative binomial model with log link. Interpret. Plot the counts against width and indicate which link seems more appropriate. b. Fit a Poisson GLMM with log link, using width predictor. Interpret. PROBLEMS 571 c. Compare results for the various models, including those in Section 4.3.2 for a Poisson GLM. Indicate your preferred model. Justify. 13.16 Refer to Problems 13.14 and 13.15. Using width and qualitative color as predictors, fit a Ža. negative binomial GLM, and Žb. Poisson GLMM, checking for interaction and interpreting the final model. 13.17 Refer to Table 13.6. For those with race classified as ‘‘other,’’ the sample counts for Ž0, 1, 2, 3, 4, 5, 6. homicides were Ž55, 5, 1, 0, 1, 0, 0.. Fit an appropriate model simultaneously to these data and those for white and black race categories. Interpret by making pairwise comparisons of the three pairs of means. 13.18 Use a quasi-likelihood approach to analyze Table 13.6 on counts of murder victims. 13.19 Conduct the analyses of Problem 4.6 on defects in the fabrication of computer chips, but use a negative binomial GLM. Compare results to those for the Poisson GLM. Indicate why results are similar. 13.20 With data at the book’s Web site Ž www. stat.ufl.edur;aarcdar cda.html ., use methods of this chapter to analyze how the countywide vote for the Reform Party candidate Pat Buchanan in the 2000 presidential election related to the vote for Reform Party candidate Ross Perot in the 1996 presidential election. Note that Palm Beach County is an enormous outlier Žapparently mainly reflecting votes intended for Al Gore but cast for Buchanan because of a confusing ballot.. Model with and without that observation and compare results. 13.21 Conduct a latent class analysis of the data in Espeland and Handelman Ž1989.. 13.22 Refer to the teratology study in Liang and Hanfelt Ž1994.. Analyze these data using at least two different approaches for overdispersed binary data. Compare results and interpret. 13.23 Refer to Problem 13.14. Using an appropriate subset of width, weight, color, and spine condition as predictors, find and interpret a reasonable model for predicting the number of satellites. Theory and Methods 13.24 Derive residual df for a latent class model with q latent classes. When I s 2, for q G 2 show one needs T G 4 for the model to be unsaturated. Then, find the maximum value for q when T s 4, 5. For an I 2 table, show one needs q - I 2rŽ2 I y 1.. 572 OTHER MIXTURE MODELS FOR CATEGORICAL DATA 13.25 Express the log likelihood for latent class model Ž13.1. in terms of the model parameters. Derive likelihood equations ŽGoodman 1974, Haberman 1979.. 13.26 Let  denote an I = J matrix of cell probabilities for the joint distribution of X and Y. Suppose that there exist I = 1 column vectors ␲ 1 k and J = 1 column vectors ␲ 2 k of probabilities, k s 1, . . . , q, and a set of probabilities  ␳ k 4 such that q s Ý ␳ k␲ 1 k␲X2 k . ks1 Explain why this implies that there is a latent variable Z such that X and Y are conditionally independent, given Z. 13.27 In Section 13.2.2, under the null that the ordinary logistic regression model holds, explain why it is inappropriate to treat the difference between the deviances for that model and the mixture of two logistic regressions as a chi-squared statistic. 13.28 Refer to Problem 12.7. Let ␮ k Ž a, b, c . denote the expected frequency of outcomes Ž a, b, c . for treatments Ž A, B, C . under treatment sequence k, where outcome 1 s relief and 0 s nonrelief. With a nonparametric random effects approach, show that one can estimate treatment effects in model Ž12.19. by fitting the quasi-symmetry model log ␮ k Ž a, b, c . s a ␤A q b␤B q c ␤C q ␭ k Ž a, b, c . , where ␭ k Ž a, b, c . s ␭ k Ž a, c, b . s ␭ k Ž b, a, c . s ␭ k Ž b, c, a. s ␭ k Ž c, a, b . s ␭ k Ž c, b, a.. Fit the model, and show that ␤ˆB y ␤ˆA s 1.64 ŽSE s 0.34., ␤ˆC y ␤ˆA s 2.23 ŽSE s 0.39., ␤ˆC y ␤ˆB s 0.59 ŽSE s 0.39.. Interpret. Compare results with Problem 12.7 for model Ž12.19.. 13.29 Show that the beta-binomial distribution Ž13.9. simplifies to the binomial when ␪ s 0. 13.30 Express the numerator of the beta density in terms of ␮ and ␪ . Using this, show that it is Ža. unimodal when ␪ - minŽ ␮, 1 y ␮ ., and Žb. the uniform density when ␮ s ␪ s 12 . 13.31 Suppose that ␲ i s P Ž Yit s 1. s 1 y P Ž Yit s 0., for t s 1, . . . , n i , and corrŽ Yi t , Yi s . s ␳ for t / s. Show that varŽ Yi t . s ␲ i Ž1 y ␲ i ., 573 PROBLEMS covŽ Yit , Yi s . s ␳␲ i Ž1 y ␲ i ., and var ž Ý Y / s n ␲ Ž 1 y ␲ . 1 q ␳ Ž n y 1. . it i i i i t 13.32 When n s 1, show that the beta-binomial distribution is no different from the binomial Ži.e., Bernoulli.. Explain why overdispersion cannot occur when n s 1. 13.33 When yi is the sum of n i binary responses each having mean ␮ i , refer to the quasi-likelihood approach with ®Ž ␮i . s ␾ n i ␮ i Ž1 y ␮ i .. Explain why this variance function has a structural problem, with only ␾ s 1 making sense when n i s 1. 13.34 Liang and Hanfelt Ž1994. described a teratology study comparing control and treatment groups in which the ML estimate of the treatment effect in a beta-binomial model differs by a factor of 2 depending on whether one assumes the same overdispersion parameter for each group. By contrast, with variance function Ž13.11., the quasi-likelihood estimate of the treatment effect is the same whether one assumes the same or different ␾ for the two groups. Explain why, and discuss whether this is an advantage or disadvantage of that method. 13.35 Consider the logistic-normal model, logit Ž␲ i . s ␣ q x Xi ␤ q u i . For small ␴ , show that it corresponds approximately to a mixture model for which the mixture distribution has var Ž␲ i . s w ␮i Ž1 y ␮ i .x 2␴ 2 . Ž Hint: See Problem 6.33.. 13.36 Altham Ž1978. introduced the discrete distribution f Ž y ; ␲ , ␺ . s cŽ␲ , ␺ . ž/ n nyy ␲ y Ž1 y ␲ . exp ␺ y Ž n y y . , y y s 0,1, . . . , n, where cŽ␲ , ␺ . is a normalizing constant. Show that this is in the exponential family. Show that the binomial occurs when ␺ s 0. wAltham noted that overdispersion occurs when ␺ - 0. Corcoran et al. Ž2001. and Lindsey and Altham Ž1998. used this as the basis of an alternative model to the beta-binomial. x 13.37 When y 1 , . . . , yN are independent from the negative binomial distribution Ž13.13. with k fixed, show that ␮ ˆ s y. 574 13.38 OTHER MIXTURE MODELS FOR CATEGORICAL DATA Using E Ž Y . s E w E Ž Y < X .x and var Ž Y . s E wvar Ž Y < X .x q varw E Ž Y < X .x, derive the mean and variance of the Ža. beta-binomial distribution, and Žb. negative binomial distribution. 13.39 Suppose that given u, Y is Poisson with E Ž Y < u. s u  , where  may depend on predictors. Suppose that u is a positive random variable with E Ž u. s 1 and var Ž u. s ␶ . Show that E Ž Y . s ␮ and varŽ Y . s ␮ q ␶␮2 . Explain how negative binomial GLMs and Poisson GLMMs with log link can follow as special cases. 13.40 An alternative negative binomial parameterization results from the gamma density formula, Ž k. f Ž ␭ ; k, ␮ . s exp Ž yk ␭ . ␭ k ␮y1 , ⌫Ž k␮. k␮ ␭ G 0, for which EŽ ␭. s ␮ , varŽ ␭. s ␮rk. Show that this gamma mixture of Poissons yields a negative binomial with EŽ Y . s ␮, var Ž Y . s ␮ Ž 1 q k . rk . For what limiting value of k does this reduce to the Poisson? wSee Nelder and Lee Ž1996. for ML model fitting. Cameron and Trivedi Ž1998, p. 75. pointed out that, unlike with quadratic variance, consistency does not occur for parameter estimators when the model for the mean holds but the true distribution is not negative binomial.x 13.41 The negative binomial distribution is unimodal with a mode at the integer part of ␮ Ž k y 1.rk ŽJohnson et al. 1992, pp. 208᎐209.. Show that the mode is 0 when ␮ F 1, and that when ␮ ) 1 the mode is still 0 if k - ␮rŽ ␮ y 1.. ŽThis gives greater scope than the Poisson, since its mode equals the integer part of the mean.. 13.42 Consider the loglinear random effects model log E Ž Yit < u i . s x Xit ␤ q zXit u i , where  u i 4 are independent N Ž0, ⌺ .. Show that this implies the marginal loglinear model log E Ž Yit . y 1 2 zXit ⌺ z it s x Xit ␤ , 575 PROBLEMS with the same fixed effects but with offset term. For the random-intercept case, indicate the role of ␴ on the size of the offset. Explain what happens when ␴ s 0. 13.43 In Section 13.5.1 and Problem 13.42 we saw that for Poisson GLMMs, the marginal effects are the same as the cluster-specific effects. This does not imply that ML estimates of effects are the same for a Poisson GLMM and a Poisson GLM. Explain why. Ž Hint: For the GLMM, is the marginal distribution Poisson?. 13.44 For the Poisson GLMM Ž13.14., use the normal mgf to show that for t / s, cov Ž Yit , Yi s . s exp Ž x Xit q x Xi s . ␤ exp Ž ␴ 2 . Ž exp Ž ␴ 2 . y 1 . Hence, find corrŽ Yit , Yi s .. 13.45 Consider a Poisson GLMM using the identity link. Relate the marginal mean and variance to the conditional mean and variance. Explain the structural problem that this model has. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 14 Asymptotic Theory for Parametric Models This chapter has a more theoretical flavor than others. It presents asymptotic theory for parametric models for categorical data, with emphasis on multinomial models for contingency tables. In Section 14.1 we review and extend the delta method. This is used to derive large-sample normal distributions for many statistics. In Section 14.2 we apply the delta method to ML estimation of parameters in models for contingency tables, later illustrated in Section 14.4 for logit and loglinear models. In Section 14.3 we derive asymptotic distributions of cell residuals and the X 2 and G 2 goodness-of-fit statistics. The results in this chapter have a long history. Pearson Ž1900. derived the asymptotic chi-squared distribution of X 2 for testing a specified multinomial distribution. Fisher Ž1922, 1924. showed the adjustment in degrees of freedom when multinomial probabilities are functions of unknown parameters. Cramer ´ Ž1946, pp. 424434. formally proved this result, under the assumption that ML estimators of the parameters are consistent. Rao Ž1957. proved consistency of the ML estimators under general conditions. He also gave the asymptotic distribution of the ML estimators, although the primary emphasis of his articles was on proving consistency. Birch Ž1964a. proved these results under weaker conditions. Andersen Ž1980., Bishop et al. Ž1975., Cox Ž1984., Haberman Ž1974a., and Watson Ž1959. provided other proofs or considered related cases. As in Cramer’s ´ and Rao’s proofs, our derivation regards the ML estimator as a point in the parameter space where the derivative of the log likelihood function is zero. Birch regarded it as a point at which the likelihood takes value arbitrarily near its supremum. Although his approach is more powerful, the proofs are more complex. We avoid a formal ‘‘theoremproof’’ style of exposition. Instead, we show that powerful results follow from simple mathematical ideas, such as Taylor series expansions. 576 DELTA METHOD 577 14.1 DELTA METHOD Suppose that a statistic used as an estimator of a parameter has a large-sample normal distribution. Then, in this section we show that many functions of that statistic are also asymptotically normal. 14.1.1 O, o Rates of Convergence Big O and little o notation is useful for describing limiting behavior of sequences. For real numbers  z n4 , the little o notation oŽ z n . represents a term that has smaller order than z n as n ™ ⬁, in the sense that oŽ z n .rz n ™ 0 as n ™ ⬁. For instance, 'n is oŽ n. as n ™ ⬁, since 'n rn ™ 0 as n ™ ⬁. A sequence that is oŽ1. satisfies oŽ1.r1 s oŽ1. ™ 0; for instance, ny1r2 is oŽ1. as n ™ ⬁. The big O notation O Ž z n . represents terms that have the same order of magnitude as z n , in the sense that O Ž z n . rz n is bounded as n ™ ⬁. For instance, Ž3rn. q Ž8rn2 . is O Ž ny1 . as n ™ ⬁; dividing it by ny1 gives a ratio that takes value close to 3 as n increases. Similar notation applies to sequences of random variables. This notation uses a subscript p to indicate that the sequence has probabilistic rather than deterministic behavior. The symbol op Ž z n . denotes a random variable of smaller order than z n for large n, in the sense that op Ž z n .rz n con®erges in probability to 0; that is, for any fixed  ) 0, P Ž op Ž z n . rz n F  . ™ 1 as n ™ . The notation Op Ž z n . represents a random variable such that for every  ) 0, there is a constant K and an integer n 0 such that P w Op Ž z n . rz n - K x ) 1 y  for all n ) n 0 . To illustrate, let Yn denote the sample mean of n independent observations Y1 , . . . , Yn from a distribution having E Ž Yi . s . Then Ž Yn y  . s op Ž1., since Ž Yn y  .r1 converges in probability to zero as n ™  by the law of large numbers. By Tchebychev’s inequality, the difference between a random variable and its expected value has the same order of magnitude as the standard deviation of that random variable. Since Yn y  has standard deviation r'n , Ž Yn y  . s Op Ž ny1r2 .. A random variable that is Op Ž ny1r2 . is also op Ž1.. An example is Ž Yn y  .. Multiplication affects the order in the way one expects intuitively ŽProblem 14.1.. For instance, 'n Ž Yn y  . s n1r2 Op Ž ny1r2 . s Op Ž n1r2 ny1r2 . s Op Ž1.. If the difference between two random variables is op Ž1. as n ™ , Slutzky’s theorem states that those random variables have the same limiting distribution. 14.1.2 Delta Method for Function of Random Variable Let Tn denote a statistic, the subscript expressing its dependence on the sample size n. For large samples, suppose that Tn is approximately normally 578 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS distributed about  , with approximate standard error r'n . More precisely, as n ™ , suppose that the cdf of 'n ŽTn y  . converges to a N Ž0,  2 . cdf. This limiting behavior is an example of con®ergence in distribution, denoted 'n Ž Tn y  . N Ž 0,  2 . . d 6 Ž 14.1 . For a function g, we now derive the limiting distribution of g ŽTn .. Suppose that g is at least twice differentiable at  . We use the Taylor series expansion for g Ž t . in a neighborhood of  . For some  * between t and  , g Ž t . s g Ž  . q Ž t y  . g X Ž  . q Ž t y  . g Y Ž  * . r2 2 s g Ž  . q Ž t y  . gX Ž  . q O Ž t y  2 .. Substituting the random variable Tn for t, we have 'n g Ž Tn . y g Ž  . s 'n Ž Tn y  . g X Ž  . q 'n O Ž Tn y  s 'n Ž Tn y  . g X Ž  . q Op Ž ny1r2 . 2 . Ž 14.2 . since 'n O Ž Tn y  2 . s 'n O Op Ž ny1 . s Op Ž ny1r2 . . Since the Op Ž ny1r2 . term is asymptotically negligible, 'n w g ŽTn . y g Ž  .x has the same limiting distribution as 'n ŽTn y  . g X Ž  .; that is, g ŽTn . y g Ž  . behaves like the constant multiple g X Ž  . of ŽTn y  .. Now, ŽTn y  . is approximately normal with variance  2rn. Thus, g ŽTn . y g Ž  . is approximately normal with variance  2 w g X Ž  .x 2rn. More precisely, g Ž Tn . y g Ž  . d N Ž 0,  2 g X Ž  . 6 'n 2 .. Ž 14.3 . Figure 3.1 illustrated this result, and in Section 3.1.6 it was applied to the sample logit. Result Ž14.3. is called the delta method for obtaining asymptotic distributions. Since  2 s  2 Ž  . and g X Ž  . usually depends on  , the asymptotic variance is unknown. Let  2 ŽTn . and g X ŽTn . denote these terms evaluated at the sample estimator Tn of  . When g X Ž. and  s  Ž. are continuous at  ,  ŽTn . g X ŽTn . is a consistent estimator of  Ž  . g X Ž  .. Thus, confidence intervals and tests use the result that 'n w g ŽTn . y g Ž  .xr ŽTn . g X Ž Tn . is asymptotically standard normal. For instance, g Ž Tn . " z r2  Ž Tn . g X Ž Tn . r'n is a large-sample 100Ž1 y .% confidence interval for g Ž  .. 579 DELTA METHOD When g X Ž  . s 0, Ž14.3. is uninformative because the limiting variance equals 0. In that case, 'n w g ŽTn . y g Ž  .x s op Ž1., and higher-order terms in the Taylor series expansion yield the asymptotic distribution Žsee Note 14.1.. 14.1.3 Delta Method for Function of Random Vector The delta method generalizes to functions of random ®ectors. Suppose that Tn s ŽTn1 , . . . , Tn N .X is asymptotically multivariate normal with mean  s Ž  1 , . . . ,  N .X and covariance matrix rn. Suppose that g Ž t 1 , . . . , t N . has a nonzero differential  s Ž 1 , . . . , N .X at , where i s g ti . ts Then, g Ž Tn . y g Ž  . d N Ž 0, X   . . Ž 14.4 . 6 'n For large n, g ŽTn . has distribution similar to the normal with mean g Ž  . and variance X  rn. The proof of Ž14.4. follows from the expansion g Ž Tn . y g Ž  . s Ž Tn y  .  q o Ž Tn y  X ., where z s ŽÝ z i2 .1r2 denotes the length of vector z. For large n, g ŽTn . y g Ž  . behaves like a linear function of the approximately normal random vector ŽTn y  .. Thus, it itself is approximately normal. 14.1.4 Asymptotic Normality of Functions of Multinomial Counts The delta method for random vectors implies asymptotic normality of many functions of cell counts in contingency tables. Suppose that cell counts Ž n1 , . . . , n N . have a multinomial distribution with cell probabilities  s Ž 1 , . . . , N .X . Let n s n1 q  qn N , and let p s Ž p1 , . . . , pN .X denote the sample proportions, where pi s n irn. Denote observation i of the n cross-classified in the contingency table by Yi s Ž Yi1 , . . . , Yi N ., where Yi j s 1 if it falls in cell j, and Yi j s 0 otherwise, i s 1, . . . , n. For instance, Y6 s Ž0, 0, 1, 0,0, . . . , 0. means that observation 6 is in the third cell of the table. Now, since each observation falls in just one cell, Ý j Yi j s 1 and Yi j Yi k s 0 when j / k. Also, pj s Ý i Yi jrn, and E Ž Yi j . s P Ž Yi j s 1 . s j s E Ž Yi 2j . , E Ž Yi j Yi k . s 0 if j / k. It follows that E Ž Yi . s  and cov Ž Yi . s  , i s 1, . . . , n, 580 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS where  s Ž jk . with j j s var Ž Yi j . s E Ž Yi 2j . y E Ž Yi j . 2 s j Ž1 y j . , jk s cov Ž Yi j , Yi k . s E Ž Yi j Yi k . y E Ž Yi j . E Ž Yi k . s y j for j / k. k The matrix  has form  s diag Ž  . y  X where diagŽ  . is the diagonal matrix with the elements of  on the main diagonal. Since p is a sample mean of n independent observations, namely Ý nis1Yi ps n , cov Ž p . s diag Ž  . y  X rn. Ž 14.5 . This covariance matrix is singular, because of the linear dependence Ý pi s 1. The multivariate central limit theorem ŽRao 1973, p. 128. implies d N 0, diag Ž  . y  X . Ž 14.6 . 6 'n Ž p y  . By the delta method, functions of p having nonzero differential at  are also asymptotically normal. Let g Ž t 1 , . . . , t N . be a differentiable function, and let i denote i, s gr i s 1, . . . , N, gr t i evaluated at t s . By the delta method Ž14.4., g Ž p. y g Ž  . d N Ž 0, X diag Ž  . y  X  . . 6 'n Ž 14.7 . The asymptotic variance equals X diag Ž  .  y Ž X  . s 2 Ý i 2 i y ŽÝ i i . 2 . In Section 3.1.7 we used this formula to derive the large-sample variance of the sample log odds ratio. 14.1.5 Delta Method for Vector Function of Random Vector The delta method generalizes further to a ®ector of functions of an asymptotically normal random vector. Let gŽt. s Ž g 1Žt., . . . , g q Žt..X and let Ž gr  . denote the q = N Jacobian matrix for which the entry in row i and column j 581 DELTA METHOD is g i Žt.r t j evaluated at t s . Then, N 0, Ž gr  .  Ž gr  . . d Ž 14.8 . X 6 'n g Ž Tn . y g Ž  . The rank of the limiting normal distribution equals the rank of the asymptotic covariance matrix. Expression Ž14.8. is useful for finding large-sample joint distributions. For instance, from Ž14.6., Ž14.7., and Ž14.8., the asymptotic distribution of several functions of multinomial proportions has covariance matrix of the form asymp. cov  'n g Ž p . y g Ž  . 4 s diag Ž  . y  X X , where  is the Jacobian Ž gr  .. 14.1.6 Joint Asymptotic Normality of Log Odds Ratios We illustrate formula Ž14.8. by finding the joint asymptotic distribution of a set of log odds ratios in a contingency table. We use the log scale because convergence to normality is more rapid for it. Let gŽ  . s logŽ  . denote the vector of natural logs of cell probabilities, for which gr  s diag Ž  . y1 . The covariance of the asymptotic distribution of 'n wlogŽp. y logŽ  .x is diag Ž  . y1 diag Ž  . y  X diag Ž  . y1 s diag Ž  . y1 y 11X where 1 is an N = 1 vector of 1 elements. For a q = N matrix of constants C, it follows that log Ž p . y log Ž  . d N 0, C diag Ž  . y1 6 'n C CX y C11X CX . Ž 14.9 . Now, suppose C logŽp. is a set of sample log odds ratios. Then, each row of C contains zeros except for two q1 elements and two y1 elements in the positions multiplied by the relevant elements of logŽp. to form the given log odds ratio. The second term in the covariance matrix in Ž14.9. is then zero. If a particular odds ratio uses the cells numbered h, i, j, and k, the variance of the asymptotic distribution is asymp. var 'n Ž sample log odds ratio . s y1 h q y1 i q y1 j q y1 k . When two log odds ratios have no cells in common, their asymptotic covariance in the limiting normal distribution equals zero. 582 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS 14.2 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS OF MODEL PARAMETERS AND CELL PROBABILITIES We now derive basic results of large-sample model-based inference for contingency tables. The delta method is the key tool. The derivations apply to a single multinomial distribution. They extend directly to products of multinomials, when the parameter space stays fixed as the sample size increases. The observations are counts n s Ž n1 , . . . , n N .X in N cells of a contingency table. The asymptotics regard N as fixed and let n s Ýn i ™ . We assume that n s np has a multinomial distribution with probabilities  s Ž 1 , . . . , N .X . The model is  s  Ž . , where  Ž  . denotes a function that relates  to a smaller number of parameters  s Ž  1 , . . . , q .X . As  ranges over its parameter space,  Ž  . ranges over a subset of the space of  for N probabilities. Adding components to , the model becomes more complex and the space of  that satisfy the model is larger. We use  and  to denote generic parameter and probability values, and  0 s Ž  10 , . . . , q 0 .X and  0 s Ž 10 , . . . , N 0 .X s Ž  0 . to denote true values for a particular application. When the model does not hold, no  0 exists for which  Ž  0 . s  0 ; that is,  0 falls outside the subset of  values that is the range of  Ž  . for the space of possible . We consider this case in Section 14.3.5. We first derive the asymptotic distribution of the ML estimator ˆ  of . We use that to derive the asymptotic distribution of the model-based ML estimator  ˆ s  Žˆ . of . The approach follows Rao Ž1973, Sec. 5e. and Bishop et al. Ž1975, Secs. 14.7 and 14.8.. The assumed regularity conditions are: 1.  0 is not on the boundary of the parameter space. 2. All i0 ) 0. 3.  Ž  . has continuous first-order partial derivatives in a neighborhood of  0 . 4. The Jacobian matrix Ž r  . has full rank q at  0 . These conditions ensure that  Ž  . is locally smooth and one-to-one at  0 and Taylor series expansions exist in neighborhoods around  0 and  0 . When the Jacobian does not have full rank, often it does with reformulation of the model using fewer parameters. 14.2.1 Distribution of Model Parameter Estimator The key to deriving the asymptotic distribution of ˆ  is to express ˆ  as a linearized function of p. Then the delta method applies, using the asymptotic 583 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS normality of p. The linearization has two steps, first relating p to , ˆ and then  ˆ to ˆ. The kernel of the multinomial log likelihood is N L Ž  . s log Ł N i i Ž  . s n Ý pi log i Ž  . . n is1 is1 The likelihood equations are LŽ  . j s nÝ i i Ž . pi i Ž . j s 0, Ž 14.10 . j s 1, . . . , q. These depend on the functional form  Ž  . used in the model. Note that Ý i Ž . j i s Ý i Ž . j s j i Ž 1 . s 0. Ž 14.11 . ˆ ˆ Ž . Let ir  j represent i  r  j evaluated at . Subtracting a common term from both sides of the jth likelihood equation Ž14.10., Ý n Ž pi y i0 . i ˆj ˆi i s Ý nŽ ˆ i y i0 . i i Ž 14.12 . , ˆj ˆi since the first sum on the right-hand side equals zero from Ž14.11.. Next we express  ˆ in terms of ˆ using ˆi y i0 s Ý Ž ˆk y  k 0 . i k k where ir  k represents ir  k evaluated at some point  falling between ˆ  and  0 . Substitution of this into the right-hand side of Ž14.12. and division of both sides by 'n yields, for each j, Ý i 'n Ž pi y ˆi i0 . i ˆj s ž 1 Ý 'n Ž ˆk y  k 0 . Ý ˆ ˆ i k i j i i k / . Ž 14.13 . Some notation lets us express more simply the dependence of ˆ  on p. Let A denote the N = q matrix having elements ai j s y1r2 i0 i Ž .  j0 . 584 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS The matrix expression for A is A s diag Ž  0 . y1r2 Ž r  0 . , Ž 14.14 .  where Ž r  0 . denotes the Jacobian Ž r  . evaluated at  0 . As ˆ converges to  0 , the term in brackets on the right-hand side of Ž14.13. X converges to the element in row j and column k of AA. As ˆ  ™  0 , the set of equations Ž14.13. has the form AX diag Ž  0 . y1r2 X 'n Ž p y  0 . s Ž AA . 'n Ž ˆ y  0 . q op Ž 1 . . Since the Jacobian has full rank at  0 , AXA is nonsingular. Thus, y1 y1r2 X 'n Ž ˆ y  0 . s Ž AA 'n Ž p y  0 . q op Ž 1. . Ž 14.15 . . AX diag Ž  0 . Now, the asymptotic distribution of p determines that of ˆ . From Ž14.6., is asymptotically normal, with covariance matrix wdiagŽ  0 . y  0 X0 x. By the delta method, 'n Žˆ  y  0 . is also asymptotically normal, with asymptotic covariance matrix 'n Žp y  0 . y1 y1r2 y1r2 y1 X X = diag Ž  0 . y  0 X0 = diag Ž  0 . A Ž AA Ž AA . AX diag Ž  0 . . . Using Ž14.11. and Ž14.14., the term subtracted in this expression disappears because X0 diag Ž  0 . y1r2 A s X0 diag Ž  0 . y1r2 s 1X Ž r  0 . s diag Ž  0 . žÝ y1r2 / Ž r  0 . X  0 s 0X . ir i Thus, this asymptotic covariance expression for 'n Žˆ  y  0 . simplifies to X y1 ŽAA . . In summary, this argument establishes the general result d X N 0, Ž AA . 6 'n Ž ˆ y  0 . y1 . Ž 14.16 . The asymptotic covariance matrix of ˆ  depends on Ž r  0 . and hence on ˆ denote A evaluated at the the function for modeling  in terms of . Let A ˆ ML estimate . The estimated covariance matrix is $ X ˆˆ cov Ž ˆ  . s Ž AA . y1 rn. The asymptotic normality and covariance of ˆ  follows more simply from general results for ML estimators. However, those results require stronger 585 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS regularity conditions ŽRao 1973, p. 364. than the ones assumed here. Suppose that observations are independent from f Žy;  ., some probability mass function. The ML estimator ˆ  is efficient, in the sense that 'n Ž ˆ y  . N Ž 0, d 6 Iy1 . , where I is the information matrix for a single observation. The Ž j, k . element of I is yE ž 2 / log f Ž y,  . j  k log f Ž y,  . sE j log f Ž y,  .  k . When f is the probability of a single observation having multinomial probabilities  1Ž  ., . . . , N Ž  .4 , this element of I equals N Ý is1 log Ž i j Ž . . log Ž i Ž . . k i Ž . s N Ý is1 Ž . i j k i Ž . 1 i Ž . . X This is the Ž j, k . element of AA. Thus the asymptotic covariance is Iy1 s X y1 ŽAA. . For results of this section to apply, a ML estimator of  must exist and be a solution of the likelihood equations. This requires the following strong identifiability condition: For every  ) 0, there exists a ) 0 such that if  y  0 )  , then  Ž  . y  0 ) . This condition implies a weaker one that two  values cannot have the same  value. When strong identifiability and the other regularity conditions hold, the probability an ML estimator is a root of the likelihood equations converges to 1 as n ™ . That estimator has the asymptotic properties given above of a solution of the likelihood equations. For proofs, see Birch Ž1964a. and Rao Ž1973, pp. 360362.. 14.2.2 Asymptotic Distribution of Cell Probability Estimators The asymptotic distribution of the model-based estimator  ˆ follows from the Taylor-series expansion  ˆ s  Ž ˆ . s  Ž  0 . q  0 Ž ˆ y 0 . q op Ž ny1r2 . . Ž 14.17 . The size of the remainder term follows from Žˆ  y  0 . s Op Ž ny1r2 .. Now ˆ ' Ž . Ž .   0 s  0 , and n  y  0 is asymptotically normal with asymptotic co- 586 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS X y1 . . By the delta method, variance ŽAA 'n Ž ˆ y  0 .  d 6 N 0, 0 X Ž AA .  y1 0 X Ž 14.18 . . When the model holds with  having q - N y 1 elements,  ˆ s  Žˆ . is more efficient than the sample proportion p for estimating . More generally, for estimating a smooth function g Ž  . of , g Ž  ˆ . has smaller asymptotic variance than g Žp.. We next derive this result, discussed in Section 6.4.5. The derivation deletes the Nth component from p and , ˆ so their covariance matrices are positive definite ŽProblem 14.16.. The Nth proportion is linearly dependent on the first N y 1 since they sum to 1. Let  s diagŽ  . y  X denote the Ž N y 1. = Ž N y 1. covariance matrix of 'n p. The inverse of  is y1 s diag Ž  . y1 q 11X Ž 14.19 . , N which can be verified by evaluating  y1 and showing that it equals the identity matrix. Let Ž gr  0 . s Ž gr 1 , . . . , gr Ny1 .X , evaluated at  s  0 . By the delta method, 'n g Ž p. s 'n g Ž ˆ . s asymp. var ž / X g 0 cov Ž 'n p . g 0 s ž / g 0 X g  0 and Asymp. var s ž / ž / g X 0 g X 0 g Asymp. cov Ž 'n  ˆ.  0 0 Asymp. cov Ž 'n ˆ . ž /  X 0 g 0 . Using Ž14.11. and Ž14.19. yields X Asymp. cov Ž 'n ˆ  . s Ž AA . y1 s Ž r  0 . diag Ž  0 . X s Ž r  0 . y1 Ž r  0 . X y1 y1 Ž r  0 . y1 . Since  is positive definite and Ž r  0 . has rank q, y1 and wŽ r  0 .X y1 Ž r  0 .xy1 are also positive definite. 587 DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS To show that asymp. varw'n g Žp.x G asymp. varw'n g Ž  ˆ .x, we show that ž /½ g 0 X  y 0 ž /  X  y1 y1  0 0 ž /5 X  g 0 0 G 0. But this quadratic form is identical to Ž Y y B . y1 Ž Y y B . X where Y s  Ž gr  0 ., B s Ž r  0 ., and  s ŽBX y1 B.y1 BX y1 Y. The result then follows from the positive definiteness of y1. This proof is based on one given by Altham Ž1984.. Her proof uses standard properties of ML estimators. It applies whenever regularity conditions hold that guarantee those properties. The proof applies not only to categorical data but to any situation in which a model describes the dependence of a set of parameters  on some smaller set . 14.3 ASYMPTOTIC DISTRIBUTIONS OF RESIDUALS AND GOODNESS-OF-FIT STATISTICS We next study the distribution of Pearson X 2 and likelihood-ratio G 2 goodness-of-fit statistics for the multinomial model  s  Ž  .. We first derive the asymptotic joint distribution of the sample proportions p and model-based estimator . ˆ This distribution determines large-sample distributions of statistics that depend on both p and . ˆ For instance, it determines the asymptotic joint distribution of the Pearson residuals, which compare p with . ˆ Deriving the large-sample chi-squared distribution for X 2 , which is the sum of squared Pearson residuals, is then straightforward. We also show that X 2 and G 2 are asymptotically equivalent, when the model holds. Our presentation borrows from Bishop et al. Ž1975, Chap 14., Cox Ž1984., Cramer ´ Ž1946, pp. 432433., and Rao Ž1973, Sect. 6b.. 14.3.1 Joint Asymptotic Normality of p and  ˆ We first express the joint dependence of p and  ˆ on p, in order to show the joint asymptotic normality of p and . ˆ Let D s diag Ž  0 . 1r2 X A Ž AA . A diag Ž  0 . y1 X y1r2 From Ž14.15. and Ž14.17.,  ˆ y 0 s  0 Ž ˆ y 0 . q o p Ž ny1r2 . s D Ž p y  0 . q o p Ž ny1r2 . . . 588 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS Therefore, 'n ž p y 0  ˆ y 0 / s ž /' Ž I D n p y  0 . q o p Ž 1. , where I is a N = N identity matrix. By the delta method, ž p y 0  ˆ y 0 / d N Ž 0, * . Ž 14.20 . 6 'n where * s ž diag Ž  0 . y  0 X0 diag Ž  0 . y  0 X0 DX D diag Ž  0 . y  0 X0 D diag Ž  0 . y  0 X0 DX / . Ž 14.21 . The two matrix blocks on the main diagonal of * are covŽ'n p. and asymp. covŽ'n  ˆ ., derived previously. The new information here is that asymp. covŽ'n p, 'n  ˆ . s wdiagŽ  0 . y  0 X0 xDX . 14.3.2 Asymptotic Distribution of Pearson and Standardized Residuals For cell counts  n i 4 the Pearson statistic is X 2 s Ýe i2 , where ei s ni y ␮ ˆi  ˆ1r2 i s 'n Ž pi y ˆ i . ˆ i1r2 . We next derive the asymptotic distribution of e s Ž e1 , . . . , e N .X , which is a diagnostic measure of lack of fit. For Poisson models it is the Pearson residual. Dividing it by its standard error gives the standardized residual. The distribution of e is also helpful in deriving the distribution of X 2 . The residuals e are functions of p and , ˆ which are jointly asymptotically normal from Ž14.20.. To use the delta method, we calculate e ir pi s 'n ˆy1r2 , i e ir ˆ i s y'n Ž pi q ˆ i . r2 ˆy3r2 i e ir p j s e ir ˆ j s 0 for i / j. That is, e p e  ˆ s 'n diag Ž  ˆ. y1r2 and s y Ž 12 . 'n diag Ž p . q diag Ž  ˆ . diag Ž  ˆ. y3r2 . Ž 14.22 . Evaluated at p s  0 and  ˆ s  0 , these matrices equal 'n diagŽ  0 .y1r2 and y1r2 ' Ž . y n diag  0 . Using Ž14.21., Ž14.22., and AX  1r2 s 0 wwhich follows 0 589 DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS from Ž14.11.x, the delta method implies that d X 1r2 X N Ž 0, I y  1r2 y A Ž AA . 0 0 6 e A .. Ž 14.23 . y1 X The limiting distribution has form N Ž0, I y Hat., where Hat is the hat matrix ŽSection 4.5.5.. Although asymptotically normal, e behaves less variably than standard normal random variables. The standardized Pearson residual ŽHaberman 1973a. divides e by its estimated standard error. This statistic, which is asymptotically standard normal, equals ri s ei 1 y ˆ i y Ý j Ý k Ž 1r ˆ i . ž ˆ ir  j /Ž ˆ ® jk ir  k . ˆ 1r2 , Ž 14.24 . ˆXˆ.y1 . The where ˆ ® jk denotes the element in row j and column k of ŽAA denominator of ri is 1yˆ h i , where the leverage ˆ h i for observation i estimates the ith diagonal element of the hat matrix. This simplifies to Ž3.13. for testing independence in two-way tables. ' 14.3.3 Asymptotic Distribution of Pearson Statistic The proof that the Pearson X 2 statistic has an asymptotic chi-squared distribution uses the following relationship between normal and chi-squared distributions ŽRao 1973, p. 188.: Let X be multivariate normal with mean  and covariance matrix B. A necessary and sufficient condition for ŽX y  .X CŽX y  . to have a chi-squared distribution is BCBCB s BCB. The degrees of freedom equal the rank of CB. When B is nonsingular, the condition simplifies to CBC s C. The Pearson statistic relates to e by X 2 s eX e, so we apply this result by X y1 X 1r2 X . A 4. identifying X with e,  s 0, C s I, and B s I y  1r2 y AŽAA 0 0 X X 2 Ž . Ž .  4 Since C s I, the condition for X y  C X y  s e e s X to have a chi-squared distribution simplifies to BBB s BB. A direct computation using AX  1r2 s 0 shows that B is idempotent, so the condition holds. Since e is 0 asymptotically multivariate normal, X 2 is asymptotically chi-squared. For symmetric idempotent matrices, the rank equals the trace. The trace X 1r2 1r2 X equals the trace of  1r2  0 s Ý i0 s 1, of I is N; the trace of  1r2 0 0 0 X y1 X X y1 X which is 1; the trace of AŽAA. A equals the trace of ŽAA. ŽAA. s identity matrix of size q = q, which is q. Thus, the rank of B s CB is N y q y 1, and the asymptotic chi-squared distribution has df s N y q y1. This result, due to Fisher Ž1922., is remarkably simple. When the sample size is large, the distribution of X 2 does not depend on  0 or the model form. It depends only on the difference between the dimension of , which is N y 1, and the dimension of . With q s 0 parameters, X 2 is Pearson’s 590 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS Ž1900. statistic Ž1.15. for testing that multinomial probabilities equal certain specified values, and df s N y 1 as Pearson claimed. Watson Ž1959. showed that the same result holds for the asymptotic conditional distribution, given a sufficient statistic for nuisance parameters. 14.3.4 Asymptotic Distribution of Likelihood-Ratio Statistic When the model holds, the likelihood-ratio statistic G 2 is asymptotically equivalent to X 2 as n ™ . To show this, we express G 2 s 2 Ý n i log i ni  ˆi ž pi y ˆ i s 2 n Ý pi log 1 q i ˆi / and apply the expansion log Ž 1 q x . s x y x 2r2 q x 3r3 y  for x - 1. We identify x with Ž pi y ˆ i .r ˆ i , which converges in probability to 0 when the model holds. For large n, G s 2 n Ý ˆ i q Ž pi y ˆ i . 2 pi y ˆ i s 2nÝ Ž pi y ˆ i . y i s nÝ i Ž pi y ˆ i . ˆi ž / y ˆi i 1 Ž pi y ˆ i . 2 ˆi ž / 1 Ž pi y ˆ i . 2 ˆ i2 2 q Ž pi y ˆ i . 2 q  2 ˆi q Op Ž pi y ˆ i . 3 2 q 2 nOp Ž ny3r2 . s X 2 q Op Ž ny1r2 . s X 2 q op Ž 1 . , since ÝŽ pi y ˆ i . s 0 and Ž pi y ˆ i . s Ž pi y i . y Ž ˆ i y i ., both of which are Op Ž ny1r2 .. Thus, when the model holds, the difference between X 2 and G 2 converges in probability to 0. As a consequence, G 2 , like X 2 , has an asymptotic chi-squared distribution with df s N y q y 1. The parameter value that maximizes the likelihood is the one that minimizes G 2 . To show this, we let G 2 Ž  ; p . s 2 n Ý pi log Ž pir i .. The kernel of the multinomial log likelihood is L Ž  . s n Ý pi log i s yn Ý pi log sy ž / 1 2 Ž . pi i Ž . q n Ý pi log pi G 2 Ž  Ž  . ; p . q n Ý pi log pi . DISTRIBUTIONS OF RESIDUALS AND FIT STATISTICS 591 The second term in the last expression does not depend on , so maximizing LŽ  . is equivalent to minimizing G 2 with respect to . A fundamental result for G 2 concerns comparisons of nested models. Suppose that model M0 is a special case of model M1. Let q0 and q1 denote the numbers of parameters in the two models. Let  ˆ 0 i 4 and  ˆ 1 i 4 denote ML estimators of cell probabilities for the two models. Then G 2 Ž M0 . y G 2 Ž M1 . s 2 n Ý pi log Ž ˆ 1 ir ˆ 0 i . has the form of y2Žlog likelihood ratio. for testing that M0 holds against the alternative that M1 holds. Theory for likelihood-ratio tests suggests that when the simpler model holds, the asymptotic distribution of G 2 Ž M0 . y G 2 Ž M1 . is chi-squared with q1 y q2 degrees of freedom. For details, see Bishop et al. Ž1975, pp. 525526., Haberman Ž1974a, p. 108., and Rao Ž1973, pp. 418419.. The statistic X 2 Ž M0 < M1 . defined in Ž9.4. is a quadratic approximation for the G 2 difference. Haberman Ž1977a. noted that these tests can perform well even for large, sparse tables, as long as q1 y q0 is small compared to the sample size and no expected frequency has larger order of magnitude than the others. 14.3.5 Asymptotic Noncentral Distributions Results in this chapter assume that a certain parametric model holds. In practice, any unsaturated model almost surely does not hold perfectly, so one might question the scope of these results. This is not problematic if we regard models merely as convenient approximations for reality. For instance, the ML estimator ˆ  converges to a value  0 that describes the best fit of the chosen model to reality. In this sense, inferences for  give us information about a useful approximation for reality. Similarly, model-based inferences about cell probabilities are inconsistent for the true probabilities when the model does not hold; nevertheless, those inferences are consistent for describing a useful smoothing of reality. For goodness-of-fit statistics, a relevant distinction exists between limiting behavior when the model holds and when it does not hold. When the model holds, we’ve seen X 2 and G 2 have a limiting chi-squared distribution, and the difference between them disappears as n increases. When the model does not hold, X 2 and G 2 tend to grow unboundedly as n increases, and X 2 y G 2 need not go to zero. One method for obtaining proper limiting distributions considers a sequence of situations n for which the lack of fit diminishes as n increases. Specifically, the model is  s fŽ  ., but in reality n s f Ž  . q r'n . Ž 14.25 . The best fit of the model to the population has ith probability equal to f i Ž  ., but the true value differs from that by ir'n . For this representation, Mitra Ž1958. showed that the Pearson X 2 has a limiting noncentral chi-squared distribution, with df s N y q y 1 and non- 592 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS centrality parameter n sn Ý is1 ni y fi Ž  . 2 fi Ž  . . This has the form of X 2 , with the sample values pi and ˆ i replaced by population values ni and f i Ž  .. Similarly, the noncentrality of the likelihood-ratio statistic has the form of G 2 , with the same substitution. Haberman Ž1974a, pp. 109112. showed that under certain conditions G 2 and X 2 have the same limiting distribution; that is, their noncentrality values converge to a common value as n ™ . Representation Ž14.25. means that for large n, the noncentral chi-squared approximation is valid when the model is just barely incorrect. In practice, it is often reasonable to adopt Ž14.25. for fixed, finite n to approximate the distribution of X 2 , even though Ž14.25. would not be plausible as we obtain more data. The alternative representation  s fŽ  . q  Ž 14.26 . in which  differs from fŽ  . by a fixed amount as n ™  may seem more natural. In fact, this is more appropriate than Ž14.25. for proving the test to be consistent Ži.e., for convergence to 1 of the probability of rejecting the hypothesis that the model holds.. For Ž14.26., however, the noncentrality parameter  grows unboundedly as n ™ , and a proper limiting distribution does not result for X 2 and G 2 . When the model holds,  s 0 in either representation Ž14.25. or Ž14.26.. That is, fŽ  . s  Ž  .,  s 0, and the results in Sections 14.3.3 and 14.3.4 apply. 14.4 ASYMPTOTIC DISTRIBUTIONS FOR LOGIT r LOGLINEAR MODELS For loglinear models, formulas in Section 8.6 for the asymptotic covariance matrices of ˆ  and  ˆ are special cases of ones derived in Section 14.2. We present these for the multinomial form of the models, which relates directly to that section. Then we discuss the connection to Poisson loglinear models. To constrain probabilities to sum to 1, we express loglinear models for multinomial sampling as  s exp Ž X . r 1X exp Ž X . Ž 14.27 . where X is a model matrix and 1X s Ž1, . . . , 1.. Letting x i denote row i of X, i s i Ž . s exp Ž x i  . Ý k exp Ž x k  . . 593 ASYMPTOTIC DISTRIBUTIONS FOR LOGIT r LOGLINEAR MODELS 14.4.1 Asymptotic Covariance Matrices A model affects covariance matrices through the Jacobian. Since i j s s Ý k exp Ž x k  . exp Ž x i  . x i j y exp Ž x i  . Ý k exp Ž x k  . i xi j y i Ý xk j k Ý k x k j exp Ž x k  . 2 , k the matrix of these elements has the form r  s diag Ž  . y  X X. Using this with Ž14.14. and Ž14.16., the information matrix at  0 is X AA s Ž r  0 . diag Ž  0 . X y1 Ž r  0 . s XX diag Ž  0 . y  0 X0 diag Ž  0 . X y1 diag Ž  0 . y  0 X0 X s XX diag Ž  0 . y  0 X0 X. Thus, for multinomial loglinear models, ˆ  is asymptotically normally distributed with estimated covariance matrix $ cov Ž ˆ  . s  XX diag Ž  ˆ. y ˆ ˆ X X4 y1 Ž 14.28 . rn. Similarly, from Ž14.23. the estimated asymptotic covariance matrix of  ˆ is $ cov Ž  ˆ . s diag Ž  ˆ. y ˆ ˆ X X  XX diag Ž  ˆ. y ˆ ˆ X X4 =XX diag Ž  ˆ. y ˆ ˆX y1 n. From Ž14.23., the Pearson residuals e are asymptotically normal with X 1r2 asymp. cov Ž e . s I y  1r2 0 Ž  0 . y A Ž AA . X s I y  1r2 0 Ž y1 X A . y diag Ž  0 . y1r2 X  1r2 0 =  XX diag Ž  0 . y  0 X0 X 4 y1 XX = diag Ž  0 . y  0 X0 diag Ž  0 . 14.4.2 diag Ž  0 . y  0 X0 X y1r2 . Connection with Poisson Loglinear Models This book expressed loglinear models in terms of Poisson expected cell frequencies s Ž 1 , . . . ,  N .X , using formulas of the form log s X a a . Ž 14.29 . 594 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS The model matrix X a and parameter vector  a in this formula are slightly different from X and  in multinomial model Ž14.27.. The Poisson expression Ž14.29. does not have constraints on . For multinomial model Ž14.27., Ý i ␮ i s n is fixed, and  s rn satisfies log s log n s X q log n y log Ž 1X exp Ž X . . 1 s X q 1 ␮ where ␮ s log n y logŽ1X expŽX ..x. In other words, multinomial model Ž14.27. implies Poisson model Ž14.29. with X a s w 1: X x and  a s Ž ␮ , X . . X The columns of X in the multinomial representation must be linearly independent of 1; that is, the parameter ␮ , which relates to the total sample size, does not appear in . The dimension of  is 1 less than the number of parameters reported in this text for Poisson loglinear models. For instance, for the saturated model,  has N y 1 elements for the multinomial representation, reflecting the sole constraint on  of Ý i s 1. NOTES Section 14.1: Delta Method 14.1. For detailed discussion of large-sample theory including the delta method, see Bishop et al. Ž1975, Chap. 14. and Sen and Singer Ž1993.. 14.2. In applying the delta method to a function g of an asymptotically normal random vector Tn , suppose that the first-order, . . . , Ž a y 1.st-order differentials of the function are zero at , but the ath-order differential is nonzero. A generalization of the delta method implies that n a r2 w g ŽTn . y g Ž  .x has limiting distribution involving products of order a of components of a normal random vector. When a s 2, the limiting distribution is a quadratic form in a multivariate normal vector, which often relates to a Y chi-squared distribution; in the univariate case, it is  2 Ž g Ž  ..r2 times a  12 variable ŽCasella and Berger 2001, p. 244.. Resampling methods such as the jackknife and the bootstrap are alternative tools for estimating standard errors and obtaining confidence intervals. They can be helpful when use of the delta method is questionablefor instance, for small samples, highly sparse data, or complex sampling designs. For details, see Davison and Hinkley Ž1997., Fay Ž1985., Parr and Tolley Ž1982., and Simonoff Ž1986.. Section 14.3: Asymptotic Distributions of Residuals and Goodness-of-Fit Statistics 14.3. If Y is Poisson with E Ž Y . s  , then for large  the delta method implies Y 1r2 is approximately normal with standard deviation 21 . This motivates an alternative goodness-of-fit statistic, the FreemanTukey statistic, FT s 4ÝŽ y i y  ˆ i . 2 . When the model holds, FT y X 2 is also op Ž1. as n ™ . See Bishop et al. Ž1975, p. 514. for details. ' ' 595 PROBLEMS Results of this chapter do not apply when the number of cells N grows as n ™ , or when different expected frequencies grow at different rates. Haberman Ž1988. showed the consistency of X 2 breaks down with non-standard asymptotics. 14.4. Drost et al. Ž1989. showed noncentral approximations using other sequences of alternatives than the local and fixed ones Ž14.25. and Ž14.26.. PROBLEMS 14.1 Explain why: a. If c ) 0, nyc s oŽ1. as n ™ . b. If c / 0, cz n has the same order as z n ; that is, oŽ cz n . is equivalent to oŽ z n . and O Ž cz n . is equivalent to O Ž z n .. c. o Ž yn . o Ž z n . s o Ž yn z n ., O Ž yn . O Ž z n . s O Ž yn z n ., o Ž yn . O Ž z n . s oŽ yn z n .. 14.2 If X 2 has an asymptotic chi-squared distribution with fixed df as n ™ , then explain why X 2rn s op Ž1.. 14.3 a. Use Tchebychev’s inequality to show that if E Ž X n . s  n and varŽ X n . s n2 - , then Ž X n y  n . s Op Ž n .. b. Suppose that Y1 , . . . , Yn are independent with E Ž Yi . s  and varŽ Yi . s  2 for i s 1, . . . , n. Let Yn s ŽÝ i Yi .rn. Apply part Ža. to show that Yn y  s Op Ž ny1r2 .. 14.4 Let Y be a Poisson random variable with mean . a. For a constant c ) 0, show that E log Ž Y q c . s log  q Ž c y 1 2 . r q O Ž y2 . Ž Hint: Note that logŽ Y q c . s log  q logw1 q Ž Y q c y  .r x.. b. Cell counts in a 2 = 2 table are independent Poisson random variables. Use part Ža. to argue that to reduce bias in estimating the log odds ratio, a sensible estimator is the sample log odds ratio after adding 12 to each cell. 14.5 Let p denote the sample proportion for n independent Bernoulli trials. Find the asymptotic distribution of the estimator w pŽ1 y p .x1r2 of the standard deviation. What happens when s 0.5? 14.6 Suppose that Tn has a Poisson distribution with mean  s n  , for fixed  ) 0. For large n, show that the distribution of log Tn is approximately normal with mean logŽ . and variance y1 . w Hint: By 596 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS the central limit theorem, Tnrn is approximately N Ž ␮, ␮rn. for large n.x ' 14.7 a. Refer to Problem 14.6. If Tn is Poisson, show Tn has asymptotic variance 14 . b. For a binomial sample with n trials and sample proportion p, show the asymptotic variance of siny1 Ž p . is 1r4n. wThis transformation and the one in part Ža. are ®ariance stabilizing, producing variates with asymptotic variances that are the same for all values of the parameter. Traditionally, these transformations were employed to make ordinary least squares applicable to count data. See Cochran 1940 for discussion and ML analyses. x ' 14.8 For a multinomial Ž n,  i 4. distribution, show the correlation between pi and pj is yw i jrŽ1 y  i .Ž1 y  j .x1r2 . What does this equal when  i s 1 y  j and  k s 0 for k / i, j? 14.9 An animal population has N species, with population proportion  i of species i. Simpson’s index of ecological di®ersity ŽSimpson 1949. is I Ž ␲ . s 1 y Ý i2 . wRao Ž1982. surveyed diversity measures. x a. Two animals are randomly chosen from the population, with replacement. Show I Ž ␲ . is the probability they are different species. b. For proportions p for a random sample, show that the estimated asymptotic standard error of I Žp. is 2 ½ Ý i pi3 y žÝ / pi2 i 2 n 5 1r2 . 14.10 Let  Yi 4 be independent Poisson random variables. Show by the delta method that the estimated asymptotic variance of Ýa i logŽ Yi . is Ýa2i ryi . wThis formula applies to ML estimators of parameters for the saturated loglinear model, which are contrasts of  logŽ yi .4 . Formula Ž14.9. yields the asymptotic covariance structure of such estimators; see Lee Ž1977..x 14.11 Assuming two independent binomial samples, derive the asymptotic standard error of the log relative risk ŽSection 3.1.4.. 14.12 Refer to Problem 3.27. The sample size may need to be quite large for the sampling distribution of ␥ ˆ to be approximately normal, especially if < ␥ < is large. The Fisher-type transform ␰ˆs 12 logwŽ1 q ␥ ˆ .r Ž1 y ␥ .x Ž ˆ Agresti 1984, pp. 166᎐167, 177; O’Gorman and Woolson 1988. converges more quickly to normality. 597 PROBLEMS a. Show that the asymptotic variance of ␰ˆ equals the asymptotic variance of ␥ ˆ multiplied by Ž1 y ␥ 2 .y2 . b. Explain how to construct a confidence interval for ␰ and use it to obtain one for ␥ . c. Show that ␰ˆs 12 logŽ CrD .. For 2 = 2 tables, show that this is half the log odds ratio. 14.13 Let 2 ŽT. s Ý i ŽTi y i0 . 2r i0 . Then 2 Žp. s X 2rn, where X 2 is the Pearson statistic Ž1.15. for testing H0 : i s i0 , i s 1, . . . , N, and n 2 Ž  . is the noncentrality for that test when  is the true value. Under H0 , why does the delta method not yield an asymptotic normal distribution for 2 Žp.? ŽSee Note 14.2.. 14.14 In an I = J contingency table, let  i j denote local odds ratio Ž2.10., and let ˆi j denote its sample value. a. Show that asymp. covŽ'n log ˆi j , 'n log ˆiq1, j . s yw y1 iq1, j q y1 x . iq1, jq1 b. Show that asymp. covŽ'n log ˆi j , 'n log ˆiq1, jq1 . s y1 iq1, jq1 . c. When ˆi j and ˆh k use mutually exclusive sets of cells, show that asymp. covŽ'n log ˆi j , 'n log ˆh k . s 0. d. State the asymptotic distribution of log ˆi j . 14.15 For loglinear model Ž XY, XZ, YZ ., ML estimates of  i jk 4 and hence the X 2 and G 2 statistics are not direct. Alternative approaches may yield direct analyses. For 2 = 2 = 2 tables, find a statistic for testing the hypothesis of no three-factor interaction, using the delta method with the asymptotic normality of log ˆ111 , where ˆ111 s p111 p 221 rp121 p 211 p112 p 222 rp122 p 212 . 14.16 Refer to Section 14.2.2, with  s diagŽ  . y  X the covariance matrix of 'n Ž p1 , . . . , pNy1 .X . Let Zs ½ ci 0 with probability with probability i, i s 1, . . . , N y 1 N and let c s Ž c1 , . . . , c Ny1 .X . a. Show that E Ž Z . s cX , E Ž Z 2 . s cX diagŽ  .c, and varŽ Z . s cX  c. b. Suppose that at least one c i / 0, and all i ) 0. Show varŽ Z . ) 0, and deduce that  is positive definite. 598 ASYMPTOTIC THEORY FOR PARAMETRIC MODELS c. If  s Ž definite. 1, ..., N .X , so  is N = N, prove that  is not positive 2 14.17 Consider the model for a 2 = 2 table, 11 s  , 12 s 21 s 2  Ž1 y  ., 22 s Ž1 y  . , where  is unknown ŽProblems 3.31 and 10.34.. a. Find the matrix A in Ž14.14. for this model. b. Use A to obtain the asymptotic variance of ˆ. ŽAs a check, it is simple to find it directly using the inverse of yE 2 Lr  2 , where L is the log likelihood.. For which  value is the variance maximized? What is the distribution of ˆ if  s 0 or  s 1? c. Find the asymptotic covariance matrix of 'n . ˆ d. Find df for testing fit using X 2 . 14.18 Refer to the model for the calf data in Section 1.5.6. Obtain the asymptotic variance of ˆ . 14.19 Justify the use of estimated asymptotic covariance matrices. For ˆ ˆ close to AA? instance, for large samples, why is AA 14.20 Cell counts  Yi 4 are independent Poisson random variables, with  i s E Ž Yi .. Consider the Poisson loglinear model log s X a a , where s Ž 1 , . . . ,  N . . Using arguments similar to those in Section 14.2, show that the large-sample covariance matrix of ˆ  a can be estimated by wXXa diagŽ ˆ .X a xy1 , where ˆ is the ML estimator of . 14.21 For a given set of parameter constraints, show that weak identifiability conditions hold for the independence loglinear model for a two-way table; that is, when two values for  give the same , those parameter vectors must be identical. 14.22 Use the delta method, with derivatives Ž14.22., to derive the asymptotic covariance matrix in Ž14.23. for residuals. Show that this matrix is idempotent. 14.23 In some situations, X 2 and G 2 take very similar values. Explain the joint influence on this event of Ža. whether the model holds, Žb. whether the sample size n is large, and Žc. whether the number of cells N is large. 599 PROBLEMS 14.24 Show X and  in multinomial representation Ž14.27. for the independence model for an I = J table. By contrast, show X a for the corresponding Poisson loglinear model Ž14.29.. $ 14.25 Using Ž14.18. and Ž14.28., derive the asymptotic covŽ  ˆ . for a multinomial loglinear model. 14.26 Consider the ML estimator ˆ i j s piq pqj of i j for the independence model, when that model does not hold. Show that E Ž piq pqj . s Ž . iq qj n y 1 rn q i jrn. To what does ˆ i j converge as n increases? 14.27 Let  denote a generic measure of association. For K independent multinomial samples of sizes  n k 4 , suppose that n k Ž ˆk y  k . d N Ž0,  k2 . as n k ™ . A summary measure is ' 6 s Ý k Ž n krˆk2 . ˆk Ý k Ž n krˆk2 . . a. Show that Ý k z k2 s V q w  2rˆ 2 Ž  .x, where Vs Ý k ž n k ˆk y  ˆk2 / 2 , zk s ˆ n1r2 k k ˆk , ˆ 2 Ž  . s žÝ ˆ / nk k  k2 y1 . b. Suppose that n ™  with n krn ™  k ) 0, k s 1, . . . , K. State the asymptotic chi-squared distribution for each component in the partitioning in part Ža.. Indicate the hypothesis that each tests. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 15 Alternative Estimation Theory for Parametric Models In this book we have used the maximum likelihood ŽML. approach to inference. This is by far the most common approach for categorical data analysis. Other paradigms have been used, however. In this chapter we discuss some of them. These methods have similar asymptotic properties as maximum likelihood, so the large-sample theory of Chapter 14 applies also to them. In Section 15.1 we discuss weighted least squares for fitting models for categorical data. This and related quasi-likelihood methods introduced in Sections 4.7 and 11.4 are sometimes simpler to apply than ML. The Bayesian paradigm is increasingly popular as computations become easier to implement. A full discussion of modern developments with this approach is beyond our scope, but in Section 15.2 we present Bayesian methods of estimating cell probabilities in a contingency table. Four other methods of estimation for categorical data are described in the final section. 15.1 WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA Weighted least squares ŽWLS. is an extension of ordinary least squares that permits responses to be correlated and to have nonconstant variance. Familiarity with the WLS method is useful because: 1. WLS computations have a standard form that is simple to apply for a wide variety of models. 2. Algorithms for calculating ML estimates often consist of iterative use of WLS. An example is the Fisher scoring method for generalized linear models ŽSection 4.6.3.. 3. When the model holds, WLS and ML estimators are asymptotically equivalent, both falling in the class of best asymptotically normal ŽBAN. 600 601 WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA estimators. For large samples, the estimators are approximately normally distributed around the parameter value, and the ratio of their variances converges to 1. Grizzle, Starmer, and Koch Ž1969. popularized WLS for categorical data analyses. In honor of them, WLS for such analyses is often called the GSK method. This section summarizes the ingredients of this approach. 15.1.1 Notation and Preliminaries for WLS Approach For a response variable Y with J categories, consider multinomial samples of sizes n1 , . . . , n I at I levels of an explanatory variable or combinations of levels of several explanatory variables. Let ␲ s Ž ␲X1 , . . . , ␲XI .X , where ␲i s Ž ␲ 1 < i , ␲ 2 < i , . . . , ␲ J < i . X Ý ␲j < i s 1 with j denotes the conditional distribution of Y at level i. Let p denote corresponding sample proportions, with V their IJ = IJ covariance matrix. When the I samples are independent, V1 0 V2 Vs .. . 0 From Section 14.1.4, the covariance matrix of n iVi s ␲ 1 < i Ž1 y ␲ 1 < i . y␲ 1 < i␲ 2 < i y␲ 2 < i ␲ 1 < i . . . ␲ 2 < i Ž1 y ␲ 2 < i . . . . y␲ J < i␲ 2 < i y␲ J < i␲ 1 < i VI 'n p i i is ⭈⭈⭈ y␲ 1 < i ␲ J < i ⭈⭈⭈ y␲ 2 < i ␲ J < i . . . . ␲ J < i Ž1 y ␲ J < i . ⭈⭈⭈ Each set of proportions has Ž J y 1. linearly independent elements. Let F be a vector of u F I Ž J y 1. response functions F Ž ␲ . s F1 Ž ␲ . , . . . , Fu Ž ␲ . . X The WLS approach applies to linear models for F of form F Ž ␲ . s X␤ , Ž 15.1 . 602 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS where ␤ is a q = 1 vector of parameters and X is a u = q model matrix of known constants having rank q. From Section 8.5.4, loglinear and logit response functions are special cases of FŽ ␲ . s C log ŽA␲ . for certain matrices C and A. Let FŽp. denote the sample response functions. We assume that F has continuous second-order partial derivatives in an open region containing ␲. This assumption enables the delta method to determine the large-sample normal distribution for FŽp.. The asymptotic covariance matrix of FŽp. depends on the u = IJ matrix Qs ⭸ Fk Ž ␲ . ⭸␲ j < i for k s 1, . . . , u and all IJ combinations Ž i, j .. Linear response models have response functions of form FŽ ␲ . s A␲ for a matrix of known constants A, in which case Q s A. For the generalized loglinear model FŽ ␲ . s C logŽA␲ . Žrecall Sections 8.5.4 and 11.2.5., Q s CwdiagŽA␲ .xy1A. wSee Magnus and Neudecker 1988 for matrix differential calculus. x By the multivariate delta method ŽSection 14.1.5., the asymptotic covariance matrix of FŽp. is VF s QVQX . ˆF denote the sample version of VF , substituting sample proportions in Q Let V and V. For subsequent formulas, this matrix must be nonsingular. 15.1.2 Inference Using the WLS Approach to Model Fitting For the general model Ž15.1., the WLS estimate of ␤ is ˆy1 b s Ž XX V F X. y1 ˆFy1 F Ž p . . XX V This is the ␤ value that minimizes the quadratic form X y1 ˆF F Ž p . y X␤ . F Ž p . y X␤ V The ordinary least squares estimate, for uncorrelated responses with constant ˆF is a constant multiple of the identity matrix. variance, results when V The WLS estimator has an asymptotic multivariate normal distribution, with estimated covariance matrix $ ˆFy1 X . cov Ž b . s Ž XX V y1 . The normal distribution improves as the sample size increases and FŽp. is more nearly normally distributed. 603 WEIGHTED LEAST SQUARES FOR CATEGORICAL DATA ˆ s Xb for the response functions. The estimate b yields predicted values F Since they satisfy the model, these predicted values are smoother than the ˆ is asymptotically sample response functions FŽp.. When the model holds, F better than FŽp. as an estimator of FŽ ␲ . ŽSection 14.2.2.. The estimated covariance matrix of the predicted values is ˆFˆ s X Ž XX V ˆFy1 X . V y1 XX . The test of model goodness of fit uses the residual term X y1 ˆF F Ž p . y Xb s F Ž p .XV ˆFy1 F Ž p . y bX Ž XX V ˆFy1 X . b, W s F Ž p . y Xb V which compares the sample response functions with their model predicted values. Under H0 : FŽ ␲ . y X␤ s 0 that the model holds, W is asymptotically chi-squared with df s u y q, the difference between the number of response functions and the number of model parameters. One can more closely check the model fit by studying the residuals, ˆ They are orthogonal to the fit F, ˆ so FŽp. y F. ½ 5 ˆ qF ˆ s cov F Ž p . y F ˆ q cov Ž F ˆ. . cov F Ž p . s cov F Ž p . y F Thus, the estimated covariance matrix of the residuals equals ˆ. s V ˆF y V ˆFˆ s V ˆF y X Ž XX V ˆFy1 X . cov F Ž p . y cov Ž F y1 XX . Dividing the residuals by their standard errors yields standardized residuals having large-sample standard normal distributions. Hypotheses about contrasts and other effects of explanatory variables have form H0 : C␤ s 0, where C is a known c = q matrix with c F q, having rank c. The estimator Cb of C␤ is asymptotically normal with mean 0 under H0 ˆFy1 X.y1 CX . The Wald statistic and with covariance matrix estimated by CŽXX V ˆFy1 X . CX WC s bX CX C Ž XX V y1 Cb Ž 15.2 . has an approximate chi-squared null distribution with df s c. This statistic also equals the difference between residual chi-squared statistics for the reduced model implied by H0 and the full model. For the special case H0 : ␤i s 0, WC s bi2rvar Ž bi . has df s 1. 15.1.3 Scope of WLS versus ML Estimation The WLS approach requires estimating the multinomial covariance matrix of sample responses at each setting of the explanatory variables. It is inapplicable when explanatory variables are continuous, since there may be only one 604 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS observation at each such setting. WLS also becomes less appropriate as the number of explanatory variables increases, since few observations may occur at each of the many combinations of settings. By contrast, in principle, continuous explanatory variables or many explanatory settings are not problematic to ML. When a certain model holds, with large cell expected frequencies ML and WLS give similar results. Both estimators are in the class of best asymptotically normal estimators. However, practical considerations often favor ML estimation. For example, zero cell counts often adversely affect the WLS approach. The sample response functions may then be ill-defined or have a singular estimated covariance matrix. WLS shares with quasi-likelihood the feature that inferential results depend only on specifying a model for the mean responses and specifying a variance function and covariance structure Žhere, based on the multinomial.. It does not use the likelihood function for the complete distribution. Thus, inference uses Wald methods. Historically, an advantage of the WLS approach was computational simplicity. This is not relevant now that software is available for ML analyses and for extensions of WLS Že.g., quasi-likelihood methods such as GEE. that do not have some of its disadvantages. Thus, WLS is now used much less frequently than it was about 25 years ago. Nonetheless, it has close connections with more sophisticated methods. Some algorithms for calculating ML estimates iteratively use WLS. Also, Miller et al. Ž1993. showed that under certain conditions the solution of the first iteration in the GEE fitting process gives the WLS estimate. This equivalence uses initial estimates based directly on sample values and assumes a saturated association structure that allows a separate correlation parameter for each pair of response categories and each pair of observations in a cluster. In this sense, GEE is an iterated form of WLS. Moreover, in this case, the covariance matrix for the estimates is the same in both approaches. 15.2 BAYESIAN INFERENCE FOR CATEGORICAL DATA Methodology using the Bayesian paradigm has advanced tremendously in the past decade. New computational methods make it easier to evaluate posterior distributions for model parameters. Nonetheless, Bayesian inference is not as fully developed or commonly used for categorical data analysis as in many other areas of statistics. For multiway contingency table analysis, partly this is because of the plethora of parameters for multinomial models, often necessitating substantial prior specification. Bayesian theory and methods are beyond the scope of this book. We present only relatively elementary problems in which the Bayesian approach applies quite naturally and is sometimes more appealing than ML. We then briefly summarize more complex developments. BAYESIAN INFERENCE FOR CATEGORICAL DATA 605 The first applications of Bayesian methods to contingency tables involved smoothing cell counts to improve estimation of cell probabilities Že.g., Good 1965.. The sample proportions are ordinary ML estimators for the saturated model. When data are sparse, these can have undesirable features. Large sparse tables often contain many sampling zeros, for which 0.0 is unappealing as a probability estimate. In addition, Stein’s results for estimating multivariate normal means suggest that lower total mean-squared error occurs with Bayes estimators that shrink the sample proportions toward some average value ŽEfron and Morris 1975.. In considering Bayesian estimators, we cannot hope to find one that is uniformly better than ML. For instance, suppose that a true cell probability ␲ i s 0. Then the sample proportion pi s 0 with probability 1, and the sample proportion is better than any other estimator. Because parameter values exist for which the sample proportion is optimal, no other estimator is uniformly better over the entire parameter space. Here the criterion of comparison is the expected value of a loss function that measures distance between the estimator and the parameter, such as squared error. In decision-theoretic terms the sample proportion is an admissible estimator, for standard loss functions ŽJohnson 1971.. In this sense, the sample mean for the multinomial or multivariate binomial differs from the sample mean for the multivariate normal, which is inadmissible Ždominated by Bayes estimators. when the dimension of the mean vector is at least three ŽFerguson 1967, p. 170.. Meeden et al. Ž1998. gave related results for decomposable loglinear models. Another approach for estimating cell probabilities fits an unsaturated model. Often, though, there is no particular model expected to describe the table well. For I = J cross-classifications of nominal variables, for instance, the independence model rarely fits well. When unsaturated models approximate the true relationship poorly, model-based estimators also have undesirable properties. Although they smooth the data, the smoothing is too severe for large samples. The model-based estimators are inconsistent, converging to values that may be far from the true cell probabilities as n increases. A Bayesian approach to estimating cell probabilities compromises between sample proportions and model-based estimators. A model still provides part of the smoothing mechanism, with the Bayes estimators shrinking the sample proportions toward a set of proportions satisfying the model. 15.2.1 Bayesian Estimation of Binomial Parameter We illustrate basic ideas with Bayesian inference for a binomial parameter. Let y denote a binŽ n, ␲ . variate. Since ␲ falls between 0 and 1, a natural prior density for ␲ is the beta wŽ13.8. in Section 13.3.1x for some choice of ␣ ) 0 and ␤ ) 0. This satisfies E Ž␲ . s ␣rŽ␣ q ␤ .. In Bayesian inference the posterior density of a parameter, given the data, is proportional to the product of the prior density with the likelihood 606 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS function. Here, the beta prior depends on ␲ through ␲ ␣y1 Ž1 y ␲ . ␤y1 , and the binomial likelihood has kernel depending on ␲ through ␲ y Ž1 y ␲ . nyy . Thus, the posterior density hŽ␲ < y . of ␲ is proportional to hŽ ␲ < y . A ␲ y Ž 1 y ␲ . nyy ␲ ␣y1 Ž 1 y ␲ . ␤y1 s ␲ yq␣y1 Ž 1 y ␲ . nyyq␤y1 , for 0 F ␲ F 1. The beta is the conjugate prior distribution. The posterior density is also beta, with parameters ␣ * s y q ␣ and ␤ * s n y y q ␤ . The mean of the posterior distribution is a Bayesian estimator of a parameter. This is optimal when a squared-error loss function ŽT y ␲ . 2 describes the consequence of estimating ␲ by an estimator T ŽFerguson 1967, p. 46.. The mean of the beta posterior distribution for ␲ is E Ž ␲ < y . s ␣ *r Ž␣* q ␤ * . s Ž y q ␣ . r Ž n q ␣ q ␤ . s w Ž yrn . q Ž 1 y w . ␣r Ž␣ q ␤ . , where w s nrŽ n q ␣ q ␤ .. This is a weighted average of the sample proportion p s yrn and the mean of the prior distribution. For fixed Ž␣, ␤ ., the weight given the sample increases as n increases. The standard deviation of the posterior distribution describes the accuracy of this estimator. This equals the square root of var Ž ␲ < y . s ␣ *␤ *r Ž␣* q ␤ * . Ž␣* q ␤ * q 1 . . 2 ' For large n the standard deviation is roughly p Ž 1 y p . rn , the ordinary standard error for the ML estimator ␲ ˆ s p. The Bayes estimator requires selecting parameters Ž␣, ␤ . for the prior distribution. Complete ignorance about ␲ might suggest a uniform prior distribution. This is the beta distribution with ␣ s ␤ s 1. The posterior distribution then has the same shape as the binomial likelihood function. The Bayes estimator is then E Ž ␲ < y . s Ž y q 1. r Ž n q 2. . This shrinks the sample proportion slightly toward 12 . Alternatively, a popular prior with Bayesians is the Jeffreys prior. This is proportional to the square root of the determinant of the Fisher information matrix for the parameters of interest, for a single observation. With a single parameter ␪ , this is w E Ž ⭸ 2 log f Ž y < ␪ .r⭸␪ 2 .x1r2 . In the binomial case with ␪ s ␲ and n s 1, this equals w␲ Ž1 y ␲ .xy1r2 and the prior is beta with ␣ s ␤ s .5. Brown et al. Ž2001. showed that the posterior generated by this prior yields a confidence interval for ␲ with good performance. It approximates the Clopper᎐Pearson interval with the mid-P adjustment ŽSections 607 BAYESIAN INFERENCE FOR CATEGORICAL DATA 1.4.4 and 1.4.5.. For a test of H0 : ␲ G 12 against Ha : ␲ - 12 , a Bayesian P-value is the posterior probability that ␲ G 12 . Routledge Ž1994. showed that with the Jeffreys prior, this posterior probability approximately equals the one-sided mid-P-value for the ordinary binomial test. 15.2.2 Dirichlet Prior and Posterior for Multinomial Parameters These ideas generalize from the binomial to the multinomial ŽGood 1965.. Suppose that cell counts Ž n1 , . . . , n N . have a multinomial distribution with n s Ýn i and parameters ␲ s Ž␲ 1 , . . . , ␲ N .X . The multinomial likelihood is proportional to N Ł ␲ in . i is1 For a prior distribution over potential ␲ values, the multivariate generalization of the beta is the Dirichlet density gŽ ␲. s ⌫ Ž Ý ␤i . N Ł ␲ ␤ y1 Ł i ⌫ Ž ␤i . is1 i i Ý ␲ i s 1, for 0 F ␲ i F 1 all i , i where  ␤i ) 04 . For it, E Ž␲ i . s ␤irŽÝ j ␤ j .. The posterior density is also Dirichlet, with parameters  n i q ␤i 4 . The Bayes estimator of ␲ i is E Ž ␲ i < n1 , . . . , n N . s Ž n i q ␤ i . ž nq Ý ␤j j / . Ž 15.3 . Let K s Ý ␤ j and ␥ i s E Ž␲ i . s ␤irK. The ␥ i 4 are prior guesses for the cell probabilities. Bayes estimator Ž15.3. equals the weighted average nr Ž n q K . pi q Kr Ž n q K . ␥ i . Ž 15.4 . From Ž15.3. the Bayes estimator is a sample proportion when the prior information corresponds to Ý j ␤ j trials with ␤i outcomes of type i, i s 1, . . . , N. This interpretation may provide guidance for choosing  ␤i 4 . The Jeffreys prior sets all ␤i s 0.5. Good referred to K as a flattening constant, since with identical  ␤i 4 Ž15.4. shrinks each sample proportion toward the uniform value ␥ i s 1rN. Greater flattening occurs as K increases, for fixed n. Hierarchical models treat  ␤i 4 as unknown and specify a second-stage prior for them Že.g., Albert and Gupta 1982.. Bayes estimators combine good characteristics of sample proportions and model-based estimators. Like sample proportions and unlike model-based 608 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS estimators, they are consistent even when the model does not hold. Unless the model holds, the weight given the sample proportion increases to 1.0 as the sample size increases. Like model-based estimators and unlike sample proportions, the Bayes estimators smooth the data. The resulting estimates, although slightly biased, usually have smaller total mean-squared error than the sample proportions. 15.2.3 Development of Bayesian Methods for Categorical Data We now summarize the development of Bayesian methods for categorical data since Good’s Ž1965. work on smoothing multinomial proportions. Leonard and Hsu Ž1994. provided a more detailed review. We begin with methods for two-way contingency tables. For 2 = 2 tables, Altham Ž1969. gave a Bayesian analysis comparing parameters for two independent binomial samples. She tested H0 :␲ 1 F ␲ 2 against ␲ 1 ) ␲ 2 using independent betaŽ ␣ i , ␤i . priors for ␲ 1 and ␲ 2 . Altham showed that the P-value that is the posterior probability that ␲ 1 F ␲ 2 can equal the one-sided P-value for Fisher’s exact test. This happens when one uses improper prior distributions Ž␣1 , ␤ 1 . s Ž1, 0. and Ž␣2 , ␤ 2 . s Ž0, 1.. These represent prior belief favoring the null hypothesis, in effect penalizing against concluding that ␲ 1 ) ␲ 2 . That is, Fisher’s exact test corresponds to a conservative prior distribution. If ␣ i s ␤i s ␥ , i s 1, 2, with 0 F ␥ F 1, Altham showed that the Bayesian P-value is smaller than the Fisher P-value. The difference between the two is no greater than the null probability of the observed data. Use of Jeffreys priors with ␣ i s ␤i s 0.5 provides a type of continuity correction to Fisher’s exact test in much the way the mid-P-value does for the frequentist approach. Howard Ž1998. showed that with these priors the posterior probability that ␲ 1 F ␲ 2 approximates the one-sided P-value for the large-sample z test using pooled variance Ži.e., the signed square root of the Pearson statistic; see Problem 3.30. for testing H0 : ␲ 1 s ␲ 2 against Ha : ␲ 1 ) ␲ 2 . Howard also discussed other priors for 2 = 2 tables, including ones that treat ␲ 1 and ␲ 2 as dependent. Altham Ž1971. showed Bayesian analyses for binomial proportions from matched-pairs data. For a simple model in which the probability of success is the same for each subject at a given occasion, she again showed that the classical exact P-value ŽSection 10.1.4, using the binomial distribution . is a Bayesian P-value for a prior distribution favoring H0 . For a model similar to Ž10.8. in which the probability varies by subject but the occasion effect is constant, she showed that the Bayesian evidence against the null is weaker as the number of pairs giving the same response at both occasions increases, for fixed values of the numbers of pairs giving different responses at the two occasions. This differs from the conditional ML result, which does not BAYESIAN INFERENCE FOR CATEGORICAL DATA 609 depend on such pairs ŽSection 10.2.3.. Ghosh et al. Ž2000. showed related results. The Bayesian approaches presented so far focused directly on cell probabilities by using a prior distribution for them. Lindley Ž1964. did this with I = J contingency tables. He considered the posterior distribution of contrasts of log probabilities, such as the log odds ratio. An alternative approach ŽLaird 1978; Leonard 1975. focused on parameters of the saturated loglinear model, using normal priors. This is not a conjugate prior, but normal distributions can approximate the posterior. Using independent normal N Ž0, ␴ 2 . distributions for the association parameters is a way of inducing shrinkage toward the independence model ŽLaird 1978.. A hierarchical approach puts second-stage priors on the parameters of the prior distribution ŽLeonard 1975.. Historically, a barrier for the Bayesian approach has been the difficulty of calculating the posterior distribution when the prior is not conjugate. This is less problematic with modern ways of approximating posterior distributions by simulating samples from them. These include the importance sampling generalization of Monte Carlo simulation ŽZellner and Rossi 1984. and Markov chain Monte Carlo methods such as Gibbs sampling ŽGelfand and Smith 1990.. Zellner and Rossi used Bayesian methods for logistic regression and Gelfand and Smith considered a class of multinomial models with Dirichlet prior. Zeger and Karim Ž1991. fitted generalized linear mixed models ŽGLMMs. essentially using a Bayesian framework with priors for fixed and random effects. The focus on distributions for random effects in GLMMs in articles such as Zeger and Karim Ž1991. led to the treatment of parameters in GLMs as random variables with a fully Bayesian approach. Dey et al. Ž2000. edited a collection of articles that provided Bayesian analyses for GLMs. For instance, in that volume Gelfand and Ghosh surveyed the subject, Albert and Ghosh reviewed item response modeling, Chib modeled correlated binary data, and Chen and Dey modeled correlated ordinal data. Bayesian methods are used increasingly in applications. For instance, Skene and Wakefield Ž1990. modeled multicenter binary response studies with a logit model that allows the treatment᎐response log odds ratio to vary among centers. This gives a Bayesian alternative to the GLMM analysis presented in Section 12.3.4. Daniels and Gatsonis Ž1999. used multi-level GLMs to analyze geographic and temporal trends with clustered longitudinal binary data. This built on hierarchical modeling ideas introduced by Wong and Mason Ž1985.. An article by Landrum and Normand in Dey et al. Ž2000. gave a case study using Bayesian ordinal probit and logit models. Chaloner and Larntz Ž1989. used a Bayesian approach to determining optimal design for experiments using logistic regression. J. Albert has suggested Bayesian models for a variety of categorical data analyses. For instance, Albert Ž1997. modeled associations in two-way tables and Albert and Chib Ž1993. studied 610 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS binary regression modeling, focusing on the probit case with extensions to ordered multinomial responses. 15.2.4 Data-Dependent Choice of Prior Distribution With Bayesian analyses, careful prior specification is necessary. The use of an improper prior, such as the uniform prior over the entire or positive real line, sometimes results in improper posteriors. One may not realize this from the output of software for Bayesian fitting. In addition, with simulation methods it may not be obvious when convergence has occurred. Be suspicious if results are dramatically different from ordinary ML frequentist results. Some dislike the subjectivity of the Bayesian approach inherent in selecting a prior distribution. Instead of choosing particular parameters for a prior distribution, it is increasingly popular to use a hierarchical approach in which those parameters themselves have a second-stage prior distribution. Alternatively, the empirical Bayes approach lets the data suggest parameter values for use in the prior distribution Že.g., Efron and Morris 1975.. This approach uses the prior that maximizes the marginal probability of the observed data, integrating out with respect to the prior. Laird Ž1978. did this for the loglinear model, estimating ␴ 2 in normal priors for association parameters by finding the value that maximizes an approximation for the marginal distribution of the cell counts, evaluated at the observed data. A disadvantage of empirical Bayes compared to the hierarchical approach is that it does not take into account the source of variability due to substituting estimates for prior parameters. Fienberg and Holland Ž1973. proposed analyses for contingency tables with data-dependent priors. For a particular choice of Dirichlet means ␥ i 4 for the Bayes estimator Ž15.4., they showed that the minimum total meansquared error occurs when K s Ž1 y Ý ␲ i2 . Ý Ž ␥ i y ␲ i . 2 . Ž 15.5 . The optimal K s K Ž ␥, ␲ . depends on ␲, so they used the estimate K Ž ␥, p. of K in which the sample proportion p replaces ␲. As p falls closer to the prior guess ␥, K Ž ␥, p. increases and the prior guess receives more weight in the posterior estimate. They selected the prior pattern ␥ i 4 for the cell probabilities based on the fit of a simple model. For two-way tables, they used the independence fit ␥ i j s piq pqj 4 . The Bayes estimator then shrinks sample proportions toward that fit. As in other inference, Bayesian modeling should normally account for any ordering in the response categories. For instance, in the method just mentioned for smoothing contingency tables, one could shrink toward an ordinal model. 611 OTHER METHODS OF ESTIMATION 15.3 OTHER METHODS OF ESTIMATION In this final section we describe some alternative estimation methods for categorical data. Consider estimation of ␲ or ␪, assuming a model ␲ s ␲ Ž ␪ .. Let ˜ ␪ denote a generic estimator of ␪, for which ␲ ˜ s ␲ Ž˜␪ . estimates ␲. The ML estimator ˆ ␪ maximizes the likelihood. It also minimizes the deviance statistic G 2 comparing observed and fitted proportions ŽSection 14.3.4.. 15.3.2 Minimum Chi-Squared Estimators Other estimators minimize other measures of distance between ␲ Ž ␪ . and p. The value ˜ ␪ that minimizes the Pearson statistic X 2 ␲ Ž ␪. , p s n Ý pi y ␲ i Ž ␪ . 2 ␲i Ž ␪. is called the minimum chi-squared estimate. It is simpler to calculate the estimate that minimizes the modified statistic 2 X mod ␲ Ž ␪. , p s n Ý pi y ␲ i Ž ␪ . 2 pi Ž 15.6 . that replaces the denominator by the sample proportion. This is called the minimum modified chi-squared estimate. It is the solution for ␪ to the equations Ý i ␲i Ž ␪. pi ž ⭸␲ i Ž ␪ . ⭸␪ j / s 0, j s 1, . . . , q. Neyman Ž1949. introduced minimum modified chi-squared estimators. He showed that they and minimum chi-squared estimators are best asymptotically normal ŽBAN. estimators. When the model holds, they are asymptotically Žas n ™ ⬁. equivalent to ML estimators. Under the model, different estimation methods ŽML, WLS, minimum chi-squared, etc.. yield nearly identical estimates of parameters when n is large. This happens partly because the estimators are consistent, converging in probability to ␪ as n increases. When the model does not hold, estimates for different methods can be quite different, even when n is large. The estimators converge to values for which the model gives the best approximation to reality, and this approximation is different when best is defined in terms of minimizing G 2 rather than minimizing X 2 or some other measure. For any n, minimum modified chi-squared estimates are sometimes identical to WLS estimates. The connection refers to an alternative way of 612 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS specifying a model, using a set of constraint equations for ␲ ,  g j Ž ␲ 1 , . . . , ␲ N . s 04 . For instance, for an I = J table, the Ž I y 1.Ž J y 1. constraint equations log ␲ i j y log ␲ i , jq1 y log ␲ iq1 , j q log ␲ iq1 , jq1 s 0 specify the model of independence. The number of constraint equations equals the residual df for the model. Neyman Ž1949. noted that minimum modified chi-squared estimates result from minimizing N Ý is1 Ž pi y ␲ i . pi 2 Nyq Ý q ␭j gj Ž ␲ 1 , . . . , ␲ N . js1 with respect to ␲, where the  ␭ j 4 are Lagrange multipliers. When the constraint equations are linear in ␲, the resulting estimating equations are linear. Then Bhapkar Ž1966. showed that these estimators are identical to WLS estimators. The statistic Ž15.6. then equals the WLS residual statistic ŽSection 15.1.2. for testing model fit. Usually, however, constraint equations are nonlinear in ␲, such as for the independence model. The WLS estimator is then the minimum modified chi-squared estimator based on a linearized version of the constraints, g j Ž p . q Ž ␲ y p . ⬘⭸ g j Ž ␲ . r⭸ ␲ s 0, with differential vector evaluated at p. Berkson Ž1944, 1955, 1980. was a strong advocate of minimum chi-squared methods. For logistic regression, his minimum logit chi-squared estimators minimized a weighted sum of squares between sample logits and linear predictions. Mantel Ž1985. criticized such methods, noting that their consistency requires group sizes to grow large, whereas ML Žor conditional ML, when there are many nuisance parameters . is consistent however information goes to the limit Žsee also Problem 15.14.. 15.3.2 Minimum Discrimination Information Kullback Ž1959. formulated estimation by minimum discrimination information ŽMDI.. The discrimination information for two probability vectors ␲ and ␥ is IŽ ␲; ␥ . s N Ý ␲ i log Ž ␲ ir␥ i . . is1 Ž 15.7 . 613 OTHER METHODS OF ESTIMATION This directed measure of distance between ␲ and ␥ is nonnegative, equaling 0 only when ␲ s ␥. Gokhale and Kullback Ž1978. studied MDI estimates that minimize I Ž ␲; ␥ ., subject to model constraints, using ␥ s p for some problems and ␥ with ␥ 1 s ␥ 2 s ⭈⭈⭈ s ␥ N s 1rN for others. Good Ž1963. conducted related work in the area of maximum entropy. In some cases with ␥ i s 1rN 4 , the MDI estimator is identical to the ML estimator ŽSimon 1973.. With ␥ s p it is not ML, but it has similar asymptotic properties, being best asymptotically normal ŽBAN.. Then Gokhale and Kullback recommended testing goodness of fit using twice the minimized value of I Ž ␲; p.. This statistic reverses the roles of p and ␲ relative to G 2 , 2 much as X mod in Ž15.6. reverses their roles relative to X 2 . Both statistics fall in the class of power divergence statistics ŽCressie and Read 1984; see also Problem 3.34. and have similar asymptotic properties. More generally, one could choose any member of the power divergence statistics and define estimates to be the values minimizing it. Under regularity conditions, they are all BAN. 15.3.3 Kernel Smoothing Kernel estimation is a smoothing method that estimates a probability density or mass function without assuming a parametric distribution. Let K denote a matrix containing nonnegative elements and having column sums equal to 1. Kernel estimates of cell probabilities in a contingency table have form Ž 15.8 . ␲ ˜ s Kp. For unordered multinomials with N categories, Aitchison and Aitken Ž1976. used ki j s ␭, isj s Ž 1 y ␭. r Ž N y 1. , i/j for Ž1rN . F ␭ F 1. The resulting kernel estimator of ␲ has form Ž 1 y ␣ . p q ␣ Ž 1rN . , Ž 15.9 . where ␣ s N Ž1 y ␭.rŽ N y 1.. This estimator shrinks the sample proportion toward Ž1rN, . . . , 1rN .. As ␭ decreases from 1 to 1rN, the smoothing parameter ␣ increases from 0 to 1. Brown and Rundell Ž1985. proved that when no ␲ i s 1, ␭ - 1 exists such that the total mean squared error is smaller for this kernel estimator than for the sample proportions. Results for other shrinkage estimators applied to multivariate means suggest that the improvement for the kernel estimator can be large when n is small and the true cell probabilities are roughly equal. Brown and Rundell generalized kernel smoothing for multiway contingency tables that may contain both nominal and ordinal variables. For a 614 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS T-way table, let L k be a stochastic matrix Ži.e., row and column sums equal to 1. with elements l k, i j s ½ ␭k , isj dk Ž i , j . Ž 1 y ␭k . , i / j, k s 1, . . . , T. They let K in Ž15.8. be the Kronecker product K s L 1 m ⭈⭈⭈ m L T . When variable k is ordinal, shrinkage alone is not enough, and it helps to borrow information from nearby cells. Then d k Ž i, j . is chosen to be smaller for greater distances between categories i and j. If variable k is nominal, the natural choice is d k Ž i, j . s 1rŽ Ik y 1., where Ik is the number of categories for variable k. For fixed  ␭ k 4 , collapsing the smoothed table gives the same result as smoothing the corresponding collapsing of the original table. With  ␭ k s ␭, k s 1, . . . , T 4 , Brown and Rundell described ways of finding ␭ to minimize an unbiased estimate of the total mean squared error. Dong and Simonoff Ž1995. and Simonoff Ž1986. described other approaches for ordered categories. Most such kernels yield probability estimates of the form ␲ ˜ i s Ž 1 y ␣ . pi q ␣ = smootheri , where the smoothing is designed to work well when true probabilities in nearby cells are similar. 15.3.4 Penalized Likelihood Good and Gaskins Ž1971. introduced the penalized likelihood method for density estimation. For log likelihood LŽ ␲ ., the estimator maximizes L* Ž ␲ . s L Ž ␲ . y ␣ Ž ␲ . where ␣ Ž⭈. is a function that provides a roughness penalty. That is, ␣ Ž ␲ . decreases as elements of ␲ are smoother, in some sense. The penalized likelihood estimator has a Bayesian interpretation. With prior density proportional to expwy␣ Ž ␲ .x, the posterior density is proportional to the penalized likelihood function. Hence, the mode of the posterior distribution equals the penalized likelihood estimator. Simonoff Ž1983. applied penalized likelihood to estimating cell probabilities ␲. Like Bayesian and kernel methods, it provides estimates that are smoother than the sample proportions. For a single multinomial with ordered Ny1 Ž categories, Simonoff Ž1983. used penalty function ␣ Ž ␲ . s ␭Ý is1 log ␲ i y 2 . log ␲ iq1 , which encourages adjacent category estimates to be similar. For 615 NOTES 6 two-way contingency tables, Simonoff suggested using ␣ Ž ␲ . s ␭Ý i Ý j Žlog ␪ i j . 2 with the local odds ratios. This provides shrinkage toward the independence estimator. One chooses the smoothing parameter ␭ to minimize an approximation for the mean-squared error of the estimator. In evaluating smoothing methods such as kernel smoothing and penalized likelihood, it is useful to distinguish between large-sample asymptotics with a fixed number of cells N and sparse-data asymptotics for which N grows with n Žrecall Section 6.3.4.. For the former, these smoothing methods and Bayesian inference behave asymptotically like ordinary ML Ži.e., the sample proportions.. They have the same rate of convergence to true probabilities. These methods then improve over ML primarily for small samples, where the benefit of ‘‘borrowing from the whole’’ occurs. For sparse-data asymptotics, however, smoothing is particularly beneficial. As the dimensions of a table increase, the number of cells grows exponentially and the ‘‘curse of dimensionality’’ occurs. Accurate estimation becomes more difficult, with estimators converging more slowly to true values. The table then has an increasing proportion of empty cells. Smoothing can be better than ML even asymptotically. For such results, see Fienberg and Holland Ž1973. for the Dirichletbased Bayes multinomial estimator and Simonoff Ž1983. for penalized likelihood with the multinomial. Simonoff showed that consistency can occur with p the latter estimator in the sense that sup i ␲ 0 as n and N grow ˆ ir␲ i y 1 and the probabilities themselves approach 0. For surveys of smoothing methods, see Fahrmeir and Tutz Ž2001, Chap. 5., Lloyd Ž1999, Chap. 5., and Simonoff Ž1996, Chap. 6; 1998.. As Simonoff noted, all smoothing methods attempt to balance the low bias of undersmoothing with the low variability of oversmoothing. The methods require input from the user about the degree of smoothness, whether it be determined by a prior distribution or some type of smoothing parameter. In summary, many methods exist for smoothing categorical data. Besides those discussed in this section, there are traditional model-building methods. Some of these, such as generalized additive models ŽSection 4.8., are also specifically directed toward smoothing. A particular type of smoothing method may seem most natural for a given application. An advantage of the Bayesian approach is that its entire formulation seems less ad hoc than some others. NOTES Section 15.1: Weighted Least Squares for Categorical Data 15.1. Applications of WLS include fitting mean response models ŽGrizzle et al. 1969. and models for marginal distributions ŽKoch et al. 1977.. For general discussion, see Bhapkar and Koch Ž1968., Imrey et al. Ž1981., and Koch et al. Ž1985.. 616 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS Section 15.2: Bayesian Inference for Categorical Data 15.2. Other literature on Bayesian analyses of categorical responses includes Fienberg et al. Ž1999., Forster and Smith Ž1998., Good Ž1976., Knuiman and Speed Ž1988., Spiegelhalter and Smith Ž1982., and Walley Ž1996.. Section 15.3: Other Methods of Estimation 15.3. For further discussion of minimum chi-squared methods, see Bhapkar Ž1966., Koch et al. Ž1985., Neyman Ž1949., and Rao Ž1963.. 15.4. For the use of minimum discrimination information, see Gokhale and Kullback Ž1978., Ireland and Kullback Ž1968a, b., Ireland et al. Ž1969., and Ku et al. Ž1971.. 15.5. Hall and Titterington Ž1987. studied rates of convergence for multinomial kernel estimators. They defined one that achieves the optimal rate. Ordinary kernel estimators tend to be biased toward zero at the boundary of a table. Dong and Simonoff Ž1994. dealt with improving kernel estimates on the boundary of large sparse tables. Kernel methods are also useful for discrete regression modeling. For binary response data, Copas Ž1983. used one to display in a nonparametric manner the dependence of P Ž Y s 1. on x. PROBLEMS Applications 15.1 Consider the mean response model fitted in Section 7.4.6. Show how to use WLS for this analysis. Identify the number of multinomial samples I, the number of response categories J, the response functions F, the model matrix X, the parameter vector ␤, and the ˆF . estimated covariance matrix V 15.2 Use WLS to conduct the longitudinal analysis of depression in Section 11.2.1. Using software Že.g., SAS: PROC CATMOD., obtain WLS estimates and standard errors and compare to the ML results. 15.3 Refer to Problem 15.2. Using these data, describe the differences between Ža. WLS and ML, and Žb. WLS and GEE methods for marginal models with multivariate categorical response data. 15.4 Using data from Section 1.4.3, obtain a Bayesian estimate of the proportion of vegetarians. Explain how you chose the prior distribution. Compare results to those with ML. 15.5 Refer to Table 9.8. Consider the model that simultaneously assumes Ž9.12. as well as linear logit relationships for the marginal effects of age on breathlessness and on wheeze. PROBLEMS 617 a. Specify C, A, and X for which this model has form C log A␲ s X␤. b. Using software, fit the model and interpret estimates. Theory and Methods 15.6 Consider marginal homogeneity for an I = I table. a. Letting FŽ ␲ . s A␲, explain how Ži. FŽ ␲ . s 0, where A has I y 1 rows, and Žii. FŽ ␲ . s X␤, where A has 2Ž I y 1. rows and ␤ has I y 1 elements. In part Žii., show A, ␲, X, ␤ when I s 3. b. Explain how to use WLS to test marginal homogeneity. wThis is Bhapkar’s test Ž10.16..x 15.7 For WLS with FŽ ␲ . s CwlogŽA␲ .x, show that Q s CwdiagŽA␲ .xy1A. ˆFy1 wFŽp. y X␤ x is minimized by 15.8 With WLS, show that wFŽp. y X␤ x⬘V y1 y1 y1 ˆ ˆ ␤ s ŽX⬘VF X. X⬘VF FŽp.. 15.9 The response functions FŽp. have asymptotic covariance matrix VF . Derive the asymptotic covariance matrix of the WLS model parameˆ s Xb. ter estimator b and the predicted values F 15.10 Consider the Bayes estimator of a binomial parameter ␲ using a beta prior distribution. a. Does any beta prior distribution produce a Bayes estimator that coincides with the ML estimator? b. Show that the ML estimator is a limit of Bayes estimators, for a certain sequence of beta prior parameter values. c. Find an improper prior density Žone for which its integral is not finite. such that the Bayes estimator coincides with the ML estimator. ŽIn this sense, the ML estimator is a generalized Bayes estimator.. d. For Bayesian inference using loss function w Ž ␪ .ŽT y ␪ . 2 , the Bayes estimator of ␪ is the posterior expected value of ␪ w Ž ␪ . divided by the posterior expected value of w Ž ␪ . ŽFerguson 1967, p. 47.. With loss function ŽT y ␲ . 2rw␲ Ž1 y ␲ .x, show the ML estimator of ␲ is a Bayes estimator for the uniform prior distribution. e. The risk function is the expected loss, treated as a function of ␲ . For the loss function in part Žd., show the risk function is constant. ŽBayes’ estimators with constant risk are minimax; their maximum risk is no greater than the maximum risk for any other estimator.. f. Show that the Jeffreys prior for ␲ equals the beta density with ␣ s ␤ s .5. 618 ALTERNATIVE ESTIMATION THEORY FOR PARAMETRIC MODELS 15.11 For the Dirichlet prior for multinomial probabilities, show the posterior expected value of ␲ i is Ž15.3.. Derive the expression for this Bayes estimator as a weighted average of pi and E Ž␲ i .. 15.12 For Bayes estimator Ž15.4., show that the total mean squared error is Kr Ž n q K . 2 Ý Ž ␲ i y ␥i . 2 q nr Ž n q K . 2 Ž 1 y Ý ␲ i2 . . Show that Ž15.5. is the value of K that minimizes this. 15.13 Refer to Problem 15.6. For marginal homogeneity, explain why the minimum modified chi-squared estimates are identical to WLS estimates. 15.14 Let yi be a binŽ n i , ␲ i . variate for group i, i s 1, . . . , N, with  yi 4 independent. Consider the model that ␲ 1 s ⭈⭈⭈ s ␲ N . Denote that common value by ␲ . a. Show that the ML estimator of ␲ is p s ŽÝ i yi .rŽÝ i n i .. b. The minimum chi-squared estimator ␲ ˜ is the value of ␲ minimizing N Ý is1 Ž yirn i . y ␲ ␲ 2 N q Ý is1 Ž yirn i . y ␲ 1y␲ 2 . The second term results from comparing Ž1 y yirn i . to Ž1 y ␲ ., the proportions in the second category. If n1 s ⭈⭈⭈ s n N s 1, show that ␲ ˜ minimizes NpŽ1 y ␲ .r␲ q N Ž1 y p .␲rŽ1 y ␲ .. Hence show ␲ ˜ s p1r2 r p1r2 q Ž 1 y p . 1r2 . Note the bias toward 12 in this estimator. c. Argue that as N ™ ⬁ with all n i s 1, the ML estimator is consistent but the minimum chi-squared estimator is not ŽMantel 1985.. 15.15 Refer to Problem 15.14. For N s 2 groups with n1 and n 2 independent observations, find the minimum modified chi-squared estimator of ␲ . Compare it to the ML estimator. 15.16 Show that the kernel estimator Ž15.9. is the same as the Bayes estimator Ž15.3. for the Dirichlet prior with  ␤i s ␣ nrŽ1 y ␣ . N 4 . Using this result, suggest a way of letting the data determine the value of ␣ in the kernel estimator. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 CHAPTER 16 Historical Tour of Categorical Data Analysis* This book concludes with an informal historical overview of the evolution of methods for categorical data analysis ŽCDA.. We have seen that categorical scales are pervasive in the social sciences and the biomedical sciences. Not surprisingly, the development of GLMs for categorical responses was fostered by statisticians having ties to the social sciences or to the biomedical sciences. Only in the last quarter of the twentieth century did these models receive the attention given early in the century to models for continuous data. Regression models for continuous variables evolved out of Francis Galton’s breakthroughs in the 1880s. The strong influence of R. A. Fisher, G. Udny Yule, and other statisticians on experimentation in agriculture and biological sciences ensured widespread adoption of regression and ANOVA modeling by the mid-twentieth century. On the other hand, despite influential articles around 1900 by Karl Pearson and Yule on association between categorical variables, models for categorical responses received scant attention until the 1960s. The beginnings of CDA were often shrouded in controversy. Key figures in the development of statistical science made groundbreaking contributions, but these statisticians were often in heated disagreement with one another. 16.1 PEARSON–YULE ASSOCIATION CONTROVERSY Much of the early development of methods for CDA took place in England, and it is fitting that we begin our historical tour in London at the beginning of the twentieth century. The year 1900 is an apt starting point, since in that year Karl Pearson introduced his chi-squared statistic Ž X 2 . and G. Udny Yule presented the odds ratio and related measures of association. Before 619 620 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS then most work focused on descriptive aspects for relatively simple measures. For instance, Goodman and Kruskal Ž1959. noted that the Belgian social statistician Adolphe Quetelet used the relative risk in 1849. By 1900, Karl Pearson Ž18571936. was already well known in the statistics community. He was head of a statistical laboratory at University College in London. His work the previous decade included developing a family of skewed probability distributions Žcalled Pearson cur®es., obtaining the product-moment estimate of the correlation coefficient and finding its standard error, and extending Galton’s work on linear regression. In fact, Pearson was a true renaissance man, writing on a wide variety of topics that included art, religion, philosophy, law, socialism, women’s rights, physics, genetics, eugenics, and evolution. Pearson’s motivation for developing the chi-squared test included testing whether outcomes on a roulette wheel in Monte Carlo varied randomly, checking the fit to various data sets of normal distributions and Pearson curves, and testing statistical independence in two-way contingency tables. Much of the literature on CDA early in the twentieth century consisted of vocal debates about appropriate ways to summarize association. Pearson’s approach assumed that continuous bivariate distributions underlie two-way contingency tables ŽPearson 1904, 1913.. He argued in favor of approximating a measure, such as the correlation, for the underlying continuum. In 1904, Pearson introduced the term contingency as a ‘‘measure of the total deviation of the classification from independent probability,’’ and he introduced measures to describe its extent. The tetrachoric correlation is a ML estimate of the correlation for a bivariate normal distribution assumed to underlie counts in 2 = 2 tables. It is the correlation value  in the bivariate normal density that would produce cell probabilities equal to the sample cell proportions when that density is collapsed to a 2 = 2 table having the same marginal proportions as the observed table. The mean-square contingency and the contingency coefficient are normalizations of X 2 to the Ž0, 1. scale. Pearson’s contingency coefficient ŽProblem 3.33. for I = J tables standardized X 2 to approximate an underlying correlation. George Udny Yule Ž18711951., a British contemporary of Pearson’s, took a different approach. Having completed pioneering work developing multiple regression models and multiple and partial correlation coefficients, Yule turned his attention between 1900 and 1912 to association in contingency tables. He believed that many categorical variables, such as Žvaccinated, unvaccinated . and Ždied, survived., are inherently discrete. Yule defined indices directly using cell counts without assuming an underlying continuum. He popularized the odds ratio  wwhich Goodman Ž2000. noted may first have been proposed by a Hungarian statistician, J. Korosy ˝ ¨ x and a transformation of w x Ž . Ž . it to the y1, q1 scale, Q s  y 1 r  q 1 , now called Yule’s Q ŽProblem 2.36.. Discussing one of Pearson’s measures that assumes underlying normality, Yule argued Ž1912, p. 612. that ‘‘at best the normal coefficient can only be said to give us in cases like these a hypothetical correlation between PEARSON YULE ASSOCIATION CONTROVERSY 621 supposititious variables. The introduction of needless and unverifiable hypotheses does not appear to me a desirable proceeding in scientific work.’’ Yule Ž1903. also showed the potential discrepancy between marginal and conditional associations in contingency tables, later studied by E. H. Simpson Ž1951. and now called Simpson’s paradox. In the first quarter of the twentieth century, Karl Pearson was the rarely challenged leader of statistical science in Britain. Pearson’s strong personality did not take kindly to criticism, and he reacted negatively to Yule’s ideas. He argued that Yule’s own coefficients were unsuitable. For instance, Pearson claimed that their values were unstable, since different collapsings of I = J tables to 2 = 2 tables could produce quite different values of the measures. Pearson and D. Heron Ž1913. filled more than 150 pages of Biometrika, a journal he co-founded and edited, with a scathing reply to Yule’s criticism. In a passage critical also of Yule’s well-received book An Introduction to the Theory of Statistics, they stated ‘‘If Mr. Yule’s views are accepted, irreparable damage will be done to the growth of modern statistical theory. . . . wYule’s Qx has never been and never will be used in any work done under his wPearson’sx supervision. . . . We regret having to draw attention to the manner in which Mr. Yule has gone astray at every stage in his treatment of association, but criticism of his methods has been thrust on us not only by Mr. Yule’s recent attack, but also by the unthinking praise which has been bestowed on a text-book which at many points can only lead statistical students hopelessly astray.’’ Pearson and Heron attacked Yule’s ‘‘half-baked notions’’ and ‘‘specious reasoning’’ and argued that Yule would have to withdraw his ideas ‘‘if he wishes to maintain any reputation as a statistician.’’ In retrospect, Pearson and Yule both had valid points. Some classifications, such as most nominal variables, have no apparent underlying continuous distribution. On the other hand, many applications relate naturally to an underlying continuum, and that fact can motivate models and inference Že.g., Section 7.2.3.. Goodman Ž1981a, b. noted that the ordinal models presented in Sections 9.4.1 and 9.6.1 provide a sort of reconciliation between Yule and Pearson, since Yule’s odds ratio characterizes models that fit well when underlying distributions are approximately normal. Half a century after the PearsonYule controversy, Leo Goodman and William Kruskal surveyed the development of association measures for contingency tables and made many contributions of their own. Their 1979 book reprinted four influential articles of theirs from the Journal of the American Statistical Association on this topic. Initial development of many measures occurred in the nineteenth century. Their 1959 article contains the following quote from M. H. Doolittle in 1887, which illustrates the lack of precision in early attempts to quantify the meaning of association even in 2 = 2 tables: ‘‘Having given the number of instances respectively in which things are both thus and so, in which they are thus but not so, in which they are so but not thus, and in which they are neither thus nor so, it is required to eliminate the general quantitative relativity inhering in the mere thingness 622 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS of the things, and to determine the special quantitative relativity subsisting between the thusness and the soness of the things.’’ Goodman Ž2000. added to the historical survey and proposed a new measure. 16.2 R. A. FISHER’S CONTRIBUTIONS Pearson’s disagreements with Yule were minor compared to his later ones with Ronald A. Fisher Ž18901962.. Using a geometric representation, Fisher Ž1922. introduced degrees of freedom to characterize the family of chi-squared distributions. Fisher claimed that for tests of independence in I = J tables, X 2 has df s Ž I y 1.Ž J y 1.. By contrast, Pearson Ž1900, 1904. had argued that for any application of X 2 , the index that Fisher later identified as df equals the number of cells minus 1, or IJ y 1 for two-way tables. Fisher pointed out, however, that estimating hypothesized cell probabilities using estimated row and column probabilities resulted in an additional Ž I y 1. q Ž J y 1. constraints on the fitted values, thus affecting the distribution of X 2 . Not surprisingly, Pearson Ž1922. reacted critically to Fisher’s suggestion that his df formula was incorrect. He stated: ‘‘I hold that such a view wFisher’sx is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the Journal of the Royal Statistical Society. . . . I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors, for they are invariably based on using sample values for those of the sampled population unknown to us.’’ Pearson claimed that using row and column sample proportions to estimate unknown probabilities had negligible effect on large-sample distributions, although he had realized ŽPearson 1917. that df must be adjusted when the cell counts have linear constraints. Fisher was unable to get his rebuttal published by the Royal Statistical Society, and he ultimately resigned his membership. Statisticians soon realized that Fisher was correct, but he maintained much bitterness over this and other dealings with Pearson. In the preface to a later volume of his collected works, he remarked that his 1922 article ‘‘had to find its way to publication past critics who, in the first place, could not believe that Pearson’s work stood in need of correction, and who, if this had to be admitted, were sure that they themselves had corrected it.’’ Writing about Pearson: he stated: ‘‘If peevish intolerance of free opinion in others is a sign of senility, it is one which he had developed at an early age.’’ In Fisher Ž1926., he was able to dig the knife a bit deeper into the Pearson family using 11,688 2 = 2 tables randomly generated assuming independence by Karl Pearson’s son, E. S. Pearson. Fisher showed that the sample mean of X 2 for these tables was 1.00001, much closer to the 1.0 predicted by his formula for E Ž X 2 . of df s Ž I y 1.Ž J y 1. s 1 than Pearson’s IJ y 1 s 3. His daughter, R. A. FISHER’S CONTRIBUTIONS 623 Joan Fisher Box Ž1978., discussed this and other conflicts between Fisher and Pearson. Hald Ž1998, pp. 652663., Plackett Ž1983., and Stigler Ž1999, Chap. 19. summarized the chi-squared controversy. Fisher’s preeminent reputation among statisticians today accrues mainly from his theoretical work Žintroducing concepts such as sufficiency, information, and optimal properties of ML estimators. and his methodological contributions to the design of experiments and the analysis of variance. Although not so well known for work in CDA, he made other interesting contributions. Moreover, he made good use of the methods in his applied work. For instance, Fisher was also a famed geneticist. In one article, he used Pearson’s goodness-of-fit test to check Mendel’s theories of natural inheritance and showed that the fit was too good ŽSection 1.5.3.. Fisher realized the limitations of large-sample methods for laboratory work, and he was at the forefront of advocating specialized small-sample methods. Writing about large-sample methods in the preface to the first edition of his classic text Statistical Methods for Research Workers, he stated: ‘‘wTxhe traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data.’’ Fisher was among the first to promote the work by W. S. Gosset Žpseudonym ‘‘Student’’. on the t distribution. The fifth edition of Statistical Methods for Research Workers Ž1934. introduced Fisher’s exact test for 2 = 2 contingency tables. In his 1935 book The Design of Experiments, Fisher described the tea-tasting experiment ŽSection 3.5.2. motivated by his experience at an afternoon tea break while employed at Rothamsted Experiment Station. The mid-1930s finally saw some model building for categorical responses. Chester Bliss Ž1934, 1935., following up a 1933 report on quantal response methods by J. H. Gaddum, popularized the probit model for applications in toxicology with a binary response. Bliss introduced the term probit but used the inverse normal cdf with mean 5 Žrather than 0, in order to avoid negative values. and standard deviation 1. In the appendix of Bliss Ž1935., Fisher Ž1935b. outlined an algorithm for finding ML estimates of model parameters. That algorithm was a NewtonRaphson type of method using expected information, today commonly called Fisher scoring ŽSection 4.6.2.. Stigler Ž1986, p. 246. and Finney Ž1971. attributed the first use of inverse normal cdf transformations of proportions to the German physicist Gustav Fechner in his 1860 book Elemente der Psychophysik. See Finney Ž1971. and McCulloch Ž2000. for other history of the probit method. The definition for homogeneous association Žno interaction . in contingency tables originated in an article by the British statistician Maurice Bartlett Ž1935. about 2 = 2 = 2 tables. Bartlett showed how to find ML 624 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS estimates of cell probabilities satisfying the property of equality of odds ratios between two variables at each level of the third. He attributed the idea to Fisher. In 1940, Fisher developed canonical correlation methods for contingency tables. He showed how to assign scores to rows and columns of a contingency table to maximize the correlation. His work relates to the later development, particularly in France, of correspondence analysis methods Že.g., Benzecri ´ 1973.. R. A. Fisher has had the greatest influence on the practice of modern statistical science. The biography by his daughter ŽBox 1978. gives a fascinating account of his impressive contributions to statistics and genetics. Fienberg Ž1980. summarized his contributions to CDA. 16.3 LOGISTIC REGRESSION Bartlett Ž1937. used logw yrŽ1 y y .x in regression and ANOVA to transform observations y that are continuous proportions ŽProblem 6.33.. In a book of statistical tables published in 1938, R. A. Fisher and Frank Yates suggested it as a possible transformation of a binomial parameter for analyzing binary data. In 1944, the physician and statistician Joseph Berkson introduced the term logit for this transformation. Berkson showed that the model using the logit fitted similarly to the probit model, and his subsequent work did much to popularize logistic regression. In 1951, Jerome Cornfield, another statistician with strong medical ties, used the odds ratio to approximate relative risks in casecontrol studies. Dyke and Patterson Ž1952. apparently first used the logit in models with qualitative predictors. Sir David R. Cox introduced many statisticians to logistic regression, through his 1958 article and 1970 book, The Analysis of Binary Data. About the same time, an article by the Danish statistician and mathematician Georg Rasch sparked an enormous literature on item response models. The most important of these is the logit model with subject and item parameters, now called the Rasch model ŽSection 12.1.4.. This work was highly influential in the psychometric community of northern Europe Žespecially in Denmark, the Netherlands, and Germany. and spurred many generalizations in the educational testing community in the United States. The extension of logistic regression to multicategory responses received occasional attention before 1970 Že.g., Mantel 1966. but substantial work after about that date. For nominal responses, early work was mainly in the econometrics literature. See Bock Ž1970., McFadden Ž1974., Nerlove and Press Ž1973., and Theil Ž1969, 1970.. In 2000, Daniel McFadden won the Nobel Prize in Economics for his work in the 1970s and 1980s on the discrete-choice model ŽSection 7.6.. For cumulative logit models for ordinal responses, see Bock and Jones Ž1968., Simon Ž1974., Snell Ž1964., Walker and Duncan Ž1967., and Williams and Grizzle Ž1972.. The cumulative probit case, MULTIWAY CONTINGENCY TABLES AND LOGLINEAR MODELS 625 based on an underlying normal response, has a longer history; see, for instance, Aitchison and Silvey Ž1957. and Bock and Jones Ž1968, Chap. 8.. Cumulative logit and probit models received much more attention following publication of McCullagh Ž1980., which provided a Fisher scoring algorithm for ML fitting of all cumulative link models. The next major advances with logistic regression dealt with its application to casecontrol studies Že.g., Breslow 1996; Mantel 1973; Prentice 1976a; Prentice and Pyke 1979; see also Section 5.1.4. and the conditional ML approach to model fitting for those studies and others with numerous nuisance parameters ŽBreslow et al. 1978, with related work in Breslow 1976, 1982; Breslow and Day 1980; Breslow and Powers 1978; Cox 1970; Farewell 1979; Prentice 1976a; Prentice and Breslow 1978; Zelen 1971; see also Sections 6.7 and 10.2.. The conditional approach was later exploited in small-sample exact inference ŽHirji et al. 1987; Mehta and Patel 1995; see also Section 6.7.. Nathan Mantel, whose name appears in the preceding two paragraphs, made a variety of interesting contributions to CDA. Although best known for the 1959 MantelHaenszel test and related odds ratio estimator, he also discussed trend tests Ž1963., multinomial logit and loglinear modeling Ž1966., logistic regression for casecontrol data Ž1973., the number of contingency tables having fixed margins ŽGail and Mantel 1977., the analysis of square contingency tables ŽMantel and Byar 1978., and problems with minimum chi-squared and Wald tests Ž1985, 1987a.. More recently, attention has focused on fitting logistic models to correlated responses for clustered data. One strand of this is marginal modeling of longitudinal data ŽDiggle et al. 2002; Liang and Zeger 1986; Liang et al. 1992.. Much of this literature focuses on quasi-likelihood methods such as generalized estimating equations ŽGEE.. Another strand is generalized linear mixed models Že.g., Breslow and Clayton 1993.. Perhaps the most far-reaching contribution of the past half century has been the introduction by British statisticians John Nelder and R. W. M. Wedderburn in 1972 of the concept of generalized linear models. This unifies the logistic and probit regression models for binomial data with loglinear models for Poisson data and with long-established regression and ANOVA models for normal-response data. Interestingly, the algorithm they used to fit GLMs is Fisher scoring, which R. A. Fisher introduced in 1935 for ML fitting of probit models. McCulloch Ž2000. reviewed the journey from probit models to GLMs and their further generalizations such as quasi-likelihood. 16.4 MULTIWAY CONTINGENCY TABLES AND LOGLINEAR MODELS The quarter century following the end of World War II saw the development of a theoretical underpinning for models for contingency tables. H. Cramer ´ 626 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS Ž1946. derived general expressions for large-sample distributions of parameter estimators. C. R. Rao Ž1957, 1963. conducted related work. In 1949, the Berkeley-based statistician Jerzy Neyman, who had already performed fundamental work on hypothesis testing and interval estimation methods with E. S. Pearson, introduced the family of best asymptotically normal ŽBAN. estimators. These have the same optimal large-sample properties as ML estimators. The BAN family includes estimators obtained by minimizing chi-squared-type measures comparing observed proportions to proportions predicted by the model ŽSection 15.3.1.. This type of estimator itself includes some weighted least squares ŽWLS. estimators. The simplicity of their computation, compared to ML estimators, was an important consideration before the advent of modern computing. Neyman’s Ž1949. only mention of Fisher was the suggestion that Fisher did not realize that estimators other than ML could be BAN, stating that ‘‘the results . . . contradict the assertion of R. A. Fisher, not a very clear one, that ‘the maximum likelihood equation may indeed be derived from the conditions that it shall be linear in frequencies, and efficient for all values of  ’.’’ Fisher, of course, returned the compliment: for instance, writing Ž1956. about proposals for an unconditional test for 2 = 2 tables, ‘‘the Principles of Neyman and Pearson’s ‘Theory of Testing Hypotheses’ are liable to mislead those who follow them into much wasted effort.’’ In the early 1950s, William Cochran published work dealing with a variety of important topics in CDA. Scottish-born, Cochran spent most of his career at American universities: Iowa State, North Carolina State, Johns Hopkins, and Harvard. He Ž1940. modeled Poisson and binomial responses with variance-stabilizing transformations. He Ž1943. recognized and discussed ways of dealing with overdispersion. He Ž1950. introduced a generalization ŽCochran’s Q . of McNemar’s test for comparing proportions in several matched samples. His classic 1954 article is a mixture of new methodology and advice for applied statisticians. It gave sample-size guidelines for chisquared approximations to work well for the X 2 statistic. It also stressed the importance of directing inferences toward narrow Že.g., single-degree-offreedom. alternatives and partitioning chi-squared statistics into components. One instance of this was Cochran’s proposed test of conditional independence in several 2 = 2 tables, which was closely related to the Mantel and Haenszel Ž1959. test ŽSection 6.3.2.. Another was a test for a linear trend in proportions across quantitatively defined rows of an I = 2 table ŽSection 5.3.5.. See also Cochran Ž1955.. Fienberg Ž1984. reviewed Cochran’s contributions to CDA. Bartlett’s work on interaction structure in 2 = 2 = 2 contingency tables had relatively little impact for 20 years. Indeed, in presenting methods for partitioning X 2 in 2 = 2 = 2 tables, Lancaster Ž1951. noted that ‘‘Doubtless little use will ever be made of more than a three-dimensional classification.’’ However, in the mid-1950s and early 1960s, Bartlett’s work was extended in many ways to multiway tables. See, for instance, Darroch Ž1962., Good MULTIWAY CONTINGENCY TABLES AND LOGLINEAR MODELS 627 Ž1963., Goodman Ž1964b., Plackett Ž1962., Roy and Kastenbaum Ž1956., and Roy and Mitra Ž1956.. These articles as well as influential articles by Martin W. Birch Ž1963, 1964a, b, 1965. were the genesis of research work on loglinear models between about 1965 and 1975. Birch’s work was part of a never-submitted Ph.D. thesis at the University of Glasgow. He showed how to obtain ML estimates of cell probabilities in three-way tables, under various conditions. He showed the equivalence of those ML estimates for Poisson and multinomial sampling. He Žand Watson 1959. extended theoretical results of Cramer ´ and Rao on large-sample distributions for contingency table models. Mantel Ž1966. discussed early results and made the loglinear model formula explicit. A survey article by the French statistician Henri Caussinus Ž1966., based partly on his Ph.D. thesis, provides a good glimpse of the state-of-the-art of CDA just before this decade of advances. There, Caussinus introduced the quasi-symmetry model for square tables. Much of the work in the next decades on loglinear and related logit modeling took place at three American universities: the University of Chicago, Harvard University, and the University of North Carolina. At Chicago, Leo Goodman wrote a series of groundbreaking articles, dealing with such topics as partitionings of chi-squared, models for square tables Že.g., quasi-independence., stepwise logit and loglinear model-building procedures, deriving asymptotic variances of ML estimates of loglinear parameters, latent class models, association models, correlation models, and correspondence analysis. For surveys of his early work, see Goodman Ž1968, an R. A. Fisher memorial lecture, 1970.. For later work, see Goodman Ž1985, 1996, 2000.. Goodman also wrote a stream of articles for social science journals that had a substantial impact on popularizing loglinear and logit methods for applications Že.g., Goodman 1969b.. Over the past 50 years, Goodman has been the most prolific contributor to the advancement of CDA methodology. The field owes tremendous gratitude to his steady and impressive body of work. In addition, some of Goodman’s students at Chicago also made fundamental contributions. In 1970, Shelby Haberman completed a Ph.D. dissertation Žthe basis of his 1974a monograph. making substantial theoretical contributions to loglinear modeling. Among topics he considered were residual analyses, existence of ML estimates, loglinear models for ordinal variables, and theoretical results for models Žsuch as the Rasch model. for which the number of parameters grows with the sample size. Clifford Clogg followed in Goodman’s steps by having influence in the social sciences and in statistics with his work on association models, demography, models for rates, the census, and various other topics. Simultaneously with Goodman’s work, related research on ML methods for loglinear-logit models occurred at Harvard by students of Frederick Mosteller Žsuch as Stephen Fienberg. and William Cochran. Much of this research was inspired by problems arising in analyzing large, multivariate data sets in the National Halothane Study ŽBishop and Mosteller 1969; see also p. 345 of an interview with Lincoln Moses in Statist. Sci. 14, 1999.. That 628 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS Publisher's Note: Permission to reproduce this image online was not granted by the copyright holder. Readers are kindly asked to refer to the printed version of this chapter. FIGURE 16.1 Four leading figures in the development of categorical data analysis. RECENT ŽAND FUTURE?. DEVELOPMENTS 629 study investigated whether halothane was more likely than other anesthetics to cause death due to liver damage. A presidential address by Mosteller Ž1968. to the American Statistical Association described early uses of loglinear models for smoothing multidimensional discrete data sets. Fienberg and his own students advanced this work further. A landmark book in 1975 by him with Yvonne Bishop and Paul Holland, Discrete Multi®ariate Analysis, was largely responsible for introducing loglinear models to the general statistical community and remains an excellent reference. Research at North Carolina by Gary Koch and several students and co-workers has been highly influential in the biomedical sciences. Their research developed WLS methods for categorical data models ŽSection 15.1.. The 1969 article by Koch with J. Grizzle and F. Starmer popularized this approach. Koch and colleagues extended it in later articles to an impressive variety of problems, including problems for which ML methods are awkward to use, such as the analysis of repeated categorical measurement data ŽKoch et al. 1977.. In 1966, Vasant Bhapkar showed that the WLS estimator is often identical to Neyman’s minimum modified chi-squared estimator. The early literature on loglinear models treated all classifications as nominal. Haberman Ž1974b. and Simon Ž1974. showed how to exploit ordinality of classifications in loglinear models. This work was extended in several articles by Leo Goodman Ž1979a, 1981a, b, 1983, 1985, 1986.. The extensions included association models, which replace ordered scores in loglinear models by parameters ŽSection 9.5.. Goodman Ž1985, 1986, 1996. also discussed related correlation models and provided a model-based perspective for the closely related correspondence analysis methods. Certain loglinear models with conditional independence structure provide graphical models for contingency tables. These relate to the association graphs used in Section 9.1. Darroch et al. Ž1980. was the genesis of much of this work. 16.5 RECENT (AND FUTURE?) DEVELOPMENTS The most active area of new research in CDA in the past decade has been the modeling of clustered data, such as occur in longitudinal studies and other forms of repeated measurement. A variety of ways now exist of modeling while accounting for the correlation among responses in the same cluster. As discussed in Chapters 11 and 12, ML estimation is difficult for such models. For complex forms of generalized linear mixed models, for instance, it is a challenge to estimate well regression parameters and variance components. Integrating out the random effect to obtain the likelihood function requires an approximation such as numerical integration. Not surprisingly, various Monte Carlo approaches are applied increasingly here. A promising 630 HISTORICAL TOUR OF CATEGORICAL DATA ANALYSIS approach is a Monte Carlo EM algorithm that uses a Monte Carlo approximation for the E step ŽBooth and Hobert 1999.. The Monte Carlo error can be assessed at each iteration, and one can accurately reproduce the ML estimates with sufficiently many iterations. The modeling of clustered correlated data is likely to be an active area of research in coming years. The class of generalized linear mixed models is certain to see substantial work and further generalization. One extension is generalized additi®e mixed models. Time-series models for categorical responses have so far received relatively little attention. For all such models with correlated responses, model diagnostics are of vital importance and need development. For longitudinal data, missing data are a common problem. This area currently has much activity. Another important recent advance is the development of efficient algorithms for exact small-sample methods. With such methods, one can guarantee that the size of a test is no greater than some prespecified level and that the coverage probability for a confidence interval is at least the nominal level. The ‘‘exactness’’ refers only to inference being based on probability distributions that do not depend on unknown parameters. There is no unique way to do this, and certain methods can be highly conservative because of discreteness. Most literature deals with the conditional approach, which eliminates nuisance parameters by conditioning on their sufficient statistics. Hence, the basic idea builds on Fisher’s exact test. Conditional methods are versatile, applying to exponential family linear models that use the canonical link function, such as loglinear models for Poisson responses and logit models for binomial responses. Many of the computational advances with the exact conditional approach occurred in a series of articles by Cyrus Mehta, Nitin Patel, and colleagues at Harvard Že.g., Mehta and Patel 1983., using the network algorithm. See surveys by Agresti Ž1992., Mehta Ž1994., Mehta and Patel Ž1995., and the StatXact and LogXact manuals ŽCytel Software, Cambridge, MA, founded by Mehta and Patel.. Although the development of ‘‘exact’’ methods has seen considerable progress, certain analyses are still infeasible and likely to be so for some time because of the exponential increase in computing time as the table size or sample size increases. There are an ever-increasing variety of methods for accurate approximation of exact methods. These include simple Monte Carlo Že.g., Agresti et al. 1979., Monte Carlo with importance sampling Že.g., Booth and Butler 1999; Mehta et al. 1988., Markov chain Monte Carlo ŽMCMC; Forster et al. 1996., saddlepoint approximations ŽPierce and Peters 1992, Strawderman and Wells 1998., and related work on an approximate conditioning approach ŽPierce and Peters 1999. in which discreteness is not so problematic. Finally, the development of Bayesian approaches to CDA is an increasingly active area. The multiplicity of parameters complicates Bayesian modeling. For early use of Bayesian estimation of probabilities, see Good Ž1965. and Lindley Ž1964.. Good’s Ž1965. article apparently evolved from his work RECENT ŽAND FUTURE?. DEVELOPMENTS 631 during World War II with Alan Turing at Bletchley Park, England, on breaking Nazi codes. The development of the Bayesian approach for CDA is discussed in Section 15.2.3. Predicting the future is always dangerous. However, it is likely that much future research will focus on computationally intensive methods such as generalized linear mixed models. Another hot topic, largely outside the realm of traditional modeling, is the development of algorithmic methods for huge data sets with large numbers of variables. Such methods, often referred to as data mining, deal with the handling of complex data structures, with a premium on predictive power at the sacrifice of simplicity and interpretability of structure. Important areas of application include genetics, such as the analysis of discrete DNA sequences in the form of very high-dimensional contingency tables, and business applications such as credit scoring and tree-structured methods for predicting future behavior of customers. Sources for the historical tour in this chapter include Stigler Ž1986., Studies in the History of Probability and Statistics, edited by E. S. Pearson and M. G. Kendall ŽLondon: Griffin, 1970., and personal conversations over the years with several statisticians, including Erling Andersen, R. L. Anderson, Henri Caussinus, William Cochran, Sir David Cox, John Darroch, Leo Goodman, Gary Koch, Frederick Mosteller, John Nelder, C. R. Rao, Stephen Stigler, Geoffrey Watson, and Marvin Zelen. To readers who have made it this far, I congratulate your perseverance! To develop a more complete understanding of the historical development of CDA, you may want to study the following chronological list of 25 sources. These convey a sense of how methodology has evolved. Alternatively, look at some early books on this topic, such as A. E. Maxwell’s Analysing Qualitati®e Data ŽNew York: Methuen, 1961., R. L. Plackett’s The Analysis of Categorical Data ŽLondon: Griffin, 1974., and the Bishop, Fienberg, and Holland Discrete Multi®ariate Analysis ŽCambridge, MA: MIT Press 1975.. Pearson Ž1900. Yule Ž1912. Fisher Ž1922. Bartlett Ž1935. Berkson Ž1944. Neyman Ž1949. Cochran Ž1954. Goodman and Kruskal Ž1954. Roy and Mitra Ž1956. Cox Ž1958a. Mantel and Haenszel Ž1959. Birch Ž1963. Birch Ž1964b. Caussinus Ž1966. Goodman Ž1968. Mosteller Ž1968. Grizzle et al. Ž1969. Goodman Ž1970. Haberman Ž1974a. Nelder and Wedderburn Ž1972. McFadden Ž1974. Goodman Ž1979a. McCullagh Ž1980. Liang and Zeger Ž1986. Breslow and Clayton Ž1993. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 APPENDIX A Using Computer Software to Analyze Categorical Data In this appendix we discuss statistical software for categorical data analysis, with emphasis on SAS. We begin by mentioning major software that can perform the analyses discussed in this book. Then we illustrate, by chapter, SAS code for the analyses. Information about other packages Žsuch as S-Plus, R, SPSS, and Stata., as well as updated information about SAS, is at the Web site Ž www. stat.ufl.edur;aarcdarcda.html.. Section A.2 on SAS also lists other software for analyses not currently available in SAS. A.1 SOFTWARE FOR CATEGORICAL DATA ANALYSIS A.1.1 SAS SAS is general-purpose software for a wide variety of statistical analyses. The main procedures ŽPROCs. for categorical data analyses are FREQ, GENMOD, LOGISTIC, NLMIXED, and CATMOD. PROC FREQ computes measures of association and their estimated standard errors. It also performs generalized CochranMantelHaenszel tests of conditional independence, and exact tests of independence in I = J tables. PROC GENMOD fits generalized linear models. It fits cumulative link models for ordinal responses. It can perform GEE analyses for marginal models. One can form one’s own variance function and allow scale parameters, making it suitable for quasi-likelihood analyses. PROC LOGISTIC gives ML fitting of binary response models, cumulative link models for ordinal responses, and baseline-category logit models for nominal responses. It incorporates model selection procedures, regression diagnostic options, and exact conditional inference. PROC PROBIT also conducts ML fitting of binary and cumulative link models as well as quantal 632 SOFTWARE FOR CATEGORICAL DATA ANALYSIS 633 response models that permit a strictly positive probability as the linear predictor decreases to y⬁. PROC CATMOD fits baseline-category logit models. It is also useful for WLS fitting of a wide variety of models for categorical data. PROC NLMIXED fits generalized linear mixed models ŽGLMMs.. It approximates the likelihood using adaptive GaussHermite quadrature. Other programs run on SAS that are not specifically supported by the SAS Institute. For further details about SAS for categorical data analyses, see the very helpful guide by Stokes et al. Ž2000.. Also useful are SAS publications on logistic regression ŽAllison 1999. and graphics ŽFriendly 2000.. A.1.2 Other Software Packages Most major statistical software has procedures for categorical data analyses. For instance, see SPSS Ž SPSS Regression Models 10.0 by M. J. Norusis, SPSS Inc., 1999., Stata Ž A Handbook of Statistical Analyses Using Stata, 2nd ed., by S. Rabe-Hesketh and B. Everitt, CRC Press, Boca Raton, FL, 2000., S-Plus Ž Modern Applied Statistics with S-Plus, 3rd ed., by W. N. Venables and B. D. Ripley, Springer-Verlag, New York, 1999., and the related free package, R, and GLIM ŽAitkin et al. 1989.. Most major software now follows the lead of GLIM and includes a generalized linear models routine. Examples are PROC GENMOD in SAS and the glm function in R and S-Plus. For certain analyses, specialized software is better than the major packages. A good example is StatXact ŽCytel Software, Cambridge, Massachusetts ., which provides exact analysis for categorical data methods and some nonparametric methods. Among its procedures are small-sample confidence intervals for differences and ratios of proportions and for odds ratios, and Fisher’s exact test and its generalizations for I = J tables. It can also conduct exact tests of conditional independence and of equality of odds ratios in 2 = 2 = K tables, and exact confidence intervals for the common odds ratio in several 2 = 2 tables. StatXact uses Monte Carlo methods to approximate exact P-values and confidence intervals when a data set is too large for exact inference to be computationally feasible. Its companion, LogXact, performs exact conditional logistic regression. Other examples of specialized software are SUDAAN for GEE-type analyses that handle clustering in survey data ŽResearch Triangle Institute, Research Triangle Park, North Carolina., Latent GOLD for latent class modeling ŽStatistical Innovations, Belmont, Massachusetts ., MLn ŽInstitute of Education, London. and HLM ŽScientific Software, Chicago. for multilevel models, and PASS for power analyses ŽNCSS Statistical Software, Kaysville, Utah.. S-Plus and R functions are also available from individuals or from published work for particular analyses. For instance, Statistical Models in S by J. M. Chambers and T. J. Hastie ŽWadsworth, Belmont, California, 1993, p. 227. showed the use of S-Plus in quasi-likelihood analyses using the quasi and make.family functions. 634 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.1 SAS Code for Chi-Squared, Measures of Association, and Residuals for Education–Religion Data in Table 3.2 data table; input degree religion $ count @@; datalines; 1 fund 178 1 mod 138 1 lib 108 2 fund 570 2 mod 648 2 lib 442 3 fund 138 3 mod 252 3 lib 252 ; proc freq order = data; weight count; tables degree*religion/ chisq expected measures cmh1; proc genmod order = data; class degree religion; model count = degree religion / dist = poi link = log residuals; A.2 EXAMPLES OF SAS CODE BY CHAPTER The examples below show SAS code ŽVersion 8.1.. We focus on basic model fitting rather than the great variety of options. The material is organized by chapter of presentation. For convenience, data for examples are entered in the form of the contingency table displayed in the text. In practice, one would usually enter data at the subject level. These tables and the full data sets are available at www. stat.ufl.edur;aarcdarcda.html. Chapters 1–3: Introduction, Two-Way Contingency Tables Table A.1 uses SAS to analyze Table 3.2. The @@ symbol indicates that each line of data contains more than one observation. Input of a variable as characters rather than numbers requires an accompanying $ label in the INPUT statement. PROC FREQ forms the table with the TABLES statement, ordering row and column categories alphanumerically. To use instead the order in which the categories appear in the data set Že.g., to treat the variable properly in an ordinal analysis ., use the ORDER s DATA option in the PROC statement. The WEIGHT statement is needed when one enters the cell counts instead of subject-level data. PROC FREQ can conduct chi-squared tests of independence ŽCHISQ option., show its estimated expected frequencies ŽEXPECTED., provide a wide assortment of measures of association and their standard errors ŽMEASURES., and provide ordinal statistic Ž3.15. with a ‘‘nonzero correlation’’ test ŽCMH1.. One can also perform chi-squared tests using PROC GENMOD Žusing loglinear models discussed in the Chapters 89 section of this appendix ., as shown. Its RESIDUALS option provides cell residuals. The output labeled ‘‘StReschi’’ is the standardized Pearson residual Ž3.13.. Table A.2 analyzes Table 3.8. With PROC FREQ, for 2 = 2 tables the MEASURES option in the TABLES statement provides confidence intervals EXAMPLES OF SAS CODE BY CHAPTER 635 TABLE A.2 SAS Code for Fisher’s Exact Test and Confidence Intervals for Odds Ratio for Tea-Tasting Data in Table 3.8 data fisher; input poured guess count @@; datalines; 1 1 3 1 2 1 2 1 1 2 2 3 ; proc freq; weight count; tables poured*guess / measures riskdiff; exact fisher or / alpha = .05; proc logistic descending; freq count; model guess = poured / clodds = pl; for the odds ratio Žlabeled ‘‘case-control’’ on output. and the relative risk, and the RISKDIFF option provides intervals for the proportions and their difference. For tables having small cell counts, the EXACT statement can provide various exact analyses. These include Fisher’s exact test and its generalization for I = J tables, treating variables as nominal, with keyword FISHER. The OR keyword gives the odds ratio and its large-sample confidence interval Ž3.2. and the small-sample interval based on Ž3.20.. Other EXACT statement keywords include binomial tests for 1 = 2 tables Žkeyword BINOMIAL., exact trend tests for I = 2 tables ŽTREND., and exact chisquared tests ŽCHISQ. and exact correlation tests for I = J tables ŽMHCHI.. One can use Monte Carlo simulation Žoption MC. to estimate exact P-values when the exact calculation is too time consuming. Table A.2 also uses PROC LOGISTIC to get a profile-likelihood confidence interval for the odds ratio ŽCLODDS s PL.. LOGISTIC uses FREQ to serve the same purpose as PROC FREQ uses WEIGHT. Other StatXact provides small-sample confidence intervals for a binomial parameter, the difference of proportions, relative risk, and odds ratio. Blaker Ž2000. gave S-Plus functions that provide his confidence interval for a binomial parameter. Chapter 4: Models for Binary Response Variables PROC GENMOD fits GLMs. It specifies the response distribution in the DIST option Ž‘‘poi’’ for Poisson, ‘‘bin’’ for binomial, ‘‘mult’’ for multinomial, ‘‘negbin’’ for negative binomial. and specifies the link in the LINK option. Table A.3 illustrates for Table 4.2. For binomial models with grouped data, the response in the model statements takes the form of the number of ‘‘successes’’ divided by the number of cases. 636 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.3 SAS Code for Binary GLMs for Snoring Data in Table 4.2 data glm; input snoring disease total @@; datalines; 0 24 1379 2 35 638 4 21 213 5 30 254 ; proc genmod; model disease / total = snoring / dist = bin link = identity; proc genmod; model disease / total = snoring / dist = bin link = logit; proc genmod; model disease / total = snoring / dist = bin link = probit; TABLE A.4 SAS Code for Poisson and Negative Binomial GLMs for Horseshoe Crab Data in Table 4.3 data crab; input color spine width satell weight; datalines; 3 3 28.3 8 3.05 4 3 22.5 0 1.55 ⭈⭈⭈ 3 2 24.5 0 2.00 ; proc genmod; model satell = width / dist = poi link = log; proc genmod; model satell = width / dist = poi link = identity; proc genmod; model satell = width / dist = negbin link = identity; Table A.4 uses GENMOD for count modeling of Table 4.3. Each observation refers to a single crab. Using width as the predictor, the first two models use Poisson regression. The third model uses the identity link assuming a negative binomial distribution. Table A.5 uses GENMOD for the overdispersed data of Table 4.5. A CLASS statement requests dummy variables for the groups. With no intercept in the model Žoption NOINT. for the identity link, the estimated parameters are the four group probabilities. The ESTIMATE statement provides an estimate, confidence interval, and test for a contrast of model parameters, in this case the difference in probabilities for the first and second groups. The second analysis uses the Pearson statistic to scale standard errors to adjust for overdispersion. PROC LOGISTIC can also provide overdispersion modeling of binary responses; see Table A.27 in the Chapter 13 part of this appendix. PROC GAM Žstarting in Version 8.2. fits generalized additive models. EXAMPLES OF SAS CODE BY CHAPTER 637 TABLE A.5 SAS Code for Overdispersion Modeling of Teratology Data in Table 4.5 data moore; input litter group n y @@; datalines; 1 1 10 1 2 1 11 4 3 1 12 9 4 1 4 4 5 1 10 10 ⭈⭈⭈ 55 4 14 1 56 4 8 0 58 4 17 0 ; proc genmod; class group; model y/n = group / dist = bin link = identity noint; estimate ‘pi1- pi2 ’ group 1 -1 0 0; proc genmod; class group; model y/n = group / dist = bin link = identity noint scale = pearson; Chapters 5 and 6: Logistic Regression One can fit logistic regression models using either software for GLMs or specialized software for logistic regression. PROC GENMOD uses NewtonRaphson, whereas PROC LOGISTIC uses Fisher scoring. Both yield ML estimates, but SE values use observed information in GENMOD and expected information in LOGISTIC. These are the same for the logit link. Table A.6 applies GENMOD and LOGISTIC to Table 5.2, when ‘‘ y’’ out of ‘‘n’’ crabs had satellites at a given width level. In GENMOD, the LRCI option provides profile likelihood confidence intervals. The ALPHA s option can specify an error probability other than the default of 0.05. The TYPE3 option provides likelihood-ratio tests for each parameter. ŽIn the Chapter 8᎐9 section we discuss the second GENMOD analysis. . TABLE A.6 SAS Code for Modeling Grouped Crab Data in Table 5.2 data crab; input width y n satell; logcases = log(n); datalines; 22.69 5 14 14 ⭈⭈⭈ 30.41 14 14 72 ; proc genmod; model y/n = width / dist = bin link = logit 1rci alpha = .01 type3; proc logistic; model y/n = width / influence stb; output out = predict p = pi ᎐ hat lower = LCL upper = UCL; proc print data = predict; proc genmod; model satell = width / dist = poi link = log offset = logcases residuals; 638 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.7 SAS Code for Logit Modeling of AIDS Data in Table 5.5 data aids; input race $ azt $ y n @@; datalines; White Yes 14 107 White No 32 113 Black Yes 11 63 Black No 12 55 ; proc genmod; class race azt; model y/n = azt race / dist = bin type3 lrci residuals obstats; proc logistic; class race azt / param = reference; model y/n = azt race / aggregate scale = none clparm = both clodds = both; output out = predict p = pi  hat lower = lower upper = upper; proc print data = predict; proc logistic; class race azt (ref = first) / param = ref; model y/n = azt / aggregate = (azt race) scale = none; With PROC LOGISTIC, logistic regression is the default for binary data. LOGISTIC has a built-in check of whether logistic regression ML estimates exist. It can detect a complete separation of data points with 0 and 1 outcomes. LOGISTIC can also apply other links, such as the probit. Its INFLUENCE option provides Pearson and deviance residuals and diagnostic measures ŽPregibon 1981.. The STB option provides standardized estimates by multiplying by s x j'3 r ŽSection 5.4.7 and Note 5.9.. Following the model statement, Table A.6 requests predicted probabilities and lower and upper 95% confidence limits for the probabilities. Table A.7 uses GENMOD and LOGISTIC to fit a logit model with qualitative predictors to Table 5.5. In GENMOD, the OBSTATS option provides various ‘‘observation statistics,’’ including predicted values and their confidence limits. The RESIDUALS option requests residuals such as the Pearson and standardized Pearson residuals Žlabeled ‘‘Reschi’’ and ‘‘StReschi’’.. A CLASS statement requests dummy variables for the factor. By default, in GENMOD the parameter estimate for the last level of each factor equals 0. In LOGISTIC, estimates sum to zero. That is, dummies take the effect coding Ž1, y1. of 1 when in the category and y1 when not, for which parameters sum to 0. In the CLASS statement in LOGISTIC, the option PARAM s REF requests Ž1, 0. dummy variables with the last category as the reference level. Also putting REF s FIRST next to a variable name requests its first category as the reference level. The CLPARM s BOTH and CLODDS s BOTH options provide Wald and profile likelihood confidence intervals for parameters and odds ratio effects of explanatory variables. With AGGREGATE SCALE s NONE in the model statement, LOGISTIC reports Pearson and deviance tests of fit; it forms groups by aggregating data into the possible combinations of explanatory variable values, without overdispersion adjustments. Adding variables in parentheses after AGGREGATE Žas in the second use of LOGISTIC in Table A.7. specifies the predictors used for forming the table on which to test fit, even when some predictors may have no effect in the model. EXAMPLES OF SAS CODE BY CHAPTER 639 TABLE A.8 SAS Code for Logistic Regression Models with Horseshoe Crab Data in Table 4.3 data crab; input color spine width satell weight; if satell>0 then y = 1; if satell = 0 then y = 0; if color = 4 then light = 0; if color<4 then light = 1; datalines; 2 3 28.3 8 3.05 ⭈⭈⭈ 2 2 24.5 0 2.00 ; proc genmod descending; class color; model y = width color / dist = bin link = logit lrci type3 obstats; contrast ’a- d’ color 1 0 0 -1; proc genmod descending; model y = width color / dist = bin link = logit; proc genmod descending; model y = width light / dist = bin link = logit; proc genmod descending; class color spine; model y = width weight color spine / dist = bin link = logit type3; proc logistic descending; class color spine / param = ref; model y = width weight color spine / selection = backward lackfit outroc = classif1; proc plot data = classif1; plot ᎐ sensit᎐ * ᎐ lmspec᎐ ; Table A.8 shows logistic regression analyses for Table 4.3. The models refer to a constructed binary variable Y that equals 1 when a horseshoe crab has satellites and 0 otherwise. With binary data entry, GENMOD and LOGISTIC order the levels alphanumerically, forming the logit with Ž1, 0. responses as logw P Ž Y s 0.rP Ž Y s 1.x. Invoking the procedure with DESCENDING following the PROC name reverses the order. The first two GENMOD statements use both color and width as predictors; color is qualitative in the first model Žby the CLASS statement . and quantitative in the second. A CONTRAST statement tests contrasts of parameters, such as whether parameters for two levels of a factor are identical. The statement shown contrasts the first and fourth color levels. The third GENMOD statement uses a dummy variable for color, indicating whether a crab is light or dark Žcolor s 4.. The fourth GENMOD statement fits the main effects model using all the predictors from Table 4.3. LOGISTIC has options for stepwise selection of variables, as the final model statement shows. The LACKFIT option yields the Hosmer᎐Lemeshow statistic. Using the OUTROC option, LOGISTIC can output a data set for plotting a ROC curve. Table A.9 analyzes Table 6.9. The CMH option in PROC FREQ specifies the CMH statistic, the Mantel᎐Haenszel estimate of a common odds ratio and its confidence interval, and the Breslow᎐Day statistic. FREQ uses the 640 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.9 SAS Code for CMH Analysis of Clinical Trial Data in Table 6.9 data crab; input center $ treat response count @@ ; datalines; a 1 1 11 a 1 2 25 a 2 1 10 a 2 2 27 ⭈⭈⭈ h 1 1 4 h 1 2 2 h 2 1 6 h 2 2 1 ; proc freq; weight count; tables center*treat*response/ cmh chisq; two rightmost variables in the TABLES statement as the rows and columns for each partial table; the CHISQ option yields chi-square tests of independence for each partial table. For I = 2 tables the TREND keyword in the TABLES statement provides the Cochran᎐Armitage trend test. Exact conditional logistic regression is available in PROC LOGISTIC with the EXACT statement. It provides ordinary and mid-P-values as well as confidence limits for each model parameter and the corresponding odds ratio with the ESTIMATE s BOTH option. One can also conduct the exact conditional version of the Cochran᎐Armitage test using the TREND option in the EXACT statement with PROC FREQ. Version 9 of SAS will include asymptotic conditional logistic regression, using a STRATA statement to indicate the stratification parameters to be conditioned out. One can also use PROC PHREG to do this ŽStokes et al. 2000.. Models with probit and complementary log-log ŽCLOGLOG. links are available with PROC GENMOD, PROC LOGISTIC, or PROC PROBIT. O’Brien Ž1986. gave a SAS macro for computing powers using the noncentral chi-squared distribution. Other LogXact provides exact conditional logistic regression and StatXact provides exact inference about the odds ratio in 2 = 2 = K tables. PASS ŽNCSS Statistical Software. provides power analyses. Chapter 7: Multinomial Response Models PROC LOGISTIC fits baseline-category logit models Žas of Version 8.2. using the LINK s GLOGIT option. The final response category is the default baseline for the logits. Exact inference is also available using the conditional distribution to eliminate nuisance parameters. PROC CATMOD also fits baseline-category logit models, as Table A.10 shows. CATMOD codes estimates for a factor so that they sum to zero. The PRED s PROB and PRED s FREQ options provide predicted probabilities and fitted values and their standard errors. The POPULATION statement provides the EXAMPLES OF SAS CODE BY CHAPTER 641 TABLE A.10 SAS Code for Baseline-Category Logit Models with Alligator Data in Table 7.1 data gator; input lake gender size food count @@; datalines; 1 1 1 1 7 1 1 1 2 1 1 1 1 3 0 1 1 1 4 0 1 1 1 5 5 ⭈⭈⭈ 4 2 2 1 8 4 2 2 2 1 4 2 2 3 0 4 2 2 4 0 4 2 2 5 1 ; proc logistic; freq count; class lake size / param = ref; model food(ref = ’1 ’) = lake size / link = glogit aggregate scale = none; proc catmod; weight count; population lake size gender; model food = lake size / pred = freq pred = prob; variables that define the predictor settings. For instance, with ‘‘gender’’ in that statement, the model with lake and size effects is fitted to the full table also classified by gender. PROC GENMOD can fit the proportional odds version of cumulative logit models using the DIST s MULTINOMIAL and LINK s CLOGIT options. Table A.11 fits it to Table 7.5. When the number of response categories exceeds 2, by default PROC LOGISTIC fits this model. It also gives a score test of the proportional odds assumption of identical effect parameters for each cutpoint. Both procedures use the ␣ j q ␤ x form of the model. Cox Ž1995. used PROC NLIN for the more general model Ž7.8. having a scale parameter. Both GENMOD and LOGISTIC can use other links in cumulative link models. GENMOD uses LINK s CPROBIT for the cumulative probit model and LINK s CCLL for the cumulative complementary log-log model. Table A.11 uses LINK s PROBIT in LOGISTIC to fit a cumulative probit model. TABLE A.11 SAS Code for Cumulative Logit and Probit Models with Mental Impairment Data in Table 7.5 data impair; input mental ses life; datalines; 1 1 1 ⭈⭈⭈ 4 0 9 ; proc genmod ; model mental = life ses / dist = multinomial link = clogit lrci type3; proc logistic; model mental = life ses / link = probit; 642 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.12 SAS Code for Adjacent-Categories Logit and Mean Response Models and CMH Analysis of Job Satisfaction Data in Table 7.8 data jobsat; input gender income satisf count @@; count2 = count + .01; datalines; 1 1 1 1 1 1 2 3 1 1 3 11 1 1 4 2 ... 0 4 1 0 0 4 2 1 0 4 3 9 0 4 4 6 ; proc catmod order = data; * ML analysis of adj- cat logit (ACL) model; weight count; population gender income; model satisf = (1 0 0 3 3, 0 1 0 2 2, 0 0 1 1 1, 1 0 0 6 3, 0 1 0 4 2, 0 0 1 2 1, 1 0 0 9 3, 0 1 0 6 2, 0 0 1 3 1, 1 0 0 12 3, 0 1 0 8 2, 0 0 1 4 1, 1 0 0 3 0, 0 1 0 2 0, 0 0 1 1 0, 1 0 0 6 0, 0 1 0 4 0, 0 0 1 2 0, 1 0 0 9 0, 0 1 0 6 0, 0 0 1 3 0, 1 0 0 12 0, 0 1 0 8 0, 0 0 1 4 0) /ml pred = freq; proc catmod order = data; weight count2; * WLS analysis of ACL model; response alogits; population gender income; direct gender income; model satisf =  response gender income; proc catmod; weight count; * mean response model; population gender income; response mean; direct gender income; model satisf = gender income / covb; proc freq; weight count; tables gender*income*satisf/ cmh scores = table; One can fit adjacent-categories logit models in CATMOD by fitting equivalent baseline-category logit models. Table A.12 uses it for Table 7.8, where each line of code in the model statement specifies the predictor values Žfor the three intercepts, income, and gender. for the three logits. The income and gender predictor values are multiplied by 3 for the first logit, 2 for the second, and 1 for the third, to make effects comparable in the two models. PROC CATMOD has options ŽCLOGITS and ALOGITS. for fitting cumulative logit and adjacent-categories logit models to ordinal responses; however, those options provide weighted least squares ŽWLS. rather than ML fits. A constant must be added to empty cells for WLS to run. CATMOD treats zero counts as structural zeros, so they must be replaced by small constants when they are actually sampling zeros. The DIRECT statements identify predictors treated as quantitative. The second analysis in Table A.12 uses the ALOGITS option. CATMOD can also fit mean response models using WLS, as the third analysis in Table A.12 shows. With the CMH option, PROC FREQ provides the generalized CMH tests of conditional independence. The statistic for the ‘‘general association’’ EXAMPLES OF SAS CODE BY CHAPTER 643 alternative treats X and Y as nominal wstatistic Ž7.20.x, the statistic for the ‘‘row mean scores differ’’ alternative treats X as nominal and Y as ordinal, and the statistic for the ‘‘nonzero correlation’’ alternative treats X and Y as ordinal wstatistic Ž7.21.x. Table A.12 analyzes Table 7.8, using scores Ž1, 2, 3, 4. for each variable. PROC MDC fits multinomial discrete choice models, with logit and probit links. One can also use PROC PHREG, which is designed for the Cox proportional hazards model for survival analysis, because the partial likelihood for that analysis has the same form as the likelihood for the multinomial model ŽAllison 1999, Chap. 7; Chen and Kuo 2001.. Other LogXact provides exact conditional analyses for baseline-category logit models. Joseph Lang Ž jblang@stat.uiowa.edu. has an R function that can fit mean response models by ML. Chapters 8 and 9: Loglinear Models Table A.13 uses GENMOD to fit model Ž AC, AM, CM . to Table 8.3. Table A.14 uses GENMOD for table raking of Table 8.15. Table A.15 uses GENMOD to fit the linear-by-linear association model Ž9.6. and the row effects model Ž9.8. to Table 9.3 Žwith column scores 1, 2, 4, 5.. The defined TABLE A.13 SAS Code for Fitting Loglinear Models to Drug Survey Data in Table 8.3 data drugs; input a c m count @@; datalines; 1 1 1 911 1 1 2 538 1 2 1 44 1 2 2 456 2 1 1 3 2 1 2 43 2 2 1 2 2 2 2 279 ; proc genmod; class a c m; model count = a c m a*m a*c c*m / dist = poi link = log lrci type3 obstats; TABLE A.14 SAS Code for Raking Table 8.15 data rake; input school atti count @@; log  c = log(count); pseudo = 100 / 3; data lines; 1 1 209 1 2 101 1 3 237 ⭈⭈⭈ ; proc genmod; class school atti; model pseudo = school atti / dist = poi link = log offset = log ᎐ c obstats; 644 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.15 SAS Code for Fitting Association Models to GSS Data in Table 9.3 data sex; input premar birth u v count @@; assoc = u*v ; datalines; 1 1 1 1 38 1 2 1 2 60 1 3 1 4 68 1 4 1 5 81 ⭈⭈⭈ ; proc genmod; class premar birth; model count = premar birth assoc / dist = poi link = log; proc genmod; class premar birth; model count = premar birth premar*v / dist = poi link = log; variable ‘‘assoc’’ represents the cross-product of row and column scores, which has ␤ parameter as coefficient in model Ž9.6.. Table A.6 uses GENMOD to fit the Poisson regression model with log link for the grouped data of Table 5.2. It models the total number of satellites at each width level Žvariable ‘‘satell’’., using the log of the number of cases as offset. Correspondence analysis is available with PROC CORRESP. Other Prof. Joseph Lang Ž jblang@stat.uiowa.edu. has R and S-Plus functions for ML fitting of the generalized loglinear model Ž8.18.. Becker Ž1990. gave a FORTRAN program that fits the RC Ž M . model. Chapter 10: Models for Matched Pairs Table A.16 analyzes Table 10.1. For square tables, the AGREE option in PROC FREQ provides the McNemar chi-squared statistic for binary matched pairs, the X 2 test of fit of the symmetry model Žalso called Bowker’s test ., TABLE A.16 SAS Code for McNemar’s Test and Comparing Proportions for Matched Samples in Table 10.1 data matched; input first second count @@; datalines; 1 1 794 1 2 150 2 1 86 2 2 570 ; proc freq; weight count; tables first*second / agree; exact mcnem; proc catmod; weight count; response marginals; model first*second = (1 0 , 1 1 ; EXAMPLES OF SAS CODE BY CHAPTER 645 TABLE A.17 SAS Code for Testing Marginal Homogeneity with Migration Data in Table 10.6 data migrate; input then $ now $ count m11 m12 m13 m21 m22 m23 m31 m32 m33 m44 m1 m2 m3; datalines; ne ne 11607 1 0 0 0 0 0 0 0 0 0 0 0 0 ne mw 100 0 1 0 0 0 0 0 0 0 0 0 0 0 ne s 366 0 0 1 0 0 0 0 0 0 0 0 0 0 ne w 124 -1 -1 -1 0 0 0 0 0 0 0 1 0 0 mw ne 87 0 0 0 1 0 0 0 0 0 0 0 0 0 mw mw 13677 0 0 0 0 1 0 0 0 0 0 0 0 0 mw s 515 0 0 0 0 0 1 0 0 0 0 0 0 0 mw w 302 0 0 0 -1 -1 -1 0 0 0 0 0 1 0 s ne 172 0 0 0 0 0 0 1 0 0 0 0 0 0 s mw 225 0 0 0 0 0 0 0 1 0 0 0 0 0 s s 17819 0 0 0 0 0 0 0 0 1 0 0 0 0 s w 270 0 0 0 0 0 0 -1 -1 -1 0 0 0 1 w ne 63 -1 0 0 -1 0 0 -1 0 0 0 1 0 0 w mw 176 0 -1 0 0 -1 0 0 -1 0 0 0 1 0 w s 286 0 0 -1 0 0 -1 0 0 -1 0 0 0 1 w w 10192 0 0 0 0 0 0 0 0 0 1 0 0 0 ; proc genmod; model count = m11 m12 m13 m21 m22 m23 m31 m32 m33 m44 m1 m2 m3 / dist = poi link = identity; proc catmod; weight count; response marginals; model then*now =  response  /freq; repeated time 2; and Cohen’s kappa and weighted kappa with SE values. The MCNEM keyword in the EXACT statement provides a small-sample binomial version of McNemar’s test. PROC CATMOD can provide the confidence interval for the difference of proportions. The code forms a model for the marginal proportions in the first row and the first column, specifying a model matrix in the model statement that has an intercept parameter Žthe first column. that applies to both proportions and a slope parameter that applies only to the second; hence the second parameter is the difference between the second and first marginal proportions. PROC LOGISTIC can conduct conditional logistic regression. Table A.17 shows ways of testing marginal homogeneity for Table 10.6. The GENMOD code shows the Lipsitz et al. Ž1990. approach, expressing the I 2 expected frequencies in terms of parameters for the Ž I y 1. 2 cells in the first I y 1 rows and I y 1 columns, the cell in the last row and last column, and I y 1 marginal totals Žwhich are the same for rows and columns.. Here, m11 denotes expected frequency ␮ 11 , m1 denotes ␮ 1qs ␮q1 , and so on. This parameterization uses formulas such as ␮ 14 s ␮ 1qy ␮ 11 y ␮ 12 y ␮ 13 for terms in the last column or last row. CATMOD provides the Bhapkar test Ž10.16. of marginal homogeneity, as shown. 646 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.18 SAS Code Showing Square-Table Analysis of Table 10.5 data sex; input premar extramar symm qi count @@; unif = premar*extramar; datalines; 1 1 1 1 144 1 2 2 5 2 1 3 3 5 0 1 4 4 5 0 2 1 2 5 33 2 2 5 2 4 2 3 6 5 2 2 4 7 5 0 3 1 3 5 84 3 2 6 5 14 3 3 8 3 6 3 4 9 5 1 4 1 4 5 126 4 2 7 5 29 4 3 9 5 25 4 4 10 4 5 ; proc genmod; class symm; model count = symm / dist = poi link = log; * symmetry; proc genmod; class extramar premar symm; model count = symm extramar premar / dist = poi link = log; *QS; proc genmod; class symm; model count = symm extramar premar / dist = poi link = log; * ordinal QS; proc genmod; class extramar premar qi; model count = extramar premar qi / dist = poi link = log; * quasi indep; proc genmod; class extramar premar; model count = extramar premar unif / dist = poi link = log; data sex2; input score below above @@; trials = below + above; datalines; 1 33 2 1 14 2 1 25 1 2 84 0 2 29 0 3 126 0 ; proc genmod data = sex2; model above / trials = score / dist = bin link = logit noint; proc genmod data = sex2; model above / trials = /dist = bin link = logit noint; proc genmod data = sex2; model above / trials = /dist = bin link = logit; Table A.18 shows various square-table analyses of Table 10.5. The ‘‘symm’’ factor indexes the pairs of cells that have the same association terms in the symmetry and quasi-symmetry models. For instance, ‘‘symm’’ takes the same value for cells Ž1, 2. and Ž2, 1.. Including this term as a factor in a model invokes a parameter  i j satisfying  i j s  ji . The first model fits this factor alone, providing the symmetry model. The second model looks like the third except that it identifies ‘‘premar’’ and ‘‘extramar’’ as class variables Žfor quasi-symmetry., whereas the third model statement does not Žfor ordinal quasi-symmetry.. The fourth model fits quasi-independence. The ‘‘qi’’ factor invokes the i parameters. It takes a separate level for each cell on the main diagonal and a common value for all other cells. The fifth model fits the quasi-uniform association model Ž10.29.. The bottom of Table A.18 fits square-table models as logit models. The pairs of cell counts Ž n i j , n ji ., labeled as ‘‘above’’ and ‘‘below’’ with reference to the main diagonal, are six sets of binomial counts. The variable defined as ‘‘score’’ is the distance Ž u j y u i . s j y i. The first two cases are symmetry EXAMPLES OF SAS CODE BY CHAPTER 647 TABLE A.19 SAS Code for Fitting Bradley–Terry Model to Table 10.10 data baseball; input wins games milw detr toro newy bost clev balt; datalines; 7 13 1 -1 0 0 0 0 0 ⭈⭈⭈ 6 13 0 0 0 0 0 1 -1 ; proc genmod; model wins / games = milw detr toro newy bost clev balt / dist = bin link = logit noint covb; and ordinal quasi-symmetry. Neither model contains an intercept ŽNOINT., and the ordinal model uses ‘‘score’’ as the predictor. The third model allows an intercept and is the conditional symmetry model Ž10.28.. Table A.19 uses GENMOD for logit fitting of the Bradley᎐Terry model to Table 10.10 by forming an artificial explanatory variable for each team. For a given observation, the variable for team i is 1 if it wins, y1 if it loses, and 0 if it is not one of the teams for that match. Each observation lists the number of wins Ž‘‘ wins’’. for the team with variate-level equal to 1 out of the number of games Ž‘‘games’’. against the team with variate-level equal to y1. The model has these artificial variates, one of which is redundant, as explanatory variables with no intercept term. The COVB option provides the estimated covariance matrix of parameter estimators. Chapter 11: Analyzing Repeated Categorical Response Data Table A.20 uses GENMOD for the likelihood-ratio test of marginal homogeneity for Table 11.1, where for instance m11p denotes ␮ 11q . The marginal homogeneity model expresses the eight cell expected frequencies in terms of TABLE A.20 SAS Code for Testing Marginal Homogeneity with Crossover Study of Table 11.1 data crossover; input a b c count m111 m11p m1p1 mp11 m1pp m222 @@; datalines; 1 1 1 6 1 0 0 0 0 0 1 1 2 16 -1 1 0 0 0 0 1 2 1 2 -1 0 1 0 0 0 1 2 2 4 1 -1 -1 0 1 0 2 1 1 2 -1 0 0 1 0 0 2 1 2 4 1 -1 0 -1 1 0 2 2 1 6 1 0 -1 -1 1 0 2 2 2 6 0 0 0 0 0 1 ; proc genmod; model count = m111 m11p m1p1 mp11 m1pp m222 / dist = poi link = identity; proc catmod; weight count; response marginals; model a*b*c = ᎐ response ᎐ /freq; repeated drug 3; 648 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.21 SAS Code for Marginal Modeling of Depression Data in Table 11.2 data depress; input case diagnose drug time outcome @@; * outcome = 1 is normal; datalines; 1 0 0 0 1 1 0 0 1 1 1 0 0 2 1 ⭈⭈⭈ 340 1 1 0 0 340 1 1 1 0 340 1 1 2 0 ; proc genmod descending; class case; model; outcome = diagnose drug time drug*time / dist = bin link = logit type3; repeated subject = case / type = exch corrw; proc n1mixed qpoints = 200; parms alpha = -.03 beta1 = -1.3 beta2 = -.06 beta3 = .48 beta4 = 1.02 sigma = .066; eta = alpha + beta1*diagnose + beta2*drug + beta3*time + beta4*drug*time + u; p = exp(eta) / (1 + exp(eta)); model outcome ; binary(p); random u ; normal(0, sigma*sigma) subject = case; TABLE A.22 SAS Code for GEE and Random Intercept Cumulative Logit Analysis of Insomnia Data in Table 11.4 data francom; input case treat time outcome @@; datalines; 1 1 0 1 1 1 1 1 ⭈⭈⭈ 239 0 0 4 239 0 1 4 ; proc genmod; class case; model outcome = treat time treat*time / dist = multinomial link = clogit; repeated subject = case / type = indep corrw; proc n1mixed qpoints = 40; bounds i2>0; bounds i3>0; eta1 = i1 + treat*beta1 + time*beta2 + treat*time*beta3+ u; eta2 = i1 + i2 + treat*beta1 + time*beta2 + treat*time*beta3+ u; eta3 = i1 + i2 + i3 + treat*beta1 + time*beta2 + treat*time*beta3+ u; p1 = exp(eta) / (1 + exp(eta1)); p2 = exp(eta2) / (1 + exp(eta2))- exp(eta1) / (1 + exp(eta1)); p3 = exp(eta3) / (1 + exp(eta3))- exp(eta2) / (1 + exp(eta2)); p4 = 1- exp(eta3) / (1 + exp(eta3)); 11 = y1*log(p1) + y2*log(p2) + y3*log(p3) + y4*log(p4); model y1 ; general(11); estimate ’interc2 ’ i1 + i2; * this is alpha᎐ 2 in model, and i1 is alpha᎐ 1; estimate ’interc3 ’ i1 + i2 + i3; * this is alpha᎐ 3 in model; random u ; normal(0, sigma*sigma) subject = case; EXAMPLES OF SAS CODE BY CHAPTER 649 ␮ 111 , ␮ 11q , ␮ 1q1 , ␮q11 , ␮ 1qq, and ␮ 222 Žsince ␮q1qs ␮qq1 s ␮ 1qq .. Note, for instance, that ␮ 112 s ␮ 11qy ␮ 111 and ␮ 122 s ␮ 111 q ␮ 1qqy ␮ 11qy ␮ 1q1 . CATMOD provides the generalized Bhapkar test Ž11.5. of marginal homogeneity. Table A.21 uses GENMOD to analyze Table 11.2 using GEE. Possible working correlation structures are TYPE s EXCH for exchangeable, TYPE s AR for autoregressive, TYPE s INDEP for independence, and TYPE s UNSTR for unstructured. Output shows estimates and standard errors under the naive working correlation and based on the sandwich matrix incorporating the empirical dependence. Alternatively, the working association structure in the binary case can use the log odds ratio Že.g., using LOGOR s EXCH for exchangeability .. The type 3 option in GEE provides score tests about effects. See Stokes et al. Ž2000, Sec. 15.11. for the use of GEE with missing data. Table A.22 uses GENMOD to implement GEE for a cumulative logit model for Table 11.4. For multinomial responses, independence is currently the only working correlation structure. Other Joseph Lang Ž jblang@stat.uiowa.edu. has R and S-Plus functions for ML fitting of marginal models through the generalized loglinear model Ž11.8., using the constraint approach with Lagrange multipliers. The program MAREG ŽKastner et al. 1997. provides GEE fitting and ML fitting of marginal models with the Fitzmaurice and Laird Ž1993. approach, allowing multicategory responses. See www. stat.uni-muenchen.der;andreasrmaregr winmareg.html. Chapter 12: Random Effects: Generalized Linear Mixed Models PROC NLMIXED extends GLMs to GLMMs by including random effects. Table A.23 analyzes the matched pairs model Ž12.3.. Table A.24 analyzes the election data in Table 12.2. TABLE A.23 SAS Code for Fitting Model (12.3) for Matched Pairs to Table 12.1 data matched; input case occasion response count @@; datalines; 2 0 1 794 1 1 1 794 2 0 1 150 2 1 0 150 3 0 0 86 3 1 1 86 4 0 0 570 4 1 0 570 ; proc n1mixed; eta = alpha + beta*occasion + u; p = exp(eta) / (1 + exp(eta)); model response ; binary(p); random u ; normal(0, sigma*sigma) subject = case; replicate count; 650 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.24 SAS Code for GLMM Analysis of Election Data in Table 12.2 data vote; input y n; case =  n  ; datalines; 1 5 16 32 ⭈⭈⭈ 1 4 ; proc n1mixed; eta = alpha + u; p = exp(eta) / (1 + exp(eta)); model y ; binomial(n,p); random u ; normal (0, sigma*sigma) subject = case; predict p out = new; proc print data = new; TABLE A.25 SAS Code for GLMM Modeling of Opinions in Table 10.13 data new; input sex poor single any count; datalines; 1 1 1 1 342 ⭈⭈⭈ 2 0 0 0 457 ; data new; set new; sex = sex- 1; case = ᎐ n ᎐ ; q1 = 1; q2 = 0; resp = poor; output; q1 = 0, q2 = 1; resp = single; output; q1 = 0; q2 = 0; resp = any; output; drop poor single any; proc n1mixed qpoints = 50; parms alpha = 0 beta1 = .8 beta2 = .3 gamma = 0 sigma = 8.6; eta = alpha + beta1*q1 + beta2*q2 + gamma*sex + u; p = exp(eta) / (1 + exp(eta)); model resp ; binary(p); random u ; normal(0, sigma*sigma) subject = case; replicate count; EXAMPLES OF SAS CODE BY CHAPTER 651 TABLE A.26 SAS Code for GLMM for Leading Crowd Data in Table 12.8 data crowd; input mem1 att1 mem2 att2 count; datalines; 1 1 1 1 458 ⭈⭈⭈ 0 0 0 0 554 ; data new; set crowd; case = ᎐ n ᎐ ; x1m = 1; x1a = 0; x2m = 0; x2a = 0; var = 1; resp = mem1; output; x1m = 0; x1a = 1; x2m = 0; x2a = 0; var = 0; resp = att1; output; x1m = 0; x1a = 0; x2m = 1; x2a = 0; var = 1; resp = mem2; output; x1m = 0; x1a = 0; x2m = 0; x2a = 1; var = 0; resp = att2; output; drop mem1 att1 mem2 att2; proc n1mixed data = new; eta = beta1m*x1m + beta1a*x1a + beta2m*x2m + beta2a*x2a + um*var + ua*(1- var); p = exp(eta) / (1 + exp(eta)); model resp ; binary(p); random um ua ; normal([0,0],[s1*s1, cov12, s2*s2]) subject = case; replicate count; estimate ’mem change’ beta2m- beta1m; estimate ’att change’ beta2a- beta1a; Table A.25 fits model Ž12.10. to Table 10.13. This shows how to set initial values and set the number of quadrature points for Gauss᎐Hermite quadrature Že.g., QPOINTS s.. One could let SAS fit without initial values but then take that fit as initial values in further runs, increasing QPOINTS until estimates and standard errors converge to the necessary precision. Table A.21 uses NLMIXED for Table 11.2. Table A.22 uses NLMIXED for ordinal modeling of Table 11.4, defining a general multinomial log likelihood. Table A.26 shows a correlated bivariate random effect analysis of Table 12.8. Agresti et al. Ž2000. showed NLMIXED examples for clustered data, Agresti and Hartzel Ž2000. showed code for multicenter trials such as Table 12.5, and Hartzel et al. Ž2001a. showed code for multicenter trials with an ordinal response. The Web site for the journal Statistical Modelling shows NLMIXED code for an adjacent-categories logit model and a nominal model at the data archive for Hartzel et al. Ž2001b.. Chen and Kuo Ž2001. discussed fitting multinomial logit models, including discrete-choice models, with random effects. Other MLn ŽInstitute of Education, London. and HLM ŽScientific Software, Chicago. fit multilevel models. MIXOR is a FORTRAN program for ML 652 USING COMPUTER SOFTWARE TO ANALYZE CATEGORICAL DATA TABLE A.27 SAS Code for Overdispersion Analysis of Table 4.5 data moore; input litter group n y @@; z2 = 0; z3 = 0; z4 = 0; if group = 2 then z2 = 1; if group = 3 then z3 = 1; if group = 4 then z4 = 1; datalines; 1 1 10 1 2 1 11 4 3 1 12 9 4 1 4 4 ⭈⭈⭈ 55 4 14 1 56 4 8 0 57 4 6 0 58 4 17 0 ; proc logistic; model y / n = z2 z3 z4 / scale = williams; proc logistic; model y / n = z2 z3 z4 / scale = pearson; proc n1mixed qpoints = 200; eta = alpha + beta2*z2 + beta3*z3 + beta4*z4 + u; p = exp(eta) / (1 + exp(eta)); model y ; binomial(n,p); random u ; normal(0, sigma*sigma) subject = litter; TABLE A.28 SAS Code for Fitting Models to Murder Data in Table 13.6 data new; input white black other response; datalines; 1070 119 55 0 60 16 5 1 ⭈⭈⭈ 1 0 0 6 ; data new; set new; count = white; race = 0; output; count = black; race = 1; output; drop white black other; data new2; set new; do i = 1 to count; output; end; drop i; proc genmod data = new2; model response = race / dist = negbin link = log; proc genmod data = new2; model response = race / dist = poi link = log scale = pearson; data new; set new; case = ᎐ n ᎐ ; proc n1mixed data = new qpoints = 400; parms alpha = -3.7 beta = 1.90 sigma = 1.6; eta = alpha + beta*race + u; mu = exp(eta); model response ; poisson(mu); random u ; normal(0, sigma*sigma) subject = case; replicate count; EXAMPLES OF SAS CODE BY CHAPTER 653 fitting of binary and ordinal random effects models available from Don Hedeker Ž www.uic.edur;hedekerrmix.html .. Chapter 13: Other Mixture Models for Categorical Data PROC LOGISTIC provides two overdispersion approaches for binary data. The SCALE s WILLIAMS option uses variance function of the beta-binomial form Ž13.10., and SCALE s PEARSON uses the scaled binomial variance Ž13.11.. Table A.27 illustrates for Table 4.5. That table also uses NLMIXED for adding litter random intercepts. For Table 13.6, Table A.28 uses GENMOD to fit a negative binomial model and a quasi-likelihood model with scaled Poisson variance using the Pearson statistic, and NLMIXED to fit a Poisson GLMM. PROC NLMIXED can also fit negative binomial models. Other Latent GOLD Ždeveloped by J. Vermunt and J. Magidson for Statistical Innovations, Belmont, Massachusetts . can fit a wide variety of mixture models, including latent class models, nonparametric mixtures of logistic regression, and some Rasch mixture models. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 APPENDIX B Chi-Squared Distribution Values Right-Tailed Probability df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 40 50 60 70 80 90 100 0.250 0.100 0.050 0.025 0.010 0.005 0.001 1.32 2.77 4.11 5.39 6.63 7.84 9.04 10.22 11.39 12.55 13.70 14.85 15.98 17.12 18.25 19.37 20.49 21.60 22.72 23.83 29.34 34.80 45.62 56.33 66.98 77.58 88.13 98.65 109.1 2.71 4.61 6.25 7.78 9.24 10.64 12.02 13.36 14.68 15.99 17.28 18.55 19.81 21.06 22.31 23.54 24.77 25.99 27.20 28.41 34.38 40.26 51.80 63.17 74.40 85.53 96.58 107.6 118.5 3.84 5.99 7.81 9.49 11.07 12.59 14.07 15.51 16.92 18.31 19.68 21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14 31.41 37.65 43.77 55.76 67.50 79.08 90.53 101.8 113.1 124.3 5.02 7.38 9.35 11.14 12.83 14.45 16.01 17.53 19.02 20.48 21.92 23.34 24.74 26.12 27.49 28.85 30.19 31.53 32.85 34.17 40.65 46.98 59.34 71.42 83.30 95.02 106.6 118.1 129.6 6.63 9.21 11.34 13.28 15.09 16.81 18.48 20.09 21.67 23.21 24.72 26.22 27.69 29.14 30.58 32.00 33.41 34.81 36.19 37.57 44.31 50.89 63.69 76.15 88.38 100.4 112.3 124.1 135.8 7.88 10.60 12.84 14.86 16.75 18.55 20.28 21.96 23.59 25.19 26.76 28.30 29.82 31.32 32.80 34.27 35.72 37.16 38.58 40.00 46.93 53.67 66.77 79.49 91.95 104.2 116.3 128.3 140.2 10.83 13.82 16.27 18.47 20.52 22.46 24.32 26.12 27.88 29.59 31.26 32.91 34.53 36.12 37.70 39.25 40.79 42.31 43.82 45.32 52.62 59.70 73.40 86.66 99.61 112.3 124.8 137.2 149.5 Source: Calculated using StaTable, Cytel Software, Cambridge, MA. 654 Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 References Adelbasit, K. M., and R. L. Plackett. 1983. Experimental design for binary data. J. Amer. Statist. Assoc. 78: 9098. Agresti, A. 1984. Analysis of Ordinal Categorical Data. New York: Wiley. Agresti, A. 1992. A survey of exact inference for contingency tables. Statist. Sci. 7: 131153. Agresti, A. 1993. Computing conditional maximum likelihood estimates for generalized Rasch models using simple loglinear models with diagonal parameters. Scand. J. Statist. 20: 6371. Agresti, A. 1997. A model for repeated measurements of a multivariate binary response. J. Amer. Statist. Assoc. 92: 315321. Agresti, A. 1999. On logit confidence intervals for the odds ratio with small samples. Biometrics 55: 597602. Agresti, A. 2001. Exact inference for categorical data: Recent advances and continuing controversies. Statist. Medic. 20: 27092722. Agresti, A., and B. Caffo. 2000. Simple and effective confidence intervals for proportions and difference of proportions result from adding two successes and two failures. Amer. Statist. 54: 280288. Agresti, A., and B. A. Coull. 1998. Approximate is better than exact for interval estimation of binomial parameters. Amer. Statist. 52: 119126. Agresti, A., and J. Hartzel. 2000. Strategies for comparing treatments on a binary response with multi-centre data. Statist. Medic. 19Ž8.: 11151139. Agresti, A., and J. Lang. 1993a. A proportional odds model with subject-specific effects for repeated ordered categorical responses. Biometrika 80: 527534. Agresti, A., and J. Lang. 1993b. Quasi-symmetric latent class models, with application to rater agreement. Biometrics 49: 131139. Agresti, A., and I. Liu. 1999. Modeling a categorical variable allowing arbitrarily many category choices. Biometrics 55: 936943. Agresti, A., and Y. Min. 2001. On small-sample confidence intervals for parameters in discrete distributions. Biometrics 57: 963971. Agresti, A., and R. Natarajan. 2001. Modeling clustered ordered categorical data: A survey. Internal. Statist. Re®. 69: 345371. Agresti, A., D. Wackerly, and J. Boyett. 1979. Exact conditional tests for cross-classifications: Approximation of attained significance levels. Psychometrika 44: 7584. Agresti, A., C. Chuang, and A. Kezouh. 1987. Order-restricted score parameters in association models for contingency tables. J. Amer. Statist. Assoc. 82: 619623. 655 656 REFERENCES Agresti, A., C. R. Mehta, and N. R. Patel. 1990. Exact inference for contingency tables with ordered categories. J. Amer. Statist. Assoc. 85: 453458. Agresti, A., J. Booth, J. Hobert, and B. Caffo. 2000. Random-effects modeling of categorical response data. Sociol. Methodol. 30: 2781. Aitchison, J., and C. G. G. Aitken. 1976. Multivariate binary discrimination by the kernel method. Biometrika 63: 413420. Aitchison, J., and C. H. Cho. 1989. The multivariate Poisson-log normal distribution. Biometrika 76: 643653. Aitchison, J., and S. M. Shen. 1980. Logistic-normal distributions: Some properties and uses. Biometrika 67: 261272. Aitchison, J., and S. D. Silvey. 1957. The generalization of probit analysis to the case of multiple responses. Biometrika 44: 131140. Aitchison, J., and S. D. Silvey. 1958. Maximum likelihood estimation of parameters subject to restraints. Ann. Math. Statist. 29: 813828. Aitkin, M. 1979. A simultaneous test procedure for contingency table models. Appl. Statist. 28: 233242. Aitkin, M. 1980. A note on the selection of log-linear models. Biometrics 36: 173178. Aitkin, M. 1999. A general maximum likelihood analysis of variance components in generalized linear models. Biometrics 55: 117128. Aitkin, M., and D. Clayton. 1980. The fitting of exponential, Weibull, and extreme value distributions to complex censored survival data using GLIM. Appl. Statist. 29: 156163. Aitkin, M., and M. Stasinopoulos. 1989. Likelihood analysis of a binomial sample size problem. Pp. 399411 in Contributions to Probability and Statistics: Essays in Honor of Ingram Olkin, ed. L. J. Gleser, M. D. Perlman, S. J. Press, and A. R. Sampson. New York: SpringerVerlag. Aitkin, M., D. Anderson, and J. Hinde. 1981. Statistical modelling of data on teaching styles. J. Roy. Statist. Soc. Ser. A 144: 419461. Aitkin, M., D. Anderson, B. Francis, and J. Hinde. 1989. Statistical Modeling in GLIM. Oxford: Clarendon Press. Albert, J. H. 1997. Bayesian testing and estimation of association in a two-way contingency table. J. Amer. Statist. Assoc. 92: 685693. Albert, A., and J. A. Anderson. 1984. On the existence of maximum likelihood estimates in logistic models. Biometrika 71: 110. Albert, J. H., and S. Chib. 1993. Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88: 669679. Albert, J. H., and A. K. Gupta. 1982. Mixtures of Dirichlet distributions and estimation in contingency tables. Ann. Statist. 10: 12611268. Allison, P. D. 1999. Logistic Regression Using the SAS System. Cary, NC: SAS Institute. Altham, P. M. E. 1969. Exact Bayesian analysis of a 2 = 2 contingency table and Fisher’s ‘‘exact’’ significance test. J. Roy. Statist. Soc. Ser B 31: 261269. Altham, P. M. E. 1970. The measurement of association of rows and columns for an r = s contingency table. J. Roy. Statist. Soc. Ser B 32: 6373. Altham, P. M. E. 1971. The analysis of matched proportions. Biometrika 58: 561576. Altham, P. M. E. 1975. Quasi-independent triangular contingency tables. Biometrics 31: 233238. Altham, P. M. E. 1978. Two generalizations of the binomial distribution. Appl. Statist. 27: 162167. Altham, P. M. E. 1984. Improving the precision of estimation by fitting a model. J. Roy. Statist. Soc. Ser B 46: 118119. Amemiya, T. 1981. Qualitative response models: A survey. J. Econom. Literature 19: 14831536. REFERENCES 657 Andersen, E. B. 1970. Asymptotic properties of conditional maximum-likelihood estimators. J. Roy. Statist. Soc. Ser B 32: 283301. Andersen, E. B. 1980. Discrete Statistical Models with Social Science Applications. Amsterdam: North-Holland. Andersen, E. B. 1995. Polytomous Rasch models and their estimation. Pp. 272291 in Rasch Models: Foundations, Recent De®elopments, and Applications, eds. G. Fischer and I. Molenaar. New York: Springer-Verlag. Anderson, J. A. 1972. Separate sample logistic discrimination. Biometrika 59: 1935. Anderson, J. A. 1975. Quadratic logistic discrimination. Biometrika 62: 149154. Anderson, J. A. 1984. Regression and ordered categorical variables. J. Roy. Statist. Soc. Ser B 46: 130. Anderson, D. A., and M. Aitkin. 1985. Variance component models with binary response: Interviewer variability. J. Roy. Statist. Soc. Ser B 47: 203210. Anderson, C. J., and U. Bockenholt. 2000. Graphical regression models for polytomous vari¨ ables. Psychometrika 65: 497509. Anderson, T. W., and L. A. Goodman. 1957. Statistical inference about Markov chains. Ann. Math. Statist. 28: 89110. Anderson, J. A., and P. R. Philips. 1981. Regression, discrimination, and measurement models for ordered categorical variables. Appl. Statist. 30: 2231. Anderson, C. J., and J. K. Vermunt. 2000. Log-multiplicative models as latent variable models for nominal andror ordinal data. Sociol. Methodol. 30: 81121. Aranda-Ordaz, F. J. 1981. On two families of transformations to additivity for binary response data. Biometrics 68: 357363. Aranda-Ordaz, F. J. 1983. An extension of the proportional hazards model for grouped data. Biometrics 39: 109117. Arminger, G., C. C. Clogg, and T. Cheng. 2000. Regression analysis of multivariate binary response variables using Rasch-type models and finite mixture methods. Sociol. Methodol. 30: 126. Armitage, P. 1955. Tests for linear trends in proportions and frequencies. Biometrics 11: 375386. Ashford, J. R., and R. D. Sowden. 1970. Multivariate probit analysis. Biometrics 26: 535546. Asmussen, S., and D. Edwards. 1983. Collapsibility and response variables in contingency tables. Biometrika 70: 567578. Azzalini, A. 1994. Logistic regression for autocorrelated data with application to repeated measures. Biometrika 81: 767775. Baglivo, J., D. Olivier, and M. Pagano. 1992. Methods for exact goodness-of-fit tests. J. Amer. Statist. Assoc. 87: 464469. Baker, S. G. 1992. A simple method for computing the observed information matrix when using the EM algorithm with categorical data. J. Comput. Graph. Statist. 1: 6376. Baker, S. G., and N. M. Laird. 1988. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. J. Amer. Statist. Assoc. 83: 6269. Baker, R. J., M. R. B. Clarke, and P. W. Lane. 1985. Zero entries in contingency tables. Comput. Statist. Data Anal. 3: 3345. Banerjee, C., M. Capozzoli, L. McSweeney, and D. Sinha. 1999. Beyond kappa: A review of interrater agreement measures. Canad. J. Statist. 27: 323. Baptista, J., and M. C. Pike. 1977. Algorithm AS115: Exact two-sided confidence limits for the odds ratio in a 2 = 2 table. Appl. Statist. 26: 214220. Barnard, G. A. 1945. A new test for 2 = 2 tables. Nature 156: 177. Barnard, G. A. 1947. Significance tests for 2 = 2 tables. Biometrika 34: 123138. 658 REFERENCES Barnard, G. A. 1949. Statistical inference. J. Roy. Statist. Soc. Ser B 11: 115139. Barnard, G. A. 1979. In contradiction to J. Berkson’s dispraise: Conditional tests can be more efficient. J. Statist. Plann. Inference 3: 181188. Barndorff-Nielsen, O. E., and B. Jorgensen. 1991. Some parametric models on the simplex. ¨ J. Multi®ariate Anal. 39: 106116. Bartholomew, D. J. 1980. Factor analysis for categorical data. J. Roy. Statist. Soc. Ser B 42: 293321. Bartholomew, D. J., and M. Knott. 1999. Latent Variable Models and Factor Analysis, 2nd ed. London: Edward Arnold. Bartlett, M. S. 1935. Contingency table interactions. J. Roy. Statist. Soc. Suppl. 2: 248252. Bartlett, M. S. 1937. Some examples of statistical methods of research in agriculture and applied biology. J. Roy. Statist. Soc. Suppl. 4: 137183. Becker, M. 1989a. Models for the analysis of association in multivariate contingency tables. J. Amer. Statist. Assoc. 84: 10141019. Becker, M. 1989b. On the bivariate normal distribution and association models for ordinal categorical data. Statist. Probab. Lett. 8: 435440. Becker, M. 1990. Maximum likelihood estimation of the RCŽM. association model. Appl. Statist. 39: 152167. Becker, M., and A. Agresti. 1992. Log-linear modelling of pairwise interobserver agreement on a categorical scale. Statist. Medic. 11: 101114. Becker, M., and C. C. Clogg. 1989. Analysis of sets of two-way contingency tables using association models. J. Amer. Statist. Assoc. 84: 142151. Bedrick, E. J. 1983. Chi-squared tests for cross-classified tables of survey data. Biometrika 70: 591595. Bedrick, E. J. 1987. A family of confidence intervals for the ratio of two binomial proportions. Biometrics 43: 993998. Begg, C. B., and R. Gray. 1984. Calculation of polytomous logistic regression parameters using individualized regressions. Biometrika 71: 1118. Beitler, P. J., and J. R. Landis. 1985. A mixed-effects model for categorical data. Biometrics 41: 9911000. Benedetti, J. K., and M. B. Brown. 1978. Strategies for the selection of loglinear models. Biometrics 34: 680686. Benichou, J. 1998. Attributable risk. Pp. 216229 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Benzecri, J.-P. 1973. L’Analyse des Donnees, ´ ´ Vol. 1, La Taxonomie; Vol. 2, L’Analyse des Correspondances. Paris: Dunod. Berger, R., and D. D. Boos. 1994. p-Values maximized over a confidence set for the nuisance parameter. J. Amer. Statist. Assoc. 89: 10121016. Bergsma, W. P., and T. Rudas. 2002. Marginal models for categorical data. Ann. Statist. 30: 140159. Berkson, J. 1938. Some difficulties of interpretation encountered in the application of the chi-square test. J. Amer. Statist. Assoc. 33: 526536. Berkson, J. 1944. Application of the logistic function to bio-assay. J. Amer. Statist. Assoc. 39: 357365. Berkson, J. 1951. Why I prefer logits to probits. Biometrics 7: 327339. Berkson, J. 1953. A statistically precise and relatively simple method of estimating the bioassay with quantal response, based on the logistic function. J. Amer. Statist. Assoc. 48: 565599. Berkson, J. 1955. Maximum likelihood and minimum logit ␹ 2 estimation of the logistic function. J. Amer. Statist. Assoc. 50: 130᎐162. REFERENCES 659 Berkson, J. 1978. In dispraise of the exact test. J. Statist. Plann. Inference 2: 2742. Berkson, J. 1980. Minimum chi-square, not maximum likelihood! Ann. Statist. 8: 457487. Berry, G., and P. Armitage. 1995. Mid-P confidence intervals: A brief review. The Statistician 44: 417423. Bhapkar, V. P. 1966. A note on the equivalence of two test criteria for hypotheses in categorical data. J. Amer. Statist. Assoc. 61: 228235. Bhapkar, V. P. 1968. On the analysis of contingency tables with a quantitative response. Biometrics 24: 329338. Bhapkar, V. P. 1973. On the comparison of proportions in matched samples. Sankhya Ser A 35: 341356. Bhapkar, V. P. 1989. Conditioning on ancillary statistics and loss of information in the presence of nuisance parameters. J. Statist. Plann. Inference. 21: 139160. Bhapkar, V. P., and G. G. Koch. 1968. On the hypothesis of ‘‘no interaction’’ in multidimensional contingency tables. Biometrics 24: 567594. Bhapkar, V. P., and G. W. Somes. 1977. Distribution of Q when testing equality of matched proportions. J. Amer. Statist. Assoc. 72: 658661. Biggeri, A. 1998. Negative binomial distribution. Pp. 29622967 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Billingsley, P. 1961. Statistical methods in Markov chains. Ann. Math. Statist. 32: 1240. Birch, M. W. 1963. Maximum likelihood in three-way contingency tables. J. Roy. Statist. Soc. Ser. B 25: 220233. Birch, M. W. 1964a. A new proof of the PearsonFisher theorem. Ann. Math. Statist. 35: 817824. Birch, M. W. 1964b. The detection of partial association I: The 2 = 2 case. J. Roy. Statist. Soc. Ser. B 26: 313324. Birch, M. W. 1965. The detection of partial association II: The general case. J. Roy. Statist. Soc. Ser B 27: 111124. Bishop, Y. M. M. 1971. Effects of collapsing multidimensional contingency tables. Biometrics 27: 545562. Bishop, Y. M. M., and F. Mosteller. 1969. Smoothed contingency table analysis. Chap. IV-3 in The National Halothane Study. Washington, DC: U.S. Government Printing Office. Bishop, Y. M. M., S. E. Fienberg, and P. W. Holland. 1975. Discrete Multi®ariate Analysis. Cambridge, MA: MIT Press. Blaker, H. 2000. Confidence curves and improved exact confidence intervals for discrete distributions. Canad. J. Statist. 28: 783798. Bliss, C. I. 1934. The method of probits. Science 79: 3839. Bliss, C. I. 1935. The calculation of the dosagemortality curve. Ann. Appl. Biol. 22: 134167. Blyth, C. R. 1972. On Simpson’s paradox and the sure-thing principle. J. Amer. Statist. Assoc. 67: 364366. Blyth, C. R., and H. A. Still. 1983 Binomial confidence intervals. J. Amer. Statist. Assoc. 78: 108116. Bock, R. D. 1970. Estimating multinomial response relations. Pp. 453479 in Contributions to Statistics and Probability, ed. R. C. Bose. Chapel Hill, NC: University of North Carolina Press. Bock, R. D., and M. Aitkin. 1981. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 46: 443459. Bock, R. D., and L. V. Jones. 1968. The Measurement and Prediction of Judgement and Choice. San Francisco: Holden-Day. 660 REFERENCES Bockenholt, U., and W. Dillon. 1997. Modelling within-subject dependencies in ordinal paired ¨ comparison data. Psychometrika 62: 411434. Bonney, G. E. 1987. Logistic regression for dependent binary observations. Biometrics 43: 951973. Boos, D. D. 1992. On generalized score tests. Amer. Statist. 46: 327333. Booth, J., and R. Butler. 1999. An importance sampling algorithm for exact conditional tests in log-linear models. Biometrika 86: 321332. Booth, J. G., and J. P. Hobert. 1998. Standard errors of prediction in generalized linear mixed models. J. Amer. Statist. Assoc. 93: 262272. Booth, J. G., and J. P. Hobert. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Roy. Statist. Soc. Ser. B 61: 265285. Bowker, A. H. 1948. A test for symmetry in contingency tables. J. Amer. Statist. Assoc. 43: 572574. Box, J. F. 1978. R. A. Fisher: The Life of a Scientist. New York: Wiley Bradley, R. A. 1976. Science, statistics, and paired comparisons. Biometrics 32: 213240. Bradley, R. A., and M. E. Terry. 1952. Rank analysis of incomplete block designs I. The method of paired comparisons. Biometrika 39: 324345. Breslow, N. 1976. Regression analysis of the log odds ratio: A method for retrospective studies. Biometrics 32: 409416. Breslow, N. 1981. Odds ratio estimators when the data are sparse. Biometrika 68: 7384. Breslow, N. 1982. Covariance adjustment of relative-risk estimates in matched studies. Biometrics 38: 661672. Breslow, N. 1984. Extra-Poisson variation in log-linear models. Appl. Statist. 33: 3844. Breslow, N. 1996. Statistics in epidemiology: The casecontrol study. J. Amer. Statist. Assoc. 91: 1428. Breslow, N., and D. G. Clayton. 1993. Approximate inference in generalized linear mixed models. J. Amer. Statist. Assoc. 88: 925. Breslow, N., and N. E. Day. 1980, 1987. Statistical Methods in Cancer Research, Vol. I, The Analysis of CaseControl Studies; Vol. II. The Design and Analysis of Cohort Studies. Lyon: IARC. Breslow, N., and X. Lin. 1995. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika 82: 8191. Breslow, N., and W. Powers. 1978. Are there two logistic regressions for retrospective studies? Biometrics 34: 100105. Breslow, N., N. Day, K. Halvorsen, R. Prentice, and C. Sabai. 1978. Estimation of multiple relative risk functions in matched casecontrol studies. Amer. J. Epidemiol. 108: 299307. Brier, S. S. 1980. Analysis of contingency tables under cluster sampling. Biometrika 67: 591596. Brooks, S. P., B. J. T. Morgan, M. S. Ridout, and S. E. Pack. 1997. Finite mixture models for proportions. Biometrics 53: 10971115. Bross, I. D. J. 1958. How to use ridit analysis. Biometrics 14: 1838. Brown, M. B. 1976. Screening effects in multidimensional contingency tables. Appl. Statist. 25: 3746. Brown, M. B., and J. K. Benedetti. 1977. Sampling behavior of tests for correlation in two-way contingency tables. J. Amer. Statist. Assoc. 72: 309315. Brown, P. J., and P. W. K. Rundell. 1985. Kernel estimates for categorical data. Technometrics 27: 293299. Brown, L. D., T. T. Cai, and A. Das Gupta. 2001. Interval estimation for a binomial proportion. Statist. Sci. 16: 101133. REFERENCES 661 Brownstone, D., and K. F. Train. 1999. Forecasting new product penetration with flexible substitution patterns. J. Econometrics 89: 109129. Bull, S. B., and A. Donner. 1987. The efficiency of multinomial logistic regression compared with multiple group discriminant analysis. J. Amer. Statist. Assoc. 82: 11181122. Burnham, K. P., and D. R. Anderson. 1998. Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag. Burnham, K. P. and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65: 625633. Burridge, J. 1981. A note on maximum likelihood estimation for regression models using grouped data. J. Roy. Statist. Soc. Ser. B 43: 4145. Cameron, A. C., and P. K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge, U.K.: Cambridge University Press. Carey, V., S. L. Zeger, and P. Diggle. 1993. Modelling multivariate binary data with alternating logistic regressions. Biometrika 80: 517526. Carroll, R. J., S. Wang, and C. Y. Wang. 1995. Prospective analysis of logistic casecontrol pairs. J. Amer. Statist. Assoc. 90: 157169. Casella, G., and R. Berger. 2001. Statistical Inference, 2nd ed. Pacific Grove, CA: Wadsworth. Catalano, P. J., and L. M. Ryan. 1992. Bivariate latent variable models for clustered discrete and continuous outcomes. J. Amer. Statist. Assoc. 87: 651658. Caussinus, H. 1966. Contribution ` a l’analyse statistique des tableaux de correlation. Ann. Fac. ´ Sci. Uni®. Toulouse 29: 77182. Chaloner, K., and K. Larntz. 1989. Optimal Bayesian design applied to logistic regression experiments. J. Statist. Plann. Inference 21: 191208. Chamberlain, G. 1980. Analysis of covariance with qualitative data. Re®. Econ. Stud. 47: 225238. Chambers, E. A., and D. R. Cox. 1967. Discrimination between alternative binary response models. Biometrika 54: 573578. Chambers, R. L., and D. G. Steel. 2001. Simple methods for ecological inference in 2 = 2 tables. J. Roy. Statist. Soc. Ser. A 164: 175192. Chan, I. 1998. Exact tests of equivalence and efficacy with non-zero lower bound for comparative studies. Statist. Medic. 17: 14031413. Chan, J. S. K., and A. Y. C. Kuk. 1997. Maximum likelihood estimation for probit-linear mixed models with correlated random effects. Biometrics 53: 8697. Chao, A., P. K. Tsay, S.-H. Lin, W.-Y. Shau, and D.-Y. Chao. 2001. The applications of capturerecapture models to epidemiological data. Statist. Medic. 20: 31233157. Chapman, D. G., and R. C. Meng. 1966. The power of chi-square tests for contingency tables. J. Amer. Statist. Assoc. 61: 965975. Chen, Z. and L. Kuo. 2001. A note on the estimation of the multinomial logit model with random effects. Amer. Statist. 55: 8995. Christensen, R. 1997. LogLinear Models and Logistic Regression. New York: Springer-Verlag. Chuang, C., D. Gheva, and C. Odoroff. 1985. Methods for diagnosing multiplicative-interaction models for two-way contingency tables. Commun. Statist. Ser. A 14: 20572080. Clogg, C. C. 1995. Latent class models. Pp. 311359 in Handbook of Statistical Modeling for the Social and Beha®ioral Sciences, ed. G. Arminger and C. C. Clogg. New York: Plenum Press. Clogg, C. C., and S. R. Eliason. 1987. Some common problems in log-linear analysis. Sociol. Methods Res. 15: 444. Clogg, C. C., and L. A. Goodman. 1984. Latent structure analysis of a set of multidimensional contingency tables. J. Amer. Statist. Assoc. 79: 762771. 662 REFERENCES Clogg, C. C., and E. S. Shihadeh. 1994. Statistical Models for Ordinal Variables. Thousand Oaks, CA: Sage Publications. Clopper, C. J., and E. S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26: 404413. Cochran, W. G. 1940. The analysis of variance when experimental errors follow the Poisson or binomial laws. Ann. Math. Statist. 11: 335347. Cochran, W. G. 1943. Analysis of variance for percentages based on unequal numbers. J. Amer. Statist. Assoc. 38: 287301. Cochran, W. G. 1950. The comparison of percentages in matched samples. Biometrika 37: 256266. Cochran, W. G. 1952. The ␹ 2 test of goodness-of-fit. Ann. Math. Statist. 23: 315᎐345. Cochran, W. G. 1954. Some methods of strengthening the common ␹ 2 tests. Biometrics 10: 417᎐451. Cochran, W. G. 1955. A test of a linear function of the deviations between observed and expected numbers. J. Amer. Statist. Assoc. 50: 377᎐397. Coe, P. R., and A. C. Tamhane. 1993. Small sample confidence intervals for the difference, ratio and odds ratio of two success probabilities. Commun. Statist. Ser. B 22: 925᎐938. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20: 37᎐46. Cohen, J. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70: 213᎐220. Cohen, A., and H. B. Sackrowitz. 1991. Tests for independence in contingency tables with ordered alternatives. J. Multi®ariate Anal. 36: 56᎐67. Cohen, A., and H. B. Sackrowitz. 1992. An evaluation of some tests of trend in contingency tables. J. Amer. Statist. Assoc. 87: 470᎐475. Collett, D. 1991. Modelling Binary Data. London: Chapman & Hall. Conaway, M. R. 1989. Analysis of repeated categorical measurements with conditional likelihood methods. J. Amer. Statist. Assoc. 84: 53᎐62. Cook, R. D., and S. Weisberg. 1999. Applied Regression Including Computing and Graphics. New York: Wiley. Copas, J. B. 1973. Randomization models for the matched and unmatched 2 = 2 tables. Biometrika 60: 467᎐476. Copas, J. B. 1983. Plotting p against x. Appl. Statist. 32: 25᎐31. Copas, J. B. 1988. Binary regression models for contaminated data. J. Roy. Statist. Soc. Ser B 50: 225᎐265. Corcoran, C., L. Ryan, P. Senchaudhuri, C. Mehta, N. Patel, and G. Molenberghs. 2001. An exact trend test for correlated binary data. Biometrics 57: 941᎐948. Cormack, R. M. 1989. Log-linear models for capture᎐recapture. Biometrics 45: 395᎐413. Cornfield, J. 1951. A method of estimating comparative rates from clinical data: Applications to cancer of the lung, breast and cervix. J. Natl. Cancer Inst. 11: 1269᎐1275. Cornfield, J. 1956. A statistical problem arising from retrospective studies. In Proc. 3rd Berkeley Symposium on Mathematics, Statistics and Probability, ed. J. Neyman, 4: 135᎐148. Cornfield, J. 1962. Joint dependence of risk of coronary heart disease on serum cholesterol and systolic blood pressure: A discriminant function analysis. Fed. Proc. 21, Suppl. 11: 58᎐61. Coull, B. A., and A. Agresti. 1999. The use of mixed logit models to reflect heterogeneity in capture᎐recapture studies. Biometrics 55: 294᎐301. Coull, B. A., and A. Agresti. 2000. Random effects modeling of multiple binomial responses using the multivariate binomial logit-normal distribution. Biometrics 56: 73᎐80. REFERENCES 663 Cox, C. 1984. An elementary introduction to maximum likelihood estimation for multinomial models: Birch’s theorem and the delta method. Amer. Statist. 38: 283287. Cox, C. 1995. Location-scale cumulative odds models for ordinal data: A generalized non-linear model approach. Statist. Medic. 14: 11911203. Cox, C. 1996. Nonlinear quasi-likelihood models: Applications to continuous proportions. Comput. Statist. Data Anal. 21: 449461. Cox, D. R. 1958a. The regression analysis of binary sequences. J. Roy. Statist. Soc. Ser. B 20: 215242. Cox, D. R. 1958b. Two further applications of a model for binary regression. Biometrika 45: 562565. Cox, D. R. 1970. The Analysis of Binary Data Ž2nd ed. 1989, by D. R. Cox and E. J. Snell.. London: Chapman & Hall. Cox, D. R. 1972. The analysis of multivariate binary data. Appl. Statist. 21: 113120. Cox, D. R. 1983. Some remarks on overdispersion. Biometrika 70: 269274. Cox, D. R., and D. V. Hinkley. 1974. Theoretical Statistics. London: Chapman & Hall. Cramer, ´ H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press. Cressie, N., and T. R. C. Read. 1984. Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46: 440464. Cressie, N., and T. R. C. Read. 1989. Pearson X 2 and the loglikelihood ratio statistic G 2 : A comparative review. Internat. Statist. Re®. 57: 1943. Croon, M., W. Bergsma, and J. Hagenaars. 2000. Analyzing change in categorical variables by generalized log-linear models. Sociol. Methods Res. 29: 195229. Crouchley, R. 1995. A random-effects model for ordered categorical data. J. Amer. Statist. Assoc. 90: 489498. Crowder, M. J. 1978. Beta-binomial ANOVA for proportions. Appl. Statist. 27: 3437. D’Agostino, R. B., Jr. 1998. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statist. Medic. 17: 22652281. Daniels, M. J., and C. Gatsonis. 1999. Hierarchical generalized linear models in the analysis of variations in health care utilization. J. Amer. Statist. Assoc. 94: 2942. Dardanoni, V., and A. Forcina. 1998. A unified approach to likelihood inference on stochastic orderings in a nonparametric context. J. Amer. Statist. Assoc. 93: 11121123. Darroch, J. N. 1962. Interactions in multi-factor contingency tables. J. Roy. Statist. Soc. Ser. B 24: 251263. Darroch, J. N. 1981. The MantelHaenszel test and tests of marginal symmetry; Fixed-effects and mixed models for a categorical response. Internat. Statist. Re®. 49: 285307. Darroch, J. N., and P. I. McCloud. 1986. Category distinguishability and observer agreement. Austral. J. Statist. 28: 371388. Darroch, J. N., and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statist. 43: 14701480. Darroch, J. N., S. L. Lauritzen, and T. P. Speed. 1980. Markov fields and log-linear interaction models for contingency tables. Ann. Statist. 8: 522539. Darroch, J. N., S. E. Fienberg, G. F. V. Glonek, and B. W. Junker. 1993. A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability. J. Amer. Statist. Assoc. 88: 11371148. Das Gupta, S., and M. D. Perlman. 1974. Power of the noncentral F-test: Effect of additional variates on Hotelling’s T 2-test. J. Amer. Statist. Assoc. 69: 174180. David, H. A. 1988. The Method of Paired Comparisons, 2nd ed. Oxford: Oxford University Press. Davis, L. J. 1986a. Exact tests for 2 by 2 contingency tables. Amer. Statist. 40: 139141. 664 REFERENCES Davis, L. J. 1986b. Relationship between strictly collapsible and perfect tables. Statist. Probab. Lett. 4: 119122. Davis, L. J. 1989. Intersection union tests for strictly collapsibility in three-dimensional contingency tables. Ann. Statist. 17: 16931708. Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge, U.K. Cambridge University Press. Dawson, R. B., Jr. 1954. A simplified expression for the variance of the ␹ 2-function on a contingency table. Biometrika 41: 280. Day, N. E., and D. P. Byar. 1979. Testing hypotheses in case᎐control studies: Equivalence of Mantel᎐Haenszel statistics and logit score tests. Biometrics 35: 623᎐630. de Falguerolles, A., S. Jmel, and J. Whittaker. 1995. Correspondence analysis and association models constrained by a conditional independence graph. Psychometrika 60: 161᎐180. Deming, W. E. 1964. Statistical Adjustment of Data Žreprint of 1943 Wiley text.. New York: Dover. Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statist. 11: 427᎐444. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39: 1᎐38. Dey, D. K., S. K. Ghosh, and B. K. Mallick Žeditors.. 2000. Generalized Linear Models: A Bayesian Perspecti®e. New York: Marcel Dekker. Diaconis, P., and B. Efron. 1985. Testing for independence in a two-way table: New interpretations of the chi-square statistic. Ann. Statist. 13: 845᎐874. Diaconis, P., and B. Sturmfels. 1998. Algebraic algorithms for sampling from conditional distributions. Ann. Statist. 26: 363᎐397. Diggle, P. J., P. Heagerty, K.-Y. Liang, and S. L. Zeger. 2002. Analysis of Longitudinal Data, 2nd ed. Oxford: Clarendon Press. Dittrich, R., R. Hatzinger, and W. Katzenbeisser. 1998. Modeling the effect of subject-specific covariates in paired comparison studies with an application to university rankings. Appl. Statist. 47: 511᎐525. Dobson, A. J. 2001. An Introduction to Generalized Linear Models, 2 nd ed. London: Chapman & Hall. Dong, J. 1998. Simpson’s paradox. Pp. 4108᎐4110 in Encyclopedia of Biostatistics, Vol. 5. Chichester, UK: Wiley. Dong, J., and J. S. Simonoff. 1994. The construction and properties of boundary kernels for smoothing sparse multinomials. J. Computat. Graph. Statist. 3: 57᎐66. Dong, J., and J. S. Simonoff. 1995. A geometric combination estimator for d-dimensional ordinal sparse contingency tables. Ann. Statist. 23: 1143᎐1159. Donner, A., and W. W. Hauck. 1986. The large-sample efficiency of the Mantel᎐Haenszel estimator in the fixed-strata case. Biometrics 42: 537᎐545. Doolittle, M. H. 1888. Association ratios. Bull. Philos. Soc. Washington 10: 83᎐87, 94᎐96. Drost, F. C., W. C. M. Kallenberg, D. S. Moore, and J. Oosterhoff. 1989. Power approximations to multinomial tests of fit. J. Amer. Statist. Assoc. 84: 130᎐141. Ducharme, G. R., and Y. Lepage. 1986. Testing collapsibility in contingency tables. J. Roy. Statist. Soc. Ser B 48: 197᎐205. Dupont, W. D. 1986. Sensitivity of Fisher’s exact test to minor perturbations in 2 = 2 contingency tables. Statist. Medic. 5: 629᎐635. Dyke, G. V., and H. D. Patterson. 1952. Analysis of factorial arrangements when the data are proportions. Biometrics 8: 1᎐12. REFERENCES 665 Edwardes, M. D. deB. 1997. Univariate random cut-points theory for the analysis of ordered categorical data. J. Amer. Statist. Assoc. 92: 11141123. Edwards, A. W. F. 1963. The measure of association in a 2 = 2 table. J. Roy. Statist. Soc. Ser A 126: 109114. Edwards, D. 2000. Introduction to Graphical Modelling, 2nd ed. New York: Springer-Verlag. Edwards, D., and S. Kreiner. 1983. The analysis of contingency tables by graphical models. Biometrika 70: 553565. Efron, B. 1975. The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70: 892898. Efron, B. 1978. Regression and ANOVA with zeroone data: Measures of residual variation. J. Amer. Statist. Soc. 73: 113121. Efron, B., and D. V. Hinkley. 1978. Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika 65: 457482. Efron, B., and C. Morris. 1975. Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 70: 311319. Ekholm, A., J. W. McDonald, and P. W. F. Smith. 2000. Association models for a multivariate binary response. Biometrics 56: 712718. Escoufier, Y. 1982. L’analyse des tableaux de contingence simples et multiples. In Proc. International Meeting on the Analysis of Multidimensional Contingency Tables ŽRome, 1981., ed. R. Coppi. Metron 40: 5377. Espeland, M. A., and S. L. Handelman. 1989. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 45: 587599. Fahrmeir, L., and G. Tutz. 2001. Multi®ariate Statistical Modelling based on Generalized Linear Models, 2nd ed. New York: Springer-Verlag. Farewell, V. T. 1979. Some results on the estimation of logistic models based on retrospective data. Biometrika 66: 2732. Farewell, V. T. 1982. A note on regression analysis of ordinal data with variability of classification. Biometrika 69: 533538. Fay, R. 1985. A jackknifed chi-squared test for complex samples. J. Amer. Statist. Assoc. 80: 148157. Fay, R. 1986. Causal models for patterns of nonresponse. J. Amer. Statist. Assoc. 81: 354365. Ferguson, T. S. 1967. Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press. Fienberg, S. E. 1970a. An iterative procedure for estimation in contingency tables. Ann. Math. Statist. 41: 907917. Fienberg, S. E. 1970b. Quasi-independence and maximum likelihood estimation in incomplete contingency tables. J. Amer. Statist. Soc. 65: 16101616. Fienberg, S. E. 1972. The analysis of incomplete multi-way contingency tables. Biometrics 28: 177202. Fienberg, S. E. 1980. Fisher’s contributions to the analysis of categorical data. Pp. 7584 in R. A. Fisher: An Appreciation, ed. S. E. Fienberg and D. V. Hinkley. Berlin: SpringerVerlag. Fienberg, S. E. 1984. The contributions of William Cochran to categorical data analysis. Pp. 103118 in W. G. Cochran’s Impact on Statistics, ed. P. S. R. S. Rao and J. Sedransk. New York: Wiley. Fienberg, S. E., and P. W. Holland. 1973. Simultaneous estimation of multinomial cell probabilities. J. Amer. Statist. Assoc. 68: 683690. Fienberg, S. E., and K. Larntz. 1976. Loglinear representation for paired and multiple comparison models. Biometrika 63: 245254. 666 REFERENCES Fienberg, S. E., M. A. Johnson, and B. J. Junker. 1999. Classical multilevel and Bayesian approaches to population size estimation using multiple lists. J. Roy. Statist. Soc. Ser. A 162: 383405. Finney, D. J. 1947. The estimation from individual records of the relationship between dose and quantal response. Biometrika 34: 320334. Finney, D. J. 1971. Probit Analysis, 3rd ed. Cambridge: Cambridge University Press. Firth, D. 1987. On the efficiency of quasi-likelihood estimation. Biometrika 74: 233245. Firth, D. 1989. Marginal homogeneity and the superposition of Latin squares. Biometrika 76: 179182. Firth, D. 1991. Generalized linear models. Pp. 5582 in Statistical Theory and Modelling. In Honour of Sir Da®id Cox, FRS, D. V. Hinkley, N. Reid, and E. J. Snell, eds. London: Chapman & Hall. Firth, D. 1993a. Bias reduction of maximum likelihood estimates. Biometrika 80: 2738. Firth, D. 1993b. Recent developments in quasi-likelihood methods. Proc. ISI 49th Session, pp. 341358. Firth, D., and J. Kuha. 2000. On the index of dissimilarity for lack of fit in log linear models. Unpublished manuscript. Fischer, G. H., and I. W. Molenaar. 1995. Rasch Models: Foundations, Recent De®elopments, and Applications. New York: Springer-Verlag. Fisher, R. A. 1922. On the interpretation of chi-square from contingency tables, and the calculation of P. J. Roy. Statist. Soc. 85: 8794. Fisher, R. A. 1924. The conditions under which chi-square measures the discrepancy between observation and hypothesis. J. Roy. Statist. Soc. 87: 442450. Fisher, R. A. 1926. Bayes’ theorem and the fourfold table. Eugenics Re®. 18: 3233. Fisher, R. A. 1934, 1970. Statistical Methods for Research Workers Žoriginally published 1925, 14th ed., 1970.. Edinburgh: Oliver & Boyd. Fisher, R. A. 1935a. The Design of Experiments Ž8th ed., 1966.. Edinburgh: Oliver & Boyd. Fisher, R. A. 1935b. Appendix to article by C. Bliss. Ann. Appl. Biol. 22: 164165. Fisher, R. A. 1935c. The logic of inductive inference. J. Roy. Statist. Soc. 98: 3982. Fisher, R. A. 1945. A new test for 2 = 2 tables ŽLetter to the Editor.. Nature 156: 388. Fisher, R. A. 1956. Statistical Methods for Scientific Inference. Edinburgh: Oliver & Boyd. Fisher, R. A., and F. Yates. 1938. Statistical Tables. Edinburgh: Oliver and Boyd. Fitzmaurice, G. M., and N. M. Laird. 1993. A likelihood-based method for analysing longitudinal binary responses. Biometrika 80: 141151. Fitzmaurice, G. M., N. M. Laird, and S. Lipsitz. 1994. Analysing incomplete longitudinal binary responses: A likelihood-based approach. Biometrics 50: 601612. Fitzmaurice, G. M., N. M. Laird, and A. G. Rotnitzky. 1993. Regression models for discrete longitudinal responses. Statist. Sci. 8: 284299. Fitzpatrick, S., and A. Scott. 1987. Quick simultaneous confidence intervals for multinomial proportions. J. Amer. Statist. Assoc. 82: 875878. Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions, 2nd ed. New York: Wiley. Fleiss, J. L., and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33: 613619. Fleiss, J. L., J. Cohen, and B. S. Everitt. 1969. Large-sample standard errors of kappa and weighted kappa. Psychol. Bull. 72: 323327. Follman, D. A., and D. Lambert. 1989. Generalizing logistic regression by nonparametric mixing. J. Amer. Statist. Assoc. 84: 295300. REFERENCES 667 Forster, J. J., and P. W. F. Smith. 1998. Model-based inference for categorical survey data subject to non-ignorable non-response. J. Roy. Statist. Soc. Ser B 60: 5770. Forster, J. J., J. W. McDonald, and P. W. F. Smith. 1996. Monte Carlo exact conditional tests for log-linear and logistic models. J. Roy. Statist. Soc. Ser B 58: 445453. Fowlkes, E. B. 1987. Some diagnostics for binary logistic regression via smoothing. Biometrika 74: 503515. Fowlkes, E. B., A. E. Freeny, and J. Landwehr. 1988. Evaluating logistic models for large contingency tables. J. Amer. Statist. Assoc. 83: 611622. Freedman, D., R. Pisani, and R. Purves. 1978. Statistics. New York: W. W. Norton. Freeman, G. H., and J. H. Halton. 1951. Note on an exact treatment of contingency, goodnessof-fit and other problems of significance. Biometrika 38: 141149. Freeman, D. H., Jr. and T. R. Holford. 1980. Summary rates. Biometrics 36: 195205. Freeman, M. F., and J. W. Tukey. 1950. Transformations related to the angular and the square root. Ann. Math. Statist. 21: 607611. Freidlin, B., and J. L. Gastwirth. 1999. Unconditional versions of several tests commonly used in the analysis of contingency tables. Biometrics 55: 264267. Friendly, M. 2000. Visualizing Categorical Data. Cary, NC: SAS Institute. Frome, E. L. 1983. The analysis of rates using Poisson regression models. Biometrics 39: 665674. Fuchs, C. 1982. Maximum likelihood estimation and model selection in contingency tables with missing data. J. Amer. Statist. Assoc. 77: 270278. Gabriel, K. R. 1966. Simultaneous test procedures for multiple comparisons on categorical data. J. Amer. Statist. Assoc. 61: 10811096. Gabriel, K. R. 1971. The biplot graphic display of matrices with applications to principal component analysis. Biometrika 58: 453467. Gail, M. H., and J. J. Gart. 1973. The determination of sample sizes for use with the exact conditional test in 2 = 2 comparative trials. Biometrics 29: 441448. Gail, M., and N. Mantel. 1977. Counting the number of r = c contingency tables with fixed margins. J. Amer. Statist. Assoc. 72: 859862. Gart, J. J. 1966. Alternative analyses of contingency tables. J. Roy. Statist. Soc. Ser B 28: 164179. Gart, J. J. 1969. An exact test for comparing matched proportions in crossover designs. Biometrika 56: 7580. Gart, J. J. 1970. Point and interval estimation of the common odds ratio in the combination of 2 = 2 tables with fixed margins. Biometrika 57: 471475. Gart, J. J. 1971. The comparison of proportions: A review of significance tests, confidence intervals and adjustments for stratification. Re®. Internat. Statist. Re®. 39: 148169. Gart, J. J., and J. Nam. 1988. Approximate interval estimation of the ratio of binomial parameters: A review and corrections for skewness. Biometrics 44: 323338. Gart, J. J., and J. R. Zweiful. 1967. On the bias of various estimators of the logit and its variance with applications to quantal bioassay. Biometrika 54: 181187. Gelfand, A. E., and A. F. Smith. 1990. Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85: 398409. Genter, F. C., and V. T. Farewell. 1985. Goodness-of-link testing in ordinal regression models. Canad. J. Statist. 13: 3744. Ghosh, B. K. 1979. A comparison of some approximate confidence intervals for the binomial parameter. J. Amer. Statist. Assoc. 74: 894900. Ghosh, M., M. Chen, A. Ghosh, and A. Agresti. 2000. Hierarchical Bayesian analysis of binary matched pairs data. Statist. Sin. 10: 647657. 668 REFERENCES Gibbons, R. D., and D. Hedeker. 1997. Random-effects probit and logistic regression models for three-level data. Biometrics 53: 15271537. Gill, J. 2000. Generalized Linear Models: A Unified Approach. Thousand Oaks, CA: Sage Publications. Gilmour, A. R., R. D. Anderson, and A. L. Rae. 1985. The analysis of binomial data by a generalized linear mixed model. Biometrika 72: 593599. Gilula, Z., and S. Haberman. 1986. Canonical analysis of contingency tables by maximum likelihood. J. Amer. Statist. Assoc. 81: 780788. Gilula, Z., and S. Haberman. 1988. The analysis of multivariate contingency tables by restricted canonical and restricted association models. J. Amer. Statist. Assoc. 83: 760771. Gilula, Z., and S. Haberman. 1998. Chi-square, partition of. Pp. 622627 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Gleser, L. J., and D. S. Moore. 1985. The effect of positive dependence on chi-squared tests for categorical data. J. Roy. Statist. Soc. Ser B 47: 459465. Glonek, G. 1996. A class of regression models for multivariate categorical responses. Biometrika 83: 1528. Glonek, G. F. V., and P. McCullagh. 1995. Multivariate logistic models. J. Roy. Statist. Soc. Ser. B 57: 533546. Glonek, G., J. N. Darroch, and T. P. Speed. 1988. On the existence of maximum likelihood estimators for hierarchical loglinear models. Scand. J. Statist. 15: 187193. Gokhale, D. V., and S. Kullback. 1978. The Information in Contingency Tables. New York: Marcel Dekker. Goldstein, H. 1995. Multile®el Statistical Models, 2nd ed. London: Edward Arnold. Goldstein, H., and J. Rasbash. 1996. Improved approximations for multilevel models with binary responses. J. Roy. Statist. Soc. Ser A 159: 505513. Good, I. J. 1963. Maximum entropy for hypothesis formulation, especially for multi-dimensional contingency tables. Ann. Math. Statist. 34: 911934. Good, I. J. 1965. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. Cambridge, MA: MIT Press. Good, I. J. 1976. On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. Ann. Statist. 4: 11591189. Good, I. J., and R. A. Gaskins. 1971. Nonparametric roughness penalties for probability densities. Biometrika 58: 255277. Good, I. J., and Y. Mittal. 1987. The amalgamation and geometry of two-by-two contingency tables. Ann. Statist. 15: 694711. Good, I. J., T. N. Gover, and G. J. Mitchell. 1970. Exact distributions for ␹ 2 and for the likelihood-ratio statistic for the equiprobable multinomial distribution. J. Amer. Statist. Assoc. 65: 267᎐283. Goodman, L. A. 1964a. Simultaneous confidence intervals for cross-product ratios in contingency tables. J. Roy. Statist. Soc. Ser B 26: 86᎐102. Goodman, L. A. 1964b. Interactions in multi-dimensional contingency tables. Ann. Math. Statist. 35: 632᎐646. Goodman, L. A. 1965. On simultaneous confidence intervals for multinomial proportions. Technometrics 7: 247᎐254. Goodman, L. A. 1968. The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. J. Amer. Statist. Assoc. 63: 1091᎐1131. Goodman, L. A. 1969a. On partitioning chi-square and detecting partial association in three-way contingency tables. J. Roy. Statist. Soc. Ser B 31: 486᎐498. REFERENCES 669 Goodman, L. A. 1969b. How to ransack social mobility tables and other kinds of cross-classification tables. Amer. J. Sociol. 75: 140. Goodman, L. A. 1970. The multivariate analysis of qualitative data: Interaction among multiple classifications. J. Amer. Statist. Assoc. 65: 226256. Goodman, L. A. 1971a. The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics 13: 3361. Goodman, L. A. 1971b. The partitioning of chi-square, the analysis of marginal contingency tables, and the estimation of expected frequencies in multidimensional contingency tables. J. Amer. Statist. Assoc. 66: 339344. Goodman, L. A. 1973. The analysis of multidimensional contingency tables with some variables are posterior to others: A modified path analysis approach. Biometrika 60: 179192. Goodman, L. A. 1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61: 215231. Goodman, L. A. 1979a. Simple models for the analysis of association in cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74: 537552. Goodman, L. A. 1979b. Multiplicative models for square contingency tables with ordered categories. Biometrika 66: 413418. Goodman, L. A. 1981a. Association models and canonical correlation in the analysis of cross-classifications having ordered categories. J. Amer. Statist. Assoc. 76: 320334. Goodman, L. A. 1981b. Association models and the bivariate normal for contingency tables with ordered categories. Biometrika 68: 347355. Goodman, L. A. 1983. The analysis of dependence in cross-classification having ordered categories, using log-linear models for frequencies and log-linear models for odds. Biometrics 39: 149160. Goodman, L. A. 1985. The analysis of cross-classified data having ordered andror unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. Ann. Statist. 13: 1069. Goodman, L. A. 1986. Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency tables. Internat. Statist. Re®. 54: 243309. Goodman, L. A. 1996. A single general method for the analysis of cross-classified data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some methods of correspondence analysis and association analysis. J. Amer. Statist. Assoc. 91: 408427. Goodman, L. A. 2000. The analysis of cross-classified data: Notes on a century of progress in contingency table analysis, and some comments on its prehistory and its future. Pp. 189231 in Statistics for the 21 st Century, ed. C. R. Rao and G. J. Szekely. New York: ´ Marcel Dekker. Goodman, L. A., and W. H. Kruskal. 1979. Measures of Association for Cross Classifications. New York: Springer-Verlag Žcontains articles appearing in J. Amer. Statist. Assoc. in 1954, 1959, 1963, 1972.. Gould, S. J. 1981. The Mismeasure of Man. New York: W. W. Norton. Gourieroux, C., A. Monfort, and A. Trognon. 1984. Pseudo maximum likelihood methods: Theory. Econometrica 52: 681700. Graubard, B. I., and E. L. Korn. 1987. Choice of column scores for testing independence in ordered 2 = K contingency tables. Biometrics 43: 471476. Green, P. J. 1984. Iteratively weighted least squares for maximum likelihood estimation and some robust and resistant alternatives. J. Roy. Statist. Soc. Ser B 46: 149192. Greenacre, M. J. 1993. Correspondence Analysis in Practice. New York: Academic Press. 670 REFERENCES Greenland, S. 1991. On the logical justification of conditional tests for two-by-two contingency tables. Amer. Statist. 45: 248251. Greenland, S., and J. M. Robins. 1985. Estimation of a common effect parameter from sparse follow-up data. Biometrics 41: 5568. Greenwood, M., and G. U. Yule. 1920. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. Roy. Statist. Soc. Ser A 83: 255279. Greenwood, P. E., and M. S. Nikulin. 1996. A Guide to Chi-Squared Testing. New York: Wiley. Grizzle, J. E., C. F. Starmer, and G. G. Koch. 1969. Analysis of categorical data by linear models. Biometrics 25: 489504. Gross, S. T. 1981. On asymptotic power and efficiency of tests of independence in contingency tables with ordered classifications. J. Amer. Statist. Assoc. 76: 935941. Gueorguieva, R., and A. Agresti. 2001. A correlated probit model for joint modeling of clustered binary and continuous responses. J. Amer. Statist. Assoc. 96: 11021112. Haber, M. 1980. A comparison of some continuity corrections for the chi-squared test on 2 = 2 tables. J. Amer. Statist. Assoc. 75: 510515. Haber, M. 1982. The continuity correction and statistical testing. Internat. Statist. Re®. 50: 135144. Haber, M. 1985. Maximum likelihood methods for linear and log-linear models in categorical data. Comput. Statist. Data Anal. 3: 110. Haber, M. 1986. An exact unconditional test for the 2 = 2 comparative trial. Psychol. Bull. 99: 129132. Haber, M. 1989. Do the marginal totals of a 2 = 2 contingency table contain information regarding the table proportions? Commun. Statist. Ser A 18: 147156. Haberman, S. J. 1973a. The analysis of residuals in cross-classification tables. Biometrics 29: 205220. Haberman, S. J. 1973b. Log-linear models for frequency data: Sufficient statistics and likelihood equations. Ann. Statist. 1: 617632. Haberman, S. J. 1974a. The Analysis of Frequency Data. Chicago: University of Chicago Press. Haberman, S. J. 1974b. Log-linear models for frequency tables with ordered classifications. Biometrics 36: 589600. Haberman, S. J. 1977a. Log-linear models and frequency tables with small expected cell counts. Ann. Statist. 5: 11481169. Haberman, S. J. 1977b. Maximum likelihood estimation in exponential response models. Ann. Statist. 5: 815841. Haberman, S. J. 1978, 1979. Analysis of Qualitati®e Data, Vols. 1 and 2. New York: Academic Press. Haberman, S. J. 1981. Tests for independence in two-way contingency tables based on canonical correlation and on linear-by-linear interaction. Ann. Statist. 9: 11781186. Haberman, S. J. 1982. The analysis of dispersion of multinomial responses. J. Amer. Statist. Assoc. 77: 568580. Haberman, S. J. 1988. A warning on the use of chi-squared statistics with frequency tables with small expected cell counts. J. Amer. Statist. Assoc. 83: 555560. Haberman, S. J. 1995. Computation of maximum likelihood estimates in association models. J. Amer. Statist. Assoc. 90: 14381446. Hagenaars, J. A. 1998. Categorical causal modeling: Latent class analysis and directed log-linear models with latent variables. Sociol. Methods Res. 26: 436486. Hald, A. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley. REFERENCES 671 Haldane, J. B. S. 1940. The mean and variance of ␹ 2 , when used as a test of homogeneity, when expectations are small. Biometrika 31: 346᎐355. Haldane, J. B. S. 1956. The estimation and significance of the logarithm of a ratio of frequencies. Ann. Human Genet. 20: 309᎐311. Hall, P., and D. M. Titterington. 1987. On smoothing sparse multinomial data. Austral. J. Statist. 29: 19᎐37. Hamada, M., and C. F. J. Wu. 1990. A critical look at accumulation analysis and related methods. Technometrics 32: 119᎐130. Hansen, L. P. 1982. Large sample properties of generalized-method of moments estimators. Econometrica 50: 1029᎐1054. Harkness, W. L., and L. Katz. 1964. Comparison of the power functions for the test of independence in 2 = 2 contingency tables. Ann. Math. Statist. 35: 1115᎐1127. Harrell F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. Evaluating the yield of medical tests. J. Amer. Medic. Assoc. 247: 2543᎐2546. Hartzel, J., I.-M. Liu, and A. Agresti. 2001a. Describing heterogeneous effects in stratified ordinal contingency tables, with application to multi-center clinical trials. Computat. Statist. Data Anal. 35: 429᎐449. Hartzel, J., A. Agresti, and B. Caffo. 2001b. Multinomial logit random effects models. Statistical Modelling 1: 81᎐102. Haslett, S. 1990. Degrees of freedom and parameter estimability in hierarchical models for sparse complete contingency tables. Computat. Statist. Data Anal. 9: 179᎐195. Hastie, T., and R. Tibshirani. 1987. Non-parametric logistic and proportional odds regression. Appl. Statist. 36: 260᎐276. Hastie, T., and R. Tibshirani. 1990. Generalized Additi®e Models. London: Chapman & Hall. Hatzinger, R. 1989. The Rasch model, some extensions and their relation to the class of generalized linear models. Statistical Modelling: Lecture Notes in Statistics, Vol. 57. Berlin: Springer-Verlag. Hauck, W. W. 1979. The large sample variance of the Mantel᎐Haenszel estimator of a common odds ratio. Biometrics 35: 817᎐819. Hauck, W. W. 1983. A note on confidence bands for the logistic response curve. Amer. Statist. 37: 158᎐160. Hauck, W. W., and A. Donner. 1977. Wald’s test as applied to hypotheses in logit analysis. J. Amer. Statist. Assoc. 72: 851᎐853. Heagerty, P. J. 1999. Marginally specified logistic-normal models for longitudinal binary data. Biometrics 55: 688᎐698. Heagerty, P. J., and S. L. Zeger. 1996. Marginal regression models for clustered ordinal measurements. J. Amer. Statist. Assoc. 91: 1024᎐1036. Heagerty, P. J., and S. L. Zeger. 2000. Marginalized multilevel models and likelihood inference. Statist. Sci. 15: 1᎐19. Hedeker, D., and R. D. Gibbons. 1994. A random-effects ordinal regression model for multilevel analysis. Biometrics 50: 933᎐944. Heinen, T. 1996. Latent Class and Discrete Latent Trait Models. Thousand Oaks, CA: Sage Publications. Heyde, C. C. 1997. Quasi-likelihood and Its Application. New York: Springer-Verlag. Hinde, J. 1982. Compound Poisson regression models. Pp. 109᎐121 in GLIM 82: Proc. International Conference on Generalised Linear Models, ed. R. Gilchrist. New York: Springer-Verlag. Hinde, J., and C. G. B. Demetrio. 1998. Overdispersion: Models and estimation. Comput. Statist. ´ Data Anal. 27: 151᎐170. 672 REFERENCES Hirji, K. F. 1991. A comparison of exact, mid-P, and score tests for matched case-control studies. Biometrics 47: 487496. Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic regression. J. Amer. Statist. Assoc. 82: 11101117. Hirotsu, C. 1982. Use of cumulative efficient scores for testing ordered alternatives in discrete models. Biometrika 69: 567577. Hirschfeld, H. O. 1935. A connection between correlation and contingency. Cambridge Philos. Soc. Proc. Ž Math. Proc.. 31: 520524. Hodges, J. L., Jr. 1958. Fitting the logistic by maximum likelihood. Biometrics 14: 453461. Hoem, J. M. 1987. Statistical analysis of a multiplicative model and its application to the standardization of vital rates: A review. Internat. Statist. Re®. 5: 119152. Holford, T. R. 1980. The analysis of rates and of survivorship using log-linear models. Biometrics 36: 299305. Holt, D., A. J. Scott, and P. D. Ewings. 1980. Chi-squared tests with survey data. J. Roy. Statist. Soc. Ser. A 143: 303320. Hook, E. B., and R. R. Regal. 1995. Capturerecapture methods in epidemiology: Methods and limitations. Epidemiol. Re®. 17: 243264. Hosmer, D. W., and S. Lemeshow. 1980. A goodness-of-fit test for multiple logistic regression model. Commun. Statist. Ser A 9: 10431069. Hosmer, D. W., and S. Lemeshow. 2000. Applied Logistic Regression, 2nd ed. New York: Wiley. Hosmer, D. W., T. Hosmer, S. le Cessie, and S. Lemeshow. 1997. A comparison of goodness-of-fit tests for the logistic regression model. Statist. Medic. 16: 965980. Hout, M., O. D. Duncan, and M. E. Sobel. 1987. Association and heterogeneity: Structural models of similarities and differences. Sociol. Methodol. 17: 145184. Howard, J. V. 1998. The 2 = 2 table: A discussion from a Bayesian viewpoint. Statist. Sci. 13: 351367. Hsieh, F. Y. 1989. Sample size tables for logistic regression. Statist. Medic. 8: 795802. Hsieh, F. Y., D. A. Bloch, and M. D. Larsen. 1998. A simple method of sample size calculation for linear and logistic regression. Statist. Medic. 17: 16231634. Hwang, J. T. G., and M. T. Wells. 2002. Optimality results for mid P-values. To appear. Hwang, J. T. G., and M.-C. Yang. 2001. An optimality theory for mid P-values in 2 = 2 contingency tables. Statist. Sin. 11: 807826. Imrey, P. B. 1998. BradleyTerry model. Pp. 437443 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Imrey, P. B., W. D. Johnson, and G. G. Koch. 1976. An incomplete contingency table approach to paired-comparison experiments. J. Amer. Statist. Assoc. 71: 614623. Imrey, P. B., G. G. Koch, and M. E. Stokes. 1981. Categorical data analysis: Some reflections on the log linear model and logistic regression. I: Historical and methodological overview. Internat. Statist. Re®. 49: 265283. Ireland, C. T., and S. Kullback. 1968a. Minimum discrimination information estimation. Biometrics 24: 707713. Ireland, C. T., and S. Kullback. 1968b. Contingency tables with given marginals. Biometrika 55: 179188. Ireland, C. T., H. H. Ku, and S. Kullback. 1969. Symmetry and marginal homogeneity of an r = r contingency table. J. Amer. Statist. Assoc. 64: 13231341. Irwin, J. O. 1935. Tests of significance for differences between percentages based on small numbers. Metron 12: 8394. Jennison, C., and B. W. Turnbull. 2000. Group Sequential Methods with Applications to Clinical Trials. London: Chapman & Hall. REFERENCES 673 Johnson, B. M. 1971. On the admissible estimators for certain fixed sample binomial problems. Ann. Math. Statist. 42: 15791587. Johnson, W. 1985. Influence measures for logistic regression: Another point of view. Biometrika 72: 5965. Johnson, N. L., S. Kotz, and A. W. Kemp. 1992. Uni®ariate Discrete Distributions, 2nd ed. New York: Wiley. Jones, B., and M. G. Kenward. 1987. Modelling binary data from a three-period cross-over trial. Statist. Medic. 6: 555564. Jones, M. P., T. W. O’Gorman, J. H. Lemke, and R. F. Woolson. 1989. A Monte Carlo investigation of homogeneity tests of the odds ratio under various sample size considerations. Biometrics 45: 171181. Jorgensen, B. 1983. Maximum likelihood estimation and large-sample inference for generalized Ⲑ linear and nonlinear regression models. Biometrika 70: 1928. Jorgensen, B. 1987. Exponential dispersion models. J. Roy. Statist. Soc. Ser. B 49: 127162. Ⲑ Kalbfleisch, J. D., and J. F. Lawless. 1985. The analysis of panel data under a Markov assumption. J. Amer. Statist. Assoc. 80: 863871. Kastner, C., A. Fieger, and C. Heumann. 1997. MAREG and WinMAREG: A tool for marginal regression models. Comput. Statist. Data Anal. 24: 237241. Kauermann, G., and R. J. Carroll, 2001. A note on the efficiency of sandwich covariance matrix estimation. J. Amer. Statist. Assoc. 96: 13871397. Kauermann, G., and G. Tutz. 2001. Testing generalized linear and semiparametric models against smooth alternatives. J. Roy. Statist. Soc. Ser. B 63: 147166. Kelderman, H. 1984. Loglinear Rasch model tests. Psychometrika 49: 223245. Kempthorne, O. 1979. In dispraise of the exact test: Reactions. J. Statist. Plann. Inference 3: 199213. Kendall, M. G. 1945. The treatment of ties in rank problems. Biometrika 33: 239251. Kendall, M., and A. Stuart. 1979. The Ad®anced Theory of Statistics, Vol. 2; Inference and Relationship, 4th ed. New York: Macmillan. Kenward, M. G., and B. Jones. 1991. The analysis of categorical data from cross-over trials using a latent variable model. Statist. Medic. 10: 16071619. Kenward, M. G., and B. Jones. 1994. The analysis of binary and categorical data from crossover trials. Statist. Methods Medic. Res. 3: 325344. Kenward, M. G., E. Lesaffre, and G. Molenberghs. 1994. An application of maximum likelihood and estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics 50: 945953. Khamis, H. J. 1983. Log-linear model analysis of the semi-symmetric intraclass contingency table. Commun. Statist. Ser. A 12: 27232752. Kim, D., and A. Agresti. 1995. Improved exact inference about conditional association in three-way contingency tables. J. Amer. Statist. Assoc. 90: 632639. Kim, D., and A. Agresti. 1997. Nearly exact tests of conditional independence and marginal homogeneity for sparse contingency tables. Comput. Statist. Data Anal. 24: 89104. King, G. 1997. A Solution to the Ecological Inference Problem. Princeton, NJ: Princeton University Press. Knuiman, M. W., and T. P. Speed. 1988. Incorporating prior information into the analysis of contingency tables. Biometrics 44: 10611071. Koch, G. G., and V. P. Bhapkar. 1982. Chi-square tests. Pp. 442457 in Encyclopedia of Statistical Sciences, Vol. 1. New York: Wiley. 674 REFERENCES Koch, G. G., J. R. Landis, J. L. Freeman, D. H. Freeman, and R. G. Lehnen. 1977. A general methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics 33: 133158. Koch, G. G., I. A. Amara, G. W. Davis, and D. B. Gillings. 1982. A review of some statistical methods for covariance analysis of categorical data. Biometrics 38: 563595. Koch, G. G., P. B. Imrey, J. M. Singer, S. S. Atkinson, and M. E. Stokes. 1985. Lecture Notes for Analysis of Categorical Data. Montreal: Les Presses de L’Universite ´ de Montreal. ´ Koehler, K. 1986. Goodness-of-fit tests for log-linear models in sparse contingency tables. J. Amer. Statist. Assoc. 81: 483493. Koehler, K. 1998. Chi-square tests. Pp. 608622 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Koehler, K., and K. Larntz. sparse multinomials. J. Koehler, K., and J. Wilson. several cluster samples. 1980. An empirical investigation of goodness-of-fit statistics for Amer. Statist. Assoc. 75: 336344. 1986. Chi-square tests for comparing vectors of proportions for Commun. Statist. Ser. A 15: 29772990. Koopman, P. A. R. 1984. Confidence limits for the ratio of two binomial proportions. Biometrics 40: 513517. Kraemer, H. C. 1979. Ramifications of a population model for ␬ as a coefficient of reliability. Psychometrika 44: 461᎐472. Kreiner, S. 1987. Analysis of multidimensional contingency tables by exact conditional tests: Techniques and strategies. Scand. J. Statist. 14: 97᎐112. Kreiner, S. 1998. Interaction models. Pp. 2063᎐2068 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Kruskal, W. H. 1958. Ordinal measures of association. J. Amer. Statist. Assoc. 53: 814᎐861. Ku, H. H., R. N. Varner, and S. Kullback. 1971. Analysis of multidimensional contingency tables. J. Amer. Statist. Assoc. 66: 55᎐64. Kuha, J., and C. Skinner. 1997. Categorical data analysis and misclassification. Pp. 633᎐670 in Sur®ey Measurement and Process Quality, ed. L. Lyberg et al. New York: Wiley. Kuha, J., C. Skinner, and J. Palmgren. 1998. Misclassification error. Pp. 2615᎐2621 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Kullback, S. 1959. Information Theory and Statistics. New York: Wiley. Kullback, S., M. Kupperman, and H. H. Ku. 1962. Tests for contingency tables and Markov chains. Technometrics 4: 573᎐608. Kupper, L. L., and J. K. Haseman. 1978. The use of a correlated binomial model for the analysis of certain toxicological experiments. Biometrics 34: 69᎐76. Kupper, L. L., C. Portier, M. D. Hogan, and E. Yamamoto. 1986. The impact of litter effects on dose᎐response modeling in teratology. Biometrics 42: 85᎐98. Laara, ¨¨ ¨ E., and J. N. S. Matthews. 1985. The equivalence of two models for ordinal data. Biometrika 72: 206᎐207. Lachin, J. M. 1977. Sample-size determinations for r = c comparative trials. Biometrics 33: 315᎐324. Laird, N. M. 1978. Empirical Bayes methods for two-way contingency tables. Biometrika 65: 581᎐590. Laird, N. M. 1998. EM algorithm. Pp. 1300᎐1313 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Laird, N. M., and D. Olivier. 1981. Covariance analysis of censored survival data using log-linear analysis techniques. J. Amer. Statist. Assoc. 76: 231᎐240. Lancaster, H. O. 1949. The derivation and partition of ␹ 2 in certain discrete distributions. Biometrika 36: 117᎐129. REFERENCES 675 Lancaster, H. O. 1951. Complex contingency tables treated by partition of ␹ 2 . J. Roy. Statist. Soc. Ser. B 13: 242᎐249. Lancaster, H. O. 1961. Significance tests in discrete distributions. J. Amer. Statist. Assoc. 56: 223᎐234. Lancaster, H. O. 1969. The Chi-Squared Distribution. New York: Wiley. Lancaster, H. O., and M. A. Hamdan. 1964. Estimation of the correlation coefficient in contingency tables with possible nonmetrical characters. Psychometrika 29: 383᎐391. Landis, J. R., and G. G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33: 363᎐374. Landis, J. R., E. R. Heyman, and G. G. Koch. 1978. Average partial association in three-way contingency tables: A review and discussion of alternative tests. Internat. Statist. Re®. 46: 237᎐254. Landis, J. R., T. J. Sharp, S. J. Kuritz, and G. G. Koch. 1998. Mantel-Haenszel methods. Pp. 2378᎐2691 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Landwehr, J. M., D. Pregibon, and A. C. Shoemaker. 1984. Graphical methods for assessing logistic regression models. J. Amer. Statist. Assoc. 79: 61᎐71. Lang, J. B. 1992. Obtaining the observed information matrix for the Poisson log linear model with incomplete data. Biometrika 79: 405᎐407. Lang, J. B. 1996a. Maximum likelihood methods for a generalized class of log-linear models. Ann. Statist. 24: 726᎐752. Lang, J. B. 1996b. On the partitioning of goodness-of-fit statistics for multivariate categorical response models. J. Amer. Statist. Assoc. 91: 1017᎐1023. Lang, J. B. 1996c. On the comparison of multinomial and Poisson log-linear models. J. Roy. Statist. Soc. Ser. B 58: 253᎐266. Lang, J. B., and A. Agresti. 1994. Simultaneously modeling joint and marginal distributions of multivariate categorical responses. J. Amer. Statist. Assoc. 89: 625᎐632. Lang, J. B., J. W. McDonald, and P. W. F. Smith. 1999. Association-marginal modeling of multivariate categorical responses: A maximum likelihood approach. J. Amer. Statist. Assoc. 94: 1161᎐1171. Laplace, P. S. 1812. Theorie ´ Analytique des Probabilites. ´ Paris: Courcier. Larntz, K. 1978. Small-sample comparison of exact levels for chi-squared goodness-of-fit statistics. J. Amer. Statist. Assoc. 73: 253᎐263. Larsen, K., J. H. Petersen, E. Budtz-JoⲐ rgensen, and L. Endahl. 2000. Interpreting parameters in the logistic regression model with random effects. Biometrics 56: 909᎐914. Larson, M. G. 1984. Covariate analysis of competing-risks data with log-linear models. Biometrics 40: 459᎐469. Lauritzen, S. L. 1996. Graphical Models. New York: Oxford University Press. Lauritzen, S. L., and N. Wermuth. 1989. Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann. Statist. 17: 31᎐57. LaVange, L. M., G. G. Koch, and T. A. Schwartz. 2001. Applying sample survey methods to clinical trials data. Statist. Medic. 20: 2609᎐2623. Lawal, H. B. 1984. Comparisons of the X 2 , Y 2 , Freeman᎐Tukey and Williams improved G 2 test statistics in small samples of one-way multinomials. Biometrika 71: 415᎐418. Lawless, J. F. 1987. Negative binomial and mixed Poisson regression. Canad. J. Statist. 15: 209᎐225. Lazarsfeld, P. F., and N. W. Henry. 1968. Latent Structure Analysis. Boston: Houghton Mifflin. Lee, S. K. 1977. On the asymptotic variances of ˆ u terms in loglinear models of multidimensional contingency tables. J. Amer. Statist. Assoc. 72: 412᎐419. 676 REFERENCES Lee, Y., and J. A. Nelder. 1996. Hierarchical generalized linear models. J. Roy. Statist. Soc. Ser B 58: 619678. Lefkopoulou, M., D. Moore, and L. Ryan. 1989. The analysis of multiple correlated binary outcomes: Application to rodent teratology experiments. J. Amer. Statist. Assoc. 84: 810815. Lehmann, E. L. 1966. Some concepts of dependence. Ann. Math. Statist. 37: 11371153. Lehmann, E. L. 1986. Testing Statistical Hypotheses, 2nd ed. New York: Wiley. Leonard, T. 1975. Bayesian estimation methods for two-way contingency tables. J. Roy. Statist. Soc. Ser. B 37: 2337. Leonard, T. and J. S. J. Hsu. 1994. The Bayesian analysis of categorical data: A selective review. Pp. 283310 in Aspects of Uncertainty: A Tribute to D. V. Lindley. P. R. Freeman and A. F. M. Smith, eds. New York: Wiley. Lesaffre, E., and A. Albert. 1989. Multiple-group logistic regression diagnostics. Appl. Statist. 38: 425440. Lesaffre, E., and G. Molenberghs. 1991. Multivariate probit analysis: A neglected procedure in medical statistics. Statist. Medic. 10: 13911403. Lesaffre, E., and B. Spiessens. 2001. On the effect of quadrature points in a logistic randomeffects model: An example. Appl. Statist. 50: 325335. Lewis, T., I. W. Saunders, and M. Westcott. 1984. The moments of the Pearson chi-squared statistic and the minimum expected value in two-way tables. Biometrika 71: 515522. Liang, K. Y. 1984. The asymptotic efficiency of conditional likelihood methods. Biometrika 71: 305313. Liang, K. Y., and J. Hanfelt. 1994. On the use of the quasi-likelihood method in teratological experiments. Biometrics 50: 872880. Liang, K. Y., and P. McCullagh. 1993. Case studies in binary dispersion. Biometrics 49: 623630. Liang, K. Y., and S. G. Self. 1985. Tests for homogeneity of odds ratios when the data are sparse. Biometrika 72: 353358. Liang, K. Y., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73: 1322. Liang, K. Y., and S. L. Zeger. 1988. On the use of concordant pairs in matched casecontrol studies. Biometrics 44: 11451156. Liang, K. Y, and S. L. Zeger. 1995. Inference based on estimating functions in the presence of nuisance parameters. Statist. Sci. 10 158173. Liang, K. Y., S. L. Zeger, and B. Qaqish. 1992. Multivariate regression analyses for categorical data. J. Roy. Statist. Soc.Ser. B 54: 324. Lin, X. 1997. Variance component testing in generalized linear models with random effects. Biometrika 84: 309326. Lindley, D. V. 1964. The Bayesian analysis of contingency tables. Ann. Math. Statist. 35: 16221643. Lindsay, B., C. Clogg, and J. Grego. 1991. Semi-parametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. J. Amer. Statist. Assoc. 86: 96107. Lindsey, J. K. 1999. Models for Repeated Measurements, 2nd ed. Oxford: Oxford University Press. Lindsey, J. K., and P. M. E. Altham. 1998. Analysis of the human sex ratio by using overdispersion models. Appl. Statist. 47: 149157. Lindsey, J. K., and G. Mersch. 1992. Fitting and comparing probability distributions with log linear models. Comput. Statist. Data Anal. 13: 373384. Lipsitz, S. 1992. Methods for estimating the parameters of a linear model for ordered categorical data. Biometrics 48: 271281. REFERENCES 677 Lipsitz, S. R., and G. Fitzmaurice. 1996. The score test for independence in R = C contingency tables with missing data. Biometrics 52: 751762. Lipsitz, S., N. Laird, and D. Harrington. 1990. Finding the design matrix for the marginal homogeneity model. Biometrika 77: 353358. Lipsitz, S., N. Laird, and D. Harrington. 1991. Generalized estimating equations for correlated binary data: Using the odds ratio as a measure of association. Biometrika 78: 153160. Lipsitz, S. R., K. Kim, and L. Zhao. 1994. Analysis of repeated categorical data using generalized estimating equations. Statist. Medic. 13: 11491163. Little, R. J. 1989. Testing the equality of two independent binomial proportions. Amer. Statist. 43: 283288. Little, R. J. 1998. Missing data. Pp. 26222635 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Little, R. J., and D. B. Rubin. 1987. Statistical Analysis with Missing Data. New York: Wiley. Little, R. J. A., and M.-M. Wu. 1991. Models for contingency tables with known margins when target and sampled populations differ. J. Amer. Statist. Assoc. 86: 8795. Liu, Q., and D. A. Pierce. 1993. Heterogeneity in MantelHaenszel-type models. Biometrika 80: 543556. Liu, Q., and D. A. Pierce. 1994. A note on GaussHermite quadrature. Biometrika 81: 624629. Lloyd, C. J. 1988a. Some issues arising from the analysis of 2 = 2 contingency tables. Austral. J. Statist. 30: 3546. Lloyd, C. J. 1988b. Doubling the one-sided P-value in testing independence in 2 = 2 tables against a two-sided alternative. Statist. Medic. 7: 12971306. Lloyd, C. J. 1999. Statistical Analysis of Categorical Data. New York: Wiley. Longford, N. T. 1993. Random Coefficient Models. New York: Oxford University Press. Loughin, T. M., and P. N. Scherer. 1998. Testing for association in contingency tables with multiple column responses. Biometrics 54: 630637. Louis, T. A. 1982. Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44: 226233. Luce, R. D. 1959. Indi®idual Choice Beha®ior. New York: Wiley. Madansky, A. 1963. Tests of homogeneity for correlated samples. J. Amer. Statist. Assoc. 58: 97119. Maddala, G. S. 1983. Limited-Dependent and Qualitati®e Variables in Econometrics. Cambridge: Cambridge University Press. Magnus, J. R., and H. Neudecker. 1988. Matrix Differential Calculus with Applications in Statistics and Econometrics. New York: Wiley. Mantel, N. 1963. Chi-square tests with one degree of freedom: Extensions of the MantelHaenszel procedure. J. Amer. Statist. Assoc. 58: 690700. Mantel, N. 1966. Models for complex contingency tables and polychotomous dosage response curves. Biometrics 22: 8395. Mantel, N. 1973. Synthetic retrospective studies and related topics. Biometrics 29: 479486. Mantel, N. 1985. Maximum likelihood vs. minimum chi-square. Biometrics 41: 777781. Mantel, N. 1987a. Understanding Wald’s test for exponential families. Amer. Statist. 41: 147148. Mantel, N. 1987b. Exact tests for 2 = 2 contingency tables ŽLetter.. Amer. Statist. 41: 159. Mantel, N., and D. P. Byar. 1978. Marginal homogeneity, symmetry and independence. Commun. Statist. Ser. A 7: 953976. Mantel, N., and W. Haenszel. 1959. Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22: 719748. 678 REFERENCES ´ Andres, Martin ´ A., and Silva Mato, A. 1994. Choosing the optimal unconditional test for comparing two independent proportions. Comput. Statist. Data Anal. 17: 555574. Matthews, J. N. S., and K. P. Morris. 1995. An application of BradleyTerry-type models to the measurement of pain. Appl. Statist. 44: 243255. McCullagh, P. 1978. A class of parametric models for the analysis of square contingency tables with ordered categories. Biometrika 65: 413418. McCullagh, P. 1980. Regression models for ordinal data. J. Roy. Statist. Soc. Ser. B 42: 109142. McCullagh, P. 1982. Some applications of quasisymmetry. Biometrika 69: 303308. McCullagh, P. 1983. Quasi-likelihood functions. Ann. Statist. 11: 5967. McCullagh, P. 1986. The conditional distribution of goodness-of-fit statistics for discrete data. J. Amer. Statist. Assoc. 81: 104107. McCullagh, P., and J. A. Nelder. 1983; 2nd ed., 1989. Generalized Linear Models. London: Chapman & Hall. McCulloch, C. E. 1994. Maximum likelihood variance components estimation for binary data. J. Amer. Statist. Assoc. 89: 330335. McCulloch, C. E. 1997. Maximum likelihood algorithms for generalized linear mixed models. J. Amer. Statist. Assoc. 92: 162170. McCulloch, C. E. 2000. Generalized linear models. J. Amer. Statist. Assoc. 95: 13201324. McCulloch, C. E., and S. Searle. 2001. Generalized, Linear, and Mixed Models. New York: Wiley. McFadden, D. 1974. Conditional logit analysis of qualitative choice behavior. Pp. 105142 in Frontiers in Econometrics, ed. P. Zarembka. New York: Academic Press. McFadden, D. 1982. Qualitative response models. Pp. 137 in Ad®ances in Econometrics, ed. W. Hildebrand. Cambridge: Cambridge University Press. McNemar, Q. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12: 153157. Mee, R. W. 1984. Confidence bounds for the difference between two probabilities Žletter.. Biometrics 40: 11751176. Meeden, G., C. Geyer, J. Lang, and E. Funo. 1998. The admissibility of the maximum likelihood estimator for decomposable log-linear interaction models for contingency tables. Commun. Statist. Ser. A 27: 473493. Mehta, C. R. 1994. The exact analysis of contingency tables in medical research. Statist. Methods Medic. Res. 3: 135156. Mehta, C. R., and N. R. Patel. 1983. A network algorithm for performing Fisher’s exact test in r = c contingency tables. J. Amer. Statist. Assoc. 78: 427434. Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statist. Medic. 14: 21432160. Mehta, C. R., and S. J. Walsh. 1992. Comparison of exact, mid-P, and MantelHaenszel confidence intervals for the common odds ratio across several 2 = 2 contingency tables. Amer. Statist. 46: 146150. Mehta, C. R., N. R. Patel, and R. Gray. 1985. Computing an exact confidence interval for the common odds ratio in several 2 by 2 contingency tables. J. Amer. Statist. Assoc. 80: 969973. Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 1988. Importance sampling for estimating exact probabilities in permutational inference. J. Amer. Statist. Assoc. 83: 9991005. Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 2000. Efficient Monte Carlo methods for conditional logistic regression. J. Amer. Statist. Assoc. 95: 99108. Michailidis, G., and J. de Leeuw. 1998. The Gifi system of descriptive multivariate analysis. Statist. Sci. 13: 307336. REFERENCES 679 Miettinen, O. S. 1969. Individual matching with multiple controls in the case of all-or-none responses. Biometrics 25: 339355. Miettinen, O. S., and M. Nurminen. 1985. Comparative analysis of two rates. Statist. Medic. 4: 213226. Miller, M. E., C. S. Davis, and J. R. Landis. 1993. The analysis of longitudinal polytomous data: Generalized estimating equations and connections with weighted least squares. Biometrics 49: 10331044. Minkin, S. 1987. On optimal design for binary data. J. Amer. Statist. Assoc. 82: 10981103. Mirkin, B. 2001. Eleven ways to look at the chi-squared coefficient for contingency tables. Amer. Statist. 55: 111120. Mitra, S. K. 1958. On the limiting power function of the frequency chi-square test. Ann. Statist. 29: 12211233. Molenberghs, G., and E. Goetghebeur. 1997. Simple fitting algorithms for incomplete categorical data. J. Roy. Statist. Soc. Ser. B 59: 401414. Molenberghs, G., and E. Lesaffre. 1994. Marginal modeling of correlated ordinal data using a multivariate Plackett distribution. J. Amer. Statist. Assoc. 89: 633644. Molenberghs, G., M. G. Kenward, and E. Lesaffre. 1997. The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika 84: 3344. Moore, D. F. 1986a. Asymptotic properties of moment estimates for overdispersed counts and proportions. Biometrika 35: 583588. Moore, D. S. 1986b. Tests of chi-squared type. Pp. 6395 in Goodness-of-Fit Techniques, ed. R. D’Agostino and M. A. Stephens. New York: Marcel Dekker. Moore, D. F., and A. Tsiatis. 1991. Robust estimation of the variance in moment methods for extra-binomial and extra-Poisson variation. Biometrics 47: 383401. Morgan, B. J. T. 1992. Analysis of Quantal Response Data. London: Chapman & Hall. Morgan, W. M., and B. A. Blumenstein. 1991. Exact conditional tests for hierarchical models in multidimensional contingency tables. Appl. Statist. 40: 435442. Mosimann, J. E. 1962. On the compound multinomial distribution, the multivariate ␤-distribution and correlations among proportions. Biometrika 49: 65᎐82. Mosteller, F. 1951. Remarks on the method of paired comparisons I: The least-squares solution assuming equal standard deviations and equal correlations. Psychometrika 16: 3᎐9. Mosteller, F. 1952. Some statistical problems in measuring the subjective response to drugs. Biometrics 8: 220᎐226. Mosteller, F. 1968. Association and estimation in contingency tables. J. Amer. Statist. Assoc. 63: 1᎐28. Nair, V. N. 1987. Chi-squared-type tests for ordered alternatives in contingency tables. J. Amer. Statist. Assoc. 82: 283᎐291. Natarajan, R., and C. McCulloch. 1995. A note on the existence of the posterior distribution for a class of mixed models for binomial responses. Biometrika 82: 639᎐643. Natarajan, R., and C. McCulloch. 1998. Gibbs sampling with diffuse proper priors: A valid approach to data-driven inference? J. Comput. Graph. Statist. 7: 267᎐277. Nelder, J., and D. Pregibon. 1987. An extended quasi-likelihood function. Biometrika 74: 221᎐232. Nelder, J., and R. W. M. Wedderburn. 1972. Generalized linear models. J. Roy. Statist. Soc. Ser. A 135: 370᎐384. Nerlove, M., and S. J. Press. 1973. Univariate and multivariate log-linear and logistic models. Technical Report R-1306-EDArNIH, Rand Corporation, Santa Monica, CA. Neuhaus, J. M. 1992. Statistical methods for longitudinal and clustered designs with binary responses. Statist. Methods Medic. Res. 1: 249᎐273. 680 REFERENCES Neuhaus, J. M., and N. P. Jewell. 1990a. Some comments on Rosner’s multiple logistic model for clustered data. Biometrics 46: 523534. Neuhaus, J. M., and N. P. Jewell. 1990b. The effect of retrospective sampling on binary regression models for clustered data. Biometrics 46: 977990. Neuhaus, J. M., and M. L. Lesperance. 1996. Estimation efficiency in a binary mixed-effects model setting. Biometrika 83: 441446. Neuhaus, J. M., J. D. Kalbfleisch, and W. W. Hauck. 1991. A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Internat. Statist. Re®. 59: 2535. Neuhaus, J. M., W. W. Hauck, and J. D. Kalbfleisch. 1992. The effects of mixture distribution misspecification when fitting mixed-effects logistic models. Biometrika 79: 755762. Neuhaus, J. M., J. D. Kalbfleisch, and W. W. Hauck. 1994. Conditions for consistent estimation in mixed-effects models for binary matched-pairs data. Canad. J. Statist. 22: 139148. Newcombe, R. 1998a. Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statist. Medic. 17: 857872. Newcombe, R. 1998b. Interval estimation for the difference between independent proportions: Comparison of eleven methods. Statist. Medic. 17: 873890. Newcombe, R. 2001. Logit confidence intervals and the inverse sinh transformation. Amer. Statist. 55: 200202. Neyman, J. 1935. On the problem of confidence limits. Ann. Math. Statist. 6: 111116. Neyman, J. 1949. Contributions to the theory of the ␹ 2 test. Pp. 239᎐273 in Proc. First Berkeley Symposium on Mathematical Statistics and Probability, ed. J. Neyman. Berkeley, CA: University of California Press. Nurminen, M. 1986. Confidence intervals for the ratio and difference of two binomial proportions. Biometrics 42: 675᎐676. O’Brien, P. C. 1988. Comparing two samples: Extensions of the t, rank-sum, and log-rank tests. J. Amer. Statist. Assoc. 83: 52᎐61. O’Brien, R. G. 1986. Using the SAS system to perform power analyses for log-linear models. Pp. 778᎐784 in Proc. 11th Annual SAS Users Group Conference. Cary, NC: SAS Institute. Ochi, Y., and R. Prentice. 1984. Likelihood inference in a correlated probit regression model. Biometrika 71: 531᎐543. O’Gorman, T. W., and R. F. Woolson. 1988. Analysis of ordered categorical data using the SAS system. Pp. 957᎐963 in Proc. 13th Annual SAS Users Group Conference. Cary, NC: SAS Institute. Paik, M. 1985. A graphic representation of a three-way contingency table: Simpson’s paradox and correlation. Amer. Statist. 39: 53᎐54. Palmgren, J. 1981. The Fisher information matrix for log-linear models arguing conditionally in the observed explanatory variables. Biometrika 68: 563᎐566. Palmgren, J., and A. Ekholm. 1987. Exponential family non-linear models for categorical data with errors of observation. Appl. Stochastic Models Data Anal. 3: 111᎐124. Park, T., and M. B. Brown. 1994. Models for categorical data with nonignorable nonresponse. J. Amer. Statist. Assoc. 89: 44᎐52. Parr, W. C., and H. D. Tolley. 1982. Jackknifing in categorical data analysis. Austral. J. Statist. 24: 67᎐79. Parzen, E. 1997. Concrete statistics. Pp. 309᎐332 in Statistics of Quality. New York: Marcel Dekker. Patefield, W. M. 1982. Exact tests for trends in ordered contingency tables. Appl. Statist. Ser B 31: 32᎐43. REFERENCES 681 Patnaik, P. B. 1949. The non-central ␹ 2 and F-distributions and their applications. Biometrika 36: 202᎐232. Paul, S. R., K. Y. Liang, and S. G. Self. 1989. On testing departure from the binomial and multinomial assumptions. Biometrics 45: 231᎐236. Pearson, E. S. 1947. The choice of a statistical test illustrated on the interpretation of data classified in 2 = 2 tables. Biometrika 34: 139᎐167. Pearson, K. 1900. On a criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. Ser. 5 50: 157᎐175. ŽReprinted in Karl Pearson’s Early Statistical Papers, ed. E. S. Pearson. Cambridge: Cambridge University Press, 1948.. Pearson, K. 1904. Mathematical contributions to the theory of evolution XIII: On the theory of contingency and its relation to association and normal correlation. Draper’s Co. Research Memoirs, Biometric Series, no. 1. ŽReprinted in Karl Pearson’s Early Papers, ed. E. S. Pearson, Cambridge: Cambridge University Press, 1948.. Pearson, K. 1913. On the probable error of a correlation coefficient as found from a fourfold table. Biometrika 9: 22᎐27. Pearson, K. 1917. On the general theory of multiple contingency with special reference to partial contingency. Biometrika 11: 145᎐158. Pearson, K. 1922. On the ␹ 2 test of goodness of fit. Biometrika 14: 186᎐191. Pearson, K., and D. Heron. 1913. On theories of association. Biometrika 9: 159᎐315. Peduzzi, P., J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. 1996. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49: 1373᎐1379. Pendergast, J. F., S. J. Gange, M. A. Newton, M. J. Lindstrom, M. Palta, and M. R. Fisher. 1996. A survey of methods for analyzing clustered binary response data. Internat. Statist. Re®. 64: 89᎐118. Pepe, M. S. 2000. Receiver operating characteristic methodology. J. Amer. Statist. Assoc. 95: 308᎐311. Peterson, B., and F. E. Harrell, Jr. 1990. Partial proportional odds models for ordinal response variables. Appl. Statist. 39: 205᎐217. Pierce, D. A., and D. Peters. 1992. Practical use of higher order asymptotics for multiparameter exponential families. J. Roy. Statist. Soc. Ser. B 54: 701᎐725. Pierce, D. A., and D. Peters. 1999. Improving on exact tests by approximate conditioning. Biometrika 86: 265᎐277. Pierce, D. A., and B. R. Sands. 1975. Extra-Bernoulli variation in regression of binary data. Technical Report 46, Statistics Deptartment, Oregon State University, Cornwallis, OR. Pierce, D. A., and D. W. Schafer. 1986. Residuals in generalized linear models. J. Amer. Statist. Assoc. 81: 977᎐983. Plackett, R. L. 1962. A note on interactions in contingency tables. J. Roy. Statist. Soc. Ser. B 24: 162᎐166. Plackett, R. L. 1964. The continuity correction in 2 = 2 tables. Biometrika 51: 327᎐337. Plackett, R. L. 1983. Karl Pearson and the chi-squared test. Internat. Statist. Re®. 51: 59᎐72. Podgor, M. J., J. L. Gastwirth, and C. R. Mehta. 1996. Efficiency robust tests of independence in contingency tables with ordered classifications. Statist. Medic. 15: 2095᎐2105. Poisson, S.-D. 1837. Recherches sur la probabilite ´ des jugements en matieere ` criminelle et en matiere ` ci®ile, precedees du calcul des probabilites. ´ ´ ´ des regles ` generales ´´ ´ Paris: Bachelier. Pratt, J. W. 1981. Concavity of the log likelihood. J. Amer. Statist. Assoc. 76: 103᎐106. Pregibon, D. 1980. Goodness of link tests for generalized linear models. Appl. Statist. 29: 15᎐24. Pregibon, D. 1981. Logistic regression diagnostics. Ann. Statist. 9: 705᎐724. 682 REFERENCES Pregibon, D. 1982. Score tests in GLIM with application. Pp. 8797 in Lecture Notes in Statistics, 14: GLIM 82, Proc. International Conference on Generalised Linear Models, ed. R. Gilchrist. New York: Springer-Verlag. Prentice, R. 1976a. Use of the logistic model in retrospective studies. Biometrics 32: 599606. Prentice, R. 1976b. Generalization of the probit and logit methods for dose response curves. Biometrics 32: 761768. Prentice, R. 1986. Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J. Amer. Statist. Assoc. 81: 321327. Prentice, R., and N. Breslow. 1978. Retrospective studies and failure time models. Biometrika 65: 153158. Prentice, R., and L. A. Gloeckler. 1978. Regression analysis of grouped survival data with application to breast cancer data. Biometrics 34: 5767. Prentice, R., and R. Pyke. 1979. Logistic disease incidence models and case-control studies. Biometrika 66: 403412. Prentice, R., and L. P. Zhao. 1991. Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825839. Press, S. J., and S. Wilson. 1978. Choosing between logistic regression and discriminant analysis. J. Amer. Statist. Assoc. 73: 699705. Qu, A., B. G. Lindsay, and B. Li. 2000. Improving generalised estimating equations using quadratic inference functions. Biometrika 87: 823836. Quine, M. P., and E. Seneta. 1987. Bortkiewicz’s data and the law of small numbers. Internat. Statist. Re®. 5: 173181. Rabe-Hesketh, S., and A. Skrondal. 2001. Parameterisation of multivariate random effects models for categorical data. Biometrics 57:. Raftery, A. E. 1986. Choosing models for cross-classification. Amer. Sociol. Re®. 51: 145146. Rao, C. R. 1957. Maximum likelihood estimation for the multinomial distribution. Sankhya 18: 139148. Rao, C. R. 1963. Criteria of estimation in large samples. Sankhya 25: 189206. Rao, C. R. 1973. Linear Statistical Inference and Its Applications, 2nd ed. New York: Wiley. Rao, C. R. 1982. Diversity: Its measurement, decomposition, apportionment, and analysis. Sankhya Ser. A 44: 122. Rao, J. N. K., and A. J. Scott. 1987. On simple adjustments to chi-square tests with sample survey data. Ann. Statist. 15: 385397. Rao, J. N. K., and D. R. Thomas. 1988. The analysis of cross-classified categorical data from complex sample surveys. Sociol. Methodol. 18: 213270. Rasch, G. 1961. On general laws and the meaning of measurement in psychology. Pp. 321333 in Proc. 4 th Berkeley Symposium on Mathematics, Statistics, and Probability, Vol. 4, ed. J. Neyman. Berkeley, CA: University of California Press. Rayner, J. C. W., and D. J. Best. 2001. A Contingency Table Approach to Nonparametric Testing. London: Chapman & Hall. Read, T. R. C., and N. A. C. Cressie. 1988. Goodness-of-Fit Statistics for Discrete Multi®ariate Data. New York: Springer-Verlag. Rice, W. R. 1988. A new probability model for determining exact P-values for 2 = 2 contingency tables when comparing binomial proportions. Biometrics 44: 122. Ritov, Y., and Z. Gilula. 1991. The order-restricted RC model for ordered contingency tables: Estimation and testing for fit. Ann. Statist. 19: 20902101. Robins, J., N. Breslow, and S. Greenland. 1986. Estimators of the MantelHaenszel variance consistent in both sparse data and large-strata limiting models. Biometrics 42: 311323. REFERENCES 683 Robins, J., A. Rotnitzky, and L. P. Zhao. 1995. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc. 90: 106121. Rohmel, J., and U. Mansmann. 1999. Unconditional non-asymptotic one-sided tests for indepen¨ dent binomial proportions when the interest lies in showing non-inferiority andror superiority. Biometrical J. 41: 149170. Rosenbaum, P. R., and D. R. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70: 4155. Rosner, B. 1984. Multivariate methods in ophthalmology with application to other paired-data situations. Biometrics 40: 10251035. Rosner, B. 1989. Multivariate methods for clustered binary data with more than one level of nesting. J. Amer. Statist. Assoc. 84: 373380. Rotnitzky, A., and N. P. Jewell. 1990. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika 77: 485497. Routledge, R. D. 1992. Resolving the conflict over Fisher’s exact test. Canad. J. Statist. 20: 201209. Routledge, R. D. 1994. Practicing safe statistics with the mid-P*. Canad. J. Statist. 22: 103110. Roy, S. N., and M. A. Kastenbaum. 1956. On the hypothesis of no ‘‘interaction’’ in a multiway contingency table. Ann. Math. Statist. 27: 749757. Roy, S. N., and S. K. Mitra. 1956. An introduction to some nonparametric generalizations of analysis of variance and multivariate analysis. Biometrika 43: 361376. Rudas, T., C. C. Clogg, and B. G. Lindsay. 1994. A new index of fit based on mixture methods for the analysis of contingency tables. J. Roy. Statist. Soc. 56: 623639. Ryan, L. 1992. Quantitative risk assessment for developmental toxicity. Biometrics 48: 163174. Ryan, L. 1995. Comment on article by Liang and Zeger. Statist. Sci. 10: 189193. Samuels, M. L. 1993. Simpson’s paradox and related phenomena. J. Amer. Statist. Assoc. 88: 8188. Santner, T. J., and M. K. Snell. 1980. Small-sample confidence intervals for p1 p 2 and p1 rp 2 in 2 = 2 contingency tables. J. Amer. Statist. Assoc. 75: 386394. Santner, T. J., and S. Yamagami. 1993. Invariant small sample confidence intervals for the difference of two success probabilities. Commun. Statist. Ser. B 22: 3359. Schafer, J. L. 1997. Analysis of Incomplete Multi®ariate Data. London: Chapman & Hall. Schluchter, M. D., and K. L. Jackson. 1989. Log-linear analysis of censored survival data with partially observed covariates. J. Amer. Statist. Assoc. 84: 4252. Scott, A., and C. Wild. 2001. Casecontrol studies with complex sampling. Appl. Statist. 50: 389401. Seeber, G. 1998. Poisson regression. Pp. 34043412 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Sekar, C. C., and W. E. Deming. 1949. On a method of estimating birth and death rates and the extent of registration. J. Amer. Statist. Assoc. 44: 101115. Self, S. G., and K.-Y. Liang. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Amer. Statist. Assoc. 82: 605610. Sen, P. K., and J. M. Singer. 1993. Large Sample Methods in Statistics: An Introduction with Applications. London: Chapman & Hall. Shapiro, S. H. 1982. Collapsing contingency tables: A geometric approach. Amer. Statist. 36: 4346. Shuster, J., and D. Downing. 1976. Two-way contingency tables for complex sampling schemes. Biometrika 63: 271276. Silvapulle, M. J. 1981. On the existence of maximum likelihood estimators for the binomial response models. J. Roy. Statist. Soc. Ser. B 43: 310313. 684 REFERENCES Simon, G. 1973. Additivity of information in exponential family probability laws. . J. Amer. Statist. Assoc. 68: 478482. Simon, G. 1974. Alternative analyses for the singly-ordered contingency table. J. Amer. Statist. Assoc. 69: 971976. Simon, G. 1978. Efficacies of measures of association for ordinal contingency tables. J. Amer. Statist. Assoc. 73: 545551. Simonoff, J. 1983. A penalty function approach to smoothing large sparse contingency tables. Ann. Statist. 11: 208218. Simonoff, J. 1986. Jackknifing and bootstrapping goodness-of-fit statistics in sparse multinomials. J. Amer. Statist. Assoc. 81: 10051111. Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer-Verlag. Simonoff, J. S. 1998. Three sides of smoothing: Categorical data smoothing, nonparametric regression, and density estimation. Internat. Statist. Re®. 66: 137156. Simpson, E. H. 1949. The measurement of diversity. Nature 163: 699. Simpson, E. H. 1951. The interpretation of interaction in contingency tables. J. Roy. Statist. Soc. Ser. B 13: 238241. Skellam, J. G. 1948. A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J. Roy. Statist. Soc. Ser. B 10: 257261. Skene, A. M., and J. C. Wakefield. 1990. Hierarchical models for multicentre binary response studies. Statist. Medic. 9: 919929. Slaton, T. L., W. W. Piegorsch, and S. D. Durham. 2000. Estimation and testing with overdispersed proportions using the beta-logistic regression model of Heckman and Willis. Biometrics 56: 125133. Small, K. A. 1987. A discrete choice model for ordered alternatives. Econometrica 55: 409424. Smith, K. W. 1976. Table standardization and table shrinking: Aids in the traditional analysis of contingency tables. Social Forces 54: 669693. Smith, P. W. F., J. J. Forster, and J. W. McDonald. 1996. Monte Carlo exact tests for square contingency tables. J. Roy. Statist. Soc. Ser. A 159: 309321. Snell, E. J. 1964. A scaling procedure for ordered categorical data. Biometrics 20: 592607. Somers, R. H. 1962. A new asymmetric measure of association for ordinal variables. Amer. Sociol. Re®. 27: 799811. Speed, T. 1998. Iterative proportional fitting. Pp. 21162119 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Spiegelhalter, D. J., and A. F. M. Smith. 1982. Bayes factors for linear and log-linear models with vague prior information. J. Roy. Statist. Soc. Ser. B 44: 377387. Spitzer, R. L., J. Cohen, J. L. Fleiss, and J. Endicott. 1967. Quantification of agreement in psychiatric diagnosis. Arch. Gen. Psychiatry 17: 8387. Sprott, D. A. 2000. Statistical Inference in Science. New York: Springer-Verlag. Stern, S. 1997. Simulation-based estimation. J. Econ. Literature 35: 20062039. Sterne, T. E. 1954. Some remarks on confidence or fiducial limits. Biometrika 41: 275278. Stevens, S. S. 1951. Mathematics, measurement, and psychophysics. Pp. 149 in Handbook of Experimental Psychology, ed. S. S. Stevens. New York: Wiley. Stevens, W. L. 1950. Fiducial limits of the parameter of a discontinuous distribution. Biometrika 37: 117129. Stigler, S. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Harvard University Press. REFERENCES 685 Stigler, S. 1994. Citation patterns in the journals of statistics and probability. Statist. Sci. 9: 94108. Stigler, S. 1999. Statistics on the Table. Cambridge, MA: Harvard University Press. Stiratelli, R., N. Laird, and J. H. Ware. 1984. Random-effects models for serial observations with binary response. Biometrics 40: 10251035. Stokes, M. E., C. S. Davis, and G. G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute. Strawderman, R. L., and M. T. Wells. 1998. Approximately exact inference for the common odds ratio in several 2 = 2 tables. J. Amer. Statist. Assoc. 93: 12941307. Stuart, A. 1955. A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42: 412416. Stukel, T. A. 1988. Generalized logistic models. J. Amer. Statist. Assoc. 83: 426431. Suissa, S., and J. J. Shuster. 1984. Are uniformly most powerful unbiased tests really best? Amer. Statist. 38: 204206. Suissa, S., and J. J. Shuster. 1985. Exact unconditional samples sizes for the 2 by 2 binomial trial. J. Roy. Statist. Soc. Ser. A 148: 317327. Suissa, S., and J. J. Shuster. 1991. The 2 = 2 matched-pairs trial: Exact unconditional design and analysis. Biometrics 47: 361372. Sundberg, R. 1975. Some results about decomposable Žor Markov-type. models for multidimensional contingency tables: Distribution of marginals and partitioning of tests. Scand. J. Statist. 2: 7179. Tango, T. 1998. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statist. Medic. 17: 891908. Tanner, M. A., and M. A. Young. 1985. Modelling agreement among raters. J. Amer. Statist. Assoc. 80: 175180. Tarone, R. E. 1985. On heterogeneity tests based on efficient scores. Biometrika 72: 9195. Tarone, R. E., and J. J. Gart. 1980. On the robustness of combined tests for trends in proportions. J. Amer. Statist. Assoc. 75: 110116. Tarone, R. E., J. J. Gart, and W. W. Hauck. 1983. On the asymptotic relative efficiency of certain noniterative estimators of a common relative risk or odds ratio. Biometrika 70: 519522. Tavare, ´ S., and P. M. E. Altham. 1983. Serial dependence of observations leading to contingency tables, and corrections to chi-squared statistics. Biometrika 70: 139144. Ten Have, T. R. 1996. A mixed effects model for multivariate ordinal response data including correlated discrete failure times with ordinal responses. Biometrics 52: 473491. Ten Have, T. R., and A. R. Localio. 1999. Empirical Bayes estimation of random effects parameters in mixed effects logistic regression models. Biometrics 55: 10221029. Ten Have, T. R., and A. Morabia. 1999. Mixed effects models with bivariate and univariate association parameters for longitudinal bivariate binary response data. Biometrics 55: 8593. Ten Have, T. R., and D. H. Uttal. 1994. Subject-specific and population-averaged continuation ratio logit models for multiple discrete time survival profiles. Appl. Statist. 43: 371384. Theil, H. 1969. A multinomial extension of the linear logit model. Internat. Econ. Re®. 10: 251259. Theil, H. 1970. On the estimation of relationships involving qualitative variables. Amer. J. Sociol. 76: 103154. Thompson, R., and R. J. Baker. 1981. Composite link functions in generalized linear models. Appl. Statist. 30: 125131. 686 REFERENCES Thompson, W. A. 1977. On the treatment of grouped observations in life studies. Biometrics 33: 463470. Thurstone, L. L. 1927. The method of paired comparisons for social values. J. Abnormal Social Psych. 21: 384400. Tjur, T. 1982. A connection between Rasch’s item analysis model and a multiplicative Poisson model. Scand. J. Statist. 9: 2330. Tocher, K. D. 1950. Extension of the NeymanPearson theory of tests to discontinuous variates. Biometrika 37: 130144. Toledano, A., and C. Gatsonis. 1996. Ordinal regression methodology for ROC curves derived from correlated data. Statist. Medic. 15: 18071826. Train, K. 1986. Qualitati®e Choice Analysis: Theory, Econometrics, and an Application. Cambridge, MA: MIT Press. Tsiatis, A. A. 1980. A note on the goodness-of-fit test for the logistic regression model. Biometrika 67: 250251. Tutz, G. 1989. Compound regression models for ordered categorical data. Biometrical J. 31: 259272. Tutz, G. 1991. Sequential models in categorical regression. Comput. Statist. Data Anal. 11: 275295. Tutz, G., and W. Hennevogl. 1996. Random effects in ordinal regression models. Comput. Statist. Data Anal. 22: 537557. Uebersax, J. S. 1993. Statistical modeling of expert ratings on medical treatment appropriateness. J. Amer. Statist. Assoc. 88: 421427. Uebersax, J. S., and W. M. Grove. 1990. Latent class analysis of diagnostic agreement. Statist. Medic. 9: 559572. Uebersax, J. S., and W. M. Grove. 1993. A latent trait finite mixture model for the analysis of rating agreement. Biometrics 49: 823835. van der Heijden, P. G. M., and J. de Leeuw. 1985. Correspondence analysis: A complement to log-linear analysis. Psychometrika 50: 429447. van der Heijden, P. G. M., A. de Falguerolles, and J. de Leeuw. 1989. A combined approach to contingency table analysis using correspondence analysis and log-linear analysis. Appl. Statist. 38: 249292. Verbeke, G., and E. Lesaffre. 1996. A linear mixed-effects model with heterogeneity in the random-effects population. J. Amer. Statist. Assoc. 91: 217221. Verbeke, G., and G. Molenberghs. 2000. Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag. Wald, A. 1943. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Amer. Math. Soc. 54: 426482. Walker, S. H., and D. B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54: 167179. Walley, P. 1996. Inferences from multinomial data: Learning about a bag of marbles. J. Roy. Statist. Soc. Ser. B 58: 334. Wardrop, R. L. 1995. Simpson’s paradox and the hot hand in basketball. Amer. Statist. 49: 2428. Ware, J. H., S. Lipsitz, and F. E. Speizer. 1988. Issues in the analysis of repeated categorical outcomes. Statist. Medic. 7: 95107. Watson, G. S. 1956. Missing and ‘‘mixed up’’ frequencies in contingency tables. Biometrics 12: 4750. Watson, G. S. 1959. Some recent results in chi-square goodness-of-fit tests. Biometrics 15: 440468. REFERENCES 687 Wedderburn, R. W. M. 1974. Quasi-likelihood functions, generalized linear models, and the GaussNewton method. Biometrika 61: 439447. Wedderburn, R. W. M. 1976. On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika 63: 2732. Wermuth, N. 1976. Model search among multiplicative models. Biometrics 32: 253263. Wermuth, N. 1987. Parametric collapsibility and the lack of moderating effects in contingency tables with a dichotomous response variable. J. Roy. Statist. Soc. Ser. B 49: 353364. Westfall, P. H., and R. D. Wolfinger. 1997. Multiple tests with discrete distributions. Amer. Statist. 51: 38. Westfall, P. H., and S. S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: Wiley. White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 126. White, A. A., J. R. Landis, and M. M. Cooper. 1982. A note on the equivalence of several marginal homogeneity test criteria for categorical data. Internat. Statist. Re®. 50: 2734. Whitehead, J. 1993. Sample size calculations for ordered categorical data. Statist. Medic. 12: 22572271. Whittaker, J. 1990. Graphical Models in Applied Multi®ariate Statistics. New York: Wiley. Whittaker, J., and M. Aitkin. 1978. A flexible strategy for fitting complex log-linear models. Biometrics 34: 487495. Whittemore, A. S. 1978. Collapsibility of multidimensional tables. J. Roy. Statist. Soc. Ser. B 40: 328340. Whittemore, A. S. 1981. Sample size for logistic regression with small response probability. J. Amer. Statist. Assoc. 76: 2732. Wilks, S. S. 1935. The likelihood test of independence in contingency tables. Ann. Math. Statist. 6: 190196. Wilks, S. S. 1938. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Statist. 9: 6062. Williams, D. A. 1975. The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31: 949952. Williams, D. A. 1982. Extra-binomial variation in logistic linear models. Appl. Statist. 31: 144148. Williams, D. A. 1987. Generalized linear model diagnostics using the deviance and single-case deletions. Appl. Statist. 36: 181191. Williams, D. A. 1988. Comments on ‘‘The impact of litter effects on doseresponse modeling in teratology.’’ Biometrics 44: 305308. Williams, E. J. 1952. Use of scores for the analysis of association in contingency tables. Biometrika 39: 274289. Williams, O. D., and J. E. Grizzle. 1972. Analysis for contingency tables having ordered response categories. J. Amer. Statist. Assoc. 67: 5563. Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. J. Amer. Statist. Assoc. 22: 209212. Wolfinger, R., and M. O’Connell. 1993. Generalized linear mixed models: A pseudo-likelihood approach. J. Statist. Comput. Simul. 48: 233243. Wong, G. Y., and W. M. Mason. 1985. The hierarchical logistic regression model for multilevel analysis. J. Amer. Statist. Assoc. 80: 513524. Woolf, B. 1955. On estimating the relation between blood group and disease. Ann. Human Genet. Ž London. 19: 251253. Woolson, R. F., and W. R. Clarke. 1984. Analysis of categorical incomplete longitudinal data. J. Roy. Statist. Soc. Ser. A 147: 8799. 688 REFERENCES Wu, C. F. J. 1985. Efficient sequential designs with binary data. J. Amer. Statist. Soc. 80: 974984. Yang, I., and M. P. Becker. 1997. Latent variable modeling of diagnostic accuracy. Biometrics 53: 948958. Yates, F. 1934. Contingency tables involving small numbers and the ␹ 2 test. J. Roy. Statist. Soc. Suppl. 1: 217᎐235. Yates, F. 1948. The analysis of contingency tables with grouping based on quantitative characters. Biometrika 35: 176᎐181. Yates, F. 1984. Tests of significance for 2 = 2 contingency tables. J. Roy. Statist. Soc. Ser. A 147: 426᎐463. Yee, T. W., and C. J. Wild. 1996. Vector generalized additive models. J. Roy. Statist. Soc. Ser. B 58: 481᎐493. Yerushalmy, J. 1947. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Rep. 62: 1432᎐1449. Yule, G. U. 1900. On the association of attributes in statistics. Philos. Trans. Roy. Soc. London Ser. A 194: 257᎐319. Yule, G. U. 1903. Notes on the theory of association of attributes in statistics. Biometrika 2: 121᎐134. Yule, G. U. 1906. On a property which holds good for all groupings of a normal distribution of frequency for two variables, with application to the study of contingency tables for the inheritance of unmeasured qualities. Proc. Roy. Soc. Ser A 77: 324᎐336. Yule, G. U. 1912. On the methods of measuring association between two attributes. J. Roy. Statist. Soc. 75: 579᎐642. Zeger, S. L., and M. R. Karim. 1991. Generalized linear models with random effects: A Gibbs sampling approach. J. Amer. Statist. Assoc. 86: 79᎐86 Zeger, S. L., K.-Y. Liang, and P. S. Albert. 1988. Models for longitudinal data: A generalized estimating equation approach. Biometrics 44: 1049᎐1060. Zelen, M. 1971. The analysis of several 2 = 2 contingency tables. Biometrika 58: 129᎐137. Zelen, M. 1991. Multinomial response models. Comput. Statist. Data Anal. 12: 249᎐254. Zellner, A., and P. E. Rossi. 1984. Bayesian analysis of dichotomous quantal response models. J. Economet. 25: 365᎐393. Zelterman. D. 1987. Goodness-of-fit tests for large sparse multinomial distributions. J. Amer. Statist. Soc. 82: 624᎐629. Zermelo, E. 1929. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Math. Z. 29: 436᎐460. Zhang, H., J. Crowley, H. Sox, and R. Olshen. 1998. Tree-structured statistical methods. Pp. 4561᎐4573 in Encyclopedia of Biostatistics. Chichester, UK: Wiley. Zheng, B., and A. Agresti. 2000. Summarizing the predictive power of a generalized linear model. Statist. Medic. 19: 1771᎐1781. Zhu, Y., and N. Reid. 1994. Information, ancillarity, and sufficiency in the presence of nuisance parameters. Canad. J. Statist. 22: 111᎐123. Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 Examples Index Abortion and education, 345 Abortion opinions, 29, 205206, 441, 486, 504506, 553 Admissions into Berkeley, 6263 Admissions into Florida, 223224, 529 Afterlife, belief in, 302303 AIDS and AZT use, 184187 AIDS, measures to deal with, 347 Air pollution and breathing, 377378 Alcohol, cigarettes, and marijuana use, 322326, 361363, 367, 482483, 528 Alcohol consumption and malformation, 8990, 158, 179180, 182 Alcohol and driving, 203 Alligator food choice, 268274, 304 Alzheimer’s disease and cognitive impairment, 310 Aspirations by income, 107, Aspirin and heart attacks, 37, 46, 7172 Automobile collisions and seat belts, 4041, 61, 305306, 327329, 331, 349, 361 Baseball complete games, 157158 Baseball standings, 437438 Beetle mortality, 247250 Birth control, teenage, 352 Blood pressure and heart disease, 221223 Breast cancer, 38, 105, 107 Breathing test and smoking, 307, 377378 Breathlessness, wheeze, and age, 378 Buchanan vote in Palm Beach County, 156157 Busing and race, 348 Calves and pneumonia, 2526, 34 Cancer of larynx and radiation therapy, 107 Cancer remission, 197199, 261 Capturerecapture, hepatitis, 533 Capturerecapture of snowshoe hares, 511513, 544545, 551552 Carcinoma of uterine cervix, 431435, 532, 541544, 549551 Chlorophyll inheritance, 29 Cholesterol and cereal, 309 Claritin, 109 Clinical trials, 230236, 507510 Coffee drinking, 446 Cola drink taste test, 448 Condoms and adolescents, 202 Coronary deaths and smoking, 404 Credit card and income ŽItaly., 206 Crime and race, 63 Crossover drug trial, 457, 483484 Death penalty and race, 4852, 63, 65, 201 Depression, mental, 459461, 468469, 506507 Developmental toxicity study, 290291, 517521 Diabetes, case-control study, 418419 Diagnostic tests, 60, 66 Diarrhea, 255 Draft position in sports, 207 Dumping severity, 308309 Dysmenorrhea, 483484, 572 Esophageal cancer, 203 Fish egg hatching, 568569 Free throws, 105, 160161 Gambler’s ruin, 489490 Genetics, 165 Government spending, 349351, 449, 530531 Graduate admissions at Florida, 223224, 529 Graduate admissions at Berkeley, 6263 689 690 Graham Greene, 28 Gun-related deaths, 61 Heart attacks and aspirin use, 37, 46 Heart catheterization and race, 62 Heart disease and blood pressure, 221223 Heart disease and snoring, 121123 Heart valve replacement and survival, 385387 Hepatitis outbreaks, 533 Home team advantage in baseball, 437438 Homicide victims, number, 561563, 564565, 571 Horseshoe crab mating, 126131, 154155, 159, 168170, 173176, 188192, 212216, 570 Income by year, 308 Infant survival, gestation, smoking, and age, 400401 Insomnia, 462464, 469, 487, 514515, 531 Job satisfaction and income, 5759, 8788, 287288, 295, 297, 308 Job satisfaction and race, gender, age, and location, 205 Journal citations, 448 Kyphosis and spinal surgery, 199200 Labelling index and remission, 197199, 261 Larry Bird free throws, 105 Leading crowd, 516517, 532 Leprosy, 239 Life table, 284 Lung cancer and chemotherapy, 306, Lung cancer and smoking, 42, 61, 62, 64 Lung cancer survival, 390391 Malformation of infants, 8990, 158, 179180, 182 Mendel’s theories, 2223 Mental health, and parents SES, 381, 383384 Mental impairment, life events and SES, 279282 Migration, 423, 427428 Missing people in London, 202 Mixture for two protozoan genuses, 546 Motor vehicle accident rates, 403 Movie reviewers, 445446 Multicenter clinical trial, infection cream, 230235, 508510 Multicenter clinical trial, fungal infections, 394395, 530 EXAMPLES INDEX Multiple sclerosis and neurologist ratings, 447 Murder rates in U.S., 62, 63 Myocardial infarction and aspirin, 37, 46, 7172 Myocardial infarction and diabetes, 418419 NCAA graduation rates, 202 Nervousness and Claritin, 109 Obesity, occasion and gender, 487 Occupational aspirations, 206 Occupational status, father and son, 447 Oral contraceptive use, 200 Osteosarcoma, 262263 Palm Beach County vote for Buchanan, 156157 Party identification by race and by gender, 105106, 303 Party identification and protestors, 307 Pathologists ratings of carcinoma, 431435, 532, 541544, 549551 Penicillin and rabbits, 259260 Pig farmer survey, 484485 Pneumonia infections, 2526, 34 Poison dose for protozoa, 546547 Political ideology and party affiliation, 305, 375377 Pregnancy rates, 567 Presidential approval rating, 409412 Presidential vote, by state, 503504, 534 Promotion discrimination, 254255 Prussian army and mule kicks, 30 Psychiatric patients and prescribed drugs, 106107 Religious fundamentalism and education, 80, 8182 Religious services, frequency of attendance, 352 Respiratory illness, age and maternal smoking, 480481 Respiratory illness in children, 478479 Satisfaction with housing, 310 Satisfaction with job, 205 Schizophrenia origin, 8384 Seat belts and injury, 4041, 61, 305306, 327329, 331, 349, 361 Sex, frequency of, 569570 Sex opinions, 65, 217219, 368, 371373, 421, 430, 431, 530 Sexual intercourse, gender and race, 201 691 EXAMPLES INDEX Shopping choice, 300 Snowshoe hares, 511513, 544545, 551552 Soccer and arrests, 403 Sore throat in surgery, 204 Space shuttle, 199 Student survey Žalcohol, marijuana, cigarettes ., 322326, 361363, 367, 482483, 528 Teratology studies, 151153 Titanic, 61 Toxicity study, 517520 Train accidents, 403, 569 Tea drinker, 92, 100 Teenage birth control, 368, 371373 Tennis rankings, 449 Vegetarianism, 1617, 29 Veterinary information sources, 484485 Voting, proportion by state, 503504, 534 UFOs, 106 Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 Author Index Adelbasit, K. M., 196 Agresti, A., 27, 32, 33, 60, 100, 101, 102, 104, 111, 156, 227, 255, 258, 266, 298, 301, 379, 384, 397, 399, 422, 426, 435, 443, 445, 453, 465, 481, 485, 491, 502, 511, 513, 517, 518, 526, 527, 533, 536, 551, 552, 565, 567, 596, 630, 651 Aitchison, J., 265, 301, 465, 561, 613, 625 Aitken, C. G. G., 613 Aitkin M., 155, 388, 398, 399, 495, 520, 526, 545, 565, 633 Albert, A., 195, 197 Albert, J. H., 607, 609 Albert, P. S., 688 Allison, P. D., 633, 643 Altham, P. M. E., xv, 59, 103, 104, 240, 442, 443, 555, 566, 573, 587, 608 Amemiya, T., 227, 258, 300 Andersen, E. B., 255, 399, 450, 496, 526, 576, 631 Anderson, C. J., 398, 399 Anderson, D. A., 520, 526 Anderson, D. R., 216 Anderson, J. A., 171, 195, 196, 197, 207, 277 Anderson, R. L., 631 Anderson, T. W., 478, 482, 490 Aranda-Ordaz, F. J., 250, 399 Arminger, G., 549 Armitage, P., 104, 181 Ashford, J. R., 258, 379 Asmussen, S., 360 Azzalini, A., 480 Baglivo, J., 346 Baker, R. J., 283 Baker, S. G., 393, 482, 541 Banerjee, C., 443 Baptista, J., 100 Barnard, G. A., 95, 104, 114 Barndorff-Nielsen, O. E., 266 Bartholomew, D. J., 526, 565 Bartlett, M. S., 265, 623624, 631 Becker, M., 370, 399, 435, 443, 544, 644 Bedrick, E. J., 77, 103 Begg, C. B., 273 Beitler, P. J., 508 Benedetti, J. K., 102, 398 Benichou, J., 66 Benzecri, ´ J. P., 399, 624 Berger, R., 18, 33, 95, 594 Bergsma, W. P., 481 Berkson, J., 80, 104, 166, 197, 612, 624, 631 Berry, G., 104 Berry, S. M., 207 Best, D. J., 103 Bhapkar, V. P., 27, 103, 104, 291, 422, 453, 488, 612, 615, 616, 629 Bickel, P., 63 Biggeri, A., 566 Billingsley, P., 482 Birch, M. W., 255, 263, 295, 298, 336, 339, 340, 341, 346, 369, 392, 576, 585, 627, 631 Bishop, Y. M. M., 347, 360, 366, 452, 482, 526, 576, 582, 587, 591, 594, 627, 629, 631 Blaker, H., 20, 27, 93, 635 Bliss, C., 246, 247, 560, 623 Blyth, C. R., 20, 27, 32, 59 Bock, R. D., 300, 301, 495, 526, 624, 625 Bockenholt, U., 398, 399, 443 ¨ Bonney, G. E., 479, 480 Boos, D. D., 95, 467, 481 Booth, J., xv, 104, 223, 397, 443, 523, 525, 567, 630 Bowker, A. H., 424 Box, J. F., 23, 92, 623, 624 Bradley, R. A., 302, 436, 443 Breslow, N., 51, 59, 155, 156, 171, 234, 235, 255, 258, 399, 419, 493, 523, 524, 563, 625, 631 Brier, S. S., 515 Brooks, S. P., 566 Bross, I. D. J., 111 693 694 Brown, L. D., 15, 27, 33, 606 Brown, M. B., 102, 398 Brown, P. J., 613, 614 Brownstone, D., 302, 527 Bull, S. B., 196 Burnham, K. P., 216, 526, 552 Burridge, J., 283 Butler, R., 104, 397, 443, 630 Byar, D. P., 232, 295, 414, 481, 625 Caffo, B., xv, 102, 656 Cameron, A. C., 131, 155, 561, 566, 574 Carey, V., 474 Carroll, R. J., 171, 467 Casella, G., 18, 33, 594 Catalano, P. J., 527 Caussinus, H., 425, 427, 428, 443, 451, 627, 631 Chaloner, K., 196, 609 Chamberlain, G., 419, 420, 526 Chambers, E. A., 258 Chambers, J. M., 633 Chambers, R. L., 527 Chan, I., 104 Chan, J. S. K., 523 Chao, A., 513, 526, 533 Chapman, D. G., 258 Chen, Z., 527, 643, 651 Chib, S., 609 Christensen, R., 196 Chuang, C., 399, 462 Clayton D. G., 388, 399, 493, 523, 563, 625, 631 Clogg, C. C., 103, 391, 399, 565, 627 Clopper, C. J., 18 Cochran W. G., 27, 80, 88, 163, 181, 232, 239, 396, 459, 488, 596, 626, 627, 631 Coe, P. R., 101 Cohen, A., 103, 104 Cohen, J., 434, 435, 443 Coleman, J. S., 516, 532 Collett, D., 196, 204 Conaway, M. R., 426, 482, 526, 565 Cook, R. D., 225 Copas, J. B., 156, 257, 442, 616 Corcoran, C., 197, 573 Cormack, R. M., 511, 526, 551 Cornfield, J., 42, 47, 51, 71, 77, 99, 100, 171, 196, 208, 221, 624 Coull, B. A., xv, 27, 32, 33, 513, 518, 526, 533, 552, 567, 655, 662 Cox, C., 266, 282, 286, 576, 587, 641 Cox, D. R., 12, 104, 133, 138, 196, 197, 258, 415, 482, 493, 497, 624, 625, 631 Cramer, ´ H., 112, 576, 587, 625626 AUTHOR INDEX Cressie, N., 27, 112, 258, 396, 612 Croon, M., 481 Crouchley, R., 527 Crowder, M. J., 555, 566 D’Agostino, R. B., Jr., 196 Dalal, S. R., 199 Daniels, M. J., 524, 609 Dardanoni, V., 301 Darroch, J. N., 347, 357, 398, 414, 426, 443, 459, 481, 513, 526, 551, 552, 565, 626, 629, 631 Das Gupta, S., 237 David, H. A., 443 Davis, L. J., 59, 93, 398 Davison, A. C., 156, 594 Dawson, R. B., Jr., 103 Dawson, R. J. M., 61 Day, N. E., 51, 171, 232, 235, 258, 399, 625 de Falguerolles, A., 399, 664, 686 de Leeuw, J., 399 Demetrio, C. G. B., 156, 555, 566 ´ Deming, W. E., 343, 347, 511 Dempster, A. P., 522 Dey, D. K., 609 Diaconis, P., 103, 104 Diggle, P., 471, 625 Dillon, W., 443 Dittrich, R., xv, 443 Dobson, A. J., 155 Doll, R., 42, 62, 64, 404 Dong, J., xv, 59, 614, 616 Donner, A., 172, 196, 258 Doolittle, M. H., 621 Downing, D., 103 Drost, F. C., 112, 258, 595 Ducharme, G. R., 398 Duncan, D. B., 195, 197, 277, 301, 624 Dupont, W. D., 93 Dyke, G. V., 624 Edwardes, M. D., 301 Edwards, A. W. F., 59 Edwards, D., 360, 398 Efron, B., 103, 146, 196, 227, 258, 526, 605, 610 Ekholm, A., 156, 481 Eliason, S. R., 103, 391 Escoufier, Y., 383, 399 Espeland, M. A., 544, 571 Everitt, B. S., 633 Fahrmeir, L., 155, 300, 615 Farewell, V. T., 171, 301, 527, 625 AUTHOR INDEX Fay, R., 103, 482, 594 Fechner, G., 623 Ferguson, T. S., 605, 617 Fienberg, S. E., 344, 347, 392, 438, 443, 513, 526, 610, 615, 616, 626, 627, 629, 631 Finney, D., 151, 258, 556, 623 Firth, D., 155, 156, 196, 330, 467, 481, 482 Fischer, G. H., 526 Fisher, R. A., 12, 22, 23, 29, 51, 79, 91, 92, 95, 99, 104, 114, 146, 156, 162, 237, 247, 560, 576, 589, 622624, 625, 626, 628, 631 Fitzmaurice, G. M., 103, 466, 474, 481, 482, 649 Fitzpatrick, S., 35 Fleiss, J. L., 104, 110, 111, 242, 258, 347, 435, 436, 443 Follman, D. A., 546, 547, 548, 566 Forcina, A., 301 Forster, J. J., 104, 346, 397, 482, 616, 630 Forthofer, R. N., 378 Fowlkes, E. B., 199, 226, 257 Francom, S., 462 Freedman, D., 23, 63 Freeman, D. H. Jr., 399 Freeman, G. H., 97 Freeman, M. F., 112 Freidlin, B., 104 Friendly, M., 59, 399, 633 Frome, E. L., 155, 399 Fuchs, C., 482 Gabriel, K. R., 263, 399 Gaddum, J. H., 623 Gail, M. H., 104, 625 Gart, J. J., 70, 71, 77, 102, 104, 197, 255, 258, 397, 442 Gaskins, R. A., 614 Gastwirth, J., 104, 197 Gatsonis, C., xv, 230, 481, 524, 609 Gelfand, A. E., 609 Genter, F. C., 301 Geyer, C., 678 Ghosh, B. K., 27 Ghosh, M., 442, 609 Gibbons, R. D., 520 Gilbert, G. N., 217 Gill, J., 155 Gilmour, A. R., 526 Gilula, Z., 83, 382, 384 Gini, C., 329 Glass, P. V., 447 Gleser, L. J., 103 Glonek, G. F. V., 393, 466 Godambe, V. P., 104, 482 695 Goetghebeur, E., 482 Gokhale, D. V., 112, 612, 616 Goldstein, H., 520, 524 Good, I. J., 24, 60, 104, 605, 607, 608, 612, 614, 616, 626, 630 Goodman L. A., 35, 59, 68, 69, 83, 84, 102, 110, 213, 217, 228, 340, 346, 365, 366, 369, 370, 374, 379, 380, 381, 382, 383, 384, 397, 398, 399, 406, 407, 408, 425, 428, 431, 443, 478, 482, 490, 516, 527, 540, 565, 566, 572, 621, 622, 627, 628, 629, 631 Gould, S. J., 544 Gourieroux, C., 467, 482 Graubard, B. I., 89, 103 Gray, R., 273 Green, P. J., 156 Greenacre, M. J., 384, 399 Greene, G., 28 Greenland, S., 96, 234, 258 Greenwood, M., 566 Greenwood, P. E., 27 Grego, J., 686 Grizzle, J. E., 291, 301, 457, 601, 615, 624, 629, 631 Gross, S. T., 197 Grove, W. M., 544 Gueorguieva, R., xv, 527, 670 Gupta, A. K., 607 Haber, M., 95, 96, 103, 291, 465 Haberman, S. J., 69, 81, 83, 113, 195, 224, 258, 268, 300, 349, 346, 347, 364, 367, 369, 374, 380, 382, 392, 393, 396, 399, 408, 440, 526, 540, 565, 572, 576, 589, 591, 592, 595, 627, 629, 631 Hagenaars, J., 565 Hald, A., 623 Haldane, J. B. S., 70, 103, 196 Hall, P., 616 Halton, J. H., 97 Hamada, M., 301 Handelman, S. L., 544, 571 Hanfelt, J., 566, 571 Hansen, L. P., 467, 482 Harkness, W. L., 258 Harrell, F. E., 229, 282, 301 Hartzel, J., 511, 513, 514, 516, 534, 651 Haslett, S., 394 Hastie, T., 153, 199, 301, 633 Hatzinger, R., 565 Hauck, W. W., 172, 234, 258 Haynam, G. F., 243 Heagerty, P., 481, 527, 548 Hedeker, D., 520, 653 696 Heinen, T., 565 Hennevogl, W., 513 Henry, N. W., 565 Heyde, C. C., 156, 481 Hill, A. B., 42, 64, 111, 404 Hinde, J., 155, 156, 555, 563, 566 Hinkley, D., 12, 104, 133, 138, 146, 156, 594 Hirji, K. F., 104, 258, 625 Hirotsu, C., 406 Hirschfeld, H., 399 Hoadley, B., 199 Hobert, J., xv, 523, 525, 630 Hodges, J. L., 197 Hoem, J. M., 347, 399 Holford, T. R., 389, 390, 399 Holland, P. W., 610, 615, 629, 631 Hollander, M., 443 Holt, D., 103 Holtbrugge, 306 ¨ Hook, E. B., 526 Hosmer, D. W., 177, 196, 197, 257, 258 Hout, M., 65, 428, 443 Howard, J. V., 104, 608 Hsieh, F. Y., 242, 243 Hsu, J. S. J., 608 Hwang, J. T. G., 104 Imrey, P. B., 346, 443, 615 Ireland, C. T., 616 Irwin, J., 91 Jennison, C., 103 Jewell, N. P., 467, 496, 566 Johnson, B. M., 605 Johnson, N. L., 566, 574 Johnson, W., 257 Jones, B., 442, 484, 536 Jones, M. P., 258 Jorgensen, B., 136, 155, 156, 266, 470 Ⲑ Kalbfleisch, J. D., 482 Karim, M. R., 524, 609 Kastenbaum, M. A., 627 Kastner, C., 466, 649 Katzenbeisser, W., xv, 672 Kauerman, G., 156, 467 Kelderman, H., 565 Kempthorne, O., 96, 104 Kendall M. G., 27, 56, 60, 68, 399, 631 Kenward, M. G., 442, 475, 484, 536 Khamis, H. J., xv, 332, 443 Kim, D., 104, 255, 298, 379, 397 King, G., 527 Knott, M., 565 AUTHOR INDEX Knuiman, M. W., 616 Koch, G. G., 27, 302, 436, 447, 459, 460, 481, 532, 601, 615, 616, 629, 631, 670, 673, 674, 675 Koehler, K., 27, 103, 396, 397 Koopman, P. A. R., 77 Korn, E. L., 89, 103 Kraemer, H. C., 443 Kreiner, S., xv, 358, 398 Kruskal, W. H., 59, 60, 68, 69, 102, 110, 621, 631 Ku, H. H., 616 Kuha, J., 330, 347 Kuk, A. Y. C., 523 Kullback, S., 112, 399, 612, 616 Kuo, L., 527, 643, 651 Kupper, L. L., 566 Laara, ¨¨ ¨ F., 301, 313 Lachin, J., 258 Laird, N. M., 385, 386, 389, 466, 482, 522, 541, 609, 610, 649 Lambert, D., 546, 547, 548, 566 Lancaster, H., 20, 27, 83, 84, 113, 399, 626 Landis, J. R., 111, 295, 297, 301, 302, 436, 447, 462, 508, 532 Landrum, M., 609 Landwehr, J. M., 226, 257 Lang, J. B., xv, 301, 340, 399, 465, 481, 537, 541, 551, 643, 644, 649, 655, 675, 678 Laplace, P. S., 15 Larntz, K., 196, 396, 397, 438, 443, 609 Larsen, K., 498 Larson, M. G., 399 Lauritzen, S. L., 346, 398, 399 LaVange, L. M., 103, 197, 481 Lawal, H. B., 396 Lawless, J. F., 155, 482, 560, 561, 566 Lazarsfeld, P. F., 565 Lee, E., 198 Lee, S. K., 596 Lee, Y., 559, 566, 574 Lefkopoulou, M., 566 Lehmann, E., 67, 104, 263, 406 Lehnen, R. G., 378 Lemeshow, S., 177, 196, 197, 257, 258 Leonard, T., 608, 609 Lesaffre, E., 258, 300, 466, 522, 545 Lesperance, M. L., 526, 548 Liang, K.-Y., 104, 258, 442, 467, 469, 471, 473, 481, 482, 525, 556, 566, 571, 573, 625, 631 Lin, X., 524, 525 Lindley, D. V., 609, 630 Lindsay, B., 494, 545, 549 697 AUTHOR INDEX Lindsey, J. K., 400, 467, 566, 573 Lipsitz, S., 103, 291, 422, 456, 469, 473, 474, 481, 645 Little, R. J., 114, 346, 347, 475, 476, 482 Liu, I., xv, 485, 655, 671 Liu, Q., 510, 522 Lloyd, C., 93, 104, 156, 615 Localio, A. R., 526 Longford, N. T., 520 Loughin, T., 484 Louis, T., 541 Luce, R., 299, 302, 443 Madansky, A., 422, 456 Maddala, G. S., 258, 264, 302 Magidson, J., 653 Magnus, J. R., 602 Mansmann, U., 104 Mantel, N., 87, 93, 104, 171, 197, 209, 230, 231, 232, 234, 238, 260, 295, 296, 297, 300, 379, 414, 481, 612, 618, 624, 625, 627, 631 Martin Andres, A., 104 Mason, W. M., 524, 609 Matthews, J. N. S., 301, 313, 443 Maxwell, A. E., 631 McArdle, J. J., 202 McCloud, P. I., 443 McCullagh, P., 132, 155, 156, 257, 276, 277, 283, 286, 290, 301, 308, 312, 340, 378, 397, 431, 443, 466, 471, 481, 556, 566, 625, 631 McCulloch, C. E., 522, 523, 524, 527, 548, 555, 623, 625 McDonald, J. W., 667, 684 McFadden, D., 228, 264, 299, 300, 302, 302, 624, 631 McNemar, Q., 411 Mee, R. W., 77 Meeden, G., 605 Mehta, C. R., 98, 104, 254, 255, 258, 298, 397, 625, 630 Mendel, G., 22 Mendenhall, W. M., 107 Mersch, G., 400 Michailidis, G., 399 Miettinen, O. S., 77, 442 Miller, M. E., 481, 604 Min, Y., 100, 101 Minkin, S., 196 Mirkin, B., 112 Mitra, S. K., 79, 258, 346, 591, 627, 631 Mittal, Y., 60 Molenaar, I. W., 526 Molenberghs, G., 258, 466, 482 Moore, D. F., 152, 556, 566 Moore, D. S., 27, 103 Morabia, A., 527 Morgan, B. J. T., 196, 207 Morgan, W. M., 346 Morris, C., 526, 605, 610 Mosimann, J. E., 566 Mosteller, F., 345, 412, 443, 627, 629, 631 Nair, V. N., 103, 301 Nam, J., 77 Natarajan, R., xv, 481, 502, 524 Nelder, J., 116, 132, 148, 149, 155, 156, 257, 290, 301, 312, 340, 378, 559, 566, 574, 625, 631 Nerlove, M., 300, 624 Neudecker, H., 602 Neuhaus, J. M., 417, 494, 496, 499, 502, 526, 547, 548, 566 Newcombe, R., 27, 109, 110 Neyman, J., 18, 112, 611, 612, 616, 626, 631 Nikulin, M. S., 27 Normand, S.-L., 609 Norusis, M. J., 633 Nurminen, M., 77 O’Brien, P. C., 207 O’Brien, R. G., 244, 258, 640 Ochi, Y., 258 Odoroff, C., 661 O’Gorman, T. W., 596 Olivier, D., 385, 386, 389 Overton, W. S., 526, 552 Pagano M., 61, 657 Paik, M., 59 Palmgren, J., 156, 340 Park, T., 482 Parr, W. C., 594 Parzen, E., 34 Patefield, W. M., 104 Patel, N. R., 98, 258, 625, 630 Patnaik, P. B., 258 Paul, S. R., 566 Pearson, E. S., 18, 104, 626, 631 Pearson, K., 22, 79, 112, 399, 576, 589, 620, 621, 622, 628, 631 Peduzzi, P., 212 Pendergast, J. F., xv, 502 Pepe, M. S., 258 Perlman, M., 237 Peters, D., 104, 630 Peterson, B., 282, 301 Peto, R., 62 Piccarreta, R., 206 698 Pierce, D. A., 104, 143, 156, 497, 502, 522, 526, 630 Pike, M. C., 100 Piegorsch, W. W., 684 Plackett, R. L., 103, 196, 399, 623, 627, 631 Podgor, M. J., 197 Poisson, S.-D., 7 Pratt, J. W., 283 Pregibon, D., 143, 156, 197, 225, 257, 258, 566, 638 Prentice, R. L., 171, 196, 258, 283, 399, 482, 555, 566, 625 Presnell, B., 156 Press, S. J., 196, 300, 624 Pyke, R., 171, 625 Qaqish, B., 676 Qu, A., 482 Quetelet, A., 68 Quine, M. P., 29 Rabe-Hesketh, S., 527, 633 Radelet, M., 48, 65 Raftery, A., 257 Rao, C. R., 10, 12, 576, 582, 585, 587, 589, 591, 596, 616, 626, 631 Rao, J. N. K., 103, 515 Rasbash, J., 520, 524 Rasch, G., 399, 415, 493, 495, 624 Rayner, J. C. W., 103 Read, T. R. C., 27, 112, 258, 396, 612 Regal, R. R., 526 Reid, N., 96 Rice, W. R., 104 Ripley, B., 633 Ritov, Y., 384 Robins, J., 234, 258, 475 Rohmel, J., 104 ¨ Rosenbaum, P. R., 196 Rosner, B., 566 Rossi, P. E., 609 Rotnitzky, A., 467 Routledge, R. D., 104, 607 Roy, S. N., 79, 346, 627, 631 Rubin, D., 196, 475, 482 Rudas, T., 481, 565 Rundell, P. W. K., 613, 614 Ryan, L., 290, 527, 566 Sackrowitz, H. B., 103, 104 Samuels, M. L., 60 Santner, T. J., 101 Schafer, D. W., 143, 156 Schafer, J. L., 103, 347, 482 AUTHOR INDEX Schluchter, M. D., 399 Schumacher, M., 306 Scott, A. J., 35, 103, 197 Searle, S., 527, 555 Seeber, G., 155 Sekar, C. C., 511 Self, S. G., 258, 525 Sen, P. K., 594 Seneta E., 29 Silvey, S. D., 301, 465, 625 Singer, J. M., 594 Shapiro, S. H., 398 Shen, S. M., 265 Shihadeh, E. S., 399 Shuster, J. J., 95, 103, 104, 442 Silva Mato, A., 104 Silvapulle, M. J., 195 Silvey, S. D., 301, 465, 625 Simon, G., 197, 301, 374, 399, 612, 624, 629 Simonoff, J., 594, 614, 615, 616 Simpson, E. H., 51, 60, 398, 596, 621 Singer, J. M., 594 Skellam, J. G., 566 Skene, A. M., 502, 609 Skinner, C., 347 Skrondal, A., 527 Slaton, T. L., 566 Small, K. A., xv, 302 Smith A. F. M., 609, 616 Smith, K. W., 345 Smith, P. W. F., 443, 482, 616 Snell, E. J., 196, 301, 624 Snell, M. K., 101 Sobel, M. E., 672 Somers, R. H., 68 Somes, G. W., 488 Speed, T., 347, 616 Spiegelhalter, D., 616 Spitzer, R. L., 435 Sprott, D. A., 95, 114, 453 Starmer, C. F., 601, 629 Stasinopoulos, M., 526 Stern, S., 302 Sterne, T. E., 20 Stevens, S. S., 26 Stigler, S., 22, 443, 448, 623, 631 Still, H. A., 20, 27, 32 Stiratelli, R., 482, 526 Stokes M. E., xv, 282, 302, 399, 476, 482, 633, 640, 649 Strawderman, R. L., 104, 630 Stuart, A., 27, 56, 399, 422 Stukel, T. A., 196, 250 Sturmfels, B., 104 AUTHOR INDEX Suissa, S., 95, 104, 442 Sundberg, R., 346, 366 Tamhane, A. C., 101 Tango, T., 411 Tanner, M. A., 443 Tarone, R. E., 197, 234, 258 Tavare, ´ S., 103 Ten Have, T. R., xv, 517, 526, 527, 527 Theil, H., 57, 228, 300, 624 Thomas, D. R., 103, 515 Thompson, R., 283 Thompson, W. A., 399 Thurstone, L. L., 443 Tibshirani, R., 153, 199, 301 Titterington, D. M., 616 Tjur, T., 426, 552, 553 Tocher, K. D., 94 Toledano, A., 230, 481 Tolley, H. D., 594 Train, K., 302, 527 Trivedi, P. K., 131, 155, 561, 566, 574 Tsiatis, A. A., 152, 197, 556, 566 Tukey, J., 112 Turing, A., 631 Turnbull, B. W., 103 Tutz, G., 155, 156, 289, 290, 300, 301, 513, 615 Uebersax, J. S., 544 Uttal, D. H., 517 van der Heijden, P. G., 399 Venables, W. N., 633 Verbeke, G., 482, 545 Vermunt, J. K., 399, 653 Wainer, H., 63 Wakefield, J. C., 502, 609 Wald, A., 11, 172 Walker, S. H., 195, 197, 277, 301, 624 Walley, P., 616 Walsh, S. J., 104 Wardrop, R. L., 105 Ware, J. H., 478, 480, 482 Watson, G. S., 79, 103, 576, 590, 627, 631 699 Wedderburn, R. W. M., 116, 148, 149, 150, 155, 156, 195, 258, 265, 266, 466, 470, 625, 631 Weisberg, S., 226 Wells, M. T., 104, 630 Wermuth, N., 398, 399, 401 Westfall, P. H., 214, 360 White, A. A., 481 White, H., 467, 471, 482 Whitehead, J., 301 Whittaker, J., 346, 358, 398, 399 Whittemore, A. S., 243, 398 Wild, C., 103, 197, 301 Wilks, S. S., 12 Williams, D. A., 156, 225, 397, 555, 566, 653 Williams, E. J., 103, 399 Williams, O. D., 291, 301, 624 Wilson, E. B., 16 Wilson, J., 103 Winner, L., 445 Wolfinger, R. D., 214, 360, 527 Wong, G. Y., 524, 609 Woolf, B., 71 Woolson, R. F., 487, 596 Wu, C. F. J., 196, 301 Wu, M., 346, 347 Yamagami, S., 101 Yang, I., 544 Yang, M., 104 Yates F., 91, 93, 96, 98, 103, 104, 114, 239, 624 Yee, T. W., 301 Yerushalmy, J., 38 Young, S. S., 214, 360 Yule, G. U., 44, 53, 59, 68, 110, 346, 406, 566, 620621, 628, 631 Zeger, S. L., 442, 467, 469, 471, 473, 481, 482, 499, 500, 524, 548, 609, 625, 631 Zelen, M., 255, 625, 631 Zellner, A., 609 Zelterman, D., 397 Zermelo, E., 443 Zhang, H., 257 Zhao, L., 482 Zheng, B., 227, 258, 266 Zhu, Y., 96 Zweiful, J. R., 70, 397 Categorical Data Analysis, Second Edition. Alan Agresti Copyright ¶ 2002 John Wiley & Sons, Inc. ISBN: 0-471-36093-7 Subject Index Adjacent categories logit, 286288, 370371, 374376, 642 Adjusted residual, see Standardized Pearson residual Agreement, 431436, 443, 453454, 541544, 549551 AIC, 216217, 324 Alternating logistic regressions, 474 Ancillary statistic, 104 Arc sine transformation, 596 Armitage test, see CochranArmitage trend test Association, see Measures of association Association graphs, 357360, 539 Association models, 373381, 399 Asymptotic covariance matrix, 137138, 577581, 594 Asymptotic normality, 7377, 577581 Attributable risk, 66, 110 Backward elimination, 214216 BAN, 611, 626 Baseline-category logits, 267274, 300, 310311, 426, 515, 640643 Bayesian inference, 604610, 616, 630631 binomial parameters, 605607, 617 generalized linear mixed models, 524, 609 kernel smoothing, connection, 614 multinomial proportions, 607610, 618 Bernoulli distribution, 117 Beta-binomial distribution, 30, 553559, 566, 572, 573, 653 Beta distribution, 554, 572, 605606 Bias, 70, 85, 196, 450, 496, 524, 548, 595, 615 BIC, 257 Binary data correlated, 409420, 455482, 491527, 538559 generalized linear models, 120125, 137, 140 matched pairs, 409420 Binomial distribution, 56 admissible estimator, 605 confidence interval for proportion, 1517, 3233, 635 exact inference, 1820 exponential family, 117, 134 GLM likelihood equations, 137 likelihood function, 9 matched pairs, 409420 moment generating function, 31 overdispersion, 8, 30 tests for proportion, 1415 variance stabilizing, 596 Binomial models deviance, 140 GLMs, 120125 likelihood equations, 137, 265 overdispersion, 151153, 291, 573, 653 Birch’s results, 336 Bootstrap, 75, 156, 525, 531, 594 BradleyTerry model, 436439, 443, 647 BreslowDay test, 258 Calibration, 207 Canonical correlation, 382, 399, 408, 624 Canonical link, 117, 148149, 193, 257, 472, 496 Capturerecapture, 511513, 526, 544545, 551552 CART, 257 Case-control study, 4243, 4647, 59, 233, and logistic regression, 170171, 418420, 625 several controls per case, 233, 442 Categorical data analysis, 1688 Causal diagram, 217218 Censoring, 386, 400 Centering, 167, 175 Chi-squared distribution df, 12, 79, 175, 589 701 702 Chi-squared distribution Ž Continued. mgf, 35 moments, 27 noncentral, 237, 258, 408, 591592, 595, 597 reproductive property, 82 table of percentage points, 654 Chi-squared statistics likelihood-ratio, see Likelihood-ratio statistic partitioning, see Partitioning Pearson, see Pearson chi-squared statistic Classification methods, 196, 257, 228230, 258 Clinical trials, 42, 230236, 507510 ClopperPearson confidence interval, 1820, 33, 606 Cluster sampling, 103, 481, 515 Clustered data, 455, 491527, 556558 Cochran, W. G., 626 CochranArmitage trend test, 181182, 197, 237, 253, 640 CochranMantelHaenszel test, 231234, 639 exact test, 254, 298 and marginal homogeneity, 413, 458459, 481 and McNemar test, 413414 matched pairs, 413 nominal and ordinal cases, 295298, 302, 379, 642643 score test for logit model, 232, 297298 Cochran’s Q, 459, 488 Collapsibility, 358360, 398 Complementary loglog model binary response, 248250, 640 ordinal response, 283284, 301, 313, 527, 641 Computer software, see Software Concentration coefficient, 69 Concordance index, 229 Concordant pair, 5759 Conditional distribution, 37, 48 Conditional independence, 52 I = J = K tables, 293298, 302, 318319, 325 logit models, 183184, 230234, 263, 293295, 359360 versus marginal independence, 53, 365366 power and sample size, 244245 small-sample test, 254, 298 Conditional inference, 91101, 250257, 416420, 495496, 630 Conditional logistic regression, 250258, 414420, 495496, 526, 625, 640, 645 Conditional logit, 299 Conditional ML, 100, 417, 494496, 526 SUBJECT INDEX Conditional symmetry, 431, 452 Confidence intervals likelihood-based, 13, 7778 tail method, 18, 99 Wald, 13 score, 1516, 77 Confounding, 4751, 230 Conjugate mixture model, 558559 Constraint equations, 612 Constraints, parameter, 178179, 317, 352353 Contingency coefficient, 112, 620 Contingency table, 36, 4754 Continuation-ratio logit, 289291, 301, 517520 Continuity correction, 27, Continuous proportions, 265266, 624 Contrasts, 82, 317, 340, 344, 603, 636, 639 Correlation, 87, 226, 296, 634 Correlation models, 381384, 399, 408 Correspondence analysis, 382384, 399, 624, 644 Cramer’s ´ V 2 , 112 Credit scoring, 165, 263, 631 Cross-classification table, see Contingency table Crossover study, 444, 457, 483, 498, 501, 572 Cross-product ratio, 44 Cross validation, 266 Cumulant function, 155 Cumulative link models, 282286, 313 Cumulative logit models, 274282, 301, 624, 641 dispersion effects, 285286 marginal models, 420421, 462463, 469 proportional odds property, 275276, 282 random effects, 514515, 536 score test and ranks, 301 Cumulative odds ratio, 67, 276 Cumulative probit model, 278, 283, 301, 312, 624625, 641 Data mining, 219, 631 Decomposable model, 346, 360 Degrees of freedom, 12, 79, 175, 589, 622 Delta method, 7377, 577581, 594 Dependent proportions, 410412 Design, 196, 609 Design matrix, see Model matrix Deviance, 118119, 139142 grouped vs. ungrouped binary data, 208 likelihood-ratio tests, 141142, 186187, 363365 residual, 142, 220, 638 R-squared measures, 228 SUBJECT INDEX Diagnostics, 142143, 219230, 257258, 366367 Diagonals-parameter symmetry, 443 Difference of proportions, 43 collapsibility, 398 dependent, 410412, 645 homogeneity, 258 large-sample confidence interval, 72, 77, 102, 110, 410411 sample size determination, 240242, 258 small-sample confidence interval, 101 z test and Pearson statistic, 111 Directed alternatives, 8890, 236239, 373 Dirichlet distribution, 607, 610 Discordant pair, 5759 Discrete choice models, 298300, 302, 527, 624 Discreteness and conservatism, 1820, 9394, 257 Discriminant analysis, 196 Dispersion parameters, 131, 133, 285286, 560 Dissimilarity index, 329330 Diversity index, 596 Dummy variables, 178179 Ecological inference, 527 Effect modifier, 54 EM algorithm, 522523, 540541 Empirical Bayes, 526, 610 Empirical logit, 168 Empty cells, 392 Entropy, 57, 613 Estimated expected frequencies, 25, 78, 315 Estimating equations, 470, 481482 Exact confidence intervals, 1820, 99101, 255 Exact tests binomial parameter, 18, 412 conditional independence, 254, 298 Fisher, 9197, 253 I = J tables, 9798, 104 logistic regression, 251257 matched pairs, 412 ordinal variables, 114 StatXact and LogXact, 633, 635, 640, 643 trend in proportions, 98 unconditional test, 9496, 104, 114 Expected frequencies, 22, 25, Exponential dispersion family, 133, 310 Exponential distribution, 313, 388 Exponential family, 116, 133 Extreme-value distribution, 249250, 264 Fisher, R. A., 2223, 622624, 626, 628 df argument with Pearson, 622623 703 variance test, 163 Fisher scoring, 145149, 156, 247, 623, 625 Fisher’s exact test, 9197, 99, 253, 623 and Bayes approach, 608 conservativism, 9394 controversy, 9596, 104 software, 635 UMPU, 104 versus unconditional test, 9596, 104, 114 Fitted values, 121 asymptotic distribution, 194, 341, 585586, 593 FreemanTukey chi-squared, 112, 594 G 2 statistic, see Likelihood-ratio statistic G 2 Ž M0 < M1 ., 187, 363 Gamma, 5859, 88, 110, 596597 Gamma distribution, 559560, 574 GaussHermite quadrature, 521522, 651 Generalized estimating equations ŽGEE., 466475, 481482, 501, 557558, 649 Generalized additive models, 153155, 156, 301, 630, 636 Generalized linear mixed model ŽGLMM., 417, 492 Bayesian approach, 524, 609 binary data, 492527 correlation nonnegative, 497, 564 count data, 563565 heterogeneity, interpretation, 497498 marginal effects, comparison, 498502, 535, 563564 marginal model, corresponding, 527, 563564, 574575 misspecification, 547548 model fitting, 520526, 527 multinomial data, 513516 software, 649653 Generalized linear model ŽGLM., 116119, 625 canonical link, 117, 148149, 193, 257, 472, 496 covariance matrix, 137138 exponential dispersion family, 133 inference using, 139143 likelihood equations, 135136, 148 model fitting, 143149 moments, 132134 multivariate, 274 variance function, 136 Generalized loglinear model, 332333, 464, 481, 602 Gini concentration index, 68 Goodman, L. A., 627629 704 SUBJECT INDEX Goodman and Kruskal tau and lambda, 6869 Goodness-of-fit statistics continuous explanatory variables, 176177, 197 deviance for GLMs, 118119, 139142 likelihood-ratio test, 141142, 186187, 363365 logistic regression, 174177, 186187, 208 loglinear models, 324 mixture summary, 565 Pearson chi-squared, 2226 uninformative for ungrouped data, 162 Graphical models, 357360, 398, 629 Grouped versus ungrouped data, 140141, 162, 174177, 208, 228 GSK method, 601 Gumbel distribution, 249 Influence diagnostics, 224226, 638 Information matrix, 9 GLM, 138, 145146 logistic regression, 193 loglinear model, 339 observed versus expected, 145146, 247 Interaction, 210 and odds ratios, 54 three-factor, 320 uniform, 407 Isotropy, 406 Item response models, 495 Iterative proportional fitting, 343345, 347 Iterative reweighted least squares, 147, 156, 195, 343 Hat matrix, 143, 225, 589 Hazard function, 301, 388, 399400 Heterogeneity, 130, 235236, 291, 377, 492493, 497, 499500, 507510, 538 Hierarchical models, 316, 520, 609 History, 619631 Homogeneity of odds ratios, 54, 183, 234236, 255, 258 Homogeneous association, 54, 320, 377, 407, 623 HosmerLemeshow statistic, 177, 639 Hypergeometric distribution, 91 and binomial, 113 moments, 103, 232 multiple hypergeometric, 97 noncentral, 99 Kappa, 434435, 443, 453, 645 Kendall’s tau and tau-b, 60, 68 Kernel smoothing, 613615, 616 Identity link, 117, 120, 124, 128, 385, 387, 562, 565 Incomplete table, 392 Independence conditional, see Conditional independence estimated expected frequencies, 78 exact test, see Fisher’s exact test from irrelevant alternatives, 299, 302 joint, 318, 319 likelihood-ratio test, 79 loglinear model, 132, 314315, 336, 352 mutual, 318319, 353, 354 Pearson test, 7879 quasi, 426428, 432433, 443 residuals, 81, 111112 smoothing using, 8586 two-way table, 3839, 7879, 111 variance of proportion estimator, 113 Independent multinomial sampling, 40, 67, 339340 Joint independence, 318, 319 Lambda Žmeasure of association., 69 Laplace approximation, 523 Latent class models, 538545, 565, 571572, 653 Latent variable, 277278, 399 LD 50, 167 Leverage, 143, 589 Likelihood function, 9 generalized linear model, 133, 135 marginal likelihood, 521 Likelihood-ratio statistic, 1112, 24 asymptotic chi-squared distribution, 590591 and confidence intervals, 13, 16, 17, 78, 638 difference of deviances, 141142, 187, 363364 independence, 79 minimized by ML estimate, 590591 monotone property, 141 nested models, 363365 noncentrality, 243 nonnegative, 34, 141 partitioning, 8284, 363365, 399, 405 Pearson statistic, comparison, 24, 80, 364 as power divergence statistic, 112 sparse data, 80, 395397 Linear-by-linear association, 369373, 643644 and bivariate normal, 370, 399 and correlation model, 408 heterogeneous, 377 homogeneous, 377379, 407 score statistic, 406 SUBJECT INDEX Linear logit model, 180182 directed inference, 236237 efficiency, 197 exact test, 253 likelihood equations, 209 and trend test, 197, 237239 Linear predictor, 116 Linear probability model, 120121, 291 and trend test, 181182 Link function, 116, 135 canonical, 117, 148149, 193, 257, 472, 496 cumulative, 282286, 301 goodness of link, 257258, 301 inverse cdf, 124125, 163, 282 Litter effects, 151153, 291, 556558, 566 Local odds ratio, 55, 312, 369370 asymptotic covariances, 597 conditional, 321322, 377 exponential family for multinomial, 310311 Logistic distribution, 125, 162, 197, 246 Logistic-normal distribution, 265 Logistic-normal model, 496513, 516527 Logistic regression, 121125, 165196 case-control studies, 170171, 418420, 625 categorical predictors, 177186 conditional, 250258, 414420, 495496, 526, 625, 640, 645 conditional independence, 183184, 231 covariance matrix, 193194 design, 196, 609 diagnostics, 219230, 257258 existence of ML estimates, 195196, 394395 fitting model, 192196 generalized linear model, 117, 121125 goodness-of-fit, 174177, 186187, 197 inference, 172177 interpretation, 166171, 191 likelihood equations, 192193 linear logit model, see Linear logit model loglinear models, connection, 315, 330332, 367, 593594 marginal models, 414, 456476 matched pairs, 414420, 493496 model-building, 211225 multiple predictors, 182195 nonparametric mixture, 546547, 653 normal distribution connection, 171, 207208 and odds ratio, 124, 166 perfect discrimination, 195196 probability estimators, 166167, 191, 194 random effects, 496513, 516527 regressive logistic model, 479481 705 repeated binary response, 414420, 456476, 496513, 516527 repeated multinomial response, 461464, 469, 474475, 513516 residuals, 219223 sample size determination, 242243 sample size and number of predictors, 212 software, 637643, 645, 649651 Logit transform, 75, 117, 624 bias, 196 confidence interval, 109 in logistic regression, 123 standard error, 7475 Wald test of proportion, 208209 Loglinear models, 117118, 314347, 627629 covariance matrix, 138139, 338, 341, 593, 598 existence of estimates, 341, 392395 fitting, 342344 four dimensions, 326330, 355 generalized loglinear model, 332333, 464, 481 generalized linear model, 117118, 125132 goodness of fit, 337338 homogeneous association, 320, 377 independence, 232, 314315, 318319, 336, 352, 365366 likelihood equations, 334336 linear-by-linear association, 369373, 377379 logit models, connection, 315, 330332, 367, 593594 ordinal variables, 367377 parameter definition, 316317, 352353 Poisson-multinomial connection, 317318, 339340 probability estimates, 340341 rates, 385391 saturated, 316, 380 selection, 360366 software, 643644 square-tables, 424431 three-factor interaction, 320 Ž X, Y, Z . type symbols, 320321 Log link, 118, 124, 125, 132, 138, 140, 314, 560, 563 Log-log models, 248250, 283 Longitudinal studies, see Repeated response Lowess, 154 MannWhitney statistic, 90, 301, 452453 Mantel, N., 625 MantelHaenszel estimator, 234235, 417, 639 706 MantelHaenszel test, see CochranMantelHaenszel test Mantel score test, 87, 88, 89, 379 Marginal distribution, 37. See also Marginal models Marginal likelihood, 521 Marginal homogeneity binary matched pairs, 410413 and independence, 111 nominal tests, 422423, 457459 ordinal tests, 421, 452453, 458 multi-way table, 439442, 456459, 647649 Marginal models, 414, 420423, 439442, 456476 conditional models, comparison, 498502 GEE approach, 466475 ML fitting, 464466, 481 odds ratio, 451, 494 software, 644649 Marginal symmetry, 442 Marginal table, 48 same association as partial table, 358360, 398 Markov chains, 477481, 482, 489490 Matched pairs, 409454 CochranMantelHaenszel approach, 413 dependent proportions, 410412 logistic models, 414420, 493496, 516517 McNemar test, 411413, 424, 442, 644645 odds ratio estimates, 417, 451, 494 ordinal data, 420421, 429431, 439, 443, 452453, 462464, 536 random effects, 417418, 493494, 535 Maximum likelihood, 9 conditional, 100, 417, 494496, 526 inconsistent estimator, 450 iterative reweighted least squares, 147, 156, 195, 343 likelihood function, see Likelihood function versus other methods, 468, 603605, 612 McNemar test, 411413, 424, 442, 644645 Mean response model, 291294 Measurement error, 347, 493 Measures of association, 4347, 5460, 6869, 620622 asymptotic normality, 110 comparing several values, 599 Mendel, 2223, 623 Mid-distribution function, 34 Mid-P-value, 20, 27, 33, 104 Midranks, 89, 90, 302 Minimum chi-squared, 112, 611612, 616, 618, 629 SUBJECT INDEX Minimum discrimination information, 112, 612613, 616 Misclassification error, 347 Missing data, 103, 347, 463, 475476, 482 Mixture models, 538566. See also Generalized linear mixed models ML, see Maximum likelihood Model-based inference improved precision of estimation, 85, 112, 174, 239240, 264 model-based tests, 141142, 172, 363365, 396, 399 Model matrix, 135 Monotone trends, 88. See also Trend tests Monte Carlo methods, 114, 522525, 609, 629630, 635 Multicollinearity, 212 Multilevel models, 520, 609, 651 Multinomial distribution, 67 binomial factorization, 289 exponential family, 310311 inference, 2126, 35 mean, correlation, covariance, 7, 31, 579580, 596 and Poisson, 89, 40 sampling models, 4041, 67 Multinomial logit models, 267291, 298300, 302, 624, 640643, 651653 Multinomial loglinear model, 317318, 339341 Multinomial response models, 267300, 640643 Mutual independence, 318319, 353, 354 National Halothane Study, 627, 629 Natural exponential family, 116, 133, 155 Natural parameter, 133 Negative binomial distribution, 31, 161, 163, 560, 566, 574 regression model, 131, 560563, 565, 566, 653 Nested models likelihood-ratio comparison, 141142, 187, 363364 simultaneous tests, 263 using X 2 , 364 NewtonRaphson, 143146, 163164 and Fisher scoring, 145, 247 IPF, comparison, 344345 logistic regression, 194195 loglinear models, 342345 Neyman, J., 626 Nominal variable, 23 baseline-category logit models, 267274, 300, 310311, 426, 515, 640643 SUBJECT INDEX Nominal variable Ž Continued. matched pairs, 422423 measures of association, 5557, 6869 square table models, 425433, 439442 Noncentral chi-squared distribution, 237, 258 asymptotic representation, 591592, 595 noncentrality parameter, 237, 243245, 408, 597 power and df, 237239 Nonparametric random effects, 545553, 565566, 653 Normal distribution asymptotic normality, see Delta method and chi-squared, 82 and logistic regression, 171, 207208 underlying categorical data, 112, 264, 370, 620 O, o rates of convergence, 577, 595 Observational study, 43 Odds, 44 Odds ratio, 44, 620 bias, 70, 595 case-control studies, 4647 conditional, 5154, 255, 321, 417, 451 conditional ML estimate, 255, 417 confidence interval, 71, 7778, 99102, 255, 256 cumulative, 67 exact inference, 99101, 253, 255 homogeneity, in 2 = 2 = K tables, 54, 183, 234236, 255 I = J tables, 5556, 581, 597 invariance properties, 4546, 59 local, see Local odds ratio MantelHaenszel estimator, 234235 marginal, 451, 494 matched pairs, 415418, 451 logistic regression parameters, 124, 166, 171, 179, 183, 331, 415, 497500 loglinear model parameters, 315, 316, 321, 331, 369 ordinal variables, see Local odds ratio relation to relative risk, 47, 124, 624 standard error, 71, 7577, 581, 597 Offset, 385 Ordinal variables, 23 cumulative link models, 282286 cumulative logit models, 274282, 301, 420421 efficiency, 197, 301 exact tests, 98, 253 improved power, 8890, 236239, 373 loglinear models, 367377, 399 707 marginal models, 420421, 429430, 440441, 462464 matched pairs, 420421, 429431, 439, 443, 452454, 462464 mean response model, 291294 measures of association, 5759, 67, 68 multinomial response models, 274295 ordinal quasi symmetry, 429430, 440441, 647 repeated response, 461464, 469, 474475, 514515, 517520 scores, choice of, 8890, 383384 testing independence, 8691, 373 Overdispersion, 493 binomial, 8, 30, 151153, 291, 555558, 573, 653 litter effects, 151153, 291, 556558, 566 Poisson, 78, 130131, 636 quasi-likelihood, 151153, 291, 555558, 653 Paired comparisons, see BradleyTerry model Parallel odds models, 374375 Partial tables, 48 Partitioning chi-squared statistic, 8284, 112113, 365, 399, 405 and combining rows, 112 I = J tables, 8283 nested models, 365 trend test, 181, 373 Pattern mixture model, 476 Pearson, Karl, 619623, 628 arguments with Fisher, Yule, 79, 619623 goodness of fit, 2224, 79 Pearson chi-squared statistic, 2226, 79, 111112 asymptotic chi-squared distribution, 589590 asymptotic conditional distribution, 103 continuity correction, 103 degrees of freedom, 25, 79, 622 and z for difference of proportions, 111 goodness of fit, 2226 independence, 7879, 111112, 622 and likelihood-ratio, comparison, 24, 80, 364 minimizing, 112, 611612, 616, 618, 629 moments, 103 multinomial parameters, 2226 nested models, 364 noncentral chi-squared distribution, see Noncentral chi-squared distribution score statistic, 24 sparse data, 80, 395397 with ungrouped data, 162 upper bound, 112 708 Pearson residual, 81, 142, 588589, 593 binomial GLM, 220, 555, 638 Poisson GLM, 142, 366, 588 Penalized likelihood, 614615 Penalized quasi likelihood ŽPQL., 523524 Perfect contingency tables, 398 Perfect discrimination, 195196 Phi-squared, 112 Poisson distribution, 7 comparing means, 31 exponential family, 117, 134 moments, 7, 31 and multinomial, 89, 40, and negative binomial, 131, 559560, 566, 574 overdispersion, 78, 130131, 636 Poisson sampling, 39 variance test, 163 Poisson models counts, 125132, 155, 563565 deviance, 140 loglinear model, 117118, 125132, 138139, 232, 314347 overdispersion, 130131, 150151, 636 random effects, 563565 rates, 385391, 399400 Polytomous logit models, 267291 Population-averaged effects, 414, 495, 499501 Positive likelihood-ratio dependence, 406 Power calculating, 240245, 640 increased, for directed alternatives, 8890, 236239, 373 and noncentrality, 237239, 243245 and number of ordinal categories, 301 Power-divergence statistic, 112, 613 Prediction, 525526 Probit model, 124125, 246247, 258, 623, 640 discrete choice, 302 likelihood equations, 265 normal parameters, 163, 246, 264 ordinal data, 278, 283, 301, 312, 641 random effects, 535 threshold and utility motivations, 264 Profile likelihood confidence interval, 78, 512, 638 Propensity score, 196 Proportional hazards model, 283284, 301, 389, 643 Proportional odds, see Cumulative logit models Proportional reduction in variation, 5657, 6768 Proportions admissible estimator, 605 SUBJECT INDEX asymptotic distribution, 585588, 593 Bayesian inference, 605607 confidence interval, 1517, 3233, 635 dependent, 410412 difference, see Difference of proportions ratio, see Relative risk standard error, 11, 340341 P-value mid-P-value, 20, 27, 33, 104 randomized, 27, 32 UMVU estimator, 162 Qualitative variable, 34 Quantitative variable, 34 Quasi-association, 431, 453454 Quasi-independence, 426428, 432433, 443 Quasi-likelihood binary models, 151153, 291, 555558 count models, 150151 GLM, 149153, 156 multivariate ŽGEE., 466475, 481482, 625 overdispersion, 150153, 291, 555558 Quasi-symmetry, 425431, 433434, 451, 454, 646647 and BradleyTerry model, 438439 and marginal homogeneity, 428430 multiway tables, 440441 and Rasch model, 552553, 565 Raking a table, 345346, 347, 643 Random component of GLM, 116, 133 Random effects, 417, 492527 Random intercept, 493 Ranks, 89, 90, 298, 301, 302 Rasch mixture model, 548551, 653 Rasch model, 495496, 517, 526, 535, 565, 624 Rates, 385391, 399400 RC model, 379381, 399400 Regressive logistic model, 479481 Relative risk, 4344 asymptotic standard error, 73 collapsibility, 398 confidence interval, 73, 77 homogeneity, 258 in model, 124 and odds ratio, 47, 624 Repeated response, 409517. See also Generalized linear mixed models; Marginal models; Matched pairs Residuals, 142143, 156 asymptotic distribution, 587589 binomial GLMs, 219223 deviance, see Deviance residual Pearson, see Pearson residual Poisson GLMs, 143, 366367 709 SUBJECT INDEX standardized Pearson, see Standardized Pearson residual Retrospective study, 4243. See also Casecontrol study logistic regression, 170171 odds ratio, 4647 Ridits, 111, 406 ROC curve, 228230, 258 Row and column effects model, see RC model Row effects model, 374376, 643644 R-squared type measure logistic regression, 226228, 258 nominal association, 5657, 6768 Sample size determination, 240245 Sampling methods, 3943 Sampling zero, 392 Sandwich estimator, 471474 SAS, 632643 Saturated model, 119, 139, 382 logit models, 178, loglinear models, 316, 380 Scaled deviance, 140 Scores choice of, 8890, 383384 efficiency, 197, 301 in loglinear models, 369379, 407 in trend test, 8889, 181182, 406 Score statistic, 12, 2627 confidence intervals, 1516, 77 logistic regression, 232, 297298 Pearson statistic, 24 and standardized residuals, 156 trend test, 182 Selection model, 475476 Sensitivity, 38, 60, 228230 Simpson diversity index, 596 Simpson’s paradox, 51, 5960, 224, 354, 621 Small-area estimation, 502504 Small samples adding constants to cells, 397398 alternative asymptotics, 233, 396397 exact inference, 1820, 91101, 104, 251257 existence of estimates, 195196, 341, 392395 model-based tests, 187, 251257 X 2 and G 2 , 24, 80, 364, 395397 zeros, 392398 Smoothing Bayes, 606610 generalized additive model, 153155 improved estimation with model, 85, 112, 174, 239240, 264 kernel, 613615, 616 penalized likelihood, 614615 Software, 632653 SAS, 632643 StatXact and LogXact, 633, 635, 640, 643 Somers’ d, 68 Sparse data, 391398, 187, 250257, 591 asymptotics, 233, 396397 Spearman’s rho, 90 Specificity, 38, 60, 228230 Square tables, 409454 Standardized table, 345346 Standardized parameter estimate, 191192, 197 Standardized Pearson residual, 81, 143, 589 binomial GLMs, 220, 638 and Pearson statistic, 112 Poisson GLMs, 143, 367, 634 as score statistic, 156 StatXact, 633, 635, 640, 643 Stepwise model-building, 213216 Stochastic ordering, 33, 67, 301 Structural zero, 25, 392 Subject-specific effects, 414420, 491, 498500 Sufficient statistics, 148, 250257, 273, 334, 336 Suppressor variable, 67 Survival data, 385391 Symmetric association, 425 Symmetry, 424425, 644647 complete, 440 multiway, 439442 Systematic component of GLM, 116 Tetrachoric correlation, 620 Three-factor interaction, 320 Threshold model, 264, 277279 Tolerance distribution, 245246 Transformations, 595, 596 Transition probabilities, 477, 490 Transitional model, 464, 476481, 482 Tree-structured methods, 257, 631 Trend tests, 8690, 103, 296, 373, 379 CochranArmitage for proportions, 90, 181182, 237239 efficiency, 197, 301 exact, 253 software, 634, 635 Uncertainty coefficient, 57 Uniform association model, 312, 369370, 377 Uniform interaction model, 407 Uniqueness of ML estimate, 341 Utility, 264 710 Variance asymptotic, see Delta method components, 492, 525 in exponential family, 134 stabilizing, 596, 626 test for Poisson, 163 variance function, 136, 149150 Wald statistic, 11, 27 and power, 172, 208209 Wald confidence intervals, 13 adjusted intervals, 33, 102 Weight matrix, 138, 155, 164 Weighted kappa, 435, 443, 645 Weighted observation, 391 Weighted least squares, 481, 600604, 615, 629 SUBJECT INDEX and minimum modified chi-squared, 611, 612 and ML estimation, 146148, 603604 Wilcoxon test, 90, 301 WLS, see Weighted least squares X 2 statistic, see Pearson chi-squared statistic X 2 Ž M0 < M1 ., 364 Yates continuity correction, 103 Yule, G. U., 620621, 628 Yule’s Q, 68, 110 Zero cell count adding constants, 7071, 397398 effects on estimates, 7071, 78, 256 sampling, 392 structural, 25, 392