[go: up one dir, main page]

Academia.eduAcademia.edu
258 Visualising contingency table data Dongwen Luo, G. R. Wood, G. Jones Abstract A geometric object, a simplex, is useful for picturing the joint, conditional and marginal distributions within a contingency table. The joint distribution is represented using weights on all vertices of the simplex, a conditional distribution by weights on vertices of a face of the simplex, and a marginal distribution by weights on the faces containing the conditional distributions. All detailed discussion is based on the simplest case, that of a two-by-two contingency table, for which all distributions are seen in a tetrahedron. 1 Introduction A contingency table is a cross-tabulation of categorical variables. An example is given in Table 1, using data from an Australian survey of attitudes to genetic engineering of food [4]. The 894 respondents are distributed among four categories defined by income level and attitude to genetic engineering. The question of interest is whether income level and attitude to genetic engineering of food are dependent. Income Low High Attitude For Against 258 222 263 151 Table 1. A cross-tabulation of income level against acceptance of genetic engineering of food, with data drawn from a recent Australia-wide survey. When faced with contingency table data, it is useful for the practitioner to have a quick method for visualising the associated distributions. The primary aim of this article is to bring such a method to a wider audience; the secondary aim is to provide a cameo example of the symbiosis between mathematics and statistics. The article exposits and builds on ideas first introduced by Fienberg [2] and Fienberg and Gilbert [3]. There are three distributional types associated with a contingency table: the joint distribution, conditional distributions and marginal distributions. This article pictures these three types in a simplex. For a given contingency table, the joint distribution can be represented by weights on all vertices of the simplex, a conditional distribution by weights on vertices of a face of the simplex, and a marginal distribution by weights on the faces containing the conditional distributions. All discussion is based on the contents of a two-by-two table, since such a table is complex enough to illustrate all items of interest yet simple enough to be readily pictured. In the next section we review the three distributions, using notation of Agresti [1]. The three distributional types are described geometrically in Section 3, then the article is completed with a generalisation in Section 4 to tables of arbitrary dimension and a conclusion. Visualising contingency table data 2 259 Distributions in a two-by-two table We begin this section by briefly reviewing standard terminology and notation for joint, conditional and marginal distributions in a contingency table. Consider two categorical variables X1 and X2 , each at two levels. The joint distribution of X1 and X2 can be represented in a 2 × 2 table denoted (πij ), where πij is the probability of X1 at the ith level and X2 at the jth level, for i = 1, 2 and j = 1, 2. The marginal distributions of X1 and X2 are denoted (π1+ , π2+ ) and (π+1 , π+2 ) respecP tively. Here the subscript “+” denotes summation over the associated index, so πi+ = j πij P and π+j = i πij . Thus, the marginal distribution of X1 (X2 ) appears as the row (column) totals of the table (πij ). The distribution of X2 conditional upon X1 = i is written as (π1|i , π2|i ) so πj|i = πij /πi+ for all j. Symmetrically, we could define the distribution of X1 for a given level of X2 . These three distributions associated with a two-by-two table and a numerical example (the frequency table of the Australia survey data) are displayed in Table 2. X2 X1 1 1 π11 (π1|1 ) 2 π21 (π1|2 ) Total π+1 2 π12 (π2|1 ) π22 (π2|2 ) π+2 Total π1+ Income Low π2+ High 1.00 Total Attitude For Against 0.2886 0.2483 (0.5375) (0.4625) 0.2942 0.1689 (0.6353) (0.3647) 0.5828 0.4172 Total 0.5369 0.4631 1.00 Table 2. The left panel presents the notation for joint, conditional and marginal distributions of categorical variables X1 and X2 , each with two levels. The right panel presents the relative frequency table for the Australia survey data. Figures in brackets show the distribution of X2 for the given level of X1 . 3 Geometry of the three distributions The joint distribution of categorical variables X1 and X2 with two levels each can be represented as (π11 , π12 , π21 , π22 ) = π11 e1 + π12 e2 + π21 e3 + π22 e4 where e1 = (1, 0, 0, 0), e2 = (0, 1, 0, 0), e3 = (0, 0, 1, 0) and e4 = (0, 0, 0, 1) form the standard basis in R4 (points A, B, C and D respectively in Figure 1(a)). Thus the joint distribution of X1 and X2 can be pictured as weights π11 , π12 P, π21 and π22 on A, B, C and D respectively. Alternatively, since πij ≥ 0 for all i, j and ij πij = 1, the joint distribution of X1 and X2 can be represented by the centre of mass J (more formally known as the “resultant” or “barycentre”) of these weights on A, B, C and D in the three dimensional simplex given by X S3 = {(π11 , π12 , π21 , π22 ) : πij = 1 and πij ≥ 0 for all i, j} ij as illustrated in Figure 1(a). The distribution of X2 conditional on X1 = 1 can be represented as (π1|1 , π2|1 , 0, 0), an ordered 4-tuple in R4 , and since we have the representation C1 = π1|1 e1 + π2|1 e2 260 Dongwen Luo, G. R. Wood, G. Jones evidently this distribution can be representedPby weights π1|1 and π2|1 on A and B alone. Alternatively, since πj|1 ≥ 0 for all j with j πj|1 = 1, the distribution of X2 conditional on X1 = 1 is the resultant of these weights on A and B, so is a point C1 in line segment AB. Similarly, the distribution of X2 conditional on X1 = 2 can be represented as (0, 0, π1|2 , π2|2 ), so as a point C2 , the resultant of weights π1|2 and π2|2 on C and D respectively (illustrated in Figure 1(b)). (a) Joint distribution A (1, 0, 0, 0) J B (0, 1, 0, 0) D (0, 0, 0, 1) C (0, 0, 1, 0) (b) Conditional distributions A (1, 0, 0, 0) A (1, 0, 0, 0) C1 B (0, 1, 0, 0) B (0, 1, 0, 0) D (0, 0, 0, 1) D (0, 0, 0, 1) C2 C (0, 0, 1, 0) C (0, 0, 1, 0) (c) Marginal distribution A (1, 0, 0, 0) B (0, 1, 0, 0) D (0, 0, 0, 1) C (0, 0, 1, 0) Figure 1. The three distributions of categorical variables X1 and X2 , each with two levels. In (a) the joint distribution of X1 and X2 is seen as weights π11 , π12 , π21 and π22 on A, B, C and D, with resultant J. In (b) the conditional distribution of X2 when X1 = 1 is seen as weights π1|1 and π2|1 on A and B, having resultant C1 , while the the conditional distribution of X2 when X1 = 2 is weights π1|2 and π2|2 on C and D, having resultant C2 . In (c) the marginal distribution of X1 is seen as weights π1+ and π2+ on edges AB and CD. Joint distributions lying on AB oblige X1 to equal one, so arguably line segment AB corresponds to X1 = 1. Similarly, line segment CD corresponds to X1 = 2. For this reason the marginal distribution of X1 , (π1+ , π2+ ), can be represented as these weights on edges AB and CD, pictured by weighting these edges in Figure 1(c). From the definition of conditional probability we have that (π11 , π12 , π21 , π22 ) = π1+ (π1|1 , π2|1 , 0, 0) + π2+ (0, 0, π1|2 , π2|2 ) Visualising contingency table data 261 or J = π1+ C1 + π2+ C2 In this special case where the joint distribution J and the conditional distributions C1 and C2 are known, the marginal distribution of X1 can be represented as the weights π1+ and π2+ on C1 and C2 (still on AB and CD respectively) having resultant J. Figure 1 in fact illustrates these ideas using the frequency table of the Australia survey data shown in the right panel of Table 2. Here we can represent the joint distribution of Income and Attitude as (0.2886, 0.2483, 0.2942, 0.1689) ∈ R4 which corresponds to point J in the tetrahedron. The distributions of Attitude conditional on Income Low and Income High can be represented by C1 = (0.5375, 0.4625, 0, 0) and C2 = (0, 0, 0.6353, 0.3647) respectively. Since J = 0.5369C1 + 0.4631C2 , the marginal distribution of Income, (0.5369, 0.4631), can be specialized now as weights 0.5369 and 0.4631 on C1 and C2 having resultant J. Fienberg and Gilbert [3] showed that the loci of all points corresponding to independence of rows and columns in a 2×2 table is a portion of a hyperbolic paraboloid in the tetrahedron, illustrated in Figure 2. In the figure, the point J (the joint distribution of Income and Attitude) is seen to be a small distance away from the independence surface; further analysis would confirm that, with a sample size as large as 894, this indicates dependence between Income and Attitude. Loosely speaking, for a given sample size the further J is from the independence surface, the greater the dependence between X1 and X2 . D A J C B Figure 2. A graphic illustrating the locus of all points corresponding to independent 2×2 tables (a portion of a hyperbolic paraboloid) and the joint distribution J of Income and Attitude in the tetrahedron ABCD. 4 Tables of higher dimension For a general contingency table, the three distributional types can be pictured in a higher dimensional simplex, having as many vertices as cells of the table. The joint distribution appears as weights on all vertices of the simplex. Conditioning on the levels of a subset of the variables partitions all vertices of the simplex; the convex hull of each partition set forms a face of the simplex. A distribution conditional on levels of the chosen variables appears as weights on vertices of the associated face. The marginal distribution of the random variables used for conditioning appears as weights on the simplicial faces determined by the partition sets. For example, for a 4 × 4 table with variables X1 and X2 , the joint distribution is the weights on the sixteen vertices of the simplex S15 . To picture the distribution of X2 262 conditional upon X1 , the vertices of S15 are partitioned into four sets of four using the levels of X1 . Four faces of S15 are then constructed as convex hulls of each set of vertices; the distribution of X2 conditional upon a given level of X1 is weights on the vertices of the associated face. The marginal distribution of X1 is weights on the four faces. These ideas are illustrated in Figure 3. J Figure 3. A schematic illustration showing that for a multi-way table the joint distribution J appears as weights on all vertices of a higher dimensional simplex; the resultant is a point in the simplex. Conditioning on values of a subset of all variables leads to a partitioning of the vertex set. Such a partition is shown as the four shaded simplexes. A conditional distribution is a weighting of the vertices of a partition set, for example, a weighting on the vertices of the upper shaded simplex. The associated marginal distribution of the subset of variables is the weighting of the facial simplexes formed by the partition, shown here using shading. The diagram presented here is strictly appropriate for a 4 × 4 table. 5 Conclusion The three distributional types associated with a 2 × 2 table have been pictured in a tetrahedron. The joint distribution appears as weights on all vertices of the tetrahedron with resultant a point in the tetrahedron. A conditional distribution can be viewed as weights on vertices of an edge of the tetrahedron with resultant a point in the edge. A marginal distribution can be viewed as weights on the edges containing the conditional distributions. These ideas directly generalize to multi-way tables. References [1] A. Agresti, Categorical Data Analysis (Wiley New York 1990). [2] S.E. Fienberg, The geometry of an r × c contingency table, The Annals of Mathematical Statistics 39 (1968), 1186–1190. [3] S.E. Fienberg and J.P. Gilbert, The geometry of a two by two contingency table, Journal of the American Statistical Association 65 (1970), 694–701. [4] J. Norton, G. Lawrence, and G.R. Wood, The Australian public’s perception of genetically-engineered foods, Australasian Biotechnology 8 (1998), 172–181. Department of Statistics, Macquarie University, NSW 2109 E-mail: gwood@efs.mq.edu.au Institute of Information Sciences and Technology, College of Sciences, Massey University, Palmerston North, New Zealand Received 26 May 2004, accepted 8 July 2004.