A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
Correspondence Analysis (CA) is a method of data analysis for representing tabular data graphically
(Greenacre 2007: 1). Its graphical display potential is aimed to discover latent structure of the data and to facilitate
complex data interpretation.
CA has been object of attention by archaeologist since it operate on data matrix.
It is common in archaeological practice to build up, for several purposes, tables where some contexts (e.g.,
tombs, huts, pits) are generally arranged in rows and their contents (e.g., vessels, bronze, pins, daggers, weights) in
columns, while the table’s cells contain counts. This kind of practical arrangement of data is usually called contingency
table. If we define contexts as objects and their contents as variables (for they do vary from a context to another), is can
be said that a contingency table allows to display the joint distribution of variables between the objects. By this way, it
is possible to make comparisons between objects, as far as their similarity in terms of variables is concerned.
Now, let us consider a simple example: see tab.1.
tab.1
Graves
1
2
3
4
5
Types
Tot
A
68
136
41
690
78
B
37
95
0
181
165
C
8
3
3
26
19
1013
478
59
Tot
113
234
44
897
262
1550
Tab. 1 reports the composition of five grave assemblages in terms of the counts of three different pottery types.
If we want to make a comparison of these assemblages, we have to standardize them, by removing the effect of
different assemblage size. We can achieve this by converting these figures into percentages, in order to obtain
proportions rather than absolute values (Drennan 1996: 65-76).
Let us consider tab. 2.
tab.2
Graves
1
2
3
4
5
Types
average
A
0,60
0,58
0,93
0,77
0,30
B
0,33
0,41
0,00
0,20
0,63
C
0,07
0,01
0,07
0,03
0,07
0,65
0,31
0,04
Tot
1,00
1,00
1,00
1,00
1,00
1,00
It reports the composition of the five assemblages as far as the proportion of types in each assemblage is
concerned. This simple tabulation immediately tell us something more informative with respect to tab.1, since it allows
to easily compare the assemblages.
We can, for example, note that type A tends to predominate in the majority of the assemblages, and that
assemblage 5 has the highest proportion of type B.
Average row informs us that type A made up about 65%, type B 31% and type C 4% of the overall pottery
assemblage. Interestingly, the average row provides a standard against which compare the individual rows. For
example, we can note that assemblage 3 is above the average as far as type A and C are concerned, while is below it in
relation to type B.
What has been described so far applies to columns. In fact, while the row percentages allow to compare each
assemblage with one another, the comparison between each type can be made only comparing column proportion.
1
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
Let us consider tab.3.
tab.3
Graves
1
2
3
4
5
Types
Tot
A
0,07
0,13
0,04
0,68
0,08
B
0,08
0,20
0,00
0,38
0,35
C
0,14
0,05
0,05
0,44
0,32
average
1,00
1,00
1,00
1,00
0,07
0,15
0,03
0,58
0,17
Tab.3 tells us how the types are distributed between the assemblages. From this table it can be seen, for
example, that the majority of type A occurs in assemblage 4, while types B and C occur with large proportion in
assemblages 4 and 5.
In this case as well, the column average provides ground for comparisons. We can see that about 58% of the
overall pottery types belong to assemblage 4. By the same token, it can be seen that the 68% of type A in assemblage 4
is well above the average, while its frequency in assemblages 1 and 2 is below it.
By combining the information provided by tab. 2 and 3 (row and column proportions), we can obtain a good
picture summarizing in a more structured way the information of the original contingency table (tab.1).
For example, we are now in the position to know that assemblage 1 is made up for the 60% of type A, for 33% of type
B and for 7% of type C. We know that its 60% of type A is above the average, and so on. But we also know that, seen
from the perspective of the distribution of type A between the assemblages, assemblage 1 has only about the 7% of that
type. In other words, the 68 pots of type A found in assemblage 1 correspond to the 60% of the types present in it, but
only to 7% of the overall distribution of that type between the five assemblages. From this last standpoint, type A
occurs more frequently in assemblage 4.
As far as CA terminology is concerned, row, column and average proportions are called (as we will see later)
respectively row, column and average profile.
Let us take this discussion a step forward.
Since we are dealing here with three variables, we can graphically display the information provided by the
original contingency table (tab.1) in a tripolar or ternary graph (fig.1). The calculation laying behind the drawing of this
graph are the same done to convert tab.1 in tab.2, that is switching the original absolute values with row proportion (tab.
2). So we can think of this graph as taking its values from tab. 2.
Each side of the triangle represents a percentage scale, so
A
that the assemblages can be plotted according to the values
90
80
70
60
of their profiles in relation to the sides of the triangle. Each
3 10
assemblage is represented by a dot, as well as the average.
20
4
30
Average
2 1
An assemblage that would have only type A (that is, 100%
40
50
40
30
of type A) would lay at the vertex A. The same applies for
50
an assemblage that would have, say, only type B or C, and
60
they would consequently lay at vertex B or C respectively.
70
5
90
10
90
B
Fig. 1
In the case under discussion, the points representing the
80
20
80
70
60
50
40
30
20
assemblages are spread inside the triangle according to
their different row profile percentages. Note, moreover,
10
C
that some assemblages are closer than other to the average.
2
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
Well. This kind of graphical display is satisfactory enough as long as we are dealing with three variables. With
three variables we are in the position to visualize the data by means of a triangle (tripolar graph), that is in a bidimensional space. Note that the dimensions needed to represent the data with three variables are one less than the
number of variables (n) involved. In this case n=3, so number of dimensions=2.
It is clear that as long as the number of variables becomes higher, the number of dimensions needed increase
proportionally as well. With four variables we need a tridimensional space for depict the data; by the same token,
objects defined by, say, twenty variables need a space with 19 dimension to be graphically displayed.
As stressed at the opening of this primer, CA provides the very possibility to represent in a low-dimensional
space complex tabular data.
The possibility to do it relies on what we touched upon before, that is the difference of the individual
assemblages and type profiles (or, in a more general terminology, row and column profiles) from the average.
With this concept we are going to enter the very core of the CA analysis.
As a general statement, it can be said that CA allows to represent tabular data in a graphical way (as noted at
the start of this primer) and that CA is a generalization of a simple graphical concept, namely the scatterplot
(Greenacre 2007: 1), the latter being a representation of data as set of points with respect to two perpendicular axes.
This means that CA can represent our tabular data as point in a low-dimensional space.
At this point one may wonder what kind of values does CA take into account in order to graphically display
tabular data and, consequently, what is responsible of the eventual spread of points in the scatterplot?
The answer is embedded in what we state just a couple of lines above: the difference of the individual row and
column profiles from the average.
On this respect, let us return to the simple example of tab. 1 above. If there were no difference between the
assemblage as far as their types proportion is concerned, we would expect that the profile of each row would be more or
less the same as the average profile, and would differ from it only due to random sampling variation. Under this
assumption, we would expect them to be split up in the same proportions as the average.
Let us consider tab.4.
tab.4
Graves
1
Types
2
3
4
observed
A
68
B
37
C
8
expected
73,8
34,9
0,2
observed
136
95
3
expected
152,9
72,2
8,9
observed
41
0
3
expected
28,8
13,6
1,6
observed
690
181
26
expected
685,2
276,6
34,1
observed
78
165
19
expected
171,2
80,8
10,0
Tot
1013
478
59
average
0,65
0,31
0,04
5
Tot
row mass
113
0,07
234
0,15
44
0,03
897
0,58
262
0,17
1550
It is similar to tab.1, but it add some extra information regarding what we are dealing with right now.
Each assemblage row (as in tab.1) reports the absolute observed values, that is the counts of pottery types in each
assemblage. Under each assemblage row the expected values are reported , that is the values that one would expect if
each assemblage was split up in the same proportions as the average. To the right some values are reported, referred to
as row mass: these values give the idea of how big is each row profile in relation to the total. For example, assemblage 1
3
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
make up the 7% of the overall assemblage (113/1550=0,07 or 7%); assemblage 2 the 15% (234/1550=0,15 or 15%) and
so on. It can be seen that assemblage 4 has a bigger impact on the overall assemblage, making up the 58% of the total.
It is clear that the observed frequencies differ from the expected ones. The point is to understand to what extent
these differences are large enough to be statistically “true”, that is that they are so large that it is unlikely they occurred
by chance alone.
For this purpose we can perform the
2
test (Drennan 1996: 187-191; Shennan 1997: 104-115), that will return
a value; the larger is this value, the more discrepant the observed and expected frequencies are and, therefore, the less is
likely that such a discrepancy occurred by chance.
The formula of this test is given below:
2
=∑
Taking data from tab.4, the chi-square calculation will be:
=
,
,
,
,
,
,
+ the corresponding terms for other assemblages
The value returned by the test will be equal to 235.1 that, for 8 degree of freedom, is highly significant. This means that
there is much variation in the table and it is highly unlikely that this variation occurred by chance alone.
The interest of CA for the chi-square relies in its ability to measure the heterogeneity of the row profiles (but of
the column profile as well, as it also applies to column profile), to measure how much variance there is in the data table,
and to give a measure of the distance of row profiles from the average.
To do so, the chi-square formula must be re-expressed in order to take into account the row profiles and the
row totals. The above formula can be rewritten as follows:
,
=113
,
,
,
,
,
,
,
,
+ the corresponding terms for other assemblages
or, more in general, each tem in this calculation is of the form
row total x
In other words, we take the formula of the chi-square calculation and for each term substitute the observed and
expected frequencies with the observed and expected row profile (in this case, taken from the tab.2, as you can see).
Each term, moreover, has been multiplied by the corresponding row total.
Now, we make another little modification, and we divide each term of the above formula by the table grand
total (or, in other words, by the sample size n). In our case (see. tab.4) this value is 1550.
So, let us re-express the above formula:
,
=
,
,
,
,
,
Note that the value
,
,
,
,
,
,
,
,
,
,
,
+ the corresponding terms for other assemblages
that is
,
,
,
,
,
,
,
,
,
,
,
+ the corresponding terms for other assemblages
that is
+ the corresponding terms for other assemblages (A)
turns up to correspond to the individual row mass, as stated before in relation to tab.4.
4
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
The quantity
is called total inertia or, simply, inertia in CA. As stated by Greenacre (2007: 28), it is a
measure of how much variance is in the table and does not depend on the sample size. It turns up to correspond to the
statistics (Shennan 1997: 115-116).
It has to be noted, moreover, that in the formula of the inertia (above labeled with A) each of the five terms of
the formula (one for each row of the table [tab.2], that is, one for each assemblage in our simple example) is the row
-distance; and we are about to
mass multiplied by a quantity that we are about to discover to be the square of the
discover, moreover, that this distance is a measure of the distance of each row profile from the average profile. We are
going to discover, moreover, that this distance is a “modified” version of the Euclidean distance.
Let us explain in short what is the Euclidean distance (example straight from Shennan 1997: 223-224). Let us
imagine that we have two vessels (i and j) defined on the basis of two variables: x=measurement of rim diameter, y=
measurement of the height. We could display our vessels in a scattergram: see fig. 2.
In this example, the Euclidean distance is a measure of
the similarity between the two vessels as defined by the
two variables. It implies the finding of the straight-line
distance between the two point representing the vessels,
and presupposes the use of the Pythagoras’ theorem.
In this case, the distance will be:
If there are more variables, then we have to add in extra
terms (that is, the square differences into brackets), one
for every variables.
When two point are in the same place, the item are
identical (and vice versa), and clearly the distance will
be 0.
Fig. 2
Now let us return back to the
-distance. If we would compute the Euclidean distance between, say,
assemblage 1 profile and the average profile, the formula would be as follows:
Euclidean distance =
,
,
,
If we divide each term by each correspondent value of the average, we get the
-distance =
It is evident that
,
,
,
,
,
,
,
,
-distance:
,
,
,
,
-distance is a weighted Euclidean distance, for it rescales the Euclidean distance using a
scaling factor in the denominator that correspond to the expected profile elements. In this example, the
-distance
provides a measure of the distance between assemblage 1 profile and the average profile.
If we take a look at the formula of the inertia (above labeled as A), we will see that the quantity in square
brackets is exactly the square of the
-distance of each individual row profile, multiplied by the mass of each row.
As consequence of this quite long reasoning, we are in the position to say that to find the inertia of the table we
have to calculate, for each row, the row mass and multiply this by the square of the row
and add up the results.
-distance from the average,
5
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
By means of this long reasoning, and taking a look to the formula of inertia (above, A), we are able to know:
-how much variation there is in the table, that is the total amount of departure from the average (the total inertia or
inertia);
-the
-distance of each individual row profile from the average profile;
- that the inertia will be high when row profiles have large deviation from the average; conversely, it will be low when
they are close to the average;
-that those rows which contribute the most to the inertia will be those which have the largest departures from the
average and the largest mass.
It has to be noted that what we described so far applies to the column profile as well.
By the way, it has to be noted that if we multiply the total inertia by n (which is the table grand total or, in
other words, the sample size), we end up with the
statistic for our table, that could be useful by itself (Drennan 1996:
187-191; Shennan 1997: 104-115; Greenacre 2007: 74).
After having understand the concepts of row and column profiles, average profile, dimensions needed to
graphically display them,
-distance of row/column profiles from the average profile and total inertia, our discussion
jumps directly to the visualization of the data of tab.1 by means of CA. Some other rather complex technical aspects
should be mentioned in between, but I prefer to refer the reader to Greenacre’s Chapters 5-8 (Greenacre 2007: 33-65). I
shall limit myself to highlight the key concepts underlying the understanding of the CA scatterplots, and to explain how
what we touched upon so far is involved in such visual display.
Let us see what the CA scatterplot of tab.1 looks like. See fig.3
0,48
C
0,4
0,32
3
Axis 2
0,24
0,16
1
0,08
5
-0,64
-0,48
-0,32 A 4
-0,16
0,16
0,32
0,48
0,64
B
-0,08
-0,16
2
Axis 1
Fig. 3
The scatterplot represents the data in a low-dimensional space, defined by orthogonal axes. In this case, we are
representing the data by means of Axis 1 (horizontal) and 2 (vertical). These axes account for part of the total inertia in
the table: the first axis usually accounts for the greatest part of the inertia, the remainder being accounted for by other
6
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
axes. In this case, axis 1 accounts for the 94,5% of the total inertia, while axis 2 accounts for the remaining 5,5%.
Overall, CA is well representing the amount of variance of our data. By the way, these statistics are provided by any
software dealing with CA.
To well understand the spread of points in the space defined by the CA axes, we have just to recall what we
talked about above, that is the concept of
-distance and row/column profiles.
The points, as clearly showed by their labels, represent the profiles of row and columns. The point of
intersection between the two axes represents the average; consequently, the spread of the points in the bi-dimensional
space is related to the
-distance of each row and column profile from the average profile. It has to be stressed that, in
order to be graphically displayed in such a bi-dimensional space, the
-distance has been transformed into Euclidean
distance (for more details see Greenacre 2007: 33-40). In other words, the straight-line distance between the point of
intersection and row/column profile points relates to their
-distance from the average.
How can we read the scatterplot? (for more details see Shennan 1997: 320-341; Greenacre 2007: 65-80)
It informs us that, as far as the horizontal dimension is concerned (axis 1), it is possible to distinguish between
row profiles (in our archaeological example graves) with relative high proportion of type A (to keep on with our
archaeological example), like grave 3 and 4, and graves with relative high proportion of types B and C, like grave 1, 2,
and 5. It is possible to note that each group is variegated as far as the vertical dimension is concerned (axis 2). Axis 2
opposes grave 3 and 4 on the one hand, grave 1,5 and 2 on the other hand. But it must be kept in mind that the axis 2
only accounts for about the 6% of the inertia, so the main trend in the data is represented by the axis 1.
If we take a look at tab. 2 and 3 we can readily understand the graphical display of our data with CA.
It can be seen, in fact, that grave 3 and 4 are above the average as far as the proportion of type A within their
respective assemblage is concerned (tab.2); on the other hand, grave 1, 2 and 5 are below the average as the proportion
of type A within their respective assemblage is concerned, but they are above the average as far as the proportion
(within each respective assemblage) of type B and C is concerned. This causes grave 3 and 4 to be displayed far apart
from the group 1, 2, 5. It has to be noted, moreover, that grave 3 and 4 are far apart in relation to axis 2: this happens
because of the relative higher proportion of type A in grave 4 rather than in 3, as far as the distribution of that type
between assemblage is concerned (see tab.3). The same can be said for the separation (in relation to the vertical axis)
between grave 1 (with its relative higher proportion of type C) and grave 2 (with its relative higher proportion of type
B).
Even from this simple example it is clearly evident how useful CA is to a better understanding of the structure
of a data table. In a case as the one here discussed, where we are dealing only with few variables, it could have been
possible compare the grave assemblages just eyeballing the data of tab.1-3 and comparing the profiles (row, column and
average) to each other in order to ascertain which one is more similar to the another. But this method would be not farreaching if we were dealing with several objects (in this case, graves) and several variables (in this case pottery types).
Let us now consider a slightly more complex example, where it can be taken into account a higher number of
objects and variable, in order to evaluate the help CA can give to the interpretation of a more complex data-set.
Let us consider tab.5, which is made up of 10 objects (grave assemblage) and 5 variables (pottery types). In
this example, graves are defined by the presence in them of 5 different pottery types, going from the most precious (A)
to the less precious (E).
7
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
tab.5
Graves
1
2
3
4
5
6
7
8
9
10
Types
Tot
av. row profile
A
3
1
6
3
10
3
1
0
2
2
B
19
2
25
15
22
11
6
12
5
11
C
39
13
49
41
47
25
14
34
11
37
D
14
1
21
35
9
15
5
17
4
8
E
10
12
29
26
26
34
11
23
7
20
114
31
128
310
129
198
796
0,04
0,16
0,39
0,16
0,25
Tot
85
29
130
120
88
37
86
29
78
Tab.5 reports the row and column totals and the average row profile. This 10x5 table lies in a four-dimensional
space, that it in order to represent it as a scatterplot it would be necessary a space with four dimensions.
Let us see the CA scatterplot for this table (fig.4):
0,5
2
0,4
0,3
E
6
0,2
Axis 2
0,17
10
8
-0,5
-0,4
-0,3
5
-0,2
9 -0,1
0,1
0,2
0,3
0,4
C
A
3
-0,1
4
D
B
-0,2
-0,3
1
-0,4
Axis 1
Fig. 4
Each axis accounts for a part of the inertia, expressed as percentage: axis 1 47,2%, axis 2 36,7%. The total
inertia is 0,082879, hence the
statistic is 65,97 (that is 0,082879 x 796), a highly significant value for 9 x 4 =36
degrees of freedom.
Let us consider how to interpret the scattergram and what it informs us about our dataset.
It can be easily seen that the horizontal dimension (axis 1) lines up the four pottery types, with the most
precious to the left (A), the less precious far apart to the right (D); the types absolutely less precious lays far from this
group: axis 2, in fact, separates pottery type E from the group A-D.
8
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
Now, let us consider the spread of row profile points in relation to the position of column profile points: it is
clear that the more a grave lays to the left, the more proportion of precious types it has; the more a grave lays to the
right, the more proportion of less precious types it will have; the more a grave lays in the upper part of the scattergram,
the more proportion of type E (the less precious of all pottery types) it will have.
Finally, it has to be noted that grave 1, 2, 9 and 10 have similar position in relation to axis 1, but have a
different spread in relation to axis 2. This means that these graves have similar position with respect to the type
categories A-D, but grave 1 has fewer proportion of type E than grave 2.
On the basis of what we have seen so far, it is possible to emphasize that CA and its graphical representation of
data-tables as a scatterplot allows us to visualize multi-dimensional data in a low-dimensional space, allowing to
explore the structure of the data, to understand the relation between objects and variables and to compare object to each
other. Naturally, the reduction of dimensionality causes some degree of loss of information, but the objective is to
restrict this loss to a minimum so that a maximum amount of information is retained (Greenacre 2007: 41).
Before passing to the next issue, that is the use of CA for seriation purposes, I would like to stress, using
Greenacre’s words, that when applicable, it is useful to test a contingency table for significant association using
test.
However, statistical significance is not a crucial requirement for justifying an inspection of the maps. CA should be
regarded as a way of re-expressing the data in pictorial form for ease of interpretation. With this objective any table of
data is worth looking at (Greenacre 2007: 80)
…to be continued
© Gianmarco Alberti
2009
9
A Primer to Correspondence Analysis (© Gianmarco Alberti 2009)
_____________________________________________________________________________________
References:
Drennan R.D. 1996, Statistics for Archaeologist. A Commonsense Approach, New York.
Greenacre M. 2007, Correspondence Analysis in Practice, New York/London (second edition).
Shennan S. 1997, Quantifying Archaeology, Edinburgh (second edition).
Additional documentation:
Baxter M.J. 1994, Exploratory Multivariate Analysis in Archaeology, Edinburgh.
Weller S.C., Romney A.K. 1990, Metric Scaling. Correspondence Analysis, Newbury Park-London-New Delhi.
Notes:
All graphs in this primer have been made with PAST.
Fig. 2 after Shennan 1997.
Freeware Software:
For two useful free computer programs capable to perform CA (PAST, CAPCA), please visit my Personal Home Page
http://xoomer.alice.it/gianmarco.alberti and go to the page “Links”.
10