arXiv:2003.00043v1 [stat.AP] 28 Feb 2020
Finding archetypal patterns for binary
questionnaires
Ismael Cabero1, Irene Epifanio2
Abstract
Archetypal analysis is an exploratory tool that explains a set of observations as mixtures
of pure (extreme) patterns.If the patterns are actual observations of the sample, we refer
to them as archetypoids. For the first time, we propose to use archetypoid analysis for
binary observations. This tool can contribute to the understanding of a binary data set,
as in the multivariate case. We illustrate the advantages of the proposed methodology
in a simulation study and two applications, one exploring objects (rows) and the other
exploring items (columns). One is related to determining student skill set profiles and
the other to describing item response functions.
MSC:
62H99, 62P25, 97D60
Keywords:
dichotomous item test, archetypal analysis, functional data analysis, item response
theory, skill profile
1 Departament
2 Departament
fanio@uji.es
de Didàctica de les Matemàtiques, Universitat de València, Spain
de Matemàtiques-IMAC-IF, Universitat Jaume I, Castelló 12071, Spain. Email: epi-
I. Cabero and I. Epifanio
2
1. Introduction
Mining binary survey data is of utmost importance in social sciences. Many raw data
from exams, opinion surveys, attitude questionnaires, etc. come in the form of a binary
data matrix, i.e. examinees’ responses are coded as 0/1 (1 if examinee i answers item h
correctly, otherwise 0). The binary matrix can be viewed from two points of views. In
the first, the interest lies in the rows, i.e. in the people, while in the second, the interest
lies in the columns that contain the items or variables. In both cases, exploratory data
analysis (EDA) aims to find information in data and generate ideas (Unwin, 2010). In
order to be useful as a tool for EDA on data sets, the tool should be simple and easy to
use, with few parameters, and reveal the salient features of the data in such a way that
humans can visualize them (Friedman and Tukey, 1974).
For the first time, we propose the use of the exploratory tool Archetypoid Analysis (ADA) for this kind of data in order to understand, describe, visualize and extract
information that is easily interpretable, even by non-experts. ADA is an unsupervised
statistical learning technique (see Hastie et al. (2009, Chapter 14) for a complete review
of unsupervised learning techniques). Its objective is to approximate sample data by
a convex combination (a mixture) of k pure patterns, the archetypoids, which are extreme representative observations of the sample. Being part of the sample makes them
interpretable, but also being extreme cases facilitates comprehension of the data. Humans understand the data better when the observations are shown through their extreme
constituents (Davis and Love, 2010) or when features of one observation are shown as
opposed to those of another (Thurau et al., 2012).
ADA was proposed by Vinué et al. (2015) for real continuous multivariate data as a
derivative methodology of Archetype Analysis (AA). AA was formulated by Cutler and Breiman
(1994), and like ADA, it seeks to approximate data through mixtures of archetypes, also
I. Cabero and I. Epifanio
3
for real continous multivariate data. However, archetypes are not actual cases, but rather
a mixture of data points. Recently, Seth and Eugster (2016b) proposed a probabilistic
framework of AA (PAA) to accommodate binary observations by working in the parameter space.
AA and ADA have been applied to many different fields, such as astrophysics (Chan et al.,
2003), biology (D’Esposito et al., 2012), climate (Steinschneider and Lall, 2015; Su et al.,
2017), developmental psychology (Ragozini et al., 2017), e-learning (Theodosiou et al.,
2013), finance (Moliner and Epifanio, 2019), genetics (Thøgersen et al., 2013), human
development (Epifanio, 2016; Epifanio et al., 2020), industrial engineering (Epifanio et al.,
2013, 2018; Millán-Roures et al., 2018; Alcacer et al., 2020), machine learning (Mørup and Hansen,
2012; Seth and Eugster, 2016a,b; Ragozini and D’Esposito, 2015; Cabero and Epifanio,
2019), market research (Li et al., 2003; Porzio et al., 2008; Midgley and Venaik, 2013),
multi-document summarization (Canhasi and Kononenko, 2013, 2014), nanotechnology
(Fernandez and Barnard, 2015), neuroscience (Tsanousa et al., 2015; Hinrich et al., 2016)
and sports (Eugster, 2012; Vinué and Epifanio, 2017, 2019).
Archetypal analysis techniques lie somewhere in between two well-known unsupervised statistical techniques: Principal Component Analysis (PCA) and Cluster Analysis (CLA). In data decomposition techniques, a data set is viewed as a linear combination of several factors to find the latent components. Different prototypical analysis tools arise depending on the constraints on the factors and how they are combined
(Mørup and Hansen, 2012; Vinué et al., 2015). The factors with the least restrictions
are those produced by PCA, since they are linear combinations of variables. One of
the advantages is that this helps explain the variability of the data; however, the interpretability of the factors is compromised. Instead, the greatest restrictions are found in
cluster tools, such as k-means or k-medoids. Their factors are readily interpreted because they are centroids (means of groups of data) or medoids (concrete observations)
I. Cabero and I. Epifanio
4
in the case of k-means and k-medoids, respectively. The price that clustering tools pay
for interpretability is their modeling flexibility due to the binary assignment of data to
the clusters. Archetypal tools, on the other hand, enjoy higher modeling flexibility than
cluster tools but without losing the interpretability of their factors. A table summarizing the relationship between several unsupervised multivariate techniques is provided by
Mørup and Hansen (2012) and Vinué et al. (2015).
AA and ADA were originally thought of for real-valued observations. The aim of
this work is to extend archetypal tools to binary data. For AA, as the factors (archetypes)
are a mixture of data, they would not necessarily be binary vectors, and as a consequence
they would not be interpretable. In ADA though, the factors (archetypoids) are actual
cases, so ADA can be applied to binary data without losing the interpretability of the
factors. So, among the possible archetypal techniques (AA, PAA and ADA), we propose
to use ADA for binary data.
To perform a sanity check and provide insight we analyze the solutions obtained by
AA, PAA and ADA through a simulation study, where ADA shows its appropriateness
versus AA or PAA for binary data sets. Furthermore, we present two real applications
and compare ADA solutions with those of other established unsupervised techniques
to illustrate the advantages of ADA in educational and behavioral sciences, when used
as another useful tool for data mining in these fields (Slater et al., 2017). In the first
application, we are interested in rows, while in the second application in columns.
The outline of the paper is as follows: In Section 2 we review AA and ADA for realvalued multivariate and functional data and PAA, besides other multivariate techniques
used in the comparison. In Section 3 we introduce the analysis for binary multivariate
data. In Section 4, a simulation study with binary data compares the different strategies
for obtaining archetypal patterns. In Section 5, our proposal is applied to two real data
sets and compared to the results of other well-known unsupervised statistical learning
I. Cabero and I. Epifanio
5
techniques. Section 6 contains conclusions and some ideas for future work.
The data sets and code in R (R Development Core Team, 2019) for reproducing the
results for both artificial and real data are available at http://www3.uji.es/˜epifanio/RESEARCH/adae
2. Preliminary
2.1. AA and ADA in the real-valued multivariate case
Let X be an n×m real-valued matrix with n observations and m variables. Three matrices
are established in AA: a) the k archetypes z j , which are the rows of a k × m matrix Z; b)
an n×k matrix α = (αi j ) with the mixture coefficients that approximate each observation
k
xi by a mixture of the archetypes (x̂i =
∑ αi j z j ); and c) a k × n matrix β = (β jl ) with
j=1
the mixture coefficients that characterize each archetype (z j = ∑nl=1 β jl xl ). To figure
out these matrices, we minimize the following residual sum of squares (RSS) with the
respective constraints (k · k denotes the Euclidean norm for vectors):
n
k
n
k
n
i=1
j=1
i=1
j=1
l=1
RSS = ∑ kxi − ∑ αi j z j k2 = ∑ kxi − ∑ αi j ∑ β jl xl k2 ,
(1)
under the constraints
k
1)
∑ αi j = 1 with αi j ≥ 0 for i = 1, . . . , n and
j=1
n
2)
∑ β jl = 1 with β jl ≥ 0 for j = 1, . . . , k.
l=1
As previously mentioned, archetypes do not necessarily match real observations. Indeed, this will only happen when one and only one β jl is equal to one for each archetype,
i.e. when each archetype is defined by only one observation. So, in ADA the previous
constraint 2) is substituted by the following one, and as a consequence in ADA a mixedinteger problem is optimized instead of the AA continuous optimization problem:
n
2)
∑ β jl = 1 with β jl ∈ {0, 1} and j = 1, . . . , k.
l=1
I. Cabero and I. Epifanio
6
As regards the location of archetypes, they are on the boundary of the convex hull
of the data if k > 1 (see Cutler and Breiman (1994)), although this does not necessarily
happen for archetypoids (see Vinué et al. (2015)). Nonetheless, the archetype is equal
to the mean and to the medoid in case of the archetypoid (Kaufman and Rousseeuw
(1990)), if k = 1.
We want to emphasize that archetypal analysis is an EDA technique based on a geometric formulation (no distribution of data is assumed). It is not an inferential statistical
technique, i.e. it is not about fitting models, parameter estimation, or testing hypotheses.
Nevertheless, a field to study in the future would be to view archetypal analysis as a
feature extraction method (Hastie et al., 2009, Ch. 5), where the raw data are preprocessed and described by α , which can be used as inputs into any learning procedure for
compositional data (Pawlowsky-Glahn et al., 2015).
2.1.1. Computation of AA and ADA
The estimation of the matrices in the AA problem can be achieved by means of an alternating minimizing algorithm developed by Cutler and Breiman (1994), where the best
α for given archetypes Z and the best archetypes Z for a given α are computed by turns.
To solve the convex least squares problems, a penalized version of the non-negative least
squares algorithm by Lawson and Hanson (1974) is used. Eugster and Leisch (2009)
implemented that algorithm in the R package archetypes, although with some changes.
Specifically, the data are standardized and the spectral norm in equation 1 is used instead
of Frobenius norm for matrices. In our R implementation those changes were annulled,
i.e. the data are not standardized by default and the objective function to minimize is
defined by equation 1.
With respect to the estimation of the matrices in the ADA problem, it can be achieved
using the algorithm developed by Vinué et al. (2015). It is composed of two steps: the
I. Cabero and I. Epifanio
7
BUILD step and the SWAP step. The objective of the BUILD step is to determine an
initial set of archetypoids that will be upgraded during the following step. The objective
of the SWAP step is to improve the primitive set by exchanging the selected instances
for unselected observations and checking whether these replacements decrease the RSS.
Vinué (2017) implemented that algorithm in the R package Anthropometry with three
possible original sets in the BUILD step: candns , candα and candβ . These sets correspond to the nearest neighbor observations in Euclidean distance to the k archetypes,
the cases with the maximum α value for each archetype j and the observations with
the maximum β value for each archetype j, respectively. Then three possible solutions
are obtained once these three sets go through the SWAP step, but only the solution with
lowest RSS (often the same final set is returned from the three initializations) is chosen
as the ADA solution.
One important point is the selection of k, since archetypes are not necessarily nested
and neither are archetypoids. If the user has prior knowledge of the structure of the data,
the value of k can be chosen based on that information. Otherwise, a simple but effective heuristic (Cutler and Breiman, 1994; Eugster and Leisch, 2009; Vinué et al., 2015;
Seth and Eugster, 2016b) such as the elbow criterion can be used. With the elbow criterion, we plot the RSS for different k values and the value of k is selected as the point
where the elbow is located.
2.1.2. Illustrative example
In Figure 1 a toy two-dimensional data set is used to illustrate what archetypoids
mean and the differences compared with CLA and PCA, as well as to provide some
intuition on what these pure and extreme patterns imply in behavioral sciences. Two
numeric variables are considered from the data set personality-1.0 of the R package
neuropsychology (Makowski, 2016), which contains personality traits data from an online questionnaire: Empathy.Agreeableness and Honesty.Humility. We apply k-means
I. Cabero and I. Epifanio
8
and ADA with k = 3, i.e. we find 3 clusters and archetypoids. We also apply PCA.
Archetypoids are people with extreme values, which have clear profiles: archetypoid 1
is characterized by a very low Empathy.Agreeableness value together with a high Honesty.Humility value (1, 5.25), archetypoid 2 has the maximum values for both Empathy.Agreeableness and Honesty.Humility (7,7), while the third archetypoid has a very
high Empathy.Agreeableness value together with the lowest Honesty.Humility value
(6,0). Archetypoids are the purest people. The rest of the individuals are expressed
as mixtures (collected in alpha coefficients) of these ideal people. For example, an individual with values of 6.25 and 0.75 for Empathy.Agreeableness and Honesty.Humility,
respectively, is explained by 11% of archetypoid 2 plus 89% of archetypoid 3.
This is compatible with the natural tendency of humans to represent a group of objects by its extreme elements (Davis and Love, 2010). Figure 1 d) shows the partition
of the set generated by assigning the observations to the archetypoid that best explains
each individual. However, when we apply k-means to this kind of data set, without differentiated clusters, the centroids are in the middle of the data cloud. Centroid profiles
are not as differentiated from each other as archetypoid profiles. This happens because
centroids have to cover the set in such a way that the set is partitioned by minimizing the
distance with respect to the assigned centroid (see Wu et al. (2016) about the connection
between set partitioning and clustering). On the one hand, this means that the set partition generated by k-means and ADA would be different (Figures 1 c) and d)). On the
other hand, centroids are not the purest, and therefore their profiles are not as clear as
those of archetypoids. For example, centroids 2 and 3 have values (4.1, 5.7) and (5.3,
3.5), which are not as intuitively interpretable as archetypoids. If we look again at the
individual with values (6.25, 0.75) from the clustering point of view this individual is
clearly assigned to cluster 3, with centroid (5.3, 3.5), but clustering does not say anything about the distance of this point with respect to the assigned centroid, or in which
I. Cabero and I. Epifanio
9
4
6
n
4
n
10
20
30
10
Comp.2
Honesty.Humility
2
20
0
30
40
40
2
−2
0
−4
2
4
6
−5.0
−2.5
Empathy.Agreeableness
0.0
Comp.1
(a)
(b)
6
6
factor(kc)
factor(malpha)
2
3
4
n
10
20
1
Honesty.Humility
Honesty.Humility
1
2
3
4
n
10
20
30
2
40
0
30
2
40
0
2
4
6
Empathy.Agreeableness
(c)
2
4
6
Empathy.Agreeableness
(d)
Figure 1. (a) Plot of the toy example. The size of the points depends on their frequency. The
red crosses represent the archetypoids, while the green stars represent the centroids of each cluster; (b) PC scores. Projected archetypoids are represented by red crosses; (c) k-means cluster
assignments; (d) ADA assignments by the maximum alpha, i.e. assigned to the archetypoid that
best explains the corresponding observation.
direction they are separated. In fact, (6.25, 0.75) is quite far from (5.3, 3.5). This happens because the objective of clustering is to assign the data to groups, not to explain the
structure of the data more qualitatively. Finally, note that archetypoids do not coincide
with the individuals with the most extreme PC scores (see Figure 1 b)).
In summary, depending on our objective, the appropriate analysis should be selected.
The objective of PCA is to reduce data dimension. Although PCA returns the location
I. Cabero and I. Epifanio
10
of the observations in the new dimensions by PC scores, there is no guarantee that the
principal components are interpretable. In other words, observations are expressed in a
new base, but in general the PCA base is not easily interpretable. However, the objective
of CLA is to segment the data, i.e. to make groups of data by finding modes in data. Although the modes can be easily interpretable, CLA does not return an expression about
the location of each observation with respect to each mode. On the other hand, finding
extreme profiles, which are easily interpretable, is not the objective of PCA or CLA, but
that of AA or ADA. These techniques also return the location of the observations as a
function of the extreme profiles, in fact as a mixture (a convex combination), which is
more easily interpretable than a linear combination. This provides a complete overview
of the data set, generally supported by visual methods, i.e. this allows data to tell us
more beyond the formal modeling or hypothesis testing task.
2.2. Probabilistic archetype analysis
The idea underlying PAA is to work in a parameter space instead of the observation
space, since the parameter space is often vectorial even if the sample space is not. The
key is to assume that data points come from a certain distribution (from the Bernoulli
distribution in the case of binary observations). Then the maximum likelihood estimates
of the parameters of the distributions are seen as the parametric profiles that best describe
each observation, and archetypal profiles are computed in the parameter space by maximizing the corresponding log-likelihood under the constraints for α and β . In summary,
probabilistic archetypes lie in the parameter space, whereas classical archetypes lie in
the observation space. Thus, archetypal profiles for binary data are the probability of a
positive response. Details can be found in Seth and Eugster (2016b).
I. Cabero and I. Epifanio
11
2.3. AA and ADA in the functional case
In Functional Data Analysis (FDA) each datum is a function. Therefore, the sample is a
set of functions {x1 (t), ..., xn (t)} with t ∈ [a, b], i.e. the values of the m variables in the
standard multivariate context are replaced by function values with a continuous index t.
We assume that these functions belong to a Hilbert space, satisfy reasonable smoothness
conditions and are square-integrable functions on that interval. Simplistically, the sums
are replaced by integrals in the definition of the inner product.
AA and ADA were extended to functional data by Epifanio (2016). In the functional context, functions from the data set are approximated by mixtures of archetypal
functions. In functional archetype analysis (FAA), we seek k archetype functions that
approximate the functional data sample by their mixtures. In other words, the objective
of FAA is the same as AA, but now both archetypes (z j (t)) and observations (xi (t)) are
functions. As a consequence, RSS is now calculated with a functional norm instead of a
vector norm. We consider the L2 -norm, k f k2 =< f , f >=
Rb
a
f (t)2 dt. The interpretation
of matrices α and β is the same as in the classical multivariate case.
Analogously, FADA is also a generalization of ADA, where k functional archetypoids, which are functions of the sample, approximate the functions of the sample
through the mixtures of these functional archetypoids. Again, vectors are replaced by
functions and vector norms by functional norms, and the matrices are interpreted is the
same way as before.
To obtain FAA and FADA in a computationally efficient way (Epifanio (2016)),
functional data are represented by means of basis functions (see Ramsay and Silverman
(2005) for a detailed explanation about smoothing functional data). Let Bh (h = 1, ...,
m) be the basis functions and bi the vector of coefficients of length m such that xi (t) ≈
h
∑m
h=1 bi Bh (t). Then, RSS is formulated as (see Epifanio (2016) for details):
I. Cabero and I. Epifanio
12
n
k
n
k
n
n
i=1
j=1
i=1
j=1
l=1
i=1
RSS = ∑ kxi − ∑ αi j z j k2 = ∑ kxi − ∑ αi j ∑ β jl xl k2 = ∑ a′i Wai ,
(2)
where a′ i = b′ i − ∑kj=1 αi j ∑nl=1 β jl b′ l and W is the order m symmetric matrix with the
R
inner products of the pairs of basis functions wm1 ,m2 = Bm1 Bm2 . If the basis is orthonormal, for instance the Fourier basis, W is the order m identity matrix and FAA and FADA
can be estimated using standard AA and ADA with the basis coefficients. If not, W has
to be calculated previously one single time by numerical integration.
2.4. Other unsupervised learning techniques
The following well-known multivariate analysis tools for binary data are used in
the comparison. We use homogeneity analysis (multiple correspondence analysis) using the R package homals (de Leeuw and Mair, 2009) (HOMALS). HOMALS can be
considered as an equivalent to PCA in the case of categorical data. For CLA we use
Partitioning Around Medoids (PAM) from the R package cluster (Maechler et al., 2018;
Kaufman and Rousseeuw, 1990), since it returns representative objects or medoids among
the observations of the data set. The pairwise dissimilarities between observations in the
data set needed for PAM are computed with the daisy function from the R package
cluster (Maechler et al., 2018), specifically using Gower’s coefficient (Gower, 1971)
for binary observations. Other popular clustering methods (Flynt and Dean, 2016) are
also used in the comparison: latent class analysis (LCA) from the R package poLCA
(Linzer and Lewis, 2011), which is a finite mixture model clustering for categorical data,
and classical k-means clustering (Lloyd, 1982). It is used in the literature (Henry et al.,
2015), despite not being recommended for binary data (IBM Support, 2016). For that
reason, we also consider PAM, since it is a robustified version of k-means (Steinley,
2006) that can be used with distances other than Euclidean, and observations, rather
than centroids, serve as the exemplars for each cluster.
I. Cabero and I. Epifanio
13
3. Archetypal analysis for binary data
Let X be an n × m binary matrix with n observations and m variables. The idea
behind archetypal analysis is that we can find a set of archetypal patterns, and that data
can be expressed as a mixture of those archetypal patterns. In the case of binary data, on
the one hand the archetypal patterns should also be binary data, as the population from
which data come. For example, if pregnancy was one of the binary variables, it would
not make sense to consider as an archetypal observation a woman who was pregnant 0.7.
In other words, archetypal patterns should be binary in order to have a clear meaning
and not lose their interpretability, which is the cornerstone of archetypal techniques, i.e.
they should not be ‘mythological’, but rather something that might be observed. On the
other hand, in order to describe data as mixtures, we should assume that observations
exist in a vector space, i.e. that observations can be multiplied by scalars (in this case in
the interval [0, 1]) and added together.
A solution that meets all these ideas is to apply ADA to X, since the feasible archetypal patterns belong to the observed sample. In fact, ADA was originally created as a
response to the problem in which pure non-fictitious patterns were sought (Vinué et al.,
2015).
Instead, the archetypes returned by applying AA or PAA do not need to be binary,
i.e. they do not need to belong to the feasible set of solutions. In fact, Seth and Eugster
(2016b) binarized the archetypes obtained by AA or PAA in experiments. However,
using a continuous optimization problem to solve a problem whose feasible solutions
are not continuous can fail badly (Fletcher, 2000, Ch. 13). Indeed, there is no guarantee
that this approach will provide a good solution, even by examining all the feasible binary
solutions in a certain neighborhood of the continuous solution.
Therefore, we propose to use ADA to handle binary observations.
I. Cabero and I. Epifanio
14
4. Simulation study
We have carried out a simulation study to assess all the alternatives in a controlled scenario. The design of the experiment has been based on simulation studies that appear
in Vinué et al. (2015) and Seth and Eugster (2016b). We generate k = 6 archetypes, ζi ,
with m = 10 binary variables by sampling them from a Bernoulli distribution with a
probability of success p = 0.7, A = [ζ1 , ζ2 , ζ3 , ζ4 , ζ5 , ζ6 ]. Given the archetypes, we
generate n = 100 observations as the binarized version of xi = Ãi hi + Ei , where Ãi contains the archetypes after adding salt and pepper noise to them, hi is a random vector
sampled from a Dirichlet distribution with α = (0.8, 0.8, 0.8, 0.8, 0.8, 0.8), and Ei is a
10-dimensional random vector of Gaussian white noise with a mean of zero and standard
deviation of 0.1. The binarized versions are obtained by replacing all values above 0.5
with 1 and others with 0. The noise density added to A is 0.05 (the default value used in
MATLAB). With salt and pepper noise, a certain amount of the data is changed to either
0 or 1. To ensure that Ãi ’s are archetypes, we chose α = 0.8, a value near to but less than
one.
We compute PAA, AA and ADA. The archetypes returned by PAA and AA are
binarized for comparison with the true ones, A. We calculated the Hamming distance
(Manhattan distance between binary vectors), which is the same as the misclassification
error used with binary images, between each archetypal solution and the true archetypes
after permuting the columns of each archetypal solution to match the true archetypes in
such a way that the least error with the city block distance is provided.
This was repeated 100 times. The first 10 times are displayed in Figure 2. The
solutions returned by all the methods are quite similar to the true archetypes, i.e. the
number of errors (a zero in the solution where the true value is 1, or vice versa) is very
small. Nevertheless, there are differences between the methods, which are more evident
I. Cabero and I. Epifanio
15
in columns 5 and 6. For columns 5 and 6, the number of errors for PAA is 5 and 5,
it is 4 and 2 for AA, but only 2 and 2 for ADA. Table 1 shows a the summary of the
misclassifications. The archetypoids returned by ADA match the true archetypes better
than those returned by AA or PAA, in this order, i.e. ADA provides the smallest mean
misclassification error.
2
3
4
5
6
7
8
9
10
ADA
AA
PAA
True
1
Figure 2. Comparison between true archetypes and those returned by PAA, AA and ADA,
respectively. The 10 columns represent the first 10 repetitions of the simulation. Black represents
0 and white 1.
Table 1. Summary of misclassification errors of the archetype profiles for each method over 100
simulations.
Method
Mean (Std. dev.)
PAA
4.20 (1.86)
AA
3.59 (1.99)
ADA
3.19 (1.88)
5. Applications
5.1. An initial mathematical skills test for first-year university students
I. Cabero and I. Epifanio
16
5.1.1. Data
The first application corresponds to the first point of view of the binary matrix (analysis
of the rows). We analyze the data set described by Orús and Gregori (2008), which
was obtained through the application of a test on the initial mathematics skills of 690
first-year students of the College of Technology and Experimental Sciences at Jaume I
University (Spain) at the beginning of the 2003-04 academic year. The test consisted
of 17 questions corresponding to 21 single items, the answers to which were coded
as 0 (incorrect or unanswered) or 1 (correct). The items of the test were selected in
order to ascertain some given didactic hypotheses on the didactic discontinuities between
mathematics at pre-university and university levels. It is not a test designed to rank
the students and return a unique score. The complete description of the questions can
be seen in Orús and Gregori (2008). With ADA, we could obtain students’ skill set
profiles. In this way, students can be grouped by their similar mastery of skills. For
instance, students showing consistently high levels of aptitude may be selected for an
advanced class or students with similar difficulties could receive extra instruction as a
group and also teaching strategies could also be adapted to suit their level. A classical
way to group student skill set profiles is by using a clustering method, as carried out
by Dean and Nugent (2013), but in terms of human interpretability, the central points
returned by clustering tools do not seem as favorable as the extreme points returned by
ADA. Results from different exploratory tools are compared.
5.1.2. Results and discussion
We would like to estimate the skill set profiles hidden in the data set. In other words,
we would like to discover the data structure. Our intuition tells us that skill sets vary continuously across students, i.e. we do not expect there to be clearly differentiated (separate) groups of students with different abilities. Even so, CLA has been used to generate
groups of students with similar skill set profiles (Chiu et al., 2009; Dean and Nugent,
I. Cabero and I. Epifanio
17
2013). Here, we are going to consider the raw binary data and let the data speak for
themselves, as ADA is a data-driven method. We compare the ADA solution with others
from well-established unsupervised techniques introduced in Section 2.4 to highlight the
information about the quality understanding of data provided by ADA.
For the sake of brevity and as an illustrative example, we examine the results of
k = 3. The RSS elbow for ADA and the Bayesian Information Criterion (BIC) elbow
for LCA are found at k = 3 (see Figure 3). According to the silhouette coefficient
(a method of interpretation and validation of consistency within clusters of data, see
Kaufman and Rousseeuw (1990) for details), the optimal number of clusters are k = 2
and k = 3 for PAM. However, the highest value of the silhouette coefficient is 0.22 (for
k = 2 and k = 3 clusters), which means that no substantial cluster structure was found,
as we predicted. We perform an h-plot (a multidimensional scaling method that is particularly suited for representing non-Euclidean dissimilarities, see Epifanio (2013) for
details) on the dissimilarities used by PAM to graphically summarize the data set and to
visualize the obtained clusters by PAM in two dimensions (see Figure 4). Effectively,
14200
BIC
2.0
13600
2.5
13800
14000
3.5
3.0
RSS
4.0
4.5
14400
5.0
separate clusters do not seem to exist.
2
4
6
Archetypoids
8
10
1
2
3
4
5
Number of Clusters
Figure 3. Initial mathematical skills test data: Screeplot of ADA (left-hand panel); screeplot of
LCA (right-hand panel).
This is also corroborated by Figure 5, where the students’ scores from HOMALS are
18
0.00
−0.05
Dim 2
0.05
I. Cabero and I. Epifanio
−0.15
−0.10
−0.05
0.00
0.05
0.10
0.15
Dim 1
Figure 4. H-plot of dissimilarities for the initial mathematical skills test data. We perform PAM.
The black circles represent data points assigned to the first cluster, the red triangles to the second
cluster and the blue crosses to the third cluster.
plotted in two dimensions. As regards the interpretation of the dimensions of HOMALS,
the loadings are displayed in Figure 6 and Table 2 shows their exact values, together
with the number of correct answers. As also happens with PCA, their interpretation is
not always easy and immediate. For the first dimension, all the coefficients are positive
(as a measure of size), which can indicate a kind of sum score. The highest coefficients
more or less correspond to the last questions of the test, which fewer students answered
correctly. The second dimension compares, above all, questions 4, 5, 6a and 6b (with
high positive coefficients) with 13a and 13b (with low negative values), while in the third
dimension, questions 1, 3, 7, 8 and 10 (with high positive coefficients) are compared with
14a and 14b (with low negative values). However, we do know how the meaning of these
contradistinctions is interpreted.
LCA returns the conditional item response probabilities by outcome variable for
each class. Table 3 lists these probabilities for correct answer. The predicted classes
for each student are shown in Figure 5, since the profiles of cluster 1 and 3 are mainly
differentiated in questions 4, 5, 13a and 13b, which are the most relevant variables of
dimension 2 of HOMALS.
I. Cabero and I. Epifanio
−0.01
0.00
0.01
0.02
0.00
Dimension 2
−0.01
67
0.01
0.02
591 605
664
111
274
602
617
288
225
244
663
320495
353
68 132238
166
597
228
611
276280
455
278
598
464
574
293160328
634
115
463 309
629
498 604 621
409
281
180
628
117
375
210
581
60
459
314
618
123
83
366
204
334
7578
508
313
568460
457184
250
381
510
575
294
675673
79
142
324408
201
130
397
105
332
243
325
232131
341
551
322
76
127
342 354
560
542
195
509
112
403
586
391
428
283
677
582
214189
317
98
627
80
630
64
298
326402 395 613 555
470
21
614
316
484
61
378
179
623
519
186
291
323
465
299
246
124
626
622
585
453
215520
492
31
221383
687
686
348
151
631
47
411
400
689
91
2684
46
674
448
371
71
647
387
477
443
170
361
541
413
688
690
153
163
386
212
198
389
466
607
167
372
556
207165
99
306
205
684
572 158558
527
169
252
154567
143
505
569
584
396
42
580
683
327
515
129
456
270
10
450
362
264
22
156
368
181
89
407
177
57
304
242
659
340
530
680
301
87
420
544
535
72
108
331
277
14
5435
488
122
261
620
517
657
649
462
279
305
149
308
635
161
384
347
17
359
335
653
128
636
493
275
479
73
563
534
511
101
355
494
562
300
267
669
418
44
272 86
573
521
235
224
271
35
671
200
233
414
48
565
550
577
63
531
144
399
185
259
401
69
592
107
97
595 458
321
229
95
25
287
36248
393
120
155
173106
537 318 536
90
476
475
654
491
452
454
648
349
590
251
587
346
640
40
51
471
363
289
260
480
600
547
624
446
526
545
172
19
45
239
514
652
93
546
66199 194
561
651
237
369
284
507
50
608
681
262
265
217
548
473
183
178
282
296 356
662
152
102
667
230
522
603
140
429
218
234
412
404
502
487
266
682
245
28 336
159
196
193
203
150
137
516
501
468
570
52
445
679
678
290
268
2240
3379
100
588
59
297
656
20
367
330
11
421
543
619
292
529
533
532
483
518
503
23
13
15
37
62
9
54
53
646
643
642
639
641
632
661
660
135
236
307
263
7566
1
315
85373338
552
227
499
213
388
638
253
441
486
554
398
339
257
426
594
606
472
644
256
596
114
432
377
164
345
524
553
579
434
427
394 406 405
103
187
295
436 461
12
145 302
116
208
34
39119
440
496
269
431
74
192 147
223
538 118
422
197
616
424
202
666
104 286
350
655
423
285
665
51394
219
564
311
645
138139
168
121
174
133211
125
467
438
55
433
32
6416
49
126
668
24
27
16
18
33
8 672
30
29
670
474
612
380
469
578
8870
490
609
481
449
382615 385 303
171 344249
254 482
417
557
637
478
38500 41
610
157
625 31065 576358
443
571
148 523
360 390
96
589 549 206
425
343
506
175489540
352
583 415
56 658
559
216
633
176 110329
247
364
539 136
190
319528273
437
357
209
442162
81
255
512
333146 109
370
392
376
258
77374
92
676
601
365
439
410
485
82
599
134
231
58
497226
419
188
444
337
182
650
525 430
447
141 191
685
504 113
220
451 312
351
222
241
−0.02
0.01
0.00
−0.02
−0.01
Dimension 2
0.02
593
19
0.03
−0.01
Dimension 1
0.00
0.01
0.02
0.03
Dimension 1
Figure 5. HOMALS of the initial mathematical skills test data. Plot of students’ scores. The
numbers indicate the code of each of the 690 students (left-hand panel). We perform LCA. The
black circles represent data points assigned to the first cluster, the red triangles to the second
cluster and the blue crosses to the third cluster (right-hand panel).
0.20
Loadings plot
p15 p16
p17a
p9
p6bp17b
p6a
p2
p5
p13b
p13a
0.00
0.10
0.05
0.00
p14b
p14a
−0.05
−0.10
−0.15
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
Dimension 2
0.15
0.10
p7 p3
p10
p12
p11 p4
0.05
p8
−0.10 −0.05
Dimension 3
p1
Dimension 1
Figure 6. HOMALS of the initial mathematical skills test data. Loadings plot.
Table 3 also lists the profiles of the medoids, centroids of k-means and the archetypal profiles for AA, PAA and ADA. For medoids and archetypoids, the code of the
corresponding observation is also displayed. To facilitate the analysis we also show the
binarized profiles of AA and PAA, referred as BAA and BPAA, respectively.
As a simple summary of the profiles, we compute the percentage of correct answers
I. Cabero and I. Epifanio
20
Table 2. Number of correct answers and loadings of the first three dimensions by HOMALS for
the initial mathematical skills test data.
Question
1
2
3
4
5
6a
6b
7
8
9
10
11
12
13a
13b
14a
14b
15
16
17a
17b
No. correct answers
621
589
301
233
253
231
105
270
140
109
202
71
329
177
132
114
22
183
236
47
62
D1
0.02862783
0.05796987
0.09098387
0.07597107
0.09922374
0.05413846
0.05230047
0.07408320
0.06009601
0.07749049
0.09540734
0.07484609
0.08006423
0.12934508
0.12952677
0.11951449
0.07564891
0.10748607
0.12062274
0.12115852
0.10884146
D2
-0.0023240445
0.0626839294
0.0229084678
0.0959592091
0.0910173149
0.0839718688
0.0873491670
0.0332456303
-0.0172906782
0.0705908727
-0.0246907330
0.0786972991
0.0138046564
-0.1274837375
-0.1231243539
-0.0355861709
-0.0454400932
-0.0366512703
-0.0001786048
0.0035393336
0.0095484411
D3
0.109355256
0.052003283
0.076212734
-0.008635664
0.004650411
-0.059468794
-0.048042860
0.078814530
0.105955795
-0.040283362
0.073252540
-0.017170762
0.015179516
-0.021853917
-0.012406654
-0.096428155
-0.065063675
0.017544829
-0.004647963
-0.009522369
-0.022486862
for each profile. For PAM, the percentages are 9.5%, 33.3% and 57.1%; for binarized
LCA, 38.1%, 9.5% and 33.3%; for binarized k-means, 38.1%, 9.5% and 42.9%; for
BAA, 9.5%, 47.6% and 61.9%; for BPAA, 9.5%, 42.9% and 57.1%; and for ADA,
57.1%, 52.4% and 9.5%, respectively. Note that the median of the percentage of correct
answers in the data set is 28.6% (the minimum is 0, the first quartile is 19.1%, the third
quartile is 38.1%, while the maximum is 95.2%).
One profile is repeated in all the methods, a student who only answers questions 1
and 2 correctly, i.e. a student with a serious lack of competence. We therefore concentrate the analysis on the other two profiles for each method.
In contrast with the third archetypoid, i.e. the student with very poor skills, the first
and second archetypoids correspond to students with very high percentages of correct
answers. In fact, the first archetypoid corresponds to the 92nd percentile of the data
set, while the second archetypoid corresponds to the 88th percentile. Nevertheless, both
I. Cabero and I. Epifanio
21
profiles are quite different. In fact, the Hamming distance between archetypoids 1 and 2
is 13, which means that although they answered a lot of items correctly, these correctly
answered items do not coincide. In other words, archetypoids 1 and 2 are somehow
complementary. Both answered items 1, 2, 3, 12 and 16 correctly, which were among
the most correctly answered items. Neither of them answered items 11, 14b and 17a
correctly, which were among the least correctly answered items. On the one hand, the
items that archetypoid 1 answered correctly, but archetypoid 2 did not are 8, 10, 13a,
13b, 14a, 15 and 17b. These items are about nonlinear systems and linear functions.
On the other hand, the items that archetypoid 2 answered correctly, but archetypoid 1
did not are 4, 5, 6a, 6b, 7 and 9. These items are about the calculation of derivatives
and integrals and algebraic interpretation. The skills of these archetypoids are clear and
different to each other.
We can use the alpha values for each of the students to learn about their relationship
to the archetypoid profiles. The ternary plot in Figure 7 displays the alpha values that
provide further insight into the data structure. Note that the majority of the data is
concentrated around archetypoid 1, i.e. the one with very poor skills. If we wanted to
form three groups using the alpha values, we could assign each student to the group in
which their corresponding alpha is the maximum, as we did in Figure 1 (d). In this way,
the number of students similar to archetypoid 1 is 113, to archetypoid 2 it is 110 and to
archetypoid 3 it is 467.
The profiles of medoids 2 and 3 are not as complementary as the previous archetypoids. In fact, medoid 2 corresponds to the 56th percentile, while medoid 3 corresponds
to the 92nd percentile. In this case, the percentage of correct answers for medoid 2 is not
high. The Hamming distance between the two medoids is only 7. On the one hand, both
answered items 1, 2, 3, 5, 7 and 12 correctly, which are the most correctly answered
items. On the other hand, both failed items 6a, 6b, 8, 9, 11, 14b, 17a and 17b, many
I. Cabero and I. Epifanio
22
A2
0
100
20
80
level
40
5
60
4
60
3
40
2
1
80
20
10
0
80
60
40
0
20
0
0
10
A1
A3
Figure 7. Ternary plot of α s of ADA together with a plot density estimate for the initial mathematical skills test data.
more items than in the case of ADA. The only item that medoid 2 answered correctly
but medoid 3 did not is item 4. The items that medoid 3 answered correctly but medoid 2
did not are 10, 13a, 13b, 14a, 15 and 16. It seems as if the cluster definition was guided
by the number of correct answers rather than by the kind of item answered correctly.
This is the reason why PAM selects medoid 2 in the middle of the data cloud. PAM, and
usual clustering methods, tries to cover the set in such a way that every point is near to
one medoid or one cluster center. The number of students belonging to each cluster is
398, 179 and 113, respectively. Note that the size of the cluster of students with poor
skills is smaller than in the case of ADA, because some of those students are assigned to
the cluster of medoid 2.
The binarized profile of LCA 1, corresponding to the 75th percentile, is similar to
medoid 3, but with a lower number of correct answers (5, 7, 14a and 15), while the
binarized profile of LCA 3, corresponding to the 56th percentile, is similar to medoid 2,
only differentiated by two items (7 and 16). Therefore, they are even less complementary
than the previous medoids. The Hamming distance between both LCA-profiles is only
I. Cabero and I. Epifanio
23
Table 3. Profiles for the initial mathematical skills test data, for PAM, LCA, k-means (k-M), AA
(and binarized, BAA), PAA (and binarized, BPAA) and ADA, with k = 3. The numbers in brackets
for PAM and ADA indicate the code of the representative student
Methods
1 2 3 4 5 6a 6b 7 8 9 10 11 12 13a 13b 14a 14b 15 16 17a 17b
PAM (661) 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PAM (586) 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
PAM (162) 1 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0
LCA 1
0.93 0.88 0.57 0.36 0.48 0.37 0.17 0.49 0.32 0.20 0.51 0.14 0.62 1.00 0.85 0.43 0.10 0.50 0.55 0.18 0.19
LCA 2
0.88 0.79 0.28 0.17 0.15 0.24 0.07 0.30 0.14 0.03 0.15 0.02 0.32 0.05 0 0.03 0 0.13 0.14 0 0
LCA 3
0.91 0.94 0.60 0.62 0.65 0.48 0.27 0.47 0.23 0.35 0.36 0.23 0.63 0.04 0 0.20 0.03 0.31 0.53 0.09 0.17
k-M 1
0.91 0.95 0.64 0.70 0.74 0.43 0.25 0.53 0.24 0.32 0.38 0.22 0.63 0.05 0.00 0.19 0.02 0.33 0.52 0.10 0.15
k-M 2
0.88 0.78 0.26 0.13 0.11 0.27 0.09 0.27 0.13 0.05 0.14 0.02 0.32 0.06 0.01 0.03 0.01 0.12 0.15 0.01 0.01
k-M 3
0.93 0.91 0.59 0.34 0.49 0.37 0.17 0.50 0.33 0.20 0.54 0.14 0.64 1 0.88 0.46 0.10 0.52 0.58 0.18 0.19
AA 1
0.85 0.68 0.02 0 0 0.05 0 0.04 0.05 0 0.01 0 0.16 0 0 0 0 0.01 0 0 0
AA 2
0.90 1 0.87 1 1 1 0.63 0.82 0.19 0.52 0.24 0.38 0.65 0 0 0.16 0 0.07 0.43 0.09 0.15
AA 3
1 1 0.89 0.32 0.53 0.19 0.06 0.71 0.58 0.26 1 0.17 1 1 1 0.67 0.18 1 1 0.36 0.37
BAA 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BAA 2
1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0
BAA 3
1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0
PAA 1
0.86 0.72 0.13 0 0 0 0 0.12 0.07 0 0 0 0.23 0 0 0 0 0 0 0 0
PAA 2
0.90 1 0.78 1 1 1 0.61 0.73 0.27 0.40 0.38 0.31 0.66 0 0 0 0 0 0.43 0 0
PAA 3
0.99 1 0.82 0.36 0.57 0.25 0 0.66 0.44 0.27 0.86 0.12 0.85 1 1 0.73 0.15 1 1 0.32 0.42
BPAA 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BPAA 2
1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
BPAA 3
1 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0
ADA (182) 1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 1
ADA (274) 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0
ADA (1)
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5. The number of students belonging to each cluster is 155, 352 and 183, respectively.
Note that the size of the cluster of students with poor skills is smaller than in the case of
PAM.
The binarized profile of the first centroid of k-means, corresponding to the 75th
percentile, is similar to medoid 2, only differentiated by item 16, while the binarized
profile of the third centroid, corresponding to the 82nd percentile, is similar to medoid 3,
but with a lower number of correct items (5, 7 and 14a). The Hamming distance between
both centroids is 7. The level of complementarity between both centroids is similar to
that of the medoids of PAM, but the number of correct answers of medoid 3 is higher
than binarized centroid 3. The number of students belonging to each cluster is 196, 349
I. Cabero and I. Epifanio
24
and 145, respectively. Note that the size of the cluster of students with poor skills is
smaller than in the case of PAM, but larger than in the case of PAM for cluster 3, which
in both clustering methods corresponds to the students with more correct answers.
In the clustering methods, the profiles of each cluster are not as extreme as archetypoids. Archetypoids are also more complementary, which makes it clearer to establish
which kinds of features distinguish one group from another. Remember also that clustering is limited to assign each student to a group but alpha values of ADA allows to
know the composition, i.e. ADA returns a richer information.
The profiles of BAA2 and BAA3 and BPAA2 and BPAA3 are quite similar to the
profiles of archetypoid 2 and 1, respectively, but with slight differences. The percentiles corresponding to correctly answered items are also high, although for one of
the archetypes not as high as for archetypoids. The percentiles are the 82nd and 94th for
BAA2 and BAA3, and the 75th and 92nd for BPAA2 and BPAA3, respectively. Therefore, the archetypoids are more extreme than the binarized archetypes of AA and PAA.
Although the profiles for BAA and BPAA are also complementary, they are not as complementary as the two archetypoids. The Hamming distance between BAA2 and BAA3
is 11, and 9 between BPAA2 and BPAA3. Archetypoids therefore manage to find more
complementary profiles.
5.2. An American College Testing (ACT) Mathematics Test
5.2.1. Data
This application corresponds to the second point of view of the binary matrix (analysis of
the columns). We use the same data and approach followed by Ramsay and Silverman
(2002, Ch. 9) and Rossi et al. (2002), although another strategy could be considered
(Ramsay and Wiberg, 2017). The data used are the 0/1 (incorrect/correct) responses of
2115 males from administration of a version of the ACT Program 60-item Mathematics
Test. Unlike the test introduced in Section 5.1.1, the objective of the test is to relate a
I. Cabero and I. Epifanio
25
student’s ACT score with probability of him or her earning a college degree, i.e. to rank
students. It seeks that the difficulty of questions increases as you get to higher question
numbers.
Although this binary matrix does not seem curvaceous at first sight, by making the
simplifying assumption that the probabilities Pih (probability that examinee h gets item
i right) vary in a smooth one-dimensional way across examinees, we can estimate the
ability space curve that this assumption implies. Then, we can work with item response
functions (IRFs) Pi (θ ) as functional data (Ramsay and Silverman, 2005), where θ is the
charting variable that measures out positions along the ability space curve. Or rather,
we can work with the log odds-ratio functions Wi (θ ), since these transformations of the
item response functions have the unconstrained variation that we are used to seeing in
directly observed curves. Ramsay and Silverman (2002, Ch. 9) and Rossi et al. (2002)
used functional PCA (FPCA) to study variations among these functions. Instead, we
propose to use functional ADA (FADA), which reveals very interesting patterns that
were not discovered with FPCA.
Note that in the literature, we find other terms for IRFs, such as option characteristic
curves, category characteristic curves, operating characteristic curves, category response
functions, item category response functions or option response functions (Mazza et al.,
2014).
5.2.2. Results and discussion
As mentioned previously, we used the same data and approach followed by Ramsay and Silverman
(2002, Ch. 9) and Rossi et al. (2002) to estimate IRFs, Pi (θ ), and their logit functions,
Wi (θ ) = log(Pi (θ )/(1 − Pi (θ ))). In particular, a penalized EM algorithm was used and
functions were expanded by terms of 11 B-spline basis functions using equally spaced
knots. Figure 8 displays the estimated IRFs, exp(Wi (θ ))/(1+ exp(Wi (θ ))), and their log
I. Cabero and I. Epifanio
26
odds-ratio functions Wi (θ ) for the 60 items. As expected, this kind of graphs with superimposed curves is largely uninformative and aesthetically unappealing (Jones and Rice,
1992).
1
0.9
6
0.8
4
0.7
W(θ)
P(θ)
0.6
0.5
0.4
2
0
0.3
−2
0.2
0.1
−4
0
−2.5
−2
−1.5
−1
−0.5
0
θ
0.5
1
1.5
2
2.5
−2.5
−2
−1.5
−1
−0.5
0
θ
0.5
1
1.5
2
2.5
Figure 8. Estimated IRFs (left-hand panel) and log odds-ratio functions (right-hand panel) for
the ACT math exam estimated from the male data.
To explore a set of curves Jones and Rice (1992) proposed the use of functions with
extreme principal component scores. This could be viewed as finding the archetypoid
functions. Nevertheless, the aim of PCA is not to recover extreme patterns. In fact,
curves with extreme PCA scores do not necessarily correspond to archetypal observations. This is discussed in Cutler and Breiman (1994) and shown in Epifanio et al.
(2013) through an example where archetypes could not be restored with PCA, even if
all the components had been considered. Not only that, Stone and Cutler (1996) also
showed that AA may be more appropriate than PCA when the data do not have elliptical
distributions.
In order to show the advantages of ADA over PCA, we compute FPCA and FADA
for W (θ ), since they are unconstrained, therefore making them more appropriate for
PCA application than the bounded Pi (θ ). This is not a problem with FADA as it works
with convex combinations. Figure 9 displays the first four PCs after a varimax rota-
I. Cabero and I. Epifanio
27
tion having been back-transformed to their probability counterparts, as performed by
Ramsay and Silverman (2002, Ch. 9) and Rossi et al. (2002). We base the interpretation
of each PC on the detailed description carried out by Ramsay and Silverman (2002, Ch.
9).
Harmonic II: 20%
Harmonic I: 43%
1
0.8
+
+
+
+
+
0.6
−
+
+
0.4
+
−
+ − − −
− − −
− − + +
+
0.2
− −
+ −
+ −
+ + + +
−
0
−2.5
−2
−1.5
−1
−0.5
+
+ −
+ −
+ −
+ + −
+ +
−
−
−
−
−
0
−
1
0.8
+
+
0.6
+
−
+
−
+
+ − −
− −
− +
− − +
− − + +
0.2+ + − − +
+
+ +
− −
0.5
1
1.5
2
2.5
0
−2.5
−2
−1.5
−1
+
−
−
+
−0.5
0
+
−
+
−
+
+ −
+ −
+ −
+ −
+ −
+ −
−
−
+
1
1.5
1
0.8
0.6
+
0.4
+
+
+
−
+
+
−
−
+ +
+ +
− −
0.2
− − +
− − − − − −
+ +
0.5
0.5
2
2.5
Harmonic IV: 27%
1
0.8
−
− +
− +
0.4
− +
− +
+
−
+ + +
+ +
+ + + −
+ −
0.2
− −
− −
− −
0
−2.5
−2
−1.5
−1
−0.5
0
−
−
−
0.4
Harmonic III: 10%
0.6
− − −
− −
−
−
+ +
+ + + +
− +
+ +
+ −
1
1.5
2
2.5
0
−2.5
−2
−1.5
−1
−0.5
+
+ −
+ −
−
0
0.5
+
−
−
+
1
−
+
−
− +
− +
− +
− +
− +
− +
+
1.5
2
2.5
Figure 9. The first four functional PCs in IRFs after VARIMAX rotation. Plus (negative) signs
indicate the effect of adding (subtracting) a multiple of a component to the mean function. The
mean function is the dashed line.
The percentage of total variation explained by those four components is nearly 100%,
while the percentage explained by each component is reported in Figure 9. The first component concentrates on the middle part of the ability range, in such a way that an item
with a high (low) score in that component has a higher (lower) slope than the mean from
approximately 0 to 2, i.e. it quantifies a discriminability trade-off between average students and those with rather high abilities. Analogously, the fourth component quantifies
a discriminability trade-off between average examinees and those with rather low abilities. On the contrary, the second component concentrates on the upper end of the ability
range. As Rossi et al. (2002) explained, the 3PL model is not well suited to modeling
this type of variation. An item with a low score on this component is good at sorting out
very high ability students from others of moderately high ability, whereas if the score
for this item is high, it will discriminate well among most of the population but will be
I. Cabero and I. Epifanio
28
found to be of approximately equal difficulty by all the very good students. Nevertheless, conclusions on the extreme part of the ability range should be made with caution,
since the estimation is carried out using a relatively small numbers of students. The third
component also accounts for variation in the characteristics of test items in the extreme
ability range, but now in low ability ranges. PC scores for these four components can
be seen in Figure 10. Note that to evaluate the 4 PC scores simultaneously and combine
them to give an idea about each item, it is not easily comprehensible or human-readable.
3
4
2
3
40
18
44
Component 2 Score
60
56
59
58
1
13
32
20
27
45
37
21
42 50
53
47
55
19 10
33
46
54
39 41
43
51
3
6
2
22
0
−1
1
15
8
17
30
52
49
57
Component 4 Score
2
2
31
7
9
5
26
35
38
36
29 24
2325
3414
12
48
11
1
4
6
7
15
10
23
8
17
4
1
11
32 20
25 13 24
27
921 52 40
34
37
35
16 50
12 39
44 49
3622
28
38 30 48
31
0
18
14 43
45
42
53 55
56
29
41
46
16
−2
33
26
47
58
19
−1
3
5
54
60
59
−2
28
51
57
−3
−4
−3
−2
−1
0
1
Component 1 Score
2
3
4
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Component 3 Score
Figure 10. Bivariate plots of principal component scores of IRFs. PC1 versus PC2 (left-hand
panel); PC3 versus PC4 (right-hand panel).
Figure 11 displays the archetypoids for k = 4 explaining 97% of the variability, which
is nearly as high as FPCA. The archetypoids are items 2, 18, 28 and 60. These four items
describe the extreme patterns found in the sample. Item 2 has very high scores in PC
3 and PC 4, high scores in PC 1 and a score of nearly zero for PC 2. Its IRF is quite
flat with a very slight slope, it seems to be a very easy item, with high probabilities of
success throughout the ability range. The other archetypoids discriminate better between
low and high ability students but in very different ways. Item 18 has a very high score
for PC 2 and a negative score for PC 3, but nearly zero for PC 1 and PC 4. It is an item
that is quite difficult even for the students in the very high ability range. The IRF of item
I. Cabero and I. Epifanio
29
28 is quite similar to that of item 18 for the low ability range until θ 0, but its slope for
the high ability range is higher, and the probabilities of success are higher than 0.9 for
θ s higher than 1. On the contrary, the probabilities of success of the IRF of item 60 are
quite low as far as 1.5, which means that it is a difficult item, but the probabilities of
success for the best students are high. In fact, the probabilities of success for item 60 for
θ higher than 2 are higher than those of item 18. Item 28 has high score for PC 1 and
low score for PC2, while it has a score of nearly zero for PC 3 and PC 4. However, item
60 has low scores for PC 1 and PC 4, a high score for PC 2 and nearly zero for PC 3. In
other words, it would have been very difficult to guess the extreme representatives of the
sample returned by ADA from an analysis of the scores in Figure 10.
1
0.9
0.8
0.7
P(θ)
0.6
0.5
0.4
0.3
0.2
2
18
28
60
0.1
0
−2
−1
0
θ
1
2
Figure 11. ACT data: The four IRF archetypoids are items 2, 18, 28 and 60. See the legend
inside the plot.
The alpha values (from 0 to 1) tell us about the contribution of each archetypoid to
each item. Remember that they add up to 1. Figure 12 shows star plots of the alpha
values for each archetypoid, thus providing a complete human-readable view of the data
set. The 4 alpha values in this case are represented starting on the right and going
counter-clockwise around the circle. The size of each alpha is shown by the radius of
the segment representing it. The items that are similar to the archetypoids can be clearly
I. Cabero and I. Epifanio
30
seen (for example, 7 and 8 are somehow similar to 2; 15 and 19 are somehow similar to
18; 14 and 16 are somehow similar to 28; and 56, 57 and 59 are similar to 60), as can
the items that are a mixture of several archetypoids (for example, item 1 is a mixture of
mainly item 2, together with items 28 and 18, to a lesser extent). Item 1 was selected
by Ramsay and Silverman (2002, Ch. 9) and Rossi et al. (2002) as an example of a low
difficulty item, although it seems that item 2 would be a better representative of this kind
of item. Item 9 was selected by Ramsay and Silverman (2002, Ch. 9) and Rossi et al.
(2002) as an example of a medium difficulty item, and it is mainly a mixture of items
18 and 28. Finally, item 59 was selected by Ramsay and Silverman (2002, Ch. 9) and
Rossi et al. (2002) as an example of a hard item. Item 59 was mainly explained by item
60.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Figure 12. Star plots of the alphas of each archetypoid for IRFs. The item number appears
below each plot. The archetypoids are 2, 18, 28 and 60.
Results of applying FADA with kernel and parametric IRF estimates are discussed
I. Cabero and I. Epifanio
31
in the Supplementary Material.
6. Conclusion
For the first time, we have proposed to find archetypal patterns in binary data using
ADA for a better understanding of a data set. A simulation study and results provided
in two applications have highlighted the benefits of ADA for binary questionnaires as an
alternative that can be used instead of (or in addition to) other established methodologies.
Although, much of statistics is based on the idea that averaging over many elements
of a data set is a good thing to do, in this paper we adopt a different perspective. We have
selected a small number of representative observations, archetypal observations, and the
data composition is explained through mixtures of those extreme observations. We have
shown that this can be highly informative and is a useful tool for making a data set more
“human-readable”, even to non-experts.
In the first application, we have shown how ADA returns the most complementary
profiles, which can be more useful in order to establish groups of students with similar
mastery of skills. Furthermore, ADA returns composition information of each observation through alpha values, which is a richer information than the simple assignation to
groups returned by CLA. In the second application, FADA has discovered the extreme
patterns in the data, which cannot be recovered by FPCA. Furthermore, we have explained each item of the ACT math exam as a percentage of the archetypal items, which
is easily understandable even for non-experts.
As regards future work, throughout the paper all variables share the same weight, but
for certain situations some variables could have more weight in RSS. Another direction
of future work would be to consider ADA for nominal observations, for example, by
converting those variables into dummy variables, i.e. with binary codes. Furthermore,
this work is limited to binary data, but questionnaires can also have Likert-type scale
responses. Therefore, archetypal techniques for ordinal data would be very valuable.
I. Cabero and I. Epifanio
32
Another not so immediate extension, would be to consider the case of mixed data, with
real valued and categorical data, together with missing data. Finally, from the computational point of view, in case of working with a very big data set, the ADA algorithm
described in Section 2.1.1 could be slow. In that case, a recent alternative implemented in
the R package adamethods (Vinue and Epifanio, 2019) for computing ADA with large
data sets could be used.
Acknowledgments
This work is supported by the following grants: DPI2017-87333-R from the Spanish
Ministry of Science, Innovation and Universities (AEI/FEDER, EU) and UJI-B2017-13
from Universitat Jaume I.
References
Alcacer, A., I. Epifanio, M. V. Ibáñez, A. Simó, and A. Ballester (2020). A datadriven classification of 3D foot types by archetypal shapes based on landmarks. PLOS
ONE 15(1), 1–19.
Cabero, I. and I. Epifanio (2019). Archetypal analysis: an alternative to clustering for
unsupervised texture segmentation. Image Analysis & Stereology 38(2), 151–160.
Canhasi, E. and I. Kononenko (2013). Multi-document summarization via archetypal
analysis of the content-graph joint model. Knowledge and Information Systems, 1–
22.
Canhasi, E. and I. Kononenko (2014). Weighted archetypal analysis of the multi-element
graph for query-focused multi-document summarization. Expert Systems with Applications 41(2), 535 – 543.
I. Cabero and I. Epifanio
33
Chan, B., D. Mitchell, and L. Cram (2003). Archetypal analysis of galaxy spectra.
Monthly Notices of the Royal Astronomical Society 338(3), 790–795.
Chiu, C.-Y., J. A. Douglas, and X. Li (2009). Cluster analysis for cognitive diagnosis:
Theory and applications. Psychometrika 74(4), 633.
Cutler, A. and L. Breiman (1994). Archetypal Analysis. Technometrics 36(4), 338–347.
Davis, T. and B. Love (2010). Memory for category information is idealized through
contrast with competing options. Psychological Science 21(2), 234–242.
de Leeuw, J. and P. Mair (2009). Gifi methods for optimal scaling in R: The package
homals. Journal of Statistical Software 31(4), 1–20.
Dean, N. and R. Nugent (2013). Clustering student skill set profiles in a unit hypercube using mixtures of multivariate betas. Advances in Data Analysis and Classification 7(3), 339–357.
D’Esposito, M. R., F. Palumbo, and G. Ragozini (2012). Interval Archetypes: A New
Tool for Interval Data Analysis. Statistical Analysis and Data Mining 5(4), 322–335.
Epifanio, I. (2013). h-plots for displaying nonmetric dissimilarity matrices. Statistical
Analysis and Data Mining 6(2), 136–143.
Epifanio, I. (2016). Functional archetype and archetypoid analysis. Computational
Statistics & Data Analysis 104, 24 – 34.
Epifanio, I., M. V. Ibáñez, and A. Simó (2020). Archetypal analysis with missing data:
see all samples by looking at a few based on extreme profiles. The American Statistician.
Epifanio, I., M. V. Ibáñez, and A. Simó (2018). Archetypal shapes based on landmarks
I. Cabero and I. Epifanio
34
and extension to handle missing data. Advances in Data Analysis and Classification 12(3), 705–735.
Epifanio, I., G. Vinué, and S. Alemany (2013). Archetypal analysis: contributions for
estimating boundary cases in multivariate accommodation problem. Computers &
Industrial Engineering 64(3), 757–765.
Eugster, M. J. and F. Leisch (2009). From Spider-Man to Hero - Archetypal Analysis in
R. Journal of Statistical Software 30(8), 1–23.
Eugster, M. J. A. (2012). Performance profiles based on archetypal athletes. International Journal of Performance Analysis in Sport 12(1), 166–187.
Fernandez, M. and A. S. Barnard (2015). Identification of nanoparticle prototypes and
archetypes. ACS Nano 9(12), 11980–11992.
Fletcher, R. (2000). Practical Methods of Optimization (Second ed.). John Wiley &
Sons.
Flynt, A. and N. Dean (2016). A survey of popular R packages for cluster analysis.
Journal of Educational and Behavioral Statistics 41(2), 205–225.
Friedman, J. H. and J. W. Tukey (1974). A projection pursuit algorithm for exploratory
data analysis. IEEE Transactions on Computers C-23(9), 881–890.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties.
Biometrics 27(4), 857–871.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning.
Data mining, inference and prediction. 2nd ed., Springer-Verlag.
Henry, D., A. B. Dymnicki, N. Mohatt, J. Allen, and J. G. Kelly (2015). Clustering
I. Cabero and I. Epifanio
35
methods with qualitative data: a mixed-methods approach for prevention research
with small samples. Prevention Science 16(7), 1007–1016.
Hinrich, J. L., S. E. Bardenfleth, R. E. Roge, N. W. Churchill, K. H. Madsen, and
M. Mørup (2016). Archetypal analysis for modeling multisubject fMRI data. IEEE
Journal on Selected Topics in Signal Processing 10(7), 1160–1171.
IBM Support (2016).
Clustering binary data with K-Means (should be avoided).
http://www-01.ibm.com/support/docview.wss?uid=swg21477401.
Accessed: 2018-07-09.
Jones, M. C. and J. A. Rice (1992). Displaying the important features of large collections
of similar curves. The American Statistician 46(2), 140–145.
Kaufman, L. and P. J. Rousseeuw (1990). Finding Groups in Data: An Introduction to
Cluster Analysis. New York: John Wiley.
Lawson, C. L. and R. J. Hanson (1974). Solving Least Squares Problems. Prentice Hall.
Li, S., P. Wang, J. Louviere, and R. Carson (2003). Archetypal Analysis: A New Way
To Segment Markets Based On Extreme Individuals. In ANZMAC 2003 Conference
Proceedings, pp. 1674–1679.
Linzer, D. A. and J. B. Lewis (2011). poLCA: An R package for polytomous variable
latent class analysis. Journal of Statistical Software 42(10), 1–29.
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 129–137.
Maechler, M., P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik (2018). cluster:
Cluster Analysis Basics and Extensions. R package version 2.0.7-1.
I. Cabero and I. Epifanio
36
Makowski, D. (2016). Package ’neuropsychology’: An R Toolbox for Psychologists,
Neuropsychologists and Neuroscientists. (0.5.0).
Mazza, A., A. Punzo, and B. McGuire (2014). KernSmoothIRT: An R package for
kernel smoothing in item response theory. Journal of Statistical Software 58(6), 1–34.
Midgley, D. and S. Venaik (2013). Marketing strategy in MNC subsidiaries: pure versus
hybrid archetypes. In P. McDougall-Covin and T. Kiyak, Proceedings of the 55th
Annual Meeting of the Academy of International Business, pp. 215–216.
Millán-Roures, L., I. Epifanio, and V. Martı́nez (2018). Detection of anomalies in
water networks by functional data analysis. Mathematical Problems in Engineering 2018(Article ID 5129735), 13.
Moliner, J. and I. Epifanio (2019). Robust multivariate and functional archetypal analysis
with application to financial time series analysis. Physica A: Statistical Mechanics and
its Applications 519, 195 – 208.
Mørup, M. and L. K. Hansen (2012). Archetypal analysis for machine learning and data
mining. Neurocomputing 80, 54–63.
Orús, P. and P. Gregori (2008). Fictitious Pupils and Implicative Analysis: a Case Study,
pp. 321–345. Berlin, Heidelberg: Springer.
Pawlowsky-Glahn, V., J. J. Egozcue, and R. Tolosana-Delgado (2015). Modeling and
analysis of compositional data. John Wiley & Sons.
Porzio, G. C., G. Ragozini, and D. Vistocco (2008). On the use of archetypes as benchmarks. Applied Stochastic Models in Business and Industry 24, 419–437.
R Development Core Team (2019). R: A Language and Environment for Statistical Com-
I. Cabero and I. Epifanio
37
puting. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-90005107-0.
Ragozini, G. and M. R. D’Esposito (2015). Archetypal networks. In Proceedings of the
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis
and Mining 2015, New York, NY, USA, pp. 807–814. ACM.
Ragozini, G., F. Palumbo, and M. R. D’Esposito (2017). Archetypal analysis for datadriven prototype identification. Statistical Analysis and Data Mining: The ASA Data
Science Journal 10(1), 6–20.
Ramsay, J. O. and B. W. Silverman (2002). Applied Functional Data Analysis. Springer.
Ramsay, J. O. and B. W. Silverman (2005).
Functional Data Analysis (2nd ed.).
Springer.
Ramsay, J. O. and M. Wiberg (2017). A strategy for replacing sum scoring. Journal of
Educational and Behavioral Statistics 42(3), 282–307.
Rossi, N., X. Wang, and J. O. Ramsay (2002). Nonparametric item response function estimates with the EM algorithm. Journal of Educational and Behavioral Statistics 27(3), 291–317.
Seth, S. and M. J. A. Eugster (2016a). Archetypal analysis for nominal observations.
IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 849–861.
Seth, S. and M. J. A. Eugster (2016b). Probabilistic archetypal analysis. Machine Learning 102(1), 85–113.
Slater, S., S. Joksimović, V. Kovanovic, R. S. Baker, and D. Gasevic (2017). Tools for
educational data mining: A review. Journal of Educational and Behavioral Statistics 42(1), 85–106.
I. Cabero and I. Epifanio
38
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of
Mathematical and Statistical Psychology 59(1), 1–34.
Steinschneider, S. and U. Lall (2015). Daily precipitation and tropical moisture exports
across the Eastern United States: An application of archetypal analysis to identify
spatiotemporal structure. Journal of Climate 28(21), 8585–8602.
Stone, E. and A. Cutler (1996). Introduction to archetypal analysis of spatio-temporal
dynamics. Physica D: Nonlinear Phenomena 96(1), 110 – 131.
Su, Z., Z. Hao, F. Yuan, X. Chen, and Q. Cao (2017). Spatiotemporal variability of
extreme summer precipitation over the Yangtze river basin and the associations with
climate patterns. Water 9(11).
Theodosiou, T., I. Kazanidis, S. Valsamidis, and S. Kontogiannis (2013). Courseware
usage archetyping. In Proceedings of the 17th Panhellenic Conference on Informatics,
PCI ’13, New York, NY, USA, pp. 243–249. ACM.
Thøgersen, J. C., M. Mørup, S. Damkiær, S. Molin, and L. Jelsbak (2013). Archetypal analysis of diverse pseudomonas aeruginosa transcriptomes reveals adaptation in
cystic fibrosis airways. BMC Bioinformatics 14, 279.
Thurau, C., K. Kersting, M. Wahabzada, and C. Bauckhage (2012). Descriptive matrix
factorization for sustainability: Adopting the principle of opposites. Data Mining and
Knowledge Discovery 24(2), 325–354.
Tsanousa, A., N. Laskaris, and L. Angelis (2015). A novel single-trial methodology for
studying brain response variability based on archetypal analysis. Expert Systems with
Applications 42(22), 8454 – 8462.
Unwin, A. (2010). Exploratory data analysis. In P. Peterson, E. Baker, and B. Mc-
I. Cabero and I. Epifanio
39
Gaw (Eds.), International Encyclopedia of Education (Third Edition), pp. 156 – 161.
Oxford: Elsevier.
Vinué, G. (2017). Anthropometry: An R package for analysis of anthropometric data.
Journal of Statistical Software 77(6), 1–39.
Vinué, G. and I. Epifanio (2017). Archetypoid analysis for sports analytics. Data Mining
and Knowledge Discovery 31(6), 1643–1677.
Vinue, G. and I. Epifanio (2019). adamethods: Archetypoid Algorithms and Anomaly
Detection. R package version 1.2.
Vinué, G. and I. Epifanio (2019). Forecasting basketball players’ performance using
sparse functional data. Statistical Analysis and Data Mining: The ASA Data Science
Journal 12(6), 534–547.
Vinué, G., I. Epifanio, and S. Alemany (2015). Archetypoids: A new approach to define
representative archetypal data. Computational Statistics & Data Analysis 87, 102 –
115.
Wu, C., E. Kamar, and E. Horvitz (2016). Clustering for set partitioning with a case study
in ridesharing. In IEEE 19th International Conference on Intelligent Transportation
Systems (ITSC), pp. 1384–1388.