Thinking by Classes in Data Science SDA
Thinking by Classes in Data Science SDA
Keywords: data science, data mining, classification, learning, symbolic data anal-
ysis, functional analysis, Bayesian, multilevel analysis, complex data, big data,
granular computing, compositional data
interest the user. For example, shopping transactions data context and second, to consider symbolic
are interesting units in marketing, but customers con- objects as the intent of the described class; this intent
sidered as new units represented by the classes of having an extent as in the Galois lattice framework
their transactions in a given period, can also be con- (this is developed in Unsupervised Classification
sidered as interesting units by themselves and not Extended to Symbolic Data section). In this study,
only for summarizing the data. considered ‘symbolic objects’ are vectors of
How are classes obtained? Behind any ground symbolic data.
population described by classical (numerical or cate- A ‘symbolic data table’ is a table where classes
gorical) variables, there are underlying hidden popu- of individuals are described by at least one symbolic
lations of classes induced, e.g., by the categorical variable. Standard variables can also describe classes
variables or by numerical variables (transformed in by considering the set of classes as a new ground
categorical variables by a discretization process) or population of higher level.
by Cartesian product of such categorical variables. An example of symbolic data table is illustrated
Another way for obtaining classes is to use a clus- by the Table 1 where the statistical units of the
tering method on the ground population. Neverthe- ground population are players of French cup teams
less, in practice the classes are often induced from and classes of players are teams called Paris, Lyon,
given categories (as regions, unemployed people Marseille, and Bordeaux. The variability of the
type, epidemiological strategies, consumption level, players inside each team is expressed by the symbolic
degradation level, etc.). variables: ‘Weight’ which value is the interval of [min,
For classes, their description cannot be max] weight of the players of the associated team,
expressed by just numerical and categorical values. ‘National Country’ which value is the list of their
This is due to the variability of the individuals inside nationality, ‘Age bar chart’ is the frequency of the age
each class of individuals. This variability is better players being in the intervals: [less than 20], [20, 25],
expressed by intervals, histograms, probability distri- [25, 30], [more than 30], respectively, denoted: (0),
butions, bar charts, sequences of categorical or (1), (2), (3) in Table 1. The symbolic variable ‘age’ is
numerical values, sometimes weighted by numbers or called ‘bar chart variable’ as the interval of age on
associated with categorical values and the like. These which it is defined are the same for all the classes and
kinds of data are called ‘symbolic’ as they cannot be can therefore be considered as categories. The last
reduced to numbers without a loss of much informa- variable is numerical as its values for a team is the fre-
tion. The so-called «symbolic variables» are the vari- quency of the French players in this team among all
ables which associate to each class a symbolic value. the French players of all the teams. Hence, this varia-
In SDA Paradigm section, different kinds of symbolic ble produces a vertical bar chart in comparison with
variables are presented. the symbolic variable ‘age’ of horizontal bar charts
As classes are considered in the SDA frame- value in Table 1. By adding to the French the same
work like ‘objects’ to be described in all their useful kinds of columns associated with the other national-
facets, their description are often called ‘symbolic ities, we can obtain a new symbolic variable whose
objects.’ Several kinds of more or less structured and values are a list of numbers, where each number is the
complex data have been associated with different frequency of having players in a team of a nationality
kinds of symbolic objects; e.g., ‘hoard’ in Refs 1, 2 among all the players having this nationality among
and ‘belief objects’ in Ref 3. The first advantage of all the teams. A team can also be described by stand-
such symbolic objects was to give a linear expression ard variables as, e.g., its expenses or the number of
of the symbolic description of classes in a complex goals in a season.
TABLE 1 | Example of Symbolic Data Table Where Teams of the French Cup Are Described by Three Symbolic Variables of Interval, Sequence of
Categories, ‘Horizontal’ Bar Charts, and a Numerical Variable Inducing a ‘Vertical’ Bar Chart
Frequency of
FRENCH Among
French Cup Teams Weight National Country Age All French (%)
Paris [73, 85] {France, Argentina, Senegal} {(0) 30%, (1) 70%} 30
Lyon [68, 90] {France, Brazil, Italia} {(0) 30%, (1) 65%, (2) 5%} 25
Marseille [77, 85] {France, Brazil, Algeria} {(1) 40%, (2) 52%, (3) 8%} 28
Bordeaux [80, 90] {France, Argentina} {(0) 40%, (1) 60%} 17
Basically in SDA, there are two types of between n players inside their team considered
descriptive variables depending on the population on as a class of players described by several classi-
which they are defined: cal variables. Another example can be species
(of plants or animals) where the variability is
1. The standard variables (numerical or categori- between the specimens of a species. A last
cal), so-called ‘ground variables’ when they are example is diseases considered as classes where
defined on the ground population of the considered variability is between patients
individuals. having the same disease. In all these cases, the
2. The so-called ‘symbolic-valued variables’ classes can be described by symbolic variables
defined on classes, which values cannot be expressing the variability of the individuals of
reduced to be just numbers (e.g., means). each class.
2. Variability of (or inside) single entities: each
These symbolic variables can be obtained from class is a set defined by a fixed entity (as a
standard ground variables and define a symbolic data player, a specimen or a patient) considered in
table as illustrated by Figure 1. different conditions, parts etc. The variability
SDA gives a framework for building, describ- of a single entity depends on external condi-
ing, analyzing and extracting new knowledge from tions: as in position, time, and environmental
symbolic data tables. In that way, SDA can enhance situations or on internal conditions: as among
the usual study of any ground population of indivi- its parts or among its patterns or its physical
duals described by classical variables by adding a constitutions. More formally, this means that a
complementary study of classes of individuals class associated with the ith entity is the subset
described by symbolic variables expressing the inter- {ij1, ij2, … , ijk} of k individuals of the ground
nal variability of these classes. population representing the ith entity varying
As it is better to know what we describe before in k external conditions (k times, k positions,
describing it, an important question is to know well etc.) or among k internal conditions (k of its
which kinds of classes we have and what is their parts, k of its patterns, etc.).
meaning. Basically, we have two kinds of classes
which depend on the kinds of variability of their
individuals: Examples of Variability of a Single Entity
A player’s performance considered at different times
1. Variability between different entities: each class can be considered as a class of the same player at
is a set of entities considered as individuals of these different times. In this case, the single entities
the ground population. The variability is are players. The individuals of the ground population
between the individuals. This is the most com- are players considered at different times. Hence, the
mon variability in SDA. Let {i1, i2, … , in} be a ith player can be a class of k individuals (ij1, ij2, … ,
fixed class of n individuals; e.g., the variability ijk) associated with the ith player considered at
indn Clk
F I G U R E 1 | From a standard data table (X, Y) describing a set of individuals X by a set of standard variables Y, to a symbolic data table
(X0 , Y0 ) describing a set of teams X0 by a set of symbolic variables Y0 .
k times. Each player at each time is described by dif- data tables to be aggregated and merged. This leads
ferent kinds of performance. The higher level popula- naturally to towers or regions described by symbolic
tion of players is then described by symbolic data with several kinds of symbolic variables.
variables obtained by aggregation of these kinds of Why aggregate classes and describe them by
variables for each player considered at k times. symbolic data? The symbolic description of classes
Another example is a traveler visiting different hotels. leads at least to the following advantages:
The single entity is a traveler. In this case, a class is a
subset of individuals (ijh1, ijh2, … , ijhk) of the ground • Finding new and complementary kinds of
population associated with the traveler visiting knowledge not available at the level of indivi-
k hotels. Each traveler is described, at each hotel duals: e.g., a kind of knowledge inside the pop-
visit, by different criteria of satisfaction on the visited ulation of individuals is to find the players
hotel. The higher level population of travelers is then whose age is between 20 and 25 years old. A
described by symbolic variables obtained by aggrega- complementary kind of knowledge can be
tion of these kinds of variables, on the k visited found inside the population of teams
hotels, for each traveler. (i.e., classes of players) described by distribu-
tions: find the classes whose probability of hav-
ing players of age between 20 and 25 years old
Examples of Variability Inside Single is higher than 0.9.
Entities • Studying the data by units given at the needed
The single entities are towers. The individuals of the level of generalization: e.g., if we wish to know
ground population are parts of each tower. These what makes a player win, a good level of study is
parts are cracks. Hence the ith tower can be associ- a data table where individuals (i.e., the players)
ated with a class of k individuals (ic1, ic2, … , ick) are the units; if we wish to know what makes a
associated with k cracks. Each crack of each tower is team win, a good level of study is a data table
described by classical variables as the size, deepness, where classes of individuals (i.e., the teams) are
orientation, and so on. The higher level population the units. The description of the teams needs to
of towers is then described by symbolic variables take the variability inside the classes of players
obtained by aggregation of these kinds of variables, into account by using: intervals or histograms of
on the k cracks, for each tower. age, bar chart of nationalities, list of sponsors
In industrial data, these different kinds of etc., associated with the kind of sponsoring, and
classes often happen together. This leads to what we so forth, which leads to symbolic data.
call ‘complex data’ (see Classes of Individuals and
• Summarizing by reducing the loss of informa-
Their Symbolic Description Built from ‘Complex
tion: when each class contains millions of indi-
Data’ section for more details). In contrast to the
viduals, it is easier to study them when they are
case where the ground data can be modeled by a
summarized by symbolic data, than to study the
unique standard data table (where a set of indivi-
millions of individuals which define them. A
duals is described by a set of classical variables),
reduction of loss is obtained when instead of
complex data are composed of many such data tables
the single point value in the p-dimensional
of several kinds of units, variables and sizes. For
space IRp seen in classical data (by using,
example, (see Application to Complex Data
e.g., means of classes), the symbolic data can
section for more details), nuclear power plant towers
take the variability inside classes e.g., by inter-
constitute a class of individuals which vary (e.g., in
val values (as the min–max age of the players of
their geographical position). Moreover, each tower is
each team instead of their mean) into account,
considered as an entity varying in time, and also
thus producing hypercubes in IRp.
inside its parts (as its cracks or corrosion points).
More details on this example can be found in Afonso • Reducing the number of individuals: e.g., when
et al.4,5. Another example, of complex data (devel- the teams are the classes, there are fewer teams
oped in Classes of Individuals and Their Symbolic than players, so the number of units is reduced
Description Built from ‘Complex Data’ section), con- from the ground level of players to the higher
cerns surveys in Official Statistics for the sociodemo- level of teams.
graphic description of each region which require • Reducing the number of variables: the number
many different census data tables (on families, of variables can become higher at the level of
schools, hospitals, etc.; see for more details, Refs classes than at the level of the individuals (this
6–9). In both examples, we obtain many standard will be developed in Some Principles section, in
the ‘class facets principle’). For example, a than just by means or min–max intervals. Neverthe-
‘team age’ can be described by the mean, mean less, other kinds of symbolic variables (which values
square, min–max interval, confidence interval, are not distributions) describing classes by other of
histogram, quantiles, and so on of the age of their facets can be added. For example, vertical bar
the players of the team. These values allow the chart variables, as in Table 1, interquartile interval-
definition of descriptive standard or symbolic valued variables, percentiles-valued variables or cor-
variables at the higher level of classes. Never- relation between ground variables can be usefully
theless, by considering symbolic variables added (among many others) to the descriptive vari-
instead of their transformation in numerical or ables of each class.
categorical standard variables, we reduce the Notice that bar charts or distributions built
number of variables. For example, by consider- from ordinal variables, can always describe classes as
ing p symbolic variables whose values are inter- they can express different kinds of variability. How-
vals, we have only p interval-valued variables ever, distributions cannot describe classes in the case
instead of 2p numerical variables of min or of nonordinal variables. By definition, a distribution,
max values When we use p bar chart-valued evaluated at ‘x,’ is the probability that a real-valued
variables of k categories instead of variables random variable X will take a value less than or
which values are the frequency of each class for equal to x. Hence, from a ground nonordinal varia-
each category of each bar chart, we reduce the ble, we can build from each given class, a bar chart-
number of variables from k × p classical valued variable but not a distribution-valued variable
numerical variables to p symbolic variables. as in this case ‘a value less than or equal to x’ has no
Hence, in a principal component analysis (PCA) meaning.
extended to symbolic data we will have Classes can also be described by bar chart or
p variables to represent on the correlation circle histograms or distributions of two kinds: horizontal
instead of k × p. or vertical, depending on the kind of probability
• Missing data reducing: e.g., if we have a million used: Pr(C|yi) or Pr(yi|C) where C is a class of indivi-
of individuals described by a numerical ground duals of the ground population and yi is a category
variable containing 1000 missing data on a of a ground categorical variable y. For example, in
ground standard variable, the study of 10 classes Table 1 the symbolic variable ‘age’ of horizontal bar
of such individuals would lead to some missing chart value and the numerical variable ‘Frequency
data at the level of class for this variable. This among all French,’ which leads to a vertical bar
can happen when a class contains only missing chart. These kinds of description will be developed in
data of the ground variable. However, the num- How to Measure the Quality of Bar Chart Symbolic
ber of missing data at the level of the ten classes Variables, Classes, and Their Associated Symbolic
will be much fewer than 1000! Data Tables section.
• Solving confidentiality questions: classes are
generally less confidential than individuals. For Statistical Observations and Symbolic Data
example, counties instead of inhabitants.
In contrast to observed data (i.e., ‘observations’)
• Facilitating interpretation of results: e.g., in which are considered given in standard statistics,
reduced decision trees as classes are less numer- ‘symbolic data’ are built from classes. More precisely
ous than their individuals; another example is usually in standard statistics an ‘observation’ is the
PCA with new kinds of graphics expressing the (numerical or categorical) value, at a particular
variability inside the classes and the correlation period, of a particular variable [see the OECD Glos-
between the symbolic variables. sary of statistical terms (2007) at http://stats.oecd.
• Transforming complex data with unstructured org/glossary/search.asp].
data tables and unpaired variables in a structured An observation describes a unit (here called
symbolic data table with paired symbolic vari- ‘individual’) of a given population. Therefore, in
ables. SDA methods and tools can then be applied order to ‘describe’ the classes in SDA, the symbolic
to this new symbolic data table. This will be variables values (as intervals) are not ‘observations’
detailed in Application to Complex Data section. but built from observations given on the ground pop-
ulation of individuals. The classes of individuals are
Representing classes by marginal or joint probability considered as higher level units and so, constitute a
distributions, as often done in SDA (in case of new population of higher level. To be studied, this
ground numerical variables), is more informative new population uses the ‘description’ of the higher
TABLE 2 | Class Descriptive Single or Multivalued Variables (Two Are Classical and Four Are Symbolic) with Examples Based on Descriptive
Variables of Team Players
Value
Variable Single Value Multivalue
Numerical Classical numerical variable Symbolic numerical multivalued variable
Example: Frequency list in a team of nationalities among Example: List of team players ages
the players of same nationality in all the teams
Categorical Classical categorical variable Symbolic categorical multivalued variable
Example: Nationality (of the team manager) Example: List of a team players nationalities
Interval Symbolic interval-valued variable: Symbolic interval multivalued variable
Example: [Min, Max] Interval age of a team players Example: Intervals between successive month team
expenses or of stock values
describes. For example, the [Min, Max] or the inter variables’ which contains the bar chart-valued vari-
quartile interval of the random ground variable ‘age’ ables as a particular case. In Table 3, we can see that
inside a team considered as a class of players. ‘categorical modal variables’ are enlarged to two
other types: symbolic (category, category), list-valued
Multivalued Variables variables and symbolic (category, interval) list-valued
This is the case where the symbolic variables values variables. In the same books, ‘symbolic interval
express the internal variability of a class of indivi- modal variables’ are restricted to ‘symbolic (interval,
duals by a list of numbers, categories or intervals. number) list-valued variables’ which therefore con-
These kinds of variables are connected to reality and tain the case of ‘histogram-valued variables,’ when
not an abstract construct just in order to complete the sum of the numbers associated with each interval
the cells of the Table 2. Hence, the list of numbers of the list is equal to one. It also contains ‘composi-
can be the list of marks of a pupil varying between tional data’19 when the sum of these numbers
examinations in some courses (Mathematics, Physics, remains constant. In Table 3, ‘symbolic interval
etc.), of temperature of a patient varying during each modal variables’ are enlarged to the case of ‘symbolic
hour every day, of performances of an athlete vary- (interval, category) list-valued variables’ and to the
ing in various competitions. In these cases, the classes case of ‘symbolic (interval, interval) list-valued varia-
are, respectively, the pupils, the patients, and the ath- ble.’ All these cases are illustrated by examples given
letes. The variables are, respectively, the courses, the in Table 3.
days, and the competitions. The list of categories can
be a list of products bought by customers in several
sections of a supermarket in a given period. In this WHAT IS TO BE OBSERVED
case, the classes are the customers and the variables AND SCRUTINIZED?
are the sections of the supermarket. The list of inter- In SDA, we observe and scrutinize classes of indivi-
vals can be the list of interval variation of stocks duals, in order to identify them by the standard or
prices every day, during several weeks. In this case, symbolic data which describe them.
the classes are the stocks and the symbolic variables
are the weeks.
In Table 3, other kinds of symbolic variables From Where Can Classes and Their
are illustrated by examples as folsows. Descriptions Be Obtained?
In the introduction, we defined two kinds of classes:
Numerical, Categorical, or Interval Modal classes expressing variability between entities and
Symbolic Variables classes expressing the variability of (or inside) enti-
Such symbolic variables values are lists of pairs ties. When classes express the variability between
(number, mode), (category, mode), or (interval, entities, they can be obtained in two ways. First, they
mode), where the mode expresses in column is a can be induced from the categories, defined on the
number, a category or an interval. ground population, and induced by standard numeri-
Therefore, each of the 9 (i.e., 3 × 3) cells of cal (after discretization) and/or categorical variables
Table 3 defines a different type of symbolic variable. and their Cartesian product. These categories are
In Refs 10–12, ‘categorical modal variables’ are sometimes ‘taxonomic’ by inducing classes of indivi-
restricted to ‘symbolic (category, number) list-valued duals structured in different ways as by hierarchy or
TABLE 3 | Class Description by Modal Symbolic Variables Illustrated by Examples of Team Players
Value
Support Numerical Modal Categorical Modal Interval Modal
Numerical (number, number) list-valued variable (number, category) list-valued variable (number, interval) list-valued variable
Example: time series team Example: team description by its rank Example: team description by its score
description by its performance list: evolution: (time × rank) list evolution in the ongoing season:
(time × number of goals in the (time × [min,max] number of goals)
ongoing season) list
Categorical (category, number) list-valued variable (category, category) list-valued variable (category, interval) list-valued variable
Example: team description by the bar Example: team description by its Example: team description by:
chart: (nationality, frequency) list (sponsor × dress) list as: (Adidas (nationality × (age interval value))
(teashirt), Total (shoes)) list
Interval (interval, number) list-valued variable (interval, category) list-valued variable (interval, interval) list-valued variable
Example: Histogram team Example: team description by time Example: team description by: (interval
description by: frequencies of age intervals sequence of same rank: of time × interval of rank) list
intervals list (interval of time × rank) list
other structures as, e.g., in the case of towns, regions, by their interval of height, distribution of colors and
countries, and continents. Second, these classes can the like, without giving the ground level of the speci-
also be obtained from a clustering algorithm applied men. Native symbolic data can also appear when due
to the ground population. This clustering yields to to confidential data, the observations on the ground
more or less structured clusters defining classes of populations (people or companies and the like) are
individuals as they can provide partitions, hierar- not available. Native symbolic data happen in
chies, overlapping clusters, pyramids (or more gener- national census of Official Statistics, where we have
ally, spatial pyramids20 and Galois lattices (see for each region the bar charts associated with differ-
Unsupervised Classification Extended to Symbolic ent ground variables describing hospitals, schools,
Data section). sociodemographic situation of the inhabitants, and
When classes are due to the variability of so on, but we do not have the underlying surveys giv-
(or inside) single entities, the classes can be built ing the ground populations of the hospitals, the
directly from the ground population by the subsets schools, the inhabitants, and so forth. This last exam-
{ij1, ij2, … , ijk} associated with each entity ple is detailed in Classes of Individuals and Their
i considered in k internal or external conditions as Symbolic Description Built from ‘Complex Data’
explained in the introduction. section as it is a case of ‘complex data.’
Both kinds of classes can be described by sym-
bolic variables taking care of their internal variability.
Notice that the inclusion order between classes must be Classes of Individuals and Their Symbolic
coherent with the chosen order between their symbolic Description Built from ‘Complex Data’
descriptions. For example, if the symbolic description ‘Complex data’ are the case where we have entities
of a set of classes is given by an interval-valued variable defined by several ground populations. In this con-
(i.e., each class is associated with an interval), then it text, we have as input several unstructured data
should be coherent that inclusion between classes tables of different sizes of individuals (in rows) and
implies inclusion between the associated intervals; of variables (in columns). Moreover, the individuals
Diday21 studies how the taxonomic order (between and the variables can be different between the data
classes) can imply the symbolic order (between sym- tables. The entities can also be considered as indivi-
bolic values). By using these symbolic descriptions, one duals of another ground population.
can obtain different levels of more or less specialized Example. Each entity is a ‘power plant tower’
knowledge and therefore a better understanding of the defined by a population of cracks (described by their
inherent knowledge structure of the data. orientation, their length, etc.), a population of corro-
Sometimes, we can have ‘native symbolic data’ sions (described by their deepness, their surface, etc.).
describing classes but not obtained from a given Each entity is considered as an individual of a ground
ground populations. Native symbolic data appear in population of power plant towers. This example is
many situations. For example, in biology where ‘spe- developed in Application to Complex Data
cies’ (considered as classes of individuals called ‘spec- section where Figure 9 shows the complex data and
imen’), are described by experts in specialized books, their fusion in symbolic data. More generally, we can
not constant, as, e.g., when the absolute and not the specific tools or by extending known tools to sym-
relative frequency is used. bolic data, as descriptive statistics, PCA, regression,
Between the first level where the ground popu- decision trees, clustering, dissimilarities, and the like.
lation is described by standard numerical or categori- Several studies have already been performed in this
cal variables and the second level where classes are direction in the four books (Refs 10–12 and 22). Sev-
described by symbolic variables, there is an interme- eral introductions have been done, e.g., in Ref 23
diate step. For example, if we consider the random and with a nontechnical introduction: Refs 24–27.
variable Age which associates with each player of a Several international SDA workshops [Vienna
team c, its age, we can then define a higher level ran- (2009), Namur (2011), Madrid (2012), Taiwan
dom variable Age whose value on each team is Agec. (2014), and Orléans (2015)] give the trends. More
Hence, this random variable Age, is a random varia- details on SDA theory and tools are given in Some
ble of random variable values and constitutes an Methods for the Analysis of Symbolic Data
intermediate step from where the symbolic variables section and directions of research in Some Direction
values (e.g., as the probability distributions of each of Research and Links with Specific Domains of
team) are obtained. Data Science section. Any new method extended on
symbolic data can be studied in term of stability and
How Do Random Variables of Random convergence when the ground population increases.
Variables Value and Model of Models This question is discussed in Theoretical Develop-
Appear in SDA? ment section.
Basically, the process is the following: the ground
population X, where n individuals are described by a
set Y of p standard variables, constitutes the initial WHAT ARE THE PRINCIPLES AND THE
data table. This table is transformed into a new and THEORETICAL DEVELOPMENTS?
reduced in rows data table (X0 , Y0 ) of k rows and
p columns, where X0 is a set of k classes and Y0 is a Some Principles
set of p0 new variables whose values on each class is
a random variable. The random variable values of Y0 Class Facets Principle
have generally different laws on the classes. For This principle aims to consider each class of indivi-
example, the random variables associated with the duals as ‘a thing by itself’ to be described, like an
age of each team of players, have generally not the object, by its different facets. As classes are consid-
same law as, e.g., the players of a team can be ered as ‘objects’ their symbolic description is often
younger with mean age 22 than the players of called in SDA literature, ‘symbolic objects.’ The dif-
another team with mean age 25. Hence, we obtain a ferent facets include descriptive statistics of classes
new data table (of k rows and p new variables) considered as new standard or symbolic variables but
which contains, in each cell, a random variable not only, as often adding variables (at the level of
defined on the individuals of its associated class. classes), having no sense at the level individuals, can
This intermediate table containing a random be useful as, e.g., sponsoring company of a team or
variable in each cell leads to symbolic variables the team ‘region.’ From this principle, it results that
with empirical histogram or distribution values the number of variables which describe the classes
(or bar chart values in the case of categorical vari- can be much larger than the number of initial ground
ables) but also to parametric models. Therefore, variables. For example, the description of classes by
the question of finding models of models can arise. a ground numerical variable as the ‘age’ can induce
Dirichlet is a good example of such model as the several numerical variables, which associate with
associated random variable values can be probabil- each class: its mean, its median, and/or its mean
ity density distributions. In that direction, several square. This ground numerical variable can also
works have been recently developed (see induce symbolic descriptive variables which value
Theoretical Development section). associated with each class is, e.g.: the [Min, Max]
interval, an interquartile interval or another percen-
tile interval, a percentile list, a distribution, and histo-
How to Analyze a Symbolic Data Table? grams. In the same way, a unique ground categorical
For a symbolic data table (built from a ground popu- variable can induce several symbolic variables where
lation or native), tools have been developed in order the value for each class can be: a categories list, a bar
to analyze and discover new knowledge from such chart, and moreover an interval or a distribution
tables. This analysis can be performed by developing when the categories are ordered.
Also, links between the ground variables inside of symbolic data to classical data. Transforming sym-
the classes, induce more descriptive symbolic variables bolic data in numerical data is possible and can be
as symbolic variables of joint distribution values. It useful but loses much information. For example,
induces of correlation and concordance values of dif- transforming descriptive vectors of p intervals by vec-
ferent kinds (standard, spearman correlation, Kendall tor of 2p numbers associated with 2p Min- and
Tau concordance, and the like). For example, in the Max-valued variables loses the information con-
symbolic description of each team player by their tained in their inherent hyperrectangle of 2p vertices.
aggregated height, weight, and so on, we can add the In Figure 3, we give an example where two symbolic
correlation associated to each class between the height interval-valued variables Y1 and Y2 are first associ-
and the weight of the individuals of this class. This ated with four Min or Max numerical variables
means that we can add to the symbolic data table a denoted a1, b1, a2, and b2. In this case, the graphical
new (numerical) variable called ‘correlation between representation in the space defined by these four
height and weight’ defined on the set of classes. Notice numerical variables associates a point to each class;
that adding such kind of variable at the ground level of hence, in that way, there is no variability inside the
individuals would have no sense as the correlation class. In a second way, each class is represented by a
between two numbers defined by the height and the biplot on the two symbolic variables by 22 vertices.
weight of a unique individual has no sense. In that way, the variability of each class appears in
More generally, if we have p ground numerical terms of a rectangle associated with each class.
variables, we can add to the p symbolic variables Therefore, symbolic data cannot be transformed to
obtained by aggregation from the ground data, the p standard data without losing some aspects of the
(p − 1)/2 numerical variables expressing the correlation symbolic variables and their expressed variability.
of each pair of ground variable for each class. In order
to reduce the number of variables, in How to Measure Specificity of the Individual Level Versus
the Quality of Bar Chart Symbolic Variables, Classes, the Class Level Principle
and Their Associated Symbolic Data Tables section, This principle aims to clearly distinguish the ground
different ways for selecting the variables of best explan- level knowledge from the higher level of classes’
atory and/or discriminatory power are given. knowledge. This means that the ground data
Adding variables (e.g., of correlations-valued describing individuals as the symbolic data describ-
variables), means that two classes are closer if the ing classes must be considered with their own speci-
added values in their symbolic description are close. As ficity. For example, questions like ‘for which players
for the correlations, we can add inside the symbolic is the height higher than 1.80 m?’ has a meaning at
description of each class, the most discriminating joint the level of the ground data table describing players
probabilities of categorical variables. Notice that at this (i.e., individuals) but has no meaning at the level of
level, a good copula model can save much computing teams (i.e., classes). In contrast, questions like ‘for
time in the case of big data by calculating the joint dis- which classes, the probability that a player has more
tribution from the marginal distributions and save cal- than 1.80 m is higher than 0.9?’ has meaning at the
culating them from the ground data (see Figure 8). level of classes and not at the ground level of indivi-
Moreover, other classical variables specific to the duals. In order to build a decision tree, suppose we
classes can be added (e.g., the amount of expenses of a have ‘good classes’ and ‘bad classes.’ This implies
team is specific to the team and not to the players). that we have also ‘good individuals’ and ‘bad indivi-
Sometimes, other kinds of variables can be duals.’ The binary answer of the first question can
added to the description of a class by a transforma- be used on the ground data describing individuals
tion of symbolic data to other kinds of symbolic and the second question can be used on the sym-
data. For example, functions (as curves or time bolic data describing classes. In that way, we can
series) can be transformed into histograms by a obtain two different decision trees one is specific to
wavelet series expansion 28 or in a set of coefficient the individuals and the other is specific to the
numbers by Fourier series expansion. In the same classes. Nevertheless, an individual can be consid-
way, estimated probability distributions can be trans- ered as a class reduced to a single individual and so,
formed into other symbolic variables values as histo- by using the decision tree on classes, we can say if a
grams or a list of percentiles values. new individual is good or bad. However, a class
cannot be generally considered as an individual and
Variability Principle so cannot be identified in a decision tree on the indi-
This principle aims to take the variability inside viduals. There are some examples where curiously
classes into account by avoiding a simple reduction the allocation of new individuals considered as
C1 C1
Ck Ck
b1
Y2
Ci
b2 Ci
b2i
X
a2i
X
a1
a2 a1i b1i Y1
FI GU RE 3 | Graphical representation of the variability inside symbolic data by four numeric and two symbolic variables.
classes has given better results the decision tree on Generalization Principle
individuals (see Ref 29). When this happen, is an Often SDA tools generalize standard tools on sym-
open question. Notice that in the case of complex bolic data. When a SDA tool generalizes a standard
data (see the power plant cooling towers example in one, this principle aims to take attention on the fact
Application to Complex Data section), only the that its associated software can be applied on stand-
decision tree on classes has a meaning. Also, in the ard data too as they are a case of symbolic data.
case of a huge ground population, the decision tree
becomes very large and difficult to interpret. On the
other side, the decision tree on classes is much smal-
Explanatory Power Versus Discriminatory
ler and so easier to interpret.
Power Principle
This principle aims to clearly distinguish the ‘explan-
atory power’ from the ‘discriminatory power’ of a
Interpretation Principle symbolic variable. Basically, the ‘explanatory power’
In the interpretation of the symbolic descriptions, of a symbolic variable is a measure of the differences
we have to verify (before taking wrong interpreta- between the classes (i.e., between their symbolic
tion) independencies between the descriptive distri- values for this variable). The higher are these differ-
butions and other descriptive variables. If, e.g., the ences the higher is the explanatory power of the sym-
frequency of a category of a symbolic variable bolic variable. Hence, the explanatory power of a
(of bar chart value) is higher for a class than for symbolic variable is nil when there are no differences
another class, we have to verify that this is not the between the classes. This is coherent with the fact
effect of other descriptive symbolic variables. For that a variable is of no help in explaining the classes
example, classes defined by regions described by dis- when their symbolic values, for this variable, are all
tributions of different antibiotic strategies on inhab- the same.
itant, can depend on sociodemographic variables of The ‘discriminatory power’ of a symbolic varia-
each region. One way to validate the distributions ble Y associated with a ground categorical variable
on the strategies is to compare them inside each y is a measure of the differences between the cate-
homogeneous clusters obtained on the ground popu- gories of y, considered as classes described in term of
lation described by only the sociodemographic vari- symbolic data by a symbolic variable so-called ‘class
ables. That kind of situation has been studied in variable’ which categories are the classes. The higher
medical epidemiology.30 are these differences the higher is the discriminatory
power of a symbolic variable. Hence, the discrimina- of the L1 distances between the horizontal (resp. ver-
tory power of a symbolic variable is nil when there is tical) bar charts, taken two by two. This criterion is
no differences between the categories of y in their detailed in How to Measure the Quality of Bar Chart
symbolic description of the classes. In other words, Symbolic Variables, Classes, and Their Associated
this means that a symbolic variable Y is of no help in Symbolic Data Tables section. Hence, if the horizon-
discriminating the classes when all the symbolic tal (resp. vertical) bar charts are all the same, the
values of the symbolic class variable, describing the explanatory (resp. discriminatory) power value is
categories of y, are the same. 0. The more the horizontal (resp. vertical) bar charts
We illustrate these two notions with two sym- are different, the more the explanatory (resp. discrim-
bolic variables based on two kinds of frequencies, inatory) power value is high. We can call fij, (resp.
roughly defined by the conditional probabilities Pr(yj| gij), the explanatory (resp. discriminatory) value of
Ci) = fij and by Pr(Ci|yj) = gij where yj is a category the category yj (of the class Ci) for the class Ci (resp
of the ground variable y. for the category yj). Notice that having an individual
More precisely, let y be a ground categorical and its class Ci, we can associate to this individual its
variable of m categories denoted y = (y1, … , ym) and category of best explanatory value which is the one
let be C = {C1, … , Ck} the set of k classes of indivi- maximizing fij for j = 1, … , m. In the same way hav-
duals, with card(Ci) = ni. Let be nij the number of ing an individual and its category yj, we can associate
individuals of the category yj in the class Ci. to this individual its class of best discriminating value
A bar chart symbolic-valued variable, so called which is the one maximizing gij for i = 1,k. The trend
Yh, associated with the ground variable y and defined of this best frequency is to become higher when the
on C is such that Yh(Ci) = (fi1, … , fim) be a vector of discriminatory power of the variable Yv increases.
k frequencies fij = nij/ni. This vector defines a bar Hence, having a new individual whose category is
chart as we have: Σj = 1,mfij = 1. The vector Yh(Ci) is known for the variable y, we can associate it to the
‘horizontal’ in the symbolic data table. Therefore, we class of best explanatory or to the class of best dis-
say Yh is a ‘horizontal bar charts variable.’ Notice criminatory value or to the class of best combination
that having an individual and its class Ci, we can of these two values. Such combination between the
associate to this individual its best explanatory cate- explanatory and discriminatory power of a ground
gory which is the one maximizing fij for j = 1, m. categorical variable and a given set of classes will be
The behavior of this best frequency is to become seen in How to Measure the Quality of Bar Chart
higher when the explanatory power of the variable Symbolic Variables, Classes, and Their Associated
Yh increases. This comes from the fact that the more Symbolic Data Tables section.
the distance between two bar charts is high the more More generally, instead of fij (or gij), we can
in each bar chart, the frequencies are contrasted use in the same way, other value depending on fij
(more large and small values) and concentrated (or gij), as the tf or the tf–idf value (see e.g., Ref 31).
(more 0 values) on some categories (which are the Coming back to Table 1, we can see that the
more possible different from one bar chart to the symbolic variable ‘Age bar chart’ is a ‘horizontal bar
other one). chart symbolic variable’ denoted Yh as it associates
Now we define another symbolic variable, also to each team a horizontal bar chart. For example, the
associated with the ground variable y and so-called categorical age of best explanatory of an individual
Yv which induces ‘vertical’ bar charts in the symbolic of the Lyon team is the interval: [20, 25] (which is
data table. This symbolic variable, defined on C is the category denoted (1)), as 30% is the highest fre-
such that Yv(Ci) = ( gi1, … , gim) is a vector of m fre- quency of the categories of age. We can define a ‘ver-
quencies gij = nij/Nj where Nj is the total number of tical symbolic variable’ by adding to the numerical
individuals taking the value yj in the ground popula- variable ‘Frequency of French among all French’
tion. Hence, to each category yj of y is associated the (which leads to a vertical bar chart, as shown in
vector Gj = ( g1j, … , gkj)T which is a vertical bar Table 1), other such numerical variables are obtained
chart as we have: Σi = 1,kgij = 1. The sub data table in the same way by considering the other national-
Y ’ I = (Yv(C1), …, Yv(Ck))T of k rows and m col- ities. More precisely, having a ground variable
umns is identical to the vector G = (G1, …, Gm) of denoted z which associates to any individual its
m ‘vertical’ bar charts in the symbolic data table. nationality, we can then define a ‘vertical bar charts
Therefore, we say that Yv is a ‘vertical bar charts variable’ Zv whose values for each class is a vector of
variable.’ m numbers if there are m nationalities. Each of these
The explanatory (resp. discriminatory) power numbers is the frequency of a nationality of the
of Yh (resp. Yv) can then be measured, e.g., by a sum players of a team, among all the players having this
nationality. This symbolic variable Zv leads to variables by a discretization process which can opti-
m vertical bar charts associated with m nationalities. mize their explanatory or discriminatory power.
For example, the team of best discriminatory value The second step is to select the best symbolic
of an individual having the French nationality is Paris variables by using criteria based on their explanatory
with frequency 30%. Having a new player of French and/or discriminatory power that we define precisely.
nationality, we can allocate him to the team of high- Then, a natural question appears: can we say that
est explanatory value (i.e., of highest proportion of the variable with higher explanatory power is also
French in the team) or of highest discriminatory the variable with the higher discriminatory power? In
value (i.e., with the greatest proportion of French this section, we show that this is not always the case
among all French). Notice that if the teams have all and we give eight rules relating these both kinds of
the same number of players then the same team max- power in the case of ground binary variables and a
imizes the two values. partition with two classes. The quality criterion
Can we say that among all the ground categori- defined in this case can then be easily extended to the
cal variables, the variable y which induces the sym- case of multi categorical variables and to a partition
bolic variable Yh with higher explanatory power is of more than two classes.
also the variable which induces the variable Yv with Let X = (X1, X2) and U = (U1, U2) be the parti-
the higher discriminatory power? In How to Measure tions on the ground population induced by two
the Quality of Bar Chart Symbolic Variables, ground binary variables. The number of individuals
Classes, and Their Associated Symbolic Data Tables of Xi and Ui in the ground population are denoted:
section, we show that this is not always the case and xi = |Xi| and ui = |Ui|. We set also: au = |U1 \ X1|, bu = |
we give eight rules relating both these kinds of X2 \ U1|, du = |U2 \ X1|. Thus, we have:
powers. x1 = du + au, u1 = au + bu and u1 + u2 = x1 + x2 = X.
Hence, we obtain the symbolic data tables of
Figure 4 where the frequency of the class U1 in the
How to Measure the Quality of Bar Chart class X1 (resp X2) is f11 = |X1 \ U1|/ |X1| = au/x1,-
Symbolic Variables, Classes, and Their (resp. f12 = |X2 \ U1|/ |X1| = bu/x1) and in the same
Associated Symbolic Data Tables way we obtain the following frequencies: g11 = |X1
\ U1|/ |U1| = au/u1, g12 = |X1 \ U2|/ |U2| = du/u2.
Having obtained the symbolic data table from stand-
Notice that, in the first data table of the Figure 4,
ard or complex ground populations, an important
U is considered as a symbolic variable and X as a
question is measuring the quality of symbolic vari-
class variable. In the second data table, X is consid-
ables and their associated symbolic data tables. First
ered as a symbolic variable and U as a class variable.
(in Quality Criteria Based on the Symbolic Variables
The L1 distance between classes and between
section), we define criteria based on the ‘quality’ of
categories is defined by DX/U(X1, X2) and DU/X(U1,
the symbolic variables measured by its explanatory
U2) such that:
and discriminatory power that we define precisely.
DX/U(X1, X2) = (Σj = 1,2 | f1j – f2j|)/2, expresses
Second (in Quality Criteria of Classes and Variables
the explanatory power of the symbolic variable U for
Based on the Cells of the Symbolic Data Table sec-
the class variable X.
tion), based on the quality of each cell, we can define
DU/X(U1, U2) = (Σj = 1,2 | g1j – g2j|)/2, expresses
the quality of the classes and of the symbolic vari-
the discriminatory power of the symbolic variable
ables and of the symbolic data table.
X for the class variable U. We can see that both dis-
tances vary between 0 and 1.
Quality Criteria Based on the Symbolic In the following, instead of U we use the vari-
Variables ables Y and Z, and instead of X we use the variable
In this section, we define criteria in order to measure C. Therefore, e.g., we have: f11 = |C1 \ Y1|/|C1 | = ay/
the quality of a symbolic data table. These criteria c1, f21 = |C2 \ Y1|/|C2 | = by/c2, and so on.
are based on the explanatory and discriminatory Example. The symbolic data tables C/Y, C/Z,
power of the symbolic variables. We focus on bar Y’/C, Z’/C are given in Figure 5. By comparing the
chart-valued variables as they can be obtained
directly from the ground data table or from other X\U U U\X X
U1 U2 X1 X2
kinds of symbolic data. Therefore, in that way, the
X1 au /x1 1-au/x1 U1 au/u1 1-au/u1
first step is to build in a ‘good way’ the bar chart- X2 bu/x2 1-bu/x2 U2 du/u2 1-du/u2
valued variables from the ground classical variables.
Diday et al.32 give a way for building these symbolic F I G U R E 4 | The tables X/U and U/X.
distances DC/Y (C1, C2) = 1 and DC/Z (C1, C2) = 0, Therefore, DC/Z < DC/Y, DZ/C < DY/C
we see that the explanatory power of Y is higher with y1y2 = 12 > 10 = z1z2. In other words, in this
than that for Z (in fact it is the highest possible for Y case the more explanatory variable Y is also the more
and the lowest possible for Z). In the same way, by discriminatory one.
comparing the distances DY/C (Y1, Y2) = 0.1 and DZ/C In the second example, we have: c = 100,
(Z1, Z2) = 0.3, we see that the discriminatory power of c1 = 80, c2 = 20, ay = 9, by = 3, y1 = 12,
Z’ is higher than that for Y.’ y2 = 88, az = 39, bz = 11, z1 = 50, z2 = 50. Then, we
In Ref 33, it is shown that eight rules (see obtain: DC/Y/DC/Z = 3/5 < 1 and DY/C/DZ/C = (z1z2/
Figure 6) relating the explanatory and discriminatory y1y2)DC/Y/DC/Z = 125/88 > 1. From this, it results
power can be induced from the two following results: that: DC/Y(C1, C2) < DC/Z(C1, C2) and DY/C(Y1,
Y2) > DZ/C(Z1, Z2). Hence, in this case the more
DC=Y =DC=Z = ðy1 y2 =z1 z2 Þ DY=C =DZ=C ð1Þ explanatory variable Z is not the more
discriminatory one.
and In order to select symbolic variables with a
‘good’ explanatory and discriminatory power, we
DY=C =DZ=C = ðz1 z2 =y1 y2 Þ DC=Y =DC=Z ð2Þ can define a selection criterion denoted S(Y) to be
maximized, defined as follows:
Examples. We give two examples. In the first one,
the same variable has simultaneously the best explan-
SðY Þ = ΣZ2V DY=C =DZ=C × DC=Y =DC=Z :
atory and the best discriminatory power. In the sec-
ond example, this is not the case.
Then, from Eq. (2), we obtain:
In the first example, the ground data table is
given in Figure 7 where the number of individuals is 2
C = 7 and c1 = 4, c2 = 3, y1 = 3, y2 = 4, z1 = 2, z2 = 5. SðY Þ = ΣZ2V ðz1 z2 =y1 y2 Þ DC=Y =DC=Z :
From this data table, it results also that ay = 2, by = 1
for the variable Y and for the variable Z: az = 1, Hence, the more this criterion is large, the more the
bz = 1. From these, we can induce that: DC/Z = explanatory and the discriminatory power of the var-
1/12 < DC/Y = 1/6 and from Eq. (2) we obtain: iable Y is large compared to the other variables
Z belonging to a given set of variables V.
In the more general case of a class variable with
DY=C =DZ=C = z1 z2 DC=Y = y1 y2 DC=Z
= 2 × 10=12 = 5=3: a partition in more than two classes and of symbolic
variables with more than two categories, this crite-
rion can be extended by summing on all the classes
C/Y Y C/Z Z Y′/C C Z′/C C of categories divided in two parts for each variable
Y1 Y2 Z1 Z2 C1 C2 C1 C2 and on all the pairs of classes of the partition.
C1 1 0 C1 1/3 2/3 Y′1 0.6 0.4 Z′1 0.3 0.7
By introducing the entropy, a criterion measur-
C2 0 1 C2 1/3 2/3 Y′2 0.5 0.5 Z′2 0 1
ing the explanatory and discriminatory power of a
F I G U R E 5 | The explanatory power of Y is much higher than the bar chart symbolic variable can be defined in the fol-
one of Z and the discriminatory power of Z 0 is higher than the one lowing way:
of Y 0 .
W ðY Þ = a IDC=Y =ð1− EntrðC=Y ÞÞ + b IDY=C =
then
DZ/C < DY/C DZ/C > DY/C
ð1−EntrðY=CÞÞ with a + b = 1
If
described by different variables. The building process for each population yields to the final symbolic data
of the symbolic data table is based on the following table. A graphical illustration of the kind of symbolic
principle sometimes called ‘fusion process.’ First, each data table obtained after the fusion process is given
class induced by the class variable (on individuals) is in Figure 9. The industrial results were a statistical
aggregated by using the standard variables associated evaluation of the towers’ degradation highlighting
with each ground population. These aggregations atypical/abnormal values for each measurement by
yield to a symbolic description for each class and are the symbolic data table representation and a graphi-
‘vertically’ concatenated (see Figure 9). More pre- cal overview of the towers by a PCA extended to
cisely, as the individuals of the ground populations symbolic data (see SYR Software section), the clus-
are in rows, this aggregation is called ‘vertical’ and ters, the network, and the correlation between the
yields to a symbolic description of all the classes for symbolic variables of cracks and corrosion (which
each population and its associated set of variables. had no meaning at the ground level), with the facto-
Then, a horizontal concatenation of the symbolic rial axes, the ranking of the towers expressing their
variables description associated with each ground degradation degree yielding to a reduction of the
population yields to the final symbolic data table. number of sensors.
Example of industrial application. We illustrate
this process by an industrial application for the study
of the degradation problems occurring on nuclear SOME METHODS FOR THE ANALYSIS
power plant cooling towers (see Refs 4, 5 and 66). In OF SYMBOLIC DATA
order to simplify, we consider here a description of
each tower by only two standard data tables of dif- Extended Standard Statistics Methods to
ferent ground populations themselves described by Symbolic Data
different variables (see Figure 9). The population of
the first standard data table is a set of cracks Descriptive Statistics
described by their length, thickness, orientation, and Descriptive statistics of symbolic data (such as mean,
so on, for each tower. The second population is a set variance, and covariance) are studied in Refs 11 and
of corrosion positions described by corrosion vari- 67–69. In these studies, univariate (mean, variance,
ables. In order to compare the towers’ degradation, standard deviation, etc.) and bivariate statistics
several aggregation and concatenation (vertical and (covariance and correlation) are extended to interval
horizontal) processes are applied. They consist first and to histogram-valued variables. Covariance exten-
of an aggregation process applied on the two data sion can be also found in Refs 11 and 68–72.
tables associated with each tower. This aggregation
leads to a symbolic description of each tower for Principal Component Analysis
each population of cracks and corrosion. Then, a In the case of interval-valued variables, each symbolic
vertical concatenation followed by a horizontal con- object can be considered as a hyperrectangle. In SDA,
catenation of the symbolic descriptions of each tower the aim is to reduce the number of symbolic variables
Horizontal concatenation
Towers 1 to n
Tower n
Corrosions
Cracks
Tower n
FI GU RE 9 | Building a symbolic data table from several ground populations described by different sets of variables and a unique class
variable.
by obtaining new hyperrectangles in a reduced space the discretization process (used on the variables of the
of symbolic variables. The basic idea is to consider initial standard data table). This process can be opti-
the means of the intervals or the vertices of this hyper- mized in order to maximize this discrimination (see
rectangle as new individuals on which a standard Ref 32).
PCA can be applied. As usually in PCA, the natural Discriminant analysis has been developed by
aim is to reduce the number of variables. In this way, Silva and Brito,110 Appice et al.,111 and Duarte Silva
different strategies have been tried and developed in and Brito.112 Factorial discriminant analysis is per-
Refs 73–75. Le-Rademacher76,77 use the entire inter- formed by Lauro et al.98 and Cazes.81
val rather not just the vertices. Kosmelj et al.78
expends on this. Le-Rademacher and Billard79 extend
their approach to histogram-valued data. Unsupervised Classification Extended to
In the case of bar chart-valued variables, several Symbolic Data
approaches have been developed, e.g., by extending
standard covariance matrices to symbolic data in Dissimilarities Between Symbolic Data
Refs 70, 72, and 80–83. Le-Rademacher76 obtain This is an important question in SDA where much
polytopes as output instead of hyperrectangles. has been done (see e.g., Chapter 8 in Ref 10, Chap-
The case of mixtures of several kinds of sym- ter 7 in Ref 11, and Chapter 8 devoted to dissimilari-
bolic variables (intervals, distribution, etc.) has been ties and matching in Ref 12). Families of such
considered by two different ways: based on percen- dissimilarities have been defined by Gowda and
tiles as Refs 81 and 84, where the nonordered cate- Diday,113 Ichino and Yaguchi,114 and De Car-
gories are ordered by their frequencies, or based on valho.115 A widely used dissimilarity in case of
‘metabins’ shortest pathways in Ref 45. interval-valued variables is the Haussdorf-based dis-
similarity (see Ref 116). The Wasserstein metric in
Regression, Canonical Analysis, the case of probability distribution-valued variables
and Forecasting is becoming popular (see Refs 117–119). Several dis-
Standard regression has been extended in different similarity measures between histogram-valued varia-
ways to symbolic data. In the case of interval-valued ble have been proposed in Ref 120. For consensus
variables, see Refs 85–89. In the case of histogram- measures between symbolic data, see Ref 121.
valued variables, see Refs 71 and 90–95. Symbolic
regression with different kinds of constraints has Clustering
been developed in Refs 11, 96, and 97. Ref 46 is Much work has been carried out in extending clus-
copulas-based. Canonical analysis by Lauro et al.98 tering to symbolic data. This comes from the fact
and Tenenhaus et al. (unpublished data) has been that at the beginning we thought that ‘clustering’
extended to symbolic data by different ways. Multi- would be the main way for obtaining classes consid-
dimensional scaling has been extended to the case of ered as clusters (see, e.g., Refs 2, 38, 47, and 122 in
interval values variables by Groenen et al.99 and Ter- the case the of data stream).
ada and Yadohisa.100 Several advances based on ‘dynamical cluster-
Forecasting in the case of interval series has ing’ for partitioning or on ‘pyramidal clustering’ for
been developed in Refs 58, 60, and 101–103. Histo- obtaining overlapping clusters on classical data, can
gram time series have been studied in Refs 58 be extended to symbolic data. We recall some of
and 104. them: ‘Dynamical clustering’ (Refs 123 and 124) is
an extension of K-means where, instead of the
means, we use other kinds of centers called ‘kernel’:
Supervised Classification Extended seeds, distributions,125 curves,126,127 regressions,128
to Symbolic Data adaptive distances,129 typological principal
130
Decision trees have been studied in the case of differ- components, canonical components,131 and so
ent kinds of symbolic data in Refs 29 and 105–108. on. The link with SDA is that the obtained kernels
An extension of association rules tools to sym- can be considered to be the symbolic descriptions of
bolic data can be found in Ref 109, where, e.g., the the obtained clusters. These clusters constitute the
units are the customers considered as classes of trans- classes considered as higher level units. Hence, we
actions, instead being as usually the transactions. obtain a symbolic data table on which SDA can be
The symbolic histogram-valued variables applied.
obtained from numerical ground variables discriminate Dynamical clustering has already been extended
and identify more or less these classes, depending on to symbolic interval-valued variables with Haussdorf
dissimilarity in Refs 116 and 132, and when the cen- The Galois Lattice Structure of Symbolic
ters are adaptive distances in Ref 133; by using Objects
Wasserstein-based distance for histograms-valued Extent and intent of symbolic objects are introduced
variables, see Ref 134 or interval-valued variables, see in Ref 38, where ‘Complete symbolic objects’ are
Ref 118; by a probabilistic approach in Refs 15 defined and their link with Galois lattices is given
and 16. (by using, the Maximum and the Minimum opera-
‘Pyramidal clustering’135 is an extension of tors for generalization). A ‘complete symbolic object’
hierarchies to overlapping clusters. In Refs 136 and is a symbolic description of a class considered as an
137, pyramidal clustering is extended to symbolic ‘intent’ whose ‘extent’ has the same intent.
data considering that each level of the pyramid is Example. Suppose a class of individuals is
associated with a ‘complete symbolic object’ described by a symbolic description reduced to just an
(defined hereunder in the section on the Galois interval of âge. This interval denoted I, has an ‘extent’
structure). Pyramidal and hierarchical clustering are (in a given ground population), defined by the set of
graphically represented with individuals ordered on individuals with age included in this interval. The
a straight line as support. They have been extended ‘intent’ of this extent can be defined by the interval
in order to have a network of two or more dimen- denoted I0 = [min, max] age of the individuals of this
sions as support. This extension leads to a general extent. If I = I0 , we say that this interval is a ‘complete
theory of spatial pyramids in Ref 20, where its symbolic’ object for this ground population.
application to symbolic data and its link with In Refs 146 and 147, ‘concepts’ are defined by
Galois lattices are given. Pruning and graphical an extent C and an intent which extent is C and con-
representations of spatial pyramids are developed stitute the vertices of a Galois lattice. This result fol-
in Refs 138 and 139. lows in works given by Birkhoff37 and Barbut and
By using dissimilarities between symbolic data Monjardet148 in a binary context.
(see Unsupervised Classification Extended to Sym- Several works have been developed in this
bolic Data section) clustering structures with differ- direction in Ref 39 on several kinds of symbolic
ent level of clusters can be obtained by known objects; for reducing the lattices of symbolic objects,
algorithms of hierarchical or pyramidal clustering or see Refs 149–151.
top–down clustering. The quality of the resulted As recalled and more detailed in Theoretical
structure can be measured by the difference between Development section, the stochastic lattice case has
the symbolic dissimilarity between the symbolic been considered for distributional data in Refs
objects and the associated Ultrametric (resp. Robin- 40 and 41, which show (in a probabilistic and Cho-
sonian and Yadidean) dissimilarity in case of a hier- quet capacities context) that when the size n of a
archical (resp. 2D pyramidal and 3D pyramidal) sample of the ground population increases, then the
structure (see Ref 20). Galois lattices sequence Gn built on this sample con-
Top–down hierarchical clustering for some verges toward a lattice G. Brito and Polaillon152
kinds of symbolic variables (intervals and histo- define two Galois connections on a set of distribu-
grams) has been developed in Refs 140–144, In these tional data and the corresponding concept lattices.
approaches, the initial set of classes is recursively Recently, Brito and Polaillon153 proposed a novel
divided in two sets by splitting each symbolic varia- approach, which determines intents by intervals,
ble. The best split is the one which maximizes the thereby producing more homogeneous concepts,
‘quality’ of a class Ck of nk individuals measured by which are easier to interpret.
the following criteria ‘Q,’ where d is a dissimilarity
between the symbolic objects associated with Ck
such that: Models of Models: Distribution-Valued
Variables and Their Mixture Decomposition
1 X X 2 In the case of a unique symbolic variable of distribu-
QðCk Þ = d ωi ,ωj :
2nk ω 2C ω 2C tion values, an orthonormal wavelets basis has been
i k j k
used in Ref 154. In the multidimensional case of sev-
eral symbolic variables of distribution values, we use
In case of functional-valued variables, dynamical the wavelet method of Mallat28 to extract histogram-
clustering with orthonormal polynomial as centers, valued variables (Billard et al., unpublished data). A
have been used in Ref 124 (p. 523). Self-organizing PCA on these histograms is then conducted using
maps have been extended to interval-valued variables Diday’s45 approach. In the multidimensional case of
in Refs 10 and 145. several symbolic variables of distribution value mixed
with other kinds of symbolic variables, a percentile correlation between the numerical variables of the
representation is used in Ref 84. Mixture decomposi- table T of metabins.
tion of probabilities distributions has been studied
with copulas, in Refs 43, 44, and 155. In the case of
a unique symbolic distribution-valued variable, a SDA SOFTWARE
Dirichlet model has been used in Refs 52 and 156. Symbolic Objects Data Analysis
For clustering based on a normal mixture model for The SODAS software is issued from two European
aggregated symbolic data, see Ref 157. projects (from 1997 to 2003), involving 17 research
Modeling probability densities-valued variables laboratories, industrial companies, and National Sta-
by likelihood estimation has been developed in Ref tistical Institutions (NSI) of three countries. The
158. Modeling interval-valued variables is studied by results of these projects are edited in Refs 10 and 12.
Brito and Duarte Silva,144 and Diday159 gives formu- The SODAS basic principle is based on two steps. In
las relating the density of a set of interval vectors and the first step, a symbolic data file (called ‘.sds’) is cre-
their associated parameters under the hypothesis of ated from a query to a relational data base. This
uniform distributions. query defines a standard data table which contains a
categorical variable at its first position. The cate-
‘Metabins’ a Useful Tool in SDA for Taking gories of this variable define classes of individuals
which constitute the higher level units described by
Care of the Variability Inside Classes symbolic data by using the Data Base to Symbolic
A metabin is a vector of values that are associated Objects (DB2SO) module as shown in Figure 10.
with each symbolic variable in order to transform the In the second step, several tools can be applied
symbolic data table in a numerical data table expres- to the symbolic data obtained at the first step. The
sing the variability inside each of the classes on output of several of these tools is shown in
which standard tools can be applied. In a PCA on Figure 11: Kohonen and principal component in the
interval variables (see Ref 73), the metabins are the case of interval-valued variables; decision tree,
2p numerical values vectors defined by the vertices of ‘pyramidal clustering,’ and ‘zoom star’ on several
the hyperrectangle I1i x … x Ipi, where Iji is the inter- kinds of symbolic data. The SODAS pyramidal over-
val associated with the symbolic interval-valued vari- lapping clustering tool contains the case of hierarchi-
able j for the class i. cal clustering. Zoom star allows the graphical
In Ref 45, the p symbolic variables are of bar representation of the symbolic variables associated
chart values and the metabins are the p frequency of with the rows of a circle.
categories value vectors taken at the same position in The package SODAS is a free, though registra-
each variable. In that way, we obtain a numerical tion is required and a code needed for installation,
data table denoted T where each column is associ- see http://www.info.fundp.ac.be/asso/sodaslink.htm.
ated with a numerical variable itself associated with Numerous reports of French students containing
a symbolic variable and each row is associated with many data bases (from which symbolic data are built
a metabin and a class. This leads to the numerical and their SODAS files provided) can be found at
data table T of p columns and k × m rows if there www.sodas.ceremade.dauphine.fr or www.sodas.
are k classes and if m is the largest number of cate- lamsade.dauphine.fr. Notice that Chiun-How
gories for a variable. et al.160 have also developed a symbolic database for
In the case of nonordinal (i.e., nominal) catego- Trends in International Mathematics and Science
rical variables, the ranking of the bins associated Study (TIMSS) but not related to SODAS.
with each of these variables is obtained by maximiz-
ing the correlations between the numerical variables
associated with each column of the table T. The SYR Software
higher are these correlations, the better is the rank- The package SYR is a professional software for
ing. Ichino84 gives another approach based on a industrial applications. Its aim is to extract, from a
transformation of the bar charts in distributions data file (.txt and .csv) of several millions of units or
which is not possible in the case of nominal vari- from an Access data base of hundreds of thousands
ables. Nevertheless, the Ichino method gives a solu- of units, a reduced number of units (i.e., classes),
tion in this case by ranking the bins by their described by symbolic data which summarize the ini-
frequency. Hence, the metabins approach gives an tial data in a file (called .syr), compatible with the
alternative solution to the bins ranking challenge in SODAS .sds files by conversion. Then, from this sym-
the nonordinal case, based on a maximization of the bolic file, several original tools can be applied.161 For
Relational data
base
Query to data
base
DB2SO
Observations by numerial
or categorical variables
KOHONEN Map on symbolic interval valued variables Principal component of interval valued variables
Axe 2(32.553%)
AA08 AA10_AA12
AA00
AA10
1.50
AA05\02
0.75 AA06
AA04
0 AA15 AA07– AA16
AA14 AA03
–1.50
–2 –1 0 1 2
Axe 1(58.048%)
Bunqalow in US
Bunqalow in France
4 Fast Food in US
false
18–24 Fast Food in France
18 age_range = OR 25–39
true
6 pays_client = US
2 Excursion in US
Excursin in France
Pyramide
12 age_range = 25–39
classifiante
Hotel Room in France
3 Restaurant in France
Activities in France
Restaurant in US
3 Hotel Room in US
Activitis in US
a text mining study in Ref 57. The NETSYR allows Normal or Skew-Normal distributions for the Mid-
the visualization of unnoted classes by the pie chart Points and Log-Ranges of the interval-valued vari-
associated with a given bar chart-valued variable ables. Several alternative configurations for the
and its bar chart view. The result of a clustering on global covariance matrix are considered, allowing
the initial data or on selected PCA factorial axes for taking into account the link that may exist
can also be visualized as a network relating the clo- between MidPoints and Ranges of the same or differ-
sest concepts. Moreover, the method produces a ent interval-valued variables. Intermediate parameter-
correlation circle of the categories where the sym- izations between the nonrestricted and the
bolic variables themselves can be represented. More noncorrelation setup considered for real-valued data
information on the SYR software can be asked at may be relevant for the specific case of interval data.
afonso@syrokko.com. This modeling has been implemented in the R-
package MAINT.Data,162 available on CRAN.
RSDA: An R-Package for SDA MAINT.Data introduces a data class for representing
interval data and includes methods for the display,
This package aims to implement in R, certain techni-
management, and analysis of these data. In particu-
ques of SDA as clustering, as well as some linear
lar, maximum likelihood estimation and statistical
models. These implementations will always be made
tests for the different configurations are addressed.
following two principles: Classic Data Analysis
Methods for (M)ANOVA and Linear and Quadratic
should always be a particular case of the SDA and
Discriminant Analysis of this data class are also
both the output and the input in a SDA should be
provided.
symbolic of the same kind in order to express the
data in the same language at input as at output. The
latest version of the RSDA package is 1.2, the author
is Oldemar Rodríguez with contributions from Olger R-Package: Histogram Data Analysis Using
Calderón and Roberto Zúñiga. Information can be Wasserstein Distance (HistDAWass)
obtained at oldemar.rodriguez@ucr.ac.cr. An exam- This package, from Ref 163 contains methods (see
ple of output of RSDA software in case of interval- Refs 164 and 165) mainly based on the L2 Wasser-
valued variables PCA is given Figure 14. stein metric between distributions (i.e., a Euclidean
metric between quantile functions). It contains basic
statistics of symbolic histogram-valued variables,
R-Package MAINT.Data clustering methods (both hierarchical and dynamic
Brito and Duarte Silva144 have proposed parametric clustering), regression analysis, PCA of distributional
models for interval data, which consider Multivariate variables, histogram time series forecasting using the
Politness-satisfaction.
doc_clust70
C39 0.2041
C68 0.1386
3.0 C6 0.0679
C28 0.0538
C1 0.0453
C59 0.0297
0.0244 Politness-satisfaction
C16
C19 0.0216
1.5
Contact call
Siret_APE_NAF
Dates
doc_clust 70
Axis 2 (12.45%)
0 C16 0.0870
C25 0.0793 Troubleshooting-intervention
C53 0.0744
C21 0.0414 doc_clust70
C2 0.0319 C26 0.1779 Technical terms
C47 0.0307 C64 0.0762
C58 0.0284 C47 0.0577
C19 0.0279 C16 0.0535
–1.5 C53 0.0487
C27 0.0327
C30 0.0298
C48 0.0293
Dates Invoice reading
–3.0 Troubleshooting-intervention
Schedule
–4.5
to be extended to a symbolic data table with more complex data by considering each point i of the func-
symbolic variables. In the PCA built on the meta- tion f as an individual taking the value f(i).
bins45 in both cases of interval or bar chart-valued Example. ten sensors settled on a bridge pro-
variables, it would be interesting to use the joint prob- duce a signal function when 100 different kinds of
ability density of the ground level population by asso- trains passes over this bridge before and after
ciating with each metabin, a weight proportional to improving the state of the bridge. Hence, to each
the number of individuals whose dissimilarity to this train is associated ten fj functions considered as ten
metabin is lower than a given threshold. ground variables Yj describing, e.g., nij = 10,000
Symbolic data tables where classes are described points i of the function fj by their values f(i). There-
by horizontal and vertical bar chart variables can be fore, in this way, we obtain a complex data table
obtained directly from the ground data table in case where the variables are not paired as nij varies
of ground categorical variables. In the case of ground between variables. Notice that this is a case of ‘varia-
numerical variables, a simultaneous discretization bility inside entities’ defined in the introduction sec-
(both kinds of bar chart variables) optimizing a qual- tion. It is then possible to transform, these ground
ity criterion S or W (see How to Measure the Quality variables, e.g., into symbolic histogram-valued vari-
of Bar Chart Symbolic Variables, Classes, and Their ables, by using wavelets (see a wavelet tour on wave-
Associated Symbolic Data Tables section) is needed. lets in Ref 28).
Table 1 is a simple example of such a symbolic data Hence, such data can be transformed by a
table. Then, we can analyze these new kinds of sym- fusion process into a unique symbolic data table
bolic data tables by any SDA method with new kinds where the symbolic variables are paired. This process
of interpretation due to the vertical bar charts. is based on a vertical concatenation of aggregations
of the values of the functions inside each class and
horizontal concatenation of the obtained symbolic
Links with Specific Domains of Data Science variables. Several symbolic variables can be induced
The SDA can also enhance domains of research like by the same functional-valued variable. For example:
FDA, mixture decomposition, Bayesian approaches, the min–max interval, the interquartile interval, the
multilevel statistics, uncertainty and fuzzy sets, gran- bar charts, the histograms, the distributions of the
ular computing, and rough sets. functions values, including the one induced by a
wavelets approximation or other time series
Compositional Data approach, see Refs 28 and Billard et al. (unpublished
‘Compositional data’ appear in a specific case of sym- data). Here, also, the SDA challenge is to find the
bolic data. The relative frequencies of bar charts are aggregation which maximizes the discrimination,
‘compositional parts’ as their sum is equal to 1 for between the classes and also maximizes the correla-
each bar chart. Aitchison19 described the difficulties tion between the symbolic variables.
(as negative bias and spurious correlation) often
encountered with such variables and the need of nor- Mixture Decomposition
malization. Moreover, according to recent Mixture decomposition aims to find the underlying
developments,167 compositional data are not neces- distributions of a given population described by a set
sarily defined with a constant sum constraint, but of standard variables. In the SDA context, the ques-
rather more generally that their parts contain quanti- tion is how to extend mixture decomposition meth-
tatively expressed relative contributions on a whole. ods and tools to symbolic data. In Theoretical
From this perspective, the unit sum of parts is just one Development section, we recall some early studies in
possible representation of data, where the only rele- the case of classes described by a unique symbolic
vant information is contained in ratios between parts. variable of distribution values where Dirichlet48,49
In Ref 72 for unit sum normalization of composi- and copulas models,43, have been used. The mixture
tional data, the angular transformation of Fisher168 is decomposition of several symbolic variables of differ-
used; but there are many other possibilities which ent kinds defined in Tables 2 and 3 remains to be
have to be deeply considered, like in the recent paper considered. Moreover, in the actual methodology
by Wang et al.,169 where the Aitchison geometry and based on an EM kind of approach,170 the distribu-
centered log ratio coordinates are employed. tions of the obtained mixture decomposition are not
the distributions induced by the given clusters. In
Functional Data Analysis SDA, as we wish to obtain a symbolic data table
Functional data (i.e., where variables values are func- where classes (i.e., ‘clusters’) are described by their
tions as curves or signals) can be seen as a case of distribution, the ‘dynamical clustering’ methodology
The ‘standard accuracy’177 of a standard known clusters. A simple idea could be simply to
rough-set can be extended to the ‘symbolic accuracy’ aggregate the known clusters. Hence, these aggrega-
denoted A of a symbolic rough set XR by setting: A tions produce new units described by symbolic data.
(XR) = |X0 |/|X0 |. Hence, in that way, we can measure It is then possible to add these units to the initial
the ‘symbolic accuracy’ of any class X = Ci of the set ground population and then to apply the symbolic
C on which the symbolic variable Y0 has been clustering tools to the new population. Many other
defined. ways are also possible, e.g., by looking of the best
discriminating space (e.g., by adaptive distances140)
Uncertainty or factorial discriminant analysis (e.g., in Ref 98) of
Uncertainty is not variability as uncertainty describes the known clusters and then applying clustering tools
individual facts or events by subjective values expres- involving the whole population in this space.
sing, e.g., a ‘possibility’ or a ‘belief’ which follow Another way can be to select the symbolic variables
their own axioms. At the level of classes, aggregating which maximize the S or W (see How to Measure the
such values associated with same categories leads to Quality of Bar Chart Symbolic Variables, Classes,
different aggregation axioms depending on the kind and Their Associated Symbolic Data Tables section)
of uncertainty. Several theories have been developed criterion, in order to find the variables with the high-
such as subjective probability, possibility, belief the- est explanatory and discriminatory power of the
ory (see a synthesis in Ref 3). Variability inside given classes. This yields to a symbolic data table
classes of individuals is expressed by objective values where the units are the classes and the remaining
as frequencies which follows Kolmogorov axioms of individuals and the variables are the selected ones.
probabilities. If there are several facts or events, then Then, a clustering tool can be applied to this sym-
a variability can appear among their associated bolic data table.
uncertainty values. This variability can be expressed
by symbolic data and then analyzed by SDA tools.
CONCLUSION AND PERSPECTIVES
Fuzzy Sets We have presented a new way of thinking in data sci-
Fuzzy sets express a kind of uncertainty by a kind of ence, where currently only the first level of indivi-
subjective membership function. For example, with duals is mainly considered by ignoring the
such subjective function, we can say that ‘this man is complementary analysis of a higher level of classes
high’ with a fuzzy value equal to 0.8. If each unit of considered as new units to be described and studied
the ground population is described by one or several by themselves. Often we have the question: ‘Does
fuzzy sets, then, classes of such units can be described SDA gives better results than standard Data Analy-
by intervals, bar charts or histograms expressing the sis?.’ This question can be illustrated by the following
variability of the fuzzy values inside each class (see example: does studying players give better results
an example given in Ref 12, section 1.4.2, p. 14). than studying teams of players? This question has no
Otherwise, a fuzzy coding study of symbolic data can sense as the considered units are not the same! The
be found in Ref 178. only answer to this question, is that standard
approaches are the best for studying individuals
Clusterfier (as players described by standard data) and SDA is
A ‘clusterfier’ aims to find all the clusters of a popula- the best for studying classes (as teams of players).
tion by knowing only some of them. More precisely, Moreover, we cannot say that one approach is better
a clusterfier is a function which produces from classes than the other, we can just say that both approaches
(i.e., clusters), known on a part of a population, new are complementary and that SDA methods can
clusters on the remaining part of the same popula- enhance standard results by giving a class point of
tion. Notice that a ‘classifier’ is different, as it is a view results induced from the standard data. We can
function which from classes known on a part of a also say that the SDA tools are more general than the
population, associates a class to each individual of standard ones as an individual can be considered as a
the remaining population. Clusterfiers has been class reduced to a single unit.
studied in Ref 179 by using a general Lance and Standard, complex, and big data are given
Williams180 formula. In the SDA framework, several whereas symbolic data are built from these kinds of
symbolic clustering (by partitioning, hierarchical or data. Therefore, before the proper analysis of the
pyramidal tools, see Unsupervised Classification built symbolic data, there is a wide domain of
Extended to Symbolic Data section) can be used in research for obtaining ‘good’ symbolic data and mea-
order to build new clusters taking care of some suring their quality (validity, robustness, etc.).
Much remains to be done in the following some sense, thinking by classes in Data Science, bring
directions: in statistics and data mining, going more closer our way of thinking in Data Science from our
deeply inside the extended classical methods to sym- natural way of thinking.
bolic data and extending to other methods not yet What the SDA framework can change? SDA
considered. In computer science, extending SQL can change: our way of teaching, researching, and
queries algebra language to symbolic data bases (see applying:
a start in Ref 181), extending EXCEL to symbolic Teaching: by considered standard teaching on
data (see a start with TABSYR inside the SYR Soft- individuals as a case of teaching on classes described
ware; SYR Software section), building summariza- by symbolic data. Researching: by asking ‘how to
tions on big data by symbolic data (as said by extend my actual results to the case where instead of
Minami and Mizuta182). Also, in the case of big standard statistical units described by standard vari-
ground data bases, parallelizing the actual tools for ables, I have classes described by symbolic data ‘.
building and analyzing big sets of symbolic data. Applying: as SDA can enhance our current results by
We have seen that, starting with complex data complementary ones on classes, enlightening our cur-
defined by several unstructured data tables with rent study by changing our actual units
unpaired variables, we can obtain a data table with (i.e., ‘individuals’) in higher level units (i.e., ‘classes’)
paired symbolic variables. Therefore, SDA is more a described by symbolic data.
solution to the problem of complex data than a prob- In 1984, Schweitzer183 says that ‘distributions
lem of complex data. are the numbers of the future.’ In the SDA context,
By enlarging the actual framework of Data Sci- this means that the classes of individuals from which
ence to higher level populations of classes, we have these distributions are obtained, are the ‘units of the
seen that SDA can enhance large domains of applica- future’ and moreover that the symbolic data which
tions and research in computational statistics. describe these classes are the numbers of the future.
Thinking by classes of individuals, is our natu-
ral way of thinking. This happen, e.g., when we say: Epilogue: It is my hope this fascinating domain will
‘I like my dog Tomi, I prefer dogs than cats’ as in the inspire many teachers and students in the numerous
same sentence ‘Tomi’ is at the level individuals level, directions suggested in that paper.
‘dogs’ and ‘cats’ are at the classes level. Hence, in
ACKNOWLEDGMENTS
The author gratefully acknowledge the reviewers for their helpful remarks and suggestions.
FURTHER READINGS
Diday E. L’Analyse des données symboliques, un cadre théorique et des outils pour le data mining. In: Diday E,
Kodratoff Y, Brito P, Moulet M, eds. Induction Symbolique Numérique à Partir de Données. Toulouse: CEPADUES; 2000.
Diday E. From Schweizer to Dempster: mixture decomposition of distributions by copulas in the symbolic data analysis
framework. In: IPMU 2002, Annecy, France, July, 2002.
Emilion R. Classification of wind speed distributions. Renew Energy 2011, 36:3091–3097.
Nakano J. Regression analysis for aggregated symbolic data. In: Arroyo J, Maté C, Brito P, Noirhomme M, eds, 3rd Work-
shop in Symbolic Data Analysis. Universidad Compiutense de Madrid; 2012. Available at: http://www.sda-workshop.org/.
Saporta G, Niang N. Resampling ROC curves. In: IASC Meeting on Statistics for Data Mining, Learning and Knowledge
Extraction (IASC07), 30 August–September, 2007.
REFERENCES
1. Diday, E. Introduction à l’approche symbolique en 2. Diday E. The symbolic approach in clustering and
analyse des données. Premières journées Symbolique- related methods of data analysis: the basic choices.
Numerique. Workshop. CEREMADE Laboratory, In: Bock HH, ed. Proceedings of IFCS’87 on
1987, Université Paris-Dauphine, France, 21, 56. Classification and Related Methods of Data
Analysis, Amsterdam, North Holland, 1988, 16. Brito P, Noirhomme-Fraiture M, Arroyo J. Special
673–684. issue on symbolic data analysis. Editorial. Adv Data
Anal Classif 2015, 9:1–4.
3. Diday E. Probabilist, possibilist and belief objects for
knowledge analysis. Ann Oper Res 1995, 55:227–276. 17. Su S-F, Pedrycz W, Hong T-P, De Carvalho AT. Spe-
cial issue on granular/symbolic data processing.
4. Afonso F, Diday E, Badez N, Genest Y. Symbolic data IEEE Trans Cybern 2016, 344–401.
analysis of complex data: application to nuclear power
plant. In: COMPSTAT’2010, Paris, 2010. 18. Kuhn T. The structure of scientific revolutions. Chi-
cago: University of Chicago Press; 1962.
5. Afonso F, Diday E, Badez N, Genest Y. Use of sym-
19. Aitchison J. The Statistical Analysis of Compositional
bolic data analysis for structural health monitoring
Data. London: Chapman and Hall; 1986.
applications. In: Second International Symposium on
Life-Cycle Civil Engineering, IALCCE’2010, October 20. Diday E. Spatial classification. Discrete Appl Math)
27–30, 2010. Taipei, Taiwan. 2008, 156:1271–1294.
6. Laaksonen S. Chapter 22: people’s life values and trust 21. Diday E. Des objets de l’Analyse des données à ceux
components in Europe—symbolic data analysis for de l’Analyse des connaissances. In: Kodratoff Y,
20–22 countries. In: Diday E, Noirhomme-Fraiture M, Diday E, eds. Induction Symbolic Numerique. Tou-
eds. Symbolic Data Analysis and the SODAS Software. louse: CEPADUES; 1991.
Chichester: Wiley & Sons; 2008, 405–419. 22. Brito P, Bertrand P, Cucumel G, de Carvalho F, eds.
On the analysis of symbolic data. In: Selected Contri-
7. Laaksonen S. The survey as a basis for symbolic data
butions in Data Analysis and Classification. Berlin:
analysis. In: Carlson M, Nyquist H, Villani M, eds.
Springer; 2007, 13–22.
Official Statistics, Methodology and Applications in
Honour of Daniel Thorburn. Stockholm, Sweden: 23. Billard L, Diday E. From the statistics of data to the
Stockholm University; 2010, 15–28. Available at: statistic of knowledge: symbolic data analysis. J Am
officialstatistics.wordpress.com. Stat Assoc 2003, 98:470–487.
8. Afonso F, Laaksonen S. Analyzing European Social Sur- 24. Billard L. Special issue on SDA. ASA Data Sci J 2011,
vey data using symbolic data methods and Syrokko soft- 4:147–246.
ware. In: RNTI Special Issue « en l’honneur des travaux 25. Billard L. Brief overview of symbolic data and analytic
de Monique Noirhomme-Fraiture: Analyse de données et issues. Stat Anal Data Mining 2011, 4:149–156.
Visualisation ». RNTI 2015, 89–100. 26. Noirhomme-Fraiture M, Brito P. Far beyond the classi-
9. Korenjak-Cerne S, Kejžar N, Batagelj V. A weighted cal data models: symbolic data analysis. Stat Anal
clustering of population pyramids for the world’s Data Mining 2012, 4:157–170.
countries, 1996, 2001, 2006. Pop Stud J Demogr 27. Brito P. Symbolic data analysis: another look at the
2015, 69:105–120. interaction of data mining and statistics. Wiley Inter-
discip Rev Data Mining Knowl Discov 2014,
10. Bock HH, Diday E. Analysis of Symbolic Data:
4:281–295. doi:10.1002/widm.1133.
Exploratory Methods for Extracting Statistical Infor-
mation from Complex Data. Heidelberg: Springer-Ver- 28. Mallat S. A Wavelet Tour of Signal Processing. San
lag; 2000, 425. ISBN: 3-540-66619-2. Diego, CA: Academic Press; 1998.
11. Billard L, Diday E. Symbolic Data Analysis: Concep- 29. Seck D. Arbres de décision symboliques, outils de vali-
tual Statistics and Data Mining. Wiley Series in Com- dation et d’aide à l’interprétation. PhD (these de doc-
putational Statistics. Chichester: Wiley; 2006, 321. torat), Paris-Dauphine University, France, 2012.
ISBN: 0-470-09016-2. 30. Guinot C, Malvy D, Schemann J-F, Afonso F,
Haddad R, Diday E. Strategies evaluation in environ-
12. Diday E, Noirhomme-Fraiture M. Symbolic Data
mental conditions by symbolic data analysis: applica-
Analysis and the SODAS software. Chichester: Wiley;
tion in medicine and epidemiology to trachoma. Adv
2008. doi:978-0-470-01883-5.
Data Anal Classif 2015, 9:107–119.
13. Billard L, Douzal-Chouakria A, Diday E. Symbolic 31. Leskovec J, Rajaraman A, Ullman JD. Chapter 1: data
principal components for interval-valued observations. mining. In: Mining of Massive Datasets. England:
Stat Anal Data Mining 2011, 4:229–246. Cambridge University Press; 2011, 1–17.
14. Guan R, Lechevallier Y, Saporta G, Wang H. 32. Diday E, Afonso F, Haddad R. The symbolic data
Advances in Theory and Applications of High Dimen- analysis paradigm, discriminant discretization and
sional and Symbolic Data Analysis, vol. E25. Her- financial application. In: HDSDA 2013 Conference,
mann, MO: RNTI; 2013. Beijing, China. RNTI-E-25. Paris: Hermann;
15. Brito P, Duarte Silva AP, Dias JG. Probabilistic cluster- 2013, 1–14.
ing of interval data. Intell Data Anal 2015, 33. Diday E. Pouvoir explicatif et discriminant de variables
19:293–313. à valeurs diagrammes en bâtons et de tableaux de
données symboliques associés. Revue Modulad n 52. Emilion R. Unsupervised classification of objects
45, RNTI; In press. described by nonparametric distributions. Stat Anal
34. Horn S, Pesce AJ, Copeland BE. A robust approach to Data Mining 2012, 388–398.
reference interval estimation and evaluation. Clin 53. Bezerra B, Carvalho F. Symbolic data analysis tools
Chem 1998, 44:622–631. for recommendation systems. Knowl Inf Syst 2011,
35. Royall RM. Model robust confidence intervals using 26:385–418. doi:10.1007/s10115-009-0282-3.
maximum likelihood estimators. Int Stat Rev 1986, 54. Quantin C, Billard L, Touati M, Andreu N, Cottin Y,
54:221–226. Zeller M, Afonso F, Battaglia G, Seck D, Le Teuff G,
36. Lebart L, Morineau A, Warwick KM. Multivariate et al. Classification and regression trees on aggregate
Descriptive Statistical Analysis. New York: data modeling: an application in acute myocardial
Wiley; 1984. infarction. J Prob Stat 2011, 2011:19.
37. Birkhoff G. Lattice Theory, vol. 25. 3rd ed. Provi- 55. Mizuta M. Study on radiation therapy with distribu-
dence, RI: AMS Colloquium Publications; 1967. Rep- tion valued data. In: Arroyo J, Maté C, Brito P, Noi-
rinted 1984. homme M, eds. 3rd Workshop in Symbolic Data
Analysis. Spain: Universidad Compiutense de
38. Diday E. Introduction à l’analyse des données symboli- Madrid; 2012.
ques. Oper Res Rev 1989, 23:193–236. Also in Rap-
56. Fablet C, Diday E, Bougeard S, Toque C, Billard L.
port de Recherche No. 1074, INRIA, Rocquencourt.
Classification of hierarchical-structured data with sym-
39. Brito P. Order structure of symbolic assertion objects. bolic analysis: application to veterinary epidemiology.
IEEE Trans Knowl Data Eng 1994, 6:5. In: COMPSTAT’2010, Paris, 2010.
40. Diday E, Emilion R. Treillis de Galois maximaux et 57. Haddad R, Afonso F, Diday E. Approche symbolique
capacites de Choquet. CR Acad Sci Paris 1997, pour l’extraction de thématiques: Application à un cor-
325:261–266. pus issu d’appels téléphoniques. In: actes des
41. Diday E, Emilion R. Maximal and stochastic Galois XVIIIèmes Rencontres de la Sociéte francophone de
lattices. Discrete Appl Math 2003, 27:271–284. Classification. Université d’Orléans, France; 2011.
42. Nelsen RB. An Introduction to Copulas. New-York: 58. García-Ascanio C, Maté C. Electric power demand
Springer Verlag; 1999. forecasting using interval time series: a comparison
between VAR and iMLP. Energy Policy 2010,
43. Diday E, Vrac M. Mixture decomposition of distribu-
38:715–725.
tions by Copulas in the symbolic data analysis frame-
work. Discrete Appl Math 2005, 147:27–41. 59. Emilion R. Classification of daily solar radiation distri-
butions using a mixture of Dirichlet distributions.
44. Vrac M, Billard L, Diday E, Chédin A. Copulas analy-
Solar Energy 2009, 83:1056–1063.
sis of mixture model. Comput Stat 2012, 27:427–457.
60. Han A, Hong Y, Lai KK, Wang S. Interval time series
45. Diday E. Principal component analysis for bar charts
analysis with an application to the sterling-dollar
and Metabins tables. Stat Anal Data Mining 2013,
exchange rate. J Syst Sci Complex 2008, 21:550–565.
6:403–430. doi:10.1002/sam.11188.
61. He LT, Hu C. Impacts of interval computing on stock
46. Neto EA, Anjos UU. Regression model for interval- market variability forecasting. Comput Econ 2009,
valued variables based on copulas. J Appl Stat 2015, 33:263–276.
42:2010–2029.
62. Long W, Mok HMK, Hu Y, Wang H. The style and
47. Diday E, Murthy N. Symbolic data clustering. In: innate structure of the stock markets in China, Pacific-
Wang J, ed. Encyclopedia of Data Warehousing and Basin. Finance J 2009, 17:224–242.
Mining. Hershey, NY: Information Science Reference;
2005, 1087–1091. 63. Terraza V, Toque C. Mutual Fund Rating: A Symbolic
Data Approach. In: Terraza V, Razafitombo H, eds.
48. Emilion R. Classification et mélanges de processus. Understanding Investment Funds Insights from Perfor-
CR Acad Sci Paris 2002, 335:189–193. mance and Risk Analysis. Economics & Finance Col-
49. Soule A, Salamatian K, Taft N, Emilion R, lection. London, UK: The Palgrave Macmillan; 2013.
Papagiannaki K. Flow classification by histograms. In: 64. Bouteiller V, Toque C, A, Cherrier J-F, Diday E,
Proceedings of Sigmetrics’04, New York, 2004. Cremona C. Non-destructive electrochemical charac-
50. Soubdhan T, Emilion R, Calif R. Classification of daily terizations of reinforced concrete corrosion: basic and
solar radiation distributions using a mixture of Dirich- symbolic data analysis. Corros Rev 2011, 30:47–62.
let distributions. Solar Energy 2009, 83:1056–1063. doi:10.1515/corrrev-2011-002.
51. Calif R, Emilion R, Soubdhan T. Classification of wind 65. Cury A, Crémona C, Diday E. Application of symbolic
speed distributions using a mixture of Dirichlet distri- data analysis for structural modification assessment.
butions. Renewable Energy 2011, 36:3091–3097. Eng Struct J 2010, 32:762–775.
66. Courtois A, Genest G, Afonso F, Diday E, Orcesi A. In 80. Murillo JD, Rodrıguez O, Diday E, Winberg S. Gener-
service inspection of reinforced concrete cooling alization of the principal components analysis to histo-
towers—EDF’s feedback. In: IALCCE 2012, Vienna, gram data. In: 4th European Conference on Principles
Austria, 2012. and Practice of Knowledge Discovery in Data Bases,
67. Bertrand P, Goupil F. Descriptive statistics for sym- Lyon, France, 12–16 September, 2000.
bolic data. In: Bock H-H, Diday E, eds. Analysis of 81. Cazes P. Analyse factorielle d’un tableau de lois de
Symbolic Data: Exploratory Methods for Extracting probabilité. Rev Stat Appl 2002, 50:5–24.
Statistical Information from Complex Data. Berlin: 82. Wang H, Chen M, Li N, Wang L. Principal Compo-
Springer-Verlag; 2000, 103–124. nent Analysis of Modal Interval-Valued Data with
68. Billard L. Dependencies and variation components of Constant Numerical Characteristics. The Hague, The
symbolic interval-valued data. In: Brito P, Bertrand P, Netherlands: International Statistical Institute; 2012.
Cucumel G, de Carvalho F, eds. Selected Contributions 83. Shimizu N, Nakano J. Histograms principal compo-
in Data Analysis and Classification. Berlin: Springer; nent analysis. In: Arroyo J, Maté C, Brito P,
2007, 3–12. Noihomme M, eds, 3rd Workshop in Symbolic Data
69. Billard L. Sample covariance functions for complex Analysis. Spain: Universidad Compiutense de
quantitative data. In: Mituza M, Nakano J, eds. Pro- Madrid; 2012.
ceedings, World Conferences International Association 84. Ichino M. The quantile method for symbolic principal
of Statistical Computing 2008. Tokyo: Yoko- component analysis. Stat Anal Data Mining 2011,
hama; 2008. 4:184–198.
70. Nagabhushan P, Kumar P. Histogram PCA. Adv Neu- 85. Billard L, Diday E. Regression analysis for interval-
ral Netw 2007, 4492:1012–1021. valued data. In: Data Analysis, Classification, and
71. Verde R, Irpino A. Ordinary least squares for histo- Related Methods, Proceedings of the Seventh Confer-
gram data based on wasserstein distance. In: ence of the International Federation of Classification.
Lechevallier Y, Saporta G, eds, Procedings of COMP- Societies (IFCS00). Namur, Belgium: Springer; 2000,
STAT’2010. Heidelberg: Physica-Verlag; 2010, 369–374.
581–589. 86. De Carvalho FAT, Lima Neto EA, Tenorio CP. A new
72. Makosso-Kallyth S, Diday E. Adaptation of interval method to fit a linear regression model for interval-
PCA to symbolic histogram variables. Advances in valued data. In: KI2004 Advances in Artificial Intelli-
Data Analysis and Classification. Adv Data Anal Clas- gence. Lecture Notes in Computer Science. Berlin/
sif 2012, 6:147–159. Heidelberg: Springer-Verlag; 2004, 295–306.
73. Douzal-Chouakria A, Billard L, Diday E. Principal 87. Wang H, Guan R, Wu J. Linear regression of interval-
component analysis for interval-valued observations. valued data based on complete information in hyper-
Stat Anal Data Mining 2011, 4:229–246. doi:10.1002/ cubes. J Syst Sci Syst Eng 2012, 21:422–442.
sam.10118. 88. Xu W. Symbolic data analysis: interval-valued data
74. Cazes P, Chouakria A, Diday E, Schektman Y. Exten- regression. PhD Dissertation, University of Geor-
sion de l’analyse en composantes principales à des don- gia, 2010.
nées de type intervalle. Rev Stat Appl 1997, 89. Giordani P. Lasso-constrained regression analysis for
XLV:5–24. interval-valued data. Adv Data Anal Classif
75. Wang H, Guan R, Wu J. CIPCA: complete-informa- 2015, 9:5–19.
tion-based principal component analysis for interval- 90. Irpino A, Romano E. Optimal histogram representa-
valued data. Neurocomputing 2012, 86:158–169. tion of large data sets: Fisher vs piecewise linear
76. Le-Rademacher J, Billard L. Principal component anal- approximation. Revue des Nouvelles Technologies de
ysis for interval data. Wiley Interdiscip Rev Comput l’Information (RNTI) 2007, E-9:99–110.
Stat 2012, 4:535–540. 91. Souza RMCR, Queiroz DCF, Cysneiros FJA. Logistic
77. Le-Rademacher J, Billard L. Principal component his- regression-based pattern classifiers for symbolic inter-
tograms from interval-valued observations. Comput val data. Pattern Anal Appl 2011, 14:273–282.
Stat 2013, 28:2117–2138. 92. Dias S, Brito P. Linear regression model with
78. Kosmelj K, Le-Rademacher J, Billard L. Symbolic histogram-valued variables. Stat Anal Data Mining
covariance matrix for interval-valued variables and its 2011, 8:75–113. doi:10.1002/sam.11260.
application to principal component analysis: a case 93. Utkin LV, Coolen FPA. Interval-valued regression and
study. Metodoloski Zvezki No. 11, 2014, 1–20. classification models in the framework of machine
79. Le-Rademacher J, Billard L. Principal component anal- learning. In: 7th International Symposium on Impre-
ysis for histogram-valued data. Adv Data Anal Clas- cise Probability: Theories and Applications, Innsbruck,
sif ) 2016; 1–25. doi:10.1007/s11634-016-0255-9. Austria, 2011.
94. Sinova B, Colubi A, Gil MA, González-Rodríguez G. 109. Afonso F, Diday E. Extension de l’algorithme Apriori
Interval arithmetic-based simple linear regression et des règles d’association aux cas des donnees sym-
between interval data: discussion and sensitivity analy- boliques diagrammes et intervalles. In: Revue RNTI,
sis on the choice of the metric. Inform Sci 2012, Extraction et Gestion des Connaissances (EGC
199:109–124. 2005), vol 1. Toulouse: Editions Cépaduès; 2005,
95. Cerny M, Antoch J, Hladik M. On the possibilistic 205–210.
approach to linear regression models involving uncer- 110. Silva APD, Brito P. Linear discriminant analysis for
tain, indeterminate or interval data. Inform Sci 2013, interval data. Comput Stat 2006, 21:289–308.
244:26–47.
111. Appice A, D’Amato C, Esposito F. Malerba D. In:
96. Afonso F, Billard L, Diday E. Symbolic linear regres- Intelligent Data Analysis: Analysis of Symbolic and
sion with taxonomies. In: Proceedings of the Meeting Spatial Data, vol. 10. The Netherlands: IOS Press
of the International Federation of Classification Socie- Amsterdam; 2006, 301–324.
ties (IFCS), Chicago, IL. Berlin/Heidelberg: Springer-
112. Duarte Silva AP, Brito P. Discriminant analysis of
Verlag; 2004.
interval data: an assessment of parametric and
97. Neto EA, De Carvalho FAT. Constrained linear regres- distance-based approaches. J Classif 2015,
sion models for symbolic interval-valued variables. 32:516–541.
Comput Stat Data Anal 2010, 54:333–347.
113. Gowda KC, Diday E. Symbolic clustering using a
98. Lauro C, Verde R, Irpino A. Generalized canonical new dissimilarity measure. Pattern Recogn 1991,
analysis. In: Diday E, Noirhomme-Fraiture M, eds. 24:567–578.
Symbolic Data Analysis and the Sodas Software. Chi-
chester: Wiley; 2008, 313–330. 114. Ichino M, Yaguchi H. Generalized Minkowski met-
rics for mixed feature-type data analysis. IEEE Trans
99. Groenen PJF, Winsberg S, Rodriguez O, Diday E. I- Syst Man Cybern 1994, 24:698–707.
Scal: multidimensional scaling of interval dissimilari-
ties. Comput Stat Data Anal 2006, 51:360–378. 115. De Carvalho FAT. Extension based proximity coeffi-
cients between constrained Boolean symbolic objects.
100. Terada Y, Yadohisa H. Multidimensional scaling
In: Hayashi C et al., eds. Proceedings of IFCS’96.
with hyperbox model for percentile dissimilarities. In:
Berlin: Springer-Verlag; 1998, 370–378.
Watada J, Phillips-Wren G, Jain LC, Howlett RJ, eds.
Intelligent Decision Technologies. Berlin/Heidelberg: 116. De Carvalho F, Souza R, Chavent M, Lechevallier Y.
Springer-Verlag; 2011, 779–788. Adaptive Hausdorff distances and dynamic clustering
of symbolic interval data. Pattern Recogn Lett 2006,
101. Maia ALS, De Carvalho FDAT, Ludermir TB. Fore-
27:167–179.
casting models for interval-valued time series. Neuro-
computing 2008, 71:3344–3352. 117. Rüschendorf L. Wasserstein metric. In:
Hazewinkel M, ed. Encyclopedia of Mathematics.
102. Arroyo J, Espínola R, Maté C. Different approaches
Berlin/Heidelberg: Springer; 2001.
to forecast interval time series: a comparison in
Finance. Comput Econ 2011, 37:169–191. 118. Irpino A, Verde R. Dynamic clustering of interval
103. Teles P, Brito P. Modelling Interval Time Series with data using a Wasserstein-based distance. Pattern
Space-Time Processes. Commun Stat Theory Method Recogn Lett 2008, 29:1648–1658.
2015, 44:3599–3627. 119. Kosmelj K, Le-Rademacher J, Billard L. Mallows’ L2
104. Arroyo J, Maté C. Forecasting histogram time series distance in some multivariate methods and its appli-
with k-nearest neighbors’ methods. Int J Forecast cation to histogram-type data. Metodoloski Zvezki
2009, 25:192–207. No. 9, 2012, 107–118.
105. Ciampi A, Diday E, Lebbe J, Perinel E, Vignes R. 120. Kim J, Billard L. Dissimilarity measures for
Growing a tree classifier with imprecise data. Pattern histogram-valued observations. Commun Stat Theory
Recogn Lett 2000, 21:787–803. Method 2013, 42:283–303.
106. Bravo M, Garcia-Santesmases J. Symbolic Object 121. García-Santesmases JM, Franco C, Montero J. Con-
Description of Strata by Segmentation Trees, Compu- sensus measures for symbolic data. Comput Eng Inf
tational Statistics, vol. 15. Heidelberg, Germany: Sci 2010, 4:651–658.
Physica-Verlag; 2000, 13–24. 122. Diday E. The symbolic approach in clustering and
107. Mballo C, Diday E. The criterion of Smirnov- related methods of data analysis: the basic choices.
Kolmogorov for binary decision tree: application to In: Bock H, ed. First Conference of the International
interval valued variables. Intell Data Anal 2006, Federation of Classifications Societies. North-Hol-
10:325–341. land: Technical University of Aachen (RFA); 1988.
108. Winsberg S, Diday E, Limam M. A tree structured 123. Diday E, Simon JC. Cluster analysis. In: Fu KS,
classifier for symbolic class description. In: Compstat ed. Digital Pattern Intent Recognition. Berlin/
2006. Rome, Italy: Physica-Verlag; 2006. Heidelberg: Springer-Verlag; 1976.
124. Diday E. Optimisation en Classification Automa- 140. Chavent M. Criterion-based divisive clustering for
tique, Tome 1, 2. Rocquencourt: INRIA; 1979. symbolic data. In: Bock H-H, Diday E, eds. Analysis
125. Diday E, Schroeder A. A new approach in mixed dis- of Symbolic Data: Exploratory Methods for Extract-
tributions detection. Revue d’Automatique, Informa- ing Statistical Information from Complex Data. Ber-
tique et Recherche Opérationnelle (RAIRO), Paris, lin: Springer-Verlag; 2000, 299–311.
France; 1975, 10. 141. Kim J. Dissimilarity measures for histogram-valued
126. Diday E, Ok Y, Schroeder A. The dynamic cluster data and divisive clustering of symbolic objects. Doc-
method in pattern recognition. In: Proceedings of toral Dissertation, University of Georgia, 2009.
IFIP Congress, Stockholm. North-Holland, 1974. 142. Kim J, Billard L. A polythetic clustering process and
127. Ok-Sakun Y. Analyse factorielle typologique et lis- cluster validity indexes for histogram-valued objects.
sage typologique. Thèse de 3ème cycle, Université Comput Stat Data Anal 2011, 55:2250–2262.
Paris VI, Juin, 1975.
143. Kim J, Billard L. Dissimilarity measures and divisive
128. Charles C. Régression typologique et reconnaissance clustering for symbolic multimodal-valued data.
des formes. Thèse de doctorat 3ème cycle, Université Comput Stat Data Anal 2012, 56:2795–2808.
Paris IX-Dauphine, Juin, 1977.
144. Brito P, Duarte Silva AP. Modelling interval data
129. Diday E, Govaert G. Classification avec distance
with normal and skew-normal distributions. J Appl
adaptative. CR Acad Sci Paris 1974, 278:993–995.
Stat 2012, 39:3–20.
130. Diday E. Introduction à l’Analyse factorielle typologi-
que. Rapport Laboria n 27. Rocquencourt: 145. Hajjar C., Hamdan H. Self-organizing map based on
INRIA; 1972. L2 distance for interval-valued data. In: 6th IEEE
International Symposium on Applied Computational
131. Diday E. Analyse canonique du point de vu de la clas-
Intelligence and Informatics (SACI 2011), Timisoara,
sification automatique. Rapport Laboria n 293. Roc-
Romania, 2011, 317–322.
quencourt: INRIA; 1978.
132. De Souza RMCR, De Carvalho FAT. Clustering of 146. Ganter B, Wille R. Formale Begrffsanalyse: Mathema-
interval data based on city-block distances. Pattern tishe Grunlagen. Heidelberg, Deutschland: Springer-
Recogn Lett 2004, 25:353–365. Verlag; 1996.
133. De Carvalho FAT, Lechevallier Y. Partitional cluster- 147. Wille R. Knowledge acquisition by methods of formal
ing algorithms for symbolic interval data based on concepts analysis. In: Proceedings of the conference
single adaptive distances. Pattern Recog 2010, on Data Analysis, Learning Symbolic and Numeric
42:1223–1236. Knowledge. Antibes, France: Nova Sciences; 1989,
365–380.
134. Verde R, Irpino A. Dynamic Clustering of Histogram
Data: Using the Right Metric. Selected Contributions 148. Barbut M, Monjardet B. Ordres et Classification.
in Data Analysis and Classification. Berlin/Heidel- Paris: Hachette; 1971.
berg: Springer; 2007, 123–134.
149. Polaillon G, Diday E. Galois lattices of symbolic
135. Diday E. Orders and overlapping clusters by pyra- objects. Rapport n0 9631. Paris: CEREMADE, Uni-
mids. In: Deleuw J, Heiser WJ, Meulman JJ, versity Paris; 1997.
Critchley F, eds. Multivariables Data Analysis. Lei-
den: DSWO Press; 1986, 201–234. 150. Polaillon G, Diday E. Reduction of symbolic Galois
lattices via hierarchies. In: Proceedings of Conference
136. Brito P, Diday E. Use of pyramids in symbolic data
on Knowledge Extraction and Symbolic Data Analy-
analysis. In: Diday E, Lechevallier Y, Schader M,
sis (KESDA’98). Luxembourg: Office for Official
Bertrand P, Burtschy B, eds. New Approaches in
Publications of the European Communities; 1999,
Classification and Data Analysis. Berlin: Springer-
137–143.
Verlag; 1990, 378–386.
137. Brito P. Symbolic objects: order structure and pyrami- 151. Polaillon G. Interpretation and reduction of Galois
dal clustering. Ann Oper Res 1995, 55:277–297. lattices of complex data. In: Rizzi A, Vichi M,
Bock H-H, eds. Advances in Data Science and Classi-
138. Pak K, Rahal MC, Diday E. Élagage et aide à l’inter- fication. Berlin/Heidelberg: Springer-Verlag; 1998,
prétation symbolique et graphique d’une pyramide. 433–440.
In: Congrès d’extraction et gestion des connaissances
(EGC), 18–21 Janvier. Paris: Editions Cepa- 152. Brito P, Polaillon G. Structuring probabilistic data by
dues; 2005. Galois lattices. Math Social Sci 2005, 169:77–104.
139. Rahal MC, Diday E. Spatial hierarchical and pyrami- 153. Brito P, Polaillon G. Homogeneity and stability in
dal clustering software. In: Proceedings of the 10th conceptual analysis. In: Napoli A, Vychodil V, eds.
Conference of the Federation of Classification Socie- Proceedings of the 8th International Conference on
ties: Data Science and Classification, Ljubljana, Slove- Concept Lattices and Their Applications, Nancy,
nia, 25–29 July, 2006. Editions Springer. France. Nancy: INRIA; 2011, 251–263.
154. Montanary A, Calo DG. Model-based clustering of 169. Wang H, Shangguan L, Guan R, Billard L. Principal
probability density functions. Adv Data Anal Classif component analysis for compositional data vectors.
2013, 7:301–320. Comput Stat 2015, 30:1079–1096.
155. Cuvelier E. QAMML: probability distributions for 170. Dempster A, Laird N, Rubin D. Maximum likelihood
functional. PhD Thesis, University of Namur, from incomplete data with the EM algorithm. J R
Belgium, 2009. Stat Soc Series B Stat Methodol 1977, 39:1–38.
156. Fan W, Bouguila N. Infinite Dirichlet mixtures mod- 171. Gelman A, Carlin J, Stern H, Rubin D. Bayesian
els learning via expectation propagation. Adv Data Data Analysis. 2nd ed. New York: Chapman and
Anal Classif 2013, 7:465–489. Hall; 2001.
157. Shimizu N, Nakano J. Clustering based on normal 172. Marin J-M, Robert C. Bayesian Core: A Practical
mixture model for aggregated symbolic data. In: Approach to Computational Bayesian Statistics.
Arroyo J, Maté C, Brito P, Noihomme M, eds, 3rd New York: Springer-Verlag; 2007.
Workshop in Symbolic Data Analysis. Spain: Univer- 173. Diday E, Emilion R. Symbolic bayesian network. In:
sidad Compiutense de Madrid; 2012. SDA ‘2015, 17–19 November, Orleans, France. 2015.
158. Le-Rademacher J, Billard L. Likelihood functions and Available at: http://www.univ-orleans.fr/mapmo/
some maximum likelihood estimators for symbolic. colloques/sda2015/SDA2015 Slides.zip. (Accessed
J Stat Plan Inference 2011, 141:1593–1602. August 3, 2016).
159. Diday E. Modélisation de Données Symboliques et 174. Raudenbush SW, Bryk AS. Hierarchical Linear Mod-
Application au cas des Intervalles. Orléans: Journées els. 2nd ed. Thousand Oaks, CA: Sage; 2002.
Nationales de la Société Francophone de Classifica- 175. Inuiguchi M, Hirano S, Tsumoto S, eds. Rough Set
tion; 2011. Theory and Granular Computing. Berlin:
Springer; 2003.
160. Chiun-How K, Chih-Wen O, Yin-Jing T, Chuan-
kai, Y, Chun-houh C. A symbolic database for 176. Pedrycz W. Granular Computing: Analysis and
TIMSS. In: Arroyo J, Maté C, Brito P, Noihomme M, Design of Intelligent Systems. Boca Raton, FL: CRC
eds, 3rd Workshop in Symbolic Data Analysis. Spain: Press/Taylor & Francis; 2013.
Universidad Compiutense de Madrid; 2012. 177. Pawlak Z. Rough Sets: Theoretical Aspects of Rea-
161. Afonso F, Haddad R, Toque C, Eliezer ES, Diday E. soning About Data. Dordrecht: Kluwer Academic
User manual of the SYR software. Syrokko Internal Publishing; 1991. ISBN: 0-7923-1472-7.
Publication, 2012, 70. Available at: http://www. 178. Verde R, Diday E. Chapter 16—symbolic data analy-
syrokko.com. (Accessed August 3, 2016). sis: a factorial approach based on fuzzy coded data.
162. Duarte Silva AP, Brito P. MAINT.DATA: model and In: Blasius J, Greenacre M, eds. Visualization and
analyze interval data. R Package, version 0.2; 2011. Verbalization of Data. Mathematics|Probability and
Available at: http://cran.r-project.org/web/packages/ Statistics. UK: CRC Press Chapman & Hall book;
MAINT.Data/index.html. (Accessed August 3, 2016). 2014, 255–270.
163. Irpino A. HistDAWass: Histogram-Valued Data Anal- 179. Diday E, Moreau JV. Hierarchical Inference. In:
ysis, R package, version 0.1.4. 2016. Available at: Proceedings in Computational Statistics
https://cran.rproject.org/web/packages/HistDAWass/ (COMPSTAT 6), Prague: Physica-Verlag; 1984.
index.html.hermann. (Accessed August 3, 2016). 180. Lance GN, Williams WT. A general theory of classifi-
catory sorting strategies: hierarchical systems. Com-
164. Irpino A, Verde R. Linear regression for numeric
put J 1967, 9:373–380.
symbolic variables: a least squares approach based on
Wasserstein distance. Adv Data Anal Classif ) 2015, 181. Meroune O. Traitement à grand échelle des données
9:81–106. symboliques. PhD co-directed by Prof.E. Diday and
P. Rigaux, Paris Dauphine University. France, 2011.
165. Irpino A, Verde R. Basic statistics for distributional
symbolic variables: a new metric-based approach. 182. Minami H, Mizuta M. SDA framework is the tool
Adv Data Anal Classif ) 2015, 9:143–175. for big data analysis? In: Arroyo J, Maté C, Brito P,
Noihomme M, eds. 3rd Workshop in Symbolic Data
166. Benzécri JP. L’Analyse des Données: l’Analyse des Analysis, Spain: Universidad Compiutense de
Correspondances. Paris: Dunod; 1980. Madrid; 2012.
167. Pawlowsky-Glahn V, Egozcue JJ, Tolosana- 183. Schweizer B. Distributions are the numbers of the
Delgado R. Modeling and Analysis of Compositional future. In: Proc. Sec. Napoli Meeting on “The Mathe-
Data. Chichester: Wiley; 2015. matics of Fuzzy Systems”. Instituto di Mathematica
168. Fisher RA. On the mathematical foundations of theo- delle Faculta di Mathematica delle Faculta di Achitec-
retical statistics. Philos Trans A Math Phys Eng Sci tura, Universita degli studi di Napoli; 1984,
1922, 222:309–368. 137–149.