0% found this document useful (0 votes)

62 views34 pages

Thinking by Classes in Data Science SDA

Symbolic Data Analysis (SDA) is a paradigm in data science that extends traditional data analysis by considering classes of individual entities as higher-level units, allowing for the representation of variability through symbolic values rather than just numerical data. This approach enables the aggregation of complex data and provides new insights that complement classical methods, particularly in the context of big data. SDA facilitates the transformation of unstructured data into structured formats, enhancing the ability to extract knowledge from diverse datasets.

Uploaded by

junkmsp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views34 pages

Thinking by Classes in Data Science SDA

Uploaded by

junkmsp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Overview

Thinking by classes in data

science: the symbolic data analysis
paradigm
Edwin Diday*

Data Science, considered as a science by itself, is in general terms, the extraction

of knowledge from data. Symbolic data analysis (SDA) gives a new way of think-
ing in Data Science by extending the standard input to a set of classes of individ-
ual entities. Hence, classes of a given population are considered to be units of a
higher level population to be studied. Such classes often represent the real units
of interest. In order to take variability between the members of each class into
account, classes are described by intervals, distributions, set of categories or
numbers sometimes weighted and the like. In that way, we obtain new kinds of
data, called ‘symbolic’ as they cannot be reduced to numbers without losing
much information. The ﬁrst step in SDA is to build the symbolic data table
where the rows are classes and the variables can take symbolic values. The sec-
ond step is to study and extract new knowledge from these new kinds of data by
at least an extension of Computer Statistics and Data Mining to symbolic data.
SDA is a new paradigm which opens up a vast domain of research and applica-
tions by giving complementary results to classical methods applied to standard
data. SDA also gives answers to big data and complex data challenges as big data
can be reduced and summarized by classes and as complex data with multiple
unstructured data tables and unpaired variables can be transformed into a struc-
tured data table with paired symbolic-valued variables. © 2016 Wiley Periodicals, Inc.
How to cite this article:
WIREs Comput Stat 2016, 8:172–205. doi: 10.1002/wics.1384

Keywords: data science, data mining, classiﬁcation, learning, symbolic data anal-
ysis, functional analysis, Bayesian, multilevel analysis, complex data, big data,
granular computing, compositional data

INTRODUCTION already exists before thinking on extracting any new

knowledge by Data Sciences methods and tools.
B asically in Data Science, we have as input a set of
‘individual entities.’ In the following, this set is
so-called ‘ground population.’ This ground popula-
‘Classes’ are subsets of individual entities hav-
ing a common property. In the following, the ‘indi-
vidual entities’ of the ground population are more
tion can be a set of statistical units considered as a simply called ‘individuals’ as distinct from the
representative sample of a whole population. It can ‘classes’ of individuals. In symbolic data analysis
also be just issued from any kind of data base which (SDA), these classes are considered as new units of
higher level of generalization than individuals.
When in Data Science, sets of data are huge, a
*Correspondence to: diday@ceremade.dauphine.fr first idea is to consider classes of individuals as new
CEREMADE, Paris-Dauphine University, Paris, France units in order to reduce the initial size of the popula-
Conflict of interest: The author has declared no conflicts of interest
tion by summarizing it. Another feature of classes is
for this article. that often they can represent the real units which

172 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

interest the user. For example, shopping transactions data context and second, to consider symbolic
are interesting units in marketing, but customers con- objects as the intent of the described class; this intent
sidered as new units represented by the classes of having an extent as in the Galois lattice framework
their transactions in a given period, can also be con- (this is developed in Unsupervised Classification
sidered as interesting units by themselves and not Extended to Symbolic Data section). In this study,
only for summarizing the data. considered ‘symbolic objects’ are vectors of
How are classes obtained? Behind any ground symbolic data.
population described by classical (numerical or cate- A ‘symbolic data table’ is a table where classes
gorical) variables, there are underlying hidden popu- of individuals are described by at least one symbolic
lations of classes induced, e.g., by the categorical variable. Standard variables can also describe classes
variables or by numerical variables (transformed in by considering the set of classes as a new ground
categorical variables by a discretization process) or population of higher level.
by Cartesian product of such categorical variables. An example of symbolic data table is illustrated
Another way for obtaining classes is to use a clus- by the Table 1 where the statistical units of the
tering method on the ground population. Neverthe- ground population are players of French cup teams
less, in practice the classes are often induced from and classes of players are teams called Paris, Lyon,
given categories (as regions, unemployed people Marseille, and Bordeaux. The variability of the
type, epidemiological strategies, consumption level, players inside each team is expressed by the symbolic
degradation level, etc.). variables: ‘Weight’ which value is the interval of [min,
For classes, their description cannot be max] weight of the players of the associated team,
expressed by just numerical and categorical values. ‘National Country’ which value is the list of their
This is due to the variability of the individuals inside nationality, ‘Age bar chart’ is the frequency of the age
each class of individuals. This variability is better players being in the intervals: [less than 20], [20, 25],
expressed by intervals, histograms, probability distri- [25, 30], [more than 30], respectively, denoted: (0),
butions, bar charts, sequences of categorical or (1), (2), (3) in Table 1. The symbolic variable ‘age’ is
numerical values, sometimes weighted by numbers or called ‘bar chart variable’ as the interval of age on
associated with categorical values and the like. These which it is defined are the same for all the classes and
kinds of data are called ‘symbolic’ as they cannot be can therefore be considered as categories. The last
reduced to numbers without a loss of much informa- variable is numerical as its values for a team is the fre-
tion. The so-called «symbolic variables» are the vari- quency of the French players in this team among all
ables which associate to each class a symbolic value. the French players of all the teams. Hence, this varia-
In SDA Paradigm section, different kinds of symbolic ble produces a vertical bar chart in comparison with
variables are presented. the symbolic variable ‘age’ of horizontal bar charts
As classes are considered in the SDA frame- value in Table 1. By adding to the French the same
work like ‘objects’ to be described in all their useful kinds of columns associated with the other national-
facets, their description are often called ‘symbolic ities, we can obtain a new symbolic variable whose
objects.’ Several kinds of more or less structured and values are a list of numbers, where each number is the
complex data have been associated with different frequency of having players in a team of a nationality
kinds of symbolic objects; e.g., ‘hoard’ in Refs 1, 2 among all the players having this nationality among
and ‘belief objects’ in Ref 3. The first advantage of all the teams. A team can also be described by stand-
such symbolic objects was to give a linear expression ard variables as, e.g., its expenses or the number of
of the symbolic description of classes in a complex goals in a season.

TABLE 1 | Example of Symbolic Data Table Where Teams of the French Cup Are Described by Three Symbolic Variables of Interval, Sequence of
Categories, ‘Horizontal’ Bar Charts, and a Numerical Variable Inducing a ‘Vertical’ Bar Chart

Frequency of
FRENCH Among
French Cup Teams Weight National Country Age All French (%)
Paris [73, 85] {France, Argentina, Senegal} {(0) 30%, (1) 70%} 30
Lyon [68, 90] {France, Brazil, Italia} {(0) 30%, (1) 65%, (2) 5%} 25
Marseille [77, 85] {France, Brazil, Algeria} {(1) 40%, (2) 52%, (3) 8%} 28
Bordeaux [80, 90] {France, Argentina} {(0) 40%, (1) 60%} 17

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 173

Overview wires.wiley.com/compstats

Basically in SDA, there are two types of between n players inside their team considered
descriptive variables depending on the population on as a class of players described by several classi-
which they are defined: cal variables. Another example can be species
(of plants or animals) where the variability is
1. The standard variables (numerical or categori- between the specimens of a species. A last
cal), so-called ‘ground variables’ when they are example is diseases considered as classes where
defined on the ground population of the considered variability is between patients
individuals. having the same disease. In all these cases, the
2. The so-called ‘symbolic-valued variables’ classes can be described by symbolic variables
defined on classes, which values cannot be expressing the variability of the individuals of
reduced to be just numbers (e.g., means). each class.
2. Variability of (or inside) single entities: each
These symbolic variables can be obtained from class is a set defined by a fixed entity (as a
standard ground variables and define a symbolic data player, a specimen or a patient) considered in
table as illustrated by Figure 1. different conditions, parts etc. The variability
SDA gives a framework for building, describ- of a single entity depends on external condi-
ing, analyzing and extracting new knowledge from tions: as in position, time, and environmental
symbolic data tables. In that way, SDA can enhance situations or on internal conditions: as among
the usual study of any ground population of indivi- its parts or among its patterns or its physical
duals described by classical variables by adding a constitutions. More formally, this means that a
complementary study of classes of individuals class associated with the ith entity is the subset
described by symbolic variables expressing the inter- {ij1, ij2, … , ijk} of k individuals of the ground
nal variability of these classes. population representing the ith entity varying
As it is better to know what we describe before in k external conditions (k times, k positions,
describing it, an important question is to know well etc.) or among k internal conditions (k of its
which kinds of classes we have and what is their parts, k of its patterns, etc.).
meaning. Basically, we have two kinds of classes
which depend on the kinds of variability of their
individuals: Examples of Variability of a Single Entity
A player’s performance considered at different times
1. Variability between different entities: each class can be considered as a class of the same player at
is a set of entities considered as individuals of these different times. In this case, the single entities
the ground population. The variability is are players. The individuals of the ground population
between the individuals. This is the most com- are players considered at different times. Hence, the
mon variability in SDA. Let {i1, i2, … , in} be a ith player can be a class of k individuals (ij1, ij2, … ,
fixed class of n individuals; e.g., the variability ijk) associated with the ith player considered at

Standard data table Symbolic data table

A symbolic
data
Players Y1 Yj Y′1 Y′j describing
Messi team
A number
ind1 (Messi age) Cl1
or a
Category
(Messi
nationality)
indi Yij Cli
? ? ? ? ?

indn Clk

Age Weight Nationalities

interval histogram barshart

F I G U R E 1 | From a standard data table (X, Y) describing a set of individuals X by a set of standard variables Y, to a symbolic data table
(X0 , Y0 ) describing a set of teams X0 by a set of symbolic variables Y0 .

174 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

k times. Each player at each time is described by dif- data tables to be aggregated and merged. This leads
ferent kinds of performance. The higher level popula- naturally to towers or regions described by symbolic
tion of players is then described by symbolic data with several kinds of symbolic variables.
variables obtained by aggregation of these kinds of Why aggregate classes and describe them by
variables for each player considered at k times. symbolic data? The symbolic description of classes
Another example is a traveler visiting different hotels. leads at least to the following advantages:
The single entity is a traveler. In this case, a class is a
subset of individuals (ijh1, ijh2, … , ijhk) of the ground • Finding new and complementary kinds of
population associated with the traveler visiting knowledge not available at the level of indivi-
k hotels. Each traveler is described, at each hotel duals: e.g., a kind of knowledge inside the pop-
visit, by different criteria of satisfaction on the visited ulation of individuals is to find the players
hotel. The higher level population of travelers is then whose age is between 20 and 25 years old. A
described by symbolic variables obtained by aggrega- complementary kind of knowledge can be
tion of these kinds of variables, on the k visited found inside the population of teams
hotels, for each traveler. (i.e., classes of players) described by distribu-
tions: find the classes whose probability of hav-
ing players of age between 20 and 25 years old
Examples of Variability Inside Single is higher than 0.9.
Entities • Studying the data by units given at the needed
The single entities are towers. The individuals of the level of generalization: e.g., if we wish to know
ground population are parts of each tower. These what makes a player win, a good level of study is
parts are cracks. Hence the ith tower can be associ- a data table where individuals (i.e., the players)
ated with a class of k individuals (ic1, ic2, … , ick) are the units; if we wish to know what makes a
associated with k cracks. Each crack of each tower is team win, a good level of study is a data table
described by classical variables as the size, deepness, where classes of individuals (i.e., the teams) are
orientation, and so on. The higher level population the units. The description of the teams needs to
of towers is then described by symbolic variables take the variability inside the classes of players
obtained by aggregation of these kinds of variables, into account by using: intervals or histograms of
on the k cracks, for each tower. age, bar chart of nationalities, list of sponsors
In industrial data, these different kinds of etc., associated with the kind of sponsoring, and
classes often happen together. This leads to what we so forth, which leads to symbolic data.
call ‘complex data’ (see Classes of Individuals and
• Summarizing by reducing the loss of informa-
Their Symbolic Description Built from ‘Complex
tion: when each class contains millions of indi-
Data’ section for more details). In contrast to the
viduals, it is easier to study them when they are
case where the ground data can be modeled by a
summarized by symbolic data, than to study the
unique standard data table (where a set of indivi-
millions of individuals which define them. A
duals is described by a set of classical variables),
reduction of loss is obtained when instead of
complex data are composed of many such data tables
the single point value in the p-dimensional
of several kinds of units, variables and sizes. For
space IRp seen in classical data (by using,
example, (see Application to Complex Data
e.g., means of classes), the symbolic data can
section for more details), nuclear power plant towers
take the variability inside classes e.g., by inter-
constitute a class of individuals which vary (e.g., in
val values (as the min–max age of the players of
their geographical position). Moreover, each tower is
each team instead of their mean) into account,
considered as an entity varying in time, and also
thus producing hypercubes in IRp.
inside its parts (as its cracks or corrosion points).
More details on this example can be found in Afonso • Reducing the number of individuals: e.g., when
et al.4,5. Another example, of complex data (devel- the teams are the classes, there are fewer teams
oped in Classes of Individuals and Their Symbolic than players, so the number of units is reduced
Description Built from ‘Complex Data’ section), con- from the ground level of players to the higher
cerns surveys in Official Statistics for the sociodemo- level of teams.
graphic description of each region which require • Reducing the number of variables: the number
many different census data tables (on families, of variables can become higher at the level of
schools, hospitals, etc.; see for more details, Refs classes than at the level of the individuals (this
6–9). In both examples, we obtain many standard will be developed in Some Principles section, in

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 175

Overview wires.wiley.com/compstats

the ‘class facets principle’). For example, a than just by means or min–max intervals. Neverthe-
‘team age’ can be described by the mean, mean less, other kinds of symbolic variables (which values
square, min–max interval, confidence interval, are not distributions) describing classes by other of
histogram, quantiles, and so on of the age of their facets can be added. For example, vertical bar
the players of the team. These values allow the chart variables, as in Table 1, interquartile interval-
definition of descriptive standard or symbolic valued variables, percentiles-valued variables or cor-
variables at the higher level of classes. Never- relation between ground variables can be usefully
theless, by considering symbolic variables added (among many others) to the descriptive vari-
instead of their transformation in numerical or ables of each class.
categorical standard variables, we reduce the Notice that bar charts or distributions built
number of variables. For example, by consider- from ordinal variables, can always describe classes as
ing p symbolic variables whose values are inter- they can express different kinds of variability. How-
vals, we have only p interval-valued variables ever, distributions cannot describe classes in the case
instead of 2p numerical variables of min or of nonordinal variables. By definition, a distribution,
max values When we use p bar chart-valued evaluated at ‘x,’ is the probability that a real-valued
variables of k categories instead of variables random variable X will take a value less than or
which values are the frequency of each class for equal to x. Hence, from a ground nonordinal varia-
each category of each bar chart, we reduce the ble, we can build from each given class, a bar chart-
number of variables from k × p classical valued variable but not a distribution-valued variable
numerical variables to p symbolic variables. as in this case ‘a value less than or equal to x’ has no
Hence, in a principal component analysis (PCA) meaning.
extended to symbolic data we will have Classes can also be described by bar chart or
p variables to represent on the correlation circle histograms or distributions of two kinds: horizontal
instead of k × p. or vertical, depending on the kind of probability
• Missing data reducing: e.g., if we have a million used: Pr(C|yi) or Pr(yi|C) where C is a class of indivi-
of individuals described by a numerical ground duals of the ground population and yi is a category
variable containing 1000 missing data on a of a ground categorical variable y. For example, in
ground standard variable, the study of 10 classes Table 1 the symbolic variable ‘age’ of horizontal bar
of such individuals would lead to some missing chart value and the numerical variable ‘Frequency
data at the level of class for this variable. This among all French,’ which leads to a vertical bar
can happen when a class contains only missing chart. These kinds of description will be developed in
data of the ground variable. However, the num- How to Measure the Quality of Bar Chart Symbolic
ber of missing data at the level of the ten classes Variables, Classes, and Their Associated Symbolic
will be much fewer than 1000! Data Tables section.
• Solving confidentiality questions: classes are
generally less confidential than individuals. For Statistical Observations and Symbolic Data
example, counties instead of inhabitants.
In contrast to observed data (i.e., ‘observations’)
• Facilitating interpretation of results: e.g., in which are considered given in standard statistics,
reduced decision trees as classes are less numer- ‘symbolic data’ are built from classes. More precisely
ous than their individuals; another example is usually in standard statistics an ‘observation’ is the
PCA with new kinds of graphics expressing the (numerical or categorical) value, at a particular
variability inside the classes and the correlation period, of a particular variable [see the OECD Glos-
between the symbolic variables. sary of statistical terms (2007) at http://stats.oecd.
• Transforming complex data with unstructured org/glossary/search.asp].
data tables and unpaired variables in a structured An observation describes a unit (here called
symbolic data table with paired symbolic vari- ‘individual’) of a given population. Therefore, in
ables. SDA methods and tools can then be applied order to ‘describe’ the classes in SDA, the symbolic
to this new symbolic data table. This will be variables values (as intervals) are not ‘observations’
detailed in Application to Complex Data section. but built from observations given on the ground pop-
ulation of individuals. The classes of individuals are
Representing classes by marginal or joint probability considered as higher level units and so, constitute a
distributions, as often done in SDA (in case of new population of higher level. To be studied, this
ground numerical variables), is more informative new population uses the ‘description’ of the higher

176 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

level units by ‘symbolic data’ expressing variability SDA PARADIGM

and considered as nonstandard ‘observations.’ Nev-
ertheless, we can consider a standard ‘observation’ What Is the Actual Failure and the Shift
being the ‘description’ of a class reduced to a single Which Has Produced the SDA Domain?
individual without variation. Therefore, any tool of What is the actual failure? The failure is that in the
SDA should be applicable to standard data and more actual practice only the ‘individual’ kind of units
specifically any standard tool of standard data gener- described by standard numerical and categorical vari-
alized to symbolic data should be applicable to ables is considered.
standard data and therefore should contain the What is the SDA paradigm shift? First, it is the
standard tool as a case. transition from the analysis of ‘individuals’ (e.g., a
The two main steps of a SDA are first to build player of a team, a stock, a pig, etc.) described by
the symbolic data from given standard or complex standard variables of numerical and categorical values,
(sometimes big) data files. Second, to apply SDA to the analysis of ‘classes of individuals’ (a team of
methods and tools to the obtained symbolic data. players, a fund of stocks, a farm of pigs, etc.) consid-
These two steps are necessary, even if sometimes we ered as new units. Second, it is the transition to ‘sym-
can find in practice symbolic data where the first step bolic variables’ whose values can be more general than
is already done. For example, the sociodemographic standard numerical or categorical-valued variables. In
description of counties by histogram is given in the that way, the classes constitute higher level units not
statistical offices. only described by standard variables (as their means or
The theory and practice of SDA have been their variables correlations) but also by symbolic vari-
developed in several books: Refs 10–12, many ables taking care of the variability inside these classes.
papers (in JASA, SAM, ADAC journals, etc.), and Both transitions transform the standard pair (X,
several international workshops. Special issue Y) where X is a set of individuals of a given population
related to SDA has been recently published: in ‘The and Y a set of standard numerical and/or categorical
ASA Data Science Journal’ (Wiley), edited by Billard variables to a new pair (X0 , Y)) where X0 is a set of
et al.13 on SDA; in the RNTI journal, edited by classes of individuals and Y0 is a set of symbolic vari-
Guan et al.,14 on ‘Advances in Theory and Applica- ables. In standard data analysis, the pair (X, Y) is the so-
tions of High Dimensional and Symbolic Data Anal- called ‘standard data table.’ The pair (X0 , Y0 ) is so-called
ysis’; in the ADAC journal on SDA, edited by Brito ‘symbolic data table.’ More generally, a ‘symbolic data’
et al.15,16; in the IEEE Man and Cybernetic journal table is a data table where the set of variables can con-
edited by Su et al.17 on ‘Granular/Symbolic data tain standard numerical or/and categorical variables
processing.’ and at least one symbolic variable. In Figure 1, at left
The sections of this article are organized in the there is a standard data table (X, Y) where X is a set of
following way. Inspired by ‘The Structure of Scien- players and Y is a set of their descriptive variables. At
tific Revolutions,’18 we try first to show that the right side, there is a data table (X0 , Y0 ) where X0 is a set
‘Symbolic Data Analysis’ framework is a new scien- of teams players and Y0 is a set of symbolic variables
tific paradigm by answering the following questions: describing teams by symbolic data.
In SDA Paradigm section: what is the actual
failure which has produced the SDA domain? In the
third section: what is to be observed and scrutinized? What kinds of ‘Symbolic Variables’?
In the fourth section: what kinds of questions and The first characteristic of ‘symbolic variables’ is that
how are they structured? In the fifth section: what they are defined on classes. The second characteristic
are the principles and the theoretical development? In is that their values take the variability between the
the sixth section, what is the applicability domain? individuals inside these classes into account by more
Then, an overview on the symbolic data methods than only one category or number.
and tools is given in the seventh section. In the eighth In Table 2, six kinds of variables describing classes
section, some SDA software is described. In the ninth are given and illustrated by examples. The first two are
section, several directions of research are given; some classical numerical or categorical-valued variables. The
having connection with SDA are positioned as ‘com- four other variables of this table are symbolic as their
positional data,’ ‘functional data analysis (FDA),’ values express a variability in the following way:
‘mixture decomposition,’ ‘multilevel statistics,’ ‘gran-
ular computing,’ ‘rough sets,’ ‘uncertainty,’ ‘fuzzy Interval-Valued Variable
sets,’ and ‘clusterfier.’ Finally, we give a conclusion It is a kind of symbolic variable as an interval
with some perspectives. expresses by itself a variability inside the class that it

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 177

Overview wires.wiley.com/compstats

TABLE 2 | Class Descriptive Single or Multivalued Variables (Two Are Classical and Four Are Symbolic) with Examples Based on Descriptive
Variables of Team Players

Value
Variable Single Value Multivalue
Numerical Classical numerical variable Symbolic numerical multivalued variable
Example: Frequency list in a team of nationalities among Example: List of team players ages
the players of same nationality in all the teams
Categorical Classical categorical variable Symbolic categorical multivalued variable
Example: Nationality (of the team manager) Example: List of a team players nationalities
Interval Symbolic interval-valued variable: Symbolic interval multivalued variable
Example: [Min, Max] Interval age of a team players Example: Intervals between successive month team
expenses or of stock values

describes. For example, the [Min, Max] or the inter variables’ which contains the bar chart-valued vari-
quartile interval of the random ground variable ‘age’ ables as a particular case. In Table 3, we can see that
inside a team considered as a class of players. ‘categorical modal variables’ are enlarged to two
other types: symbolic (category, category), list-valued
Multivalued Variables variables and symbolic (category, interval) list-valued
This is the case where the symbolic variables values variables. In the same books, ‘symbolic interval
express the internal variability of a class of indivi- modal variables’ are restricted to ‘symbolic (interval,
duals by a list of numbers, categories or intervals. number) list-valued variables’ which therefore con-
These kinds of variables are connected to reality and tain the case of ‘histogram-valued variables,’ when
not an abstract construct just in order to complete the sum of the numbers associated with each interval
the cells of the Table 2. Hence, the list of numbers of the list is equal to one. It also contains ‘composi-
can be the list of marks of a pupil varying between tional data’19 when the sum of these numbers
examinations in some courses (Mathematics, Physics, remains constant. In Table 3, ‘symbolic interval
etc.), of temperature of a patient varying during each modal variables’ are enlarged to the case of ‘symbolic
hour every day, of performances of an athlete vary- (interval, category) list-valued variables’ and to the
ing in various competitions. In these cases, the classes case of ‘symbolic (interval, interval) list-valued varia-
are, respectively, the pupils, the patients, and the ath- ble.’ All these cases are illustrated by examples given
letes. The variables are, respectively, the courses, the in Table 3.
days, and the competitions. The list of categories can
be a list of products bought by customers in several
sections of a supermarket in a given period. In this WHAT IS TO BE OBSERVED
case, the classes are the customers and the variables AND SCRUTINIZED?
are the sections of the supermarket. The list of inter- In SDA, we observe and scrutinize classes of indivi-
vals can be the list of interval variation of stocks duals, in order to identify them by the standard or
prices every day, during several weeks. In this case, symbolic data which describe them.
the classes are the stocks and the symbolic variables
are the weeks.
In Table 3, other kinds of symbolic variables From Where Can Classes and Their
are illustrated by examples as folsows. Descriptions Be Obtained?
In the introduction, we defined two kinds of classes:
Numerical, Categorical, or Interval Modal classes expressing variability between entities and
Symbolic Variables classes expressing the variability of (or inside) enti-
Such symbolic variables values are lists of pairs ties. When classes express the variability between
(number, mode), (category, mode), or (interval, entities, they can be obtained in two ways. First, they
mode), where the mode expresses in column is a can be induced from the categories, defined on the
number, a category or an interval. ground population, and induced by standard numeri-
Therefore, each of the 9 (i.e., 3 × 3) cells of cal (after discretization) and/or categorical variables
Table 3 defines a different type of symbolic variable. and their Cartesian product. These categories are
In Refs 10–12, ‘categorical modal variables’ are sometimes ‘taxonomic’ by inducing classes of indivi-
restricted to ‘symbolic (category, number) list-valued duals structured in different ways as by hierarchy or

178 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

TABLE 3 | Class Description by Modal Symbolic Variables Illustrated by Examples of Team Players
Value
Support Numerical Modal Categorical Modal Interval Modal
Numerical (number, number) list-valued variable (number, category) list-valued variable (number, interval) list-valued variable
Example: time series team Example: team description by its rank Example: team description by its score
description by its performance list: evolution: (time × rank) list evolution in the ongoing season:
(time × number of goals in the (time × [min,max] number of goals)
ongoing season) list
Categorical (category, number) list-valued variable (category, category) list-valued variable (category, interval) list-valued variable
Example: team description by the bar Example: team description by its Example: team description by:
chart: (nationality, frequency) list (sponsor × dress) list as: (Adidas (nationality × (age interval value))
(teashirt), Total (shoes)) list
Interval (interval, number) list-valued variable (interval, category) list-valued variable (interval, interval) list-valued variable
Example: Histogram team Example: team description by time Example: team description by: (interval
description by: frequencies of age intervals sequence of same rank: of time × interval of rank) list
intervals list (interval of time × rank) list

other structures as, e.g., in the case of towns, regions, by their interval of height, distribution of colors and
countries, and continents. Second, these classes can the like, without giving the ground level of the speci-
also be obtained from a clustering algorithm applied men. Native symbolic data can also appear when due
to the ground population. This clustering yields to to confidential data, the observations on the ground
more or less structured clusters defining classes of populations (people or companies and the like) are
individuals as they can provide partitions, hierar- not available. Native symbolic data happen in
chies, overlapping clusters, pyramids (or more gener- national census of Official Statistics, where we have
ally, spatial pyramids20 and Galois lattices (see for each region the bar charts associated with differ-
Unsupervised Classification Extended to Symbolic ent ground variables describing hospitals, schools,
Data section). sociodemographic situation of the inhabitants, and
When classes are due to the variability of so on, but we do not have the underlying surveys giv-
(or inside) single entities, the classes can be built ing the ground populations of the hospitals, the
directly from the ground population by the subsets schools, the inhabitants, and so forth. This last exam-
{ij1, ij2, … , ijk} associated with each entity ple is detailed in Classes of Individuals and Their
i considered in k internal or external conditions as Symbolic Description Built from ‘Complex Data’
explained in the introduction. section as it is a case of ‘complex data.’
Both kinds of classes can be described by sym-
bolic variables taking care of their internal variability.
Notice that the inclusion order between classes must be Classes of Individuals and Their Symbolic
coherent with the chosen order between their symbolic Description Built from ‘Complex Data’
descriptions. For example, if the symbolic description ‘Complex data’ are the case where we have entities
of a set of classes is given by an interval-valued variable defined by several ground populations. In this con-
(i.e., each class is associated with an interval), then it text, we have as input several unstructured data
should be coherent that inclusion between classes tables of different sizes of individuals (in rows) and
implies inclusion between the associated intervals; of variables (in columns). Moreover, the individuals
Diday21 studies how the taxonomic order (between and the variables can be different between the data
classes) can imply the symbolic order (between sym- tables. The entities can also be considered as indivi-
bolic values). By using these symbolic descriptions, one duals of another ground population.
can obtain different levels of more or less specialized Example. Each entity is a ‘power plant tower’
knowledge and therefore a better understanding of the defined by a population of cracks (described by their
inherent knowledge structure of the data. orientation, their length, etc.), a population of corro-
Sometimes, we can have ‘native symbolic data’ sions (described by their deepness, their surface, etc.).
describing classes but not obtained from a given Each entity is considered as an individual of a ground
ground populations. Native symbolic data appear in population of power plant towers. This example is
many situations. For example, in biology where ‘spe- developed in Application to Complex Data
cies’ (considered as classes of individuals called ‘spec- section where Figure 9 shows the complex data and
imen’), are described by experts in specialized books, their fusion in symbolic data. More generally, we can

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 179

Overview wires.wiley.com/compstats

say, in this case, that we have a population X of Y2

n individuals (the power plant towers); each of these
individuals is defined by N different populations Xij
of different sizes (cracks, corrosions, etc.), each one x
defined by ni entities described by pj variables. x
Hence, complex data cannot be described by a
x Y1
unique standard data table where individuals in rows
are simply described by variables in columns. Moreo- x
ver, the variables are not paired as the correlations
between variables defined on populations of different
sizes cannot be calculated. If there is a variable which
induces k classes on all the populations, we can F I G U R E 2 | Individuals are uniformly distributed inside the circle.
Therefore, there is no correlation between Y1 and Y2.
aggregate these classes; this leads to symbolic data
and symbolic variables. Finally, we obtain a symbolic
data table where k classes are described by symbolic in Figure 2, there is no correlation between Y1 and
variables defined on these classes. Hence, starting Y2. Nevertheless, a correlation appears between the
with unstructured data with unpaired variables, we two variables for the population defined by the four
obtain a structured data table with paired symbolic centers of a given partition, in four classes, of the
variables. initial population. As a consequence, any data anal-
Example. National Statistical Institutes (NSI) ysis method based on correlation (regression, PCA,
organize censuses in their regions on different kinds etc.) yields to different results depending on the two
of populations: hospitals, schools, inhabitants, and populations of individuals or of classes of
so on. Each of these populations is associated with individuals.
their own characteristic variables. For hospitals:
number of beds, doctors, patients, etc.; for schools:
number of pupils, teachers, etc.; for inhabitants: gen- WHAT KIND OF QUESTIONS IN SDA
der, age, socioprofessional category, etc. The regions AND HOW ARE THEY STRUCTURED?
are the individuals described by the variable available
for all these populations of different sizes. If we have How to Go from Standard
n regions and N populations (of hospitals, schools, to Symbolic Data?
etc.), then we obtain after aggregation, a symbolic The first challenge is building the higher level units
data table with n rows and p1 + … + pN columns and their descriptive symbolic variables from the
associated with the N sets of symbolic variables char- ground population described by standard data.
acteristic of each of the N populations. Other vari- Depending on the kind of variables and the user’s
ables can be added to the obtained symbolic data in wishes, we can obtain numerous kinds of class
order to describe other aspects of the regions such as descriptions. Actually, those mainly used are inter-
its number of inhabitants, its different kinds of pro- vals, bar charts, histograms, probability distributions,
duction, and so on. Some SDA software such as, percentile functions, and percentiles values.
Symbolic Objects Data Analysis (SODAS) or Sym- The class description by intervals is obtained
bolic Research (SYR), as shown in Symbolic Data from ordinal variables (numerical or categorical) as
Analysis Software section, facilitate this possibility. the weight in Table 1. Standard choices are the [min,
max] intervals, interquartile or confidence intervals.
The class description by probability distributions is
What Distinguishes the Study of Classes obtained empirically or by standard or classical mix-
from the Study of Individuals? ture model estimation from numerical variables. In
order to describe classes by histograms, a discretiza-
From Standard Statistical Units to Classes, the tion of ground numerical variables is needed. This
Statistic Is Not the Same has to be done in such a way that they discriminate
The statistics on the ground population of indivi- as well as possible the classes (see How to Analyze a
duals and on the higher level population of classes Symbolic Data Table? section on quality of symbolic
are not always the same even in the simplest case data). The class description by lists of sometimes
where the classes are described by their means. For weighted values like bar charts is obtained from
example, if we consider a ground population of ground categorical variables such as the ‘Nationality’
individuals uniformly distributed in the circle given in Table 1. In many cases, the sum of the weights is

180 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

not constant, as, e.g., when the absolute and not the specific tools or by extending known tools to sym-
relative frequency is used. bolic data, as descriptive statistics, PCA, regression,
Between the first level where the ground popu- decision trees, clustering, dissimilarities, and the like.
lation is described by standard numerical or categori- Several studies have already been performed in this
cal variables and the second level where classes are direction in the four books (Refs 10–12 and 22). Sev-
described by symbolic variables, there is an interme- eral introductions have been done, e.g., in Ref 23
diate step. For example, if we consider the random and with a nontechnical introduction: Refs 24–27.
variable Age which associates with each player of a Several international SDA workshops [Vienna
team c, its age, we can then define a higher level ran- (2009), Namur (2011), Madrid (2012), Taiwan
dom variable Age whose value on each team is Agec. (2014), and Orléans (2015)] give the trends. More
Hence, this random variable Age, is a random varia- details on SDA theory and tools are given in Some
ble of random variable values and constitutes an Methods for the Analysis of Symbolic Data
intermediate step from where the symbolic variables section and directions of research in Some Direction
values (e.g., as the probability distributions of each of Research and Links with Specific Domains of
team) are obtained. Data Science section. Any new method extended on
symbolic data can be studied in term of stability and
How Do Random Variables of Random convergence when the ground population increases.
Variables Value and Model of Models This question is discussed in Theoretical Develop-
Appear in SDA? ment section.
Basically, the process is the following: the ground
population X, where n individuals are described by a
set Y of p standard variables, constitutes the initial WHAT ARE THE PRINCIPLES AND THE
data table. This table is transformed into a new and THEORETICAL DEVELOPMENTS?
reduced in rows data table (X0 , Y0 ) of k rows and
p columns, where X0 is a set of k classes and Y0 is a Some Principles
set of p0 new variables whose values on each class is
a random variable. The random variable values of Y0 Class Facets Principle
have generally different laws on the classes. For This principle aims to consider each class of indivi-
example, the random variables associated with the duals as ‘a thing by itself’ to be described, like an
age of each team of players, have generally not the object, by its different facets. As classes are consid-
same law as, e.g., the players of a team can be ered as ‘objects’ their symbolic description is often
younger with mean age 22 than the players of called in SDA literature, ‘symbolic objects.’ The dif-
another team with mean age 25. Hence, we obtain a ferent facets include descriptive statistics of classes
new data table (of k rows and p new variables) considered as new standard or symbolic variables but
which contains, in each cell, a random variable not only, as often adding variables (at the level of
defined on the individuals of its associated class. classes), having no sense at the level individuals, can
This intermediate table containing a random be useful as, e.g., sponsoring company of a team or
variable in each cell leads to symbolic variables the team ‘region.’ From this principle, it results that
with empirical histogram or distribution values the number of variables which describe the classes
(or bar chart values in the case of categorical vari- can be much larger than the number of initial ground
ables) but also to parametric models. Therefore, variables. For example, the description of classes by
the question of finding models of models can arise. a ground numerical variable as the ‘age’ can induce
Dirichlet is a good example of such model as the several numerical variables, which associate with
associated random variable values can be probabil- each class: its mean, its median, and/or its mean
ity density distributions. In that direction, several square. This ground numerical variable can also
works have been recently developed (see induce symbolic descriptive variables which value
Theoretical Development section). associated with each class is, e.g.: the [Min, Max]
interval, an interquartile interval or another percen-
tile interval, a percentile list, a distribution, and histo-
How to Analyze a Symbolic Data Table? grams. In the same way, a unique ground categorical
For a symbolic data table (built from a ground popu- variable can induce several symbolic variables where
lation or native), tools have been developed in order the value for each class can be: a categories list, a bar
to analyze and discover new knowledge from such chart, and moreover an interval or a distribution
tables. This analysis can be performed by developing when the categories are ordered.

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 181

Overview wires.wiley.com/compstats

Also, links between the ground variables inside of symbolic data to classical data. Transforming sym-
the classes, induce more descriptive symbolic variables bolic data in numerical data is possible and can be
as symbolic variables of joint distribution values. It useful but loses much information. For example,
induces of correlation and concordance values of dif- transforming descriptive vectors of p intervals by vec-
ferent kinds (standard, spearman correlation, Kendall tor of 2p numbers associated with 2p Min- and
Tau concordance, and the like). For example, in the Max-valued variables loses the information con-
symbolic description of each team player by their tained in their inherent hyperrectangle of 2p vertices.
aggregated height, weight, and so on, we can add the In Figure 3, we give an example where two symbolic
correlation associated to each class between the height interval-valued variables Y1 and Y2 are first associ-
and the weight of the individuals of this class. This ated with four Min or Max numerical variables
means that we can add to the symbolic data table a denoted a1, b1, a2, and b2. In this case, the graphical
new (numerical) variable called ‘correlation between representation in the space defined by these four
height and weight’ defined on the set of classes. Notice numerical variables associates a point to each class;
that adding such kind of variable at the ground level of hence, in that way, there is no variability inside the
individuals would have no sense as the correlation class. In a second way, each class is represented by a
between two numbers defined by the height and the biplot on the two symbolic variables by 22 vertices.
weight of a unique individual has no sense. In that way, the variability of each class appears in
More generally, if we have p ground numerical terms of a rectangle associated with each class.
variables, we can add to the p symbolic variables Therefore, symbolic data cannot be transformed to
obtained by aggregation from the ground data, the p standard data without losing some aspects of the
(p − 1)/2 numerical variables expressing the correlation symbolic variables and their expressed variability.
of each pair of ground variable for each class. In order
to reduce the number of variables, in How to Measure Specificity of the Individual Level Versus
the Quality of Bar Chart Symbolic Variables, Classes, the Class Level Principle
and Their Associated Symbolic Data Tables section, This principle aims to clearly distinguish the ground
different ways for selecting the variables of best explan- level knowledge from the higher level of classes’
atory and/or discriminatory power are given. knowledge. This means that the ground data
Adding variables (e.g., of correlations-valued describing individuals as the symbolic data describ-
variables), means that two classes are closer if the ing classes must be considered with their own speci-
added values in their symbolic description are close. As ficity. For example, questions like ‘for which players
for the correlations, we can add inside the symbolic is the height higher than 1.80 m?’ has a meaning at
description of each class, the most discriminating joint the level of the ground data table describing players
probabilities of categorical variables. Notice that at this (i.e., individuals) but has no meaning at the level of
level, a good copula model can save much computing teams (i.e., classes). In contrast, questions like ‘for
time in the case of big data by calculating the joint dis- which classes, the probability that a player has more
tribution from the marginal distributions and save cal- than 1.80 m is higher than 0.9?’ has meaning at the
culating them from the ground data (see Figure 8). level of classes and not at the ground level of indivi-
Moreover, other classical variables specific to the duals. In order to build a decision tree, suppose we
classes can be added (e.g., the amount of expenses of a have ‘good classes’ and ‘bad classes.’ This implies
team is specific to the team and not to the players). that we have also ‘good individuals’ and ‘bad indivi-
Sometimes, other kinds of variables can be duals.’ The binary answer of the first question can
added to the description of a class by a transforma- be used on the ground data describing individuals
tion of symbolic data to other kinds of symbolic and the second question can be used on the sym-
data. For example, functions (as curves or time bolic data describing classes. In that way, we can
series) can be transformed into histograms by a obtain two different decision trees one is specific to
wavelet series expansion 28 or in a set of coefficient the individuals and the other is specific to the
numbers by Fourier series expansion. In the same classes. Nevertheless, an individual can be consid-
way, estimated probability distributions can be trans- ered as a class reduced to a single individual and so,
formed into other symbolic variables values as histo- by using the decision tree on classes, we can say if a
grams or a list of percentiles values. new individual is good or bad. However, a class
cannot be generally considered as an individual and
Variability Principle so cannot be identified in a decision tree on the indi-
This principle aims to take the variability inside viduals. There are some examples where curiously
classes into account by avoiding a simple reduction the allocation of new individuals considered as

182 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

Numerical variables for symbolic data Symbolic variables

a1 b1 a2 b2 Y1 Y2

C1 C1

(Y1(C1), Y2(C1)) = ([a1i , b1i], ([a2i , b2i])

Ci a1i b1i a2i b2i Ci

Ck Ck

b1
Y2
Ci
b2 Ci

b2i
X

a2i
X
a1
a2 a1i b1i Y1

Numerical space: four variables Symbolic space; two variables

inside points no variability appears inside rectangles variability appears

FI GU RE 3 | Graphical representation of the variability inside symbolic data by four numeric and two symbolic variables.

classes has given better results the decision tree on Generalization Principle
individuals (see Ref 29). When this happen, is an Often SDA tools generalize standard tools on sym-
open question. Notice that in the case of complex bolic data. When a SDA tool generalizes a standard
data (see the power plant cooling towers example in one, this principle aims to take attention on the fact
Application to Complex Data section), only the that its associated software can be applied on stand-
decision tree on classes has a meaning. Also, in the ard data too as they are a case of symbolic data.
case of a huge ground population, the decision tree
becomes very large and difﬁcult to interpret. On the
other side, the decision tree on classes is much smal-
Explanatory Power Versus Discriminatory
ler and so easier to interpret.
Power Principle
This principle aims to clearly distinguish the ‘explan-
atory power’ from the ‘discriminatory power’ of a
Interpretation Principle symbolic variable. Basically, the ‘explanatory power’
In the interpretation of the symbolic descriptions, of a symbolic variable is a measure of the differences
we have to verify (before taking wrong interpreta- between the classes (i.e., between their symbolic
tion) independencies between the descriptive distri- values for this variable). The higher are these differ-
butions and other descriptive variables. If, e.g., the ences the higher is the explanatory power of the sym-
frequency of a category of a symbolic variable bolic variable. Hence, the explanatory power of a
(of bar chart value) is higher for a class than for symbolic variable is nil when there are no differences
another class, we have to verify that this is not the between the classes. This is coherent with the fact
effect of other descriptive symbolic variables. For that a variable is of no help in explaining the classes
example, classes deﬁned by regions described by dis- when their symbolic values, for this variable, are all
tributions of different antibiotic strategies on inhab- the same.
itant, can depend on sociodemographic variables of The ‘discriminatory power’ of a symbolic varia-
each region. One way to validate the distributions ble Y associated with a ground categorical variable
on the strategies is to compare them inside each y is a measure of the differences between the cate-
homogeneous clusters obtained on the ground popu- gories of y, considered as classes described in term of
lation described by only the sociodemographic vari- symbolic data by a symbolic variable so-called ‘class
ables. That kind of situation has been studied in variable’ which categories are the classes. The higher
medical epidemiology.30 are these differences the higher is the discriminatory

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 183

Overview wires.wiley.com/compstats

power of a symbolic variable. Hence, the discrimina- of the L1 distances between the horizontal (resp. ver-
tory power of a symbolic variable is nil when there is tical) bar charts, taken two by two. This criterion is
no differences between the categories of y in their detailed in How to Measure the Quality of Bar Chart
symbolic description of the classes. In other words, Symbolic Variables, Classes, and Their Associated
this means that a symbolic variable Y is of no help in Symbolic Data Tables section. Hence, if the horizon-
discriminating the classes when all the symbolic tal (resp. vertical) bar charts are all the same, the
values of the symbolic class variable, describing the explanatory (resp. discriminatory) power value is
categories of y, are the same. 0. The more the horizontal (resp. vertical) bar charts
We illustrate these two notions with two sym- are different, the more the explanatory (resp. discrim-
bolic variables based on two kinds of frequencies, inatory) power value is high. We can call fij, (resp.
roughly defined by the conditional probabilities Pr(yj| gij), the explanatory (resp. discriminatory) value of
Ci) = fij and by Pr(Ci|yj) = gij where yj is a category the category yj (of the class Ci) for the class Ci (resp
of the ground variable y. for the category yj). Notice that having an individual
More precisely, let y be a ground categorical and its class Ci, we can associate to this individual its
variable of m categories denoted y = (y1, … , ym) and category of best explanatory value which is the one
let be C = {C1, … , Ck} the set of k classes of indivi- maximizing fij for j = 1, … , m. In the same way hav-
duals, with card(Ci) = ni. Let be nij the number of ing an individual and its category yj, we can associate
individuals of the category yj in the class Ci. to this individual its class of best discriminating value
A bar chart symbolic-valued variable, so called which is the one maximizing gij for i = 1,k. The trend
Yh, associated with the ground variable y and defined of this best frequency is to become higher when the
on C is such that Yh(Ci) = (fi1, … , fim) be a vector of discriminatory power of the variable Yv increases.
k frequencies fij = nij/ni. This vector defines a bar Hence, having a new individual whose category is
chart as we have: Σj = 1,mfij = 1. The vector Yh(Ci) is known for the variable y, we can associate it to the
‘horizontal’ in the symbolic data table. Therefore, we class of best explanatory or to the class of best dis-
say Yh is a ‘horizontal bar charts variable.’ Notice criminatory value or to the class of best combination
that having an individual and its class Ci, we can of these two values. Such combination between the
associate to this individual its best explanatory cate- explanatory and discriminatory power of a ground
gory which is the one maximizing fij for j = 1, m. categorical variable and a given set of classes will be
The behavior of this best frequency is to become seen in How to Measure the Quality of Bar Chart
higher when the explanatory power of the variable Symbolic Variables, Classes, and Their Associated
Yh increases. This comes from the fact that the more Symbolic Data Tables section.
the distance between two bar charts is high the more More generally, instead of fij (or gij), we can
in each bar chart, the frequencies are contrasted use in the same way, other value depending on fij
(more large and small values) and concentrated (or gij), as the tf or the tf–idf value (see e.g., Ref 31).
(more 0 values) on some categories (which are the Coming back to Table 1, we can see that the
more possible different from one bar chart to the symbolic variable ‘Age bar chart’ is a ‘horizontal bar
other one). chart symbolic variable’ denoted Yh as it associates
Now we define another symbolic variable, also to each team a horizontal bar chart. For example, the
associated with the ground variable y and so-called categorical age of best explanatory of an individual
Yv which induces ‘vertical’ bar charts in the symbolic of the Lyon team is the interval: [20, 25] (which is
data table. This symbolic variable, defined on C is the category denoted (1)), as 30% is the highest fre-
such that Yv(Ci) = ( gi1, … , gim) is a vector of m frequency of the categories of age. We can define a ‘ver-
quencies gij = nij/Nj where Nj is the total number of tical symbolic variable’ by adding to the numerical
individuals taking the value yj in the ground popula- variable ‘Frequency of French among all French’
tion. Hence, to each category yj of y is associated the (which leads to a vertical bar chart, as shown in
vector Gj = ( g1j, … , gkj)T which is a vertical bar Table 1), other such numerical variables are obtained
chart as we have: Σi = 1,kgij = 1. The sub data table in the same way by considering the other national-
Y ’ I = (Yv(C1), …, Yv(Ck))T of k rows and m col- ities. More precisely, having a ground variable
umns is identical to the vector G = (G1, …, Gm) of denoted z which associates to any individual its
m ‘vertical’ bar charts in the symbolic data table. nationality, we can then define a ‘vertical bar charts
Therefore, we say that Yv is a ‘vertical bar charts variable’ Zv whose values for each class is a vector of
variable.’ m numbers if there are m nationalities. Each of these
The explanatory (resp. discriminatory) power numbers is the frequency of a nationality of the
of Yh (resp. Yv) can then be measured, e.g., by a sum players of a team, among all the players having this

184 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

nationality. This symbolic variable Zv leads to variables by a discretization process which can opti-
m vertical bar charts associated with m nationalities. mize their explanatory or discriminatory power.
For example, the team of best discriminatory value The second step is to select the best symbolic
of an individual having the French nationality is Paris variables by using criteria based on their explanatory
with frequency 30%. Having a new player of French and/or discriminatory power that we define precisely.
nationality, we can allocate him to the team of high- Then, a natural question appears: can we say that
est explanatory value (i.e., of highest proportion of the variable with higher explanatory power is also
French in the team) or of highest discriminatory the variable with the higher discriminatory power? In
value (i.e., with the greatest proportion of French this section, we show that this is not always the case
among all French). Notice that if the teams have all and we give eight rules relating these both kinds of
the same number of players then the same team max- power in the case of ground binary variables and a
imizes the two values. partition with two classes. The quality criterion
Can we say that among all the ground categori- defined in this case can then be easily extended to the
cal variables, the variable y which induces the sym- case of multi categorical variables and to a partition
bolic variable Yh with higher explanatory power is of more than two classes.
also the variable which induces the variable Yv with Let X = (X1, X2) and U = (U1, U2) be the parti-
the higher discriminatory power? In How to Measure tions on the ground population induced by two
the Quality of Bar Chart Symbolic Variables, ground binary variables. The number of individuals
Classes, and Their Associated Symbolic Data Tables of Xi and Ui in the ground population are denoted:
section, we show that this is not always the case and xi = |Xi| and ui = |Ui|. We set also: au = |U1 \ X1|, bu = |
we give eight rules relating both these kinds of X2 \ U1|, du = |U2 \ X1|. Thus, we have:
powers. x1 = du + au, u1 = au + bu and u1 + u2 = x1 + x2 = X.
Hence, we obtain the symbolic data tables of
Figure 4 where the frequency of the class U1 in the
How to Measure the Quality of Bar Chart class X1 (resp X2) is f11 = |X1 \ U1|/ |X1| = au/x1,-
Symbolic Variables, Classes, and Their (resp. f12 = |X2 \ U1|/ |X1| = bu/x1) and in the same
Associated Symbolic Data Tables way we obtain the following frequencies: g11 = |X1
\ U1|/ |U1| = au/u1, g12 = |X1 \ U2|/ |U2| = du/u2.
Having obtained the symbolic data table from stand-
Notice that, in the first data table of the Figure 4,
ard or complex ground populations, an important
U is considered as a symbolic variable and X as a
question is measuring the quality of symbolic vari-
class variable. In the second data table, X is consid-
ables and their associated symbolic data tables. First
ered as a symbolic variable and U as a class variable.
(in Quality Criteria Based on the Symbolic Variables
The L1 distance between classes and between
section), we define criteria based on the ‘quality’ of
categories is defined by DX/U(X1, X2) and DU/X(U1,
the symbolic variables measured by its explanatory
U2) such that:
and discriminatory power that we define precisely.
DX/U(X1, X2) = (Σj = 1,2 | f1j – f2j|)/2, expresses
Second (in Quality Criteria of Classes and Variables
the explanatory power of the symbolic variable U for
Based on the Cells of the Symbolic Data Table sec-
the class variable X.
tion), based on the quality of each cell, we can define
DU/X(U1, U2) = (Σj = 1,2 | g1j – g2j|)/2, expresses
the quality of the classes and of the symbolic vari-
the discriminatory power of the symbolic variable
ables and of the symbolic data table.
X for the class variable U. We can see that both dis-
tances vary between 0 and 1.
Quality Criteria Based on the Symbolic In the following, instead of U we use the vari-
Variables ables Y and Z, and instead of X we use the variable
In this section, we define criteria in order to measure C. Therefore, e.g., we have: f11 = |C1 \ Y1|/|C1 | = ay/
the quality of a symbolic data table. These criteria c1, f21 = |C2 \ Y1|/|C2 | = by/c2, and so on.
are based on the explanatory and discriminatory Example. The symbolic data tables C/Y, C/Z,
power of the symbolic variables. We focus on bar Y’/C, Z’/C are given in Figure 5. By comparing the
chart-valued variables as they can be obtained
directly from the ground data table or from other X\U U U\X X
U1 U2 X1 X2
kinds of symbolic data. Therefore, in that way, the
X1 au /x1 1-au/x1 U1 au/u1 1-au/u1
first step is to build in a ‘good way’ the bar chart- X2 bu/x2 1-bu/x2 U2 du/u2 1-du/u2
valued variables from the ground classical variables.
Diday et al.32 give a way for building these symbolic F I G U R E 4 | The tables X/U and U/X.

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 185

Overview wires.wiley.com/compstats

distances DC/Y (C1, C2) = 1 and DC/Z (C1, C2) = 0, Therefore, DC/Z < DC/Y, DZ/C < DY/C
we see that the explanatory power of Y is higher with y1y2 = 12 > 10 = z1z2. In other words, in this
than that for Z (in fact it is the highest possible for Y case the more explanatory variable Y is also the more
and the lowest possible for Z). In the same way, by discriminatory one.
comparing the distances DY/C (Y1, Y2) = 0.1 and DZ/C In the second example, we have: c = 100,
(Z1, Z2) = 0.3, we see that the discriminatory power of c1 = 80, c2 = 20, ay = 9, by = 3, y1 = 12,
Z’ is higher than that for Y.’ y2 = 88, az = 39, bz = 11, z1 = 50, z2 = 50. Then, we
In Ref 33, it is shown that eight rules (see obtain: DC/Y/DC/Z = 3/5 < 1 and DY/C/DZ/C = (z1z2/
Figure 6) relating the explanatory and discriminatory y1y2)DC/Y/DC/Z = 125/88 > 1. From this, it results
power can be induced from the two following results: that: DC/Y(C1, C2) < DC/Z(C1, C2) and DY/C(Y1,
Y2) > DZ/C(Z1, Z2). Hence, in this case the more
DC=Y =DC=Z = ðy1 y2 =z1 z2 Þ DY=C =DZ=C ð1Þ explanatory variable Z is not the more
discriminatory one.
and In order to select symbolic variables with a
‘good’ explanatory and discriminatory power, we
DY=C =DZ=C = ðz1 z2 =y1 y2 Þ DC=Y =DC=Z ð2Þ can define a selection criterion denoted S(Y) to be
maximized, defined as follows:
Examples. We give two examples. In the first one,
the same variable has simultaneously the best explan-
SðY Þ = ΣZ2V DY=C =DZ=C × DC=Y =DC=Z :
atory and the best discriminatory power. In the sec-
ond example, this is not the case.
Then, from Eq. (2), we obtain:
In the first example, the ground data table is
given in Figure 7 where the number of individuals is 2
C = 7 and c1 = 4, c2 = 3, y1 = 3, y2 = 4, z1 = 2, z2 = 5. SðY Þ = ΣZ2V ðz1 z2 =y1 y2 Þ DC=Y =DC=Z :
From this data table, it results also that ay = 2, by = 1
for the variable Y and for the variable Z: az = 1, Hence, the more this criterion is large, the more the
bz = 1. From these, we can induce that: DC/Z = explanatory and the discriminatory power of the var-
1/12 < DC/Y = 1/6 and from Eq. (2) we obtain: iable Y is large compared to the other variables
Z belonging to a given set of variables V.
In the more general case of a class variable with
DY=C =DZ=C = z1 z2 DC=Y = y1 y2 DC=Z
= 2 × 10=12 = 5=3: a partition in more than two classes and of symbolic
variables with more than two categories, this crite-
rion can be extended by summing on all the classes
C/Y Y C/Z Z Y′/C C Z′/C C of categories divided in two parts for each variable
Y1 Y2 Z1 Z2 C1 C2 C1 C2 and on all the pairs of classes of the partition.
C1 1 0 C1 1/3 2/3 Y′1 0.6 0.4 Z′1 0.3 0.7
By introducing the entropy, a criterion measur-
C2 0 1 C2 1/3 2/3 Y′2 0.5 0.5 Z′2 0 1
ing the explanatory and discriminatory power of a
F I G U R E 5 | The explanatory power of Y is much higher than the bar chart symbolic variable can be defined in the fol-
one of Z and the discriminatory power of Z 0 is higher than the one lowing way:
of Y 0 .
W ðY Þ = a IDC=Y =ð1− EntrðC=Y ÞÞ + b IDY=C =
then
DZ/C < DY/C DZ/C > DY/C
ð1−EntrðY=CÞÞ with a + b = 1
If

DC/Z < DC/Y and

y1y2 ≤ z1z2 True False
i Y Z C

DC/Z < DC/Y and i1 1 0 1

y1y2 > z1z2 Possible Possible
i2 1 0 1

DC/Z < DC/Y and i3 1 1 0

y1y2 < z1z2 Possible Possible
i4 0 1 1
i5 0 0 1
DC/Z > DC/Y and
y1y2 ≥ z1z2 True False
i6 0 0 0
i7 0 0 0
F I G U R E 6 | The ﬁrst cell of this table means that if y1y2 ≤ z1z2
and Y has a better explanatory power than Z, it has also a better F I G U R E 7 | The ground data table where seven individuals are
discriminatory power than Z. described by three binary variables.

186 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

where: ‘symbolic rough set’ associated with a class C

described by Y (see Links with Specific Domains of
• the parameters a and b give more or less impor- Data Science section).
tance to the explanatory or to the discrimina-
tory power. They can be learned from
examples. Theoretical Development
Theoretical developments have been done and con-
• IDX/U is the sum of the L1 distances between all
tinue in that direction, in order to extend standard
the pairs {Xi, Xj} of the given partition C of the
methods of statistics and Data Mining to symbolic
ground population.
data; but certainly, much remains to be done. This
• Entr(C/Y) = Σ i,jpij log pij where pij is the proba- will be seen in Symbolic Data Analysis Software and
bility of the category j of the symbolic variable Some Direction of Research and Links with Specific
Y for the class Ci of the partition C. The use of Domains of Data Science sections where several
the entropy in the criterion improves the directions of research are suggested.
explanatory and discriminatory power by The SDA framework can enhance some known
increasing the concentration (more 0 values) mathematical frameworks. We give here three exam-
and contrast (more large and small values) ples: random variables of random variable value,
inside the bar charts of the symbolic data tables Galois lattices, and copulas.
C/Y and Y/C. The random variables framework in the case
where the variables have random variables
We can define a criterion which measures the quality (of different laws) as value appears in the SDA frame-
of a symbolic data table of bar chart symbolic vari- work from a process explained in How to Go from
ables by summing the criterion S or W on all these Standard to Symbolic Data? section. This leads to
symbolic variables. symbolic variables of distributions values which have
been widely studied at least in order to calculate their
Quality Criteria of Classes and Variables dissimilarities by Paul Levy (1925), Kulback-Leibler
Based on the Cells of the Symbolic Data Table (1951), Hellinger (1977), and so on. In the SDA
Since long time ago, much work has been done on framework, these measures have to be extended from
robust intervals which can be useful to measure the a unique symbolic variable of distribution values to
robustness quality of a symbolic interval-valued vari- multidimensional tables of symbolic variables of dis-
able, see, e.g., Refs 34 and 35. The quality of the tribution values and simultaneously on other kinds
inferred distributions contained in the cells of the of symbolic values as shown in the Tables 1 and 2.
obtained symbolic data table can be also measured The lattice framework37 can be extended to the
by classical model selection criteria like Bayesian case of symbolic variables (see Refs 38 and 39), and
Information Criterion (BIC), Minimum Description extended to the stochastic case by using Choquet
Length (MDL), Akaike's Information Criterion capacities in Refs 40 and 41. More details are given
(AIC), Minimum Message Length (MML), or other in that direction in Unsupervised Classification
criterion of this kind based on the likelihood estima- Extended to Symbolic Data section.
tion. The ‘test value,’ developed in Ref 36, may also The copulas framework (see e.g., Ref 42) is use-
be used to measure the quality of a category in a bar ful in SDA which extends the field of application of
chart contained in a cell. The quality of a cell can be this theory. For example, copulas models are needed
considered to be the quality of the category of best for the biplot of histogram-valued variables as shown
quality. Then, by summing on the quality of all the in Figure 8. In this way, copula models are used in
cells of each row (resp. column) we obtain a quality order to reduce calculations and to avoid false inde-
of the classes (resp. variables). In the same way, by pendencies assumptions by saying that the joint
summing on the quality of all the cells, we can obtain probability is the product of the probability of the
a quality of the symbolic data table. margins. Also, copulas have been used for the classi-
A simultaneous quality (related to Galois lat- fication of distributions43,44 and can be used in PCA
tices, see Unsupervised Classification Extended to as suggested in Ref 45 or in regression as shown in
Symbolic Data section) of a class C and a symbolic Ref 46.
variable Y, can be to use the fit between the class Theoretical developments, more specific to the
C and the extent of its intent defined by the symbolic SDA framework, are needed. For example, conver-
description of the class C by this variable Y. Another gence properties have to be proved. We say that a
way can be to use the accuracy criterion of a method converges when there exists an integer N

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 187

Overview wires.wiley.com/compstats

Y2 consistent as h goes to 0 and its limit is the result of a

Y1 Y2
similar method applied to a T2-law-valued random
C. Copula variable. Such a convergence has been proved in the
stochastic lattice case of distributional symbolic data
Ci in Refs 39 and 40, where it has been shown (in a
12345
probabilistic and capacities context) that when the
Y1
size n of a sample of the ground population increases
C. and h goes to 0, then the Galois lattices sequence Gn
built on this sample, converges toward a lattice G.
F I G U R E 8 | The biplot of histogram-valued variables needing Another example is the case of the mixture
copulas models. decomposition problem.48–52 It was proved that a
mixture of Dirichlet distributions applied to histo-
enough large such that for any n larger than N the grams built with a bandwith h converges to a mix-
result of the method does not change. For example, let ture of Dirichlet processes as h converges to 0.
M(n, k) be a SDA method where k is the number of
classes obtained on n initial individuals of the ground
population. It would be interesting to show, e.g., that: WHAT IS THE SDA APPLICABILITY
DOMAIN?
• If the k classes are fixed and n tends toward
infinity, then M(n, k) converges toward a stable Application to Standard Data
position. Tables Transformed into Symbolic Data
Table
An example of such convergence has been given in The application domain concerns any kind of data
Refs 40, 41 in the case where M(n, k) is a building expressing the individuals variability inside classes
Galois lattice method (see Unsupervised Classifica- considered as new units to be studied. Therefore, the
tion Extended to Symbolic Data section), from sym- application domain concerns often standard data, as
bolic data defined by vectors of distributions. in many cases, users are not only interested in the
study of individuals of a ground population but also
• If k increases until obtaining a single individual in the complementary study of classes of individuals
by class, then M(n, k) converges toward a involved by given categorical variables and their Car-
standard method. tesian product or obtained by a clustering process.
We can mention applications in official statistics for
For example, a symbolic PCA on vectors of intervals people’s life values and trust components in Europe,
contains, as a special case, a standard PCA on stand- see Refs 6–8; in marketing for recommendation, see
ard numerical vectors data. Ref 53. Medical applications in acute myocardial
infarction in Ref 54; in radiation therapy, Ref 55. In
• If k and n increase simultaneously toward infin- biology, Fablet et al.56 has studied pigs illnesses in
ity, then M(n, k) converges toward a stable veterinary epidemiology where pigs were the indivi-
position. duals and classes were farms of pigs. Also, SDA has
been applied in text mining in Ref 57, where the indi-
This is the case of sequential data, data stream like viduals were words and the classes were themes. For
where a one pass algorithm can be used (see, forecasting in electrical power demand, see Ref 58
e.g., Ref 47). and solar radiation, see also Ref 59; in sterling-dollar
Assume that the k laws describing the k classes exchange rate, see Ref 60, in stock market in China,
are of a certain type, say T1, and that they depend on see Refs 61 and 62; in mutual fund rating, see Ref
a parameter h. We consider the k laws as a sample of 63; for electrochemical characterizations of rein-
a T1-law-valued random variable and also we assume forced concrete corrosion, see Ref 64, for structural
that each of these laws converges to a law of type T2 modification assessment, see Ref 65.
as h goes to 0. This is the case, e.g., of histograms
built with a bandwith h, which can be considered as
discrete type laws. As h goes to 0, such an histogram Application to Complex Data
generally converges toward a law having a density. Complex data considered here are the case where
The question that arises is whether method M(n, k) each individual is described by several different
applied to a T1-law-valued random variable is ground populations of different sizes, themselves

188 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

described by different variables. The building process for each population yields to the final symbolic data
of the symbolic data table is based on the following table. A graphical illustration of the kind of symbolic
principle sometimes called ‘fusion process.’ First, each data table obtained after the fusion process is given
class induced by the class variable (on individuals) is in Figure 9. The industrial results were a statistical
aggregated by using the standard variables associated evaluation of the towers’ degradation highlighting
with each ground population. These aggregations atypical/abnormal values for each measurement by
yield to a symbolic description for each class and are the symbolic data table representation and a graphi-
‘vertically’ concatenated (see Figure 9). More pre- cal overview of the towers by a PCA extended to
cisely, as the individuals of the ground populations symbolic data (see SYR Software section), the clus-
are in rows, this aggregation is called ‘vertical’ and ters, the network, and the correlation between the
yields to a symbolic description of all the classes for symbolic variables of cracks and corrosion (which
each population and its associated set of variables. had no meaning at the ground level), with the facto-
Then, a horizontal concatenation of the symbolic rial axes, the ranking of the towers expressing their
variables description associated with each ground degradation degree yielding to a reduction of the
population yields to the final symbolic data table. number of sensors.
Example of industrial application. We illustrate
this process by an industrial application for the study
of the degradation problems occurring on nuclear SOME METHODS FOR THE ANALYSIS
power plant cooling towers (see Refs 4, 5 and 66). In OF SYMBOLIC DATA
order to simplify, we consider here a description of
each tower by only two standard data tables of dif- Extended Standard Statistics Methods to
ferent ground populations themselves described by Symbolic Data
different variables (see Figure 9). The population of
the first standard data table is a set of cracks Descriptive Statistics
described by their length, thickness, orientation, and Descriptive statistics of symbolic data (such as mean,
so on, for each tower. The second population is a set variance, and covariance) are studied in Refs 11 and
of corrosion positions described by corrosion vari- 67–69. In these studies, univariate (mean, variance,
ables. In order to compare the towers’ degradation, standard deviation, etc.) and bivariate statistics
several aggregation and concatenation (vertical and (covariance and correlation) are extended to interval
horizontal) processes are applied. They consist first and to histogram-valued variables. Covariance exten-
of an aggregation process applied on the two data sion can be also found in Refs 11 and 68–72.
tables associated with each tower. This aggregation
leads to a symbolic description of each tower for Principal Component Analysis
each population of cracks and corrosion. Then, a In the case of interval-valued variables, each symbolic
vertical concatenation followed by a horizontal con- object can be considered as a hyperrectangle. In SDA,
catenation of the symbolic descriptions of each tower the aim is to reduce the number of symbolic variables

Horizontal concatenation

Crack variables Corrosion variables

Aggregation

Cracks and corrosion variables

Corrosions
Cracks
Tower 1
Vertical concatenation

Towers 1 to n

Crack variables Corrosion variables

Aggregation

Tower n

Corrosions
Cracks

Tower n

Symbolic data table

FI GU RE 9 | Building a symbolic data table from several ground populations described by different sets of variables and a unique class
variable.

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 189

Overview wires.wiley.com/compstats

by obtaining new hyperrectangles in a reduced space the discretization process (used on the variables of the
of symbolic variables. The basic idea is to consider initial standard data table). This process can be opti-
the means of the intervals or the vertices of this hyper- mized in order to maximize this discrimination (see
rectangle as new individuals on which a standard Ref 32).
PCA can be applied. As usually in PCA, the natural Discriminant analysis has been developed by
aim is to reduce the number of variables. In this way, Silva and Brito,110 Appice et al.,111 and Duarte Silva
different strategies have been tried and developed in and Brito.112 Factorial discriminant analysis is per-
Refs 73–75. Le-Rademacher76,77 use the entire inter- formed by Lauro et al.98 and Cazes.81
val rather not just the vertices. Kosmelj et al.78
expends on this. Le-Rademacher and Billard79 extend
their approach to histogram-valued data. Unsupervised Classification Extended to
In the case of bar chart-valued variables, several Symbolic Data
approaches have been developed, e.g., by extending
standard covariance matrices to symbolic data in Dissimilarities Between Symbolic Data
Refs 70, 72, and 80–83. Le-Rademacher76 obtain This is an important question in SDA where much
polytopes as output instead of hyperrectangles. has been done (see e.g., Chapter 8 in Ref 10, Chap-
The case of mixtures of several kinds of sym- ter 7 in Ref 11, and Chapter 8 devoted to dissimilari-
bolic variables (intervals, distribution, etc.) has been ties and matching in Ref 12). Families of such
considered by two different ways: based on percen- dissimilarities have been defined by Gowda and
tiles as Refs 81 and 84, where the nonordered cate- Diday,113 Ichino and Yaguchi,114 and De Car-
gories are ordered by their frequencies, or based on valho.115 A widely used dissimilarity in case of
‘metabins’ shortest pathways in Ref 45. interval-valued variables is the Haussdorf-based dis-
similarity (see Ref 116). The Wasserstein metric in
Regression, Canonical Analysis, the case of probability distribution-valued variables
and Forecasting is becoming popular (see Refs 117–119). Several dis-
Standard regression has been extended in different similarity measures between histogram-valued varia-
ways to symbolic data. In the case of interval-valued ble have been proposed in Ref 120. For consensus
variables, see Refs 85–89. In the case of histogram- measures between symbolic data, see Ref 121.
valued variables, see Refs 71 and 90–95. Symbolic
regression with different kinds of constraints has Clustering
been developed in Refs 11, 96, and 97. Ref 46 is Much work has been carried out in extending clus-
copulas-based. Canonical analysis by Lauro et al.98 tering to symbolic data. This comes from the fact
and Tenenhaus et al. (unpublished data) has been that at the beginning we thought that ‘clustering’
extended to symbolic data by different ways. Multi- would be the main way for obtaining classes consid-
dimensional scaling has been extended to the case of ered as clusters (see, e.g., Refs 2, 38, 47, and 122 in
interval values variables by Groenen et al.99 and Ter- the case the of data stream).
ada and Yadohisa.100 Several advances based on ‘dynamical cluster-
Forecasting in the case of interval series has ing’ for partitioning or on ‘pyramidal clustering’ for
been developed in Refs 58, 60, and 101–103. Histo- obtaining overlapping clusters on classical data, can
gram time series have been studied in Refs 58 be extended to symbolic data. We recall some of
and 104. them: ‘Dynamical clustering’ (Refs 123 and 124) is
an extension of K-means where, instead of the
means, we use other kinds of centers called ‘kernel’:
Supervised Classification Extended seeds, distributions,125 curves,126,127 regressions,128
to Symbolic Data adaptive distances,129 typological principal
130
Decision trees have been studied in the case of differ- components, canonical components,131 and so
ent kinds of symbolic data in Refs 29 and 105–108. on. The link with SDA is that the obtained kernels
An extension of association rules tools to sym- can be considered to be the symbolic descriptions of
bolic data can be found in Ref 109, where, e.g., the the obtained clusters. These clusters constitute the
units are the customers considered as classes of trans- classes considered as higher level units. Hence, we
actions, instead being as usually the transactions. obtain a symbolic data table on which SDA can be
The symbolic histogram-valued variables applied.
obtained from numerical ground variables discriminate Dynamical clustering has already been extended
and identify more or less these classes, depending on to symbolic interval-valued variables with Haussdorf

190 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

dissimilarity in Refs 116 and 132, and when the cen- The Galois Lattice Structure of Symbolic
ters are adaptive distances in Ref 133; by using Objects
Wasserstein-based distance for histograms-valued Extent and intent of symbolic objects are introduced
variables, see Ref 134 or interval-valued variables, see in Ref 38, where ‘Complete symbolic objects’ are
Ref 118; by a probabilistic approach in Refs 15 defined and their link with Galois lattices is given
and 16. (by using, the Maximum and the Minimum opera-
‘Pyramidal clustering’135 is an extension of tors for generalization). A ‘complete symbolic object’
hierarchies to overlapping clusters. In Refs 136 and is a symbolic description of a class considered as an
137, pyramidal clustering is extended to symbolic ‘intent’ whose ‘extent’ has the same intent.
data considering that each level of the pyramid is Example. Suppose a class of individuals is
associated with a ‘complete symbolic object’ described by a symbolic description reduced to just an
(defined hereunder in the section on the Galois interval of âge. This interval denoted I, has an ‘extent’
structure). Pyramidal and hierarchical clustering are (in a given ground population), defined by the set of
graphically represented with individuals ordered on individuals with age included in this interval. The
a straight line as support. They have been extended ‘intent’ of this extent can be defined by the interval
in order to have a network of two or more dimen- denoted I0 = [min, max] age of the individuals of this
sions as support. This extension leads to a general extent. If I = I0 , we say that this interval is a ‘complete
theory of spatial pyramids in Ref 20, where its symbolic’ object for this ground population.
application to symbolic data and its link with In Refs 146 and 147, ‘concepts’ are defined by
Galois lattices are given. Pruning and graphical an extent C and an intent which extent is C and con-
representations of spatial pyramids are developed stitute the vertices of a Galois lattice. This result fol-
in Refs 138 and 139. lows in works given by Birkhoff37 and Barbut and
By using dissimilarities between symbolic data Monjardet148 in a binary context.
(see Unsupervised Classification Extended to Sym- Several works have been developed in this
bolic Data section) clustering structures with differ- direction in Ref 39 on several kinds of symbolic
ent level of clusters can be obtained by known objects; for reducing the lattices of symbolic objects,
algorithms of hierarchical or pyramidal clustering or see Refs 149–151.
top–down clustering. The quality of the resulted As recalled and more detailed in Theoretical
structure can be measured by the difference between Development section, the stochastic lattice case has
the symbolic dissimilarity between the symbolic been considered for distributional data in Refs
objects and the associated Ultrametric (resp. Robin- 40 and 41, which show (in a probabilistic and Cho-
sonian and Yadidean) dissimilarity in case of a hier- quet capacities context) that when the size n of a
archical (resp. 2D pyramidal and 3D pyramidal) sample of the ground population increases, then the
structure (see Ref 20). Galois lattices sequence Gn built on this sample con-
Top–down hierarchical clustering for some verges toward a lattice G. Brito and Polaillon152
kinds of symbolic variables (intervals and histo- define two Galois connections on a set of distribu-
grams) has been developed in Refs 140–144, In these tional data and the corresponding concept lattices.
approaches, the initial set of classes is recursively Recently, Brito and Polaillon153 proposed a novel
divided in two sets by splitting each symbolic varia- approach, which determines intents by intervals,
ble. The best split is the one which maximizes the thereby producing more homogeneous concepts,
‘quality’ of a class Ck of nk individuals measured by which are easier to interpret.
the following criteria ‘Q,’ where d is a dissimilarity
between the symbolic objects associated with Ck
such that: Models of Models: Distribution-Valued
Variables and Their Mixture Decomposition
1 X X 2 In the case of a unique symbolic variable of distribu-
QðCk Þ = d ωi ,ωj :
2nk ω 2C ω 2C tion values, an orthonormal wavelets basis has been
i k j k
used in Ref 154. In the multidimensional case of sev-
eral symbolic variables of distribution values, we use
In case of functional-valued variables, dynamical the wavelet method of Mallat28 to extract histogram-
clustering with orthonormal polynomial as centers, valued variables (Billard et al., unpublished data). A
have been used in Ref 124 (p. 523). Self-organizing PCA on these histograms is then conducted using
maps have been extended to interval-valued variables Diday’s45 approach. In the multidimensional case of
in Refs 10 and 145. several symbolic variables of distribution value mixed

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 191

Overview wires.wiley.com/compstats

with other kinds of symbolic variables, a percentile correlation between the numerical variables of the
representation is used in Ref 84. Mixture decomposi- table T of metabins.
tion of probabilities distributions has been studied
with copulas, in Refs 43, 44, and 155. In the case of
a unique symbolic distribution-valued variable, a SDA SOFTWARE
Dirichlet model has been used in Refs 52 and 156. Symbolic Objects Data Analysis
For clustering based on a normal mixture model for The SODAS software is issued from two European
aggregated symbolic data, see Ref 157. projects (from 1997 to 2003), involving 17 research
Modeling probability densities-valued variables laboratories, industrial companies, and National Sta-
by likelihood estimation has been developed in Ref tistical Institutions (NSI) of three countries. The
158. Modeling interval-valued variables is studied by results of these projects are edited in Refs 10 and 12.
Brito and Duarte Silva,144 and Diday159 gives formu- The SODAS basic principle is based on two steps. In
las relating the density of a set of interval vectors and the first step, a symbolic data file (called ‘.sds’) is cre-
their associated parameters under the hypothesis of ated from a query to a relational data base. This
uniform distributions. query defines a standard data table which contains a
categorical variable at its first position. The cate-
‘Metabins’ a Useful Tool in SDA for Taking gories of this variable define classes of individuals
which constitute the higher level units described by
Care of the Variability Inside Classes symbolic data by using the Data Base to Symbolic
A metabin is a vector of values that are associated Objects (DB2SO) module as shown in Figure 10.
with each symbolic variable in order to transform the In the second step, several tools can be applied
symbolic data table in a numerical data table expres- to the symbolic data obtained at the first step. The
sing the variability inside each of the classes on output of several of these tools is shown in
which standard tools can be applied. In a PCA on Figure 11: Kohonen and principal component in the
interval variables (see Ref 73), the metabins are the case of interval-valued variables; decision tree,
2p numerical values vectors defined by the vertices of ‘pyramidal clustering,’ and ‘zoom star’ on several
the hyperrectangle I1i x … x Ipi, where Iji is the inter- kinds of symbolic data. The SODAS pyramidal over-
val associated with the symbolic interval-valued vari- lapping clustering tool contains the case of hierarchi-
able j for the class i. cal clustering. Zoom star allows the graphical
In Ref 45, the p symbolic variables are of bar representation of the symbolic variables associated
chart values and the metabins are the p frequency of with the rows of a circle.
categories value vectors taken at the same position in The package SODAS is a free, though registra-
each variable. In that way, we obtain a numerical tion is required and a code needed for installation,
data table denoted T where each column is associ- see http://www.info.fundp.ac.be/asso/sodaslink.htm.
ated with a numerical variable itself associated with Numerous reports of French students containing
a symbolic variable and each row is associated with many data bases (from which symbolic data are built
a metabin and a class. This leads to the numerical and their SODAS files provided) can be found at
data table T of p columns and k × m rows if there www.sodas.ceremade.dauphine.fr or www.sodas.
are k classes and if m is the largest number of cate- lamsade.dauphine.fr. Notice that Chiun-How
gories for a variable. et al.160 have also developed a symbolic database for
In the case of nonordinal (i.e., nominal) catego- Trends in International Mathematics and Science
rical variables, the ranking of the bins associated Study (TIMSS) but not related to SODAS.
with each of these variables is obtained by maximiz-
ing the correlations between the numerical variables
associated with each column of the table T. The SYR Software
higher are these correlations, the better is the rank- The package SYR is a professional software for
ing. Ichino84 gives another approach based on a industrial applications. Its aim is to extract, from a
transformation of the bar charts in distributions data file (.txt and .csv) of several millions of units or
which is not possible in the case of nominal vari- from an Access data base of hundreds of thousands
ables. Nevertheless, the Ichino method gives a solu- of units, a reduced number of units (i.e., classes),
tion in this case by ranking the bins by their described by symbolic data which summarize the ini-
frequency. Hence, the metabins approach gives an tial data in a file (called .syr), compatible with the
alternative solution to the bins ranking challenge in SODAS .sds files by conversion. Then, from this sym-
the nonordinal case, based on a maximization of the bolic file, several original tools can be applied.161 For

192 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

WIREs Computational Statistics Symbolic Data Analysis

Individuals Class variable

Relational data
base

Query to data
base

DB2SO
Observations by numerial
or categorical variables

Description of the higher level units

Higher level units Columns: Symbolic variables

Rows: description of higher

level units by symbolic values

Symbolic data table

F I G U R E 1 0 | From relational data base to symbolic data.

KOHONEN Map on symbolic interval valued variables Principal component of interval valued variables

Axe 2(32.553%)
AA08 AA10_AA12

AA00
AA10
1.50
AA05\02

0.75 AA06

AA04
0 AA15 AA07– AA16
AA14 AA03

AA11 AA17 AA13

–0.75 AA09

–1.50

–2 –1 0 1 2
Axe 1(58.048%)

Overlapping stars where radius are associated to sysmbolic variables

Top down clustering tree or decision tree on symbolic data.

Hotel Suite in US
Poolside Bar in US
6 Hotel Suite in France
Poolside Bar in France
Sports in US
Sports in France

Bunqalow in US
Bunqalow in France
4 Fast Food in US
false
18–24 Fast Food in France
18 age_range = OR 25–39
true
6 pays_client = US

2 Excursion in US
Excursin in France
Pyramide
12 age_range = 25–39
classifiante
Hotel Room in France
3 Restaurant in France
Activities in France

6 resort = Bahamas Beach

Restaurant in US
3 Hotel Room in US
Activitis in US

F I G U R E 1 1 | Some symbolic data analysis tools output.

example, in Figure 12 we give an example of a select, move and sort units or symbolic variables
graphical symbolic data table provided by the TAB- from the most discriminate to the least discriminate
SYR tool which creates the symbolic data ﬁle from a (and conversely) of the different classes.
ground data table and provides a user-friendly graph- The NETSYR tool is an extension of PCA to
ical symbolic data table output allowing the user to symbolic data. An example is given Figure 13 from

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 193

Overview wires.wiley.com/compstats

Towel Height Gap1 Radius Gap2

ouv01 157.31:299.76 41.66:68.41

1 2 3 4 5 1 2 3 4 5 6 7

ouv13 95.38:233.38 38.98:62.02

1 2 3 4 5 1 2 3 4 5 6 7

ouv02 157.36:300.98 41.66:68.36

1 2 3 4 5 1 2 3 4 5 6 7

ouv07 204.56:332.56 42.02:57.71

1 2 3 4 5 1 2 3 4 5 6 7

ouv09 203.09:331.09 42.04:58.19

1 2 3 4 5 1 2 3 4 5 6 7

ouv10 201.1:334.02 42.07:58.88

1 2 3 4 5 1 2 3 4 5 6 7

FI GU RE 12 | A symbolic data table provided by the SYR software

a text mining study in Ref 57. The NETSYR allows Normal or Skew-Normal distributions for the Mid-
the visualization of unnoted classes by the pie chart Points and Log-Ranges of the interval-valued vari-
associated with a given bar chart-valued variable ables. Several alternative configurations for the
and its bar chart view. The result of a clustering on global covariance matrix are considered, allowing
the initial data or on selected PCA factorial axes for taking into account the link that may exist
can also be visualized as a network relating the clo- between MidPoints and Ranges of the same or differ-
sest concepts. Moreover, the method produces a ent interval-valued variables. Intermediate parameter-
correlation circle of the categories where the sym- izations between the nonrestricted and the
bolic variables themselves can be represented. More noncorrelation setup considered for real-valued data
information on the SYR software can be asked at may be relevant for the specific case of interval data.
afonso@syrokko.com. This modeling has been implemented in the R-
package MAINT.Data,162 available on CRAN.
RSDA: An R-Package for SDA MAINT.Data introduces a data class for representing
interval data and includes methods for the display,
This package aims to implement in R, certain techni-
management, and analysis of these data. In particu-
ques of SDA as clustering, as well as some linear
lar, maximum likelihood estimation and statistical
models. These implementations will always be made
tests for the different configurations are addressed.
following two principles: Classic Data Analysis
Methods for (M)ANOVA and Linear and Quadratic
should always be a particular case of the SDA and
Discriminant Analysis of this data class are also
both the output and the input in a SDA should be
provided.
symbolic of the same kind in order to express the
data in the same language at input as at output. The
latest version of the RSDA package is 1.2, the author
is Oldemar Rodríguez with contributions from Olger R-Package: Histogram Data Analysis Using
Calderón and Roberto Zúñiga. Information can be Wasserstein Distance (HistDAWass)
obtained at oldemar.rodriguez@ucr.ac.cr. An exam- This package, from Ref 163 contains methods (see
ple of output of RSDA software in case of interval- Refs 164 and 165) mainly based on the L2 Wasser-
valued variables PCA is given Figure 14. stein metric between distributions (i.e., a Euclidean
metric between quantile functions). It contains basic
statistics of symbolic histogram-valued variables,
R-Package MAINT.Data clustering methods (both hierarchical and dynamic
Brito and Duarte Silva144 have proposed parametric clustering), regression analysis, PCA of distributional
models for interval data, which consider Multivariate variables, histogram time series forecasting using the

WIREs Computational Statistics Symbolic Data Analysis

Politness-satisfaction.
doc_clust70
C39 0.2041
C68 0.1386
3.0 C6 0.0679
C28 0.0538
C1 0.0453
C59 0.0297
0.0244 Politness-satisfaction
C16
C19 0.0216

1.5
Contact call
Siret_APE_NAF
Dates
doc_clust 70
Axis 2 (12.45%)

0 C16 0.0870
C25 0.0793 Troubleshooting-intervention
C53 0.0744
C21 0.0414 doc_clust70
C2 0.0319 C26 0.1779 Technical terms
C47 0.0307 C64 0.0762
C58 0.0284 C47 0.0577
C19 0.0279 C16 0.0535
–1.5 C53 0.0487
C27 0.0327
C30 0.0298
C48 0.0293
Dates Invoice reading

–3.0 Troubleshooting-intervention
Schedule

–4.5

–3.0 –1.5 0 1.5

Axis 1 (16.30%)

FI GUR E 13 | A NETSYR output of a PCA extended to symbolic data.

KNN, and exponential smoothing. It can be down- R-Package: iRegression

loaded from https://cran.r-project.org/web/packages/ This package, from Lima Neto and Vasconcelos,
HistDAWass/index.html. contains some regression methods for interval-valued
variables. For each method, it gives the ﬁtted values,
residuals and some goodness-of-ﬁt measures. From
year 2012, this R-package (version 1.2) is available
at http://CRAN.R-project.org/package=iRegression.

SOME DIRECTION OF RESEARCH AND

LINKS WITH SPECIFIC DOMAINS OF
DATA SCIENCE
Some Direction of Research
As we have seen in Some Methods for the Analysis of
Symbolic Data section, many standards methods of
statistics and Data Mining have started to be
extended to symbolic data. Much remains to do in the
continuation of this extension and in the extension to
other tools not yet considered as in learning machine
SVM, neural networks, social or Bayesian networks,
and so on. For example, Correspondence Analysis166
is addressed to contingency data tables which consti-
FI GU RE 14 | An example of RSDA package output in case of a tute one symbolic variable (i.e., one column of the
principal component analysis of interval-valued variables. symbolic data table), so Correspondence Analysis has

Overview wires.wiley.com/compstats

to be extended to a symbolic data table with more complex data by considering each point i of the func-
symbolic variables. In the PCA built on the meta- tion f as an individual taking the value f(i).
bins45 in both cases of interval or bar chart-valued Example. ten sensors settled on a bridge pro-
variables, it would be interesting to use the joint prob- duce a signal function when 100 different kinds of
ability density of the ground level population by asso- trains passes over this bridge before and after
ciating with each metabin, a weight proportional to improving the state of the bridge. Hence, to each
the number of individuals whose dissimilarity to this train is associated ten fj functions considered as ten
metabin is lower than a given threshold. ground variables Yj describing, e.g., nij = 10,000
Symbolic data tables where classes are described points i of the function fj by their values f(i). There-
by horizontal and vertical bar chart variables can be fore, in this way, we obtain a complex data table
obtained directly from the ground data table in case where the variables are not paired as nij varies
of ground categorical variables. In the case of ground between variables. Notice that this is a case of ‘varia-
numerical variables, a simultaneous discretization bility inside entities’ defined in the introduction sec-
(both kinds of bar chart variables) optimizing a qual- tion. It is then possible to transform, these ground
ity criterion S or W (see How to Measure the Quality variables, e.g., into symbolic histogram-valued vari-
of Bar Chart Symbolic Variables, Classes, and Their ables, by using wavelets (see a wavelet tour on wave-
Associated Symbolic Data Tables section) is needed. lets in Ref 28).
Table 1 is a simple example of such a symbolic data Hence, such data can be transformed by a
table. Then, we can analyze these new kinds of sym- fusion process into a unique symbolic data table
bolic data tables by any SDA method with new kinds where the symbolic variables are paired. This process
of interpretation due to the vertical bar charts. is based on a vertical concatenation of aggregations
of the values of the functions inside each class and
horizontal concatenation of the obtained symbolic
Links with Specific Domains of Data Science variables. Several symbolic variables can be induced
The SDA can also enhance domains of research like by the same functional-valued variable. For example:
FDA, mixture decomposition, Bayesian approaches, the min–max interval, the interquartile interval, the
multilevel statistics, uncertainty and fuzzy sets, gran- bar charts, the histograms, the distributions of the
ular computing, and rough sets. functions values, including the one induced by a
wavelets approximation or other time series
Compositional Data approach, see Refs 28 and Billard et al. (unpublished
‘Compositional data’ appear in a specific case of sym- data). Here, also, the SDA challenge is to find the
bolic data. The relative frequencies of bar charts are aggregation which maximizes the discrimination,
‘compositional parts’ as their sum is equal to 1 for between the classes and also maximizes the correla-
each bar chart. Aitchison19 described the difficulties tion between the symbolic variables.
(as negative bias and spurious correlation) often
encountered with such variables and the need of nor- Mixture Decomposition
malization. Moreover, according to recent Mixture decomposition aims to find the underlying
developments,167 compositional data are not neces- distributions of a given population described by a set
sarily defined with a constant sum constraint, but of standard variables. In the SDA context, the ques-
rather more generally that their parts contain quanti- tion is how to extend mixture decomposition meth-
tatively expressed relative contributions on a whole. ods and tools to symbolic data. In Theoretical
From this perspective, the unit sum of parts is just one Development section, we recall some early studies in
possible representation of data, where the only rele- the case of classes described by a unique symbolic
vant information is contained in ratios between parts. variable of distribution values where Dirichlet48,49
In Ref 72 for unit sum normalization of composi- and copulas models,43, have been used. The mixture
tional data, the angular transformation of Fisher168 is decomposition of several symbolic variables of differ-
used; but there are many other possibilities which ent kinds defined in Tables 2 and 3 remains to be
have to be deeply considered, like in the recent paper considered. Moreover, in the actual methodology
by Wang et al.,169 where the Aitchison geometry and based on an EM kind of approach,170 the distribu-
centered log ratio coordinates are employed. tions of the obtained mixture decomposition are not
the distributions induced by the given clusters. In
Functional Data Analysis SDA, as we wish to obtain a symbolic data table
Functional data (i.e., where variables values are func- where classes (i.e., ‘clusters’) are described by their
tions as curves or signals) can be seen as a case of distribution, the ‘dynamical clustering’ methodology

WIREs Computational Statistics Symbolic Data Analysis

(see Unsupervised Classiﬁcation Extended to Sym- Granular Computing

bolic Data section), is more appropriate by taking as The idea of granularity happens in several domains as
kernels laws but also adaptive factorial spaces, ‘rough sets’ or ‘fuzzy sets.’ In What Is the Actual Fail-
regressions, distances, and so on. Thus, much ure and the Shift Which Has Produced the SDA
remains to do in this direction which can enhance the Domain? section, we have introduced the pair (X, Y)
actual Mixture decomposition domain. Mixture which induces a standard data table where a set of
decomposition yields generally to clusters which con- ground numerical and /or categorical variables
stitute a partitioning of the population; it would be Y describes a ground population X of individuals. In
interesting to extend, in the SDA framework, mixture order to define rough sets and granularity (see Refs
decomposition to other structures like hierarchies or 175 and 176), this pair (X, Y) is reduced to the case
pyramids of two or more dimensions. where Y contains only categorical variables and called
‘information system’ or ‘attribute-value system.’ In
Bayesian SDA, having classes of individuals considered as a
Following Gelman et al.,171 the process of Bayesian new set of units X, the aggregation of these classes
data analysis can be idealized by dividing it into the transforms the ground categorical variables into a set
following three steps: (1) setting up a full probability of symbolic bar chart-valued variables Y.’ Therefore,
model, (2) conditioning on observed data, and the obtained (X0 , Y0 ) pair can be considered as a ‘Sym-
(3) evaluating the fit of the model. Therefore, the bolic Information System’ (SIS), which is equivalent
main question (following Marin and Robert172) is to to a symbolic data table where the set of variables Y0
propose a probability distribution f(x|Θ), to compute are only symbolic bar chart-valued variables. Hence,
the conditionnal probability distributions fj(xj|Θj, a SIS can be considered as a double granularity: first
x1, … , xp−1) and the parameters Θ, Θ1 , … , Θp, at the level of the ground population by using classes
having n observations x = (x1, … , xp), from as units instead of individuals, second at the level of
n random variables (X1, … , Xp) defined on a meas- the descriptive variables of the classes as their values
urable space and such that the following equation fit are, e.g., intervals or weighted intervals in the case of
at best the data: histograms. Some direction of research in this field
can be suggested by extending the granular approach
Y
f ðx,ΘÞ = f1 ðx1 j Θ1 Þ f xj jΘj , x1 ,…,xj− 1 ð3Þ on a SIS. This idea is illustrated in the case of rough
j = 2, p j sets and fuzzy sets hereunder.

In Data Science, this model has no sense in case of

Symbolic Rough Sets
complex data when the random variables Xj are not
Rough sets on the ground population can be defined
paired. When complex data are transformed in sym-
by using symbolic data. For example, suppose in the
bolic data, Eq. (3) can be settled with random vari-
case of an interval-valued symbolic variable Y0
ables Xj having symbolic values. Hence, a wide
defined on a set of classes C = {Ci} induced by a
domain of research is opened in order to define con-
numerical variable Y is defined on the ground popu-
ditional probabilities between symbolic variables of
lation Ω. Let X be a subset of Ω and let IX be the
the different types defined in the Tables 2 and 3.
smallest interval containing all the values Y(x) with
Otherwise, Bayesian networks based on equa-
Q xϵX. Let be Ii = Y0 (Ci) the interval value associated
tion f(x) = j = 2,n fi(xj|parents xj) extended to sym-
with the class Ci2C. We can then associate with
bolic data has been initiated in Ref 173.
X the symbolic rough set XR defined by the pair (X0 ,
X0 ) such that:
Multilevel Statistics
Multilevel statistics (see, e.g., Ref 174) aims to meas- X 0 = {UiCi/Ii IX} is the lower bound of XR and
ure the influence of class variables on regressions
00
made between given classical variables of the ground X = {UiCi/Ii \ IX 6¼ ø} is the upper bound of XR.
data. In relation with SDA, at least two directions of
research can be suggested. First, by describing the This can be generalized to several interval-valued
classes of the class variables by symbolic data variables by merging the lower bounds (resp. upper
obtained by aggregating the ground variables. The bound) obtained for each interval-valued variable.
obtained symbolic data table can provide ideas on In the case of bar chart or probability
the classes influence. Second, by extending multilevel distribution-valued variables, we can use the same
statistics tools to symbolic data tables with class vari- way by considering their associated interquartile or
ables at the class level. other percentile interval-valued variables.

Overview wires.wiley.com/compstats

The ‘standard accuracy’177 of a standard known clusters. A simple idea could be simply to
rough-set can be extended to the ‘symbolic accuracy’ aggregate the known clusters. Hence, these aggrega-
denoted A of a symbolic rough set XR by setting: A tions produce new units described by symbolic data.
(XR) = |X0 |/|X0 |. Hence, in that way, we can measure It is then possible to add these units to the initial
the ‘symbolic accuracy’ of any class X = Ci of the set ground population and then to apply the symbolic
C on which the symbolic variable Y0 has been clustering tools to the new population. Many other
defined. ways are also possible, e.g., by looking of the best
discriminating space (e.g., by adaptive distances140)
Uncertainty or factorial discriminant analysis (e.g., in Ref 98) of
Uncertainty is not variability as uncertainty describes the known clusters and then applying clustering tools
individual facts or events by subjective values expres- involving the whole population in this space.
sing, e.g., a ‘possibility’ or a ‘belief’ which follow Another way can be to select the symbolic variables
their own axioms. At the level of classes, aggregating which maximize the S or W (see How to Measure the
such values associated with same categories leads to Quality of Bar Chart Symbolic Variables, Classes,
different aggregation axioms depending on the kind and Their Associated Symbolic Data Tables section)
of uncertainty. Several theories have been developed criterion, in order to find the variables with the high-
such as subjective probability, possibility, belief the- est explanatory and discriminatory power of the
ory (see a synthesis in Ref 3). Variability inside given classes. This yields to a symbolic data table
classes of individuals is expressed by objective values where the units are the classes and the remaining
as frequencies which follows Kolmogorov axioms of individuals and the variables are the selected ones.
probabilities. If there are several facts or events, then Then, a clustering tool can be applied to this sym-
a variability can appear among their associated bolic data table.
uncertainty values. This variability can be expressed
by symbolic data and then analyzed by SDA tools.
CONCLUSION AND PERSPECTIVES
Fuzzy Sets We have presented a new way of thinking in data sci-
Fuzzy sets express a kind of uncertainty by a kind of ence, where currently only the first level of indivi-
subjective membership function. For example, with duals is mainly considered by ignoring the
such subjective function, we can say that ‘this man is complementary analysis of a higher level of classes
high’ with a fuzzy value equal to 0.8. If each unit of considered as new units to be described and studied
the ground population is described by one or several by themselves. Often we have the question: ‘Does
fuzzy sets, then, classes of such units can be described SDA gives better results than standard Data Analy-
by intervals, bar charts or histograms expressing the sis?.’ This question can be illustrated by the following
variability of the fuzzy values inside each class (see example: does studying players give better results
an example given in Ref 12, section 1.4.2, p. 14). than studying teams of players? This question has no
Otherwise, a fuzzy coding study of symbolic data can sense as the considered units are not the same! The
be found in Ref 178. only answer to this question, is that standard
approaches are the best for studying individuals
Clusterfier (as players described by standard data) and SDA is
A ‘clusterfier’ aims to find all the clusters of a popula- the best for studying classes (as teams of players).
tion by knowing only some of them. More precisely, Moreover, we cannot say that one approach is better
a clusterfier is a function which produces from classes than the other, we can just say that both approaches
(i.e., clusters), known on a part of a population, new are complementary and that SDA methods can
clusters on the remaining part of the same popula- enhance standard results by giving a class point of
tion. Notice that a ‘classifier’ is different, as it is a view results induced from the standard data. We can
function which from classes known on a part of a also say that the SDA tools are more general than the
population, associates a class to each individual of standard ones as an individual can be considered as a
the remaining population. Clusterfiers has been class reduced to a single unit.
studied in Ref 179 by using a general Lance and Standard, complex, and big data are given
Williams180 formula. In the SDA framework, several whereas symbolic data are built from these kinds of
symbolic clustering (by partitioning, hierarchical or data. Therefore, before the proper analysis of the
pyramidal tools, see Unsupervised Classification built symbolic data, there is a wide domain of
Extended to Symbolic Data section) can be used in research for obtaining ‘good’ symbolic data and mea-
order to build new clusters taking care of some suring their quality (validity, robustness, etc.).

WIREs Computational Statistics Symbolic Data Analysis

Much remains to be done in the following some sense, thinking by classes in Data Science, bring
directions: in statistics and data mining, going more closer our way of thinking in Data Science from our
deeply inside the extended classical methods to sym- natural way of thinking.
bolic data and extending to other methods not yet What the SDA framework can change? SDA
considered. In computer science, extending SQL can change: our way of teaching, researching, and
queries algebra language to symbolic data bases (see applying:
a start in Ref 181), extending EXCEL to symbolic Teaching: by considered standard teaching on
data (see a start with TABSYR inside the SYR Soft- individuals as a case of teaching on classes described
ware; SYR Software section), building summariza- by symbolic data. Researching: by asking ‘how to
tions on big data by symbolic data (as said by extend my actual results to the case where instead of
Minami and Mizuta182). Also, in the case of big standard statistical units described by standard vari-
ground data bases, parallelizing the actual tools for ables, I have classes described by symbolic data ‘.
building and analyzing big sets of symbolic data. Applying: as SDA can enhance our current results by
We have seen that, starting with complex data complementary ones on classes, enlightening our cur-
deﬁned by several unstructured data tables with rent study by changing our actual units
unpaired variables, we can obtain a data table with (i.e., ‘individuals’) in higher level units (i.e., ‘classes’)
paired symbolic variables. Therefore, SDA is more a described by symbolic data.
solution to the problem of complex data than a prob- In 1984, Schweitzer183 says that ‘distributions
lem of complex data. are the numbers of the future.’ In the SDA context,
By enlarging the actual framework of Data Sci- this means that the classes of individuals from which
ence to higher level populations of classes, we have these distributions are obtained, are the ‘units of the
seen that SDA can enhance large domains of applica- future’ and moreover that the symbolic data which
tions and research in computational statistics. describe these classes are the numbers of the future.
Thinking by classes of individuals, is our natu-
ral way of thinking. This happen, e.g., when we say: Epilogue: It is my hope this fascinating domain will
‘I like my dog Tomi, I prefer dogs than cats’ as in the inspire many teachers and students in the numerous
same sentence ‘Tomi’ is at the level individuals level, directions suggested in that paper.
‘dogs’ and ‘cats’ are at the classes level. Hence, in

ACKNOWLEDGMENTS
The author gratefully acknowledge the reviewers for their helpful remarks and suggestions.

FURTHER READINGS
Diday E. L’Analyse des données symboliques, un cadre théorique et des outils pour le data mining. In: Diday E,
Kodratoff Y, Brito P, Moulet M, eds. Induction Symbolique Numérique à Partir de Données. Toulouse: CEPADUES; 2000.
Diday E. From Schweizer to Dempster: mixture decomposition of distributions by copulas in the symbolic data analysis
framework. In: IPMU 2002, Annecy, France, July, 2002.
Emilion R. Classiﬁcation of wind speed distributions. Renew Energy 2011, 36:3091–3097.
Nakano J. Regression analysis for aggregated symbolic data. In: Arroyo J, Maté C, Brito P, Noirhomme M, eds, 3rd Work-
shop in Symbolic Data Analysis. Universidad Compiutense de Madrid; 2012. Available at: http://www.sda-workshop.org/.
Saporta G, Niang N. Resampling ROC curves. In: IASC Meeting on Statistics for Data Mining, Learning and Knowledge
Extraction (IASC07), 30 August–September, 2007.

REFERENCES
1. Diday, E. Introduction à l’approche symbolique en 2. Diday E. The symbolic approach in clustering and
analyse des données. Premières journées Symbolique- related methods of data analysis: the basic choices.
Numerique. Workshop. CEREMADE Laboratory, In: Bock HH, ed. Proceedings of IFCS’87 on
1987, Université Paris-Dauphine, France, 21, 56. Classiﬁcation and Related Methods of Data

Overview wires.wiley.com/compstats

Analysis, Amsterdam, North Holland, 1988, 16. Brito P, Noirhomme-Fraiture M, Arroyo J. Special
673–684. issue on symbolic data analysis. Editorial. Adv Data
Anal Classif 2015, 9:1–4.
3. Diday E. Probabilist, possibilist and belief objects for
knowledge analysis. Ann Oper Res 1995, 55:227–276. 17. Su S-F, Pedrycz W, Hong T-P, De Carvalho AT. Spe-
cial issue on granular/symbolic data processing.
4. Afonso F, Diday E, Badez N, Genest Y. Symbolic data IEEE Trans Cybern 2016, 344–401.
analysis of complex data: application to nuclear power
plant. In: COMPSTAT’2010, Paris, 2010. 18. Kuhn T. The structure of scientific revolutions. Chi-
cago: University of Chicago Press; 1962.
5. Afonso F, Diday E, Badez N, Genest Y. Use of sym-
19. Aitchison J. The Statistical Analysis of Compositional
bolic data analysis for structural health monitoring
Data. London: Chapman and Hall; 1986.
applications. In: Second International Symposium on
Life-Cycle Civil Engineering, IALCCE’2010, October 20. Diday E. Spatial classification. Discrete Appl Math)
27–30, 2010. Taipei, Taiwan. 2008, 156:1271–1294.
6. Laaksonen S. Chapter 22: people’s life values and trust 21. Diday E. Des objets de l’Analyse des données à ceux
components in Europe—symbolic data analysis for de l’Analyse des connaissances. In: Kodratoff Y,
20–22 countries. In: Diday E, Noirhomme-Fraiture M, Diday E, eds. Induction Symbolic Numerique. Tou-
eds. Symbolic Data Analysis and the SODAS Software. louse: CEPADUES; 1991.
Chichester: Wiley & Sons; 2008, 405–419. 22. Brito P, Bertrand P, Cucumel G, de Carvalho F, eds.
On the analysis of symbolic data. In: Selected Contri-
7. Laaksonen S. The survey as a basis for symbolic data
butions in Data Analysis and Classification. Berlin:
analysis. In: Carlson M, Nyquist H, Villani M, eds.
Springer; 2007, 13–22.
Official Statistics, Methodology and Applications in
Honour of Daniel Thorburn. Stockholm, Sweden: 23. Billard L, Diday E. From the statistics of data to the
Stockholm University; 2010, 15–28. Available at: statistic of knowledge: symbolic data analysis. J Am
officialstatistics.wordpress.com. Stat Assoc 2003, 98:470–487.

8. Afonso F, Laaksonen S. Analyzing European Social Sur- 24. Billard L. Special issue on SDA. ASA Data Sci J 2011,
vey data using symbolic data methods and Syrokko soft- 4:147–246.
ware. In: RNTI Special Issue « en l’honneur des travaux 25. Billard L. Brief overview of symbolic data and analytic
de Monique Noirhomme-Fraiture: Analyse de données et issues. Stat Anal Data Mining 2011, 4:149–156.
Visualisation ». RNTI 2015, 89–100. 26. Noirhomme-Fraiture M, Brito P. Far beyond the classi-

9. Korenjak-Cerne S, Kejžar N, Batagelj V. A weighted cal data models: symbolic data analysis. Stat Anal
clustering of population pyramids for the world’s Data Mining 2012, 4:157–170.
countries, 1996, 2001, 2006. Pop Stud J Demogr 27. Brito P. Symbolic data analysis: another look at the
2015, 69:105–120. interaction of data mining and statistics. Wiley Inter-
discip Rev Data Mining Knowl Discov 2014,
10. Bock HH, Diday E. Analysis of Symbolic Data:
4:281–295. doi:10.1002/widm.1133.
Exploratory Methods for Extracting Statistical Infor-
mation from Complex Data. Heidelberg: Springer-Ver- 28. Mallat S. A Wavelet Tour of Signal Processing. San
lag; 2000, 425. ISBN: 3-540-66619-2. Diego, CA: Academic Press; 1998.

11. Billard L, Diday E. Symbolic Data Analysis: Concep- 29. Seck D. Arbres de décision symboliques, outils de vali-
tual Statistics and Data Mining. Wiley Series in Com- dation et d’aide à l’interprétation. PhD (these de doc-
putational Statistics. Chichester: Wiley; 2006, 321. torat), Paris-Dauphine University, France, 2012.
ISBN: 0-470-09016-2. 30. Guinot C, Malvy D, Schemann J-F, Afonso F,
Haddad R, Diday E. Strategies evaluation in environ-
12. Diday E, Noirhomme-Fraiture M. Symbolic Data
mental conditions by symbolic data analysis: applica-
Analysis and the SODAS software. Chichester: Wiley;
tion in medicine and epidemiology to trachoma. Adv
2008. doi:978-0-470-01883-5.
Data Anal Classif 2015, 9:107–119.
13. Billard L, Douzal-Chouakria A, Diday E. Symbolic 31. Leskovec J, Rajaraman A, Ullman JD. Chapter 1: data
principal components for interval-valued observations. mining. In: Mining of Massive Datasets. England:
Stat Anal Data Mining 2011, 4:229–246. Cambridge University Press; 2011, 1–17.
14. Guan R, Lechevallier Y, Saporta G, Wang H. 32. Diday E, Afonso F, Haddad R. The symbolic data
Advances in Theory and Applications of High Dimen- analysis paradigm, discriminant discretization and
sional and Symbolic Data Analysis, vol. E25. Her- ﬁnancial application. In: HDSDA 2013 Conference,
mann, MO: RNTI; 2013. Beijing, China. RNTI-E-25. Paris: Hermann;
15. Brito P, Duarte Silva AP, Dias JG. Probabilistic cluster- 2013, 1–14.
ing of interval data. Intell Data Anal 2015, 33. Diday E. Pouvoir explicatif et discriminant de variables
19:293–313. à valeurs diagrammes en bâtons et de tableaux de

WIREs Computational Statistics Symbolic Data Analysis

données symboliques associés. Revue Modulad n 52. Emilion R. Unsupervised classification of objects
45, RNTI; In press. described by nonparametric distributions. Stat Anal
34. Horn S, Pesce AJ, Copeland BE. A robust approach to Data Mining 2012, 388–398.
reference interval estimation and evaluation. Clin 53. Bezerra B, Carvalho F. Symbolic data analysis tools
Chem 1998, 44:622–631. for recommendation systems. Knowl Inf Syst 2011,
35. Royall RM. Model robust confidence intervals using 26:385–418. doi:10.1007/s10115-009-0282-3.
maximum likelihood estimators. Int Stat Rev 1986, 54. Quantin C, Billard L, Touati M, Andreu N, Cottin Y,
54:221–226. Zeller M, Afonso F, Battaglia G, Seck D, Le Teuff G,
36. Lebart L, Morineau A, Warwick KM. Multivariate et al. Classification and regression trees on aggregate
Descriptive Statistical Analysis. New York: data modeling: an application in acute myocardial
Wiley; 1984. infarction. J Prob Stat 2011, 2011:19.

37. Birkhoff G. Lattice Theory, vol. 25. 3rd ed. Provi- 55. Mizuta M. Study on radiation therapy with distribu-
dence, RI: AMS Colloquium Publications; 1967. Rep- tion valued data. In: Arroyo J, Maté C, Brito P, Noi-
rinted 1984. homme M, eds. 3rd Workshop in Symbolic Data
Analysis. Spain: Universidad Compiutense de
38. Diday E. Introduction à l’analyse des données symboli- Madrid; 2012.
ques. Oper Res Rev 1989, 23:193–236. Also in Rap-
56. Fablet C, Diday E, Bougeard S, Toque C, Billard L.
port de Recherche No. 1074, INRIA, Rocquencourt.
Classification of hierarchical-structured data with sym-
39. Brito P. Order structure of symbolic assertion objects. bolic analysis: application to veterinary epidemiology.
IEEE Trans Knowl Data Eng 1994, 6:5. In: COMPSTAT’2010, Paris, 2010.
40. Diday E, Emilion R. Treillis de Galois maximaux et 57. Haddad R, Afonso F, Diday E. Approche symbolique
capacites de Choquet. CR Acad Sci Paris 1997, pour l’extraction de thématiques: Application à un cor-
325:261–266. pus issu d’appels téléphoniques. In: actes des
41. Diday E, Emilion R. Maximal and stochastic Galois XVIIIèmes Rencontres de la Sociéte francophone de
lattices. Discrete Appl Math 2003, 27:271–284. Classification. Université d’Orléans, France; 2011.
42. Nelsen RB. An Introduction to Copulas. New-York: 58. García-Ascanio C, Maté C. Electric power demand
Springer Verlag; 1999. forecasting using interval time series: a comparison
between VAR and iMLP. Energy Policy 2010,
43. Diday E, Vrac M. Mixture decomposition of distribu-
38:715–725.
tions by Copulas in the symbolic data analysis frame-
work. Discrete Appl Math 2005, 147:27–41. 59. Emilion R. Classification of daily solar radiation distri-
butions using a mixture of Dirichlet distributions.
44. Vrac M, Billard L, Diday E, Chédin A. Copulas analy-
Solar Energy 2009, 83:1056–1063.
sis of mixture model. Comput Stat 2012, 27:427–457.
60. Han A, Hong Y, Lai KK, Wang S. Interval time series
45. Diday E. Principal component analysis for bar charts
analysis with an application to the sterling-dollar
and Metabins tables. Stat Anal Data Mining 2013,
exchange rate. J Syst Sci Complex 2008, 21:550–565.
6:403–430. doi:10.1002/sam.11188.
61. He LT, Hu C. Impacts of interval computing on stock
46. Neto EA, Anjos UU. Regression model for interval- market variability forecasting. Comput Econ 2009,
valued variables based on copulas. J Appl Stat 2015, 33:263–276.
42:2010–2029.
62. Long W, Mok HMK, Hu Y, Wang H. The style and
47. Diday E, Murthy N. Symbolic data clustering. In: innate structure of the stock markets in China, Pacific-
Wang J, ed. Encyclopedia of Data Warehousing and Basin. Finance J 2009, 17:224–242.
Mining. Hershey, NY: Information Science Reference;
2005, 1087–1091. 63. Terraza V, Toque C. Mutual Fund Rating: A Symbolic
Data Approach. In: Terraza V, Razafitombo H, eds.
48. Emilion R. Classification et mélanges de processus. Understanding Investment Funds Insights from Perfor-
CR Acad Sci Paris 2002, 335:189–193. mance and Risk Analysis. Economics & Finance Col-
49. Soule A, Salamatian K, Taft N, Emilion R, lection. London, UK: The Palgrave Macmillan; 2013.
Papagiannaki K. Flow classification by histograms. In: 64. Bouteiller V, Toque C, A, Cherrier J-F, Diday E,
Proceedings of Sigmetrics’04, New York, 2004. Cremona C. Non-destructive electrochemical charac-
50. Soubdhan T, Emilion R, Calif R. Classification of daily terizations of reinforced concrete corrosion: basic and
solar radiation distributions using a mixture of Dirich- symbolic data analysis. Corros Rev 2011, 30:47–62.
let distributions. Solar Energy 2009, 83:1056–1063. doi:10.1515/corrrev-2011-002.
51. Calif R, Emilion R, Soubdhan T. Classification of wind 65. Cury A, Crémona C, Diday E. Application of symbolic
speed distributions using a mixture of Dirichlet distri- data analysis for structural modification assessment.
butions. Renewable Energy 2011, 36:3091–3097. Eng Struct J 2010, 32:762–775.

Overview wires.wiley.com/compstats

66. Courtois A, Genest G, Afonso F, Diday E, Orcesi A. In 80. Murillo JD, Rodrıguez O, Diday E, Winberg S. Gener-
service inspection of reinforced concrete cooling alization of the principal components analysis to histo-
towers—EDF’s feedback. In: IALCCE 2012, Vienna, gram data. In: 4th European Conference on Principles
Austria, 2012. and Practice of Knowledge Discovery in Data Bases,
67. Bertrand P, Goupil F. Descriptive statistics for sym- Lyon, France, 12–16 September, 2000.
bolic data. In: Bock H-H, Diday E, eds. Analysis of 81. Cazes P. Analyse factorielle d’un tableau de lois de
Symbolic Data: Exploratory Methods for Extracting probabilité. Rev Stat Appl 2002, 50:5–24.
Statistical Information from Complex Data. Berlin: 82. Wang H, Chen M, Li N, Wang L. Principal Compo-
Springer-Verlag; 2000, 103–124. nent Analysis of Modal Interval-Valued Data with
68. Billard L. Dependencies and variation components of Constant Numerical Characteristics. The Hague, The
symbolic interval-valued data. In: Brito P, Bertrand P, Netherlands: International Statistical Institute; 2012.
Cucumel G, de Carvalho F, eds. Selected Contributions 83. Shimizu N, Nakano J. Histograms principal compo-
in Data Analysis and Classification. Berlin: Springer; nent analysis. In: Arroyo J, Maté C, Brito P,
2007, 3–12. Noihomme M, eds, 3rd Workshop in Symbolic Data
69. Billard L. Sample covariance functions for complex Analysis. Spain: Universidad Compiutense de
quantitative data. In: Mituza M, Nakano J, eds. Pro- Madrid; 2012.
ceedings, World Conferences International Association 84. Ichino M. The quantile method for symbolic principal
of Statistical Computing 2008. Tokyo: Yoko- component analysis. Stat Anal Data Mining 2011,
hama; 2008. 4:184–198.
70. Nagabhushan P, Kumar P. Histogram PCA. Adv Neu- 85. Billard L, Diday E. Regression analysis for interval-
ral Netw 2007, 4492:1012–1021. valued data. In: Data Analysis, Classification, and
71. Verde R, Irpino A. Ordinary least squares for histo- Related Methods, Proceedings of the Seventh Confer-
gram data based on wasserstein distance. In: ence of the International Federation of Classification.
Lechevallier Y, Saporta G, eds, Procedings of COMP- Societies (IFCS00). Namur, Belgium: Springer; 2000,
STAT’2010. Heidelberg: Physica-Verlag; 2010, 369–374.
581–589. 86. De Carvalho FAT, Lima Neto EA, Tenorio CP. A new
72. Makosso-Kallyth S, Diday E. Adaptation of interval method to fit a linear regression model for interval-
PCA to symbolic histogram variables. Advances in valued data. In: KI2004 Advances in Artificial Intelli-
Data Analysis and Classification. Adv Data Anal Clas- gence. Lecture Notes in Computer Science. Berlin/
sif 2012, 6:147–159. Heidelberg: Springer-Verlag; 2004, 295–306.
73. Douzal-Chouakria A, Billard L, Diday E. Principal 87. Wang H, Guan R, Wu J. Linear regression of interval-
component analysis for interval-valued observations. valued data based on complete information in hyper-
Stat Anal Data Mining 2011, 4:229–246. doi:10.1002/ cubes. J Syst Sci Syst Eng 2012, 21:422–442.
sam.10118. 88. Xu W. Symbolic data analysis: interval-valued data
74. Cazes P, Chouakria A, Diday E, Schektman Y. Exten- regression. PhD Dissertation, University of Geor-
sion de l’analyse en composantes principales à des don- gia, 2010.
nées de type intervalle. Rev Stat Appl 1997, 89. Giordani P. Lasso-constrained regression analysis for
XLV:5–24. interval-valued data. Adv Data Anal Classif
75. Wang H, Guan R, Wu J. CIPCA: complete-informa- 2015, 9:5–19.
tion-based principal component analysis for interval- 90. Irpino A, Romano E. Optimal histogram representa-
valued data. Neurocomputing 2012, 86:158–169. tion of large data sets: Fisher vs piecewise linear
76. Le-Rademacher J, Billard L. Principal component anal- approximation. Revue des Nouvelles Technologies de
ysis for interval data. Wiley Interdiscip Rev Comput l’Information (RNTI) 2007, E-9:99–110.
Stat 2012, 4:535–540. 91. Souza RMCR, Queiroz DCF, Cysneiros FJA. Logistic
77. Le-Rademacher J, Billard L. Principal component his- regression-based pattern classifiers for symbolic inter-
tograms from interval-valued observations. Comput val data. Pattern Anal Appl 2011, 14:273–282.
Stat 2013, 28:2117–2138. 92. Dias S, Brito P. Linear regression model with
78. Kosmelj K, Le-Rademacher J, Billard L. Symbolic histogram-valued variables. Stat Anal Data Mining
covariance matrix for interval-valued variables and its 2011, 8:75–113. doi:10.1002/sam.11260.
application to principal component analysis: a case 93. Utkin LV, Coolen FPA. Interval-valued regression and
study. Metodoloski Zvezki No. 11, 2014, 1–20. classification models in the framework of machine
79. Le-Rademacher J, Billard L. Principal component anal- learning. In: 7th International Symposium on Impre-
ysis for histogram-valued data. Adv Data Anal Clas- cise Probability: Theories and Applications, Innsbruck,
sif ) 2016; 1–25. doi:10.1007/s11634-016-0255-9. Austria, 2011.

WIREs Computational Statistics Symbolic Data Analysis

94. Sinova B, Colubi A, Gil MA, González-Rodríguez G. 109. Afonso F, Diday E. Extension de l’algorithme Apriori
Interval arithmetic-based simple linear regression et des règles d’association aux cas des donnees sym-
between interval data: discussion and sensitivity analy- boliques diagrammes et intervalles. In: Revue RNTI,
sis on the choice of the metric. Inform Sci 2012, Extraction et Gestion des Connaissances (EGC
199:109–124. 2005), vol 1. Toulouse: Editions Cépaduès; 2005,
95. Cerny M, Antoch J, Hladik M. On the possibilistic 205–210.
approach to linear regression models involving uncer- 110. Silva APD, Brito P. Linear discriminant analysis for
tain, indeterminate or interval data. Inform Sci 2013, interval data. Comput Stat 2006, 21:289–308.
244:26–47.
111. Appice A, D’Amato C, Esposito F. Malerba D. In:
96. Afonso F, Billard L, Diday E. Symbolic linear regres- Intelligent Data Analysis: Analysis of Symbolic and
sion with taxonomies. In: Proceedings of the Meeting Spatial Data, vol. 10. The Netherlands: IOS Press
of the International Federation of Classification Socie- Amsterdam; 2006, 301–324.
ties (IFCS), Chicago, IL. Berlin/Heidelberg: Springer-
112. Duarte Silva AP, Brito P. Discriminant analysis of
Verlag; 2004.
interval data: an assessment of parametric and
97. Neto EA, De Carvalho FAT. Constrained linear regres- distance-based approaches. J Classif 2015,
sion models for symbolic interval-valued variables. 32:516–541.
Comput Stat Data Anal 2010, 54:333–347.
113. Gowda KC, Diday E. Symbolic clustering using a
98. Lauro C, Verde R, Irpino A. Generalized canonical new dissimilarity measure. Pattern Recogn 1991,
analysis. In: Diday E, Noirhomme-Fraiture M, eds. 24:567–578.
Symbolic Data Analysis and the Sodas Software. Chi-
chester: Wiley; 2008, 313–330. 114. Ichino M, Yaguchi H. Generalized Minkowski met-
rics for mixed feature-type data analysis. IEEE Trans
99. Groenen PJF, Winsberg S, Rodriguez O, Diday E. I- Syst Man Cybern 1994, 24:698–707.
Scal: multidimensional scaling of interval dissimilari-
ties. Comput Stat Data Anal 2006, 51:360–378. 115. De Carvalho FAT. Extension based proximity coeffi-
cients between constrained Boolean symbolic objects.
100. Terada Y, Yadohisa H. Multidimensional scaling
In: Hayashi C et al., eds. Proceedings of IFCS’96.
with hyperbox model for percentile dissimilarities. In:
Berlin: Springer-Verlag; 1998, 370–378.
Watada J, Phillips-Wren G, Jain LC, Howlett RJ, eds.
Intelligent Decision Technologies. Berlin/Heidelberg: 116. De Carvalho F, Souza R, Chavent M, Lechevallier Y.
Springer-Verlag; 2011, 779–788. Adaptive Hausdorff distances and dynamic clustering
of symbolic interval data. Pattern Recogn Lett 2006,
101. Maia ALS, De Carvalho FDAT, Ludermir TB. Fore-
27:167–179.
casting models for interval-valued time series. Neuro-
computing 2008, 71:3344–3352. 117. Rüschendorf L. Wasserstein metric. In:
Hazewinkel M, ed. Encyclopedia of Mathematics.
102. Arroyo J, Espínola R, Maté C. Different approaches
Berlin/Heidelberg: Springer; 2001.
to forecast interval time series: a comparison in
Finance. Comput Econ 2011, 37:169–191. 118. Irpino A, Verde R. Dynamic clustering of interval
103. Teles P, Brito P. Modelling Interval Time Series with data using a Wasserstein-based distance. Pattern
Space-Time Processes. Commun Stat Theory Method Recogn Lett 2008, 29:1648–1658.
2015, 44:3599–3627. 119. Kosmelj K, Le-Rademacher J, Billard L. Mallows’ L2
104. Arroyo J, Maté C. Forecasting histogram time series distance in some multivariate methods and its appli-
with k-nearest neighbors’ methods. Int J Forecast cation to histogram-type data. Metodoloski Zvezki
2009, 25:192–207. No. 9, 2012, 107–118.
105. Ciampi A, Diday E, Lebbe J, Perinel E, Vignes R. 120. Kim J, Billard L. Dissimilarity measures for
Growing a tree classifier with imprecise data. Pattern histogram-valued observations. Commun Stat Theory
Recogn Lett 2000, 21:787–803. Method 2013, 42:283–303.
106. Bravo M, Garcia-Santesmases J. Symbolic Object 121. García-Santesmases JM, Franco C, Montero J. Con-
Description of Strata by Segmentation Trees, Compu- sensus measures for symbolic data. Comput Eng Inf
tational Statistics, vol. 15. Heidelberg, Germany: Sci 2010, 4:651–658.
Physica-Verlag; 2000, 13–24. 122. Diday E. The symbolic approach in clustering and
107. Mballo C, Diday E. The criterion of Smirnov- related methods of data analysis: the basic choices.
Kolmogorov for binary decision tree: application to In: Bock H, ed. First Conference of the International
interval valued variables. Intell Data Anal 2006, Federation of Classifications Societies. North-Hol-
10:325–341. land: Technical University of Aachen (RFA); 1988.
108. Winsberg S, Diday E, Limam M. A tree structured 123. Diday E, Simon JC. Cluster analysis. In: Fu KS,
classifier for symbolic class description. In: Compstat ed. Digital Pattern Intent Recognition. Berlin/
2006. Rome, Italy: Physica-Verlag; 2006. Heidelberg: Springer-Verlag; 1976.

Overview wires.wiley.com/compstats

124. Diday E. Optimisation en Classification Automa- 140. Chavent M. Criterion-based divisive clustering for
tique, Tome 1, 2. Rocquencourt: INRIA; 1979. symbolic data. In: Bock H-H, Diday E, eds. Analysis
125. Diday E, Schroeder A. A new approach in mixed dis- of Symbolic Data: Exploratory Methods for Extract-
tributions detection. Revue d’Automatique, Informa- ing Statistical Information from Complex Data. Ber-
tique et Recherche Opérationnelle (RAIRO), Paris, lin: Springer-Verlag; 2000, 299–311.
France; 1975, 10. 141. Kim J. Dissimilarity measures for histogram-valued
126. Diday E, Ok Y, Schroeder A. The dynamic cluster data and divisive clustering of symbolic objects. Doc-
method in pattern recognition. In: Proceedings of toral Dissertation, University of Georgia, 2009.
IFIP Congress, Stockholm. North-Holland, 1974. 142. Kim J, Billard L. A polythetic clustering process and
127. Ok-Sakun Y. Analyse factorielle typologique et lis- cluster validity indexes for histogram-valued objects.
sage typologique. Thèse de 3ème cycle, Université Comput Stat Data Anal 2011, 55:2250–2262.
Paris VI, Juin, 1975.
143. Kim J, Billard L. Dissimilarity measures and divisive
128. Charles C. Régression typologique et reconnaissance clustering for symbolic multimodal-valued data.
des formes. Thèse de doctorat 3ème cycle, Université Comput Stat Data Anal 2012, 56:2795–2808.
Paris IX-Dauphine, Juin, 1977.
144. Brito P, Duarte Silva AP. Modelling interval data
129. Diday E, Govaert G. Classification avec distance
with normal and skew-normal distributions. J Appl
adaptative. CR Acad Sci Paris 1974, 278:993–995.
Stat 2012, 39:3–20.
130. Diday E. Introduction à l’Analyse factorielle typologi-
que. Rapport Laboria n 27. Rocquencourt: 145. Hajjar C., Hamdan H. Self-organizing map based on
INRIA; 1972. L2 distance for interval-valued data. In: 6th IEEE
International Symposium on Applied Computational
131. Diday E. Analyse canonique du point de vu de la clas-
Intelligence and Informatics (SACI 2011), Timisoara,
sification automatique. Rapport Laboria n 293. Roc-
Romania, 2011, 317–322.
quencourt: INRIA; 1978.
132. De Souza RMCR, De Carvalho FAT. Clustering of 146. Ganter B, Wille R. Formale Begrffsanalyse: Mathema-
interval data based on city-block distances. Pattern tishe Grunlagen. Heidelberg, Deutschland: Springer-
Recogn Lett 2004, 25:353–365. Verlag; 1996.
133. De Carvalho FAT, Lechevallier Y. Partitional cluster- 147. Wille R. Knowledge acquisition by methods of formal
ing algorithms for symbolic interval data based on concepts analysis. In: Proceedings of the conference
single adaptive distances. Pattern Recog 2010, on Data Analysis, Learning Symbolic and Numeric
42:1223–1236. Knowledge. Antibes, France: Nova Sciences; 1989,
365–380.
134. Verde R, Irpino A. Dynamic Clustering of Histogram
Data: Using the Right Metric. Selected Contributions 148. Barbut M, Monjardet B. Ordres et Classification.
in Data Analysis and Classification. Berlin/Heidel- Paris: Hachette; 1971.
berg: Springer; 2007, 123–134.
149. Polaillon G, Diday E. Galois lattices of symbolic
135. Diday E. Orders and overlapping clusters by pyra- objects. Rapport n0 9631. Paris: CEREMADE, Uni-
mids. In: Deleuw J, Heiser WJ, Meulman JJ, versity Paris; 1997.
Critchley F, eds. Multivariables Data Analysis. Lei-
den: DSWO Press; 1986, 201–234. 150. Polaillon G, Diday E. Reduction of symbolic Galois
lattices via hierarchies. In: Proceedings of Conference
136. Brito P, Diday E. Use of pyramids in symbolic data
on Knowledge Extraction and Symbolic Data Analy-
analysis. In: Diday E, Lechevallier Y, Schader M,
sis (KESDA’98). Luxembourg: Office for Official
Bertrand P, Burtschy B, eds. New Approaches in
Publications of the European Communities; 1999,
Classification and Data Analysis. Berlin: Springer-
137–143.
Verlag; 1990, 378–386.
137. Brito P. Symbolic objects: order structure and pyrami- 151. Polaillon G. Interpretation and reduction of Galois
dal clustering. Ann Oper Res 1995, 55:277–297. lattices of complex data. In: Rizzi A, Vichi M,
Bock H-H, eds. Advances in Data Science and Classi-
138. Pak K, Rahal MC, Diday E. Élagage et aide à l’inter- fication. Berlin/Heidelberg: Springer-Verlag; 1998,
prétation symbolique et graphique d’une pyramide. 433–440.
In: Congrès d’extraction et gestion des connaissances
(EGC), 18–21 Janvier. Paris: Editions Cepa- 152. Brito P, Polaillon G. Structuring probabilistic data by
dues; 2005. Galois lattices. Math Social Sci 2005, 169:77–104.
139. Rahal MC, Diday E. Spatial hierarchical and pyrami- 153. Brito P, Polaillon G. Homogeneity and stability in
dal clustering software. In: Proceedings of the 10th conceptual analysis. In: Napoli A, Vychodil V, eds.
Conference of the Federation of Classification Socie- Proceedings of the 8th International Conference on
ties: Data Science and Classification, Ljubljana, Slove- Concept Lattices and Their Applications, Nancy,
nia, 25–29 July, 2006. Editions Springer. France. Nancy: INRIA; 2011, 251–263.

WIREs Computational Statistics Symbolic Data Analysis

154. Montanary A, Calo DG. Model-based clustering of 169. Wang H, Shangguan L, Guan R, Billard L. Principal
probability density functions. Adv Data Anal Classif component analysis for compositional data vectors.
2013, 7:301–320. Comput Stat 2015, 30:1079–1096.
155. Cuvelier E. QAMML: probability distributions for 170. Dempster A, Laird N, Rubin D. Maximum likelihood
functional. PhD Thesis, University of Namur, from incomplete data with the EM algorithm. J R
Belgium, 2009. Stat Soc Series B Stat Methodol 1977, 39:1–38.
156. Fan W, Bouguila N. Inﬁnite Dirichlet mixtures mod- 171. Gelman A, Carlin J, Stern H, Rubin D. Bayesian
els learning via expectation propagation. Adv Data Data Analysis. 2nd ed. New York: Chapman and
Anal Classif 2013, 7:465–489. Hall; 2001.
157. Shimizu N, Nakano J. Clustering based on normal 172. Marin J-M, Robert C. Bayesian Core: A Practical
mixture model for aggregated symbolic data. In: Approach to Computational Bayesian Statistics.
Arroyo J, Maté C, Brito P, Noihomme M, eds, 3rd New York: Springer-Verlag; 2007.
Workshop in Symbolic Data Analysis. Spain: Univer- 173. Diday E, Emilion R. Symbolic bayesian network. In:
sidad Compiutense de Madrid; 2012. SDA ‘2015, 17–19 November, Orleans, France. 2015.
158. Le-Rademacher J, Billard L. Likelihood functions and Available at: http://www.univ-orleans.fr/mapmo/
some maximum likelihood estimators for symbolic. colloques/sda2015/SDA2015 Slides.zip. (Accessed
J Stat Plan Inference 2011, 141:1593–1602. August 3, 2016).
159. Diday E. Modélisation de Données Symboliques et 174. Raudenbush SW, Bryk AS. Hierarchical Linear Mod-
Application au cas des Intervalles. Orléans: Journées els. 2nd ed. Thousand Oaks, CA: Sage; 2002.
Nationales de la Société Francophone de Classiﬁca- 175. Inuiguchi M, Hirano S, Tsumoto S, eds. Rough Set
tion; 2011. Theory and Granular Computing. Berlin:
Springer; 2003.
160. Chiun-How K, Chih-Wen O, Yin-Jing T, Chuan-
kai, Y, Chun-houh C. A symbolic database for 176. Pedrycz W. Granular Computing: Analysis and
TIMSS. In: Arroyo J, Maté C, Brito P, Noihomme M, Design of Intelligent Systems. Boca Raton, FL: CRC
eds, 3rd Workshop in Symbolic Data Analysis. Spain: Press/Taylor & Francis; 2013.
Universidad Compiutense de Madrid; 2012. 177. Pawlak Z. Rough Sets: Theoretical Aspects of Rea-
161. Afonso F, Haddad R, Toque C, Eliezer ES, Diday E. soning About Data. Dordrecht: Kluwer Academic
User manual of the SYR software. Syrokko Internal Publishing; 1991. ISBN: 0-7923-1472-7.
Publication, 2012, 70. Available at: http://www. 178. Verde R, Diday E. Chapter 16—symbolic data analy-
syrokko.com. (Accessed August 3, 2016). sis: a factorial approach based on fuzzy coded data.
162. Duarte Silva AP, Brito P. MAINT.DATA: model and In: Blasius J, Greenacre M, eds. Visualization and
analyze interval data. R Package, version 0.2; 2011. Verbalization of Data. Mathematics|Probability and
Available at: http://cran.r-project.org/web/packages/ Statistics. UK: CRC Press Chapman & Hall book;
MAINT.Data/index.html. (Accessed August 3, 2016). 2014, 255–270.

163. Irpino A. HistDAWass: Histogram-Valued Data Anal- 179. Diday E, Moreau JV. Hierarchical Inference. In:
ysis, R package, version 0.1.4. 2016. Available at: Proceedings in Computational Statistics
https://cran.rproject.org/web/packages/HistDAWass/ (COMPSTAT 6), Prague: Physica-Verlag; 1984.
index.html.hermann. (Accessed August 3, 2016). 180. Lance GN, Williams WT. A general theory of classiﬁ-
catory sorting strategies: hierarchical systems. Com-
164. Irpino A, Verde R. Linear regression for numeric
put J 1967, 9:373–380.
symbolic variables: a least squares approach based on
Wasserstein distance. Adv Data Anal Classif ) 2015, 181. Meroune O. Traitement à grand échelle des données
9:81–106. symboliques. PhD co-directed by Prof.E. Diday and
P. Rigaux, Paris Dauphine University. France, 2011.
165. Irpino A, Verde R. Basic statistics for distributional
symbolic variables: a new metric-based approach. 182. Minami H, Mizuta M. SDA framework is the tool
Adv Data Anal Classif ) 2015, 9:143–175. for big data analysis? In: Arroyo J, Maté C, Brito P,
Noihomme M, eds. 3rd Workshop in Symbolic Data
166. Benzécri JP. L’Analyse des Données: l’Analyse des Analysis, Spain: Universidad Compiutense de
Correspondances. Paris: Dunod; 1980. Madrid; 2012.
167. Pawlowsky-Glahn V, Egozcue JJ, Tolosana- 183. Schweizer B. Distributions are the numbers of the
Delgado R. Modeling and Analysis of Compositional future. In: Proc. Sec. Napoli Meeting on “The Mathe-
Data. Chichester: Wiley; 2015. matics of Fuzzy Systems”. Instituto di Mathematica
168. Fisher RA. On the mathematical foundations of theo- delle Faculta di Mathematica delle Faculta di Achitec-
retical statistics. Philos Trans A Math Phys Eng Sci tura, Universita degli studi di Napoli; 1984,
1922, 222:309–368. 137–149.

WIREs Data Min Knowl 2014 Brito Symbolic Data Analysis Another Look
No ratings yet
WIREs Data Min Knowl 2014 Brito Symbolic Data Analysis Another Look
15 pages
Advances in Data Science Symbolic Complex and Network Data Innovation Entrepreneurship Management Big Data Intelligence and Data Analaysis 1st Edition Edwin Diday (Editor) Download
100% (1)
Advances in Data Science Symbolic Complex and Network Data Innovation Entrepreneurship Management Big Data Intelligence and Data Analaysis 1st Edition Edwin Diday (Editor) Download
55 pages
22 Wilk
No ratings yet
22 Wilk
9 pages
DB Blind
No ratings yet
DB Blind
11 pages
The Big Picture Symbolic Data Analysis Nicole Lazar
No ratings yet
The Big Picture Symbolic Data Analysis Nicole Lazar
5 pages
Stats Assignment
No ratings yet
Stats Assignment
34 pages
Diagrammatic Presentation of Data @
No ratings yet
Diagrammatic Presentation of Data @
5 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
FALLSEM2024-25 MAT1017 TH VL2024250106383 2024-11-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 MAT1017 TH VL2024250106383 2024-11-18 Reference-Material-I
33 pages
Unit 3 Tech
No ratings yet
Unit 3 Tech
16 pages
Lecture 1. Data and Data Representation
No ratings yet
Lecture 1. Data and Data Representation
61 pages
Classification of Symbolic Objects Using Adaptive Auto-Configuring RBF Neural Networks
No ratings yet
Classification of Symbolic Objects Using Adaptive Auto-Configuring RBF Neural Networks
5 pages
Data Visualization for Business
No ratings yet
Data Visualization for Business
37 pages
5.1 Visual Displays of Data
No ratings yet
5.1 Visual Displays of Data
8 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Graphical Representation of Data The Effect of The
No ratings yet
Graphical Representation of Data The Effect of The
12 pages
Lecture 2
No ratings yet
Lecture 2
75 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
Claasification & Presentation of Data & Statistical Methods
No ratings yet
Claasification & Presentation of Data & Statistical Methods
40 pages
Research
No ratings yet
Research
11 pages
Ch.04 Organisation of Data
No ratings yet
Ch.04 Organisation of Data
10 pages
DMDW Notes Unit 2
0% (1)
DMDW Notes Unit 2
11 pages
Data Visualization
No ratings yet
Data Visualization
166 pages
Classes of Data
No ratings yet
Classes of Data
10 pages
Sharda 11e Full Accessible PPT 03
No ratings yet
Sharda 11e Full Accessible PPT 03
31 pages
Data Foundations
No ratings yet
Data Foundations
5 pages
Section 1: Make Allowances For It
No ratings yet
Section 1: Make Allowances For It
28 pages
Part1 141104090445 Conversion Gate01
No ratings yet
Part1 141104090445 Conversion Gate01
27 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
RM Module 3
No ratings yet
RM Module 3
34 pages
ةداملا مسا (Subject) ثحبلا ناونع (Research Title) Graphs and its importance
No ratings yet
ةداملا مسا (Subject) ثحبلا ناونع (Research Title) Graphs and its importance
18 pages
Ba ZC420 Ec-2r First Sem 2023-2024
No ratings yet
Ba ZC420 Ec-2r First Sem 2023-2024
12 pages
HIPYR
No ratings yet
HIPYR
30 pages
Quantitative Methods - I (Statistics)
No ratings yet
Quantitative Methods - I (Statistics)
30 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
BusMath Finals
No ratings yet
BusMath Finals
8 pages
University School of Business MBA: SUBJECT NAME: Decision Science-I Subject Code: 21bat604
No ratings yet
University School of Business MBA: SUBJECT NAME: Decision Science-I Subject Code: 21bat604
33 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
Business Intelligence and Data Warehousing
No ratings yet
Business Intelligence and Data Warehousing
23 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
Lec03 DS351 DataAndImageModels
No ratings yet
Lec03 DS351 DataAndImageModels
82 pages
5-6 - Nature of Data, Statistical Modeling, and Visualization
No ratings yet
5-6 - Nature of Data, Statistical Modeling, and Visualization
69 pages
Unit 5 in MAT1141. 2024 Ac Year.
No ratings yet
Unit 5 in MAT1141. 2024 Ac Year.
73 pages
ML Unit-II Notes
No ratings yet
ML Unit-II Notes
86 pages
Business Quantitative Data Analysis
No ratings yet
Business Quantitative Data Analysis
16 pages
Notes of Week-1 and Week-2
No ratings yet
Notes of Week-1 and Week-2
30 pages
Statistic CH 1 30-Jan-2025 08-57-44
No ratings yet
Statistic CH 1 30-Jan-2025 08-57-44
14 pages
Notes Unit-2 Statistics
No ratings yet
Notes Unit-2 Statistics
8 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Unit 3 DV
No ratings yet
Unit 3 DV
44 pages
Introduction to Business Statistics
No ratings yet
Introduction to Business Statistics
30 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Cluster Cat Vars
No ratings yet
Cluster Cat Vars
17 pages
Collection and Presentation of Data-3
No ratings yet
Collection and Presentation of Data-3
10 pages
Statistics & Data Analysis Basics
No ratings yet
Statistics & Data Analysis Basics
108 pages
Ichino Yuguchis 1994
No ratings yet
Ichino Yuguchis 1994
13 pages
Gowda and TV Ravi
No ratings yet
Gowda and TV Ravi
6 pages
Nagabhushan 1995
No ratings yet
Nagabhushan 1995
5 pages
Distance Metrics - Billard
No ratings yet
Distance Metrics - Billard
34 pages
3 - Phase Leader Election
No ratings yet
3 - Phase Leader Election
7 pages
"Budgetary Control": A Synopsis On
No ratings yet
"Budgetary Control": A Synopsis On
12 pages
Cst466 Data Mining, May 2024
No ratings yet
Cst466 Data Mining, May 2024
3 pages
SPSS Worksheet 2 One-Way ANOVA
No ratings yet
SPSS Worksheet 2 One-Way ANOVA
6 pages
Fuzzy C-Means Clustering Guide
No ratings yet
Fuzzy C-Means Clustering Guide
13 pages
Study of Flipkart Fulfilment Center
No ratings yet
Study of Flipkart Fulfilment Center
7 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Project: Maharishi Dayanand University
0% (1)
Project: Maharishi Dayanand University
67 pages
Gambaran Kepuasan Pasien Kanker Stadium Lanjut Terhadap Perawatan Paliatif Di Rsup Dr. Sardjito Yogyakarta
No ratings yet
Gambaran Kepuasan Pasien Kanker Stadium Lanjut Terhadap Perawatan Paliatif Di Rsup Dr. Sardjito Yogyakarta
9 pages
Importance of Literature Review in Research
100% (6)
Importance of Literature Review in Research
8 pages
English Interest & Achievement Link
No ratings yet
English Interest & Achievement Link
90 pages
3150 2010 November 01 Agung Santoso-with-cover-page-V2
No ratings yet
3150 2010 November 01 Agung Santoso-with-cover-page-V2
18 pages
Hypothesis Testing in Statistics
No ratings yet
Hypothesis Testing in Statistics
4 pages
Ellis Dissertation
No ratings yet
Ellis Dissertation
273 pages
Business Analytics - The Science of Data Driven Decision Making
No ratings yet
Business Analytics - The Science of Data Driven Decision Making
4 pages
Quantitative Methods Module
No ratings yet
Quantitative Methods Module
28 pages
Indian Auto 4-Wheeler Sector Analysis
No ratings yet
Indian Auto 4-Wheeler Sector Analysis
81 pages
PreTest & Post Test
No ratings yet
PreTest & Post Test
3 pages
Performance Management Systems Guide
No ratings yet
Performance Management Systems Guide
44 pages
Customer Relationship Management in Banks PPT 22 May
No ratings yet
Customer Relationship Management in Banks PPT 22 May
48 pages
EDA Credit Case Study
100% (1)
EDA Credit Case Study
21 pages
Data Science Life Cycle - All Details
No ratings yet
Data Science Life Cycle - All Details
12 pages
Data Analytics Fundementals
No ratings yet
Data Analytics Fundementals
40 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
3 pages
Master Tabel Pengetahuan
No ratings yet
Master Tabel Pengetahuan
11 pages
AI Student Marks Predictor
No ratings yet
AI Student Marks Predictor
8 pages
21CSE355T DMA-8-15 Marks Question Bank
No ratings yet
21CSE355T DMA-8-15 Marks Question Bank
2 pages
Chapter 9,10,11,12 - Công TH C
No ratings yet
Chapter 9,10,11,12 - Công TH C
9 pages
Module 4 Inferential Statistics PDF
No ratings yet
Module 4 Inferential Statistics PDF
25 pages
Business Analytics Executive MBA Exercises
No ratings yet
Business Analytics Executive MBA Exercises
5 pages
Smart Hub Consultancy RFP
No ratings yet
Smart Hub Consultancy RFP
34 pages

Thinking by Classes in Data Science SDA

Uploaded by

Thinking by Classes in Data Science SDA

Uploaded by

Overview

Thinking by classes in data

Data Science, considered as a science by itself, is in general terms, the extraction

INTRODUCTION already exists before thinking on extracting any new

172 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 173

Standard data table Symbolic data table

Age Weight Nationalities

174 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 175

176 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

level units by ‘symbolic data’ expressing variability SDA PARADIGM

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 177

178 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 179

say, in this case, that we have a population X of Y2

180 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 181

182 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Numerical variables for symbolic data Symbolic variables

(Y1(C1), Y2(C1)) = ([a1i , b1i], ([a2i , b2i])

Numerical space: four variables Symbolic space; two variables

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 183

184 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 185

DC/Z < DC/Y and

DC/Z < DC/Y and i1 1 0 1

DC/Z < DC/Y and i3 1 1 0

186 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

where: ‘symbolic rough set’ associated with a class C

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 187

Y2 consistent as h goes to 0 and its limit is the result of a

188 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Crack variables Corrosion variables

Cracks and corrosion variables

Crack variables Corrosion variables

Symbolic data table

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 189

190 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 191

192 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Individuals Class variable

Description of the higher level units

Rows: description of higher

Symbolic data table

F I G U R E 1 0 | From relational data base to symbolic data.

AA11 AA17 AA13

Overlapping stars where radius are associated to sysmbolic variables

Top down clustering tree or decision tree on symbolic data.

6 resort = Bahamas Beach

F I G U R E 1 1 | Some symbolic data analysis tools output.

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 193

Towel Height Gap1 Radius Gap2

ouv01 157.31:299.76 41.66:68.41

ouv13 95.38:233.38 38.98:62.02

ouv02 157.36:300.98 41.66:68.36

ouv07 204.56:332.56 42.02:57.71

ouv09 203.09:331.09 42.04:58.19

ouv10 201.1:334.02 42.07:58.88

FI GU RE 12 | A symbolic data table provided by the SYR software

194 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

–3.0 –1.5 0 1.5

FI GUR E 13 | A NETSYR output of a PCA extended to symbolic data.

KNN, and exponential smoothing. It can be down- R-Package: iRegression

SOME DIRECTION OF RESEARCH AND

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 195

196 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

(see Unsupervised Classiﬁcation Extended to Sym- Granular Computing

In Data Science, this model has no sense in case of

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 197

198 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 199

200 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 201

202 © 2016 Wiley Periodicals, Inc. Volume 8, September/October 2016

Volume 8, September/October 2016 © 2016 Wiley Periodicals, Inc. 203