Guide To Intelligent Data Analysis
Guide To Intelligent Data Analysis
Editors
David Gries
Fred B. Schneider
Guide to Intelligent
Data Analysis
Series Editors
David Gries Fred B. Schneider
Department of Computer Science Department of Computer Science
Upson Hall Upson Hall
Cornell University Cornell University
Ithaca, NY 14853-7501, USA Ithaca, NY 14853-7501, USA
ISSN 1868-0941 e-ISSN 1868-095X
ISBN 978-1-84882-259-7 e-ISBN 978-1-84882-260-3
DOI 10.1007/978-1-84882-260-3
Springer London Dordrecht Heidelberg New York
The main motivation to write this book came from all our problems to find suitable
material for a textbook that would really help us to teach the practical aspects of data
analysis together with the needed theoretical underpinnings. Many books out there
tackle either one or the other of these aspects (and, especially for the latter, there are
some fantastic text books out there), but a book providing a good combination was
nowhere to be found.
The idea to write our own book to address this shortcoming arose in two different
places at the same time—when one of the authors was asked to review the book
proposal of the others, we quickly realized that it would be much better to join
forces instead of independently pursuing our individual projects.
We hope that this book helps others to learn what kind of challenges data analysts
face in the real world and at the same time provides them with solid knowledge
about the processes, algorithms, and theories to successfully tackle these problems.
We have put a lot of effort into balancing the practical aspects of applying and using
data analysis techniques while making sure at the same time that we did not forget
to also explain the statistical and mathematical underpinnings behind the algorithms
beneath all of this.
There are many people to be thanked, and we will not attempt to list them all.
However, we do want to single out Iris Adä who has been a tremendous help with
the generation of the data sets used in this book. She and Martin Horn also deserve
our thanks for an intense last minute round of proof reading.
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Data and Knowledge . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Tycho Brahe and Johannes Kepler . . . . . . . . . . . . . . 4
1.1.3 Intelligent Data Analysis . . . . . . . . . . . . . . . . . . 6
1.2 The Data Analysis Process . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Methods, Tasks, and Tools . . . . . . . . . . . . . . . . . . . . . . 11
1.4 How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Project Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Determine the Project Objective . . . . . . . . . . . . . . . . . . . 26
3.2 Assess the Situation . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Determine Analysis Goals . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Attribute Understanding . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Methods for One and Two Attributes . . . . . . . . . . . . 40
4.3.2 Methods for Higher-Dimensional Data . . . . . . . . . . . 48
4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 59
vii
viii Contents
5 Principles of Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Model Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Fitting Criteria and Score Functions . . . . . . . . . . . . . . . . . 85
5.2.1 Error Functions for Classification Problems . . . . . . . . . 87
5.2.2 Measures of Interestingness . . . . . . . . . . . . . . . . . 89
5.3 Algorithms for Model Fitting . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Closed Form Solutions . . . . . . . . . . . . . . . . . . . 89
5.3.2 Gradient Method . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Combinatorial Optimization . . . . . . . . . . . . . . . . . 92
5.3.4 Random Search, Greedy Strategies, and Other Heuristics . 92
5.4 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.1 Experimental Error . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 Sample Error . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.3 Model Error . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.4 Algorithmic Error . . . . . . . . . . . . . . . . . . . . . . 101
5.4.5 Machine Learning Bias and Variance . . . . . . . . . . . . 101
5.4.6 Learning Without Bias? . . . . . . . . . . . . . . . . . . . 102
5.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Training and Test Data . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.4 Measures for Model Complexity . . . . . . . . . . . . . . 105
5.6 Model Errors and Validation in Practice . . . . . . . . . . . . . . . 111
5.6.1 Errors and Validation in KNIME . . . . . . . . . . . . . . 111
5.6.2 Validation in R . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
A.1 Terms and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 304
A.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 305
A.2.1 Tabular Representations . . . . . . . . . . . . . . . . . . . 305
A.2.2 Graphical Representations . . . . . . . . . . . . . . . . . . 306
A.2.3 Characteristic Measures for One-Dimensional Data . . . . 309
A.2.4 Characteristic Measures for Multidimensional Data . . . . 316
A.2.5 Principal Component Analysis . . . . . . . . . . . . . . . 318
A.3 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 323
A.3.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 323
A.3.2 Basic Methods and Theorems . . . . . . . . . . . . . . . . 327
A.3.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . 333
A.3.4 Characteristic Measures of Random Variables . . . . . . . 339
A.3.5 Some Special Distributions . . . . . . . . . . . . . . . . . 343
A.4 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 349
A.4.1 Random Samples . . . . . . . . . . . . . . . . . . . . . . 350
A.4.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . 351
A.4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . 361
C KNIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
C.1 Installation and Overview . . . . . . . . . . . . . . . . . . . . . . 375
C.2 Building Workflows . . . . . . . . . . . . . . . . . . . . . . . . . 377
C.3 Example Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
C.4 R Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Symbols
xiii
Chapter 1
Introduction
In this introductory chapter we provide a brief overview over some core ideas of
intelligent data analysis and their motivation. In a first step we carefully distinguish
between “data” and “knowledge” in order to obtain clear notions that help us to work
out why it is usually not enough to simply collect data and why we have to strive
to turn them into knowledge. As an illustration, we consider a well-known example
from the history of science. In a second step we characterize the data analysis pro-
cess, also often referred to as the knowledge discovery process, in which so-called
“data mining” is one important step. We characterize standard data analysis tasks
and provide a brief catalog of methods and tools to tackle them.
1.1 Motivation
Every year that passes brings us more powerful computers, faster and cheaper stor-
age media, and higher bandwidth data connections. Due to these groundbreaking
technological advancements, it is possible nowadays to collect and store enormous
amounts of data with amazingly little effort and at impressively low costs. As a
consequence, more and more companies, research centers, and governmental in-
stitutions create huge archives of tables, documents, images, and sounds in elec-
tronic form. Since for centuries lack of data has been a core hindrance to scientific
and economic progress, we feel compelled to think that we can solve—at least in
principle—basically any problem we are faced with if only we have enough data.
However, a closer examination of the matter reveals that this is an illusion. Data
alone, regardless of how voluminous they are, are not enough. Even though large
databases allow us to retrieve many different single pieces of information and to
compute (simple) aggregations (like average monthly sales in Berlin), general pat-
terns, structures, and regularities often go undetected. We may say that in the vast
amount of data stored in some databases we cannot see the wood (the patterns)
for the trees (the individual data records). However, it is most often exactly these
patterns, regularities, and trends that are particularly valuable if one desires, for
example, to increase the turnover of a supermarket. Suppose, for instance, that a
M.R. Berthold et al., Guide to Intelligent Data Analysis, 1
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_1, © Springer-Verlag London Limited 2010
2 1 Introduction
supermarket manager discovers, by analyzing the sales and customer records, that
certain products are frequently bought together. In such a case sales can sometimes
be stimulated by cleverly arranging these products on the shelves of the market (they
may, for example, be placed close to each other, or may be offered as a bundle, in
order to invite even more customers to buy them together).
Unfortunately, it turns out to be harder than may be expected at first sight to ac-
tually discover such patterns and regularities and thus to exploit a larger part of the
information that is contained in the available data. In contrast to the overwhelm-
ing flood of data there was, at least at the beginning, a lack of tools by which raw
data could be transformed into useful information. Almost fifteen years ago John
Naisbett aptly characterized the situation by saying [3]: “We are drowning in in-
formation, but starving for knowledge.” As a consequence, a new research area has
been developed, which has become known under the name of data mining. The goal
of this area was to meet the challenge to develop tools that can help humans to find
potentially useful patterns in their data and to solve the problems they are facing
by making better use of the data they have. Today, about fifteen years later, a lot of
progress has been made, and a considerable number of methods and implementa-
tions of these techniques in software tools have been developed. Still it is not the
tools alone, but the intelligent composition of human intuition with the computa-
tional power, of sound background knowledge with computer-aided modeling, of
critical reflection with convenient automatic model construction, that leads intelli-
gent data analysis projects to success [1]. In this book we try to provide a hands-on
approach to many basic data analysis techniques and how they are used to solve data
analysis problems if relevant data is available.
In this book we distinguish carefully between data and knowledge. Statements like
“Columbus discovered America in 1492” or “Mister Smith owns a VW Beetle” are
data. Note that we ignore whether we already know these statements or whether
we have any concrete use for them at the moment. The essential property of these
statements we focus on here is that they refer to single events, objects, people, points
in time, etc. That is, they generally refer to single instances or individual cases. As a
consequence, their domain of application and thus their utility is necessarily limited.
In contrast to this, knowledge consists of statements like “All masses attract
each other” or “Every day at 7:30 AM a flight with destination New York departs
from Frankfurt Airport.” Again, we neglect the relevance of these statements for
our current situation and whether we already know them. Rather, we focus on the
essential property that they do not refer to single instances or individual cases but are
general rules or (physical) laws. Hence, if they are true, they have a large domain of
application. Even more importantly, though, they allow us to make predictions and
are thus highly useful (at least if they are relevant to us).
We have to admit, though, that in daily life we also call statements like “Colum-
bus discovered America in 1492” knowledge (actually, this particular statement is
1.1 Motivation 3
knowledge
• refers to classes of instances
(sets of objects, people, events, points in time, etc.)
• describes general patterns, structures, laws, principles, etc.
• consists of as few statements as possible
(this is actually an explicit goal, see below)
• is often difficult and time-consuming to find or to obtain
(e.g., natural laws, education)
• allows us to make predictions and forecasts
These characterizations make it very clear that generally knowledge is much more
valuable than (raw) data. Its generality and the possibility to make predictions about
the properties of new cases are the main reasons for this superiority.
It is obvious, though, that not all kinds of knowledge are equally valuable as any
other. Not all general statements are equally important, equally substantial, equally
significant, or equally useful. Therefore knowledge has to be assessed, so that we
do not drown in a sea of irrelevant knowledge. The following list (which we do not
claim to be complete) lists some of the most important criteria:
criteria to assess knowledge
• correctness (probability, success in tests)
• generality (domain and conditions of validity)
• usefulness (relevance, predictive power)
• comprehensibility (simplicity, clarity, parsimony)
• novelty (previously unknown, unexpected)
In the domain of science, the focus is on correctness, generality, and simplicity
(parsimony) are in the focus: one way of characterizing science is to say that it is
the search for a minimal correct description of the world. In economy and industry,
however, the emphasis is placed on usefulness, comprehensibility, and novelty: the
main goal is to gain a competitive edge and thus to increase revenues. Nevertheless,
neither of the two areas can afford to neglect the other criteria.
4 1 Introduction
We illustrate the considerations of the previous section with an (at least partially)
well-known example from the history of science. In the sixteenth century studying
the stars and the planetary motions was one of the core areas of research. Among its
proponents was Tycho Brahe (1546–1601), a Danish nobleman and astronomer, who
in 1576 and 1584, with the financial help of King Frederic II, built two observatories
on the island of Ven, about 32 km north-east of Copenhagen. He had access to the
best astronomical instruments of his time (but no telescopes, which were used only
later by Galileo Galilei (1564–1642) and Johannes Kepler (see below) to observe
celestial bodies), which he used to determine the positions of the sun, the moon,
and the planets with a precision of less than one angle minute. With this precision
he managed to surpass all measurements that had been carried out before and to
actually reach the theoretical limit for observations with the unaided eye (that is,
without the help of telescopes). Working carefully and persistently, he recorded the
motions of the celestial bodies over several years.
Stated plainly, Tycho Brahe collected data about our planetary system, fairly
large amounts of data, at least from the point of view of the sixteenth century. How-
ever, he failed to find a consistent scheme to combine them, could not discern a
clear underlying pattern—partially because he stuck too closely to the geocentric
system (the earth is in the center, and all planets, the sun, and the moon revolve
around the earth). He could tell the precise location of Mars on any given day of
the year 1582, but he could not connect its locations on different days by a clear
and consistent theory. All hypotheses he tried did not fit his highly precise data. For
example, he developed the so-called Tychonic planetary system (the earth is in the
center, the sun and the moon revolve around the earth, and the other planets revolve
around the sun on circular orbits). Although temporarily popular in the seventeenth
century, this system did not stand the test of time. From a modern point of view we
may say that Tycho Brahe had a “data analysis problem” (or “knowledge discov-
ery problem”). He had obtained the necessary data but could not extract the hidden
knowledge.
This problem was solved later by Johannes Kepler (1571–1630), a German as-
tronomer and mathematician, who worked as an assistant of Tycho Brahe. Contrary
to Brahe, he advocated the Copernican planetary system (the sun is in the center,
the earth and all other planets revolve around the sun in circular orbits) and tried
all his life to reveal the laws that govern the motions of the celestial bodies. His ap-
proach was almost radical for his time, because he strove to find a mathematical de-
scription. He started his investigations with the data Tycho Brahe had collected and
which he extended in later years. After several fruitless trials and searches and long
and cumbersome calculations (imagine: no pocket calculators), Kepler finally suc-
ceeded. He managed to combine Tycho Brahe’s data into three simple laws, which
nowadays bear his name: Kepler’s laws. After having realized in 1604 already that
the course of Mars is an ellipse, he published the first two of these laws in his work
1.1 Motivation 5
“Astronomia Nova” in 1609 [6] and the third law ten years later in his magnum opus
“Harmonices Mundi” [4, 7]:
1. The orbit of every planet (including the earth) is an ellipse,
with the sun at a focal point.
2. A line from the sun to the planet sweeps out equal areas
during equal intervals of time.
3. The squares of the orbital periods of any two planets relate to each other
like the cubes of the semimajor axes of their respective orbits:
3
T12 /T22 = a13 /a23 , and therefore generally T ∼ a 2 .
Tycho Brahe had collected a large amount of astronomical data, and Johannes Ke-
pler found the underlying laws that can explain them. He discovered the hidden
knowledge and thus became one of the most famous “data miners” in history.
Today the works of Tycho Brahe are almost forgotten—few have even heard his
name. His catalogs of celestial data are merely of historical interest. No textbook on
astronomy contains excerpts from his measurements—and this is only partially due
to the better measurement technology we have available today. His observations and
precise measurements are raw data and thus suffer from a decisive drawback: they
do not provide any insight into the underlying mechanisms and thus do not allow us
to make predictions. Kepler’s laws, on the other hand, are treated in basically all as-
tronomy and physics textbooks, because they state the principles according to which
planets and comets move. They combine all of Brahe’s observations and measure-
ments in three simple statements. In addition, they permit us to make predictions:
if we know the location and the speed of a planet relative to the sun at any given
moment, we can compute its future course by drawing on Kepler’s laws.
How did Johannes Kepler find the simple astronomical laws that bear his name?
How did he discover them in Tycho Brahe’s long tables and voluminous catalogs,
thus revolutionizing astronomy? We know fairly little about his searches and efforts.
He must have tried a large number of hypotheses, most of them failing. He must have
carried out long and cumbersome computations, repeating some of them several
times to eliminate errors. It is likely that exceptional mathematical talent, hard and
tenacious work, and a significant amount of good luck finally led him to success.
What we can be sure of is that he did not possess a universally applicable procedure
or method to discover physical or astronomical laws.
Even today we are not much further: there is still no silver bullet to hit on the right
solution. It is still much easier to collect data, with which we are virtually swamped
in today’s “information society” (whatever this popular term actually means) than
to discover knowledge. Automatic measurement instruments and scanners, digital
cameras and computers, and an abundance of other automatic and semiautomatic
devices have even relieved us of the burden of manual data collection. In addition,
database and data warehouse technology allows us to store ever increasing amounts
of data and to retrieve and to sample them easily. John Naisbett was perfectly right:
“We are drowning in information, but starving for knowledge.”
It took a distinguished researcher like Johannes Kepler several years (actually
half a lifetime) to evaluate the data that Tycho Brahe had collected—data that from
6 1 Introduction
a modern point of view are negligibly few and of which Kepler actually analyzed
closely only those about the orbit of Mars. Given this, how can we hope today to
cope with the enormous amounts of data we are faced with every day? “Manual”
analyses (like Kepler’s) have long ceased to be feasible. Simple aids, like the visual-
ization of data in charts and diagrams, even though highly useful and certainly a first
and important step, quickly reach their limits. Thus, if we refuse to surrender to the
flood of data, we are forced to develop and employ computer-aided techniques, with
which data analysis can be simplified or even automated to some degree. These are
the methods that have been and still are developed in the research areas of intelligent
data analysis, knowledge discovery in databases and data mining. Even though these
methods are far from replacing human beings like Johannes Kepler, especially since
a mindless application can produce artifacts and misleading results, it is not entirely
implausible to assume that Kepler, if he had been supported by these methods and
tools, could have reached his goal a little earlier.
Many people associate any kind of data analysis with statistics (see also Ap-
pendix A, which provides a brief review). Statistics has a long history and originated
from collecting and analyzing data about the population and the state in general.
Statistics can be divided into descriptive and inferential statistics. Descriptive
statistics summarizes data without making specific assumptions about the data,
often by characteristic values like the (empirical) mean or by diagrams like his-
tograms. Inferential statistics provides more rigorous methods than descriptive
statistics that are based on certain assumptions about the data generating random
process. The conclusions drawn in inferential statistics are only valid if these as-
sumptions are satisfied.
Typically, in statistics the first step of the data analysis process is to design the
experiment that defines how data should be collected in order to be able to carry out
a reliable analysis based on the obtained data. To capture this important issue, we
distinguish between experimental and observational studies. In an experimental
study one can control and manipulate the data generating process. For instance,
if we are interested in the effects of certain diets on the health status of a person,
we might ask different groups of people to stick to different diets. Thus we have a
certain control over the data generating process. In this experimental study, we can
decide which and how many people should be assigned to a certain diet.
In an observational study one cannot control the data generating process. For
the same dietary study as above, we might simply ask people on the street what they
normally eat. Then we have no control about which kinds of diets we get data and
how many people we will have for each diet in our data.
No matter whether the study is experimental or observational, there are usually
independence assumptions involved, and the data we collect should be representa-
tive. The main reason is that inferential statistics is often applied to hypothesis test-
ing where, based on the collected data, we desire to either confirm or reject some
1.2 The Data Analysis Process 7
hypothesis about the considered domain. In this case representative data and certain
independencies are required in order to ensure that the test decisions are valid.
In contrast to hypothesis testing, exploratory data analysis is concerned with
generating hypotheses from the collected data. In exploratory data analysis there
are no or at least considerably weaker model assumptions about the data generating
process. Most of the methods presented in this book fall into this category, since
they are mostly universal methods designed to achieve a certain goal but are not
based on a rigorous model as in inferential statistics.
The typical situation we assume in this book is that we already have the data.
They might not have been collected in the best way, or in the way we would have
collected them had we been able to design the experiment in advance. Therefore,
it is often difficult to make specific assumptions about the data generating process.
We are also mostly goal-oriented—that is, we ask questions like “Which customers
will yield the highest profit”?—and search for methods that can help us to answer
such questions or to solve our problems.
The opportunity of analyzing large business databases that were initially col-
lected for completely different purposes came with the availability of powerful tools
and technologies that can process and analyze massive amounts of data, so-called
data mining techniques. A few years ago some people seemed to believe that with
just the right data mining tool at hand any kind of desired knowledge could be
squeezed out of a given database automatically with no or only little human inter-
ference. However, practical experience demonstrates that every problem is different
and a full automatization of the data analysis process is simply impossible. Today
we understand by knowledge discovery in databases (KDD) an interactive “pro-
cess of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data” [3]. This process consists of multiple phases, and the data mining
or modeling step became just a single step in it. That is, after a period of time where
powerful tools were (sometimes) naively applied to the data, the “intelligent ana-
lyst” is brought back into the loop. As a consequence, the KDD process differs not
so much anymore from classical statistical data analysis (except where the lacking
principled data acquisition takes its toll). To emphasize that every project is different
and therefore intelligence is required to make the most out of the already gathered
data, we use the term intelligent data analysis, which was coined by David Hand
[1, 5] (and is used today almost synonymously with the KDD process).
In this book we strove to provide a comprehensive guide to intelligent data anal-
ysis, outlining the process and its phases, presenting methods and algorithms for
various tasks and purposes, and illustrating them with two freely available software
tools. In this way we hope to offer a good starting point for anyone who wishes to
become more familiar with the area of intelligent data analysis.
In the first case, the problem at hand is by no means new, but it is already solved as
a matter of routine (e.g., approval of credit card applications, technical inspection
during quality assurance, machine control by a plant operator, etc.). If data has been
collected for the past cases together with the result that was finally achieved (such
as poor customer performance, malfunction of parts, etc.), such historical data may
be used to revise and optimize the presently used strategy to reach a decision. In
the second case, a certain question arises for the first time, and only little experi-
ence is available, or the experience is not directly applicable to this new question
(e.g., starting with a new product, preventing abuse of servers, evaluating a large
experiment or survey). In such cases, it is supposed that data from related situations
may be helpful to generalize the new problem or that unknown relationships can be
discovered from the data to gain insights into this unfamiliar area.
What if we have no data at all? This situation does not occur literally in prac-
tice, since in most cases there is always some data. Especially in businesses huge
amounts of data have been collected and stored for operational reasons in the past
(e.g., billing, logistics, warranty claims) that may now be used to optimize vari-
ous decisions or offer new options (e.g., predicting customer performance, reducing
stock on hand, tracking causes of defects). So the right question should be: How do
we know if we have enough relevant data? This question is not answered easily. If
it actually turns out that the data is not sufficient, one option is to acquire new data
to solve the problem. However, as already pointed out in the preceding section, the
experimental design of data acquisition is beyond the scope of this book.
There are several proposals about what the intelligent data analysis process
should look like, such as SEMMA (an acronym for sample, explore, modify, model,
assess used by SAS Institute Inc.), CRISP-DM (an acronym for CRoss Industry
Standard Process for Data Mining as defined by the CRISP-DM consortium) [2],
or the KDD-process [3] (see [8] for a detailed comparison). In this book, we are
going to follow the CRISP-DM process, which has been developed by a consortium
of large companies, such as NCR, Daimler, and SPSS, and appears to be the most
widely used process model for intelligent data analysis today.
CRISP-DM consists of six phases as shown in Fig. 1.1. Most of these phases are
usually executed more than once, and the most frequent phase transitions are shown
by arrows. The main objective of the first project understanding step (see Chap. 3)
is to identify the potential benefit as well as the risks and efforts of a successful
project, such that a deliberate decision on conducting the full project can be made.
The envisaged solution is also transferred from the project domain to a more techni-
cal, data-centered notion. This first phase is usually called business understanding,
but we stick to the more general term project understanding to emphasize that our
problem at hand may as well be purely technical in nature or a research project
rather than economically motivated.
Next we need to make sure that we will have sufficient data at hand to tackle
the problem. While we cannot know this for sure until the end of the project, we
at least have to convince ourselves that there is enough relevant data. To achieve
this, we proceed in the data understanding phase (see Chap. 4) with a review of
the available databases and the information contained in the database fields, a visual
1.2 The Data Analysis Process 9
Fig. 1.1 Overview of the CRISP-DM process together with typical questions to be asked in the
respective phases
assessment of the basic relationships between attributes, a data quality audit, an in-
spection of abnormal cases (outliers), etc. For instance, outliers appear to be abnor-
mal in some sense and are often caused by faulty insertion, but sometimes they give
surprising insights on closer inspection. Some techniques respond very sensitively
to outliers, which is why they should be treated with special care. Another aspect is
empty fields which may occur in the database for various reasons—ignoring them
may introduce a systematic error in the results. By getting familiar with the data,
typically first insights and hypotheses are gained. If we do not believe that the data
suffices to solve the problem, it may be necessary to revise the project’s objective.
So far, we have not changed any field of our database. However, this will be re-
quired to get the data into a shape that enables us to apply modeling tools. In the
data preparation phase (Chap. 6) the data is selected, corrected, modified, even
new attributes are generated, such that the prepared data set best suits the prob-
lem and the envisaged modeling technique. Basically all deficiencies that have been
identified in the data understanding phase require special actions. Often the outliers
10 1 Introduction
and missing values are replaced by estimated values or true values obtained from
other sources. We may restrict the further analysis to certain variables and to a se-
lection of the records from the full data set. Redundant and irrelevant data can give
many techniques an unnecessarily hard time.
Once the data is prepared, we select and apply modeling tools to extract knowl-
edge out of the data in the form of a model (Chaps. 5 and 7–9). Depending on
what we want to do with the model, we may choose techniques that are easily in-
terpretable (to gain insights) or less demonstrative black-box models, which may
perform better. If we are not pleased with the results but are confident that the model
can be improved, we step back to the data preparation phase and, say, generate new
attributes from the existing ones, to support the modeling technique or to apply
different techniques. Background knowledge may provide hints on useful transfor-
mations that simplify the representation of the solution.
Compared to the modeling itself, which is typically supported by efficient tools
and algorithms, the data understanding and preparation phases take considerable
part of the overall project time as they require a close manual inspection of the data,
investigations into the relationships between different data sources, often even the
analysis of the process that generated the data. New insights promote new ideas
for feature generation or alter the subset of selected data, in which case the data
preparation and modeling phases are carried out multiple times. The number of
steps is not predetermined but influenced by the process and findings itself.
When the technical benchmarks cannot be improved anymore, the obtained re-
sults are analyzed in the evaluation phase (Chap. 10) from the perspective of the
problem owner. At this point, the project may stop due to unsatisfactory results, the
objectives may be revised in order to succeed under a slightly different setting, or
the found and optimized model may be deployed.
After deployment, which ranges from writing a report to the creation of a soft-
ware system that applies the model automatically to aid or make decisions, the
project is not necessarily finished. If the project results are used continuously over
time, an additional monitoring phase is necessary: during the analysis, a number of
assumptions will be made, and the correctness of the derived model (and the deci-
sions that rely on the model) depends on them. So we better verify from time to time
that these assumption still hold to prevent decision-making on outdated information.
In the literature one can find attempts to create cost models that estimate the costs
associated with a data analysis project. Without going into the details, the major key
factors that remained in a reduced cost model derived from 40 projects were [9]:
While there is not much we can do about the problem size, the goal of this book is
to increase the familiarity with data analysis projects by going through each of the
phases and providing first instructions to get along with the software suites.
The most frequent categories are classification and regression, because decision
making always becomes much easier if reliable predictions of the near future are
available. When a completely new area or domain is explored, cluster analysis and
association analysis may help to identify relationships among attributes or records.
Once the major relationships are understood (e.g., by a domain expert), a deviation
analysis can help to focus on exceptional situations that deviate from regularity.
Available Tools As already mentioned, the key to success is often the proper com-
bination of data preparation and modeling techniques. Data analysis software suites
are of great help as they reduce data formatting efforts and ease method linking.
There is a long list of commercial and free software suites and tools, including the
following classical products:
• IBM SPSS PASW Modeler (formerly Clementine)
Clementine was the first commercial data mining workbench in 1994 and is a
commercial product from SPSS, now IBM.
http://www.spss.com/
• SAS Enterprise Miner
A commercial data mining solution from SAS.
http://www.sas.com/
1.4 How to Read This Book 13
• The R-project
R is a free software environment for statistical computing and graphics.
http://www.r-project.org/
• Weka
Weka is a popular open-source collection of machine learning algorithms, initially
developed by the University of Waikato, New Zealand.
http://www.cs.waikato.ac.nz/ml/weka/
For an up-to-date list of software suites see, for instance,
http://www.kdnuggets.com/software/suites.html
Although the choice of the software suite has considerable impact on the project
time (usability) and can help to avoid errors (because some of them are easily spot-
ted using powerful visualization capabilities), the suites cannot take over the full
analysis process. They provide at best an initial starting point (by means of analysis
templates or project wizards), but in most cases the key factor is the intelligent com-
bination of tools and background knowledge (regarding the project domain and the
utilized tools). The suites exhibit different strengths, some focus on supporting the
human data analyst by sophisticated graphical user interfaces, graphical configura-
tion and reporting, while others are better suited for batch processing and automati-
zation.
In this book, we will use R, which is particularly powerful in statistical tech-
niques, and KNIME (the Konstanz Information Miner1 ), which is an open-source
data analysis tool that is growing in popularity due to its graphical workflow editor
and its ability to integrate other well-known toolkits.
appendix is not just a glossary of terms to quickly look up details but also serves as
a book within the book for a few preparative lessons on statistics before delving into
the chapters about intelligent data analysis.
Most chapters contain a section that equips the reader with the necessary infor-
mation for some first hands-on experience using either R or KNIME. We have set-
tled on R and KNIME because they can be seen as extremes on the range of possible
software suites: R is a statistical tool, which is (mostly) command-line oriented and
is particularly useful for scripting and automatization. KNIME, on the other hand,
supports the composition of complex workflows in a graphical user interface.2 Ap-
pendices B and C provide a brief introduction into both systems.
References
1. Berthold, M., Hand, D.: Intelligent Data Analysis. Springer, Berlin (2009)
2. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: Cross
Industry Standard Process for Data Mining 1.0, Step-by-step Data Mining Guide. CRISP-DM
consortium (2000)
3. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowl-
edge Discovery and Data Mining. AAAI Press/MIT Press, Menlo Park/Cambridge (1996)
4. Feynman, R.P., Leighton, R.B., Sands, M.: The Feynman Lectures on Physics. Mechanics, Ra-
diation, and Heat, vol. 1. Addison-Wesley, Reading (1963)
5. Hand, D.: Intelligent data analysis: issues and opportunities. In: Proc. 2nd Int. Symp. on Ad-
vances in Intelligent Data Analysis, pp. 1–14. Springer, Berlin (1997)
6. Kepler, J.: Astronomia Nova, aitiologetos seu physica coelestis, tradita commentariis de
motibus stellae martis, ex observationibus Tychonis Brahe. (New Astronomy, Based upon
Causes, or Celestial Physics, Treated by Means of Commentaries on the Motions of the Star
Mars, from the Observations of Tycho Brahe) (1609); English edition: New Astronomy. Cam-
bridge University Press, Cambridge (1992)
7. Kepler, J.: Harmonices Mundi (1619); English edition: The Harmony of the World. American
Philosophical Society, Philadelphia (1997)
8. Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models.
Knowl. Eng. Rev. 21(1), 1–24 (2006)
9. Marban, O., Menasalvas, E., Fernandez-Baizan, C.: A cost model to estimate the effort of data
mining process (DMCoMo). Inf. Syst. 33, 133–150 (2008)
2 The workflows discussed in this book are available for download at the book’s website.
Chapter 2
Practical Data Analysis: An Example
Before talking about the full-fledged data analysis process and diving into the details
of individual methods, this chapter demonstrates some typical pitfalls one encoun-
ters when analyzing real-world data. We start our journey through the data analysis
process by looking over the shoulders of two (pseudo) data analysts, Stan and Laura,
working on some hypothetical data analysis problems in a sales environment. Being
differently skilled, they show how things should and should not be done. Through-
out the chapter, a number of typical problems that data analysts meet in real work
situations are demonstrated as well. We will skip algorithmic and other details here
and only briefly mention the intention behind applying some of the processes and
methods. They will be discussed in depth in subsequent chapters.
The Data For the following examples, we will use an artificial set of data sources
from a hypothetical supermarket chain. The data set consists of a few tables, which
have already been extracted from an in-house database:1
1 Often just getting the data is a problem of its own. Data analysis assumes that you have access to
the data you need—an assumption which is, unfortunately, frequently not true.
The Analysts Stan and Laura are responsible for the analytics of the southern and
northern parts, respectively, of a large supermarket chain. They were recently hired
to help better understand customer groups and behavior and try to increase revenue
in the local stores. As is unfortunately all too common, over the years the stores
have already begun all sorts of data acquisition operations, but in recent years quite
a lot of this data has been merged—however, still without a clear picture in mind.
Many other stores had started to issue frequent shopping cards, so the directors of
marketing of the southern and northern markets decided to launch a similar program.
Lots of data have been recorded, and Stan and Laura now face the challenge to fit
existing data to the questions posed. Together with their managers, they have sat
down and defined three data analysis questions to be addressed in the following
year:
• differentiate the different customer groups and their behavior to better understand
their impact on the overall revenue,
• identify connections between products to allow for cross selling campaigns, and
• help design a marketing campaign to attract core customers to increase their pur-
chases.
Stan is a representative of the typical self-taught data analysis newbie with little
experience on the job and some more applied knowledge about the different tech-
niques, whereas Laura has some training in statistics, data processing, and data anal-
ysis process planning.
The first analysis task is a standard data analysis setup: customer segmentation—
find out which types of customers exist in your database and try to link them to
the revenue they create. This can be used later to care for clientele that are re-
sponsible for the largest revenue source or foster groups of customers who are
under-represented. Grouping (or clustering) records in a database is the predomi-
nant method to find such customer segments: the data is partitioned into smaller
subsets, each forming a more coherent group than the overall database contains. We
will go into much more detail on this type of data analysis methods in Chap. 7. For
now it suffices to know that some of the most prominent clustering methods return
one typical example for each cluster. This essentially allows us to reduce a large
data set to a small number of representative examples for the subgroups contained
in the database.
2.2 Data Understanding and Pattern Finding 17
1 46.5 € 1,922.07
2 39.4 € 11,162.20
3 39.1 € 7,279.59
4 46.3 € 419.23
5 39.0 € 4,459.30
The Naive Approach Stan quickly jumps onto the challenge, creates a dump of
the database containing customer purchases and their birth date, and computes the
age of the customers based on their birth date and the current day. He realizes that
he is interested in customer clusters and therefore needs to somehow aggregate the
individual purchases to their respective “owner.” He uses an aggregating operator in
his database to compute the total price of the shopping baskets for each customer.
Stan then applies a well-known clustering algorithm which results in five prototyp-
ical examples, as shown in Table 2.1.
Stan is puzzled—he was expecting the clustering algorithm to return reasonably
meaningful groups, but this result looks as if all shoppers are around 40–50 years
old but spend vastly different amount of money on products. He looks into some of
the customers’ data in some of these clusters but cannot seem to find any interesting
relations or any reason why some seem to buy substantially more than others. He
changes some of the algorithm’s settings, such as the number of clusters created, but
the results are similarly uninteresting.
The Sound Approach Laura takes a different approach. Routinely she first tries
to understand the available data and validates that some basic assumptions are in fact
true. She uses a basis data summarization tool to report the different values for the
string attributes. The distribution of first names seems to match the frequencies she
would expect. Names such as “Michael” and “Maria” are most frequent, and “Rose-
marie” and “Anneliese” appear a lot less often. The frequencies of the occupations
also roughly match her expectations: the majority of the customers are employ-
ees, while the second and third groups are students and freelancers, respectively.
She proceeds to checking the attributes holding numbers. In order to check the age
of the customers, she also computes the customers’ ages from their birth date and
checks minimum and maximum. She spots a number of customers who obviously
reported a wrong birthday, because they are unbelievably young. As a consequence,
she decides to filter the data to only include people between the ages of 18 and 100.
In order to explore the data more quickly, she reduces the overall customer data set
to 5,000 records by random sampling and then plots a so-called histogram, which
shows different ranges of the attribute age and how many customers fall into that
range. Figure 2.1 shows the result of this analysis.
This view confirms Laura’s assumptions—the majority of shoppers is middle
aged, and the number of shoppers continuously declines toward higher age groups.
18 2 Practical Data Analysis: An Example
Fig. 2.1 A histogram for the distribution of the value of attribute age using 8 bins
Fig. 2.2 A histogram for the distribution of the value of attribute age using 40 bins
She creates a second histogram to better inspect the subtle but strange cliff at around
age 48 using finer setting for the bins. Figure 2.2 shows the result of this analysis.
Surprised, she notices the huge peak in the bin of ages 38–40. She discusses this
observation with colleagues and the administrator of the shopping card database.
They have no explanation for this odd concentration of 40-year-old people ei-
ther. After a few other investigations, a colleague of the person who—before his
retirement—designed the data entry forms suspects that this may have to do with
the coding of missing birth dates. And, as it turns out, this is in fact the case: forms
where people entered no or obviously nonsensical birth dates were entered into the
form as zero values. For technical reasons, these zeros were then converted into the
Java 0-date which turns out to be January 1, 1970. So these people all turn up with
the same birth date in the customer database and in turn have the same age after the
2.2 Data Understanding and Pattern Finding 19
conversion Laura performed initially. Laura marks those entries in her database as
“missing” in order to be able to distinguish them in future analyses.
Similarly, she inspects the shopping basket and product database and cleans up a
number of other outliers and oddities. She then proceeds with the customer segmen-
tation task. As in her previous data analysis projects, Laura first writes down her
domain knowledge in form of a cognitive map, indicating relationships and depen-
dencies between the attributes of her database. Having thus recalled the interactions
between the variables of interest, she is well aware that the length of customer’s
history and the number of overall shopping trips affect the overall basket price, and
so she settles on the average basket price as a better estimator for the value of a
particular customer. She considers also distinguishing the different product cate-
gories, realizing that those, of course, also potentially affect the average price. For
the first step, she adds the average number of purchases per month, another indicator
for the revenue a customer brings in. Data aggregation is now a bit more complex,
but the modern data analysis tool she is using allows her to do the required join-
ing and pivoting operations effortlessly. Laura knows that clustering algorithms are
very sensitive to attributes with very different magnitudes, so she normalizes the
three attributes to make sure they all three contribute equally to the clustering result.
Running the same clustering algorithm that Stan was using, with the same setting
for the number of clusters to be found, she gets the result shown in Table 2.2.
Obviously, there is a cluster (#1) of older customers who have a relatively small
average basket price. There is also another group of customers (#4) which seems
to correlate to younger shoppers, also purchasing smaller baskets. The middle-aged
group varies wildly in price, however. Laura realizes that this matches her assump-
tion about family status—people with families will likely buy more products and
hence combine more products into more expensive baskets, which seems to explain
the difference between clusters #2/#3 and cluster #5. The latter also seem to shop
significantly less often. She goes back and validates some of these assumptions by
looking at shopping frequency and average basket size as well and also determines
the overall impact on store revenues for these different groups. She finally discusses
these results with her marketing and campaign specialists to develop strategies to
foster the customer groups which bring in the largest chunk of revenue and develop
the ones which seem to be under-represented.
20 2 Practical Data Analysis: An Example
The Naive Approach Stan recently read in a book on practical data analysis how
association rules can find arbitrary such connections in market basket data. He runs
the association rule mining algorithm in his favorite data analysis tool with the de-
fault settings and inspects the results. Among the top-ranked generated rules, sorted
by their confidence, Stan finds the following output:
’foie gras’ (p1231) <- ’champagne Don Huberto’ (p2149),
’truffle oil de Rossini’ (p578) [s=1E-5, c=75%]
’Tortellini De Cecco 500g’ (p3456)’
<- ’De Cecco Sugo Siciliana’ (p8764) [s=1E-5, c=60%]
He quickly infers that this representation must mean that foie gras is bought when-
ever champagne and truffle oil are bought together and similarly for the other rule.
Stan knows that the confidence measure c is important, as it indicates the strength
of the dependency (the first rule holds in 3 out of 4 cases). He considers the sec-
ond measure of frequency s to be less important and deliberately ignores its fairly
small value. The two rules shown above are followed by a set of other, similarly lux-
ury/culinary product-oriented rules. Stan concludes that luxury products are clearly
the most important products on the shelf and recommends to his marketing man-
ager to launch a campaign to advertise some of the products on the right side of
these rules (champagne, truffle oil) to increase the sales of the left side (foie gras).
In parallel, he increases orders for these products, expecting a recognizable increase
in sales. He proudly sends the results of his analysis to Laura.
The Sound Approach Laura is puzzled by those nonintuitive results. She reruns
the analysis and notices the support values of the rules extracted by Stan—some
of the rules Stan extracted have indeed a remarkably high confidence, and some
do almost forecast shopping behavior. However, they have very low support values,
meaning that only a small number of shopping baskets containing the products were
ever observed. The rules that Stan found are not representative at all for his customer
base. To confirm this, she runs a quick query on her database and sees that, indeed,
there is essentially no influence on the overall revenue.
She notices that the problem of low support is caused by the fact that Stan ran
the analysis on product IDs, so in effect he was forcing the rules to differentiate
between brands of champagne and truffle oil. She reruns the analysis based on the
product categories instead, ranks them by a mix of support and confidence, and finds
a number of association rules with substantially higher support:
tomatoes <- capers, pasta [s=0.007, c=32%]
tomatoes <- apples [s=0.013, c=22%]
Laura focuses on rules with a much higher support measure s than before and also
realizes that the confidence measure c is significantly higher than one would expect
2.4 Predicting the Future 21
by chance. The first rule seems to be triggered by a recent fashion of Italian cooking,
whereas the apple/tomato-rule is a known aspect.
However, she is still irritated by one of the rules discovered by Stan, which has
a higher than suspected confidence despite a relatively low support. Are there some
gourmets among the customers who prefer a very specific set of products? Rerun-
ning this analysis on the shopping card owners yields almost the same results, so
the (potential) gourmets appear among their regular customers. Just to be sure, she
inspects how many different customers (resp. shopping cards) occur for baskets that
support this rule. As she had conjectured, there is a very limited number of cus-
tomers that seem to have a strong affection for these products. Those few customers
have bought this combination frequently, thus inflating the overall support measure
(which refers to shopping baskets). This means that the support in terms of the num-
ber of customers is even smaller than the support in terms of number of shopping
baskets. The response to any kind of special promotion would fall even shorter than
expected from Stan’s rule.
Apparently the time period in which the analyzed data has been collected influ-
ences the results. Thinking about it, she develops an idea how to learn about changes
in the customers shopping behavior: She identifies a few rules, some rather promis-
ing other well-known facts, and decides to monitor those combinations on a regular
basis (say quarterly). She got to know that a chain of liquor stores will soon open
a number of shops close to the own markets, so she picks some rules with bever-
ages in their conclusion part to see if the opening has any impact on the established
shopping patterns of the own customers. As she fears a loss of potential sales, she
plans a comparison of rules obtained not only over time but also among markets in
the vicinity of such stores versus the other markets. She wonders whether promot-
ing the products in the rule’s antecedent may help to bring back the customer and
decides to discuss this with the marketing&sales team to determine if and where
appropriate campaigns should be launched, once she has the results of her analysis.
The third and final analysis goal we consider in this brief overview is a forecasting
or prediction problem. The idea is to find some relationship in our existing data that
can help us to predict if and how customers will react to coupon mailings and how
this will affect our future revenue.
The Naive Approach Stan believes that no detailed analysis is required for this
problem and notices that it is fairly straightforward to monitor success. He has seen
at a competitor how discount coupons attract customers to purchase additional prod-
ucts. So he suggests launching a coupon campaign that gives customers a discount of
10% if they purchase products for more than €50. This coupon is mailed to all cus-
tomers on record. Throughout the course of the next month, he carefully monitors
his database and is positively surprised when he sees that his campaign is obviously
22 2 Practical Data Analysis: An Example
working: the average price of shopping baskets is going up in comparison with pre-
vious months. However, at the end of the quarter he is shocked to see that overall
revenues for the past quarter actually fell. His management is finally fed up with the
lack of performance and fires Stan.
The Sound Approach Laura, who is promoted to head of analytics for the north-
ern and southern super market chain first cancels Stan’s campaign and looks into the
underlying data. She quickly realizes that even though quite a number of customers
did in fact use the coupons and increased their shopping baskets, their average num-
ber of baskets per month actually went down—so quite a number of people seem
to have simply combined smaller shopping trips to be able to benefit from the dis-
count offer. However, for some shoppers, the combined monthly shopping basket
value did go up markedly, so there might be value here. Laura wonders how she can
discriminate between those customers who simply use the coupons to discount their
existing purchases and those who are actually enticed to purchase additional items.
She notices that one of the earlier generated customer segments correlates better
than others with the group of customers whose revenue went up—this fraction of
customers is significantly higher than in the other groups. She considers using this
very simple, manually designed predictor for a future campaign but wants to first
make sure that she cannot do better with some smarter techniques. She decides that
in the end it is not so important if she can actually understand the extracted model
but only how well it performs.
To provide good starting points for the modeling technique, she decides to gen-
erate a few potentially informative attributes first. Models that rely on thousands
of details typically perform poor, so providing how often every product has been
bought by the customer in the last month is not an option for her. To get robust mod-
els, she wants to aggregate the tiny bits of information, but what kind of aggregation
could be helpful? She returns to her cognitive map to review the dependencies. One
aspect is the availability of competitors: She reckons that customers may have alter-
native (possibly specialized) markets nearby but have been attracted by the coupon
this time, keeping them away from the competitors. She decides to aggregate the
money spent by the customer per month for a number of product types (such as bev-
erages, thinking of the chain of liquor stores again). She conjectures that customers
that perform well on average, but underperform in a specific segment only, may
be enticed by the coupon to buy products for the underperforming segment also.
Providing the segment performance before and after Stan’s campaign should help a
predictor to detect such dependencies if they exist.
The cognitive map brings another idea into her mind: people who appreciate the
full assortment but live somewhat further away from the own stores may see the
coupon as a kind of travel compensation. So she adds a variable expressing a coarse
estimation of the distance between the customer home and the nearest available
market (which is only possible for the shopping card owners). She continues to use
her cognitive map to address many different aspects and creates attributes that may
help to verify her hypotheses. She then investigates the generated attributes visually
and also technically by means of feature selection methods.
2.5 Concluding Remarks 23
After selecting the most promising attributes, she trains a classifier to distin-
guish the groups. She uses part of the data to simulate an independent test scenario
and thereby evaluates the expected impact of a campaign—are the costs created
by sending coupons to customers who do not purchase additional products offset
by customers buying additional items? After some additional model fine tuning,
she reaches satisfactory performance. She discusses the results with the market-
ing&sales team and deploys the prediction system to control the coupon mailings
for the next quarter. She keeps monitoring the performance of these coupon cam-
paigns over future quarters and updates her model sporadically.
We are at the beginning of a series of interdependent steps, where the project under-
standing phase marks the first. In this initial phase of the data analysis project, we
have to map a problem onto one or many data analysis tasks. In a nutshell, we con-
jecture that the nature of the problem at hand can be adequately captured by some
data sets (that still have to be identified or constructed), that appropriate modeling
techniques can successfully be applied to learn the relationships in the data, and fi-
nally that the gained insights or models can be transferred back to the real case and
applied successfully. This endeavor relies on a number of assumptions and is threat-
ened by several risks, so the goal of the project understanding phase is to assess the
main objective, the potential benefit, as well as the constraints, assumptions, and
risks. While the number of data analysis projects is rapidly expanding, the failure
rate is still high, so this phase should be carried out seriously to rate the chances
of success realistically. The project understanding phase should be carried out with
care to keep the project on the right track.
We have already sketched the data analysis process (CRISP-DM in Sect. 1.2).
There is a clear order in the steps in the sense that for a later step, all precedent
steps must have been executed. However, this does not mean that we can run once
through all steps to deterministically achieve the desired results. There are many
options and decisions to be made. Most of them will rely on our (subjective and dy-
namic) understanding of the problem at hand. The line of argument will not always
be from an earlier phase to a later one. For instance, if a regression problem has to be
solved, the analyst may decide that a certain method seems to be a promising choice
for the modeling phase. From the characteristics of this technique he knows that all
input data have to be transformed into numerical data, which has to be carried out
beforehand (data preparation phase). This requires a careful look at the multivalued
ordinally scaled attributes already in the data understanding phase to see how the
order of the values is best preserved. If it is not considered in time, it may happen
that later, in the evaluation phase, it turns out that the project owner expected to gain
insights into the input–output relationship rather than having a black-box model
only. If the analyst had considered this requirement beforehand, he might have cho-
sen a different method. Changing this decision at any point later than in this initial
M.R. Berthold et al., Guide to Intelligent Data Analysis, 25
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_3, © Springer-Verlag London Limited 2010
26 3 Project Understanding
Table 3.1 Problems faced in data analysis projects, excerpt from [1]
Problem source Project owner perspective Analyst perspective
Communication Project owner does not understand Analyst does not understand the terms
the technical terms of the analyst of the domain of the project owner
Lack of Project owner was not sure what the Analyst found it hard to understand
understanding analyst could do or achieve how to help the project owner
Models of analyst were different from
what the project owner envisioned
project understanding phase often renders some (if not most) of the earlier work in
data understanding, data preparation, and modeling useless. While the time spent on
project and data understanding compared to data preparation and modeling is small
(20% : 80%), the importance to success is just the opposite (80% : 20%) [4].
As a first step, a primary objective (not a long list but one or two) and some success
criteria in terms of the project domain have to be determined (who will decide which
results are desired and whether the original project goal was achieved or not). This
is much easier said than done, especially if the analysis is not carried out by the do-
main expert himself. In such cases the project owner and the analyst speak different
languages which may cause misunderstandings and confusion. In the worst case,
the communication problems lead to very soft project goals, just vague enough to
allow every stakeholder seeing his own perspective somehow accounted for. At the
end, all of a sudden, the stakeholders recognize that the results do not fit their expec-
tations. The challenge here is usually not a matter of technical but of communicative
competence.
Table 3.1 shows some typical problems occurring in such projects. To overcome
language confusion, a glossary of terms, definition, acronyms, and abbreviations is
inevitable. Knowing the terms still does not imply an understanding of the project
domain, the objectives, constraints, and influencing factors. One interviewing tech-
nique that may help to get most out of the expert is to rephrase all of her statements,
which often provokes additional relativizing statements. Another technique is to use
explorative tools such as mind maps or cognitive maps to sketch beliefs, experi-
ences, and known factors and how they influence each other.
An example of a cognitive map in the shopping domain considered in Sect. 2 is
given in Fig. 3.1. Each node of this graph represents a property of the considered
product or the customer. The variable of interest is placed in the center: how often
3.1 Determine the Project Objective 27
Fig. 3.1 A cognitive map for the shopping domain: How often will a certain product occur in
a shopping basket of some customer? The positive correlation between income and affordability
reads like the higher the income, the higher the affordability, whereas an example of a negative
correlation reads like the broader the range of offered substitutes, the lower the product affinity
will a certain product be found in the shopping basket of the customer? This depends
on various factors, which are placed around this node. The direction of influence is
given by the arrows, and the line style indicates the way how the variables influence
each other: The higher the customer’s affinity to the product, the more often it will
be found in the basket. The author of the cognitive map conjectures that the product
affinity itself is positively influenced by a high product quality and the customer’s
brand loyalty (a loyal customer is less likely to buy substitute products). On the
other hand, the broader the range of offered substitutes, the more likely a customer
may try out a different product. Other relationships depend on the product itself:
The higher the demand of a certain product, the more often it will be found in the
shopping basket, but the demand itself may, depending on the product, vary with
gender (e.g., razor blades, hairspray), age (e.g., denture cleaner), or family status
(e.g., napkins, fruit juice). The development of such a map supports the domain
understanding and adjustment of expectations.
While constructing a cognitive map, a few rules should be adhered to: First, to
keep the map clear, only direct dependencies should be included in the graph. For
instance, the size of the household influences the target variable, but only indirectly
via the generated product demand and the affordability, and therefore there is no
direct connection from size of household to frequency of product in shopping basket.
28 3 Project Understanding
Secondly, the labels of the nodes should be chosen carefully, so that they are easily
interpretable when plugged into the relationship templates such as the higher . . . ,
the higher . . . . As an example, the node size of household could have been named
family status, but then it is not quite clear what the more family status . . . actually
means.
Once an understanding of the domain has been achieved, the problem and pri-
mary objective have to be identified (see Table 3.2). Again, it is often useful to
discuss or model the current solution first, for instance, by using techniques from
software engineering (business process modeling, UML use cases, etc.) [3]. When
the current solution has been elaborated, its advantages and disadvantages can be
explored and discussed. Often, the primary objective is assumed to be known be-
forehand, probably the project would not have been initiated without having iden-
tified a problem first. But as there are many different ways to attack a problem, the
objective should be precise about the direction to follow. A general statement about
the goal is easily made (“model the profitable customers to increase the sales”), but
it is often not precise enough (how do we precisely identify a profitable customer?)
and not actionable (how exactly shall this model help to increase the sales?). To
render the objective more precise, it is necessary to sketch the target use already at
this early stage. Thus it becomes clear what kind of result has to be delivered, which
may range from a written technical report with interesting findings to a user-friendly
software that uses the final model to automatize decisions.
From the perspective of the project owner some of these elaborate steps may ap-
pear unnecessary—they master their domain already, after all. However, these steps
must be considered as a preparation of the closely linked data understanding phase
(see next section). All the identified factors, situations, and relationships that are
assumed to be relevant must be present and recognizable in the data. If they cannot
be found in the data, either there is a misconception in the project understanding or
(even worse) the data is not substantial or detailed enough to reflect the important
relations of the real-world problem. In both cases, it would be fatal to miss this point
and proceed unworried.
tant resources are data and knowledge, that is, databases and experts who can pro-
vide background information (about the domain in general and about the databases
in particular). Besides a plain listing of databases and personnel, it is important to
clarify the access to both: if the data is stored in an operative system, mining the
data may paralyze the applications using it. To become independent, it is advisable
to provide a database dump. Experts are typically busy and difficult to grasp—but
an inaccessible knowledge source is useless. A sufficiently large number of time
slots for meetings should be arranged.
Based on the domain exploration (cognitive map, business process model, etc.),
a list of explicit and implicit assumptions and risks is created to judge the chances
of a successful project and guide the next steps. Data analysis lives on data. This
list shall help to convince ourselves that the data is meaningful and relevant to the
project. Why should we undertake this effort? We will see whether we can build
a model from this data later anyway. Unfortunately, this is only half of the truth.
After reviewing a number of reports in a data analysis competition, Charles Elkan
noted that “when something surprising happens, rather than question the expecta-
tions, people typically believe that they should have done something slightly differ-
ent” [2]. Expecting that the problem can be solved with the given data may lead to
continuously changing and “optimizing” the model—rather than taking the possi-
bility into account that the data is not appropriate for this problem. In order to avoid
this pitfall, the conjectured relations and expert-proven connections can help us in
verifying that the given data satisfy our needs—or to put forward good reasons why
the project will probably fail. This is particularly important as in many projects the
available data have not been collected to serve the purpose that is intended now. To
prevent us from carrying out an expensive project having almost no prospect of suc-
cess, we have to carefully track all assumptions and verify them as soon as possible.
Typical requirements and assumptions include:
• requirements and constraints
– model requirements,
e.g., model has to be explanatory (because decisions must be justified clearly)
– ethical, political, legal issues,
e.g., variables such as gender, age, race must not be used
– technical constraints,
e.g., applying the technical solution must not take more than n seconds
• assumptions
– representativeness:
If conclusions about a specific target group are to be derived, a sufficiently
large number of cases from this group must be contained in the database, and
the sample in the database must be representative for the whole population.
– informativeness:
To cover all aspects by the model, most of the influencing factors (identified in
the cognitive map) should be represented by attributes in the database.
– good data quality:
The relevant data must be of good quality (correct, complete, up-to-date) and
unambiguous thanks to the available documentation.
30 3 Project Understanding
Finally, the primary objective must be transformed into a more technical data min-
ing goal. An architecture for the envisaged solution has to be found, composed out
of building blocks as discussed in Sect. 1.3 (data analysis tasks). For instance, this
architecture might contain a component responsible for grouping the customers ac-
cording to some readily available attributes first, another component finds interest-
ing deviating subgroups in each of the groups, and a third component predicts some
variable of interest based on the customer data and the membership to the respec-
tive groups and subgroups. The better this architecture fits the actual situation, the
better the chances of finding a model class that will prove successful in practice. To
achieve this analogy, the discussions about the project domain are of great help.
Again there is the danger of accepting a reasonable architecture quickly, under-
estimating or even ignoring the great impact on the overall effort. Suppose that a
company wants to increase the sales of some high-end product by direct mailing.
One approach is to develop a model that predicts who will buy this product using
the company’s own customer database. Such a model might be interesting to in-
terpret (useful for a report), but if it is used to decide to whom a mailing should be
sent, most of the customers may have the product already (within the same customer
database). Applying the model to people not being in the database is impossible as
we lack the information about them that is needed by the model. The predictive
model may also find out that customers buying the product were loyal customers
for many years—but artificially increasing the duration of the customer relationship
to support the purchase of the product is unfortunately impossible. If a foreseeable
result is ignored or a misconception w.r.t. the desired use of the model is not recog-
nized, considerable time may be wasted with building a correct model that turns out
to be useless in the end.
For each of the building blocks, we can select a model class and technique to
derive a model of this class automatically from data. There is nothing like the unique
best method for predictive tasks, because they all have their individual weaknesses
and strengths and it is impossible to combine all their properties or remove all biases
(see Chap. 5). Although the final decision about the modeling technique will be
made in the modeling phase, it should be clear already at this point of the analysis
which properties the model should have and why. The methods and tools optimize
the technical aspects of the model quality (such as accuracy, see also Chap. 5). Other
3.4 Further Reading 31
aspects are often difficult to formalize and thus to optimize (such as interestingness
or interpretability), so that the choice of the model class has the greatest influence
on these properties. Desirable properties may be, for instance:
• Interpretability:
If the goal of the analysis is a report that sketches possible explanations for a
certain situation, the ultimate goal is to understand the delivered model. For some
black-box models, it is hard to comprehend how the final decision is made, and
their model lacks interpretability.
• Reproduceability/stability:
If the analysis is carried out more than once, we may achieve similar performance
—but not necessarily similar models. This does no harm if the model is used as
a black box, but hinders a direct comparison of subsequent models to investigate
their differences.
• Model flexibility/adequacy:
A flexible model can adapt to more (complicated) situations than an inflexible
model, which typically makes more assumptions about the real world and requires
less parameters. If the problem domain is complex, the model learned from data
must also be complex to be successful. However, with flexible models the risk of
overfitting increases (will be discussed in Chap. 5).
• Runtime:
If restrictive runtime requirements are given (either for building or applying the
model), this may exclude some computationally expensive approaches.
• Interestingness and use of expert knowledge:
The more an expert already knows, the more challenging it is to “surprise” him or
her with new findings. Some techniques looking for associations (see Sect. 7.6)
are known for their large number of findings, many of them redundant and thus
uninteresting. So if there is a possibility of including any kind of previous knowl-
edge, this may ease the search for the best model considerably on the one hand
and may prevent us from rediscovering too many well-known artifacts.
When discussing the various modeling techniques in Chaps. 7–9, we will give hints
which properties they possess. The final choice is then up to the analyst.
The books by Dorian Pyle [4, 5] offer many suggestions and constructive hints for
carrying out the project understanding phase. [5] contains a step-by-step workflow
for business understanding and data mining consisting of various action boxes. An
organizationally grounded framework to formally implement the business under-
standing phase of data mining projects is presented in [6]. In [1] a template set for
educing and documenting project requirements is proposed.
32 3 Project Understanding
References
1. Britos, P., Dieste, O., García-Martínez, R.: Requirements elicitation in data mining for business
intelligence projects. In: Advances in Information Systems Research, Education and Practice,
pp. 139–150. IEEE Press, Piscataway (2008)
2. Elkan, C.: Magical thinking in data mining: lessons from coil challenge 2000. In: Proc. 7th Int.
Conf. on Knowledge Discovery and Data Mining (KDD), pp. 426–431. ACM Press, New York
(2001)
3. Marban, O., Segovia, J., Menasalvas, E., Fernandez-Baizan, C.: Towards data mining engineer-
ing: a software engineering approach. Inf. Syst. 34, 87–107 (2009)
4. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Mateo (1999)
5. Pyle, D.: Business Modeling and Data Mining. Morgan Kaufmann, San Mateo (2003)
6. Sharma, S., Osei-Bryson, K.-M.: Framework for formal implementation of the business under-
standing phase of data mining projects. Expert Syst. Appl. 36, 4114–4124 (2009)
Chapter 4
Data Understanding
The main goal of data understanding is to gain general insights about the data that
will potentially be helpful for the further steps in the data analysis process, but data
understanding should not be driven exclusively by the goals and methods to be ap-
plied in later steps. Although these requirements should be kept in mind during data
understanding, one should approach the data from a neutral point of view. Never
trust any data as long as you have not carried out some simple plausibility checks.
Methods for such plausibility checks will be discussed in this chapter. At the end
of the data understanding phase, we know much better whether the assumptions we
made during the project understanding phase concerning representativeness, infor-
mativeness, data quality, and the presence or absence of external factors are justified.
We first take a general look at single attributes in Sect. 4.1 and ask questions like:
What kind of attributes do we have, and what do their domains look like? What is
the precision of numerical values? Is the domain of an attribute stable over time, or
does it change? We also need to assess the data quality. Methods and criteria for this
purpose are introduced in Sect. 4.2.
Data understanding requires taking a closer look at the data. However, this does
not mean that we must browse through seemingly endless columns of numbers and
other values. In this way we would probably overlook most of the important facts.
Looking at the data refers to visualization techniques (Sect. 4.3) that can be used
to get a quick overview on basic characteristics of the data and enable us to check
the plausibility of the data to a certain extent. Visualization techniques are suitable
for the analysis of single attributes and of attributes in combination. Apart from the
pure visualization, it is also recommended to compute simple statistical measures
for correlation between attributes as described in Sect. 4.4.
Outliers, values, or records that are very different from all others should be iden-
tified with methods described in Sect. 4.5. They might cause difficulties for some of
the methods applied in later steps, or they might be wrong values due to data quality
problems. Missing values (see Sect. 4.6) can lead to similar problems as outliers,
and by simply ignoring missing values we might obtain wrong data analysis results,
so we must be aware of whether we have to deal with missing values and, if we have
to, of what kind the missing values are.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 33
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_4, © Springer-Verlag London Limited 2010
34 4 Data Understanding
Data understanding is also a step that is required for data preparation. For ex-
ample, data understanding will help us to identify and to characterize outliers and
missing values. However, how to treat them—whether to leave them as they are,
to exclude them from further analysis steps, or to replace them by more plausible
values—is a task for data preparation.
Throughout this chapter we will use the Iris data set [2, 7]: a set of 150 data
points describing three different types of iris flowers (Iris setosa, Iris virginica, and
Iris versicolor) using four different attributes measuring the length and width of
the sepal and the petal leaves. This is a classic data set with a few very simple
and obvious properties which lends itself naturally to demonstrate how the several
different data analysis methods work. However, readers should not let themselves be
fooled into believing that any real-world data sets will ever display just nice and well
pronounced features. Quite the opposite: in real-world data the odd and undesired
effects often far outweigh the interesting ones.
a more refined level for the drinks might distinguish between water, beer, wine, . . . ,
and these categories might even be more refined by incorporating the producer. Even
the same producer might offer different brands and variants of water, beer, or wine.
This is still not the lowest level of refinement. We might even distinguish identical
drinks by the size of the container in which they are sold, for instance, 0.5 l, 1.0 l, and
1.5 l bottles. Of course, the most refined level will always provide the most detailed
information. Nevertheless, it is often not very useful to carry out an analysis on this
level, since it is impossible to extract general structures or rules when we restrict
our analysis on the refined level. A general rule like “Customers tend to buy wine
and cheese together” might not be discovered on a refined level of granularity where
no combination of a specific wine and specific cheese might be frequent. Therefore,
it is crucial to choose an appropriate level of granularity for such attributes taking
the aim of the analysis and the size of the data set into account. For smaller data
sets, little statistical evidence can be found when we look at a very refined level,
since the number of instances per value of the attribute decreases with the level
of refinement. Especially, if a number of such attributes with different levels of
granularity is considered, the number of possible combinations for the values grows
extremely fast for refined levels. However, the choice of a suitable granularity is a
task within data preparation. Nevertheless, we must make a decision here as well,
on which level we want to understand and take a closer look at the data. We might
do this even on different levels of granularity.
Another problem that sometimes comes along with categorical attributes is a
dynamic domain. For certain categorical attributes, the domain will never change.
The range of possible values for the month of birth is January, . . . , December, and
it seems quite improbable that it will be changed in the near future. The situation for
products is, however, completely different. Certain products might not be offered
by a shop anymore or might vanish completely from the market, whereas other new
products enter the market, or already existing products are adopted by the shop.
Such a dynamic domain of an attribute can cause problems in the analysis later
on. When products are analyzed on a long-term basis of, say, several years, then
products that have entered the market just recently will not show significant (ac-
cumulated) sales numbers compared to products that have been in the market for
decades. Therefore, categorical attributes can lead to undesired or even wrong anal-
ysis results when such problems like the level of granularity or dynamic domains
are not taken into account. We identify such attributes already at this early stage and
make sure that we do not forget to handle them later.
A specific type of categorical attributes are ordinal attributes with an additional
linear ordering imposed on the domain. An attribute for university degrees with the
values none, B.Sc., M.Sc., and Ph.D. represents an ordinal attribute. A Ph.D. is a
higher degree than an M.Sc., and an M.Sc. is a higher degree than a B.Sc. However,
the ordering does not say that the difference between a Ph.D. and an M.Sc. is the
same as the difference between an M.Sc. and a B.Sc.
The domain of a numerical attribute are numbers. Numerical attributes can be
discrete, usually taking integer values, or continuous, taking arbitrary real values.
Discrete numerical attributes often result from counting processes like the number
36 4 Data Understanding
of children or the number of times a customer has ordered from an on-line shop in
the last twelve months.
Sometimes, categorical attributes are coded by numerical values. For instance,
the three possible values food, drinks, nonfood of the attribute general product cat-
egory might be coded by the numbers 1, 2, and 3. However, this does not turn the
attribute general product category into a (discrete) numerical attribute. We should
bear this fact in mind for later steps of the analysis to avoid that the attribute is
suddenly interpreted as a numerical attribute: it does not make sense to carry out
numerical operations like computing the mean on such coded categorical attributes.
For a discrete attribute, though, especially when it represents some counting pro-
cess, it is meaningful to calculate the mean value, even though the mean value will
usually not be an integer number. It is meaningful to say that on average the cus-
tomers buy products 2.6 times per year in our on-line shop. But it does not make
sense that the average general product category we sell is 2.6, which we might ob-
tain when we simply compute the mean value of the products we have sold based
on the numerical coding of the general product categories.
In contrast to discrete numerical attributes, a continuous attribute can—at least
theoretically—assume any real value. However, such numerical values will always
be measured and represented with limited precision. It should be taken into account
how precise these values are. Drastic round-off errors or truncations can lead to
problems in later steps of the analysis. Suppose, for instance, that a cluster analysis
is to be carried out later on and that there is one numerical attribute, say X, that
is truncated to only one digit right after the decimal point, while all other numeri-
cal attributes were measured and stored with a higher precision. When comparing
different records, such truncation for the attribute X influences their perceived sim-
ilarity and might be a dominating factor for the further analysis only for this reason.
Truncation errors and measurements with limited precision should be distinguished
from values corrupted with noise. The problem of noise will be tackled in the con-
text of data quality in Sect. 4.2 and will also be discussed in more detail in Chap. 5.
Numerical attributes can have an interval, a ratio, or an absolute scale. For an
interval scale, the definition of what zero means is more or less arbitrary. The date
is a typical example for an attribute measured on an interval scale. There are calen-
dars with different definitions of the time point zero. For instance, the Unix standard
time, counted in milliseconds, has its time point zero in the year 1970 of the Gre-
gorian calendar. The same applies to temperature scales like Fahrenheit and Celsius
degrees, where zero refers to different temperatures. Certain operations like quo-
tients are not meaningful for interval scales. For example, it does not make sense to
say that a temperature of 21°C is three times as warm as 7°C.1
In contrast to this, a ratio scale has a canonical zero value and thus allows us to
compute meaningful ratios. Examples of ratio scales are height, distance, or dura-
tion. Distance can be measured in different units like meters, kilometers, or miles.
1 Such a statement may make sense, though, for the Kelvin temperature scale, because on this scale
the temperature is directly proportional to the average kinetic energy of the particles—and it is
meaningful to compute ratios of energies.
4.2 Data Quality 37
But no matter which unit we choose, a distance of zero will always have the same
meaning. Especially ratios, which do not make sense for interval scales, are often
useful for ratio scales: the quotient of distances is independent of the measurement
unit, so that the distance 20 km is always twice as long as the distance 10 km, even
if we change the unit kilometers to meters or miles. Whereas for a ratio scale, only
the value zero has a canonical meaning and the meaning of other values depends
on the choice of the measurement unit, for an absolute scale, there is a unique
measurement unit. A typical example for an absolute scale is any kind of counting
procedure.
The saying “garbage in, garbage out” applies to data analysis as to any other area.
The results of an analysis cannot be better than the quality of the data, so that we
should be concerned about the data quality before we carry out any deeper analysis
with the data. Data quality refers to how well the data fit to their intended use.
There are various data quality dimensions.
Accuracy is defined as the closeness between the value in the data and the true
value. For numerical attributes, accuracy means how exact the value in the data set is
compared to the true value. Noise or limited precision in measurements can lead to
reduced accuracy for numerical attributes. Limited precision is often obvious from
the data set. For example, in the Iris data set all numerical values are measured with
only one digit right after the decimal point. The magnitude of noise can be estimated
when measurements for the same value have been taken repeatedly. Accuracy of nu-
merical values can also be affected by wrong or erroneous measurements or simply
by errors like transposition of digits when measurements are recorded manually.
For categorical attributes, problems with accuracy can result from misspellings like
“fmale” for a value of the attribute gender, and also from erroneous entries.
We distinguish between syntactic and semantic accuracy. Syntactic accuracy
means that a considered value might not be correct, but it belongs at least to the do-
main of the corresponding attribute. For a categorical attribute like gender for which
only the values female and male are admitted, “fmale” violates syntactic accuracy.
For numerical attributes, syntactic accuracy does not only mean that the value must
be a number and not a string or text. Also certain numerical values can be out of the
range of syntactic accuracy. Attributes like weight or duration will admit only pos-
itive values, and therefore negative values would violate syntactic accuracy. Other
numerical attributes have an interval as their range like [0, 100] for the percentage
of votes for a candidate. Negative values and values larger than 100 should not oc-
cur. For integer-valued attributes like the number of items a customer has bought,
floating-point values should be excluded.
Problems with semantic accuracy mean that a value might be in the domain of
the corresponding attribute, but it is not correct. When the attribute gender has the
value female for the customer John Smith, then this is not a question of syntactic
38 4 Data Understanding
accuracy, since female is a possible value of the attribute gender. But it is obviously
a wrong value for a person named “John”.2
Discovering problems of syntactic accuracy in a data set is a relatively easy task.
Once we know the domains of the attributes, we can easily verify, whether the values
lie in the corresponding domains or not. A simple measure for syntactic accuracy is
the fraction of values that lies in the domains of their corresponding attribute.
The verification of semantic accuracy is much more difficult or often even im-
possible. Another source for the same data would enable us to check our data, and
differences not caused by problems with syntactic accuracy indicate problems with
semantic accuracy. Sometimes also certain “business rules” are known for the data.
For instance, if we find a record in our data set with the value male for the at-
tribute gender and yes for the attribute pregnant, there must be a problem of se-
mantic accuracy based on the known “business rule” that only women can be preg-
nant.
Whether or in which detail to check syntactic and semantic accuracy depends
very much on how the data were generated. Especially, when data were entered
manually, there is a higher chance for accuracy problems. In any case, it is rec-
ommended to carry out at least some simple tests to see whether there might be
problems with accuracy. However, the usual practice is to keep these tests at a min-
imum and to find out later on that there are problems with accuracy, namely when
the data analysis yields implausible results.
Throughout this book we normally assume that the data are already given, for
example, as a database table. This is not the best point in time to cope with data
quality problems. Chances of avoiding or reducing data quality problems are highest
when the data are entered into the database. For instance, instead of letting a user
type in the value of categorical attribute with the danger of misspellings, one could
provide a fixed selection of values from which the user can choose.
Another dimension of data quality is completeness which can be divided into
completeness with respect to attribute values and completeness with respect to
records. Completeness with respect to attribute values refers to missing values
(which will be discussed in Sect. 4.6). When missing values are explicitly marked
as such, then a simple measure for this dimension of data quality is the fraction of
missing values among all values. But we will see that missing values are not al-
ways directly recognizable, so that the fraction of known missing values might only
provide a lower bound for the fraction of actually missing values.
Completeness with respect to records means that the data set contains the nec-
essary information that is required for the analysis. Some records might simply
be missing for some technical reasons. Data might have been lost because a few
years ago the underlying database system was changed and only those data records
were transferred to the new database that were considered to be important at that
point in time. In a customer database, customers who had not bought anything
for a longer time might not have been transferred (in order to eliminate potential
2 Note, however, that the problem may also reside with the name. Maybe the name of the person
was misspelled, and the correct name is “Joan Smith”—then the gender is actually female.
4.2 Data Quality 39
zombie customers) to the new database, or older transactions were not stored any-
more.
Very often the available dataset itself is biased and not representative. Consider
as an example a bank that provides mortgages to private customers. If the aim of
the analysis is to predict for future applicants of loans whether they will return
the loan, we must take into account that the sample is biased in the sense that we
only have information about those customers who have been granted a loan. For
those customers who have been denied the loan initially, we have no information
whether they would have returned the loan or not. But especially these customers
might be the ones for which it is interesting to find a good scheme to predict the
risk. For customers with high income and a safe job and no current debt, we need
no sophisticated data analysis techniques to predict that there is a good chance that
they will return the loan. Of course, it is impossible to obtain a representative sample
in the statistical sense in this case. Such a sample would mean that we would have
to provide loans to any customer, no matter how bad their financial status is, for a
certain period and collect and evaluate these data. Unfortunately, this would be a
method entailing almost guaranteed bankruptcy.
The same problems occur in many other areas. For a production plant, we usually
have large amounts of data when it is running in a normal mode. For exceptional
situations, we will have little or no data. We cannot ask for such data, for instance,
by requiring to check what happens if, say, a nuclear plant operates at its limit.
In such cases we might encounter future situations for which we had no corre-
sponding data in our sample. Such possible gaps in the data should be identified. One
should be aware that the space of possible values is automatically covered sparsely
by the data when we have a larger number of attributes. Consider a set of m numer-
ical attributes, and we want to make sure that we have at least positive and negative
values for each attribute in our data set. This does not require a large data set. But
if we want to make sure that we have data for all combinations of positive and neg-
ative values for the considered attributes, this leads to 2m possible combinations. If
we have m = 20 attributes, we have already more than one million possible com-
binations of positive and negative values. Therefore, if we have a data set with one
million records, we have on average one sample for each of these combinations.
For a data set with 100,000 records, at least 90% of the combinations will not be
covered.
Other problems can be caused by unbalanced data. As an example, consider a
production line for goods for which an automatic quality control is to be installed.
Based on suitable measurements, a classifier is to be constructed that sorts out parts
with flaws or faults. The scrap rate in production is usually very small, so that our
data might contain far less than 1% examples for parts with flaws or faults.
Timeliness refers to whether the available data are too old to provide up to date
information or cannot be considered as representative for predictions of future data.
Timeliness is often a problem in dynamically changing domains, where only re-
cently collected data provide relevant information, while older data can be mislead-
ing and can indicate trends that have vanished or even reversed.
40 4 Data Understanding
Fig. 4.2 A bar chart (categorical attribute, left) and a histogram (numerical attribute, right)
A bar chart is a simple way to depict the frequencies of the values of a categorical
attribute. A simple example for a categorical attribute with six values a, b, c, d, e,
and f is shown on the left in Fig. 4.2.
A histogram shows the frequency distribution for a numerical attribute. To this
end, the range of the numerical attribute is discretized into a fixed number of inter-
4.3 Data Visualization 41
vals (called bins), usually of equal length. For each interval, the (absolute) frequency
of values falling into it is indicated by the height of a bar as shown on the right in
Fig. 4.2. From this histogram it can be read, for example, that a little bit more than
100 values lie in the vicinity of 1.
The histogram in Fig. 4.2 resulted from a sample of size 1000 from a mixture of
two normal distributions with means 0 and 3, respectively, having both a standard
deviation of 1. The density of this distribution is shown in Fig. 4.3.
But how should we choose the number of bins, and how much does this choice
influence the result? Figure 4.4 shows a histogram with only five bins for the same
data set underlying the histogram shown in Fig. 4.2. With only five bins, the two
peaks of the original distribution are no longer visible, and one gets the wrong im-
pression that the distribution is unimodal but skewed.
There is no generally best choice for the number of bins, but there are certain
recommendations. Sturges’ rule [22] proposes to choose the number k of bins ac-
cording to the following formula:
where n is the sample size. Although Sturges’ rule is still very often used as a default
in various statistics software packages, it is tailored to data from normal distributions
and data sets of moderate size [21]. The number of bins of the histogram in Fig. 4.2
has been computed based on Sturges’ rule. The size of the data set is n = 1000.
42 4 Data Understanding
Fig. 4.5 Histograms with a suitable choice for the number of bins (left) and too many bins (right)
Assuming that, as in Sturges’ rule, the bins have equal length, the number of bins
can also be determined based on the length h of each bin:
maxi {xi } − mini {xi }
k= , (4.2)
h
where x1 , . . . , xn is the sample to be displayed. Reasonable values for h are [20]
3.5 · s
h= 1
, (4.3)
n3
where s is the sample standard deviation, and [8]
2 · IQR(x)
h= 1
, (4.4)
n3
where IQR(x) is the interquartile range of the sample, that is, the length of the
interval which covers the middle 50% of the data.
For the data set with the histogram displayed in Fig. 4.2, (4.3) yields k = 16,
and (4.4) leads to k = 17. A histogram for the second choice (that is, for k = 17) is
shown on the left in Fig. 4.5.
As we have seen in Fig. 4.4, the histogram can be misleading when the number
of bins is chosen too small. Choosing the number of bins too high usually leads
to a very scattered histogram in which it is difficult to distinguish true peaks from
random peaks. An example is shown on the right in Fig. 4.5, where k = 200 was
chosen for the number of bins for the same data underlying Fig. 4.2.
All of these methods (that is, (4.2), (4.3), and (4.4) for determining the number
of bins or the length of the bins) are highly sensitive to outliers, since they divide
the range between the smallest and the largest value of the sample into bins of equal
size. A single outlier can make this range extremely large, so that for a smaller
number of bins, the bins themselves become very large, and for a larger number of
bins, most of the bins can be empty. To avoid this problem, one can either leave out
extreme values from the sample (for instance, the 3% smallest and the 3% largest
values) for calculating and displaying the histogram, or one can deviate from the
principle of bins of equal length.
4.3 Data Visualization 43
Boxplots are a very compact way to visualize and summarize main character-
istics of a sample from a numerical attribute. Figure 4.6 shows two boxplots from
samples from a standard normal distribution with mean 0 and variance 1. The left
boxplot is based on sample of size n = 1000, whereas a sample of size n = 100 was
used for the right boxplot.
The line in the middle of a boxplot indicates the sample median. The notch in
the box is not always shown. It indicates a 95% confidence interval for the median.
The box itself corresponds to the interquartile range covering the middle 50% of the
data. The whiskers are drawn in the following way. The maximum length of each
whisker is 1.5 times the length of the interquartile range. But if there is no data point
at the maximum length of a whisker, the corresponding whisker is shortened until
it reaches the next data point. Data points lying outside the whiskers are considered
as outliers and are indicated in the form of small circles.
Comparing the two boxplots in Fig. 4.6, we can observe the following:
• Although both boxplots come from samples from the same normal distribution,
they look different, since they are based on different samples.
• The notch of the left boxplot, representing a 95% confidence interval for the me-
dian, is much smaller than the notch of the right boxplot because of the larger
sample size for the left boxplot.
• Theoretically, the whiskers for a sample from a symmetric distribution like the
normal distribution should have roughly the same length. For the boxplot based
on the smaller sample size, we can see that whiskers differ significantly in length,
since—by chance—the largest value among the sample of 100 values was not
greater than 2, whereas the smallest value was smaller than −3.
• In contrast to the boxplot on the left-hand side, the right boxplot does not contain
any outliers. This is again due to the smaller sample size. The theoretical length
of the interquartile range for a standard normal distribution is 1.349. Therefore,
the probability of a point lying outside the (theoretical) range [−2.698, 2.698] of
the whiskers is almost 0.7%. Therefore, for a sample from a normal distribution
of size n = 1000, we can expect roughly 7 outliers on average in a boxplot and
less than one for a sample of size n = 100.
The boxplots of asymmetric distributions look completely different. If we sample
from an exponential distribution, whose probability density function is shown in
Fig. 4.7, we obtain boxplots as they are shown in Fig. 4.8. The boxplots on the left
and right represent samples of sizes n = 1000 and n = 100, respectively.
44 4 Data Understanding
Fig. 4.10 Density plot (left) and a plot based on hexagonal binning (right) for the same data set
as shown in Fig. 4.9
dimensional domain of the data for the scatter plot is partitioned into bins of the
same size. Possible forms for the bins are rectangles or hexagons. The intensity of
the color for the bin is chosen proportional to the number of data objects falling into
the bin. Figure 4.10 shows a density plot on the left and a plot based on a hexagonal
binning on the right for the same data set displayed in Fig. 4.9. Both plots indicate
a higher density of the data around the point (0.6, 0.4), which cannot be seen in the
simple scatter plot in Fig. 4.9.
Scatter plots can be enriched with further information in order to involve more
attributes. Different plot symbols or colors can be used for plotting the points in
order to include information about a categorical attribute. Color intensity and the
size of the symbols are possible means to indicate the value of additional numerical
attributes.
Figure 4.11 shows two scatter plots of the Iris data set—one displaying the sepal
length versus the sepal width and the other one the petal length versus the petal
width—in which different species are displayed by different colors. Both plots show
that the red circles, representing the species Iris setosa, can be well distinguished
from the other two species Iris versicolor and Iris virginica displayed as triangles
46 4 Data Understanding
Fig. 4.11 Scatter plots of the iris data set for sepal length vs. sepal width (left) and for petal length
vs. petal width (right). All quantities are measured in centimeters (cm)
and crosses, respectively. However, the left chart in Fig. 4.11 gives the impression
that Iris virginica and Iris versicolor are very difficult to distinguish, at least when we
only take the sepal length and the sepal width into account. But when we consider
the petal length and the petal width (right chart in Fig. 4.11), we can still see the
overlap of the corresponding symbols for the species, but there is a clear tendency
that Iris virginica tends to larger values than Iris versicolor for the petal length and
width.
Comparing the number of red circles in Figs. 4.11 (left and right), there seem to
be less red circles on the right. But how can some of the objects suddenly vanish
in the scatter plot? When we count the number of red circles, we see that in both
scatter plots there are less than 50, although the data set contains 50 instances of
Iris setosa that should be displayed by red circles. The circles are not missing in the
scatter plots. Some circles are simply plotted at exactly the same position, since their
measured sepal length and width or their measured petal length and width coincide.
Recall that these values were only measured with a precision of just one digit right
after the decimal point. To avoid this impression of seeing less objects than there
actually are, one can add jitter to the scatter plot. Instead of plotting the symbols
exactly at the coordinates specified by the values in the data set, we add a small ran-
dom value to each original value in the data table. The left chart in Fig. 4.12 shows
the resulting scatter plot with jitter where we have added random values from a uni-
form distribution on the interval [−0.04, 0.04] to the original values. This ensures
that a point originally lying left or below another point will always remain left or
below the other point, even when the jitter is added.
Jitter is essential when categorical attributes are used for the coordinate axes of
a scatter plots, since categorical attributes have only a limited number of possible
values, so that plotting of objects at exactly the same position occurs very often
when no jitter is added.
From a scatter plot we can already extract important information. Consider again
the scatter plot displayed in Fig. 4.12. We can see that the petal length and width
are correlated. Objects with larger values for the petal length also tend to have larger
4.3 Data Visualization 47
Fig. 4.12 The same scatter plot as in Fig. 4.11 on the right, but with jitter (left) and with jitter and
two outliers (right; the outliers are the red points in the top left and top right corners)
values for the petal width. The scatter plot also shows that Iris setosa—the red circles
in the scatter plot—can be easily distinguished from the other two species just on
the basis of the petal length or width. The scatter plot does not indicate that the other
two species cannot be separated clearly. It only shows that, solely based on the petal
length and width, it is not possible to distinguish the two species perfectly. Outliers
can also be discovered in scatter plots. The left chart in Fig. 4.12 does not have any
outliers. In the right chart, however, we have added two artificial outliers. The data
point in the upper left corner is a clear outlier with respect to the whole data set. Note
that the values for the attributes petal length and width are both in the general range
of the corresponding attributes in the data set. But there is no other object in the data
set with a similar combination of these attribute values. The second outlier in the
right chart of Fig. 4.12—the circle in the upper right corner—is not an outlier with
respect to the values for the petal length and width or their combination. However, it
is an outlier for the class Iris setosa displayed by red circles. Whenever such outliers
are discovered, one should check the data or the data generating process again to
ensure that the outliers are not due to erroneous data.
It should be noted that the scatter plots—like all other visualization techniques—
are very useful tools to discover simple structures and patterns or peculiar deviations
like outliers in a data set. But there is no guarantee that a scatter plot or any visu-
alization technique will automatically show all or even any interesting or deviating
pattern in the data set. A scatter plot with no outliers does not mean that there are
no outliers in the data set. It only means that there are no outliers with respect to the
combination of the attributes displayed in the scatter plot. In this sense, visualization
techniques are like test cases for computer programs. Test cases can discover errors
in a program. But if the test cases have not indicated any errors, this does not imply
that the program does not have any bugs. In the same way, a visualization technique
might give hints to certain interesting patterns in the data set. But if one cannot see
any interesting patterns in a visualization, it does not mean that there are no patterns
in the data set.
48 4 Data Understanding
Principal component analysis (PCA), which is also briefly described in the ap-
pendix on statistics, is a method from statistics to find a projection to a plane—or
more generally, to a linear subspace—which preserves as much as possible of the
original variance in the data. In order to restrict the search for the best projection
plane to planes through the origin of the coordinate system, the data are first cen-
tered around the origin by subtracting the mean value for each attribute from the
attribute values. In this way, the projection to the plane can be represented by a
matrix M mapping the data points x ∈ Rm to the plane by
y = M · (x − x̄), (4.5)
where x̄ denotes the (empirical) mean value (or vector of (empirical) mean values)
1
n
x̄ = xi .
n
i=1
M = (v1 , . . . , vq ),
4.3 Data Visualization 49
Fig. 4.13 Left: the first (solid line) and the second principal component (dashed line) of an exam-
ple data set (Iris data). Right: the example data set projected to the space that is spanned by the first
and second principal components (resulting from a PCA involving all four attributes)
3λ is called an eigenvalue of a matrix A if there is a nonzero vector v such that Av = λv. The
vector v is called an eigenvector to the eigenvalue λ.
50 4 Data Understanding
the scaling of the attributes. If no standardization is carried out, the attribute with the
largest variance can easily dominate the first principal component. For the example
in√Fig. 4.13
√ with z-score standardization, the first principal component is the vector
( 2/2, 2/2). If the petal length is measured in meters instead of centimeters, but
the petal width is still measured in centimeters, the first principal component without
z-score standardization becomes the vector (0.0223, 0.9998), since the variance of
the petal length has been decreased drastically by the scaling factor 0.01 resulting
from the change from centimeters to meters, so that more or less only the petal width
contributes to the variance in the data.
PCA can be used for visualization purposes by restricting to the first two principal
components. More generally, PCA can carry out a dimension reduction to any lower-
dimensional space; even more, PCA also provides information about over how many
dimensions the data set actually spreads. This information can be extracted from the
eigenvalues λ1 ≥ · · · ≥ λm of the covariance matrix. When we project the data to the
first q principal components v1 , . . . , vq corresponding to the eigenvalues λ1 , . . . , λq ,
this projection will preserve a fraction of
λ1 + · · · + λq
(4.7)
λ1 + · · · + λm
of the variance of the original data. Table 4.1 shows the corresponding result of
PCA applied to the Iris data set without the categorical attribute for the species.
A projection of this four-dimensional data set to the first principal component, i.e.,
to only one dimension, covers already 73% of the variance of the original data set.
A projection to a plane defined by the first two principal components covers already
95.8% of the variance. This means that the four numerical attributes of the Iris data
set are not located on a two-dimensional plane in the four-dimensional space but do
not deviate too much from the plane defined by the first two principal components.
The right chart in Fig. 4.13 shows the projection of the Iris data set to the first two
principal components where PCA was carried out after the z-score standardization
had been applied.
The importance of the scaling effect carried out by the z-score standardization
can also be observed by revisiting the example where we had considered only the
petal length and width of the Iris data set. We had applied PCA to the original data
and the data where we had changed the measurement of the petal length from cen-
timeters to meters without scaling, resulting in the vector (0.7071, 0.7071) as the
first principal component for the original data and the vector (0.0223, 0.9998) for
the modified data. The variance preserved by the projection to the first principal
component is 98.1% in the first case and 99.996% in the second case. In the lat-
ter case, the first principal component corresponds more or less to the petal width,
4.3 Data Visualization 51
since the variance of the petal length measured in meters is almost negligible, and
therefore the projection can preserve close to 100% of the original variance.
In order to illustrate the advantages and limitations of PCA for visualization pur-
poses, we consider an artificial three-dimensional data set illustrated in Fig. 4.14.
The data fill the unit cube in chessboard-like manner. When the unit cube is divided
into eight subcubes, these subcubes are alternatingly empty and filled with data.
There is also one outlier close to the upper left corner of the surface in the front of
the cube. The scatter plots resulting from projections to two axes of the coordinate
system are shown in Fig. 4.15. All these scatter plots give the wrong impression that
the data are uniformly distributed over a grid in the data space. The scatter plots
provide neither a hint to the chessboard pattern in the three-dimensional data space
nor to the single outlier.
Figure 4.16 shows the projection to the first two principal components of the
data set after a z-score standardization has been carried out. The outlier can now
be identified easily. From Fig. 4.16 it is also obvious that the data cannot be dis-
tributed uniformly over the original three-dimensional space but that there must be
some inherent pattern. Of course, it is impossible to recover the three-dimensional
chessboard pattern in a two-dimensional projection completely.
PCA has the advantage that the best projection with respect to the given criterion—
the preservation of the variance—can be computed directly based on the eigenvec-
tors of the covariance matrix. Projection pursuit [10] takes a different approach.
52 4 Data Understanding
The projection of the data should show interesting aspects of the data. But what does
interesting mean? Normally, for projection pursuit, interestingness of a projection
is defined as the deviation from a normal distribution, according to the observation
that most of the projections of high-dimensional data will resemble a normal distri-
bution [6]. The more the projected data deviate from a normal distribution, the more
interesting is the projection. Various criteria [5, 9–12] can be defined to measure
how much a projection deviates from a normal distribution.
In contrast to PCA, there is no way to find the most interesting projections with
respect to the deviation from a normal distribution. Projection pursuit simply gen-
erates random projections and chooses the ones yielding the best values for the
measures of interestingness.
4.3 Data Visualization 53
(Y )
the distances dij = pi − pj between the points in the two-dimensional repre-
(X)
sentation should deviate as little as possible from the original distances dij . One
way to measure this deviation is the sum of the squared errors between the original
distances and the distances in the two-dimensional representation.
n
n
(Y ) (X) 2
E0 = dij − dij . (4.8)
i=1 j =i+1
This equation does not refer explicitly to the two-dimensional representation. The
(Y )
distances dij in the lower-dimensional space can be derived from points in R2 , but
the distances could also be computed for points in Rq for any q ∈ N.
The sum of squared errors depends on the number of data objects and on the val-
(X)
ues for the original distances. For more data points and for larger distances dij , E0
will tend to become larger as well. In order to obtain an error measure independent
of these effects, E0 is often normalized to
1
n
n
(Y ) (X) 2
E1 = n n (X) 2
dij − dij . (4.9)
i=1 j =i+1 (dij ) i=1 j =i+1
1 (Y )
The factor n n (X) 2 is independent of the distances dij and therefore in-
i=1 j =i+1 (dij )
dependent of the lower-dimensional representation of the data. It does not affect the
result of the minimization of E0 or E1 .
The relative error
n n d (Y ) − d (X) 2
ij ij
E2 = (X)
(4.10)
i=1 j =i+1 d ij
is an alternative to the absolute error in (4.9). Very often, neither the absolute nor
the relative error is considered for MDS, but a compromise between them given by
(Y ) (X)
1
n
n (dij − dij )2
E3 = n n (X) (X)
. (4.11)
i=1 j =i+1 dij i=1 j =i+1 dij
When MDS is carried out based on the error measure E3 , it is also called Sam-
mon mapping. The error E3 for a concrete representation of the data in the lower-
dimensional space is called stress.
So far, we have only proposed error measures for MDS, but we still need to find
a way to minimize these error measures. The minimization requires finding n suit-
able points in Rm and therefore involves m × n variables, where m is the dimension
for the MDS representation of the data, and n is the number of data objects. This
means that even a two-dimensional MDS representation of a small data set like the
Iris data leads to a minimization problem with 2 × 150 = 300 parameters to be
optimized. Unfortunately, there is no known analytical solution for any of the er-
ror measures in (4.8)–(4.11). Therefore, a heuristic optimization strategy is needed.
Typically, a gradient method is applied. Gradient methods are discussed in more
4.3 Data Visualization 55
Figure 4.17 shows the result of MDS in the form of the Sammon mapping applied
to the Iris data set. The categorical attribute for the species has not been used for
MDS. But the corresponding points resulting from the Sammon mapping are marked
(X)
by colors corresponding to the Iris flower species. Before the distances dij in the
original four-dimensional space defined by the sepal and petal length and width
are computed, a z-score standardization is applied to each of the four numerical
attributes.
Before the MDS algorithm can be applied, we have to make sure that no distance
between two different data objects is zero because these distances occur in the de-
nominator of the gradient, so that a zero distance would lead to a division by zero
error. There are two Iris flowers—number 102 and number 143—with exactly the
same values for all attributes, leading to a distance of zero between them. Therefore,
we removed one of them from the data set before carrying out the computations for
MDS.
Figure 4.18 shows the result of the Sammon mapping applied to the “3D chess-
board pattern” data set in Fig. 4.14, where we have again carried out z-score stan-
dardization in advance. The four cubes filled with data can even be recognized in
4.3 Data Visualization 57
MDS differs from PCA in various aspects. MDS is based on the idea of preserving
the distances among the original data objects, whereas PCA focuses on the variance.
PCA provides an explicit mapping from the space in which the data objects are
located to the lower-dimensional space, whereas MDS provides only an explicit rep-
resentation of the data objects in the lower-dimensional space. This means that when
a new or hypothetical data object is considered, it can be represented immediately
in case of PCA simply by projecting the data object to the corresponding principal
components. This is not possible for MDS.
The computational complexity of PCA is lower than the complexity of MDS.
The computation of the covariance matrix can be carried out in linear time with re-
spect to the size of the data set. Once the covariance matrix has been calculated, the
computation of the eigenvalues and eigenvectors depends only on the number of at-
tributes, but not on the number of data anymore. The number of attributes is usually
much smaller than the number of data objects. For MDS, it is necessary to consider
the pairwise distances between data objects leading to a quadratic complexity in the
number of data objects. Although a quadratic complexity is often considered as fea-
sible in computer science, it is unacceptable for larger data sets. Consider a data set
with one million data objects. The distance matrix contains 1012 entries in this case.
Since the distance matrix is symmetric and the entries in the diagonal are all zero,
we only need to know (1012 − 106 )/2 = 4999995 · 105 entries. If we want to store
the distances in 4-byte floating-point values, this would require more than 1800 gi-
gabytes! There are various modifications of MDS [3] that try to overcome these
complexity problems by sampling [15] or variations of the error measures [16, 17].
All these dimension-reduction methods generate scatter plots with abstract co-
ordinate axes that do not correspond to attributes of the original data. In this way,
the scatter plots of projection pursuit can even be extended and evaluated by the
measures of interestingness proposed for projection pursuit and also other statistical
measures and tests in order to select the most interesting visualizations [23].
Fig. 4.19 Parallel coordinates plot for the Iris data set
Fig. 4.20 Parallel coordinate plots for Iris setosa, Iris versicolor, and Iris virginica (left to right)
Figure 4.19 shows the plot of the Iris data set with parallel coordinates. The
species is a categorical attribute and can only assume three possible values. The
polylines for the different species are displayed with different colors. The plot
clearly shows that the setosa has smaller values for the petal length and width than
the other two species.
For larger data sets, it becomes more or less impossible to track the lines that
correspond to a data object in parallel coordinates plots or even to discover gen-
eral structures. It can be helpful to generate separate parallel coordinate plots for
different subsets of the data. For instance, in the case of the Iris data set, we could
generate a separate plot for each of the three species as shown in Fig. 4.20. In or-
der to keep the three plots comparable, we have not rescaled the axes for each of
the three plots. Normally, the axes will be scaled in such a way that the minimum
and maximum of all values for the corresponding attribute of the displayed data are
lowest and highest point of the axes.
Figure 4.21 shows the parallel coordinates plot for the “3D chessboard pattern”
data set in Fig. 4.14. This plot is identical to the plot that we would obtain if we had
not used the “3D chessboard pattern,” but had filled the cube uniformly with data.
4.4 Correlation Analysis 59
Radar plots are based on a similar idea as parallel coordinates with the difference
that the coordinate axes are not drawn as parallel lines, but in a star-like fashion
intersecting in one point. Figure 4.22 shows a radar plot for the four numerical
attributes of the Iris data set. Due to the fact that radar plots sometimes resemble
spider webs, they are also called spider plots.
Radar plots are only suited for smaller data sets. For such smaller data sets, it is
sometimes better not to draw all data objects in the system of coordinate axes but
to draw each data object separately, which is then called a star plot. A star plot for
the numerical attributes of the Iris data set is shown in Fig. 4.23. The first 50 “stars”
correspond to the objects from the species setosa, the next 50 to versicolor, and the
last 50 to virginica. The star plot also shows clearly that setosa differs much from
the two species.
Fig. 4.23 Star plot for the numerical attributes of the Iris data set
Table 4.3 Pearson’s correlation coefficients for the numerical attributes of the Iris data set
Sepal length Sepal width Petal length Petal width
Of course, the values in the diagonal must be equal to 1, since an attribute corre-
lates fully with itself. When we plot an attribute against itself in a scatter plot, the
points lie on a perfect line, the diagonal, so that Pearson’s correlation coefficient
must be 1 in this case. The matrix with the coefficients must also be symmetric.
It is also not surprising that the Pearson’s correlation coefficient between the
length and the petal width is very high. This means more or less that the leaves
roughly keep their shape. A short leaf will also not be very broad, and a long leaf
will be broader. It seems counterintuitive that there is more or less no correlation
between the sepal length and the sepal width or even a very small negative correla-
tion. When we take a look at the scatter plots in Fig. 4.11 on page 46, this can be
explained easily. The negative correlation originates from the fact that we have the
measurements from different species. Setosa has short but broad leaves compared to
the other species. When we compute Pearson’s correlation coefficient separately for
the species for the sepal length and width, we obtain the values 0.743, 0.526, and
0.457 for Iris setosa, Iris versicolor, and Iris virginica, respectively. The correlation
is not as high as for the petal length and width, but at least it is positive and not
negative.
Pearson’s correlation coefficient measures linear correlation. Even if there is a
functional dependency between two attributes, but the function is nonlinear but
monotone, Pearson’s correlation coefficient will not be −1 or 1. It can even be
far away from these values, depending on how much the function describing the
functional relationship deviates from a line.
Rank correlation coefficients avoid this problem by ignoring the exact numeri-
cal values of the attributes and considering only the ordering of the values. Rank cor-
relation coefficients intend to measure monotonous correlations between attributes
where the monotonous function does not have to be linear.
Spearman’s rank correlation coefficient or Spearman’s rho is defined as
n
(r(xi ) − r(yi ))2
ρ = 1 − 6 i=1 , (4.15)
n(n2 − 1)
where r(xi ) is the rank of value xi when we sort the list (x1 , . . . , xn ). r(yi ) is defined
analogously.
Spearman’s rho measures the sum of quadratic distances of ranks and scales this
measure to the interval [−1, 1]. When the rankings of the x- and y-values are exactly
in the same order, Spearman’s rho will yield the value 1; if they are in reverse order,
we will obtain the value −1.
Spearman’s rho assumes that there are no ties, i.e., no two values of one attribute
are equal. If two or more values coincide, their rank is not defined. When ties exist,
the rank r(xi ) is usually defined as the mean value of all ranks of consecutive coin-
ciding values in the sorted list. So if we have the (already sorted) list of values 0.6,
1.2, 1.4, 1.4, 1.6 for an attribute, the corresponding ranks would be 1, 2, 3.5, 3.5, 4.
Kendall’s tau rank correlation coefficient or simply Kendall’s tau is not, like
Spearman’s rho, based on ranks, but rather on the comparison of the orders of pairs
of values. Assuming that xi < xj , the two pairs (xi , xj ) and (yi , yj ) are called con-
cordant if yi < yj , i.e., when the two pairs are in the same order. They are called
discordant when they are in reverse order, which means that yi > yj .
62 4 Data Understanding
Table 4.4 Spearman’s rank correlation coefficients for the numerical attributes of the Iris data set
Sepal length Sepal width Petal length Petal width
Table 4.5 Kendall’s tau for the numerical attributes of the Iris data set
Sepal length Sepal width Petal length Petal width
where C and D denote the numbers of concordant and discordant pairs, respectively:
Since it has an intuitive, though imprecise meaning, we have used the term outlier
already before without giving a precise definition. Actually there is no formally
precise definition of outliers. An outlier is simply a value or data object that is far
away or very different from all or most of the other data.
4.5 Outlier Detection 63
For a categorical attribute, one can consider the finite set of values. An outlier is a
value that occurs with a frequency extremely lower than the frequency of all other
values. However, in some cases, this might be actually the target of our analysis.
If we want to set up an automatic quality control system and want to train a clas-
sifier, classifying the parts as correct or with failures based on measurements of
the produced parts, we will probably have so many correct parts in comparison
the ones with failures that we would consider them as outliers. However, removing
these “outliers” from the data set would actually make it impossible to achieve our
original goal to derive a classifier from the data set that can identify the parts with
failures.
For numerical attributes, outlier detection is more difficult. We have already clas-
sified certain data points in a boxplot as outliers. However, the definition of outliers
in a boxplot does not take the number of data into account, so that for larger data
sets, boxplots will usually contain points marked as outliers. We have seen this al-
ready in the boxplots in Fig. 4.6, both showing samples from a standard normal
distribution. The left boxplot in Fig. 4.6 for a sample of size n = 1000 contains
eight outliers corresponding to what we expect theoretically, namely seven points
outside the whiskers. As mentioned before, for a normal distribution, we can expect
roughly 0.7% points to be marked as outliers in a boxplot.
For asymmetric distributions, boxplots tend to contain more outliers. For exam-
ple, in boxplots for samples from an exponential distribution with λ = 1 as they are
shown in Fig. 4.8, we would expect roughly 4.8% points marked as outliers. Heavy-
tailed distributions tend to show more outliers in a boxplot, whereas for a sample
from a uniform distribution, we would expect no outliers at all, no matter how large
the sample size is.
64 4 Data Understanding
Even if we try to adjust the definition of outliers according to the sample size,
the above examples show that what we consider as an outlier depends strongly on
the underlying distribution from which the data are sampled. Therefore, statistical
tests for outliers are usually based on assumptions about the underlying distribution,
although we might not know from which distribution the data are sampled.
The standard assumption for outlier tests for continuous attributes is that the
underlying distribution is a normal distribution. Grubb’s test is a test for outliers for
normal distributions taking the sample size into account. It is based on the statistics
max{|xi − x̄||1 ≤ i ≤ n}
G= , (4.18)
s
where x1 , . . . , xn is the sample, x̄ is its mean value, and s is its empirical standard
deviation. For a given significance level α, the null hypothesis that the sample does
not contain outliers is rejected if
2
n−1 t1−α/(2n),n−2
G> √ , (4.19)
n n − 2 + t1−α/(2n),n−2
2
4 For Grubb’s test, the null hypothesis is that there are no outliers. Then the point in the sample
with the largest distance to the mean is considered. In the case of Grubb’s test, the p-value is the
probability that in a sample of size n, such a large or even large deviation from the mean would
occur. For a more formal and general definition of p-values, see also Appendix A.
4.6 Missing Values 65
attributes are considered. Instead of using projections to two attributes, one can also
use dimension-reduction methods like PCA or multidimensional scaling in order to
identify outliers in the corresponding plots as in Figs. 4.16 and 4.18.
There are many approaches for finding outliers in multidimensional data based
on clustering the data and defining those data objects as outliers that cannot be
assigned reasonably to any cluster [18, 19]. There are also distance-based [13, 14],
density-based [4], and projection-based methods [1] for outlier detection.
the wind speed is absolutely zero over a longer time. When we see a diagram like
in Fig. 4.24, it is very probable that during the period between 5 and 10, the wind
speed was not zero, but that there are missing values in this period due to a jammed
anemometer.
It is very important to identify such hidden missing values. Otherwise, the further
analysis of the data can be completely mislead by such erroneous values.
When there are missing values, one should also take into account how missing
values enter the data set. The simplest and most common assumption about missing
values is that they are missing completely at random (MCAR) or are observed at
random (OAR). This means that special circumstances or special values in the data
lead to higher or lower chances for missing values. One can imagine that one has
printed out the data table on a large sheet of paper with no missing values and then
someone has dropped accidentally some random spots of ink on the paper so that
some values cannot be read anymore.
In a more formal way, the situation missing completely at random can be de-
scribed in the following way. We consider the random variable X for which we
might have some missing entries in the data set. The random variable X itself does
not have missing values. The corresponding value is just not available in our data
set. The random variable Xmiss can assume the value 0 for a missing values and 1
for a nonmissing value for the random variable X of interest. This means that for
Xmiss = 1, we see the true value of X in our data set, and for Xmiss = 0, instead of
the true value of X, we see a missing value. We also consider the random vector Y
representing all attributes except X in our data set. The situation observed at random
means that Xmiss is independent of X and Y , i.e.,
P (Xmiss ) = P (Xmiss |X, Y ) (MCAR).
Consider a sensor for the outside air temperature X whose battery might run out
of energy once in a while, leading to missing values of the missing completely at
random. This is the best case for missing values. It can, for instance, be concluded
that the unknown missing values follow the same distribution as the known values
of X.
The situation missing at random (MAR) is more complicated but might still
be manageable with appropriate techniques. The probability for a missing value
4.6 Missing Values 67
2. Build a classifier with now binary attribute X as the target attribute and use all
other attributes for the prediction of the class values yes and no.
3. Determine the misclassification rate. The misclassification rate is the proportion
of data objects that are not assigned to the correct class by the classifier.
In the case of missing values of the type observed at random, the other attributes
should not provide any information, whether X has a missing value or not. There-
fore, the misclassification rate of the classifier should not differ significantly from
pure guessing, i.e., if there are 10% missing values for the attribute X, the misclas-
sification rate of the classifier should not be much smaller than 10%. If, however,
the misclassification rate of the classifier is significantly better than pure guessing,
this is an indicator that there is a correlation between missing values for X and the
values of the other attributes. Therefore, the missing values for X might not be of
the type observed at random but of the type missing at random or, even worse, non-
ignorable. Note that it is in general not possible to distinguish the case nonignorable
from the other two cases based on the data only.
• Apart from these data quality issues, data understanding should also help to dis-
cover new or confirm expected dependencies or correlations between attributes.
Techniques like the ones mentioned in Sect. 4.4 are one way to solve this task.
Apart from this, scatter plots can show correlations between pairs of attributes.
• Specific application dependent assumptions—for instance, the assumption that a
specific attribute follows a normal distribution—should also be checked during
data understanding.
• Representativeness of the data cannot always be checked just based on the data,
but we have to compare the statistics with our expectations. If we suspect that
there is a change in a numerical attribute over time, we can compare histograms
or boxplots for different time periods. We can do the same with bar charts for
categorical attributes.
• Check the distributions for each attribute whether there are unusual or unexpected
properties like outliers. Are the domains or ranges correct? Do the medians of
numerical attributes look correct? This should be done based on
– histograms and boxplots for continuous attributes and
– bar charts for categorical attributes.
• Check correlations or dependencies between pairs of attributes with scatter plots
which should be density-based for larger data sets. For small numbers of at-
tributes, inspect scatter plots for all pairs of attributes. For higher numbers of
attributes, do not generate scatter plots for all pairs, but only for those ones where
independence or a specific dependency is expected. Generate in addition scatter
plots for some randomly chosen pairs.
To cite Tukey [24] again, if problems in later phases of the data analysis process oc-
cur that could have been discovered or even avoided in an early plotting and looking
at the data, there is absolutely no excuse for this failure.
Of course, one should also exploit the other methods described in this section for
these and other more specific purposes.
A key issue in data understanding (and actually the entire data analysis process)
often causes considerable headaches: the loading of the data into the analysis tool
of choice. One of the strengths of KNIME is its versatility in terms of powerful
file importing nodes and database connectors. R, on the other hand offers the entire
breadth of analysis and visualizations, although they are often not all that intuitive
to use.
70 4 Data Understanding
Fig. 4.25 The dialog of the file reader node offers manifold options to control the reading of
diverse file formats
Data Loading Data understanding starts with loading the data: KNIME offers a
“File Reader” node which hides an entire file import wizard in its dialog. Manifold
options allow one to choose the underlying character encoding, column types, and
separators and escape characters, to name just a few. The ability to quickly check in
the preview tab if the selected options fit the given file make it possible to smoothly
find suitable settings also for complex files. Figure 4.25 shows the dialog of this
node together with the “expert tab” which hides another set of options.
KNIME also allows one to read data from specialized file formats, such as the
Weka ARFF format or the compressed KNIME table format. For these formats,
specialized nodes are available in the “IO” category as well.
Reading data from databases is often neglected in stand-alone analytic tools.
KNIME offers flexible connectivity to access databases of various types by specify-
ing the corresponding JDBC driver (Java Database Connectivity) in addition to the
table location and name. For basic data filtering, specialized nodes are also available,
which allow one to do, e.g., column and row filtering also within the database, with-
out loading the data into KNIME explicitly. This saves time for larger databases if
only a much smaller subset is to be analyzed. Figure 4.26 shows a part of a workflow
reading data from a database and filtering some rows and columns before reading
the data into KNIME itself.
Data Types KNIME supports all basic data types, such as string, integers, and
numbers, but also date and time types and nominal values. In addition, various ex-
tensions add the ability to process images, molecular structures, sequences, and tex-
tual data. The repository of type extension is constantly growing as well. For a first
glance at the data read it into KNIME and in order to check domain and nominal
value range, add the “Statistics” and “Value Counter” node. The first one computes
4.8 Data Understanding in Practice 71
Fig. 4.26 Specialized nodes for database access allow one to filter rows and columns using
database routines before loading the data into KNIME
Fig. 4.27 Looking at basic information of the data’s attributes often helps one to spot outliers or
other errors in the data
Fig. 4.29 Interactive views in KNIME allow one to propagate selections to other views within the
same workflow. The points falling into the selected bar in the histogram are automatically selected
by KNIME’s Hiliting mechanism
for this selection process (called hiliting) to propagate along the data pipeline as
long as a meaningful translation between records (or rows in the table) is possi-
ble. For instance, selecting a rule will hilite all points contained within that rule in
all other views. Visual brushing is a powerful tool to allow interactive exploration
of data since it allows one to quickly select interesting elements in one view and
see details about the selected elements in other views, for instance, the underlying
customer data, the images covered by a rule, or the molecular structures that seem
to look like outliers in a scatter plot. Figure 4.29 illustrates the hiliting mechanism
in KNIME. We see an interactive histogram of one attribute and a scatter plot de-
picting two other attributes. One of the bars in the histogram was selected, and the
corresponding points in the scatter plot are now marked as well.
4.8 Data Understanding in Practice 73
4.8.2.1 Histograms
Histograms are generated by the function hist. The simplest way to create a his-
togram is to just use the corresponding attribute as an argument of the function
hist, and R will automatically determine the number of bins for the histogram
based on Sturge’s rule. In order to generate the histogram for the petal length of the
Iris data set, the following command is sufficient:
> hist(iris$Petal.Length)
The partition into bins can also be specified directly. One of the parameters of hist
is breaks. If the bins should cover the intervals [a0 , a1 ), [a1 , a2 ), . . . , [ak−1 , ak ],
then one can simply create a vector in R containing the values ai and assign
it to breaks. Note that a0 and ak should be the minimum and maximum val-
ues of the corresponding attribute. If we want the boundaries for the bins at
1.0, 3.0, 4.5, 4.0, 6.1, then we would use
> hist(iris$Petal.Length,breaks=c(1.0,3.0,4.5,4.0,6.9))
to generate the histogram. Note that in the case of bins with different length, the
heights of the boxes in the histogram do not show the relative frequencies. The
areas of the boxes are chosen in such a way that they are proportional to the relative
frequencies.
4.8.2.2 Boxplots
> boxplot(iris$Petal.Length)
yielding the boxplot for the petal length of the Iris data set. Instead of a single
attribute, we can hand over more than one attribute
> boxplot(iris$Petal.Length,iris$Petal.Width)
to show the boxplots in the same plot. We can even use the whole data set as an
argument to see the boxplots of all attributes in one plot:
> boxplot(iris)
In this case, categorical attributes will be turned into numerical attributes by coding
the values of the categorical attribute as 1, 2, . . . , so that these boxplots are also
shown but do not really make sense.
In order to include the notches in the boxplots, we need to set the parameter
notch to true:
74 4 Data Understanding
> boxplot(iris,notch=TRUE)
If one is interested in the precise values of the boxplot like the median, etc., one can
use the print-command:
> print(boxplot(iris$Sepal.Width))
$stats
[,1]
[1,] 2.2
[2,] 2.8
[3,] 3.0
[4,] 3.3
[5,] 4.0
$n
[1] 150
$conf
[,1]
[1,] 2.935497
[2,] 3.064503
$out
[1] 4.4 4.1 4.2 2.0
$group
[1] 1 1 1 1
$names
[1] "1"
The first five values are the minimum, the first quartile, the median, the third quartile,
and the maximum value of the attribute, respectively. n is the number of data. Then
come the boundaries for the confidence interval for the notch, followed by the list of
outliers. The last values group and names only make sense when more than one
boxplot is included in the same plot. Then group is needed to identify to which
attribute the outliers in the list of outliers belong. names just lists the names of the
attributes.
A scatter plot of the petal width against petal length of the Iris data is obtained by
> plot(iris$Petal.Width,iris$Petal.Length)
All scatter plots of each attribute against each other in one diagram are created with
4.8 Data Understanding in Practice 75
> plot(iris)
If symbols representing the values for some categorical attribute should be included
in a scatter plot, this can be achieved by
> plot(iris$Petal.Width,iris$Petal.Length,
pch=as.numeric(iris$Species))
where in this example the three types of Iris are plotted with different symbols.
If there are some interesting or suspicious points in a scatter plot and one wants
to find out which data records these are, one can do this by
> plot(iris$Petal.Width,iris$Petal.Length)
> identify(iris$Petal.Width,iris$Petal.Length)
and then clicking on the points. The index of the corresponding records will be
added to the scatter plot. To finish selecting points, press the ESCAPE-key.
Jitter can be added to a scatter plot in the following way:
> plot(jitter(iris$Petal.Width),
jitter(iris$Petal.Length))
Intensity plots and density plots with hexagonal binning, as they are shown Fig. 4.9,
can be generated by
> plot(iris$Petal.Width,iris$Petal.Length,
col=rgb(0,0,0,50,maxColorValue=255),
pch=16)
and
> library(hexbin)
> bin<-hexbin(iris$Petal.Width,
iris$Petal.Length,
xbins=50)
> plot(bin)
respectively, where the library hexbin does not come along with the standard ver-
sion of R and needs to be installed as described in the appendix on R. Note that such
plots are not very useful for such a small data sets like the Iris data set.
For three-dimensional scatter plots, the library scatterplots3d is needed
and has to be installed first:
> library(scatterplot3d)
> scatterplot3d(iris$Sepal.Length,
iris$Sepal.Width,
iris$Petal.Length)
76 4 Data Understanding
Rotation:
PC1 PC2
Sepal.Length 0.5210659 -0.37741762
Sepal.Width -0.2693474 -0.92329566
Petal.Length 0.5804131 -0.02449161
Petal.Width 0.5648565 -0.06694199
PC3 PC4
0.7195664 0.2612863
-0.2443818 -0.1235096
-0.1421264 -0.8014492
-0.6342727 0.5235971
> summary(iris.pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.71 0.956 0.3831 0.14393
Proportion of Variance 0.73 0.229 0.0367 0.00518
Cumulative Proportion 0.73 0.958 0.9948 1.00000
> plot(predict(iris.pca))
For the Iris data set, it is necessary to exclude the categorical attribute Species
from PCA. This is achieved by the first line of the code and calling prcomp with
iris[,-species] instead of iris.
The parameter settings center=T, scale=T, where T is just a short form of
TRUE, mean that z-score standardization is carried out for each attribute before
applying PCA.
The function predict can be applied in the above-described way to obtain
the transformed data from which the PCA was computed. If the computed PCA
transformation should be applied to another data set x, this can be achieved by
> predict(iris.pca,newdata=x)
where x must have the same number of columns as the data set from which the PCA
has been computed. In this case, x must have four columns which must be numer-
ical. predict will compute the full transformation, so that the above command
will also yield transformed data with four columns.
4.8 Data Understanding in Practice 77
MDS requires the library MASS which is not included in the standard version of R
and needs installing. First, a distance matrix is needed for MDS. Identical objects
leading to zero distances are not admitted. Therefore, if there are identical objects
in a data set, all copies of the same object except one must be removed. In the Iris
data set, there is only one pair of identical objects, so that one of them needs to
be removed. The Species is not a numerical attribute and will be ignored for the
distance.
> library(MASS)
> x <- iris[-102,]
> species <- which(colnames(x)=="Species")
> x.dist <- dist(x[,-species])
> x.sammon <- sammon(x.dist,k=2)
> plot(x.sammon$points)
k = 2 means that MDS should reduce the original data set to two dimensions.
Note that in the above example code no normalization or z-score standardization
is carried out.
Parallel coordinates need the library MASS. All attributes must be numerical. If the
attribute Species should be included in the parallel coordinates, one can achieve this
in the following way:
> library(MASS)
> x <- iris
> x$Species <- as.numeric(iris$Species)
> parcoord(x)
Star and radar plots are obtained by the following two commands:
> stars(iris)
> stars(iris,locations=c(0,0))
> cor(iris$Sepal.Length,iris$Sepal.Width)
> cor.test(iris$Sepal.Length,iris$Sepal.Width,
method="spearman")
> cor.test(iris$Sepal.Length,iris$Sepal.Width,
method="kendall")
78 4 Data Understanding
Grubb’s test for outlier detection needs the installation of the library outliers:
> library(outliers)
> grubbs.test(iris$Petal.Width)
References
1. Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: Proc. ACM SIGMOD
Int. Conf. on Management of Data (SIGMOD 2001, Santa Barbara, CA), pp. 37–46. ACM
Press, New York (2001)
2. Anderson, E.: The irises of the Gaspe Penisula. Bull. Am. Iris Soc. 59, 2–5 (1935)
3. Borg, I., Groenen, P.: Modern Multidimensional Scaling: Theory and Applications. Springer,
Berlin (1997)
4. Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: identifying density-based local outliers.
In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 2000, Dallas, TX),
pp. 93–104. ACM Press, New York (2000)
5. Cook, D., Buja, A., Cabrera, J.: Projection pursuit indices based on orthonormal function
expansion. J. Comput. Graph. Stat. 2, 225–250 (1993)
6. Diaconis, P., Freedman, D.: Asymptotics of graphical projection pursuit. Ann. Stat. 17, 793–
815 (1989)
7. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2),
179–188 (1936)
8. Freedman, D., Diaconis, P.: On the histogram as a density estimator: L2 theory. Z. Wahrschein-
lichkeitstheor. Verw. Geb. 57, 453–476 (1981)
9. Friedman, J.: Exploraory projection pursuit. J. Am. Stat. Assoc. 82, 249–266 (1987)
10. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data analysis. IEEE
Trans. Comput. C-23, 881–890 (1974)
11. Hall, P.: On polynomial-based projection indices for exploratory projection pursuit. Ann. Stat.
17, 589–605 (1989)
12. Huber, P.: Projection pursuit. Ann. Stat. 13, 435–475 (1985)
13. Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proc.
24th Int. Conf. on Very Large Data Bases (VLDB 1998, New York, NY), pp. 392–403. Morgan
Kaufmann, San Mateo (1998)
14. Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: algorithms and applications. Very
Large Data Bases 8, 237–253 (2000)
15. Morrison, A., Ross, G., Chalmers, M.: Fast multidimensional scaling through sampling. Inf.
Vis. 2, 68–77 (2003)
16. Pekalska, E., Ridder, D., Duin, R., Kraaijveld, M.: A new method of generalizing Sammon
mapping with application to algorithm speed-up. In: Proc. 5th Annual Conf. Advanced School
for Computing and Imaging, pp. 221–228, Delft, Netherlands (1999)
17. Rehm, F., Klawonn, F., Kruse, R.: Mdspolar : a new approach for dimension reduction to visu-
alize high dimensional data. In: Advances in Intelligent Data Analysis, vol. VI, pp. 316–327.
Springer, Berlin (2005)
18. Rehm, F., Klawonn, F., Kruse, R.: A novel approach to noise clustering for outlier detection.
Soft. Comput. 11, 489–494 (2007)
19. Santos-Pereira, C., Pires, A.: Detection of outliers in multivariate data: a method based on
clustering and robust estimators. In: Proc. 5th Annual Conference of the Advanced School for
Computing and Imaging, pp. 291–296. Physica, Berlin (2002)
20. Scott, S.: On optimal and data-based histograms. Biometrika 66, 605–610 (1979)
References 79
21. Scott, D.: Sturges’ rule. In: Wiley Interdisciplinary Reviews: Computational Statistics, vol. 1,
pp. 303–306. Wiley, Chichester (2009)
22. Sturges, H.: The choice of a class interval. J. Am. Stat. Assoc. 21, 65–66 (1926)
23. Tschumitschew, K., Klawonn, F.: Veda: statistical tests for finding interesting visualisations.
In: Knowledge-Based and Intelligent Information and Engineering Systems 2009, Part II, pp.
236–243. Springer, Berlin (2009)
24. Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley, Reading (1977)
Chapter 5
Principles of Modeling
After we have gone through the phases of project and data understanding, we are ei-
ther confident that modeling will be successful or return to the project understanding
phase to revise objectives (or to stop the project). In the former case, we have to pre-
pare the dataset for subsequent modeling. However, as some of the data preparation
steps are motivated by modeling itself, we first discuss the principles of modeling.
Many modeling methods will be introduced in the following chapters, but this chap-
ter is devoted to problems and aspects that are inherent in and common to all the
methods for analyzing the data.
All we need for modeling, it might seem, is a collection of methods from which
we have to choose the most suitable one for our purpose. By now, the project under-
standing has already ruled out a number of methods. For example, when we have to
solve a regression problem, we do not consider clustering methods. But even within
the class of regression problems, there are various methods designed for this task.
Which one would be the best for our problem, and how do we find out about this?
In order to solve this task, we need a better understanding of the underlying princi-
ples of our specific data analysis methods. Most of the data analysis methods can be
viewed within the following four-step procedure:
• Select the Model Class.
First of all, we must specify the general structure of the analysis result. We call
this the “architecture” or “model class.” In a regression problem, one could de-
cide to consider only linear functions; or instead, quadratic functions could be an
alternative; or we could even admit polynomials of arbitrary degree. This, how-
ever, defines only the structure of the “model.” Even for a simple linear function,
we still would have to determine the coefficients (Sect. 5.1).
• Select the Score Function.
We need a score function that evaluates the possible “models,” and we aim to
find the best model with respect to our goals—which is formalized by the score
function. In the case of the simple linear regression function, our score function
will tell us which specific choice of the coefficients is better when we compare
different linear functions (Sect. 5.2).
M.R. Berthold et al., Guide to Intelligent Data Analysis, 81
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_5, © Springer-Verlag London Limited 2010
82 5 Principles of Modeling
how to find the regression line with the best fit. Depending on the error measure,
an algorithm is needed that finds the best fit or at least a good fit.
(b) Another example of an even simpler model is a constant value as a represen-
tative or prototype of a numerical attribute (or an attribute vector) either for
the full dataset or for various subsets of it. The mean value or median are typ-
ical choices for such prototypes. Although the mean value and median can be
viewed as purely descriptive concepts, they can also be derived as optimal fits
in terms of suitable error measures, as we will see in Sect. 5.3.
(c) Multidimensional scaling as it was introduced in Sect. 4.3.2.3 can also be inter-
preted as a model. Here, the model is a set of points in the plane or 3D-space,
representing the given data set. Again, a criterion or error measure is needed to
evaluate whether a given set of points is a good or bad representation of the data
set under consideration.
(d) All examples so far were numerical in nature, but there are, of course, also
models for nominal data. A simple model class is that of propositional rules like
“If temperature = cold and precipitation = rain, then action = read book” or
“If a customer buys product A, then he also buys product B with a probability
of 30%” can also be viewed as models. The latter rules are called association
rules, because they associate features of arbitrary variables.
The four problems or models described in (a) the regression line, (b) the constant
value, (c) multidimensional scaling, and (d) rules will serve as simple examples
throughout this chapter. In most cases the models are parameterized as in the above
first three examples. The number of parameters depends on the model. The con-
stant value and the regression line have only one and two parameters, respectively,
whereas 2n parameters are needed for multidimensional scaling for the represen-
tation of a data set with n records in the plane. The freedom of choice in the rule
models are the variables and values that build the conditions in the rule’s antecedent.
When we are looking for a regression function f : R → R, we might decide that
we want to restrict f to lines of the form f (x) = a1 x + a0 , so that we have fixed set
of two parameters. But we might also not know, whether a line, a quadratic func-
tion or even a polynomial of higher degree f (x) = ak x k + · · · + a1 x + a0 might be
the best choice. When we do not fix the degree of the polynomial for the regres-
sion function, we have not a fixed but a variable set of parameters. A propositional
rule (example (d)) or sets of rules also belong to this case, as the number of condi-
tions and rules is typically not fixed in advance. The same holds for decision trees
(see Chap. 8), which can also not be represented in terms of a simple fixed set of
parameters.
Another distinction can be made between, say, linear regression models and a
propositional rule, by their applicability: a linear regression model can (in principle)
be applied to all possible values of x and yields a resulting y. As they can be applied
to all data from the data space, they are often called global model. This is different
with rules: the consequent of the rule applies only in those cases where all conditions
of the antecedent hold—there is no information returned otherwise. As such models
can be applied to a somewhat limited fraction of the whole dataspace, they are called
local models. We will use the term local model and pattern synonymously.
84 5 Principles of Modeling
In any case, we need to define the class of possible models that we consider as
possible candidates for solving our data analysis task. Project understanding aims at
identifying the goals of the data analysis process and will therefore already restrict
the choice of possible model classes. But even a well-defined goal will in most cases
not imply a unique class of possible models. For instance, in the case of a regression
function to be learned from the data, it is often not obvious what type of regression
function should be chosen. Unless there is an underlying model—for instance, for a
physical process—that provides the functional dependencies between the attributes,
a polynomial of arbitrary degree or even other functions might be good candidates
for a model.
Finding the “best” model for a given data set is not a trivial task at all, especially
since the question what a good or the best model means is not easy to answer. How
well the model fits the data is one criterion for a good model which can often be
expressed in a more or less obvious manner, for instance, in the case of regression
as the mean square or the mean absolute error of the values predicted by the model
and the values provided in the data set. Such fitting criteria will be discussed in more
detail in the next section.
Another important aspect is the simplicity of the model. Simpler models will be
preferred for various reasons:
• They are usually easier to understand and to be interpreted.
• Their computational complexity is normally lower.
• A model can also be seen as a summary of the data. A complex model might not
be a summary but, in the worst case, just more or less a one-to-one representation
of the data. General structures in the data can only be discovered by a model,
when it summarizes the data to a certain extent. The problem that too complex
models often fail to reveal the general relations and structures in the data is called
overfitting and will be discussed in more detail in Sect. 5.4.
Interpretability is often another requirement for the model derived from the data.
In some cases, when only a classifier or regression function with a low error is
the target, black-box models like neural networks might be accepted. But when the
model should deliver an understandable description of the data or when the way how
a classifier selects the predicted class needs an explanation, black-box models are
not a proper choice. It is impossible to define a general measure for interpretability,
because interpretability depends very much on the context. For instance, a quadratic
regression function might be perfectly interpretable in a mechanical process where
acceleration, velocity, and distance are involved, whereas in other cases a quadratic
function would be nothing else than a black-box model, although a very simple one.
Computational aspect also plays a role in the choice of the model class. Find-
ing the model from a model class that satisfies the desired criteria—data fitting and
low model complexity and interpretability—best or at least reasonably well, is also
a question of computational effort. In some cases, there might be simple ways to
derive a model based on the given data. For example, for the regression line ex-
ample (a), it might be the goal to minimize sum of squared errors. In this case, an
explicit analytical solution for the best choice of the parameters a and b can be
5.2 Fitting Criteria and Score Functions 85
provided as we will see in the next section. Although the search space in the associ-
ation rule example (d) is finite in contrast to the search space R2 for the regression
line, it is so large that highly efficient strategies are needed to explore all potentially
interesting rules.
Although all these aspects, the fitting criterion or score function, the model com-
plexity, interpretability, and the required computational complexity for finding the
model, are important, the focus is very often put on the fitting criterion that can be
usually defined in a more or less obvious way.
1 Note that minimizing the mean squared error or the sum of squared errors leads to the same
solution, since the parameters a and b do not depend on the constant factor n1 .
86 5 Principles of Modeling
for the parameters a and b. It only provides a criterion telling us whether parame-
ter combination (a1 , b1 ) is considered better than parameter combination (a2 , b2 ).
This is the case where E(a1 , b1 ) < E(a2 , b2 ). When we plot the error function
(5.4), which is a function of the parameters a and b, we obtain the graph shown
in Fig. 5.2.
The mean squared error is not the only reasonable choice to measure how well a
model fits the data. Other obvious examples for alternative measures are the mean
absolute error
1
n
E(a, b) = |axi + b − yi | (5.5)
n
i=1
or the mean Euclidean distance of the data points to the regression line. Instead
of the mean or the sum of the errors, the maximum of the errors could also be
considered.
All these error measures have in common that they only yield the value zero
when the regression line fits perfectly to the data and that they increase with larger
distance of the data points to the regression line. Properties of such error measures
and their advantages and disadvantages are discussed in more detail in Sect. 8.3.3.3.
Error functions for the even simpler model (b) (on page 83) of a single value m
representing a sample can be defined in a similar way. It can be shown that the
minimization of mean squared errors
1
n
E(m) = (xi − m)2 (5.6)
n
i=1
1
n
m = x̄ = xi , (5.7)
n
i=1
5.2 Fitting Criteria and Score Functions 87
OK 0 c1
broken c2 0
c1 0 c1,2 . . . c1,m
c1 c2,1 0 . . . c2,m
.. .. .. .. ..
. . . . .
cm cm,1 cm,2 ... 0
should be minimized. E is the evidence, i.e., the observed values of the predictor
attributes used for the classification, and P (cj |E) is the predicted probability that
the true class is cj given observation E.
Sometimes, classification can be considered as a special form of regression.
A classification problem where each instance belongs to one of two classes can
be reformulated as a regression problem by assigning the values 0 and 1 to the two
classes. This means that the regression function must be learned from data where
the values yi are either 0 or 1. Since the regression function will only approximate
the data, it will usually not yield the exact values 0 and 1. In order to interpret such
a regression function as a classifier, it is necessary to assign arbitrary values to the
classes 0 and 1. The obvious way to do this is to choose 0.5 as a threshold and
consider values lower than 0.5 as class 0 and values greater than 0.5 as class 1.
When a classifier is learned as a regression function based on error measures as
they are used for regression problems, this can lead to undesired results. The aim of
regression is to minimize the approximation error which is not the same as the mis-
classification rate. In order to explain this effect, consider a classification problem
with 1000 instances, half of them belonging to class 0 and the other half to class 1.
Assume that there are two regression functions f and g as possible candidates to be
used as a classifier.
• Regression function f yields 0.1 for all data from class 0 and 0.9 for all data from
class 1.
• Regression function g always yields the exact and correct values 0 and 1, except
for 9 data objects where it yields 1 instead of 0 and vice versa.
Although f does not yield the exact values 0 and 1 for the classes, it classifies all
instances correctly when the threshold value 0.5 is used to make the classification
decision. As a regression function, the mean squared error of f is 0.01. The regres-
sion function g has a smaller mean squared error of 0.009, but classifies 9 instances
incorrectly. From the viewpoint of regression, g is better than f , and from the view-
point of the misclassification rate, f should be preferred.
5.3 Algorithms for Model Fitting 89
When we search for patterns, for instance, for single classification rules or asso-
ciation rules in the form if A = a then B = b, statistical measures of interest are
typically used to evaluate the rule which is then considered as the “model.” Assume
that A = a for na records, B = b for nb records, and A = a and B = b at the same
time holds for nab records, where we have n records altogether. If nnb ≈ nnaba , the rule
has no significance at all, since replacing the records with A = a by a random sam-
ple of size nA , we would expect roughly the same fraction of records with B = b
in the random sample. The rule can only be considered as relevant if nnaba nnb . In
order to measure the relevance of the rule, we can compute the probability that a
random sample of size na contains at least nab records with B = b. This probability
can be derived from a hypergeometric distribution:
n−nb
a ,nb } nb
min{n
i · na −i
n . (5.10)
i=nab na
This can be interpreted as the p-value for the statistical test2 with the null hypothesis
that the rule applies just by chance. The lower this p-value, the more relevant the
rule can be considered.
A very simple measure of interestingness that is often used in the context of
association rules and frequent item set mining is the support or frequency of the
rule in the data set. There, a rule is considered “interesting” if the support exceeds a
given lower bound. As we have seen above, the support alone is not a good advisor
when looking for unexpected, interesting rules. The focus on support in the context
of association rules/frequent patterns has merely technical/algorithmic reasons.
These are two examples for statistical measures of interest. There are many others
depending on the type of pattern we are searching [5]. We will encounter some of
them in the subsequent chapters.
In the best case, a closed-form solution for the optimization problem can be ob-
tained directly. This is, however, not possible for most of the objective functions we
consider. A positive example, for which we can find a closed form solution, is our
case (a) (see page 82), the linear regression function (see also Sect. 8.3 for more de-
tails). For a minimum of the error function of linear regression, it is necessary that
the partial derivatives with respect to the parameters a and b of the error function
(5.4) vanish. This leads to the system of two linear equations
2
n
∂E
= (axi + b − yi )xi = 0,
∂a n
i=1
2
n
∂E
= (axi + b − yi ) = 0.
∂b n
i=1
the number of instances in the data set displayed in the plane by multidimensional
scaling.
When the objective function is differentiable, a gradient method can be applied.
The gradient, i.e., the vector of partial derivatives with respect to the model param-
eters, points in the direction of steepest ascend. The idea of optimization based on a
gradient method is to start at a random point—an arbitrary choice of the parameters
to be optimized—and then to go a certain step in the direction of the gradient, when
the objective function should be maximized, and in the opposite direction of the gra-
dient, when the objective function should be minimized, leading to a new point in
the parameter space. If this point yields a better value for the objective function, the
gradient in this point is computed, and the next point in the direction or, respectively,
in the opposite direction of the new gradient is chosen. This procedure is continued
until no more improvements can be achieved or a fixed number of gradient steps has
been carried out.
The stepwidth can be chosen constant. However, the problem with constant step-
width is that, with a large stepwidth, one might “jump” over or oscillate around a
local optimum. On the other hand, a very small stepwidth can lead to extreme slow
convergence or even to starving, which means that the algorithm converges before
a local optimum is reached. Therefore, an adaptive stepwidth is usually preferred,
however, for the price of higher computational costs.
Applying a gradient method to minimize an objective function, it can only find
the local minimum in the same “valley” of the landscape where the starting point is
located.3 Therefore, it is recommended to run a gradient method repeatedly, starting
with different initial points in order to increase the chance to find the global or at
least a good local optimum.
3 For maximization, the same holds, except that the gradient method will just climb the “mountain”
However, if we assume that the data points are corrupted with noise, the polyno-
mial may mainly fit the noise rather than capture the underlying relationship. When
we obtain further data points, the regression line is likely even a better approxima-
tion for these points than the polynomial. Therefore, a model with a smaller error is
not necessarily a better fit for the data. In general, more complex models can fit the
data better but have a higher tendency to show this bad effect called overfitting.
Once we have fitted a model to given data, the fitting error can be composed into
four components.
• The pure or experimental error,
• the sample error,
• the lack of fit or model error, and
• the algorithmic error.
The pure error or experimental error is inherent in the data and is due to noise,
random variations, imprecise measurements, or the influence of hidden variables
that cannot be observed. It is impossible to overcome this error by the choice of a
suitable model. This error is inherent in the data. Therefore, it is also called intrinsic
error. In the context of classification problems, it is also called Bayes error.
Fig. 5.5 A simple classification problem with perfect separation (left) and more difficult classifi-
cation problems with slightly (middle) and strongly overlapping classes (right)
Fig. 5.7 Darts results for a beginner (left), a hobby player (middle), and a professional (right)
the center of the dartboard. The hits of the professional player will all be close to
the center, whereas the complete beginner sometimes almost hits the center but has
the highest deviation.
In terms of a classification problems, all three classes have the same prototype
or center, and only the deviation differs. In such cases, a classifier can still be con-
structed, however, with a high misclassification rate.
We take an abstract simplified look at the dart player example. Assume that we
draw samples from three univariate normal distributions, representing three different
classes, with the same mean but with different variances, as shown in Fig. 5.8. This
5.4 Types of Errors 97
corresponds more or less to the classification problem of the dart players when we
consider only the horizontal deviation to the center of the dartboard. Samples from
the normal distribution with the smallest variance correspond to hits of the profes-
sional player, whereas the normal distribution with the largest variance represents
to the complete beginner. In this theoretical example, it is obvious how to find the
classification decision. Assuming that the three classes have the same frequency, an
object, i.e., a simple value in this case, should be assigned to the normal distribution
(class) with the highest likelihood, in other words, to the normal distribution with
the highest value of the corresponding probability density function at the position
of the given value. In this way, the region in the middle would be assigned to the
normal distribution with smallest variance, the left and right outer region would be
assigned to the normal distribution with highest variance, and the region in between
would be assigned to the remaining normal distribution.
In this sense, it is obvious how to make best guesses for the classes, although
these best guesses will still lead to a high misclassification rate. For the best guesses,
there are clear boundaries for the classification decision. However, one should not
mix up the classification boundaries with class boundaries. Classification bound-
aries refer to the boundaries drawn by a classifier by assigning objects to classes.
These boundaries will always exist. But these classification boundaries do not nec-
essarily correspond to class boundaries that separate the classes. In most cases, class
boundaries do not even exist, since classes tend to overlap in real applications and
cannot be clearly separated as in Fig. 5.5. This is due to the Bayes or the pure error.
For many classification problems, there are only two classes which the classifier
is supposed to distinguish. Let us call the two classes plus and minus. The classi-
fier can make two different kinds of mistakes. Objects from the class minus can be
wrongly assigned to the class plus. These objects are called false positives. And
vice versa, objects from the class plus can be wrongly classified as minus. Such ob-
jects are called false negatives. The objects that are classified correctly are called
true positives and true negatives, respectively.
There is always a trade-off between false positives and false negatives. One can
easily ensure that there are no false positives by simply classifying all objects as mi-
nus. However, this means that all objects from the class plus become false negatives.
The other extreme is to classify all objects as plus, in this way avoiding false nega-
tives but accepting that all objects from the class minus are false positives. A clas-
sifier must find a compromise between these two extremes, trying to minimize both
the number of false positives and false negatives. A classifier biased to the class
plus will have fewer false negatives but more false positives, whereas a classifier
biased to the class minus will have fewer false positives but more false negatives.
Cost functions, as they were explained in Sect. 5.2.1, are one way to introduce such
biases to false positives or false negatives.
Some classifiers also provide for each object a probability, whether it belongs
to a class or not. The usual decision is then to assign the object to the class with
98 5 Principles of Modeling
the highest probability. So in the case of the two classes plus and minus, we would
assign an object to the class plus if and only if the probability for this class is greater
than 0.5. But we could also decide to be more careful and to assign objects to the
class plus only when the corresponding probability is higher than τ = 0.8, leading
to fewer false positives but more false negatives. If we choose a threshold τ for
assigning an object to the class plus lower than 0.5, we will reduce the number of
false negatives for the price of having false positives.
This trade-off between false positives and false negatives is illustrated by the
receiver operating characteristic or ROC curve showing the false positive rate
versus the true positive rate (in percent). Figure 5.9 shows examples for possible
ROC curves. For various choices of τ , a new point is drawn at the respective co-
ordinates of false positive rate and true positive rate. These dots are connected to a
curve. The ROC curves in Fig. 5.9 are idealized. Normally, the ROC curves based
on sampled data look less smooth and more ragged.
The best case for a ROC curve would be to jump immediately from 0% to 100%,
so that we could have a classifier with 100% true positives and no false positives.
The red line shows a very good, but not perfect, ROC curve. We can have a very high
true positive rate together with a low false positive rate. The diagonal, shown as a
gray line, corresponds to pure random guessing, so that such a classifier has actually
learned nothing from the data, or there is no connection between the classes and the
attributes used for prediction. Therefore, the diagonal line is the worst case for a
ROC curve.
The area under curve (AUC), i.e., the area under the ROC curve, is an indi-
cator how well the classifier solves the problem. The larger the area, the better the
solution for the classification problem. The area is measured relative to the area of
the square [0, 100] × [0, 100] in which the ROC curve is drawn. The lowest value
for AUC is 0.5, corresponding to random guessing, and the highest value is 1 for a
perfect classifier with no misclassifications. The blue line in Fig. 5.9 is a ROC curve
with a lower performance. The reason for the low performance may be due to the
Bayes error, but also because of other errors that will be discussed in the following
sections.
5.4 Types of Errors 99
Iris setosa 50 0 0
Iris versicolor 0 47 3
Iris virginica 0 2 48
When there are more than two classes, it is not possible to draw a ROC curve as
described above. One can only draw ROC curves with respect to one class against
all others.
The confusion matrix is another way to describe the classification errors. A con-
fusion matrix is a table where the rows represent the true classes and the columns
the predicted classes. Each entry specifies how many objects from a given class are
classified into the class of the corresponding column. An ideal classifier with no
misclassifications would have only entries different from zero in the diagonal.
Table 5.3 shows a possible confusion matrix for the Iris data set. From the confu-
sion matrix we can see that all objects from the class setosa are classified correctly
and no object from another class is wrongly classified as setosa. A few objects from
the other classes—three from versicolor and two from virginica—are wrongly clas-
sified.
The sample error is caused by the fact that the data is only an imperfect represen-
tation of the underlying distribution of the data.
A finite sample, especially when its size is quite small, will seldom exactly reflect
the true distribution of the probability distribution generating the data. According to
the laws of large numbers, the sample distribution converges with probability one
to the true distribution when the sample size approaches infinity. However, a finite
sample can deviate significantly from the true distribution, although the probability
for such a deviation might be small. The bar chart in Fig. 5.10 shows the result for
throwing a fair die 60 times. In the ideal case, one would expect each of the numbers
1, . . . , 6 to occur 10 times. But for this sample, the sample distribution does not look
uniform. Another source for sample errors are measurements with limited precision
and round-off errors.
Sometimes the sample is also biased. Consider a bank that supplies loans to cus-
tomers. Based on the data available on the customers who have obtained loans, the
bank wants to estimate the probability for paying back a loan for new customers.
However, the collected data will be biased in the direction of better customers be-
cause customers with a more problematic financial status have not been granted
loans, and therefore, no information is available for such customers whether they
100 5 Principles of Modeling
might have paid back the loan nevertheless. From the perspective of a representa-
tive sample of all applicants, we deal with a sample error when using the bank’s
database.
A large error may be caused by a high pure error, but it may also be due to a lack
of fit. When the set of considered models is too simple for the structure inherent in
the data, no model will yield a small error. Such an error is also called model error.
Figure 5.11 shows how a regression line is fitted to data with no pure error. But the
data points originate from a quadratic and not from a linear function.
The line shown in the figure is the one with the smaller mean squared error. Such
a line can always be computed, no matter from which true function the data come.
But the line that fits best such data does not reflect the structure inherent in the data.
Unfortunately, it is often difficult or even impossible to distinguish between the
pure error and the error due to the lack of fit. Simple models tend to have a large
error due to the lack of fit, whereas more complex models lead to small errors for
the given data but tend to overfitting and may lead to very large errors for new data.
5.4 Types of Errors 101
There is the algorithmic error caused by a method that is used to fit the model or
the model parameters. In the ideal case, when an analytical solution for the optimum
of the objective function exists, the algorithmic error is zero or is only caused by
numerical problems. But as we have seen in Sect. 5.3, in many cases an analytical
solution cannot be provided, and heuristic strategies are needed to fit the model to
the data.
Even if a model exists with a very good fit—the global optimum of the objective
function—the heuristic optimization strategy might only be able to find a local op-
timum with a much larger error that is caused neither by the pure error nor by the
error due to the lack of fit.
Most of the time, the algorithmic error will not be considered, and it is assumed
that the heuristic optimization strategy is chosen well enough to find an optimum
that is at least close to the global optimum.
The types of errors mentioned in the previous four subsections can be grouped into
two categories. The algorithmic and the model errors can be controlled to a certain
extend, since we are free to choose a suitable model and algorithm. These errors are
also called machine learning bias. We have no influence on the pure or intrinsic er-
ror. The same applies to the sample error when the data to be analyzed have already
been collected. The error caused by the intrinsic and the sample error sometimes is
also called variance.
It is also well known from statistics that the mean squared error (MSE) of an es-
timator θ ∗ for an unknown parameter θ can be decomposed in terms of the variance
of the estimator and its bias:
MSE = Var(θ ∗ ) + (Bias(θ ∗ ))2 . (5.11)
Note that this decomposition deviates from the classification into model bias and
variance above as it is popular in machine learning. The variance in (5.11) depends
on the intrinsic error, i.e., on the variance of the random variable from which the
sample is generated, and also on the choice of the estimator θ ∗ which is considered
as part of the model bias in machine learning.
A more detailed discussion on the different meanings and usages of the terms
variance and bias can be found in [3].
102 5 Principles of Modeling
The different types of errors or biases discussed in the previous section have an in-
teresting additional impact on the ability to find a suitable model for a given data set:
if we have no model or learning bias, we will not be able to generalize. Essentially
this means that we need to constrain either the types of models that are available or
the way we are searching for a suitable model (or both). Tom Mitchell demonstrates
this very convincingly in his hypothesis learning model [8]—in this toy world he
can actually prove that in the unrestricted case of a boolean classification problem,
the fitting models we can possibly find predict “false” in exactly half of the cases
and “true” for the other half. This means that without any constraint we always leave
all choices open. The learner or model bias is essential to put some sort of a priori
knowledge into the model learning process: we either limit what we can express, or
we limit how we search for it.
In the previous section, algorithms were discussed that fit a model from a prede-
fined model class to a given data set. Complex models can satisfy a simple fitting
criterion better than simple models. However, the problem of overfitting increases
with the complexity of the model. Especially, when the model is built for prediction
purposes, the error of the model based on the data set from which it was computed is
usually smaller than for data that have not been used for determining the model. For
instance, in Fig. 5.4 on page 94, the polynomial fits the data perfectly. The error for
the given data set is zero. Nevertheless, the simple line might be a better description
of the data, at least when we assume that the data are corrupted by noise. Under
this assumption, the polynomial would lead to larger errors for new data, especially
in those regions where it tends to oscillate. How do we find out, which model is
actually suited best to our problem?
The most common principle to estimate a realistic performance of the model for
unknown or future data is separating the data set for training and testing purposes. In
the simplest case, the dataset is split into two disjoint sets, the training data which
are used for fitting the model and the test data which only serve for evaluating the
trained model but not for fitting the model. Usually, the training set is chosen larger
than the test data set, for instance, 2/3 of the data are used for training, and 1/3 for
testing.
One way to split the data into a training and a test set is a random assignment
of the data objects to these two sets. This means that in average the distributions
5.5 Model Validation 103
of the values in the original data set and in the training and the test data set should
be roughly the same. However, by chance it can happen that the distributions may
differ significantly. When a classification problem is considered, it is usually rec-
ommended to draw stratified samples for the training and the test set. Stratification
means that the random assignments of the data to the test and the training set are
carried out per class and not simply for the whole data set. In this way, it is ensured
that the relative frequency in the original data set, the training, and the test set are
the same.
Sometimes, it is not advisable to carry out a (stratified) random assignment of
the data to the training and test set. Consider again the example of the producer of
tea cups from Sect. 5.2.1 who wants to classify the cups automatically into ok and
broken. Assume that six different types of cups are produced at the moment and in
the future new types of cups might be introduced. Dividing the data set randomly
into training and test data would not reflect the classification problem to be encoun-
tered in the future. If the producer had no intention to change the types of the cups,
it would be correct to draw a random sample for testing from all data. In this way,
the classifier will be trained and tested with examples from all six types of cups.
But since new models of cups might be introduced in the future, this would yield an
over-optimistic estimation of the classification error for future cups. A better way
would be to use the data from four types of cups for training the classifier and to test
them on the remaining two types of cups. In this way, we can get an idea of how the
classifier can cope with cups it has never seen before.
This example shows how important it is in prediction tasks to consider whether
the given data are representative for future data. When predictions are made for
future data for which given data are not representative, extrapolation is carried out
with a higher risk of wrong predictions. In the case of high-dimensional data, it
cannot be avoided to have scarce or no data in certain regions of the space of possible
values as we have discussed already in Sect. 4.2, so that we always have to be aware
of this problem.
Sometimes, the data set is split into three parts: In addition to the training and
the test data set, a validation set is also used. If, for instance, a classifier should be
learned from data, but we do not know which kind of model is the most appropriate
one for the classifier, we could make use of a validation set. All classifier models are
generated based on the training data only. Then the classifier with the best perfor-
mance on the validation set is chosen. The prediction error of this classifier is then
estimated based on the test set.
5.5.2 Cross-Validation
The estimation of the fitting error for new data based on a test data set that has
not been used for learning the model depends on the splitting of the original data
set into training and test data. By chance, we might just be lucky that the test set
contains more easy examples leading to an over-optimistic evaluation of the model.
104 5 Principles of Modeling
Or we might be unlucky when the test set contains more difficult examples and
the performance of the model is underestimated. Cross-validation does not rely on
only one estimation of the model error, but rather on a number of estimations. For
k-fold cross-validation, the data set is partitioned into k subsets of approximately
equal size. Then the first of the k subsets is used as a test set, and the other (k − 1)
sets are used as training data for the model. In this way, we get the first estimation
for the model error. Then this procedure is repeated by using each of the other k
subsets as test data and the remaining (k − 1) subsets as training data. Altogether,
we obtain k estimations for the model error. The average of these values is taken as
the estimation for the model error. Typically, k = 10 is chosen.
Small data sets might not contain enough examples for training when 10% are left
out for testing. In this case, the leave-one-out method, also known as the jackknife
method, can be applied which is simply n-fold cross-validation for a data set with n
data objects, so that each time only one data object is used for evaluating the model
error.
5.5.3 Bootstrapping
Bootstrapping is a resampling technique from statistics that does not directly eval-
uate the model error but aims at estimating the variance of the estimated model
parameters. Therefore, bootstrapping is suitable for models with real-valued param-
eters. Like in cross-validation, the model is computed not only once but multiple
times. For this purpose, k bootstrap samples, each of size n, are drawn randomly
with replacement from the original data set with n records. The model is fitted to
each of these bootstrap samples, so that we obtain k estimates for the model param-
eters. Based on these k estimates, the empirical standard deviation can be computed
for each parameter to provide information how reliable the estimation of the param-
eter is.
Figure 5.12 shows a data set with n = 20 data points from which k = 10 bootstrap
samples were drawn. For each of the bootstrap samples, the corresponding regres-
sion line is shown in the figure. The resulting parameter estimates for the intercept
and the slope of the regression line are listed in Table 5.4. The standard deviation for
the slope is much lower than for the intercept, so that the estimation for the slope is
more reliable. It is also possible to compute confidence intervals for the parameters
based on bootstrapping [2].
The results from bootstrapping can be used to improve predictions as well by
applying bagging (bootstrap aggregation). In the case of regression, one would use
the average of the predicted values that the k models generated from the bootstrap
samples yield. In the example in Fig. 5.12, for a given value x, all 10 lines would
provide a prediction for y, and we would use the average of these predictions. For
classification, one would generate k classifiers from the bootstrap samples, calculate
the predicted class for all k classifiers, and use the most frequent class among the
k predicted classes as the final prediction. Bagging will be introduced in more detail
in Sect. 9.4.
Complex models are more flexible and can usually yield a better fit for the training
data. But how well a model fits the training data does not tell much about how
well a model represents the inherent structure in the data. Complex models tend to
overfitting as we have already seen in Fig. 5.4 on page 94.
Model selection—the choice of a suitable model—requires a trade-off between
simplicity and fitting. Based on the principle of Occam’s razor, one should choose
the simplest model that “explains” the data. If a linear function fits the data well
enough, one should prefer the linear function and not a quadratic or cubic function.
However, it is not clear what is meant by “fitting the data well enough.” There is a
need for a trade-off between model simplicity and model fit. But the problem is that
106 5 Principles of Modeling
it is more or less impossible to measure these two aspects in the same unit. In order
to combine the two aspects, regularization techniques are applied. Regularization
is a general mathematical concept that introduces additional information in order to
solve an otherwise ill-posed problem. A penalty term for more complex models can
be incorporated into to the pure measure for model fit as a regularization technique
for the avoidance of overfitting.
The minimum description length principle (MDL) is one promising way to join
measures for model fit and model complexity into one measure. The basic idea be-
hind MDL is to understand modeling as a technique for data compression. The aim
of data compression is to minimize the memory—measured in bits—needed to store
the information contained in the data in a file. In order to recover the original data,
the compressed data and the decompression rule is needed. Therefore, the overall
size of the compressed data file is the sum of the bits needed for the compressed
data plus the bits needed to encode the decompression rule. In principle, any file
can be compressed to the size of one bit, say the bit with value 1, by defining the
decompression rule “if the first bit is 1, then the decompressed file is the original
data.” This implies that the decompression rule contains the original data set and is
therefore as large as the original data, so that no compression is achieved at all when
we consider the compressed data and the decompression rule together. The same ap-
plies when we do not compress the original data at all. Then we need no space at all
for the decompression rule but have not saved memory space at all, since we have
not carried out any compression. The optimum lies somewhere in between by using
a simple decompression rule allowing a reasonable compression.
The application of these ideas to model selection requires an interpretation of
the model as a compression or decompression scheme, the (binary) coding of the
model, and the compressed (binary) coding of the data. We illustrate the minimum
description length principle by two simplified examples.
The first example is a classification task based on the data set shown in Table 5.5.
The attribute C with the two possible values + and − shall be predicted with a
decision tree based on the two binary attributes A and B with domains {a1 , a2 } and
{b1 , b2 }.
We consider the two decision trees shown in Fig. 5.13 to solve this classification
task. The decision tree on the left-hand side is simpler but leads to two misclassifica-
tions. The last two records in Table 5.5 are classified wrongly by this decision. The
A a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a1 a2 a2
B b1 b2 b2 b2 b1 b1 b1 b2 b1 b2 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2
C + + + + + + + + + + − − − − − − − − + +
5.5 Model Validation 107
slightly larger decision tree on the right-hand side classifies all records in Table 5.5
correctly. If we were just concerned with misclassification rate for the training data,
we would prefer the decision tree on the right-hand side. But the node with the at-
tribute B is only required to correct two misclassifications—10% of the data set.
This might just be an artificial improvement of the misclassification rate only for
the training data set, and it might lead to overfitting.
In order to decide whether we should prefer the smaller or the larger decision
tree, we interpret the two decision trees as compression schemes for the data set
in the following way. Based on the corresponding decision tree, we can predict the
attribute C, so that we do not have to store the value of the attribute C when we know
the decision tree and the values of the attributes A and B for each record. However,
this is only true for the larger decision tree. For the smaller decision tree, we have
to correct the value of the attribute C for the two records that are misclassified by
the smaller decision tree. The length of the compressed file for the data set based on
any of the two decision trees is the sum of the lengths needed for coding
• the corresponding decision tree,
• the values of the records for the attributes A and B, and
• the corrected values for the attribute C only for the misclassified records.
The coding of the values of the records for the attributes A and B is needed for any
decision tree and can be considered as a constant. The length needed for coding the
decision tree depends on the size of the tree. The larger the tree, the more bits are
needed to store the tree. However, with a larger tree, the misclassification rate can
be reduced, and we need less bits for coding the corrections for attribute C for the
misclassified records. When we want to minimize the overall length needed to store
the compressed data, we need a compromise between a smaller decision tree with
a higher misclassification rate and larger decision with a lower misclassification
rate. The smaller decision tree will need less space for its own coding but more
for the corrections of the misclassified records, whereas the larger decision tree
needs more space for its own coding but can save space due to a lower number
of misclassifications. According to the minimum description length principle, we
should choose the decision tree with the minimum number of bits needed to code
the tree itself and the corrected values for the attribute C.
Of course, the number of bits needed for the coding depends on the binary coding
scheme we use for encoding the decision tree and the corrections for the values of
the attribute C. If we have a highly efficient coding scheme for decision trees but an
inefficient one for the corrections for the values of the attribute C, larger trees would
be preferred. The naive MDL approach will ignore this problem and simply try to
find the most efficient binary coding for both parts. However, there are more general
108 5 Principles of Modeling
concepts of universal codes and universal models freeing the minimum description
length principle from the dependency on the specific coding scheme. For a detailed
introduction to universal codes and universal models that are out of the scope of this
book, we refer to [4].
As a second example for the application of the minimum description length prin-
ciple, we consider a regression problem. Figure 5.14 shows a simple data set to
which line and a quadratic curve is fitted. Which of these two models should we
prefer? Of course, the quadratic curve will lead to a smaller error for the training
data set than the simple line. We could also think of a polynomial of higher degree
that would reduce the error even more. But then we have to face the problem of
overfitting in the same as we have to take this for larger decision trees into account.
For illustration purposes, we consider an even simpler regression problem pro-
vided by the data set in Table 5.6.
What would be the best model for this data set? Should we describe the relation
between X and Y by a simple constant function, by a linear, or even a quadratic
function? Of course, the quadratic function would yield the smallest, and the con-
stant function the largest error. But we want to avoid overfitting by applying the
minimum description length principle. As in the example of the classification prob-
lem, it will suffice to consider the naive approach and not to bother about universal
codings and universal models. Instead of the decision tree, our models are now func-
tions with one, two, and three real-valued parameters. The errors are now also real
numbers and not just binary values as in the case of the decision trees. If we insist on
exact numbers, the coding of a single error value could need infinite memory, since
a real number can have infinitely many digits. To avoid this problem, we restrict our
precision to two digits right after the decimal point. For reasons of simplicity, we do
not consider a binary coding of the numbers, but a decimal coding. A real number is
5.5 Model Validation 109
y = 1.92 0.73 0.59 −0.11 0.71 0.12 −0.41 0.29 −0.96 −0.92
y = 1.14 + 0.19x −0.05 0.00 −0.51 0.50 0.10 −0.24 0.65 −0.41 −0.18
y = 1.31 + 0.05x + 0.02x 2 0.12 0.05 −0.54 0.43 0.03 −0.27 0.70 −0.24 0.15
coded backwards starting from the lowest digit, in our case the second digit after the
decimal point. Therefore, the numbers 1.23, 2.05, 0.06, and 0.89 would be coded as
321, 502, 6, and 98, respectively. Note that smaller numbers require less memory
because we do not code leading zero digits. We also would have to take the sign of
each number into account for the coding. But we will neglect this single bit here in
order not to mix the decimal coding for the numbers with a binary coding for the
sign. If we use a binary coding for the numbers, the sign integrates naturally to the
coding as an additional bit. Figure 5.15 shows a plot of the data set and least squares
fit of constant, linear, and quadratic functions.
Table 5.7 lists the errors for the constant, linear, and quadratic functions that have
been fitted to the data. How many decimal digits do we need for coding of the data
set when we use a constant function? We need three digits for the constant 1.92
representing the function. We also need to encode the errors in order to recover
the original values of the attribute Y from our constant function. For the errors of
this constant function, we always need to encode to digits. Only the digit before
the decimal point is always correct. Altogether, the coding of the data set with the
constant function requires 3 + 9 · 2 = 21 decimal digits.
What about the linear function? The function itself requires the coding of the
two coefficients 1.14 and 0.19 for which we need 5 decimal digits. The coding
errors require two decimal digits each time, except for the data points with the ID 1
and 2, for which we need only one and zero decimal digits. This means that we
have altogether 5 + 7 · 2 + 1 + 0 = 19 decimal digits. Similar considerations for the
quadratic curve lead to 5 + 7 · 2 + 2 · 1 = 20 decimal digits.
110 5 Principles of Modeling
This means that, in terms of our extremely simplified MDL approach, the linear
function leads to the most efficient coding for the data set, and we would therefore
prefer the linear regression function over the other ones.
It should be emphasized again that there is a more rigorous theory for the min-
imum description length principle that avoids the problem of finding the most ef-
ficient coding and the restriction to a fixed precision for the representation of real
numbers. But the naive approach we have described here often suffices to give an
idea how complex the chosen model should be.
1
n
MSE = (f (xi ) − yi )2 (5.14)
n
i=1
With the same assumptions as for (5.13), BIC becomes in the context of regression
BICGauss = k ln(n) + n(MSE)/σ 2 , (5.16)
where σ 2 is the (estimated) variance for the underlying normal distribution that
causes the noise.
KNIME offers a number of modules to estimate errors. Most prominently, the Scorer
node computes a confusion matrix given two columns with the actual and predicted
class label. There is also a node to plot a ROC curve and an entropy scorer, which
allows one to compute the class–class purities between two columns. So the stan-
dard error metrics are available as individual nodes. Figure 5.16 shows the use of
the scorer node in practice. The trained Naive Bayes classifier is applied to a sec-
ond data set, and the output is fed into the scorer node which compares the target
with the predicted class. The output of this scorer is a confusion matrix (which is
also available as node view) and a second matrix listing some well-known error
measures.
More interestingly, however, are methods to run cross validation or other vali-
dation techniques. KNIME offers those in the form of so-called meta nodes which
encapsulate a series of other nodes. Figure 5.17 shows the inside of such a node.
Besides the node to train a model (a neural network in this case) and apply the
network to unseen data, there are two special nodes: the begin of the cross valida-
tion look which takes care of the repeated partitions of the data and the end node
(X-Aggregator) which collects the information from all runs.
The special type of “loop nodes” are also available individually in KNIME, and
the user can then assemble much more complex looping constructs, but for conve-
nience, frequently used setups, such as the cross validation shown here, are available
in preconfigured meta nodes.
5.6.2 Validation in R
In order to apply the idea of using separate parts of a data set for training and testing,
one needs to select random subsets of the data set. As a very simple example, we
112 5 Principles of Modeling
Fig. 5.16 The use of the scorer node together with the two tables it produces on its outports. One
table holds the confusion matrix, the second output holds some well-known error measurer
Fig. 5.17 Preconfigured meta nodes allow one to run cross validation in KNIME. In this workflow
a neural network is repeatedly applied to different partitions of the incoming data
consider the Iris data set that we want to split into training and test sets. The size
of the training set should contain 2/3 of the original data, and the test set 1/3. It
would not be a good idea to take the first 100 records in the Iris data set for training
purposes and the remaining 50 as a test set, since the records in the Iris data set are
ordered with respect to the species. With such a split, all examples of Iris setosa and
Iris versicolor would end up in the training set, but none of Iris versicolor, which
5.7 Further Reading 113
would form the test set. Therefore, we need random sample from the Iris data set. If
the records in the Iris data set were not systematically orderer, but in a random order,
we could just take the first 100 records for training purposes and the remaining 50
as a test set.
Sampling and orderings in R provide a simple way to shuffle a data set, i.e., to
generate a random order of the records.
First, we need to know the number n of records in our data set. Then we generate
a permutation of the numbers 1, . . . , n by sampling from the vector containing the
numbers 1, . . . , n, generated by the R-command c(1:n). We sample n numbers
without replacement from this vector:
Then we define this permutation as an ordering in which the records of our data set
should be ordered and store the shuffled data set in the object iris.shuffled:
Now define how large the fraction for the training set should be—here 2/3—and
take the first two thirds of the data set as a training set and the last third as a test set:
The R-command sample can also be used to generate bootstrap samples by setting
the parameter replace to TRUE instead of F (FALSE).
References
1. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19,
716–723 (1974)
2. Chernick, M.: Bootstrap Methods: A Practitioner’s Guide. Wiley, New York (1999)
3. Dietterich, T., Kong, E.: Machine learning bias, statistical bias, and statistical variance of deci-
sion tree algorithms. Technical report, Oregon State University, USA (1995)
114 5 Principles of Modeling
4. Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
5. Hamilton, H.J., Hilderman, R.J.: Knowledge Discovery and Measures of Interest. Springer,
New York (2001)
6. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)
7. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd edn. Springer, New York (2009)
8. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
9. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Chapter 6
Data Preparation
In the data understanding phase we have explored all available data and carefully
checked if they satisfy our assumptions and correspond to our expectations. We in-
tend to apply various modeling techniques to extract models from the data. Although
we have not yet discussed any modeling technique in greater detail (see Chaps. 7ff),
we have already glimpsed at some fundamental techniques and potential pitfalls in
the previous chapter. Before we start modeling, we have to prepare our data set
appropriately, that is, we are going to modify our dataset so that the modeling tech-
niques are best supported but least biased.
The data preparation phase can be subdivided into at least four steps. The first
step is data selection and will be discussed in Sect. 6.1. If multiple datasets are avail-
able, based on the results of the data understanding phase, we may select a subset
of them as a compromise between accessibility and data quality. Within a selected
dataset, we may concentrate on a subset of records (data rows) and attributes (data
columns). We support the subsequent modeling steps best if we remove all useless
information, such as irrelevant or redundant data. The second step involves the cor-
rection of individual fields, which are conjectured to be noisy, apparently wrong or
missing (Sect. 6.2). If something is known, new attributes may be constructed as
hints for the modeling techniques, which then do not have to rediscover the use-
fulness of such transformations themselves. For some modeling techniques, it may
even be necessary to construct new features from existing data to get them running.
Such data transformations will be discussed in Sect. 6.3. Finally, most available
implementations assume that the data is given in a single table, so if data from
multiple tables have to be analyzed jointly, some integration work has to be done
(Sect. 6.5).
Sometimes there is a lot of data available, which sounds good in the first place, but
only if the data is actually relevant for the given problem. By adding new columns
M.R. Berthold et al., Guide to Intelligent Data Analysis, 115
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_6, © Springer-Verlag London Limited 2010
116 6 Data Preparation
with random values to a table of, say, customer transactions, the table may get an
impressive volume but does not carry any useful information. Even worse, these ad-
ditional attributes are considered harmful as it is often difficult to identify them as
being irrelevant (we will illustrate this point in a minute). This is the main reason
why it is important to restrict the analysis process to potentially useful variables,
but other reasons exist such as efficiency: the computational effort often depends
greatly on the number of variables and the number of data records. The results of
the data understanding phase greatly help to identify useless datasets and attributes,
which have to be withheld from the analysis. But for other attributes, things may
not be that clear, or the number of attributes that are classified as potentially useful
is still quite large. In the following, we investigate how to select the most promising
variables (Sect. 6.1.1) and, alternatively, how to construct a few variables that never-
theless carry most of the information contained in all given attributes (Sect. 6.1.2).
Having selected the data columns, we finally discuss which data rows should enter
our analysis in Sect. 6.1.3.
Suppose that we have some target or response variable R that we want to predict.
For the sake of simplicity, let us assume that we have no data that is relevant for
the final outcome of R. What happens if we nevertheless try to build a model that
explains or predicts the value of R given some other irrelevant variables? How will
it perform? We are going to answer this question following a thought experiment.
Suppose that the probability for a positive response is p. We pick some binary
variable A with P (A = yes) = q. In this experiment, A has nothing to do with the
response R, but usually we do not know this in advance. As R and A are (sta-
tistically) independent, we expect the joint probability of observing, say, A = yes
and R = yes to be P (A = yes ∧ P (R = yes) = P (A = yes) · P (R = yes) = p · q,
as shown in Table 6.1a. Ignoring the variable A and assuming p = 0.4, having no
other information at hand, we would predict the more likely outcome R = no. In the
long run, this prediction would be correct in 60% (= p) of the cases and wrong in
40% (= 1 − p).
Next we take A into account with q = 0.7. How would this additional knowledge
of A influence our prediction of R? When looking at the expected number of records
in Table 6.1b, the knowledge of A does not suggest a different decision: in both
rows of the table (A = yes and A = no) we have P (R = no) > P (R = yes), so the
chances of getting R = no are higher, and this would be our prediction. Clearly, the
irrelevant information did not help to improve the prediction, but at least it did no
harm—at the first glance. There is, however, one problem: The true probabilities p
and q are not known in advance but have to be estimated from the data (introducing a
sample error, see Sect. 5.4). Suppose that our data set consists of 50 records as shown
in Table 6.1c. Note that, by pure chance, the sample from the whole population
slightly differs in our database from the expected situation: only 2 records out of 50
6.1 Select Data 117
Table 6.1 (a) Probability distribution of independent variables. (b) Expected number of cases for
a total number of n = 50 records. (c) Observed number of cases
(a) (b) (c)
R n = 50 R n = 50 R
yes no yes no yes no
are different (in A = no: R = yes instead of R = no). Now there is a slight majority
for R = yes given A = no. If, based on this sample, the prediction of R is changed
to yes given A = no, the rate of accurate predictions goes down from 60% to 52%.
Obviously, such situations occur easily if the numbers in the table are relatively
small. If the dataset consists of many more records, say 10,000, the estimates of
the joint probabilities are much more robust—as long as the number of cells in our
table remains constant. The number of cells, however, increases exponentially in the
number of variables; therefore, the number of cases per cell decreases much faster
than we can gather new records: To keep the number of cases per cell constant, we
would have to double the number of records if we add a single binary variable.
Many methods we will discuss in the following chapters come along with their
own weapons against the danger of overfitting, which generally grows with the num-
ber of (irrelevant) attributes. By removing irrelevant (and redundant) attributes in
advance, the variance of the estimated model is reduced, because we have less pa-
rameters to estimate, and the chances of overfitting decrease. In some cases, the ap-
plication of feature selection methods seems to be more important than the choice
of a modeling technique [16].
The goal of feature selection is to select an optimal subset of the full set of
available attributes A of size n. The more attributes there are, the wider is the range
of possible subsets: the number of subsets increases exponentially in the size of A.
Feature selection typically implies two tasks: (1) the selection of some evaluation
function that enables us to compare two subsets to decide which will perform better
and (2) a strategy (often heuristic in nature) to select (some of) the possible feature
subsets that will be compared against each other via this measure.
Table 6.2 The full dataset consists of 9 repetitions of the four records on the left plus the four
records in the middle, which differ only in the last record. To the right, there are contingency tables
for all four variables vs. the target variable
A B C D Target A B C D Target
A Target B Target C Target D Target
+ + + − no + + + − no no yes no yes no yes no yes
9× , 1×
+ − + − yes + − + − yes
+ 10 10 + 10 10 + 11 20 + 10 0
− + + − yes − + + − yes
− 10 10 − 10 10 − 9 0 − 10 20
− − − + no − − + + no
the target attribute. For categorical data, Pearson’s χ 2 -test (see Sect. A.4.3.4) in-
dicates the goodness of fit for the observed contingency table (of target and con-
sidered variable) versus the table expected from the marginal observations. A large
deviation from the expected distribution points towards a dependency among both
attributes, which is desirable for predictive purposes. Another evaluation function is
the information gain or its normalized variants symmetric uncertainty or gain ratio,
which will be discussed in Sect. 8.1. The information gain itself is known to prefer
attributes with a large number of values, which is somewhat compensated by the
normalized variants (see also Sect. 6.3.2).
As an example, consider the dataset shown in Table 6.2. As we can see from the
contingency tables, the values +/− of the attributes A and B are equally distributed
among the target values no/yes and thus appear useless at first glance. When ob-
serving C = −, the target value distribution no : yes is 9 : 0, which indicates a very
clear preference for no and makes C valuable. When observing D = +, the target
value distribution is 10 : 0, which is even slightly better because it covers one more
case. All the mentioned evaluation measures provide the same ranking for this set
of variables: D – C – A, B (with A and B having the same rank).
The interesting observation is that in this artificial example the attributes A and B
together are sufficient to perfectly predict the target value, which is impossible with
attributes C and D, each of them individually ranked higher than A and B. Both, C
and D, are quite good in predicting the target value no, but they do not complement
one another. By our initial assumption the evaluation functions considered so far do
not take the interaction among features into account but look at the performance of
individual variables only. They do not recognize that C and D are almost redundant,
nor do they realize that A and B jointly perform much better than individually. This
holds for all evaluation functions that analyze the contingency table of individual
attributes only.
However, we may arrive at different conclusions if we account for the values of
other attributes, say B, C, D, while evaluating a variable, say A, as it is done by
the Relief family of measures. We discuss only one member (Relief, see Table 6.3)
and refer to the literature for the more robust extensions ReliefF and RReliefF [15].
An advantage of the Relief family is that it is easily implemented for numerical and
categorical attributes (whereas many of the aforementioned evaluation measures are
only available for categorical data in most data analysis toolboxes).
6.1 Select Data 119
Table 6.3 Evaluation of attributes—Relief algorithm. The nearest hit or miss is selected upon the
same diff function (sum over all attributes). Rather than applying the algorithm to all data, it may
be applied to a sample only to reduce the high computational costs of handling big data sets (caused
by the neighborhood search)
Algorithm Relief(D, A, C) → w[·]
where
input: data set D, |D| = n,
attribute set A, target variable C ∈ A 0 if xA = yA
output: attribute weights w[A], A ∈ A diff(A, x, y) =
1 otherwise
1 set all weights w[A] = 0 for categorical attributes A and
2 for all records x ∈ D
3 find nearest hit h (same class label xC = hC ) |xA −yA |
diff(A, x, y) = max(A)−min(A)
4 find nearest miss m (different class label) for numerical attributes A.
5 for all A ∈ A:
6 w[A] = w[A] − diff(A,x,h)
n + diff(A,x,m)
n
Intuitively, the Relief measure estimates how much an attribute A may help to
distinguish between the target classes by means of a weight w[A]: if w[A] is large,
the attribute is useful. The weights are determined incrementally and respect the
interaction between the attributes by the concept of “nearest” or “most similar”
records, which takes all attributes (rather than just A) into account. Given some
record x, if there are very similar records h and m with the same or different target
label, the features present in this record are apparently not of much help in predict-
ing the target value (and the weights w[A] for all attributes A remain unchanged
because the positive and negative diff-terms cancel out in line 6). On the other hand,
if the most similar record with the same target label is close (first diff-term is small),
but the most similar record with a different target label is far away (second diff-term
is large), then the weight will increase overall.
In our particular example, since we have duplicated the records in our dataset,
the nearest hit is almost always identical to the selected record x itself; therefore
the contribution of the first diff-term is zero. If we consider the first two records in
Table 6.2, they carry different labels but differ otherwise only in the value of B. This
makes B attractive, because it is the only attribute that helps to distinguish between
both cases. Overall, the ReliefF ranking delivers A, B − D − C.
ber of cases per newly constructed feature remains relatively high, otherwise
we risk overfitting.
Wrapper: Once we have methods available to learn models automatically from data
(see Chaps. 7ff), we may evaluate the subset by the performance of the derived
model itself: For each subset, we build a new model, evaluate it, and consider
the result as an evaluation of the attribute subset. This is known as the so-
called wrapper approach to feature selection. In order to avoid overfitting of
the training data, we must carry out the evaluation on hold out validation data.
New measures: As we have already mentioned, we may remove both, irrelevant
attributes (carrying no information) and redundant attributes (information is
already carried by other attributes). So far we have focussed on the former
by investigating how much information a given attribute contains wrt. the tar-
get class (which should be close to zero in case of irrelevant attributes). To
identify redundant attributes, we may apply the same evaluation function on
attribute pairs: if one attribute tells us everything about another, they are ob-
viously redundant, and one of them may be skipped. This is the idea of the
correlation-based filter (CFS) [7], where a set of features is evaluated by the
following heuristic:
k r̄ci
√ ,
k + k(k − 1)r̄ii
where r̄ci is the average correlation of the target attribute with all other at-
tributes in the set, whereas r̄ii is the average attribute–attribute correlation
between different attributes (except the target) and k denotes the number of
attributes. The evaluation function becomes the larger, the better the attributes
correlate with the target attribute and the less they correlate with each other.
Once a subset evaluation function has been found, a strategy to select subsets
from the full set of attributes has to be found. Two very simple standard procedures
are forward selection and backward elimination. In the former, one starts with a
subset that does not use any attributes (and a model must always predict the majority
class) and then adds attributes in a greedy manner: in each step the attribute that most
improves the subset performance is added. In backward elimination the process is
reversed: one starts with a subset that uses all available attributes and then removes
them in a greedy manner: in each step the attribute whose removal most improves the
subset performance is eliminated. The process stops if there is no attribute that can
be added (forward selection) or removed (backward elimination) without actually
worsening the model performance. An exhaustive search over all possible attribute
sets is usually prohibitive, but at least possible, if only a limited number of attributes
is available. Even a random set generation is possible (Las Vegas Wrapper).
The data from Table 6.2 evaluated by a forward selection wrapper with a standard
decision tree learner selects the attribute D only. Exhaustive search of all subsets
identifies the subset A, B, D as the best. In both cases attribute D is included be-
cause of the preference for attributes that correlate directly with the target variable,
which is built in with most decision tree learners. On the one hand, having iden-
tified the optimal subset does not help if the learner is not capable of constructing
6.1 Select Data 121
Principal component analysis (PCA) has already been introduced in Sect. 4.3.2.1
in the context of data understanding. PCA was used for generating scatter plots
from higher-dimensional data and to get an idea of how many intrinsic dimensions
there are in the data by looking at how much of the variance can be preserved by a
projection to a lower dimension. Therefore, PCA can also be used as a dimension-
reduction method for data preparation. In contrast to feature selection, PCA does not
choose a subset of features or attributes, but rather a set of linear combinations of
features. Although this can be very efficient in certain cases, the extracted features
in the form of linear combinations of the attributes are often difficult to interpret, so
that the later steps will automatically result in a black-box model.
PCA belongs to a more general class of techniques called factor analysis. Fac-
tor analysis aims at explaining observed attributes as a linear combination of un-
observed attributes. In PCA, the (first) principal components are the corresponding
factors. Independent component analysis (ICA) is another method to identify such
unobserved variables that can explain the observed attributes. In contrast to factor
analysis, ICA drops the restriction to linear combinations of the unobserved vari-
ables.
Dimension-reduction techniques that construct an explicit, but not necessarily
linear mapping, from the high-dimensional space to the low-dimensional space are
often called nonlinear PCA, even though they might not aim at preserving the vari-
ance like PCA, but rather the distances like MDS. Examples for such approaches
can be found in [8, 9, 13, 14].
Except for very rare occasions, the available data is already a sample of the whole
population. If we had started our analysis a few weeks earlier, the sample would
have looked different (smaller, in particular) but probably still representative for the
whole population. Likewise we may create a subsample from our data and use it for
the analysis instead of the full sample, e.g., for the sake of faster computation or
the need for a withhold test dataset (see Sect. 5.5.1). Other reasons for using only a
subsample include:
122 6 Data Preparation
(Sect. 9.3). Instance selection for clustering tasks often involves some condensation
or compression by choosing one record that represents a number of very similar
records in its neighborhood. Often the number of records it represents is included as
an additional weight attribute, which gets higher the more data it substitutes. Many
clustering algorithms (such as k-means clustering, Sect. 7.3) are easily extended to
process this additional weight information appropriately. The instance selection can
be considered as some kind of preclustering as it identifies representative cases that
may stand for a small group of data (a small cluster). The small clusters may then
be aggregated further in subsequent steps.
1 The reduction of inflected or derived words to their root (or stem) is called stemming. So-called
While humans are usually very good in spotting and correcting noise, the prob-
lem is that it may occur in many variants which makes an automatic recovery very
difficult. It is therefore extremely helpful to define rules and patterns the correctly
spelled values have to match (e.g., regular expressions for text and valid ranges for
numerical values), so that the data can be filtered and conversion rules may be de-
fined case-by-case whenever a new rule violation occurs. If additional data will be
considered later, these rules prevent us from repeating a manual analysis over and
over again.
While the steps mentioned so far were concerned with single records only, other
errors may be recognized only if multiple records are investigated. The most promi-
nent example is the discovery of duplicate records, e.g., in the table of customers
when creating customer profiles. This problem is known as record linkage or en-
tity resolution. Methods for detecting duplicates utilize some measure of similarity
between records (which will be discussed in Sect. 7.2); two records that are highly
similar are than considered as variants of the same entity and are merged to a sin-
gle record. The greatest difficulty is to avoid that two different entities are merged
mistakenly (false positive).
It is always a good idea to document changes, such as the correction of an out-
lier, because it may turn out later that the entry is not defective at all but points out
some rare, special case. An artificial attribute degree of trustworthiness may be in-
troduced, which is 1.0 by default but decreased whenever some correction is applied
to the record that involves heuristic guesswork, be it outlier correction or missing
value treatment, which will be discussed in the next section. Methods which are nu-
merical in nature frequently offer the possibility to take an additional weight into
account, such that suspicious, less trustworthy records get less influence on the re-
sulting model (see also robust regression and M-estimation in Sect. 8.3.3.3).
Correcting all possible errors is a very time-consuming task; and if the data turns
out to be useless for the task at hand, it is definitely not worth the effort. If potentially
useful attributes have been identified early, efforts in data cleaning can be geared
towards those attributes that pay back.
Why is it necessary to treat missing data, anyway? If the data is not in the database,
this is because we do not know anything about the true value—so there is not much
we can do about it—besides guessing the true value, but this seems to be even worse
than keeping the missing data, because it almost certainly introduces errors.
The reason why we nevertheless may want to change the missing fields is that
the implementations of some methods simply cannot deal with empty fields. Then
the imputation of estimated values is a simple mean to get the program run and
deliver some result, which is often, although affected by estimation errors, better
than having no result at all. Another option would be to remove records with missing
values completely, so that only complete records remain, and the program does not
6.2 Clean Data 125
run into any problems either. In case of nominal attributes there is a third possibility:
we may simply make the missing value explicit by adding another possible outcome,
say MISSING, to its range. Let us consider all these options individually.
Ignorance/Deletion In case deletion all records that contain at least one miss-
ing value are removed completely and are thus not considered in subsequent steps.
Although this procedure is easy to carry out, it also removes a lot of information
(from the nonempty attributes), especially if the number of attributes is large. It
may be safely applied in case of missing completely at random (see Sect. 4.6), as
long as the remaining dataset is still sufficiently large. As soon as we deal with
missing at random (MAR), we may start to seriously distort the data. To continue
the example from Sect. 4.6, where missing temperature values due to empty bat-
teries are less likely to get fixed when it is raining, case deletion will remove more
rainy, cold days than sunny, warm days, which may lead to biased estimates else-
where. A removal of records may therefore threaten the representativeness of the
dataset.
Lossless preparation
Construct data Construct new attributes that indicate imputation, as this may carry
important information itself
Explicit Value or Variable A very simple approach is to replace the missing val-
ues by some new value, say MISSING. This is only possible for nominal attributes,
as any kind of distance to or computation with this value is undefined and mean-
ingless. If the fact that the value is missing carries important information about the
value itself (nonignorable missing values), the explicit introduction of a new value
may actually be advisable, because it may express an intention that is not recover-
able from the other attributes. If we suppose that the absence of the value and the
intention correlate, we may luckily capture this situation by a new constant, but the
problem is that we cannot assure this from the data itself. If the values are miss-
ing completely at random, there is no need to introduce a new constant. The cases
that are marked by a constant do not exhibit any deviation from those cases that
are not marked. If we introduce it nevertheless, the models will have to deal with
an additional, useless value, which makes the models unnecessary complicated and
estimates less robust.
A better approach is to introduce a new (binary) variable that simply indicates
that the field was missing in the original dataset (and then impute a value). In those
cases where neither the measured nor the estimated values really help, but the fact
that the data was missing represents the most important bit of information, this at-
tribute preserves all chances of discovering this relationship: the newly introduced
attribute will turn out to be informative and useful during modeling. On the other
hand, if no such missing value pattern is present, the original variable (with imputed
values) can be used without any perturbing MISSING entries.
6.3 Construct Data 127
Scale Conversion If the modeling algorithm handles only certain types of data, we
have to transform the respective attributes such that they fit the needs of the model-
ing tool. Some techniques assume that all attributes are numerical (e.g., regression
in Sect. 8.3, neural networks in Sect. 9.2), so we either ignore the categorical at-
tributes or have to apply some transformation first. Simply assigning a numerical
value to each of the possible (categorical) values is not an option because typical
operations on numerical values (such as averaging or comparison) are not necessar-
ily meaningful for the assigned numerical values. A typical solution is to convert
nominal and ordinal attributes into a set of binary attributes, as it will be discussed
in Sect. 7.2.
Other methods prefer categorical attributes (like Bayesian classifiers2 in Sect. 8.2)
or perform superior when a discretization is carried out beforehand [2, 17]. Dis-
cretization means that the range of possible numerical values is subdivided into
several contiguous, nonoverlapping intervals and a label of the respective interval
replaces the original value. There are many (more or less intelligent) ways to define
the boundaries of the intervals. An equi-width discretization selects the boundaries
such that all intervals are of the same width (see page 41 for selecting a reasonable
number of bins). As we can see from Fig. 6.1(a) for the sepal length attribute of
the Iris dataset, it may happen that some intervals contain almost no data. An equi-
frequency discretization tries to assure that all intervals contain the same number
of data objects. Of course an equally distributed frequency is not always achievable.
For example, if the dataset with 100 cases contains only 4 distinct numerical values
and a partition into 5 intervals is desired, there are not enough split points. Again,
such a partition does not necessarily represent the underlying data distribution ap-
propriately: in the partition of the sepal length attribute in Fig. 6.1(b), the compact
group of similar data (left) is separated into two intervals; in particular the region of
highest density is split into two intervals.
2 Bayesian classifiers can handle numerical data directly by imposing some assumptions on their
dynamic domains, we must be prepared to handle values unseen before. One solu-
tion is to introduce a new variable, which is an abstraction of the original one. By
mapping the original values to more abstract or general values we can hope that
newly occurring entries can also be mapped to the generalized terms so that the
model remains applicable. In a data warehouse such a granularization scheme may
already be installed, otherwise we have to introduce it and install means that newly
arriving values will be mapped into the hierarchy first.
Another reason for transforming attributes is to ensure that the influence or impor-
tance of every attribute is a priori the same. This is not automatically the case, as the
next examples will show.
When looking for two natural groups in the two datasets of Fig. 6.2 (which is the
task of clustering, see Chap. 7), most people would probably suggest the groups in-
dicated by dotted lines. The rationale for both groupings is that the elements within
each group appear to be more similar than elements of other groups. However, any
notion of similarity or distance strongly depends on the used scaling: Both graphs
in Fig. 6.2 actually display the same dataset, only the scaling of the x-axis has been
changed from hour to minute. If the distance between the elements is taken as an
indication of their similarity, a different scaling may change the whole grouping. In
practical applications with considerable background knowledge, one may be in the
comfortable position to conclude that a difference of 1.0 in attribute A is as impor-
tant or relevant as a difference of 2.5 in attribute B. In such a case, all attributes
should be rescaled accordingly so that in the rescaled variables a distance of 1.0 is
gain) are sensitive to such effects and thus implicitly prefer attributes with a larger
number of values (just because memorizing is then much easier). A model, such as
a decision tree, is very likely to evaluate poorly if such a variable is chosen. Rather
than removing such attributes completely, their range of values should be reduced
to a moderate size. This corresponds to selecting a coarser granularity of the data.
For instance, the very large number of individual (full) zip-codes should be reduced
to some regional or area information. The website access timestamp (which even
includes seconds) may be reduced to day-of-week and am/pm. The binning tech-
niques mentioned earlier for discretizing numerical attributes may also be helpful.
If there is useful information contained in these attributes, it is very likely that it is
retained in the coarsed representation, but the risk of overfitting is greatly reduced.
Another typical assumption is that some variables obey a certain probability dis-
tribution, which is not necessarily the case in practice. If the assumption of a Gaus-
sian distribution is not met, we may transform the variable to better suit this require-
ment. This should be a last resort, because the interpretability of the transformed
variable suffers. Typical transformations include the application of the square root,
logarithm, or inverse when there is moderate, substantial, or extreme skewness.
(If the variable’s domain includes the zero, it has to be shifted first by adding an
appropriate constant before applying the inverse or logarithm.) Another choice is
the power transform: Given the mean value ȳ, the power transform
y λ −1
if λ = 0,
y→ λȳ λ−1
ȳ log y if λ = 0,
transforms the data monotonically to better approximate a normal distribution. The
parameter λ is optimized for the problem at hand, e.g., such that the sum of squares
of residuals in regression problems is minimized.
Text Data Analysis Textual data is somewhat difficult to represent in a data ta-
ble. Among the possible representations, the vector model is most frequently used.
The individual words are reduced to their stem and very frequent and very rare
(stemmed) words are removed. The remaining words are then represented by a num-
ber of boolean attributes, indicating that the word occurred in the text or not. This
approach is also known as bag-of-words representation because it is usually easier
to simply enumerate the words contained in the vector instead of storing the en-
tire (usually very sparse) vector itself. Once we have reduced texts/documents to
such a vector-based representation, we can apply the methods presented later in the
book without change. A good summary can be found in [5]. Under the umbrella of
information retrieval, a lot of work has been undertaken as well.
Graph Data Analysis The analysis of large graphs and also the analysis of many
collections of graphs poses interesting challenges: from finding common patterns
of similar structure in social networks that can help to identify groups of similar
interest to the identification of reoccurring elements in molecular databases that
could potentially lead to the discovery of a new molecular theme for a particular
medication. Network analysis is often performed on abstractions, describing degrees
of connectivity, and then proceeds by using standard techniques. However, a number
of approaches also exist that operate on the network structure directly. A prominent
example is frequent subgraph mining approaches which identify graph patterns that
occur sufficiently often or are sufficiently discriminative. These methods follow the
frequent itemset mining approaches described later in this book (see Chap. 7) but
adopt those methods to directly operate on the underlying graph or graph database.
In [1] a good introduction to the analysis of graph data is given. We also discuss
one example of a molecular graph algorithm in more detail in Sect. 7.6.3.4 when we
introduce item mining algorithms.
Image Data Analysis Analyzing large numbers of images often requires a mix
of both methods. In order to find objects or regions of interest, one needs to ap-
ply sophisticated image segmentation approaches. Often already here analysis steps
6.5 Data Integration 135
are involved to semiautomatically find the best segmentation technique. Once this
is done, various types of image feature computations can be applied to describe
brightness, shape, texture, or other properties of the extracted regions. On these fea-
tures classical data analysis routines are then applied. The analysis of these types of
data emphasizes an interesting additional problem in data analysis, which is often
ignored: each object can be described in many different ways, and it is not always
clear from the beginning which of these representations is best suited for the anal-
ysis at hand. Finding out automatically which descriptors to use goes further than
feature selection discussed earlier in this chapter (Sect. 6.1.1) as we cannot select
from various subsets of features accompanied by different semantics. [12] offers a
good summary over the different aspects of image mining.
Other Data Types Of course, plenty other types of data are around: temporal data
from stock indices, process control applications, etc. Lots of work has been done in
the analysis of movies and other multimedia data such as music and speech. But
complex data is also generated by various ways to record data such as weblogs and
other types of user interactions recordings. Complex data also arise during observa-
tions of various types of populations (e.g., cattle herds). From an analysis point of
view, the next big challenge lies in the combined analysis of separate information
sources and types.
Almost all of the discussions in the remainder of this book focus on the modeling or
analysis of data that arrives in one nice, well-defined table. In the previous sections
of this chapter we have discussed how we can prepare the structure of this table by
selecting, cleaning, or constructing data, but what if our data is spread out over nu-
merous repositories? In reality, the data recording has begun over time, and various
departments have chosen their own recording mechanism. Worse, a merger in the
life time of our corporate database may have thrown two completely independently
developed database setups together that contain similar, but not quite the same, data.
If we now want to analyze the joint customer base, we may want to merge the two
(or more) customer databases into a uniform one first. This type of data integration
is called vertical data integration since we are really interested in concatenating two
tables holding essentially the same information. But already in this simple case we
will run into many annoying problems as we will see in the following section. The
other type of integration asks for combining different types of information and is
called horizontal data integration. Here we aim to enrich an existing table of, e.g.,
customers with information from another database such as purchase behavior which
was recorded independently. Essentially we are aiming to concatenate the entries
from one database with the entries of the second one, somehow smartly matching
which entries should be appended to each other.
136 6 Data Preparation
An example of horizontal data integration was discussed before, using our example
customer/shopping data. If we have separate customer and purchase databases but
are, for instance, interested in which purchases were actually done by females, we
need to pull information from these datasets together. In order to do such a join,
we will require some identifier which allows us to identify rows in both tables that
“belong together.”
Table 6.5 illustrates how this works. Here the identifier used for the join is the
customer ID. Note that it is actually named differently in the two tables, so we
6.5 Data Integration 137
Table 6.5 The two datasets on top contain information about customers and product purchases.
The joint dataset at the bottom combines these two tables. Note how we loose information about
individual customers and how a lot of duplicate information is introduced. In reality this effect is,
of course, far more dramatic
Shopper id Item id Price
will need to inform the joint operator which two columns to use to join the tables.
There are, of course, other ways to join tables which we will discuss in a minute.
However, let us first discuss the two most problematic issues related to joins in
general: overrepresentation and data explosion.
• overrepresentation of items is (although only mildly) already visible in our ex-
ample table above. If we were to use the resulting table to determine the gender
distribution of our shoppers, we would have a substantially higher degree of male
shoppers that is evident from our table of shoppers. So here we would really
answer the question “what is the number of items purchased by male (female)
shoppers” and not a question related to shoppers.
• data explosion is also visible above: we already see a few duplicate entries which
are clearly not necessary to keep all essential information. Finding a more com-
pact database setup, i.e., splitting a big database into several smaller ones, to
avoid such redundancies is an important aspect in database normalization where
the goal is to ensure dependency of database entries onto the key only.
Joins, as illustrated above, are only a very simple case of what is possible in real
systems. Most database systems allow one to use predicates to define which rows
are to be joined and more generic data analysis systems often allow one to at least
138 6 Data Preparation
choose one or more attributes as join identifiers and select the four main join types:
inner, left, right, and outer joins:
• inner join: creates a row in the output table only if at least one entry in the left
and right tables can be found with matching identifiers.
• left join: creates at least one row in the output table for each row of the left input
table. If no matching entry in the right table can be found, the output table is filled
with missing or null values. So we could make sure that each customer appears
at least once, even if the customer has not made any purchase.
• right join: similar to the left join, but at least one row is created for each row in
the right table.
• outer join: creates at least one row for every row in the left and right table. If no
matching entries can be found in the other tables, the corresponding entries are
filled with null or missing values.
Implementing join operations is not an easy task, and various efficient algorithms
have been developed. Especially on large tables when all data does not fit into main
memory anymore, smart methods need to be used to allow for such out-of-memory
joining. Databases are usually optimized to pick the right algorithm for the tables at
hand so the data analyst does not need to worry about this. However being aware of
the risk of complex SQL statements, launching hidden joins does not hurt.
Integrating data is, of course, a brute force way to transforming a problem into
one that we can then deal with using methods developed for single tables. Min-
ing relational databases, which should actually be called “mining multirelational
databases,” is a research area which deals with finding information directly in such
distributed tables. However, the methods developed so far are rather specialized, and
in the following we will continue to assume the existence of one nice, well-formed
table holding all the relevant data. This is, of course, a drastic simplification, and
quite often, the integration and cleaning of data can easily require much more time
and effort than the analysis itself.
Data Preparation is the most ignored aspect in data analysis, and this is also visible
in many data analysis software packages. They often assume the existence of a nice
file representation or a flat table in database. Visual data-flow-oriented tools such
as KNIME offer the ability to prepare the data intuitively, whereas command line
or script-based tools are often a lot harder to use for data preparation. One of the
important aspects here is not so much to possibility of doing data preparation in the
data analysis tool of choice but the repeatability or reproducibility. If a data analysis
process is to be used in a production environment, it is often necessary to rerun the
analysis periodically or whenever new data becomes available. Having a clear and
intuitive way to model and document the data, preparation phase in conjunction with
the actual analysis is then a serious benefit.
6.6 Data Preparation in Practice 139
KNIME offers a vast array of nodes for all kinds of data preparation. Most of them
can be found in the category Data Manipulation where there are nodes for binning,
converting/replacing, filtering, and transforming columns, rows, and entire tables.
Data selection tasks can be modeled by filtering out columns using the Column
Filter node. For row filtering, many more modules are available which filter rows
based on certain criteria applied to individual cells (Row Filter). These criteria can
be regular expressions for string cells or range or other constraints on numerical
attributes. In addition, the Nominal Value Row Filter allows one to select from the
range of nominal values to determine which rows should be filtered out. Those nodes
also come in a splitter-variant which does not filter out rows but instead has a second
output port containing the table with the remaining elements. Those—and more—
functions are all contained within individual nodes, so we skip the corresponding
very simple examples.
However, automated versions of column or feature selection such as the ones de-
scribed in Sect. 6.1.1 are worth being discussed in a bit more detail as they demon-
strate the interesting and very powerful loop-feature of KNIME. KNIME brings
along the framework to run wrapper approaches for feature selection using arbitrary
models. Figure 6.4 shows a workflow performing backward feature elimination us-
ing a decision tree as the underlying models. The flow starts with a loop node which
creates different subsets of features by filtering out all other columns. The reduced
data is then fed into a data partitioner, which splits it into training and validation
data. Afterwards we learn a model (here a decision tree) on the training data and
apply it to the validation data. The Loop end node (Backward Feature Elimination
End) calculates the accuracy of the resulting prediction and remembers this for all
features that were involved in this run. The loop is now executed several times (the
number of runs and number of features per run can be adjusted in the dialog of the
loop-start node), and finally a column filter model is produced, which can be used
to filter all but the top k features/columns from the second table using the Backward
Feature Elimination Filter node.
Fig. 6.4 An example for feature selection in KNIME using a decision tree as the underlying model
to evaluate feature importance
140 6 Data Preparation
Fig. 6.6 An example for data preparation in KNIME. We join a number of tables and compute
aggregate information in the Group By nodes. The resulting table is finally fed into a clustering
node
Nodes for constructing and converting data are available in KNIME as well. Dif-
ferent types of normalizations are available via the Normalizer node. Note that this
node has a second outport which carries the normalization model. This allows one
to easily apply the same normalization to a second data pipeline. An often over-
looked problem in data analysis is that users tend to normalize their data using, e.g.,
min/max normalization and then normalize their test data using this same technique!
The resulting errors are hard to find since the min/max ranges of the training and test
data can, but are not guaranteed to be exactly the same. Especially, if the ranges are
almost equal, the resulting small deviations can cause a lot of confusion. Figure 6.5
shows an example of such a flow. Here a different type of model (support vector ma-
chine, see Chap. 9) is trained on normalized training data, and the resulting model
is applied to testing data which underwent the same normalization procedure.
KNIME also offers nodes for different types of binning or discretization of data
which are available in the “Data Manipulation-Column-Binning” category. Addi-
tional nodes allow one to concatenate and join tables, enabling vertical and horizon-
tal data integration. Many other functionalities are available in special nodes. Instead
of showing individual examples we conclude this section by showing (Fig. 6.6) a
small example of a workflow performing more complex data integration, prepro-
cessing steps to create the data we first discussed in Chap. 2.
The workflow fragment shows how information on products, basket/product as-
sociations, additional basket information (date, customer), and customers is merged
to produce information on the average basket price of individual customers. Since
this information is spread out over several different tables, we first need to join the
two tables containing product information (most notably product ID and price) and
the map between product and basket ID. Once we have this, we can aggregate the ta-
6.6 Data Preparation in Practice 141
ble (second node from the left Group By) to obtain the price of each basket. We join
this table with the table containing (among others) customer and basket ID and the
date of purchase. The latter field is only available as a string, so we need to convert
it to a date/time representation before reducing this to only contain year and month.
We can then again aggregate the resulting table to obtain the total and average bas-
ket price per customer and month. Joining this with the customer data allows one
to also add the age of the customer. Note that the age needs to be computed based
on the customer birth dates first. We aggregate this again to compute the average
prices per month and finally obtain a table containing the features we are interested
in (age, average purchase per month, average basket size) which can be fed into the
clustering algorithm.
Modeling data integration, transformation, and aggregation in such a graphical
way has two advantages. Firstly, this process can be documented and communicated
to others. Secondly—and even more importantly—this process can be easily exe-
cuted again whenever the underlying data repositories change. The documentation
aspect is often a problem with script-based platforms. In contrast, reproducibility
is an issue with table-based tools, which do allow for an intuitive way to integrate
and transform data but do not offer the ability to rerun this process reliably over and
over again in the future.
The use of PCA as technique for dimensionality reduction has already been ex-
plained in Sect. 4.3.2.1.
The mean value is in this case also a missing value, since R has no information
about the missing values and how to handle them. But if we explicitly say that
missing values should simply be ignored for the computation of the mean value
(na.rm=T), then R returns the mean value of all nonmissing values:
142 6 Data Preparation
> mean(x,na.rm=T)
[1] 3
Note that this computation of the mean value implicitly assumes that the values are
missing completely at random (MCAR).
References
1. Cook, D.J., Holder, L.B.: Mining Graph Data. Wiley, Chichester (2006)
2. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of contin-
uous features. In: Proc. 12th Int. Conf. on Machine Learning (ICML 95, Lake Tahoe, CA), pp.
115–123. Morgan Kaufmann, San Mateo (1995)
3. Elomaa, T., Rousu, J.: Efficient multisplitting revisited: optima-preserving elimination of par-
tition candidates. Data Min. Knowl. Discov. 8(2), 97–126 (2004)
4. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classifi-
cation learning. In: Proc. 10th Int. Conf. on Artificial Intelligence (ICML’93, Amherst, MA),
pp. 1022–1027. Morgan Kaufmann, San Mateo (1993)
5. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge University Press, Cambridge (2007)
6. Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Learn.
Res. 3, 1157–1182 (2003)
References 143
7. Hall, M.A., Smith, L.A.: Feature subset selection: a correlation based filter approach. In: Proc.
Int. Conf. on Neural Information Processing and Intelligent Information Systems, pp. 855–
858. Springer, Berlin (1997)
8. Kolodyazhniy, V., Klawonn, F., Tschumitschew, K.: A neuro-fuzzy model for dimensionality
reduction and its application. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 15, 571–593
(2007)
9. Lowe, D., Tipping, M.E.: Feed-forward neural networks topographic mapping for exploratory
data analysis. Neural Comput. Appl. 4, 83–95 (1996)
10. Markovitch, S., Rosenstein, S.: Feature generation using general constructor functions. Mach.
Learn. 49(1), 59–98 (2002)
11. Murphy, P., Pazani, M.: ID2-of-3: constructive induction of m-of-n concepts for discrimina-
tors in decision trees. In: Proc. 8th Int. Conf. on Machine Learning (ICML’91, Chicago, IL),
pp. 183–188. Morgan Kaufmann, San Mateo (1991)
12. Petrushin, V.A., Khan, L. (eds.): Multimedia Data Mining and Knowledge Discovery.
Springer, New York (2006)
13. Rehm, F., Klawonn, F., Kruse, R.: POLARMAP—efficient visualisation of high dimensional
data. In: Information Visualization, pp. 731–740. IEEE Press, Piscataway (2006)
14. Rehm, F., Klawonn, F.: Improving angle based mappings. In: Advanced Data Mining and
Applications, pp. 3–14. Springer, Berlin (2008)
15. Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of ReliefF and
RReliefF. Mach. Learn. 53(1–2), 23–69 (2004)
16. van der Putten, P., van Someren, M.: A bias-variance analysis of a real world learning problem:
the COIL challenge 2000. Mach. Learn. 57, 177–195 (2004)
17. Yang, Y., Webb, G.I.: Discretization for naive Bayes learning: managing discretization bias
and variance. Mach. Learn. 74(1), 39–74 (2009)
Chapter 7
Finding Patterns
Clustering For instance, we may want to group all cars according to their sim-
ilarities. Rather than investigating the offers car by car, we could then look at the
groups as a whole and take a closer look only at the group of cars we are particu-
larly interested in. This aggregation into similar groups is called clustering. (From
our background knowledge we expect to recover the well-known classes, such as
luxury class, upper and lower middle-sized class, compact cars, etc.) Cluster anal-
ysis (or clustering) looks for groups of similar data objects that can naturally be
separated from other, dissimilar data objects. As a formal definition of clusters turns
out to be quite challenging, depending on the application at hand or the algorithmic
approach used to find the structure (e.g., search, mathematical optimization, etc., see
Chap. 5), algorithms with quite different notions of a cluster have been proposed.
Their strengths (and weaknesses) lie in different aspects of the analysis: some focus
on the data constellation (such as compactness and separation of clusters), others
concentrate on the detection of changes in data density, and yet others aim at an
easily understandable and interpretable summary.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 145
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_7, © Springer-Verlag London Limited 2010
146
Self-organizing Maps We could even arrange all cars into a two-dimensional map
where similar car offers are placed close together so that we explore cars similar to
some selected car by examining its neighborhood in the map. Such an overview is
generated by so-called self-organizing maps (Sect. 7.5). By means of color coding it
becomes immediately clear where many similar records reside and where we have
a sparsely populated area. Self-organizing maps were not intended as clustering
techniques but can deliver similar insights.
While clustering methods seek for similarities among cases or records in our
database, we may also be interested in relationships between different variables or
in subgroups that behave differently with respect to some target variable.
Association Rules Rather than grouping or organizing the cars, we may be inter-
ested in interdependencies among the individual variables. Several automobile man-
ufacturers, for instance, offer certain packages that contain a number of additional
features for a special price. If the car equipment is listed completely, it is possi-
ble to recover those features that frequently occur together. The existence of some
features increases the probability of others, either because both features are offered
in a package or simply because people frequently select a certain set of features in
combination. One technique to find associations of this kind are association rules
(Sect. 7.6): For every feature that can be predicted confidently from the occurrence
of some other features, we obtain a rule that describes this relationship.
Deviation Analysis The last method we are going to investigate in this chapter is
deviation analysis. Usually expensive cars offer a luxury equipment and consume
significantly more petrol than standard cars. However, new cars with new fuel-
saving engines, hybrid technology, etc. may represent exceptions from this general
7.1 Hierarchical Clustering 147
rule. Typically, there is a negative correlation with price and mileage, however, the
group of oldtimer cars represents an exception. Typically a domain expert is already
familiar with the more global dependencies and is more interested in deviations
from the standard case or exceptional rules. The discovery of deviating subgroups
of the population is discussed in Sect. 7.7.
7.1.1 Overview
Suppose that we have x, y ∈ D with δ = d(x, y) and that from our background
knowledge we conclude that both records should belong to the same cluster. As
distance is our only measure to decide upon belongingness to a cluster, to be con-
sistent, any other z ∈ D with d(x, z) ≤ δ should also belong to the same cluster, as z
is even more similar to x than y is. Given some distance threshold δ, we may define
clusters implicitly by requiring that any data x, y must belong to the same cluster C
if they are closer together than δ:
∀x, y ∈ D : (d(x, y) ≤ δ ⇒ ∃C ∈ P : x, y ∈ C) . (7.1)
Apparently, for different values of δ, we will obtain different partitions. Deferring
how to find the partition for the moment, one difficult question is then how to come
up with the right distance threshold δ. The choice must not be arbitrary, because the
resulting clusters should be stable and robust in the sense that a slightly changed
threshold should not lead to completely different clusters (which would render our
clusters arbitrary and useless). The idea is to overcome this problem by exploring
the full range of thresholds and deciding on the best threshold on the basis of the
obtained results.
Exploring the full range of thresholds sounds like a costly task delivering many
different partitions. Comparing all these partitions against each other appears to be
even more expensive. The key observation is, however, that the solutions form a hier-
archy, where clusters obtained from a threshold δ1 are always completely contained
in clusters obtained from δ2 if δ1 < δ2 . This follows directly from (7.1): Suppose
x, y ∈ C, where cluster C has been obtained using δ1 . Then, d(x, y) ≤ δ1 , and thus
d(x − y) ≤ δ2 (as δ1 ≤ δ2 ), so x must be contained in some cluster C obtained from
δ2 that contains y.
The evolution of the partitions (as δ increases) can therefore be summarized in
a single hierarchy, called dendrogram. On the horizontal axis, all data objects are
listed in a suitable order, and the vertical axis shows δ. Consider the example in
Fig. 7.2: For the smallest possible value of δ = 0, we obtain 7 clusters (each of
size 1, represented by leaves in the hierarchy). Records d and g are closest, so at
the level δ = d(d, g), these two points unite to a cluster, while all others still remain
singletons. For δ1 = d(a, c) (green color), all four points on the right side can be
reached within this distance from d, so that all points belong to the same cluster
already. The three points on the left are still separated into two groups ({a, c} and
{b}), because the distance d(a, b) = δ2 (blue color) is larger than δ1 . If we cut the
dendrogram just above δ1 , the set of leaves of the remaining subtrees represent the
currently obtained clusters: {a, c}, {b}, and {d, e, f, g}. At δ2 we obtain two clusters
7.1 Hierarchical Clustering 149
Fig. 7.2 Hierarchical clustering (single linkage) applied to a small example dataset
{a, c, b} and {d, e, f, g}, and at a very large distance δ3 = d(c, e), the two clusters
become connected, and all data belong to a single cluster.
Coming back to the initially stated problem of finding the best distance threshold
δ, the dendrogram provides useful information: As already mentioned, for robust
clusters, a slightly changed δ should not lead to completely different results. If we
select δ1 , it makes a difference whether we cut the tree slightly above or slightly
below δ1 , because we obtain different clusters in both cases. However, for δ∗ , we
obtain clusters that are stable in the sense that we get the same clusters if we choose
δ somewhat smaller or larger. By using δ∗ = 12 (δ2 + δ3 ) this stability is maximized,
and we end up with two clusters. Looking for the largest difference between δi and
δi+1 , which corresponds to the largest gap between in the dendrogram, helps us in
finding the most stable partition.
Figure 7.3 shows two examples of performance (so-called single linkage cluster-
ing, see next section) to illustrate the interpretation of a dendrogram. Example (a)
consists of three well-separated clusters, which are easily discovered by the algo-
rithm. The data belonging to each of the clusters unites at relatively low values of δ
already, whereas quite a large value of δ is required to merge two of the three clus-
ters into one. Therefore, the vertical gap is large, and the existence of three clusters
is easily recognized by visual inspection. From hierarchy (b) we immediately see
that we have two clusters, one being a bit larger (left cluster in tree occupies more
horizontal space and thus has more leaves) and less dense than the other (higher
δ-values compared to the right cluster). The fact that the shape of the clusters is
very different (cf. scatterplot) does not make a difference for the algorithm (and is
likewise not reflected in the dendrogram).
Although example (c) is somewhat similar to (a), just some random noise has
been added, the dendrogram is very different. To belong to a cluster, it is sufficient
to have a high similarity to just one of the cluster members, noise points get eas-
ily chained, thereby building bridges between the clusters that cause cluster merg-
ing at a relatively low δ-value. These chaining capabilities were advantageous for
case (b) but prevent us from observing a clear cluster structure in case (c). Hierar-
chical (single-linkage) clustering is extremely sensitive to noise—adding just a few
noise points to dataset (a) quickly destroys the clear structure of the dendrogram.
We will discuss solutions to this problem in Sect. 7.1.3.
In addition to the dendrogram, a heatmap can also help to illustrate the cluster-
ing result. For heatmap, hierarchical clustering is carried out for the data records,
150 7 Finding Patterns
Fig. 7.3 Hierarchical clustering (single linkage) for three example data sets
i.e., the rows of a data table, and also for the attributes by transposing the data table,
so that clustering is applied to the columns of the data table. Then the records and
the attributes are reordered according to the clustering result. Instead of showing the
numerical entries in the data table, a color scale is used to represent the numerical
values.
Figure 7.4 shows a heatmap for clustering the numerical attributes of the Iris
data set (after z-score standardization). The dendrogram for the record clustering is
shown on the left, and the dendrogram for the attribute clustering on top. One can
easily see that in the lower part of the diagram, records are clustered based on lower
values for sepal width, the petal length, and the petal width (red color) and on higher
values for the sepal length (yellow or orange color).
7.1.2 Construction
Table 7.1 shows an algorithm to perform hierarchical clustering. Starting from
δ0 = 0, we have n = |D| clusters of size 1. In line 4 we find the two closest clusters
in the current partition Pt . These two clusters are removed from the current partition,
and their union is reinserted as a new cluster (line 6). In this way, the number of clus-
ters reduces by 1 for each iteration, and after n iterations the algorithm terminates.
Note that the distance measure d used in Table 7.1 measures the distance be-
tween clusters C and C and not between data objects. While the Euclidean distance
7.1 Hierarchical Clustering 151
1 P0 = {{x} | x ∈ D}
2 t = 0, δt = 0
3 while current partition Pt has more than one cluster
4 find pair of clusters (C1 , C2 ) with minimal distance d (C1 , C2 )
5 δt+1 = d (C1 , C2 )
6 construct Pt+1 from Pt by removing C1 and C2 and inserting C1 ∪ C2
7 t = t+1
8 end while
152 7 Finding Patterns
Table 7.2 First five iterations of hierarchical clustering on records a–g with given distance matrix.
The cluster pairs (C1 , C2 ) are selected by minimal distances (bold font)
a b c d e f g
a b c d ef g
a 0 50 63 17 72 81 12
a 0 50 63 17 72 12
b 0 49 41 42 54 37
b 0 49 41 42 37
c 0 52 13 16 61
c 0 52 13 61
d 0 56 66 15
d 0 56 15
e 0 11 64
ef 0 74
f 0 74
g 0
g 0
ag b c d ef
ag b d efc
ag 0 37 61 15 72 agd b efc
ag 0 37 15 61
b 0 49 41 42 agd 0 37 52
b 0 41 42
c 0 52 13 b 0 54
d 0 52
d 0 56 efc 0
efc 0
ef 0
As long as the data set size is moderate, the dendrogram provides useful in-
sights, but a tree with several thousand leaves is quite difficult to manage. Fur-
thermore, at least in the single linkage clustering, one cannot tell anything about
7.1 Hierarchical Clustering 153
how the data distributes in the data space, since the cluster may be shaped arbi-
trarily. A brief characterization of each cluster is thus difficult to provide. To get
an impression of the cluster, one has to scan through all of its members. On the
other hand, hierarchical clustering may be carried out even if only a distance ma-
trix is given, whereas the approaches in the next two sections require an explicitly
given distance function (often even a specific function, such as the Euclidean dis-
tance).
The runtime and space complexity of the algorithm is quite demanding: In a
naïve implementation we need to store the full distance matrix and have a cubic
runtime complexity (n times find the minimal element in an n × n matrix). At least
the runtime complexity may be reduced by employing data structures that support
querying for neighboring data. Further possibilities to reduce time and space re-
quirements are subsampling (using only part of the dataset), omission of entries
in the distance matrix (transformation into a graph, application of, e.g., minimum
spanning trees), and data compression techniques (use representatives for a bunch
of similar data, e.g., [60]).
The so-called single-linkage distance (7.2) is responsible for the high sensitivity to
noise we have observed already in Fig. 7.3(c). Several alternative distance aggrega-
tion heuristics have been proposed in the literature (cf. Table 7.3). The complete-
linkage uses the maximum rather than the minimum and thus measures the diame-
ter of the united cluster rather than the smallest distance to reach one of the cluster
points. It introduces a bias towards compact cloud-like clusters, which enables hier-
archical clustering to better recover the three clusters in Fig. 7.2(c) but fails in case
of Fig. 7.2(b) because any cluster consisting of the ring points would also include
all data from the second (interior) cluster.
The single-linkage and complete-linkage approaches represent the extremes in
measuring distances between clusters. Calculating the exact average distance be-
d ({C ∪ C }, C ) = · · ·
single linkage = min{d (C, C ), d (C , C )} (7.3)
complete linkage = max{d (C, C ), d (C , C )} (7.4)
|C|d (C, C ) + |C |d (C , C )
average linkage = (7.5)
|C| + |C |
(|C| + |C |)d (C, C ) + (|C | + |C |)d (C , C ) − |C |d (C, C )
Ward = (7.6)
|C| + |C | + |C |
1
centroid (metric) = d(x, y) (7.7)
|C ∪ C ||C |
x∈C∪C y∈C
154 7 Finding Patterns
tween data from different clusters via (7.7) is computationally more expensive than
using the heuristics (7.5) or (7.6). If the data stems from a vector space, the cluster
means may be utilized for distance computation. The cluster means can be stored
together with the clusters and easily updated at the merging step, making approach
(7.7) computationally attractive. For hyperspherical clusters, all these measures per-
form comparable but otherwise may lead to quite different dendrograms.
Rather than agglomerating small clusters to large ones, we may start with one sin-
gle cluster (full data set) and subsequently subdivide clusters to find the best result.
Such methods are called divisive hierarchical clustering. In agglomerative cluster-
ing we have merged two clusters, whereas in divisive clustering typically a single
cluster is subdivided into two subclusters (bisecting strategy), which leads us again
to a binary tree of clusters (a hierarchy). Two critical questions have to be answered
in every iteration: (1) which cluster has to be subdivided, and (2) how to divide
a cluster into subclusters. For the first problem, validity measures try to evaluate
the quality of a given cluster, and the one with the poorest quality may be the best
candidate for further subdivision. Secondly, if the number of subclusters is known
or assumed to be fixed (e.g., 2 in a bisecting strategy), clustering algorithms from
Sect. 7.3 may be used for this step, as these algorithms subdivide the data into a par-
tition with a fixed number of clusters. In the bottom-up approach of agglomerative
clustering, the decisions are based on local information (neighborhood distances),
which give poor advice in case of diffuse cluster boundaries and noisy data. In such
situations, top-down divisive clustering can provide better results, as the global data
distribution is considered from the beginning. The computational effort is higher for
divisive clustering, though, which can be alleviated by stopping the construction of
the hierarchy after a few steps, as the desired number of clusters is typically rather
small.
Providing distances for categorical data leads to a coarse view on the data, for
instance, the number of distinct distance values is limited (as we will see in the
next section). For the moment, just consider six binary features and let us com-
pare the cases: Visual inspection of the cases a–h in Fig. 7.5 reveals two clusters,
one group of cars (a–d) equipped with features u and v (plus one optional fea-
ture) and the second group (e–h) equipped with y and z (plus one optional fea-
tures).
Applying the Jaccard measure (a well-suited distance measure for binary data,
which will be discussed in the next section) to this dataset leads to the distance ma-
trix in Fig. 7.5. A distance of 0.5 occurs both within groups (e.g., d(a, b), d(e, g))
and across groups (d(c, g)). The Jaccard distance is therefore not helpful to discrim-
inate between these two groups.
7.2 Notion of (Dis-)Similarity 155
Fig. 7.5 Grouping of categorical data: A Jaccard distance of 0.5 is achieved within the groups
(e.g., d(a, b), d(e, g)) and also across groups (d(c, g))
Distance measures usually compare only two records from the dataset, but they
do not take any further information into account. We have seen that single-linkage
clustering reacts sensitive to single outliers, which may drastically reduce the dis-
tance between clusters. Knowing the outliers can thus help to improve the clustering
results we obtain from distances alone. One approach in this direction is to use link
information between data objects: The number of links that exist between data x1
and x2 is the number of records x ∈ D that are close to both, x1 and x2 . Based on
some threshold ϑ , we define the neighborhood Nϑ (x) of x as {y ∈ D | d(x, y) ≤ ϑ}.
In our example, N0.5 (c) = {a, b, c, d, g}. The number of links or common neighbors
of x and y is then (Nϑ (x) ∩ Nϑ (y))\{x, y}. In the example we obtain a single com-
mon neighbor for a and g: ({a, b, c, d} ∩ {c, e, f, g, h})\{a, g} = {c}. The matrix on
the right of Fig. 7.5 shows the number of links for all pairs of data objects. Using
the number of links as a measure of similarity rather than the distance now allows us
to discriminate both clusters, and we have no longer the same degree of similarity
within and across groups. This approach is used in the ROCK algorithm (robust
clustering using links) [26].
In the previous section we have simply taken the length of the line segment be-
tween two data points as the distance or dissimilarity between records. This section
is solely dedicated to distance measures, as they are crucial for clustering appli-
cations. We have already seen in Sect. 6.3.2 that by a different scaling of individ-
ual attributes, the clustering result may be completely different. If we have clearly
pronounced clusters in two dimensions (as in Fig. 7.3(a)), but the distance measure
takes also a third variable into account, which is just random noise, this may mess up
the clear structure completely, and in the end, all groups contain data from the origi-
nal two-dimensional clusters. We have discussed the negative influence of irrelevant
and redundant variables already in Sect. 6.1.1 but assumed a predictive task such that
we had some reference variable to measure a variable’s informativeness—we have
no such target variable in explorative clustering tasks. Instead, we can only observe
156 7 Finding Patterns
Table 7.4 Small excerpt from a database of cars offered on an internet platform
id Manufacturer Name Type yofr cap fu fxu used air abs esp hs ls
Legend: yofr = year of first registration, cap = engine capacity, fu = fuel consumption (urban),
fxu = fuel consumption (extra urban), used = used car, air = air conditioned, abs = anti blocking
system, esp = electronic stability control, hs = heated seats, ls = leather seats
how good the data gets clustered in two and three dimensions. A clear grouping with
respect to variables A and B means that both attributes correlate somehow (certain
combinations of attribute values occur more frequently than others). There is no rea-
son why a similar co-occurrence pattern should be observable if we add irrelevant
or random variables. Thus, whether we will observe clusters or not strongly depends
on the choice of the distance measure—and that includes the choice of the variable
subset.
For this section, we return to the example of car offers as those shown in Ta-
ble 7.4. This time we have a mixed dataset with binary, categorical, and numerical
attributes. How can we define the similarity between two cars?
Table 7.5 Dissimilarity of numeric vectors. The graph on the right shows (for selected measures)
all two-dimensional points that have a distance of 1.0 from the small bullet in the centre
higher for the Euclidean distance. We have already seen that these distance val-
ues directly influence the order in which the clusters get merged, so we should
carefully check that our choice corresponds to our intended notion of dissimilar-
ity.
Nonisotropic Distances The term isotropic means that the distance grows in all
directions equally fast, which is obviously true for, e.g., the Euclidean distance. All
points with distance 1.0 from some reference point x form a perfect, symmetric sur-
face of a sphere. We use the distance to capture the dissimilarity between records,
and a large distance may even be used as an indication for an outlier. (In particular, if
we have z-score transformed data, which can be interpreted as “how many standard
deviations is this value away from the mean?”, see Sect. 6.3.2). If we consider a pair
of correlated variables, say urban and extra-urban fuel consumption, we may per-
ceive isotropic distances and counterintuitive: The differences of the fuel consump-
tions x1 = (8.1, 5.3) and x2 = (5.1, 5.3) from the fuel consumption x = (6.6, 4.3)
is 1 = (1.5, 1.0) and 2 = (−1.5, 1.0). The length of the difference vectors are
identical, so are the distances when using an isotropic distance function. However,
we would be very surprised if the difference in the urban fuel consumption is neg-
ative while at the same time the difference in the extra-urban fuel consumption is
positive—this deviates vastly from our expectations.
Nonisotropic distance functions help us to respect such cases. We have already
seen that the joint distribution of multiple variables is captured by the covariance
matrix C (see also Sect. 4.3.2.1). The nonisotropic Mahalanobis distance is defined
as
d(x, y) = (x − y) C −1 (x − y)
and considers the variance across all variables. Similar to the z-score transformed
data, the distance may be interpreted as the number of standard deviations a vector
deviates from the mean, but this time the deviation is not measured independently
for every variable but jointly for all variables at the same time. Now, all points
with distance 1.0 from some reference points x do no longer form the surface of a
symmetric sphere but the surface of an arbitrary hyperellipsoid.
Text Data and Time Series Sometimes not the absolute difference is important
but the fact that the data varies in the same way. This is often the case with em-
bedded time series (x contains the values of a variable at consecutive time points)
158 7 Finding Patterns
Binary Attributes Next, we consider binary attributes, e.g., whether the car was
offered by a dealer or not (private offer). If we encode the Boolean values (yes/no)
into numbers (1/0), we may apply the same differencing we used for numeric values
before. In this case, we get a zero distance if both cars are used cars or both cars are
new cars (and a distance of one otherwise).
However, this approach may be inappropriate for other binary attributes. Suppose
that we have a feature “offer by local dealer,” where “local” refers to the vicinity of
your own home. Then, the presence of this feature is much more informative than
its absence: The cars may be offered by private persons in the vicinity and by private
persons or dealers far away. The presence of this feature tells us much more than
the absence; the probability of absence is typically much higher, and therefore the
absence should not have the same effect in the dissimilarity measure as the presence.
A similar argument applies to the optional car equipment, such as ABS, EPS, heated
seats, leather seats, etc. We could easily invent hundred such attributes, probably
only a handful of cars having more than ten of these properties. But if we consider
the absence of a feature as a sharing a common property, most cars will be very
similar, simply because they share that 90% of the equipment is not present. In such
cases, the measure by Russel and Rao could be used that accounts only for shared
present features (see Table 7.6). Another possibility is the Jaccard measure, which
takes its motivation from set theory, where the dissimilarity of two sets A, B ⊆ Ω is
well reflected by J = 1− |A∩B|
|A∪B| . Apparently, this definition is independent of the size
of the domain |Ω|. For the cars example restricted to the last six binary attributes, we
obtain, for the Jaccard measure, dJ (x1 , x2 ) = 1 − 25 = 35 and dJ (x2 , x3 ) = 1 − 23 = 13
versus dS (x1 , x2 ) = 1 − 36 = 12 and dS (x2 , x3 ) = 1 − 56 = 16 for the simple match.
Table 7.6 Measuring the dissimilarity in case of a vector of predicates (binary variables) x and
y, which may also be interpreted as a set of properties X, Y ⊆ Ω. The Tanimoto measure from
Table 7.5 becomes identical to the Jaccard measure when applied to binary variables
binary attributes sets of properties
simple match dS = 1 − b+n+x
b+n
|X∩Y |
Russel & Rao dR = 1 − b+n+x
b
1− |Ω|
Jaccard dJ = 1 − b+x
b
1 − |X∩Y |
|X∪Y |
2|X∩Y |
Dice dD = 1 − 2b+x
2b
1 − |X|+|Y |
x y set X set Y b n x dM dR dJ dD
Example:
101000 111000 {A, C} {A, B, C} 2 3 1 0.16̄ 0.66̄ 0.33̄ 0.20
becomes true if and only if both records share the same value. Finally, a third alter-
native is the provision of a look-up table for individual dissimilarity values, where
Porsche and BMW are more similar than, say, Mercedes and Dacia.
Ordinal Attributes Ordinal attributes can also be transformed into binary at-
tributes in the same fashion as nominal attributes, but then any two different values
appear equally (dis)similar, which does not reflect the additional information con-
tained in an ordinal attribute. A better transformation is shown in Table 7.8, where
the Jaccard dissimilarity is proportional to the number of ranks lying between the
compared values. This is equivalent to introducing an integer-valued rank attribute
and apply, say, the Manhattan distance to measure the dissimilarity of two values.
The provision of a dissimilarity matrix remains as a third option.
In a mixed scale situation, such as the comparison of car offerings in Table 7.4,
there are two possibilities. First, even interval- and ratio-scaled attributes could be
160 7 Finding Patterns
discretized and then transformed into binary attributes, so that only the Jaccard mea-
sure needs to be applied on the final set of (only) binary attributes. A second option
is the combination of various distance measures (sum, mean, etc.), where each of
the measures considers a different subset of attributes.
The Curse of Dimensionality While we are used to work with distances in two-
or three-dimensional spaces, our intuition does not help us in high-dimensional
spaces. By considering more and more attributes in a dataset, we mathematically
construct a high-dimensional space. While the generalization of, e.g., the Euclidean
distance to an arbitrary number of dimensions is straightforward, the interpretation
of the distances may be somewhat counterintuitive.
Suppose that all variables have been normalized to the unit interval [−1, 1]. If we
select two attributes, the space in which our data resides corresponds to the box of
edge length 2 around the origin 0 of the coordinate system. Most of its area (≈78%)
is contained in the circle of radius 1 around the origin; only the four corners are omit-
ted. If we add another attribute, the data space corresponds to a three-dimensional
cube. Now, we have already eight corners that are not covered by the a sphere with
radius 1 around the origin, and the sphere contains less than 55% of the cube’s
volume. Figure 7.6 shows the percentage of the data space’s hypervolume that is
covered by a hypersphere of radius 1: The percentage quickly approximates zero.
If one data object is located at the origin of a two-dimensional space and we take
a second data object from [−1, 1]2 , there is a chance of only ≈22% that its distance
to the first object is greater than 1 (assuming the data space is uniformly populated).
In case of an eight-dimensional data space, the chances of being further away than a
distance of 1 rises to more than 95%. The distances among the data objects increase,
and most of the data is in the corners of the hypercube. If we want to identify data
similar to some reference object (here: the origin), the higher the dimensionality, the
less data we will find. This phenomenon is known as the curse of dimensionality,
a term coined by Bellmann [9].
7.2 Notion of (Dis-)Similarity 161
Fig. 7.6 Curse of dimensionality: The higher the dimensionality, the smaller the fraction of the
hyperbox that is covered by the embedded hypersphere
This effect is also shown in Fig. 7.7, which shows the distribution of the vector
lengths of 20,000 random vectors in the m-dimensional unit hypercube [0, 1]m . As
expected, in the one-dimensional case (bottom row) the unit interval [0,1] is almost
uniformly covered. To obtain an Euclidean distance of zero in an m-dimensional
space, all m components must be zero at the same time, so the chances of low overall
distances (sum of all components) decreases rapidly with the dimensionality. In the
10-dimensional sample there are almost no vectors of length ≤1, that is, within a
range of 1.0 from the origin.
162 7 Finding Patterns
Clusters are represented by subtrees in a dendrogram, and all cluster members are
represented individually by leaves. For large datasets, such dendrograms are too
detailed, and the clusters themselves somewhat difficult to grasp, because the only
way to inspect the cluster is by looking at the members one by one. A condensed
representation by, say, the mean and variance of the members may be misleading,
because the cluster’s shape is arbitrary. The alternative perspective on clustering in
this section is that we explicitly search for clusters that can be well represented by
simple statistics such as mean and variance.
7.3.1 Overview
Fig. 7.9 Fuzzy c-Means clustering for three example data sets (cf. Fig. 7.3)
tion (pi|j ∈ {0, 1}) is required only in some variants of prototype-based cluster-
ing.
As shown in Fig. 7.8, prototype-based clustering starts with some initial guesses
of the prototypes and usually alternates between the adjustment of the member-
ship matrix (using the currently given prototypes) and the adjustment of the proto-
types (using the currently given membership degrees). It is assumed that the number
of clusters c is known beforehand. Actually, all prototype-based clustering proce-
dures optimize an objective-function during this iterative refinement (which could
not be formulated without knowing the number of clusters in advance). Although
an inspection of the membership matrix might be useful in some cases, the pro-
totypes are considered as the primary output, since they concisely characterize
the data associated with them. New data may be associated with the established
clusters easily by assigning it to the nearest prototype (according to the distance
d(pi , x)).
If the cluster is represented by a point prototype (assuming hyperspherical clus-
ters), results for the datasets already presented in Fig. 7.3 are shown in Fig. 7.9.
Where the model assumption was correct (cases (a) and (c)), the clusters are quickly
and robustly recovered. In case (b), however, the ring-shaped outer cluster cannot
be discriminated from the inner cluster: a typical prototype for the outer ring would
have roughly the same distance to all its members, and thus would have to be placed
in the center of the inner cluster. But then, both prototypes compete for the central
data, and eventually both clusters are split among the prototypes.
164 7 Finding Patterns
Table 7.9 Algorithm for prototype-based clustering: k-Means (kM), Fuzzy c-Means (FCM), and
Gaussian mixture decomposition (GMD)
Algorithm CentroidBasedClustering(D, p1 , . . . , pc ) → p1 , . . . , pc
kM FCM GMD
1 repeat
2 for all 1 ≤ j ≤ n:
3 update pi|j according to. . . (7.10) (7.11) (7.14)
4 for all 1 ≤ i ≤ c:
5 update prototype/model pi according to. . . (7.9) (7.13) (7.15)–(7.17)
6 until maximum number of iterations reached
7 or p1 , . . . , pc converged
7.3.2 Construction
The general sketch of the algorithm has already been discussed and is depicted in
Table 7.9. However, there are still placeholders for the two main steps, the mem-
bership update and the prototype update. We will briefly review three common
algorithms for hyperspherical clusters, namely the k-Means, Fuzzy c-Means, and
Gaussian mixture decomposition.
The k-Means model corresponds to our initially stated situation where the proto-
types stem from the same space as the data and the membership degrees are binary
(pi|j ∈ {0, 1}). The objective function of k-Means is given by
c
n
c
JkM = pi|j xj − pi 2
= xj − pi 2
(7.8)
i=1 j =1 i=1 x∈Ci
variance within cluster Ci
subject to ci=1 pi|j = 1 for 1 ≤ i ≤ c. If we think of a prototype pi as the center
of a cluster, the term nj=1 pi|j xj − pi 2 calculates (up to a constant) the variance
within the cluster (only data associated with the cluster is considered because pi|j
is zero otherwise). By minimizing JkM the prototypes will be chosen such that the
variance of all clusters is minimized, thereby seeking for compact clusters. From
the necessary condition for a minimum of JkM (∇pi JkM = 0) the optimal prototype
location is derived:
n
j =1 pi|j xj
pi = n , (7.9)
j =1 pi|j
7.3 Prototype- and Model-Based Clustering 165
which corresponds to the mean of the data associated with the prototype. All proto-
types are updated using (7.9) in the prototype-update step of Algorithm 7.9. As to
the membership update step, optimization of JkM is simple: for data xj , we have to
select a single cluster i with pi|j = 1 (and pk|j = 0 for all k = i). Since we seek to
minimize the objective function, we select the prototype where xj − pi 2 becomes
minimal (which constitutes the membership update step):
1 if pi − xj becomes minimal for i; ties are broken arbitrarily;
pi|j = (7.10)
0 otherwise.
The k-Means model can be extended to a membership matrix that contains contin-
uous membership degrees within [0, 1] rather than only binary memberships from
{0, 1} (the cluster representation remains unchanged). A gradual membership al-
lows one to distinguish data close to the prototype (very typical for the cluster) from
incidentally associated data (far away from prototype). To guarantee
a meaningful
comparison of membership values, they should be normalized ( ci=1 pi|j = 1) and
proportional to the distance to the prototype (if x1 has twice the distance to the pro-
totype than x2 , x1 should receive only one half of the membership that x2 receives).
This determines the membership degrees up to a constant [32]:
1
pi|j = . (7.11)
c xj −pi 2
k=1 xj −pk 2
These membership degrees can also be obtained by minimizing the objective func-
tion of Fuzzy c-Means
c
n
JF cM = 2
pi|j xj − pi 2
, (7.12)
i=1 j =1
which includes an exponent of 2 and is usually called fuzzifier. Without the intro-
duction of a fuzzifier, the minimization of JF cM would still lead to binary mem-
bership degrees. Other choices than 2 are possible but are less frequently used in
practice. The cluster centroid update is very similar to k-Means; the only difference
is due to the introduction of the fuzzifier:
n 2
j =1 pi|j xj
pi = n 2
. (7.13)
j =1 pi|j
1 x−μ 2
g(x; μ, σ ) = √ 1 e− 2 σ .
2πσ
Since such a distribution is unimodal (i.e., there is only a single maximum) but
each cluster represents its own peak in the density itself, we assume the overall
density being a mixture of Gaussians. Given that the number of clusters c in the
data is a priori known, the generative model of a dataset of size n is repeating the
following two steps n times:
• First, draw a random integer between 1 and c with probability pi of drawing i.
The data generated next will then belong to cluster #i.
• Second, draw a random sample from the Gaussian distribution g(x; μi , σi ) with
the parameters μi , σi taken from model #i.
The overall density function is thus given by
c
f (x; θ ) = pi g(x; μi , σi ),
i=1
To find the most likely cluster configuration θ , the following maximum likelihood
estimate can be derived:
n
j =1 pi|j xj
μi = n , (7.15)
j =1 pi|j
1 Isotropicmeans that the density is symmetric around the mean, that is, the cluster shapes are
roughly hyperspheres. Using the Mahalanobis distance instead would allow for ellipsoidal shape,
but we restrict ourselves to the isotropic case here.
7.3 Prototype- and Model-Based Clustering 167
n
j =1 pi|j xj − μi
2
1
σi = n , (7.16)
m j =1 pi|j
1
n
pi = pi|j . (7.17)
n
i=1
(Note the similarity of (7.15) with (7.9) and (7.13).) The so-called EM-algorithm
performs a gradient descent search for the maximum likelihood estimate and as
such is prone to local minima. If the model is poorly initialized and the clusters
interact, convergence may be very slow.
One drawback of all prototype-based methods is that the number of clusters needs to
be known (or suspected) in advance. If nothing is known whatsoever about the cor-
rect number of clusters, additional effort is necessary to apply the methods from this
section. One may think of at least three different approaches: (1) top-down, divisive
clustering: start with a relatively small number of clusters and split the cluster in
case it does not fit the associated data well (e.g., bisecting k-means; see, e.g., [61]);
(2) bottom-up, agglomerative clustering: overestimate the number of clusters and
merge similar clusters; and (3) run the algorithm for a full range of possible num-
bers of cluster and evaluate each partition w.r.t. the overall goodness of fit. From the
plot of obtained results the optimal number of clusters is determined by finding an
extremum.
To find the optimal result, all these methods make use of so-called validity mea-
sures (see, e.g., [19, 27]). Approaches (1) and (2) employ a local validity measure
that evaluate a single cluster only, such as the data density within the cluster or the
distribution of membership degrees (unambiguous memberships are preferable as
they indicate a successful separation of clusters). Another example is the silhouette
coefficient [46] of a cluster C, which is the average of silhouette coefficients s(x) of
its members x ∈ C:
b(x) − a(x)
s(x) = ∈ [−1, 1],
max{a(x), b(x)}
where a(x) = d(x, C) is the average distance of x to members of the same cluster
C, and b(x) is the average distance to the members of the nearest cluster C other
168 7 Finding Patterns
than C (b(x) = minC =C d(x, C )). Well-clustered data x is close to members of its
own cluster (small a(x)) but far away from members of other clusters (large b(x))
and thus receives a high silhouette coefficient near 1. A good cluster receives a high
average silhouette coefficient from its members.
In contrast, approach (3) requires a global validity measure that evaluates all
clusters at the same time. Regarding the global validity measures, we could em-
ploy the objective functions themselves to determine the optimal number of clusters.
However, all the objective functions tend to become better and better the larger the
number of cluster is (and at the end, there would be n singleton clusters). Validity
measures such as the Aikake information criterion compensate this effect by reduc-
ing the obtained likelihood by the number of necessary parameters. Other typical
measures, such as the separation index [20], identify compact and well-separated
clusters:
minx∈Ci ,y∈Cj d(xi , xj )
D = min min ,
i=1..c j =i+1..c maxk=1..c diamk
where the numerator represents the distances between clusters Ci and Cj (which
should be large), and the denominator diamk = maxx,y∈Ck x − y expresses the
extension of cluster Ck (which should be small for compact clusters). Since the
separation index considers neither fuzzy membership nor probabilities, it is well
applicable to k-Means. There are also fuzzified versions of the separation index [58].
Admittedly these heuristic approaches fail to find the number of clusters if the
model assumptions do not hold (e.g., cluster shape is not hyperspherical, cluster
sizes are very different, cluster density is very different) or there is too much noise
in the data.
Although finding the correct number of clusters using model-based clustering
may be difficult, sometimes it is very handy to specify the number of groups in
advance. Clustering algorithms may also be used as a tool for other tasks, such as
discretization or allocation. If customers have to be assigned to k account managers,
k-Means clustering could help to assign a homogeneous group of customers to each
manager. In such settings, the number of groups is usually known beforehand.
We have only considered hyperspherical, isotropic clusters, that is, clusters that ex-
tend uniformly from the centre into the dataspace. With real data this assumption is
seldom satisfied. The Euclidean distance used in GMD and FcM can be generalized
to the Mahalanobis distance, which adapts to the correlations between dimensions
and accounts for ellipsoidal clusters. There are also many variants of Fuzzy c-Means
that consider somewhat exotic cluster shapes such as circles, ellipses, rectangles, or
quadrics [33] (shell clustering).
Integrating flexible cluster shapes into the used distance measure greatly in-
creases the number of parameters; typically the resulting algorithms react very sen-
sitively to poor initializations. Another approach is to subdivide the data into a pretty
7.4 Density-Based Clustering 169
large number of small clusters (using, e.g., k-Means) and then agglomeratively join
these prototypes to clusters. In a sense, k-Means clustering is then used like a prepro-
cessing phase to compress the data volume before then agglomerative hierarchical
clustering is applied. (On the other hand, such a condensed representation can help
to speed up k-Means and its variants significantly [31, 60].)
A very simple extension of fuzzy c-means clustering to cope with outliers is noise
clustering [18]. We have already discussed that, to consider a different cluster shape,
we only have to define an appropriate distance function. The idea of a noise cluster
is that it has a fixed (usually large) distance dnoise to any data point. As soon as the
distance of some data x ∈ X to the nearest prototype pi comes close to dnoise , the
noise cluster gains a considerable fraction of the total membership degree, thereby
reducing the influence of x with respect to pi . Noise clustering simply requires to
exchange the membership update (7.11) by
1
pi|j = c (7.18)
xj −pi 2 xj −pi 2
2 + k=1 xj −pk 2
dnoise
and represents and effective mean to reduce the influence of noise and extract cluster
prototypes more clearly.
The noise cluster allows for a better analysis of the clustering results. What typ-
ically happens when no noise cluster is used is that the memberships of an outlier
tend to ≈ k1 for the k closest clusters. This situation may also occur for data that
is well within the usual data range (and thus not an outlier) but half way between
these k clusters. By inspection of membership degrees these two cases are undistin-
guishable. When a noise cluster is present, the former case is easily detected, as it
has gained most of the membership degree (and only a small remainder is equally
distributed among the closest prototypes).
We have seen in Sect. 7.1 that single-linkage clustering is flexible in terms of rec-
ognizable cluster shapes, but sensitive to outliers, which may easily be misused as
bridges between clusters. These bridges are particularly counterintuitive if the core
of the clusters consists of a considerable amount of data, but the bridge is estab-
lished by a few data objects only. To establish a substantial connection between the
clusters, the data density of the bridge should be comparable to that of the cluster
itself. Density-based clustering goes into this direction by requiring a minimum data
density within the whole cluster.
170 7 Finding Patterns
7.4.1 Overview
Fig. 7.10 Density-based clustering using (a) grid cells or (b)/(c) data neighborhoods to estimate
the density
7.4 Density-Based Clustering 171
7.4.2 Construction
We consider the DBScan algorithm given in Table 7.10. The density threshold is en-
coded by two parameters, the radius of the (hyperspherical) neighborhood and the
number MinPts of data objects that is needed in the neighborhood to consider it as
dense. The actual density at some location x is measured within the ε-neighborhood
Nε (x) = {y ∈ D | x − y ≤ ε} of x. If this neighborhood contains at least MinPts
elements, x is located within a cluster and is called a core object. All data in the
ε-neighborhood of a core object also belong to the same cluster as the core object.
Other core objects within Nε (x) may likewise be core objects (they are density-
reachable) such that those neighboring elements are also included in the same clus-
ter. This expansion strategy is repeated until no further expansion is possible. Then,
eventually, the full cluster has been determined.
The main routine dbscan passes once through the database and evokes the sub-
routine expand for every data x unlabeled so far (line 5). Whenever this subroutine
returns true, a new cluster has successfully been extracted from the seed point x.
The expand subroutine firstly identifies the data in the neighborhood of x (set S
in line 1). In case the desired density is not reached (|S| < MinPts), x is relabeled as
NOISE, and no cluster has been found at x. Otherwise, a (potentially new) cluster
has been found, and all data from the neighborhood are assigned to it (relabeled
with cluster-ID cid, line 4). At this point, the processing of x has been completed,
and the function expand tries to extend the cluster further. For all the data in the
neighborhood, it is consecutively checked if they also satisfy the density threshold
(line 8). If that is the case, they also belong to the core of the cluster (and are called
core-points). All data in the neighborhood of a core-point are then also added to
the cluster, that is, they are relabeled and inserted into the set S of data that has to
be tested for further cluster expansion. If the data are organized in a data structure
that supports neighborhood queries (as called in line 1), the runtime complexity is
low (O(n log n)). However, the performance of such data structures (e.g., R ∗ -trees)
degrades as the data dimensionality increases.
To obtain optimal results, two parameters need to be adjusted, MinPts and ε. An
alternative set of parameters is the desired density threshold and the resolution of the
analysis: Any choice of MinPts and ε corresponds to the selection of a lower bound
on the data density. But for some given density threshold , we have the freedom to
choose MinPts twice as large and double at the same time the volume of Nε without
affecting the density threshold. The larger the volume, the more robust the density
estimation, but at the same time we loose the capability of recognizing clusters
smaller than this volume. Besides that, a single best choice may not exist for a given
dataset if clusters of different densities are present. As a rule of thumb, MinPts may
be set to 2 ∗ dimension − 1, and then ε is determined from visually inspecting a
sorted MinPts-distance plot: For all data xi , the radius ri is determined for which
Nri (x) = MinPts. These radii ri are then plotted in descending order (cf. Fig. 7.11).
Under the assumption that the clusters have roughly comparable densities, the ri -
values of data belonging to clusters should roughly be the same (in the ordered
plot this corresponds to a marginal negative slope). In contrast, outside the clusters
The OPTICS algorithm [5] is an extension of DBScan that solves the problem of
determining MinPts and ε differently. It can also be considered as a hierarchical
clustering algorithm, as it provides the resulting partition for a full range of possible
values ε. For each data object, OPTICS determines the so-called reachability- and
core-distance. The core-distance is basically the smallest value of ε under which
the data object becomes a core point. The reachability distance is the smallest dis-
tance under which the data objects becomes a cluster member, that is, belongs to the
neighborhood of the closest core object. The reachability plot of all data objects in a
specific order determined during clustering can be interpreted as a cluster hierarchy,
as shown in Fig. 7.12. Valleys in this plot indicate clusters in the dataset: the broader
the valley, the larger the cluster, and the deeper the valley, the higher the density.
Another possibility is to define the data density over the full data space X by the
superposition of influence functions centered at every data object:
(x) = fy (x),
y∈D
where fy (·) tells us how well x is represented (or influenced) by y. In the case of DB-
Scan, we may define fy (x) = 1 ⇔ x − y ≤ ε and 0 otherwise. For a core object of
Fig. 7.12 The reachability plot of the dataset on the right. The data objects have been ordered, and
for each data object, its reachability value is shown. Contiguous areas on the horizontal axis, where
the reachability drops below a horizontal line, represent a cluster. The horizontal lines represent
certain data density levels; the lower the line, the higher the density
174 7 Finding Patterns
a DBScan-cluster, we have (x) ≥ MinPts, because at least MinPts data objects ex-
ist around x within the ε-neighborhood. The more data exist around x, the higher the
density function (x). The DENCLUE algorithm [29] uses a Gaussian distribution
for fy (x) instead and concentrates on local maxima of (called density attractors)
via a hill-climbing technique. Such local maxima can be considered as point-like
clusters as in prototype-based clustering. Alternatively, several points with densities
above some threshold can be considered as representing jointly a cluster of arbitrary
shape. The DENCLUE algorithm preprocesses the data and organizes it into grid
cells. The summation of is restricted to some neighboring cells, thereby keeping
the overall runtime requirements low. Since the neighborhood relationship between
grid cells is known, no expensive neighborhood queries are necessary as with DB-
Scan or OPTICS. The regular grid of DENCLUE is adapted to the data at hand in
the OptiGrid approach [30].
We have emphasized in Sect. 7.2 that the choice of the (dis)similarity measure has
major impact on the kind and number of clusters the respective algorithm will find.
A certain distance function may ignore some attributes; therefore we may consider
the feature selection problem being part of the distance function design. All the
clustering algorithms we have discussed so far assume that the user is capable of
providing the right distance function—and thus, to select the right features or, in
other words, the right subspace that contains the clusters.
The term subspace clustering refers to those methods that try to solve the prob-
lem of subspace selection simultaneously while clustering. (This short section on
subspace clustering has been located in the section on density-based clustering, be-
cause most subspace clustering techniques base their measurements on a density-
based clustering notion.) As in feature selection, the problem with a naïve approach,
where we simply enumerate all possible subspaces (i.e., subsets of the attribute set)
and run the clustering algorithm of our choice, is the exponential complexity: the
number of potential subspaces increases exponentially with the dimensionality of
the dataspace. There are at least three ways to attack this problem.
In a bottom-up approach, we first seek for clusters in the individual one-
dimensional spaces (1D). From those we can construct two-dimensional spaces
via the cross-product of dense one-dimensional clusters (blue shaded regions in
Fig. 7.13, right). We thereby safe ourselves from investigating those 2D-areas whose
1D-projections are not dense (white areas in Fig. 7.13, right), as they have no chance
of becoming dense in higher dimensions anyway. The algorithm CLIQUE [4] fol-
lows this approach and takes the basic idea of dimension-wise cluster construction
from frequent pattern mining, which will be discussed in Sect. 7.6.
In a top-down solution we seek for a few data points with high data density in the
full dataset (anchor points) and then identify those variables that lead to the highest
data density. This can be done cluster-wise, such that every cluster may live in a
different subspace. A member of this family of algorithms is PROCLUS (projected
7.5 Self-organizing Maps 175
clustering) [1]. Each dimension gets a weight that indicates whether the dimension
belongs to the cluster’s subspace, and this weight is iteratively adjusted to increase
the cluster validity measure.
The third option is to rank subspaces according to their suspected suitability for
clustering and use the top-ranked subspace(s) for a conventional clustering algo-
rithm (similar to feature selection). The basic idea of the SURFING (subspaces
relevant for clustering) approach [7] is to look at the distribution of distances to
the kth nearest neighbor, as it has already been used to determine ε in the DBScan
algorithm, cf. Fig. 7.11. If the data is distributed on a regular grid, this rather un-
interesting dataset wrt. clustering is recognized by an almost straight line in this
graph. If clusters are present, we suspect varying distances to the kth neighbor. The
more the distances deviate from a constant line, the more promising is the subspace.
Similar to CLIQUE, the potential subspaces are then explored levelwise to identify
those subspaces that appear most promising.
7.5.1 Overview
A map consists of many tiny pictograms, letters, and numbers, each of them mean-
ingful to the reader, but the most important property when using a map is the neigh-
borhood relationship: If two point are close to each other on the map, so they are in
the real world. The self-organizing maps or Kohonen maps [38] start from a mesh
of map nodes2 for which the neighborhood relationship is fixed. One may think
2 The node are usually called neurons, since self-organizing maps are a special form of neural
network.
176 7 Finding Patterns
of a fishing net, where the knots represent the nodes, and all neighbors are con-
nected by strings. The structure of the fishing net is two-dimensional, but in a three-
dimensional world it may lay flat on the ground, forming a regular grid of mesh
nodes, may be packed tightly in a wooden box, or may even float freely in the ocean.
While the node positions in the higher-dimensional space varies, the neighborhood
relationship remains fixed by the a priori defined connections of the mesh (knots).
The idea of self-organizing maps is to define the mesh first (e.g., two-dimensional
mesh)—which serves as the basic layout for the pictorial representation—and then
let it float in the dataspace such that the data gets well covered. Every node pi
in the pictorial representation is associated with a corresponding reference node
ri in the high-dimensional space. At the end, the properties of the map at node
pi , such as color or saturation, are taken directly from the properties of the data
space at ri . The elastic network represents a nonlinear mapping of reference vectors
(high-dimensional positions in the data space) to the mesh nodes (two-dimensional
positions in the mesh).
Figure 7.14 shows an example for a two-dimensional data set. In such a low-
dimensional case the self-organizing map provides no advantages over a scatter plot,
it is used for illustrating purposes only. The occupied dataspace is covered by the
reference vectors of the self-organizing map (overlaid on the scatter plot). The color
and data density information is taken from the respective locations and gets mapped
to the two-dimensional display coordinates (right subfigure).
7.5.2 Construction
Learning a self organizing map is done iteratively (see Table 7.11): for any data
vector x from the dataset, the closest reference vector rw (in the high-dimensional
dataspace) is identified. The node w is usually called the winner neuron. From
the low-dimensional map topology the neighboring nodes of node w are identified.
Suppose that node i is one of these neighbors, then the associated reference vector ri
is moved slightly towards x so that it better represents this data vector. How much it
is shifted is controlled by a heuristic learning rate η, which slowly approaches zero
to ensure the convergence of the reference vectors. Thus, any new data vector leads
to a local modification of the map around the winner neuron. After a fixed number
7.5 Self-organizing Maps 177
of iterations or when the changes drop below some threshold, the map adjustment
is finished.
To determine the winner node any distance function might be used (see the mea-
sures discussed in Sect. 7.2), but the Euclidean and the cosine distances are most
frequently employed. As not only the winning reference vector rw is shifted towards
the new sample, but also neighboring vectors, we need a way to specify which nodes
will be affected. This is done by a neighborhood function h : R2 × R2 → [0, 1]: For
node i, this function provides a factor h(pw , pi ) that directly influences the refer-
ence vector modification. Note that the neighborhood information is taken from the
two-dimensional node locations pi , while the update concerns the high-dimensional
reference vectors ri . The smaller the factor, the smaller the effect of the modifica-
tion; a factor of zero means that the reference vector is left unaltered. The following,
typically used functions share a common parameter, the influence range δ, which
controls the size of the neighborhood:
178 7 Finding Patterns
Initially, the neighborhood should be relatively large as the map needs to be first
unfolded. The random initialization leads to an unorganized initial state, and neigh-
boring nodes i and j in the low-dimensional representation are not associated with
neighboring reference vectors. To avoid distorted or twisted maps, the so-called
stiffness parameter δ must be relatively large at the beginning and must not de-
crease too quickly. Figure 7.15 shows some intermediate steps in the evolution of a
self-organizing map.
7.6 Frequent Pattern Mining and Association Rules 179
Frequent pattern mining tackles the problem of finding common properties (pat-
terns) that are shared by all cases in certain sufficiently large subgroups of a given
data set. The general approach to find such patterns is to (efficiently) search a space
of potential patterns. Based on the type of data to be mined, frequent pattern mining
is divided into (1) frequent item set mining and association rule induction, (2) fre-
quent sequence mining, and (3) frequent (sub)graph mining with the special subarea
of frequent (sub)tree mining.
7.6.1 Overview
Frequent pattern mining was originally developed—in the special form of frequent
item set mining—for market basket analysis, which aims at finding regularities in
the shopping behavior of the customers of supermarkets, mail-order companies, and
online shops. In particular, it tries to identify sets of products that are frequently
bought together. Once identified, such sets of associated products may be exploited
to optimize the organization of the offered products on the shelves of a supermarket
or on the pages of a mail-order catalog or web shop, may be used to suggest other
products a customer could be interested in (so-called cross-selling), or may provide
hints which products may conveniently be bundled. In order to find sets of asso-
ciated products, one analyzes recordings of actual purchases and selections made
by customers in the past, that is, basically lists of items that were in their shopping
carts. Due to the widespread use of scanner cashiers (with bar code readers), such
recordings are nowadays readily available in basically every supermarket.
Generally, the challenge of frequent pattern mining consists in the fact that the
number of potential patterns is usually huge. For example, a typical supermarket
has thousands of products on offer, giving rise to astronomical numbers of sets
180 7 Finding Patterns
of potentially associated products, even if the size of these sets is limited to few
items: there are about 8 quadrillion (8 · 1012 ) different sets of 5 items that can be
selected from a set of 1000 items. As a consequence, a brute force approach, which
simply enumerates and checks all possible patterns, quickly turns out to be infeasi-
ble.
More sophisticated approaches exploit the structure of the pattern space to guide
the search and the simple fact that a pattern cannot occur more frequently than any
of its subpatterns to avoid useless search, which cannot yield any output. In addition,
clever representations of the data, which allow for efficient counting of the number
of cases satisfying a pattern, are employed. In the area of frequent item set mining,
research efforts in this direction led to well-known algorithms like Apriori [2, 3],
Eclat [59], and FP-growth [28], but there are also several variants and extensions
[8, 25].
Such rules are customarily called association rules, because they describe an
association of the item(s) in the consequent (then-part) with the item(s) in the an-
tecedent (if-part). Such rules are particularly useful for cross-selling purposes, be-
cause they indicate what other products may be suggested to a customer. They are
generated from the found frequent item sets by relating item sets one of which is a
subset of the other, using the smaller item set as the antecedent and the additional
items in the other as the consequent.
With the algorithms mentioned above, the basic task of frequent item set min-
ing can be considered satisfactorily solved, as all of them are fast enough for most
practical purposes. Nevertheless, there is still room for improvement. Recent ad-
vances include filtering the found frequent item sets and association rules (see, e.g.,
[53, 54]), identifying temporal changes in discovered patterns (see, e.g., [10, 11]),
and mining fault-tolerant or approximate frequent item sets (see, e.g., [16, 44, 52]).
Furthermore, sequences, trees, and generally graphs have been considered as
patterns, thus vastly expanding the possible applications areas and also introducing
new problems. In particular, for patterns other than item sets—and especially for
general graphs—avoiding redundant search is much more difficult. In addition, the
fact that with these types of data a pattern may occur more than once in a single
sample case (for example, a specific subsequence may occur at several locations in
a longer sequence) allows for different definitions of what counts as frequent and
thus makes it possible to find frequent patterns for single instances (for example,
mining frequent subgraphs of a single large graph).
The application areas of frequent pattern mining in its different specific forms
include market basket analysis, quality control and improvement, customer man-
agement, fraud detection, click stream analysis, web link analysis, genome analysis,
drug design, and many more.
7.6 Frequent Pattern Mining and Association Rules 181
7.6.2 Construction
Formally, the task of frequent item set mining can be described as follows: we
are given a set B of items, called the item base, and a database T of transactions.
Each item may represent a product, a special equipment item, a service option, etc.,
and the item base represents the set of all products etc. that are offered. The term
item set refers to any subset of the item base B. Each transaction is an item set
and represents a set of products that has been bought by a customer. Since two or
even more customers may have bought the exact same set of products, the total
of all transactions must be represented as a vector, a bag, or a multiset, since in a
simple set each transaction could occur at most once. (Alternatively, each trans-
action may be enhanced by a unique transaction identifier, and these enhanced
transactions may then be combined in a simple set.) Note that the item base B
is usually not given explicitly but only implicitly as the union of all transactions.
An example transaction database over the item base B = {a, b, c, d, e} is shown
in Fig. 7.16.
The support sT (I ) of an item set I ⊆ B is the number of transactions in the
database T it is contained in. Given a user-specified minimum support smin ∈ N,
an item set I is called frequent in T iff sT (I ) ≥ smin . The goal of frequent item
set mining is to identify all item sets I ⊆ B that are frequent in a given transaction
database T . In the transaction database shown in Fig. 7.16, 16 frequent item sets
with 0 to 3 items can be discovered if smin = 3 is chosen. Note that more frequent
item sets are found than there are transactions, which is a typical situation.
In order to design an algorithm to find frequent item sets, it is beneficial
to first consider the properties of the support of an item set. Obviously we
have
∀I : ∀J ⊇ I : sT (J ) ≤ sT (I ).
That is: If an item set is extended, its support cannot increase. This is immediately
clear from the fact that each added item is like an additional constraint a transaction
has to satisfy. One also says that support is antimonotone or downward closed.
From this property it immediately follows what is known as the a priori prop-
erty:
∀smin : ∀I : ∀J ⊇ I : sT (I ) < smin → sT (J ) < smin .
182 7 Finding Patterns
Fig. 7.17 A subset lattice for five items (left) and the frequent item sets (right, frequent item sets
in blue) for the transaction database shown in Fig. 7.16
Fig. 7.18 A subset tree that results from assigning a unique parent to each item set (left) and a
corresponding prefix tree in which sibling nodes with the same prefix are merged (right)
Since with this scheme of assigning unique parents, all sibling item sets share the
same prefix (w.r.t. the chosen item order), namely the item set that is their parent,
it is convenient to structure the search as a prefix tree, as shown in Fig. 7.18. In this
prefix tree the concatenation of the edge labels on the path from the root to a node
is the common prefix of all item sets explored in this node.
A standard approach to find all frequent item sets w.r.t. a given database T and
given support threshold smin , which is adopted by basically all frequent item set
mining algorithms (except those of the Apriori family), is a depth-first search in
the subset tree of the item base B. Viewed properly, this approach can be inter-
preted as a simple divide-and-conquer scheme. For the first item i in the chosen
item order, the problem to find all frequent item sets is split into two subproblems:
(1) find all frequent item sets containing the item i, and (2) find all frequent item
sets not containing the item i. Each subproblem is then further divided based on the
next item j in the chosen item order: find all frequent item sets containing (1.1) both
items i and j , (1.2) item i but not j , (2.1) item j but not i, (2.2) neither item i nor j ,
and so on, always splitting with the next item (see Fig. 7.19 for an illustration).
All subproblems that occur in this divide-and-conquer recursion can be defined
by a conditional transaction database and a prefix. The prefix is a set of items
that has to be added to all frequent item sets that are discovered in the conditional
database. Formally, all subproblems are tuples S = (C, P ), where C is a conditional
database, and P ⊆ B is a prefix. The initial problem, with which the recursion is
started, is S = (T , ∅), where T is the given transaction database, and the prefix is
empty. The problem S is, of course, to find all item sets that are frequent in T .
A subproblem S0 = (T0 , P0 ) is processed as follows: Choose an item i ∈ B0 ,
where B0 is the set of items occurring in T0 . Note that, in principle, this choice is
arbitrary but usually respects a predefined order of the items, which then also defines
the structure of the subset tree as discussed above.
If sT0 (i) ≥ smin (that is, if the item i is frequent in T0 ), report the item set P0 ∪ {i}
as frequent with the support sT0 (i) and form the subproblem S1 = (T1 , P1 ) with
P1 = P0 ∪ {i}. The conditional database T1 comprises all transactions in T0 that
184 7 Finding Patterns
contain the item i, but with the item i removed. This also implies that transactions
that contain no other item than i are entirely removed: no empty transactions are
ever kept in the search. If T1 is not empty, S1 is processed recursively.
In any case (that is, regardless of whether sT0 (i) ≥ smin or not), form the sub-
problem S2 = (T2 , P2 ), where P2 = P0 . The conditional database T2 comprises all
transactions in T0 (including those that do not contain the item i), but again with
the item i removed (and, as before, transactions containing no other item than i
discarded). If T2 is not empty, S2 is processed recursively.
Eclat, FP-growth and several other frequent item set mining algorithms all rely
on the described basic recursive processing scheme. They differ mainly in how they
represent the conditional transaction databases. There are two basic approaches: in a
horizontal representation, the database is stored as a list (or array) of transactions,
each of which is a list (or array) of the items contained in it. In a vertical represen-
tation, on the other hand, a database is represented by first referring with a list (or
array) to the different items. For each item, a list (or array) of transaction identifiers
is stored, which indicate the transactions that contain the item.
However, this distinction is not pure, since there are many algorithms that use
a combination of the two forms of representing a database. For example, while
Eclat uses a purely vertical representation and the SaM algorithm presented in
the next section uses a purely horizontal representation, FP-growth combines in its
FP-tree structure (basically a prefix tree of the transaction database, with links be-
tween the branches that connect equal items) a vertical representation (links between
branches) and a (compressed) horizontal representation (prefix tree of transactions).
Apriori also uses a purely horizontal representation but relies on a different process-
ing scheme, because it traverses the subset tree levelwise rather than depth-first.
In order to give a more concrete idea of the search process, we discuss the par-
ticularly simple SaM (Split and Merge) algorithm [14] for frequent item set mining.
In analogy to basically all other frequent item set mining algorithms, the SaM al-
gorithm first preprocesses the transaction database with the aim to find a good item
7.6 Frequent Pattern Mining and Association Rules 185
Fig. 7.20 The example database: original form (1), item frequencies (2), transactions with sorted
items (3), lexicographically sorted transactions (4), and the used data structure (5)
Fig. 7.21 The basic operations of the SaM algorithm: split (left) and merge (right)
order and to set up the representation of the initial transaction database. The steps
are illustrated in Fig. 7.20 for a simple example transaction database. Step 1 shows
a transaction database in its original form. In step 2 the frequencies of individual
items are determined from this input in order to be able to discard infrequent items
immediately. If we assume a minimum support of three transactions for this exam-
ple, there are two infrequent items, namely f and g, which are discarded. In step 3
the (frequent) items in each transaction are sorted according to their frequency in
the transaction database, since experience (also with other algorithms) has shown
that processing the items in the order of increasing frequency usually leads to the
shortest execution times. In step 4 the transactions are sorted lexicographically into
descending order, with item comparisons again being decided by the item frequen-
cies, although here the item with the higher frequency precedes the item with the
lower frequency. In step 5 the data structure on which SaM operates is built by com-
bining equal transactions and setting up an array, in which each element consists of
two fields: an occurrence counter and a pointer to the sorted transaction. This data
structure is then processed recursively, according to the divide-and-conquer scheme
discussed in the preceding section, to find the frequent item sets.
The basic operations used in the recursive processing are illustrated in Fig. 7.21.
In the split step (see the left part of Fig. 7.21) the given array is split w.r.t. the leading
item of the first transaction (the split item; item e in our example): all array elements
referring to transactions starting with this item are transferred to a new array. In
this process the pointer (in)to the transaction is advanced by one item, so that the
186 7 Finding Patterns
common leading item is removed from all transactions, and empty transactions are
discarded. Obviously, the new array represents the conditional database of the first
subproblem (see page 183), which is then processed recursively to find all frequent
item sets containing the split item (provided that this item is frequent).
The conditional database for frequent item sets not containing this item (needed
for the second subproblem, see page 184) is obtained with a simple merge step (see
the right part of Fig. 7.21). The created new array and the rest of the original array
(which refers to all transactions starting with a different item) are combined with a
procedure that is almost identical to one phase of the well-known mergesort algo-
rithm. Since both arrays are obviously lexicographically sorted, one merging traver-
sal suffices to create a lexicographically sorted merged array. The only difference
to a mergesort phase is that equal transactions (or transaction suffixes) are com-
bined. That is, there is always just one instance of each transaction (suffix), while its
number of occurrences is kept in the occurrence counter. In our example this results
in the merged array having two elements less than the input arrays together: the
transaction (suffixes) cbd and bd, which occur in both arrays, are combined, and,
consequently, their occurrence counters are increased to 2.
As association rules, we formally consider rules A → C, with A ⊆ B, C ⊆ B,
and A ∩ C = ∅. Here B is the underlying item base, and the symbols A and C
have been chosen to represent the antecedent and the consequent of the rule. For
simplicity, we confine ourselves in the following to rules with |C| = 1, that is, to
rules with only a single item in the consequent. Even though rules with several
items in the consequent have also been studied, they are usually a lot less useful and
only increase the size of the created rule set unnecessarily.
While (frequent) item sets are associated with only one measure, namely their
support, association rules are evaluated with two measures:
• support of an association rule
The support of an association rule A → C can be defined in two ways:
– support of all items appearing in the rule: rT (A → C) = sTs(A∪C)T (∅)
.
This is the more common definition:
the support of a rule is the fraction of cases in which it is correct.
– support of the antecent of the rule: rT (A → C) = ssTT (A)
(∅) .
This definition is actually more plausible:
the support of a rule is the fraction of cases in which it is applicable.
(Note that sT (∅) is simply the number of transactions in the database T to mine,
because the empty item set is contained in all transactions.)
• confidence of an association rule
The confidence of an association rule is the number of cases in which it is cor-
rect relative to the number of cases in which it is applicable: cT (A → C) =
sT (A∪C)
sT (A) . The confidence can be seen as an estimate of the conditional probability
P (C | A).
Given a transaction database T over some item base B, a minimum support rmin ∈
[0, 1], and a minimum confidence cmin ∈ [0, 1] (both to be specified by a user),
the task of association rule induction is to find all association rules A → C with
A, C ⊂ B, rT (A → C) ≥ rmin , and cT (A → C) ≥ cmin .
7.6 Frequent Pattern Mining and Association Rules 187
The usual approach to this task consists in finding, in a first step, the frequent
item sets of T (as described above). For this step, it is important which rule support
definition is used. If the support of an association rule is the support of all items
appearing in the rule, then the minimum support to be used for the frequent item
set mining step is simply smin = sT (∅)rmin (which explains why this is the more
common definition). If, however, the support of an association rule is only the sup-
port of its antecendent, then smin = sT (∅)rmin cmin has to be used. In a second
step the found frequent item sets are traversed, and all possible rules (with one item
in the consequent, see above) are generated from them and then filtered w.r.t. rmin
and cmin .
As an example, consider again the transaction database shown in Fig. 7.16 and
the frequent item sets that can be discovered in it (for a minimum support, rmin =
30%, that is, smin = 3). The association rules that can be constructed from them
with a minimum confidence cmin = 0.8 = 80% are shown in Fig. 7.22. Due to the
fairly high minimum confidence requirement, the number of found association rules
is relatively small compared to the number of transactions.
The basic recursive processing scheme for finding frequent item sets described in
Sect. 7.6.2 can easily be improved with so-called perfect extension pruning (also
188 7 Finding Patterns
called parent equivalence pruning), which relies on the following simple idea:
given an item set I , an item i ∈/ I is called a perfect extension of I iff I and I ∪ {i}
have the same support, that is, if i is contained in all transactions that contain I . Per-
fect extensions have the following properties: (1) if the item i is a perfect extension
of an item set I , then it is also a perfect extension of any item set J ⊇ I as long as
i∈/ J , and (2) if I is a frequent item set and K is the set of all perfect extensions
of I , then all sets I ∪ J with J ∈ 2K (where 2K denotes the power set of K) are
also frequent and have the same support as I . These properties can be exploited by
collecting in the recursion not only prefix items but also, in a third element of a
subproblem description, perfect extension items. Once identified, perfect extension
items are no longer processed in the recursion but are only used to generate all su-
persets of the prefix that have the same support. Depending on the data set, this can
lead to a considerable acceleration of the search.
Often three types of frequent item sets are distinguished by additional constraints:
• frequent item set
A frequent item set is merely frequent:
I is a frequent item set ⇔ sT (I ) ≥ smin .
• closed item set
A frequent item set is called closed if no proper superset has the same support:
I is a closed item set ⇔ sT (I ) ≥ smin ∧ ∀J ⊃ I : sT (J ) < sT (I ).
• maximal item set
A frequent item set is called maximal if no proper superset is frequent:
I is a maximal item set ⇔ sT (I ) ≥ smin ∧ ∀J ⊃ I : sT (J ) < smin .
Obviously, all maximal item sets are closed, and all closed item sets are frequent.
The reason for distinguishing these item set types is that with them one can achieve
a compressed representation of all frequent item sets: the set of all frequent item
sets can be recovered from the maximal (or the closed) item sets by simply forming
all subsets of maximal (or closed) item sets, because each frequent item set has a
maximal superset. With closed item sets, one even preserves the knowledge of their
support, because any frequent item set has a closed superset with the same support.
Hence the support of a nonclosed frequent item set I can be reconstructed with
∀smin : ∀I : sT (I ) ≥ smin ⇒ sT (I ) = max sT (J ),
J ∈CT (smin ),J ⊇I
where CT (smin ) is the set of all closed item sets that are discovered in a given trans-
action database T if the minimum support smin is chosen.
Note that closed item sets are directly related to perfect extension pruning, since
a closed item set can also be defined as an item set that does not possess a perfect
extension. As a consequence, identifying perfect extensions is particularly useful
and effective if the output is restricted to closed item sets.
7.6 Frequent Pattern Mining and Association Rules 189
For the simple example database shown in Fig. 7.16, all frequent item sets are
closed, with the exception of {b} (since sT ({b}) = sT ({b, c}) = 3) and {d, e} (since
sT ({d, e}) = sT ({a, d, e}) = 4). The maximal item sets are {b, c}, {a, c, d}, {a, c, e},
and {a, d, e}. Note that indeed any frequent item set is a subset of at least one of
these four maximal item sets, demonstrating that they suffice to represent all fre-
quent item sets. In order to give an impression of the relative number of frequent,
closed, and maximal item sets in practice, Fig. 7.23 displays the decimal logarithms
of the numbers of these item sets that are found for different minimum support
values for a common benchmark data set, namely the BMS-Webview-1 database.
Clearly, restricting the output to closed or even to maximal item sets can reduce the
size of the output by orders of magnitude in this case.
Naturally, the notions of a closed and a maximal item set can easily be trans-
fered to other pattern types, leading to closed and maximal sequences, trees, and
(sub)graphs (with analogous definitions). Their purpose is the same in these cases:
to reduce the output to a more manageable size. Nevertheless, however, the size of
the output remains a serious problem in frequent pattern mining, despite several re-
search efforts, for example, to define relevance or interestingness measures by which
the best patterns can be singled out for user inspection.
In practice the number of found association rules is usually huge, often exceeding
(by far) the number of transactions. Therefore considerable efforts have been de-
voted to filter them in order to single out the interesting ones, or at least to rank
them. Here we study only a few simple methods, though often effective. More so-
phisticated approaches are discussed, for instance, in [53, 54].
The basic idea underlying all approaches described in the following is that a
rule is interesting only if the presence of its antecedent has a sufficient effect on
the presence of its consequent. In order to assess this, the confidence of a rule
is compared to its expected confidence under the assumption that antecedent and
190 7 Finding Patterns
where pkl = nkl /n∗∗ for k, l ∈ {0, 1, ∗}. For both of these measures, a higher value
means a stronger dependence of C on A and thus a higher interestingness of the
rule.
It should be noted that such measures are only a simple aid, as they do not detect
and properly assess situations like the following: suppose that cT (∅ → C) = 20%
but cT (A → C) = 80%. According to all of the above measures, the rule A → C
is highly interesting. However, if we know that cT (A − {i} → C) = 76% for some
item i ∈ A, the rule A → C does not look so interesting anymore: the change in
7.6 Frequent Pattern Mining and Association Rules 191
confidence that is brought about by the item i is fairly small, and thus we should
rather consider a rule with a smaller antecedent.
While this problem can at least be attacked by comparing a rule not only to
its counterpart with an empty antecedent but to all rules with a simpler antecedent
(even though this is more difficult than it may appear at first sight), semantic aspects
of interestingness are completely neglected. For example, the rule “pregnant → fe-
male” has a confidence of 100%, while “∅ → female” has a confidence of only 50%.
Nevertheless the rule “pregnant → female” is obviously completely uninteresting,
because it is part of our background knowledge (or can at least be derived from it).
In analogy to frequent item set mining, with which item sets are found that are con-
tained in a sufficiently large number of transactions of a given database (as speci-
fied by the minimum support), frequent subgraph mining tries to find (sub)graphs
that are contained in a sufficiently large number of (attributed or labeled) graphs
of a given graph database. Since the advent of this research area around the turn
of the century, several clever algorithms for frequent subgraph mining have been
developed. Some of them rely on principles from inductive logic programming
and describe the graph structure by logical expressions [23]. However, the vast
majority transfers techniques developed originally for frequent item set mining
(see Sect. 7.6.2). Examples include MolFea [39], FSG [40], MoSS/MoFa [12],
gSpan [56], CloseGraph [57], FFSM [34], and Gaston [42]. A related, but slightly
different approach, is used in Subdue [17], which is geared toward graph compres-
sion with common subgraphs rather than frequent subgraph mining.
Here we confine ourselves to a brief outline of the core principles of frequent
subgraph mining, because frequent sequence and tree mining can obviously be seen
as special cases of frequent subgraph mining. Nevertheless, however, it should be
noted that specific optimizations are possible due to the restricted structure of trees
and those graphs to which different types of sequences can be mapped. Note also
that this is the only place in this book where we consider the analysis of data that
does not have a tabular structure. While we cannot provide a comprehensive cover-
age of all complex data types, we strive in this section to convey at least some idea
of the special requirements and problem encountered when analyzing such data.
Formally, frequent subgraph mining works on a database of labeled graphs (also
called attributed graphs). A labeled graph is a triple G = (V , E, l), where V is the
set of vertices, E ⊆ V × V − {(v, v) | v ∈ V } is the set of edges, and l : V ∪ E → L
assigns labels from some label set L to vertices and edges. The graphs we consider
are undirected and simple (that is, there is at most one edge between two given
vertices) and contain no loops (that is, there are no edges connecting a vertex to
itself). However, graphs without these restrictions (that is, directed graphs, graphs
with loops and/or multiple edges) can be handled as well with properly adapted
methods. Note also that several vertices and edges may carry the same label.
192 7 Finding Patterns
Fig. 7.26 A Hasse diagram of subgraphs of three molecule-like graphs, to be seen at the bottom
(top) and an assignment of unique parents, which turns the Hasse diagram into a tree (bottom)
With a unique parent for each subgraph, we can carry out the search for frequent
subgraphs according to the following simple recursive scheme, which parallels the
divide-and-conquer scheme described in Sect. 7.6.2: in a base loop, all possible ver-
tex labels are traversed (their unique parent is the empty graph). All vertex labels
(and thus all single vertex subgraphs) that are frequent are processed recursively.
A given frequent subgraph S is processed recursively by forming all possible ex-
tensions R of S by a single edge and a vertex for which S is the chosen unique
parent. All such extensions R that are frequent (that is, for which sG (R) ≥ smin ) are
processed recursively, while infrequent extensions are discarded.
Whereas for frequent item sets, it was trivial to assign unique parents, namely
by simply defining an (arbitrary but fixed) order of the items, ordering the labels
in the set L, though also necessary, is not enough. The reason is mainly that sev-
eral vertices (and several edges) may carry the same label. Hence the labels do not
uniquely identify a vertex, thus rendering it impossible to describe the graph struc-
ture solely with these labels. We rather have to endow each vertex with a unique
identifier (usually a number), so that we can unambiguously specify the edges of
the graph.
Given an assignment of unique identifiers, we can describe a graph with a code
word which specifies the vertex and edge labels and the connection structure, and
194 7 Finding Patterns
from which the graph can be reconstructed. Of course, the form of this code word
depends on the chosen numbering of the vertices: each numbering leads to a differ-
ent code word. In order to single out one of these code words, the canonical code
word, we simply select the lexicographically smallest code word.
With canonical code words, we can easily define unique parents: the canonical
code word of a (sub)graph S is obtained with a specific numbering of its vertices and
thus also fixes (maybe with some additional stipulation) an order of its edges. By
removing the last edge according to this order, which is not a bridge or is incident
to at least one vertex with degree 1 (which is then also removed), we obtain a graph
that is exactly one level up in the partial order of subgraphs and thus may be chosen
as the unique parent of S. Details about code words, together with a discussion of
the important prefix property of code words, can be found, for example, in [13].
7.7.1 Overview
To gain insights from the discovery, the respective subgroup should be easy to
grasp by the analyst and is therefore best characterized by a conjunction of features
(conditions on attributes like fuel = liquid petrol gas, offeredby = dealer, mileage
> 100000, etc.). All possible combinations of constraints define jointly the set of
possible subgroups. A search method is then applied to systematically explore this
subgroup space. The target measure of each candidate subgroup is compared against
the value obtained for, say, the whole dataset, and a statistical test decides about the
significance of a discovered deviation (under the null hypothesis that the subgroup
does not differ from the whole dataset). Only significant deviations are considered
as interesting and reported to the analyst. The found subgroups are ranked to allow
the search method to focus on the most interesting patterns and to point the user
to the most promising findings. This ranking involves a quality function, which is
usually a trade off between size of the subgroup and its unusualness.
One of the major practical benefits of deviation analysis is that we do not even try
to find a model for the full dataset, because in practice such a model is either coarse
and well known to the expert or very large and thus highly complex. Concentrating
on the (potentially most interesting) highest deviation from the expected average
and deriving short, understandable rules delivers invaluable insights.
7.7.2 Construction
The ingredients of deviation analysis are (1) the target measure and a verification
test serving as a filter for irrelevant or incidental patterns, (2) a quality measure
to rank subgroups, and (3) a search method that enumerates candidates subgroups
systematically.
Any attribute of the dataset may serve as the target measure, and any combi-
nation or derived variable, and even models themselves, may be considered (e.g.,
regression models [41]). Depending on the scale of the target attribute, an appropri-
ate test has to be selected (see Fig. 7.27). For instance, for some binary target flag,
we have a probability of p0 = 0.7 in the full population of observing the flag F .
Now, in a subgroup of size n = 100 we observe m = 80 cases where the flag is set,
so p = P (F ) = 0.8. Before we claim that we have found a deviation (and thus an
interesting subgroup), we want to get confident that 80 cases or more are unlikely
to occur by chance (say, only in 1%). We state the null hypothesis that p = p0 and
apply a one-sample z-test3 with a significance level of 0.01. We have to compute
the z-score
m − μ 80 − 70
z= = ≈ 2.18,
σ 0.0458
where μ = np0 is the mean,√ and σ is the standard deviation in the whole population,
which is given by σ = n p0 (1 − p0 ). A table of the standard normal distribution
tells us that the probability of observing a z-value of 2.18 or above is approximately
0.0146. Thus, in 1 − 0.0146 = 98.54% there will be less than 80 observed target
flags. A significance level of 0.01 means that we reject the null-hypothesis if this
situation occurs only in 1 out of 100 times—actually it may occur in ≈1.5 out of
100 times, so the null-hypothesis cannot be rejected (at this significance level), and
the subgroup is not reported. As we will perform quite a large number of tests,
the significance level must be adjusted appropriately. If we test 1000 subgroups for
being interesting and use a significance level of 0.01, we have to expect about 10
subgroups that pass this test (even if there is no substantial deviation).
All interesting subgroups are ranked according to their interestingness. One pos-
sibility is to reuse the outcome of the statistical test itself, as it expresses how un-
usual the finding is. In the above case, this could be the z-score itself:
√
m − n p0 n(p − p0 )
z= √ =√ .
n p0 (1 − p0 ) p0 (1 − p0 )
√
The z-score value balances the size of subgroup (factor of n), the deviation (factor
of p − p0 ) and the level of the target share (p0 ), which is important because an
absolute deviation of p0 − p = 0.1 may represent both, a small increase (from 0.9
to 1.0) or a doubled share (from
0.1 to 0.2). Another aspect that might be considered
by an additional factor of NN−n is the relative size of the subgroup with respect to
the total population size. A frequently used quality function is the weighted relative
accuracy (WRAcc), which is typically used for predictive tasks:
n
WRAcc = (p − p0 ) · .
N
However, the user may want to focus on other (application dependent) aspects such
as practical usefulness, novelty, and simplicity and employ other quality function be
used to rank the subgroups.
Since subgroups are to be described by a conjunction of features (such as variable
= value), the number of possible subgroups on the syntactical level grows exponen-
tially with the number of attributes and the number of possible values. Given an at-
tribute A with ΩA = {high, medium, low, absent}, we may form four conditions on
identity (e.g., A = high), four on imparity (e.g., A = medium), and as A has ordinal
scale, even more features regarding the order (e.g., A ≤ low or A ≥ medium). When
constructing 11 features from A alone, another 9 attributes with 4 values would al-
7.7 Deviation Analysis 197
If a high quality subgroup is present, it is very likely that several variants of this
subgroup exist, which are also evaluated as good subgroups (e.g., by adding a con-
straint A = v for some attribute A and a rare feature v). Thus, it may happen that
198 7 Finding Patterns
many of the subgroups in the beam are variations of a single scheme, which prevents
the beam search from focussing on other parts of the dataspace—a small number of
diverse subgroups would be preferable. A subset of diverse subsets can be extracted
from the beam by selecting successively those subgroups that cover most of the
data. Once a subgroup has been selected, the covered data is excluded from the sub-
sequent selection steps [24]. If most of the subgroups in the beam are variations of
one core subgroup, only a few diversive subgroups will be selected. A better ap-
proach is a sequential covering search, where good subgroups are discovered one
after the other. Over several runs, only a few or a single best subgroup is identi-
fied, and the data covered by this subgroup is then excluded from subsequent runs.
Thereby subgroups cannot be rediscovered, but the method has to focus on different
parts of the dataspace. Similar techniques are applied for learning sets of classifi-
cation rules (see Chap. 8). It is also possible to generate a new sample from the
original dataset that no longer exhibits the unusualness that has been discovered by
a given subgroup [48]. If this subsampling is applied before any subsequent run,
new subgroups rather than known subgroups will be discovered.
Another issue is the efficiency of search, the scalability to large datasets. Ex-
haustive searching is prohibitive unless intelligent pruning techniques are applied
that prevent us from losing to much time with redundant, uninteresting subgroups.
On the other hand, any kind of heuristic search (like beam search) bears the risk of
missing the most interesting subgroups. There are multiple directions how to attack
this problem.
For nominal target variables, efficient algorithms from association rule mining
can be utilized. If it is possible to transform the database such that association min-
ing (e.g., via the Apriori algorithm) can be applied, the validation and ranking of
the patterns found are merely a post-processing step [15]. Missing values require
special care in this approach, as the case of missing data is usually not considered
in market basket analysis. Furthermore, deviation analysis typically focusses on a
target variable rather than associations between any attributes, which offers some
optimization potential [6].
If the dataset size becomes an issue, one may use a subsample to test and rank
subgroups rather than the full dataset. For a broad range of quality measures, one
can derive upper bounds for the size of the sample with guaranteed upper bounds
on the subgroup quality estimation error [47]. This speeds up the discovery process
considerably, because there is no need for a full database scan.
Fig. 7.28 A KNIME dendrogram for the Iris data. The colors indicate the underlying class infor-
mation of the three different types of iris plants
KNIME offers a number of nodes for pattern finding. Hierarchical clustering is, of
course, among them. Applying the node is fairly straight forward. The user selects
the (numerical4 ) columns to be used along with the distance function and the type
of linkage. The node will also provide a cluster ID assignment of the input on its
outport, and for this, the user can specify the cutoff, i.e., the number of clusters to
be created. Figure 7.28 shows a dendrogram for the Iris data described earlier. We
have also used the color assigner based on the class labels. However, the Euclidean
distance was only computed on the four numerical attributes.
The standard method for prototype based clustering is provided via the k-Means
node. The options here are again the selection of numerical columns to be used
and additionally the number of clusters to be used and the maximum number of
iterations. The latter is usually not required, but in the (rare) case of oscillations and
hence nontermination of the underlying algorithm, KNIME will force termination
after reaching this number of iterations. The node has two outports: one carrying
the input data together with cluster labels and a second port holding the cluster
model. This model can be exported for use in other tools (using, e.g., the Model
Writer node), or it can be used to apply the clustering to new data. The Cluster
Assigner node will compute the closest cluster prototype for each pattern and assign
the corresponding label. Figure 7.29 shows two small flows to demonstrate this.
The part on the left reads a file, runs the k-Means algorithm, and writes the result-
ing cluster model into a file. The flow on the right reads this model and a second file
and applies the cluster model to this new data. Of course, in this little example we
could have simply connected the clustering node to the node assigning the cluster
model, but this demonstrates how we can use one analysis tool to create a clustering
and then use a second tool (maybe a nightly batch job clustering customers?) using
the very same model. We will discuss this aspect of model deployment in more de-
tail in Chap. 10. KNIME also provides a node which allows one to apply the fuzzy
version of k-Means, fuzzy c-Means.
4 We will discuss the use of other types of distance metrics in KNIME later.
200 7 Finding Patterns
Fig. 7.29 Two KNIME workflows creating a clustering and writing the resulting model to a file
(left) and reading data and model from file and applying the read clustering model (right)
Fig. 7.30 Through additional plugins, KNIME can also process other types of data. In this exam-
ple, molecular file formats are read and processed in a discriminative fragment mining node
the left (connected to the input data) which rows (and molecules) the fragment is
contained in.
As an example, we apply hierarchical clustering to the Iris data set, ignoring the
categorical attribute Species. We use the normalized Iris data set iris.norm that
is constructed in Sect. 6.6.2.3. We can apply hierarchical clustering after removing
the categorical attribute and can plot the dendrogram afterwards:
Here, the Ward method for the cluster distance aggregation function as described
in Table 7.3 was chosen. For the other cluster distance aggregation functions in the
table, one simply has to replace ward by single (for single linkage), by com-
plete (for complete linkage), by average (for average linkage), or by cen-
troid.
For heatmaps, the library gplots is required that needs installing first:
> library(gplots)
> rowv <- as.dendrogram(hclust(dist(iris.num),
method="ward"))
> colv <- as.dendrogram(hclust(dist(t(iris.num)),
method="ward"))
> heatmap.2(as.matrix(iris.num), Rowv=rowv,Colv=colv,
trace="none")
202 7 Finding Patterns
The desired numbers of clusters is specified by the parameter centers. The loca-
tion of the cluster centers and the assignment of the data to the clusters is obtained
by the print-function:
> print(iris.km)
For fuzzy c-means clustering, the library cluster is required. The clustering
is carried out by the method fanny similar to kmeans:
> library(cluster)
> iris.fcm <- fanny(iris.num,3)
> iris.fcm
The last line provides the necessary information on the clustering results, especially
the membership degrees to the clusters.
Gaussian mixture decomposition automatically determining the number of clus-
ters requires the library mlcust to be installed first:
> library(mclust)
> iris.em <- mclustBIC(iris.num[,1:4])
> iris.mod <- mclustModel(iris.num[,1:4],iris.em)
> summary(iris.mod)
The last line lists the assignment of the data to the clusters.
Density-based clustering with DBSCAN is implemented in the library fpc
which needs installation first:
> library(fpc)
> iris.dbscan <- dbscan(iris.num[,1:4],1.0,showplot=T)
> iris.dbscan$cluster
The last line will print out the assignment of the data to the clusters. Singletons or
outliers are marked by the number zero. The second argument in dbscan (in the
above example 1.0) is the parameter ε for DBSCAN. showplot=T will generate
a plot of the clustering result projected to the first two dimensions of the data set.
The library som provides methods for self organizing maps. The library som needs
to be installed:
7.9 Further Reading 203
> library(som)
> iris.som <- som(iris.num,xdim=5,ydim=5)
> plot(iris.som)
xdim and ydim define the number of nodes in the mesh in x- and y-directions,
respectively. plot will show, for each node in the mesh, a representation of the
values in the form of parallel coordinates.
For association rule mining, the library arules is required in which the function
apriori is defined. This library does not come along with R directly and needs to
be installed first.
Here we use an artificial data set basket that we enter manually. The data set
is a list of vectors where each vector contains the items that were bought:
> library(arules)
> basket <- list(c("a","b","c"), c("a","d","e"),
c("b","c","d"), c("a","b","c","d"),
c("b","c"), c("a","b","d"),
c("d","e"), c("a","b","c","d"),
c("c","d","e"), c("a","b","c"))
> rules <- apriori(baskets,parameter = list(supp=0.1,
conf=0.8,
target="rules"))
> inspect(rules)
The last command lists the rules with their support, confidence, and lift.
There are quite a number of books about cluster analysis. A good reference to
start is [22]. A review of subspace clustering methods can be found in [43]. Self-
organizing maps are mentioned in almost every book about neural networks, but
only a few books are more or less dedicated to them [35, 45]. The usefulness of
SOMs for visualization is shown in [51].
There were two workshops on frequent itemset mining implementations in 2003
and 2004 by Bart Goethals, Mohammed J. Zaki, and Roberto Bayardo, and the
workshop proceedings are available online http://fimi.cs.helsinki.fi/. The website
provides a comparison and the source code of many different implementations.
204 7 Finding Patterns
References
1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected
clustering. In: Proc. 1999 ACM SIGMOD Int. Conf. on Management of Data, pp. 61–72.
ACM Press, New York (1999)
2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf.
on very Large Databases (VLDB 1994, Santiago de Chile), pp. 487–499. Morgan Kaufmann,
San Mateo (1994)
3. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of asso-
ciation rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Ad-
vances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI Press/MIT Press, Cam-
bridge (1996)
4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high
dimensional data. Data Min. Knowl. Discov. 11, 5–33 (2005)
5. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify
the clustering structure. In: ICMD, pp. 49–60, Philadelphia (1999)
6. Atzmueller, M., Puppe, F.: Sd-map: a fast algorithm for exhaustive subgroup discovery. In:
Proc. Int. Conf. Knowledge Discovery in Databases (PKDD). Lecture Notes in Computer
Science, vol. 4213. Springer, Berlin (2006)
7. Baumgartner, C., Plant, C., Kailing, K., Kriegel, H.-P., Kröger, P.: Subspace selection for
clustering high-dimensional data. In: Proc. IEEE Int. Conf. on Data Mining, pp. 11–18. IEEE
Press, Piscataway (2003)
8. Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Imple-
mentations (FIMI 2004, Brighton, UK), CEUR Workshop Proceedings 126, Aachen, Germany
(2004). http://www.ceur-ws.org/Vol-126/
9. Bellman, R.: Adaptive Control Processes. Princeton University Press, Princeton (1961)
10. Böttcher, M., Spott, M., Nauck, D.: Detecting temporally redundant association rules. In: Proc.
4th Int. Conf. on Machine Learning and Applications (ICMLA 2005, Los Angeles, CA), pp.
397–403. IEEE Press, Piscataway (2005)
11. Böttcher, M., Spott, M., Nauck, D.: A framework for discovering and analyzing changing
customer segments. In: Advances in Data Mining—Theoretical Aspects and Applications.
Lecture Notes in Computer Science, vol. 4597, pp. 255–268. Springer, Berlin (2007)
12. Borgelt, C., Berthold, M.R.: Mining molecular fragments: finding relevant substructures of
molecules. In: Proc. IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), pp.
51–58. IEEE Press, Piscataway (2002)
13. Borgelt, C.: On canonical forms for frequent graph mining. In: Proc. 3rd Int. Workshop on
Mining Graphs, Trees and Sequences (MGTS’05, Porto, Portugal), pp. 1–12. ECML/PKDD
2005 Organization Committee, Porto (2005)
14. Borgelt, C., Wang, X.: SaM: a split and merge algorithm for fuzzy frequent item set mining
(to appear)
15. Branko, K., Lavrac, N.: Apriori-sd: adapting association rule learning to subgroup discovery.
Appl. Artif. Intell. 20(7), 543–583 (2006)
16. Cheng, Y., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in
high dimensions. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining (KDD’01,
San Francisco, CA), pp. 194–203. ACM Press, New York (2001)
17. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Trans. Intell. Syst. 15(2), 32–41
(2000)
18. Davé, R.N.: Characterization and detection of noise in clustering. Pattern Recognit. Lett. 12,
657–664 (1991)
19. Ding, C., He, X.: Cluster merging and splitting in hierarchical clustering algorithms. In: Proc.
IEEE Int. Conference on Data Mining, p. 139. IEEE Press, Piscataway (2002)
20. Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
21. Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering
clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery
and Data Mining (KDD 96, Portland, Oregon), pp. 226–231. AAAI Press, Menlo Park (1996)
References 205
22. Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley, Chichester (2001)
23. Finn, P.W., Muggleton, S., Page, D., Srinivasan, A.: Pharmacore discovery using the inductive
logic programming system PROGOL. Mach. Learn. 30(2–3), 241–270 (1998)
24. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and application.
J. Artif. Intell. Res. 17, 501–527 (2007)
25. Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations
(FIMI 2003, Melbourne, FL, USA), CEUR Workshop Proceedings 90, Aachen, Germany
(2003). http://www.ceur-ws.org/Vol-90/
26. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes.
Inf. Syst. 25(5), 345–366 (2000)
27. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf.
Syst. 17(2–3), 107–145 (2001)
28. Han, J., Pei, H., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc.
Conf. on the Management of Data (SIGMOD’00, Dallas, TX), pp. 1–12. ACM Press, New
York (2000)
29. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia satabases
with noise. In: Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp.
224–228. AAAI Press, Menlo Park (1998)
30. Hinneburg, A., Keim, D.A.: Optimal grid-clustering: towards breaking the curse of dimen-
sionality in high-dimensional clustering. In: Proc. 25th Int. Conf. on Very Large Databases,
pp. 506–517. Morgan Kaufmann, San Mateo (1999)
31. Höppner, F.: Speeding up Fuzzy C-means: using a hierarchical data organisation to control the
precision of membership calculation. Fuzzy Sets Syst. 128(3), 365–378 (2002)
32. Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy C-means and deriva-
tives. IEEE Trans. Fuzzy Syst. 11(5), 682–694 (2003)
33. Höppner, F., Klawonn, F., Kruse, R., Runkler, T.A.: Fuzzy Cluster Analysis. Wiley, Chichester
(1999)
34. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of iso-
morphism. In: Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM 2003, Melbourne, FL), pp.
549–552. IEEE Press, Piscataway (2003)
35. Kaski, S., Oja, E., Oja, E.: Kohonen Maps. Elsevier, Amsterdam (1999)
36. Klösgen, W.: Efficient discovery of interesting statements in databases. J. Intell. Inf. Syst. 4,
53–69 (1995)
37. Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in
Knowledge Discovery and Data Mining. MIT Press, Cambridge (1996). Chap. 10
38. Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1480 (1990)
39. Kramer, S., de Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: Proc. 7th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2001, San Francisco,
CA), pp. 136–143. ACM Press, New York (2001)
40. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proc. 1st IEEE Int. Conf. on
Data Mining (ICDM 2001, San Jose, CA), pp. 313–320. IEEE Press, Piscataway (2001)
41. Leman, D., Feelders, A., Knobbe, A.: Exceptional model mining. In: Proc. Europ. Conf. Ma-
chine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science,
vol. 5212, pp. 1–16. Springer, Berlin (2008)
42. Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In:
Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD2004,
Seattle, WA), pp. 647–652. ACM Press, New York (2004)
43. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review.
SIGKDD Explor. Newsl. 6(1), 90–105 (2004)
44. Pei, J., Tung, A.K.H., Han, J.: Fault-tolerant frequent pattern mining: problems and challenges.
In: Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Dis-
covery (DMK’01, Santa Babara, CA). ACM Press, New York (2001)
45. Ritter, H., Martinez, T., Schulten, K.: Neural Computation and Self-Organizing Maps: An
Introduction. Addison-Wesley, Reading (1992)
206 7 Finding Patterns
46. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
47. Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using
sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)
48. Scholz, M.: Sampling-based sequential subgroup mining. In: Proc. 11th ACM SIGKDD Int.
Conf. on Knowledge Discovery and Data Mining, pp. 265–274. AAAI Press, Menlo Park
(2005)
49. Smyth, P., Goodman, R.M.: An information theoretic approach to rule induction from
databases. IEEE Trans. Knowl. Discov. Eng. 4(4), 301–316 (1992)
50. Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The chemistry
development kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem.
Inf. Comput. Sci. 43(2), 493–500 (2003)
51. Vesanto, J.: SOM-based data visualization methods. Intell. Data Anal. 3(2), 111–126 (1999)
52. Wang, X., Borgelt, C., Kruse, R.: Mining fuzzy frequent item sets. In: Proc. 11th Int. Fuzzy
Systems Association World Congress (IFSA’05, Beijing, China), pp. 528–533. Tsinghua Uni-
versity Press/Springer, Beijing/Heidelberg (2005)
53. Webb, G.I., Zhang, S.: k-Optimal-rule-discovery. Data Min. Knowl. Discov. 10(1), 39–79
(2005)
54. Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
55. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Proc. 1st Europ.
Symp. on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer
Science, vol. 1263, pp. 78–87. Springer, London (1997)
56. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proc. 2nd IEEE Int.
Conf. on Data Mining (ICDM 2003, Maebashi, Japan), pp. 721–724. IEEE Press, Piscataway
(2002)
57. Yan, X., Han, J.: Close-graph: mining closed frequent graph patterns. In: Proc. 9th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003, Washington,
DC), pp. 286–295. ACM Press, New York (2003)
58. Xie, X.L., Beni, G.A.: Validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach.
Intell. (PAMI) 3(8), 841–846 (1991)
59. Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of asso-
ciation rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97,
Newport Beach, CA), pp. 283–296. AAAI Press, Menlo Park (1997)
60. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its
applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
61. Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets.
Data Min. Knowl. Discov. 10, 141–168 (2005)
Chapter 8
Finding Explanations
In the previous chapter we have discussed methods that find patterns of different
shapes in data sets. All these methods needed measures of similarity in order to
group similar objects. In this chapter we will discuss methods that address a very
different setup: instead of finding structure in a data set, we are now focusing on
methods that find explanations for an unknown dependency within the data. Such a
search for a dependency usually focuses on a so-called target attribute, that is, we
are particularly interested in why one specific attribute has a certain value. In case
of the target attribute being a nominal variable, we are talking about a classification
problem; in case of a numerical value we are referring to a regression problem.
Examples for such problems would be understanding why a customer belongs to
the category of people who cancel their account (e.g., classifying her into a yes/no
category) or better understanding the risk factors of customers in general.
We are therefore assuming that, in addition to the object description x, we have
access to a value for the target attribute Y . In contrast to the methods described in
previous chapters, where no target value was available, we now aim to model a de-
pendency towards one particular attribute. Since this can be compared to a teacher
model, where the teacher gives us the desired output, this is often called supervised
learning. Our dataset in this setup consists of tuples D = {(xi , Yi )|i = 1, . . . , n}.
Based on this data, we are interested in finding an explanation in the form of some
type of interpretable model, which allows us to understand the dependency of the
target variable to the input vectors. The focus of these models lies on their inter-
pretability, and in a subsequent chapter we will focus on the quality of the forecast-
ing or predictive ability. Here we are interested in supervised (because we know the
desired outcome) and descriptive (because we care about the explanation) data anal-
ysis. It is important to emphasize immediately that this model will not necessarily
express a causal relationship, that is, the true underlying cause of how the outcome
depends on the inputs. All we can extract (automatically, at least) from the given
data are relationships expressed by some type of numerical correlations.
This chapter will cover the main representatives of model-based explanation
methods: decision trees, Bayes classifiers, regression models, and rule extraction
methods. These four types of explanation finding methods nicely cover four rather
different flavors.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 207
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_8, © Springer-Verlag London Limited 2010
208 8 Finding Explanations
Decision trees Decision trees aim to find a hierarchical structure to explain how
different areas in the input space correspond to different outcomes. The hierarchical
way to partition this space is particularly useful for applications where we drown in
a series of attributes of unknown importance: the final decision tree often only uses
a small subset of the available set of attributes. Decision trees are often thrown first
at classification problems because they tend to be insensitive to normalization issues
and tolerant toward many correlated or noisy attributes. In addition, the structure of
the tree also allows for data clean-up. Quite often, the first attempt at generating a
decision tree reveals unexpected dependencies in the data which would otherwise
be hidden in a more complex model.
Bayes classifiers Bayes classifiers form a solid baseline for achievable classifica-
tion accuracy—any other model should at least perform as well as the naive Bayes
classifier. In addition they allow quick inspection of possible correlations between
any given input attribute and the target value. Before trying to apply more complex
models, a quick look at a Bayes classifier can be helpful to get a feeling for real-
istic accuracy expectations and simple dependencies in the data. Bayes classifiers
express their model in terms of simple probabilities (or parameters of simple prob-
ability distributions, such as mean and standard deviation) and hence explain nicely
how the classifier works.
Regression models Regression models are the counterpart for numerical approx-
imation problems. Instead of finding a classifier and minimizing the classification
error, regression models try to minimize the approximation error, that is, some mea-
sure for the average deviation between the expected and predicted numerical output
value. Again, many more complex models exist, but the regression coefficients al-
low easy access to the internals of the model. Each coefficient shows the dependency
between any input attribute and the target value.
Rule models Rule models are the most intuitive, interpretable representation.
However, not many efficient or usable algorithms exist to date for complex real-
world data sets. As a result, rule methods are generally not the first choice when
it comes to finding explanations in complex data sets. We still discuss some of the
most prominent methods in this chapter because the knowledge representation is
best suited to offer insights into the data set and some of those methods deserve
a bit more attention in data analysis setups. Generally, one would only apply rule
extraction algorithms only to data set with a reasonably well-understood structure.
Many of the algorithms tend to be rather sensitive toward useless or highly corre-
lated attributes and excessive noise in the data.
8.1.1 Overview
Decision Trees come in two flavors, classification and regression trees. We will first
concentrate on the former, since classification problems are more common and most
training algorithms and other issues can easily be generalized to regression problems
as well. We will discuss issues particularly related to regression trees at the end of
this section. Figure 8.1 shows a typical example for a classification tree. The simple
tree can be used to classify animals, based on a number of attributes (the ability to
swim and fly, in this case). As usual in computer science related areas, trees grow
from top to bottom. The tree then builds a hierarchical decision structure which
helps to understand the classification process by traversing the tree from the root
node until a leaf is reached. At each intermediate node, the relevant attribute is in-
vestigated and the branch matching the attribute’s value is followed. The leaves then
hold the classifications. Obviously, the tree in our example does not work correctly
for all types of animals—we have already discussed this issue of generalization and
performance on unseen cases in Chap. 5. However, it is a reasonable compact repre-
sentation of a generic classification mechanism for animals based on easy to observe
attributes and—more importantly—it summarizes (most of) our data.
Note how the tree in Fig. 8.1 uses different types of splits. In practice there are
really three options:
Fig. 8.2 A simple example for a decision tree using two numerical attributes (left) and the corre-
sponding partitioning of the underlying feature space (right)
1. Boolean splits are considering boolean attributes and have two child nodes. An
example for such a split would be an attribute “married” with two leaves “yes”
and “no.”
2. Nominal splits are based on nominal attributes and can be binary, that is, they
split based on two disjoint subsets of the attribute values. Sometimes also splits
with more children are considered—in the extreme for each nominal value one.
An example for such a split would be connected to the boolean split “mar-
ried=no” above and could split into “never”, “divorced”, and “widow(er)”. Note
that one can model such a split also with a sequence of nodes splitting the same
attribute into increasingly smaller subsets. We will see in the section on decision
tree construction how such potentially very wide splits can lead to rather useless
trees.
3. Splits on continuous attributes finally use a numerical variable and can either
split based on one particular value (e.g., “temperature” with two splits ≤80 F
and >80 F) or on a series of values defining bins for each child.
Decision tree algorithms are also known as recursive partitioning methods since
each split of the tree essentially divides the remaining space into two (or more in
case of not binary splits) disjoint subpartitions. Figure 8.2 shows an example of a
tree operating on two numerical attributes and illustrates the corresponding parti-
tioning of the underlying feature space.
Decision trees are well received by users because they are easy to read and inter-
pret. Additionally, the recursive partitioning approach seems to resemble the human
like hierarchical structuring of the description of a classification problem.
8.1.2 Construction
From a data analysis point of view it is now of course interesting to know how we
can generate such simple, easy to understand structures from real-world data sets.
8.1 Decision Trees 211
Finding the optimal decision tree for a given set of training examples is nontrivial,
but in most cases it is sufficient to find a reasonably small tree, which explains the
training data well. Note also that “optimal” is not obvious to define: do we mean the
smallest tree or the one with the best accuracy on the training data. And if we are
indeed interested in the smallest tree, does this relate to the number of nodes or the
depth or width of the tree? At the end of this chapter we will discuss this issue of
“learning bias” in more detail.
The most prominent algorithms therefore do not attempt to optimize a global
measure but employ a greedy strategy, that is, they focus on building the tree root-
first and then add subsequent branches and splits along those branches for the re-
mainder of the training data—they recursively find the best split at each point. Gen-
erally, such an algorithm looks like the one shown in Table 8.1, where D indicates
the available training examples, C the target attribute, and A the set of available
input attributes.
The algorithm recursively splits up the data and constructs the decision tree start-
ing with its root node. The recursion stops at steps 1 or 3 if a subset of patterns of
only one class is encountered (the resulting tree is then simply a leaf carrying that
class label) or no further attributes for splits are available (the leaf then carries the
label of the majority class in the subset). Otherwise we find the best split for the
subset at hand (line 6) and create a new split node. We then recursively call the tree
construction method on the subsets created by applying the chosen split (lines 8–14).
Looking at this algorithm, two issues remain unclear:
212 8 Finding Explanations
with 0 log 0 := 0. The entropy ranges from 0 to 1 and is maximal for the case of
two classes and an even 50 : 50 distribution of patterns of those classes. An entropy
of exactly 0 tells us that only patterns of the same class exist. Entropy therefore
provides us with a measure of how impure (with respect to the class variable C) a
dataset is.
We can now determine what the best attribute at any given point is: it is the
attribute A with the biggest reduction in entropy compared to the original set (recall
that our aim is to find pure leaves). We call this reduction also information gain:
where
|DA=a |
HD (C, A) = HDA=a (C),
|D|
a∈dom(A)
and DA=a indicates the subset of D for which attribute A has value a. HD (C, A)
then denotes the entropy that is left in the subsets of the original data after they have
been split according to their values of A. It is interesting to note that the information
1 Rumors say that ID3 stands for “Iterative Dichotomiser 3” (from Greek dichotomia: divided), sup-
posedly it was Quinlans’ third attempt. Another interpretation of one of the authors is “Induction
of Decision 3rees.”
8.1 Decision Trees 213
gain can never be negative, that is, no matter which split we choose, the entropy
spread out over the resulting subsets is not going to increase.
However, what happens if we have an attribute ID with unique values for every
single example pattern and we use these kinds of splits and the entropy to measure
the information gain? Simple, the attribute holding the unique IDs will be chosen
first because it results in a node with n branches for n = |D| training examples: each
example has its individual ID resulting in its own branch. The entropy of the sets
passed down to all of these branches goes down to zero (since only one element
of one class is contained in each set), and the information gain is maximized. This
is clearly not desirable in most applications. The reason for choosing this odd split
is that we are reducing the entropy at all costs, completely ignoring the costs of
actually making the split. In order to compensate for this we can also compute the
split information:
|DA=a | |DA=a |
SI D (A) = − log ,
|D| |D|
a∈dom(A)
which computes the entropy of distribution of our original set into subsets, and use
this to normalize the purely entropy driven information gain:
ID (C, A)
GRD (C, A) = ,
SID (A)
resulting in GR, the gain ratio of a given split. The gain ratio penalizes very wide
splits, that is, splits with many branches, and biases the selection toward more nar-
row splits.
Of course, building decision trees on nominal values only is of limited interest.
In the following we will touch upon some of the extensions to address this and other
limitations. However, the main underlying algorithm always remains the same—
a greedy search which attempts to locally maximize some measure of information
gain over the available attributes and possible splits at each stage until a certain
criteria is reached.
Decision trees are a widely researched topic in machine learning and data analysis.
There are numerous extensions and variants around which various shortcomings or
specific types of data and applications address. In the following we try to briefly
summarize some of the more general variations.
As mentioned above, the biggest limitation of ID3 is the focus on nominal attributes
only. In order to extend this to numerical values, let us first explore possible splits
214 8 Finding Explanations
for such variables and worry about how to actually find those splits a tad later.
A small example for a numerical attribute (top row) and the corresponding class
values (bottom row) is shown in Fig. 8.3.
For convenience, we have sorted the training instances (columns showing values
of one input and one class value in this case) according to the numerical attribute.
Clearly, splits can divide the range of the numerical attribute “temperature” only
into two (or more) ranges. Hence, looking at the sorted series of training instances,
we can really only divide this into two pieces, a set of patterns to the left of the
“cut” and a set to the right. One could imaging performing more than one split at
each branching point in the tree, but this can easily be modeled by a series of binary
splits. So in the following we will concentrate on the binary split setup.
In addition, in the case of a decision tree, we need to care about splits that occur
within class boundaries only—why is that? An intuitive explanation is simple: if
we already have instances of one class (e.g., C) in one branch of our considered
split, it really does not make sense to assign any patterns of that class to the other
branch, that is, splitting a uniform subset into even smaller subsets never gives us
any advantage. This, of course, depends on the chosen measure of information gain,
but in most cases, this assumption holds.
Therefore the only split points we need to consider are the splits indicated in
Fig. 8.3. Note that even though we would like to split at the dashed lined (i.e., in
between the patterns of class B and class C), we cannot do this since the value of
temperature is 41 for both instances. So in this particular case we have to consider
neighboring splits, e.g., (b) and (b ). This results in four possible splits that we need
to investigate further. And suddenly, the problem can be converted back into the
issue of looking into finding the one best nominal split: we can simply define four
binary variables temp_leq_35, temp_leq_39, temp_leq_42, and temp_
leq_46.5 that take on values of true resp. false when the value of the original
attribute temperature lies below resp. above the specific threshold. We can then
apply the strategies described in the previous section to determine information gains
for each one of these attributes and compare them with other possible splits on
nominal or numerical attributes.
Note that using this approach, a numerical attribute may appear several times
along the branch of a decision tree, each time being split at a different possible
split point. This may make the tree somewhat harder to read. One could circumvent
this by adding nominal values that split the numerical attributes in class-wise pure
ranges only, but especially for somewhat noisy attributes, this will result in many
small bins and subsequently many nominal values for the resulting variable. As we
have seen above, such splits are favored by pure entropy-based information gains
and discouraged by other measures. Either way, these splits will be hard to inter-
8.1 Decision Trees 215
In addition to handling numerical attributes for the feature vector, one is often also
interested in dealing with numerical target values. The underlying issue then turns
from predicting a class (or the probability of an input vector belonging to a certain
class) to the problem of predicting a continuous output value. We call this regres-
sion, and, in case of decision trees, the resulting tree is called a regression tree.
At the same time when Quinlan was discussing algorithms for construction of clas-
sification tree, Breiman and his colleagues described CART (Classification And
Regression Trees, see [5]).2 The structure of such a tree is the same as that of the
trees discussed above; the only difference is the leaves: instead of class labels (or
distributions of class labels), the leaves now contain numerical constants. The tree
output is then simply the constant corresponding to the leaf the input pattern ends
up in. So the regression process is straightforward, but what about constructing such
trees from data? Instead of an entropy-based measurement which aims to measure
class impurity, we now need to find a measure for the quality of fit of a tree, or
at least a branch. For this, we can use the sum of squares measurement, discussed
earlier. The sum of squares for a specific node n may then simply be defined as the
squared mean error of fit:
1
SME(Dn ) = (Y − f (x))2 ,
|Dn |
(X,Y )∈Dn
where f (n) indicates the constant values assigned to the node n, and Dn are the data
points ending up at the node. Using this quality measure, we can again apply our
decision tree construction algorithm and continue analogously to the classification
case: we greedily investigate all possible splits, determine the error on the respective
splits, combine them (weighted by the respective sizes of the subsets, of course), and
choose the split which results in the biggest reduction of error.
More sophisticated versions of regression trees, so-called model trees, not only
allow for constants in the leaves but more complex functions on all or a subset of the
input variables. However, these trees are much less commonly used, since normal
regression trees are usually sufficient to approximate regression problems. In effect,
model trees allow one to move some of the complexity from the tree structure into
the functions in each leaf, resulting in smaller trees with leaves that are harder to
interpret.
2 Quinlan later also developed methods for regression problems, similar to CART.
216 8 Finding Explanations
8.1.3.3 Pruning
So far we have not really worried much about when to stop growing a tree. Driven
to the extreme, the base algorithm described in Sect. 8.1.2 simply continues until
the tree cannot be refined any further, that is, no split attributes remain to be used,
or the dataset at a node contains one pattern or only patterns of the same class resp.
output value. As we have discussed, this behavior is most likely not desirable since
it leads to overfitting and, in this particular case will also make the interpretation of
the resulting tree unnecessarily complex.
This is why pruning of decision trees gains importance quickly in real-world
applications. Pruning decision trees comes in two variants:
• prepruning refers to the process of stopping the construction of the decision tree
already during the training process, that is, specific splits will not be added since
they do not offer sufficient advantage even though they would produce a (albeit
small) numerical improvement;
• postpruning refers to the reduction of a decision tree which was build to a clearly
overly large size. Here we distinguish two variants; one replaces subtrees with
leaves (subtree replacement), and the other one also removes nodes from within
the tree itself (subtree raising). The latter is substantially more complex since
the instances contained in a node need to be redistributed among the branches
of the raised subtree, and its benefits are questionable since in practice trees rarely
become large enough to make this approach beneficial.
Figure 8.4 illustrates both approaches.
The important question is, independent of when we prune, how do we determine
if a node can be removed (or should not be split in the first place)? Most commonly,
one splits the available data into a training and validation data set, where the training
set is used to determine possible splits, and the validation data then helps to estimate
the split’s actual usefulness. Often a sufficient amount of training data is, however,
not available, and one needs to rely on other measures. One possibility is statistical
tests that help to estimate if expanding a node will likely bring a significant improve-
ment. Quinlan proposes a heuristic which is based on the training data and computes
confidence intervals for each node. From these, a standard Bernoulli-based estimate
(see Sect. A.3 in the appendix) is used to determine which nodes can be pruned.
The statistical foundation of this method is somewhat shaky, but it works well in
practice. Other ways for pruning are based on measures for complexity, similar to
the split information we discussed above. The Minimum Description Principle (see
Sect. 5.5.4.1) can help to weight the expected improvement in accuracy over the
required enlargement of the model.
8.1 Decision Trees 217
Decision trees make it relatively easy to deal with missing values. Instead of simply
ignoring rows which contain missing values, we can make use of the remaining in-
formation both during training and classification. What all approaches for missing
value handling during training have in common is that they use more or less sophis-
ticated ways to estimate the impact of the missing value on the information gain
measure. The most basic one simply adds a fraction of each class to each partition
if the split attribute’s value is missing for that record. During classification, dealing
with missing attribute values is straightforward: special treatment is only required
if, during tree traversal, a node relies on the value which is missing in the pattern
to be classified. Then the remainder of the tree traversal can be simply done in both
branches and later on, the results that are encountered in the two (or more, if more
than one missing values was encountered) leaves are merged.
Decision trees are one of the most well-known and prominently used examples of
more sophisticated data analysis methods. However, one often forgets that they are
notoriously unstable. Stability means that small changes to the training data, such
as removing just one example, can result in drastic changes in the resulting tree.
This is mostly due to the greedy nature of the underlying algorithm, which never re-
considers chosen splits but is also part of the general nature of the tree structure: two
very similar but not necessarily highly correlated attributes can exchange roles in a
decision tree when a few training examples are added or removed, in turn affecting
the rest of the tree dramatically.
When better, more stable performance is needed but interpretation is not such
an issue, one often refers to forests of decision trees or decision stumps, which
belong to the family of ensemble methods: instead of building one large tree, a set
of differently initialized, much smaller decision trees (“stumps”) are created, and
the classification (or regression) output is created by committee voting. Forests of
trees have the advantage that they are more stable than classical decision trees and
218 8 Finding Explanations
often show superior generalization performance. This type of model class can also
be used for feature selection wrapper methods, such as backward feature elimination
as discussed in Sect. 6.1.1.
There are many other variations of decision trees around: for instance, fuzzy
decision trees [12] which allow one to process imprecise data and handle degrees
of class membership. Other variations allow one to include costs of attributes, that
is, they allow one to consider during the coding phase that different attribute values
may be obtainable at different costs. The result is a decision tree which attempts to
optimize both the quality and the expected cost of processing new instances.
8.2.1 Overview
Let us consider the question: what is the best possible classifier? The obvious and
immediate answer seems to be: a classifier that always predicts the correct class,
that is, the class the instance under consideration actually belongs to. Although
this is certainly the ideal we strive for, it is rarely possible to achieve it in prac-
tice.
A fundamental obstacle is that an instantiation of the available descriptive at-
tributes rarely allows us to single out one class as the obtaining one. Rather, there
are usually several possible classes, either because there exist one or more hidden
(unobserved) attributes that influence the class an instance belongs to, or because
there is some random influence which cannot be predicted in principle (see Sect. 5.4
for a more detailed discussion).
Such a situation may be reflected in the available data by the fact that there are
contradictory instances, that is, instances that coincide in their values for all de-
scriptive attributes but differ in the class they belong to. Note, however, that the
absence of contradictory instances does not guarantee that a perfect classifier is
possible: the fact that no contradictory instances were observed (up to now) does
not imply that they are impossible. Even if the training data set does not contain
contradictory instances, it may still be that future cases, for which we want to pre-
dict the class, contradict each other or contradict instances from the training data
set.
In a situation where there exists no unique class that can be predicted with cer-
tainty, but we still have to predict a single class (and not only a set of possible
8.2 Bayes Classifiers 219
classes), the best we can do is to predict the class that has the highest probability
(provided, of course, that all misclassification costs are equal—unequal misclas-
sification costs are discussed in Sect. 8.2.3). The reason is that this scheme ob-
viously yields (at least on average) the highest number or rate of correct predic-
tions.
If we try to build a classifier that predicts, for any instantiation of the available
descriptive attributes, the most probable class, we face two core problems: (1) how
can we properly estimate which is the most probable class for a given instantiation?
and (2) how can we store all of these estimates in a feasible way, so that they are
easily accessible whenever we have to classify a new case? The second problem
is actually a more fundamental one: if we are able to estimate, but cannot store
efficiently, obtaining the estimates is obviously pointless.
Unfortunately, the most straightforward approach, namely simply storing the
most probable class or a probability distribution over the classes for each possi-
ble instance is clearly infeasible: from a theoretical point of view, a single metric
attribute would give rise to a supercountably infinite number of possible instances.
In addition, even if we concede that in practice metric attributes are measured with
finite precision and thus can have only a finite number of values, the number of
possible instances grows exponentially with the number of attributes.
As a consequence, we have to introduce simplifying assumptions. The most com-
mon is to assume that the descriptive attributes are conditionally independent given
the class (see Sect. A.3.2.3 in the appendix for a definition of this notion), thus re-
ducing the number of needed parameters from the product of the domain sizes to
their sum times the number of classes. However, the disadvantage is that this as-
sumption is very strong and not particularly realistic. Therefore the result is also
known as the naive or even the idiot’s Bayes classifier. Despite this pejorative
name, naive Bayes classifiers perform very well in practice and are highly valued
in domains in which large numbers of descriptive attributes have to be taken into
account (for example, chemical compound and text document classification).
Other simplifying assumptions concern metric (numeric) attributes. They are
usually treated by estimating the parameters of one conditional distribution func-
tion per class—most commonly a normal distribution. Of course, this approach
limits how well the actual distribution can be fitted but has the advantage that it
considerably reduces the number of needed parameters. As a consequence, it may
even become possible to abandon the naive conditional independence assumption:
if we model all metric attributes jointly with one (multivariate) normal distribu-
tion per class, the number of parameters is merely quadratic in the number of
(metric) attributes. The result is known as the full Bayes classifier. Note, how-
ever, that this applies only to metric (numeric) attributes. If categorical attributes
are present, assuming conditional independence or something similar may still be
necessary.
Extensions of the basic approach deal with mitigating the objectionable condi-
tional independence assumption, selecting an appropriate subset of descriptive at-
tributes to simplify the classifier, or incorporating misclassification costs.
220 8 Finding Explanations
8.2.2 Construction
Note that this simplification is possible even if we desire not only to predict the
most probable class, but want to report its probability as well. The reason is that by
exploiting the law of total probability we can compute P (x) for any x as
P (x) = P (x | y)P (y).
y∈dom(Y )
3 Since one or more of them may be metric, we may have to use a probability density function f to
refer to descriptive attributes: f (x | y). However, we ignore such notational subtleties here.
8.2 Bayes Classifiers 221
where x is the element of the data tuple x that refers to the descriptive attribute X.
This is the core classification formula of the naive Bayes classifier.
If an attribute X is categorical, the corresponding factor P (x | y) is fairly eas-
ily manageable: it takes only | dom(X)| · | dom(Y )| parameters to store it explicitly.
However, if X is metric, a distribution assumption is needed to store the correspond-
ing conditional probability density f (x | y). The most common choice is a normal
distribution f (x | y) = N (μX|y , σX|y
2 ) with parameters μ 2
X|y (mean) and σX|y (vari-
ance).
Once we have the above classification formula, estimating the model parameters
becomes very simple. Given a data set D = {(x1 , y1 ), . . . , (xn , yn )}, we use
γ + ny
∀y ∈ dom(Y ) : P̂ (y) =
γ | dom(Y )| + n
n
n
ny = τ (yi = y) and nxy = τ xi [X] = x ∧ yi = y ,
i=1 i=1
where xi [X] is the value that attribute X has in the ith tuple, and yi is the class of
the i-tuple (that is, the value of the class attribute Y ). The function τ is a kind of
truth function, that is, τ (ϕ) = 1 if ϕ is true and 0 otherwise. Hence, ny is simply the
number of sample cases belonging to class y, and nxy is the number of sample cases
in class y for which the attribute X has the value x.
Finally, γ is a constant that is known as Laplace correction. It may be chosen
as γ = 0, thus reducing the procedure to simple maximum likelihood estimation
(see Sect. A.4.2.3 in the appendix for more details on maximum likelihood estima-
tion). However, in order to appropriately treat categorical attribute values that do not
occur with some class in the given data set but may nevertheless be possible, it is
advisable to choose γ > 0. This also renders the estimation more robust, especially
for a small sample size (small data set or small number of cases for a given class).
Clearly, the larger the value of γ , the stronger the tendency toward a uniform distri-
γ0
bution. Common choices are γ = 1, γ = 12 , or γ = | dom(X)|·| dom(Y )| , where γ0 is a
user-specified constant that is known as the equivalent sample size (since it can be
justified with an argument from Bayesian statistics, where it represents the weight
of the prior distribution, measured as the size of a sample having the same effect).
4 This is a prior probability, because it describes the class probability before observing the values
1
n
μ̂X|y = τ (yi = y) · xi [X] and
ny
i=1
1
n
2
2
σ̂X|y =
τ (yi = y) · xi [X] − μ̂X|y ,
ny
i=1
where XM is the set of metric attributes, and xM is the vector of values of these
attributes. The (class-conditional) mean vectors μXM |y and the corresponding co-
variance matrices ΣXM |y can be estimated from a given data set X as
1
n
∀y ∈ dom(Y ) : μ̂XM |y = τ (yi = y) · xi [XM ]
ny
i=1
1
n
and Σ̂XM |y = τ (yi = y)
ny
i=1
× xi [XM ] − μXM |y xi [XM ] − μXM |y ,
with either ny = ny or ny = ny − 1 (as above). If all attributes are metric, the result
is
pred(x) = arg max P (y)f (x | y),
y
that is, the core classification formula of the full Bayes classifier.
As a consequence, the classification formula for a mixed Bayes classifier is
pred(x) = arg max P (y)f (xM | y) P (x | y),
y
X∈XC
8.2 Bayes Classifiers 223
Table 8.2 A naive Bayes classifier for the iris data. The class-conditional normal distributions are
described by μ̂ ± σ̂ (that is, expected value ± standard deviation)
Iris type Iris setosa Iris versicolor Iris virginica
Fig. 8.5 Naive Bayes density functions for the Iris data (axes-parallel ellipses, left) and density
functions that take the covariance of the two measures into account (general ellipses, right). The
ellipses are the 1σ̂ - and 2σ̂ -boundaries (lines of equal probability density)
where XC is the set of categorical attributes. In a mixed Bayes classifier the metric
attributes are modeled with a multivariate normal distribution (as in a full Bayes
classifier), while the categorical attributes are treated with the help of the conditional
independence assumption (as in a naive Bayes classifier).
In order to illustrate the effect of the conditional independence assumption, we
consider a naive Bayes classifier for the well-known iris data [2, 8]. The goal is
to predict the iris type (Iris setosa, Iris versicolor, or Iris virginica) from measure-
ments of the length and width of the petals and sepals. Here we confine ourselves
to the measurements of the petals, which are most informative w.r.t. the class. The
parameters of a naive Bayes classifier, derived with a normal distribution assump-
tion, are shown in Table 8.2, and its graphical illustration in Fig. 8.5 on the left.
In addition, Fig. 8.5 contains, on the right, an illustration of a full Bayes classifier,
in which the two descriptive attributes (petal length and width) are modeled with
class-conditional bivariate normal distributions.
These diagrams show that the conditional independence assumption is expressed
by the orientation of the ellipses, which are lines of equal probability density: in a
naive Bayes classifier, the major axes of these ellipses are always parallel to the co-
ordinate axes, while a full Bayes classifier allows them to have arbitrary directions.
As a consequence, the dependence of the attributes petal length and width can be
modeled better, especially for Iris versicolor.
224 8 Finding Explanations
Bayes classifiers have been extended in several ways, for example, by mitigating
the conditional independence assumption (Sect. 8.2.3.3) and by taking misclassifi-
cation costs into account (Sect. 8.2.3.5). In addition, we argue in this section why
naive Bayes classifiers often perform very well despite the strong independence as-
sumptions they make (Sect. 8.2.3.1), study how full Bayes classifiers are related
to linear discriminant analysis (Sect. 8.2.3.2), and consider how missing values are
handled in Bayes classifiers (Sect. 8.2.3.4).
8.2.3.1 Performance
The naive assumption that the descriptive attributes are conditionally independent
given the class attribute is clearly objectionable: it is rarely (exactly) satisfied in
practice. Even in the simple iris classification example studied above it does not
hold. Nevertheless, naive Bayes classifiers usually perform surprisingly well and are
often not much worse than much more sophisticated and complicated classifiers.
The reasons for this behavior are investigated in detail in [7], where it is revealed
that the good performance of naive Bayes classifiers is less surprising than one may
think at first sight. In a nutshell, this behavior is due to the fact that a classifier is
usually evaluated with accuracy or 0–1 loss, that is, it is simply counted how often
it makes a correct prediction (on test data). However, in order to make a correct
prediction, it is not necessary that the class probabilities are predicted with high
accuracy. It suffices that the most probable class receives the highest probability
assignment.
For example, in a classification problem with only two classes, the true proba-
bility of class 1 given some instantiation of the descriptive attributes may be 94%.
If a naive Bayes classifier predicts it to be 51% instead, it wildly misestimates this
probability but still yields the correct classification result.
In a large fraction of all cases in which the conditional independence assump-
tion is not satisfied, the effect of this assumption is mainly the distortion of the
probability distribution, while the most probable class remains the same for most
instantiations. Therefore the classification performance can still be very good, and
this is what one actually observes in practice on many (though not all) data sets.
Note also that, even though it sounds highly plausible, it is not the case that
a (naive) Bayes classifier can only profit from additional attributes. There are cases
where additional attributes actually worsen the prediction performance, even if these
additional attributes carry information about the class. An example is shown in
Fig. 8.6: a naive Bayes classifier that uses only the horizontal axis to discriminate
between cases belonging to class • from cases belonging to class ◦ yields a per-
fect result (solid line). However, if the vertical axis is also used, the classification
boundary changes because of the diagonal location of the mass of the data points
(dashed line). As a consequence, two instances are misclassified, namely the data
point marked with • at the top and the data point marked with ◦ at the bottom.
8.2 Bayes Classifiers 225
Fig. 8.6 The classification performance of a (naive) Bayes classifier need not get better if addi-
tional attributes are used. In this example adding the vertical axis, despite its being informative,
yields a worse classification accuracy than using the horizontal axis alone, which allows for a
perfect classification of the data points
1 1
pred(x) = sgn(dec(x)) + .
2 2
In the decision function, μ0 and μ1 are the mean vectors for the two classes that
are estimated in the same way as for a full Bayes classifier. In order to estimate the
covariance matrix, however, the data points from the two classes are pooled:
1
n
Σ̂ =
xi [X] − μ̂yi xi [X] − μ̂yi ,
n
i=1
This pooling of data points is possible because of the assumption that the class-
conditional density functions have the same shape. It has the advantage that the
covariance estimate becomes more robust (since it is computed from more data
points) but also the disadvantage that it misrepresents the density functions if the
assumption of equal shape is not valid.
Bayes classifiers can be seen as a special case of Bayes networks. Although we can-
not provide here a comprehensive treatment of Bayes network (an interested reader
is referred to, for instance, [4, 13] for an in-depth treatment), we try to capture some
basics, because otherwise the name of the most common augmented naive Bayes
classifier remains mysterious. The core idea underlying Bayes networks is to ex-
ploit conditional independence statements that hold in a multidimensional proba-
bility distribution, to decompose the distribution function. This already shows the
connection to Bayes classifiers, in which we also factorize the joint probability dis-
tribution of the descriptive attributes based on the assumption that they are condi-
tionally independent given the class.
The other core ingredient of Bayes networks is that, generally, the set of con-
ditional independence statements that hold in a probability distribution has proper-
ties that are highly analogous to those of certain separation statements in graphs or
networks. As a consequence, the idea suggests itself to use graphs or networks to
express these conditional independence statements in a concise and intuitive form.
Although we cannot go into the full formal details here, we would like to mention
that a naive Bayes classifier is a Bayes network with a star-like structure, see Fig. 8.7
on the left: all edges are directed from the class attribute Y at the center to the de-
scriptive attributes X1 , . . . , Xm . This graph structure expresses the conditional inde-
pendence assumption by a graph-theoretic criterion that is known as d-separation:
any path from a descriptive attribute Xi to another attribute Xj , i = j , is blocked by
the class attribute Y if it is known (details can be found, for instance, in [4]).
Drawing on the theory of representing conditional independencies by graphs or
networks, these conditional independence assumptions can be mitigated by adding
edges between descriptive attributes that are not conditionally independent given
the class, see Fig. 8.7 on the right: the direct connections between these attributes
express that even if the class is known, and thus the path through the class attribute
Fig. 8.7 A naive Bayes classifier is a Bayes network with a star-like structure (with the class
attribute at the center). Additional edges can mitigate the conditional independence assumptions
8.2 Bayes Classifiers 227
is blocked, the two attributes are still dependent on each other. In this case the clas-
sification formula of a Bayes classifier is generalized with the standard factorization
formula for a Bayes network, namely to
pred(x) = arg max P (y) P x | parents(X) ,
y∈Y
X∈X
where parents(X) denotes the set of parents of the attribute X in the graph structure.
Clearly, with a purely star-like structure (as in Fig. 8.7 on the left), this formula coin-
cides with the standard classification formula of a naive Bayes classifier, because in
a purely star-like structure the class attribute Y is the only parent of any descriptive
attribute. If, however, there are additional edges, a descriptive attribute may have
additional parents (other than the class attribute Y ) on which it depends and which
are needed in order to compute the correct conditional probability.
The most common form of a Bayes classifier that is extended in this way is the
tree-augmented naive Bayes classifier (TAN) [9, 10]. In this classifier the number
of parents of each descriptive attribute is restricted to two: the class attribute Y and
at most one descriptive attribute. Since Bayes networks are required to be acyclic
directed graphs (that is, there must not be a directed path connecting a node to itself),
this constraint allows at most edges forming a tree to be added to the star structure of
a naive Bayes classifier (and hence the name tree-augmented naive Bayes classifier).
The standard way to choose the additional edges is to construct a maximum
weight spanning tree for the descriptive attributes with conditional mutual infor-
mation providing the edge weights. This measure is defined as
P (xi , xj | y)
× log2
P (xi | y) · P (xj | y)
Handling missing values is particularly easy in naive Bayes classifiers. In the con-
struction step, it is convenient that the (conditional) distributions refer to at most two
attributes, namely the class and at most one conditioned attribute. Therefore data
points that are only partially known (that is, miss some attribute values) can still be
used for the estimation: they are useless only for estimating those distributions that
involve one of the attributes for which the value is missing. For all other attributes,
the data point can be exploited. The only exception is the case for which the class is
unknown, since the class is involved in all relevant (conditional) distributions.
In the execution step—that is, when a naive Bayes classifier is employed to com-
pute a prediction for a new case—the factors of the classification formula, which
refer to attributes the values of which are unknown, are simply dropped. The pre-
diction computed in this way is the same as that of a naive Bayes classifier which
has been constructed on the subspace of those attributes that are known for the sam-
ple case under consideration. Hence a naive Bayes classifier can flexibly adapt to
whatever subset of the descriptive attributes is known.
The situation is less convenient for a full Bayes classifier: since all descriptive at-
tributes are treated jointly (no conditional independence assumption), a single miss-
ing value can already make it impossible to use the sample case in the estimation
of the class-conditional distributions. With a multivariate normal distribution as-
sumption, however, the case may still be used to estimate the covariances of known
attribute pairs. One merely has to keep track of the different values of n , which
may be different for each element of the covariance matrix. In the execution step
integrating the class-conditional multivariate normal distributions over unknown at-
tributes provides a means of computing a prediction despite the fact that values are
missing.
Since Bayes classifiers predict a class by estimating the probability of the dif-
ferent possible classes and then selecting the most probable one, misclassification
costs can fairly easily be incorporated into the process. Suppose that we model the
misclassification costs with a cost function c(y), which assigns to each class y the
costs of wrongly classifying an instance of this class as belonging to some other
class y . Then the classification formula should be modified to
pred(x) = arg min c(y) 1 − P (y | x) .
y∈dom(Y )
In this way a class with nonmaximal probability may be predicted if the costs for
misclassifying an instance of it as belonging to some other class are high.
A more complicated situation is where we are given misclassification costs in the
form of a matrix C of size dom(Y ) × dom(Y ). Each element cyz of this matrix states
the costs of misclassifying an instance of class y as belonging to class z. Naturally,
the diagonal elements cyy are all zero (as these refer to a correct classification). In
this case the classification formula is modified to
pred(x) = arg min czy P (z | x).
y∈dom(Y )
z∈dom(Y )
With this prediction procedure, the expected costs (under the induced probability
model) that result from possible misclassifications are minimized.
Note that both this and the previous modified classification formula reduce to the
standard case (that is, predicting the most probable class) where all costs are equal:
pred(x) = arg min c · P (z | x)
y∈dom(Y )
z∈dom(Y )−{y}
= arg min c · 1 − P (y | x)
y∈dom(Y )
8.3 Regression
8.3.1 Overview
Up to now we considered classification, that is, the task to predict a class from a
finite set of possible classes. However, in many applications the quantity to predict
is not categorical, but rather metric (numerical), whether it is the price of the shares
of a given company, the electrical power consumption in a given area, the demand
for a given product, etc. In such cases classification techniques may, in principle,
still be applied, but doing so requires us to discretize the quantity to predict into a
finite set of intervals, which are then considered as the classes. Clearly, however, a
better approach is to use a prediction method that can yield a numeric output.
The main problem of such an approach is that (physical) measurement values
rarely show the exact relationship between the considered quantities, because they
are inevitably afflicted by errors. If one wants to determine the relationship between
the considered quantities nevertheless, at least approximately, one faces the task to
find a function that fits the given data as well as possible, so that the measurement er-
rors are “neutralized.” Naturally, for such an approach, one should possess at least a
conjecture about how the target attribute (in statistics also called the response vari-
able) depends on the descriptive attributes (in statistics also called the explanatory
or regressor variables), so that one can choose a (parameterized) function class and
thus can reduce the problem to a parameter estimation task. This choice is a critical
issue: if a chosen function class does not fit the data (for example, if one tries to fit a
linear function to nonlinear data), the result can be completely useless, because the
function cannot, in principle, be made to fit the data.5
Generally, one has to be particularly careful not to choose a function class with
too many degrees of freedom (too many free parameters), as this invites overfitting.
For example, any set of n data points with one explanatory and one response variable
can be fitted perfectly with a polynomial of degree n − 1 (and thus n free parame-
ters). An example with eight data points is shown in Fig. 8.8 (blue curve: polynomial
Fig. 8.8 The function class for regression has to be chosen with care. If a very complex function is
chosen, a perfect fit can be achieved, but this fit does not allow for reliable inter- or extrapolations.
A simpler function, although it fits the training data less well, is usually a much better predictor for
new function values
of degree 7). Clearly, even though the blue curve fits the data points perfectly, it is
completely useless for interpolation or extrapolation purposes: it is not likely that a
data point with a value for the explanatory variable (x-axis) other than those of the
given data points actually has a value for the response variable (y-axis) so that it lies
on the blue curve. This is particularly obvious for points with an x-value beyond 8.
A better fit of this data set can be obtained with a simple straight line (shown in
black). Intuitively, interpolating or extrapolating the values of the response variable
based on this straight line are much more likely to yield a useful result.
In order to deal with the unavoidable measurement errors and to achieve a good
fit to the data within the limitations imposed by the choice of the function class,
we have to choose a cost function that penalizes (at least large) deviations from the
actual values. The most common cost function is the sum of squared errors (SSE),
with which the approach is also known as the method of least squares (OLS for
“ordinary least squares”). It has the advantage that for a large family of parameter-
ized function classes—in particular, any polynomial of the explanatory variables—
the solution (that is, the set of parameters yielding the least sum of squared errors)
can be obtained by taking derivatives and applying simple methods from linear al-
gebra. However, it has the disadvantage that outliers, either in the explanatory or
the response variable, can have a distorting effect due to the fact that the errors
are squared, and thus outliers strongly influence the estimation. As a consequence,
other cost functions—like the sum of absolute errors or functions that limit the
contribution of a single data point to the total cost to a certain maximum—are also
used.
Once the (parameterized) function class and the cost function have been chosen,
the model construction process is straightforward: try to find the set of parameters
that identifies the function from the chosen class that minimizes the costs. Whether
the result is unique or not and whether it can be computed directly or has to be found
by an iterative improvement procedure (like gradient descent) depends mainly on the
cost function used. As we will see below, a direct solution can be obtained for any
polynomial as the function class and the sum of squared errors as the cost function.
8.3.2 Construction
We start our description of the model construction procedure with the simplest case,
finding a linear regression function for a single explanatory variable. In this case
one has to determine the parameters a and b of a straight line y = f (x) = a + bx.
However, due to the unavoidable measurement errors, it will usually not be possible
to find a straight line such that all n data points (xi , yi ), 1 ≤ i ≤ n, lie exactly on this
straight line. Rather we have to find a straight line from which the given data points
deviate as little as possible. Hence it is plausible to determine the parameters a and b
in such a way that the sum of squared deviations, that is,
n
2
n
F (a, b) = f (xi ) − yi = (a + bxi − yi )2 ,
i=1 i=1
232 8 Finding Explanations
is minimal. In other words, the y-values that are computed with the linear equation
should (in total) deviate as little as possible from the measured values. The reasons
for choosing squared deviations are twofold: (1) by using squares the error function
becomes continuously differentiable everywhere (whereas the derivative of the sum
of absolute errors is discontinuous/does not exist at zero), and (2) the squares weight
larger deviations higher, so that individual large deviations are avoided.6
A necessary condition for a minimum of the error function F (a, b) is that the
partial derivatives of this function w.r.t. the parameters a and b vanish, that is,
∂F ∂F
n n
= 2(a + bxi − yi ) = 0 and = 2(a + bxi − yi )xi = 0.
∂a ∂b
i=1 i=1
As a consequence, we obtain (after a few simple steps) the so-called normal equa-
tions
n n n
n n
na + xi b = yi and xi a + xi b =
2
xi yi ,
i=1 i=1 i=1 i=1 i=1
that is, a linear two-equation system with two unknowns a and b. It can be shown
that this equation system has a unique solution unless all x-values are identical and
that this solution specifies a minimum of the function F . The straight line deter-
mined in this way is called regression line for the data set (x1 , y1 ), . . . , (xn , yn ).
Note that finding a regression line can also be seen as a maximum likelihood
estimation (see Sect. A.4.2.3 in the appendix) of the parameters of the linear model
Y = aX + b + ξ,
3 7
y= + x.
4 12
6 Note, however, that this second property can also be a disadvantage, as it can make outliers have
This line is shown in Fig. 8.8 on page 230 in black. Obviously, it provides a reason-
able fit to the data and may be used to interpolate between or to extrapolate the data
points.
The least squares method is not limited to straight lines but can be extended (at
least) to regression polynomials. In this case one tries to find a polynomial
y = p(x) = a0 + a1 x + · · · + am x m ,
with a given fixed degree m, which best fits the n data points (x1 , y1 ), . . . , (xn , yn ).
Consequently, we have to minimize the error function
n
2
n
2
F (a0 , a1 , . . . , am ) = p(xi ) − yi = a0 + a1 xi + · · · + am xim − yi .
i=1 i=1
In analogy to the linear case, we form the partial derivatives of this function w.r.t. the
parameters ak , 0 ≤ k ≤ m, and equate them to zero (as this is a necessary condition
for a minimum). The resulting system of linear equations (m + 1 unknowns and
m + 1 equations) can be solved with one of the standard methods of linear algebra
(elimination method according to Gauß, inverting the coefficient matrix, etc.).
Note that finding a regression polynomial can also be interpreted as a maximum
likelihood estimation of the parameters (as it was possible for linear regression) if
one assumes a normally distributed error term ξ that is independent of X and Y .
Furthermore, the least squares method can be used for functions with more than
one argument. This case is known as multivariate regression. We consider only
multilinear regression, that is, we are given a data set ((x1 , y1 ), . . . , (xn , yn )) with
input vectors xi and the corresponding responses yi , 1 ≤ i ≤ n, for which we want
to determine the linear regression function
m
y = f (x1 , . . . , xm ) = a0 + ak xk .
k=1
where
⎛ ⎞ ⎛⎞
1 x11 ... x1m y1
⎜ .. ⎟ ⎜ ⎟
X = ⎝ ... ..
.
..
. . ⎠ and y = ⎝ ... ⎠ (8.2)
1 xn1 ... xnm yn
represent the data set, and a = (a0 , a1 , . . . , am ) is the vector of coefficients we have
to determine. (Note that the ones in the matrix X refer to the coefficient a0 .) Again
a necessary condition for a minimum is that the partial derivatives of this function
234 8 Finding Explanations
X Xa = X y.
As we have seen above, an analytical solution of the least squares problem can
easily be obtained for polynomials. However, by exploiting transformations (for
8.3 Regression 235
example, the logit-transformation), the regression approach can also be applied for
other functions classes (Sect. 8.3.3.1). As least squares regression is sensitive to
outliers, several more robust methods have been developed (Sect. 8.3.3.3). Finally,
methods to automatically select the function class, at least from a wider family of
functions of different complexity, may be worth considering (Sect. 8.3.3.4).
8.3.3.1 Transformations
In certain special cases the procedure described above can also be used to find other
regression functions. In order for this to be possible, one has to find a suitable trans-
formation which reduces the problem to the problem of finding a regression line or
regression polynomial. For example, regression functions of the form
y = ax b
can be found by finding a regression line. By simply taking the (natural) logarithm
of this equation, we arrive at
ln y = ln a + b · ln x.
This equation can be handled by finding a regression line if we take the (natural)
logarithms of the data points (xi , yi ), 1 ≤ i ≤ n, set a = ln a, and then carry out all
computation with these transformed values.
It should be noted, though, that with such an approach, only the sum of squared
errors in the transformed space (coordinates x = ln x and y = ln y) is minimized,
but not necessarily also the sum of squared errors in the original space (coordi-
nates x and y). Nevertheless, this approach often yields good results. In addition,
one may use the result only as a good starting point for a subsequent gradient de-
scent (see generally Sect. 5.3, also Sect. 9.2 on training artificial neural network, and
Sect. 8.3.3.2 below) in the original space, with which the solution can be improved,
and the true minimum in the original space may be obtained.
For practical purposes, it is important that one can transform the logistic func-
tion,
ymax
y= ,
1 + ea+bx
where ymax , a, and b are constants, to the linear or polynomial case (so-called logis-
tic regression). The logistic function is relevant for many applications, because it
describes growth processes with a limit, for example, the growth of an animal pop-
ulation with a habitat of limited size or the sales of a (new) product with a limited
market. In addition, it is popular in artificial neural networks (especially multilayer
perceptrons, see Sect. 9.2), where it is often used as the activation function of the
neurons.
In order to linearize the logistic function, we first take the reciprocal values
1 1 + ea+bx
= .
y ymax
236 8 Finding Explanations
As a consequence, we have
ymax − y
= ea+bx .
y
Taking the (natural) logarithm of this equation yields
ymax − y
ln = a + bx.
y
By transforming these data points with the logit transformation using ymax = 6 and
finding a regression line for the result, we obtain
z ≈ 4.133 − 1.3775x,
and thus, for the original data, the logistic regression curve is
6
y≈ .
1 + e4.133−1.3775x
These two regression functions are shown together with the (transformed or origi-
nal) data points in Fig. 8.9.
Fig. 8.9 Transformed data (left) and original data (right) as well as the regression line (left) and
the corresponding logistic regression curve (right) computed with the method of least squares
Hence there is room for improvement, which may be achieved by using the solution
that is obtained by solving the system of normal equations in the transformed space
only as an initial point in the parameter space of the regression model. This initial
solution is then iteratively improved by gradient descent (see Sect. 5.3 for a general
treatment), that is, by repeatedly computing the gradient of the objective function
(here, the sum of squared errors) at the current point in the parameter space and then
making a small step in this direction.
If the functional relationship is logistic (as in the example in the previous sec-
tion), this procedure is actually equivalent to training a multilayer perceptron with-
out a hidden layer (so actually a two-layer perceptron, with only an input layer and
one output neuron) with standard error backpropagation, provided, of course, that
the activation function of the output neuron is logistic. Details for this special case
can be found in Sect. 9.2.
Least squares e2
1 2
2e if |e| ≤ k,
Huber
k|e| − 1 2
2k if |e| > k.
⎧ 2
⎨ k (1 − (1 − ( e )2 )3 ) if |e| ≤ k,
6 k
Tukey’s bisquare
⎩ k2 if |e| > k.
6
where ρ(ei ) = ei2 , and ei is the (signed) error of the regression function at the ith
point. Is this the only reasonable choice for the function ρ? The answer is definitely
no. However, ρ should satisfy at least some reasonable restrictions. ρ should always
be positive, except for the case ei = 0. Then we should have ρ(ei ) = 0. The sign of
the error ei should not matter for ρ, and ρ should be increasing when the absolute
value of the error increases. These requirements can formalized in the following
way:
ρ(e) ≥ 0, (8.4)
ρ(0) = 0, (8.5)
ρ(e) = ρ(−e), (8.6)
ρ(ei ) ≥ ρ(ej ) if |ei | ≥ |ej |. (8.7)
n
ψ i x
i a − yi xi = 0. (8.8)
i=1
8.3 Regression 239
Defining w(e) = ψ(e)/e and wi = w(ei ), Equation (8.8) can be rewritten in the
form
n
ψi (x a − yi )
n
i
· ei · x
i = wi · yi − xi b · xi = 0. (8.9)
ei
i=1 i=1
n
wi ei2 . (8.10)
i=1
However, the weights wi depend on the residuals ei , the residuals depend on the
coefficients ai , and the coefficients depend on the weights. Therefore, it is in general
not possible to provide an explicit solution to the system of equations. Instead, the
following iteration scheme is applied.
1. Choose an initial solution a(0) , for instance, the standard least squares solution
setting all weights to wi = 1.
2. In each iteration step t, calculate the residuals e(t−1) and the corresponding
weights w (t−1) = w(e(t−1) ) determined by the previous step.
3. Compute the solution of the weighted least squares problem ni=1 wi ei2 which
leads to
−1
a(t) = X W(t−1) X X W(t−1) y, (8.11)
where W stands for the diagonal matrix with weights wi on the diagonal.
Table 8.4 lists the formulae for the weights in the regression scheme based on the
error measures listed in Table 8.3.
Figure 8.10 shows the graph of the error measure ρ and the weighting function
for the standard least squares approach. The error measure ρ increases in a quadratic
manner with increasing distance. The weights are always constant. This means that
extreme outliers will have full influence on the regression coefficients and can cor-
rupt the result completely.
In the more robust approach by Huber the change of the error measure ρ switches
from a quadratic increase for small errors to a linear increase for larger errors. As a
240 8 Finding Explanations
Fig. 8.10 The error measure ρ and the weight w for the standard least squares approach
Fig. 8.11 The error measure ρ and the weight w for Huber’s approach
Fig. 8.12 The error measure ρ and the weight w for the bisquare approach
result, only data points with small errors will have the full influence on the regres-
sion coefficients. For extreme outliers, the weights tend to zero. This is illustrated
by the corresponding graphs in Fig. 8.11.
Tukey’s bisquare approach is even more drastic than Huber’s approach. For larger
errors, the error measure ρ does not increase at all but remains constant. As a con-
sequence, the weights for outliers drop to zero when they are too far away from the
regression curve. This means that extreme outliers have no influence on the regres-
sion curve at all. The corresponding graphs for the error measure and the weights
are shown in Fig. 8.12.
8.3 Regression 241
To illustrate how robust regression works, consider the simple regression prob-
lem in Fig. 8.13. There is one outlier that leads to the red regression line that neither
fits the outlier nor the other points. With robust regression, for instance, based on
Tukey’s ρ-function, we obtain the blue regression line that simply ignores the out-
lier.
An additional result in robust regression is the computed weights for the data
points. The weights for the regression problem in Fig. 8.13 are plotted in Fig. 8.14.
All weights, except one, have a value close to 1. The right-most weight with the
value close to 0 is the weight for the outlier. In this way, outlier can be identified
by robust regression. This applies also to the case of multivariate regression where
we cannot simply plot the regression function as in Fig. 8.13. But the weights can
still be computed and plotted. We can also take a closer look at the data points with
low weights for the regression function. They might be exceptional cases or even
erroneous measurements.
In their basic form, regression approaches require a conjecture about the form of the
functional relationship between the involved variables, so that the task only consists
242 8 Finding Explanations
Finally, let a data set D = {(x1 , y1 ), . . . , (xn , yn )} be given, the elements of which
are assigned to one of the classes c1 or c2 (that is, yi = c1 or yi = c2 for i =
1, . . . , n).
We desire to find a reasonably simple description of the function p(x), the pa-
rameters of which have to be estimated from the data set X. A common approach is
to model p(x) as a logistic function, that is, as
1 1
p(x) = = .
1 + ea0 +ax 1 + exp(a0 + ri=1 ai xi )
that is, a multilinear regression problem, which can easily be solved with the tech-
niques introduced above.
What remains to be clarified is how we determine the values p(x) that enter the
above equation. If the data space is small enough, so that there are sufficiently many
realizations for every possible point (that is, for every possible instantiation of the
random variables X1 , . . . , Xm ), we may estimate the class probabilities simply as
the relative frequencies of the classes (see Sect. A.4.2 in the appendix).
If this is not the case, we may rely on an approach known as kernel estimation
in order to determine the class probabilities at the data points. The basic idea of
such an estimation is to define a kernel function K which describes how strongly
a data point influences the estimation of the probability (density) at a neighboring
point (see Sect. 9.1.3.1 for a related approach in connection with k-nearest-neighbor
classifiers and Sect. 9.3 on support vector machines). The most common choice is a
Gaussian function, that is,
1 (x − y) (x − y)
K(x, y) = m exp − ,
(2πσ 2 ) 2 2σ 2
where the variance σ 2 has to be chosen by a user. With the help of this kernel
function the probability density at a point x is estimated from an (unclassified) data
set D = {x1 , . . . , xn } as
1
n
fˆ(x) = K(x, xi ).
n
i=1
244 8 Finding Explanations
If we deal with a two-class problem, one estimates the class probabilities by relating
the probability density resulting from datapoints of one of the classes to the total
probability density. That is, we estimate
n
i=1 c(xi )K(x, xi )
p̂(x) = n ,
i=1 K(x, xi )
where
1 if xi belongs to class c1 ,
c(xi ) =
0 if xi belongs to class c2 .
Solving the resulting regression problem yields a (multidimensional) logistic func-
tion, which describes the probability of one of the two classes for the points of the
data space. For this function, one has to choose a threshold value: if the function
value exceeds the threshold, the class the function refers to is predicted; otherwise
the other class is predicted. Note that with such a threshold a linear separation of the
input space is described (see also Sect. 9.3).
If this method is used in finance to assess the credit-worthiness of a customer,
one of the classes means that the loan applied for is granted, while the other means
that the application is declined. As a consequence, several threshold values are cho-
sen, which refer to different loan conditions (interest rate, liquidation rate, required
securities, loan duration, etc.)
The last type of explanation-based methods that we are discussing in this chapter are
rule learning methods. In contrast to association rule learners as discussed before,
we are now concentrating on algorithms that generate sets of rules which explain
all of the training data. Hence we are interested in a global rule system vs. associ-
ation rules that are a collection of local models. These local models can be highly
redundant (and they usually are), and we expect a global rule system to explain the
data reasonably free of overlaps. Rule systems are one of the easiest ways (if not
the easiest way) to express knowledge in a human-readable form. By the end of this
section we hope to have given a better understanding what these types of methods
can achieve and why they are still relatively unknown to data analysis practitioners.
Rule Learning algorithms can be roughly divided into two types of methods.
Simple, “if this fact is true, then that fact holds”-type rules and more complex rules
which allow one to include variables, something along the lines of “if x has wings
then x is a bird”.7 The first type of rules are called propositional rules and will
be discussed first. Afterwards we will then discuss first-order rules or the field of
inductive logic programming as this area of rule learning is often also called.
7 Note that we are not saying much about the truthfulness or precision of rules at this stage.
8.4 Rule learning 245
Propositional rules are rules consisting of atomic facts and combinations of those
using logical operators. In contrast to first-order logic (see next section), no variables
are allowed to be part of those rules. A simple propositional classification rule could
look like this:
IF x1 ≤ 10 AND x3 = red THEN class A.
Note that we have an antecedent part of the rule (to the left of THEN) indicating
the conditions to be fulfilled in order for the consequent part (to the right) to be
true. As with typical implications, we do not know anything about the truth value
of the consequent if the antecedent does not hold. The atomic facts of propositional
rules more commonly occurring in data analysis tasks are constraints on individual
attributes:
• constraints on numerical attributes, such as greater/smaller (-or-equal) than a con-
stant, equality (usually for integer values) or containment in given intervals;
• constraints on nominal attributes, such as checks for equality or containment in a
set of possible values;
• constraints on ordinal attributes, which add range checks to the list of possible
constraints on numeric attributes.
The rule above shows examples for a numerical constraint and a constraint on a
nominal attribute. How can we now find such rules given a set of training instances?
One very straight forward way to find propositional rules was already presented
earlier in this chapter: we can simply train a decision tree and extract the rules from
the resulting representation. The tree shown in Fig. 8.2 can also be interpreted as
four (mutually exclusive—we will get back to this in a second) rules:
Those rules are simply generated by traversing the tree from each leaf up to the root
and collecting the conditions along the way. Note that in rule Rc we collapsed the
two tests on temperature into one. Since the rules are disjunctive and disjunction is
commutative, this does not alter the antecedent.
Rules extracted from decision trees have two interesting properties:
246 8 Finding Explanations
• mutual exclusivity: this means that a pattern will be explained by one rule only.
This stems from the origin of the rules: they were generated from a hierarchical
structure where each branch partitions the feature space into two disjoint parti-
tions.
• unordered: the rules are extracted in arbitrary order from the tree, that is, no rule
has preference over any other one. This is not a problem since only one (and
exactly one!) rule will apply to a training pattern. However, as we will see later,
rules can also overlap, and then a conflict avoidance strategy may be needed: one
such strategy requires rules to be ordered, and the first on that matches is the one
creating the response.
Creating rules from decision trees is straightforward and uses well-established
and efficient training algorithms. However, we inherit all of the disadvantages of
decision tree learning algorithms (most notably with respect to their notorious in-
stability!), and the rules can also be quite redundant. We can sometimes avoid this
redundancy by transforming the rule set into a set of ordered rules:
Now we really need to apply these rules in the correct order, but then they still
represent exactly the same function as the original decision tree and the unordered
rule set shown before. This type of conversion only works because we have an un-
balanced tree—no branch going to the left actually carries a subtree. So we could
simply assign labels for those branches and carry the rest forward to the next rule
describing the branch to the right. In general, simplifying rule sets extracting from
decision trees are considerably more complex.
(or decrease) the set of training patterns that fall within the influence of that rule’s
constraint.
The first type of rule learners generally starts with a set of extremely small special
rules. In the extreme, these will be rules centered on one or more training examples
having constraints that limit numerical values to the exact value of that instance and
a precise match of the nominal value. So, for a training instance (v, k) with
There are two typical generalization operators that can now be applied iteratively.
If we start with one rule, we will attempt to make this rule cover more training
instances by finding one (usually the closest) example that can be included in this
rule without including any other example. A second training instance (v2 , k) with
Alternatively, we can also combine two rules into a new one, paying attention that
we are not accidentally including a third rule of different class. The first approach
(extending rules by covering one additional example) can be seen as a special case
of the second approach (merging two rules) since we can always model a single
training example as a (very) specific rule.
More generally, we can regard these training algorithms as a heuristic greedy
search which iteratively tries to make rules more general by either merging two rules
of the original set or by enlarging one rule to cover an additional training example.8
The biggest question is then which two rules (or one rule and training example) to
pick in the next step and how to do the generalization in order to merge the rules
(or cover the additional example). As with all greedy algorithms, no matter how
these two questions are answered, the resulting algorithm will usually not return
the optimal set of rules, but in most cases the result will be relatively close to the
optimum.
Specializing Rule Learners operate exactly the opposite way. They start with very
general rules, in the extreme with one rule of the following form:
8 Note that this is a substantial deviation from the abstract concepts of rule learners in Mitchell’s
version space setup: real-world rule learners usually do not investigate all more general (or more
specific) rules but only a subset of those chosen by the employed heuristic(s).
248 8 Finding Explanations
1 R=∅
2 Drest = D
3 while (Performance(R, Drest ) < pmin )
4 r = FindOneGoodRule(Drest )
5 R = R ∪ {r}
6 Drest = Drest − covered(r, Drest )
7 endwhile
8 return R
for each class k. Then they iteratively attempt to avoid misclassifications by special-
izing these rules, i.e., adding new or narrowing existing constraints.
We have now, of course, not yet really talked about learning more than one rule;
all we know is how to generate one rule for a data set. However, it is quite optimistic
to assume that one simple rule will be sufficient to explain a complex real-world data
set. Most rule learning algorithms hence wrap this “one rule learning” approach into
an outer loop, which tries to construct an entire set of rules. This outer loop often
employs a set covering strategy, also known as sequential covering and generically
looks as shown in Table 8.5.
The initialization (steps 1 and 2) creates an empty set of rules R and sets the
training instances which are still to be explained to the entire dataset. The while
loop runs into the performance of the rule set R, reaches a certain threshold pmin
and iteratively creates a new rule r, adds it to the rule base R, and removes the
instances it covers from the “still to be explained” instances. Once the threshold is
reached, the resulting rule set is returned.
The biggest variations in existing implementations of this base algorithm are the
chosen error measure to measure the performance of a given rule (or sets of rules)
and the strategy to find one good rule which optimizes this performance criterion.
One of the earlier and still very prominent methods is called CN2 [6] and uses a
simple generalizing rule searching heuristic as shown in Table 8.6.
This routine essentially performs a search for all hypothesis starting with a gen-
eral one (line 1 and 2) and iteratively specializing them (line 4). During each it-
eration, the so far best hypothesis is remembered (line 5). There are two heuristics
involved controlling the specialization (line 4) and the termination criteria; the latter
is done in line 6, where all newly generated hypotheses which do not fulfill a valid-
ity criteria are eliminated. For this, CN2 uses a significance test on the dataset. The
specialization in line 4 returns only consistent, maximally specific hypotheses. In
line 8, finally, a rule is assembled, assigning to the chosen antecedent the majority
class of the patterns covered by the best hypothesis.
CN2 therefore not only returns 100% correct hypotheses but also rules which do
make some errors on the training data—the amount of this error is controlled by the
statistical significance test in the update-routing used in line 6. The CN2 algorithm
8.4 Rule learning 249
1 hbest = true
2 Hcandidates = {hbest }
3 while Hcandidates = ∅
4 Hcandidates = specialize(Hcandidates )
5 hbest = arg maxh∈Hcandidates ∪{hbest } {Performance(h, Drest )}
6 update(Hcandidates )
7 endwhile
8 return IF hbest THEN arg maxk {|coveredk (hbest , Drest )|}
essentially performs a beam search with variable beam width (controlled by the
significance test) and evaluates all hypotheses generated within the beam based on
the performance criteria used also in the main algorithm.
We have not yet specified which attribute types we are dealing with, the “special-
ized” routine just assumes that we know how to specialize (or generalize in other
types of algorithms) our rules. For nominal attributes it is pretty straightforward:
specialization removes individual values from the set of allowed ones, and general-
ization adds some. In the extreme we have either only one value left (leaving none
does not make much sense as it results in a rule that is never valid) or all possible
values, resulting in a constraint that is always true. But what about numerical val-
ues? The most specific case is easy as it translates an equality to one exact numerical
value. Also the most generic case is simple: it results in an interval from minus to
plus infinity. But also generalizing a numerical constraint so that it contains a value
it previously did not contain is simple: we only have to expand the interval so that
it uses this new value either as new lower or upper bound, depending on which side
of the interval the value lied. Specializing is a bit more complicated as we have
two options. In order to move a specific value out of a given numerical interval,
we can either move the lower or upper bound to be just above (or below) the given
value. This is one point where heuristics start playing a role. However, the much
bigger impact of heuristics happens during multidimensional specialization opera-
tions. Figure 8.15 illustrates this for the case of two numerical attributes. On the
left we show how we can generalize a given rule (solid line) to cover an additional
training instance (cross). The new rule (dashed lines) is given without any ambigu-
ity. On the right we show a given rule (solid line) and four possibilities to avoid a
given conflict (cross). Note that the different rectangles are not drawn right on top of
each other as they should be but slightly apart for better visibility. It is obvious that
we have a number of choices and this number will not decrease with an increase in
dimensionality.
However, there is one big problem with all of the current propositional rule in-
duction approaches following the above schema: for real-world data sets, they tend
to generate an enormous amount of rules. Each one of these rules is, of course, inter-
pretable and hence fulfills the requirements of the methods described in this chapter.
250 8 Finding Explanations
However, the overall set of rules is often a way too large to be even remotely com-
prehensible. Some approaches have been proposed to learn hierarchical rule models
or find rules with outlier models, but those have not yet gained great prominence,
in part also because they rely on a set of heuristics which are hard to control. In the
following section we discuss a few approaches that try to address this problem.
In order to handle the larger number of rules, various approaches have been pro-
posed. Some simply prune rules by their importance, which is usually measured by
the number of patterns they cover. Others include the minimum description length
principle to balance the complexity (e.g., length) of a rule against the amount of
data it covers. Similarly to decision tree learning, we can of course here also employ
pruning strategies to reduce collapse several rules into one or completely eliminate
them. Other approaches attempt to further reduce the number of rules by looking
at their discriminative power between classes or other measures of interestingness.
The most prominent such measure is the J-measure [11], which essentially esti-
mates how dissimilar the a priori and a posteriori beliefs about the rule’s consequent
are. Only if these two likelihoods are substantially different, a rule is potentially in-
teresting. In the J-measure this difference is additionally weighted by the generality
of the rule (probability that the rule’s conditional part holds), because the rule be-
comes the more interesting the more often it applies. See Chap. 5 for further details
on some of these techniques.
Note that in the toy setup, as used in the Version Spaces, another very elegant
way to dramatically reduce the number of matching rules was presented: by only
reporting the most general and most specific rules which cover the training data, the
entire part of the lattice in between these two boundaries is reported as well. Un-
fortunately this cannot as easily be applied to rules stemming from real-world data
since coverage will hardly ever be exactly 100%. However, the setup described in
Sect. 7.6.3.3, where we describe a propositional rule learning system finding asso-
ciation rules, uses a related strategy. Also, here it is hard to dig through the resulting
list of association rules. However, looking at closed or maximum itemsets (and the
resulting association rules) drastically reduces the set of rules to consider. So we,
in effect, report only the most specific rules describing a local aspect of the training
data.
One of the reasons why propositional rule learners tend to produce excessive
numbers of rules, especially in numerical feature spaces, is the sharp boundaries
they attempt to impose. If training instances of different class are not separable by
8.4 Rule learning 251
axes parallel lines in some reasonably large local area, the rule learning algorithms
are forced to introduce many small rules to model these relationships. Often the
precise nature of these boundaries is not interesting or, worse, caused only by noisy
data in the first place. To be able to incorporate this imprecision into the learning
process, a large number of fuzzy rule learners have been introduced. They are all
using the notion of fuzzy sets which allows one to model degrees of membership
(to rules in this case) and hence results in the ability to model gray zones where
areas of different class overlap. Fuzzy rule systems have the interesting side ef-
fect that it is possible to express a fuzzy rule system as a system of differentiable
equations of the norm and membership functions are used accordingly. Then other
adaptation methods can be applied, often motivated by work in the neural network
community. The resulting Neuro-Fuzzy systems allow one, for instance, to use ex-
pert knowledge to initialize a set of fuzzy rules and then employ a gradient descent
training algorithm to update those rules to better fit training data. In addition Takagi–
Sugeno–Kang-type fuzzy rules also allow one to build regression systems based on
fuzzy antecedents and (local) regression consequents. We refer to [3] for more de-
tails on fuzzy rules and the corresponding training methods and [16] for details on
Neuro-Fuzzy systems.
Propositional rules are quite limited in their expressive power. If we want to express
a rule of the form
using propositional rules, we would need to enumerate all possible values for x and
y and put those pairs into individual rules. In order to express such types of rules,
we need to introduce variables. First Order rules allow one to do just that and are
based on only a few base constructs:
• constants, such as Bob, Luise, red, green,
• variables, such as x and y in the example above,
• predicates, such as is_father(x, y), which produce truth values as a result, and
• functions, such as age(x), which produce constants.
From this we can construct
• terms which are constants, variables, or functions applied to a term,
• literals, which are predicates (or negations of predicates) applied to any set of
terms,
• ground literals, which are literals that do not contain a variable,
• clauses, which are disjunctions of literals whose variables are universally quanti-
fied, and
• horn clause, which are clauses with at most one positive literal.
252 8 Finding Explanations
The latter is especially interesting because any disjunction with at most one positive
literal can be written as
H ∨ ¬L1 ∨ · · · ∨ ¬Ln
=H
ˆ ⇒ (L1 ∧ · · · ∧ Ln )
=IF
ˆ L1 AND . . . AND L2 THEN H.
So horn clauses express rules and H , the head of the rule, is the consequent, and the
Li together form the body or consequent.
Horn clauses are also used to express Prolog programs, which is also the reason
why learning first-order rules is often referred to as inductive logic programming
(ILP) , because we can see this as learning (Prolog) programs, not only sets of rules.
We talk about substitutions which denote any replacement, called binding of vari-
ables in a literal with appropriate constants. A rule body is satisfied if at least one
binding exists that satisfies the literals. A few example rules are:
• simple rules:
IF x is Parent of y AND y is male THEN x is Father of y
• existentially qualified variables (z in this case):
IF y is Parent of z AND z is Parent of x THEN x is Granddaughter of y
• Recursive rules:
IF x is Parent of z AND z is Anchestor of y THEN x is Anchestor of y
IF x is Parent of y THEN x is Anchestor of y
Inductive Logic Programming allows one to learn concepts also from data spread
out over several multirelation databases—without the need for previous integration
as discussed in Sect. 6.5. But how do we learn these types of rules from data? A num-
ber of learning methods have been proposed recently, and we will explain one of
earlier algorithms in more detail in the following section.
FOIL, the First Order Inductive Learning method developed by Quinlan [19],
operates very similar to the rule learners we have discussed in the previous section.
It also follows the strategy of a sequential covering approach—the only difference
being the inner FindOneGoodRule() routine. It creates not quite Horn clauses but
something very similar besides two differences:
• the rules learned by FOIL are more restrictive in that they do not allow literals to
contain function symbols (this reduces the hypothesis space dramatically!), and
• FOIL rules are more expressive than Horn clauses because they allow literals in
the body to be negated.
The FindOneGoodRule() routine of FOIL again specializes candidate rules
greedily. It does so by adding literals one by one where these new literals can be
one of the following:
• P(v1, . . . , vr), where P is any predicate name occurring in the set of available
predicates. At least one of the variables must already be present in the original
rule, the others can be either new or existing;
8.5 Finding Explanations in Practice 253
• Equal(x, y), where x and y are variables already present in the rule; or
• the negation of either of the above forms of literals.
How does FOIL now pick the “best” specialization to continue? For this, Quinlan
introduced a measure (FoilGain) which essentially estimates the utility of the new
added literal by comparing the number of positive and negative bindings of the
original and the extended rules.
This is, of course, a system that does not scale well with large, real-world data
sets. ILP systems are in general not heavily used for real data analysis. They are
more interesting for concept learning in data sets with purely nominal values and/or
structured data sets. However, some of the ideas are used in other applications such
as molecular fragment mining.
Finding explanations is, similar to finding patterns, a central piece of any analytics
software. The main difference is the way the extracted model is presented to the
user: earlier tools produced long lists of rules or other ASCII representations, but
nowadays tools increasingly offer interactive views which allow one to select in-
dividual parts of an explanation, e.g., a leaf in a decision tree, and propagate this
selection to other views on the underlying data—see Sect. 4.8 for a more detailed
discussion of this type of visual brushing. KNIME is, of course, inherently bet-
ter suited for such type of visual explorations, but also R offers quite a number of
graphical representations of the discovered explanations.
When constructing decision trees in data mining software, a number of options are
available. KNIME allows one to adjust the information gain criteria and a number
of other options. Figure 8.16 (left) shows the dialog of the native KNIME decision
tree learner. We need to first and foremost select the target attribute (class—it has
to be a nominal attribute, a string in this case). Afterwards, we can choose between
two different ways to compute the information gain (Gini index and Gain ratio) and
if a pruning of the tree is to be performed (KNIME offers to either skip pruning
or performs a minimum description length (MDL)-based pruning). Most of the re-
maining options are self-explanatory or related to the hiliting support of KNIME
(number of records to be stored for the interaction). Noteworthy is the last option
“number threads,” which allows one to control how many threads KNIME can use
to execute the learning method in parallel on, e.g., a multicore machine. Once the
node is run, we can display the resulting decision tree. Figure 8.16 (right) shows
the resulting tree for the training part of the iris data, which, hopefully, shows in-
teresting insights into the structure of the data. The KNIME view shows the color
254 8 Finding Explanations
Fig. 8.16 The KNIME dialog of the decision tree construction node (left) along with the tree view
(right) after running on the training part of the iris data
distribution (if available) for each node of the tree, along with the split attribute and
the majority class. The numbers in brackets show the number of training patterns
classified correctly together with the overall number of training patterns falling into
this branch. Additionally, the fraction of patterns falling into splits are displayed by
the vertical orange bar charts. The Weka decision tree learning node works sim-
ilarly. Although the Weka j4.8 (an implementation following closely the original
c4.5, revision 8 version of Quinlan’s algorithm) offers more options to fine tune
the training algorithm, most of those options will hardly be used in practice. After
installing the Weka integration, all learning and clustering algorithms are available
as individual nodes in KNIME as well. Additionally, also the Weka views are ac-
cessible. Figure 8.17 shows the view of the Weka decision tree nodes. The Weka
integration also offers access to a regression tree implementation, which is—as of
version 2.1—not yet available in KNIME.
Many other explanation finding methods are available in KNIME, among them,
Naive Bayes, as described in Sect. 8.2. The corresponding node in KNIME produces
8.5 Finding Explanations in Practice 255
R also allows for much finer control of the decision tree construction. The script
below demonstrates how to create a simple tree for the Iris data set using a training
set of 100 records. Then the tree is displayed, and a confusion matrix for the test
set—the remaining 50 records of the Iris data set—is printed. The libraries rpart,
which comes along with the standard installation of R, and rattle, that needs to
be installed, are required:
> library(rpart)
> iris.train <- c(sample(1:150,75))
> iris.dtree <- rpart(Species~.,data=iris,
subset=iris.train)
> library(rattle)
> drawTreeNodes(iris.dtree)
> table(predict(iris.dtree,iris[-iris.train,],
type="class"),
iris[-iris.train,"Species"])
256 8 Finding Explanations
In addition to many options related to tree construction, R also offers many ways
to beautify the graphical representation. We refer to R manuals for more details.
Naive Bayes classifiers use normal distributions by default for numerical attributes.
The package e1071 must be installed first:
> library(e1071)
> iris.train <- c(sample(1:150,75))
> iris.nbayes <- naiveBayes(Species~.,data=iris,
subset=iris.train)
> table(predict(iris.nbayes,iris[-iris.train,],
type="class"),
iris[-iris.train,"Species"])
As in the example of the decision tree, the Iris data set is split into a training and
a test data set, and the confusion matrix is printed. The parameters for the normal
distributions of the classes can be obtained in the following way:
> print(iris.nbayes)
8.5.2.3 Regression
The summary provides the necessary information about the regression result, in-
cluding the coefficient of the regression function.
If we want to use a polynomial as the regression function, we need to protect the
evaluation of the corresponding power by the function I inhibiting interpretation.
As an example, we compute a regression function to predict the petal width based
on a quadratic function in the petal length:
Robust regression requires the library MASS, which needs installation. Otherwise
it is handled in the same way as least squares regression, using the function rlm
instead of lm:
The default method is based on Huber’s error function. If Tukey’s biweight should
be used, the parameter method should be changed in the following way:
> plot(iris.rlm$w)
Explanation Finding Methods are almost always the center piece of data mining
books, so any book on this topic will likely cover most, if not all, of what we dis-
cussed in this chapter. For Decision trees, there are two main directions:
• Classification and Regression Trees (CART), which are covered in much more
detail in the following book: Hastie, T., Tibshirani, R. and Friedman, J.H.: Ele-
ments of Statistical Learning (Springer, 2001).
• Quinlan’s c4.5 algorithm and subsequent improvements, with the original text
book describing c4.5r8: J.R. Quinlan: Induction of Decision Trees (Springer,
1986).
For Bayes and Regression, any more statistically oriented book offers more back-
ground; these (and much more) with a more statistical view are also described in the
book by Hastie, Tibshirani, and Friedman.
Tom Mitchell’s excellent Machine Learning book (McGraw Hill, 1997) intro-
duces the concept of version space learning and also describes decision tree induc-
tion, among others. In order to get a better feeling for how learning systems work
in general, this is highly recommended reading, especially the first half. For fuzzy
rule learning systems, we recommend the chapter on “Fuzzy Logic” in Intelligent
Data Analysis: An Introduction, M.R. Berthold and D.J. Hand (Eds), published by
Springer Verlag (2003) and the book on Foundations of Neuro-Fuzzy Systems by
D. Nauck, F. Klawonn, and R. Kruse (Wiley, 1997). For inductive logic program-
ming, Peter Flach and Nada Lavrac contributed a nice chapter in the Intelligent Data
Analysis book edited by Berthold and Hand.
258 8 Finding Explanations
References
1. Albert, A.: Regression and the Moore–Penrose Pseudoinverse. Academic Press, New York
(1972)
2. Anderson, E.: The irises of the Gaspe Peninsula. Bull. Am. Iris Soc. 59, 2–5 (1935)
3. Berthold, M.R.: Fuzzy logic. In: Berthold, M.R., Hand, D.J. (eds.) Intelligent Data Analysis:
An Introduction, 2nd edn. Springer, Berlin (2003)
4. Borgelt, C., Steinbrecher, M., Kruse, R.: Graphical Models—Representations for Learning,
Reasoning and Data Mining, 2nd edn. Wiley, Chichester (2009)
5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: CART: Classification and Regression
Trees. Wadsworth, Belmont (1983)
6. Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
7. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one
loss. Mach. Learn. 29, 103–137 (1997)
8. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2),
179–188 (1936)
9. Friedman, N., Goldszmidt, M.: Building classifiers using Bayesian networks. In: Proc. 13th
Nat. Conf. on Artificial Intelligence (AAAI’96, Portland, OR, USA), pp. 1277–1284. AAAI
Press, Menlo Park (1996)
10. Geiger, D.: An entropy-based learning algorithm of Bayesian conditional trees. In: Proc. 8th
Conf. on Uncertainty in Artificial Intelligence (UAI’92, Stanford, CA, USA), pp. 92–97. Mor-
gan Kaufmann, San Mateo (1992)
11. Goodman, R.M., Smyth, P.: An information-theoretic model for rule-based expert systems. In:
Int. Symposium in Information Theory. Kobe, Japan (1988)
12. Janikow, C.Z.: Fuzzy decision trees: issues and methods. IEEE Trans. Syst. Man, Cybern.,
Part B 28(1), 1–14 (1998)
13. Jensen, F.V., Nielsen, T.D.: Bayesian Networks and Decision Graphs, 2nd edn. Springer, Lon-
don (2007)
14. Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R., Kuijpers, C.: Structural learning of
Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE
Trans. Pattern Anal. Mach. Intell. 18, 912–926 (1996)
15. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
16. Nauck, D., Klawonn, F., Kruse, R.: Neuro-Fuzzy Systems. Wiley, Chichester (1997)
17. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
18. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
19. Quinlan, J.R., Cameron-Jones, R.M.: FOIL: a midterm report. In: Proc. European Conference
on Machine Learning. Lecture Notes in Computer Science, vol. 667, pp. 3–20. Springer, Berlin
(1993)
20. Sahami, M.: Learning limited dependence Bayesian classifiers. In: Proc. 2nd Int. Conf. on
Knowledge Discovery and Data Mining (KDD’96, Portland, OR, USA), pp. 335–338. AAAI
Press, Menlo Park (1996)
Chapter 9
Finding Predictors
In this chapter we consider methods of constructing predictors for class labels or nu-
meric target attributes. However, in contrast to Chap. 8, where we discussed meth-
ods for basically the same purpose, the methods in this chapter yield models that
do not help much to explain the data or even dispense with models altogether. Nev-
ertheless, they can be useful, namely if the main goal is good prediction accuracy
rather than an intuitive and interpretable model. Especially artificial neural networks
and support vector machines, which we study in Sects. 9.2 and 9.3, are known to
outperform other methods w.r.t. accuracy in many tasks. However, due to the ab-
stract mathematical structure of the prediction procedure, which is usually difficult
to map to the application domain, the models they yield are basically “black boxes”
and almost impossible to interpret in terms of the application domain. Hence they
should be considered only if a comprehensible model that can easily be checked for
plausibility is not required, and high accuracy is the main concern.
Artificial Neural Networks Among methods that try to endow machines with
learning ability, artificial neural networks are among the oldest and most intensely
studied approaches. They take their inspiration from biological neural networks and
try to mimic the processes that make animals and human beings able to learn and
adapt to new situations. However, the used model of biological processes is very
coarse, and several improvements to the basic approach have even abandoned the
biological analogy. The most common form of artificial neural networks, multilayer
perceptrons, can be described as a staged or hierarchical logistic regression (see
Sect. 8.3), which is trained with a gradient descent scheme, because an analytical
solution is no longer possible due to the staged/hierarchical structure. The advan-
tage of artificial neural networks is that they are very flexible and thus often achieve
very good accuracy. However, the resulting models, due to their involved mathemat-
ical structure (complex prediction function), are basically impossible to interpret
in terms of the application domain. In addition, choosing an appropriate network
structure and conducting the training in such a way that overfitting is avoided can
be tricky.
Ensemble Methods If we are sick and seek help from a physician but have some
doubts about the diagnosis we are given, it is standard practice to seek a second or
even a third opinion. Ensemble methods follow the same principle: for a difficult
prediction task, do not rely on a single classifier or numeric predictor but generate
several different predictors and aggregate their individual predictions. Provided that
the predictors are sufficiently different (that is, exploit different properties of the data
to avoid what is expressed in the saying, usually ascribed to Ludwig Wittgenstein:
“If I do not believe the news in todays paper, I buy 100 copies of the paper. Then I
9.1 Nearest-Neighbor Predictors 261
believe.”), this can often lead to a considerably improved prediction accuracy. The
main tasks to consider when constructing such ensemble predictors is how to select
or generate a set of predictors that exhibits sufficient variation to offer good chances
of improving the prediction quality and how to combine the individual predictions.
The nearest-neighbor algorithm [13] is one of the simplest and most natural classi-
fication and numeric prediction methods: it derives the class labels or the (numeric)
target values of new input objects from the most similar training examples, where
similarity is measured by distance in the feature space. The prediction is computed
by a majority vote of the nearest neighbors or by averaging their (numeric) target
values. The number k of neighbors to be taken into account is a parameter of the
algorithm, the best choice of which depends on the data and the prediction task.
9.1.1 Overview
Fig. 9.1 Illustrations of nearest neighbor classification (left) and numeric prediction (right)
262 9 Finding Predictors
metric space, has been derived from them. In such a Voronoi diagram each training
example s is a Voronoi site, which defines a Voronoi cell consisting of all points
that are closer to s than to any other site. The shown line segments comprise those
points that are equidistant to their two nearest sites, the nodes (i.e., points where line
segments meet) are the points that are equidistant to three (or more) sites. A nearest-
neighbor classifier transfers the class of the site s of a Voronoi cell to all points in
the cell of s, illustrated here by two different colorings (shades of grey) of the cells,
which indicates two classes and thus the regions where these classes are predicted.
For points on the line segments, if they border Voronoi cells with different classes,
some (arbitrary) tie-breaking rule has to be applied.
In the right diagram, the horizontal axis represents the input space, and the ver-
tical axis the output (or target) space of a regression task (i.e., a numeric prediction
task). The training examples are again drawn as small dots. The (now metric) target
value is transferred from a training example s to all query values that are closer to
(the input value of) s than to (the input value of) any other training example. In
effect, this yields a piecewise constant regression function as the predictor.
Note, however, that neither the Voronoi tessellation nor the piecewise linear func-
tion is actually computed in the learning process, and thus no model is built at train-
ing time. The prediction is determined only in response to a query for the class
or target value of a new input object, namely by finding the closest neighbor of
the query object and then transferring its class or target value. Hence the diagrams
should be seen as illustrations that summarize the results of all possible queries
within the range of the diagrams rather than depictions of learned models.
A straightforward generalization of the nearest-neighbor approach is to use not
just the one closest, but the k nearest neighbors (usually abbreviated as k-NN). If
the task is classification, the prediction is then determined by a majority vote among
these k neighbors (breaking ties arbitrarily); if the task is numeric prediction, the
average of the target values of these k neighbors is computed. Not surprisingly,
using more than one neighbor improves the robustness of the algorithm, since it is
not so easily fooled by individual training instances that are labeled incorrectly or
outliers for a class (that is, data points that have an unusual location for the class
assigned to them).1 However, using too many neighbors can reduce the capability
of the algorithm as it may smooth the classification boundaries or the interpolation
too much to yield good results. As a consequence, apart from the core choice of
the distance function that determines which training examples are the nearest, the
choice of the number of neighbors to consider is crucial.
Once multiple neighbors are considered, further extensions become possible (see
the next sections for details). For example, the (relative) influence of a neighbor may
be made dependent on its distance from the query point, and the prediction may be
computed from a local model that is constructed on the fly for a given query point
(i.e., from its nearest neighbors) rather than by a simple majority or averaging rule.
1 Outliers
for the complete data set, on the other hand, do not affect nearest-neighbor predictors
much, because they can only change the prediction for data points that should not occur or should
occur only very rarely (provided that the rest of the data is representative).
9.1 Nearest-Neighbor Predictors 263
9.1.2 Construction
2 The fold sizes may differ by one data point, to account for the fact that the total number of training
Fig. 9.2 Illustrations of standard 3-nearest neighbor prediction (average of three nearest neigh-
bors, left) and distance-weighted 2-nearest neighbor prediction (right) in one dimension
The most straightforward choices for the prediction function are, as already men-
tioned, a (weighted) majority vote for classification or a simple average for numeric
prediction. However, especially for numeric prediction, one may also consider more
complex prediction functions, like building a local regression model from the neigh-
bors (usually with a low-degree polynomial), thus arriving at locally weighted poly-
nomial regression (see Sect. 9.1.3 for some details).
A core issue of implementing nearest-neighbor prediction is the data structure
used to store the training examples. In a naive implementation they are simply stored
as a list, which requires merely O(n) time, where n is the number of training exam-
ples. However, though fast at training time, this approach has the serious drawback
of being very slow at execution time, because a linear traversal of all training ex-
amples is needed to find the nearest neighbor(s), requiring O(nm) time, where m
is the dimensionality of the data. As a consequence, this approach becomes quickly
infeasible with a growing number of training examples or for high-dimensional data.
Better approaches rely on data structures like a kd-tree (short for k-dimensional
tree3 ) [4, 20], an R- or R*-tree [3, 24], a UB-tree [33], etc. With such data structures,
the query time can be reduced to O(log n) per query data point. The time to store
the training examples—that is, the time to construct an efficient access structure for
them—is, of course, worse than for storing them in a simple list. However, it is usu-
ally acceptably longer. For example, a kd-tree is constructed by iterative bisections
in different dimensions that split the set of data points (roughly) equally. As a con-
sequence, constructing it from n training examples takes O(n log n) time if a linear
time algorithm for finding the median in a dimension [5, 14] is employed.
The basic k-nearest-neighbor scheme can be varied and extended in several ways.
By using kernel functions to weight the neighbors (Sect. 9.1.3.1), all data points
or a variable number of neighbors, depending on the query point, may be used.
The simple averaging of the target values for numeric prediction may be replaced
by building a (simple) local regression model (Sect. 9.1.3.2). Feature weights may
be used to adapt the employed distance function to the needs of the prediction
task (Sect. 9.1.3.3). In order to mitigate the problem to extract the nearest neigh-
bors from the training data, one may try to form prototypes in a preprocessing step
(Sect. 9.1.3.4).
3 Note that this k is independent of and not to be confused with the k denoting the number of
neighbors. This equivocation is an unfortunate accident, which, however, cannot be avoided with-
out deviating from standard nomenclature.
266 9 Finding Predictors
weighted with a kernel function K that is defined on its distance d to the query point
and that satisfies the following properties: (1) K(d) ≥ 0, (2) K(0) = 1 (or at least
that K has its mode at 0), and (3) K(d) decreases monotonously for d → ∞. In this
case all training examples for which the kernel function yields a nonvanishing value
w.r.t. a given query point are used for the prediction. Since the density of training
examples may, of course, differ for different regions of the feature space, this may
lead to a different number of neighbors being considered, depending on the query
point. If the kernel function has an infinite support (that is, does not vanish for any
finite argument value), all data points are considered for any query point. By using
such a kernel function, we try to mitigate the problem of choosing a good value for
the number K of neighbors, which is now taken care of by the fact that instances
that are farther away have a smaller influence on the prediction result. On the other
hand, we now face the problem of having to decide how quickly the influence of a
data point should decline with increasing distance, which is analogous to choosing
the right number of neighbors and can be equally difficult to solve.
Examples of kernel functions with a finite support, given as a radius σ around
the query point within which training examples are considered, are
Krect (d) = τ (d ≤ σ ),
Ktriangle (d) = τ (d ≤ σ ) · (1 − d/σ ),
3
Ktricubic (d) = τ (d ≤ σ ) · 1 − d 3 /σ 3 ,
where τ (φ) is 1 if φ is true and 0 otherwise. A typical kernel function with infinite
support is the Gaussian function
d2
Kgauss (d) = exp − 2 ,
2σ
where d is the distance of the training example to the query point, and σ 2 is a
parameter that determines the spread of the Gaussian function. The advantage of a
kernel with infinite support is that the prediction function is smooth (has no jumps)
if the kernel is smooth, because then a training case does not suddenly enter the
prediction if a query point is moved by an infinitesimal amount, but its influence
rises smoothly in line with the kernel function. One also does not have to choose a
number of neighbors. However, the disadvantage is, as already pointed out, that one
has to choose an appropriate radius σ for the kernel function, which can be more
difficult to choose than an appropriate number of neighbors.
An example of kernel regression with a Gaussian kernel function is shown in
Fig. 9.3 on the left. Note that the regression function is smooth, because the kernel
function is smooth and always refers to all data points as neighbors, so that no jumps
occur due to a change in the set of nearest neighbors. The price one has to pay for
this advantage is an increased computational cost, since the kernel function has to
be evaluated for all data points, not only for the nearest neighbors.
9.1 Nearest-Neighbor Predictors 267
of the training examples that yields the best prediction quality (on a given test data
set): finding best subsets is a standard combinatorial optimization problem.
9.2.1 Overview
At least many higher organisms exhibit the ability to adapt to new situations and
to learn from experience. Since these capabilities clearly derive from the fact that
these organisms are endowed with a brain or at least a central nervous system, it is
a plausible approach to try to achieve similar capabilities in an artificial system by
mimicking the functionality of (biological) neurons and their interaction.
Depending on which aspects of (theories of) biological neurons are emphasized
and which plausible modifications are introduced, different types of artificial neu-
ral networks can be distinguished. The most popular type is the multilayer per-
ceptron, which is based on a “threshold logic” view of neuronal activity, but can
also be seen—from a statistical perspective—as a staged or hierarchical logistic
regression system. The basic idea underlying multilayer perceptrons relies on the
following operation principle of biological neurons: a neuron receives, through “ex-
tension lines” of other neurons, electrical input, which either increases (excitatory
input) or decreases (inhibitory input) the electrical potential of the neural cell rel-
ative to its environment. If the total input exceeds a certain threshold, the neuron
is activated and “fires,” that is, emits an electrical signal to other neurons it is con-
nected to. The type (excitatory or inhibitory) and the strength of the influence of an
input is determined by the chemical conditions at the connection point (so-called
synapse) between the transmission lines (biologically: axons) and the neuron’s re-
ceptors (biologically, dendrites). By modifying these chemical conditions, the be-
havior of the network of neurons can be changed, and thus adaptation and learning
can be achieved.
By heavily simplifying the actually fairly complex mechanisms, this operation
principle gives rise to a neuron model as it is displayed in Fig. 9.4 on the left: the
influence of the incoming signals from other neurons is encoded as simple weights
(real-valued numbers), which are multiplied with the (strength of) the incoming
270 9 Finding Predictors
Fig. 9.4 A prototypical artificial neuron with connections to its predecessor neurons, including a
dummy neuron v0 (left) and general structure of a multilayer perceptron with r layers (right)
signal (also a real-valued number). The weighted signals (or signal strengths) are
then summed and submitted to an activation function, which is a kind of threshold
function4 and which determines whether the neuron fires (emits a signal) or not.
This signal is then transmitted to other neurons.
A standard connection structure for the neurons is a layered pattern, as it is de-
picted schematically in Fig. 9.4 on the right. This connection structure and the fact
that a single neuron in this structure is called a perceptron (for historical reasons
[26, 27]) explains the name “multilayer perceptron.” The first (leftmost) layer is
called the input layer, because it receives input (represented by the xi ) from the
environment, while the last (rightmost) layer is called the output layer, because it
emits output (represented by the yi ) to the environment. All intermediate layers are
called hidden layers, because they do not interact with the environment and thus
are “hidden” from it. In principle, a multilayer perceptron may have any number of
hidden layers, but it is most common to use only a single hidden layer, based on
certain theoretical results about the capabilities of such neural networks [16].
A multilayer perceptron works by executing the neurons in the layers from left
to right, computing the output of each neuron based on its weighted inputs and the
activation function. This procedure is called forward propagation and gives rise
to the general name feed-forward network for a neural network that operates in
this manner. Note that in this fashion the neural network implements, in a staged or
layered form, the computation of a (possibly fairly complex) function of the input
signals, which is emitted by the output layer at the end of the propagation process.
A multilayer perceptron is trained to implement a desired function with the help
of a data set of sample cases, which are pairs of input values and associated output
values. The input values are fed into the multilayer perceptron, and its output is com-
puted. This output is then compared to the desired output (as specified by the second
part of the sample case). If the two differ, a process called error backpropagation
is executed, which adapts the connection weights and possibly other parameters of
4 Note that for technical reasons, the threshold orbias value β of the neuronal activation function is
turned into a connection weight by adding a connection to a dummy neuron emitting a permanent
signal of 1, while the actual activation function has a threshold of zero.
9.2 Artifical Neural Networks 271
Fig. 9.5 The logistic function and the hyperbolic tangent, two common sigmoid activation func-
tions for multilayer perceptrons, both of which describe a “soft” threshold behavior
Fig. 9.6 Logistic function and Gaussian radial basis function for a two-dimensional input space
the activation functions in such a way that the output error (that is, the deviation of
the actual output from the desired output) is reduced.
For error backpropagation to be feasible, it is mandatory that the activation func-
tions of the neurons are not crisp threshold functions (as suggested in the above
description), but differentiable sigmoid (s-shaped) functions. That is, the activation
of the neuron should not jump at the threshold from completely inactive to fully
active but should rise smoothly, over some input range, from inactive (0 or −1) to
active (+1). With such functions, the output of a multilayer perceptron is a differ-
entiable function of the inputs, which also depends on the connection weights and
the function parameters (basically the threshold or bias value β). As a consequence,
the weights and parameters can be adapted with a gradient descent scheme, which
minimizes the sum of the squared output errors for the training data set.
The most commonly used activation functions are the logistic function (unipo-
lar) and the hyperbolic tangent (bipolar). Illustrations for a one-dimensional input
space are shown in Fig. 9.5, and a logistic function for a two-dimensional input
space with threshold β = 0 in Fig. 9.6 on the left. Note that the weight vector can
be interpreted as the direction in which the logistic function rises.
Like multilayer perceptrons, radial basis function networks are feed-forward
networks. However, they always have three layers (one hidden layer), while this
is only the most common choice for multilayer perceptrons. In addition, they do
not employ a sigmoid activation function, at least not in the hidden layer. Rather, a
distance from a reference point (or center) in the data space, which is represented by
272 9 Finding Predictors
the neuron weights, is computed.5 This distance is transformed with a radial (basis)
function, the name of which derives from the fact that is defined on a distance from
a given point and thus on a radius from this point. Note that radial basis function is
actually just another name for a kernel function, as we considered it in Sect. 9.1.3.1.
Such a function is 0 for infinite distance, increases monotonously for decreasing
distance, and is 1 at zero distance. Thus it is, in a way, similar to a sigmoid function,
which increases monotonously from −∞, where it is 0 (or −1), to +∞, where
it is 1. A radial basis function is parameterized with a reference radius σ , which
takes the place of the bias value β of the sigmoid activation functions of multilayer
perceptrons. An illustration for a two-dimensional input space is shown in Fig. 9.6
on the right, which shows a Gaussian radial basis function centered at (0, 0) and
having a reference radius of 1. The value of the radial function for the computed
distance from the reference point is passed on to the output layer.
The neurons in the output layer either have a linear activation function (that is,
the weighted sum of the inputs is merely transformed with a scaling factor and an
offset) or a sigmoid one (like in multilayer perceptrons). The former choice has
certain advantages when it comes to the technical task of initializing the network
before training,6 but otherwise no fundamental difference exists between the two
choices. Like multilayer perceptrons, radial basis function networks are trained with
error backpropagation, which differs only due to the different activation functions.
The term “basis” in the name radial basis function derives from the fact that these
functions are the basis, in the sense of the basis of a vector space, for constructing
the function, the neural network is desired to compute; especially with a linear acti-
vation function in the output layer, the outputs are linear combinations of the basis
functions and thus vector representations w.r.t. the basis functions. To some degree
one may also say that a multilayer perceptron behaves in this way, even though it
uses logistic activation functions for the output neurons, because close to the bias
value, a logistic function is almost linear and thus models a linear combination of
the activation functions of the hidden layer. The difference is that the basis func-
tions of multilayer perceptrons are not radial functions, but rather logistic functions,
which are sigmoid functions along a direction in the data space (see Fig. 9.6).
9.2.2 Construction
The first step of the construction of a neural network model consists in choosing the
network structure. Since the number of input and output neurons is fixed by the data
5 Note that with respect to the employed distance function all considerations of Sect. 7.2 can be
one to interpret the output as an approximation of the desired function in the vector space spanned
by the radial functions computed by the hidden neurons: the connection weights from the hidden
layer to the output are the coordinates of the approximating function w.r.t. this vector space.
9.2 Artifical Neural Networks 273
analysis task (or, for the inputs, by a feature selection method, see Sect. 6.1.1), the
only choices left for a multilayer perceptron are the number of hidden layers and
the number of neurons in these layers. Since single-hidden-layer networks are most
common (because of certain theoretical properties [16]), the choice is usually even
reduced to the number of hidden neurons. The same applies to radial basis function
networks, which have only one hidden layer by definition.
A simple rule of thumb, which often leads to acceptable training results, is to use
1
2 $#inputs + #outputs% hidden neurons, where #inputs and #outputs are the numbers
of input and output attributes, respectively. There also exist approaches to optimize
the number of hidden neurons during the training process [21], or one may employ
a wrapper scheme as for feature selection (see Sect. 6.1.1) in order to find the best
number of hidden neurons for a given task. However, even though a wrong number
of hidden neurons, especially if it is chosen too small, can lead to bad results, one has
to concede that other factors, especially the choice and scaling of the input attributes,
are much more important for the success of neural network model building.
Once the network structure is fixed, the connection weights and the activa-
tion function parameters are initialized randomly. For multilayer perceptrons, the
weights and the bias values are usually chosen uniformly from a small interval cen-
tered around 0. For a radial basis function network, the reference or center points
(the coordinates of which are the weights of the neurons in the hidden layer) may
be chosen by randomly selecting data points or by sampling randomly from some
distribution (Gaussian or rectangular) centered at the center of the data space. The
reference radii are usually initialized to equal values, which are derived from the
size of the data space and the number of hidden neurons, for example, as ld /k,
where ld is the length of the data space diagonal, and k is the number of hidden
neurons. If a linear activation function is chosen for the output layer, the con-
nection weights from the hidden to the output layer can be initialized by solving
a linear optimization problem. Alternatively, the weights can be initialized ran-
domly.
After all network parameters (connection weights and activation function param-
eters) have been initialized, the neural network already implements a function of the
input attributes. However, unless a radial basis function network with linear activa-
tion functions in the output layer has been initialized by solving the corresponding
linear optimization problem, this function is not likely to be anywhere close to the
desired function as represented by the training samples. Rather it will produce (sig-
nificant) errors and thus needs to be adapted or trained.
The rationale of neural network training is as follows: the deviation of the
function implemented by the neural network and the desired function as represented
by the given training data is measured by the sum of squared errors,
e(D) = (ou (x) − yu )2 ,
(x,y)∈D u∈Uout
where D = {(x1 , y1 ), . . . , (xn , yn )} is the given training data set, Uout the set of
output neurons, yu is the desired output of neuron u for the training case (x, y), and
ou (x) the computed output of neuron u for training case (x, y). Furthermore, this
274 9 Finding Predictors
error of the neural network for a given training sample can be seen as a function of
the network parameters (connection weights and activation function parameters like
bias value and reference radius), since ou (x) depends on these parameters, while
everything else is fixed by the network structure and the given training data:
e(D) = e(D; θ ),
where θ is the total of all network parameters. Provided that all activation functions
are (continuously) differentiable (which is the case for the common choices of acti-
vation functions, see above, and also one of the reasons why the error is squared7 ),
we can carry out a gradient descent on the error function in order to minimize the
error. Intuitively, the principle of gradient descent (see also Chap. 5 for a general de-
scription) is the same that scouts follow in order to find water: always go downhill.
Formally, we compute the gradient of the error function, that is, we consider
∂ ∂
∇θ e(D; θ ) = ,..., e(D; θ ),
∂θ1 ∂θr
where θk , 1 ≤ k ≤ r, are the network parameters, ∂θ∂ k denotes the partial derivative
w.r.t. θk , and ∇θ (pronounced “nabla”) is the gradient operator. Intuitively, the gra-
dient describes the direction of steepest ascent of the error function (see Fig. 9.7 for
a sketch). In order to carry out a gradient descent, it is negated and multiplied by a
factor η, which is called the learning rate and which determines the size of the step
in the parameter space that is carried out. Formally, we thus have
(new) (old) ∂
θ (new) = θ (old) − η∇θ e D; θ (old) or θk = θk −η e D; θ (old)
∂θk
if written for an individual parameter θk . The exact form of this expression depends
on the activation functions that are used in the neural network. A particularly simple
case results for the standard case of the logistic activation function
1
fact (z) = ,
1 + e−z
p
which is applied to z = wx = i=1 wi xi , that is, the weighted sum of the
inputs, written as a scalar product. Here x = (x0 = 1, x1 , . . . , xp ) is the in-
put vector of the neuron, extended by a fixed input x0 = 1 (see above), and
7 The other reason is that without squaring the error, positive and negative errors could cancel out.
9.2 Artifical Neural Networks 275
Fig. 9.8 Cookbook recipe for the execution (forward propagation) and training (error backpropa-
gation) of a multilayer perceptron for the standard choice of a logistic activation function
In addition, the formulas for the different layers are connected by a recursive
scheme, which makes it possible to propagate an error term, usually denoted
by δ, from a given layer to its preceding layer. We skip the detailed deriva-
tion here, which does not pose special mathematical problems (see, for example,
[21, 23]).
Rather, we present the final result in the form of a cookbook recipe in Fig. 9.8,
which contains all relevant formulas. The computations start at the input layer
(green), where the external inputs are simply copied to the outputs of the input
neurons. For all hidden and output neurons, forward propagation yields their output
values (blue). The output of the output neurons is then compared to the desired out-
put and a first error factor (for the output neurons) is computed (red), into which the
derivative of the activation function enters (here, the logistic function, for which the
derivative is shown in white). This error factor can be propagated back to the pre-
ceding layer with a simple recursive formula (yellow), using the connection weights
and again the derivative of the activation function. From the error factors the weight
changes are computed, using a user-specified learning rate η and the output of the
neuron the connection, the weight is associated with, leads to.
276 9 Finding Predictors
Since standard (also called “vanilla”8 ) error backpropagation can be fairly slow and
it may also be difficult to choose an appropriate learning rate, several variants have
been suggested (see Sect. 9.2.3.1). An important technique to achieve robust learn-
ing results with neural networks is weight decay (see Sect. 9.2.3.2). The relative
importance of the different inputs for predicting the target variable can be deter-
mined with sensitivity analysis (see Sect. 9.2.3.3).
quick backpropagation approximates the error function locally and per weight by
a parabola (derived from the gradient of the current and the previous step) and then
sets the weight directly to the value of the apex of this parabola.
Even though neural networks are essentially black boxes, which compute their out-
put from the input through a series of numerical transformations that are usually
difficult to interpret, there is a simple method to get at least some idea of how im-
portant different inputs are for the computation of the output. This method is called
sensitivity analysis and consists in forming, for the training examples, the partial
derivatives of the function computed by a trained neural network w.r.t. the different
inputs. Given that the inputs are properly normalized, or these derivatives are put
in proper relation with the range of input values, they provide an indication how
strongly an input affects the output, because they describe how much the output
changes if an input is changed. This may be used for feature selection if the relevant
features are filtered out with an appropriately chosen threshold for the derivatives.
Finding the embedding function Φ, so that the data is in the new space lin-
early separable is, of course, still a problem as we will discuss in more detail later.
Nevertheless, kernel methods are a well-understood mechanism to build powerful
classifiers—and also regression functions, as we will see later as well.
9.3.1 Overview
Before we look into linear separability again, let us first reformulate the problem
slightly, to allow us to easier extend this later into other spaces. We are, as usual,
considering a set of training examples D = {(xj , yj )|j = 1, . . . , n}. The binary class
information is encoded as ±1 in contrast to the often used 0/1, which will allow us
later to simplify equations considerably. Our goal is to find a linear discriminant
function f (·) together with a decision function h(·). The latter reduces the continu-
ous output of f to a binary class label, i.e., ±1:
f (x) = &w, x' + b and h(x) = sign(f (x)).
The discriminant function f (·) returns the cosine of the angle between the weight
vector w and the input vector x, under an offset b since
&w, x'
cos ∠(w, x) = .
|w||x|
Figure 9.10 illustrates this.
Finding such linear discriminant functions has also attracted interest in the
field of artificial neural networks. Before multilayer perceptrons, as discussed in
Sect. 9.2, grew in popularity because the error backpropagation was developed, the
majority of the interest lied on single perceptrons. A single perceptron can be seen
as computing just this: the linear discriminant function shown above. In [26, 27] a
9.3 Support Vector Machines 279
learning rule for the single perceptron was introduced, which updates the parameters
when sequentially presenting the training patterns one after the other:
IF yj · (&w, xj ' + b) < 0
THEN wt+1 = wt + yj · xj and bt+1 = bt + yj · R 2
with R = maxj xj . Whenever the product of the actual output value and the de-
sired output is negative, we know that the signs of those two values were different
and the classification was incorrect. We then update the weight vector using the
learning rule. One nice property is that this process is guaranteed to converge to a
solution if the training patterns are indeed perfectly classifiable with such a simple,
linear discriminant function.
One of the key observation is now that we can represent f (·) based on a weighted
sum of the training examples instead of some arbitrary weight vector because
n
w= αj · yj · xj .
j =1
The αj essentially count how often the j th training pattern triggered an invocation
of the learning rule above. Including the yj , that is, the sign of the correct output,
enables us to keep the αj ’s to remain positive. A small trick, which will make our
formalism a lot simpler later on.
Note that this representation does require that during initialization we do not
assign random values to w but set it to 0 or at least a (random) linear combination
of the training vectors. This, however, is not a substantial limitation.
From this it is straightforward to also represent the discriminant function based
on the weighted training examples:
f (x) = &w, x' + b = αj · yj · &xj , x' + b.
j =1
Hence we can perform the classification of new patterns solely by computing the
inner product between the new pattern x and the training pattern (xj , yj ).
Finally, the update rule can also be represented based on inner products between
training examples:
IF yj · αj yj &xj , xj ' + b < 0
j
(t+1) (t)
THEN αj = αj + yj and b(t+1) = b(t) + yj · R 2 .
This representation uses only inner products with training examples. Note that sud-
denly we do not need to know anything about the input space anymore—as long
we have some way to compute the inner product between the training instances, we
280 9 Finding Predictors
can derive the corresponding α’s. This representation is called dual representation
in contrast to the primal representation, which represents the solution through a
weight vector. A nice property of the dual representation is that the vector of α’s
expresses how much each training instance contributes to the solution, that is, how
difficult to classify they were. In fact, in case of a nonlinearly separable problem, the
α of the misclassified patterns will grow infinitely. However, on the other extreme,
some α’s will remain zero, since the corresponding patterns are never misclassi-
fied. Those are patterns that are easy to classify and we do not need to record their
influence on the solution. We will return to this effect later.
The observation above indicates that we could ignore our input space if we had
access to a function which returned the inner product for arbitrary vectors. So-called
kernel functions offer, among others, precisely that property:
K(x1 , x2 ) = &Φ(x1 ), Φ(x2 )'.
If we find a kernel for which Φ = I , we can replace all our inner products with
K(·, ·). This, however, is obviously boring: we can simply define K to compute the
inner product. However, these kernels offer a very interesting perspective: we can
suddenly compute inner products in spaces that we never really have to deal with.
As long as, for the kernel K, there exists a function Φ which projects our original
space into some other space, we can use the corresponding kernel K to compute
the inner product directly. We can, of course, always define a kernel the way we
see it above, e.g., as the inner product on the results of applying Φ to our original
vectors. But what if we had much simpler kernel functions K that in effect computed
an inner product in some other, Φ-induced space? Polynomial kernels allow us to
demonstrate this nicely. Consider the function
x1 √ T
Φ = x12 , x22 , 2x1 x2 ,
x2
for which we can easily derive the corresponding kernel
% & 2
x1 y1 x1 y1
K , = (x1 y1 ) + (x2 y2 ) + 2(x1 y1 x2 y2 ) =
2 2
, .
x2 y2 x2 y2
This particular kernel has the additional twist that we can represent the inner prod-
uct in our Φ-induced space through (among others) the inner product in the original
space. This is, of course, not a requirement. More interestingly, the
general
kernel
K(x, y) = &x, y'd gives us an implicit induced space of dimension n+d+1 d
. Calcu-
lating the resulting φ(·) directly would very quickly become computationally very
expensive. This kernel is a nice example how we can find a model in a very high-
dimensional space without ever explicitly even dealing with the vectors in that space
directly.
A few examples for other kernels are
K(x, y) = &x, y'd
9.3 Support Vector Machines 281
and
x−y 2
K(x, y) = e− 2σ .
Since the set of kernels is closed under certain arithmetic operations, we can con-
struct much more complex kernels based on simpler ones. Additionally, we can
test for an arbitrary function K if it does indeed represent a kernel in some other
space or not. There are, in fact, kernels which represent inner products in infinite-
dimensional spaces, so we can find linear discriminant functions in spaces that we
could not possibly deal with directly.
Even more interesting is the ability to define kernels for objects without numer-
ical representations such as texts, images, graphs (such as molecular structures), or
(biological) sequences. We can then create a classifier for such objects without ever
entering an original or the derived space—all we need is a kernel function which re-
turns a measure for the two objects which can then be used to derive the weighting
factors α determining the weight vector w.
Note that for the training, we do not even need access to the kernel function itself
as long as we have the inner products of all training examples to each other. The
resulting kernel (or Gram) matrix looks as follows:
⎛ ⎞
K(x1 , x1 ) K(x1 , x2 ) · · · K(x1 , xm )
⎜ K(x2 , x1 ) K(x2 , x2 ) · · · K(x2 , xm ) ⎟
⎜ ⎟
K=⎜ .. .. .. .. ⎟.
⎝ . . . . ⎠
K(xm , x1 ) K(xm , x2 ) · · · K(xm , xm )
This matrix is a the center piece of kernel machines and contains all the information
required during training. The matrix combines information of both, the training data
and the chosen kernel. For kernel matrices, a couple of interesting observations hold:
• a kernel matrix is symmetric and positive definite;
• every positive definite, symmetric matrix is a kernel matrix, that is, it represents
an inner product in some space; and
• the eigenvectors of the matrix correspond to the input vectors.
Note that it is still crucial to choose an appropriate kernel. If the Gram matrix is
close to being a diagonal matrix, all points end up being essentially orthogonal to
each other in the induced space, and finding a linear separation plane is very simple
but also does not offer any generalization power.
If we can find a separating hyper plane, we can already visually motivate that in
order to represent w, we do not need to use all our original training data points. It
is sufficient to use points which lie closest to this hyperplane. Those points are the
ones ending up with an α = 0 and are called support vectors. Figure 9.11 illustrates
this, only the two boxed x’s and the one boxed o are really needed to define the
weight vector w.
282 9 Finding Predictors
From this picture it also becomes evident that there are actually many differ-
ent solutions for our classification problem—any line which correctly separates the
points of different classes works just fine. However, there are lines that lie closer
to training instances than others. If we were to maximize the minimum distance of
any of the training instances to the separation line, we would create a solution with
maximum distance to possibly making an error: the maximum margin classifier.
Figure 9.11 show this distance for the optimal, that is, largest such margin of error:
γ = max min&w, xj '.
w j
There are solid theoretical explanations why this is indeed the best choice for the
separating hyperplane. From statistical learning theory we can derive that the com-
plexity of the class of all hyperplanes with constant margin is smaller than the class
of hyperplanes with smaller margins. From this we can then derive upper bounds
on the generalization error of the resulting SVM. We refer to [15] for a detailed
treatment of these issues.
9.3.2 Construction
We will not describe the many training methods and their variants for SVMs in great
detail here but instead refer to [15]. In a nutshell, the main idea reduces to solving a
quadratic programming problem. In order to do this, we reformulate our constraint
a bit. We now require our decision function to hold,
yj · (&w, xj ' + b) ≥ 1,
instead of merely being greater than zero. The decision line is still given by
&w, x' + b = 0,
but we now can also describe the upper and lower margins by
&w, x' + b = 1
and
&w, x' + b = −1,
and the distance between those two hyperplanes is 2/ w . Our goal of finding the
maximum margin can now be formulated as the minimization problem
9.3 Support Vector Machines 283
minimize (in w, b)
w
subject to (for any j = 1, . . . , n)
yj (&w, x' − b) ≥ 1
This is fairly complex to solve because it depends on the norm of w which in-
volves a square root. However, we can convert this into a quadratic form by sub-
stituting w with 12 w 2 without changing the solution. After expressing this by
means of Lagrange multipliers, this turns into a standard quadratic programming
problem. In [15] more details and references to other, more memory- or time-
efficient solutions are given.
Much work has been done on support vector machines. In the following we describe
a few extensions which are essential for practical applications.
We cannot always assume that we can find a linear hyperplane which cleanly sep-
arates our training examples. Especially for noisy training examples, this can also
not be desirable as we would end up overfitting our data. So-called soft margin
classifiers allow one to introduce slack variables which allow some of the training
examples to be within the margin or even on the wrong side of the separation line.
These slack variables end up expressing a degree of misclassification of the indi-
vidual training examples. Our optimization problem in equation 9.3.2 is modified
to
∀j = 1, . . . , n : yj · (&w, xj ' + b) ≥ 1 − εj ,
and we need to introduce an additional penalty term to punish nonzero εj :
1
arg min w 2 + C εj
2
j
subject to yj · (&w, xj ' + b) ≥ 1 − εj for 1 ≤ j ≤ n.
This can again be solved using Lagrange multipliers.
Not all real-world problems are binary classification tasks. In order to classify ex-
amples into more than two classes, one usually transforms the problem into a set
of binary classification problems. Those can be classifiers which either separate one
284 9 Finding Predictors
class from all others or separate pairs of classes from each other. In the former case,
the class with the highest distance from the hyperplane wins; in the other case, the
winners are counted, and the class which wins the most class-pair classifications
determines the final classification.
One interesting variation of support vector machines allows one to address regres-
sion problems instead of binary classifications. The key idea is to change the opti-
mization to the following expression:
1
arg min w 2
2
subject to yj − (&w, xj ' + b) ≤ ε for 1 ≤ j ≤ n.
So, instead of requiring the signs of the target variable and the prediction to match,
we are requesting the prediction error to stay within a certain range (or margin) ε.
We can, of course, also introduce slack variables to tolerate larger errors. Moreover,
we can use the kernel trick to allow for not only linear regression functions. A very
good tutorial of Support Vector Regression can be found in [32].
9.4.1 Overview
It is well known from psychological studies of problem solving activities (but also
highly plausible without such scientific backing) that a committee of (human) ex-
perts with different, but complementary skills, usually produces better solutions than
any individual. As a consequence, the idea suggests itself to combine several pre-
dictors (classifiers or regression models) in order to achieve a prediction accuracy
exceeding the quality of the individual predictors. That is, instead of using a single
model to predict the target value, we employ an ensemble of predictors and combine
9.4 Ensemble Methods 285
their predictions (for example, by majority voting for classification or by simple av-
eraging for numeric targets) in order to obtain a joint prediction.
A necessary and sufficient condition for an ensemble of predictors to outperform
the individuals it is made of is that the predictors are reasonably accurate and di-
verse. Technically, a predictor is already called accurate if it predicts the correct
target value for a new input object better than random guessing. Hence this is a
pretty weak requirement that is easy to meet in practice. Two predictors are called
diverse predictors if they do not make the same mistakes on new input objects.
It is obvious that this requirement is essential: if the predictors always made the
same mistakes, no improvement could possibly result from combining them. As
an extreme case, consider that the predictors in the ensemble are all identical: the
combined prediction is necessarily the same as that of any individual predictor—
regardless of how the individual predictions are combined. However, if the errors
made by the individual predictors are uncorrelated, their combination will reduce
these errors. For example, if we combine classifiers by majority voting and if we
assume that the mistakes made by these classifiers are independent, the resulting
ensemble yields a wrong result only if more than half of the classifiers misclassify
the new input object, which is a lot less likely than any individual classifier assign-
ing it to the wrong class. For instance, for five independent classifiers for a two-class
problem, each of which has an error probability of 0.3, the probability that three or
more yield a wrong result is
5
5
0.3i · 0.75−i = 0.08748.
i
i=3
Note, however, that this holds only for the ideal case that the classifiers are fully in-
dependent, which is usually not the case in practice. Fortunately, though, improve-
ments are also achieved if the dependence is sufficiently weak, although the gains
are naturally smaller. Note also that even in the ideal case no gains result (but rather
a degradation) if the error probability of an individual classifier exceeds 0.5, which
substantiates the requirement that the individual predictors should be accurate.
According to [17], there are basically three reasons why ensemble methods work:
statistical, computational, and representational. The statistical reason is that in prac-
tice any learning method has to work on a finite data set and thus may not be able to
identify the correct predictor, even if this predictor lies within the set of models that
the learning method can, in principle, return as a result (see also Sect. 5.4). Rather,
it is to be expected that there are several predictors that yield similar accuracy. Since
there is thus no sufficiently clear evidence which model is the correct or best one,
there is a certain risk that the learning method selects a suboptimal model. By re-
moving the requirement to produce a single model, it becomes possible to “average”
over many or even all of the good models. This reduces the risk of excluding the best
predictor and the influence of actually bad models.
The computational reason refers to the fact that learning algorithms usually can-
not traverse the complete model space but must use certain heuristics (greedy, hill
climbing, gradient descent, etc.) in order to find a model. Since these heuristics
may yield suboptimal models (for example, local minima of the error function), a
286 9 Finding Predictors
suboptimal model may be chosen (see also Sect. 5.4). However, if several models
constructed with heuristics are combined in an ensemble, the result may be a better
approximation of the true dependence between the inputs and the target variable.
The representational reason is that for basically all learning methods, even the
most flexible ones, the class of models that can be learned is limited, and thus it
may be that the true model cannot be represented accurately. By combining several
models in a predictor ensemble, the model space can be enriched, that is, the ensem-
ble may be able to represent a dependence between the inputs and the target variable
that cannot be expressed by any of the individual models the learning method is able
to produce. That is, from a representational point of view, ensemble methods make
it possible to reduce the bias of a learning algorithm by extending its model space,
while the statistical and computational reasons indicate that they can also reduce the
variance. In this sense, ensemble methods are able to sever the usual link between
bias and variance (see also Sect. 5.4.5).
9.4.2 Construction
Bayesian Voting In pure Bayesian voting the set of all possible models in a user-
defined hypothesis space is enumerated to form the ensemble. The predictions of
the individual models are combined weighted with the posterior probability of the
model given the training data [17]. That is, models that are unlikely to be correct
given the data have a low influence on the ensemble prediction, models that are
likely to have a high influence. The likelihood of the model given the data can often
be computed conveniently by exploiting P (M | D) ∝ P (D | M)P (M), where M is
the model, D the data, P (M) the prior probability of the model (often assumed to
be the same for all models), and P (D | M) the data likelihood given the model.
Theoretically, Bayesian voting is the optimal combination method, because all
possible models are considered and their relative influence reflects their likelihood
given the data. In practice, however, it suffers from several drawbacks. In the first
place, it is rarely possible to actually enumerate all models in the hypothesis space
defined by a learning method. For example, even if we restrict the tree size, it is
usually infeasible to enumerate all decision trees that could be constructed for a
given classification problem. In order to overcome this problem, model sampling
methods are employed, which ideally should select a model with a probability that
corresponds to its likelihood given the data. However, most such methods are biased
and thus usually do not yield a representative sample of the total set of models,
sometimes seriously degrading the ensemble performance.
9.4 Ensemble Methods 287
Injecting Randomness Both bagging and random subspace selection employ ran-
dom processes in order to obtain diverse predictors. This approach can of course be
generalized to the principle of injecting randomness into the learning process. For
288 9 Finding Predictors
example, such an approach is very natural and straightforward for artificial neu-
ral networks (see Sect. 9.2): different initialization of the connection weights often
yield different learning results, which may then be used as the members of an ensem-
ble. Alternatively, the network structure can be modified, for example, by randomly
deleting a certain fraction of the connections between two consecutive layers.
Boosting While in all ensemble methods described so far the predictors can, in
principle, be learned in parallel, boosting constructs them progressively, with the
prediction results of the model learned last influencing the construction of the next
model [18, 19, 28]. Like bagging, boosting varies the training data. However, in-
stead of drawing random samples, boosting always works on the complete training
data set. It rather maintains and manipulates a data point weight for each training
example in order to generate diverse models. Boosting is usually described for clas-
sification problems with two classes, which are assumed to be coded by 1 and −1.
The best-known boosting approach is AdaBoost [18, 19, 28] and works as fol-
lows: Initially, all data point weights are equal and therefore set to wi = 1/n,
i = 1, . . . , n, where n is the size of the data set. After a predictor Mt has been
constructed in step t using the current weights wi,t , i = 1, . . . , n, it is applied to the
training data, and
n
i=1 wi,t yi Mt (xi ) 1 1 − et
et = n and αt = ln
i=1 wi,t 2 1 + et
are computed [29], where xi is the input vector, yi the class of the ith training
example, and Mt (xi ) is the prediction of the model for the input xi . The data point
weights are then updated according to
wi,t+1 = c · wi,t exp(−αt yi Mt (xi )),
where c is a normalization constant chosen so that ni=1 wi,t+1 = 1. The proce-
dure of learning a predictor and updating the data point weights is repeated a user-
specified number of times tmax . The constructed ensemble classifies new data points
by majority voting, with each model Mt weighted with αt . That is, the joint predic-
tion is
t
max
Mjoint (xi ) = sign αt Mt (xi ) .
t=1
Since there is no convergence guarantee and the performance of the ensemble clas-
sifier can even degrade after a certain number of steps, the inflection point of the
error curve over t is often chosen as the ensemble size [17].
For low-noise data, boosting clearly outperforms bagging and random subspace
selection in experimental studies [17]. However, if the training data contains noise,
the performance of boosting can degrade quickly, because it tends to focus on the
noise data points (which are necessarily difficult to classify and thus receive high
weights after fairly few steps). As a consequence, boosting overfits the data. For
noisy data, bagging and random subspace selection yield much better results [17].
9.4 Ensemble Methods 289
Stacking Like a mixture of experts, stacking takes the set of predictors as already
given and focuses on combining their individual predictions. The core idea is to
view the outputs of the predictors as new features and to use a learning algorithm
to find a model that combines them optimally [36]. Technically, a new data table
is set up with one row for each training example, the columns of which contain the
predictions of the different (level-1) models for training example. In addition, a final
column states the true classes. With this new training data set a (level-2) model is
learned, the output of which is the prediction of the ensemble.
Note that the level-2 model may be of the same or of a different type than the
level-1 models. For example, the output of several regression trees (e.g., a random
forest) may be combined with a linear regression [6] or with a neural network.
Using predictors in practice is the classic application scenario for many data analy-
sis, data mining, and machine learning toolkits. As discussed earlier in this chapter,
there are two types of predictors: lazy learners, which require both training data and
the records to assign labels to during prediction, and the model-based predictors,
which first learn a model and then use this model later during prediction to deter-
mine the target value. This difference is also reflected in KNIME and R, as we will
see in the next two sections.
Nearest-neighbor algorithms are present in essentially all tools and are usually very
straightforward to use. In KNIME, the kNN node allows one to set the number
of neighbors to be considered and, if the distance should be used, to weight those
neighbors. Figure 9.12 shows a small workflow reading in training data and data
to determine the class for. Note that kNN is very sensitive to the chosen distance
function, so we should make sure to normalize the data and use the exact same
normalization procedure for both the training and test data: this can be achieved
by using the normalizer node and “Normalizer (Apply)” node, which copies the
settings from the first node (see also Sect. 6.6.1). We then feed those two data tables
into the “K-Nearest-Neighbor” node which adds a column with the predicted class
to the test data.
Both, artificial neural networks and support vector machines follow a different
setup. Figure 9.13 shows the flow when training and applying a multilayer percep-
tron. KNIME uses two nodes, one creating the model based on training data and the
second node applying the model to new data. Note that the connection between the
model creation and model consumption (prediction) node has different port icons
indicating that, along this connection, a model is transported instead of a data table.
Note that the workflow is also writing (node “PMML Writer”) the network to file
in a standardized XML dialect (Predictive Model Markup Language, PMML) which
allows one to use this model in other learning environments but also databases and
other prediction/scoring toolkits. KNIME offers other types of neural networks as
Fig. 9.14 The base SVM nodes in KNIME offer well-known kernels along with simply dialogs to
adjust the corresponding parameters
kernel functions (or at least the settings of their respective parameters) are rather
sensitive to the normalization of the input data. For instance, the sigma parameter
of the RBF kernels controls the width of the basis functions in relation to the Eu-
clidean distance of the input vectors to the support vectors. Hence it is critical to
adjust this parameter accordingly. In the above example workflow we again use the
normalization module to avoid problems with different normalizations.
Using a different implementation, for instance, the Weka SVM classifier or re-
gression nodes works analogously: one simply replaces the learner and predictor
nodes in the above workflow with their respective alternatives. Especially using the
LibSVM [10], integration is of interest since it is computationally substantially more
efficient than both the native KNIME and Weka nodes. LibSVM offers numerous
different learning methods in addition to a number of kernel functions. We refer to
the respective literature for more information on the LibSVM library, where also a
nice guide for classification using SVMs can be found.
An experimental extension to KNIME also allows one to separate the computa-
tion of the Gram matrix from the linear discriminant learning. This allows one to
easily integrate new kernel functions, for instance, to compute kernels on graphs or
text databases without having to recode the entire learning procedure as well. We
refer to the KNIME Labs webpage9 for more details.
> library(class)
> iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],
iris.training[,5],k=3)
> table(iris.knn,iris.test[,5])
For the example of multilayer perceptrons in R, we use the same training and test
data as for the nearest-neighbor classifier above. The multilayer perceptron can only
process numerical values. Therefore, we first have to transform the categorical at-
tribute Species into a numerical attribute:
The multilayer perceptron is constructed and trained in the following way, where
the library neuralnet needs to be installed first:
> library(neuralnet)
> iris.nn <- neuralnet(x$Species + x$Sepal.Length ~
x$Sepal.Width + x$Petal.Length
+ x$Petal.Width, x,
hidden=c(3))
The first argument of neuralnet defines that the attributes species and sepal
length correspond to the output neurons. The other three attributes correspond to
the input neurons. x specifies the training data set. The parameter hidden defines
how many hidden layers the multilayer perceptron should have and how many neu-
rons in each hidden layer should be. In the above example, there is only one hidden
layer with three neurons. When we replace c(3) by c(4,2), there would be two
hidden layers, one with four and one with two neurons.
The training of the multilayer perceptron can take some time, especially for larger
data sets.
When the training is finished, the multilayer perceptron can be visualized:
> plot(iris.nn)
We can then compare the target outputs for the training set with the outputs from
the multilayer perceptron. If we want to compute the squared errors for the second
output neuron—the sepal length—we can do this in the following way:
For support vector machine, we use the same training and test data as already for the
nearest-neighbor classifier and for the neural networks. A support vector machine to
predict the species in the Iris data set based on the other attributes can be constructed
in the following way. The package e1071 is needed and should be installed first if
it has not been installed before:
> library(e1071)
> iris.svm <- svm(Species ~ ., data = iris.training)
> table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
The last line prints the confusion matrix for the test data set.
The function svm works also for support vector regression. We could, for in-
stance, use
to predict the numerical attribute petal width based on the other attributes and to
compute the squared errors for the test set.
As an example for ensemble methods, we consider random forest with the training
and test data of the Iris data set as before. The package randomForest needs to
be installed first:
> library(randomForest)
> iris.rf <- randomForest(Species ~., iris.training)
> table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
In this way, a random forest is constructed to predict the species in the Iris data set
based on the other attributes. The last line of the code prints the confusion matrix
for the test data set.
References
1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1),
37–66 (1991)
2. Aurenhammer, F.: Voronoi diagrams—a survey of a fundamental geometric data structure.
ACM Comput. Surv. 23(3), 345–405 (1991)
References 295
3. Beckmann, N., Beckmann, H.-N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R∗ -tree: an
efficient and robust access method for points and rectangles. In: Proc. ACM SIGMOD Con-
ference on Management of Data (Atlantic City, NJ), pp. 322–331. ACM Press, New York
(1990)
4. Bentley, J.L.: Multidimensional divide and conquer. Commun. ACM 23(4), 214–229 (1980)
5. Blum, M., Floyd, R.W., Pratt, V., Rivest, R., Tarjan, R.: Time bounds for selection. J. Comput.
Syst. Sci. 7, 448–461 (1973)
6. Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996)
7. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
8. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
9. Chang, C.-L.: Finding prototypes for nearest neighbor classifiers. IEEE Trans. Comput.
23(11), 1179–1184 (1974)
10. Chang, C.-C., Lin, C.-L.: LIBSVM: a library for support vector machines. Manual (2001).
http://www.csie.ntu.edu.tw/~cjlin/libsvm
11. Cleveland, W.S.: Robust locally weighted regression and smoothing scatterplots. J. Am. Stat.
Assoc. 74(368), 829–836 (1979)
12. Cleveland, W.S., Devlin, S.J.: Locally-weighted regression: an approach to regression analysis
by local fitting. J. Am. Stat. Assoc. 83(403), 596–610 (1988)
13. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1),
21–27 (1967)
14. Cormen, T.H., Stein, C., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, 2nd edn.
MIT Press/McGraw-Hill, Cambridge/New York (2001)
15. Cristianini, N., Shawe-Taylor, J.: Kernel Methods for Pattern Analysis. Cambridge University
Press, Cambridge (2004)
16. Cybenko, G.V.: Approximation by superpositions of a sigmoidal function. Math. Control Sig-
nals Syst. 2, 303–314 (1989)
17. Dietterich, T.G.: Ensemble methods in machine learning. In: Proc. 1st Int. Workshop on Mul-
tiple Classifier Systems (MCS 2000, Cagliari, Italy). Lecture Notes in Computer Science, vol.
1857, pp. 1–15. Springer, Heidelberg (2000)
18. Freund, Y.: Boosting a weak learning algorithm by majority. In: Proc. 3rd Ann. Workshop
on Computational Learning Theory (COLT’90, Rochester, NY), pp. 202–216. Morgan Kauf-
mann, San Mateo (1990)
19. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an ap-
plication to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
20. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarith-
mic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)
21. Haykin, S.: Neural Networks and Learning Machines. Prentice Hall, Englewood Cliffs (2008)
22. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern
Anal. Mach. Intell. 20, 832–644 (1998)
23. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
24. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory
and Applications. Springer, Heidelberg (2005)
25. Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6, 21–45
(2006)
26. Rosenblatt, F.: The Perceptron: a probabilistic model for information storage and organization
in the brain. Psychol. Rev. 65, 386–408 (1958)
27. Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, New York (1962)
28. Schapire, R.E.: Strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
29. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictors.
Mach. Learn. 37(3), 297–336 (1999)
30. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond. MIT Press, Cambridge (2001)
31. Shakhnarovich, G., Darrel, T., Indyk, P. (eds.): Nearest Neighbor Methods in Learning and
Vision: Theory and Practice. MIT Press, Cambridge (2006)
296 9 Finding Predictors
32. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Technical Report (2003).
http://eprints.pascal-network.org/archive/00002057/01/SmoSch03b.pdf
33. Tropf, H., Herzog, H.: Multidimensional range search in dynamically balanced trees. Angew.
Inform. 1981(2), 71–77 (1981)
34. Voronoi, G.: Nouvelles applications des paramètres continus à la théorie des formes quadra-
tiques. J. Reine Angew. Math. 133, 97–178 (1907)
35. Xu, L., Amari, S.-I.: Combining classifiers and learning mixture of experts. In: Encyclopedia
of Artificial Intelligence, pp. 318–326. IGI Global, Hershey (2008)
36. Wolpert, D.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Chapter 10
Evaluation and Deployment
The models generated by techniques from Chaps. 7–9 have already been evalu-
ated during modeling (as discussed in Chap. 5). Performance on technical measures
such as classification accuracy has been checked routinely whenever changes to the
model were made to judge the advantageousness of the modifications. The models
were also interpreted to gain new insights for feature construction (or even data ac-
quisition). Once we are satisfied with the technical performance, what remains is
to judge the potential impact the model will have in the projects domain should we
implement and deploy it. We will tackle these two steps only briefly in the follow-
ing two sections. A deployment in the form of, say, a software system for decision
support involves several planning and coordination tasks, which are out of the scope
of this book.
10.1 Evaluation
The analyst should assure herself that the interpretations of the data and models and
the conclusions drawn are conform to the knowledge of the problem owner or, if
available, domain expert. In particular the evidences, findings, and conclusions (not
only a final model) need to be documented throughout the process and presented at
this evaluation phase, where the results are discussed to decide if and how they may
be deployed later. There are at least three more reasons for documentation: First, it
is important that all steps are revised from the perspective of the project’s owner to
guarantee flawless interpretations and conclusions. Secondly, the findings represent
important resources for future projects, where data understanding and data prepa-
ration phases may exhibit a large overlap with the current project. For instance, the
cognitive map may be revised or extended and possibly linked to evidence found in
the data. Finally, the drawbacks and problems faced may initiate improved data en-
tering procedures (to improve data quality where it is crucial) or establish new data
collections (data acquisition) such that future projects can benefit from an improved
situation.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 297
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3_10, © Springer-Verlag London Limited 2010
298 10 Evaluation and Deployment
Testbed Evaluation If possible, a further evaluation that comes close to the final
operation area may be carried out. This is particularly important if several models,
which may perform well individually, but not much is known about their orches-
tration and possible interdependencies, are needed to achieve the projects objective.
In a narrow sense, we have seen this already in the third example of the overview
chapter (Sect. 2.4): the effects of Stan’s strategy are observed in the database by
Laura to judge its successfulness. Depending on the project goal, such an evaluation
may be costly but offers a realistic assessment of the performance in practice. We
have, however, to plan this evaluation in the same way as we have done earlier with
the technical model assessment: we need a baseline to compare with. For instance,
if a new product is introduced and at the same time we start the new marketing cam-
paign, there is no way to separate the effect of marketing. Either we use data from
different points in time (which is not present in this example, because the product
was not available earlier) or we have to generate a control group for comparison
(people excluded from the marketing campaign).
To avoid self-fulfilling prophecies, the decision about the deployment should not
be based on observation of extreme cases. Suppose that we investigate the sales fig-
ures of individual customers and identify the, say, 10% best- and worst-performing
customers. Now, we just developed and started a marketing campaign and want to
convince others that we were really successful by showing the sales figures of the
same customers during the weeks following the campaign. Which group (best- or
worst-performer) would you choose? Which would somebody choose who does not
believe in the success of your campaign? Retail is not deterministic, and the needs
of people may vary from week to week substantially. By cherry-picking the worst
10.2 Deployment and Monitoring 299
performing customers first and observing how they perform next week, it is very
likely that not all of them will perform as poor as before (eventually some other
will do), simply because of the inherent arbitrariness of many shopping decisions.
So we are guaranteed to perform better (regardless of the marketing campaign)
when picking the lowest 10%. Likewise for the 10% top-performers: Customers
who had an extremely full cart last time probably still have some stockpile at home
and need less next time. So these customers will underperform. This effect will be
the more prominent the smaller the groups are chosen and the more both measure-
ments are correlated. If we identify the worst-performing customers before and after
the campaign independently from the first selection (rather than sticking to the same
customers), the observed differences level off.
drifting) decline in the model’s accuracy indicates divergence of the model and the
world, and we must consider a recalibration. The dataset that has been used for
modeling has, however, been constructed in a laborious preprocessing phase, which
sometimes also includes the target attribute. (See, for example, the introductory ex-
ample in Sect. 2.4, where the target attribute has been generated from a preceding
experiment.) Providing the information whether the prediction was correct or not is
thus highly important but often easier said than done.
Additionally, we may look directly for changes in the world that could possi-
bly invalidate the model. Some changes may be detected (semi-)automatically, and
others require external input. Possible reasons include:
• Objectives change. First of all, obviously the objectives can change because of,
say, a different business strategy. This may lead to revised goals and modified
problem statements. Such a management decision is typically propagated through
the whole company, but we must not forget to check the impact on the deployed
model and problem statement.
• Invalidated assumptions. Starting in the project understanding phase, we have
carefully collected all assumptions we made. We have checked, during data un-
derstanding and validation, that these assumptions hold in principle. So our model
may implicity or explicitly rely on these assumptions, and if one of them does no
longer hold, it may perform poor because it was not designed for this kind of
setting. At least for those assumptions we were able to verify in the data, we
could periodically repeat this check to give at least a warning that the model per-
formance might be affected. Sometimes this is easily accomplished by statistical
tests that look for deviations in data distributions (such as the chi-square of good-
ness of fit test, Sect. A.4.3.4).
• The world changes. In the super market domain new products are offered, cus-
tomer habits adapt, other products become affordable, competitors may open (or
close) new stores, etc. In manufacturing newly installed machines (and measure-
ment devices that provide the data) may have different characteristics (e.g., preci-
sion, operation point), machines deteriorate over time, operators may be replaced,
etc. The data that has been used to construct the model may be affected by any
of these factors, distributions of arbitrary variables may change. We have seen
that many models estimate distribution parameters, derive optimal thresholds, etc.
from the data. So if these distributions change, the parameters and thresholds may
no longer be optimal. Again, statistical test can be applied to detect changes in
the variables distributions.
• Shift from inter- to extrapolation. There is one particular situation (typically in
modeling technical systems) which deserve special care, namely when the model
is applied to data that is atypical for the data that has been used for learning. Un-
like humans, who would recognize (and possibly complain) about situations that
never occurred before, most of the models deliver always some result—some do
not even carry information about the range of data for which they were designed.
A polynomial of a higher degree, fitted to the data, usually yields poor predictions
outside the range of the training data (see, for example, Fig. 8.8 on page 230).
References 301
Thus the detection of such cases is extremely important, because otherwise the
model’s behavior becomes objectionable.
Compared to the training data, such cases of extrapolation represent outliers.
Therefore a feasible approach to overcome this problem is to employ cluster-
ing algorithms on the training data. As in outlier detection, the clusters may be
used as a coarse representation of the data distribution (say, k prototypes of k-
means, where k is usually chosen relatively large as the prototypes shall not find
some true clusters in the data but distribute themselves over the whole dataset1 ).
Then, for every new incoming data, its typicality is determined by finding the
closest cluster (e.g., prototype). If it is too far away from the known data, a case
of extrapolation is on hand.
The objective of monitoring is in the first place to avoid situations where the
model is still applied even if the model may be inappropriate. However, detecting
changes is often interesting by itself, as it may indicate the need to derive, say, new
marketing strategies. We have already seen such a case in the example of Sect. 2.3,
where Laura planned to compare models learned from data of different periods of
time. Such questions are investigated in a relatively new field called change min-
ing [2]. Prominent methods integrate the detection of change and an automatic adap-
tion. For instance, in [4] each node of a decision tree monitors the distribution of
newly arriving data and compares it with the training data distribution. If the de-
viation is significant, the node is replaced with a tree that has been learned in the
meantime with more recent examples.
References
1. Becker, K., Ghedini, C.: A documentation infrastructure for the management of data mining
projects. Inf. Softw. Technol. 47, 95–111 (2005)
2. Böttcher, M., Höppner, F., Spiliopoulou, M.: On exploiting the power of time in data mining.
SIGKDD Explorations 10(2), 3–11 (2008)
3. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: Cross
Industry Standard Process for Data Mining 1.0, Step-by-step Data Mining Guide. CRSIP-DM
consortium (2000)
4. Hulten, G., Spencer, L., Domingos, P.: Mining time changing data streams. In: Proc. Int. ACM
SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD’01, San Francisco, CA), pp.
97–106. ACM Press, New York (2001)
5. Wirth, R., Hipp, J.: CRISP-DM: towards a standard process model for data mining. In: Proc. 4th
Int. Conf. on the Practical Application of Knowledge Discovery and Data Mining, pp. 29–39.
London, United Kingdom (2000)
1 Sometimes a k-means cluster consists of a single outlier—such clusters should be, of course,
rejected.
Appendix A
Statistics
Since classical statistics provides many data analysis methods and supports and jus-
tifies a lot of others, we provide in this appendix a brief review of some basics of
statistics. We discuss descriptive statistics, inferential statistics, and needed funda-
mentals from probability theory. Since we strove to make this appendix as self-
contained as possible, some overlap with the chapters of this book is unavoidable.
However, material is not simply repeated here but presented from a slightly different
point of view, emphasizing different aspects and using different examples.
Before we can turn to statistical procedures, we have to introduce some terms and
notions, together with some basic notation, with which we can refer to data.
• object, case
Data describe objects, cases, people etc. For example, medical data usually de-
scribes patients, stockkeeping data usually describes components, devices or gen-
erally products, etc. The objects or cases are sometimes called the statistical units.
• (random) sample
The set of objects or cases that are described by a data set is called a sample, its
size (number of elements) is called the sample size. If the objects or cases are
the outcomes of a random experiment (for example, drawing the numbers in a
lottery), we call the sample a random sample.
• attribute, feature
The objects or cases of the sample are characterized by attributes or features that
refer to different properties of these objects or cases. Patients, for example, may
be described by the attributes sex, age, weight, blood group, etc., component parts
may have features like their physical dimensions or electrical parameters.
• attribute value
The attributes, by which the objects/cases are described, can take different values.
For example, the sex of a patient can be male or female, its age can be a positive
integer number, etc. The set of values an attribute can take is called its domain.
Depending on the kind of the attribute values, one distinguishes different scale types
(also called attribute types). This distinction is important, because certain charac-
teristic measures (which we study in Sect. A.2.3) can be computed only for certain
scale types. Furthermore, certain statistical procedures presuppose attributes of spe-
cific scale types. Table A.1 shows the most important scale types nominal, ordinal,
and metric (or numerical), together with the core operations that are possible on
them and a few examples of attributes having the scale type.
The task of descriptive statistics is to describe states and processes on the basis of
observed data. The main tools to tackle this task are tabular and graphical represen-
tations and the computation of characteristic measures.
Tables are used to display observed data in a clearly arranged form, and also to
collect and display characteristic measures. The simplest tabular representation of
a (one-dimensional) data set is the frequency table, which is the basis for many
graphical representations. A frequency table records for every attribute value its (ab-
solute and/or relative) frequency in a sample, where the absolute frequency fk is
simply the occurrence frequency of an attribute value ak in the sample, and the rela-
tive frequency rk is defined as rk = fnk with the sample size n. In addition, columns
for the cumulated (absolute and/or relative) frequencies (also simply referred to as
frequency sums) may be present. As an example, we consider the data set
x = (3, 4, 3, 2, 5, 3, 1, 2, 4, 3, 3, 4, 4, 1, 5, 2, 2, 3, 5, 3, 2, 4, 3, 2, 3),
306 A Statistics
which may be, for instance, the grades of a written exam at school.1 A frequency
table for this data set is shown in Table A.2. Obviously, this table provides a much
better view of the data than the raw data set as it is shown above, which only lists
the sample values (an not even in a sorted fashion).
A two- or generally multidimensional frequency table, into which the (relative
and/or absolute) frequency of every attribute value combinations is entered, is also
called a contingency table. An example of a contingency table for two attribute A
and B (with absolute frequencies), which also records the row and column sums,
that is, the frequencies of the values of the individual attributes, is shown in Ta-
ble A.3.
Graphical representations serve the purpose to make tabular data more easily com-
prehensible. The main tool to achieve this is to use geometric quantities—like
lengths, areas, and angles—to represent numbers, since such geometric properties
are more quickly interpretable for humans than abstract numbers. The most impor-
tant types of graphical representations are:
1 Inmost of Europe it is more common to use numbers for grades, with 1 being the best and 6
being the worst possible, while in the United States it is more common to use letters, with A being
the best and F being the worst possible. However, there is an obvious mapping between the two
scales. We chose numbers here to emphasize that nominal scales may use numbers and thus may
look deceptively metric.
A.2 Descriptive Statistics 307
Fig. A.1 Pole (a) and bar chart (b) and frequency polygons (c) for the data shown in Table A.2
• pole/stick/bar chart
Numbers, which may be, for instance, the frequencies of different attribute values
in a sample, are represented by the lengths of poles, sticks, or bars. In this way
a good impression especially of ratios can be achieved (see Figs. A.1a and b, in
which the frequencies of Table A.2 are displayed).
• area and volume charts
Area and volume charts are closely related to pole and bar charts: the difference
is merely that they use areas and volumes instead of lengths to represent numbers
and their ratios (see Fig. A.2, which again shows the frequencies of Table A.2).
However, area and volume charts are usually less comprehensive (maybe except
if the represented quantities are actually areas and volumes), since human be-
ings usually have trouble comparing areas and volumes and often misjudge their
numerical ratios. This can already be seen in Fig. A.2: only very few people cor-
rectly estimate that the area of the square for the value 3 (frequency 9) is three
times as large as that of the square for the value 5 (frequency 3).
• frequency polygons and line chart
A frequency polygon results if the ends of the poles of pole diagram are connected
by lines, so that a polygonal course results. This can be advantageous if the at-
tribute values have an inherent order and one wants to show the development of
the frequency along this order (see Fig. A.1c). In particular, it can be used if
numbers are to be represented that depend on time. This particular case is usually
referred to as a line chart, even though the name is not exclusively reserved for
this case.
308 A Statistics
• mosaic chart
Contingency tables (that is, two- or generally multidimensional frequency tables)
can nicely be represented as mosaic charts. For the first attribute, the horizontal
direction is divided like in a stripe diagram. Each section is then divided accord-
ing to the second attribute along the vertical direction—again like in a stripe di-
agram (see Fig. A.4). Mosaic charts can have advantages over two-dimensional
bar charts, because bars at the front can hide bars at the back, making it difficult
to see their height, as shown in Fig. A.5. In principle, arbitrarily many attributes
can be displayed by subdividing the resulting mosaic pieces alternatingly along
the horizontal and vertical axis. However, even if one uses the widths of the gaps
A.2 Descriptive Statistics 309
and colors in order to help a viewer to identify attribute values, mosaic charts can
easily become confusing if it is tried to use to many attributes.
• histogram
In principle, a histogram looks like a bar chart, with the only difference that the
domain of the underlying attribute is metric (numerical). As a consequence, it is
usually impossible to simply enumerate the frequencies of the individual attribute
values (because there are usually too many different values), but one has to form
counting intervals, which are usually called bins or buckets. The width (or, if the
domain is fixed, equivalently the number) of these bins has to be chosen by a user.
All bins should have the same width, since histograms with varying bin widths
are usually more difficult to read—for the same reasons why area charts are more
difficult to interpret than bar charts (see above). In addition, a histogram may only
provide a good impression of the data if an appropriate bin width has been chosen
and onto which values the borders of the bins fall (see Sect. 4.3.1).
• scatter plot
A scatter plot displays a two-dimensional data set of metric attributes by interpret-
ing the sample values as coordinates of a point in a metric space (see Fig. A.6).
A scatter plot is very well suited if one wants to see whether the two represented
quantities depend on each other or vary independently (see also Sects. A.2.4
and 8.3).
Examples how graphical representations can be misleading—a property that is
sometimes (all too often actually) exploited to convey a deceptively favorable or
unfavorable impression, in particular in the press and in advertisements—can be
found in the highly recommended books [6, 8].
The goal of computing characteristic measures is to summarize the data set, that
is, to capture characteristic and relevant properties in as few quantities as possible.
There are basically three types of characteristic measures:
• location measures
As their name clearly indicates, location measures specify the location of the
(majority of) the data in the domain of an attribute by a single number or attribute
values. Thus location measures summarize the data heavily.
310 A Statistics
• dispersion measures
Given the value of a location measure, dispersion measures specify how much the
data scatter around this value (how much they deviate from it) and thus charac-
terize how well the location measure captures the location of the data.
• shape measures
Given the values of a location and a dispersion measure, shape measures char-
acterize the distribution of the data by comparing its shape to a reference shape.
The most common reference shape is the normal distribution (see Sect. A.3.5.7).
In the following we study these characteristic measures, which will turn out to be
very useful in inferential statistics (see Sect. A.4), in more detail.
Median (Central Value) The (empirical) median or central value x̃ can be in-
troduced as a value that minimizes the sum of the absolute deviations. That is, a
median x̃ is any value that satisfies
n
|xi − x̃| = min.
i=1
In order to find a value for x̃, we take the derivative of the left-hand side and equate
the result to zero (since the derivative must vanish at the minimum). In this way we
obtain
n
sgn(xi − x̃) = 0,
i=1
A.2 Descriptive Statistics 311
where sgn is the sign function (which is −1 if its argument is negative, +1 if its
argument is positive, and 0 if its argument is 0).2 Therefore a median is a value that
lies “in the middle of the data.” That is, in the data set there are as many values
greater than x̃ as smaller than x̃ (this justifies the expression central value as an
alternative to median).
With the above characterization, the median is not always uniquely determined.
For example, if all sample values are distinct, there is only a unique middle element
if the sample size is odd. If it is even, there may be several values that satisfy the
above defining equations. As an example, consider the data set (1, 2, 3, 4). Any
value in the interval [2, 3] minimizes the sum of absolute deviations. In order to
obtain a unique value, one usually defines the median as the arithmetic mean of the
two sample values in the middle of the (sorted) data set in such a case. In the above
example, this would result in x̃ = 2+3 2 = 2 . Note that the median is always uniquely
5
determined for even sample size if the two middle values are equal.
Formally the median is defined as follows: let x = (x(1) , . . . , x(n) ) be a sorted
data set, that is, we have ∀i, j : (j > i) → (x(j ) ≥ x(i) ). Then
x( n+1 ) if n is odd,
x̃ = 2
2 x( 2 ) + x( 2 +1)
1 n n if n is even,
is called the (empirical) median of the data set x.
The median can be computed for ordinal and metric attributes, because all it re-
quires is a test for greater or less than, namely for sorting the sample values. For
ordinal values, computing the arithmetic mean of the two middle values for even
sample size is replaced by simply choosing one of them, thus eliminating the need
for the computation. Note, however, that the above characterization of the median as
the minimizer of the absolute deviations can, of course, not be used for ordinal at-
tributes as they do not allow for computing differences. We used it here nevertheless
in order to show the analogy to the mean, which is considered below.
Quantiles We have seen in the preceding section that the median is an attribute
value such that half of the sample values are less than it, and the other half is greater.
This idea can easily be generalized by finding an attribute value such that a certain
fraction p, 0 < p < 1, of the sample values is less than this attribute value (and a
fraction of 1 − p of the sample values are greater). These values are called (empiri-
cal) p-quantiles. The median in particular is the (empirical) 12 -quantile of a data set.
Other important quantiles are the first, second, and third quartiles, for which
p = 14 , 24 , and 34 , respectively, of the data set are smaller (therefore the median
is also identical to the second quartile), and the deciles (k tenths of the data set are
smaller) and the percentiles (k hundredths of the data set are smaller).
Note that for metric attributes, it may be necessary, depending on the sample
size and the exact sample values, to introduce adaptations that are analogous to the
computation of the arithmetic mean of the middle values for the median.
2 Note that, with the standard definition of the sign function, this equation cannot always be satis-
fied. In this case one confines oneself with the closest possible approximation to zero.
312 A Statistics
Mean While the median minimizes the absolute deviations of the sample values,
the (empirical) mean x̄ can be defined as the value that minimizes the sum of the
squares of the deviations of the sample values. That is, the mean is the attribute
value that satisfies
n
(xi − x̄)2 = min.
i=1
In order to find a value for x̄, we take the derivative of the left-hand side and equate
it to zero (since the derivative must vanish at a minimum). In this way we obtain
n
n
(xi − x̄) = xi − nx̄ = 0,
i=1 i=1
and thus
1
n
x̄ = xi .
n
i=1
Therefore the mean of a sample is the arithmetic mean of the sample values.
Even though the mean is the most commonly used location measure for metric at-
tributes (note that it cannot be applied for ordinal attributes as it requires summation
and thus an interval scale), the median should be preferred for
• few measurement values,
• asymmetric (skewed) data distributions, and
• likely presence of outliers,
since the median is more robust in these cases and conveys a better impression of the
data. In order to make the mean more robust against outliers, it is often computed
by eliminating the largest and the smallest sample values (a typical procedure for
averaging the ratings of the judges in sports events) or even multiple extreme values,
like all values before the 1st and beyond the 99th percentile.
Range The range of a data set is simply the difference between that largest and
the smallest sample value:
n n
R = xmax − xmin = max xi − min xi .
i=1 i=1
The range is a very intuitive dispersion measure. However, it is very sensitive against
outliers, which tend to corrupt one or even both of the values it is computed from.
Interquantile Range The difference between the (empirical) (1 − p)- and the
(empirical) p-quantiles of a data set is called the p-interquantile range, 0 < p < 12 .
Commonly used interquantile ranges are the interquartile range (p = 14 , difference
between the third and first quartiles), the interdecile range (p = 101
, difference be-
tween the 9th and the 1st deciles), and the interpercentile range (p = 100
1
, difference
between the 99th and the 1st percentiles). For small p, the p-interquantile range
transfers the idea to make the mean more robust by eliminating extreme values to
the range.
Mean Absolute Deviation The mean absolute deviation is the arithmetic mean of
the absolute deviations of the sample values from the (empirical) median or mean:
1
n
dx̃ = |xi − x̃|
n
i=1
is the mean absolute deviation from the median, and
1
n
dx̄ = |xi − x̄|
n
i=1
is the mean absolute deviation from the mean. It is always dx̃ ≤ dx̄ , because the
median minimizes the sum and thus also the mean of the absolute deviations.
Variance and Standard Deviation In analogy to the absolute deviation, one may
also compute the mean squared deviation. (Recall that the mean minimizes the sum
of the squared deviations.) However, instead of
1
n
m2 = (xi − x̄)2 ,
n
i=1
it is more common to employ
1
n
s2 = (xi − x̄)2
n−1
i=1
as a dispersion measure, which is called the (empirical) variance of the sample.
The reason for the value n − 1 in the denominator is provided by inferential statis-
tics, in which the characteristic measures of descriptive statistics are related to cer-
tain parameters of probability distributions and density functions (see Sect. A.4).
314 A Statistics
A detailed explanation will be provided in Sect. A.4.2, which deals with parameter
estimation (unbiasedness of an estimator for the variance of a normal distribution).
The positive square root of the variance, that is,
1
n
s= s2 = (xi − x̄)2 ,
n−1
i=1
is called the (empirical) standard deviation of the sample.
Not that the (empirical) variance can often be computed more conveniently with
formula that is obtained with the following transformation:
1 1 2
n n
s2 = (xi − x̄)2 = xi − 2xi x̄ + x̄ 2
n−1 n−1
i=1 i=1
n n
1 n n
1
= xi2 − 2x̄ xi + x̄ 2 = xi2 − 2nx̄ 2 + nx̄ 2
n−1 n−1
i=1 i=1 i=1 i=1
n n n 2
1 1 1
= xi2 − nx̄ 2 = xi2 − xi .
n−1 n−1 n
i=1 i=1 i=1
The advantage of this formula is that it allows us to compute both the mean and
the variance of a sample with one pass through the data, by computing the sum of
sample values and the sum of their squares. A computation via the original formula,
on the other hand, needs two passes: in the first pass the mean is computed, and in
the second pass the variance is computed from the sum of the squared deviations.
If one plots a histogram of observed metric data, one often obtains a bell shape. In
practice, this bell shape usually differs more or less from the reference of an ideal
Gaussian bell curve (normal distribution, see Sect. A.3.5.7). For example, the em-
pirical distribution, as shown by the histogram, is asymmetric or differently curved.
With shape measures one tries to capture these deviations.
Skewness The skewness or simply skew α3 states whether, and if yes, by how
much a distribution differs from a symmetric distribution.3 The skewness is com-
puted as
1 1 3
n n
xi − x̄
α3 = (x i − x̄)3
= zi with zi = ,
n · s3 n s
i=1 i=1
that is, z is the z-score normalized variable. For a symmetric distribution, α3 = 0. If
the skew is positive, the distribution is steeper on the left, and if it is negative, the
distribution is steeper on the right (see Fig. A.7).
3 The index 3 indicates that the skew is the 3rd moment of the sample around the mean—see the
Some characteristic measures, namely the median, the mean, the range, and the
interquartile range are often displayed jointly in a so-called box plot (see Fig. A.9):
the outer lines show the range and the box in the middle, which gives this diagram
form its name, indicates the interquartile range. Into the box the median is drawn
as a solid, and the mean as a dashed line (alternatively mean and median can be
drawn in different colors). The range may be replaced by the interpercentile range.
In this case the extreme values outside this range are depicted as individual dots.
Sometimes the box that represents the interquartile range is drawn constricted at the
location of the mean in order to emphasize the location of the mean. Obviously this
simple diagram provides a good compact impression of the rough shape of the data
distribution.
4 The index 4 indicates that the kurtosis is the 4th moment around the mean—see the defining
Mean For multidimensional data, the mean turns into the vector mean of the data
points. For example, for two-dimensional data, we have
1
n
(x, y) = (xi , yi ) = (x̄, ȳ).
n
i=1
It should be noted that one obtains the same result if one forms the vector that
consists of the means of the individual attributes. Hence, for computing the mean,
the attributes can be treated independently.
1
n
xi x̄ xi x̄ sx2 sxy
Σ= − − = ,
n−1 yi ȳ yi ȳ sxy sy2
i=1
where
n
1
sx2 = xi − nx̄
2 2
(variance of x),
n−1
i=1
n
1
sy2 = yi2 − nȳ 2 (variance of y),
n−1
i=1
n
1
sxy = xi yi − nx̄ ȳ (covariance of x and y).
n−1
i=1
In addition to the variances of the individual attributes, a covariance matrix con-
tains an additional quantity, the so-called covariance. It yields information about the
strength of the (linear) dependence of the two attributes. However, since its value
also depends on the variance of the individual attributes, it is normalized by dividing
it by the standard deviations of the individual attributes, which yields the so-called
correlation coefficient (more precisely, Pearson’s product moment correlation coef-
ficient; see Sect. 4.4 for alternatives, especially for ordinal attributes),
sxy
r= .
sx sy
It should be noted that the correlation coefficient is identical to the covariance of the
two attributes if their values are first normalized to mean 0 and standard deviation 1.
The correlation coefficient has a value between −1 and +1 and characterizes the
strength of the linear dependence of two metric quantities: if all data points lie ex-
actly on an ascending straight line, its value is +1. If they lie exactly on a descending
straight line, its value is −1. In order to convey an intuition of intermediate values,
Fig. A.10 shows some examples.
Note that it does not mean that two measures are (stochastically) independent if
their correlation coefficient vanishes. For example, if the data points lie symmet-
rically on a parabola, the correlation coefficient is r = 0. Nevertheless there is, of
course, a clear and exact functional dependence of the two measures. If the correla-
tion coefficient is zero, this only means that this dependence is not linear.
Since the covariance and correlation describe the linear dependence of two mea-
sures, it is not surprising that they can be used to fit a straight line to the data, that
is, to determine a so-called regression line. This line is defined as
sxy sxy
(y − ȳ) = 2 (x − x̄) or y = 2 (x − x̄) + ȳ.
sx sx
The regression line can be seen as kind of mean function, which assigns a mean
of the y-values to each of the x-values (conditional mean). This interpretation is
supported by the fact that the regression line minimizes the sum of the squares of
the deviations of the data points (in y-direction), just like the mean. More details
about regression and the method of least squares, together with generalizations to
larger function classes, can be found in Sect. 8.3.
318 A Statistics
Correlations between the attributes of a data set can be used to reduce its dimension:
if an attribute is (strongly) correlated with another, then this attribute is essentially
a linear function of the other (plus some noise). In such a case it is often sufficient
to consider only one of the two attributes, since the other can be reconstructed (ap-
proximately) via the regression line. However, this approach has the disadvantage
that it is not trivial to decide which of several correlated attributes should be kept
and which can be discarded.
A better approach to reduce the dimension of a data set is the so-called principal
component analysis (PCA; see also Sect. 4.3.2.1). The basic idea of this procedure is
not to select a subset of the features of the data set, but to construct a small number
of new features as linear combinations of the original ones. These new quantities
are supposed to capture the greater part of the information in the data set, where the
information content is measured by the (properly normalized) variance: the larger
the variance, the more important the (constructed) feature.
In order to find the linear combinations that define the new features, the data
is first normalized to mean 0 and standard deviation 1 in all original attributes, so
that the scale of the attributes (that is, for example, the units in which they were
measured) does not influence the result. In the next step one tries to find a new
basis for the data space, that is, perpendicular directions. This is done in such a way
that the first direction is the one in which the (normalized) data exhibits the largest
A.2 Descriptive Statistics 319
variance. The second direction is the one which is perpendicular to the first and in
which the data exhibits the largest variance among all directions perpendicular to
the first, and so on. Finally, the data is transformed to the new basis of the data
space, and some of the constructed features are discarded, namely those, for which
the transformed data exhibits the lowest variances. How many features are discarded
is decided based on the sum of the variances of the kept features relative to the total
sum of the variance of all features.
Formally, the perpendicular directions referred to above can be found with a
mathematical method that is known as principal axes transformation. This trans-
formation is applied to the correlation matrix, that is, the covariance matrix of the
data set that has been normalized to mean 0 and standard deviation 1 in all fea-
tures. That is, one finds a rotation of the coordinate system such that the correlation
matrix becomes a diagonal matrix. The elements of this diagonal matrix are the
variances of the data set w.r.t. the new basis of the data space, while all covariances
vanish. This is also a fundamental goal of principal component analysis: one wants
to obtain features that are linearly independent. As is well known from linear alge-
bra (see, for example, [5, 11]), a principal axes transformation consists basically in
computing the eigenvalues and eigenvectors of a matrix. The eigenvalues show up
on the diagonal of the transformed correlation matrix, the eigenvectors (which can
be obtained in parallel with appropriate methods) indicate the desired directions in
the data space. The directions w.r.t. which the data is now described are selected
based on the eigenvalues, and finally the data is projected to the subspace chosen
in this way.
The following physical analog may make the idea of principal axes transfor-
mation clearer: how a solid body reacts to a rotation around a given axis, can be
described by the so-called tensor of inertia [9]. Formally, this tensor is a symmetric
3 × 3 matrix,
⎛ ⎞
Θxx Θxy Θxz
Θ = ⎝ Θxy Θyy Θyz ⎠ .
Θxz Θyz Θzz
The diagonal elements of this matrix are the moments of inertia of the body w.r.t.
the axes that pass through its center of gravity7 and are parallel to the axes of the
coordinate system that we use to describe the rotation. The remaining (off-diagonal)
elements of the matrix are called deviation moments and describe the forces that act
perpendicular to the axes during the rotation.8 However, for any solid body, regard-
less of its shape, there are three axes w.r.t. which the deviation moments vanish, the
7 Theinertia behavior of axes that do not pass through the center of gravity can easily be described
with Steiner’s law. However, this goes beyond the scope of this discussion, see standard textbooks
on theoretical mechanics like [9] for details.
8 These forces result from the fact that generally the vector of angular momentum is not parallel to
the vector of angular velocity. However, this again leads beyond the scope of this discussion.
320 A Statistics
so-called principal axes of inertia.9 As an example, Fig. A.11 shows the principal
axes of inertia of a box. The principal axes of inertia are always perpendicular to
each other. In the coordinate system that is spanned by the principal axes of inertia,
the tensor of inertia is a diagonal matrix.
Formally, the principal axes of inertia are found by carrying out a principal axes
transformation of the tensor of inertia (given w.r.t. an arbitrary coordinate system):
its eigenvectors are the directions of the principal axes of inertia.
In the real world, the deviation moments cause shear forces, which lead to vibra-
tions and jogs in the bearings of the axis. Since such vibrations and jogs naturally
lead to quick abrasion of the bearings, it is tried to minimize the deviation moments.
As a consequence, a car mechanic who balances a wheel can be seen as carrying out
a principal axes transformation (though not in mathematical form), because he/she
tries to equate the rotation axis with a principal axis of inertia. However, he/she does
not do so by changing the direction of the rotation axis, as this is fixed in the wheel.
Rather, he/she changes the mass distribution by adding, removing, and shifting small
weights so that the deviation moments vanish.
Based on this analog, we may say that a statistician looks, in the first step of a
principal component analysis, for axes around which a mass distribution with unit
weights at the locations of the data points can be rotated without vibrations or jogs
in the bearings. Afterwards, he selects a subset of the axes by removing those axes
around which the rotation needs the most energy, that is, those axes for which the
moments of inertia are largest (in the direction of these axes, the variance is smallest,
and perpendicularly to them, the variance is largest).
Formally, the axes are selected via the percentage of explained variance. It can
be shown that the sum of the eigenvalues of a correlation matrix equals the dimen-
sion m of the data set, that is, it is equal to the number of features (see, for example,
[5, 11]). In this case it is plausible to define that the proportion of the total variance
that is captured by the j th principal axis as
λj
pj =
· 100%,
m
where λj is the eigenvalue corresponding to the j th principal axis.
9 Note that a body may possess more than three axes w.r.t. which the deviation moments vanish.
For a sphere with homogeneous mass distribution, for example, any axis that passes through the
center is such an axis. However, any body, regardless of its shape, has at least three such axes.
A.2 Descriptive Statistics 321
k
p(j ) ≥ α · 100%
j =1
with a proportion α that has to be chosen by a user (for example, α = 0.9). The
corresponding k principal axes are chosen as the new features, and the data points
are projected to them. Alternatively, one may specify to how many features one
desires to reduce the data set to and then chooses the axes following a descending
λ
proportion mj . In this case the above sum provides information about how much
information contained in the original data is lost.
1
10
−8.28 23
r = sx y = xi yi = = − = −0.92.
9 9 25
i=1
10 Note, however, that for larger matrices, this method is numerically unstable and should be re-
A.3.1 Probability
Probability theory is concerned with random events. That is, it is known what spe-
cific events can occur in principle, but it is uncertain which of the possible events
324 A Statistics
will actually occur in any given instance. Examples are the throw of a coin or the cast
of die. In probability theory a numerical quantity, called probability, is assigned to
random events, which is intended to capture the chance or the tendency of the event
to actually occur, at least in relation to the other possible outcomes.
However, this does not work. Even though it is necessary for the above definition to
be valid, it is not possible to ensure that sequences of experiments occur in which
∀n ≥ n0 (ε): P (A) − ε ≤ rn (A) ≤ P (A) + ε
does not hold, regardless of how large n0 is chosen. (Even though it is highly un-
likely, it is not impossible that repeated throws of a die result only in, say, ones.)
Nevertheless the intuition of a probability as a relative frequency is helpful, espe-
cially if it is interpreted as our estimate of the relative frequency of the event (in
future executions of the same experiment).
As an example, consider the sex of a newborn child. Usually the number of girls
roughly equals the number of boys. Hence the relative frequency of a girl being born
is equal to the relative frequency of a boy being born. Therefore we say that the
probability that a girl is born (and also the probability that a boy is born) equals 12 .
Note that this probability cannot be derived from considerations of symmetry (as
we can derive the probabilities of heads and tails when throwing a coin).12
Definition A.1 Let Ω be a base set of elementary events (the sample space). Any
subset E ⊆ Ω is called an event. A system S ⊆ 2Ω of events is called an event
algebra iff
1. S contains the certain event Ω and the impossible event.
2. If an event A belongs to S, then the event A = Ω − A also belongs to S.
12 It should be noted, though, that explanations from evolutionary biology for the usually equal
Hence the probability P (A) can be seen as a function that is defined on an event
algebra or on a σ -algebra and that has certain properties. From the above definition
several immediate consequences follow:
1. For every event A, we have P (A) = 1 − P (A).
2. The impossible event has probability 0, that is, P (∅) = 0.
3. From A ⊆ B it follows that P (A) ≤ P (B).
4. For arbitrary events A and B, we have
P (A − B) = P (A ∩ B) = P (A) − P (A ∩ B).
5. For arbitrary events A and B, we have
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
6. Since P (A ∩ B) cannot be negative, it follows that
P (A ∪ B) ≤ P (A) + P (B).
7. By simple induction we infer from the additivity axiom:
if A1 , . . . , Am are (finitely
many) pairwise mutually exclusive events, we have
P (A1 ∩ · · · ∩ Am ) = m i=1 P (Ai ).
Kolmogorov’s system of axioms is consistent, because there are actual systems that
satisfy these axioms. This system of axioms allows us to construct probability theory
as part of measure theory and to interpret a probability as a nonnegative, normalized,
additive set function and thus as a measure (in the sense of measure theory).
A.3 Probability Theory 327
After we have defined probability, we now turn to basic methods and theorems of
probability theory, which we explain with simple examples. Among these are the
computation of probabilities with combinatorial and geometrical approaches, the
notions of conditional probability and (conditionally) independent events, the prod-
uct law, the theorem of total probability, and finally the very important Bayes’ rule.
13 This is not quite correct, since surveys show that the frequency of births is not quite uniformly
The surprising property of this formula is that for m as low as 23, we already have
P (A23 ) ≈ 0.507. Hence for 23 or more people, it is more likely that two people have
their birthday on the same day than that all have their birthdays on different days.
In many cases one has to determine the probability of an event A when it is already
known that some other event B has occurred. Such probabilities are called condi-
tional probabilities and denoted P (A | B). In a strict sense the “unconditional” prob-
abilities we considered up to now are also conditional, because they always refer to
specific frame conditions and circumstances. For example, we assumed that the die
we throw is symmetric and made of homogeneous material. Only under these, and
possibly other, silently adopted frame conditions (like no electromagnetic influence
on the die, etc.) we stated that the probability of each number is 16 .
Definition A.4 Let A and B be two arbitrary events with P (B) > 0. Then
P (A ∩ B)
P (A | B) =
P (B)
is called the conditional probability of A given B.
A simple example: two dice are cast. What is the probability that one of the dice
displays a five if it is known that the sum of the pips is eight. If two dice are cast,
we have 36 elementary events, five of which satisfy that the sum of the pips is eight
(4 + 4, 5 + 3, and 6 + 2, where the last two have to be counted twice due to the two
A.3 Probability Theory 329
possible distributions of the numbers to the two dice). That is, we have P (B) = 36 5
.
The event “The sum of the pips is eight, and one of the dice shows a five” can be
obtained from two elementary events: either the first die shows a five, and the second
a three, or vice versa. Therefore P (A ∩ B) = 362
, and thus the desired conditional
probability is P (A | B) = 5 .
2
Theorem A.2 For an arbitrary, but fixed, event B with P (B) > 0, the function PB
that is defined as PB (A) = P (A | B) is a probability which satisfies PB (B) = 0.
With the help of the notion of a conditional probability, we can now define the
notion of (stochastic) independence of events. This notion can be motivated as fol-
lows: if, for example, smoking had no influence on the development of lung cancer,
then the proportion people with lung cancer among smokers should be (roughly)
equal to the proportion of people with lung cancer among nonsmokers.
Definition A.5 Let B be an event with 0 < P (B) < 1. Then an event A is called
(stochastically) independent of B iff
P (A | B) = P (A | B).
The following two relations, which are usually easier to handle, are equivalent:
14 Formally this argument is not quite valid, though, since for P (B) = 0, the conditional probability
P (A | B) is undefined (see Definition A.4). However, since the equation holds for any value that
may be fixed for P (A | B) in case that P (B) = 0, we allow ourselves to be slightly sloppy here.
330 A Statistics
Note that for the complete (stochastic) independence of more than two events,
their pairwise independence is necessary but not sufficient.
Let us consider a simple example: A white and a red die are cast. Let A be
the event “The number of pips shown by the white die is even,” B the event “The
number of pips shown by the red die is odd,” and C the event “The sum of the pips
is even.” It is easy to check that A and B are pairwise (stochastically) independent,
as well as B and C, and also A and C. However, due to P (A ∩ B ∩ C) = 0 (since the
sum of an even number and an odd number must be odd), they are not completely
independent.
Another generalization of (stochastic) independence can be achieved by intro-
ducing another condition for all involved probabilities:
Note that two events A and B may be conditionally independent but not uncon-
ditionally independent and vice versa. To see this, consider again the example of the
red and white die discussed above. A and B are independent but not conditionally
independent given C, because if C holds, only one of A and B can be true, even
though either of them can be true (provided that the other is false). Hence the joint
probability P (A ∩ B | C) = 0, while P (A | C) · P (B | C) > 0. Examples for the re-
verse case (conditional independence, but unconditional dependence) are also easy
to find.
Often we face situations in which the probabilities of disjoint events Ai are known,
which together cover the whole sample space. In addition, we know the conditional
probabilities of an event B given these Ai . Desired is the (unconditional) probabil-
ity of the event B. As an example, consider a plant that has a certain number of
A.3 Probability Theory 331
machines to produce the same product. Suppose that the capacities of the individ-
ual machines and their probabilities (rates) of producing faulty products are known.
The total probability (rate) of faulty products is to be computed. This rate can easily
be found by using the law of total probability. However, before we turn to it, we
formally define the notion of an event partition.
The law of total probability can be derived by applying the product rule (see
Theorem A.1) to the relation
m
)
m ) m
P (B) = P (B ∩ Ω) = P B ∩ Ai = P (B ∩ Ai ) = P (B ∩ Ai ),
i=1 i=1 i=1
the last step of which follows from the additivity axiom.
With the help of this theorem we can easily derive the important Bayes’ rule. To
do so, it is merely necessary to realize that the product rule can be applied in two
ways to the simultaneous occurrence of two events A and B:
P (A ∩ B) = P (A | B) · P (B) = P (B | A) · P (A).
Dividing the right-hand side by P (B) (which, of course, must be positive to be
able to do so), yields the simple form of Bayes’ rule. By applying the law of total
probability to the denominator, we obtain the extended form.
This rule15 is also called the formula for the probability of hypotheses, since it
can be used to compute the probability of hypotheses (for example, the probabil-
ity that a patient suffers from a certain disease) if the probabilities are known with
15 Note that Thomas Bayes (1702–1761) did not derive this formula, despite the fact that it bears
his name. In the form given here it was stated only later by Pierre-Simon de Laplace (1749–1827).
This supports a basic law of the history of science: a law or an effect that bears the name of a
person was found by somebody else.
332 A Statistics
which the hypotheses lead to the considered events Ai (for example, medical symp-
toms).
As an example, we consider five urns with the following contents:
• two urns with the contents A1 with two white and three black balls each,
• two urns with the contents A2 with one white and four black balls each,
• one urn with the contents A3 with four white and one black ball.
Suppose that an urn is chosen at random and a ball is drawn from it, also at random.
Let this ball be white: this is the event B. What is the (posterior) probability that the
ball stems from an urn with the contents A3 ?
According to our presuppositions, we have:
2 2 1
P (A1 ) = , P (A2 ) = , P (A3 ) = ,
5 5 5
2 1 4
P (B | A1 ) = , P (B | A2 ) = , P (B | A3 ) = .
5 5 5
We start by applying the law of total probability in order to find the probabil-
ity P (B):
P (B) = P (B | A1 )P (A1 ) + P (B | A2 )P (A2 ) + P (B | A2 )P (A3 )
2 2 1 2 4 1 10
= · + · + · = .
5 5 5 5 5 5 25
Using Bayes’ rule, we then obtain
P (B | A3 )P (A3 ) 4
5 · 1
5
P (A3 | B) = = 10
= 25 .
P (B) 25
Likewise, we can obtain
2 1
P (A1 | B) = and P (A2 | B) = .
5 5
In Sect. A.3.1 we already considered the relation between the probability P (A) and
the relative frequency rn (A) = hnn(A) of an event A, where hn (A) is the absolute
frequency of this event in n trials. We saw that it is not possible to define the prob-
ability of A as the limit of the relative frequency as n → ∞. However, a slightly
weaker statement holds, namely the famous law of large numbers.
Definition A.9 A random experiment in which the event A can occur is repeated
n times. Let Ai be the event that A occurs in the ith trial. Then the sequence of
experiments of length n is called a Bernoulli experiment16 for the event A iff the
following conditions are satisfied:
16 The notion “Bernoulli experiment” was introduced in recognition of the Swiss mathematician
1. ∀1 ≤ i ≤ n : P (Ai ) = p.
2. The events A1 , . . . , An are fully independent.
Theorem A.6 (Bernoulli’s law of large numbers) Let hn (A) be the number of oc-
currences of an event A in n independent trials of a Bernoulli experiment, where in
each of the trials the probability P (A) of the occurrence of A equals an arbitrary,
but fixed, value p, 0 ≤ p ≤ 1. Then for any ε > 0,
+ + ,
+ hn (A) + 1 z2
lim P + + +
− p+ < ε = √ e− 2 dz = 1.
n→∞ n 2π
This property of the relative frequency rn (A) = hnn(A) can be interpreted as fol-
lows: even though p = P (A) is not the limit of the relative frequency for infinite
sample size, as von Mises [12] tried to define, but it can be seen as very probable
(practically certain) that in a Bernoulli experiment of sufficiently large size n, the
relative frequency rn (A) differs only very little from a fixed value, the probabil-
ity p. This is often referred to by saying that the relative frequency rn (A) converges
in probability to the probability P (A) = p. With the above law we have the funda-
mental relationship between the relative frequency and the probability of an event A.
In many situations we are not interested in the complete set of elementary events and
their individual probabilities, but in the probabilities of events that result from a par-
tition of the sample space. That is, these events are mutually exclusive, but together
cover the whole sample space (an event partition, see above). The probabilities of
such events are commonly described by so-called random variables, which can be
seen as transformations from one sample space into another [10].
Definition A.10 A function X that is defined on a sample space Ω and has the do-
main dom(X) is called random variable if the preimage of any subset of its domain
possesses a probability. Here the preimage of a subset U ⊆ dom(X) is defined as
X −1 (U ) = {ω ∈ Ω | X(ω) ∈ U }.
The simplest random variable is obviously one that has the elementary events as
its possible values. However, in principle, any set can be the domain of a random
variable. However, most often the domain is the set of real numbers.
Definition A.11 A function X that maps a sample space Ω to the real numbers is
called a real-valued random variable if it possesses the following properties: for
334 A Statistics
any x ∈ R and any interval (a, b], a < b, (where a = −∞ is possible), the events
Ax = {ω ∈ Ω | X(ω) = x} and A(a,b] = {ω ∈ Ω | a < X(ω) ≤ b} possess probabili-
ties.
Sometimes the required properties are stated with an interval [a, b) that is open
on the right. This does not lead to any significant differences.
Every discrete real-valued random variable X has a step function F as its distribu-
tion function, which has jumps of height P (X = x ) only at those values x that are
in the domain dom(X). From x < y it follows that F (x) ≤ F (y), that is, F is mono-
tone nondecreasing. The function values F (x) become arbitrarily small if only x is
chosen small enough, while the values F (x) get arbitrarily close to 1 for growing x.
Therefore we have
lim F (x) = 0 and lim F (x) = 1.
x→−∞ x→∞
A.3 Probability Theory 335
Vice versa, from any step function F , which satisfies the above conditions, the dis-
tribution (x, P (X = x)), x ∈ dom(X), of a real-valued random variable can be de-
rived.
With the help of a distribution function F it is very simple to compute the prob-
ability that X assumes a value from a given interval:
Since P (−∞ < X < ∞) = P ({ω ∈ Ω | −∞ < X(ω) < ∞}) = P (Ω), the den-
sity function f satisfies
, ∞
f (u) du = 1.
−∞
For continuous random variables, similar relations hold as for discrete real-valued
random variables.
Definition A.16 Let X and Y be two real-valued random variables. The function F
which is defined for all pairs (x, y) ∈ R2 as
F (x, y) = P (X ≤ x, Y ≤ y)
is called the distribution function of the two-dimensional random variable (X, Y ).
The one-dimensional distribution functions
F1 (x) = P (X ≤ x) and F2 (y) = P (Y ≤ y)
are called marginal distribution functions.
For discrete random variables, the notion of their joint distribution is defined in
an analogous way.
Definition A.17 Let X and Y be two discrete random variables. Then the total
of pairs ∀x ∈ dom(X) : ∀y ∈ dom(Y ) : ((x, y), P (X = x, Y = y)) is called the
joint distributions ∀x ∈ dom(X) :
distribution of X and Y . The one-dimensional
(x, y P (X = x, Y = y)) and ∀y ∈ dom(Y ) : (y, x P (X = x, Y = y)) are
marginal distributions.
The function f (x, y) is called the joint density function or simply joint density of
the random variables X and Y . The one-dimensional density functions
, +∞ , +∞
f1 (x) = f (x, y) dy and f2 (y) = f (x, y) dx
−∞ −∞
By extending the notion of the independence of events one can define the notion
of the independence of random variables.
Definition A.20 Two discrete random variables X and Y with joint distribution
∀x ∈ dom(X) : ∀y ∈ dom(Y ) : ((x, y), P (X = x, Y = y)) and marginal distribu-
tions ∀x ∈ dom(X) : (x, P (X = x)) and ∀y ∈ dom(Y ) : (y, P (Y = y)) are called
(stochastically) independent, if
∀x ∈ dom(X) : ∀y ∈ dom(Y ): P (X = x, Y = y) = P (X = x) · P (Y = y).
Definition A.21 Let X, Y , and Z be three discrete random variables. Let the
X and Y have the conditional joint distribution ∀x ∈ dom(X) : ∀y ∈ dom(Y ) :
∀z ∈ dom(Z) : ((x, y, z), P (X = x, Y = y | Z = z)) given Z and the conditional
marginal distributions ∀x ∈ dom(X):∀z ∈ dom(Z) : ((x, z), P (X = x | Z)) and
∀y ∈ dom(Y ) : ∀z ∈ dom(Z) : ((y, z), P (Y = y | Z)). X and Y are conditionally
(stochastically) independent if
Fig. A.13 Illustration of marginal dependence and conditional independence: the two measures
describing the points are marginally dependent (left), but if a third, binary variable is fixed and thus
the data points are split into two groups, the dependence vanishes (middle and right)
plot of a sample from two continuous random variables that is shown in Fig. A.13
on the left. Obviously, the two quantities are not independent, because there is a
clear tendency for the Y variable (vertical axis) to take lower values if X (hori-
zontal axis) has higher values. Hence X and Y are not marginally independent. To
make the example more vivid, we can interpret the horizontal axis as the average
number of cigarettes a person smoked per day and the vertical axis as the age of
death of that person. Of course, this is a fictitious example, with artificially gen-
erated data points, but medical surveys usually show such a dependence.17 From
such an observed dependence it is usually concluded that smoking is a health haz-
ard.
However, this need not be conclusive. There may be a third variable that couples
the two, which, if we fix its value, renders the two quantities independent. A pas-
sionate smoker may claim, for example, that this third variable is whether a person
is exposed to severe stress at work. Such stress is certainly a health hazard and it
causes, as our passionate smoker may argue, both: a shorter life span (due to the
strain on the person’s health by the stress it is exposed to) and a higher cigarette
consumption (due to the fact that smoking has a calming effect and thus can help
to cope with stress). If this argument were correct,18 the dependence should van-
ish if we consider people that are exposed to stress at work and those who are not
separately. That is, if the argument were correct, we should see the separate data as
depicted in Fig. A.13 in the middle (people that are not exposed to stress at work and
thus smoke less and live longer) and on the right (people that are exposed to stress
at work and thus smoke more and live less long). In both cases the dependence be-
tween the two quantities has vanished, and thus they are conditionally independent
given the third variable (stress at work or not).
17 However, we do not claim that the actual dependence looks like our data.
18 We do not believe it is. The claim that smoking harms your health is much better supported
than just by an observation of a correlation like the one depicted in Fig. A.13, even though such
correlations are part of the argument.
A.3 Probability Theory 339
In analogy to Sect. A.2.3, where we defined characteristic measures for data sets,
random variables can be described by analogous measures. While for data sets, these
measures are derived from the sample values, measures for random variables are
derived from their distributions and distribution functions. The analogy is actually
very close: the unit weight of each sample data point is replaced by the probability
mass that is assigned to the different values in the domain of a random variable.
If the notion of a random variable is applied to gambling, where the value of the
random variable could be, for example, the gains or losses connected to different
outcomes, the idea suggests itself to consider a kind of average or expected win
(or loss) if a sufficiently large number of gambles is played. This idea leads to the
notion of an expected value.
As an example, consider the expected value of the winnings in the classical game
of roulette. We do not bet on a so-called simple chance (Rouge vs. Noir, Pair vs.
Impair, Manque vs. Passe), in order to avoid the difficulties that result from the
special rules applying to them, but bet on a column (one of the sets of numbers
1–12, 13–24, and 14–36). In case of a win we receive three times our wager. Since
in roulette 37 numbers can occur (0–36),19 all of which are equally likely if we
assume perfect conditions, winning with a bet on a column has the probability 12 37 ,
and losing has the probability 25
37 . Let us assume that the wager consists of m chips.
In case of a win we have twice the wager as a net gain (the paid out win has to
be reduced by the initially waged m chips), that is, 2m chips, whereas in case of a
failure we lose m chips. As a consequence, the expected value is
12 25 1
E(X) = 2m · −m· = − m ≈ −0.027m.
37 37 37
On average we thus lose 2.7% of our wager in every gamble.
In order to define the expected value of continuous random variables, we only
have to replace the sum by an integral, and the probabilities of the distribution by
the density function.
19 In certain types of American roulette even 38, as these have 0 and 00 (double zero).
340 A Statistics
In this section some properties of the expected value are collected, which can often
be exploited when one has to compute the expected value.
Theorem A.9 Let X be a discrete random variable that takes no other values than
some constant c. Then its expected value is equal to c: μ = E(X) = c.
Theorem A.11 (expected value of a sum of random variables) The expected value
of a sum Z = X + Y of two arbitrary real-valued random variables X and Y , whose
A.3 Probability Theory 341
expected values E(X) and E(Y ) both exist, is equal to the sum of their expected
values,
E(Z) = E(X + Y ) = E(X) + E(Y ).
Again the validity of these theorems can easily be checked by inserting the
sum/product into the definition of the expected value, in the case of a product of
random variables by also exploiting the definition of independence. It should be
clear that both theorems can easily be generalized to sums and products of finitely
many (independent) random variables. Never forget about the presupposition of in-
dependence in the second theorem, since it does not hold for dependent random
variables.
The expected value alone does not sufficiently characterize a random variable. We
must consider also what deviation from the expected value can occur on average
(see Sect. A.2.3.2). This dispersion is described by variance and standard deviation.
and thus a standard deviation D(X) of about 5.84m. Despite the same expected
value, the average deviation from the expected value is about 4 times as large for a
bet on a plain chance than for a bet on a column.
In order to define the variance of a continuous random variable, we only have to
replace the sum by an integral—just as we did for the expected value.
Theorem A.13 Let X be a discrete random variable which takes no other values
than a constant c. Then its variance is 0: σ 2 = D 2 (X) = 0.
The validity of this theorem (like the validity of the next theorem) can easily be
checked by inserting the given expressions into the definition of the variance, once
for discrete and once for continuous random variables.
The expression E(X · Y ) − E(X) · E(Y ) = E[(X − E(X))(Y − E(Y ))] is called the
covariance of X and Y . From the (stochastic) independence of X and Y it follows
that
D 2 (Z) = D 2 (X + Y ) = D 2 (X) + D 2 (Y ),
that is, the covariance of independent random variables vanishes.
Again the validity of this theorem can easily be checked by inserting the sum
into the definition of the variance. By simple induction it can easily be generalized
to finitely many random variables.
A.3.4.5 Quantiles
Quantiles are defined in direct analogy to the quantiles of a data set, with the frac-
tion of the data set replaced by the fraction of the probability mass. For continuous
random variables, quantiles are often also called percentage points.
Note that for discrete random variables, several values may satisfy both inequali-
ties, because their distribution function is piecewise constant. It should also be noted
that the pair of inequalities is equivalent to the double inequality
α − P (X = x) ≤ FX (x) ≤ α,
where FX (x) is the distribution function of a random variable X. For a continuous
random variable X, it is usually more convenient to define that the α-quantile is the
value x that satisfies FX (x) = α. In this case a quantile can be computed from the
inverse of the distribution function FX (provided that it exists and can be specified
in closed form).
In this section we study some special distributions, which are often needed in appli-
cations (see Sect. A.4 about inferential statistics).
Let X be a random variable that describes the number of trials of a Bernoulli exper-
iment of size n in which an event A occurs with probability p = P (A) in each trial.
344 A Statistics
Bernoulli experiments can easily be generalized to more than two mutually exclu-
sive events. In this way one obtains the polynomial distribution, which is a multi-
dimensional distribution: a random experiment is executed independently n times.
Let A1 , . . . , Ak be mutually exclusive events, of which in each trial exactly one must
occur, that is, let A1 , . . . , Ak be an event partition. In every trial each event Ai oc-
curs with constant probability pi = P (Ai ), 1 ≤ i ≤k. Then the probability that in
n trials the event Ai , i = 1, . . . , k, occurs xi times, ki=1 xi = n, is equal to
n n!
P (X1 = x1 , . . . , Xk = xk ) = p x1 · · · pkxk = p x1 · · · pkxk .
x1 . . . xk 1 x1 ! · · · xk ! 1
The total of all probabilities of all vectors (x1 , . . . , xk ) with ki=1 xi = n is called
the (k-dimensional) polynomial distribution with parameters p1 , . . . , pk and n.
The
n binomial
distribution is obviously a special case of for k = 2. The expression
= n!
x1 !···xk!
is called a polynomial coefficient, in analogy to the binomial
n
x1 ...xk
coefficient x = x!(n−x)! n!
.
Let X be a random variable that describes the number of trials in a Bernoulli exper-
iment that are needed until an event A, which occurs with p = P (A) > 0 in each
trial, occurs for the first time. Then X has the distribution ∀x ∈ N : (x; P (X = x))
with
P (X = x) = gX (x; p) = p(1 − p)x−1
A.3 Probability Theory 345
From an urn which contains M black and N − M white, and thus in total N balls,
n balls are drawn without replacement. Let X be the random variable that de-
scribes the number of black balls that have been drawn. Then X has the distribution
∀x; max(0, n − (N − M)) ≤ x ≤ min(n, M) : (x; P (X = x)) with
M N −M
x n−x
P (X = x) = hX (x; n, M, N) = N
n
and is said to be hypergeometrically distributed with parameters n, M and N .
This distribution satisfies the recursive relation
∀x; max(0, n − (N − M)) ≤ x ≤ min(n, M) :
(M − x)(n − x)
hX (x + 1; n, M, N) = hX (x; n, M, N )
(x + 1)(N − M − n + x + 1)
M
with hX (1; n, M, N) = .
N
With p = M
N and q = 1 − p, the expected value and variance are
N −n
μ = E(X) = np; σ 2 = D 2 (X) = npq .
N −1
20 This distribution bears its name in recognition of the French mathematician Siméon-Denis Pois-
son (1781–1840).
346 A Statistics
Intuitively this theorem says that the sum of a large number of almost arbitrarily
distributed random variables (the Lindeberg condition is a very weak restriction)
is approximately normally distributed. Since physical measurements are usually af-
fected by a large number of random influences from several independent sources,
348 A Statistics
which all add up to form the total measurement error, the result is often approxi-
mately normally distributed. The central limit theorem thus explains why normally
distributed quantities are so common in practice.
Like the Poisson distribution, the normal distribution can be used as an approxi-
mation of the binomial distribution, even if the probabilities p are not small.
If one forms the sum of m independent, standard normally distributed random vari-
ables (expected value 0 and variance 1), one obtains a random variable X with the
density function
0 for x < 0,
fX (x; m) = m
1
· x 2 −1 · e− 2
m x
for x ≥ 0,
2 2 ·( m
2)
21 This theorem bears its name in recognition of the French mathematicians Abraham de Moivre
for x > 0. This random variable is said to be χ 2 -distributed with m degrees of free-
dom. The expected value and variance are
E(X) = m; D 2 (X) = 2m.
The χ 2 distribution plays an important role in the statistical theory of hypothesis
testing (see Sect. A.4.3), for example, for independence tests.
It is immediately clear that not every statistic, that is, not every function computed
from the sample values, is a usable point estimator for an examined parameter θ .
Rather, a statistic should have certain properties, in order to be a reasonable estima-
tor. Desirable properties are:
352 A Statistics
• consistency
If the amount of available data grows, the estimated value should become closer
and closer to the actual value of the estimated parameter, at least with higher and
higher probability. This can be formalized by requiring that for growing sam-
ple size, the estimation function converges in probability to the true value of the
parameter. For example, if T is an estimator for the parameter θ , it should be
∀ε > 0: lim P (|T − θ | < ε) = 1,
n→∞
where n is the sample size. This condition should be satisfied by every point
estimator; otherwise we have no reason to assume that the estimated value is in
any way related to the true value.
• unbiasedness
An estimator should not tend to generally under- or over-estimate the parame-
ter, but should, on average, yield the right value. Formally, this means that the
expected value of the estimator should coincide with the true value of the param-
eter. For example, if T is an estimator for the parameter θ , it should be
E(T ) = θ,
independently of the sample size.
• efficiency
The estimation should be as precise as possible, that is, the deviation from the true
value of the parameter should be as small as possible. Formally, one requires that
the variance of the estimator should be as small as possible, since the variance is
a natural measure for the precision of the estimation. For example, let T and U be
unbiased estimators for a parameter θ . Then T is called more efficient than U iff
D 2 (T ) < D 2 (U ).
However, it is rarely possible to show that an estimator achieves the highest pos-
sible efficiency for a given estimation problem.
• sufficiency
An estimation function should exploit the information that is contained in the
data in an optimal way. This can be made more precise by requiring that different
samples, which yield the same estimated value, should be equally likely (given
the estimated value of the parameter). The reason is that if they are not equally
likely, it must be possible to derive additional information about the parameter
value from the data. Formally, this means that an estimator T for a parameter θ
is called sufficient if for all random samples x = (x1 , . . . , xn ) with T (x) = t, the
expression
fX1 (x1 ; θ ) · · · fXn (xn ; θ )
fT (t; θ )
does not depend on θ [10].
Note that the estimators used in the definition of efficiency must be unbiased, since
otherwise arbitrary constants (variance D 2 = 0) would be efficient estimators. Con-
A.4 Inferential Statistics 353
sistency, on the other hand, can often be neglected as an additional condition, since
an unbiased estimator T for a parameter θ which also satisfies
lim D 2 (T ) = 0
n→∞
is consistent (not surprisingly).
22 Recall that estimators are functions of random variables and thus random variables themselves.
n!
k
x
fX1 ,...,Xk (x1 , . . . , xk ; θ1 , . . . , θk , n) = k θi i ,
i=1 xi ! i=1
where θi is the probability that values ai occurs, and the random vari-
able Xi describes how often the value ai occurs in the sample.
Desired: Estimators for the unknown parameters θ1 , . . . , θk .
A.4 Inferential Statistics 355
The relative frequencies Ri = Xni of the feature values in the sample are consistent,
unbiased, most efficient, and sufficient estimators for the unknown parameters θi ,
i = 1, . . . , k. This is the reason why relative frequencies are used in basically all
cases to estimate the probabilities of nominal values.
Up to now we have simply stated estimation functions for parameters. This is possi-
ble because for many standard problems, consistent, unbiased, and efficient estima-
tors are known, so that they can be looked up in standard textbooks. Nevertheless,
we consider briefly how one can find estimation functions in principle.
Besides the method of moments, which we omit here, maximum likelihood es-
timation, as it was developed by R.A. Fisher,23 is one of the most popular methods
for finding estimation functions. The underlying principle is very simple: choose
the value of the parameter to estimate (or the set of values of the parameters to es-
timate if there are several) that renders the given random sample most likely. This
is achieved as follows: if the parameter(s) of the true underlying distribution were
known, we could easily compute the probability of a random experiment generat-
ing the observed random sample. However, this probability can also be written with
unknown parameters (though not necessarily be numerically computed). The result
is a function that describes the likelihood of a random sample given the unknown
parameters. This function is called a likelihood function. By taking partial deriva-
tives of this function w.r.t. the parameters to estimate and setting them equal to zero
(since the derivative must vanish at a maximum), estimation functions are derived.
23 However, R.A. Fisher did not invent this method as is often believed. Earlier on C.F. Gauß and
D. Bernoulli already made use of it, but Fisher was the first to study it systematically and to
establish it in statistics [10].
356 A Statistics
It describes the probability of the sample (the data set) depending on the parame-
ters μ and σ 2 . By exploiting the known rules for computing with exponential func-
tions, this expression can be transformed into
1
n
1
L x1 , . . . , xn ; μ, σ = √
2
exp − 2 (xi − μ)2 .
( 2πσ 2 )n 2σ
i=1
In order to find the maximum, we set the partial derivatives w.r.t. μ and σ 2 equal
to 0. The partial derivative w.r.t. μ is
1
n
∂ !
ln L x1 , . . . , xn ; μ, σ 2 = 2 (xi − μ) = 0,
∂μ σ
i=1
from which
n
n !
(xi − μ) = xi − nμ = 0,
i=1 i=1
and thus
1
n
μ̂ = xi
n
i=1
follows as an estimate for the parameter μ. The partial derivative of the log-
likelihood function w.r.t. σ 2 yields
1
n
∂ n !
2
ln L x 1 , . . . , x n ; μ, σ 2
= − 2
+ 4
(xi − μ)2 = 0.
∂σ 2σ 2σ
i=1
By inserting the estimated value μ̂ for the parameter μ, we obtain the estimator
n 2
1
n
2 1 n
1
σ̂ 2 = xi − μ̂ = xi2 − 2 xi
n n n
i=1 i=1 i=1
for the parameter σ 2.Note that the result is not unbiased. (Recall that, as we men-
tioned above, the empirical variance with a factor of n1 instead of n−1 1
is not unbi-
ased.) This shows that there is no estimator for the variance of a normal distribution
that has all desirable properties. Among those that are unbiased, the data is not max-
imally likely, and the one that makes the data maximally likely is not unbiased.
A.4 Inferential Statistics 357
In order to illustrate why it can be useful to assume a prior distribution on the pos-
sible values of the parameter θ , we consider three situations:
• A drunkard claims to be able to predict the side onto which a tossed coin will
land (head or tails). On ten trials he always states the correct side beforehand.
• A tea lover claims that she is able to taste whether the tea or the milk was poured
into the cup first. On ten trials she always identifies the correct order.
• An expert of classical music claims to be able to recognize from a single sheet
of music whether the composer was Mozart or somebody else. On ten trials he is
indeed correct every time.
Let θ be the (unknown) parameter that states the probability that a correct prediction
is made. The data is formally identical in all three cases: 10 correct, 0 wrong predic-
tions. Nevertheless we are reluctant to treat these three cases equally, as maximum
likelihood estimation does. We hardly believe that the drunkard can actually predict
the side a tossed coin will land on but assume that he was simply “lucky.” The tea
lover we also view sceptically, even though our skepticism is less pronounced as in
the case of the drunkard. Maybe there are certain chemical processes that depend on
the order in which tea and milk are poured into the cup and which change the taste
slightly and thus are noticeable to a passionate tea drinker. We just see this possibil-
ity as unlikely. On the other hand, we are easily willing to believe the music expert.
Clearly, there are differences in the style of different composers that may allow a
knowledgeable music expert to see even from a single sheet of music whether it was
composed by Mozart or not.
The three attitudes with which we see the three situations can be expressed by
prior distribution on the domain of the parameter θ . In the case of the drunkard we
ascribe a nonvanishing probability density only to the value 0.5.24 In the case of the
tea lover we may choose a prior distribution, which ascribes values close to 0.5 a
high probability density, which quickly declines towards 1. In the case of the music
expert, however, we ascribe a significant probability densities also to values closer
to 1. In effect, this means that in the case of the drunkard we always estimate θ
as 0.5, regardless of the data. In the case of the tea lover only fairly clear evidence in
favor of her claim will make us accept higher values for θ . In the case of the music
expert, however, few positive examples suffice to obtain a fairly high value for θ .
Obviously, the prior distribution contains background knowledge about the data-
generating process and expresses which parameter values we expect and how easily
we are willing to accept them. However, how to choose the prior distribution is a
tricky and critical problem, since it has to be chosen subjectively. Depending on
their experience, different people will choose different distributions.
A parameter value that is estimated with a point estimator from a data set usually
deviates from the true value of the parameter. Therefore it is useful if one can make
statements about these unknown deviations and their expected magnitude. The most
straightforward approach is certainly to provide a point-estimated value t and the
standard deviation D(T ) of the estimator, that is,
t ± D(T ) = t ± D 2 (T ).
However, a better possibility consists in determining intervals—so-called confi-
dence intervals—that contain the true value of the parameter with high probability.
The boundaries of these confidence intervals are computed by certain rules from
the sample values. Hence they are also statistics, and thus, like point estimators,
(realizations of) random variables. Therefore they can be treated analogously. For-
mally, they are defined as follows:
Let X = (X1 , . . . , Xn ) be a simple random vector the random variables of which
have the distribution function FXi (xi ; θ ) with (unknown) parameter θ . Furthermore,
let A = gA (X1 , . . . , Xn ) and B = gB (X1 , . . . , Xn ) be two estimators defined on X
such that
α α
P (A < θ < B) = 1 − α, P (θ ≤ A) = , P (θ ≥ B) = .
2 2
Then the random interval [A, B] (or a realization [a, b] of this random interval) is
called a (1 − α) · 100% confidence interval for the (unknown) parameter θ . The
value 1 − α is called confidence level.
Note the term “confidence” refers to the method and not to the result of the
procedure (that is, to a realization of the random interval). Before data has been
collected, a (1 − α) · 100% confidence interval contains the true parameter value
with probability 1 − α. However, after the data has been collected and the interval
boundaries have been computed, the interval boundaries are not random variables
anymore. Therefore the interval either contains the true value of the parameter θ
A.4 Inferential Statistics 359
or it does not (probability 1 or 0—even though it is not known which of the two
possibilities is obtained).
The above definition of a confidence interval is not specific enough to derive a
computation procedure from it. Indeed, the estimators A and B are not uniquely
determined: the sets of realizations of the random vectors X1 , . . . , Xn for which
A ≥ θ and B ≤ θ hold merely have to be disjoint and must possess the probability α2 .
In order to derive a procedure to obtain the boundaries A and B of a confidence
interval, the estimators are restricted as follows: they are not defined as general
functions of the random vector but rather as functions of a chosen point estimators T
for the parameter θ . That is,
A = hA (T ) and B = hB (T ).
In this way confidence intervals can be determined generally, namely by replacing
an investigation of A < θ < B with the corresponding event w.r.t. the estimator T ,
that is, A∗ < T < B ∗ . Of course, this is only possible if we can derive the func-
tions hA (T ) and hB (T ) from the inverse functions A∗ = h−1 ∗ −1
A (θ ) and B = hB (θ )
that we have to consider w.r.t. T .
Idea: P A∗ < T < B ∗ = 1 − α
⇒ P h−1 −1
A (θ ) < T < hB (θ ) = 1 − α
⇒ P (hA (T ) < θ < hB (T )) = 1 − α
⇒ P (A < θ < B) = 1 − α.
Unfortunately, this is not always possible (in a sufficiently simple way).
Thus we obtain
' '
α n+1 α n+1
B∗ = n
θ and A∗ = n
1− θ,
2 n 2 n
that is,
' '
α n+1 α n+1
P n
θ <U < n
1− θ = 1 − α,
2 n 2 n
from which we can derive easily
U U
P <θ < = 1 − α.
n
1 − α2 n+1
n
n α n+1
2 n
Due to the symmetry of the normal distribution, the computations become fairly
simple. For example, due to this symmetry, we know that B ∗ = −A∗ . Hence we can
write
X − nθ
P −A∗ < √ < A∗
nθ (1 − θ )
, A∗
1 (x − nθ )2
= √ exp − dx
−A∗ 2πnθ(1 − θ ) 2nθ (1 − θ )
= Φ A∗ − Φ −A∗ = 2Φ A∗ − 1 = 1 − α,
where Φ is the distribution function of the standard normal distribution. This func-
tion cannot be computed analytically but is available in tabulated form, so that one
can easily find the value x that corresponds to a given value Φ(x). Thus we only
have to derive an expression P (A < θ < B) from the above expression. This is done
as follows:
X − nθ
−A∗ < √ < A∗
nθ (1 − θ )
⇒ |X − nθ | < A∗ nθ (1 − θ )
2
⇒ (X − nθ )2 < A∗ nθ (1 − θ )
2 2
⇒ θ 2 n A∗ + n2 − θ 2nX + A∗ n + X 2 < 0.
From the resulting quadratic equation we easily obtain the values of A and B as
3
1 (A∗ )2 X(n − X) (A∗ )2
A/B = ∗
X+ ∓ A∗ + ,
n + (A ) 2 2 n 4
where Φ(A∗ ) = 1 − α2 .
25 Alternatively, one may say that a court trial is held against the null hypothesis, where the data
(sample) act as evidence. In case of doubt the defendant is acquitted (the null hypothesis is ac-
cepted). Only if the evidence is sufficiently incriminating, the defendant is convicted (the null
hypothesis is rejected).
362 A Statistics
The test decision is made on the basis of a test statistic, that is, a function of the
sample values of the given data set. The null hypothesis is rejected if the value of
the test statistic lies in the so-called critical region C. The development of a statis-
tical test consists in choosing, for a given distribution assumption and a parameter,
an appropriate test statistic and then to determine, for a user-specified significance
level (see the next section), the corresponding critical region C (see the following
sections).
Since the data on which the test decision rests is the outcome of a random process,
we cannot be sure that the decision made with a hypothesis test is correct. We may
decide wrongly and may do so in two different ways:
• error of the first kind:
The null hypothesis H0 is rejected, even though it is correct.
• error of the second kind:
The null hypothesis H0 is accepted, even though it is wrong.
Errors of the first kind are seen as more severe, because the null hypothesis receives
the benefit of the doubt and thus is not rejected as easily as the alternative hypothesis.
If the null hypothesis is rejected nevertheless, despite being correct, we commit a
serious error. Therefore it is tried to limit the probability of an error of the first kind
to a certain maximal value. This maximal value α is called the significance level of
the hypothesis test. It has to be chosen by a user. Typical values of the significance
level are 10%, 5%, or 1%.
In a parameter test the contrary hypotheses make statements about the values of
one or more parameters. For example, the null hypothesis may be that the true value
of a parameter θ is at least (or at most) θ0 :
H0 : θ ≥ θ0 , Ha : θ < θ0 .
In such a case the test is called one-sided. On the other hand, in a two-sided test
the null hypothesis consists of a statement that the true value of a parameter lies in
a certain interval or equals a specific value. Other forms of parameter tests compare
the parameters of the distributions that underlie two different samples. Here we only
consider a one-sided test as an example.
For a one-sided test, like the one described above, one usually chooses a point
estimator T for the parameter θ as a test statistic. In such a case we will reject the
null hypothesis H0 only if the value of the point estimator T has a value c, which
does not exceed the critical value. Therefore the critical region is C = (−∞, c].
Hence it is clear that the value c must lie to the left of θ0 , because we will not be
A.4 Inferential Statistics 363
able to reasonably reject H0 if even the value of the point estimators T exceeds θ0 .
However, even a value that is only slightly smaller than θ0 will not be sufficient to
make the probability of an error of the first kind (the null hypothesis H0 is rejected
even though it is correct) sufficiently small. Therefore c must lie at some distance to
the left of θ0 . Formally, the critical value c is determined as follows: We consider
β(θ ) = Pθ (H0 is rejected) = Pθ (T ∈ C),
which can be simplified to β(θ ) = P (T ≤ c) for a one-sided test. The quantity β(θ )
is also called the power of the test. It describes the probability of a rejection of H0
dependently on the value of the parameter θ . For all values θ that satisfy the null
hypothesis, the value of β(θ ) must be less than the significance level α. The reason
is that it the null hypothesis is true, we want to reject it at most with probability α
in order to commit an error of the first kind at most with this probability. Therefore
we must have
max β(θ ) ≤ α.
θ:θ satisfies H0
For the test we consider here, it is easy to see that the power β(θ ) of the test reaches
its maximum for θ = θ0 : the larger the true value of θ , the less likely it is that the test
statistic (the point estimator T ) yields a value of at most c. Hence we must choose
the smallest value θ that satisfies the null hypothesis H0 : θ ≥ θ0 . The expression
reduces to
β(θ0 ) ≤ α.
At this point all that is left to do to complete the test is to determine β(θ0 ) from the
distribution assumption and the point estimator T .
that is, the arithmetic mean of the sample values. (n is the sample size.) As one can
easily check, this estimator has the probability density
σ2
fX̄ (x) = N x; μ, .
n
364 A Statistics
Therefore it is
X̄ − μ0 c − μ0 c − μ0
α = β(μ0 ) = Pμ0 (X̄ ≤ c) = P √ ≤ √ =P Z≤ √
σ/ n σ/ n σ/ n
with standard normally distributed random variable Z. (The third step in the above
transformation served the purpose to obtain a statement about such a random vari-
able.) Thus we have
c − μ0
α=Φ √ ,
σ/ n
where Φ is the distribution function of the standard normal distribution, which can
be found in a tabulated form in many textbooks. From such a table we obtain the
value zα for which Φ(zα ) = α. Then the critical value is
σ
c = μ0 + zα √ .
n
Note that due to the small value of α, the value of zα is negative, and therefore c, as
already made plausible above, is smaller than μ0 .
In order to give a numeric example, we choose [1] μ0 = 130 and α = 0.05. In
addition, let σ = 5.4, n = 125, and x̄ = 128. From a table of the standard normal
distribution we obtain z0.05 ≈ −1.645 and arrive at
5.4
c0.05 ≈ 130 − 1.645 √ ≈ 128.22.
25
Since x̄ = 128 < 128.22 = c, the null hypothesis H0 is rejected. If we had chosen
α = 0.01 instead, we would have obtained (with z0.01 ≈ −2.326)
5.4
c0.01 ≈ 130 − 2.326 √ ≈ 127.49,
25
and thus H0 would not have been rejected.
As an alternative, the significance level can be left unspecified. Instead, one pro-
vides the value α from which upward the null hypothesis H0 is rejected. This value α
is also called p-value. For the above example, it has the value
128 − 130
p=Φ √ ≈ 0.032.
5.4/ 25
That is, the null hypothesis H0 is rejected for a significance level above 0.032 but
accepted for a significance level less than 0.032. Note, however, that one must not
choose the significance level after computing the p-value as this would undermine
the validity of the test. The p-value is only a convenience in order to accommodate
the different attitudes of users, some of which are more cautious and thus choose
lower significance levels α, while other are more daring and thus choose higher
significance levels. From the p-value all users can see whether they would reject or
accept the null hypothesis and thus need not follow the choice of the writer.
A.4 Inferential Statistics 365
With a goodness of fit test, it is checked whether two distributions, two empirical
distributions, or one empirical and one theoretical coincide. Often a goodness-of-fit
test is used to check a distribution assumption, as it is needed for parameter esti-
mation. As an example, we consider the χ 2 goodness-of-fit test for a polynomial
distribution: let a one-dimensional data set of size n be given for k attribute values
a1 , . . . , ak . In addition, let pi∗ , 1 ≤ i ≤ k, be an assumption about the probabilities
with which the attribute values ai occur. We want to check whether the hypothe-
sis fits the data set, that is, whether the actual probabilities pi coincide with the
hypothetical pi∗ , 1 ≤ i ≤ k, or not. Thus we contrast the hypotheses
H0 : ∀i, 1 ≤ i ≤ k : pi = pi∗ and Ha : ∃i, 1 ≤ i ≤ k : pi = pi∗ .
An appropriate test statistic can be derived from the following theorem about poly-
nomially distributed random variables, which describe the frequency of the occur-
rence of the different values ai in a sample.
In the expression for calculating the random variable Y , the values of the random
variables Xi are compared to their expected values npi , the deviations are squared
(among other reasons, so that positive and negative deviations do not cancel), and
summed weighted, with a deviation being weighted the lower, the smaller the ex-
pected value is. Since Y is χ 2 distributed, large values are unlikely.
The degrees of freedom result from the number of free parameters of the dis-
tribution. The number n is not a free parameter, since it is fixed by the size of the
the k parameters p1 , . . . , pk only k − 1 can be chosen freely, since it
sample. From
must be ki=1 pi = 1. Hence only k − 1 of the k + 1 parameters of the polynomial
distribution remain that determine the degrees of freedom.
By replacing the actual probabilities pi by the hypothetical pi∗ and replacing the
random variables Xi by their realizations (absolute frequency of the occurrence of
ai in the sample), we obtain a test statistic for the goodness-of-fit test, namely
k
(xi − np ∗ )2
y= i
.
npi∗
i=1
If the null hypothesis H0 is correct, that is, if all hypothetical probabilities coin-
cide with the actual ones, it is very unlikely that y takes a large value, since y
366 A Statistics
A die is suspected to be unfair, that is, that when tossed, the die shows the different
numbers of pips with different probabilities. In order to test this hypothesis, the die
is tossed 30 times and it is counted how frequently the different numbers turn up:
x1 = 2, x2 = 4, x3 = 3, x4 = 5, x5 = 3, x6 = 13.
That is, one pip turned up twice, two pips four times, etc. Now we contrast the
hypotheses
1 1
H0 : ∀i, 1 ≤ i ≤ 6 : pi = and Ha : ∃i, 1 ≤ i ≤ 6 : pi = .
6 6
Since n = 30, we have ∀i : npi = 30 16 = 5, and thus the prerequisites of Theo-
rem A.20 are satisfied. Hence the χ 2 distribution with 5 degrees of freedom is a
good approximation of the random variable Y . We compute the test statistic
6
(xi − 30 · 16 )2 1
6
67
y= = (xi − 5)2 = = 13.4.
i=1
30 · 1
6
5
i=1
5
A.4 Inferential Statistics 367
For a significance level of α1 = 0.05 (5% probability for an error of the first kind),
the critical value is c ≈ 11.07, since a χ 2 distributed random variable Y with five de-
grees of freedom satisfies
P (Y ≤ 11.07) = FY (11.07) = 0.95 = 1 − α1 ,
as one may easily obtain from tables of the χ 2 distribution. Since 13.4 > 11.07,
the null hypothesis that the die is fair can be rejected on a significance level of
α1 = 0.05. However, it cannot be rejected on a significance level of α2 = 0.01, since
P (Y ≤ 15.09) = FY (15.09) = 0.99 = 1 − α2
and 13.4 < 15.09. The p-value is
p = 1 − FY (13.4) ≈ 1 − 0.9801 = 0.0199.
That is, for a significance level of 0.0199 and above, the null hypothesis H0 is re-
jected, while for a significance level below 0.0199, however, it is accepted.
With a dependence test, it is checked whether two quantities are dependent. In prin-
ciple, any goodness-of-fit test can easily be turned into a dependence test: simply
compare the empirical joint distribution of two quantities with a hypothetical in-
dependent distribution that has the same marginals. In such a case the marginal
distributions are usually estimated from the data.
As an example, we consider the χ 2 dependence test for two nominal values,
which is derived from the χ 2 goodness-of-fit test. Let Xij , 1 ≤ i ≤ k1 , 1 ≤ j ≤ k2 ,
be random variables that describe the absolute frequency of the joint occurrence of
the values ai and bj of two attributes A and B, respectively. Furthermore, let Xi. =
k2 k1
j =1 Xij and X.j = i=1 Xij be the marginal frequencies (absolute frequencies
of the attribute values ai and bj ). Then, as a test statistic, we compute
k1
k2
(xij − n1 xi. x.j )2
k1
k2
(pij − pi. p.j )2
y= 1
= n
n xi. x.j
pi. p.j
i=1 j =1 i=1 j =1
from the realizations xij , xi. , and x.j of these random variables, which are counted in
x
a sample of size n, or from the estimated joint probabilities pij = nij and marginal
x
probabilities pi. = xni. and p.j = n.j . The critical value c is determined with the help
of the chosen significance level from a χ 2 distribution with (k1 − 1)(k2 − 1) de-
grees of freedom. The degrees of freedom are justified as follows: for the k1 · k2
probabilities pij , 1 ≤ i ≤ k1 , 1 ≤ j ≤ k2 , and for the occurrence of the different
1 k2
combinations of ai and bj , it must be ki=1 j =1 pij = 1. Thus k1 · k2 − 1 free pa-
rameters remain. From the data we estimate the k1 probabilities pi. and the k2 prob-
1
abilities p.j . However, they must also satisfy kj2=1 pi. = 1 and ki=1 p.j = 1, so
that the degrees of freedom are reduced by only (k1 − 1) + (k2 − 1). In total, we
have k1 k2 − 1 − (k1 − 1) − (k2 − 1) = (k1 − 1)(k2 − 1) degrees of freedom.
Appendix B
The R Project
R is an open-source statistics and data analysis software available under the General
Public License (GPL). This means especially that R can be downloaded, used, and
distributed freely.
R is based on a very simple command-line language that can be used interac-
tively, but also for writing programs in R. The sections in this book referring to R
are in no way intended to give a comprehensive introduction to R and do not claim
to be complete in anyway. The main purpose of these sections is to enable the reader
to apply methods introduced in the “theoretical chapters” directly to their own data.
For most of the methods whose usage is explained in R, one or two commands will
be sufficient.
This appendix explains how to get started with R a provides quick overview on
the very basics of R. More details can be found at the website for R
http://www.r-project.org
> plot(iris)
has been entered in the console window. Note that the prompt symbol > does not
belong to the command. We will always display the prompt symbol in order to
distinguish R commands from outputs generated after a command has been entered.
Outputs will be shown without the prompt symbol.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 369
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3, © Springer-Verlag London Limited 2010
370 B The R Project
The graphics displayed on the right in Fig. B.1 is generated by the plot com-
mand.
R is a type-free language. This means that variables need not to be declared before
their use. A variable might be used to store just a single number, but it can also
be a complex object with many attributes. In most cases, the objects in this book
will contain data sets or analysis results. Assignments in R are denoted by the two
symbols <-. Before we can analyze a data set with R, we have to load the data set
into R. The easiest way to achieve this uses the function read.table:
This command will open a file chooser window to find and select the file with
the data set to be analyzed. After the file has been chosen, it will be stored in
the variable mydata. The specification header=TRUE when calling the function
read.table tells R that the first line in the file should not be interpreted as data,
B.2 Reading Files and R Objects 371
but rather as names for the attributes. It is therefore assumed that the structure of the
file looks like
x y z
1.3 2.8 a
3.4 1.9 b
2.7 4.2 a
... ... ...
In this case, there are three attributes named x, y, and z given in the first line.
The records then come in the following lines. Because three attribute names have
been given in the first line, each of the following lines must also have three entries
separated by an arbitrary number of blanks. If the values of the attributes are not
separated by blanks but by another symbol, say a comma, then one would have to
write
Now the object named mydata contains the data from the file. Assume that the file
contains only three records and not more as indicated by the dots above.
At least for smaller data sets, one can take a look at the data by simply typing
> mydata
x y z
1 1.3 2.8 a
2 3.4 1.9 b
3 2.7 4.2 a
> summary(mydata)
x y z
Min. :1.300 Min. :1.900 a:2
1st Qu.:2.000 1st Qu.:2.350 b:1
Median :2.700 Median :2.800
Mean :2.467 Mean :2.967
3rd Qu.:3.050 3rd Qu.:3.500
Max. :3.400 Max. :4.200
> mydata$y
[1] 2.8 1.9 4.2
In this simple example, we only have three records and therefore only three values
for the attribute y. If one line is not enough to list all values, R will simply continue
in the next line and list the index of the first data entry in each line in square brackets.
So if we had 15 records, the result might look like the following one:
> mydata$y
[1] 2.8 1.9 4.2 2.4 3.0 1.7
[7] 4.1 3.3 2.6 1.8 4.3 3.1
[13] 3.7 2.1 1.8
We can also assess the values of an attribute—a column in our data table—by
using its index in square brackets:
> mydata[2]
y
1 2.8
2 1.9
3 4.2
A specific record, a row in our data table, can be selected in the following way:
> mydata[2,]
x y z
2 3.4 1.9 b
Most of the R code examples given in the “Practical . . .” section at the ends of the
corresponding chapters are based on the Iris data set. It is not necessary to first load
this data set into R. R provides some simple data sets, and, among them, there is the
Iris data set that can be accessed via an object called iris. The attribute names
are Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and
Species. So, if we want to know the mean value of the sepal length, we simply
need to enter
> mean(iris$Sepal.Length)
[1] 5.843333
different from the default value of the parameter. Such a parameters can be set inside
the call of the function by specifying the parameter name, followed by “=” and the
value of the parameter. We have seen examples for such parameters already in the
function read.table in the beginning of the previous section. The default value
of the parameter header is FALSE, assuming that the file to be read does not
contain the names of the attributes. Only if we have a file whose first line defines
the names of the attributes, we have to use header=TRUE, as we have done using
the function read.table. Another parameter of this function is sep. We do not
need to specify the value of this parameter when the attribute values in a row of our
file are separated by blanks. If another symbol like comma is used to separate the
values, we must assign the corresponding value (symbol) to the parameter sep.
Not all parameters are specified in this form. As in most programming languages,
the data set is just handed over to the function as normal argument as in the very
first example of this appendix (plot(iris)).
One can browse through the history of commands or functions that have been
used in an R session by the key “cursor up” and “cursor down.”
B.4 Libraries/Packages
There are various libraries or packages for R for special topics or specialized meth-
ods. Some of these libraries come along with R, and other need to be downloaded.
Downloading an additional package is very easy. Given that the computer is con-
nected to the Internet, just type
> install.packages
and wait for the window asking you to choose a mirror site from which you would
like to download the package. After you have clicked the mirror site, the packages
will be listed in alphabetical order in a new window, and you can choose the package
you need by clicking it.
Once a package—for instance, the package cluster—has been downloaded,
it can be added to an R session by the following command:
> library(cluster)
B.5 R Workspace
The actual R session can be stored, so that it can be reloaded next time, and all the
R objects, like the data that had been loaded and the analysis results that have been
stored in objects, can be recovered:
374 B The R Project
In order to load a workspace that has been stored before, the command
can be used. Of course, the file does not have to be called all.Rdata.
> help(...)
will provide the description of the R function or command that has been entered in
place of the three dots.
If one does not know the command,
> help.search("...")
will help. It will list all the R functions/commands in which the specified term that
should be given in place of the three dots occurs.
Even if you know the correct name of the command or function you are interested
in, R will not be able to provide help if the function belongs to a package that has
not been included in the corresponding R session. So
> help(scatterplot3d)
will not give any information on scatterplot3d, unless the package to which
the function scatterplot3d belongs has been added to the session. (In this case,
the package name is even identical with the function name.)
If you know neither the exact function name nor the corresponding package, you
can simply search the R website for the topic or use a search engine and type in R
the corresponding term for which you would like to find an R function.
KNIME, pronounced [naim], is a modular data exploration platform that enables the
user to visually create data flows (often referred to as pipelines), selectively execute
some or all analysis steps, and later investigate the results through interactive views
on data and models. This appendix will give a short introduction to familiarize the
readers of this book with the basic usage of KNIME. Considerably more information
regarding the use of KNIME is available online at
http://www.knime.org
In order to install KNIME, download one of the versions suitable for your operating
system and unzip it to any directory for which you have write permissions. No
other action to install KNIME is required, in particular no setup routine has to be
launched. In order to start KNIME for the first time, double click the knime.exe file
on Windows or on Linux launch knime.
KNIME is uninstalled from the system by simply deleting the installation direc-
tory. Per default the workspace is also in this directory. If a different location for the
workspace was chosen, this directory needs to be deleted manually as well.
When KNIME is started the first time, a welcome screen opens. From here the
user can
• Open KNIME workbench: opens the KNIME workbench to immediately start
exploring KNIME, build own workflows, and explore your data.
• Get additional nodes: In addition to the ready-to-use basic KNIME installation,
there are additional plug-ins for KNIME, e.g., an R and Weka integration, mod-
ules for image and text processing, or the integration of the Chemistry Develop-
ment Kit with additional nodes for the processing of molecular structures. These
features can be downloaded also later from within KNIME itself if you choose to
skip this step.
M.R. Berthold et al., Guide to Intelligent Data Analysis, 375
Texts in Computer Science 42,
DOI 10.1007/978-1-84882-260-3, © Springer-Verlag London Limited 2010
376 C KNIME
moved with the mouse, which causes the editor to scroll so that the visible part
matches the gray rectangle.
• Console: The console view prints out error and warning messages in order to give
you a clue of what is going on under the hood. The same information is written
to a log file, which is located in the workspace directory.
• Node Description: The node description displays information about the selected
node (or the nodes contained in a selected category). In particular, it explains the
dialog options, the available views, the expected input data, and resulting output
data.
• Workflow Editor: The workflow editor is used to assemble workflows, config-
ure and execute nodes, inspect the results, and explore your data. This section
describes the interactions possible within the editor.
Fig. C.2 The node dialog allows one to configure (left) and later execute (right) individual nodes
378 C KNIME
and the result of this node will be available at the out-port (see Fig. C.2 on the right).
After a successful execution the status light of the node is green, indicating that the
processed data is now available on the outports. The result(s) can be inspected by
exploring the out-port view(s): the last entries in the context menu open them.
Ports on the left are input ports, where the data from the outport of the predeces-
sor node is fed into the node. Ports on the right are outgoing ports. The result of the
node’s operation on the data is provided at the out-port to successor nodes. A tooltip
gives information about the output of the node.
Nodes are typed such that only ports of the same type can be connected; Fig. C.3
shows the corresponding symbols for the following, most prominently encountered
port types:
• Data Ports: The most common type is the data port (a white triangle) which trans-
fers flat data tables from node to node.
• Database Ports: Nodes executing commands inside a database can be identified
by their port color and shape (brown square):
• PMML Ports: Data Mining nodes learn a model which is passed to a model writer
or predictor node via a blue squared PMML port:
• Other Ports: Whenever a node provides data which does not fit a flat data table
structure, a general purpose port for structured data is used (dark cyan square).
Ports that are not data, database, PMML, or ports for structured data are displayed
as unknown types (gray square):
data views later). To again see all nodes in the repository, press ESC or Backspace
in the search field of the Node Repository.
Finally, drag the Interactive Table and the Scatter Plot from the Data Views cat-
egory to the Workflow Editor and position them to the right of the Color Manager
node.
After placing all nodes, we can now connect them (one can, of course, also later
drag new nodes onto the workbench). Click one output (right) port of the file reader
and drag the connection to the input port of the k-means node. Then continue to
connect the ports as shown in Fig. C.6. (Note that your nodes will not show a green
status, as long as they are not configured and executed.)
Some of the now connected nodes may still show a red status icon, indicating
that it must be configured in order to produce meaningful results. Right click the
File Reader and select Configure from the menu. Navigate to the IrisDataSet direc-
tory located in the KNIME installation directory. Select the data.all file from this
location. The File Reader’s preview table shows a sample of the data, which should
match the structure of the data file correctly. Click OK to confirm this configuration.
Once the node has been configured correctly, it switches to yellow (indicating that
it is ready for execution). After that, the K-Means node will also turn yellow, since
its default settings can be applied. To be sure that the default settings fit your needs,
open the dialog and inspect the default settings.
In order to configure the Color Manager node, you must first execute the K-
Means node by right clicking the node and selecting Execute. Note how the File
380 C KNIME
Fig. C.6 The example flow to cluster and visualize the sample data set
Reader node will automatically be executed as well. After execution all nominal
values and ranges of all attributes are known at the outport of the executed node:
this meta information is propagated to the successor nodes. The Color Manager
needs this data before it can be configured. Once the K-Means node is executed,
open the configuration dialog of the Color Manger node. The node will suggest
to color the rows in our table based on the clustering results. Accept these default
settings by clicking OK.
Finally, execute the Scatter Plot. In order to examine the data and the results,
open the nodes’ views. In our example, the K-Means, the Interactive Table, and the
Scatter Plot have views. Open them from the nodes’ context menus.
Select some points in the scatter plot and choose “Hilite Selected” from the
“Hilite” menu. The hilited points are marked with an orange border. You will also
see the hilited points in the table view. The propagation of the hilite status works for
all views in all branches of the flow displaying the same data. Figure C.7 shows an
example of the views with a couple of highlighted points.
C.4 R Integration
One of the nice features of KNIME is the modular, open API which allows one to
easily integrate other data processing or analysis projects. From the KNIME web-
page one can already download a number of such integrations of third party libraries
and projects, most notable the statistical data analysis package R, and the machine
learning library Weka. In addition, a number of external contributors are providing
nodes integrating their own projects into KNIME.
The Weka integration is fairly straightforward to use, one simply drags the node
corresponding to the desired learning algorithm onto the workbench, connects it,
and opens the configuration dialog which then provides access to all appropriate
parameters. If views are available, the KNIME–Weka nodes allow one to open those
C.4 R Integration 381
just like other KNIME views. Weka models can also be fed into a predictor node,
similar to KNIME models and hence applied to other data.
The R integration is a bit different, however. Since R really provides more of a
statistical programming language, it would require thousand of nodes to cover all the
possibilities hidden within the language. KNIME therefore offer nodes which allow
one to call small fragments of R code instead—which allows one to use the power
of R when needed, e.g., for sophisticated statistical analyses and rely on KNIME’s
strengths for data loading, integration, and transformation and some of the built-in
analysis routines. KNIME offers to point to a local R installation (one can actu-
ally download an integrated R installation together with the corresponding KNIME
nodes), and it also allows one to access an R installation residing on a server. Dif-
382 C KNIME
ferent R nodes allow one to execute an R script on incoming data and produce again
a data table, a view, or a model. The latter can then be used in the R predictor node
and applied to other data. Figure C.8 shows a small example flow and the dialog for
an R snippet node.
References
Appendix A
1. Berthold, M., Hand, D.: Intelligent Data Analysis. Springer, Berlin (2009)
2. Buffon, G.-L.L.: Mémoire sur le Jeu Franc-Carreau, France (1733)
3. Everitt, B.S.: The Cambridge Dictionary of Statistics, 3rd edn. Cambridge University Press,
Cambridge (2006)
4. Freedman, S., Pisani, R., Purves, R.: Statistics, 4th edn. Norton, London (2007)
5. Friedberg, S.H., Insel, A.J., Spence, L.E.: Linear Algebra, 4th edn. Prentice Hall, Englewood
Cliffs (2002)
6. Huff, D.: How to Lie with Statistics. Norton, New York (1954)
7. Kolmogorow, A.N.: Foundations of the Theory of Probability. Chelsea, New York (1956)
8. Krämer, W.: So lügt man mit Statistik, 7 Auflage. Campus-Verlag, Frankfurt (1997)
9. Landau, L.D., Lifshitz, E.M.: Mechanics, 3rd edn. Butterworth-Heinemann, Oxford (1976)
10. Larsen, R.J., Marx, M.L.: An Introduction to Mathematical Statistics and Its Applications, 4th
edn. Prentice Hall, Englewood Cliffs (2005)
11. Lay, D.C.: Linear Algebra and Its Applications, 3rd edn. Addison Wesley, Reading (2005)
12. von Mises, R.: Wahrscheinlichkeit, Statistik und Wahrheit. Berlin (1928)
13. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C—The
Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)
14. Sachs, L.: Angewandte Statistik—Anwendung statistischer Methoden, 11 Auflage. Springer,
Berlin (2003)
15. Wichura, M.J.: Algorithm AS 241: the percentage points of the normal distribution. Appl.
Stat. 37, 477–484 (1988)
Appendix B
16. Chambers, J.: Software for Data Analysis: Programming with R. Springer, New York (2008)
17. Dalgaard, P.: Introductory Statistics with R, 2nd edn. Springer, New York (2008)
18. Murrell, P.: R Graphics. Chapman & Hall/CRC, Boca Raton (2006)
19. Spector, P.: Data Manipulation with R. Springer, New York (2008)