CH 4 CLass
CH 4 CLass
Classifi cation
4. 1 INTRODUCTION
4.2 STATISTICAL-BASED ALGORITH M S
4.3 DISTANCE-BASED ALGORITHMS
4.4 DECISION TRE E - BASED ALGORITHMS
4.5 N E URAL N ETWORK-BASED ALGORITH MS
4.6 RULE-BASED ALGORITHMS
4.7 COMBINING TECH N IQUES
4.8 SUM MARY
4.9 EXERCISES
4.10 BIBLIOGRAPHIC NOTES
4. 1 INTRODUCTION
Classification is perhaps the most familiar and most popular data mining technique.
Examples of classification applications include image and pattern recognition, medical
diagnosis, loan approval, detecting faults in industry applications, and classifying financial
market trends. Estimation and prediction may be viewed as types of classification. Wheri
someone estimates your age or guesses the number of marbles in a jar, these are actually
classification problems. Prediction can be thought of as classifying an attribute value into
one of a set of possible classes. It is often viewed as forecasting a continuous value, while
classification forecasts a discrete value. Example 1 . 1 in Chapter 1 illustrates the use of
classification for credit card purchases. The use of decision trees and neural networks
(NNs) to classify people according to their height was illustrated in Chapter 3 . Before
the use of current data mining techniques, classification was frequently performed by
simply applying knowledge of the data. This is illustrated in Example 4. 1 .
EXAMPLE 4.1
90 ::5 grade A
80 ::5 grade < 90 B
70 ::5 grade < 80 C
60 ::5 grade < 70 D
grade < 60 F
75
76 Chapter 4 Classification Section 4. 1 I ntroduction 77
10 - 10 X
10 X
9 f- X X
All approaches to performing classification assume some knowledge of the data.
X 8 f- X
9 9
Often a training set is used to develop the specific parameters required by the technique.
8-
X X
8
Training data consist of sample input data as well as the classification assignment for 7-
Class A 7 ,_x
X
7
5 f-X X
the data. Domain experts may also be used to assist in the process. 6- 6 6
4 4 f- X
X
X X
5 5
X
The classification problem is stated as shown in Definition 4. 1 : 4-
3 I-
2 f- X X
3- 3 X
Our definition views classification as a mapping from the database to the set of classes. F I G U R E 4. 1 : Classification problem.
Note that the classes are predefined, are nonoverlapping, and partition the entire database.
Each tuple in the database is assigned to exactly one class. The classes that exist for a
classification problem are indeed equivalence classes. In actuality, the problem usually Suppose we are given that a database consists of tuples of the form t (x, y } =
is implemented in two hases :tp where 0 S x S 8 and 0 S y S 10. Figure 4. 1 illustrates the classification problem.
Figure 4. l (a) shows the predefined classes by dividing the reference space, Figure 4. l (b)
1. Create a specific model by evaluating the training data. This step has as input provides sample input data, and Figure 4. l (c) shows the classification of the data based
the t�aining data (including defined classification for each tuple) and as output a on the defined classes.
.
defirutwn of the model developed. The model created classifies the training data A major issue associated with classification is that of overfitting. If the classification
as accurately as possible. strategy fits the training data exactly, it may not be applicable to a broader population of
2. Apply the model developed in step 1 by classifying tuples from the target database. data. For example, suppose that the training data has erroneous or noisy data. Certainly
in this case, fitting the data exactly is not desired.
Although the second step actually does the classification (according to the definition in In the following sections, various approaches to perfonning classification are exam
Definition 4. 1 ) , most research has been applied to step 1. Step 2 is often straightforward. ined. Table 4.1 contains data to be used throughout this chapter to illustrate the various
As discussed in [KLR+ 98], there are three basic methods used to solve the classi techniques. This example assumes that the problem is to classify adults as short, medium,
fication problem: or tall. Table 4. 1 lists height in meters. The last two columns of this table show two clas
sifications that could be made, labeled Output l and Output2, respectively. The Outputl
• Specifying bou:"daries. Here classification is performed by dividing the input classification uses the simple divisions shown below:
2 m s Height
space of potential database tuples into regions where each region is associated
with one class. Tall
1 . 7 m < Height < 2 m Medium
• Using probability distributions. For any given class, CJ, P
(ti 1 CJ)
is the PDF for Height S 1. 7 m Short
the class evaluated at one point, ti . 1 If a probability of occurrence for each class
The Output2 results require a much more complicated set of divisions using both height
! (CJ) is known (perhaps determined by a domain expert), then P(Cj)P(ti C .)
1 1
In this chapter we examine classification algorithms based on the categorization
and gender attributes.
.
IS used to estimate the probability that t; is in class C J.
as seen in Figure 4.2. Statistical algorithms are based directly on the use of statistical
• Using posterior probabilities. Given a data value t; , we would like to determine
the probability that t; is in a class CJ.
This is denoted by P (CJ
1 ti) and is called
information. Distance-based algorithms use similarity or distance measures to perform
the classification. Decision tree and NN approaches use these structures to perform the
the posterior probability. One classification approach would be to determine the
classification. Rule-based classification algorithms generate if-then rules to perform the
posterior probability for each class and then assign ti to the class with the highest
classification.
probability.
The naive divisio�s used in Example 4. 1 as well as decision tree techniques are examples 4. 1 . 1 Issues in Classification
of the first modehng approach. Neural networks fall into the third category.
1 In this discussion each tuple in the database is assumed to consist of a single value rather than a set of
Missing Data. Missing data values cause problems during both the training phase and
values. the classification process itself. Missing values in the training data must be handled
78 Chapter 4 Classification Section 4.1 Introduction 79
often a fuzzy problem, the correct answer may depend on the user. Traditional algorithm
TA BLE 4. 1 : Data for Height Classification
evaluation approaches such as determining the space and time overhead can be used, but
these approaches are usually secondary .
Name Gender Height Outputl Output2 Classification accuracy is usually calculated by determining the percentage of tuples
placed in the correct class. This ignores the fact that there also may be a cost asso
Kristina F 1 .6 m Short Medium ciated with an incorrect assignment to the wrong class. This perhaps should also be
Jim M 2 m Tall Medium determined.
Maggie F 1 .9 m Medium Tall We can examine the performance of classification much as is done with informa
Martha F 1 .88 m Medium Tall tion retrieval systems. With only two classes, there are four possible outcomes with
Stephanie F 1 .7 m Short Medium the classification, as is shown in Figure 4 . 3 . The upper left and lower right quad
Bob M 1 .85 m Medium Medium rants [for both Figure 4.3 (a) and (b)] represent correct actions. The remaining two
Kathy F 1 .6 m Short Medium quadrants are incorrect actions. The performance of a classification could be deter
Dave M 1 .7 m Short Medium mined by associating costs with each of the quadrants. However, this would be dif
Worth M 2.2 m Tall Tall ficult because the total number of costs needed is m 2 , where m is the number of
Steven M 2. 1 m Tall Tall classes.
Debbie F 1.8 m Medium Medium Given a specific class, C j, and a database tuple, t; , that tuple may or may not be
Todd M 1 .95 m Medium Medium assigned to that class while its actual membership may or may not be in that class. This
Kim F 1 .9 m Medium Tall again gives us the four quadrants shown in Figure 4.3(c), which can be described in the
Amy F 1.8 m Medium Medium following ways:
Wynette F 1 .75 m Medium Medium
• True positive (TP): t; predicted to be in Cj and is actually in it.
�
•
.
Statistlca '
l D!stance DT NN Rules • False negative (FN): t; not predicted to be in Cj but is actually in it.
• Assume a value for the missing data. This may be determined by using some
method to predict what the value could be.
RET NOTRET Assigned Class A Assigned Class B True positive False negative
• Assu �e a special value for the missing data. This means that the value of missing REL REL in Class A in Class A
data IS taken to be a specific value all of its own.
� .
?
otice t e similarity between missing data in the classification problem and that of nulls
RET NOTRET Assigned Class A Assigned Class B False positive True negative
m traditional databases. NOTREL NOTREL in Class B in Class B
�
Meas ring Performance. Table 4 . 1 shows two different classification results
. (a) Information retrieval (b) Classification into Class A (c) Class prediction
usmg two different classification tools. Determining which is best depends on the inter
�
.
�
preta on of the prob em by users. The performance of classification algorithms is usually
exaffilned by evaluatmg the accuracy of the classification. However, since classification is F I G U R E 4.3: Comparing classification performance to information retrieval.
80 Chapter 4 Classification
Section 4.2 Statistical-Based Alg orithms 81
it can also be used for other applications such as forecasting. Although not explicitly
described in this text, regression can be performed using many different types of tech
75% niques, including NNs. In actuality, regression takes a set of data and fits the data to
� a formula.
:�
0 50%
Looking at Figure 3.3 in Chapter 3, we see that a simple linear regression problem
0.. can be thought of as estimating the formula for a straight line (in a two-dimensional
<1)
=
F: space). This can be equated to partitioning the data into two classes. With the banking
25%
example, these would be to approve or reject a loan application. The straight line is the
break-even point or the division between the two classes.
In Chapter 2, we briefly introduced linear regression using the formula
F I G U R E 4.4: Operating characteristic curve. output parameter, y, and the input parameters, x 1 , . . . , Xn can be estimated. All high
school algebra students are familiar with determining the formula for a straight line,
1
y = mx + b, given two points in the xy plane. They are determining the regression
coefficients m and b. Here the two points represent the training data.
TABLE 4.2: Confusion Matrix
Admittedly, Example 3 . 5 is an extremely simple problem. However, it illustrates
Actual Assignment how we all use the basic classification or prediction techniques frequently. Figure 4.5
Membership illustrates the more general use of linear regression with one input value. Here there
Short Medium Tall is a sample of data that we wish to model (shown by the scatter dots) using a linear
model. The line generated by the linear regression technique is shown in the figure.
Short 0 4 0 Notice, however, that the actual data points do not fit the linear model exactly. Thus,
Medium 0 5 3 this model is an estimate of what the actual input-output relationship is. We can use the
Tall 0 2 generated linear model to predict an output value given an input value, but unlike that
for Example 3.5, the prediction is an estimate rather than the actual output value. If we
attempt to fit data that are not linear to a linear model, the results will be a poor model
beginning of evaluating a sample, there are none of either category, while at the end of the data, as illustrated by Figure 4.5.
there are 100 percent of each. When evaluating the results for a specific sample, the
curve looks like a j agged stair-step, as seen in Figure 4.4, as each new tuple is either
a false positive or a true positive. A more smoothed version of the OC curve can also
be obtained.
A confusion matrix illustrates the accuracy of the solution to a classification prob
lem. Given m classes, a confusion matrix is an m x m matrix where entry ci,J indicates the
number of tuples from D that were assigned to class C1 but where the correct class is Ci .
Obviously, the best solutions will have only zero values outside the diagonal. Table 4.2
shows a confusion matrix for the height example in Table 4. 1 where the Outputl assign
ment is assumed to be correct and the Output2 assignment is what is actually made.
4.2.1 Regression
Regression problems deal with estimation of an output value based on input values. When
used for classification, the input values are values from the database D and the output FIGURE 4.5: Example of poor fit for linear regression.
values represent the classes. Regression can be used to solve classification problems, but
82 Chapter 4 C l assification Section 4.2 Statistical-Based Algorithms 83
There are many reasons why the linear regression model may not be used to Example 4.3 illustrates the division process, while Example 4.4 illustrates the pre
estimate output data. One is that the data do not fit a linear model. It is possible, however, diction process using the data from Table 4. 1 . For simplicity, we assume the training data
that the data generally do actually represent a linear model, but the linear model generated include only data for short and medium people and that the classification is performed
is poor because noise or outliers exist in the data. Noise is erroneous data. Outliers are using the Outputl column values. If you extend this example to all three classes, you
data values that are exceptions to the usual and expected data. Example 4.2 illustrates will see that it is a nontrivial task to use linear regression for classification. It also will
o?tliers: Iri these cases the observable data may actually be described by the following: become obvious tl�at the result may be quite poor.
Here E is a random error with a mean of 0. As with point estimation, we can estimate the By looking at the data in the Qutput1 column from Table 4.1 and the basic understanding
accuracy of the fit of a linear regression model to the actual data using a mean squared that the class to which a person is assigned is based only on the numeric value of his or
error function. her height, in this example we apply the linear regression concept to determine how to
distinguish between the short and medium classes. Figure 4.6(a) shows the points under
consideration. We thus have the linear regression formula of y = co + E . This implies
EXAMPLE 4.2
that we are actually going to be finding the value for co that best partitions the height
Suppose that a graduate level abstract algebra class has 100 students. Kristina consistently numeric values into those that are short and those that are medium. Looking at the data
outperforms the other 4;tudents on exams. On the final exam, Kristina gets a grade of 99. in Table 4. 1, we see that only 12 of the 15 entries can be used to differentiate between
The next highest grade is 75, with the range of grades being between 5 and 99. Kristina short and medium persons. We thus obtain the following values for Yi in our training
clearly is resented by the other students in the class because she does not perform at data: { 1 .6, 1 .9, 1 . 88, 1 .7, 1 . 85, 1 . 6, 1 . 7 , 1 . 8, 1 .95, 1 .9, 1.8, 1 .75 } . We wish to minimize
the same level they do. She "ruins the curve." If we were to try to fit a model to the 12 12
L f L (Yi - co )
grades, this one outlier grade would cause problems because any model that attempted 2
L = E =
to include it would then not accurately model the remaining data.
i=1 i=1
Taking the derivative with respect to co and setting equal to zero we get
We illustrate the process using a simple linear regression formula and assuming k
12 12
points in our training sample. We thus have the following k formulas:
-2 L Yi + L 2co = 0
Yi = co + C1X1 i + Ei , i = 1, . . . , k (4.3) i=1 i= 1
Solving for co we find that To simplify the notation, in the rest of the example we drop the range values for the
12 summation because all are the same. Solving for co, we find that:
I >i L Yi - L CJ X !i
i
co = =1 = 1 .786 co =
12 12
We thus have the division between short and medium persons as being determined by Now taking the partial of L with respect to c 1 , substituting the value for co, and setting
y = 1 .786, as seen in Figure 4.6(b). equal to zero we obtain
aL
- = 2 "'(y; - CO - C!X1i) (-X1; ) = 0
ac1 �
EXAMPLE 4.4
Solving for c 1 , we finally have
We now look at predicting the class using the short and medium data as input and looking
at the Output! classification. The data are the same as those in Example 4.3 except that we
now look at the classes as indicated in the training data. Since regression assumes numeric
data, we assume that the value for the short class is 0 and the value for the medium class
is 1. Figure 4.7(a) sho� the data for this example: { ( 1 .6, 0), ( 1 .9, 1), ( 1 .88, 1), ( 1 .7, 0),
(1 .85, 1), (1 .6, 0), ( 1 .7, 0), ( 1 .8, 1), (1 .95, 1), ( 1 .9, 1), ( 1 .8, 1), ( 1 .75, 1 ) } . In this case
we are using the regression formula with one variable:
We can now solve for co and C J . Using the data from the 1 2 points in the training data,
Y = CO + C!XJ + E we have L:: xli = 21 .43, L Yi = 8, L:: <x l i y; ) = 14.83, and L:: < xr; ) = 38 .42. Thus, we
get q = 3.63 and co = -5.8 16. The prediction for the class value is thus
We thus wish to minimize
12 y = - 5 . 8 1 6 + 3.63x1
12
L = L Ef = L(y; - co - C1X1i)2 This line is plotted in Figure 4.7(b).
i=1 i= l
Taking the partial derivative with respect to co and setting equal to zero we get In Example 4.4 a line that predicts the class value is generated. This was done for
two classes, but it also could have been done for all three classes. Unlike the division
aL
12 12 12
- = -2 L Yi + 'L: 2co + L 2C!X1 i = 0 approach where class membership is obvious based on the region within which a point
aco
i=1 i= 1 ;,1
predict a class value. In Figure 4.7(b) the class value is predicted based on the height
occurs, with prediction the class to which a point belongs is less obvious. Here we
value alone. Since the prediction line is continuous, howe:ver, the class membership is
I I
I _
not always obvious. For example, if the prediction for a value is 0.4, what would its class
1 1-
be? We can determine the class by splitting the line. So a height is in the short class if
6 0.5 r- - Eu o.s division between the short class and the medium class.
"'
If the predictors in the linear regression function are modified by some function
(square, square root, etc.), then the model looks like
-
(4.5)
I I I I
0 I- i
0
Linear regression is not always appropriate because the data may not fit a straight the contribution of each "independent" attribute, a conditional probability is determined.
line, but also because the straight line values can be greater than 1 and less than 0. Thus, A classification is made by combining the impact that the different attributes have on the
they certainly cannot be used as the probability of occurrence of the target class. Another prediction to be made. The approach is called "naive" because it assumes the indepen
commonly used regression technique is called logistic regression. Instead of fitting the dence between the various attribute values. Given a data value Xi the probability that a
data to a straight line, logistic regression uses a logistic curve such as is illustrated in related tuple, ti , is in class C1 is described by P ( CJ I Xi ) . Training data can be used to
Figure 4.8. The formula for a univariate logistic curve is determine P (xi ), P (xi 1 C1 ), and P (C1 ) . From these values, Bayes theorem allows us
to estimate the posterior probability P ( CJ I Xi) and then P ( C1 I ti ) .
e <co+CJxJ) Given a training set, the naive B ayes algorithm first estimates the prior probability
(4.6)
P =
1 + e <co+cixil P ( C 1) for each class by counting how often each class occurs in the training data. For
each attribute, Xi , the number of occurrences of each attribute value Xi can be counted
The logistic curve gives a value between 0 and 1 so it can be interpreted as the probability
to determine P (xi) . Similarly, the probability P (xi I Cj) can be estimated by counting
of class membership. As with linear regression, it can be used when classification into
how often each value occurs in the class in the training data. Note that we are looking
two classes is desired. To perform the regression, the logarithmic function can be applied
at attribute values here. A tuple in the training data may have many different attributes,
to obtain the logistic function
each with many values. This must be done for all attributes and all values of attributes.
loge ( --
p
1 - p
) = co + C [ XJ (4.7)
We then use these derived probabilities when a new tuple must be classified. This is
why naive Bayes classification can be viewed as both a descriptive and a predictive
type of algorithm. The probabilities are descriptive and are then used to predict the class
Here p is the probability of being in the class and 1 - p is the probability that it is membership for a target tuple.
not. However, the process chooses values for co and q that maximize the probability of When classifying a target tuple, the conditional and prior probabilities generated
observing the given values. from the training set are used to make the prediction. This is done by combining the
effects of the different attribute values from the tuple. Suppose that tuple ti has p indepen
4.2.2 Bayesian Classification dent attribute values {xi ! , Xi2· . . . , Xi p } From the descriptive phase, we know P ( Xi k I CJ ) ,
for each class CJ and attribute Xi k · We then estimate P ( ti I Cj ) by
Assuming that the contribution by all attributes are independent and that each contributes
equally to the classification problem, a simple classification scheme called naive Bayes p
classification has been proposed that is based on Bayes rule of conditional probability as P (ti I Cj ) = IT P (Xi k I Cj ) (4.8)
stated in Definition 3. 1 . This approach was briefly outlined in Chapter 3. By analyzing k=l
At this point in the algorithm, we then have the needed prior probabilities P ( C J) for
each class and the conditional probability P (ti I CJ ) . To calculate P (ti ) , we can estimate
the likelihood that ti is in each class. This can be done by finding the likelihood that
this tuple is in each class and then adding all these values. The probability that ti is
in a class is the product of the conditional probabilities for each attribute value. The
posterior probability P ( CJ I ti) is then found for each class. The class with the highest
� 0.6 probability is the one chosen for the tuple. Example 4.5 illustrates the use of naive B ayes
8 classification.
·15 0.5
·6b EXAMPLE 4.5
.s 0.4
Using the Outputl classification results for Table 4. 1 , there are four tuples classified as
0.3
short, eight as medium, and three as tall. To facilitate classification, we divide the height
0.2 attribute into six ranges:
0.1
(0, 1 . 6] , ( 1 .6, 1 .7] , ( 1 .7, 1 .8] , ( 1 . 8 , 1 .9], ( 1 .9, 2.0], (2.0, oo)
4 Table 4.3 shows the counts and subsequent probabilities associated with the attributes.
X
With these training data, we estimate the prior probabilities:
F I G U R E 4.8: Logistic curve. P (short) 4/ 1 5 = 0.267, P (medium) = 8/15 = 0.533 , and P (tall) = 3/ 1 5 = 0.2
=
88 Chapter 4 Classification
Section 4.3 Distance-Based Algorithms 89
TABLE 4.3: Probabilities Associated with Attributes The naive Bayes approach has several advantages. First, it is easy to use. Second,
unlike other classification approaches, only one scan of the training data is required.
Attribute Value Count Probabilities
The naive Bayes approach can easily handle missing values by simply omitting that
Short Medium Tall Short Medium Tall probability when calculating the likelihoods of membership :in each class. In cases where
there are simple relationships, the technique often does yield good results.
Gender M 1 2 3 1/4 2/8 3/3 Although the naive Bayes approach is straightforward to use, it does not always
F 3 6 0 3/4 6/8 0/3 yield satisfactory results. First, the attributes usually are not independent. We could use a
Height (0, 1 .6] 2 0 0 2/4 0 0 subset of the attributes by ignoring any that are dependent on others. The technique does
( 1 .6, 1 .7] 2 0 0 2/4 0 0 not handle continuous data. Dividing the continuous values into ranges could be used to
( 1 .7' 1 . 8] 0 3 0 0 3/8 0 solve this problem, but the division of the domain into ranges is not an easy task, and
( 1 .8, 1 . 9] 0 4 0 0 4/8 0 how this is done can certainly impact the results.
( 1 .9, 2] 0 1 1 0 118 1/3
(2, oo) 0 0 2 0 0 2/3 4.3 DISTANCE-BASED ALGORITH M S
Each item that i s mapped to the same class may be thought o f as more similar to the
other items in that class than it is to the items found in other classes. Therefore, similarity
1
We use these values to classify a new tuple. For example, suppose we wish to classify (or distance) measures may be used to identify the "alikeness" of different items in the
t= (Adam , M, 1 .95 m) . By using these values and the associated probabilities of gender database. The concept of similarity measure was introduced in Chapter 2 with respect to
and height, we obtain the following estimates: IR retrieval. Certainly, the concept is well known to anyone who has performed Internet
searches using a search engine. In these cases, the set of Web pages represents the whole
P (t I short) 1 /4 X 0 = 0 database and these are divided into two classes : those that answer your query and those
that do not. Those that answer your query should be more alike than those that do not
P (t I medium) 2/8 X 1/8 = 0.03 1
answer your query. The similarity in this case is defined by the query you state, usually
P (t I tall) 3 /3 X 1/3 = 0.333
a keyword list. Thus, the retrieved pages are similar because they all contain (to some
degree) the keyword list you have specified.
Combining these, we get
The idea of similarity measures can be abstracted and applied to more general
classification problems. The difficulty lies in how the similarity measures are defined
Likelihood of being short 0 X 0.267 = 0 (4.9)
and applied to the items in the database. Since most similarity measures assume numeric
Likelihood of being medium 0.03 1 X 0.533 0.0166 (4. 1 0)
=
(and often discrete) values, they might be difficult to use for more general or abstract
Likelihood of being tall 0.33 X 0.2 = 0. 066 (4. 1 1) data types. A mapping from the attribute domain to a subset of the integers may
be used.
We estimate P (t) by summing up these individual likelihood values since t will be either Using a similarity measure for classification where the classes are predefined is
short or medium or tall: somewhat simpler than using a similarity measure for clustering where the classes are
not known in advance. Again, think of the IR example. Each IR query provides the
P (t) = 0 + 0.0 1 66 + 0. 066 = 0.0826 (4. 1 2) class definition in the form of the IR query itself. So the classification problem then
becomes one of determining similarity not among all tuples in the database but between
Finally, we obtain the actual probabilities of each event: each tuple and the query. This makes the problem an O (n) problem rather than an
2
O (n ) problem.
0 X 0.0267
P (short I t) = O (4. 1 3)
0.0826
0.03 1 X 0.533 4.3 . 1 S i m ple Approach
P (medium I t) = 0.2 (4. 14)
0.0826 Using the IR approach, if we have a representative of each class, we can perform
0.333 X 0.2
P (tall l t) = 0'799 (4. 1 5 ) classification by assigning each tuple to the class to which it is most similar. We
0.0826 assume here that each tuple, ti , in the database is defined as a vector (ti l , ti2, . . , fi k )
.
Therefore, based on these probabilities, w e classify the new tuple a s tall because i t has of numeric values. Likewise, we assume that each class Cj is defined by a tuple
the highest probability. (Cj t . Cj2
• ., Cj k ) of numeric values. The classification problem is then restated in
. .
Definition 4.2.
Section 4.3 Distance-Based Algorithms 91
90 Chapter 4 Classification
To calculate these similarity measures, the representative vector for each class �
I
must be determined. Referring to the three classes in Figure 4. l (a), we can determine a ' I
' I
', I
representative for each class by calculating the center of each region. Thus class A is , I 'X
represented by (4, 7.5), class B by (2, 2.5), and class C by (6, 2.5 ) . A simple classifica CB :l': - - x
/ '
/ '
tion technique, then, would be to place each item in the class where it is most similar X '
'
(closest) to the center of that class. The representative for the class may be found in other
ways. For example, in pattern recognition problems, a predefined pattern can be used
to represent each class. Once a similarity measure is defined, each item to be classified
2 3 4 5 6 7 8
will be compared to each predefined pattern. The item will be placed in the class with
the largest similarity 'value. Algorithm 4. 1 illustrates a straightforward distance-based
F I G U RE 4.9: Classification using simple distance algorithm.
approach assuming that each class, Ci , is represented by its center or centroid. In the
algorithm we use Ci to be the center for its class. Since each tuple must be compared
to the center for a class and there are a fixed (usually small) number of classes, the
10
complexity to classify one tuple is O (n).
9 f-- X
X
ALGORITHM 4.1 X
8�
X
Input :
7- X
c1 , . . . , Cm I / Centers
f o r each c l a s s
t/ X
t / / Input tuple t o clas s i fy 6 1- x- - - - �
I
Output : I
5 1- I
c / /C l a s s to which t i s a s s igned X
4 f-- X
Simple dis tanc e - based a l go r i thm X
X
d i s t = oo ;
3 f--
for i : = 1 t o m do X
i f di s ( ci , t) < di s t , then 2 1-
X
c= i; 1 1- X
d i s t = dist(ci , t) ;
0
0 2 3 4 5 6 7 8
Figure 4.9 illustrates the use of this approach to perform classification using the
FIGURE 4. 1 0: Classification using KNN
data found in Figure 4. 1 . The three large dark circles are the class representatives for the
.
three classes. The dashed lines show the distance from each item to the closest center.
the process used by KNN Here the points in the training set are shown and K
. 3. The =
ALGORITHM 4.2
The decision tree approach to classification is to divide the search space into rect
Input : angular regions . A tuple is classified based on the region into which it falls. A definition
T / /Tra ining data for a decision tree used in classification is contained in Definition 4.3. There are alter
K / /Number of ne ighbors
native definitions; for example, in a binary DT the nodes could be labeled with the
t / / Input tuple t o c l a s s i fy
predicates themselves and each arc would be labeled with yes or no (like in the "Twenty
Output :
Questions" game).
c / /Class to which t is ass igned
KNN algori t hm : DEFINITION 4.3. Given a database D = {t1 , . . . , tn } where t; = (ti l , . . , t; h } and
.
EXAM PLE 4.6 sample data found in Table 4. 1 is that shown in the column labeled Output2. A different
DT could yield a different classification. Since the application of a given tuple to a
Using the sample data from Table 4. 1 and the Output l classification as the training set
DT is relatively straightforward, we do not consider the second part of the problem
output value, we classify the tuple (Pat, F, 1 .6} . Only the height is used for distance calcu
further. Instead, we focus on algorithms to construct decision trees. Several algorithms
lation so that both the Euclidean and Manhattan distance measures yield the same results;
are surveyed in the following subsections .
that is, the distance is simply the absolute value of the difference between the values.
There are many advantages t o the use o f DTs for classification. DTs certainly
Suppose that K 5 is given. We then have that the K nearest neighbors to the input tuple
=
are easy to use and efficient. Rules can be generated that are easy to interpret and
are { (Kristina, F, 1 .6} , (Kathy, F, 1 .6} , (Stephanie, F, 1 .7} , (Dave, M, 1 . 7} , (Wynette,
understand. They scale well for large databases because the tree size is independent of
F, 1 .75} } . Of these five items, four are classified as short and one as medium. Thus,
the database size. Each tuple in the database must be filtered through the tree. This takes
the KNN will classify Pat as short.
time proportional to the height of the tree, which is fixed. Trees can be constructed for
data with many attributes.
Disadvantages also exist for DT algorithms. First, they do not easily handle con
4.4 DECISION TRE E - BASE D ALGO RITH MS tinuous data. These attribute domains must be divided into categories to be handled.
The approach used is that the domain space is divided into rectangular regions [such as
The decision tree approach is most useful in classification problems. With this technique,
is seen in Figure 4. l (a)]. Not all classification problems are of this type. The division
a tree is constructed to model the classification process. Once the tree is built, it is applied
shown by the simple loan classification problem in Figure 2.4(a) in Chapter 2 cannot be
to each tuple in the database and results in a classification for that tuple. There are two
handled by DTs. Handling missing data is difficult because correct branches in the tree
basic steps in the technique: building the tree and applying the tree to the database.
could not be taken. Since the DT is constructed from the training data, overfitting may
Most research has focused on how to build effective trees as the application process is
occur. This can be overcome via tree pruning. Finally, correlations among attributes in
straightforward.
the database are ignored by the DT process.
94 Chapter 4 Classificati o n Section 4.4 Decision Tree-Based Alg orithms 95
�
ALGORITHM 4.3 Height
<1.3 2m
Input : Gender
D / / Training da t a ,
�
Short Gender Tall
Output :
�8 � /�
T / /D e c i s i o n t r e e
DTBuild algor i t hm : Height Height
/ / S imp l i s t i c a l gori thm t o i l lustrate naive app roach <1. m <l. m
;/\\ �
Height Height
to bui lding DT
T= 0; < = 1.8 1.8 m <1.5 = 1 .5 m
Short Medium Tall Short Medium Tall
Det ermine best sp l i t t ing c r i t e r i on ;
Medium Tall Short Medium
T = Creat e root node node and label w i t h sp l i t t ing attr ibut e ;
T = Add arc to root node f o r each sp l i t predicate and l abe l ;
for each arc do
D = Database cre a t e d by app lying spl i t t i ng p redicate to D ;
i f stopping point reached f o r t h i s path, then
T = Create l e a f node and l ab e l with approp r i a t e c l a s s ; I
1.3 1.5 1.8 2.0 1.3 1.5 1 .8 2.0
else 1
(a) Balanced tree (b) Deep tree
T = DTBu i l d(D) ;
T = Add T to arc ; Height
>=1.3 m
There have been many decision tree algorithms . We illustrate the tree-building <1.5 m
1\ 1\
phase in the simplistic DTBuild Algorithm 4.3. Attributes in the database schema that will I
= �
Short Gender Medium Gender Tall Height
be used to label nodes in the tree and around which the divisions will take place are called
= M = M
the splitting attributes. The predicates by which the arcs in the tree are labeled are called < 1 .5 =2 m
the splitting predicates. In the decision trees shown in Figure 4. 1 1 , the splitting attributes
are {gender, height} . The splitting predicates for gender are { = female, = male}, while Medium Short Tall Medium Short Medium Tal l
those for height include {< 1 .3 m, > 1 . 8 m, < 1 . 5 m, >2 m} . The splitting predicates for
height differ based on whether the tuple is for a male or a female. This recursive algorithm
H H I � I
builds the tree in a top-down fashion by examining the training data. Using the initial
training data, the "best" splitting attribute is chosen first. Algorithms differ in how they
determine the "best attribute" and its "best predicates" to use for splitting. Once this :, I I
I
has been determined, the node and its arcs are created and added to the created tree. 1.3 1.5 1.8 2.0 1.3 1.5 1.8 2.0
The algorithm continues recursively by adding new subtrees to each branching arc. The (c) Bushy tree (d) No gender attribute
algorithm terminates when some "stopping criteria" is reached. Again, each algorithm
determines when to stop the tree differently. One simple approach would be to stop when F I G U R E 4. 1 1 : Comparing decision trees.
the tuples in the reduced training set all belong to the same class. This class is then used
to label the leaf node created. • Ordering of splitting attributes: The order in which the attributes are chosen
Note that the major factors in the performance of the DT building algorithm are is also important. In Figure 4. 1 l (a) the gender attribute is chosen first. Alterna
the size of the training set and how the best splitting attribute is chosen. The following tively, the height attribute could be chosen first. As seen in Figure 4. l l (b), in this
issues are faced by most DT algorithms: case the height attribute must be examined a second time, requiring unnecessary
comparisons.
• Choosing splitting attributes: Which attributes to use for splitting attributes
impacts the performance applying the built DT. Some attributes are better than • Splits: Associated with the ordering of the attributes is the number of splits to
others. In the data shown in Table 4.1, the name attribute definitely should not be take. With some attributes, the domain is small, so the number of splits is obvious
used and the gender may or may not be used. The choice of attribute involves not based on the domain (as with the gender attribute). However, if the domain is
only an examination of the data in the training set but also the informed input of continuous or has a large number of values, the number of splits to use is not
domain experts. easily determined.
Section 4.4 Decision Tree-Based Algo rith ms
97
96 Chapter 4 Classification
at values of the name attribute, all nodes labeled with that attribute would be removed
• Tree structure: To improve the performance of applying the tree for classification,
a balanced tree with the fewest levels is desirable. However, in this case, more
L�wer-level nodes w�uld move up or be combined in some way. The approach to doin �
this could become qmte com�licated. In the case of overfitting, lower-level subtrees may
complicated comparisons with multiway branching [see Figure 4 . 1 1 (c)] may be
be removed completely. Prumng may be performed while the tree is being created, thus
needed. Some algorithms build only binary trees. .
�rev�ntmg a tree from becoming too large. A second approach prunes the tree after it
• Stopping criteria: The creation of the tree definitely stops when the training data
Is bmlt.
are perfectly classified. There may be situations when stopping earlier would be The time and space �omplexity of DT algorithms depends on the size of the training
desirable to prevent the creation of larger trees. This is a trade-off between accuracy data, q ; the number of attnbutes, h ; and the shape of the resulting tree. In the worst case
of classification and performance. In addition, stopping earlier may be performed the DT that is built may be quite deep and not bushy. As the tree is built, for each o f
to prevent overfitting. It is even conceivable that more levels than needed would the se nodes, ea�h attrib�te will be examined to determine if it is the best. This gives
.
be created in a tree if it is known that there are data distributions not represented a time complexity to build the tree of 0 (h q log q ). The time to classify a database of
in the training data. size n is based on the height of the tree. Assuming a height of 0 (log q ), this is then
O (n log q).
• Training data: The structure of the DT created depends on the training data. If the In the following subsections we examine several popular DT approaches.
training data set is too small, then the generated tree might not be specific enough
to work properly with the more general data. If the training data set is too large, 4.4.1 103
then the created tree may overfit.
The �3 �echnique to building a decision tree is based on information theory and attempts
• Pruning: Once a tree is constructed, some modifications to the tree might be t� rmmrmze the exp �cted number of comparisons. The bask idea of the induction algo
needed to improve the performance of the tree during the classification phase. The nthm IS to ask questions whose answers provide the most information. This is similar to
pruning phase might remove redundant comparisons or remove subtrees to achieve the intuitive approach taken by adults when playing the "1\venty Questions" game. The
better performance. � rst question an adult might ask could be "Is the thing alive?" while a child might ask "Is
1t II_lY Daddy?" The first question divides the search space into two large search domains,
To illustrate some of these design decisions, Figure 4. 1 1 shows four different deci
:Vhile the second performs little division of the space. The basic strategy used by ID3
sion trees that can be used to classify persons according to height. The first tree is a 1s to choose splitting attributes with the highest information gain first. The amount of
duplicate of that from Chapter 3. The first three trees of this figure all perform the same
information associated with an attribute value is related to the probability of occurrence.
classification. However, they all perform it differently. Underneath each tree is a table
showing the logical divisions used by the associated tree for classification. A nice feature
� ooking at the "Twenty Questions" example, the child's question divides the search space
mto two sets. One set (Daddy) has an infinitesimal probability associated with it and the
of Figure 4 . 1 1 (a) is that it is balanced. The tree is of the same depth for any path from
root to leaf. Figures 4 . 1 1 (b) and (c), however, are not balanced. In addition, the height of ?ther set is almost . certain, while the question the adult makes divides the search space
mto two subsets with almost equal probability of occurring.
the tree in (b) is greater than that of any of the others, implying a slightly worse behavior
The concept used to quantify information is called entropy. Entropy is used to
when used for classification. However, all of these factors impact the time required to do
measure the amount of uncertainty or surprise or randomness in a set of data. Certainly,
the actual classification. These may not be crucial performance issues unless the database
when all data in a set belong to a single class, there is no uncertainty. In this case the
is extremely large. In that case a balanced shorter tree would be desirable. The tree shown
entropy is zero. The objective of decision tree classification is to iteratively partition
in Figure 4. l l (d) does not represent the same classification logic as the others.
the given data set into subsets where all elements in each final subset belong to the
The training data and the tree induction algorithm determine the tree shape. Thus,
same class. In Figure 4. 12(a, b, and c) will help to explain the concept. Figure 4.12(a)
the best-shaped tree that performs perfectly on the training set is desirable. Some algo
shows log ( l / p) as the probability p ranges from 0 to 1 . This intuitively shows the
rithms create only binary trees. Binary trees are easily created, but they tend to be
amount of surprise based on the probability. When p = J. , there is no surprise. This
deeper. The performance results when applying these types of trees for classification
means that if an event has a probability of 1 and you are told that the event occurred,
may be worse because more comparisons usually are needed. However, since these com
you would not be surprised. As p � 0, the surprise increases. When we deal with a
parisons are simpler than those that require multiway branches, the ultimate performance
divide and conquer approach such as that used with decision trees, the division results in
may be comparable.
multiple probabilities whose sum is 1 . In the "1\venty Questions" game, the P (Daddy) <
The DT building algorithms may initially build the tree and then prune it for more
P (-,Daddy) and P (Daddy) + P (-,Daddy) = 1 . To measure the information associated
effective classification. With pruning techniques, portions of the tree may be removed
with this division, we must be able to combine the inforrnation associated with both
or combined to reduce the overall size of the tree. Portions of the tree relating to clas
events. That is, we must be able to calculate the average information associated with
sification using an unimportant attribute may be removed. This sort of change with a
the division. This can be performed by adding the two values together and taking into
node close to the root could ripple down to create major changes in the lower parts of
account the probability that each occurs. Figure 4 . 1 2(b) shows the function p log(l jp ),
the tree. For example, with the data in Figure 4. 1 , if a tree were constructed by looking
98 Chapter 4 Classification
Section 4.4 Decision Tree- B ased Algorithms 99
0.5 .------,..---,---r--,
0.45
4 0.4 0.8
0.35
Height
3 0.3 0.6
Short
0.25 <=1 .7 m >1.95 m
F I G U RE 4. 1 2 : Entropy.
division into ranges is needed when the domain of an attribute is continuous or (as in this
case) consists of many possible values. While the choice of these divisions is somewhat
which is the expected information based on probability of an event. To determine the arbitrary, a domain expert shoulQ be able to perform the task.
expected informatiori associated with two events, we add the individual values together.
This function p log( l / p) + ( 1 - p) log( 1 / ( 1 - p)) is plotted in Figure 4. 1 2(c). Note that EXAMPLE 4.7
the maximum occurs when the two probabilities are equal. This supports our intuitive
The beginning state of the training data in Table 4. 1 (with the Outputl classification) is
idea that the more sophisticated questions posed by the adult are better than those posed
that (4/ 1 5) are short, (8/ 1 5) are medium, and (3 / 15) are tall. Thus, the entropy of the
by the child.
The formal definition of entropy is shown in Definition 4.4. The value for entropy starting set is
is between 0 and 1 and reaches a maximum when the probabilities are all the same.
4/ 1 5 log(15/4) + 8 / 1 5 log(15j8) + 3 / 1 5 log(1 5j3) = 0.4384
DEFINITION 4.4. Given probabilities PI , P2, . . . , Ps where l:f= I Pi = 1, entropy
is defined as Choosing the gender as the splitting attribute, there are nine tuples that are F and six
that are M. The entropy of the subset that are F is
H( p! , P2 , . . . , Ps) = L (p; log(1/p; )) (4. 1 6)
i=! 3/9 log(9/3) + 6/9 log(9/6) = 0.2764 (4. 1 8)
whereas that for the M subset is
Given a database state, D, H (D) finds the amount of order (or lack thereof) in that
state. When that state is split into s new states S {DJ , D2 , . . . , Ds }, we can again look
1 /6 log(6/ 1 ) + 2/6 log(6/2) + 3/6 log(6/3) 0.4392
=
at the entropy of those states. Each step in ID3 chooses the state that orders splitting
=
(4. 1 9)
the most. A database state is completely ordered if all tuples in it are in the same class. The ID3 algorithm must determine what the gain in informa
tion is by using this split.
ID3 chooses the splitting attribute with the highest gain in information, where gain is To do this, we calculate the weighted sum of these iast two
entropies to get
defined as the difference between how much information is needed to make a correct
classification before the split versus how much information is needed after the split. ((9/ 15) 0.2764) + ((6/ 1 5) 0.4392) 0.341 5 2
= (4.20)
Certainly, the split should reduce the information needed by the largest amount. This is
calculated by determining the differences between the entropies of the original dataset and The gain in entropy by using the gender attribute is thus
the weighted sum of the entropies from each of the subdivide datasets. he entr�p e � of� � �
the split datasets are weighted by the fraction of the dataset bemg placed n that dlVlswn. 0.4384 - 0.34 1 52 0.09688 (4.2 1 )
�
=
The ID3 algorithm calculates the gain of a particular split by the followmg formula:
Looking at the height attribute, w e have two tuples that
are 1 .6, two are 1 .7, one is
1 .75, two are 1 .8, one is 1 .85, one is 1 .88, two are 1 .9,
one js 1 .95, one is 2, one is
Gain(D, S) = H (D) - L P (D; ) H (D;) (4. 17) 2. 1 , and one is 2.2. Determining the split values for height
is hot easy. Even though the
i=l training dataset has these 1 1 values, we know that there will
be many more. Just as with
continuous data, we divide into ranges:
Example 4.7 and associated Figure 4. 13 illustrate this process using the heig t �
example. In this example, six divisions of the possible ranges of heights are used. Thts
(0, 1 .6] , (1 .6, 1 .7], ( 1 .7, 1 .8], ( 1 .8, 1 .9], (1 .9, 2.0], (2.0, oo)
1 00
101
Chapter 4 Classification
Section 4.4 Decision Tnee-Based Algorith ms
There are 2 tuples in the first division with entropy (2/2(0) + 0 + 0) 0, 2 in ( 1 .6, 1 .7]
tuple in the training set would be the best because there would be only one tuple
=
states are completely ordered and thus an entropy of 0 except for the ( 1 .9, 2.0] state. The
gain in entropy by using the height attribute is thus . . Gain (D , S)
GamRatw(D, S) = (lE!J �) (4.23)
0.4384 - 2/ 1 5 (0.301) 0.3983 (4.22) H ID l ..., I D l
=
,
Thus, this has the greater gain, and we choose this over gender as the first splitting For splitting purposes, C4.5 uses the largest GainRatio that ensures a larger than average
attribute. Within this division there are two males, one medium and one tall. This has information gain. This is to compensate for the fact that the GainRatio value is skewed
occurred because this grouping was too large. A further subdivision on height is needed, toward splits where the size of one subset is close to that of the starting one. Example 4.8
and this generates the DT seen in Figure 4.13(a). shows the calculation of GainRatio for the first split in Example 4.7.
EXAMPLE 4.8
Figure 4. 1 3(a) illustrates a problem in that the tree has multiple splits with identical
results. In addition, there is a subdivision of range (1 .9, 2.0]. Figure 4 . 1 3(b) shows an To calculate the GainRatio for the gender split, we first find the entropy associated with
optimized version of the tree. the split ignoring classes
( 9 6 ) 9 ( )
15 6 ( )
15
H
4.4.2 C4.5 and CS.O -
, - = - log - + - log -- = 0.292 (4.24)
15 15 15 9 15 6
The decision tree algorithm C4.5 iwproves ID3 in the following ways:
This gives the GainRatio value for the gender attribute as
• Missing data: When the decision tree is built, missing data are simply ignored. 0.09688
That is, the gain ratio is calculated by looking only at the other records that have = 0.332 (4.25)
0. 292
a value for that attribute. To classify a record with a missing attribute value, the
value for thai item can be predicted based on what is known about the attribute The entropy for the split on height (ignoring classes) is
values for the other r�cords. ( 2 2 3 4 2 )
H
' ' ' (4.26)
• Continuous d;:lta: The basic idea is to divide the data into ranges based on the 15 15 15 lS ' 15
attribute values for that item that are found irt the training sample.
� Pruning: There are two primary pruning strategies proposed in C4.5: C5.0 (called See 5 on Windows) is a commercial version of C4.5 now widely used
in many data mining packages such as Clementine and RuleQuest. It is targeted toward
- With subtree replacement, a subtree is replaced by a leaf node if this replace use with large datasets. The DT induction is close to that of C4.5, but the rule generation
ment results in an error rate close to that of the original tree. Subtree replace is different. Unlike C4.5, the precise algorithms used for C5.0 have not been divulged.
ment works from the bottom of the tree up to the root. C5.0 does include improvements to generate rules. Results show that C5.0 improves on
- Another pruning strategy, called subtree raising, replaces a subtree by its memory usage by about 90 percent, runs between 5.7 and 240 times faster than C4.5,
most used subtree. Here a subtree is raised from its current location to a node and produces more accurate rules [ResO 1].
higher up in the tree. Again, we must determine the increase in error rate for One major improvement to the accuracy of C5.0 is based on boosting. Boosting
this replacement. is an approach to combining different classifiers. While boosting normally increases the
time that it takes to run a specific classifier, it does improve the accuracy. The error
• rate has been shown to be less than half of that found with C4.5 on some datasets
them. In addition, some techniques to simplify complex rules are proposed. One
Rules: C4.5 allows classification via either decision trees or rules generated from
[ResO l ] . Boosting does not always help when the training data contains a lot of noise.
approach is to replace the left-hand side of a rule by a simpler version if all records Boosting works by creating multiple training sets from one training set. Each item in
iri the training set are treated identically. An "otherwise" type of rule can be used the training set is assigned a weight. The weight indicates the importance of this item to
to indicate what should be done if no other rules apply. the classification. A classifier is constructed for each combination of weights used. Thus,
multiple classifiers are actually constructed. When C5.0 performs a classification, each
• Splitting: The ID3 approach favors attributes with many divisions and thus may
classifier is assigned a vote, voting is performed, and the target tuple is assigned to the
lead to overfitting. In the extreme, an attribute that has a unique value for each
class with the most number of votes.
1 02 Chapter 4 Classification Section 4.5 Neura l Network-Based Algorith ms 1 03
Classification and regression trees (CART) is a technique that generates a binary decision We briefly examine some DT techniques that address creation of DTs for large datasets.
tree. As with ID3, entropy is used as a measure to choose the best splitting attribute and The SPRINT (Scalable PaRallelizable INduction of decision Trees) algorithm
criterion. Unlike ID3, however, where a child is created for each subcategory, only two addresses the scalability issue by ensuring that the CART technique can be applied
children are created. The splitting is performed around what is determined to be the best regardless of availability of main memory. In addition, it can be easily parallelized. With
split point. At each step, an exhaustive search is used to determine the best split, where SPRINT, a gini index is used to find the best split. Here gini for a database D is defined as
"best" is defined by
m
gini (D) = 1 - L PJ (4.34)
n n
be on the left or nght side of the tree. Thts IS defined as !tuples in training set l " vve assume
. . . . !tuples in subtree ! nr
that the right branch is taken on equality. P (CJ I tL ) or P (CJ I tR) is the probability The split with the best gini value is chosen. Unlike the earlier approaches, SPRINT
that a tuple is in this class, C1 , and in the left or right subtree. This is defined as the does not need to sort the data by goodness value at each node during the DT induction
!tuples of class 1 in subtree!
. At each step ' only one criterion is chosen as the best over all process. With continuous data, the split point is chosen to be the midpoint of every pair
!tuples at the target nodel . .
possible criteria. Example 4.9 shows Its use with the height example With Output1 results. of consecutive values from the training set.
. .
With neural networks (NNs), just as with decision trees, a model representing how to
The largest of these is the split at 1. 8. The remainder of this example is left as an exercise.
classify any given database tuple is constructed. The activation functions typically are
sigmoidal. When a tuple must be classified, certain attribute values from that tuple are
Since gender is really unordered, we assume M < F. input into the directed graph at the corresponding source nodes. There often is one sink
As illustrated with the gender attribute, CART forces that an ordering of the node for each class . The output value that is generated indicates the probability that
attributes be used. CART handles missing data by simply ignoring that record in calcu the corresponding input tuple belongs to that class. The tuple will then be assigned to
lating the goodness of a split on that attribute. The tree stops growing when no split will the class with the highest probability of membership. The learning process modifies the
improve the performance. Note that even though it is the best for the training data, it labeling of the arcs to better classify tuples. Given a starting structure and value for all
may not be the best for all possible data to be added in the future. The CART algorithm the labels in the graph, as each tuple in the training set is sent through the network, the
also contains a pruning strategy, which we will not discuss here but which can be found projected classification made by the graph can be compared with the actual classification.
in [KLR+98] . Based on the accuracy of the prediction, various labelings in the graph can change. This
1 04 Chapter 4 Classification
Section 4.5 N e u ral Network-Based Al gorithms 1 05
learning process continues with all the training data or until the classification accuracy • Learning technique: The technique for adjusting the weights is called the le�ing
is adequate. technique. Although many approaches can be used, the most common approach is
Solving a classification problem using NNs involves several steps: some form of backpropagation, which is discussed in a subsequent subsection.
1. Determine the number of output nodes as well as what attributes should be used as • Stop: The learning may stop when all the training tuples have propagated through
input. The number of hidden layers (between the source and the sink nodes) also the network or may be based on time or error rate.
must be decided. This step is performed by a domain expert.
There are many advantages to the use of NNs for classification:
2. Determine weights (labels) and functions to be used for the graph.
3. For each tuple in the training set, propagate it through the network and evaluate • NNs are more robust than DTs because of the weights.
the output prediction to the actual result. If the prediction is accurate, adjust labels
• The NN improves its performance by learning. This may continue even after the
to ensure that this prediction has a higher output weight the next time. If the
training set has been applied.
prediction is not correct, adjust the weights to provide a lower output value for this
class. • The use of NNs can be parallelized for better perfom1ance.
4. For each tuple ti E D , propagate t; through the network and make the appropriate
• There is a low error rate and thus a high degree of accuracy once the appropriate
classification.
training has been performed.
1
There are many issues to be examined:
• NNs are more robust than DTs in noisy environments.
• Attributes (number of source nodes): This is the same issue as determining which Conversely, NNs have many disadvantages:
attributes to use as splitting attributes.
• NNs are difficult to understand. Nontechnical users may have difficulty understand
• Number of hidden layers : In the simplest case, there is only one hidden layer. ing how NNs work. While it is easy to explain decision trees, NNs are much more
difficult to understand.
• Number of hidden nodes: Choosing the best number of hidden nodes per hid
den layer is one of the most difficult problems when using NNs. There have been • Generating rules from NNs is not straightforward.
many empirical and theoretical studies attempting to answer this question. The
• Input attribute values must be numeric.
answer depends on the structure of the NN, types of activation functions, training
algorithm, and problem being solved. If too few hidden nodes are used, the target • Testing
function may not be learned (underfitting). If too many nodes are used, overfit
ting may occur. Rules of thumb are often given that are based on the size of the • Verification
training set.
• As with DTs, overfitting may result.
• Training data: As with DTs, with too much training data the NN may suffer from • The learning phase may fail to converge.
overfitting, while too little and it may not be able to classify accurately enough.
• NNs may be quite expensive to use.
• Number of sinks: Although it is usually assumed that the number of output nodes
is the same as the number of classes, this is not always the case. For example, with 4.5.1 Propagation
two classes there could only be one output node, with the resulting value being the
The normal approach used for processing is called propagation. Given a tuple of values
probability of being in the associated class. Subtracting this value from one would
input to the NN, X = (XJ , . . . , Xh ) , one value is input at each node in the input layer.
give the probability of being in the second class.
Then the summation and activation functions are applied at each node, with an output
value created for each output arc from that node. These values are in tum sent to the
• Interconnections: In the simplest case, each node is connected to all nodes in the
subsequent nodes. This process continues until a tuple of output values, Y (y 1 , . . . , Ym ) ,
next level. =
i s produced from the nodes in the output layer. The process of propagation i s shown in
• Weights: The weight assigned to an arc indicates the relative weight between those Algorithm 4.4 using a neural network with one hidden layer. Here a hyperbolic tangent
two nodes. Initial weights are usually assumed to be small positive numbers and activation function is used for the nodes in the hidden layer, while a sigmoid function
are assigned randomly. is used for nodes in the output layer. We assume that the constant c in the activation
function has been provided. We also use k to be the number of edges corning into
• Activation functions: Many different types of activation functions can be used. a node.
1 Q6 Chapter 4 Classification Section 4 . 5 Neural NetwCJrk- Based Algorithms 107
:ll
ALGORITHM 4.4
Input : -
N I / neural network
,, , . �
X= (x1; . . . , Xh ) / / Input tup l e cori� i s t ing of values for Small
input a t t ributes only
Outp ut :
Y= (y1 , . . . , Ym) I /Tup l e con s i s t ing of output values f rom NN
Propagation algori thm :
/ /Algori thm i l lustrates propagat ion of a tup l e
through a NN
for each node i in the input layer do
Tall
Output Xi on each output arc f rom i ;
lM-
for each hidden layer do
for· each node i do -
0
si = (I:�=1 ( wj i Xj i )) ; 1 2
for each output arc f rom i do
( 1- e -5i ) FIGURE 4. 1 4: Example propagation for tall data.
Output
(1 +e csi ) ;
for each node i in the output layer do
Si = (I:�=1 (Wj i Xj1i )) i Supervised learning in an NN is the process of adjusting the arc weights based
Output Yi = si ) ;
on its performance with a tuple from the training set. The behavior of the training data
(1+e c
is known a priori and thus can be used to fine-tune the network for better behavior
in future similar situations. Thus, the training set can be used as a "teacher" during
A simple application of propagation is shown in Example 4. 1 0 for the height data. the training process. The output from the network is compared to this known desired
Here the classification performed is the same as that seen with the decision tree in behavior. Algorithm 4.5 outlines the steps required. One potential problem with super
Figure 4. l l (d). vised learning is that the error may not be continually reduced. It would, of course,
be hoped that each iteration in the learning process reduces the error so that it is ulti
EXAMPLE 4.10 mately below an acceptable level. However, this is not always the case. This may be
due to the error calculation technique or to the approach used for modifying the weights.
Figure 4. 1 4 shows a very simple NN used to classify university students as short, medium,
This is actually the general problem of NNs. They do not guarante� convergence or
or tall. There are two input nodes, one for the gender data and one for the height
optimality.
data. There are three output nodes, each associated with one class and using a simple
threshold activation function. Activation function h is associated with the short class, ALGORITHM 4.5
!4 is associated with the medium class, and fs is associated with the tall class. In this
Inpu t :
case, the weights of each arc from the height node is 1 . The weights on the gender arcs
N / / S tart ing neural network
is 0. This implies that in this case the gender values are ignored. The plots for the graphs
X / / Input tuple from t raining set
of the three activation functions are shown.
D / / Output tup l e de s i red
Outpu t :
N / / Improved neural network
4.5.2 NN Supervised Learning SupLearn algor i thm :
/ / S implis t ic algori thm to i l lustrate approach
The NN starting state is modified based on feedback of its petformance with the data in
to NN l earning
the training set. This type of learning is referred to as supervised because it is known a
Propagate X through N producing output Y;
priori what the desired output should be. Unsupervised learning can also be performed if Cal culate error by comparing v' to Y;
the output is not known. With unsupervised approaches, no external teacher set is used. A Upcj.ate weight s on arcs in N to reduce error ;
training set may be provided, but no labeling of the desired outcome is included. In this
case, similarities and differences between different tuples in the training set are uncovered. Notice that this algorithm must be associated with a means to calculate the error
in this chapter, we examine supervised learning. We briefly explore unsupervised learning as well as some technique to adjust the weights. Many techniques have been proposed
in Chapter 5. to calculate the error. Assuming that the output from node i is Yi but should be di , the
1 08 Chapter 4 Classification Section 4 . 5 N e u ra l N etwork- Based Algorithms 109
(4.36) � X ?j Yj ?
�!!.w?j dwj?
(a) Node j in NN (b) Propagation at Node j (c) Back-propagation at Node j
The mean squared error (MSE) is found by
F I G U R E 4. 1 5: Neural network usage.
(Yi - di ) 2
(4.37)
2
Here the smaller dashed arrow underneath the regular graph arc shows the input value
This MSE can then be used to find a total error over all nodes in the network or over
X?j flowing into node j . The activation function fJ is applied to all the input values
only the output nodes. In the following discussion, the assumption is made that only the
and weights, with output values resulting. There is an associated input function that is
final output of the NN is known for a tuple in the training data. Thus, the total MSE
applied to the input values and weights before applying the activation function. This
error over all m output nodes in the NN is
input function is typically a weighted sum of the input values. Here YJ ? shows the output
value flowing (propagating) to the next node from node j . Thus, propagation occurs
t (Yi - di ) 2
(4.38) by applying the activation function at each node, which then places the output value
m
i=l on the arc to be sent as input to the next nodes. In most cases, the activation function
produces only one output value that is propagated to the set of connected nodes. The
This formula couffl be expanded over all tuples in the training set to see the total error
NN can be used for classification and/or learning. During the classification process, only
over all of them. Thus, an error can be calculated for a specific test tuple or for the total
propagation occurs. However, when learning is used after the output of the classification
set of all entries.
occurs, a comparison to the known classification is used to determine how to change
Tlie Hebb and delta rules are approaches to change the weight on an input arc to a
the weights in the graph. In the simplest types of learning, learning progresses from the
node based on the knowledge that the output value from that node is incorrect. With both
output layer backward to the input layer. Weights are changed based on the changes
techniques, a learning rule is used to modify the input weights. Suppose for a given node,
that were made in weights in subsequent arcs. This backward learning process is called
j, the input weights are represented as a tuple (w l j , . . . , Wkj ) , while the input and output
backpropagation and is illustrated in Figure 4. 15( c). Weight wJ ? is modified to become
values are (xu . . . . , Xkj ) and YJ · respectively. The objective of a learning technique is to
wj ? + 6 wj ? . A learning rule is applied to this 6 wj ? to determine the change at the next
higher level 6 W? j.
change the weights based on the output obtained for a specific input tuple. The change
in weights using the Hebb rule is represented by the following rule
ALGORITHM 4.6
(4.39)
N
Inpu t :
X = (xl , . . . ,
/ / S t a r t ing neural network
Here c is a constant often called the learning rate. A rule of thumb is that c =
I# entries in training setl
Xh)
D = (d1 , . . . ,
1 / / Input tup l e f rom tra ining set
· dm) / /Output tup l e de s i red
A variation of this approach, called the delta rule, examines not only the output
N
Outpu t :
value YJ but also the desired value dj for output. In this case the change in weight is / / Improved neural network
found by the rule Backpropagat i on algor i t hm :
( N, X) ;
(4.40) / / I l lu s t rate backpropagat i on
E= 1/2 L7=l ( di - Yi ) 2 ;
Propagat i on
The nice feature of the delta rule is that it minimizes the error d1 - YJ at each node.
Backpropagation is a learning technique that adjusts weights in the NN by prop Gradient ( N, E) ;
agating weight changes backwa.td from the sink to the source nodes. Backpropagation
is the most well known form of learning because it is easy to understand and generally
applicable. Backpropagation can be thought of as a generalized delta rule approach. A simple version of the backpropagation algorithm is shown in Algorithm 4.6. The
Figure 4.15 shows the structure and use of one tiode, j, in a neural network graph. MSE is used to calculate the error. Each tuple in the training set is input to this algorithm.
The basic node structure is shown in part (a). Here the representative input arc has The last step of the algorithm uses gradient descent as the technique to modify the weights
a weight cif W?j . where ? is used to show that the input to node j is corning from in the graph. The basic idea of gradient descent is to find the set of weights that minimizes
another node shown here as ?. Of course, there probably are multiple input arcs to the MSE. a�igives the slope (or gradient) of the error function for one weight. We thus
a node. The output weight is similarly labeled wJ ? · During propagation, data values wish to find the weight where this slope is zero. Figure 4.16 and Algorithm 4.7 illustrate
input at the input layer flow through the network, with final values corning out of the the concept. The stated algorithm assumes only one hidden layer. More hidden layers
network at the output layer. The propagation technique is shown in part (b) Figure 4.15. would be handled in the same manner with the error propagated backward.
110 Chapter 4 Classification Secti on 4.5 Neural Network-B ased Algorithms 111
E Output
wkj
wji Y;
- - -- - - - - - - �
I
0
. - -- - --- - - - - -- - - - - -
'0------
I
�
�>
Yk Yi
Figure 4.17 shows the structure we use to discuss the gradient descent algorithm.
Here node i is at the output layer and node j is at the hidden layer just before it; y; is
--
w
the output of i and yJ is the output of j .
--�----r-- The learning function i n the gradient descent technique is based o n using the
Desired weight
following value for delta at the output layer:
F I G U R E 4. 1 6: Gradient descent.
aE = aE ay; as;
/).Wji = awji
- 1] -- -1] - - -- (4.4 1 )
ay; as; awji
.ALGORITHM 4.7
Here the weight wJi is that at one arc coming into i from j . Assuming a sigmoidal
activation function in the output layer, for the output layer we have
Input :
N
E
/ / S t art ing neural network
/ / Error found f rom back a l gorithm :� = a�i ( (1 �-S;)) = ( 1 - ( 1 �-S; )) ( 1 �-S; ) = (1 - y; ) y; (4.42)
+ + +
Output : Also,
N / / Improved neural network as; = YJ
-- (4.43)
Gradient algori t hm : awji
/ / I l l ustrates incr emental gradient des cent
For nodes in the output layer, the third partial is
for each node i in output layer do
for each node j input t o i do
ll.wji TJ (di - Yi)Yj ( l - Yi)Yi ;
= ::, = a:, (� pdm - Ym )2) = - (d, - y; ) (4.44)
Wji Wj i + fl.Wj i ;
=
This algorithm changes weights by working backward from the output layer to the
input layer. There are two basic versions of this algorithm. With the batch or offline
/).Wkj = awaEkj -1] -- (4.46)
approach, the weights are changed once after all tuples in the training set are applied
where
and a total MSE is found. With the incremental or online approach, the weights are
changed after each tuple in the training set is applied. The incremental technique is (4.47)
usually preferred because it requires less space and may actually examine more potential
solutions (weights), thus leading to a better solution. In this equation, 17 is referred to Here the variable m ranges over all output nodes with arcs from j . We then derive
as the learning parameter. It typically is found in the range (0, 1 ) , although it may be aE - (dm - Ym ) (4.48)
larger. This value determines how fast the algorithm learns.
Applying a learning rule back through multiple layers in the network may be
difficult. Doing this for the hidden layers is not as easy as doing it with the output layer. (4.49)
Overall, however, we are trying to minimize the error at the output nodes, not at each
node in the network. Thus, the approach that is used is to propagate the output errors
backward through the network.
Wj m (4.50)
1 12 Chapter 4 Classification Section 4.5 Neural Network-Based Algorithms 113
For hidden layers, where a hyperbolic tangent activation function is assumed, we have
1 - y2
__ }
2
(4.5 1 )
Also,
a sJ
= Yk (4.52)
a wkj
This gives us the formula in Algorithm 4.7:
1 - (yj ) 2
/).Wkj = 11 Yk
2
L (dm - Ym )Wjm Ym O - Ym ) (4.53)
m
Another common formula for the change in weight is FIGURE 4. 1 8 : Radial basis function network.
aE
/). Wji (t + 1) = - 1'} + a f).Wji (t) (4.54) Xz
B Wj i
Here the change in weight at time t + 1 is based not only on the same partial derivative
as earlier, but also on the last change in weight. Here
used to prevent oscillation problems that may occur without it.
a is called the momentum and is y
- - �
4.5.3 Radial Basis Function Networks
A radial function or a radial basis function (REF) is a class of functions whose value
decreases (or increases) with the distance from a central point. The Gaussian activation (a) Classification perceptron (b) Classification problem
function shown in Equation 3 .35 is an RBF with a central point of 0. An RBF has a
Gaussian shape, and an RBF network is typically an NN with three layers. The input layer FIGURE 4. 1 9: Petceptron classification example.
is used to simply input the data. A Gaussian activation function is used at the hidden
layer, while a linear activation function is used at the output layer. The objective is to
have the hidden nodes learn to respond only to a subset of the input, namely, that where
S = 3x1 + 2x2 - 6. Using a simple unipolar step activation function, we get
the Gaussian function is centered. This is usually accomplished via supervised learning.
When RBF functions are used as the activation functions on the hidden layer, the nodes { 1 if S > O }
(4.55)
can be sensitive to a subset of the input values. Figure 4 . 1 8 shows the basic structure of !4 =
0 otherwise
an RBF unit with one output node.
An alternative way to view this classification problem is shown in Figure 4. 1 9(b). Here
4.5.4 Perceptrons x1 is shown on the horizontal axis and x2 is shown on the vertical axis. The area of the
plane to the right of the line x2 3 3 j2xi represents one class and the rest of the
The simplest NN is called a perceptron. A perceptron is a single neuron with multiple
'_
=
between two sets of numbers can be performed using an NN with only one hidden Gen a l go r i thm :
layer. In this case, the NN is to have one input node for each attribute input, and
given n input attributes the hidden layer should have (2n + 1) nodes, each with input
/ / I l l u s t rate s imp l e approach to generating
c l a s s i f i c at i on ru l e s f rom a DT
from each of the input nodes. The output layer has one node for each desired out R=0
put value. f o r each path from root to a leaf in T do
a = True
for each non - l e af node do
4.6 RULE-BASED ALGORITHMS a = a/\ ( l abel of node combined with label of incident
outgoing arc )
One straightforward way to perform classification is to generate if-then rules that cover c = label of leaf
R = R U r = (a, c)
node
all cases. For example, we could have the following rules to determine classification
of grades:
If 90 ::S grade, then class A Using this algorithm, the following rules are generated for the DT in Figure 4 . 1 3(a):
If 80 ::S grade and grade < 90, then class B { ( (Height ::S 1 . 6 m), Short)
If 70 ::S grade and grade < 80, then class c ( ((Height > 1 .6 m) 1\ (Height ::S 1 .7 m)) , Short)
i 60 ::S grade and grade < 70, then class D (((Height > 1 .7 m) 1\ (Height ::S 1 . 8 m)), Medium)
If grade < 60, then class F (((Height > 1 . 8 m) 1\ (Height ::s 1 .9 m)), Medium)
A classification rule, r = (a , c), consists of the if or antecedent, a, part and the then or ( ((Height > 1 .9 m) 1\ (Height ::: 2 m) 1\ (Height ::: 1 .95 m)), Medium)
consequent portion, c. The antecedent contains a predicate that can be evaluated as true ( ((Height > 1 .9 m) 1\ (Height ::S 2 m) 1\ (Height > 1 .95 m)), Tall)
or false against each tuple in the database (and obviously in the training data). These
((Height > 2 m) , Tall) }
rules relate directly to the corresponding DT that could be created. A DT can always be
used to generate rules, but they are not equivalent. There are differences between rules An optimized version of these rules is then:
and trees :
{ ( (Height ::S 1 . 7 m), Short)
• The tree has an implied order in which the splitting is performed. Rules have no
( ((Height > 1 .7 m) 1\ (Height ::: 1 .95 m)) , Medium)
. ·
order.
((Height > 1 .95 m), Tall) }
• A tree is created based on looking at all classes. When generating rules, only one
class must be examined at a time. 4.6.2 Generating Rules from a Neural Net
There are algorithms that generate rules from trees as well as algorithms that generate To increase the understanding of an NN classification rules may be derived from it.
,
rules without first creating DTs. While the source NN may still be used for classification, the derived rules can be used to
verify or interpret the network. The problem is that the rules do not explicitly exist. They
are buried in the structure of �p.e graph itself. In addition, if learning is still occurring,
4.6. 1 Generating R�les from a DT
and to have a lower error rate than rules used with DTs. The basic idea of the RX
the rules themselves are dynaffiic. The rules generated tend both to be more concise
The process to generate a rule from a DT is straightforward and is outlined in Algo
rithm 4.8. This algorithm will generate a rule for each leaf node in the decision tree. All algorithm is to cluster output values with the associated hidden nodes and input. A major
rules with the same consequent could be combined together by ORing the antecedents problem with rule extraction is the potential size that these rules should be. For example,
of the simpler rules. if you have a node with n inputs each having 5 values, there are 5n different input
combinat�ons to this one node alone. These patterns would all have to be accounted for
ALGORlTHM 4.8
when constructing rules. To overcome this problem and that of having continuous ranges
of output values from nodes, the output values for both the hidden and output layers are
Input :
values into disjoint ranges. The rule extraction algorithm, RX, shown in Algorithm 4.9
first discretized. This is accomplished by clustering the values and dividing continuous
T / / De c i s ion tree
Output :
R / /Rul e s is derived from [LSL95].
1 16 Chapter 4 Classification
Section 4.6 f{ule -Bas ed Alg orit hm s
1 17
ALGORITHM 4.9
Outputl . If we only use the gender attribute, there are a total of 6/15 errors whereas
Input :
D / / Training data
if we use the height attribute, there are only 1/ 15. Thus, the height would e chosen b
and the six rules stated in the table would be used. As with ID3 , 1R tends to choose
N
Output :
/ / Ini t i al neural network
� �
attributes w th a large n�mber o values leading to overfitting. l R can handle missing
.
R / / Derived rul e s
dat� b� addmg an additional attnbute value for the value of missing. Algorithm 4.10,
RX algori t hm : which 1s adapted from [WFOO], shows the outline for this algorithm.
/ / Rule extract i on algori thm to extract ru l e s f rom NN
ALGORITHM 4.10
c l uster output node act iva t i on value s ;
c l uster hi dden node act i va t i on value s ;
Input :
generate rul e s that de s c r ibe the output values in terms o f
D / / Training data
R / /Att r ibu t e s t o cons ider for rul e s
the hi dden act iva t i on value s ;
c / / Classes
generate rul e s that de s cribe hi dden output values in
Output :
terms o f input s ;
R / / Ru l e s
combine the two s e t s o f rul e s .
lR algori t hm :
/ / lR algo r i thm gene rat e s rul e s based o n one
a t t r i bute
4.6.3 Generating Rules iNithout a DT or N N R = 0;
for each A E R do
These techniques are sometimes called covering algorithms because they attempt to RA = 0 ;
generate rules exactly cover a specific class [WFOO]. Tree algorithms work in a top for each p o s s i b l e value , v, of A do
down divide and conquer approach, but this need not be the case for covering algorithms. / / v may be a range rather than a spec i f i c value
They generate the best rule possible by optimizing the desired classification probability. for each Cj E c f i nd count ( Cj ) ;
Usually the "best" attribute-value pair is chosen, as opposed to the best attribute with I I Here count i s the number of occurrences o f this
the tree-based algorithms. Suppose that we wished to generate a rule to classify persons class for thi s att ribute
as tall. The basic format for the rule is then let Cm be the c l a s s with the l arge s t count ;
RA = RA U ((A = v) ---r ( c l a s s = Cm)) ;
If ? then class = tall ERRA = numb er of tup l e s incorre c t ly c l as s i f i e d by RA i
R = RA where ERRA i s minimum ;
The objective for the covering algorithms is to replace the "?" in this statement with
predicates that can be used to obtain the "best" probability of being tall.
One simple approach is called 1R because it generates a simple set of rules that are Another approach to generating rules without first having a DT is called PRISM.
equivalent to a DT with only one level. The basic idea is to choose the best attribute to PRISM generates rules for each class by looking at the training data and adding rules that
perform the classification based on the training data. "Best" is defined here by counting completely describe all tuples in that class. Its accuracy is 100 percent. Example 4. 12
the number of errors. In Table 4.4 this approach is illustrated using the height example, illustrates the use of PRISM. Algorithm 4. 1 1 , which is adapted from [WFOO], shows the
process. Note that the algorithm refers to attribute-value pairs. Note that the values will
include an operator so that in Example 4 . 1 2 the first attribute-value pair chosen is with
TABLE 4.4: 1R Classification
attribute height and value 72.0. As with earlier classification techniques, this must be
modified to handle continuous attributes. In the example, we have again used the ranges
Option Attribute Rules Errors Total Errors of height values used in earlier examples.
However, only one of the tuples that satisfies this is actually tall, so we need to add Given a classification problem, no one classification technique always yields the best
another predicate to it. We then look only at the other predicates affecting these two results. Therefore, there have been some proposals that look at combining techniques.
.
tuples. We now see a problem in that both of these are males. The problem 1s actually While discussing C5.0, we briefly introduced one technique for combining classifiers
caused by our "arbitrary" range divisions. We now divide the range into two subranges: called boosting. 1\vo basic techniques can be used to accomplish this:
• A synthesis of approaches takes multiple techniques and blends them into a new
1 . 9 < Height <= 1 . 95 0/ 1
approach. An example of this would be using a prediction technique, such as linear
1 .95 < Height <= 2.0 1/1 regression, to predict a future value for an attribute that is then used as input to a
classification NN In this way the NN is used to predict a future classification value.
.
proceed by classifying the short and medium classes. This is left as an exercise. L Wk Pk (Cj I ti ) (4.56)
k= l
Classification
SE!Ction 4.9 Exercises 121
1 20 Chapter 4
·
for both classifiers with respect to class 1 is then: 3/4 + 1 /4. When looking at class 2,
0 0 the measure is: 4/6 + 5/6. Thus, X is placed in class 2.
A Thple in Class 1
A � � and correctly classified
�
0 0
0
X
•
0
� Tuple in Class 1
0 and incorrectly classified
�
4.8 SUM MARY
A
• • o Thple in Class 2
and correctly classified No one classification technique is always superior to the others in terms of classification
A A 0
0
accuracy. However, there are advantages and disadvantages to the use of each. The
• Thple in Class 2
and incorrectly classified regression approaches force the data to fit a predefined model. If a linear model is
(a) Classifier 1 (b) Classifier 2 chosen, then the data are fit into that model even though it might not be linear. It
requires that linear data be used. The KNN technique requires only that the data be
F I G U R E 4.20: Combination of multiple classifiers. such that distances can be calculated. This can then be applied even to nonnumeric data.
Outliers are handled by looking only at the K nearest neighbors. Bayesian classification
assumes that the data attributes are independent with discrete values. Thus, although it is
or learned based on the past accuracy
Here the weights, Wk. can be assigned by a user easy to use and understand, results may not be satisfactory. Decision tree techniques are
the classifier that has the best accuracy
of each classifier. Another technique is to choose
in a database sample. This is referred to as a
dynamic classifier selection (DCS). x�m � easy to understand, but they may lead to overfitting. To avoid this, pruning techniques
may be needed. ID3 is applicable only to categorical data. Improvements on it, C4.5
illustrate � the use of DCS. Anothe� vanat10n
ple 4. 1 3 , which is modified from [LJ98], . . and CS, allow the use of continuous data and improved techniques for splitting. CART
tuple to the class to which a maJonty of the classifiers have
is simple voting: assign the creates binary trees and thus may result in very deep trees.
in case there are many classes and no
assigned it. This may have to be modified slightly When looking at the approaches based on complexity analysis, we see that they
majority is found. are all very efficient. This is due to the fact that once the model is created, applying
it for classification is relatively straightforward. The statistical techniques, regression
EXAMPLE 4. 1 3 and naive Bayes, require constant time to classify a tuple once the models are built.
A target tuple, X , needs t o be The distance-based approaches, simple and KNN, are also constant but require that each
Tw o classifiers exist to classify tuples into two classes.
r approac h, the 10 tuples closest to X are identified. tuple be compared either to a representative for each class or to all items in the training
classified. Using a nearest neighbo
the 10 tuples closest to X. In Figure 4.20(a) the res�lts for the first set. Assuming there are q of these, the KNN then requires O (q) time per tuple. DT
Figure 4.20 shows
for the second classifier are shown. classification techniques, ID3, C4.5, and CART require a number of comparisons that
classifier are shown, while in Figure 4.20(b) those
The tuples designated with triangles should be in c as
�
1, �
whil �
those shown as s�uares
.
are (in the worst case) equal to the longest path from a root to a leaf node. Thus, they
shapes that are darkene d mdicate an mcorrec t classificauon by require 0 (log q) time per tuple. Since q is a constant, we qm view these as being
should be in class 2. Any
look at the general accuracy of each performed in constant time as well. The NN approaches again require that a tuple be
that classifier. To combine the classifiers using DCS,
rhood of X are correctly classified, propagated through the graph. Since the size of the graph is constant, this can be viewed
classifier. With classifier 1 , 7 tuples in the neighbo
only 6 are correctly classifie d. Thus, X will be classified as being performed in constant time. Thus, all algorithms are 0 (n) to classify the n items
while with the second classifier,
it is classifie d with the first classifie r. in the database.
according to how
4.9 EXERCISES
Recently, a new CMC technique, adaptive classifier combination (ACC), has been
1. Explain the differences between the definition of the classification problem found
proposed [LJ98]. Given a tuple to classify, the neighborhood around it is first determined,
in Definition 4. 1 and an alternative one with the mapping from C to D .
then the tuples in that neighborhood are classified by each classifier, and finall the
�
accuracy for each class is measured. By examining the accuracy across all classifiers 2 . Using the data i n Table 4 . 1 , draw OC curves assuming that the Output2 column is
for each class, the tuple is placed in the class that has the highest local accuracy. In the correct classification and Output l is what is seen. You will need to draw three
effect, the class chosen is that to which most of its neighbors are accurately classified curves, one for each class.
independent of classifier. Example 4 . 1 4 illustrates the use of ACC. 3. Using the data in Table 4. 1 , construct a confusion matrix assuming Output is the
correct assignment and Output1 is what is actually made.
EXAMPLE 4 . 1 4 4. Apply the method of least squares technique to determine the division between
medium and tall persons using the training data in Table 4. 1 and the classification
ACC technique examines how accurat� all
Using the same data a s i n Example 4 . 13 , the shown in the Output1 column (see Example 4.3). You may use either the division
class 1, classifier 1 accurately classifies
classifiers are for each class. With the tuples in technique or the prediction technique.
y classifies only 1 tuple. A measure of the accuracy
3 tuples, while classifier 2 accuratel
Classification Section 4. i 0 B i b li ogra p h i c Notes 1 23
1 22 Chapter 4
that Output2 is correct. is used to prevent the tree from growing too big. A modified version of CHAID was
P ( Cj I t;)
Explain the difference between P Cti I C j) and proposed in 1980 by Kass; it reduces the overall time, but does not guarantee the best
·
8.
9. Redo Example 4.5 using Output2 data. result. Kass proposes that pairs of categories (for the predictor variable) be merged
each tree shown in Figure 4. 1 1 .
10. Determine the expected number of comparisons for if there is no statistically significant difference between them. The resulting set of
4. 1 using the ID 3 algorithm and categories defines the split [Kas80]. With exhaustive CHA1D, which was proposed in
11. Generate a DT for the height example in Table
column of that table. 199 1 , the split point is determined by merging pairs of categories until only one pair
the training classifications shown in the Output2
of the Gain. remains [BdVS91]. The predictor with the best overall prediction is then chosen for
12. Repeat Exercise 1 1 using the GainRatio instead
the split.
for the results of Exercises 1 1 and 1 2 .
13. Construct the confusion matrix ID3 was first proposed in the mid 1970s by Quinlan [Qui86] . A thorough investi
le using the Output2 column in
14. Using 1 R , gerferate rules for the height examp gation of C4.5 can be found in the seminal text on the subject [Qui93] .
Table 4. 1 . CART was developed by Breimen in 1984 [BFOS84] . In 1997, another binary
15. Complete Example 4.9 by generating
the D� for the height example (using the decision tree technique was proposed [LS97] . QUEST (quick unbiased efficient statistical
Outpu tl classification) using the CART algonthm.
16. Suppose that the output for Mary in Exerci
se 8 in Chapter 3 should �ave been O � r � tree) addresses the problems in CART that it tends to select variables with many values,
which creates a bias iri the model. QUEST handles variable selection and split point
small, 0 for medium, and 1 for tall. Use the gradient descent algonthm to mo d.I Y differently. Also, unlike CART, QUEST does not perform an exhaustive search, so it is
the weights in the NN. more efficient. The approach used by QUEST is to determine the association between