Dwdm-Unit-3 R16
Dwdm-Unit-3 R16
Dwdm-Unit-3 R16
UNIT-III
DATA CLASSIFICATION
Classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
For example, we can build a classification model to categorize bank loan applications as either
safe or risky. Such analysis can help provide us with a better understanding of the data at large.
Many classification methods have been proposed by researchers in machine learning, pattern
recognition, and statistics.
Why Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe”
and which are “risky” for the bank. A marketing manager at AllElectronics needs data analysis
to help guess whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data to predict which one of three
specific treatments a patient should receive. In each of these examples, the data analysis task
is classification, where a model or classifier is constructed to predict class (categorical) labels,
such as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or
“treatment A,” “treatment B,” or “treatment C” for the medical data.
Suppose that the marketing manager wants to predict how much a given customer will
spend during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or ordered
value, as opposed to a class label. This model is a predictor.
Regression analysis is a statistical methodology that is most often used for numeric
prediction; hence the two terms tend to be used synonymously, although other methods for
numeric predictionexist. Classification and numeric prediction are the two major types of
prediction problems.
General Approach for Classification:
Data classification is a two-step process, consisting of alearning step (where a
classification model is constructed) and a classification step (wherethe model is used to predict
class labels for given data).
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated class labels.
Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
In the second step, the model is used for classification. First, the predictive accuracy of the
classifier is estimated. If we were to use the training set to measure the classifier’s accuracy,
this estimate would likely be optimistic, because the classifier tends to overfit the data.
Page 1
www.Jntufastupdates.com
Data Warehousing and Data Mining
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Page 2
www.Jntufastupdates.com
Data Warehousing and Data Mining
Page 3
www.Jntufastupdates.com
Data Warehousing and Data Mining
During tree construction, attribute selection measures are used to select the attribute
that best partitions the tuples into distinct classes. When decision trees are built, many of the
branches may reflect noise or outliers in the training data. Tree pruning attempts to identify
and remove such branches, with the goal of improving classification accuracy on unseen data.
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J.
Marin,and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a
benchmark to which newer supervised learning algorithms are often compared.
In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) publishedthe
book Classification and Regression Trees (CART), which described the generation ofbinary
decision trees.
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data
partition, D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best”
partitions the data tuples into individual classes. This criterion consists of a splitting
attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
Method:
1) create a node N;
2) if tuples in D are all of the same class, C, then
3) return N as a leaf node labeled with the class C;
4) if attribute list is empty then
5) return N as a leaf node labeled with the majority class in D; // majority voting
6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
7) label node N with splitting criterion;
8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
9) attribute list attribute list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
12) if Dj is empty then
13) attach a leaf labeled with the majority class in D to node N;
14) else attach the node returned by Generate decision tree(Dj , attribute list) to node N;
endfor
15) return N;
Page 4
www.Jntufastupdates.com
Data Warehousing and Data Mining
Binary Attributes: The test condition for a binary attribute generates two potential
outcomes.
Nominal Attributes:These can have many values. These can be represented in two ways.
Ordinal attributes: These can produce binary or multiway splits. The values can be grouped as
long as the grouping does not violate the order property of attribute values.
Page 5
www.Jntufastupdates.com
Data Warehousing and Data Mining
Information Gain
ID3 uses information gain as its attribute selection measure. Let node N represent or
hold the tuples of partition D. The attribute with the highest information gain is chosen as the
splitting attribute for node N. This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least randomness or “impurity” in these
partitions. Such an approach minimizes the expected number of tests needed to classify a given
tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
Where piis the nonzero probability that an arbitrary tuple in D belongs to class Ciand is estimated
by |Ci,D|/|D|. A log function to the base 2 is used, because the information is encoded in
bits.Info(D) is also known as the entropy of D.
Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after
partitioning on A). That is,
Page 6
www.Jntufastupdates.com
Data Warehousing and Data Mining
The attribute A with the highest information gain, Gain(A), is chosen as the
splittingattribute at nodeN. This is equivalent to saying that we want to partition on the
attributeA that would do the “best classification,” so that the amount of information still
requiredto finish classifying the tuples is minimal.
Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as
This value represents the potential information generated by splitting the trainingdata set, D, into
v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each
outcome, it considers the number of tuples having that outcome with respect to the total number
of tuples in D. It differs from information gain, which measures the information with respect
to classification that is acquired based on the same partitioning. The gain ratio is defined as
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini
indexmeasures the impurity of D, a data partition or set of training tuples, as
Where piis the nonzero probability that an arbitrary tuple in D belongs to class Ciand is
estimated by |Ci,D|/|D| over m classes.
Note: The Gini index considers a binary split for each attribute.
When considering a binary split, we compute a weighted sum of the impurity of
eachresulting partition. For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is
For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum Gini index for that attribute is selected as its
splitting subset.
For continuous-valued attributes, each possible split-point must be considered. The strategy
is similar to that described earlier for information gain, where the midpoint between each
pair of (sorted) adjacent values is taken as a possible split-point.
The reduction in impurity that would be incurred by a binary split on a discrete- or
continuous-valued attribute A is
Page 7
www.Jntufastupdates.com
Data Warehousing and Data Mining
Tree Pruning:
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers.
Tree pruning methods address this problem of overfitting the data. Such methods
typically use statistical measures to remove the least-reliable branches.
Pruned trees tend to be smaller and less complex and, thus, easier to comprehend.
They are usually faster and better at correctly classifying independent test data (i.e., of
previously unseen tuples) than unpruned trees.
“How does tree pruning work?” There are two common approaches to tree pruning:
prepruning and postpruning.
In the prepruning approach, a tree is “pruned” by halting its construction early. Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the
subset tuples or the probability distribution of those tuples.
If partitioning the tuples at a node would result in a split that falls below a prespecified
threshold, then further partitioning of the given subset is halted. There are difficulties,
however, in choosing an appropriate threshold.
In the postpruning, which removes subtrees from a “fully grown” tree. A subtree at a given
node is pruned by removing its branches and replacing it with a leaf. The leaf is labeled
with the most frequent class among the subtree being replaced.
Page 8
www.Jntufastupdates.com
Data Warehousing and Data Mining
This set isindependent of the training set used to build the unpruned tree and of any test
set usedfor accuracy estimation.
The algorithm generates a set of progressively pruned trees. Ingeneral, the smallest
decision tree that minimizes the cost complexity is preferred.
C4.5 uses a method called pessimistic pruning, which is similar to the cost
complexitymethod in that it also uses error rate estimates to make decisions regarding
subtreepruning.
Scalability of Decision Tree Induction:
“What if D, the disk-resident training set of class-labeled tuples, does not fit in
memory? In other words, how scalable is decision tree induction?” The efficiency of existing
decision tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively
small data sets. Efficiency becomes an issue of concern when these algorithms are applied to
the mining of very large real-world databases. The pioneering decision tree algorithms that we
have discussed so far have the restriction that the training tuples should reside in memory.
In data mining applications, very large training sets of millions of tuples are common.
Most often, the training data will not fit in memory! Therefore, decision tree construction
becomes inefficient due to swapping of the training tuples in and out of main and cache
memories. More scalable approaches, capable of handling training data that are too large to fit
in memory, are required. Earlier strategies to “save space” included discretizing continuous-
valued attributes and sampling data at each node. These techniques, however, still assume that
the training set can fit in memory.
Several scalable decision tree induction methods have been introduced in recent studies.
RainForest, for example, adapts to the amount of main memory available and applies to any
decision tree induction algorithm. The method maintains an AVC-set (where “AVC” stands
for “Attribute-Value, Classlabel”) for each attribute, at each tree node, describing the training
tuples at the node. The AVC-set of an attribute A at node N gives the class label counts for each
value of A for the tuples at N. The set of all AVC-sets at a node N is the AVC-group of N. The
size of an AVC-set for attribute A at node N depends only on the number of distinct values of
A and the number of classes in the set of tuples at N. Typically, this size should fit in memory,
even for real-world data. Rain Forest also has techniques, however, for handling the case where
the AVC-group does not fit in memory. Therefore, the method has high scalability for decision
tree induction in very large data sets.
Page 9
www.Jntufastupdates.com
Data Warehousing and Data Mining
𝐥 𝟓 𝐥
Info(D) = I(9,5) = -𝐥𝐥𝐥𝟐 - 𝐥𝐥𝐥𝟐 = 0.940
𝟗 𝟏𝟒 𝟏𝟒 𝟏𝟒
𝟏𝟒
Page 10
www.Jntufastupdates.com
Data Warehousing and Data Mining
I(3,2) = - 𝟑 𝐥𝐨𝐠𝟐 𝟑 - 𝟐 𝐥𝐨𝐠𝟐 𝟐 = 0.970
𝟓 𝟓 𝟓 𝟓
Page 11
www.Jntufastupdates.com
Data Warehousing and Data Mining
Finally, age has the highest information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of the attribute’s values.
The tuples are then partitioned accordingly, as
Page 12
www.Jntufastupdates.com
Data Warehousing and Data Mining
Page 13
www.Jntufastupdates.com
Data Warehousing and Data Mining
Page 14
www.Jntufastupdates.com