DM Module 4
DM Module 4
DM Module 4
CLASSIFICATION
Classification is a task in data mining that involves assigning a class label to each instance in a
dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as “spam” or “not
spam”, while multi-class classification involves classifying instances into more than two
classes.
Classification is a widely used technique in data mining and is applied in a variety of domains,
such as email filtering, sentiment analysis, and medical diagnosis.
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of
a set of categories (subpopulations), a new observation belongs to, on the basis of a training set
of data containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier
is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to
further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.
Classification Algorithms:
DECISION TREE INDUCTION
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision tree
creates classification or regression models as a tree structure. It separates a data set into
smaller subsets, and at the same time, the decision tree is steadily developed. The final tree is a
tree with the decision nodes and leaf nodes. A decision node has at least two branches. The
leaf nodes show a classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node.
Decision trees can deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures
the randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also
called Entropy Reduction. Building a decision tree is all about discovering attributes
that return the highest data gain.
We can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of
simple decision rules. A decision tree model comprises a set of rules for portioning a huge
heterogeneous population into smaller, more homogeneous, or mutually exclusive classes. The
attributes of the classes can be any variables from nominal, ordinal, binary, and quantitative
values, in contrast, the classes must be a qualitative type, such as categorical or ordinal or binary.
In brief, the given data of attributes together with its class, a decision tree creates a set of rules
that can be used to identify the class. One rule is implemented after another, resulting in a
hierarchy of segments within a segment. The hierarchy is known as the tree, and each segment is
called a node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3
and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
BAYES CLASSIFICATION
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus these
classifiers are generally used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called the consequent.
Properties of rule-based classifiers:
Coverage: The percentage of records which satisfy the antecedent conditions of a particular
rule.
The rules generated by the rule-based classifiers are generally not mutually exclusive, i.e.
many rules can cover the same record.
The rules generated by the rule-based classifiers may not be exhaustive, i.e. there may be
some records which are not covered by any of the rules.
The decision boundaries created by them is linear, but these can be much more complex than
the decision tree because the many rules are triggered for the same record.
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
Points to remember −
If the condition holds true for a given tuple, then the antecedent is satisfied.
KNN ALGORITHM(LAZY LEARNER ALG)
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically AND.
The leaf node holds the class prediction, forming the rule consequent.
Confusion matrix
In binary classification, there are two possible target classes, which are typically labeled as
"positive" and "negative" or "1" and "0". In our spam example above, the target (positive class)
is "spam," and the negative class is "not spam."
When evaluating the accuracy, we looked at correct and wrong predictions disregarding the class
label. However, in binary classification, we can be "correct" and "wrong" in two different ways.
Correct predictions include so-called true positives and true negatives. This is how it unpacks
for our spam use case example:
True positive (TP): An email that is actually spam and is correctly classified by the
model as spam.
True negative (TN): An email that is actually not spam and is correctly classified by the
model as not spam.
Model errors include so-called false positives and false negatives. In our example:
False Positive (FP): An email that is actually not spam but is incorrectly classified by the
model as spam (a "false alarm").
False Negative (FN): An email that is actually spam but is incorrectly classified by the
model as not spam (a "missed spam").
Using the confusion matrix, you can visualize all 4 different outcomes in a single table.
Accuracy
Accuracy is a metric that measures how often a machine learning model correctly predicts the
outcome. You can calculate accuracy by dividing the number of correct predictions by the
total number of predictions.
Precision
Precision is a metric that measures how often a machine learning model correctly predicts the
positive class. You can calculate precision by dividing the number of correct positive predictions
(true positives) by the total number of instances the model predicted as positive (both true and
false positives).
Recall
Recall is a metric that measures how often a machine learning model correctly identifies positive
instances (true positives) from all the actual positive samples in the dataset. You can calculate
recall by dividing the number of true positives by the number of positive instances. The latter
includes true positives (successfully identified cases) and false negative results (missed cases).