Machine Learning
❖Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed
❖The primary aim is to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly
❖It is a branch of Artificial Intelligence
❖Machine Learning Techniques
❖Supervised learning
❖Unsupervised learning
❖Semi-supervised learning
❖Re-inforcement machine learning
12/17/2022 MIT-WPU, DMDW 2
Terms
❑Class label attribute
❑In a dataset one column is a class label which classifies the tuple into categories.
❑Example of class labels – red/blue/green , bird/animal, safe/risky, yes/no
❑Labelled dataset - Dataset in which every tuple contains class label attribute
❑Training data - From the dataset, major portion of data is fed to the classification
algorithm . This data is labelled which enables the algorithm to construct a model or a
classifier
12/17/2022 MIT-WPU, DMDW 3
Terms
❑Test data (Validation set) - From the dataset, remaining portion of data is used to test the
classifier. The results are compared with its labels to find the accuracy of the classifier
❑Ground truth - Ground truth is a term used in statistics and machine learning that means
checking the results of machine learning for accuracy against the real world
❑Training tuples - Individual tuples making up the training set are referred to as training
tuples . They are also referred to as samples, examples, instances, data points or objects
12/17/2022 MIT-WPU, DMDW 4
Supervised vs. Unsupervised Learning
Supervised learning (classification/prediction)
◦ The training data (observations, measurements, etc.) are accompanied by labels indicating
the class of the observations
◦ New data is classified based on the training set
◦ Ground truth is available
◦ Techniques for Supervised Learning – Classification and Numeric Prediction
Unsupervised learning (clustering)
◦ The class labels of training data is unknown
◦ Its goal is to infer the natural structure present within a set of data points
◦ Exploratory analysis
◦ In situations where it is either impossible or impractical for a human to propose trends in the
data, unsupervised learning can provide initial insights that can then be used to test individual
hypotheses.
12/17/2022 MIT-WPU, DMDW 5
12/17/2022 MIT-WPU, DMDW 6
Classification
Form of data analysis that extracts models describing important data classes
Such models are called as classifiers, which are built to predict classes
Applications
➢ A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
➢ Image classification, Face recognition
➢ Spam filtering, Cancer diagnosis
➢ Sentiment analysis, Mood detection
➢ Prediction of rainfall
7
Definition - Classification
Classification is the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.
12/17/2022 MIT-WPU, DMDW 8
Numeric Prediction
In this case, a model or a predictor is constructed that predicts a continuous-
valued-function or ordered value. Regression analysis is a statistical
methodology that is most often used for numeric prediction.
Examples/ applications of prediction:
✓Marketing manager needs to predict how much a given customer will spend during a sale at his
company.
✓Predict the agricultural yield
✓Predict the salary of the customer
✓Predict the number of runs by a cricketer
✓Predict stock market value
9
Classification—A Two-Step Process
Learning phase or Model construction: describing a set of predetermined classes
◦ Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees, or mathematical formulae
Classification phase or Model usage: for classifying future or unknown objects
◦ Estimate accuracy of the model
◦ The known label of test sample is compared with the classified result from the model
◦ If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation set
12/17/2022 MIT-WPU, DMDW 10 10
Step (1): Model Construction
Classification
Training Algorithms
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
12/17/2022 MIT-WPU, DMDW 11 11
Step (2): Using the Model in Prediction
Classifier
Testing
Unseen Data
Data
(Jeff, Professor, 4)
Tenured?
12/17/2022 MIT-WPU, DMDW 12 12
Classification Techniques
✓Decision tree based methods
✓Rule based methods
✓Memory based reasoning
✓Neural Networks
✓Genetic algorithms
✓Bayesian networks
✓Support vector machines
12/17/2022 MIT-WPU, DMDW 13
Decision Tree – Whether to play or not?
Outlook
Sunny Overcast Rain
Yes
Humidity Wind
High Normal Strong Weak
No Yes No Yes
Decision Tree : Another Example
❑ Training data set: Buys_computer
❑ Resulting tree:
age?
<=30 31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
12/17/2022 MIT-WPU, DMDW 15 15
Decision Tree Induction
❑Decision tree induction is learning of decision tree from class-labeled training
tuples
❑In decision tree, each internal node denotes a test on an attribute, each
branch represents an outcome of test, and each leaf node holds a class label
❑For an unknown tuple X, path is traced from root to leaf node for prediction of
class.
❑Decision tree can be converted to classification rules.
12/17/2022 MIT-WPU, DMDW 16
Algorithms
❖ID3 (Iterative Dichotomizer)
❖C4.5 (Successor of ID3)
❖CART (Classification and Regression Tree)
❖SLIQ
❖SPRINT
12/17/2022 MIT-WPU, DMDW 17
ID3 Algorithm
Input :
Data partition D, which is a set of training tuples and their associated
class labels
Attribute_list: the set of candidate attributes
Attribute_selection_method: a procedure to determine selection
criterion that best partitions data tuples into individual classes
Output: A decision tree
12/17/2022 MIT-WPU, DMDW 18
Method:
Create a node N
If tuples in D are all of class C then
return N as a leaf node labeled class C
If attribute_list is empty then
return N as a leaf node labeled with the majority class in D
Apply Attribute_Selection_Method(D, attribute_list )
to find the best splitting criterion
12/17/2022 MIT-WPU, DMDW 19
Label node N with splitting criterion
If splitting_attribute is discrete valued and
multiway split is allowed then
attribute_list = attribute_list – splitting attribute ;
For each outcome j of splitting_criterion
// partition the tuples and grow subtrees
let Dj be the set of tuples in D satisfying j
If Dj is empty then
attach a leaf labeled with majority class in D to node N
12/17/2022 MIT-WPU, DMDW 20
Else attach the node returned by
Generate_decision_tree(Dj, attribute_list) to node N
Endfor
Return N;
End Algo
Time Complexity: O(m * |D| * log |D|)
m= number of attributes
|D| = number of records in dataset
12/17/2022 MIT-WPU, DMDW 21
Attribute Selection
1. Attribute selection method specifies a heuristic procedure for selecting the attribute that
best discriminates the given tuples according to class
2. The measure so selected determines the splitting criterion. Splitting criterion tells us which
attribute to test at node N
3. It determines the best way to separate or partition tuples in D into individual classes
4. It indicates the splitting attribute such that the resulting partitions at each branch are as pure
as possible
5. A partition is pure if all the tuples in it belong to the same class
12/17/2022 MIT-WPU, DMDW 22
Cases
a) A is discrete
b) A is continuous
c) A is discrete but
binary tree should
be produced
12/17/2022 MIT-WPU, DMDW 23
Attribute selection measure
oInformation Gain
oGain Ratio
oGini Index
oChi-square test
oC-SEP
oMinimum description length
Attribute selection measures are also known as splitting rules because
they determine how the tuples at a given node are to be split
12/17/2022 MIT-WPU, DMDW 24
Information Gain
Based on Information theory in Physics which studies information content of messages
Entropy -> measure of disorder in the dataset
Information gain = measure of decrease in disorder achieved by partitioning the original
dataset
Information gain = difference between original information requirement and new
requirement after partitioning on a certain attribute
Information gain = Entropy for dataset - Entropy for partition
The attribute with highest information gain is chosen as the splitting attribute for a node
This attribute minimizes the information needed to classify tuples in the resulting
partition, and reflects the least impurity/randomness in these partition
12/17/2022 MIT-WPU, DMDW 25
Entropy - Example
12/17/2022 MIT-WPU, DMDW 26
Partitioning on color to reduce the disorder
12/17/2022 MIT-WPU, DMDW 27
Attribute Selection Measure: Information Gain (ID3/C4.5)
▪Select the attribute with the highest information gain reflecting least impurity
▪Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci,D|/|D|
▪Let m = number of distinct values of class label attribute
▪Expected information (entropy of D) needed to classify a tuple in D:
▪Information needed (after using A to split D into v partitions) to classify D:
▪Information gained by branching on attribute A
12/17/2022 MIT-WPU, DMDW 28 28
Day Outlook Temperature Humidity Wind PlayGolf
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
12/17/2022 MIT-WPU, DMDW 29
Info (D) = 0.940 bits
12/17/2022 MIT-WPU, DMDW 30
Now, consider attribute outlook
For outlook, there are 3 categories: sunny, overcast, rainy
12/17/2022 MIT-WPU, DMDW 31
Calculating entropy for each attribute
➢The dataset is split on the different attributes.
➢The entropy for each branch is calculated.
➢Then it is added proportionally, to get total entropy for the split.
➢The resulting entropy is subtracted from the entropy before the split.
➢The result is the Information Gain, or decrease in entropy.
12/17/2022 MIT-WPU, DMDW 32
Gain when split on different attributes
12/17/2022 MIT-WPU, DMDW 33
Choose attribute with the largest information gain as the decision node, divide
the dataset by its branches and repeat the same process on every branch.
12/17/2022 MIT-WPU, DMDW 34
A branch with entropy of 0 is a leaf node. For outlook=overcast, the
entire partition has play golf = yes
12/17/2022 MIT-WPU, DMDW 35
A branch with entropy more than 0 needs further splitting.
The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
12/17/2022 MIT-WPU, DMDW 36
Final Decision Tree
12/17/2022 MIT-WPU, DMDW 37
Decision Tree to Decision Rules
12/17/2022 MIT-WPU, DMDW 38
12/17/2022 MIT-WPU, DMDW 39
Limitation
Limitations of information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier such as product ID. A split on
product ID would result in a large number of partitions (as many as there are values), each one
containing just one tuple. Because each partition is pure, the information required to classify
data set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the informa-
tion gained by partitioning on this attribute is maximal.
12/17/2022 MIT-WPU, DMDW 40
Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias.
It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
12/17/2022 MIT-WPU, DMDW 41
Gain Ratio
Gain Ratio is defined as
12/17/2022 MIT-WPU, DMDW 42
12/17/2022 MIT-WPU, DMDW 43
Solution
Computation of gain ratio for the attribute income.
❖A test on income splits the data of Table into three partitions, namely low, medium, and high,
containing four, six, and four tuples, respectively.
❖To compute the gain ratio of income, we first use Split Info Eq. to obtain
❖we have Gain(income) = 0.029. Therefore, GainRatio(income) = 0.029/1.557 = 0.019.
12/17/2022 MIT-WPU, DMDW 44
Gini Index
❖The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples.
❖where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum
is computed over m classes.
❖The Gini index considers a binary split for each attribute.
❖For example, if income has three possible values, namely {low, medium, high}, then the possible subsets
are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
❖We exclude the power set, {low, medium, high}, and the empty set from consideration since, conceptu-
ally, they do not represent a split.
❖Therefore, there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary split
on A.
12/17/2022 MIT-WPU, DMDW 45
Gini Index
❖When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition.
❖For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning is
❖a possible split-point of A, D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set
of tuples in D satisfying A > split point.
❖The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is
12/17/2022 MIT-WPU, DMDW 46
Tree Pruning
❖Tree pruning addresses the problem of overfitting the data
❖Tree overfitting can occur due to irrelevant attributes, noisy data or too
little training data.
❖As depth of the tree increases, training error reduces but test error
increases
❖Statistical measures to remove least reliable branches
❖Approaches – prepruning and postpruning
❖Prepruning – construction halted early, node converted to leaf
❖Postpruning – removes subtrees from a fully grown tree
12/17/2022 MIT-WPU, DMDW 47
Advantages of Decision Tree
1. Simple to understand and interpret. Trees can also be displayed graphically in a way
that is easy for non-experts to interpret
2. Able to handle both numerical and categorical data
3. Requires little data preparation. Other techniques often require data normalization
4. Uses a white box model. By contrast, in a black box model, the explanation for the
results is typically difficult to understand, for example with an artificial neural network
5. Possible to validate a model using statistical tests, hence reliable
6. Makes no assumptions of the training data
7. Performs well with large datasets
12/17/2022 MIT-WPU, DMDW 48
Limitations of Decision Tree
1. A small change in the training data can result in a large change in the tree and consequently
the final predictions.
2. Decision-tree learners can create over-complex trees that do not generalize well from the
training data. (overfitting)
12/17/2022 MIT-WPU, DMDW 49
Implementations
R, Weka, Orange, scikit (python), IBM SPSS, SAS, Rapidminer, Matlab
12/17/2022 MIT-WPU, DMDW 50
Bayes Classification Methods
❖Bayesian classifiers are statistical classifiers
❖They predict class membership probabilities – probability that a tuple belongs
to a particular class
❖Classification is based on Bayes’ theorem
❖High accuracy and speed
❖Naïve Bayesian Classifiers assume that effect of an attribute value on a given
class is independent of the values of other attributes → class conditional
independence
❖Application: Chinese word segmentation
12/17/2022 MIT-WPU, DMDW 51
Bayes’ Theorem
X data tuple or evidence (contains n attributes)
H hypothesis (e.g. data tuple X belongs to class C)
P(H/X) probability that hypothesis H holds given evidence X [classification problem]
It is called “Posterior / Posteriori Probability” of H conditioned on X
P(H) probability of hypothesis irrespective of any condition (Prior / Priori probability)
P(X/H) posterior probability of X conditioned on H
P(X) prior probability of X
Bayes’ Theorem :
P(H/X) = [ P(X/H) P(H)] / P(X)
Suppose the fraction of UG students who play cricket is 15% and
fraction of graduate students playing cricket is 23%. If one-fifth of
the college students are graduate students and rest are UG, what is
the probability that a student who plays cricket is a graduate
student?
P(C|U) = 0.15
P(C|G)=0.23
P(G)=0.2
P(U)=0.8
To find P(G|C)
P(G|C) = [P(C|G) * P(G)]/ P(C) = (0.23 * 0.2 )/ (0.15*0.8 + 0.23 *
0.2) = 0.277
12/17/2022 MIT-WPU, DMDW 53
Working of Naïve Bayesian Classifier
❑Let D – training set of tuples, and their associated class labels
❑ X = {x1, x2, …, xn}
❑There are m classes, C1, C2, … ,Cm
❑Given tuple X, classifier predicts that it belongs to class having highest posterior probability
❑ P(Ci/X) > P(Cj/X) for 1 ≤ j ≤ m, j ≠ i
Thus, we have to maximize)
❑Class Ci for which P(Ci/X) is maximized is called “maximum posteriori hypothesis”
❑ P(Ci/X) = [ P(X/Ci)P(Ci)] / P(X) …… Bayes’ theorem
Continued…
❑P(X) is constant for all classes, hence only P(X/Ci)P(Ci) is to be
maximized.
❑If class probabilities are not known, then it is assumed that the classes
are equally likely. i.e. P(C1) = P(C2) = …=P(Cm)
Hence only maximize P(X/Ci)
❑With class conditional independence:
P(X/Ci) = ∏ P(xk/Ci) ….. k=1 to n
= P(x1/ Ci) * P(x2/ Ci) * … * P(xn/ Ci)
RID Age Income Student Credit-rating Class: buys_
computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle-aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle-aged Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle-aged Medium No Excellent Yes
13 Middle-aged High Yes Fair Yes
14 Senior Medium No Excellent no
Problem statement
Given the training data, classify the tuple :
X =( age=youth, income=medium, student=yes, credit_rating=fair)
Let C1: buys_computer = yes, C2: buys_computer = no
We need to maximize P(X|Ci) for i=1 and i=2
P(C1) = P(buys_computer=yes) = 9/14 = 0.643
P(C2) = P(buys_computer=no) = 5/14 = 0.357
12/17/2022 MIT-WPU, DMDW 57
Computing conditional probabilities:
P(age=youth|buys_computer=yes) = 2/9 = 0.222
P(age=youth|buys_computer=no) = 3/5 = 0.600
P(income=medium | buys_computer=yes) = 4/9 = 0.444
P(income=medium | buys_computer=no) = 2/5 = 0.400
P(student=yes | buys_computer=yes) = 6/9 = 0.667
P(student=yes | buys_computer=no) = 1/5 = 0.2
P(credit_rating=fair | buys_computer=yes) = 6/9 = 0.667
P(credit_rating=fair | buys_computer=no) = 2/5 = 0.4
P(X|buys_computer=yes) = P(age=youth|buys_computer=yes) *
P(income=medium|buys_computer=yes) *
P(student=yes|buys_computer=yes) *
P(credit_rating=fair|buys_computer=yes)
= 0.222 * 0.444 * 0.667 * 0.667
= 0.044
12/17/2022 MIT-WPU, DMDW 58
Similarly,
P (X|buys_computer=no) = 0.6 * 0.4 * 0.2 * 0.4 = 0.019
To find the class Ci that maximizes P(X|Ci).P(Ci):
P (X|buys_computer=yes).P(buys_computer=yes)= 0.044 * 0.643 = 0.028
P (X|buys_computer=no).P(buys_computer=no)= 0.019 * 0.357 = 0.007
Thus, Naïve Bayesian Classifier predicts buys_computer = YES for tuple X
If X =( age=youth, income=medium, student=yes, credit_rating=fair)
Then buys_computer = yes
12/17/2022 MIT-WPU, DMDW 59
Try it out on Weather Data
X =( Outlook = sunny , temperature = cool, humidity = high, windy =
strong, play = ?)
12/17/2022 MIT-WPU, DMDW 60
Zero probability value – Laplace correction
If no training tuples for particular attribute value, then its probability is zero, result in product
probability =0
Thus we would have P(X|Ci)=0, cancelling the effect of other probabilities
To avoid this problem, assume that D is very large and add one to each count to avoid zero
probability.
This technique is known as Laplacian correction or Laplace estimator.
Example: In a training database D, for class buys_computer =yes,
0 tuples with income=yes, 990 tuples with income = medium and 10 tuples with income=high
Probabilities of events without Laplacian correction: 0, 0.990, 0.010
Using Laplace correction, add 1 tuple of each type in D
New probabilities: 1/1003, 991/ 1003, 11/1003 i.e., 0.001, 0.988, 0.011
12/17/2022 MIT-WPU, DMDW 61
Model Evaluation and Selection
Terminology
➢Positive tuples- tuples of main class of interest
➢Negative tuples – all other tuples
➢True Positives (TP) – Positive tuples that were correctly labelled by the classifier
➢True Negatives (TN) – Negative tuples that were correctly labelled by the classifier
➢False Positives (FP) – Negative tuples that were incorrectly labelled as positives
➢False Negatives (FN) – Positive tuples that were incorrectly labelled as negative
12/17/2022 MIT-WPU, DMDW 62
Confusion Matrix
Confusion Matrix – method to analyze how well classifier can recognize
tuples of different classes
Predicted Class
C1 (yes / +ve) C2 (no / -ve)
Actual Class C1 True positive False negative
C2 False positive True negative
Example of confusion matrix
12/17/2022 MIT-WPU, DMDW 64
Evaluation Measures - Accuracy
1. Accuracy - Accuracy of a classifier on a given test set is percentage of test set tuples that are
correctly classified by the classifier. It is termed as overall Recognition rate
2. Class imbalance problem –
◦ In classification involving skewed or highly imbalanced data, e.g., network intrusion, financial fraud
detections or detection of cancer we are interested only in the minority class.
◦ High accuracy does not mean any intrusion is detected.
◦ Hence accuracy is not suitable in some applications
Classes Cancer=yes Cancer=no Total Find the overall
accuracy of the
Cancer=yes 90 210 300 classifier to detect
cancer patients
Cancer-=no 140 9560 9700
Total 230 9770 10000
12/17/2022 MIT-WPU, DMDW 65
Precision and Recall measures
TP TP
p= . r= .
TP + FP TP + FN
◼ Precision p is the number of correctly classified positive
examples divided by the total number of examples that
are classified as positive. [Exactness]
◼ Recall r is the number of correctly classified positive
examples divided by the total number of actual positive
examples in the test set. [Completeness]
o Find the precision and recall for cancer patients in
previous slide
CS583, Bing Liu, UIC 66
Calculate accuracy , precision and recall from the foll. confusion
matrix
Predicted Yes Predicted No Total
Actual Yes 100 5 105
Actual No 10 50 60
Total 110 55 165
12/17/2022 MIT-WPU, DMDW 67
Other parameters to compare classifiers
1. Speed – computational costs involved in generating and using the classifier
2. Robustness – ability of classifier to make correct predictions by handling noise and missing
values
3. Scalability – ability to construct classifier efficiently given large amounts of data
4. Interpretability - understandable and insight provided by the model
CS583, BING LIU, UIC 68