Nayes Bayes Classifier
Nayes Bayes Classifier
Classification Methods
Bayesian Classification, Nearest
Neighbor, Ensemble Methods
Bayesian Classification: Why?
P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Choosing Hypotheses
Maximum Likelihood hML arg max P (d | h)
hypothesis: hH
P( A A A )
1 2 n
1 2 n
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities
from Data?
For continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to
estimate the conditional probability P(A i|c)
How to Estimate Probabilities from
Data?
Normal distribution:
1
( Ai ij ) 2
P( A | c ) e 2 ij2
2
i j 2
ij
1
( 120 110 ) 2
First step: Compute P(C) The prior probability of each class can be
computed based on the training tuples:
P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
Naïve Bayesian Classifier:
An Example
P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
P(Income=120K|
Class=No)
= 4/7 4/7 0.0072 = 0.0024
N ic
Original : P( Ai | C ) c: number of classes
Nc
p: prior probability
N ic 1
Laplace : P( Ai | C ) m: parameter
Nc c
N ic mp
m - estimate : P ( Ai | C )
Nc m
Naïve Bayes (Summary)
Advantage
Robust to isolated noise points
Handle missing values by ignoring the instance during probability
estimate calculations
Robust to irrelevant attributes
Disadvantage
Assumption: class conditional independence, which may cause loss
of accuracy
Independence assumption may not hold for some attribute.
Practically, dependencies exist among variables
Use other techniques such as Bayesian Belief Networks (BBN)
Remember
• Do less work when test tuples are • Do more work when test tuples are
presented presented
Basic k-Nearest Neighbor Classification
Given training data (x1 , y1 ),..., (x N , y N )
Define a distance metric between points in input space
D(x1,xi)
E.g., Eucledian distance, Weighted Eucledian, Mahalanobis
distance, TFIDF, etc.
Training method:
Save the training examples
At prediction time:
Find the k training examples (x1,y1),…(xk,yk) that are closest
to the test example x given the distance D(x1,xi)
Predict the most frequent class among those yi’s.
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
Classification:
Regression: K
^ 1
y
K
y
k 1
k
K-Nearest Neighbor Model: Weighted
by Distance
Classification:
31
Definition of Nearest Neighbor
X X X
Slide by Hinton
Nearest Neighbor Classification…
X
Determining the value of k
In typical applications k is in units or tens rather than in
hundreds or thousands
Higher values of k provide smoothing that reduces the
risk of overfitting due to noise in the training data
Value of k can be chosen based on error rate measures
We should also avoid over-smoothing by choosing k=n,
where n is the total number of tuples in the training data
set
Determining the value of k
Given training examples (x1 , y1 ),..., (x N , y N )
Use N fold cross validation
Search over K = (1,2,3,…,Kmax). Choose search size Kmax
based on compute constraints
Calculated the average error for each K:
Calculate predicted class for each training point
ˆyi
(xi , yi ), i =1,..., N
(using all other points to build the model)
Average over all training examples
Pick K to minimize the cross validation error
Example
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Why does it work?
Boosting
Random Forests
Bagging: Bootstrap AGGregatING
Bootstrap: data resampling
Generate multiple training sets
Resample the original training data
With replacement
Data sets have different “specious” patterns
Sampling with replacement
Each sample has probability (1 – 1/n)n of being selected
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Build classifier on each bootstrap sample
Specious patterns will not correlate
Underlying true pattern will be common to many
Combine the classifiers: Label new test examples by a majority vote
among classifiers
Boosting