Lecture Slides for
INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 8:
NONPARAMETRIC
METHODS
Nonparametric Estimation
3
Parametric (single global model), semiparametric
(small number of local models)
Nonparametric: Similar inputs have similar outputs
Functions (pdf, discriminant, regression) change
smoothly
Keep the training data;“let the data speak for
itself”
Given x, find a small number of closest training
instances and interpolate from these
Aka lazy/memory-based/case-based/instance-
based learning
Density Estimation
4
Given the training set X={xt}t drawn iid from p(x)
Divide data into bins of size h
Histogram: # x t in the samebin as x
pˆx
Nh
Naive estimator:
# x h x t x h
pˆx
2Nh
or
1 N x xt 1/ 2 if u 1
pˆx w
Nh t 1 h
wu
otherwise
0
5
6
Kernel Estimator
7
Kernel function, e.g., Gaussian kernel:
1 u
2
K u exp
2 2
Kernel estimator (Parzen windows)
1 N x xt
pˆx K
Nh t 1 h
8
k-Nearest Neighbor Estimator
9
Instead of fixing bin width h and counting the
number of instances, fix the instances (neighbors) k
and check bin width
k
pˆx
2Ndk x
dk(x), distance to kth closest instance to x
10
Multivariate Data
11
Kernel density estimator
1 N
x xt
pˆx d K
t 1
Nh h
Multivariate Gaussian kernel
spheric 1
d
u
2
K u exp
2 2
ellipsoid
1 1 T 1
K u exp u S u
2 S
d /2 1/ 2
2
Nonparametric Classification
12
Estimate p(x|Ci) and use Bayes’ rule
Kernel estimator
1 N
x xt t ˆ Ni
pˆx|C i K ri P C i
Ni h d t 1 h N
1 N
x xt t
gi x pˆx|C i P C i d
ˆ K ri
Nh t 1 h
k-NN estimator
pˆx|C i
ki
PˆC i |x
ˆ
p x | C i PˆC i ki
NiV x
k
pˆx k
Condensed Nearest Neighbor
13
Time/space complexity of k-NN is O (N)
Find a subset Z of X that is small and is accurate in
classifying X (Hart, 1968)
E' Z | X E X |Z Z
Condensed Nearest Neighbor
14
Incremental algorithm: Add instance if needed
Distance-based Classification
15
Find a distance function D(xr,xs) such that
if xrand xsbelong to the same class, distance is small
and if they belong to different classes, distance is
large
Assume a parametric model and learn its
parameters using data, e.g.,
Learning a Distance Function
16
The three-way relationship between distances,
dimensionality reduction, and feature extraction.
M=LTL is dxd and L is kxd
Similarity-based representation using similarity
scores
Large-margin nearest neighbor (chapter 13)
Euclidean distance (circle) is not suitable,
Mahalanobis distance using an M (ellipse) is suitable.
After the data is projected along L, Euclidean distance can be used.
17
Outlier Detection
18
Find outlier/novelty points
Not a two-class problem because outliers are very
few, of many types, and seldom labeled
Instead, one-class classification problem: Find
instances that have low probability
In nonparametric case: Find instances far away from
other instances
Local Outlier Factor
19
Nonparametric Regression
20
Aka smoothing models
Regressogram
t 1
N
bx , x t
r t
gˆx
t 1 b
N
x , x t
where
if is in the samebin with x
bx , x
t
t 1 x
0 otherwise
21
22
Running Mean/Kernel Smoother
23
Running mean smoother Kernel smoother
x xt t
t 1K h r
N
x xt t
t 1w h r
N
gˆx
gˆx
x xt
t 1K h
N
x xt
t 1w h
N
where where K( ) is Gaussian
1 i f u 1 Additive models (Hastie
w u
0 otherwi s e and Tibshirani, 1990)
Running line smoother
24
25
26
How to Choose k or h ?
27
When k or h is small, single instances matter; bias is
small, variance is large (undersmoothing): High
complexity
As k or h increases, we average over more instances
and variance decreases but bias increases
(oversmoothing): Low complexity
Cross-validation is used to finetune k or h.
28