Machine learning
Instance Based Learning
Hamid Beigy
Sharif University of Technology
November 14, 2021
Table of contents
1. Introduction
2. Nearest neighbor algorithms
3. Distance-weighted nearest neighbor algorithms
4. Locally weighted regression
5. Finding KNN(x) efficiently
6. Reading
1/18
Introduction
Introduction
1. The methods described before such as decision tree at the first find hypothesis and then
this hypothesis will be used for classification of new test examples.
2. These methods are called eager learning.
3. The instance based learning algorithms such as k-NN store all of the training examples
and then classify a new example x by finding the training example (xi , yi ) that is nearest
Voronoi Diagram
to x according to some distance metric.
4. Instance based classifiers do not explicitly compute decision boundaries. However, the
boundaries form a subset of the Voronoi diagram of the training data.
2/18
Nearest neighbor algorithms
Nearest neighbor algorithms
1. Fix k ≥ 1, given a labeled sample
S = {(x1 , t1 ), . . . , (xN , tN )}
where ti ∈ {0, 1}. The k-NN for all test examples x returns the hypothesis h defined by
X X
h(x) = I wi > wi .
i,ti =1 i,ti =0
where the weights w1 , . . . , wN are chosen such that wi = k1 if xi is among the k nearest
neighbors of x. Voronoi Diagram
2. The boundaries form a subset of the Voronoi diagram of the training data.
3/18
page 4
Nearest neighbor algorithms
1. The k-NN only requires
I An integer k.
I A set of labeled examples S.
I A metric to measure closeness.
2. For all points x, y , z, a metric d must satisfy the following properties.
I Non-negativity : d(x, y ) ≥ 0.
I Reflexivity : d(x, y ) = 0 ⇔ x = y .
I Symmetry : d(x, y ) = d(y , x).
I Triangle inequality : d(x, y ) + d(y , z) ≥ d(x, z).
4/18
Distance functions
1. The Minkowski distance for D-dimensional examples is the Lp norm.
D
! p1
X
Lp (x, y ) = |xi − yi |p
i=1
2. The Euclidean distance is the L2 norm
D
! 12
X
L2 (x, y ) = |xi − yi |2
i=1
3. The Manhattan or city block distance is the L2 norm
D
X
|xi − yi |
L1 (x, y ) = Functions
Distance
i=1
• The L1 norm is the maximum of the distances along individual
4. The L∞ norm is the coordinate
maximum of distances along axes
axes
unctions • The L norm is the maximum ofd the distances along individual
∞ (x,
coordinate axesL L y )y)==max
1(x, |xi −yyi|.|
max |x
i=1 |xi
L (x, y) = max
i yi |
i
ensional patterns is the Minkowski i i
! p1
X p
|xi yi |
1
rm
! 12
X 2
|xi yi |
1
ce is the L1 norm 5/18
Nearest neighbor algorithm for regression
1. The k-NN algorithm adapted for approximating continuous-valued target function.
2. We calculate the
Pk mean of k nearest neighborhood training examples rather than majority
f (x )
vote : fˆ(x) = i=1k i .
K -NN Behavior for Regression
RVFSZQPJOU
1
3. The effect of k on the performance of algorithm
Figure 1: Example of Locally Weighted Learning, containing in the upper graphic the set of data
points (x,y) (blue dots), query point (green line), local linear model (red line) and prediction (black
dot). The graphic in the middle shows the activation area of the model. The corresponding weighting
kernel (receptive field) is shown in the bottom graphic.
1 Pictures is assumed
are taken from with a continuous
P. Rai slide. function f(x) and noise ✏. The basic cost function of LWL is defined
as 6/18
n
X
Nearest neighbor algorithms
1. The k-NN algorithm is a lazy learning algorithm.
I It defers the hypothesis finding until a test example x arrives.
I For test example x, the k-NN uses the stored training data.
I Discards the the found hypothesis and any intermediate results.
2. This strategy is opposed to an eager learning algorithm which
I It finds a hypothesis h using the training set
I It uses the found hypothesis h for classification of test example x.
3. Trade offs
I During training phase, lazy algorithms have fewer computational costs than eager algorithms.
I During testing phase, lazy algorithms have greater storage requirements and higher
computational costs.
4. What is inductive bias of k-NN?
7/18
Properties of nearest neighbor algorithms
1. Advantages
I Analytically tractable
I Simple implementation
I Use local information, which results in highly adaptive behavior.
I It parallel implementation is very easy.
I Nearly optimal in the large sample (N → ∞).
E (Bayes) < E (NN) < 2 × E (Bayes).
2. Disadvantages
I Large storage requirements.
I It needs a high computational cost during testing.
I Highly susceptible to the irrelevant features.
3. Large values of k
I Results in smoother decision boundaries.
I Provides more accurate probabilistic information
4. But large values of k
I Increases computational cost.
I Destroys the locality of estimation.
8/18
Distance-weighted nearest neighbor
algorithms
Distance-weighted nearest neighbor algorithms
1. One refinement of k-NN is to weight the contribution of each k neighbors to their
distance to the query point x.
2. For two class classification
X X
h(x) = I wi > wi .
i,ti =1 i,ti =0
where
1
wi =
d(x, xi )2
3. For C class classification
k
X
h(x) = argmax wi δ(c, ti ).
c∈C i=1
4. For regression
Pk
wi f (xi )
fˆ(x) = i=1
.
wi
9/18
Locally weighted regression
Locally weighted regression
1. In locally weighted regression (LWR), we use a linear model to do the local approximation
fˆ:
ˆ = w 0 + w 1 x1 + w 2 x2 + . . . + w D xD .
f (x)
2. Suppose we aim to minimize the total squared error:
1X
E= (f (x) − fˆ(x))2
2
x∈S
3. Using gradient descent X
∆wj = η (f (x) − fˆ(x))xj
x∈S
where η is a small number (the learning rate).
10/18
Locally weighted regression i
1. How shall we modify this procedure to derive a local approximation rather than a global
one?
2. The simple way is to redefine the error criterion E to emphasize fitting the local training
examples.
3. Three possible criteria are given below. Note we write the error E (xq ) to emphasize the
fact that now the error is being defined as a function of the query point xq .
I Minimize the squared error over just the k nearest neighbors:
1 X
E1 (xq ) = (f (x) − fˆ(x))2
2
x∈KNN(xq )
I Minimize 1 squared error over the set S of training examples, while weighting the error of
each training example by some decreasing function k of its distance from xq
1X
E2 (xq ) = (f (x) − fˆ(x))2 K (d(xq , x))
2
x∈S
I Combine 1 and 2:
1 X
E3 (xq ) = (f (x) − fˆ(x))2 K (d(xq , x))
2
x∈KNN(xq )
11/18
Locally weighted regression ii
4. If we choose criterion (3) and re-derive the gradient descent rule, we obtain
X
∆wj = η K (d(xq , x))(f (x) − fˆ(x))xj
x∈KNN(xq )
where η is a small number (the learning rate).
5. Criterion (2) is perhaps the most esthetically pleasing because it allows every training
example to have an impact on the classification of xq .
6. However, this approach requires computation that grows linearly with the number of
training examples.
7. Criterion (3) is a good approximation to criterion (2) and has the advantage that
computational cost is independent of the total number of training examples; its cost
depends only on the number k of neighbors considered.
12/18
Finding KNN(x) efficiently
" the
Bucketing (a.k.a Elias’s algorithm) [Welch 1971] cell exceeds the distance to the closest point already visited
k-d trees [Bentley, 1975; Friedman et al, 1977]
Finding
nearest
!
neighbor KNN(x)
"
Bucketing
efficiently
" In the Bucketing algorithm, the space is divided into identical cells and for
each cell the data points inside it are stored in a list
" The cells are examined in order of increasing distance from the query point
and for each cell the distance is computed between its internal data points
"
1. How efficiently find KNN(x)?
and the query point
The search terminates when the distance from the query point to the cell
2. Tree-based data structures: pre-processing.
exceeds the distance to the closest point already visited
! k-d trees
" A k-d3.
tree Often kd-trees
is a generalization of a binary(k-dimensional trees) used in applications.
search tree in high dimensions
! Each internal node in a k-d tree is associated with a hyper-rectangle and a hyper-plane orthogonal to one of the
4.coordinate
!
A kd-tree
axis is a generalization of binary tree in high dimensions
The hyper-plane splits the hyper-rectangle into two parts, which are associated with the child nodes
The partitioning process goes on until the number of data points in the hyper-rectangle falls below some given threshold
"
!
4.1 Each internal node is associated with a hyper-rectangle and the hyper-plans is orthogonal to
The effect of a k-d tree is to partition the (multi-dimensional) sample space according to the underlying
distribution of the data, the partitioning being finer in regions where the density of data points is higher
one of its coordinates.
For a given query point, the algorithm works by first descending the the tree to find the data points lying in the cell that
!
contains the query point
Then 4.2
! The
it examines hyper-plan
surrounding splits
cells if they overlap the ballthe hyper-rectangle
centered to data
at the query point and the closest two pointparts,
so far which are associated with the child
nodes.
Introduction to Pattern Analysis KD-tree construction
4.3 The partitioning goes on until the number of data points in the hyper-plane falls below some
Ricardo Gutierrez-Osuna
19
Texas A&M University
high dimensions given threshold.
er-rectangle and a
hich are associated
X Y
ata points in the .15 .1
.03 .55
ulti-dimensional) .95 .1
on of the data,
y of data points ... ...
ing the the tree to
oint
ntered at the 5. Splitting
query
Start with a list of n-dimensional points
order : Widest first
6. Splitting value : Median
7. Stop condition : fewer than a threshold or box hit some minimum width.
13/18
distribution of the data, the partitioning being finer in regions where the density of data points is higher
! For a given query point, the algorithm works by first descending the the tree to find the data points lying in the cell that
kd-tree
!
contains the query point
Then it examines surrounding cells if they overlap the ball centered at the query point and the closest data point so far
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
KD-tree construction 19
Texas A&M University
dimensions 1. initial data set
angle and a
e associated
X Y
oints in the .15 .1
.03 .55
mensional) .95 .1
the data,
ata points ... ...
the tree to
KD-tree construction
at the query 2. After first split Start with a list of n-dimensional points
X > .5
No
Yes
X Y X Y
.15 .1 .95 .1
.03 .55 ... ...
... ...
Consider each group separately and possibly split again 14/18
kd-tree
KD-tree construction
1. After second split
X > .5
No Yes
X Y
Y> .5 .95 .1
No Yes ... ...
X Y X Y
.15 .1 .03 .55
... ... ... ...
KD-tree construction
Consider each group separately and possibly split again Ke
2. Final split. structu
(along same/different dimension).
Will keep around one additional piece of information at each 15/18
Median of value of the split dimension for the points.
Nearest neighbor with kd-tree
ch • When do we stop ?
e. WhenNearest neighbour
there are fewer with left
then m points KD-trees
OR the box has Ne
hit looking
1. Traverse tree some minimum width.
for the nearest neighbor of the query point.
Nearest neighbour with KD-trees
Traverse
2. Explore a branch of treethe
thattree lookingtoforthe
is closest thequery
nearest neighbor
point first of the Exam
query point.
Examine nearby points first: Explore the branch of the 16/18
Reading
Readings
1. Chapter 8 of Machine Learning Book (Mitchell 1997).
17/18
References i
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill.
18/18
Questions?
18/18