Lecture#2. K Nearest Neighbors
Lecture#2. K Nearest Neighbors
Introduction
KNN is a non-parametric supervised learning technique in which we try to classify the data point to a
given category with the help of a training set.
“A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm)
is one that relies on labeled input data to learn a function that produces an appropriate output when
given new unlabeled data”.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based
on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and
pattern recognition already since the beginning of the 1970s as a non-parametric technique.
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is
simply assigned to the class of its nearest neighbor.
X mean
Xs
s.d .
X mean
Xs
max min
X min
Xs
max min
n
2
x x
i 1
i
n 1
Distance Functions
There are many distance functions but Euclidean is the most commonly used measure. It is mainly used
when data is continuous. Manhattan distance is also very common for continuous variables.
It should also be noted that all three distance measures are only valid for continuous variables. In the
instance of categorical variables, the Hamming distance must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and
categorical variables in the dataset.
The idea to use distance measure is to find the distance (similarity) between new sample and training
cases and then find the k-closest customers to a new customer in terms of height and weight.
Example 1
Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target. This constitutes the training data. We can now use the training set
to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. There are 13
observations.
Using the Euclidean distance formula,
If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.
Using the standardized distance on the same training set, the unknown case returned a different
neighbor (see observation no. 3 giving least distance from unknown case; K=1 case varies now) which
is not a good sign of robustness.
Example 2
Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt
size of a new customer given only height and weight information we have. Data including height, weight
and T-shirt size information is shown below. There are 18 observations.
New customer named 'Monica' has height 161 cm and weight 61 kg.
Height (cm) Weight (kg) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
Euclidean distance between first observation and new observation (Monica) is as follows -
sqrt((161-158)^2+(61-58)^2) = 4.2426
Similarly, we will calculate distance of all the training cases with the new case and calculates the rank in
terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.
In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color.
'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is
exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are
close to yellow circle. So, the prediction for the new case is blue highlighted data point which is Medium
T-shirt size.
Standardized Training Data Set
After standardization, 5th closest value got changed as height was dominating earlier before
standardization. Hence, it is important to standardize predictors before running K-nearest neighbor
algorithm.
Original Standardized
X min
As (161-158)/(170-158)= 0.25; this implies that X s has NOT been used.
max min
X mean
Actually, formula X s has been used for standardizing the data.
s.d .
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.