[go: up one dir, main page]

0% found this document useful (0 votes)
109 views10 pages

Lecture#2. K Nearest Neighbors

K Nearest Neighbors (KNN) is a non-parametric supervised machine learning algorithm that classifies data points based on their distance to the k closest training examples. It stores all training data and classifies new data based on the majority class of its k nearest neighbors. KNN can be used for both classification and regression problems. It has advantages of being simple, making no assumptions about the data, and working for multi-class problems. However, it is computationally expensive, sensitive to data scaling, and struggles with many variables or rare classes. Cross-validation is used to select the optimal k value. Standardizing variables is important for accurate distance calculations when variables have different scales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views10 pages

Lecture#2. K Nearest Neighbors

K Nearest Neighbors (KNN) is a non-parametric supervised machine learning algorithm that classifies data points based on their distance to the k closest training examples. It stores all training data and classifies new data based on the majority class of its k nearest neighbors. KNN can be used for both classification and regression problems. It has advantages of being simple, making no assumptions about the data, and working for multi-class problems. However, it is computationally expensive, sensitive to data scaling, and struggles with many variables or rare classes. Cross-validation is used to select the optimal k value. Standardizing variables is important for accurate distance calculations when variables have different scales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

K Nearest Neighbors (KNN)

Introduction
KNN is a non-parametric supervised learning technique in which we try to classify the data point to a
given category with the help of a training set.

“A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm)
is one that relies on labeled input data to learn a function that produces an appropriate output when
given new unlabeled data”.

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based
on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and
pattern recognition already since the beginning of the 1970s as a non-parametric technique.

A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is
simply assigned to the class of its nearest neighbor.

Can KNN be used for regression?


Yes, K-nearest neighbor can be used for regression. In other words, the K-nearest neighbor algorithm
can be applied when a dependent variable is continuous. In this case, the predicted value is the average
of the values of its k nearest neighbors (rather than the majority vote).

Pros and Cons of KNN


Pros
 Easy to understand
 No assumptions about data
 Can be applied to both classification and regression
 Works easily on multi-class problems
Cons
 Memory Intensive / Computationally expensive
 Sensitive to a scale of data
 Not work well on rare event (skewed) target variable
 Struggle when a high number of independent variables
For any given problem, a small value of k will lead to a large variance in predictions. Alternatively,
setting k to a large value may lead to a large model bias.

Why is KNN non-parametric?


Non-parametric means not making any assumptions on the underlying data distribution. Non-
parametric methods do not have fixed numbers of parameters in the model. Similarly, in KNN, model
parameters actually grow with the training data set - you can imagine each training case as a
"parameter" in the model.
Choosing the Optimal Value for K
 Choosing the optimal value for K is best done by first inspecting the data.
 In general, a large K value is more precise as it reduces the overall noise but there is no guarantee.
 Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate
by holding out a subset of the training set from the model building process.
o Cross-validation (let's say 10-fold validation) involves randomly dividing the training set into
10 groups, or folds, of approximately equal size. 90% of data is used to train the model and
the remaining 10% to validate it. The misclassification rate is then computed on the 10%
validation data. This procedure repeats 10 times. A different group of observations is
treated as a validation set each of the 10 times. It results in 10 estimates of the validation
error which are then averaged out.
 Historically, the optimal K for most datasets has been between 3-10. That produces much better
results than 1NN.
 A low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers
more voters to decide prediction.

Standardization (of Training Data Set)


When independent variables in training data are measured in different units, it is important to
standardize variables before calculating distance. For example, if one variable is based on height in cm,
and the other is based on weight in kg then height will influence more on the distance calculation. In
order to make them comparable we need to standardize them which can be done by any of the
following methods:

X  mean
Xs 
s.d .
X  mean
Xs 
max  min
X  min
Xs 
max  min

Mean and Standard Deviation Formulas


1 n
x  xi
n i 1

n
2
x  x 
i 1
i

n 1

KNN vs. K-Mean


Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor.
See some of the difference below -
 K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a
supervised learning algorithm (dependent variable exists).
 K-mean is a clustering technique that tries to split data points into K-clusters such that the points in
each cluster tend to be near each other whereas K-nearest neighbor tries to determine the
classification of a point, combines the classification of the K nearest points.

Distance Functions
There are many distance functions but Euclidean is the most commonly used measure. It is mainly used
when data is continuous. Manhattan distance is also very common for continuous variables.

It should also be noted that all three distance measures are only valid for continuous variables. In the
instance of categorical variables, the Hamming distance must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and
categorical variables in the dataset.
The idea to use distance measure is to find the distance (similarity) between new sample and training
cases and then find the k-closest customers to a new customer in terms of height and weight.

Example 1
Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target. This constitutes the training data. We can now use the training set
to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. There are 13
observations.
Using the Euclidean distance formula,

 If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

 With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.

Standardized Training Data Set


One major drawback in calculating distance measures directly from the training set is in the case where
variables have different measurement scales or there is a mixture of numerical and categorical
variables. For example, if one variable is based on annual income in dollars, and the other is based on
age in years then income will have a much higher influence on the distance calculated. One solution is to
standardize the training set as shown below.

Euclidean Distance Standardized Distance

Using the standardized distance on the same training set, the unknown case returned a different
neighbor (see observation no. 3 giving least distance from unknown case; K=1 case varies now) which
is not a good sign of robustness.

Example 2
Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt
size of a new customer given only height and weight information we have. Data including height, weight
and T-shirt size information is shown below. There are 18 observations.

New customer named 'Monica' has height 161 cm and weight 61 kg.
Height (cm) Weight (kg) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L

Euclidean distance between first observation and new observation (Monica) is as follows -
sqrt((161-158)^2+(61-58)^2) = 4.2426

Similarly, we will calculate distance of all the training cases with the new case and calculates the rank in
terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.

In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color.
'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is
exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are
close to yellow circle. So, the prediction for the new case is blue highlighted data point which is Medium
T-shirt size.
Standardized Training Data Set
After standardization, 5th closest value got changed as height was dominating earlier before
standardization. Hence, it is important to standardize predictors before running K-nearest neighbor
algorithm.
Original Standardized

X  min
As (161-158)/(170-158)= 0.25; this implies that X s  has NOT been used.
max  min
X  mean
Actually, formula X s  has been used for standardizing the data.
s.d .
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.

You might also like