0% found this document useful (0 votes)

109 views10 pages

Lecture#2. K Nearest Neighbors

K Nearest Neighbors (KNN) is a non-parametric supervised machine learning algorithm that classifies data points based on their distance to the k closest training examples. It stores all training data and classifies new data based on the majority class of its k nearest neighbors. KNN can be used for both classification and regression problems. It has advantages of being simple, making no assumptions about the data, and working for multi-class problems. However, it is computationally expensive, sensitive to data scaling, and struggles with many variables or rare classes. Cross-validation is used to select the optimal k value. Standardizing variables is important for accurate distance calculations when variables have different scales.

Uploaded by

Muhammad Waseem Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views10 pages

Lecture#2. K Nearest Neighbors

Uploaded by

Muhammad Waseem Anwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

K Nearest Neighbors (KNN)

Introduction
KNN is a non-parametric supervised learning technique in which we try to classify the data point to a
given category with the help of a training set.

“A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm)
is one that relies on labeled input data to learn a function that produces an appropriate output when
given new unlabeled data”.

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based
on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and
pattern recognition already since the beginning of the 1970s as a non-parametric technique.

A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is
simply assigned to the class of its nearest neighbor.

Can KNN be used for regression?

Yes, K-nearest neighbor can be used for regression. In other words, the K-nearest neighbor algorithm
can be applied when a dependent variable is continuous. In this case, the predicted value is the average
of the values of its k nearest neighbors (rather than the majority vote).

Pros and Cons of KNN

Pros
 Easy to understand
 No assumptions about data
 Can be applied to both classification and regression
 Works easily on multi-class problems
Cons
 Memory Intensive / Computationally expensive
 Sensitive to a scale of data
 Not work well on rare event (skewed) target variable
 Struggle when a high number of independent variables
For any given problem, a small value of k will lead to a large variance in predictions. Alternatively,
setting k to a large value may lead to a large model bias.

Why is KNN non-parametric?

Non-parametric means not making any assumptions on the underlying data distribution. Non-
parametric methods do not have fixed numbers of parameters in the model. Similarly, in KNN, model
parameters actually grow with the training data set - you can imagine each training case as a
"parameter" in the model.
Choosing the Optimal Value for K
 Choosing the optimal value for K is best done by first inspecting the data.
 In general, a large K value is more precise as it reduces the overall noise but there is no guarantee.
 Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate
by holding out a subset of the training set from the model building process.
o Cross-validation (let's say 10-fold validation) involves randomly dividing the training set into
10 groups, or folds, of approximately equal size. 90% of data is used to train the model and
the remaining 10% to validate it. The misclassification rate is then computed on the 10%
validation data. This procedure repeats 10 times. A different group of observations is
treated as a validation set each of the 10 times. It results in 10 estimates of the validation
error which are then averaged out.
 Historically, the optimal K for most datasets has been between 3-10. That produces much better
results than 1NN.
 A low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers
more voters to decide prediction.

Standardization (of Training Data Set)

When independent variables in training data are measured in different units, it is important to
standardize variables before calculating distance. For example, if one variable is based on height in cm,
and the other is based on weight in kg then height will influence more on the distance calculation. In
order to make them comparable we need to standardize them which can be done by any of the
following methods:

X  mean
Xs 
s.d .
X  mean
Xs 
max  min
X  min
Xs 
max  min

Mean and Standard Deviation Formulas

1 n
x  xi
n i 1

n
2
x  x 
i 1
i

n 1

KNN vs. K-Mean

Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor.
See some of the difference below -
 K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a
supervised learning algorithm (dependent variable exists).
 K-mean is a clustering technique that tries to split data points into K-clusters such that the points in
each cluster tend to be near each other whereas K-nearest neighbor tries to determine the
classification of a point, combines the classification of the K nearest points.

Distance Functions
There are many distance functions but Euclidean is the most commonly used measure. It is mainly used
when data is continuous. Manhattan distance is also very common for continuous variables.

It should also be noted that all three distance measures are only valid for continuous variables. In the
instance of categorical variables, the Hamming distance must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and
categorical variables in the dataset.
The idea to use distance measure is to find the distance (similarity) between new sample and training
cases and then find the k-closest customers to a new customer in terms of height and weight.

Example 1
Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target. This constitutes the training data. We can now use the training set
to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. There are 13
observations.
Using the Euclidean distance formula,

 If K=1 then the nearest neighbor is the last case in the training set with Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

 With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.

Standardized Training Data Set

One major drawback in calculating distance measures directly from the training set is in the case where
variables have different measurement scales or there is a mixture of numerical and categorical
variables. For example, if one variable is based on annual income in dollars, and the other is based on
age in years then income will have a much higher influence on the distance calculated. One solution is to
standardize the training set as shown below.

Euclidean Distance Standardized Distance

Using the standardized distance on the same training set, the unknown case returned a different
neighbor (see observation no. 3 giving least distance from unknown case; K=1 case varies now) which
is not a good sign of robustness.

Example 2
Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt
size of a new customer given only height and weight information we have. Data including height, weight
and T-shirt size information is shown below. There are 18 observations.

New customer named 'Monica' has height 161 cm and weight 61 kg.
Height (cm) Weight (kg) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L

Euclidean distance between first observation and new observation (Monica) is as follows -
sqrt((161-158)^2+(61-58)^2) = 4.2426

Similarly, we will calculate distance of all the training cases with the new case and calculates the rank in
terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.

In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color.
'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is
exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are
close to yellow circle. So, the prediction for the new case is blue highlighted data point which is Medium
T-shirt size.
Standardized Training Data Set
After standardization, 5th closest value got changed as height was dominating earlier before
standardization. Hence, it is important to standardize predictors before running K-nearest neighbor
algorithm.
Original Standardized

X  min
As (161-158)/(170-158)= 0.25; this implies that X s  has NOT been used.
max  min
X  mean
Actually, formula X s  has been used for standardizing the data.
s.d .
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e., most similar to
Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had
‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt.

7-324
No ratings yet
7-324
1 page
5-324
No ratings yet
5-324
1 page
9-324
No ratings yet
9-324
1 page
Introduction To K-Nearest Neighbor (KNN) : Height (In CMS) Weight (In KGS) T Shirt Size
No ratings yet
Introduction To K-Nearest Neighbor (KNN) : Height (In CMS) Weight (In KGS) T Shirt Size
5 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
22 pages
12 - 23ECE216 - Nearest Neighbors
No ratings yet
12 - 23ECE216 - Nearest Neighbors
29 pages
05 KNN
No ratings yet
05 KNN
49 pages
Lecture 5-KNN
No ratings yet
Lecture 5-KNN
55 pages
ML Lecture#2
No ratings yet
ML Lecture#2
70 pages
1 KNN-Algo
No ratings yet
1 KNN-Algo
27 pages
Chapter 4. K Nearest Neighbors
No ratings yet
Chapter 4. K Nearest Neighbors
55 pages
Lecture 1.2.1 - Data Models and Lecture 1.2.2 - Relational Model
No ratings yet
Lecture 1.2.1 - Data Models and Lecture 1.2.2 - Relational Model
24 pages
K-Nearest Neighbors (KNN)
No ratings yet
K-Nearest Neighbors (KNN)
9 pages
SITP - Data Science With ML Ai
No ratings yet
SITP - Data Science With ML Ai
16 pages
08 - KNN
No ratings yet
08 - KNN
39 pages
4.1 K-Nearest Neighbours (K-NN
No ratings yet
4.1 K-Nearest Neighbours (K-NN
9 pages
K-Means and KNN
No ratings yet
K-Means and KNN
11 pages
06 KNN
No ratings yet
06 KNN
41 pages
Lecture 02 - Time Complexity
No ratings yet
Lecture 02 - Time Complexity
15 pages
KNN 2
No ratings yet
KNN 2
53 pages
4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0
No ratings yet
4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0
50 pages
Optimization of Convolutional Neural Network Architecture With Genetic Algorithm For Feature Extraction Prediction of Student Graduation
No ratings yet
Optimization of Convolutional Neural Network Architecture With Genetic Algorithm For Feature Extraction Prediction of Student Graduation
10 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
Introduction To Classification - KNN
No ratings yet
Introduction To Classification - KNN
29 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
Why Do We Need A K-NN Algorithm?
No ratings yet
Why Do We Need A K-NN Algorithm?
11 pages
Vidali Tesi
No ratings yet
Vidali Tesi
73 pages
Research Survey On Support Vector Machine
No ratings yet
Research Survey On Support Vector Machine
9 pages
KNN
No ratings yet
KNN
53 pages
Machine Learning: Lecture # 2 Data Normalization, KNN & Minimum Distance
No ratings yet
Machine Learning: Lecture # 2 Data Normalization, KNN & Minimum Distance
74 pages
Certificate-Refrigeración Ojeda-ISO 13485 ISO 9001
No ratings yet
Certificate-Refrigeración Ojeda-ISO 13485 ISO 9001
2 pages
4 K Nearest Neighbors Full
No ratings yet
4 K Nearest Neighbors Full
34 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
AIML
No ratings yet
AIML
13 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
ML Lecture 13 KNN
No ratings yet
ML Lecture 13 KNN
14 pages
K Nearest Neighbors KNN A Fundamental Machine Learning Algorithm
No ratings yet
K Nearest Neighbors KNN A Fundamental Machine Learning Algorithm
11 pages
AI Lec4
No ratings yet
AI Lec4
17 pages
21STA024 Md. Toufik Umar Assignment On STAT 309 1
No ratings yet
21STA024 Md. Toufik Umar Assignment On STAT 309 1
12 pages
Usability and Human Computer Interaction (HCI)
No ratings yet
Usability and Human Computer Interaction (HCI)
19 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
Convolutional Neural Network-Based Real-Time ROV Detection Using Forward-Looking Sonar Image
No ratings yet
Convolutional Neural Network-Based Real-Time ROV Detection Using Forward-Looking Sonar Image
5 pages
Homework 7 - Solutions
No ratings yet
Homework 7 - Solutions
13 pages
12 ML KNN
No ratings yet
12 ML KNN
28 pages
IE2042 Assignment Semester2 2023
No ratings yet
IE2042 Assignment Semester2 2023
3 pages
Supervised Learning KNN
No ratings yet
Supervised Learning KNN
23 pages
ML 2
No ratings yet
ML 2
6 pages
Week 07
No ratings yet
Week 07
24 pages
Instance-Based Learning: K-Nearest Neighbour Learning
No ratings yet
Instance-Based Learning: K-Nearest Neighbour Learning
21 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
Communication Skills PDF
No ratings yet
Communication Skills PDF
8 pages
Ai 03 KNN
No ratings yet
Ai 03 KNN
8 pages
Ai Problems Assignment 1
No ratings yet
Ai Problems Assignment 1
7 pages
3.1 K Nearest Neighbour Classifier
No ratings yet
3.1 K Nearest Neighbour Classifier
24 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
No ratings yet
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
8 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
Plant Disease Detection
No ratings yet
Plant Disease Detection
6 pages
IV Distance and Rule Based Models 4.1 Distance Based Models
No ratings yet
IV Distance and Rule Based Models 4.1 Distance Based Models
45 pages
K Nearest Neighbors - Classification: Algorithm
No ratings yet
K Nearest Neighbors - Classification: Algorithm
4 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
E Book
No ratings yet
E Book
9 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
Epr Moga XL
No ratings yet
Epr Moga XL
2 pages
L09 Using Matlab Neural Networks Toolbox
100% (1)
L09 Using Matlab Neural Networks Toolbox
34 pages
Clustering - KNN
No ratings yet
Clustering - KNN
10 pages
11 - Logistic Regression
No ratings yet
11 - Logistic Regression
17 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
Compensator
No ratings yet
Compensator
69 pages
Introduction To KNN and R
No ratings yet
Introduction To KNN and R
12 pages
Sentiment Analysis For Government: An Optimized Approach: July 2015
No ratings yet
Sentiment Analysis For Government: An Optimized Approach: July 2015
16 pages
Lec#6, Advanced Electrical Machine Design - W2021 - Dr. Tanveer - UOL
No ratings yet
Lec#6, Advanced Electrical Machine Design - W2021 - Dr. Tanveer - UOL
14 pages
KNN PDF
No ratings yet
KNN PDF
30 pages
Introduction To KNN
100% (1)
Introduction To KNN
8 pages
K Nearest Neighbor - Step by Step Tutorial
No ratings yet
K Nearest Neighbor - Step by Step Tutorial
16 pages
14 K - Nearest Neighbours
No ratings yet
14 K - Nearest Neighbours
8 pages
Chapter Proposal Topic: Importance of Natural Language Processing in Cognitive Systems
No ratings yet
Chapter Proposal Topic: Importance of Natural Language Processing in Cognitive Systems
3 pages
K456 Nota 4 Deb 4113 Industrial Management
0% (1)
K456 Nota 4 Deb 4113 Industrial Management
9 pages
What Is KNN
No ratings yet
What Is KNN
9 pages
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
No ratings yet
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
7 pages
K Nearest Neighbor KNN
No ratings yet
K Nearest Neighbor KNN
18 pages
AirBnB Data Analysis - HLD
No ratings yet
AirBnB Data Analysis - HLD
10 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Sensored Field Oriented Control of A Permanent Magnet Synchronous Motor (PMSM)
100% (1)
Sensored Field Oriented Control of A Permanent Magnet Synchronous Motor (PMSM)
35 pages
New Detailed Case
No ratings yet
New Detailed Case
9 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Handwritten Text Recognition
No ratings yet
Handwritten Text Recognition
4 pages
Image Segmentation and Extraction of Object From An Image: CH - Venkanna EC104018 M.Tech (E.I)
No ratings yet
Image Segmentation and Extraction of Object From An Image: CH - Venkanna EC104018 M.Tech (E.I)
19 pages
Fault Classification For Photovoltaic Modules Using Thermography and Machine Learning Techniques
No ratings yet
Fault Classification For Photovoltaic Modules Using Thermography and Machine Learning Techniques
6 pages
Classic Control Home Work Sheets
No ratings yet
Classic Control Home Work Sheets
7 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
5 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
K-Nearest Neighbour Classifier: Prerequisite
No ratings yet
K-Nearest Neighbour Classifier: Prerequisite
6 pages
Course Outline of Renewable Energy
100% (1)
Course Outline of Renewable Energy
1 page
Allama Iqbal Open University, Islamabad: (Department of English Language & Applied Linguistics) Warning
No ratings yet
Allama Iqbal Open University, Islamabad: (Department of English Language & Applied Linguistics) Warning
8 pages
Allama Allama Iqbal Open University Open University
No ratings yet
Allama Allama Iqbal Open University Open University
1 page
Miniproject Control Systems
100% (1)
Miniproject Control Systems
4 pages
Road Traffic Algorithm
No ratings yet
Road Traffic Algorithm
5 pages
Course Outlines - Advanced Electrical Machine Design - W21 - Dr. Tanveer
100% (1)
Course Outlines - Advanced Electrical Machine Design - W21 - Dr. Tanveer
2 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)

Lecture#2. K Nearest Neighbors

Uploaded by

Lecture#2. K Nearest Neighbors

Uploaded by

K Nearest Neighbors (KNN)

Can KNN be used for regression?

Pros and Cons of KNN

Why is KNN non-parametric?

Standardization (of Training Data Set)

Mean and Standard Deviation Formulas

KNN vs. K-Mean

Standardized Training Data Set

Euclidean Distance Standardized Distance

You might also like