[go: up one dir, main page]

0% found this document useful (0 votes)
309 views51 pages

Similarity-Based Learning in ML

Chapter 4 focuses on Similarity-Based Learning in machine learning, covering key algorithms such as Nearest-Neighbor Learning, Weighted K-Nearest algorithms, Nearest Centroid Classifier, and Locally Weighted Regression. It discusses the principles of instance-based learning, its applications, advantages, and disadvantages, as well as the differences between instance-based and model-based learning. The chapter also highlights the importance of distance metrics and provides practical examples of using K-Nearest Neighbors for classification tasks.

Uploaded by

heyitsyoyo135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views51 pages

Similarity-Based Learning in ML

Chapter 4 focuses on Similarity-Based Learning in machine learning, covering key algorithms such as Nearest-Neighbor Learning, Weighted K-Nearest algorithms, Nearest Centroid Classifier, and Locally Weighted Regression. It discusses the principles of instance-based learning, its applications, advantages, and disadvantages, as well as the differences between instance-based and model-based learning. The chapter also highlights the importance of distance metrics and provides practical examples of using K-Nearest Neighbors for classification tasks.

Uploaded by

heyitsyoyo135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

6th Semester

MACHINE LEARNING
(BCS602)

Module – 3

Chapter 4: Similarity-based Learning

By,
Prof. Samatha R Swamy
ISE Department
Module – 3

Chapter 4: Similarity-based Learning


Module – 3

Chapter 4: Similarity-based Learning

● Chapter 4 is about Similarity Learning.

● It discusses
○ Nearest-Neighbor Learning

○ Weighted K-Nearest algorithms

○ Nearest Centroid Classifier

○ Locally Weighted Regression (LWR)


algorithms.
Module – 3

Chapter 3: BASICS OF LEARNING THEORY

Textbook: Chapter 4 - 4.2 to 4.5

4.2 Nearest-Neighbour Learning

4.3 Weighted k-Nearest-Neighbour Algorithm

4.4 Nearest Centroid Classifier

4.5 Locally Weighted Regression(LWR)


Module – 3

Chapter 4: Learning Objectives

● Know about the concepts of Nearest-Neighbour Learning using the


algorithm called k-Nearest-Neighbours(k-NN).
● Learn about Weighted k-Nearest-Neighbour classifier that chooses the
neighbours by using the weighted distance.
● Gain knowledge about Nearest Centroid classifier, a simple alternative to
k-NN classifiers.
● Understand Locally Weighted Regression(LWR) that approximates the
linear functions of all k neighbours to minimize the error while prediction.

Dept of Information Science & Engineering Prof. Samatha R Swamy 1


Module – 3

Introduction to Similarity or Instance-Based Learning


Similarity-Based Learning or Instance-Based Learning or Just-in Time Learning
since it does not build an abstract model of the training instances and performs
Lazy Learning when classifying a new instance.

It makes prediction by computing distances or similarities between test


instance and specific set of training instances local to the test instance in an
incremental process.

In contrast to other learning mechanisms, it considers only the nearest instance


or instances to predict the class of unseen instances. This learning method
improves the performance of classification since it uses only a specific instances
as incremental learning task.
Dept of Information Science & Engineering Prof. Samatha R Swamy 2
Module – 3

Introduction to Similarity or Instance-Based Learning


Practical application of this learning is predicting daily stock index price
changes.

Dept of Information Science & Engineering Prof. Samatha R Swamy 3


Module – 3

Introduction to Similarity or Instance-Based Learning


Popular Distance Metrics
● Hamming Distance
● Euclidean Distance
● Manhattan Distance
● Minkowski Distance
● Cosine Similarity
● Mahalanobis Distance
● Pearson’s Correlation or correlation similarity
● Mean Squared Difference
● Jaccard Coefficient
● Tanimoto Coefficient

Dept of Information Science & Engineering Prof. Samatha R Swamy 4


Module – 3

Introduction to Similarity or Instance-Based Learning

● It is a Supervised Learning Technique.


● Belongs to a family of Instance-Based learning.
● Used to solve both Classification and Regression Problems.
● Used in fields such as
○ Image Processing
○ Text Classification
○ Pattern Classification
○ Bioinformatics
○ Data Mining
○ Information Retrieval
○ Natural Language Processing(NLP)
Dept of Information Science & Engineering Prof. Samatha R Swamy 5
Module – 3

Introduction to Similarity or Instance-Based Learning

Advantage:

● The processing occurs only when a request to classify a new instance is


given. This methodology is particularly useful when the whole dataset is
not available in the beginning but collected in an incremental manner.

Dept of Information Science & Engineering Prof. Samatha R Swamy 6


Module – 3

Introduction to Similarity or Instance-Based Learning

Disadvantage:

● It requires a large memory to store the data since a global abstract


model is not constructed initially with the training data.
● Sensitive to Irrelevant and correlated features leading to
misclassification of instances.
● Have serious limitations about the range of feature values taken.

Dept of Information Science & Engineering Prof. Samatha R Swamy 7


Module – 3

Difference between Instance-Based and Model-Based Learning

Instance-Based Learning Model-Based Learning


Lazy Learners. Eager Learners.

Processing of training instances is done only during Processing of training instances is done only during
testing phase. training phase.

No model is built with the training instances before it Generalizes a model with the training instances before it
receives a test instance. receives a test instance.

Predicts the class of the instance directly from the training Predicts the class of the instance directly from the model
data built.

Slow in testing phase. Fast in testing phase.

Learns by making many local approximations. Learns by making global approximation.

Dept of Information Science & Engineering Prof. Samatha R Swamy 8


Module – 3

4.1 Difference between Instance-Based and Model-Based Learning

Examples of Instance-Based Learning algorithms are:


1. k-Nearest Neighbour(k-NN)
2. Variants of Nearest Neighbour Learning
3. Locally Weighted Regression
4. Learning Vector Quantization(LVQ)

Examples of Model-Based Learning algorithms are:


1. Support Vector Machines(SVM)
2. Neural Networks(NN)
3. Decision Trees(DT)
Dept of Information Science & Engineering Prof. Samatha R Swamy 9
Module – 3

4.2 Nearest-Neighbor Learning

● The Nearest Centroid classifier is arguably the simplest


Classification algorithm in Machine Learning.
● The Nearest Centroid classifier works on a simple principle : Given
a data point (observation), the Nearest Centroid classifier simply
assign it the label (class) of the training sample whose mean or
centroid is closest to it.
● When applied on text classification, the Nearest Centroid classifier
is also called the Rocchio classifier.

Dept of Information Science & Engineering Prof. Samatha R Swamy 10


Module – 3

4.2 Nearest-Neighbor Learning

Dept of Information Science & Engineering Prof. Samatha R Swamy 11


Module – 3

4.2 Nearest-Neighbor Learning

Dept of Information Science & Engineering Prof. Samatha R Swamy 12


Module – 3

4.2 Nearest-Neighbor Learning

Dept of Information Science & Engineering Prof. Samatha R Swamy 13


Module – 3

4.2 Nearest-Neighbor Learning

What is K Nearest Neighbors(KNN) algorithm?

KNN classifier is a machine learning algorithm used for


classification and regression problems. It works by finding the K
nearest points in the training dataset and uses their class to predict
the class or value of a new data point. It can handle complex data
and is also easy to implement, which is why KNN has become a
popular tool in the field of artificial intelligence.

Dept of Information Science & Engineering Prof. Samatha R Swamy 14


Module – 3

4.2 Nearest-Neighbor Learning

KNN algorithm is most commonly used for:

1. Disease Prediction – predicts the likelihood of a disease based


on symptoms and the data available.
2. Handwriting Recognition – recognizes the handwritten
characters.
3. Image classification – recognizes images in computer vision.

Dept of Information Science & Engineering Prof. Samatha R Swamy 15


Module – 3

4.2 Nearest-Neighbor Learning

Difference between KNN and Artificial Neural Networks

● K-nearest neighbors (KNN) are mainly used for classification and


regression problems, while Artificial Neural Networks (ANN) are used for
complex function approximation and pattern recognition problems.

● Moreover, ANN has a higher computational cost than KNN.

Dept of Information Science & Engineering Prof. Samatha R Swamy 16


Module – 3

4.2 Nearest-Neighbor Learning


k-NN Algorithm

Step #1 - Assign a value to K.

Step #2 - Calculate the distance between the new data entry and all other
existing data entries. Arrange them in ascending order.

Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.

Step #4 - Assign the new data entry to the majority class in the nearest
neighbors.
Dept of Information Science & Engineering Prof. Samatha R Swamy 17
Module – 3

4.2 Nearest-Neighbor Learning

S. No. CGPA Assessment Project Submitted Result


1 9.2 85 8 PASS
2 8 80 7 PASS
3 8.5 81 8 PASS
4 6 45 5 FAIL
5 6.5 50 4 FAIL
6 8.2 72 7 PASS
7 5.8 38 5 FAIL
8 8.9 91 9 PASS

Dept of Information Science & Engineering Prof. Samatha R Swamy 18


Module – 3

4.2 Nearest-Neighbor Learning

Test Instance : (6.1, 40, 5)

Categories : PASS, FAIL - > Classes

Based on the performance of the student, classify test instance


whether a student will PASS or FAIL.

Dept of Information Science & Engineering Prof. Samatha R Swamy 19


Module – 3

4.2 Nearest-Neighbor Learning

Assign k = 3

Step 1: Euclidean Distance between the test instance (6.1, 40, 5)


and each of the training instances.

Dept of Information Science & Engineering Prof. Samatha R Swamy 20


Module – 3

4.2 Nearest-Neighbor Learning

S. CGPA Assessment Project Submitted Result Euclidean


No. Distance
1 9.2 85 8 PASS 45.2063
2 8 80 7 PASS 40.09501
3 8.5 81 8 PASS 41.17961
4 6 45 5 FAIL 5.001
5 6.5 50 4 FAIL 10.05783
6 8.2 72 7 PASS 32.13114
7 5.8 38 5 FAIL 2.022375
8 8.9 91 9 PASS 51.23319
Dept of Information Science & Engineering Prof. Samatha R Swamy 21
Module – 3

4.2 Nearest-Neighbor Learning

Assign k = 3
Step 2: Sort the distances in the ascending order and select the
first 3 Nearest Training Data instances to the test instance.

Instance Euclidean Distance Class


4 5.001 FAIL
5 10.05783 FAIL
7 2.022375 FAIL

Dept of Information Science & Engineering Prof. Samatha R Swamy 22


Module – 3

4.2 Nearest-Neighbor Learning

Assign k = 3

Step 3: Predict the class of the instance by majority voting.

The class for the test instance is predicted as “FAIL”

Dept of Information Science & Engineering Prof. Samatha R Swamy 23


Module – 3

4.2 Nearest-Neighbor Learning

k-NN performance affected by:

1. Number of Nearest Neighbours

1. Distance Metric

1. Decision Rule

Dept of Information Science & Engineering Prof. Samatha R Swamy 24


Module – 3

4.2 Nearest-Neighbor Learning

● K-NN suits lower dimensional data.

● One of the many issues that affect the performance of the kNN algorithm is the choice of
the hyperparameter k.

● If k is too small, the algorithm would be more sensitive to outliers. If k is too large, then
the neighborhood may include too many points from other classes.

● Another issue is the approach to combining the class labels. The simplest method is to
take the majority vote, but this can be a problem if the nearest neighbors vary widely in
their distance and the closest neighbors more reliably indicate the class of the object.

Dept of Information Science & Engineering Prof. Samatha R Swamy 25


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Weighted kNN is a modified version of k nearest neighbors.

Test Instance : (7.6, 60, 8)

Categories : PASS, FAIL - > Classes

Based on the performance of the student, classify test instance


whether a student will PASS or FAIL.

Dept of Information Science & Engineering Prof. Samatha R Swamy 26


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Assign k = 3

Step 1: Euclidean Distance between the test instance (7.6, 60, 8)


and each of the training instances.

Dept of Information Science & Engineering Prof. Samatha R Swamy 27


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

S. CGPA Assessment Project Submitted Result Euclidean


No. Distance
1 9.2 85 8 PASS 25.05
2 8 80 7 PASS 20.02
3 8.5 81 8 PASS 21.01
4 6 45 5 FAIL 15.38
5 6.5 50 4 FAIL 10.82
6 8.2 72 7 PASS 12.05
7 5.8 38 5 FAIL 22.27
8 8.9 91 9 PASS 31.04
Dept of Information Science & Engineering Prof. Samatha R Swamy 28
Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Assign k = 3

Step 2: Sort the distances in the ascending order and select the
first 3 Nearest Training Data instances to the test instance.

Instance Euclidean Distance Class


4 15.38 FAIL
5 10.82 FAIL
7 12.05 PASS

Dept of Information Science & Engineering Prof. Samatha R Swamy 29


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Assign k = 3

Step 3: Predict the class of the instance by majority voting.

The class for the test instance is predicted as “FAIL”

Dept of Information Science & Engineering Prof. Samatha R Swamy 30


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Assign k = 3

Step 2: Sort the distances in the ascending order and select the
first 3 Nearest Training Data instances to the test instance.

Instance Euclidean Distance Inverse Distance Class


4 15.38 0.06 FAIL
5 10.82 0.09 FAIL
7 12.05 0.08 PASS

Dept of Information Science & Engineering Prof. Samatha R Swamy 31


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Find the sum of the inverses.

Sum = 0.24033

Compute the weight by dividing each inverse distance by the sum.

Dept of Information Science & Engineering Prof. Samatha R Swamy 32


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Instance Euclidean Inverse Distance Weight= Inverse Class


Distance distance/ Sum
4 15.38051 0.06502 0.270545 FAIL
5 10.82636 0.092370 0.384347 FAIL
7 12.05653 0.08294 0.345109 PASS

Dept of Information Science & Engineering Prof. Samatha R Swamy 33


Module – 3

4.3 Weighted K-Nearest-Neighbor Algorithm

Add the weights of the same class

Fail = 0.270545 + 0.384347 = 0.654892


Pass =0.345109

Predict the class by choosing the class with the max vote.

The class predicted is “FAIL”

Dept of Information Science & Engineering Prof. Samatha R Swamy 34


Module – 3

4.4 Nearest Centroid Classifier

An alternate to k-NN classifiers for similarity-based classification.


Nearest Centroid Classifier is also called as Mean Difference
classifier.

Dept of Information Science & Engineering Prof. Samatha R Swamy 35


Module – 3

4.4 Nearest Centroid Classifier

X Y Class
3 1 A
5 2 A
4 3 A
7 6 B
6 7 B
8 5 B

Dept of Information Science & Engineering Prof. Samatha R Swamy 36


Module – 3

4.4 Nearest Centroid Classifier

Dept of Information Science & Engineering Prof. Samatha R Swamy 37


Module – 3

4.4 Nearest Centroid Classifier

Dept of Information Science & Engineering Prof. Samatha R Swamy 38


Module – 3

4.4 Nearest Centroid Classifier

Dept of Information Science & Engineering Prof. Samatha R Swamy 39


Module – 3

4.4 Nearest Centroid Classifier

Dept of Information Science & Engineering Prof. Samatha R Swamy 40


Module – 3

4.5 Locally Weighted Regression (LWR)

Dept of Information Science & Engineering Prof. Samatha R Swamy 41


Module – 3

4.5 Locally Weighted Regression (LWR)

Dept of Information Science & Engineering Prof. Samatha R Swamy 42


Module – 3

4.5 Locally Weighted Regression (LWR)

Dept of Information Science & Engineering Prof. Samatha R Swamy 43


Module – 3

4.5 Locally Weighted Regression (LWR)

Dept of Information Science & Engineering Prof. Samatha R Swamy 44


Module – 3

4.5 Locally Weighted Regression (LWR)

Dept of Information Science & Engineering Prof. Samatha R Swamy 45


Module – 3

SUMMARY

Dept of Information Science & Engineering Prof. Samatha R Swamy 46


Thank You
Any Queries:
Mail - [Link]@[Link]

You might also like