Detailed Explanation of Module 3 Lab 1: Understanding Distance Metrics and
Introduction to KNN
This lab introduces the concept of distance metrics—how to measure the "closeness" of data
points—and shows how these are used in the K-Nearest Neighbors (KNN) algorithm. Below, each
section and concept is explained step-by-step, with examples and answers to key questions.
Section 1: Distance Metrics
A. What is a Distance Metric?
A distance metric is a mathematical way to measure how far apart two points (data samples) are
in space. Different metrics are used depending on the data type and problem.
B. Common Distance Metrics (with Examples)
1. Euclidean Distance
Definition: The straight-line distance between two points.
Formula:
Example:
x_1 = np.array((1, 2))
x_2 = np.array((4, 6))
euclidean_dist = np.sqrt(np.sum((x_1-x_2) ** 2))
print(euclidean_dist) # Output: 5.0
Visualization: The shortest path between two points on a plane.
2. Manhattan Distance
Definition: The sum of the absolute differences of their coordinates (like a taxi driving
on a city grid).
Formula:
Example:
manhattan_dist = np.sum(np.abs(x_1 - x_2))
print(manhattan_dist) # Output: 7
3. Minkowski Distance
Generalizes Euclidean (p=2) and Manhattan (p=1) distances.
Formula:
Example:
For , the Minkowski distance between the same points is about 4.5.
4. Hamming Distance
Definition: Number of positions at which the corresponding values are different (used
for categorical/binary data).
Example:
str_1 = 'euclidean'
str_2 = 'manhattan'
hamming_dist = distance.hamming(list(str_1), list(str_2)) * len(str_1)
print(hamming_dist) # Output: 7.0
5. Cosine Similarity
Definition: Measures the cosine of the angle between two vectors (used for text and
high-dimensional data).
Formula:
Example:
cosine_similarity = np.dot(x_1, x_2)/(norm(x_1)*norm(x_2))
print(cosine_similarity) # Output: 0.992...
6. Chebyshev Distance
Definition: The maximum absolute difference across any dimension.
Example:
chebyshev_distance = distance.chebyshev(x_1, x_2)
print(chebyshev_distance) # Output: 4
7. Jaccard Distance
Definition: Measures dissimilarity between sets.
Formula:
Example:
print(distance.jaccard([1, 0, 0], [0, 1, 0])) # Output: 1.0
8. Haversine Distance
Definition: Used for geographic coordinates (latitude/longitude) on a sphere (e.g.,
Earth).
Example:
haversine([-0.116773, 51.510357], [-77.009003, 38.889931]) # Output: 5897.658 km
C. How to Choose the Right Distance Metric?
Euclidean: Most common for continuous, low-dimensional data.
Manhattan: Useful for high-dimensional or grid-like data.
Cosine Similarity: Good when only direction matters (e.g., text).
Hamming: For categorical/binary variables.
Jaccard: For set/binary data.
Haversine: For geographic data.
D. Visualizing Distance Metrics
The lab uses 3D plots to show how Euclidean and Manhattan distances look from the origin,
helping you understand how each metric "measures" space differently.
Section 2: K-Nearest Neighbors (KNN)
A. What is KNN?
KNN is a supervised, non-parametric, instance-based algorithm used for classification and
regression.
How it works: For a new data point, KNN finds the k closest points in the training set (using
a distance metric) and assigns the most common class among them (for classification) or
averages their values (for regression).
B. KNN on a Synthetic Dataset
The lab generates two clusters of 2D points (red and blue) and uses KNN to classify new
points.
Example:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(pts, tgts)
our_predictions = knn.predict(test_pts)
print("Prediction Accuracy: ", 100 * np.mean(our_predictions == test_tgts))
# Output: e.g., 80.0
Experiment:
Try different distance metrics ('euclidean', 'manhattan', 'chebyshev', 'minkowski',
'hamming') and observe how accuracy changes.
C. KNN on the Iris Dataset (Real Data Example)
Iris dataset: 150 samples, 3 species, 4 features each.
Data is split into training and testing sets.
KNN is run with different distance metrics (Euclidean, Cosine, Manhattan, Chebyshev).
Result:
For this dataset and split, all metrics gave 100% accuracy, but this may not always be the
case in other datasets.
Section 3: Questions to Think About and Answer
1. How are similarity and distance different?
Similarity measures how alike two data points are (higher = more alike, e.g., cosine
similarity).
Distance measures how far apart two data points are (lower = more similar, e.g.,
Euclidean).
In KNN, distance metrics like Euclidean and Manhattan are used to find the closest
neighbors, while similarity is used in other algorithms.
2. What makes a valid distance metric?
A valid distance metric must satisfy:
Non-negativity: $ d(x, y) \geq 0 $
Identity: $ d(x, y) = 0 $ if and only if $ x = y $
Symmetry: $ d(x, y) = d(y, x) $
Triangle Inequality: $ d(x, z) \leq d(x, y) + d(y, z) $
Section 4: Best Practices and Observations
Metric choice matters: For some data, the right distance metric can significantly improve
KNN performance [1] [2] [3] .
Curse of dimensionality: In very high-dimensional data, distances between points become
less meaningful, and KNN may not work well [2] .
Feature scaling: Always scale features before using KNN, especially with Euclidean or
Manhattan distance [2] [4] .
Section 5: Summary Table
Metric Use Case Formula/Example
$ \sqrt{\sum (x_i - y_i)^2}
Euclidean Continuous, low-dim data
$
x_i -
Manhattan High-dim, grid data $ \sum $
y_i
Max difference in any x_i -
Chebyshev $ \max $
dimension y_i
# positions where $ x_i
Hamming Categorical/binary variables
\neq y_i $
A \cap } A \cup }
Jaccard Set/binary data $ 1 - \frac{
B { B $
Text, direction matters, not
Cosine $ \frac{x \cdot y}{|x||y|} $
magnitude
Haversine Geographic coordinates See code in notebook
Key Takeaways
Distance metrics are foundational for KNN and many other algorithms.
KNN is simple and effective, but its performance depends on the distance metric and value
of $ k $.
Experiment with different metrics and always scale your features.
If you want a deeper explanation of any metric, code example, or visualization, just ask!
⁂
1. https://www.ustcnewly.com/teaching/2020_2_3.pdf
2. https://www.kdnuggets.com/2020/11/most-popular-distance-metrics-knn.html
3. https://blog.devgenius.io/exploring-knn-with-different-distance-metrics-85aea1e8299
4. https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/