A Comparative Study On Distance Measuring Approach

Uploaded by

WebTech Info

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views3 pages

A Comparative Study On Distance Measuring Approach

Uploaded by

WebTech Info

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

International Journal of Research in Computer Science

eISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 29-31

A COMPARATIVE STUDY ON DISTANCE MEASURING

APPROACHES FOR CLUSTERING
Shraddha Pandit1, Suchita Gupta2
*Assistant Professor,Gyan Ganga Institute of Information Technology and Management,Bhopal

Abstract: Clustering plays a vital role in the various can be measured and interpretation of the comparison;
areas of research like Data Mining, Image Retrieval, And we conclude the report.
Bio-computing and many a lot. Distance measure
plays an important role in clustering data points. II. DISTANCE MEASURES AND ITS
Choosing the right distance measure for a given SIGNIFICANCE
dataset is a biggest challenge. In this paper, we study A cluster is a collection of data objects that are
various distance measures and their effect on different similar to objects within the same cluster and
clustering. This paper surveys existing distance dissimilar to those in other clusters. Similarity between
measures for clustering and present a comparison two objects is calculated using a distance measure
between them based on application domain, efficiency, [6].Since clustering forms groups; it can be used as a
benefits and drawbacks. This comparison helps the pre-processing step for methods like classifications.
researchers to take quick decision about which
distance measure to use for clustering. We conclude Many distance measures have been proposed in
this work by identifying trends and challenges of literature for data clustering. Most often, these
research and development towards clustering. measures are metric functions; Manhattan distance,
Keywords: Clustering, Distance Measure, Clustering Minkowski distance and Hamming distance. Jaccard
Algorithms index, Cosine Similarity and Dice Coefficient are also
popular distance measures. For non-numeric datasets,
I. INTRODUCTION special distance functions are proposed. For example,
edit distance is a well-known distance measure for text
Clustering is an important data mining technique attributes.
that has a wide range of applications in many areas
like biology, medicine, market research and image In this section we briefly elaborate seven commonly
analysis etc.It is the process of partitioning a set of used distance measures.
objects into different subsets such that the data in each A. Euclidean Distance
subset are similar to each other. In Cluster analysis
Distance measure and clustering algorithm plays an The Euclidean distance or Euclidean metric is the
important role [1]. ordinary distance between two points that one would
measure with a ruler. It is the straight line distance
An important step in any clustering is to select a between two points.
distance measure, which will determine how similarity
[1] of two elements is calculated. This will influence In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is
the shape of the clusters, as some elements may be √((x1 - x2)² + (y1 - y2)²).
close to one another according to one distance and In N dimensions, the Euclidean distance between two
farther away according to another. points p and q is √ (Σi=1N (pi-qi)²) where pi (or qi) is
It is expected that distance between objects within a the coordinate of p (or q) in dimension i.
cluster should be minimum and distance between B. Manhattan Distance
objects within different clusters should be maximum.
In this paper we compare different distance measures. The distance between two points measured along
Comparison of these distance measures show that axes at right angles. In a plane with p1 at (x1, y1) and
different distance measures behave differently p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|.
depending on application domain. The rest of the paper This is easily generalized to higher dimensions.
is organized as follows: Manhattan distance is often used in integrated circuits
In section II, we discuss distance measures and its where wires only run parallel to the X or Y axis. It is
significance in nutshell; in section III, we present the also known as rectilinear distance, Minkowski’s [7] [3]
comparison between these distances measures in L1 distance, taxi cab metric, or city block distance.
TABLE I; In section IV, we describe how the accuracy

www.ijorcs.org
30 Shraddha Pandit, Suchita Gupta

C. Bit-Vector Distance G. Dice Index

An N x N matrix Mb is calculated. Each point has d Dice's coefficient [11] (also known as the Dice
dimensions and Mb (Pi, Pj) is determined as d-bit Coefficient) is a similarity measure related to the
vector. This vector is obtained as follows: Jaccard index.
For sets X and Y of keywords used in information
If the numerical value of the xth dimension of point
is greater than the numerical value of the xth retrieval, the coefficient may be defined as:
dimension of point pj,y then the bit x of Mb(Pi ,Pj) is
2|𝑋 ∩ 𝑌|
set to 1 and bit x of Mb(Pj, Pi) is set to 0.All the bit 𝑆=
vectors in Mb are then converted to integers. |𝑋| + |𝑌|

D. Hamming Distance When taken as string similarity measure, the

coefficient may be calculated for two strings, x and y
The Hamming distance between two strings of using bigrams as follows:
equal length is the number of positions for which the
2𝑛𝑡
corresponding symbols are different. 𝑆=
𝑛𝑥 +𝑛𝑦
Let x, y A^n. We define the Hamming distance
between x and y, denoted dH(x, y), to be the number Where nt is the number of character bigrams found in
of places where x and y are different. both strings, nx is the number of bigrams in string x
and ny is the number of bigrams in string y.
The Hamming distance [1] [6] can be interpreted as
the number of bits which need to be changed III. ACCURACY AND RESULT INTERPRETATION
(corrupted) to turn one string into other. Sometimes the
number of characters is used instead of the number of In general, the larger the number of sub-clusters
bits. Hamming distance can be seen as Manhattan produced by the clustering the more accurate the final
distance between bit vectors. result is. However, too many sub-clusters will slow
down the clustering. The above comparison table
E. Jaccard Index compares 5 proximity measures. This comparison is
The Jaccard index, also known as the Jaccard based on 4 different criteria which are generally
similarity coefficient is a statistic used for comparing required to decide upon distance measure and
the similarity and diversity of sample sets. clustering algorithms.
The Jaccard coefficient [11] measures similarity All above comparisons are tested using standard
between sample sets, and is defined as the size of the synthetic dataset generated by Syndeca [3] Software
intersection divided by the size of the union of the and few of it is tested using open source clustering tool
sample sets: CLUTO.
√(A,B) = ‫׀‬A∩B‫׀‬ IV.CONCLUSION
‫׀‬AUB ‫׀‬ This paper surveys existing proximity measures for
F. Cosine Index clustering and presents a comparison between them
based on application domain, efficiency, benefits and
It is a measure of similarity between two vectors of drawbacks. This comparison helps the researchers to
n dimensions by finding the angle between them, often take quick decision about which distance measure to
used to compare documents in text mining. Given two use for clustering. We ran our experiments on
vectors of attributes, A and B, the cosine similarity synthetic datasets for its validation. Future work
[11], θ, is represented using a dot product and involves running the experiments on larger and
magnitude as different kinds of datasets and extending our study to
𝐴. 𝐵 other proximity measures and clustering algorithms.
𝜃 = 𝑎𝑟𝑐𝑐𝑜𝑠
‖𝐴‖‖𝐵‖
V. REFERENCES
For text matching, the attribute vectors A and B are [1] Ankita Vimal, Satyanarayana R Valluri, Kamalakar
usually the tf-idf vectors of the documents. Since the Karlapalem , “An Experiment with Distance Measures
angle, θ, is in the range of [0, π], the resulting for Clustering” , Technical Report: IIIT/TR/2008/132
similarity will yield the value of π as meaning exactly [2] John W. Ratcliff and David E. Metzener, Pattern
opposite, π / 2 meaning independent, 0 meaning Matching: The Gestalt Approach, DR. DOBB’S
exactly the same, with in-between values indicating JOURNAL, 1998, p. 46.
intermediate similarities or dissimilarities [3] Martin Ester Hans-Peter Kriegel Jrg Sander and
Xiaowei Xu, A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with
Noise, AAAI Press, 1996, pp. 226–231.

www.ijorcs.org
A Comparative Study on Distance Measuring Approaches for Clustering 31

TABLE I: COMPARISON OF DISTANCE MEASURE

Distance Algorithms In Application

Formula Benefits Drawbacks
Measure which it is Used Area
Results are greatly
Easy to -Appl.
√ ((x1 - x2) ² + (y1 - -Partitional influenced by
Euclidean implement and Involving
y2) ²) Algorithms variables that have
Test Interval Data
the largest value.
- In health
-K Modes psychology
analysis
Does not work well
for image data,
- AutoClass
Document
Classification
- DNA
-ROCK
Analysis
Easily Does not work well
Partitional generalized to for image data and In Integrated
Manhattan |x1 - x2| + |y1 - y2|.
Algorithms higher Document Circuits
dimensions Classification
Handles both
Continuous and Does not work well
Cosine Ontology and Text Mining
categorical for nominal data
variables
Ө = arccos A • B
Similarity Graph based
‫ ׀׀‬A ‫ ׀׀ ׀׀‬B‫׀׀‬
Handles both
√(A,B) = ‫׀‬A∩B‫׀‬ Continuous and Does not work well Document
Jaccard Index Neural Network
‫׀‬AUB‫׀‬ categorical for nominal data classification
variables

[4] Bar-Hilel, A., Hertz, T., Shental, N., & Weinshall, D. [7] http://en.wikipedia.org/wiki/Data clustering
(2003). Learning distance functions using equivalence
[8] http://en.wikipedia.org/wiki/K-means
[5] Fukunaga, K. (1990). Statistical pattern recognition.
San Diego: Academic Press. 2nd edition. [9] http://en.wikipedia.org/wiki/ DBSCAN
[6] Rui Xu, Donald Wunsch “Survey of Clustering [10] http://en.wikipedia.org/wiki/Jaccard index
Algorithms” IEEE Transactions on Neural Networks ,
VOL. 16, NO. 3, MAY 2005. [11] http://en.wikipedia.org/wiki/Dice coefficient
doi:10.1109/TNN.2005.845141

How to cite
Shraddha Pandit, Suchita Gupta, "A Comparative Study on Distance Measuring Approaches for Clustering".
International Journal of Research in Computer Science, 2 (1): pp. 29-31, December 2011.
doi:10.7815/ijorcs.21.2011.011

www.ijorcs.org

Unit-1 (Part-1) Similarity and Dissimilarity Measures
No ratings yet
Unit-1 (Part-1) Similarity and Dissimilarity Measures
24 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
III Clustering
No ratings yet
III Clustering
87 pages
Clustering Distance Measures
No ratings yet
Clustering Distance Measures
10 pages
Cosine Similarity in Data Mining
No ratings yet
Cosine Similarity in Data Mining
4 pages
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
No ratings yet
A Comparison Study On Similarity and Dissimilarity Measures in Clustering Continuous Data
20 pages
Clustering
No ratings yet
Clustering
15 pages
Clustering
0% (1)
Clustering
127 pages
9 Distance Measures in Data Science
No ratings yet
9 Distance Measures in Data Science
23 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Unsupervised Learning: Uses of Cluster Analysis
No ratings yet
Unsupervised Learning: Uses of Cluster Analysis
2 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
57 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
26 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Unit 3
No ratings yet
Unit 3
13 pages
Clustering
No ratings yet
Clustering
43 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
9 Distance Measures in Data Science - Towards Data Science
No ratings yet
9 Distance Measures in Data Science - Towards Data Science
14 pages
Data Similarity & Dissimilarity Guide
No ratings yet
Data Similarity & Dissimilarity Guide
27 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
ML-Module 5-P1
No ratings yet
ML-Module 5-P1
45 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
DSB - Unit3
No ratings yet
DSB - Unit3
87 pages
ML Module5
No ratings yet
ML Module5
37 pages
Unit - 3 Image Proc
No ratings yet
Unit - 3 Image Proc
71 pages
Similarity and Disimilarity Measures
No ratings yet
Similarity and Disimilarity Measures
2 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Text Mining: Similarity Measures
No ratings yet
Text Mining: Similarity Measures
11 pages
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
No ratings yet
An Empirical Study of Distance Metrics For K-Nearest Neighbor Algorithm
6 pages
M4 - Clustering
No ratings yet
M4 - Clustering
43 pages
Data Mining: Distance & Similarity
No ratings yet
Data Mining: Distance & Similarity
25 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Minimum-Distance Pattern Classification
No ratings yet
Minimum-Distance Pattern Classification
104 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
5 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Clustering and Distance Measures
No ratings yet
Clustering and Distance Measures
2 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Alam Uri 2014
No ratings yet
Alam Uri 2014
8 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
SEMINAR
No ratings yet
SEMINAR
19 pages
Scikit
No ratings yet
Scikit
3 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Chetan Complete BCT Practical File 2
No ratings yet
Chetan Complete BCT Practical File 2
22 pages
Seaspan Application Form
No ratings yet
Seaspan Application Form
7 pages
R/C Soaring Digest - Feb 2005
No ratings yet
R/C Soaring Digest - Feb 2005
28 pages
40 Powerful Candlestick Patterns
No ratings yet
40 Powerful Candlestick Patterns
25 pages
Battle Bus Drop at DuckDuckGo
No ratings yet
Battle Bus Drop at DuckDuckGo
1 page
Rs232 Control: How To Find Audia/Nexia Ip Settings Using Hyperterminal
No ratings yet
Rs232 Control: How To Find Audia/Nexia Ip Settings Using Hyperterminal
4 pages
Trade FX
No ratings yet
Trade FX
18 pages
Autosport May 25 2017
No ratings yet
Autosport May 25 2017
121 pages
Youth Group Activity Consent Form
No ratings yet
Youth Group Activity Consent Form
2 pages
Non-Developers' Subdivision Guide
No ratings yet
Non-Developers' Subdivision Guide
33 pages
Commerce Assignment Guide
No ratings yet
Commerce Assignment Guide
3 pages
Values and Ethics
No ratings yet
Values and Ethics
19 pages
House Price Indices in India - 2015
No ratings yet
House Price Indices in India - 2015
48 pages
Overview of Soviet UCG 172541
No ratings yet
Overview of Soviet UCG 172541
60 pages
Vol 4-DBR
No ratings yet
Vol 4-DBR
132 pages
Welding Skewed T-joint Quiz
No ratings yet
Welding Skewed T-joint Quiz
3 pages
Lvin Jamerlan: Key Skills
No ratings yet
Lvin Jamerlan: Key Skills
4 pages
Final Philo Week 3 LAS
100% (1)
Final Philo Week 3 LAS
8 pages
International Business Perspectives From Developed and Emerging Markets K. Praveen Parboteeah Instant Download
No ratings yet
International Business Perspectives From Developed and Emerging Markets K. Praveen Parboteeah Instant Download
144 pages
Atri FL SEM 2
No ratings yet
Atri FL SEM 2
17 pages
BS-in-Midwifery (Professional-Final)
No ratings yet
BS-in-Midwifery (Professional-Final)
57 pages
Glass-And-Alminium (3) .PDF 0558490607 Info@
No ratings yet
Glass-And-Alminium (3) .PDF 0558490607 Info@
118 pages
Thermal Analysis
No ratings yet
Thermal Analysis
3 pages
ACCA Financial Reporting Mock Exam 2020
No ratings yet
ACCA Financial Reporting Mock Exam 2020
20 pages
DofE Bronze Initial Letter 2024
No ratings yet
DofE Bronze Initial Letter 2024
2 pages
PA 206 - Public Policy and Program Administration
50% (2)
PA 206 - Public Policy and Program Administration
10 pages
Maruti Suzuki Brand Audit
No ratings yet
Maruti Suzuki Brand Audit
3 pages
Setup of SM 100 For React Solution
No ratings yet
Setup of SM 100 For React Solution
4 pages
NCC-UNEP MoU on Plastic Pollution
No ratings yet
NCC-UNEP MoU on Plastic Pollution
16 pages
Merged PDF 20250113 18.20.25
No ratings yet
Merged PDF 20250113 18.20.25
33 pages

A Comparative Study On Distance Measuring Approach

Uploaded by

A Comparative Study On Distance Measuring Approach

Uploaded by

International Journal of Research in Computer Science

eISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 29-31

A COMPARATIVE STUDY ON DISTANCE MEASURING

C. Bit-Vector Distance G. Dice Index

D. Hamming Distance When taken as string similarity measure, the

TABLE I: COMPARISON OF DISTANCE MEASURE

Distance Algorithms In Application

You might also like