International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 29-31
© White Globe Publications
www.ijorcs.org
A COMPARATIVE STUDY ON DISTANCE MEASURING
APPROACHES FOR CLUSTERING
Shraddha Pandit1, Suchita Gupta2
*Assistant Professor,Gyan Ganga Institute of Information Technology and Management,Bhopal
Abstract: Clustering plays a vital role in the various can be measured and interpretation of the comparison;
areas of research like Data Mining, Image Retrieval, And we conclude the report.
Bio-computing and many a lot. Distance measure
plays an important role in clustering data points. II. DISTANCE MEASURES AND ITS
Choosing the right distance measure for a given SIGNIFICANCE
dataset is a biggest challenge. In this paper, we study A cluster is a collection of data objects that are
various distance measures and their effect on different similar to objects within the same cluster and
clustering. This paper surveys existing distance dissimilar to those in other clusters. Similarity between
measures for clustering and present a comparison two objects is calculated using a distance measure
between them based on application domain, efficiency, [6].Since clustering forms groups; it can be used as a
benefits and drawbacks. This comparison helps the pre-processing step for methods like classifications.
researchers to take quick decision about which
distance measure to use for clustering. We conclude Many distance measures have been proposed in
this work by identifying trends and challenges of literature for data clustering. Most often, these
research and development towards clustering. measures are metric functions; Manhattan distance,
Keywords: Clustering, Distance Measure, Clustering Minkowski distance and Hamming distance. Jaccard
Algorithms index, Cosine Similarity and Dice Coefficient are also
popular distance measures. For non-numeric datasets,
I. INTRODUCTION special distance functions are proposed. For example,
edit distance is a well-known distance measure for text
Clustering is an important data mining technique attributes.
that has a wide range of applications in many areas
like biology, medicine, market research and image In this section we briefly elaborate seven commonly
analysis etc.It is the process of partitioning a set of used distance measures.
objects into different subsets such that the data in each A. Euclidean Distance
subset are similar to each other. In Cluster analysis
Distance measure and clustering algorithm plays an The Euclidean distance or Euclidean metric is the
important role [1]. ordinary distance between two points that one would
measure with a ruler. It is the straight line distance
An important step in any clustering is to select a between two points.
distance measure, which will determine how similarity
[1] of two elements is calculated. This will influence In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is
the shape of the clusters, as some elements may be √((x1 - x2)² + (y1 - y2)²).
close to one another according to one distance and In N dimensions, the Euclidean distance between two
farther away according to another. points p and q is √ (Σi=1N (pi-qi)²) where pi (or qi) is
It is expected that distance between objects within a the coordinate of p (or q) in dimension i.
cluster should be minimum and distance between B. Manhattan Distance
objects within different clusters should be maximum.
In this paper we compare different distance measures. The distance between two points measured along
Comparison of these distance measures show that axes at right angles. In a plane with p1 at (x1, y1) and
different distance measures behave differently p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|.
depending on application domain. The rest of the paper This is easily generalized to higher dimensions.
is organized as follows: Manhattan distance is often used in integrated circuits
In section II, we discuss distance measures and its where wires only run parallel to the X or Y axis. It is
significance in nutshell; in section III, we present the also known as rectilinear distance, Minkowski’s [7] [3]
comparison between these distances measures in L1 distance, taxi cab metric, or city block distance.
TABLE I; In section IV, we describe how the accuracy
www.ijorcs.org
30 Shraddha Pandit, Suchita Gupta
C. Bit-Vector Distance G. Dice Index
An N x N matrix Mb is calculated. Each point has d Dice's coefficient [11] (also known as the Dice
dimensions and Mb (Pi, Pj) is determined as d-bit Coefficient) is a similarity measure related to the
vector. This vector is obtained as follows: Jaccard index.
For sets X and Y of keywords used in information
If the numerical value of the xth dimension of point
is greater than the numerical value of the xth retrieval, the coefficient may be defined as:
dimension of point pj,y then the bit x of Mb(Pi ,Pj) is
2|𝑋 ∩ 𝑌|
set to 1 and bit x of Mb(Pj, Pi) is set to 0.All the bit 𝑆=
vectors in Mb are then converted to integers. |𝑋| + |𝑌|
D. Hamming Distance When taken as string similarity measure, the
coefficient may be calculated for two strings, x and y
The Hamming distance between two strings of using bigrams as follows:
equal length is the number of positions for which the
2𝑛𝑡
corresponding symbols are different. 𝑆=
𝑛𝑥 +𝑛𝑦
Let x, y A^n. We define the Hamming distance
between x and y, denoted dH(x, y), to be the number Where nt is the number of character bigrams found in
of places where x and y are different. both strings, nx is the number of bigrams in string x
and ny is the number of bigrams in string y.
The Hamming distance [1] [6] can be interpreted as
the number of bits which need to be changed III. ACCURACY AND RESULT INTERPRETATION
(corrupted) to turn one string into other. Sometimes the
number of characters is used instead of the number of In general, the larger the number of sub-clusters
bits. Hamming distance can be seen as Manhattan produced by the clustering the more accurate the final
distance between bit vectors. result is. However, too many sub-clusters will slow
down the clustering. The above comparison table
E. Jaccard Index compares 5 proximity measures. This comparison is
The Jaccard index, also known as the Jaccard based on 4 different criteria which are generally
similarity coefficient is a statistic used for comparing required to decide upon distance measure and
the similarity and diversity of sample sets. clustering algorithms.
The Jaccard coefficient [11] measures similarity All above comparisons are tested using standard
between sample sets, and is defined as the size of the synthetic dataset generated by Syndeca [3] Software
intersection divided by the size of the union of the and few of it is tested using open source clustering tool
sample sets: CLUTO.
√(A,B) = ׀A∩B׀ IV.CONCLUSION
׀AUB ׀ This paper surveys existing proximity measures for
F. Cosine Index clustering and presents a comparison between them
based on application domain, efficiency, benefits and
It is a measure of similarity between two vectors of drawbacks. This comparison helps the researchers to
n dimensions by finding the angle between them, often take quick decision about which distance measure to
used to compare documents in text mining. Given two use for clustering. We ran our experiments on
vectors of attributes, A and B, the cosine similarity synthetic datasets for its validation. Future work
[11], θ, is represented using a dot product and involves running the experiments on larger and
magnitude as different kinds of datasets and extending our study to
𝐴. 𝐵 other proximity measures and clustering algorithms.
𝜃 = 𝑎𝑟𝑐𝑐𝑜𝑠
‖𝐴‖‖𝐵‖
V. REFERENCES
For text matching, the attribute vectors A and B are [1] Ankita Vimal, Satyanarayana R Valluri, Kamalakar
usually the tf-idf vectors of the documents. Since the Karlapalem , “An Experiment with Distance Measures
angle, θ, is in the range of [0, π], the resulting for Clustering” , Technical Report: IIIT/TR/2008/132
similarity will yield the value of π as meaning exactly [2] John W. Ratcliff and David E. Metzener, Pattern
opposite, π / 2 meaning independent, 0 meaning Matching: The Gestalt Approach, DR. DOBB’S
exactly the same, with in-between values indicating JOURNAL, 1998, p. 46.
intermediate similarities or dissimilarities [3] Martin Ester Hans-Peter Kriegel Jrg Sander and
Xiaowei Xu, A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with
Noise, AAAI Press, 1996, pp. 226–231.
www.ijorcs.org
A Comparative Study on Distance Measuring Approaches for Clustering 31
TABLE I: COMPARISON OF DISTANCE MEASURE
Distance Algorithms In Application
Formula Benefits Drawbacks
Measure which it is Used Area
Results are greatly
Easy to -Appl.
√ ((x1 - x2) ² + (y1 - -Partitional influenced by
Euclidean implement and Involving
y2) ²) Algorithms variables that have
Test Interval Data
the largest value.
- In health
-K Modes psychology
analysis
Does not work well
for image data,
- AutoClass
Document
Classification
- DNA
-ROCK
Analysis
Easily Does not work well
Partitional generalized to for image data and In Integrated
Manhattan |x1 - x2| + |y1 - y2|.
Algorithms higher Document Circuits
dimensions Classification
Handles both
Continuous and Does not work well
Cosine Ontology and Text Mining
categorical for nominal data
variables
Ө = arccos A • B
Similarity Graph based
׀׀A ׀׀ ׀׀B׀׀
Handles both
√(A,B) = ׀A∩B׀ Continuous and Does not work well Document
Jaccard Index Neural Network
׀AUB׀ categorical for nominal data classification
variables
[4] Bar-Hilel, A., Hertz, T., Shental, N., & Weinshall, D. [7] http://en.wikipedia.org/wiki/Data clustering
(2003). Learning distance functions using equivalence
[8] http://en.wikipedia.org/wiki/K-means
[5] Fukunaga, K. (1990). Statistical pattern recognition.
San Diego: Academic Press. 2nd edition. [9] http://en.wikipedia.org/wiki/ DBSCAN
[6] Rui Xu, Donald Wunsch “Survey of Clustering [10] http://en.wikipedia.org/wiki/Jaccard index
Algorithms” IEEE Transactions on Neural Networks ,
VOL. 16, NO. 3, MAY 2005. [11] http://en.wikipedia.org/wiki/Dice coefficient
doi:10.1109/TNN.2005.845141
How to cite
Shraddha Pandit, Suchita Gupta, "A Comparative Study on Distance Measuring Approaches for Clustering".
International Journal of Research in Computer Science, 2 (1): pp. 29-31, December 2011.
doi:10.7815/ijorcs.21.2011.011
www.ijorcs.org