Abstract
Density based clustering algorithms (DBCLAs) rely on the notion of density to identify clusters of arbitrary shapes, sizes with varying densities. Existing surveys on DBCLAs cover only a selected set of algorithms. These surveys fail to provide an extensive information about a variety of DBCLAs proposed till date including a taxonomy of the algorithms. In this paper we present a comprehensive survey of various DBCLAs over last two decades along with their classification. We group the DBCLAs in each of the four categories: density definition, parameter sensitivity, execution mode and nature of data and further divide them into various classes under each of these categories. In addition, we compare the DBCLAs through their common features and variations in citation and conceptual dependencies. We identify various application areas of DBCLAs in domains such as astronomy, earth sciences, molecular biology, geography, multimedia. Our survey also identifies probable future directions of DBCLAs where involvement of density based methods may lead to favorable results.
Similar content being viewed by others
References
Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323
Yan Z, Luo W, Bu C, Ni L. Clustering spatial data by the neighbors intersection and the density difference. In: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing Applications and Technologies (BDCAT). 2016, 217–226
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 2005, 16(3): 645–678
Ertöz L, Steinbach M, Kumar V. Finding clusters of different sizes, shapes and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 47–58
Karypis G, Han E H, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68–75
Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeedings of International Conference on KDD. 1996, 226–231
Keogh E, Mueen A. Encyclopedia of Machine Learning and Data Mining. Curse of Dimensionality. 2nd ed. Springer, Boston, MA, 2017, 314–315
Raymond T N, Han J. Clarans: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(5): 1003–1016
Hartigan J A, Wong M A. Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979, 28(1): 100–108
Silverman B W. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC press, 1998
Aggarwal C, Reddy C K. Data Clustering: Algorithms and Applications. CRC press, 2013
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of data. 1998, 94–105
Parimala M, Lopez D, Senthilkumar N C. A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 2011, 31(1): 59–66
Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record, 1999, 28: 49–60
Ram A, Jalal S, Jalal A S, Kumar M. A density based algorithm for discovering density varied clusters in large spatial databases. International Journal of Computer Applications, 2010, 3(6): 1–4
Birant D, Kut A. ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 2007, 60(1): 208–221
Bhuyan R, Borah S. A survey of some density based clustering techniques. In: Proceedings of the 2013 Conference on Advancements in Information, Computer and Communication. 2013
Chaudhary S, Chauhan R, Batra P. A survey of density based clustering algorithms. International Journal of Computer Science and Technology, 2014, 5(2): 169–171
Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998, 58–65
Nagpal P B, Mann P A. Comparative study of density based clustering algorithms. International Journal of Computer Applications, 2011, 27(11): 421–435
Xu X, Ester M, Kriegel H P, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th IEEE International Conference on Data Engineering. 1998, 324–331
Thiagarasu V, Prabahari R. A comparative analysis of density based clustering techniques for outlier mining. International Journal of Engineering Sciences and Research Technology, 2014, 3(11): 132–136
Rajavat A, Dubey P. Comparative study between density based clustering — DBSCAN and OPTICS. International Journal of Advance Computational Engineering and Networking, 2016, 4(12): 34–37
Smiti A, Eloudi Z. Soft DBSCAN: improving DBSCAN clustering method using fuzzy set theory. In: Proceedings of the 6th International Conference on Human System Interactions. 2013, 380–385
Loh W K, Park Y H. A survey on density-based clustering algorithms. In: Jeong Y S, Park Y H, Hsu C H, Park J, eds. Ubiquitous Information Technologies and Applications. Springer, Berlin, Heidelberg, 2014, 775–780
Xu X, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. In: Guo Y, Grossman R, eds. High Performance Data Mining. Springer, Boston, MA, 1999, 263–290
Böhm C, Noll R, Plant C, Wackersreuther B. Density-based clustering using graphics processors. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 661–670
Loh W K, Moon Y S, Park Y H. Erratum: fast density-based clustering using graphics processing units. IEICE TRANSACTIONS on Information and Systems, 2014, 97(7): 1947–1951
Ting K M, Zhu Y, Carman M J, Zhu Y, Zhou Z H. Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 1205–1214.
Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169–194
Ester M, Kriegel H P, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 323–333
Zhou S, Zhou A, Cao J, Wen J, Fan Y, Hu Y. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2000, 169–172
Jarvis R A, Patrick E A. Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, 1973, 100(11): 1025–1034
Januzaj E, Kriegel H P, Pfeifle M. DBDC: density based distributed clustering. Advances in Database Technology-EDBT 2004, 2004, 529–530
Borah B, Bhattacharyya D K. An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of IEEE International Conference on Intelligent Sensing and Information Processing. 2004, 92–96
Tsai CF, Liu CW. KIDBSCAN: anew efficient data clustering algorithm. In: Proceedings of International Conference on Artifical Intelligence and Soft Computing. 2006, 702–711
Kisilevich S, Mansmann F, Keim D. P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research and Application. 2010
An N, Khac L, Kechadi M T. On a distributed approach for density-based clustering. In: Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops. 2011, 283–286
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems. 2011, 473–480
Campello R J, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013, 160–172
Yu Y, Zhao J, Wang X, Wang Q, Zhang Y. Cludoop: an efficient distributed density-based clustering for big data using hadoop. International Journal of Distributed Sensor Networks, 2015, 11(6): 379–391
Hou J, Gao H, Li X. DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Transactions on Image Processing, 2016, 25(7): 3182–3193
Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 29(1): 167–172
Gan J, Tao Y. Dynamic density based clustering. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 1493–1507
Wang W, Yang J, Muntz R. Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases. 1997, 186–195
Davis L. Bit-climbing, representational bias, and test suite design. In: Proceedings of the 4th International Conference on Genetic Algorithms. 1991, 18–23
Hinneburg A, Keim D A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases. 1999
Zhang T, Ramakrishnan R, Livny M. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1997, 1(2): 141–182
Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 428–439
Aggarwal C C. A human-computer interactive method for projected clustering. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(4): 448–460
Chen Y, Tu L. Density-based clustering for real-time stream data. In: Proceedings of the 13thACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007, 133–142
Aggarwal C C. Outlier Analysis. In: Data Mining. Springer, 2015, 237–263
Kriegel H P, Pfeifle M. Hierarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE International Conference on Data Mining. 2005
Kriegel H P, Pfeifle M. Density-based clustering of uncertain data. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 2005, 672–677
Stonebraker M, Frew J, Dozier J. The SEQUOIA 2000 Project. In: Proceedings of International Symposium on Spatial Databases. 1993, 397–412
Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Record, 1996, 25(2): 103–114
Zheng Y, Xie X, Ma W Y. Geolife: a collaborative social networking service among user, location and trajectory. IEEE Data Engineering Bulletin, 2010, 33(2): 32–39
Yuan J, Zheng Y, Xie X, Sun G. T-drive: enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering, 2011, 25(1): 220–232, 2011
Yeung K Y, Fraley C, Murua A, Raftery A E, Ruzzo W L. Model-based clustering and data transformations for gene expression data. Bioinformatics, 2001, 17(10): 977–987
Bishop C M. Pattern Recognition and Machine Learning. Springer, 2006
Lu C D, Zhang T Y, Du X Z, Li C P. A robust kernel PCA algorithm. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. 2004, 3084–3087
Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and Information Processing, 1998, 18: 1–8
Baudat G, Anouar F. Generalized discriminant analysis using a kernel approach. Neural Computation, 2000, 12(10): 2385–2404
Epanechnikov V A. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 1969, 14(1): 153–158
Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM Sigmod Record, 1998, 27(2): 73–84
Tavallaee M, Bagheri E, Lu W, Ghorbani A A. A detailed analysis of the KDD cup 99 data set. In: Proceedings of IEEE Symposium on Computational Intelligence for Security and Defense Applications. 2009, 1–6
Bader G D, Hogue C W V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 2003, 4(1): 2
Connolly M L. Measurement of protein surface shape by solid angles. Journal of Molecular Graphics, 1986, 4(1): 3–6
Becker R H, White R L, Helfand D J. The first survey: faint images of the radio sky at twenty centimeters. The Astrophysical Journal, 1995, 450: 559
Zepka A F, Cordes J M, Wasserman I. Signal detection amid noise with known statistics. The Astrophysical Journal, 1994, 427: 438–445
Weir N, Fayyad U M, Djorgovski S. Automated star/galaxy classification for digitized POSS-II. The Astronomical Journal, 1995, 109: 2401
Chrisman N R, Dougenik J A, White D. Lessons for the design of polygon overlay processing from the odyssey whirlpool algorithm. In: Proceedings of the 5th International Symposium on Spatial Data Handling. 1992, 401–410
Du Q, Dong Z, Huang C, Ren F. Density-based clustering withgeographical background constraints using a semantic expression model. ISPRS International Journal of Geo-Information, 2016, 5(5): 72
Hendler J. Data integration for heterogenous datasets. Big Data, 2014, 2(4): 205–215
Allen T E, Herrgård M J, Liu M, Qiu Y, Glasner J, Blattner F R, Palsson B. Genome-scale analysis of the uses of the escherichia coli genome: model-driven analysis of heterogeneous data sets. Journal of Bacteriology, 2003, 185(21): 6392–6399
Pavlidis P, Weston J, Cai J, Grundy W N. Gene functional classification from heterogeneous data. In: Proceedings of the 5thAnnual International Conference on Computational Biology, 2001, 249–255
Angelier J. Tectonic analysis of fault slip data sets. Journal of Geophysical Research: Solid Earth, 1984, 89(B7): 5835–5848
Kowalski M, Rubin D, Aldering G, Agostinho R J, Amadon A, Amanullah R, Balland C, Barbary K, Blanc G, Challis P J, et al. Improved cosmological constraints from new, old, and combined supernova data sets. The Astrophysical Journal, 2008, 686(2): 749
Nguyen M D, Shin W Y. DBSTexC: density-based spatio-textual clustering on twitter. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2017, 23–26
Author information
Authors and Affiliations
Corresponding author
Additional information
Panthadeep Bhattacharjee is currently a PhD research scholar in the Department of Computer Science & Engineering, Indian Institute of Technology (IIT) Guwahati, India. He obtained his BTech in Information Technology from Assam University Silchar, India in 2010 and MTech in Computer Science & Engineering from National Institute of Technology Durgapur (NIT), India in 2012. He worked as assistant professor in the School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Bhubaneswar, India from June 2012 to December 2013. He joined IIT Guwahati in January 2014. His research interest includes pattern recognition, machine learning, data science, online algorithms. He has been a winner in the Artificial Intelligence-Design Challenge Championship held at International Institute of Information Technology (HIT) — Bangalore in 2018.
Pinaki Mitra is currently an associate professor at the Department of Computer Science & Engineering, Indian Institute of Technology (IIT) Guwahati, India. He obtained his BTech in Computer Science & Engineering from Jadavpur University, Kolkata, India in 1987 and MTech in Computer Science & Engineering from Indian Institute of Science (IISc) Bangalore, India in 1989. He received his PhD from Simon Fraser University, Canada in 1994. He worked as a Research Scientist at Center for Microprocessor Training Education and Research, Jadavpur University. Subsequently he joined National Institute of Management, Kolkata and served as an assistant professor. He joined IIT Guwahati in December 2004. His research interest includes cryptography, network security, computer graphics, multimedia, pattern recognition and machine learning.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Bhattacharjee, P., Mitra, P. A survey of density based clustering algorithms. Front. Comput. Sci. 15, 151308 (2021). https://doi.org/10.1007/s11704-019-9059-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-019-9059-3