[go: up one dir, main page]

skip to main content
10.1145/502512.502549acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A robust and scalable clustering algorithm for mixed type attributes in large database environment

Published: 26 August 2001 Publication History

Abstract

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.

References

[1]
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and Non-Gaussian Clustering. Biometrics, 49, p. 803-821.
[2]
Fraley, C., and Raftery, A. E. (1998). How Many Clusters? Which Clustering Method? Answers via Model-based Cluster Analysis. Computer Journal, 4, p. 578-588.
[3]
Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). CACTUS - Clustering Categorical Data Using Summaries. In Proceedings of 1999 SIGKDD Conference. p. 73-82.
[4]
Guha, S., Rastogi, R., and Shim, K. (1999). ROCK: A Robust Clustering Algorithm for Categorical Attributes. URL: http://www.bell-labs.com/proiect/serendip/.
[5]
Huang, Z. (1998). Extensions to the K-means Algorithm for Clustering Large Datasets with Categorical Values. Data Mining and Knowledge Discovery, 2, p. 283-304.
[6]
MacQueen, J. B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, p. 281-297.
[7]
Schwarz, G. (1978). Estimating the dimension of a model. The Annual of Statistics, 6, p. 461-464.
[8]
Zhang, T., Ramakrishnan, R., and Livny M. (1997). BIRCH: A New Data Clustering Algorithm and its Applications. Data Mining and Knowledge Discovery 1 (2).

Cited By

View all
  • (2024)Integrating Heritage and Environment: Characterization of Cultural Landscape in Beijing Great Wall Heritage AreaLand10.3390/land1304053613:4(536)Online publication date: 17-Apr-2024
  • (2024)A study on the stratification of long-tail customers in civil aviation based on a cluster ensembleJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23415546:3(5783-5799)Online publication date: 5-Mar-2024
  • (2024)Digitalization and digital technologies: The obstacles to adaptation among Hungarian farmersEquilibrium. Quarterly Journal of Economics and Economic Policy10.24136/eq.323719:3(1075-1110)Online publication date: 27-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
August 2001
493 pages
ISBN:158113391X
DOI:10.1145/502512
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2001

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Mixed type of attributes
  2. clustering
  3. log-likelihood
  4. noisy data
  5. number of clusters

Qualifiers

  • Article

Conference

KDD01
Sponsor:

Acceptance Rates

KDD '01 Paper Acceptance Rate 31 of 237 submissions, 13%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)7
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Integrating Heritage and Environment: Characterization of Cultural Landscape in Beijing Great Wall Heritage AreaLand10.3390/land1304053613:4(536)Online publication date: 17-Apr-2024
  • (2024)A study on the stratification of long-tail customers in civil aviation based on a cluster ensembleJournal of Intelligent & Fuzzy Systems10.3233/JIFS-23415546:3(5783-5799)Online publication date: 5-Mar-2024
  • (2024)Digitalization and digital technologies: The obstacles to adaptation among Hungarian farmersEquilibrium. Quarterly Journal of Economics and Economic Policy10.24136/eq.323719:3(1075-1110)Online publication date: 27-Sep-2024
  • (2024)Financial Situation of Cities in the Lodz Voivodeship in the Era of the COVID-19 Pandemic – Trends of Change and Impact on IndebtednessOptimum. Economic Studies10.15290/oes.2024.01.115.06(110-133)Online publication date: 2024
  • (2024)Early trajectories and moderators of autistic language profiles: A longitudinal study in preschoolersAutism10.1177/13623613241253015Online publication date: 21-May-2024
  • (2024)The relationship between students' self‐regulated learning behaviours and problem‐solving efficiency in technology‐rich learning environmentsJournal of Computer Assisted Learning10.1111/jcal.13043Online publication date: 23-Jul-2024
  • (2024) Network analysis to prioritize issues for intervention to improve the health‐related quality of life of people with HIV in Spain HIV Medicine10.1111/hiv.13693Online publication date: 10-Aug-2024
  • (2024)Exploring the distribution and cognitive profiles of poor readers across varying levels of reading difficulty: implications for identification and supportJournal of Research in Reading10.1111/1467-9817.12454Online publication date: 25-Apr-2024
  • (2024)Towards Robust LIDAR Lane Clustering for Autonomous Vehicle Perception in ROS 22024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST)10.1109/MOST60774.2024.00031(229-234)Online publication date: 1-May-2024
  • (2024)Private investigations and ethical orientations: a cause for concern?Journal of Financial Crime10.1108/JFC-02-2024-0066Online publication date: 7-May-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media