Abstract
This paper applies divide and conquer approach in an iterative way to handle the clustering process. The target is a parallelized effective and efficient approach that produces the intended clustering result. We achieve scalability by first partitioning a large dataset into subsets of manageable sizes based on the specifications of the machine to be used in the clustering process; then cluster the partitions separately in parallel. The centroid of each obtained cluster is treated like the root of a tree with instances in its cluster as leaves. The partitioning and clustering process is iteratively applied on the centroids with the trees growing up until we get the final clustering; the outcome is a forest with one tree per cluster. Finally, a conquer process is performed to get the actual intended clustering, where each instance (leaf node) belongs to the final cluster represented by the root of its tree. We use multi-objective genetic algorithm combined with validity indices to decide on the number of classes. This approach fits well for interactive online clustering. It facilitates for incremental clustering because chunks of instances are clustered as stand alone sets, and then the results are merged with existing clusters. This is attractive and feasible because we consider the clustering of only centroids after the first clustering stage. The reported test results demonstrate the applicability and effectiveness of the proposed approach.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD international conference on management of data, pp 70–81
Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings ACM SIGMOD international conference on management of data, pp 61–72
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data, pp 94–105
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, pp 487–499
Andersson J (2000) A survey of multiobjective optimization in engineering design. Technical report LiTH-IKP-R-1097, Department of Mechanical Engineering, Linkping University, Linkping, Sweden
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD conference on management of data, pp 49–60
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Ben-Dor A, Shamir R, Yahkini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3):281–297
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the international conference on evolutionary computation, pp 34–39
Bezdek JC, Hathaway RJ (1994) Optimization of fuzzy clustering criteria using genetic algorithms. In: Proceedings of the international conference on evolutionary computation, pp 589–594
Cheng C-H, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 84–93
Cole RM (1998) Clustering with genetic algorithms. Master’s thesis, Nedlands 6907, Australia
Deb K, Agrawal S, Pratab A, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Demiriz A, Bennett K, Embrechts M (1999) Semi-supervised clustering using genetic algorithms. Technical report 9901, Rensselaer Polytechnic Institute, Troy, NY
Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–160
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD process for extracting useful knowledge from volumes of data. Commun ACM 39(11):27–34
Fonseca CM, Flemming PJ (1995) An overview of evolutionary algorithms in multiobjective optimization. Evol Comput 3(1):1–16
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569
Friedman J, Meulman J (2004) Clustering objects on subsets of attributes. J R Stat Soc Ser B 4:815–849
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD conference on management of data, pp 73–84
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering, pp 512–521
Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112
Hartigan J (1975) Clustering algorithms. Wiley, New York
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 58–65
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 316(3):264–323
Kaski S, Nikkilä J, Kohonen T (1998) Methods for interpreting a self-organized map in data analysis. In: Proceedings of the European symposium on artificial neural networks, Brussels, Belgium, pp 185–190
Kaufman L, Rouseeauw PL (1990) Finding group in data: an introduction to cluster analysis. Wiley, New York
Likas A, Vlassis N, Verbeek J (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461
Liu Y, Özyer T, Alhajj R, Barker K (2004) Validity analysis of clustering obtained using multi-objective genetic algorithm. In: Proceedings of the international conference on intelligent systems design and applications. Lecture notes in computer science. Springer, Berlin
Liu Y, Özyer T, Alhajj R, Barker K (2005) Cluster validity analysis of alternative solutions from multi-objective optimization. In: Proceedings of SIAM international conference on data mining
Liu Y, Özyer T, Alhajj R, Barker K (2005) Integrating multi-objective genetic algorithm and validity analysis for locating and ranking alternative clustering. Eur J Inform 29(1):33–40
Lu Y, Lu S, Fotouhi F, Deng Y, Brown S (2004) FGKA: a fast genetic k-means clustering algorithm. In: Proceedings of ACM symposium on applied computing, pp 162–163
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Nagesh H, Goil S, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report, Stanford University, Northwestern University, June 1999
Nakayama H (2005) Multi-objective optimization and its engineering applications. In: Branke J, Deb K, Miettinen K, Steuer RE (eds) Practical approaches to multi-objective optimization. Dagstuhl seminar proceedings, N 04461. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany. http://drops.dagstuhl.de/opus/volltexte/2005/234
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the international conference on very large data bases, Santiago de Chile, Chile, pp 144–155
Noel SE, Szu HH (1997) Multiple-resolution clustering for recursive divide and conquer. In: Szu HH (ed) Wavelet applications IV. Proceedings of SPIE, vol 3078, pp 266–279, April 1997
Özyer T, Alhajj R (2006) Achieving natural clustering by validating results of iterative evolutionary clustering approach. In: Proceedings of the IEEE conference on intelligent systems
Özyer T, Alhajj R (2006) Clustering by integrating multi-objective optimization with weighted k-means and validity analysis. In: 7th international conference on intelligent data engineering and automated learning. Springer, Berlin
Özyer T, Liu Y, Alhajj R, Barker K (2004) Multi-objective genetic algorithm based clustering approach and its application to gene expression data. In: Proceedings of biennial international conference on advances in information systems. Lecture notes in computer science. Springer, Berlin
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Discov 2(2):169–194
Schikuta E, Erhart M (1998) Bang-clustering: a novel grid-clustering algorithm for huge data sets. In: SSPR/SPR, pp 867–874
Scott AJ, Symons MJ (1971) Model-based Gaussian and non-Gaussian clustering. Biometrics 27:387–397
Sharan R, Shamir R (2000) Click: a clustering algorithm with applications to gene expression analysis. In: Proceedings of the international conference on intelligent systems on molecular biology, pp 307–316
Strehl A, Gosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Thierens D (1999) Scalability problems of simple genetic algorithms. Evol Comput 7(4):331–352
Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large databases, pp 186–195
Zitzler E (1999) Evolutionary algorithms for multiobjective optimization: methods and applications. PhD thesis, Zurich: Swiss Federal Institute of Technology (ETH), Aachen, Germany
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Özyer, T., Alhajj, R. Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell 31, 318–331 (2009). https://doi.org/10.1007/s10489-008-0129-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-008-0129-8