Abstract
Discovering connected regions in data space is a complex problem that is extremely demanding on the user. Datasets often require preprocessing and postprocessing before they are fit for algorithm and user consumption. Existing clustering algorithms require performance parameters to achieve adequate results. Typically, these parameters are either empirically optimized or scanned using brute force, which ultimately adds additional burden to the user. We present RADDACL2, a density-based clustering algorithm, with the intent of reducing overall user burden. The algorithm requires no information other than the dataset to identify clusters. In addition, the algorithm is deterministic, meaning the results will always be the same. Both of these features reduce user burden by decreasing the number of passes one must make to get an outcome. A number of experiments are performed using toy and real datasets to verify the capabilities of RADDACL2 as compared to existing algorithms.



























Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Abbreviations
- Centroid [\({C}_{\mathrm{X}}\)]:
-
The centroid of the current region. The value of X associates the centroid with a specific region
- Observation [O]:
-
A series of attributes representing a single observation in the dataset
- Region [\({R}_{\mathrm{X}}\)]:
-
A region represents a subdivision of the dataset. These are created during Density Discovery
- Precluster [PC]:
-
A set of related observations that contain a very density. They are identified during Density Discovery
- Average Absolute Deviation [AAD]:
-
An average of the absolute deviations from a measure of central tendency
- Standard Deviation [STD]:
-
The variability of data from a measure of central tendency
- Neighborhood Function [NF]:
-
An algorithm that is used to identify if a pair of preclusters are neighbors
- Neighborhood Radius [NR]:
-
A threshold assigned to each precluster. These are used by the neighborhood function to determine if two preclusters should be merged
References
Grabmeier, J., Rudolph, A.: Techniques of cluster algorithms in data mining. Data Min. Knowl. Discov. 6(4), 303–360 (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. Pattern Anal. Mach. Intell. IEEE Trans. 22(1), 4–37 (2000)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Bunke, H., Riesen, K.: Towards the unification of structural and statistical pattern recognition. Pattern Recognit. Lett. 33(7), 811–825 (2012)
Hastie, T., Tibshirani, R., Friedman, J.J.H.: The Elements of Statistical Learning, vol. 1. Springer, New York (2001)
Curtin, R.R., Cline, J.R., Slagle, N.P., Amidon, M.L., Gray, A.G.: MLPACK: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)
Chi, Y.: Multivariate methods. Wiley Interdiscip. Rev. Comput. Stat. 4(1), 35–47 (2012)
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Plaza, A., Benediktsson, J.A., Boardman, J.W., Brazile, J., Bruzzone, L., Camps-Valls, G., Chanussot, J., Fauvel, M., Gamba, P., Gualtieri, A., et al.: Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 113, S110–S122 (2009)
Liming, X., Yanchao, Z.: Automated strawberry grading system based on image processing. Comput. Electron. Agric. 71, S32–S39 (2010)
Bécue-Bertaut, M., Pagès, J.: Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Comput. Stat. Data Anal. 52(6), 3255–3268 (2008)
Chan, G., Gelernter, J., Oslin, D., Farrer, L., Kranzler, H.R.: Empirically derived subtypes of opioid use and related behaviors. Addiction 106(6), 1146–1154 (2011)
Gottardo, F., Liu, C.G., Ferracin, M., Calin, G.A., Fassan, M., Fassan, M., Bassi, P., Sevignani, C., Byrne, D., Negrini, M., Pagano, F., Gomella, L.G., Croce, C.M., Baffa, R.: Micro-RNA profiling in kidney and bladder cancers. Urol. Oncol. 25(5), 387–392 (2007)
Stahlberg, A., Elbing, K., Andrade-Garda, J., Sjogreen, B., Forootan, A., Kubista, M.: Multiway real-time PCR gene expression profiling in yeast Saccharomyces cerevisiae reveals altered transcriptional response of ADH-genes to glucose stimuli. BMC Genomics 9(1), 170 (2008)
Tichopád, A., Pecen, L., Pfaffl, M.W.: Distribution-insensitive cluster analysis in SAS on real-time PCR gene expression data of steadily expressed genes. Comput. Methods Programs Biomed. 82(1), 44–50 (2006)
Estivill-Castro, V.: Why so many clustering algorithms: a position paper. SIGKDD Explor. Newsl. 4(1), 65–75 (2002)
Beaton, D., Valova, I.: RADDACL: a recursive algorithm for clustering and density discovery on non-linearly separable data. In: Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2007, pp. 1633–1638
Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Pearson Education India, Delhi (2006)
Rizzi, R., Mahata, P., Mathieson, L., Moscato, P.: Hierarchical clustering using the arithmetic-harmonic cut: complexity and experiments. PLoS ONE 5(12), e14067 (2010)
Sileshi, M., Gamback, B.: Evaluating clustering algorithms: cluster quality and feature selection in content-based image clustering. 2009 WRI World Congress on Computer Science and Information Engineering, vol. 6, pp. 435, 441, March 31 2009–April 2 2009
Malik, H.H., Kender, J.R., Fradkin, D., Moerchen, F.: Hierarchical document clustering using local patterns. Data Min. Knowl. Discov. 21(1), 153–185 (2010)
Xiong, T., Wang, S., Mayers, A., Monga, E.: DHCC: divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 24(1), 103–135 (2012)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means, spectral clustering and normalized cuts. KDD ’04 Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035, (2007)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley.com, New York (2009)
Kaufman, L., Rousseeuw, P.J.: Clustering my means of Medoids. Statistical Data Analysis Based on the L\(\_\)1–Norm and Related Methods (2002)
Mahdavi, M., Abolhassani, H.: Harmony K-means algorithm for document clustering. Data Min. Knowl. Discov. 18(3), 370–391 (2009)
Bellaachia, A., Bari, A.: Flock by leader: a novel machine learning biologically inspired clustering algorithm. In: Tan, Y., Shi, Y., Ji, Z. (eds.) Advances in Swarm Intelligence, pp. 117–126. Springer, Berlin (2012)
Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)
Bação, F., Lobo, V., Painho, M.: Self-organizing maps as substitutes for k-means clustering. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) Computational Science—ICCS 2005, pp. 476–483. Springer, Berlin (2005)
Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer design. Commun. IEEE Trans. 28(1), 84–95 (1980)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–31 (1977)
Sander, J., Ester, M., Kriegel, H.-P., Xu, X.: Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min. Knowl. Discov. 2(2), 169–194 (1998)
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Rec. 28(2), 49–60 (1999)
Achtert, E., Böhm, C., Kröger, P.: DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Advances in Knowledge Discovery and Data Mining, pp. 119–128. Springer, Berlin (2006)
Hinneburg, A., Gabriel, H.H.: DENCLUE 2.0: fast clustering based on kernel density estimation. In: Berthold, M.R., Shawe-Taylor, J., Lavrač, N. (eds.) Advances in Intelligent Data Analysis VII, pp. 70–80. Springer, Berlin (2007)
Bae, E., Bailey, J., Dong, G.: A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min. Knowl. Discov. 21(3), 427–471 (2010)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)
Achtert, E., Kriegel, H.-P., Zimek, A.: ELKI: a software system for evaluation of subspace clustering algorithms. In: Scientific and Statistical Database Management, pp. 580–585 (2008)
Miller, I., Miller, M., John, E.: Freund’s mathematical statics with applications. Pearson Prentice Hall, Upper Sadle River (2004)
Samaria, F.S., Harter, A.C.: Parameterisation of a stochastic model for human face identification. In: Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pp. 138–142 (1994)
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (OASIS): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 19, 1498–150 (2007). http://www.oasis-brains.org/app/template/Index.vm
Hashioka, A., Kobashi, S., Kuramoto, K., Wakata, Y., Ando, K., Ishikura, R., Ishikawa, T., Hirota, S., Hata, Y.: Shape and appearance knowledge based brain segmentation for neonatal MR images. World Automation Congress (WAC), 2012, pp. 1–6, 24–28 June (2012)
Selvaraj, D., Dhanasekaran, R.: Novel approach for segmentation of brain magnetic resonance imaging using intensity based thresholding. IEEE International Conference on Communication Control and Computing Technologies (ICCCCT), pp. 502–507, 7–9 Oct. (2010)
Lee, S.H., Kim, J.H., Kim, K.G., Park, J.S., Park, S.J., Moon, W.K.: Optimal clustering of kinetic patterns on malignant breast lesions: comparison between k-means clustering and three-time-points method in dynamic contrast-enhanced MRI. EMBS 2007. 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2089–2093, 22–26 Aug. (2007)
Zhang, M.-L., Zhou, Z.-H.: Exploiting unlabeled data to enhance ensemble diversity. Data Min. Knowl. Discov. 26(1), 98–129 (2013)
Acknowledgments
The authors wish to thank Mr. Derek Beaton for his comments and suggestions in developing RADDACL2. He is one of the co-authors on the original RADDACL algorithm.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Avila, D., Valova, I. RADDACL2: a recursive approach to discovering density clusters. Prog Artif Intell 4, 21–36 (2015). https://doi.org/10.1007/s13748-015-0066-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-015-0066-9