[go: up one dir, main page]

Skip to main content

Advertisement

Log in

RADDACL2: a recursive approach to discovering density clusters

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Discovering connected regions in data space is a complex problem that is extremely demanding on the user. Datasets often require preprocessing and postprocessing before they are fit for algorithm and user consumption. Existing clustering algorithms require performance parameters to achieve adequate results. Typically, these parameters are either empirically optimized or scanned using brute force, which ultimately adds additional burden to the user. We present RADDACL2, a density-based clustering algorithm, with the intent of reducing overall user burden. The algorithm requires no information other than the dataset to identify clusters. In addition, the algorithm is deterministic, meaning the results will always be the same. Both of these features reduce user burden by decreasing the number of passes one must make to get an outcome. A number of experiments are performed using toy and real datasets to verify the capabilities of RADDACL2 as compared to existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Abbreviations

Centroid [\({C}_{\mathrm{X}}\)]:

The centroid of the current region. The value of X associates the centroid with a specific region

Observation [O]:

A series of attributes representing a single observation in the dataset

Region [\({R}_{\mathrm{X}}\)]:

A region represents a subdivision of the dataset. These are created during Density Discovery

Precluster [PC]:

A set of related observations that contain a very density. They are identified during Density Discovery

Average Absolute Deviation [AAD]:

An average of the absolute deviations from a measure of central tendency

Standard Deviation [STD]:

The variability of data from a measure of central tendency

Neighborhood Function [NF]:

An algorithm that is used to identify if a pair of preclusters are neighbors

Neighborhood Radius [NR]:

A threshold assigned to each precluster. These are used by the neighborhood function to determine if two preclusters should be merged

References

  1. Grabmeier, J., Rudolph, A.: Techniques of cluster algorithms in data mining. Data Min. Knowl. Discov. 6(4), 303–360 (2002)

    Article  MathSciNet  Google Scholar 

  2. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  3. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. Pattern Anal. Mach. Intell. IEEE Trans. 22(1), 4–37 (2000)

    Article  Google Scholar 

  4. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  5. Bunke, H., Riesen, K.: Towards the unification of structural and statistical pattern recognition. Pattern Recognit. Lett. 33(7), 811–825 (2012)

    Article  Google Scholar 

  6. Hastie, T., Tibshirani, R., Friedman, J.J.H.: The Elements of Statistical Learning, vol. 1. Springer, New York (2001)

    Book  Google Scholar 

  7. Curtin, R.R., Cline, J.R., Slagle, N.P., Amidon, M.L., Gray, A.G.: MLPACK: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)

    MATH  MathSciNet  Google Scholar 

  8. Chi, Y.: Multivariate methods. Wiley Interdiscip. Rev. Comput. Stat. 4(1), 35–47 (2012)

    Article  Google Scholar 

  9. Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)

    Chapter  Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  11. Plaza, A., Benediktsson, J.A., Boardman, J.W., Brazile, J., Bruzzone, L., Camps-Valls, G., Chanussot, J., Fauvel, M., Gamba, P., Gualtieri, A., et al.: Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 113, S110–S122 (2009)

    Article  Google Scholar 

  12. Liming, X., Yanchao, Z.: Automated strawberry grading system based on image processing. Comput. Electron. Agric. 71, S32–S39 (2010)

    Article  Google Scholar 

  13. Bécue-Bertaut, M., Pagès, J.: Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Comput. Stat. Data Anal. 52(6), 3255–3268 (2008)

    Article  MATH  Google Scholar 

  14. Chan, G., Gelernter, J., Oslin, D., Farrer, L., Kranzler, H.R.: Empirically derived subtypes of opioid use and related behaviors. Addiction 106(6), 1146–1154 (2011)

    Article  Google Scholar 

  15. Gottardo, F., Liu, C.G., Ferracin, M., Calin, G.A., Fassan, M., Fassan, M., Bassi, P., Sevignani, C., Byrne, D., Negrini, M., Pagano, F., Gomella, L.G., Croce, C.M., Baffa, R.: Micro-RNA profiling in kidney and bladder cancers. Urol. Oncol. 25(5), 387–392 (2007)

    Article  Google Scholar 

  16. Stahlberg, A., Elbing, K., Andrade-Garda, J., Sjogreen, B., Forootan, A., Kubista, M.: Multiway real-time PCR gene expression profiling in yeast Saccharomyces cerevisiae reveals altered transcriptional response of ADH-genes to glucose stimuli. BMC Genomics 9(1), 170 (2008)

  17. Tichopád, A., Pecen, L., Pfaffl, M.W.: Distribution-insensitive cluster analysis in SAS on real-time PCR gene expression data of steadily expressed genes. Comput. Methods Programs Biomed. 82(1), 44–50 (2006)

    Article  Google Scholar 

  18. Estivill-Castro, V.: Why so many clustering algorithms: a position paper. SIGKDD Explor. Newsl. 4(1), 65–75 (2002)

    Article  MathSciNet  Google Scholar 

  19. Beaton, D., Valova, I.: RADDACL: a recursive algorithm for clustering and density discovery on non-linearly separable data. In: Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2007, pp. 1633–1638

  20. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Pearson Education India, Delhi (2006)

    Google Scholar 

  21. Rizzi, R., Mahata, P., Mathieson, L., Moscato, P.: Hierarchical clustering using the arithmetic-harmonic cut: complexity and experiments. PLoS ONE 5(12), e14067 (2010)

    Article  Google Scholar 

  22. Sileshi, M., Gamback, B.: Evaluating clustering algorithms: cluster quality and feature selection in content-based image clustering. 2009 WRI World Congress on Computer Science and Information Engineering, vol. 6, pp. 435, 441, March 31 2009–April 2 2009

  23. Malik, H.H., Kender, J.R., Fradkin, D., Moerchen, F.: Hierarchical document clustering using local patterns. Data Min. Knowl. Discov. 21(1), 153–185 (2010)

    Article  MathSciNet  Google Scholar 

  24. Xiong, T., Wang, S., Mayers, A., Monga, E.: DHCC: divisive hierarchical clustering of categorical data. Data Min. Knowl. Discov. 24(1), 103–135 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  25. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means, spectral clustering and normalized cuts. KDD ’04 Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)

  26. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035, (2007)

  27. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley.com, New York (2009)

    Google Scholar 

  28. Kaufman, L., Rousseeuw, P.J.: Clustering my means of Medoids. Statistical Data Analysis Based on the L\(\_\)1–Norm and Related Methods (2002)

  29. Mahdavi, M., Abolhassani, H.: Harmony K-means algorithm for document clustering. Data Min. Knowl. Discov. 18(3), 370–391 (2009)

    Article  MathSciNet  Google Scholar 

  30. Bellaachia, A., Bari, A.: Flock by leader: a novel machine learning biologically inspired clustering algorithm. In: Tan, Y., Shi, Y., Ji, Z. (eds.) Advances in Swarm Intelligence, pp. 117–126. Springer, Berlin (2012)

    Chapter  Google Scholar 

  31. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (2001)

    Book  MATH  Google Scholar 

  32. Bação, F., Lobo, V., Painho, M.: Self-organizing maps as substitutes for k-means clustering. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) Computational Science—ICCS 2005, pp. 476–483. Springer, Berlin (2005)

    Chapter  Google Scholar 

  33. Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer design. Commun. IEEE Trans. 28(1), 84–95 (1980)

    Article  Google Scholar 

  34. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–31 (1977)

  35. Sander, J., Ester, M., Kriegel, H.-P., Xu, X.: Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min. Knowl. Discov. 2(2), 169–194 (1998)

  36. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Rec. 28(2), 49–60 (1999)

    Article  Google Scholar 

  37. Achtert, E., Böhm, C., Kröger, P.: DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Advances in Knowledge Discovery and Data Mining, pp. 119–128. Springer, Berlin (2006)

  38. Hinneburg, A., Gabriel, H.H.: DENCLUE 2.0: fast clustering based on kernel density estimation. In: Berthold, M.R., Shawe-Taylor, J., Lavrač, N. (eds.) Advances in Intelligent Data Analysis VII, pp. 70–80. Springer, Berlin (2007)

    Chapter  Google Scholar 

  39. Bae, E., Bailey, J., Dong, G.: A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min. Knowl. Discov. 21(3), 427–471 (2010)

    Article  MathSciNet  Google Scholar 

  40. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)

  41. Achtert, E., Kriegel, H.-P., Zimek, A.: ELKI: a software system for evaluation of subspace clustering algorithms. In: Scientific and Statistical Database Management, pp. 580–585 (2008)

  42. Miller, I., Miller, M., John, E.: Freund’s mathematical statics with applications. Pearson Prentice Hall, Upper Sadle River (2004)

  43. Samaria, F.S., Harter, A.C.: Parameterisation of a stochastic model for human face identification. In: Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pp. 138–142 (1994)

  44. Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (OASIS): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 19, 1498–150 (2007). http://www.oasis-brains.org/app/template/Index.vm

  45. Hashioka, A., Kobashi, S., Kuramoto, K., Wakata, Y., Ando, K., Ishikura, R., Ishikawa, T., Hirota, S., Hata, Y.: Shape and appearance knowledge based brain segmentation for neonatal MR images. World Automation Congress (WAC), 2012, pp. 1–6, 24–28 June (2012)

  46. Selvaraj, D., Dhanasekaran, R.: Novel approach for segmentation of brain magnetic resonance imaging using intensity based thresholding. IEEE International Conference on Communication Control and Computing Technologies (ICCCCT), pp. 502–507, 7–9 Oct. (2010)

  47. Lee, S.H., Kim, J.H., Kim, K.G., Park, J.S., Park, S.J., Moon, W.K.: Optimal clustering of kinetic patterns on malignant breast lesions: comparison between k-means clustering and three-time-points method in dynamic contrast-enhanced MRI. EMBS 2007. 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2089–2093, 22–26 Aug. (2007)

  48. Zhang, M.-L., Zhou, Z.-H.: Exploiting unlabeled data to enhance ensemble diversity. Data Min. Knowl. Discov. 26(1), 98–129 (2013)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors wish to thank Mr. Derek Beaton for his comments and suggestions in developing RADDACL2. He is one of the co-authors on the original RADDACL algorithm.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iren Valova.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avila, D., Valova, I. RADDACL2: a recursive approach to discovering density clusters. Prog Artif Intell 4, 21–36 (2015). https://doi.org/10.1007/s13748-015-0066-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-015-0066-9

Keywords