Abstract
Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abramson, I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10, 1217–1223 (1982)
Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17, 71–80 (2007)
Azzalini, A., Menardi, G., Rosolin, T.: R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0) (2012). http://cran.r-project.org/package=pdfCluster
Burman, P., Polonik, W.: Multivariate mode hunting: data analytic tools with measures of significance. J. Multivar. Anal. 100, 1198–1218 (2009)
Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367–382 (2000)
Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441–459 (2001)
Dazard, J.E., Rao, J.S.: Local sparse bump hunting. J. Comput. Graph. Stat. 19, 900–929 (2010)
De la Cruz, R.: Bayesian non-linear regression models with skew-elliptical errors: applications to the classification of longitudinal profiles. Comput. Stat. Data Anal. 53, 436–449 (2008)
Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637–676 (1999)
Even, S.: Graph Algorithms, 2nd edn. Cambridge University Press, New York (2011)
Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classication of olive oils from their fatty acid composition. In: Food Research and Data Analysis, pp. 189–214. Applied Sc. Publishers, London (1983)
Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
Fraley, C., Raftery, A.E.: MCLUST vers. 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504, Univ. Washington, Dep. Stat. (2006), rev. 2009
Friedman, J.H.: On bias, variance, 0–1 loss, and the curse of dimensionality. Data Min. Knowl. Discov. 1, 55–77 (1997)
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc., Ser. C, Appl. Stat. 18, 54–64 (1969)
Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)
Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Data Min. Knowl. Discov. 1, 231–240 (2011)
Krznaric, D., Levcopoulos, C.: Fast algorithms for complete linkage clustering. Discrete Comput. Geom. 19, 131–145 (1998)
Lubischew, A.A.: On the use of discriminant analysis in taxonomy. Biometrics 18, 455–477 (1962)
Menardi, G., Torelli, N.: Reducing data dimension for cluster detection. J. Stat. Comput. Simul. (2012). doi:10.1080/00949655.2012.679032
Minnotte, M.C.: Nonparametric testing for the existence of modes. Ann. Stat. 25, 1646–1660 (1997)
Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86, 738–746 (1991)
Prates, M., Lachos, V., Cabral, C.: R package mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions (version 1.0-3) (2012). http://cran.r-project.org/web/packages/mixsmsn/index.html
Ooi, H.: Density visualization and mode hunting using trees. J. Comput. Graph. Stat. 11, 328–347 (2002)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013). http://www.R-project.org/. ISBN 3-900051-07-3-900051-0
Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38, 2678–2722 (2010)
Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 31, 129–150 (2003)
Scott, D.W., Sain, S.: Multidimensional Density Estimation. Handbook of Statistics, vol. 24, pp. 229–261 (2005)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986)
Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)
Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397–418 (2010)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the GAP statistic. J. R. Stat. Soc., Ser. B, Stat. Methodol. 63, 411–423 (2000)
Wishart, D.: Mode analysis: a generalization of nearest neighbor which reduces chaining effects. In: Cole, A.J. (ed.) Numerical Taxonomy, pp. 282–308. Academic Press, London (1969)
Acknowledgements
We wish to thank the two anonymous referees and an Associate editor for their valuable comments which have stimulated a clearer exposition of the original formulation and helped to improve the overall outcome.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Menardi, G., Azzalini, A. An advancement in clustering via nonparametric density estimation. Stat Comput 24, 753–767 (2014). https://doi.org/10.1007/s11222-013-9400-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-013-9400-x