Abstract
A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.
Similar content being viewed by others
References
Banerjee A, Merugu S, Dhillon I, Ghosh J (2004) Clustering with Bregaman Divergences. In: Proceedings of 4th SIAM international conference data mining (SDM 2004), pp 234–245
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short text using Wikipedia. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007), pp 787–788
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022
Bolelli L, Ertekin S, Zhou D, Giles CL (2007) K-SVMeans: a hybrid clustering algorithm for multi-type interrelated datasets. In: Proceedings of international conference on web intelligence (WI 2007), pp 198–204
BuyAns (2009) http://www.buyans.com
Chen K, Liu L (2009) Best K: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33
Chuang S, Chien L (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of the 13th ACM international conference on Information and knowledge management (CIKM 2004), pp 127–136
CLUTO (2009) http://glaros.dtc.umn.edu/gkhome/views/cluto/
Cutting DR, Karger DR, Pedersen JO (1993) Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the 16th international ACM SIGIR conference on research and development in information retrieval, pp 126–134
Cutting DR, Karger DR, Pedersen P, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 5th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1992), pp 318–329
Dempster A, Laird N, Rubin D (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc 39(1): 1–38
Ding C, He X, Zha H (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the international conference on data mining (ICDM 2001), pp 107–114
Dittenbach M, Merkl D, Rauber A (2002) Organizing and exploring high dimensional data with the growing hierarchical self organizing map. In: Proceedings of the 1st international conference on fuzzy systems and knowledge discovery (FSKD 2002), vol 2, pp 626–630
Ester M, Kriegal HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD 1996), pp 226–231
Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8: 16–33
Gluck MA, Corter JE (1985) Information, uncertainty, and the utility of categories. In: Proceedings of the 7th annual conference of the cognitive science society (CogSci 1985), pp 283–287
Google (2009) http://www.google.com
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Hachey B, Grover C (2005) Sequence modelling for sentence classification in a legal summarisation system. In: Proceedings of the 2005 ACM symposium on applied computing (SAC 2005), pp 292–296
ICTCLAS (2009) http://www.ictclas.org
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New York
Kim H, Lee S (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6: 125–149
Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web (WWW 2004), pp 658–665
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD 1999), pp 16–22
Liu W, Hao T, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. In: World Wide Web: Internet Web Inf Syst 12(2): 107–124
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137
Lucene (2009) http://lucene.apache.org/
MacQueen J (1967) Some method for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability I: (Statistics), pp 281–297
Ng RT, Han J (1994) Clustering methods for spatial data mining. In: Proceedings of 20th international conference very large data bases (VLDB 1994), pp 144–155
Ni X, Lu Z, Quan X, Liu W, Hua B (2009) Short text clustering for search results. In: Proceedings of the joint international conferences on Asia-Pacific web conference (APWeb) and web-age information management (WAIM). LNCS, pp 584–589
Ordonez C, Omiecinski E (2005) Accelerating EM clustering to find high-quality solutions. Knowl Inf Syst 7(2): 135–157
Phan X, Nguyen L, Horiguchi S (2008) Learn to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web (WWW 2008), pp 91–100
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2009) Short text similarity based on probabilistic topics. Knowl Inf Syst. doi:10.1007/s10115-009-0250-y, published online first
Su Z, Yang Q, Zhang H, Xu X, Hu Y, Ma S (2002) Correlation-based web document clustering for adaptive web interface design. Knowl Inf Syst 4(2): 151–167
Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. In: Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval, pp 707–708
Treeratpituk P, Callan J (2006) Automatically labeling hierarchical clusters. In: Proceedings of the 7th international conference on digital government research (dg.o 2006), pp 167–176
Wang X, Zhai C (2007) Learn from web search logs to organize search results. In: Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, pp 87–94
Wikipedia (2009) http://www.wikipedia.org
Yahoo! Answers (2009) http://answers.yahoo.com
Yahoo! Groups (2009) http://groups.yahoo.com
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international conference on World Wide Web (WWW1999), pp 1361–1374
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998), pp 46–54
Zeng H, He Q, Chen Z, Ma W, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 210–217
Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2003), pp 26–32
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 7th international conference on Information and knowledge management (CIKM 2002), pp 515–524
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ni, X., Quan, X., Lu, Z. et al. Short text clustering by finding core terms. Knowl Inf Syst 27, 345–365 (2011). https://doi.org/10.1007/s10115-010-0299-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0299-7