Short text clustering by finding core terms

Xingliang Ni^1,2,3,
Xiaojun Quan²,
Zhi Lu²,
Liu Wenyin^1,2,3 &
…
Bei Hua^1,3

1165 Accesses
43 Citations
Explore all metrics

Abstract

A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Banerjee A, Merugu S, Dhillon I, Ghosh J (2004) Clustering with Bregaman Divergences. In: Proceedings of 4th SIAM international conference data mining (SDM 2004), pp 234–245
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short text using Wikipedia. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007), pp 787–788
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022
Article MATH Google Scholar
Bolelli L, Ertekin S, Zhou D, Giles CL (2007) K-SVMeans: a hybrid clustering algorithm for multi-type interrelated datasets. In: Proceedings of international conference on web intelligence (WI 2007), pp 198–204
BuyAns (2009) http://www.buyans.com
Chen K, Liu L (2009) Best K: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33
Article Google Scholar
Chuang S, Chien L (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of the 13th ACM international conference on Information and knowledge management (CIKM 2004), pp 127–136
CLUTO (2009) http://glaros.dtc.umn.edu/gkhome/views/cluto/
Cutting DR, Karger DR, Pedersen JO (1993) Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the 16th international ACM SIGIR conference on research and development in information retrieval, pp 126–134
Cutting DR, Karger DR, Pedersen P, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 5th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1992), pp 318–329
Dempster A, Laird N, Rubin D (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc 39(1): 1–38
MathSciNet MATH Google Scholar
Ding C, He X, Zha H (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the international conference on data mining (ICDM 2001), pp 107–114
Dittenbach M, Merkl D, Rauber A (2002) Organizing and exploring high dimensional data with the growing hierarchical self organizing map. In: Proceedings of the 1st international conference on fuzzy systems and knowledge discovery (FSKD 2002), vol 2, pp 626–630
Ester M, Kriegal HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD 1996), pp 226–231
Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8: 16–33
Article Google Scholar
Gluck MA, Corter JE (1985) Information, uncertainty, and the utility of categories. In: Proceedings of the 7th annual conference of the cognitive science society (CogSci 1985), pp 283–287
Google (2009) http://www.google.com
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Book MATH Google Scholar
Hachey B, Grover C (2005) Sequence modelling for sentence classification in a legal summarisation system. In: Proceedings of the 2005 ACM symposium on applied computing (SAC 2005), pp 292–296
ICTCLAS (2009) http://www.ictclas.org
Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40
Article Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New York
Google Scholar
Kim H, Lee S (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6: 125–149
Google Scholar
Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web (WWW 2004), pp 658–665
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD 1999), pp 16–22
Liu W, Hao T, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. In: World Wide Web: Internet Web Inf Syst 12(2): 107–124
Google Scholar
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137
Article MathSciNet MATH Google Scholar
Lucene (2009) http://lucene.apache.org/
MacQueen J (1967) Some method for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability I: (Statistics), pp 281–297
Ng RT, Han J (1994) Clustering methods for spatial data mining. In: Proceedings of 20th international conference very large data bases (VLDB 1994), pp 144–155
Ni X, Lu Z, Quan X, Liu W, Hua B (2009) Short text clustering for search results. In: Proceedings of the joint international conferences on Asia-Pacific web conference (APWeb) and web-age information management (WAIM). LNCS, pp 584–589
Ordonez C, Omiecinski E (2005) Accelerating EM clustering to find high-quality solutions. Knowl Inf Syst 7(2): 135–157
Article Google Scholar
Phan X, Nguyen L, Horiguchi S (2008) Learn to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web (WWW 2008), pp 91–100
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2009) Short text similarity based on probabilistic topics. Knowl Inf Syst. doi:10.1007/s10115-009-0250-y, published online first
Su Z, Yang Q, Zhang H, Xu X, Hu Y, Ma S (2002) Correlation-based web document clustering for adaptive web interface design. Knowl Inf Syst 4(2): 151–167
Article Google Scholar
Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. In: Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval, pp 707–708
Treeratpituk P, Callan J (2006) Automatically labeling hierarchical clusters. In: Proceedings of the 7th international conference on digital government research (dg.o 2006), pp 167–176
Wang X, Zhai C (2007) Learn from web search logs to organize search results. In: Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, pp 87–94
Wikipedia (2009) http://www.wikipedia.org
Yahoo! Answers (2009) http://answers.yahoo.com
Yahoo! Groups (2009) http://groups.yahoo.com
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international conference on World Wide Web (WWW1999), pp 1361–1374
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998), pp 46–54
Zeng H, He Q, Chen Z, Ma W, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 210–217
Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2003), pp 26–32
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 7th international conference on Information and knowledge management (CIKM 2002), pp 515–524
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Xingliang Ni, Liu Wenyin & Bei Hua
Department of Computer Science, City University of Hong Kong, HKSAR, China
Xingliang Ni, Xiaojun Quan, Zhi Lu & Liu Wenyin
Joint Research Lab of Excellence, CityU-USTC Advanced Research Institute, Suzhou, China
Xingliang Ni, Liu Wenyin & Bei Hua

Authors

Xingliang Ni
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Quan
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Liu Wenyin
View author publications
You can also search for this author in PubMed Google Scholar
Bei Hua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liu Wenyin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ni, X., Quan, X., Lu, Z. et al. Short text clustering by finding core terms. Knowl Inf Syst 27, 345–365 (2011). https://doi.org/10.1007/s10115-010-0299-7

Download citation

Received: 16 June 2009
Revised: 07 April 2010
Accepted: 11 April 2010
Published: 25 June 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10115-010-0299-7

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering small-sized collections of short texts

Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

Clustering of semantically enriched short texts

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Short text clustering by finding core terms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering small-sized collections of short texts

Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

Clustering of semantically enriched short texts

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation