[go: up one dir, main page]

Skip to main content
Log in

KP-Rank: a semantic-based unsupervised approach for keyphrase extraction from text data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic key concept identification from text is the main challenging task in information extraction, information retrieval, digital libraries, ontology learning, and text analysis. The main difficulty lies in the issues with the text data itself, such as noise in text, diversity, scale of data, context dependency and word sense ambiguity. To cope with this challenge, numerous supervised and unsupervised approaches have been devised. The existing topical clustering-based approaches for keyphrase extraction are domain dependent and overlooks semantic similarity between candidate features while extracting the topical phrases. In this paper, a semantic based unsupervised approach (KP-Rank) is proposed for keyphrase extraction. In the proposed approach, we exploited Latent Semantic Analysis (LSA) and clustering techniques and a novel frequency-based algorithm for candidate ranking is introduced which considers locality-based sentence, paragraph and section frequencies. To evaluate the performance of the proposed method, three benchmark datasets (i.e. Inspec, 500N-KPCrowed and SemEval-2010) from different domains are used. The experimental results show that overall, the KP-Rank achieved significant improvements over the existing approaches on the selected performance measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Adar E, Datta S (2015) Building a scientific concept hierarchy database (schbase). Ann Arbor 1001:48104

    Google Scholar 

  2. Aman M, bin Md Said A, Jadid Abdul Kadir S, Ullah I (2018) Key concept identification: a comprehensive analysis of frequency and topical graph-based approaches. Information 9(5):128

    Article  Google Scholar 

  3. Aman M, bin Md Said A, Kadir SJA, Ullah I (2018) Key concept identification: A sentence parse tree-based technique for candidate feature extraction from unstructured texts, IEEE Access

  4. Barker K, Cornacchia N (2000) Using noun phrase heads to extract document keyphrases. Adv Artif Intell, 40–52

  5. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  6. Boudin F (2016) pke: an open source python-based keyphrase extraction toolkit. In: COLING (Demos), pp 69–73

  7. Bougouin A, Boudin F, Daille B (2013) Topicrank: Graph-based topic ranking for keyphrase extraction. In: International joint conference on natural language processing (IJCNLP), pp 543–551

  8. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1-7):107–117

    Article  Google Scholar 

  9. Chandu K, Naik A, Chandrasekar A, Yang Z, Gupta N, Nyberg E (2017) Tackling biomedical text summarization:, Oaqa at bioasq 5b. In: BioNLP 2017, pp 58–66

  10. Danesh S, Sumner T, Martin JH (2015) Sgrank: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In: * SEM NAACL-HLT, pp 117–126

  11. Danilevsky M, Wang C, Desai N, Ren X, Guo J, Han J (2014) Automatic construction and ranking of topical keyphrases on collections of short documents. In: Proceedings of the 2014 SIAM international conference on data mining, SIAM, 398–406

  12. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41(6):391

    Article  Google Scholar 

  13. El-Beltagy SR, Rafea A (2009) Kp-miner: a keyphrase extraction system for english and arabic documents. Inf Syst 34(1):132–144

    Article  Google Scholar 

  14. Florescu C, Caragea C (2017) Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual meeting of the association for computational linguistics (Volume 1: Long Papers), vol 1, pp 1105–1115

  15. Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: 16th International joint conference on artificial intelligence (IJCAI 99), vol. 2. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 668–673

  16. Geiss J (2011) Latent semantic sentence clustering for multi-document summarization. University of Cambridge, Computer Laboratory, Tech. Rep.

  17. Gollapalli SD, Caragea C (2014) Extracting keyphrases from research papers using citation networks. In: AAAI, pp 1629–1635

  18. Gollapalli SD, Li X-L, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: AAAI, pp 3180–3187

  19. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108

    MATH  Google Scholar 

  20. Hasan KS, Ng V (2010) Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics, pp 365–373

  21. Haveliwala TH (2003) Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans Knowl Data Eng 15(4):784–796

    Article  Google Scholar 

  22. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing. Association for Computational Linguistics pp 216–223

  23. Hulth A, Megyesi BB (2006) A study on automatically extracted keywords in text categorization. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 537–544

  24. Kang Y-B, Haghighi PD, Burstein F (2014) Cfinder: an intelligent key concept finder from text for ontology development. Expert Syst Appl 41 (9):4494–4504

    Article  Google Scholar 

  25. Kashyap A, Han L, Yus R, Sleeman J, Satyapanich T, Gandhi S, Finin T (2016) Robust semantic text similarity using lsa, machine learning, and linguistic resources. Lang Resour Eval 50(1):125–161

    Article  Google Scholar 

  26. Kim SN, Medelyan O, Kan M. -Y., Baldwin T (2013) Automatic keyphrase extraction from scientific articles. Lang Resour Eval 47(3):723–742

    Article  Google Scholar 

  27. Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: Proceedings of the 41st annual meeting of the association for computational linguistics

  28. Kwon H, Kim J, Park Y (2017) Applying lsa text mining technique in envisioning social impacts of emerging technologies: the case of drone technology. Technovation 60:15–28

    Article  Google Scholar 

  29. Lahiri S, Choudhury SR, Caragea C (2014) Keyword and keyphrase extraction using centrality measures on collocation networks.arXiv:1401:6571

  30. Le TTN, Le Nguyen M, Shimazu A (2016) Unsupervised keyphrase extraction: Introducing new kinds of words to keyphrases. In: Australasian joint conference on artificial intelligence, Springer, pp 665–671

  31. Lewis DD (1995) Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 246–254

  32. Li L, Simek O, Lai A, Daggett M, Dagli CK, Jones C (2018) Detection and characterization of human trafficking networks using unsupervised scalable text template matching. In: 2018 IEEE international conference on big data (Big Data). IEEE, pp 3111–3120

  33. Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics pp 366–376

  34. Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1-Volume 1, Association for Computational Linguistics, pp 257–266

  35. Liu Z, Liang C, Sun M (2012) Topical word trigger model for keyphrase extraction. In: Proceedings of COLING 2012, pp 1715–1730

  36. Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp 620–628

  37. Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: the penn treebank. Comput Linguist 19(2):313–330

    Google Scholar 

  38. Martinez-Romo J, Araujo L, Duque Fernandez A (2016) Semgraph: Extracting keyphrases following a novel semantic graph-based approach. J Assoc Inf Sci Technol 67(1):71–82

    Article  Google Scholar 

  39. Marujo L, Gershman A, Carbonell J, Frederking R, Neto JP (2012) Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization. In: Proceedings of the eighth international conference on language resources and evaluation (LREC-2012), pp 399–403

  40. Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools 13(01):157–169

    Article  Google Scholar 

  41. McInnes L, Healy J, Astels S (2017) Hdbscan: Hierarchical density based clustering. J Open Source Softw 2(11):205

    Article  Google Scholar 

  42. Medelyan O, Frank E, Witten IH (2009) Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 3-Volume 3. Association for Computational Linguistics, pp 1318–1327

  43. Mendoza M, Ormeno P, Valle C (2018) Boosting text clustering using topic selection

  44. Merchant K, Pande Y (2018) Nlp based latent semantic analysis for legal text summarization. In: 2018 International conference on advances in computing, communications and informatics (ICACCIx), IEEE, pp 1803–1807

  45. Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: EMNLP, vol. 4, pp 404–411

  46. Nam KS, Olena M, Min-Yen K, Timothy B (2010) Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International workshop on semantic evaluation. Association for Computational Linguistics, pp 21–26

  47. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  48. Rafiei-Asl J, Nickabadi A (2017) Tsake: a topical and structural automatic keyphrase extractor. Appl Soft Comput 58:620–630

    Article  Google Scholar 

  49. Rijsbergen CJV (1979) Information retrieval. 2nd ed. Newton, MA USA: Butterworth-Heinemann

  50. Saidul HK, Vincent N (2014) Automatic keyphrase extraction: A survey of the state of the art. In: ACL (1)pp 1262–1273

  51. Shen Y, Zhang Q, Zhang J, Huang J, Lu Y, Lei K (2018) Improving medical short text classification with semantic expansion using word-cluster embedding. In: International conference on information science and applications, Springer, vol 401–411

  52. SÜZEK TÖ (2017) Using latent semantic analysis for automated keyword extraction from large document corpora. Turk J Electr Eng Comput Sci 25(3):1784–1794

    Article  Google Scholar 

  53. Teneva N, Cheng W (2017) Salience rank:, efficient keyphrase extraction with topic modeling. In: Proceedings of the 55th Annual meeting of the association for computational linguistics (Volume 2: Short Papers), vol 2, pp 530–535

  54. Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 workshop on multiword expressions: analysis, acquisition and treatment-Volume 18, Association for Computational Linguistics, pp 33–40

  55. Turney P (1997) Extraction of keyphrases from text: Evaluation of four algorithms. national research council canada. Institute for Information Technology

  56. Turney PD (2000) Learning algorithms for keyphrase extraction. Inform Retr 2(4):303–336

    Article  Google Scholar 

  57. Turney PD (2003) Coherent keyphrase extraction via web mining. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 434–439

  58. Turpin A, Scholer F (2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 11–18

  59. Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: AAAI, vol 8, pp 855–860

  60. Wan X, Yang J, Xiao J (2007) Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In: ACL, vol 7, pp 552–559

  61. Wang R, Liu W, McDonald C (2014) Corpus-independent generic keyphrase extraction using word embedding vectors. In: Software engineering research conference, vol 39

  62. Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  MathSciNet  Google Scholar 

  63. Xiaojun W, Jianguo X (2008) Collabrank: towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, Association for Computational Linguistics, pp 969–976

  64. Zha H (2002) Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 113–120

  65. Zhang Y, Milios E, Zincir-Heywood N (2007) A comparative study on key phrase extraction methods in automatic web site summarization. J Digit Inf Manag 5(5):323

    Google Scholar 

  66. Zhang Q, Wang Y, Gong Y, Huang X (2016) Keyphrase extraction using deep recurrent neural networks on twitter. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 836–845

Download references

Acknowledgments

This research was fully funded and supported by Universiti Teknologi PETRONAS, under the Yayasan Universiti Teknologi PETRONAS (YUTP), Cost Centre (015LC0-119).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Said Jadid Abdulkadir.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Description of part-of-speech (POS) tags

Appendix: Description of part-of-speech (POS) tags

Table 15 Description of part-of-speech (POS) tags [37]

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aman, M., Abdulkadir, S.J., Aziz, I.A. et al. KP-Rank: a semantic-based unsupervised approach for keyphrase extraction from text data. Multimed Tools Appl 80, 12469–12506 (2021). https://doi.org/10.1007/s11042-020-10215-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10215-x

Keywords

Navigation