[go: up one dir, main page]

Skip to main content

Comparative Analysis of Topic Modeling Algorithms Based on Arabic News Documents

  • Conference paper
  • First Online:
Advances in Intelligent Computing Techniques and Applications (IRICT 2023)

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 211))

  • 212 Accesses

Abstract

Topic modeling is a text mining technique that revolves around extracting latent topics from a collection of documents. Although the majority of research within the field of topic modeling has been conducted in the English language. Nonetheless, in recent years, there has been an interest in employing the topic modeling methodology within the Arabic language, although its utilization remains somewhat restricted in this language. In this paper, we propose a comparison among various techniques commonly utilized in topic modeling. These techniques include a Probabilistic model, specifically Latent Dirichlet Allocation (LDA), as well as matrix factorization methods like Non-Negative Matrix Factorization (NMF) and Latent Semantic Indexing (LSI). Additionally, we incorporate a transformer-based model known as BERTopic. The implementation was applied to the Arabic language, and the algorithms were trained using the TF-IDF text representation. This choice aimed to ensure a fair comparison between the algorithms. The evaluation of each model is conducted using topic coherence as the metric. The results indicate that both NMF and Bertopic give an excellent performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://radimrehurek.com/gensim/.

References

  1. Hu, Y., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014). https://doi.org/10.1007/s10994-013-5413-0

    Article  MathSciNet  Google Scholar 

  2. Crain, S.P., Zhou, K., Yang, S.H., Zha, H.: Dimensionality reduction and topic modeling: from latent semantic indexing to latent dirichlet allocation and beyond. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 129–161. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_5

  3. Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., Hassan, A.: Topic modeling algorithms and applications: a survey. Inf. Syst. 112, 102131 (2023). https://www.sciencedirect.com/science/article/pii/S0306437922001090

  4. Atagün, E., Hartoka, B., Albayrak, A.: Topic modeling using lda and bert techniques: Teknofest example. In: 2021 6th International Conference on Computer Science and Engineering (UBMK), pp. 660–664 (2021)

    Google Scholar 

  5. George, L., Sumathy, P.: An integrated clustering and BERT framework for improved topic modeling. Int. J. Inf. Technol. 15(4), 2187–2195 (2023). https://doi.org/10.1007/s41870-023-01268-w

    Article  Google Scholar 

  6. Abuzayed, A., Al-Khalifa, H.: Bert for Arabic topic modeling: an experimental study on bertopic technique. Procedia Comput. Sci. 189, 191–194 (2021). https://www.sciencedirect.com/science/article/pii/S1877050921012199

  7. Al Qudah, I., Hashem, I., Soufyane, A., Chen, W., Merabtene, T.: Applying latent dirichlet allocation technique to classify topics on sustainability using Arabic text. In: Arai, K. (eds.) Intelligent Computing, vol. 506, pp. 630–638. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-10461-9_43

  8. Alhaj, F., Al-Haj, A., Sharieh, A., Jabri, R.: Improving arabic cognitive distortion classification in twitter using bertopic. Int. J. Adv. Comput. Sci. Appl. 13(1) (2022). https://doi.org/10.14569/IJACSA.2022.0130199

  9. Almuzaini, H.A., Azmi, A.M.: An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst. Appl. 203, 117384 (2022). https://www.sciencedirect.com/science/article/pii/S0957417422007266

  10. Alhawarat, M., Hegazi, M.: Revisiting k-means and topic modeling, a comparison study to cluster arabic documents. IEEE Access 6, 42740–42749 (2018)

    Article  Google Scholar 

  11. Nouar, F., Belhadef, H.: A deep neural network model with multihop self-attention mechanism for topic segmentation of texts. In: Saeed, F., Mohammed, F., Al-Nahari, A. (eds.) IRICT 2020. LNDECT, vol. 72, pp. 407–417. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-70713-2_38

    Chapter  Google Scholar 

  12. Yang, Y.: Research and realization of internet public opinion analysis based on improved tf - idf algorithm. In: 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), pp. 80–83 (2017)

    Google Scholar 

  13. Liang, M., Niu, T.: Research on text classification techniques based on improved TF-IDF algorithm and LSTM inputs. Procedia Comput. Sci. 208, 460–470 (2022). 7th International Conference on Intelligent, Interactive Systems and Applications

    Google Scholar 

  14. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  15. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(2), 993–1022 (2003)

    Google Scholar 

  16. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of LREC 2010 Workshop New Challenges for NLP Frameworks, pp. 46–50 (2010)

    Google Scholar 

  17. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1990)

    Article  Google Scholar 

  18. Pauca, V.P., Piper, J., Plemmons, R.J.: Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl. 416(1), 29–47 (2006). https://www.sciencedirect.com/science/article/pii/S002437950500340X

  19. Grootendorst, M.: Bertopic: neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022)

  20. Chen, W., Rabhi, F., Liao, W., Al-Qudah, I.: Leveraging state-of-the-art topic modeling for news impact analysis on financial markets: a comparative study. Electronics 12(12) (2023). https://www.mdpi.com/2079-9292/12/12/2605

  21. Einea, O., Elnagar, A., Al Debsi, R.: Sanad: single-label Arabic news articles dataset for automatic text categorization. Data Brief 25, 104076 (2019)

    Article  Google Scholar 

  22. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 30, 31–40 (2009)

    Google Scholar 

  23. Michael, R., Andreas, B., Alexander, H.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015). https://doi.org/10.1145/2684822.2685324

  24. Syed, S., Spruit, M.: Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 165–174 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Islam Djemmal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Djemmal, I., Belhadef, H. (2024). Comparative Analysis of Topic Modeling Algorithms Based on Arabic News Documents. In: Saeed, F., Mohammed, F., Fazea, Y. (eds) Advances in Intelligent Computing Techniques and Applications. IRICT 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 211. Springer, Cham. https://doi.org/10.1007/978-3-031-59707-7_10

Download citation

Publish with us

Policies and ethics