Topic Classification of Online News Articles Using Optimized Machine Learning Models
<p>Dataflow diagram of the methodology used in this paper.</p> "> Figure 2
<p>Dataset class distribution: (<b>a</b>) before dataset balancing and (<b>b</b>) after balancing.</p> "> Figure 3
<p>Comparison of optimized and unoptimized models.</p> "> Figure 4
<p>Comparison of SVM kernel performance.</p> "> Figure 5
<p>Comparison of SVM kernels: (<b>a</b>) linear, (<b>b</b>) sigmoid, (<b>c</b>) polynomial, (<b>d</b>) RBF.</p> "> Figure 5 Cont.
<p>Comparison of SVM kernels: (<b>a</b>) linear, (<b>b</b>) sigmoid, (<b>c</b>) polynomial, (<b>d</b>) RBF.</p> ">
Abstract
:1. Introduction
2. Literature Review
3. Methods and Experimentation
3.1. Mathematical Model of Topic Classification Problem
- RCF relation is functional. The following property holds: , i.e., each topic corresponds to a unique description.
- Each description of the topic contains a set of features used by the classification operation and their values.
- For each topic , we define which is the set of documents associated with it (a priori or with the help of the classification operation), i.e., Topic by definition, if
- It can be shown that is an upper semilattice, i.e., there is a unique element . Here is the root topic containing all other classes (topics).
- The result of determining the topic of the document (text), i.e., the classification of the document (text) —is the set of topics to which the document corresponds.
3.2. Dataset Description
3.3. Workflow of Methodology
3.4. Preprocessing
3.4.1. Dataset Structuring and Balancing
3.4.2. Label Encoding
3.4.3. Article Preprocessing
3.5. Feature Extraction
3.6. Model Training
3.7. Model Evaluation
4. Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Strömbäck, J.; Karlsson, M.; Hopmann, D.N. Determinants of News Content. J. Stud. 2012, 13, 718–728. [Google Scholar] [CrossRef]
- Mitchell, A.; Rosenstiel, T. Navigating News Online: Where People Go, How They Get There and What Lures Them Away. PEW Research Center’s Project for Excellence in Journalism. 2011. Available online: http://www.journalism.org/2011/05/09/navigatingnewsonline/ (accessed on 8 January 2022).
- Harouni, M.; Rahim, M.S.M.; Al-Rodhaan, M.; Saba, T.; Rehman, A.; Al-Dhelaan, A. Online Persian/Arabic script classification without contextual information. Imaging Sci. J. 2014, 62, 437–448. [Google Scholar] [CrossRef]
- Bakshy, E.; Rosenn, I.; Marlow, C.; Adamic, L. The Role of Social Networks in Information Diffusion. In Proceedings of the WWW 2012: 21st World Wide Web Conference, Lyon, France, 16–20 April 2012; pp. 519–528. [Google Scholar] [CrossRef] [Green Version]
- Bennett, W.L.; Iyengar, S. A New Era of Minimal Effects? The Changing Foundations of Political Communication. J. Commun. 2008, 58, 707–731. [Google Scholar] [CrossRef]
- Rehman, A.; Saba, T. Off-line cursive script recognition: Current advances, comparisons and remaining problems. Artif. Intell. Rev. 2012, 37, 261–288. [Google Scholar] [CrossRef]
- Kull, S.; Ramsay, C.; Lewis, E. Media, Misperceptions, and the Iraq War. Polit. Sci. Q. 2003, 118, 569–598. [Google Scholar] [CrossRef]
- Chen, Z.Q.; Zhang, G.X. Survey of text mining, Pattern Recognit. Artif. Intell. 2005, 18, 65–74. [Google Scholar] [CrossRef]
- Schutze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- Javed, R.; Rahim, M.S.M.; Saba, T.; Rehman, A. A comparative study of features selection for skin lesion detection from dermoscopic images. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 1–13. [Google Scholar] [CrossRef]
- Larabi-Marie-Sainte, S.; Aburahmah, L.; Almohaini, R.; Saba, T. Current Techniques for Diabetes Prediction: Review and Case Study. Appl. Sci. 2019, 9, 4604. [Google Scholar] [CrossRef] [Green Version]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar] [CrossRef]
- Rehman, A.; Saba, T. Performance analysis of character segmentation approach for cursive script recognition on benchmark database. Digit. Signal Process. 2011, 21, 486–490. [Google Scholar] [CrossRef]
- Tesfagergish, S.G.; Kapočiūtė-Dzikienė, J.; Damaševičius, R. Zero-Shot Emotion Detection for Semi-Supervised Sentiment Analysis Using Sentence Transformers and Ensemble Learning. Appl. Sci. 2022, 12, 8662. [Google Scholar] [CrossRef]
- Saba, T.; Rehman, A.; Altameem, A.; Uddin, M. Annotated comparisons of proposed preprocessing techniques for script recognition. Neural Comput. Appl. 2014, 25, 1337–1347. [Google Scholar] [CrossRef]
- Dalyan, T.; Ayral, H.; Özdemir, Ö. A Comprehensive Study of Learning Approaches for Author Gender Identification. Inf. Technol. Control 2022, 51, 429–445. [Google Scholar] [CrossRef]
- Shambour, Q.Y.; Abu-Shareha, A.A.; Abualhaj, M.M. A Hotel Recommender System Based on Multi-Criteria Collaborative Filtering. Inf. Technol. Control 2020, 51, 390–402. [Google Scholar] [CrossRef]
- Wei, W.; Wang, Z.; Fu, C.; Damaševičius, R.; Scherer, R.; Wožniak, M. Intelligent recommendation of related items based on naive bayes and collaborative filtering combination model. J. Phys. Conf. Ser. 2020, 1682, 012043. [Google Scholar] [CrossRef]
- Tesfagergish, S.G.; Damaševičius, R.; Kapočiūtė-Dzikienė, J. Deep fake recognition in tweets using text augmentation, word embeddings and deep learning. In Computational Science and Its Applications, ICCSA 2021; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12954, pp. 523–538. [Google Scholar] [CrossRef]
- Jiang, M.; Zou, Y.; Xu, J.; Zhang, M. GATSum: Graph-Based Topic-Aware Abstract Text Summarization. Inf. Technol. Control 2022, 51, 345–355. [Google Scholar] [CrossRef]
- Kapočiūtė-Dzikienė, J.; Tesfagergish, S.G. Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages. Inf. Technol. Control 2020, 49, 482–494. [Google Scholar] [CrossRef]
- Omoregbe, N.A.I.; Ndaman, I.O.; Misra, S.; Abayomi-Alli, O.O.; Damaševičius, R. Text Messaging-Based Medical Diagnosis Using Natural Language Processing and Fuzzy Logic. J. Health Eng. 2020, 2020, 8839524. [Google Scholar] [CrossRef]
- Rijcken, E.; Kaymak, U.; Scheepers, F.; Mosteiro, P.; Zervanou, K.; Spruit, M. Topic Modeling for Interpretable Text Classification from EHRs. Front. Big Data 2022, 5, 846930. [Google Scholar] [CrossRef]
- Chang, I.-C.; Horng, J.-S.; Liu, C.-H.; Chou, S.-F.; Yu, T.-Y. Exploration of Topic Classification in the Tourism Field with Text Mining Technology—A Case Study of the Academic Journal Papers. Sustainability 2022, 14, 4053. [Google Scholar] [CrossRef]
- Kapočiūtė-Dzikienė, J.; Damaševičius, R.; Woźniak, M. Sentiment analysis of lithuanian texts using deep learning methods. In Information and Software Technologies. ICIST 2018; Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2018; Volume 920, pp. 521–532. [Google Scholar] [CrossRef]
- Damasevicius, R.; Valys, R.; Wozniak, M. Intelligent tagging of online texts using fuzzy logic. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence, SSCI 2016, Athens, Greece, 6–9 December 2016. [Google Scholar] [CrossRef]
- Alhaj, Y.A.; Dahou, A.; Al-Qaness, M.A.A.; Abualigah, L.; Abbasi, A.A.; Almaweri, N.A.O.; Elaziz, M.A.; Damaševičius, R. A Novel Text Classification Technique Using Improved Particle Swarm Optimization: A Case Study of Arabic Language. Futur. Internet 2022, 14, 194. [Google Scholar] [CrossRef]
- Zhang, X.; LeCun, Y. Text Understanding from Scratch. arXiv 2015, arXiv:1502.01710. [Google Scholar]
- Jadooki, S.; Mohamad, D.; Saba, T.; Almazyad, A.S.; Rehman, A. Fused features mining for depth-based hand gesture recognition to classify blind human communication. Neural Comput. Appl. 2017, 28, 3285–3294. [Google Scholar] [CrossRef]
- Sidorov, G.; Velasquez, F.; Stamatatos, E.; Gelbukh, A.; Chanona-Hernández, L. Syntactic N-grams as machine learning features for natural language processing. Expert Syst. Appl. 2014, 41, 853–860. [Google Scholar] [CrossRef]
- Ramos, J. Using tf-idf to determine word relevance in document queries. Proc. First Instr. Conf. Mach. Learn. 2003, 242, 29–48. [Google Scholar]
- Wallach, H.M. Topic Modeling: Beyond Bag-of-Words. In Proceedings of the ICML ’06: 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 977–984. [Google Scholar] [CrossRef]
- Lilleberg, J.; Zhu, Y.; Zhang, Y. Support vector machines and Word2vec for text classification with semantic features. In Proceedings of the 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), Beijing, China, 6–8 July 2015; pp. 136–140. [Google Scholar] [CrossRef]
- Shuai, Q.; Huang, Y.; Jin, L.; Pang, L. Sentiment Analysis on Chinese Hotel Reviews with Doc2Vec and Classifiers. In Proceedings of the 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 October 2018; pp. 1171–1174. [Google Scholar] [CrossRef]
- Umakanth, N.; Santhi, S. Classification and ranking of trending topics in twitter using tweets text. J. Crit. Rev. 2020, 7, 895–899. [Google Scholar] [CrossRef]
- Domingos, P. A Few Useful Things to Know about Machine Learning. Commun. ACM 2012, 55, 79–88. [Google Scholar] [CrossRef] [Green Version]
- Yar, H.; Hussain, T.; Khan, Z.A.; Koundal, D.; Lee, M.Y.; Baik, S.W. Vision Sensor-Based Real-Time Fire Detection in Resource-Constrained IoT Environments. Comput. Intell. Neurosci. 2021, 2021, 5195508. [Google Scholar] [CrossRef]
- Dilrukshi, I.; De Zoysa, K. Twitter news classification: Theoretical and practical comparison of SVM against Naive Bayes algorithms. In Proceedings of the 2013 International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 11–15 December 2013. [Google Scholar] [CrossRef]
- Bun, K.K.; Ishizuka, M. Topic extraction from news archive using TF*PDF algorithm. In Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002, Singapore, 14 December 2002. [Google Scholar] [CrossRef] [Green Version]
- Kapusta, J.; Obonya, J. Improvement of Misleading and Fake News Classification for Flective Languages by Morphological Group Analysis. Informatics 2020, 7, 4. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Wang, X.; Xu, P. Chinese Text Classification Model Based on Deep Learning. Futur. Internet 2018, 10, 113. [Google Scholar] [CrossRef] [Green Version]
- Zhu, Y.; Gao, X.; Zhang, W.; Liu, S.; Zhang, Y. A Bi-Directional LSTM-CNN Model with Attention for Aspect-Level Text Classification. Futur. Internet 2018, 10, 116. [Google Scholar] [CrossRef] [Green Version]
- Debole, F.; Sebastiani, F. Supervised Term Weighting for Automated Text Categorization. In Text Mining and its Applications: Studies in Fuzziness and Soft Computing; Sirmakessis, S., Ed.; Association for Computing Machinery: New York, NY, USA, 2004; Volume 138, pp. 81–97. [Google Scholar] [CrossRef]
- Yousef, M.; Voskergian, D. TextNetTopics: Text Classification Based Word Grouping as Topics and Topics’ Scoring. Front. Genet. 2022, 13, 893378. [Google Scholar] [CrossRef] [PubMed]
- Shao, D.; Li, C.; Huang, C.; An, Q.; Xiang, Y.; Guo, J.; He, J. The short texts classification based on neural network topic model. J. Intell. Fuzzy Syst. 2022, 42, 2143–2155. [Google Scholar] [CrossRef]
- Ozbay, F.A.; Alatas, B. Fake news detection within online social media using supervised artificial intelligence algorithms. Phys. A Stat. Mech. Its Appl. 2019, 540, 123174. [Google Scholar] [CrossRef]
- Zhang, W.; Yoshida, T.; Tang, X. A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 2011, 38, 2758–2765. [Google Scholar] [CrossRef]
- Hiemstra, D. A probabilistic justification for using tf × idf term weighting in information retrieval. Int. J. Digit. Libr. 2000, 3, 131–139. [Google Scholar] [CrossRef]
- Gholamy, A.; Kreinovich, V.; Kosheleva, O. Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation; Departmental Technical Reports (C.S.): El Paso, TX, USA, 2018; pp. 1–7. [Google Scholar]
- Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Advances in Information Retrieval; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Rehman, A. Neural computing for online Arabic handwriting recognition using hard stroke features mining. Int. J. Innov. Comput. Inf. Control 2021, 17, 171–191. [Google Scholar]
- Meethongjan, K.; Dzulkifli, M.; Rehman, A.; Altameem, A.; Saba, T. An Intelligent Fused Approach for Face Recognition. J. Intell. Syst. 2013, 22, 197–212. [Google Scholar] [CrossRef]
- Maragheh, H.K.; Gharehchopogh, F.S.; Majidzadeh, K.; Sangar, A.B. A New Hybrid Based on Long Short-Term Memory Network with Spotted Hyena Optimization Algorithm for Multi-Label Text Classification. Mathematics 2022, 10, 488. [Google Scholar] [CrossRef]
Column Name | Description |
---|---|
Id | The numbered ID of the news article |
Headline | The headline of the news article |
Body | The main text of the news article |
Kat | Category/Label of the news article |
Date | The date news article was published |
Category | Encoded Label |
---|---|
Weltnachrichten | 0 |
TopNachrichten | 1 |
Politik | 2 |
Inlandsnachrichten | 3 |
Input | Output |
---|---|
U.S. weekly jobless claims rebound from near 45 year lows | u.s. weekly jobless claims rebound from near 45 year lows |
‘Thomson Reuters says CEO Jim Smith to make full recovery after an arrhythmia incident | thomson reuters says ceo jim smith to make full recovery after an arrhythmia incident |
‘Trump says FBI missed signs on Florida shooting due to Russia probe, draws criticism | trump says fbi missed signs on florida shooting due to russia probe, draws criticism |
‘Moscow says no evidence behind U.S. indictment of Russians for alleged election meddling | moscow says no evidence behind u.s. indictment of russians for alleged election meddling |
‘Do you fear me?’: Venezuela’s Maduro vows to gatecrash regional summit | ‘do you fear me?’: venezuela’s maduro vows to gatecrash regional summit |
Input | Output |
---|---|
u.s. weekly jobless claims rebound from near 45 year lows | u s weekly jobless claims rebound from near year lows |
thomson reuters says ceo jim smith to make full recovery after arrhythmia incident | thomson reuters says ceo jim smith to make full recovery after arrhythmia incident |
trump says fbi missed signs on florida shooting due to russia probe, draws criticism | trump says fbi missed signs on florida shooting due to russia probe draws criticism |
moscow says no evidence behind u.s. indictment of russians for alleged election meddling | moscow says no evidence behind u s indictment of russians for alleged election meddling |
‘do you fear me?’: venezuela’s maduro vows to gatecrash regional summit | do you fear me venezuela s maduro vows to gatecrash regional summit |
Input | Output |
---|---|
u s weekly jobless claims rebound from near year lows | u weekly jobless claims rebound near year lows |
thomson reuters says ceo jim smith to make full recovery after arrhythmia incident | Thomson reuters says ceo jim smith make full recovery arrhythmia incident |
trump says fbi missed signs on florida shooting due to russia probe draws criticism | Trump says fbi missed signs florida shooting due Russia probe draws criticism |
moscow says no evidence behind u s indictment of russians for alleged election meddling | moscow says evidence behind u indictment Russians alleged election meddling |
do you fear me venezuela s maduro vows to gatecrash regional summit | Fear Venezuela maduro vows gatecrash regional summit |
Input | Output |
---|---|
u weekly jobless claims rebound near year lows | [‘u’, ‘weekli’, ‘jobless’, ‘claim’, ‘rebound’, ‘near’, ‘year’, ‘low’] |
Thomson reuters says ceo jim smith make full recovery arrhythmia incident | [‘thomson’, ‘reuter’, ‘say’, ‘ceo’, ‘jim’, ‘smith’, ‘make’, ‘full’, ‘recoveri’, ‘arrhythmia’, ‘incid’] |
Trump says fbi missed signs florida shooting due Russia probe draws criticism | [‘trump’, ‘say’, ‘fbi’, ‘miss’, ‘sign’, ‘florida’, ‘shoot’, ‘due’, ‘russia’, ‘probe’, ‘draw’, ‘critic’] |
moscow says evidence behind u indictment Russians alleged election meddling | [‘moscow’, ‘say’, ‘evid’, ‘behind’, ‘u’, ‘indict’, ‘russian’, ‘alleg’, ‘elect’, ‘meddl’] |
Fear Venezuela maduro vows gatecrash regional summit | [‘fear’, ‘venezuela’, ‘maduro’, ‘vow’, ‘gatecrash’, ‘region’, ‘summit’] |
Model | Hyper-Parameters |
---|---|
RF | n_estimators = 3000, max_depth = 100 |
NB | Alfa is reduced to 0.01 |
SGD | max_depth = 10, average = True |
KNN | Default setting K = 9 |
SVM | Kernel = rbf, C = 1.0 |
LR | Solver=saga, C = 2.8 |
Classifier | Accuracy (w/o Optimization) | Accuracy (With Optimization) |
---|---|---|
SVM | 0.6435 | 0.8516 |
SGD | 0.8480 | 0.8476 |
LR | 0.8437 | 0.8470 |
RF | 0.7587 | 0.8110 |
NB | 0.8106 | 0.8183 |
KNN | 0.8104 | 0.8135 |
Classifier | Accuracy |
---|---|
SVM | 0.9130 |
SGD | 0.9186 |
LR | 0.9126 |
RF | 0.9118 |
NB | 0.8995 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Daud, S.; Ullah, M.; Rehman, A.; Saba, T.; Damaševičius, R.; Sattar, A. Topic Classification of Online News Articles Using Optimized Machine Learning Models. Computers 2023, 12, 16. https://doi.org/10.3390/computers12010016
Daud S, Ullah M, Rehman A, Saba T, Damaševičius R, Sattar A. Topic Classification of Online News Articles Using Optimized Machine Learning Models. Computers. 2023; 12(1):16. https://doi.org/10.3390/computers12010016
Chicago/Turabian StyleDaud, Shahzada, Muti Ullah, Amjad Rehman, Tanzila Saba, Robertas Damaševičius, and Abdul Sattar. 2023. "Topic Classification of Online News Articles Using Optimized Machine Learning Models" Computers 12, no. 1: 16. https://doi.org/10.3390/computers12010016