Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation
<p>Main steps of the study’s experiments.</p> "> Figure 2
<p>(<b>a</b>) Subset of the main information of a vulnerability entry. (<b>b</b>) Snippet of the content included in the external reference.</p> "> Figure 3
<p>Flowchart of the proposed word clustering approach.</p> "> Figure 4
<p>Example of an ROC curve.</p> "> Figure 5
<p>Projection and formed clusters of the proposed approach.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Exploitation of Security Vulnerabilities
2.2. Cluster Analysis and Word Embeddings
3. Methodology
3.1. Data Collection and Preprocessing
3.2. Topic Extraction
3.2.1. Text Preprocessing
3.2.2. Topic Modeling
3.2.3. Keyword Clustering
GloVe
UMAP
Fuzzy K-Means
3.3. Classification Models
3.3.1. Data Oversampling
3.3.2. Model Selection and Tuning
3.3.3. Performance Evaluation
3.4. Summary
- Retrieve information related to vulnerability descriptions and exploitability indicators;
- Apply cleaning and preprocessing procedures to the initial datasets;
- Define datasets for topic extraction (2015–2021 data feeds in our case) and classification models (2022 data feeds in our case);
- Employ techniques to establish the and ;
- Use to train word embeddings using the algorithm;
- Employ to project these word embeddings in a low-dimensional space;
- Pipeline the outcomes of into the to extract cluster memberships of keywords ();
- Use to train topic models;
- Calculate document memberships using the and ;
- Evaluate the topic coherence of topic and cluster models using the ;
- Train models for different numbers of topics as indicated by the highest of each algorithm (in our case, 24 for the proposed framework, 21 for , and 10 for );
- Provide a topic title for each cluster using the top keywords and some representative descriptions;
- Evaluate coefficients that assess the potential effects of each cluster on the exploitability indicators by employing a ;
- Identify exploitable weaknesses and products to assist vulnerability prioritization based on the highest coefficients of the ;
- Split the dataset into training (70%) and testing (30%) datasets;
- Balance the training dataset by employing an oversampling algorithm—in our case, the Adaptive Synthetic oversampling algorithm was employed;
- Select machine learning algorithms (in our case, two were selected);
- Apply a strategy that combines 10-fold cross-validation and grid search, using the training dataset, to tune the parameters of each algorithm for every set of inputs;
- Select the best parameter combinations based on the average accuracy of the respective models in the 10-fold cross-validation process;
4. Results
4.1. RQ1 Topic Assignement
4.2. RQ2 Exploit Prediction
5. Discussion
6. Threats to Validity
7. Conclusions
8. Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Abbreviation | Description |
Proof Of Concept. We refer to this abbreviation when we discuss Exploit Proof Of Concepts. | |
National Vulnerability Database. The main data of this study were downloaded from their data feeds. | |
Random Forest machine learning algorithm. It was employed in our experiments for exploitation prediction. | |
Bag of words term weighting scheme representing the raw term frequency of each keyword in a document. This scheme was employed to finalize LDA and the DCM. | |
Document-Term Matrix. An appropriate document representation for topic modeling. | |
Global Vectors machine learning algorithm. Helped us establish word representations in a multidimensional embedding space. | |
Uniform Manifold Approximation and Projection for dimension reduction. | |
The number of nearest neighbors (prior parameter for UMAP) that are considered in projecting the topology of a data point in a low-dimensional space. | |
Fuzzy K-means algorithm. It was employed to cluster keyword vectors extracted from UMAP and to later interpret topics. | |
Latent Dirichlet Allocation. A standard topic modeling algorithm that was used to compare the efficiency of the proposed approach that is related on keyword clustering. | |
Correlated Topic Models. A standard topic modeling algorithm that was used to compare the efficiency of the proposed approach that is related on keyword clustering. | |
Variational Expectation Maximization algorithm. This algorithm was followed as a basis to define stoppage criteria for the topic models (LDA and CTM). | |
Normalized Pointwise Mutual Information. This measure was used to evaluate the topic coherence of the topic and clustering models in capturing the semantics of the dataset. | |
Number of clusters and topics. | |
Posterior cluster (g) memberships of the keywords (i). | |
Topology of the cluster (g) centers | |
Initial vector space inserted in the Fuzzy K-means algorithm | |
Keyword Document Presence. The number of documents that contain a keyword at least once. | |
Keyword–Topic Linkage Strength. Linkage strength calculated by multiplying KDP and U. It is used to identify the top words of each cluster and topic | |
Row Sums refer to the numerical sums of the elements included in each row of the DTM. | |
Document Cluster Membership. It denotes the matrix that stores the memberships linking each document with clusters extracted from a FKM model. | |
Area Under Curve. A standard performance evaluation measure of classification machine learning models. | |
Proposed Framework. It is first introduced in the second section of the presented results and refers to the complete framework that combines GloVe, UMAP, and FKM. | |
Generalized Linear Models. In our study, a model of this nature was employed to address the potential effects of the extracted topics on a target class. | |
Coefficient of the j-th predictor in a GLM. | |
Expected Document Frequency. This abbreviation refers to the expected (mean) cluster memberships of the latest documents (2022) included in this study. |
References
- Nayak, K.; Marino, D.; Efstathopoulos, P.; Dumitraş, T. Some vulnerabilities are different than others. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection, Gothenburg, Sweden, 17–19 September 2014; Springer: Cham, Switzerland, 2014; pp. 426–446. [Google Scholar]
- Spanos, G.; Angelis, L. A multi-target approach to estimate software vulnerability characteristics and severity scores. J. Syst. Softw. 2018, 146, 152–166. [Google Scholar] [CrossRef]
- Bullough, B.L.; Yanchenko, A.K.; Smith, C.L.; Zipkin, J.R. Predicting exploitation of disclosed software vulnerabilities using open-source data. In Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, Scottsdale, AZ, USA, 24 March 2017; pp. 45–53. [Google Scholar]
- Tavabi, N.; Goyal, P.; Almukaynizi, M.; Shakarian, P.; Lerman, K. Darkembed: Exploit prediction with neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Almukaynizi, M.; Nunes, E.; Dharaiya, K.; Senguttuvan, M.; Shakarian, J.; Shakarian, P. Proactive identification of exploits in the wild through vulnerability mentions online. In Proceedings of the 2017 International Conference on Cyber Conflict (CyCon US), Washington, DC, USA, 7–8 November 2017; pp. 82–88. [Google Scholar]
- Bhatt, N.; Anand, A.; Yadavalli, V.S. Exploitability prediction of software vulnerabilities. Qual. Reliab. Eng. Int. 2021, 37, 648–663. [Google Scholar] [CrossRef]
- Bozorgi, M.; Saul, L.K.; Savage, S.; Voelker, G.M. Beyond heuristics: Learning to classify vulnerabilities and predict exploits. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 105–114. [Google Scholar]
- Fang, Y.; Liu, Y.; Huang, C.; Liu, L. FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm. PLoS ONE 2020, 15, e0228439. [Google Scholar] [CrossRef] [PubMed]
- Sabottke, C.; Suciu, O.; Dumitraș, T. Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits. In Proceedings of the 24th USENIX Security Symposium (USENIX Security 15), Washington, DC, USA, 12–14 August 2015; pp. 1041–1056. [Google Scholar]
- Kalouptsoglou, I.; Siavvas, M.; Kehagias, D.; Chatzigeorgiou, A.; Ampatzoglou, A. An empirical evaluation of the usefulness of word embedding techniques in deep learning-based vulnerability prediction. In Security in Computer and Information Sciences; Springer Nature: Cham, Switzerland, 2022; p. 23. [Google Scholar]
- Kalouptsoglou, I.; Siavvas, M.; Kehagias, D.; Chatzigeorgiou, A.; Ampatzoglou, A. Examining the Capacity of Text Mining and Software Metrics in Vulnerability Prediction. Entropy 2022, 24, 651. [Google Scholar] [CrossRef] [PubMed]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Mohammed, S.M.; Jacksi, K.; Zeebaree, S.R. Glove word embedding and DBSCAN algorithms for semantic document clustering. In Proceedings of the 2020 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 23–24 December 2020; pp. 1–6. [Google Scholar]
- Singh, K.N.; Devi, S.D.; Devi, H.M.; Mahanta, A.K. A novel approach for dimension reduction using word embedding: An enhanced text classification approach. Int. J. Inf. Manag. Data Insights 2022, 2, 100061. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
- Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: A comparative study. In Proceedings of the International Conference on Image and Signal Processing, Marrakesh, Morocco, 4–6 June 2020; Springer: Cham, Switzerland, 2020; pp. 317–325. [Google Scholar]
- Ordun, C.; Purushotham, S.; Raff, E. Exploratory analysis of COVID-19 tweets using topic modeling, umap, and digraphs. arXiv 2020, arXiv:2005.03082. [Google Scholar]
- Rao, R.N.; Chakraborty, M. Vec2GC—A Graph Based Clustering Method for Text Representations. arXiv 2021, arXiv:2104.09439. [Google Scholar]
- Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, NY, USA, 1981. [Google Scholar]
- Rashid, J.; Shah, S.M.A.; Irtaza, A. Fuzzy topic modeling approach for text mining over short text. Inf. Process. Manag. 2019, 56, 102060. [Google Scholar] [CrossRef]
- Rashid, J.; Shah, S.M.A.; Irtaza, A.; Mahmood, T.; Nisar, M.W.; Shafiq, M.; Gardezi, A. Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy k-means clustering. IEEE Access 2019, 7, 146070–146080. [Google Scholar] [CrossRef]
- Ikonomakis, M.; Kotsiantis, S.; Tampakas, V. Text classification using machine learning techniques. WSEAS Trans. Comput. 2005, 4, 966–974. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Blei, D.; Lafferty, J. Correlated topic models. Adv. Neural Inf. Process. Syst. 2006, 18, 147. [Google Scholar]
- Le, T.H.; Chen, H.; Babar, M.A. A survey on data-driven software vulnerability assessment and prioritization. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
- Jacobs, J.; Romanosky, S.; Edwards, B.; Adjerid, I.; Roytman, M. Exploit prediction scoring system (epss). Digit. Threat. Res. Pract. 2021, 2, 1–17. [Google Scholar] [CrossRef]
- Chen, H.; Liu, J.; Liu, R.; Park, N.; Subrahmanian, V.S. VEST: A System for Vulnerability Exploit Scoring & Timing. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 6503–6505. [Google Scholar]
- Chen, H.; Liu, R.; Park, N.; Subrahmanian, V.S. Using twitter to predict when vulnerabilities will be exploited. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 3143–3152. [Google Scholar]
- Charmanas, K.; Mittas, N.; Angelis, L. Predicting the existence of exploitation concepts linked to software vulnerabilities using text mining. In Proceedings of the 25th Pan-Hellenic Conference on Informatics, Volos, Greece, 26–28 November 2021; pp. 352–356. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; pp. 427–431. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [Google Scholar]
- Das, R.; Zaheer, M.; Dyer, C. Gaussian LDA for topic models with word embeddings. In Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 795–804. [Google Scholar]
- Moody, C.E. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv 2016, arXiv:1605.02019. [Google Scholar]
- Singh, A.K.; Shashi, M. Vectorization of text documents for identifying unifiable news articles. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 305–310. [Google Scholar] [CrossRef] [Green Version]
- Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; Daumé, H., III. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1681–1691. [Google Scholar]
- Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning—Based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar]
- Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag. 2020, 57, 102034. [Google Scholar] [CrossRef]
- Rosalina, R.; Huda, R.; Sahuri, G. Multidocument Summarization using GloVe Word Embedding and Agglomerative Cluster Methods. In Proceedings of the 2020 IEEE International Conference on Sustainable Engineering and Creative Computing (ICSECC), Cikarang, Indonesia, 16–17 December 2020; pp. 260–264. [Google Scholar]
- Ashwini, K.S.; Shantala, C.P.; Jan, T. Impact of Text Representation Techniques on Clustering Models. Res. Sq. 2022. [Google Scholar] [CrossRef]
- Salih, N.M.; Jacksi, K. State of the art document clustering algorithms based on semantic similarity. J. Inform. 2020, 14, 58–75. [Google Scholar]
- Sridhar, V.K.R. Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 31 May–5 June 2015; pp. 192–200. [Google Scholar]
- Angelov, D. Top2vec: Distributed representations of topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
- Goswami, S.; Shishodia, M. A Fuzzy Based Approach to Text Mining and Document Clustering. Int. J. Data Min. Knowl. Manag. Process 2013, 3, 43–52. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Juneja, P.; Jain, H.; Deshmukh, T.; Somani, S.; Tripathy, B.K. Context aware clustering using glove and K-means. Int. J. Softw. Eng. Appl. 2017, 8, 21–38. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
- Kenter, T.; Borisov, A.; De Rijke, M. Siamese cbow: Optimizing word embeddings for sentence representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 941–951. [Google Scholar]
- Xing, C.; Wang, D.; Zhang, X.; Liu, C. Document classification with distributions of word vectors. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Siem Reap, Cambodia, 9–12 December 2014; pp. 1–5. [Google Scholar]
- Janani, R.; Vijayarani, S. Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst. Appl. 2019, 134, 192–200. [Google Scholar] [CrossRef]
- Mehta, V.; Bawa, S.; Singh, J. WEClustering: Word embeddings based text clustering technique for large datasets. Complex Intell. Syst. 2021, 7, 3211–3224. [Google Scholar] [CrossRef]
- Ruspini, E.H.; Bezdek, J.C.; Keller, J.M. Fuzzy clustering: A historical perspective. IEEE Comput. Intell. Mag. 2019, 14, 45–55. [Google Scholar] [CrossRef]
- D’urso, P.; Massari, R. Fuzzy clustering of mixed data. Inf. Sci. 2019, 505, 513–534. [Google Scholar] [CrossRef]
- Sonalitha, E.; Zubair, A.; Mulyo, P.D.; Nurdewanto, B.; Prambanan, B.R.; Mujahidin, I. Combined text mining: Fuzzy clustering for opinion mining on the traditional culture arts work. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2020, 11, 294–299. [Google Scholar] [CrossRef]
- Gosain, A.; Dahiya, S. Performance analysis of various fuzzy clustering algorithms: A review. Procedia Comput. Sci. 2016, 79, 100–111. [Google Scholar] [CrossRef] [Green Version]
- Hunt, L.; Jorgensen, M. Clustering mixed data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 352–361. [Google Scholar] [CrossRef]
- Ichino, M.; Yaguchi, H. Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans. Syst. Man Cybern. 1994, 24, 698–708. [Google Scholar] [CrossRef]
- Ghosal, A.; Nandy, A.; Das, A.K.; Goswami, S.; Panday, M. A short review on different clustering techniques and their applications. In Emerging Technology in Modelling and Graphics; Springer: Singapore, 2020; pp. 69–83. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized Linear Models; Chapman and Hall: London, UK, 1989. [Google Scholar]
- Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
- Kherwa, P.; Bansal, P. Topic modeling: A comprehensive review. EAI Endorsed Trans. Scalable Inf. Syst. 2019, 7. [Google Scholar] [CrossRef] [Green Version]
- Grün, B.; Hornik, K. topicmodels: An R package for fitting topic models. J. Stat. Softw. 2011, 40, 1–30. [Google Scholar] [CrossRef] [Green Version]
- Bouma, G. Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 2009, 30, 31–40. [Google Scholar]
- Ferraro, M.B.; Giordani, P.; Serafini, A. fclust: An R Package for Fuzzy Clustering. R J. 2019, 11, 198. [Google Scholar] [CrossRef]
- Siriseriwan, W. Smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE. R Package Version 1.3.1. 2019. Available online: https://CRAN.R-project.org/package=smotefamily (accessed on 11 July 2023).
- Kuhn, M. caret: Classification and Regression Training. R package Version 6.0-90. 2021. Available online: https://cran.r-project.org/web/packages/caret/index.html (accessed on 11 July 2023).
- Yan, Y. MLmetrics: Machine Learning Evaluation Metrics. R Package Version 1.1.1. 2016. Available online: https://CRAN.R-project.org/package=MLmetrics (accessed on 11 July 2023).
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Hand, D.J.; Till, R.J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 2001, 45, 171–186. [Google Scholar] [CrossRef]
- Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A (Gen.) 1972, 135, 370–384. [Google Scholar] [CrossRef]
- Ghosh, S.; Dubey, S.K. Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 2013, 4, 35–39. [Google Scholar] [CrossRef] [Green Version]
Reference | Scoring | Multiple Databases | Exclusive Use of Descriptions | Topics | Period |
---|---|---|---|---|---|
Jacobs et al. [26] | Yes | Yes | No | No | 2016–2018 |
Chen, Liu, Liu et al. [27] | Yes | No | No | No | Updated daily |
Chen, Liu, Park et al. [28] | Yes | Yes | No | No | 2016–2018 |
Sabottke et al. [9] | No | Yes | No | No | 2014–2015 |
Almukaynizi et al. [5] | No | Yes | No | No | 2015–2016 |
Bozorgi et al. [7] | No | Yes | No | No | 1991–2007 |
Bullough et al. [3] | No | Yes | No | No | 2014–2015 |
Fang et al. [8] | No | Yes | No | No | 2013–2018 |
Tavabi et al. [4] | No | Yes | No | No | 2010–2017 |
Bhatt et al. [6] | No | Yes | No | No | 2012–2015 |
Charmanas et al. [29] | No | No | Yes | No | 2015–2021 |
This study | Yes | No | Yes | Yes | 2015–2022 |
Predicted Class | |||
Positive | Negative | ||
Real Class | Positive | Observations classified correctly and belong to the positive output class (True Positives—) | Observations classified incorrectly and belong to the positive output class (False Negatives—) |
Negative | Observations classified incorrectly and belong to the negative output class (False Positives—) | Observations classified correctly and belong to the negative output class (True Negatives—) |
Model | Number of Topics | |
---|---|---|
10 | 0.239 | |
21 | 0.246 | |
24 | 0.263 | |
10 | 0.177 | |
21 | 0.184 | |
24 | 0.167 | |
10 | 0.202 | |
21 | 0.140 | |
24 | 0.170 |
Titles | Top 10 Stemmed Keywords | |
---|---|---|
1—0.0145 | Vulnerabilities on Oracle products with displayed severity metrics | oracl; score; confidenti; cvss; easili; complet; subcompon; critic; abil; person |
2—0.0652 | Successful attacks–exploits that affect specific data components | affect; exploit; access; result; data; success; prior; attack; network; interact |
3—0.0338 | Buffer and stack overflow | buffer; craft; function; overflow; stack; heap; servic; trigger; denial; length |
4—0.0210 | Credentials disclosure and session breaches especially on IBM products | ibm; forc; session; credenti; intend; javascript; site; trust; cross; thus |
5—0.0507 | Vulnerabilities on Android, especially local privilege escalation | lead; local; potenti; possibl; discov; can; enabl; android; need; escal |
6—0.0273 | Bugs on devices due to packet mishandling from protocol units, e.g., ipv, udp, tcp | send; devic; packet; softwar; condit; bug; bypass; cisco; firmwar; seri |
7—0.0344 | Path-directory traversal | file; directori; path; travers; search; name; dll; untrust; command; attack |
8—0.0100 | SSL certificates that allow spoofing from man-in-the-middle attacks | token; certif; spoof; middl; verifi; man; ssl; smart; domain; verif |
9—0.0305 | Vulnerabilities that cause system crashes and denial of service especially from null pointer dereference | caus; denial; crash; servic; craft; pointer; function; null; leak; derefer |
10—0.0294 | Unauthorized compromises on specific Enterprise product components | compon; product; impact; compromis; support; integr; unauthor; delet; avail; low |
11—0.0349 | Attacks on the web via URL redirection and malicious links | web; manag; url; link; site; attack; browser; conduct; cross; user |
12—0.0276 | Vulnerabilities with discussed affected or fixing patches especially on TensorFlow platform | includ; will; patch; sourc; platform; also; one; upgrad; issu; machin |
13—0.0501 | Cross-site scripting | xss; script; html; cross; site; page; store; reflect; inject; payload |
14—0.0652 | SQL injection and input mishandling, especially on php components | paramet; php; inject; file; sql; get; admin; mishandl; demonstr; post |
15—0.0967 | Remote code execution | via; remot; execut; arbitrari; attack; code; request; user; vector; unspecifi |
16—0.0567 | Incorrect controlling, configuring, and granted permissions | use; can; configur; control; issu; attack; permiss; user; set; default |
17—0.0051 | Vulnerabilities on Snapdragon products | mobil; snapdragon; comput; consum; msm; infrastructur; industri; iot; auto; qualcomm |
18—0.1459 | Vulnerabilities that affect specific versions and allow privilege gain | vulner; version; privileg; attack; user; may; read; system; server; contain |
19—0.0260 | Vulnerabilities with reported issues and fixes | issu; can; note; fix; address; run; abl; discov; report; use |
20—0.0302 | Hardcoded keys along with weak passwords and encryptions | password; key; attack; administr; chang; file; user; account; encrypt; can |
21—0.0404 | Vulnerabilities including discussed CVE codes, especially on Microsoft products and on vulnerabilities with memory issues | “aka; memori; cve; window; handl; corrupt; object; differ; microsoft; uniqu |
22—0.0391 | Vulnerabilities on media components and Nvidia GPU drivers, especially denial of service, use after free, and double free | servic; applic; craft; denial; function; driver; overflow; content; librari; free |
23—0.0262 | Lack of security within specific processes that attackers leverage to intervene in the related context | context; within; open; specif; exist; lack; leverag; process; current; pars |
24—0.0387 | Vulnerabilities on WordPress plugins | plugin; wordpress; admin; page; escap; form; file; output; attribut; ajax |
Topic No. | GLM Coefficient Estimate () | Odds Ratio () |
---|---|---|
1 | −10.627 *** | 0.000024 |
2 | −7.283 *** | 0.000687 |
3 | 5.155 *** | 173.286 |
4 | −14.230 *** | 0.000001 |
5 | −7.755 *** | 0.000429 |
6 | −7.119 *** | 0.000809 |
7 | −0.988 | 0.372251 |
8 | −11.790 *** | 0.000008 |
9 | 2.200 * | 9.031704 |
10 | −5.950 *** | 0.002605 |
11 | −1.245 | 0.287895 |
12 | 4.0589 *** | 57.91047 |
13 | −2.162 ** | 0.115115 |
14 | 6.051 *** | 424.7054 |
15 | −1.285 * | 0.276587 |
16 | −8.134 *** | 0.000293 |
17 | −12.853 *** | 0.000003 |
18 | −6.438 *** | 0.001599 |
19 | −3.635 ** | 0.02638 |
20 | 1.367 | 3.925521 |
21 | −11.616 *** | 0.000009 |
22 | −9.0170 *** | 0.000121 |
23 | −9.6196 *** | 0.000066 |
24 | Linearly Dependent (Aliased) on Intercept | Aliased with Intercept |
Intercept | 3.480 *** () | 32.444549 |
Model | Topics | Classifier | Accuracy | Precision | Recall | F1 | AUC |
---|---|---|---|---|---|---|---|
10 | C5.0 | 0.7332521 | 0.8528864 | 0.6473498 | 0.7360386 | 0.8507613 | |
10 | 0.7945595 | 0.8369162 | 0.7978799 | 0.816932 | 0.8742083 | ||
21 | C5.0 | 0.7803492 | 0.861755 | 0.735689 | 0.7937476 | 0.870051 | |
21 | 0.8018676 | 0.8508706 | 0.7943463 | 0.8216374 | 0.8851462 | ||
24 | C5.0 | 0.7864393 | 0.8459144 | 0.7681979 | 0.8051852 | 0.8778039 | |
24 | 0.8075518 | 0.8477458 | 0.8106007 | 0.8287572 | 0.8925603 | ||
10 | C5.0 | 0.7860333 | 0.852381 | 0.7590106 | 0.8029907 | 0.8733546 | |
10 | 0.8148599 | 0.8477157 | 0.8261484 | 0.8367931 | 0.8904948 | ||
21 | C5.0 | 0.8262282 | 0.862601 | 0.829682 | 0.8458213 | 0.9061763 | |
21 | 0.8278522 | 0.8511694 | 0.8487633 | 0.8499646 | 0.9114623 | ||
24 | C5.0 | 0.8278522 | 0.8630037 | 0.8325088 | 0.847482 | 0.9136154 | |
24 | 0.8367844 | 0.8594748 | 0.8558304 | 0.8576487 | 0.9138086 | ||
10 | C5.0 | 0.7941535 | 0.8229018 | 0.8176678 | 0.8202765 | 0.8635709 | |
10 | 0.8124239 | 0.8420675 | 0.8289753 | 0.8354701 | 0.8866159 | ||
21 | C5.0 | 0.81689 | 0.8559823 | 0.8190813 | 0.8371253 | 0.9060425 | |
21 | 0.8327243 | 0.8663258 | 0.8381625 | 0.8520115 | 0.912211 | ||
24 | C5.0 | 0.8282582 | 0.8547926 | 0.844523 | 0.8496267 | 0.9125877 | |
24 | 0.8371904 | 0.8611111 | 0.854417 | 0.857751 | 0.9157352 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Charmanas, K.; Mittas, N.; Angelis, L. Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation. Information 2023, 14, 403. https://doi.org/10.3390/info14070403
Charmanas K, Mittas N, Angelis L. Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation. Information. 2023; 14(7):403. https://doi.org/10.3390/info14070403
Chicago/Turabian StyleCharmanas, Konstantinos, Nikolaos Mittas, and Lefteris Angelis. 2023. "Exploitation of Vulnerabilities: A Topic-Based Machine Learning Framework for Explaining and Predicting Exploitation" Information 14, no. 7: 403. https://doi.org/10.3390/info14070403